CodeSecBench
← All posts

June 7, 2026 · Fafa Agbetsise · 0.5.1 → 0.5.2

How a public AI-app SAST benchmark made our detector 15% better in one afternoon

We built CodeSecBench Tier C — six deliberately-vulnerable AI apps with 149 labeled bugs across all six AI-app security categories. Running getdebug 0.5.1 against the first two repos surfaced gaps we could fix in code the same day. Here's the loop, the numbers, and what 0.5.2 ships.

ai-app-security benchmark calibration

Two days ago Fafa flagged a problem with our outreach pitch: we tell prospective users getdebug catches AI-app security bugs, but our own benchmark was a handful of one-bug micro-fixtures. Useful for unit testing the detector. Useless for the question a real developer asks: does this catch the bugs my app actually has?

So we built CodeSecBench Tier C — a public corpus of six deliberately-vulnerable AI apps, each ~40 files, each carrying 12–18 labeled bugs across all six AI-app security categories. The truth lives in a separate public repository so scanners never see the labels at scan time. The first two repos are live; this post is the first calibration cycle.

The benchmark, in one paragraph

Six target repos under getdebug-ai/cst-* (cst- = CodeSecBench Tier C), each a different stack: Next.js + Vercel AI SDK, Vite + Express + LangChain.js, SvelteKit + Anthropic, Express + tool-calling agent, FastAPI + Python, CrewAI multi-agent. Total corpus: ~89 vulnerable rows + 33 safe near-misses + 27 borderline cases = 149 labeled lines. Each repo has a known-safe.ts hallucination control file — any scanner finding inside it is a guaranteed false positive. The truth lives at getdebug-ai/codesecbench-truth; the “don’t peek” norm is documented in the README, same as every honest public benchmark.

File paths in each repo are randomized — domain-appropriate, not template-matched. A vendor allowlisting lib/user.ts won’t generalize from repo #1 to repo #2 (server/services/personalization.ts). The benchmark measures detection skill, not memorization.

Running getdebug 0.5.1 against the first two repos

We ran getdebug analyze . --quiet --json against the two completed repos and scored against the truth file using a span+tolerance JOIN scorer (any finding whose line span overlaps a truth row’s span, ±5 lines, credits the row).

Repo                  Findings   TP   FP   FN   Precision   Recall
cst-nextjs-chat            6     3    1   10        75%       23%
cst-vite-rag               6     4    1   11        80%       27%

23% recall. Far short of where we’d need to be for an outreach pitch. But the data is useful — the misses cluster. Both repos missed the same canonical CWE patterns:

  • Shell & SQL injection via args.X — the canonical SDK tool-callable shape. Repo #1’s execAsync(args.command) and repo #2’s sql.unsafe(args.query) are both classic CWE-78 / CWE-89 sinks. The detector’s existing regex only matched exec(tool.input.X) — the SDK’s typed-tool-ref form. Real code uses args.X, where args is the typed function parameter.
  • API key returned in JSON response body — the second-most-common Next.js / Express leak after NEXT_PUBLIC_. Pattern is Response.json({apiKey: process.env.X_API_KEY}). The detector had no rule for this shape at all.

The fix: two new regexes, sixty minutes

Both gaps are regex-detectable. Pattern A — the args.X form — needed only to add args as a valid identifier prefix alongside tool, block, toolUse, etc., plus extend the sink list to include SQL: sql.unsafe, db.unsafe, pool.unsafe, db.prepare:

var unsafeToolOutputArgsRe = regexp.MustCompile(
  `\b(?:exec|execSync|execAsync|spawn|spawnSync|eval|run|runCommand|
       runSync|sql\.unsafe|db\.unsafe|db\.query|pool\.unsafe|pool\.query|
       client\.unsafe|client\.query|db\.prepare)
    \s*\(\s*[^)]{0,160}?\bargs\.\w+`,
)

Pattern B — the key-in-response form — is a fresh detector with a response-context anchor:

var keyInResponseRe = regexp.MustCompile(
  `(?s)(?:Response\.json|res\.json|res\.send|return\s+json|
         return\s+Response\.json)
       \s*\(\s*\{[^}]{0,400}?
       (?:apiKey|api_key|secret|token|key)\s*:\s*
       process\.env\.[A-Z][A-Z0-9_]*(?:KEY|TOKEN|SECRET)`,
)

Both new patterns ship with explicit negative tests. Parameterized SQL via the tagged template (sql`SELECT * FROM users WHERE id = ${userId}` ) doesn’t fire. Legitimate SDK construction (new OpenAI({apiKey: process.env.OPENAI_API_KEY})) doesn’t fire. The point is to catch new shapes, not over-fire on safe ones.

The numbers after 0.5.2

Repo                 0.5.1 recall   0.5.2 recall   Delta
cst-nextjs-chat          23%           38%        +15pp
cst-vite-rag             27%           40%        +13pp

Per-category, repo #1:
  client-side-llm-key      50% → 100%   (+50pp)
  unsafe-tool-output        0% →  33%   (+33pp)
  prompt-injection         50% →  50%   (flat)
  pii-in-prompt            50% →  50%   (flat)
  unsafe-role-merge         0% →   0%   (flat)
  unbounded-stream          0% →   0%   (flat — label-line issue, see below)

The 50pp jump on client-side-llm-key means both repos’ #2 carrier (server route returning the key in a response body) is now caught. The 33pp jump on unsafe-tool-output means the canonical CWE-78 and CWE-89 sinks — execAsync(args.command) and sql.unsafe(args.query) — are caught. These are real-world patterns, not contrived; we saw both in the wild while building the corpus.

What didn’t move: unsafe-role-merge, prompt-injection, and one of the unbounded-stream rows in repo #1. The unbounded-stream miss is a label issue — the detector hit at line 42 (stream: true) while the truth label is at line 48 (the for-await loop), 6 lines apart, outside the ±5 tolerance. Widening that label to a span fixes it; we’ll do that in the v0.1.1 truth release. The role-merge and prompt-injection misses are real detector gaps, and they’re the next calibration target.

The loop

Each repo becomes a learning artifact. Add a repo → score with the current tool → identify the gaps → ship detector fixes → re-score all earlier repos with the new version → build the next repo → repeat. The /results page tracks the time-series. Each calibration cycle gets a blog post.

Next: cst-sveltekit-stream (#3) is in author-mode now. SvelteKit puts the system message in a separate top-level parameter (anthropic.messages.create({system: "...", messages: [...]})), not as a role inside the messages array. The existing role: "system" detector won’t see it. That’s the kind of stack-specific blind spot the corpus exists to surface.

Try it yourself

If you’re building a SAST tool that targets AI-app categories, CodeSecBench is for you. The corpus + truth file are MIT licensed, and there’s a vendor-side scorer at codesecbench-truth/score.js (zero deps, ~120 lines). Run your tool against the public targets, JOIN against the truth, open a PR with your results. The don’t-peek norm is the only ask.

If you’re a developer wondering whether your own AI app has any of these patterns today: npm i -g @getdebug/cli and getdebug analyze .. 0.5.2 lands on npm with the next release; the LLM-augmented pass (--local-llm) is already there.