CodeSecBench
← All posts

June 7, 2026 · Fafa Agbetsise · 0.5.3 → 0.5.4

Targeting the weakest category: how cst-express-agent took our unsafe-tool-output detector from 42% to 76%

When you build a benchmark, you can see which detector categories are weakest. We saw unsafe-tool-output sitting at 42% mean recall across three repos. So we built repo #4 to specifically stress that surface: an Express + Anthropic agent with 6 different tool-output sink shapes. Here's what 0.5.4 ships, and what the data tells us to target next.

ai-app-security benchmark calibration agent-tools

CodeSecBench has six AI-app security categories. After three calibration cycles (repos #1 cst-nextjs-chat, #2 cst-vite-rag, #3 cst-sveltekit-stream) the per-category recall data showed an obvious pattern: unsafe-tool-output, unsafe-role-merge, and unbounded-stream were our weakest three.

So instead of designing repo #4 with the original distribution-even spec, we rebalanced. cst-express-agent ships 6 unsafe-tool-output carriers — an agent-framework idiom of Express + Anthropic tool-calling that naturally houses dense tool-output sinks. The repo is the calibration target for the UTO detector. Here’s what we found and what 0.5.4 ships.

The 6 tool-output sinks the repo seeds

#  File                                Pattern
─────────────────────────────────────────────────────────────────────────
4  src/tools/shell.ts                  execAsync(args.command)           ← CWE-78
5  src/tools/sql.ts                    sql.unsafe(args.query)            ← CWE-89
6  src/tools/file-write.ts             writeFileSync(args.path, ...)     ← CWE-22
7  src/tools/eval-math.ts              new Function(`return (${args.expr})`)  ← CWE-94
8  src/routes/render-message.ts        res.send(`...${marked.parse(...)}...`)  ← CWE-79
9  src/tools/spawn-subprocess.ts       spawn(args.bin, args.argsString.split())  ← CWE-78

Three of these (shell, sql, spawn) were already detector-covered by 0.5.2’s unsafeToolOutputArgsRe. Three weren’t (file-write, Function-constructor, HTML-render). On first scan: 3 of 6 UTO carriers caught, total repo recall 29%.

What 0.5.4 ships

  • unsafeToolOutputArgsRe sink list extended with writeFileSync, writeFile, appendFileSync, appendFile, createWriteStream, and Function (the constructor form — new Function(`...${args.X}`) is the eval-equivalent the existing detector was missing). Catches #6 and #7 immediately.
  • scanUnboundedFetch — new unbounded-stream detector for fetch(args.url, { ...no signal: option }). Naive paren-balancing scans the call body, then checks for signal:. The tool-side streaming-fetch pattern that stream: true and .messages.stream( don’t reach.
  • scanHtmlKeyEmbed — new client-side-llm-key sub-detector. res.send of a backtick-quoted HTML template that embeds ${process.env.X_API_KEY} as an attribute or text node. The leak shape: server-rendered HTML containing a credential.
  • PIP context-gate relaxation by file name — when the file path matches a strong AI-context signal (user-snapshot.ts, profile-context.ts, prompt-builder.ts, etc.), the piiInPromptRe finding fires without requiring an LLM-call marker in the same file. Catches the cross-file case where the helper that JSON.stringify(user) is imported by the LLM-calling route. Fixes a real miss on cst-sveltekit-stream that was sitting at 0% recall on PIP.

The numbers across the corpus

Repo                    0.5.3 recall   0.5.4 recall   Delta
cst-nextjs-chat              38%            46%        +8pp
cst-vite-rag                 40%            47%        +7pp
cst-sveltekit-stream         36%            50%       +14pp
cst-express-agent            29%            59%       +30pp

Mean recall per category   0.5.3 → 0.5.4   Delta
unsafe-tool-output           42%  → 76%   +34pp ← TARGETED — now strongest
pii-in-prompt                25%  → 50%   +25pp
unbounded-stream             17%  → 38%   +21pp
prompt-injection             42%  → 46%    +4pp
client-side-llm-key          63%  → 63%    flat
unsafe-role-merge            17%  →  8%    -9pp ← NOW WEAKEST

The targeting worked. UTO went from joint-weakest (42%) to strongest (76%) in one cycle. PIP +25pp, UBS +21pp. The 4-line file-name signal regex closed a 50% recall gap on PIP across two repos.

Repos #1, #2, and #3 all improved on this cycle even though the new detectors were cross-cutting (file-write sinks, fetch-without-signal). The corpus design’s randomized file paths mean each detector that works at all works across all stacks that share the underlying pattern. That’s the whole point of generalising the regex shapes vs. tying them to a specific framework.

What the data says next

unsafe-role-merge is now the single weakest category at 8% mean recall. That’s the natural target for repos #5 and #6. The remaining gaps cluster around three patterns:

  • Agent-protocol role pass-throughmessages.map((m) => ({ role: m.role, content: m.content })) where m.role isn’t constrained to an allowlist. Inter-agent messages cross trust boundaries without role validation.
  • Persona file loadersreadFileSync(join(personasDir, `${personaName}.txt`)) reads system-prompt text from a path the LLM caller can influence. Path traversal + role escalation in one pattern.
  • CrewAI-style agent delegation — sub-agent system prompts constructed from parent-agent output. The role boundary is structural to the framework; the merge happens at the framework layer.

cst-fastapi-tools (UBS-heavy — async generators, asyncio cancellation, StreamingResponse) and cst-crewai-multiagent (URM-heavy — agent-to-agent role boundaries) are next. The pattern is now clear: design each repo to stress a known-weak category, ship detectors that close the gap, watch the mean recall move measurably. The benchmark is doing what a benchmark is supposed to do.

Previous in this series: What SvelteKit + Anthropic taught our SAST detector that Next + OpenAI couldn’t · How a public AI-app SAST benchmark made our detector 15% better in one afternoon.