June 7, 2026 · Fafa Agbetsise · 0.5.3 → 0.5.4
Targeting the weakest category: how cst-express-agent took our unsafe-tool-output detector from 42% to 76%
When you build a benchmark, you can see which detector categories are weakest. We saw unsafe-tool-output sitting at 42% mean recall across three repos. So we built repo #4 to specifically stress that surface: an Express + Anthropic agent with 6 different tool-output sink shapes. Here's what 0.5.4 ships, and what the data tells us to target next.
CodeSecBench has six AI-app security categories. After three calibration cycles (repos #1 cst-nextjs-chat, #2 cst-vite-rag, #3 cst-sveltekit-stream) the per-category recall data showed an obvious pattern: unsafe-tool-output, unsafe-role-merge, and unbounded-stream were our weakest three.
So instead of designing repo #4 with the original distribution-even spec, we rebalanced. cst-express-agent ships 6 unsafe-tool-output carriers — an agent-framework idiom of Express + Anthropic tool-calling that naturally houses dense tool-output sinks. The repo is the calibration target for the UTO detector. Here’s what we found and what 0.5.4 ships.
The 6 tool-output sinks the repo seeds
# File Pattern
─────────────────────────────────────────────────────────────────────────
4 src/tools/shell.ts execAsync(args.command) ← CWE-78
5 src/tools/sql.ts sql.unsafe(args.query) ← CWE-89
6 src/tools/file-write.ts writeFileSync(args.path, ...) ← CWE-22
7 src/tools/eval-math.ts new Function(`return (${args.expr})`) ← CWE-94
8 src/routes/render-message.ts res.send(`...${marked.parse(...)}...`) ← CWE-79
9 src/tools/spawn-subprocess.ts spawn(args.bin, args.argsString.split()) ← CWE-78
Three of these (shell, sql, spawn) were already detector-covered by 0.5.2’s unsafeToolOutputArgsRe. Three weren’t (file-write, Function-constructor, HTML-render). On first scan: 3 of 6 UTO carriers caught, total repo recall 29%.
What 0.5.4 ships
unsafeToolOutputArgsResink list extended withwriteFileSync,writeFile,appendFileSync,appendFile,createWriteStream, andFunction(the constructor form —new Function(`...${args.X}`)is the eval-equivalent the existing detector was missing). Catches #6 and #7 immediately.scanUnboundedFetch— new unbounded-stream detector forfetch(args.url, { ...no signal: option }). Naive paren-balancing scans the call body, then checks forsignal:. The tool-side streaming-fetch pattern thatstream: trueand.messages.stream(don’t reach.scanHtmlKeyEmbed— new client-side-llm-key sub-detector.res.sendof a backtick-quoted HTML template that embeds${process.env.X_API_KEY}as an attribute or text node. The leak shape: server-rendered HTML containing a credential.- PIP context-gate relaxation by file name — when the file path matches a strong AI-context signal (
user-snapshot.ts,profile-context.ts,prompt-builder.ts, etc.), the piiInPromptRe finding fires without requiring an LLM-call marker in the same file. Catches the cross-file case where the helper thatJSON.stringify(user)is imported by the LLM-calling route. Fixes a real miss on cst-sveltekit-stream that was sitting at 0% recall on PIP.
The numbers across the corpus
Repo 0.5.3 recall 0.5.4 recall Delta
cst-nextjs-chat 38% 46% +8pp
cst-vite-rag 40% 47% +7pp
cst-sveltekit-stream 36% 50% +14pp
cst-express-agent 29% 59% +30pp
Mean recall per category 0.5.3 → 0.5.4 Delta
unsafe-tool-output 42% → 76% +34pp ← TARGETED — now strongest
pii-in-prompt 25% → 50% +25pp
unbounded-stream 17% → 38% +21pp
prompt-injection 42% → 46% +4pp
client-side-llm-key 63% → 63% flat
unsafe-role-merge 17% → 8% -9pp ← NOW WEAKEST
The targeting worked. UTO went from joint-weakest (42%) to strongest (76%) in one cycle. PIP +25pp, UBS +21pp. The 4-line file-name signal regex closed a 50% recall gap on PIP across two repos.
Repos #1, #2, and #3 all improved on this cycle even though the new detectors were cross-cutting (file-write sinks, fetch-without-signal). The corpus design’s randomized file paths mean each detector that works at all works across all stacks that share the underlying pattern. That’s the whole point of generalising the regex shapes vs. tying them to a specific framework.
What the data says next
unsafe-role-merge is now the single weakest category at 8% mean recall. That’s the natural target for repos #5 and #6. The remaining gaps cluster around three patterns:
- Agent-protocol role pass-through —
messages.map((m) => ({ role: m.role, content: m.content }))wherem.roleisn’t constrained to an allowlist. Inter-agent messages cross trust boundaries without role validation. - Persona file loaders —
readFileSync(join(personasDir, `${personaName}.txt`))reads system-prompt text from a path the LLM caller can influence. Path traversal + role escalation in one pattern. - CrewAI-style agent delegation — sub-agent system prompts constructed from parent-agent output. The role boundary is structural to the framework; the merge happens at the framework layer.
cst-fastapi-tools (UBS-heavy — async generators, asyncio cancellation, StreamingResponse) and cst-crewai-multiagent (URM-heavy — agent-to-agent role boundaries) are next. The pattern is now clear: design each repo to stress a known-weak category, ship detectors that close the gap, watch the mean recall move measurably. The benchmark is doing what a benchmark is supposed to do.
Previous in this series: What SvelteKit + Anthropic taught our SAST detector that Next + OpenAI couldn’t · How a public AI-app SAST benchmark made our detector 15% better in one afternoon.