June 7, 2026 · Fafa Agbetsise · 0.5.2 → 0.5.3
What SvelteKit + Anthropic taught our SAST detector that Next + OpenAI couldn't
Adding cst-sveltekit-stream to CodeSecBench Tier C dropped getdebug's recall from 38% to 7% on first scan. The same six AI-app categories, the same bugs — but a different framework + SDK changed every pattern shape. Here's what we shipped in 0.5.3 to close the gap.
Yesterday we shipped getdebug 0.5.2 with two new regex detectors after measuring the first two CodeSecBench Tier C repos (a Next 14 + Vercel AI SDK chatbot and a Vite + Express + LangChain.js RAG dashboard). Recall jumped 15 percentage points on each. Felt good. Then we added the third repo — a SvelteKit + Anthropic streaming app — and recall dropped to 7%.
This is the entire point of the corpus.
Same six categories. Different stack. Different patterns.
The bugs in cst-sveltekit-stream were the same six categories that the other two repos cover — client-side-llm-key, prompt-injection, unsafe-tool-output, unsafe-role-merge, pii-in-prompt, unbounded-stream. The detector knew about all six. But every concrete pattern in the new repo had a different shape:
Pattern Next/Express/OpenAI SvelteKit/Anthropic
──────────────────────────────────────────────────────────────────────────────────
LLM key in client bundle NEXT_PUBLIC_OPENAI_API_KEY $env/static/public import
LLM key in response body Response.json({apiKey: json({apiKey: BARE_IDENT})
process.env.X_API_KEY})
LLM operator channel role: 'system' in system: param at the
messages[] top level of the call
HTML rendered from LLM output dangerouslySetInnerHTML {@html ...}
Streaming LLM call stream: true anthropic.messages.stream()
Unbounded client read useEffect with no cleanup onMount with no onDestroy
Five of the six existing detectors structurally could not see the SvelteKit / Anthropic shape. The detector for role: 'system' inside messages[] cannot find anything when Anthropic puts the system prompt in a separate top-level system: parameter — it’s not a regex distance problem, it’s an entirely different shape. Same for {@html ...} vs dangerouslySetInnerHTML.
7% recall on first scan. One TP (the canonical db.prepare(args.sql).all() SQL injection sink, caught by 0.5.2’s args.X detector). One false positive (the safe sibling db.prepare call inside an allowlist guard — the detector couldn’t see the guard).
What 0.5.3 ships
Five new detectors plus one precision fix:
scanPublicEnvKey— catchesimport { PUBLIC_X_API_KEY } from '$env/static/public'. The SvelteKit / Astro / Nuxt equivalent of theNEXT_PUBLIC_mistake. Same build-time inlining contract, different framework idiom.scanKeyInResponseSvelte— catchesreturn json({apiKey: ANTHROPIC_API_KEY})where the key was destructured from$env/static/privateat import time. 0.5.2’s version anchors onprocess.env.X_API_KEYdirectly; this one accepts bare-identifier values shaped like API keys.scanSvelteHtmlSink— catches Svelte’s{@html X}directive, gated by a ±40-line context check formarked.parse/role/tool/messagesreferences so static-content{@html ...}calls don’t fire.scanAnthropicSystemMerge— catches Anthropic’s top-levelsystem:parameter. Two cases handled: inline template literalssystem: `...${user}...`, and the indirect formsystem: systemPromptwheresystemPromptwas assigned a template-literal-with-interpolation earlier in the file.scanAnthropicUnboundedStream— catches.messages.stream(method calls with the same abort-in-scope gate the existingstream: truedetector uses.- Precision fix —
db.prepare(args.X)inside an allowlist guard (*_ALLOWLIST.has(args.X)within the prior 30 lines) no longer fires. The 0.5.2 false positive oncst-sveltekit-stream/src/lib/server/agent-tools/database-safe.tsis cleared.
All five detectors ship with passing positive tests and negative tests. The Anthropic system-merge detector specifically tests against a static-string assignment to make sure we don’t over-fire on the common safe case.
The numbers
Repo 0.5.2 recall 0.5.3 recall Delta
cst-nextjs-chat 38% 38% 0pp (no SvelteKit/Anthropic code)
cst-vite-rag 40% 40% 0pp (same reason)
cst-sveltekit-stream 7% 36% +29pp
Repo #3 precision 50% 100% +50pp (allowlist guard FP cleared)
Repos #1 and #2 held flat because 0.5.3’s new detectors are SvelteKit/Anthropic-specific — they don’t fire on Next/OpenAI shapes, which is the correct behavior. The point of stack-aware detectors is precisely that they’re narrow.
What still doesn’t fire (and why)
Repo #3 sits at 36% recall after 0.5.3. Nine vulnerabilities remain undetected:
- PII in
profile-context.ts— the canonicalJSON.stringify(profile)pattern. The existing detector requires an LLM-call marker (messages:/chat.completions/ etc.) within ±20 lines. The profile-context module is a pure helper imported by routes — the LLM call lives in a different file. Real bug, real shape, structurally invisible to the current detector. Candidate for v0.5.4. - PII in SvelteKit load function —
SELECT *from a user table returned via+layout.server.ts’sload(). The full row is serialized into the page payload AND becomes available to downstream prompt builders. No current detector for the DB-side leak shape. - Svelte
onMountwithoutonDestroystreaming-without-cleanup — the Svelte 5 hook analogue of the ReactuseEffectleak the corpus also hasn’t caught yet. Adjacent to Svelte 5 rune patterns generally. - Type-laundered roles —
parsed.rawMessages.map((m) => ({ role: m.role as any, content: m.content }))— an arbitrary role string smuggled past Zod via a type cast. Requires either the detector to look at Zod schemas (role: z.string()) or to trackas anycasts adjacent to role-bearing structures.
The methodology debt this surfaced
The 0.5.3 publicEnvKey detector fires at the import line (line 1): import { PUBLIC_ANTHROPIC_API_KEY } from '$env/static/public'. The truth label for the same bug is at the usage line (line 7) where the var is actually read. The scorer’s ±5-line tolerance can’t close a 6-line gap, so the detector’s correct find lands as out-of-scope. That’s a labeling issue, not a detector bug — the truth needs a span (line 1 to line 7) to credit both the import and the usage as a single labeled vulnerability. Truth-version bump owed: cst-sveltekit-stream v0.1.1.
Similar issue on the Anthropic stream detector: it fires at line 20 (.messages.stream(), but the truth label for the unbounded-stream bug spans lines 28-36 (the for-await loop). The detector’s hit lands inside an adjacent label (unsafe-role-merge at lines 20-25) instead. Line overlap = TP regardless of category, which is the right call — but the per-category recall for unbounded-stream reads as 0% when the bug was caught. Same fix: widen the truth label to span lines 20-36.
The loop, repo by repo
Each Tier C repo surfaces a different set of detector gaps. The pattern is becoming clear: shared categories give you a common vocabulary, but the implementation patterns are stack-specific. A “catch all unsafe-role-merge bugs” detector is too abstract to ship. What you ship is per-SDK, per-framework detectors that compose into the abstract category coverage promise.
Three repos shipped, three more to go: cst-express-agent (Express + Anthropic tool-calling, expected to surface client-component-roles patterns + multi-tool chain risks), cst-fastapi-tools (FastAPI Python, expected to surface async-streaming + Pydantic-trust patterns), and cst-crewai-multiagent (CrewAI multi-agent, expected to surface agent-to-agent role-merge + tool-orchestration risks). Each will get its own calibration cycle and a follow-up post.
The benchmark is open
CodeSecBench is MIT-licensed end to end — the corpus, the truth file, the scorer. If you’re building a SAST tool that targets AI-app patterns, clone the repos, run your tool, score against the truth, open a PR to add your numbers to the leaderboard. We’d especially welcome runs from teams whose tools approach these patterns differently (taint analysis, dataflow, LLM-as-judge). The detector landscape for AI-app security is young — if the benchmark surfaces something the regex approach can’t reach, that’s the most useful data we’ll get.
Previous in this series: How a public AI-app SAST benchmark made our detector 15% better in one afternoon.