CodeSecBench
← All posts

June 7, 2026 · Fafa Agbetsise · 0.5.2 → 0.5.3

What SvelteKit + Anthropic taught our SAST detector that Next + OpenAI couldn't

Adding cst-sveltekit-stream to CodeSecBench Tier C dropped getdebug's recall from 38% to 7% on first scan. The same six AI-app categories, the same bugs — but a different framework + SDK changed every pattern shape. Here's what we shipped in 0.5.3 to close the gap.

ai-app-security benchmark calibration sveltekit anthropic

Yesterday we shipped getdebug 0.5.2 with two new regex detectors after measuring the first two CodeSecBench Tier C repos (a Next 14 + Vercel AI SDK chatbot and a Vite + Express + LangChain.js RAG dashboard). Recall jumped 15 percentage points on each. Felt good. Then we added the third repo — a SvelteKit + Anthropic streaming app — and recall dropped to 7%.

This is the entire point of the corpus.

Same six categories. Different stack. Different patterns.

The bugs in cst-sveltekit-stream were the same six categories that the other two repos cover — client-side-llm-key, prompt-injection, unsafe-tool-output, unsafe-role-merge, pii-in-prompt, unbounded-stream. The detector knew about all six. But every concrete pattern in the new repo had a different shape:

Pattern                          Next/Express/OpenAI         SvelteKit/Anthropic
──────────────────────────────────────────────────────────────────────────────────
LLM key in client bundle         NEXT_PUBLIC_OPENAI_API_KEY  $env/static/public import
LLM key in response body         Response.json({apiKey:      json({apiKey: BARE_IDENT})
                                  process.env.X_API_KEY})
LLM operator channel             role: 'system' in           system: param at the
                                  messages[]                  top level of the call
HTML rendered from LLM output    dangerouslySetInnerHTML     {@html ...}
Streaming LLM call               stream: true                anthropic.messages.stream()
Unbounded client read            useEffect with no cleanup   onMount with no onDestroy

Five of the six existing detectors structurally could not see the SvelteKit / Anthropic shape. The detector for role: 'system' inside messages[] cannot find anything when Anthropic puts the system prompt in a separate top-level system: parameter — it’s not a regex distance problem, it’s an entirely different shape. Same for {@html ...} vs dangerouslySetInnerHTML.

7% recall on first scan. One TP (the canonical db.prepare(args.sql).all() SQL injection sink, caught by 0.5.2’s args.X detector). One false positive (the safe sibling db.prepare call inside an allowlist guard — the detector couldn’t see the guard).

What 0.5.3 ships

Five new detectors plus one precision fix:

  • scanPublicEnvKey — catches import { PUBLIC_X_API_KEY } from '$env/static/public'. The SvelteKit / Astro / Nuxt equivalent of the NEXT_PUBLIC_ mistake. Same build-time inlining contract, different framework idiom.
  • scanKeyInResponseSvelte — catches return json({apiKey: ANTHROPIC_API_KEY}) where the key was destructured from $env/static/private at import time. 0.5.2’s version anchors on process.env.X_API_KEY directly; this one accepts bare-identifier values shaped like API keys.
  • scanSvelteHtmlSink — catches Svelte’s {@html X} directive, gated by a ±40-line context check for marked.parse / role / tool / messages references so static-content {@html ...} calls don’t fire.
  • scanAnthropicSystemMerge — catches Anthropic’s top-level system: parameter. Two cases handled: inline template literals system: `...${user}...` , and the indirect form system: systemPrompt where systemPrompt was assigned a template-literal-with-interpolation earlier in the file.
  • scanAnthropicUnboundedStream — catches .messages.stream( method calls with the same abort-in-scope gate the existing stream: true detector uses.
  • Precision fixdb.prepare(args.X) inside an allowlist guard (*_ALLOWLIST.has(args.X) within the prior 30 lines) no longer fires. The 0.5.2 false positive on cst-sveltekit-stream/src/lib/server/agent-tools/database-safe.ts is cleared.

All five detectors ship with passing positive tests and negative tests. The Anthropic system-merge detector specifically tests against a static-string assignment to make sure we don’t over-fire on the common safe case.

The numbers

Repo                        0.5.2 recall   0.5.3 recall   Delta
cst-nextjs-chat                 38%            38%        0pp (no SvelteKit/Anthropic code)
cst-vite-rag                    40%            40%        0pp (same reason)
cst-sveltekit-stream             7%            36%       +29pp

Repo #3 precision               50%           100%       +50pp (allowlist guard FP cleared)

Repos #1 and #2 held flat because 0.5.3’s new detectors are SvelteKit/Anthropic-specific — they don’t fire on Next/OpenAI shapes, which is the correct behavior. The point of stack-aware detectors is precisely that they’re narrow.

What still doesn’t fire (and why)

Repo #3 sits at 36% recall after 0.5.3. Nine vulnerabilities remain undetected:

  • PII in profile-context.ts — the canonical JSON.stringify(profile) pattern. The existing detector requires an LLM-call marker (messages: / chat.completions / etc.) within ±20 lines. The profile-context module is a pure helper imported by routes — the LLM call lives in a different file. Real bug, real shape, structurally invisible to the current detector. Candidate for v0.5.4.
  • PII in SvelteKit load functionSELECT * from a user table returned via +layout.server.ts’s load(). The full row is serialized into the page payload AND becomes available to downstream prompt builders. No current detector for the DB-side leak shape.
  • Svelte onMount without onDestroy streaming-without-cleanup — the Svelte 5 hook analogue of the React useEffect leak the corpus also hasn’t caught yet. Adjacent to Svelte 5 rune patterns generally.
  • Type-laundered rolesparsed.rawMessages.map((m) => ({ role: m.role as any, content: m.content })) — an arbitrary role string smuggled past Zod via a type cast. Requires either the detector to look at Zod schemas (role: z.string()) or to track as any casts adjacent to role-bearing structures.

The methodology debt this surfaced

The 0.5.3 publicEnvKey detector fires at the import line (line 1): import { PUBLIC_ANTHROPIC_API_KEY } from '$env/static/public'. The truth label for the same bug is at the usage line (line 7) where the var is actually read. The scorer’s ±5-line tolerance can’t close a 6-line gap, so the detector’s correct find lands as out-of-scope. That’s a labeling issue, not a detector bug — the truth needs a span (line 1 to line 7) to credit both the import and the usage as a single labeled vulnerability. Truth-version bump owed: cst-sveltekit-stream v0.1.1.

Similar issue on the Anthropic stream detector: it fires at line 20 (.messages.stream(), but the truth label for the unbounded-stream bug spans lines 28-36 (the for-await loop). The detector’s hit lands inside an adjacent label (unsafe-role-merge at lines 20-25) instead. Line overlap = TP regardless of category, which is the right call — but the per-category recall for unbounded-stream reads as 0% when the bug was caught. Same fix: widen the truth label to span lines 20-36.

The loop, repo by repo

Each Tier C repo surfaces a different set of detector gaps. The pattern is becoming clear: shared categories give you a common vocabulary, but the implementation patterns are stack-specific. A “catch all unsafe-role-merge bugs” detector is too abstract to ship. What you ship is per-SDK, per-framework detectors that compose into the abstract category coverage promise.

Three repos shipped, three more to go: cst-express-agent (Express + Anthropic tool-calling, expected to surface client-component-roles patterns + multi-tool chain risks), cst-fastapi-tools (FastAPI Python, expected to surface async-streaming + Pydantic-trust patterns), and cst-crewai-multiagent (CrewAI multi-agent, expected to surface agent-to-agent role-merge + tool-orchestration risks). Each will get its own calibration cycle and a follow-up post.

The benchmark is open

CodeSecBench is MIT-licensed end to end — the corpus, the truth file, the scorer. If you’re building a SAST tool that targets AI-app patterns, clone the repos, run your tool, score against the truth, open a PR to add your numbers to the leaderboard. We’d especially welcome runs from teams whose tools approach these patterns differently (taint analysis, dataflow, LLM-as-judge). The detector landscape for AI-app security is young — if the benchmark surfaces something the regex approach can’t reach, that’s the most useful data we’ll get.

Previous in this series: How a public AI-app SAST benchmark made our detector 15% better in one afternoon.