Blog
Calibration cycles + methodology notes
Each calibration cycle pairs a new corpus repo with the detector changes it surfaced. Plus methodology notes, span-label decisions, and tool submissions as they land. Subscribe via RSS.
-
June 7, 2026 · 0.5.1 → 0.5.2
How a public AI-app SAST benchmark made our detector 15% better in one afternoon
We built CodeSecBench Tier C — six deliberately-vulnerable AI apps with 149 labeled bugs across all six AI-app security categories. Running getdebug 0.5.1 against the first two repos surfaced gaps we could fix in code the same day. Here's the loop, the numbers, and what 0.5.2 ships.
ai-app-security benchmark calibration -
June 7, 2026 · 0.5.2 → 0.5.3
What SvelteKit + Anthropic taught our SAST detector that Next + OpenAI couldn't
Adding cst-sveltekit-stream to CodeSecBench Tier C dropped getdebug's recall from 38% to 7% on first scan. The same six AI-app categories, the same bugs — but a different framework + SDK changed every pattern shape. Here's what we shipped in 0.5.3 to close the gap.
ai-app-security benchmark calibration sveltekit anthropic -
June 7, 2026 · 0.5.3 → 0.5.4
Targeting the weakest category: how cst-express-agent took our unsafe-tool-output detector from 42% to 76%
When you build a benchmark, you can see which detector categories are weakest. We saw unsafe-tool-output sitting at 42% mean recall across three repos. So we built repo #4 to specifically stress that surface: an Express + Anthropic agent with 6 different tool-output sink shapes. Here's what 0.5.4 ships, and what the data tells us to target next.
ai-app-security benchmark calibration agent-tools