CodeSecBench

CodeSecBench — blogCalibration cycles, methodology decisions, and tool-submission writeups for the CodeSecBench public AI-app SAST benchmark.https://codesecbench.org/en-usHow a public AI-app SAST benchmark made our detector 15% better in one afternoonhttps://codesecbench.org/blog/codesecbench-tier-c-0-5-2-calibration/https://codesecbench.org/blog/codesecbench-tier-c-0-5-2-calibration/We built CodeSecBench Tier C — six deliberately-vulnerable AI apps with 149 labeled bugs across all six AI-app security categories. Running getdebug 0.5.1 against the first two repos surfaced gaps we could fix in code the same day. Here's the loop, the numbers, and what 0.5.2 ships.Sun, 07 Jun 2026 00:00:00 GMTai-app-securitybenchmarkcalibrationFafa AgbetsiseWhat SvelteKit + Anthropic taught our SAST detector that Next + OpenAI couldn'thttps://codesecbench.org/blog/codesecbench-tier-c-0-5-3-sveltekit/https://codesecbench.org/blog/codesecbench-tier-c-0-5-3-sveltekit/Adding cst-sveltekit-stream to CodeSecBench Tier C dropped getdebug's recall from 38% to 7% on first scan. The same six AI-app categories, the same bugs — but a different framework + SDK changed every pattern shape. Here's what we shipped in 0.5.3 to close the gap.Sun, 07 Jun 2026 00:00:00 GMTai-app-securitybenchmarkcalibrationsveltekitanthropicFafa AgbetsiseTargeting the weakest category: how cst-express-agent took our unsafe-tool-output detector from 42% to 76%https://codesecbench.org/blog/codesecbench-tier-c-0-5-4-agent-tools/https://codesecbench.org/blog/codesecbench-tier-c-0-5-4-agent-tools/When you build a benchmark, you can see which detector categories are weakest. We saw unsafe-tool-output sitting at 42% mean recall across three repos. So we built repo #4 to specifically stress that surface: an Express + Anthropic agent with 6 different tool-output sink shapes. Here's what 0.5.4 ships, and what the data tells us to target next.Sun, 07 Jun 2026 00:00:00 GMTai-app-securitybenchmarkcalibrationagent-toolsFafa Agbetsise