<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"><channel><title>CodeSecBench — blog</title><description>Calibration cycles, methodology decisions, and tool-submission writeups for the CodeSecBench public AI-app SAST benchmark.</description><link>https://codesecbench.org/</link><language>en-us</language><item><title>How a public AI-app SAST benchmark made our detector 15% better in one afternoon</title><link>https://codesecbench.org/blog/codesecbench-tier-c-0-5-2-calibration/</link><guid isPermaLink="true">https://codesecbench.org/blog/codesecbench-tier-c-0-5-2-calibration/</guid><description>We built CodeSecBench Tier C — six deliberately-vulnerable AI apps with 149 labeled bugs across all six AI-app security categories. Running getdebug 0.5.1 against the first two repos surfaced gaps we could fix in code the same day. Here&apos;s the loop, the numbers, and what 0.5.2 ships.</description><pubDate>Sun, 07 Jun 2026 00:00:00 GMT</pubDate><category>ai-app-security</category><category>benchmark</category><category>calibration</category><author>Fafa Agbetsise</author></item><item><title>What SvelteKit + Anthropic taught our SAST detector that Next + OpenAI couldn&apos;t</title><link>https://codesecbench.org/blog/codesecbench-tier-c-0-5-3-sveltekit/</link><guid isPermaLink="true">https://codesecbench.org/blog/codesecbench-tier-c-0-5-3-sveltekit/</guid><description>Adding cst-sveltekit-stream to CodeSecBench Tier C dropped getdebug&apos;s recall from 38% to 7% on first scan. The same six AI-app categories, the same bugs — but a different framework + SDK changed every pattern shape. Here&apos;s what we shipped in 0.5.3 to close the gap.</description><pubDate>Sun, 07 Jun 2026 00:00:00 GMT</pubDate><category>ai-app-security</category><category>benchmark</category><category>calibration</category><category>sveltekit</category><category>anthropic</category><author>Fafa Agbetsise</author></item><item><title>Targeting the weakest category: how cst-express-agent took our unsafe-tool-output detector from 42% to 76%</title><link>https://codesecbench.org/blog/codesecbench-tier-c-0-5-4-agent-tools/</link><guid isPermaLink="true">https://codesecbench.org/blog/codesecbench-tier-c-0-5-4-agent-tools/</guid><description>When you build a benchmark, you can see which detector categories are weakest. We saw unsafe-tool-output sitting at 42% mean recall across three repos. So we built repo #4 to specifically stress that surface: an Express + Anthropic agent with 6 different tool-output sink shapes. Here&apos;s what 0.5.4 ships, and what the data tells us to target next.</description><pubDate>Sun, 07 Jun 2026 00:00:00 GMT</pubDate><category>ai-app-security</category><category>benchmark</category><category>calibration</category><category>agent-tools</category><author>Fafa Agbetsise</author></item></channel></rss>