Blog

Calibration cycles + methodology notes

Each calibration cycle pairs a new corpus repo with the detector changes it surfaced. Plus methodology notes, span-label decisions, and tool submissions as they land. Subscribe via RSS.

June 12, 2026 · 0.5.4 → 0.5.5

The first Python target: how cst-fastapi-tools took our async-streaming detectors from 38% to 100%

Five calibration cycles in, the corpus was still all JavaScript and TypeScript. cst-fastapi-tools is the first Python repo — built UBS-heavy to stress the one surface Python makes uniquely error-prone: async streaming and cancellation. Here's the nine-detector wave 0.5.5 ships, why the explanatory comments in the fixtures forced us to strip comments before detecting, and the methodology bug we fixed in the scorer along the way.

ai-app-security benchmark calibration python fastapi streaming
June 12, 2026 · 0.5.5 → 0.5.6

The agent-framework blind spot: how cst-crewai-multiagent doubled our unsafe-role-merge recall

Two cycles ago, unsafe-role-merge was our weakest category at 8%. The 0.5.5 Python wave nudged it to 22%. The problem was that every URM detector we had looked for a dict — {"role": "system"}. CrewAI doesn't use a dict. An agent's backstory IS its system prompt; its role IS its authority; both are constructor keyword arguments. Here's the repo built to expose that, the three detectors that close it, and the one design trap we had to avoid: a Task description is not a vulnerability.

ai-app-security benchmark calibration python crewai agents
June 12, 2026 · 0.5.6 → 0.5.7

A whole new language: cst-go-agent and the .go detector path that moved every category

Seven repos in, the corpus was entirely Python and TypeScript. The detectors had never seen a line of Go. cst-go-agent forces a .go AI-app scanner path to exist at all — and because the Go target stresses all six categories at once, building it moved every single one. Here's the new detector file, the LLM-SDK gate that keeps getdebug's own CLI silent, and two precise bugs the first scan surfaced.

ai-app-security benchmark calibration go golang
June 12, 2026 · 0.5.7 → 0.5.8

Four languages, eight repos: cst-rails closes the Tier C corpus

The last repo of the run is Ruby on Rails — the fourth language. By now the routine is mechanical: a new language scores ~6% on first scan, a .rb detector path closes it, the corpus mean ticks up. cst-rails went 6% to 100%. Here's the Ruby wave, the three comment-poisoning bugs the first scan caught (again), and what the whole eight-repo arc actually proved.

ai-app-security benchmark calibration ruby rails retrospective
June 7, 2026 · 0.5.1 → 0.5.2

How a public AI-app SAST benchmark made our detector 15% better in one afternoon

We built CodeSecBench Tier C — six deliberately-vulnerable AI apps with 149 labeled bugs across all six AI-app security categories. Running getdebug 0.5.1 against the first two repos surfaced gaps we could fix in code the same day. Here's the loop, the numbers, and what 0.5.2 ships.

ai-app-security benchmark calibration
June 7, 2026 · 0.5.2 → 0.5.3

What SvelteKit + Anthropic taught our SAST detector that Next + OpenAI couldn't

Adding cst-sveltekit-stream to CodeSecBench Tier C dropped getdebug's recall from 38% to 7% on first scan. The same six AI-app categories, the same bugs — but a different framework + SDK changed every pattern shape. Here's what we shipped in 0.5.3 to close the gap.

ai-app-security benchmark calibration sveltekit anthropic
June 7, 2026 · 0.5.3 → 0.5.4

Targeting the weakest category: how cst-express-agent took our unsafe-tool-output detector from 42% to 76%

When you build a benchmark, you can see which detector categories are weakest. We saw unsafe-tool-output sitting at 42% mean recall across three repos. So we built repo #4 to specifically stress that surface: an Express + Anthropic agent with 6 different tool-output sink shapes. Here's what 0.5.4 ships, and what the data tells us to target next.

ai-app-security benchmark calibration agent-tools

Calibration cycles + methodology notes

The first Python target: how cst-fastapi-tools took our async-streaming detectors from 38% to 100%

The agent-framework blind spot: how cst-crewai-multiagent doubled our unsafe-role-merge recall

A whole new language: cst-go-agent and the .go detector path that moved every category

Four languages, eight repos: cst-rails closes the Tier C corpus

How a public AI-app SAST benchmark made our detector 15% better in one afternoon

What SvelteKit + Anthropic taught our SAST detector that Next + OpenAI couldn't

Targeting the weakest category: how cst-express-agent took our unsafe-tool-output detector from 42% to 76%