CodeSecBench

Leaderboard · Four sections

Results

Recall and precision per tool, per section. Methodology, span-label tolerance, hallucination accounting, and oracle confirmation rules in /methodology. The corpus + truth files + scoring harness go public once Tier C lands its final two repositories — at which point this is where the submit-a-tool flow opens.

Section A — JS/TS AI-app fixtures

15 hand-crafted JavaScript/TypeScript fixtures across six AI-app categories. Each fixture is either deliberately vulnerable or deliberately safe (near-miss). Scoring is per-row TP/FP/FN against labeled spans. Head-to-head: getdebug vs gitleaks vs trufflehog.

Tool Recall Precision TP FP FN
getdebuggraded tool 75% 55% 6 5 2
gitleaks 25% 67% 2 1 6
trufflehog 0% 0% 0 0 8

Section B — Python AI-app fixtures

10 Python AI-app fixtures across five categories. Head-to-head: getdebug vs bandit vs semgrep. Bandit and Semgrep are the de facto Python SAST baselines.

Tool Recall Precision TP FP FN
getdebuggraded tool 100% 100% 5 0 0
bandit 20% 50% 1 1 4
semgrep 20% 50% 1 1 4

Section C — App-shaped corpus (Tier C)

Six hand-authored AI-app repositories with span-labeled truth files. Four baselined. Per-target recall + precision across each baselined version of getdebug. Multi-tool comparison opens once another vendor submits a result against the same truth files.

Calibration cycle 1 · 0.5.1 → 0.5.2

Per-category recall · Mean across cst-nextjs-chat + cst-vite-rag

The 0.5.2 detector wave targeted two patterns: execAsync(args.X) / sql.unsafe(args.X) (canonical SDK tool-callable shape feeding shell or raw-SQL sinks), and server routes returning process.env.X_API_KEY in a JSON response body. Both showed up in BOTH repos under different file paths.

Category 0.5.1 0.5.2 Δ Note
client-side-llm-key 50% 100% +50pp
unsafe-tool-output 0% 29% +29pp
prompt-injection 42% 42% flat
pii-in-prompt 50% 50% flat
unsafe-role-merge 0% 0% flat next target
unbounded-stream 25% 25% flat label-line debt on repo #1

Calibration cycle 2 · 0.5.2 → 0.5.3

Per-category recall · cst-sveltekit-stream only

When cst-sveltekit-stream joined the corpus, recall dropped to 7% on first scan — the same six categories of bug but a different framework (SvelteKit) and a different SDK (Anthropic) changed every pattern shape. The 0.5.3 detector wave targeted five SvelteKit/Anthropic-specific shapes plus one precision fix. Repos #1 and #2 held flat: the new detectors are stack-specific by design and correctly don't fire on Next/OpenAI code.

Category 0.5.2 0.5.3 Δ Note
client-side-llm-key 0% 50% +50pp
unsafe-tool-output 33% 67% +34pp
unsafe-role-merge 0% 33% +33pp
prompt-injection 0% 50% +50pp
pii-in-prompt 0% 0% flat next target
unbounded-stream 0% 0% flat credit landed under URM (label-span issue)

Calibration cycle 3 · 0.5.3 → 0.5.4

Per-category recall · Mean across all 4 shipped repos

cst-express-agent (repo #4) was built UTO-heavy by design — 6 unsafe-tool-output carriers across 6 different sink shapes (shell exec, raw SQL, file write with path traversal, Function constructor eval, HTML render of LLM output, spawn with arg-splitting). The 0.5.4 detector wave extended the sink list and added fetch-without-signal + HTML-embed-key + a PIP context-gate relaxation by file name.

Category 0.5.3 0.5.4 Δ Note
unsafe-tool-output 42% 76% +34pp targeted
pii-in-prompt 25% 50% +25pp
unbounded-stream 17% 38% +21pp
prompt-injection 42% 46% +4pp
client-side-llm-key 63% 63% flat
unsafe-role-merge 17% 8% -9pp now weakest — cst-crewai-multiagent target

Per-target recall + precision, per version

Each cell is recall / precision. Em-dash = repo wasn't yet authored at that version.

Target 0.5.10.5.20.5.30.5.4
cst-nextjs-chat 23% / 75% 38% / 83% 38% / 83% 46% / 86%
cst-vite-rag 40% / 86% 40% / 86% 47% / 88%
cst-sveltekit-stream 36% / 100% 50% / 100%
cst-express-agent 59% / 100%

Multi-tool comparison opens once another vendor submits a result against the same truth files — see /governance.

Real-world repos — 24 GitHub projects

Finding-count comparison on 24 public AI-app repos. Three sub-categories: a known-leaky baseline (1 repos, ~150 planted secrets — high recall expected), popular references (3 repos — high precision expected, near-zero findings), AI starters (20 repos — the noise-floor sample).

Synthetic recall isn't real-world recall

On a known-plant baseline, broader regex pattern sets win. On real less-curated AI starter repos, lower false-positive rates win. Both numbers matter; neither alone is the whole picture.

Synthetic · leaky-baseline · ~150 planted secrets

Recall test

Tool Hits
getdebug 9
gitleaks 22
trufflehog 12

gitleaks ships the broadest regex pattern set today. Detector parity work targets closing this gap; the bench will track it.

Noise-floor · 23 less-curated + popular AI starters

Noise-floor test

Tool Hits Repos
getdebug 5 2/23
gitleaks 12 4/23
trufflehog 8 4/23

Lower is better here — every finding the scanner emits, a human triages. CodeSecBench has done the manual classification for false-positive analysis; see /methodology.

Wall-clock per scan — median across 24 repos

Two tools sit in CI's comfort zone (sub-300ms median); trufflehog's killer feature is its live-API verifier, which this run disables for a fair shape-match comparison. With verification on, the time goes up further and the finding set shrinks to verified only.

gitleaks

176ms

n = 24 · min 74ms · max 14289ms

getdebug

227ms

n = 24 · min 30ms · max 1624ms

trufflehog

1779ms

n = 24 · min 1666ms · max 3283ms

Per-tool totals (across all 24 repos)

Tool Total findings Repos with findings
getdebuggraded tool 14 3 / 24
gitleaks 34 5 / 24
trufflehog 20 5 / 24

By category

Reading these numbers