CodeSecBench

Public benchmark · AI-app SAST

The SAST tools your team already trusts weren't designed for AI apps.

AI applications introduce a different attack surface — prompt injection, unsafe role merging, client-side LLM keys, PII flowing into prompts, unbounded streams, unsafe tool outputs. None of these patterns existed when Snyk, Checkmarx, Veracode, GitHub Advanced Security, or any other enterprise SAST tool calibrated its rule set. CodeSecBench is the public benchmark that measures who catches what.

Headline numbers

Head-to-head, on the same fixtures

Section A · JS/TS · 15 fixtures

Recall on AI-app categories

getdebug
75%
gitleaks
25%
trufflehog
0%

gitleaks and trufflehog are secret scanners by design — they score high on secret-shape categories and low elsewhere. The point isn't "they're bad," it's "they're not built for the AI-app surface."

Section B · Python · 10 fixtures

Recall on Python AI-app categories

getdebug
100%
bandit
20%
semgrep
20%

bandit and semgrep are the Python SAST baselines. They catch classic CWE patterns reliably. AI-app behavioral patterns are a different category — not under-tuned coverage, missing coverage.

Full per-tool breakdowns, per-category mean recall, real-world finding-count comparison, and per-version trend on /results.

Four classes of target, measuring four things

Different target shapes measure different parts of a SAST tool's behavior. A benchmark that only uses one class can be gamed; a benchmark that uses all four can't.

Maintain a SAST tool? Get on the leaderboard.

CodeSecBench will grade any tool that submits. The maintainer is transparent — see governance — and a multi-maintainer model takes over the moment a second tool's maintainer joins.

The corpus, the truth files, and the score.js harness go public once Tier C lands its final two repositories (cycles 5 and 6 in flight). The methodology, results, targets, and SAST landscape are all browsable now.