Public benchmark · AI-app SAST

The SAST tools your team already trusts weren't designed for AI apps.

AI applications introduce a different attack surface — prompt injection, unsafe role merging, client-side LLM keys, PII flowing into prompts, unbounded streams, unsafe tool outputs. None of these patterns existed when Snyk, Checkmarx, Veracode, GitHub Advanced Security, or any other enterprise SAST tool calibrated its rule set. CodeSecBench is the public benchmark that measures who catches what.

See the head-to-head results Why this category needs a benchmark

Headline numbers

Head-to-head, on the same fixtures

Section A · JS/TS · 15 fixtures

Recall on AI-app categories

getdebug: 75%
gitleaks: 25%
trufflehog: 0%

gitleaks and trufflehog are secret scanners by design — they score high on secret-shape categories and low elsewhere. The point isn't "they're bad," it's "they're not built for the AI-app surface."

Section B · Python · 10 fixtures

Recall on Python AI-app categories

getdebug: 100%
bandit: 20%
semgrep: 20%

bandit and semgrep are the Python SAST baselines. They catch classic CWE patterns reliably. AI-app behavioral patterns are a different category — not under-tuned coverage, missing coverage.

Full per-tool breakdowns, per-category mean recall, real-world finding-count comparison, and per-version trend on /results.

Four classes of target, measuring four things

Different target shapes measure different parts of a SAST tool's behavior. A benchmark that only uses one class can be gamed; a benchmark that uses all four can't.

Section A

JS/TS micro-fixtures

Hand-crafted, one-pattern-per-file. Measures whether a tool can detect a category at its purest. 15 fixtures, six categories.
Section B

Python micro-fixtures

Same shape as Section A but Python idioms. Catches tools that have JS/TS coverage but no equivalent Python pass.
Section C

App-shaped repositories

Eight hand-authored AI apps with the patterns at app density, not one-bug-per-file. Span-labeled truth files. Where micro-fixtures measure detection, this measures detection at noise.
Real-world

24 public GitHub repos

Mid-popularity AI app templates plus a known-leaky baseline plus popular references. Measures finding-count behavior in the wild — high counts on the baseline are good; high counts on references suggest noise.

Browse the full corpus →

Maintain a SAST tool? Get on the leaderboard.

CodeSecBench will grade any tool that submits. The maintainer is transparent — see governance — and a multi-maintainer model takes over the moment a second tool's maintainer joins.

The corpus, the truth files, and the score.js harness go public once Tier C lands its final two repositories (cycles 5 and 6 in flight). The methodology, results, targets, and SAST landscape are all browsable now.

Read the methodology Neutrality model

The SAST tools your team already trusts weren't designed for AI apps.

Head-to-head, on the same fixtures

Recall on AI-app categories

Recall on Python AI-app categories

Four classes of target, measuring four things

JS/TS micro-fixtures

Python micro-fixtures

App-shaped repositories

24 public GitHub repos

Maintain a SAST tool? Get on the leaderboard.