Leaderboard · Four sections

Results

Recall and precision per tool, per section. Methodology, span-label tolerance, hallucination accounting, and oracle confirmation rules in /methodology. The corpus + truth files + scoring harness go public once Tier C lands its final two repositories — at which point this is where the submit-a-tool flow opens.

Section A — JS/TS AI-app fixtures

15 hand-crafted JavaScript/TypeScript fixtures across six AI-app categories. Each fixture is either deliberately vulnerable or deliberately safe (near-miss). Scoring is per-row TP/FP/FN against labeled spans. Head-to-head: getdebug vs gitleaks vs trufflehog.

Tool	Recall	Precision	TP	FP	FN
getdebuggraded tool	75%	55%	6	5	2
gitleaks	25%	67%	2	1	6
trufflehog	0%	0%	0	0	8

Section B — Python AI-app fixtures

10 Python AI-app fixtures across five categories. Head-to-head: getdebug vs bandit vs semgrep. Bandit and Semgrep are the de facto Python SAST baselines.

Tool	Recall	Precision	TP	FP	FN
getdebuggraded tool	100%	100%	5	0	0
bandit	20%	50%	1	1	4
semgrep	20%	50%	1	1	4

Section C — App-shaped corpus (Tier C)

Eight hand-authored AI-app repositories with span-labeled truth files. All eight baselined. Per-target recall + precision across each baselined version of getdebug. Multi-tool comparison opens once another vendor submits a result against the same truth files.

Calibration cycle 1 · 0.5.1 → 0.5.2

Per-category recall · Mean across cst-nextjs-chat + cst-vite-rag

The 0.5.2 detector wave targeted two patterns: execAsync(args.X) / sql.unsafe(args.X) (canonical SDK tool-callable shape feeding shell or raw-SQL sinks), and server routes returning process.env.X_API_KEY in a JSON response body. Both showed up in BOTH repos under different file paths.

Category	0.5.1	0.5.2	Δ	Note
client-side-llm-key	50%	100%	+50pp
unsafe-tool-output	0%	29%	+29pp
prompt-injection	42%	42%	flat
pii-in-prompt	50%	50%	flat
unsafe-role-merge	0%	0%	flat	next target
unbounded-stream	25%	25%	flat	label-line debt on repo #1

Calibration cycle 2 · 0.5.2 → 0.5.3

Per-category recall · cst-sveltekit-stream only

When cst-sveltekit-stream joined the corpus, recall dropped to 7% on first scan — the same six categories of bug but a different framework (SvelteKit) and a different SDK (Anthropic) changed every pattern shape. The 0.5.3 detector wave targeted five SvelteKit/Anthropic-specific shapes plus one precision fix. Repos #1 and #2 held flat: the new detectors are stack-specific by design and correctly don't fire on Next/OpenAI code.

Category	0.5.2	0.5.3	Δ	Note
client-side-llm-key	0%	50%	+50pp
unsafe-tool-output	33%	67%	+34pp
unsafe-role-merge	0%	33%	+33pp
prompt-injection	0%	50%	+50pp
pii-in-prompt	0%	0%	flat	next target
unbounded-stream	0%	0%	flat	credit landed under URM (label-span issue)

Calibration cycle 3 · 0.5.3 → 0.5.4

Per-category recall · Mean across all 4 shipped repos

cst-express-agent (repo #4) was built UTO-heavy by design — 6 unsafe-tool-output carriers across 6 different sink shapes (shell exec, raw SQL, file write with path traversal, Function constructor eval, HTML render of LLM output, spawn with arg-splitting). The 0.5.4 detector wave extended the sink list and added fetch-without-signal + HTML-embed-key + a PIP context-gate relaxation by file name.

Category	0.5.3	0.5.4	Δ	Note
unsafe-tool-output	42%	76%	+34pp	targeted
pii-in-prompt	25%	50%	+25pp
unbounded-stream	17%	38%	+21pp
prompt-injection	42%	46%	+4pp
client-side-llm-key	63%	63%	flat
unsafe-role-merge	17%	8%	-9pp	now weakest — cst-crewai-multiagent target

Calibration cycle 4 · 0.5.4 → 0.5.5

Per-category recall · Mean across all 5 shipped repos

cst-fastapi-tools (repo #5) is the corpus's first Python target and is UBS-heavy by design — 6 unbounded-stream carriers across FastAPI's async surface: a StreamingResponse generator with no is_disconnected() poll, an SSE generator with no finally, an httpx stream with no timeout, a websocket receive loop with no disconnect branch, an async generator with no try/finally, and a BackgroundTasks drain loop with no bound. The pre-existing Python prefilters only caught the stream=True kwarg, so first-scan recall was 38%. The 0.5.5 wave ships nine Python detectors (function-scoped, comment/docstring-stripped) that took the target to 100% — with zero regression on the four JS/TS repos, since the detectors are .py-gated.

Category	0.5.4	0.5.5	Δ	Note
unbounded-stream	43%	64%	+21pp	targeted — Python target 50% → 100%
unsafe-role-merge	8%	25%	+17pp	still weakest — cst-crewai-multiagent (0.5.6) target
client-side-llm-key	67%	78%	+11pp
unsafe-tool-output	68%	79%	+11pp
pii-in-prompt	50%	60%	+10pp
prompt-injection	45%	55%	+9pp

Calibration cycle 5 · 0.5.5 → 0.5.6

Per-category recall · Mean across all 6 shipped repos

cst-crewai-multiagent (repo #6) is the corpus's first agent-framework target, built URM-heavy. Every prior unsafe-role-merge detector looked for a message dict — {"role": "system"} or {"role": role_var}. CrewAI doesn't use a dict: an Agent's backstory IS its system prompt and its role IS its authority, both constructor kwargs. Five of six URM carriers were invisible on first scan (the sixth, a dict-shape tool call, was already caught). The 0.5.6 wave ships three Agent-surface detectors. The five non-URM carriers were caught with no change — the 0.5.5 Python wave generalised straight from FastAPI to CrewAI.

Category	0.5.5	0.5.6	Δ	Note
unsafe-role-merge	22%	50%	+28pp	targeted — URM-heavy target 17% → 100%; corpus URM doubled
client-side-llm-key	82%	82%	flat	flat — non-URM carriers already covered
unsafe-tool-output	81%	81%	flat
unbounded-stream	69%	69%	flat
pii-in-prompt	67%	67%	flat
prompt-injection	62%	62%	flat

Calibration cycle 6 · 0.5.6 → 0.5.7

Per-category recall · Mean across all 7 shipped repos

cst-go-agent (repo #7) is the corpus's first non-Python/JavaScript target. There was no .go AI-app detector path at all, so first scan was 6% (only the language-agnostic secret scanner fired). 0.5.7 adds aiapp_regex_go.go — Go-idiom detectors (fmt.Sprintf/strings.Join prompts, anthropic.F System, openai-go Role, os/exec, json.Marshal, NewStreaming) gated on an LLM-SDK marker so getdebug's own Go CLI stays silent. A new language is a coverage stressor, not a category stressor — so unlike every prior cycle, this one moved all six categories.

Category	0.5.6	0.5.7	Δ	Note
prompt-injection	47%	71%	+24pp	biggest mover — Go target carries 4 PI carriers
unsafe-role-merge	41%	59%	+18pp
pii-in-prompt	57%	71%	+14pp
unbounded-stream	61%	72%	+11pp
unsafe-tool-output	74%	83%	+9pp
client-side-llm-key	77%	85%	+8pp

Calibration cycle 7 · 0.5.7 → 0.5.8

Per-category recall · Mean across all 8 shipped repos

cst-rails (repo #8) is the fourth language — Ruby on Rails. No .rb AI-app detector path existed, so first scan was 6%. 0.5.8 adds aiapp_regex_rb.go — fourteen Ruby detectors (#{} interpolation, params[:role], backtick exec, File.read persona, user.to_json, ActionController::Live) gated on an LLM-SDK marker. The first scan caught the same comment-poisoning bug in three places at once (a case-sensitive gate, and request_timeout / ensure suppressions fooled by comments) — the third language in a row to surface it. This closes the eight-repo, four-language corpus at 76% recall and 100% precision on every repo.

Category	0.5.7	0.5.8	Δ
prompt-injection	60%	75%	+15pp
unsafe-role-merge	50%	65%	+15pp
pii-in-prompt	63%	75%	+13pp
unsafe-tool-output	73%	85%	+12pp
unbounded-stream	65%	75%	+10pp
client-side-llm-key	80%	87%	+7pp

Per-target recall + precision, per version

Each cell is recall / precision. Em-dash = repo wasn't yet authored at that version.

Target	0.5.1	0.5.2	0.5.3	0.5.4	0.5.5	0.5.6	0.5.7	0.5.8
cst-nextjs-chat	23% / 75%	38% / 83%	38% / 83%	46% / 86%	46% / 100%	46% / 100%	46% / 100%	46% / 100%
cst-vite-rag	—	40% / 86%	40% / 86%	47% / 88%	47% / 100%	47% / 100%	47% / 100%	47% / 100%
cst-sveltekit-stream	—	—	36% / 100%	50% / 100%	50% / 100%	50% / 100%	50% / 100%	50% / 100%
cst-express-agent	—	—	—	59% / 100%	59% / 100%	59% / 100%	59% / 100%	59% / 100%
cst-fastapi-tools	—	—	—	—	100% / 100%	100% / 100%	100% / 100%	100% / 100%
cst-crewai-multiagent	—	—	—	—	—	100% / 100%	100% / 100%	100% / 100%
cst-go-agent	—	—	—	—	—	—	100% / 100%	100% / 100%
cst-rails	—	—	—	—	—	—	—	100% / 100%

Multi-tool comparison opens once another vendor submits a result against the same truth files — see /governance.

Real-world repos — 24 GitHub projects

Finding-count comparison on 24 public AI-app repos. Three sub-categories: a known-leaky baseline (1 repos, ~150 planted secrets — high recall expected), popular references (3 repos — high precision expected, near-zero findings), AI starters (20 repos — the noise-floor sample).

Synthetic recall isn't real-world recall

On a known-plant baseline, broader regex pattern sets win. On real less-curated AI starter repos, lower false-positive rates win. Both numbers matter; neither alone is the whole picture.

Synthetic · leaky-baseline · ~150 planted secrets

Recall test

Tool	Hits
getdebug	9
gitleaks	22
trufflehog	12

gitleaks ships the broadest regex pattern set today. Detector parity work targets closing this gap; the bench will track it.

Noise-floor · 23 less-curated + popular AI starters

Noise-floor test

Tool	Hits	Repos
getdebug	5	2/23
gitleaks	12	4/23
trufflehog	8	4/23

Lower is better here — every finding the scanner emits, a human triages. CodeSecBench has done the manual classification for false-positive analysis; see /methodology.

Wall-clock per scan — median across 24 repos

Two tools sit in CI's comfort zone (sub-300ms median); trufflehog's killer feature is its live-API verifier, which this run disables for a fair shape-match comparison. With verification on, the time goes up further and the finding set shrinks to verified only.

gitleaks

176ms

n = 24 · min 74ms · max 14289ms

getdebug

227ms

n = 24 · min 30ms · max 1624ms

trufflehog

1779ms

n = 24 · min 1666ms · max 3283ms

Per-tool totals (across all 24 repos)

Tool	Total findings	Repos with findings
getdebuggraded tool	14	3 / 24
gitleaks	34	5 / 24
trufflehog	20	5 / 24

By category

leaky-repo-baseline (1)

Repo	getdebug	gitleaks	trufflehog
Plazmaz/leaky-repo	9	22	12

popular-reference (3)

Repo	getdebug	gitleaks	trufflehog
vercel/ai-chatbot	0	0	0
langchain-ai/chat-langchain	0	0	0
modelcontextprotocol/servers	0	0	0

ai-starter (20)

Repo	getdebug	gitleaks	trufflehog
amjadraza/langchain-streamlit-docker-template	0	0	0
joshuasundance-swca/langchain-research-assistant-docker	0	0	0
rahulsamant37/langchain-langgraph-starter	0	0	0
oisee/zllm	0	0	0
NJUxlj/Travel-Agent-based-on-Qwen2-RLHF	3	4	3
ssgrummons/rag-with-milvus-langchain-streamlit	0	0	0
CronusL-1141/AI-company	0	1	0
Sinapsis-AI/sinapsis-langchain	0	0	0
rryyqn/ai-chatbot	0	0	0
D-artisan/ai-chatbot	0	0	0
arvindsis11/Ai-Healthcare-Chatbot	0	0	1
Ramakm/AI-Chatbot	0	0	0
stackitcloud/rag-template	0	0	1
The-Swarm-Corporation/Multi-Agent-RAG-Template	0	0	0
xyspg/RAG-template	0	0	0
mia-platform/ai-rag-template	0	0	0
alexeykrol/claude-code-starter	2	5	3
hamzafarooq/claude-code-starter	0	0	0
davidhershey/ClaudePlaysPokemonStarter	0	0	0
ArtemXTech/claude-code-obsidian-starter	0	2	0

Reading these numbers

Sections A and B are fairness-tuned for behavioral AI-app patterns. getdebug, gitleaks, and trufflehog all run on the same JS/TS fixtures — the latter two are secret-detection tools and will score low on non-secret categories by design. Same for Section B: bandit + semgrep are Python SAST baselines, not AI-app aware.
Section C is app-density: the same six categories at app scale, not one-bug-per-file. The lower per-target recall numbers reflect the difficulty step-up from micro-fixtures.
Real-world counts findings, not TP/FP/FN — no truth labels exist for these (yet). High counts on the leaky baseline are good; high counts on popular references suggest noise.