Leaderboard · Four sections
Results
Recall and precision per tool, per section. Methodology, span-label tolerance, hallucination accounting, and oracle confirmation rules in /methodology. The corpus + truth files + scoring harness go public once Tier C lands its final two repositories — at which point this is where the submit-a-tool flow opens.
Section A — JS/TS AI-app fixtures
15 hand-crafted JavaScript/TypeScript fixtures across six AI-app categories. Each fixture is either deliberately vulnerable or deliberately safe (near-miss). Scoring is per-row TP/FP/FN against labeled spans. Head-to-head: getdebug vs gitleaks vs trufflehog.
| Tool | Recall | Precision | TP | FP | FN |
|---|---|---|---|---|---|
| getdebuggraded tool | 75% | 55% | 6 | 5 | 2 |
| gitleaks | 25% | 67% | 2 | 1 | 6 |
| trufflehog | 0% | 0% | 0 | 0 | 8 |
Section B — Python AI-app fixtures
10 Python AI-app fixtures across five categories. Head-to-head: getdebug vs bandit vs semgrep. Bandit and Semgrep are the de facto Python SAST baselines.
| Tool | Recall | Precision | TP | FP | FN |
|---|---|---|---|---|---|
| getdebuggraded tool | 100% | 100% | 5 | 0 | 0 |
| bandit | 20% | 50% | 1 | 1 | 4 |
| semgrep | 20% | 50% | 1 | 1 | 4 |
Section C — App-shaped corpus (Tier C)
Six hand-authored AI-app repositories with span-labeled truth files. Four baselined. Per-target recall + precision across each baselined version of getdebug. Multi-tool comparison opens once another vendor submits a result against the same truth files.
Calibration cycle 1 · 0.5.1 → 0.5.2
Per-category recall · Mean across cst-nextjs-chat + cst-vite-rag
The 0.5.2 detector wave targeted two patterns: execAsync(args.X) / sql.unsafe(args.X) (canonical SDK tool-callable shape feeding shell or raw-SQL sinks), and server routes returning process.env.X_API_KEY in a JSON response body. Both showed up in BOTH repos under different file paths.
| Category | 0.5.1 | 0.5.2 | Δ | Note |
|---|---|---|---|---|
| client-side-llm-key | 50% | 100% | +50pp | |
| unsafe-tool-output | 0% | 29% | +29pp | |
| prompt-injection | 42% | 42% | flat | |
| pii-in-prompt | 50% | 50% | flat | |
| unsafe-role-merge | 0% | 0% | flat | next target |
| unbounded-stream | 25% | 25% | flat | label-line debt on repo #1 |
Calibration cycle 2 · 0.5.2 → 0.5.3
Per-category recall · cst-sveltekit-stream only
When cst-sveltekit-stream joined the corpus, recall dropped to 7% on first scan — the same six categories of bug but a different framework (SvelteKit) and a different SDK (Anthropic) changed every pattern shape. The 0.5.3 detector wave targeted five SvelteKit/Anthropic-specific shapes plus one precision fix. Repos #1 and #2 held flat: the new detectors are stack-specific by design and correctly don't fire on Next/OpenAI code.
| Category | 0.5.2 | 0.5.3 | Δ | Note |
|---|---|---|---|---|
| client-side-llm-key | 0% | 50% | +50pp | |
| unsafe-tool-output | 33% | 67% | +34pp | |
| unsafe-role-merge | 0% | 33% | +33pp | |
| prompt-injection | 0% | 50% | +50pp | |
| pii-in-prompt | 0% | 0% | flat | next target |
| unbounded-stream | 0% | 0% | flat | credit landed under URM (label-span issue) |
Calibration cycle 3 · 0.5.3 → 0.5.4
Per-category recall · Mean across all 4 shipped repos
cst-express-agent (repo #4) was built UTO-heavy by design — 6 unsafe-tool-output carriers across 6 different sink shapes (shell exec, raw SQL, file write with path traversal, Function constructor eval, HTML render of LLM output, spawn with arg-splitting). The 0.5.4 detector wave extended the sink list and added fetch-without-signal + HTML-embed-key + a PIP context-gate relaxation by file name.
| Category | 0.5.3 | 0.5.4 | Δ | Note |
|---|---|---|---|---|
| unsafe-tool-output | 42% | 76% | +34pp | targeted |
| pii-in-prompt | 25% | 50% | +25pp | |
| unbounded-stream | 17% | 38% | +21pp | |
| prompt-injection | 42% | 46% | +4pp | |
| client-side-llm-key | 63% | 63% | flat | |
| unsafe-role-merge | 17% | 8% | -9pp | now weakest — cst-crewai-multiagent target |
Per-target recall + precision, per version
Each cell is recall / precision. Em-dash = repo wasn't yet authored at that version.
| Target | 0.5.1 | 0.5.2 | 0.5.3 | 0.5.4 |
|---|---|---|---|---|
| cst-nextjs-chat | 23% / 75% | 38% / 83% | 38% / 83% | 46% / 86% |
| cst-vite-rag | — | 40% / 86% | 40% / 86% | 47% / 88% |
| cst-sveltekit-stream | — | — | 36% / 100% | 50% / 100% |
| cst-express-agent | — | — | — | 59% / 100% |
Multi-tool comparison opens once another vendor submits a result against the same truth files — see /governance.
Real-world repos — 24 GitHub projects
Finding-count comparison on 24 public AI-app repos. Three sub-categories: a known-leaky baseline (1 repos, ~150 planted secrets — high recall expected), popular references (3 repos — high precision expected, near-zero findings), AI starters (20 repos — the noise-floor sample).
Synthetic recall isn't real-world recall
On a known-plant baseline, broader regex pattern sets win. On real less-curated AI starter repos, lower false-positive rates win. Both numbers matter; neither alone is the whole picture.
Synthetic · leaky-baseline · ~150 planted secrets
Recall test
| Tool | Hits |
|---|---|
| getdebug | 9 |
| gitleaks | 22 |
| trufflehog | 12 |
gitleaks ships the broadest regex pattern set today. Detector parity work targets closing this gap; the bench will track it.
Noise-floor · 23 less-curated + popular AI starters
Noise-floor test
| Tool | Hits | Repos |
|---|---|---|
| getdebug | 5 | 2/23 |
| gitleaks | 12 | 4/23 |
| trufflehog | 8 | 4/23 |
Lower is better here — every finding the scanner emits, a human triages. CodeSecBench has done the manual classification for false-positive analysis; see /methodology.
Wall-clock per scan — median across 24 repos
Two tools sit in CI's comfort zone (sub-300ms median); trufflehog's killer feature is its live-API verifier, which this run disables for a fair shape-match comparison. With verification on, the time goes up further and the finding set shrinks to verified only.
gitleaks
176ms
n = 24 · min 74ms · max 14289ms
getdebug
227ms
n = 24 · min 30ms · max 1624ms
trufflehog
1779ms
n = 24 · min 1666ms · max 3283ms
Per-tool totals (across all 24 repos)
| Tool | Total findings | Repos with findings |
|---|---|---|
| getdebuggraded tool | 14 | 3 / 24 |
| gitleaks | 34 | 5 / 24 |
| trufflehog | 20 | 5 / 24 |
By category
leaky-repo-baseline (1)
| Repo | getdebug | gitleaks | trufflehog |
|---|---|---|---|
| Plazmaz/leaky-repo | 9 | 22 | 12 |
popular-reference (3)
| Repo | getdebug | gitleaks | trufflehog |
|---|---|---|---|
| vercel/ai-chatbot | 0 | 0 | 0 |
| langchain-ai/chat-langchain | 0 | 0 | 0 |
| modelcontextprotocol/servers | 0 | 0 | 0 |
ai-starter (20)
| Repo | getdebug | gitleaks | trufflehog |
|---|---|---|---|
| amjadraza/langchain-streamlit-docker-template | 0 | 0 | 0 |
| joshuasundance-swca/langchain-research-assistant-docker | 0 | 0 | 0 |
| rahulsamant37/langchain-langgraph-starter | 0 | 0 | 0 |
| oisee/zllm | 0 | 0 | 0 |
| NJUxlj/Travel-Agent-based-on-Qwen2-RLHF | 3 | 4 | 3 |
| ssgrummons/rag-with-milvus-langchain-streamlit | 0 | 0 | 0 |
| CronusL-1141/AI-company | 0 | 1 | 0 |
| Sinapsis-AI/sinapsis-langchain | 0 | 0 | 0 |
| rryyqn/ai-chatbot | 0 | 0 | 0 |
| D-artisan/ai-chatbot | 0 | 0 | 0 |
| arvindsis11/Ai-Healthcare-Chatbot | 0 | 0 | 1 |
| Ramakm/AI-Chatbot | 0 | 0 | 0 |
| stackitcloud/rag-template | 0 | 0 | 1 |
| The-Swarm-Corporation/Multi-Agent-RAG-Template | 0 | 0 | 0 |
| xyspg/RAG-template | 0 | 0 | 0 |
| mia-platform/ai-rag-template | 0 | 0 | 0 |
| alexeykrol/claude-code-starter | 2 | 5 | 3 |
| hamzafarooq/claude-code-starter | 0 | 0 | 0 |
| davidhershey/ClaudePlaysPokemonStarter | 0 | 0 | 0 |
| ArtemXTech/claude-code-obsidian-starter | 0 | 2 | 0 |
Reading these numbers
- Sections A and B are fairness-tuned for behavioral AI-app patterns. getdebug, gitleaks, and trufflehog all run on the same JS/TS fixtures — the latter two are secret-detection tools and will score low on non-secret categories by design. Same for Section B: bandit + semgrep are Python SAST baselines, not AI-app aware.
- Section C is app-density: the same six categories at app scale, not one-bug-per-file. The lower per-target recall numbers reflect the difficulty step-up from micro-fixtures.
- Real-world counts findings, not TP/FP/FN — no truth labels exist for these (yet). High counts on the leaky baseline are good; high counts on popular references suggest noise.