CodeSecBench

Methodology

How CodeSecBench scores tools — span labels, hallucination control, category disagreement, borderline rows, and the local-LLM model rule.

Notes from the first end-to-end run (getdebug 0.5.1 vs cst-nextjs-chat, 2026-06-07). The decisions below are locked in before the corpus pattern clones to repos #2–#6 — otherwise the gaps replicate 5×.

Decision 1 — Span labels in truth files (not only point labels)

Problem surfaced. getdebug fired on app/api/chat/route.ts:42 (stream: true); the truth label for the same bug was at app/api/chat/route.ts:48 (the for await loop). The bug spans both lines (and several in between — the whole streaming-call block, lines 38–48). A point label at :48 with a 5-line tolerance can’t reach back to :42.

Decision. Truth rows support lineStart and lineEnd, not just line. A row with lineStart: 38, lineEnd: 48 means “any finding overlapping [38, 48] credits this row.” Point labels (line: N) stay supported as lineStart = lineEnd = N. The schema in schema.json allows either shape.

For repo #1, existing point labels stay; the next pass through this corpus widens the lines that should be spans (notably #11, #4, #9, #10 — anywhere a bug genuinely spans multiple lines). Repos #2–#6 are authored with span labels from day one.

Decision 2 — Hallucination detection via negative-control files

Problem surfaced. The local-LLM pass on cst-nextjs-chat fired three findings that any honest reviewer would call hallucinations:

The scorer counts these as “out-of-scope” (no truth row in span), which is NOT penalized. A developer experiencing the output sees them as noise; the corpus should call that out.

Decision. Every Tier C repo includes a lib/known-safe.ts (or equivalent) — a file of innocuous helper code, ~50–100 lines, with no vulnerabilities of any category. Any finding inside known-safe.ts is a guaranteed false positive — labeled in the truth file as a whole-file safe declaration. The scorer treats hits there as FPs, not out-of-scope.

This adds a third precision grade beyond strict-JOIN precision and total-finding-count precision: hallucination rate = findings-in-known-safe / total-findings.

Decision 3 — Track category disagreement separately; do not penalize TP

Problem surfaced. getdebug correctly identified lib/user.ts:9 as a security issue but labeled it unsafe-role-merge when the truth says pii-in-prompt. The line is right; the category is debatable (the string IS in a system-role merge AND contains PII — both are true). Strict category matching would call this an FN + FP. That’s punishing a correct find.

Decision. TP/FP/FN are scored on line-overlap only. Category-disagreement gets its own roll-up — categoryAgreement = TP-with-matching-category / total-TP. Quoted separately. A tool with 100% recall and 70% category agreement is still a great tool; a tool with 100% recall and 0% category agreement may have a taxonomy mismatch worth investigating.

Decision 4 — Borderline rows do not move the precision number

Problem surfaced. getdebug fired on lib/checkout.ts:9 (B5, labeled safe with confidence 0 — UUID-as-PII, threat-model-dependent). Under strict JOIN that’s an FP that drags precision down by ~17 points on a small corpus.

Decision. Rows with confidence: 0 are excluded from headline precision/recall. They’re still printed in the per-fixture detail with a “borderline agreement: tool agreed with our adjudication / tool disagreed (defensible)” breakdown. A vendor that flags B5 is defensibly right — penalizing them rewards aligning with our adjudication, which corrupts the benchmark.

Decision 5 — Re-run with the documented 7B model before quoting --local-llm numbers

Problem surfaced. The only available chat model locally was qwen2.5-coder:1.5b. getdebug’s docs document the default as qwen2.5-coder:7b. The 1.5B model showed a 7% spurious-finding rate; the 7B model is likely lower (and slower).

Decision. Before publishing any --local-llm headline number from CodeSecBench, run with qwen2.5-coder:7b, deepseek-r1:7b, and llama3.1:8b, and report each separately. The 1.5B run is published as “indicative of low-budget local SAST” but not the headline.

What this means for repo #1’s recorded numbers

Open follow-ups