Methodology
How CodeSecBench scores tools — span labels, hallucination control, category disagreement, borderline rows, and the local-LLM model rule.
Notes from the first end-to-end run (getdebug 0.5.1 vs cst-nextjs-chat, 2026-06-07). The decisions below are locked in before the corpus pattern clones to repos #2–#6 — otherwise the gaps replicate 5×.
Decision 1 — Span labels in truth files (not only point labels)
Problem surfaced. getdebug fired on app/api/chat/route.ts:42 (stream: true); the truth label for the same bug was at app/api/chat/route.ts:48 (the for await loop). The bug spans both lines (and several in between — the whole streaming-call block, lines 38–48). A point label at :48 with a 5-line tolerance can’t reach back to :42.
Decision. Truth rows support lineStart and lineEnd, not just line. A row with lineStart: 38, lineEnd: 48 means “any finding overlapping [38, 48] credits this row.” Point labels (line: N) stay supported as lineStart = lineEnd = N. The schema in schema.json allows either shape.
For repo #1, existing point labels stay; the next pass through this corpus widens the lines that should be spans (notably #11, #4, #9, #10 — anywhere a bug genuinely spans multiple lines). Repos #2–#6 are authored with span labels from day one.
Decision 2 — Hallucination detection via negative-control files
Problem surfaced. The local-LLM pass on cst-nextjs-chat fired three findings that any honest reviewer would call hallucinations:
db/seed.ts:24-30claimedprompt-injectionon a hardcoded Drizzle insert.lib/rag.ts:18-20claimedxsson a for-loop reading files from disk.lib/tools/calculator.ts:1-20claimedsql-injectionon a regex-allowlist arithmetic parser.
The scorer counts these as “out-of-scope” (no truth row in span), which is NOT penalized. A developer experiencing the output sees them as noise; the corpus should call that out.
Decision. Every Tier C repo includes a lib/known-safe.ts (or equivalent) — a file of innocuous helper code, ~50–100 lines, with no vulnerabilities of any category. Any finding inside known-safe.ts is a guaranteed false positive — labeled in the truth file as a whole-file safe declaration. The scorer treats hits there as FPs, not out-of-scope.
This adds a third precision grade beyond strict-JOIN precision and total-finding-count precision: hallucination rate = findings-in-known-safe / total-findings.
Decision 3 — Track category disagreement separately; do not penalize TP
Problem surfaced. getdebug correctly identified lib/user.ts:9 as a security issue but labeled it unsafe-role-merge when the truth says pii-in-prompt. The line is right; the category is debatable (the string IS in a system-role merge AND contains PII — both are true). Strict category matching would call this an FN + FP. That’s punishing a correct find.
Decision. TP/FP/FN are scored on line-overlap only. Category-disagreement gets its own roll-up — categoryAgreement = TP-with-matching-category / total-TP. Quoted separately. A tool with 100% recall and 70% category agreement is still a great tool; a tool with 100% recall and 0% category agreement may have a taxonomy mismatch worth investigating.
Decision 4 — Borderline rows do not move the precision number
Problem surfaced. getdebug fired on lib/checkout.ts:9 (B5, labeled safe with confidence 0 — UUID-as-PII, threat-model-dependent). Under strict JOIN that’s an FP that drags precision down by ~17 points on a small corpus.
Decision. Rows with confidence: 0 are excluded from headline precision/recall. They’re still printed in the per-fixture detail with a “borderline agreement: tool agreed with our adjudication / tool disagreed (defensible)” breakdown. A vendor that flags B5 is defensibly right — penalizing them rewards aligning with our adjudication, which corrupts the benchmark.
Decision 5 — Re-run with the documented 7B model before quoting --local-llm numbers
Problem surfaced. The only available chat model locally was qwen2.5-coder:1.5b. getdebug’s docs document the default as qwen2.5-coder:7b. The 1.5B model showed a 7% spurious-finding rate; the 7B model is likely lower (and slower).
Decision. Before publishing any --local-llm headline number from CodeSecBench, run with qwen2.5-coder:7b, deepseek-r1:7b, and llama3.1:8b, and report each separately. The 1.5B run is published as “indicative of low-budget local SAST” but not the headline.
What this means for repo #1’s recorded numbers
- Default pass: 23% recall / 75% precision stands as the baseline.
--local-llmwith qwen 1.5B: 38% recall / 83% join-precision stands as indicative, with the 7B re-run owed.- Files
getdebug-0.5.1-default-2026-06-07.jsonandgetdebug-0.5.1-local-llm-2026-06-07.jsonare correct under the v0.1.0 methodology; they will not be retroactively rescored when v0.2.0 lands. The CHANGELOG marks v0.2.0 as a methodology-revision break.
Open follow-ups
- Author
lib/known-safe.tsincst-nextjs-chatwith ~50 lines of helpers (hallucination control file). - Widen #11 (and other multi-line bugs) to span labels in
targets/cst-nextjs-chat.json. BumptruthVersionto 0.1.1. - Update
score.jsto (a) treat any hit in a whole-file-safe file as FP, (b) emitcategoryAgreementrollup, (c) skip confidence-0 rows from headline metrics. - Re-run
getdebug 0.5.1with--local-llm-model=deepseek-r1:7bfor an apples-to-apples-with-docs comparison.