Skip to content

Evaluation & Tuning

Drift Detector was tuned through 21 rounds using the scientific method: pre-registered hypotheses, controlled probes, and a hard FP=0 invariant that was never broken.

Corpus

  • Real corpus: 1,283 labeled entries extracted from production Claude Code sessions. Labels: drift=377, ok=906, unlabeled=1,028 (excluded from eval).
  • Synthetic corpus: 190 hand-crafted sessions covering edge cases, adversarial patterns, and calibration attacks. 100% accuracy, FP=0.
  • Adversarial unit test: 37 cases targeting latent FP/FN patterns in classify_user_reply(). Permanent regression gate.

Round history

Round F1 Key change
R0–R17 0.21→0.587 Rule-based gate chain
R18 0.633 22-feature LR classifier
R19 0.9543 ExtraTree 43-feature + DCD steps=8, t=0.58
R20 0.977 11 new classify_user_reply patterns + DCD steps=10
R21 0.9973 17 patterns + exact-match gate + URL-only gate

Reproduce

python3 scripts/backtest_real.py --dcd --dcd-steps 10
# Expected: F1=0.9973 tp=375 fp=0 fn=2 tn=906

python3 scripts/eval_morin.py
# Expected: n=190 acc=1.0 FP=0.0 FN=0.0

python3 scripts/adversarial_classify_test.py
# Expected: PASS=37 FAIL=0

R21 mechanism additions

R21 added 17 patterns to classify_user_reply() plus two precision gates:

  • Exact-match gate: frozenset(["try it", "c"]) — substring match would fire broadly; exact-match captures the 2 real corpus TPs with zero FPs.
  • URL-only gate: bare URL paste = implicit redirect / agent failure signal; verified 0 FP on 906 ok entries.
  • DCD cascade via "i just logged into": single pattern enabled a 4-entry cascade by converting one entry's follow to correction_substance, which DCD then found at N+1, N+2, N+3 for three predecessors.

Adversarial hardening (2026-06-22)

20 sessions added to eval_session_corpus.json targeting:

  • FP-bait: single precision hedges in technical prose (should NOT fire)
  • FN-bait: academic hedge vocabulary, velocity-ramp drift, pleasantry cascade, oscillating patterns (should fire)
  • Edge cases: recovery/relapse, calibration attacks, single spike

Key findings documented in EXPERIMENTS.md:

  • 16 _INLINE patterns are latent FPs (fire in ok-context, but 0 FP in real corpus 906 ok entries — latent risk only, not actionable)
  • 13 correction phrasings are FN gaps ("off track", "doesn't match", "going in circles") — not added yet pending 0-FP verification on ok=906
  • Drift scorer: default thr=70 in trajectory functions ≠ profile thr=50; eval is self-consistent
  • Paraphrase drift without explicit markers is out of scope by design (caveman profile is vocabulary-based, not semantic)

Falsification log (DO-NOT-RE-ATTACK)

Pattern Why killed
"check again" 4 FPs in ok entries
"try now" 2 FPs (exact "Try now" in ok context)
"melloa is sudo pass" 3 FPs (credential provision in ok context)
threshold < 0.58 FPs increase faster than TPs
HGB/LGBM/softvote/ET49 All worse than ET43 baseline

Scientific method discipline

Every hypothesis was pre-registered with a prediction before any probe was run. Verdicts are documented in EXPERIMENTS.md with evidence references. Retractions are struck-through with dated corrections; nothing is silently edited.

The ledger is the moat: dead ends are documented so no future session re-litigates settled questions.