Evaluation & Tuning¶
Drift Detector was tuned through 21 rounds using the scientific method: pre-registered hypotheses, controlled probes, and a hard FP=0 invariant that was never broken.
Corpus¶
- Real corpus: 1,283 labeled entries extracted from production Claude Code sessions. Labels: drift=377, ok=906, unlabeled=1,028 (excluded from eval).
- Synthetic corpus: 190 hand-crafted sessions covering edge cases, adversarial patterns, and calibration attacks. 100% accuracy, FP=0.
- Adversarial unit test: 37 cases targeting latent FP/FN patterns in
classify_user_reply(). Permanent regression gate.
Round history¶
| Round | F1 | Key change |
|---|---|---|
| R0–R17 | 0.21→0.587 | Rule-based gate chain |
| R18 | 0.633 | 22-feature LR classifier |
| R19 | 0.9543 | ExtraTree 43-feature + DCD steps=8, t=0.58 |
| R20 | 0.977 | 11 new classify_user_reply patterns + DCD steps=10 |
| R21 | 0.9973 | 17 patterns + exact-match gate + URL-only gate |
Reproduce¶
python3 scripts/backtest_real.py --dcd --dcd-steps 10
# Expected: F1=0.9973 tp=375 fp=0 fn=2 tn=906
python3 scripts/eval_morin.py
# Expected: n=190 acc=1.0 FP=0.0 FN=0.0
python3 scripts/adversarial_classify_test.py
# Expected: PASS=37 FAIL=0
R21 mechanism additions¶
R21 added 17 patterns to classify_user_reply() plus two precision gates:
- Exact-match gate:
frozenset(["try it", "c"])— substring match would fire broadly; exact-match captures the 2 real corpus TPs with zero FPs. - URL-only gate: bare URL paste = implicit redirect / agent failure signal; verified 0 FP on 906 ok entries.
- DCD cascade via "i just logged into": single pattern enabled a 4-entry cascade by converting one entry's follow to correction_substance, which DCD then found at N+1, N+2, N+3 for three predecessors.
Adversarial hardening (2026-06-22)¶
20 sessions added to eval_session_corpus.json targeting:
- FP-bait: single precision hedges in technical prose (should NOT fire)
- FN-bait: academic hedge vocabulary, velocity-ramp drift, pleasantry cascade, oscillating patterns (should fire)
- Edge cases: recovery/relapse, calibration attacks, single spike
Key findings documented in EXPERIMENTS.md:
- 16
_INLINEpatterns are latent FPs (fire in ok-context, but 0 FP in real corpus 906 ok entries — latent risk only, not actionable) - 13 correction phrasings are FN gaps ("off track", "doesn't match", "going in circles") — not added yet pending 0-FP verification on ok=906
- Drift scorer: default thr=70 in trajectory functions ≠ profile thr=50; eval is self-consistent
- Paraphrase drift without explicit markers is out of scope by design (caveman profile is vocabulary-based, not semantic)
Falsification log (DO-NOT-RE-ATTACK)¶
| Pattern | Why killed |
|---|---|
| "check again" | 4 FPs in ok entries |
| "try now" | 2 FPs (exact "Try now" in ok context) |
| "melloa is sudo pass" | 3 FPs (credential provision in ok context) |
| threshold < 0.58 | FPs increase faster than TPs |
| HGB/LGBM/softvote/ET49 | All worse than ET43 baseline |
Scientific method discipline¶
Every hypothesis was pre-registered with a prediction before any probe was run.
Verdicts are documented in EXPERIMENTS.md with evidence references. Retractions
are struck-through with dated corrections; nothing is silently edited.
The ledger is the moat: dead ends are documented so no future session re-litigates settled questions.