Evaluation & Tuning¶

Drift Detector was tuned through 21 rounds using the scientific method: pre-registered hypotheses, controlled probes, and a hard FP=0 invariant that was never broken.

Corpus¶

Real corpus: 1,283 labeled entries extracted from production Claude Code sessions. Labels: drift=377, ok=906, unlabeled=1,028 (excluded from eval).
Synthetic corpus: 190 hand-crafted sessions covering edge cases, adversarial patterns, and calibration attacks. 100% accuracy, FP=0.
Adversarial unit test: 37 cases targeting latent FP/FN patterns in classify_user_reply(). Permanent regression gate.

Round history¶

Round	F1	Key change
R0–R17	0.21→0.587	Rule-based gate chain
R18	0.633	22-feature LR classifier
R19	0.9543	ExtraTree 43-feature + DCD steps=8, t=0.58
R20	0.977	11 new classify_user_reply patterns + DCD steps=10
R21	0.9973	17 patterns + exact-match gate + URL-only gate

Reproduce¶

python3 scripts/backtest_real.py --dcd --dcd-steps 10
# Expected: F1=0.9973 tp=375 fp=0 fn=2 tn=906

python3 scripts/eval_morin.py
# Expected: n=190 acc=1.0 FP=0.0 FN=0.0

python3 scripts/adversarial_classify_test.py
# Expected: PASS=37 FAIL=0

R21 mechanism additions¶

R21 added 17 patterns to classify_user_reply() plus two precision gates:

Exact-match gate: frozenset(["try it", "c"]) — substring match would fire broadly; exact-match captures the 2 real corpus TPs with zero FPs.
URL-only gate: bare URL paste = implicit redirect / agent failure signal; verified 0 FP on 906 ok entries.
DCD cascade via "i just logged into": single pattern enabled a 4-entry cascade by converting one entry's follow to correction_substance, which DCD then found at N+1, N+2, N+3 for three predecessors.

Adversarial hardening (2026-06-22)¶

20 sessions added to eval_session_corpus.json targeting:

FP-bait: single precision hedges in technical prose (should NOT fire)
FN-bait: academic hedge vocabulary, velocity-ramp drift, pleasantry cascade, oscillating patterns (should fire)
Edge cases: recovery/relapse, calibration attacks, single spike

Key findings documented in EXPERIMENTS.md:

16 _INLINE patterns are latent FPs (fire in ok-context, but 0 FP in real corpus 906 ok entries — latent risk only, not actionable)
13 correction phrasings are FN gaps ("off track", "doesn't match", "going in circles") — not added yet pending 0-FP verification on ok=906
Drift scorer: default thr=70 in trajectory functions ≠ profile thr=50; eval is self-consistent
Paraphrase drift without explicit markers is out of scope by design (caveman profile is vocabulary-based, not semantic)

Falsification log (DO-NOT-RE-ATTACK)¶

Pattern	Why killed
"check again"	4 FPs in ok entries
"try now"	2 FPs (exact "Try now" in ok context)
"melloa is sudo pass"	3 FPs (credential provision in ok context)
threshold < 0.58	FPs increase faster than TPs
HGB/LGBM/softvote/ET49	All worse than ET43 baseline

Scientific method discipline¶

Every hypothesis was pre-registered with a prediction before any probe was run. Verdicts are documented in EXPERIMENTS.md with evidence references. Retractions are struck-through with dated corrections; nothing is silently edited.

The ledger is the moat: dead ends are documented so no future session re-litigates settled questions.