Overview — what this corpus is
This is a 2000-decision Burl harvest collected on fresh seeds from
gus/data/corpus_train_chunk_0-99.pt. For every Texas-42 decision
point, we ran the wax-museum tool-loop against Burl (Gemma 4 E2B + LoRA
via MLX-LM, batched at 6 sequences, max_tokens=2048) on variant
D_required_first ("Turn 1: belief_trajectory()
first"). Each decision is then classified into one of 14 buckets based on
how Burl's final play compared to π (policy head), Q-mean (K=200
belief-sampled), and the oracle's E[Q] argmax — using
per_decision_train_chunk_0_99_k200.jsonl as ground truth.
The corpus has two downstream uses:
- Strict pool (1062 rows, 202 non-trivial): positive STaR examples — Burl reasoned to the oracle answer and we want to lock that behaviour in.
- Loss pool (719 rows, sharpest target 299 BURL_BREAKS_CONSENSUS): rationalization STaR targets — Burl produced a wrong final play; the oracle's correct answer is shown to the model and we ask it to write the trajectory that *would* have led there.
Why we re-ran the harvest (the v1 contamination story)
The first 2000-decision batched run (harvest_batched_20260425_031033,
1398/2000 before SIGKILL) was contaminated by a turn-1 truncation bug. The
user's morning intuition flagged it: "we didn't have short-decision
problems before batched mode". Three parallel agents validated:
max_tokens=1024 was too tight for the model's
preferred thinking-then-tool-call pattern. In ~11.4% of decisions, the model
exhausted its budget mid-thinking-block on turn 1 and never reached the
belief_trajectory() tool call. The harness re-prompted, the model
eventually called belief_trajectory() on turn 2, and we lost
~2500 chars of mid-sentence reasoning that was never persisted as a
structured thinking event. Bucket parity gate was a
false pass — distribution-level errors cancelled out, but 43% of
decisions actually changed bucket between sequential and batched.
Fix: max_tokens 1024 → 2048 (justified by the data — sequential's p99 was
1770 chars / ~600 tokens, max 2639 / ~900 tokens; 2048 covers the full
observed range with 9× margin), plus a per-wave OOM-resilience layer (try/except
classifier, quarantine.jsonl ledger, SIGKILL-recoverable sentinel files,
--rerun-quarantined retry pass). Reverted run was launched as v2
at 07:29:10. Zero quarantine fires across the 5h 46m run.
Reference docs (siblings to this page): LENGTH_STATS_COMPARISON.md · PARITY_AUDIT.md · HARVEST_RESILIENCE_NOTES.md · prod_2000_v2.log
Validation gates — v2 vs v1 vs sequential baseline
| Gate | Sequential 560 (max_tokens=8192) | Batched v1 (max_tokens=1024) — CONTAMINATED | Batched v2 (max_tokens=2048) |
|---|---|---|---|
| belief NOT at turn 1 | ~0% | 11.4% | 0.0% |
| truncated_at_cap | 0 | ~1.2% | 0 |
| bailed | 0 | 0 | 0 |
| illegal commits | 0 | 0 | 0 |
| matches_bot rate | ~56% | ~58% | 61.7% |
| forced_commit rate | 12.3% | ~12% | 10.9% |
| quarantine fires | n/a | n/a | 0 / 333 waves |
| strict pool / 560 | 294 (52.5%) | contaminated | → 1062 / 2000 (53.1%) |
Strict pool — Burl matched the oracle (1062 rows, 53.1% of corpus)
These are the rows where Burl's final play matched the oracle's argmax over E[Q]. The strict pool is the full positive-example corpus for STaR filter-only. The non-trivial subset (excluding ALL_AGREE_CORRECT, where π already had the right answer trivially) is the more interesting training material: 202 rows across the four "Burl-did-something-real" buckets.
ALL_AGREE_CORRECT 860 (43.0%) -0.2pp vs seq
All three (π, Q-mean, Burl) match the oracle. Trivial win — π already had it; Burl just didn't break it.
6 sample decisions
gi=5 seed=5 • final=21 bot=21 • Δeq=+0.00 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=74.9s— decision_5/transcript.livegi=348 seed=348 • final=23 bot=23 • Δeq=+0.00 • n_turns=4 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case'] • wall=74.9s— decision_348/transcript.livegi=718 seed=718 • final=15 bot=15 • Δeq=+0.00 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=58.5s— decision_718/transcript.livegi=1033 seed=1033 • final=15 bot=15 • Δeq=+0.00 • n_turns=5 • belief=[1, 3] • tools=['belief_trajectory', 'explore_game', 'belief_trajectory', 'probe_best_case'] • wall=38.8s— decision_1033/transcript.livegi=1357 seed=1357 • final=20 bot=20 • Δeq=+0.00 • n_turns=6 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_best_case', 'probe_worst_case'] • wall=65.5s— decision_1357/transcript.livegi=1667 seed=1667 • final=12 bot=12 • Δeq=+0.00 • n_turns=12 • belief=[1, 2] • ext=3 • tools=['belief_trajectory', 'belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case', 'explore_game', 'probe_best_case'] • wall=79.5s— decision_1667/transcript.live
BURL_ALONE_FIXES 14 (0.7%) -0.2pp vs seq
Burl alone matches the oracle. Both π and Q-mean are wrong, *and pick distinct wrong plays*. The cleanest gold signal: Burl reasoned its way past two heads that disagreed on the wrong answer.
6 sample decisions
gi=180 seed=180 • final=19 bot=19 • Δeq=+0.00 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=61.9s— decision_180/transcript.livegi=376 seed=376 • final=7 bot=7 • Δeq=+0.00 • n_turns=6 • belief=[1, 2] • tools=['belief_trajectory', 'belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=96.5s— decision_376/transcript.livegi=653 seed=653 • final=25 bot=10 • Δeq=+0.37 • n_turns=4 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case'] • wall=59.5s— decision_653/transcript.livegi=940 seed=940 • final=7 bot=7 • Δeq=+0.00 • n_turns=6 • belief=[1, 4] • tools=['belief_trajectory', 'explore_game', 'belief_trajectory', 'probe_best_case'] • wall=57.8s— decision_940/transcript.livegi=1131 seed=1131 • final=12 bot=12 • Δeq=+0.00 • n_turns=8 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=91.5s— decision_1131/transcript.livegi=1204 seed=1204 • final=20 bot=20 • Δeq=+0.00 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=36.7s— decision_1204/transcript.live
BOTH_FIX 52 (2.6%) +1.0pp vs seq
Burl + Q-mean match the oracle; π is wrong. The 'belief sampling' route shows up cleanly here.
6 sample decisions
gi=16 seed=16 • final=20 bot=9 • Δeq=+0.01 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=81.1s— decision_16/transcript.livegi=275 seed=275 • final=3 bot=3 • Δeq=+0.00 • n_turns=6 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=42.5s— decision_275/transcript.livegi=515 seed=515 • final=1 bot=1 • Δeq=+0.00 • n_turns=4 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case'] • wall=71.4s— decision_515/transcript.livegi=896 seed=896 • final=12 bot=12 • Δeq=+0.00 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'explore_game', 'probe_best_case'] • wall=54.9s— decision_896/transcript.livegi=1056 seed=1056 • final=8 bot=8 • Δeq=+0.00 • n_turns=4 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case'] • wall=55.1s— decision_1056/transcript.livegi=1416 seed=1416 • final=11 bot=11 • Δeq=+0.00 • n_turns=4 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case'] • wall=46.9s— decision_1416/transcript.live
BURL_INDEPENDENT_RIGHT 88 (4.4%) -0.1pp vs seq
Burl matches the oracle; π and Q-mean agree on the *same* wrong play. Burl bucked a wrong consensus — the strongest evidence of independent reasoning.
6 sample decisions
gi=3 seed=3 • final=12 bot=12 • Δeq=+0.00 • n_turns=5 • belief=[1, 2] • tools=['belief_trajectory', 'belief_trajectory', 'explore_game', 'probe_best_case'] • wall=74.9s— decision_3/transcript.livegi=156 seed=156 • final=16 bot=16 • Δeq=+0.00 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=42.0s— decision_156/transcript.livegi=408 seed=408 • final=5 bot=27 • Δeq=+0.70 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=59.8s— decision_408/transcript.livegi=731 seed=731 • final=13 bot=13 • Δeq=+0.00 • n_turns=6 • belief=[1] • ext=1 • tools=['belief_trajectory', 'explore_game', 'probe_best_case'] • wall=49.6s— decision_731/transcript.livegi=1184 seed=1184 • final=16 bot=16 • Δeq=+0.00 • n_turns=7 • belief=[1, 2] • tools=['belief_trajectory', 'belief_trajectory', 'explore_game', 'probe_best_case', 'probe_best_case', 'probe_worst_case'] • wall=81.5s— decision_1184/transcript.livegi=1560 seed=1560 • final=25 bot=25 • Δeq=+0.00 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=38.2s— decision_1560/transcript.live
BURL_FOLLOWS_PI_RIGHT 48 (2.4%) +0.1pp vs seq
π and Burl match the oracle; Q-mean is wrong. Burl correctly *did not* deviate to follow Q-mean's belief-sampling answer.
6 sample decisions
gi=131 seed=131 • final=16 bot=16 • Δeq=+0.00 • n_turns=4 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case'] • wall=56.1s— decision_131/transcript.livegi=504 seed=504 • final=0 bot=0 • Δeq=+0.00 • n_turns=4 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case'] • wall=46.2s— decision_504/transcript.livegi=656 seed=656 • final=24 bot=18 • Δeq=+0.01 • n_turns=4 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case'] • wall=52.9s— decision_656/transcript.livegi=1171 seed=1171 • final=20 bot=20 • Δeq=+0.00 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=38.0s— decision_1171/transcript.livegi=1517 seed=1517 • final=16 bot=16 • Δeq=+0.00 • n_turns=6 • belief=[1] • ext=1 • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=56.9s— decision_1517/transcript.livegi=1751 seed=1751 • final=9 bot=7 • Δeq=+0.03 • n_turns=6 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=156.4s— decision_1751/transcript.live
Loss buckets — Burl produced a wrong final play (719 rows, 36.0%)
Burl's regret across these rows is what STaR rationalization is meant to fix. The oracle's correct answer gets exposed to the model and we train it to write the trajectory that would have produced that answer. BURL_BREAKS_CONSENSUS is the sharpest signal: π and Q-mean both already knew the right answer; Burl alone deviated. There is no "the heads were confused" excuse here — Burl invented a wrong story.
BURL_BREAKS_CONSENSUS 299 (14.9%) -2.4pp vs seq
π and Q-mean match the oracle, but Burl alone deviates to a wrong play. The mirror of BURL_INDEPENDENT_RIGHT — and the sharpest STaR rationalization target. The right answer was already known by both heads; Burl invented a wrong story.
6 sample decisions
gi=1 seed=1 • final=25 bot=19 • Δeq=-3.53 • n_turns=8 • belief=[1] • ext=3 • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=74.9s— decision_1/transcript.livegi=308 seed=308 • final=11 bot=11 • Δeq=+0.00 • n_turns=4 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case'] • wall=46.0s— decision_308/transcript.livegi=662 seed=662 • final=27 bot=21 • Δeq=-0.03 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=53.4s— decision_662/transcript.livegi=954 seed=954 • final=13 bot=10 • Δeq=-3.96 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=46.0s— decision_954/transcript.livegi=1310 seed=1310 • final=8 bot=3 • Δeq=-0.33 • n_turns=7 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case', 'explore_game', 'explore_game'] • wall=49.6s— decision_1310/transcript.livegi=1680 seed=1680 • final=11 bot=21 • Δeq=-4.75 • n_turns=8 • belief=[1, 2] • tools=['belief_trajectory', 'belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=55.1s— decision_1680/transcript.live
BURL_INDEPENDENT_WRONG 148 (7.4%) +1.5pp vs seq
All three are wrong, but Burl picks something neither π nor Q-mean did. Burl made up its own incorrect answer.
6 sample decisions
gi=0 seed=0 • final=7 bot=14 • Δeq=-4.62 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=74.9s— decision_0/transcript.livegi=318 seed=318 • final=14 bot=17 • Δeq=-4.23 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=74.8s— decision_318/transcript.livegi=532 seed=532 • final=20 bot=2 • Δeq=-1.45 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=25.9s— decision_532/transcript.livegi=857 seed=857 • final=11 bot=15 • Δeq=-14.76 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=97.4s— decision_857/transcript.livegi=1260 seed=1260 • final=5 bot=14 • Δeq=-4.38 • n_turns=6 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_best_case'] • wall=59.6s— decision_1260/transcript.livegi=1655 seed=1655 • final=14 bot=0 • Δeq=-13.03 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=34.3s— decision_1655/transcript.live
ALL_AGREE_WRONG 100 (5.0%) +0.9pp vs seq
All three pick the same wrong play. The data says even the oracle's frontier is hard here.
6 sample decisions
gi=41 seed=41 • final=4 bot=8 • Δeq=-4.11 • n_turns=6 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'explore_game', 'probe_best_case', 'probe_best_case'] • wall=64.5s— decision_41/transcript.livegi=336 seed=336 • final=23 bot=20 • Δeq=+0.86 • n_turns=5 • belief=[1, 2] • tools=['belief_trajectory', 'belief_trajectory', 'explore_game', 'probe_best_case'] • wall=66.9s— decision_336/transcript.livegi=819 seed=819 • final=24 bot=26 • Δeq=-0.36 • n_turns=8 • belief=[1] • ext=1 • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'explore_game', 'explore_game', 'probe_best_case'] • wall=63.5s— decision_819/transcript.livegi=1044 seed=1044 • final=27 bot=26 • Δeq=-3.97 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=84.9s— decision_1044/transcript.livegi=1301 seed=1301 • final=26 bot=15 • Δeq=+1.05 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=87.5s— decision_1301/transcript.livegi=1590 seed=1590 • final=15 bot=16 • Δeq=-1.06 • n_turns=4 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case'] • wall=40.5s— decision_1590/transcript.live
BURL_PARROTS_PI_WRONG 57 (2.9%) +0.6pp vs seq
π == Burl, both wrong. Burl mirrored a wrong policy head.
6 sample decisions
gi=40 seed=40 • final=20 bot=12 • Δeq=-8.39 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=64.5s— decision_40/transcript.livegi=356 seed=356 • final=26 bot=27 • Δeq=-0.00 • n_turns=4 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case'] • wall=45.8s— decision_356/transcript.livegi=830 seed=830 • final=2 bot=9 • Δeq=-1.88 • n_turns=5 • belief=[1, 2] • tools=['belief_trajectory', 'belief_trajectory', 'explore_game', 'probe_best_case'] • wall=63.3s— decision_830/transcript.livegi=1054 seed=1054 • final=9 bot=3 • Δeq=-6.11 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=57.4s— decision_1054/transcript.livegi=1251 seed=1251 • final=5 bot=17 • Δeq=-0.01 • n_turns=8 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_best_case', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=49.2s— decision_1251/transcript.livegi=1515 seed=1515 • final=14 bot=12 • Δeq=-5.06 • n_turns=6 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=56.9s— decision_1515/transcript.live
QMEAN_ALONE_FIXES 37 (1.9%) +0.1pp vs seq
Q-mean matches the oracle; π and Burl don't. Burl missed a fix it could have inherited from belief-sampling.
6 sample decisions
gi=46 seed=46 • final=20 bot=14 • Δeq=-14.05 • n_turns=4 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case'] • wall=45.0s— decision_46/transcript.livegi=224 seed=224 • final=20 bot=21 • Δeq=-10.40 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=56.7s— decision_224/transcript.livegi=709 seed=709 • final=25 bot=4 • Δeq=-9.35 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'explore_game', 'probe_best_case'] • wall=48.9s— decision_709/transcript.livegi=909 seed=909 • final=10 bot=0 • Δeq=-0.08 • n_turns=4 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case'] • wall=59.4s— decision_909/transcript.livegi=1139 seed=1139 • final=20 bot=4 • Δeq=-11.96 • n_turns=7 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'explore_game', 'probe_best_case', 'probe_worst_case', 'probe_worst_case'] • wall=56.0s— decision_1139/transcript.livegi=1544 seed=1544 • final=19 bot=17 • Δeq=-0.31 • n_turns=5 • belief=[1, 2] • tools=['belief_trajectory', 'belief_trajectory', 'explore_game', 'probe_best_case'] • wall=42.9s— decision_1544/transcript.live
BURL_PARROTS_QMEAN_WRONG 41 (2.0%) +0.0pp vs seq
π was right; Burl == Q-mean, and both are wrong. Burl trusted Q-mean over a correct π.
6 sample decisions
gi=75 seed=75 • final=8 bot=6 • Δeq=-31.41 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=57.5s— decision_75/transcript.livegi=434 seed=434 • final=22 bot=3 • Δeq=-0.02 • n_turns=4 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case'] • wall=61.8s— decision_434/transcript.livegi=661 seed=661 • final=19 bot=23 • Δeq=-0.77 • n_turns=7 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case', 'explore_game', 'probe_best_case'] • wall=53.4s— decision_661/transcript.livegi=929 seed=929 • final=1 bot=4 • Δeq=-0.28 • n_turns=9 • belief=[1] • ext=1 • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=75.4s— decision_929/transcript.livegi=1092 seed=1092 • final=20 bot=2 • Δeq=-0.36 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=67.8s— decision_1092/transcript.livegi=1507 seed=1507 • final=24 bot=16 • Δeq=-9.45 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'explore_game', 'probe_best_case'] • wall=37.9s— decision_1507/transcript.live
BURL_DRIFTS_FROM_PI 37 (1.9%) +0.1pp vs seq
π was right; Burl picked something other than π and other than Q-mean — wandered off into a wrong answer of its own.
6 sample decisions
gi=4 seed=4 • final=10 bot=0 • Δeq=-5.61 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=74.9s— decision_4/transcript.livegi=200 seed=200 • final=25 bot=13 • Δeq=-8.40 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=69.4s— decision_200/transcript.livegi=574 seed=574 • final=14 bot=2 • Δeq=-5.18 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=54.5s— decision_574/transcript.livegi=980 seed=980 • final=15 bot=14 • Δeq=-12.17 • n_turns=6 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'explore_game', 'probe_best_case', 'probe_best_case'] • wall=42.3s— decision_980/transcript.livegi=1349 seed=1349 • final=6 bot=5 • Δeq=-0.89 • n_turns=4 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case'] • wall=51.4s— decision_1349/transcript.livegi=1559 seed=1559 • final=19 bot=7 • Δeq=-10.76 • n_turns=4 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case'] • wall=57.9s— decision_1559/transcript.live
Guarded buckets — harness handled it
These rows are excluded from regret math but report on the harness's
behaviour. FORCED_COMMIT means the model never produced a
legal commit and the Phase A guard force-committed the highest-E[Q] legal
play; this preserves the bid (no illegal commits) but tells us how often
Burl stalls. ILLEGAL and OTHER should both be 0
— they are.
FORCED_COMMIT 219 (10.9%) -1.4pp vs seq
Burl never produced a legal commit — Phase A guard force-committed the highest-E[Q] legal play. Excluded from regret math; valuable as a 'how often does the agent stall' signal.
6 sample decisions
gi=2 seed=2 • final=11 bot=11 • Δeq=+0.00 • n_turns=13 • belief=[1, 2] • FORCED • ext=3 • tools=['belief_trajectory', 'belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=74.9s— decision_2/transcript.livegi=394 seed=394 • final=21 bot=21 • Δeq=+0.00 • n_turns=10 • belief=[1] • FORCED • ext=1 • tools=['belief_trajectory', 'explore_game', 'explore_game', 'probe_best_case', 'explore_game', 'explore_game', 'probe_best_case'] • wall=52.3s— decision_394/transcript.livegi=649 seed=649 • final=3 bot=3 • Δeq=+0.00 • n_turns=10 • belief=[1] • FORCED • ext=1 • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case', 'ask_rule', 'ask_rule', 'explore_game', 'probe_best_case'] • wall=59.5s— decision_649/transcript.livegi=963 seed=963 • final=24 bot=24 • Δeq=+0.00 • n_turns=8 • belief=[1, 2] • FORCED • tools=['belief_trajectory', 'belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=105.3s— decision_963/transcript.livegi=1307 seed=1307 • final=18 bot=18 • Δeq=+0.00 • n_turns=14 • belief=[1, 2] • FORCED • ext=3 • tools=['belief_trajectory', 'belief_trajectory', 'explore_game', 'explore_game', 'probe_best_case', 'probe_worst_case', 'ask_rule', 'ask_rule'] • wall=76.0s— decision_1307/transcript.livegi=1637 seed=1637 • final=9 bot=9 • Δeq=+0.00 • n_turns=8 • belief=[1] • FORCED • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case', 'explore_game', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=87.3s— decision_1637/transcript.live
ILLEGAL 0 (0.0%) +0.0pp vs seq
Final play is None / illegal. Phase A is supposed to keep this at 0.
0 sample decisions
(no decisions in this bucket)OTHER 0 (0.0%) +0.0pp vs seq
Catch-all. Should be 0 after the v2 OTHER-triage classifier.
0 sample decisions
(no decisions in this bucket)What's next — STaR run-3
The corpus is ready. Recommended next steps:
-
Build the filtered STaR corpus from the strict pool, with the
--min-assistant-chars 300filter to strip templated tail rows (the prior-experiment sub-100-char artifacts that caused the 71-row STaR collapse). We landed on 300 as the right cutoff — strips ~40% of rows uniformly across all buckets (so it's row-decomposition noise, not signal); zero gold-bucket decisions are lost.PYTHONPATH=. .venv/bin/python scratch/belief_trajectory_rollout/star/build_filtered_corpus.py \ --harvest harvest_batched_20260425_072910 \ --strict-pool-buckets ALL_AGREE_CORRECT,BURL_ALONE_FIXES,BOTH_FIX,BURL_INDEPENDENT_RIGHT,BURL_FOLLOWS_PI_RIGHT \ --min-assistant-chars 300 \ --out scratch/belief_trajectory_rollout/star/filter_corpus_v2_min300.jsonl -
Train LoRA filter-only STaR with rank=8, lr=3e-5, 1 epoch, and
val-loss + early-stopping. The val-loss + early-stop wiring already lives
in
burl/train/star_mlx.pyfrom a prior session. Start with conservative hyperparams; the prior 71-row run collapsed at lr=1e-4 because of memorizable templated rows — both are now fixed.PYTHONPATH=. .venv/bin/python burl/train/star_mlx.py \ --train-corpus scratch/belief_trajectory_rollout/star/filter_corpus_v2_min300.train.jsonl \ --val-corpus scratch/belief_trajectory_rollout/star/filter_corpus_v2_min300.val.jsonl \ --rank 8 --lr 3e-5 --epochs 1 \ --steps-per-eval 50 --early-stop-val-rise 0.02 --early-stop-patience 2 \ --out scratch/belief_trajectory_rollout/star/run3_$(date +%Y%m%d_%H%M%S)/ - Eval on the held-out 560 (sequential gold corpus) — same buckets, compare regret and matches_bot deltas vs the unadapted model. Win criterion: regret on Burl drops below 0.517 (the deployable Q-mean router's regret) by a meaningful margin without harming π.
- (Stretch) Rationalization pass on BURL_BREAKS_CONSENSUS (299 rows) — only after filter-only shows a clean signal. That bucket alone is 3.1× the size of any bucket we tried in prior STaR experiments.
Open questions still warm:
-
In v1's parity audit, Burl's redundant-tool-call pattern (gi=7 looped
belief_trajectory; gi=13 repeatedexplore_game{play:17}) was symptomatic of mid-thinking truncation. v2 should have eliminated this; a sample sweep on five v2 BURL_BREAKS_CONSENSUS decisions would confirm the cleanup before STaR training treats them as natural reasoning failures. - v2 budget extensions: 733 total / max 3 per decision. Worth checking whether the extension trigger is reading "reasoning incomplete" or "tool reject"; the latter is the legitimate use case, the former indicates the same thinking-budget issue at a smaller scale.
Generated by scratch/belief_trajectory_rollout/build_review.py
on the 2000-decision harvest. Static page — no server needed; reload after
re-running the build script to refresh.