Burl harvest v2

Overview — what this corpus is

This is a 2000-decision Burl harvest collected on fresh seeds from gus/data/corpus_train_chunk_0-99.pt. For every Texas-42 decision point, we ran the wax-museum tool-loop against Burl (Gemma 4 E2B + LoRA via MLX-LM, batched at 6 sequences, max_tokens=2048) on variant D_required_first ("Turn 1: belief_trajectory() first"). Each decision is then classified into one of 14 buckets based on how Burl's final play compared to π (policy head), Q-mean (K=200 belief-sampled), and the oracle's E[Q] argmax — using per_decision_train_chunk_0_99_k200.jsonl as ground truth.

The corpus has two downstream uses:

Strict pool (1062 rows, 202 non-trivial): positive STaR examples — Burl reasoned to the oracle answer and we want to lock that behaviour in.
Loss pool (719 rows, sharpest target 299 BURL_BREAKS_CONSENSUS): rationalization STaR targets — Burl produced a wrong final play; the oracle's correct answer is shown to the model and we ask it to write the trajectory that *would* have led there.

Why we re-ran the harvest (the v1 contamination story)

The first 2000-decision batched run (harvest_batched_20260425_031033, 1398/2000 before SIGKILL) was contaminated by a turn-1 truncation bug. The user's morning intuition flagged it: "we didn't have short-decision problems before batched mode". Three parallel agents validated:

v1 root cause: max_tokens=1024 was too tight for the model's preferred thinking-then-tool-call pattern. In ~11.4% of decisions, the model exhausted its budget mid-thinking-block on turn 1 and never reached the belief_trajectory() tool call. The harness re-prompted, the model eventually called belief_trajectory() on turn 2, and we lost ~2500 chars of mid-sentence reasoning that was never persisted as a structured thinking event. Bucket parity gate was a false pass — distribution-level errors cancelled out, but 43% of decisions actually changed bucket between sequential and batched.

Fix: max_tokens 1024 → 2048 (justified by the data — sequential's p99 was 1770 chars / ~600 tokens, max 2639 / ~900 tokens; 2048 covers the full observed range with 9× margin), plus a per-wave OOM-resilience layer (try/except classifier, quarantine.jsonl ledger, SIGKILL-recoverable sentinel files, --rerun-quarantined retry pass). Reverted run was launched as v2 at 07:29:10. Zero quarantine fires across the 5h 46m run.

Reference docs (siblings to this page): LENGTH_STATS_COMPARISON.md · PARITY_AUDIT.md · HARVEST_RESILIENCE_NOTES.md · prod_2000_v2.log

Validation gates — v2 vs v1 vs sequential baseline

Gate	Sequential 560 (max_tokens=8192)	Batched v1 (max_tokens=1024) — CONTAMINATED	Batched v2 (max_tokens=2048)
belief NOT at turn 1	~0%	11.4%	0.0%
truncated_at_cap	0	~1.2%	0
bailed	0	0	0
illegal commits	0	0	0
matches_bot rate	~56%	~58%	61.7%
forced_commit rate	12.3%	~12%	10.9%
quarantine fires	n/a	n/a	0 / 333 waves
strict pool / 560	294 (52.5%)	contaminated	→ 1062 / 2000 (53.1%)

All gates pass. v2 is shippable as the canonical corpus for STaR run-3.

Strict pool — Burl matched the oracle (1062 rows, 53.1% of corpus)

These are the rows where Burl's final play matched the oracle's argmax over E[Q]. The strict pool is the full positive-example corpus for STaR filter-only. The non-trivial subset (excluding ALL_AGREE_CORRECT, where π already had the right answer trivially) is the more interesting training material: 202 rows across the four "Burl-did-something-real" buckets.

ALL_AGREE_CORRECT 860 (43.0%) -0.2pp vs seq

All three (π, Q-mean, Burl) match the oracle. Trivial win — π already had it; Burl just didn't break it.

6 sample decisions

gi=5 seed=5 • final=21 bot=21 • Δeq=+0.00 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=74.9s — decision_5/transcript.live
gi=348 seed=348 • final=23 bot=23 • Δeq=+0.00 • n_turns=4 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case'] • wall=74.9s — decision_348/transcript.live
gi=718 seed=718 • final=15 bot=15 • Δeq=+0.00 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=58.5s — decision_718/transcript.live
gi=1033 seed=1033 • final=15 bot=15 • Δeq=+0.00 • n_turns=5 • belief=[1, 3] • tools=['belief_trajectory', 'explore_game', 'belief_trajectory', 'probe_best_case'] • wall=38.8s — decision_1033/transcript.live
gi=1357 seed=1357 • final=20 bot=20 • Δeq=+0.00 • n_turns=6 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_best_case', 'probe_worst_case'] • wall=65.5s — decision_1357/transcript.live
gi=1667 seed=1667 • final=12 bot=12 • Δeq=+0.00 • n_turns=12 • belief=[1, 2] • ext=3 • tools=['belief_trajectory', 'belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case', 'explore_game', 'probe_best_case'] • wall=79.5s — decision_1667/transcript.live

BURL_ALONE_FIXES 14 (0.7%) -0.2pp vs seq

Burl alone matches the oracle. Both π and Q-mean are wrong, *and pick distinct wrong plays*. The cleanest gold signal: Burl reasoned its way past two heads that disagreed on the wrong answer.

6 sample decisions

gi=180 seed=180 • final=19 bot=19 • Δeq=+0.00 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=61.9s — decision_180/transcript.live
gi=376 seed=376 • final=7 bot=7 • Δeq=+0.00 • n_turns=6 • belief=[1, 2] • tools=['belief_trajectory', 'belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=96.5s — decision_376/transcript.live
gi=653 seed=653 • final=25 bot=10 • Δeq=+0.37 • n_turns=4 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case'] • wall=59.5s — decision_653/transcript.live
gi=940 seed=940 • final=7 bot=7 • Δeq=+0.00 • n_turns=6 • belief=[1, 4] • tools=['belief_trajectory', 'explore_game', 'belief_trajectory', 'probe_best_case'] • wall=57.8s — decision_940/transcript.live
gi=1131 seed=1131 • final=12 bot=12 • Δeq=+0.00 • n_turns=8 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=91.5s — decision_1131/transcript.live
gi=1204 seed=1204 • final=20 bot=20 • Δeq=+0.00 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=36.7s — decision_1204/transcript.live

BOTH_FIX 52 (2.6%) +1.0pp vs seq

Burl + Q-mean match the oracle; π is wrong. The 'belief sampling' route shows up cleanly here.

6 sample decisions

gi=16 seed=16 • final=20 bot=9 • Δeq=+0.01 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=81.1s — decision_16/transcript.live
gi=275 seed=275 • final=3 bot=3 • Δeq=+0.00 • n_turns=6 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=42.5s — decision_275/transcript.live
gi=515 seed=515 • final=1 bot=1 • Δeq=+0.00 • n_turns=4 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case'] • wall=71.4s — decision_515/transcript.live
gi=896 seed=896 • final=12 bot=12 • Δeq=+0.00 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'explore_game', 'probe_best_case'] • wall=54.9s — decision_896/transcript.live
gi=1056 seed=1056 • final=8 bot=8 • Δeq=+0.00 • n_turns=4 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case'] • wall=55.1s — decision_1056/transcript.live
gi=1416 seed=1416 • final=11 bot=11 • Δeq=+0.00 • n_turns=4 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case'] • wall=46.9s — decision_1416/transcript.live

BURL_INDEPENDENT_RIGHT 88 (4.4%) -0.1pp vs seq

Burl matches the oracle; π and Q-mean agree on the *same* wrong play. Burl bucked a wrong consensus — the strongest evidence of independent reasoning.

6 sample decisions

gi=3 seed=3 • final=12 bot=12 • Δeq=+0.00 • n_turns=5 • belief=[1, 2] • tools=['belief_trajectory', 'belief_trajectory', 'explore_game', 'probe_best_case'] • wall=74.9s — decision_3/transcript.live
gi=156 seed=156 • final=16 bot=16 • Δeq=+0.00 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=42.0s — decision_156/transcript.live
gi=408 seed=408 • final=5 bot=27 • Δeq=+0.70 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=59.8s — decision_408/transcript.live
gi=731 seed=731 • final=13 bot=13 • Δeq=+0.00 • n_turns=6 • belief=[1] • ext=1 • tools=['belief_trajectory', 'explore_game', 'probe_best_case'] • wall=49.6s — decision_731/transcript.live
gi=1184 seed=1184 • final=16 bot=16 • Δeq=+0.00 • n_turns=7 • belief=[1, 2] • tools=['belief_trajectory', 'belief_trajectory', 'explore_game', 'probe_best_case', 'probe_best_case', 'probe_worst_case'] • wall=81.5s — decision_1184/transcript.live
gi=1560 seed=1560 • final=25 bot=25 • Δeq=+0.00 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=38.2s — decision_1560/transcript.live

BURL_FOLLOWS_PI_RIGHT 48 (2.4%) +0.1pp vs seq

π and Burl match the oracle; Q-mean is wrong. Burl correctly *did not* deviate to follow Q-mean's belief-sampling answer.

6 sample decisions

gi=131 seed=131 • final=16 bot=16 • Δeq=+0.00 • n_turns=4 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case'] • wall=56.1s — decision_131/transcript.live
gi=504 seed=504 • final=0 bot=0 • Δeq=+0.00 • n_turns=4 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case'] • wall=46.2s — decision_504/transcript.live
gi=656 seed=656 • final=24 bot=18 • Δeq=+0.01 • n_turns=4 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case'] • wall=52.9s — decision_656/transcript.live
gi=1171 seed=1171 • final=20 bot=20 • Δeq=+0.00 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=38.0s — decision_1171/transcript.live
gi=1517 seed=1517 • final=16 bot=16 • Δeq=+0.00 • n_turns=6 • belief=[1] • ext=1 • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=56.9s — decision_1517/transcript.live
gi=1751 seed=1751 • final=9 bot=7 • Δeq=+0.03 • n_turns=6 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=156.4s — decision_1751/transcript.live

Loss buckets — Burl produced a wrong final play (719 rows, 36.0%)

Burl's regret across these rows is what STaR rationalization is meant to fix. The oracle's correct answer gets exposed to the model and we train it to write the trajectory that would have produced that answer. BURL_BREAKS_CONSENSUS is the sharpest signal: π and Q-mean both already knew the right answer; Burl alone deviated. There is no "the heads were confused" excuse here — Burl invented a wrong story.

BURL_BREAKS_CONSENSUS 299 (14.9%) -2.4pp vs seq

π and Q-mean match the oracle, but Burl alone deviates to a wrong play. The mirror of BURL_INDEPENDENT_RIGHT — and the sharpest STaR rationalization target. The right answer was already known by both heads; Burl invented a wrong story.

6 sample decisions

gi=1 seed=1 • final=25 bot=19 • Δeq=-3.53 • n_turns=8 • belief=[1] • ext=3 • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=74.9s — decision_1/transcript.live
gi=308 seed=308 • final=11 bot=11 • Δeq=+0.00 • n_turns=4 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case'] • wall=46.0s — decision_308/transcript.live
gi=662 seed=662 • final=27 bot=21 • Δeq=-0.03 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=53.4s — decision_662/transcript.live
gi=954 seed=954 • final=13 bot=10 • Δeq=-3.96 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=46.0s — decision_954/transcript.live
gi=1310 seed=1310 • final=8 bot=3 • Δeq=-0.33 • n_turns=7 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case', 'explore_game', 'explore_game'] • wall=49.6s — decision_1310/transcript.live
gi=1680 seed=1680 • final=11 bot=21 • Δeq=-4.75 • n_turns=8 • belief=[1, 2] • tools=['belief_trajectory', 'belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=55.1s — decision_1680/transcript.live

BURL_INDEPENDENT_WRONG 148 (7.4%) +1.5pp vs seq

All three are wrong, but Burl picks something neither π nor Q-mean did. Burl made up its own incorrect answer.

6 sample decisions

gi=0 seed=0 • final=7 bot=14 • Δeq=-4.62 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=74.9s — decision_0/transcript.live
gi=318 seed=318 • final=14 bot=17 • Δeq=-4.23 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=74.8s — decision_318/transcript.live
gi=532 seed=532 • final=20 bot=2 • Δeq=-1.45 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=25.9s — decision_532/transcript.live
gi=857 seed=857 • final=11 bot=15 • Δeq=-14.76 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=97.4s — decision_857/transcript.live
gi=1260 seed=1260 • final=5 bot=14 • Δeq=-4.38 • n_turns=6 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_best_case'] • wall=59.6s — decision_1260/transcript.live
gi=1655 seed=1655 • final=14 bot=0 • Δeq=-13.03 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=34.3s — decision_1655/transcript.live

ALL_AGREE_WRONG 100 (5.0%) +0.9pp vs seq

All three pick the same wrong play. The data says even the oracle's frontier is hard here.

6 sample decisions

gi=41 seed=41 • final=4 bot=8 • Δeq=-4.11 • n_turns=6 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'explore_game', 'probe_best_case', 'probe_best_case'] • wall=64.5s — decision_41/transcript.live
gi=336 seed=336 • final=23 bot=20 • Δeq=+0.86 • n_turns=5 • belief=[1, 2] • tools=['belief_trajectory', 'belief_trajectory', 'explore_game', 'probe_best_case'] • wall=66.9s — decision_336/transcript.live
gi=819 seed=819 • final=24 bot=26 • Δeq=-0.36 • n_turns=8 • belief=[1] • ext=1 • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'explore_game', 'explore_game', 'probe_best_case'] • wall=63.5s — decision_819/transcript.live
gi=1044 seed=1044 • final=27 bot=26 • Δeq=-3.97 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=84.9s — decision_1044/transcript.live
gi=1301 seed=1301 • final=26 bot=15 • Δeq=+1.05 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=87.5s — decision_1301/transcript.live
gi=1590 seed=1590 • final=15 bot=16 • Δeq=-1.06 • n_turns=4 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case'] • wall=40.5s — decision_1590/transcript.live

BURL_PARROTS_PI_WRONG 57 (2.9%) +0.6pp vs seq

π == Burl, both wrong. Burl mirrored a wrong policy head.

6 sample decisions

gi=40 seed=40 • final=20 bot=12 • Δeq=-8.39 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=64.5s — decision_40/transcript.live
gi=356 seed=356 • final=26 bot=27 • Δeq=-0.00 • n_turns=4 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case'] • wall=45.8s — decision_356/transcript.live
gi=830 seed=830 • final=2 bot=9 • Δeq=-1.88 • n_turns=5 • belief=[1, 2] • tools=['belief_trajectory', 'belief_trajectory', 'explore_game', 'probe_best_case'] • wall=63.3s — decision_830/transcript.live
gi=1054 seed=1054 • final=9 bot=3 • Δeq=-6.11 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=57.4s — decision_1054/transcript.live
gi=1251 seed=1251 • final=5 bot=17 • Δeq=-0.01 • n_turns=8 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_best_case', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=49.2s — decision_1251/transcript.live
gi=1515 seed=1515 • final=14 bot=12 • Δeq=-5.06 • n_turns=6 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=56.9s — decision_1515/transcript.live

QMEAN_ALONE_FIXES 37 (1.9%) +0.1pp vs seq

Q-mean matches the oracle; π and Burl don't. Burl missed a fix it could have inherited from belief-sampling.

6 sample decisions

gi=46 seed=46 • final=20 bot=14 • Δeq=-14.05 • n_turns=4 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case'] • wall=45.0s — decision_46/transcript.live
gi=224 seed=224 • final=20 bot=21 • Δeq=-10.40 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=56.7s — decision_224/transcript.live
gi=709 seed=709 • final=25 bot=4 • Δeq=-9.35 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'explore_game', 'probe_best_case'] • wall=48.9s — decision_709/transcript.live
gi=909 seed=909 • final=10 bot=0 • Δeq=-0.08 • n_turns=4 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case'] • wall=59.4s — decision_909/transcript.live
gi=1139 seed=1139 • final=20 bot=4 • Δeq=-11.96 • n_turns=7 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'explore_game', 'probe_best_case', 'probe_worst_case', 'probe_worst_case'] • wall=56.0s — decision_1139/transcript.live
gi=1544 seed=1544 • final=19 bot=17 • Δeq=-0.31 • n_turns=5 • belief=[1, 2] • tools=['belief_trajectory', 'belief_trajectory', 'explore_game', 'probe_best_case'] • wall=42.9s — decision_1544/transcript.live

BURL_PARROTS_QMEAN_WRONG 41 (2.0%) +0.0pp vs seq

π was right; Burl == Q-mean, and both are wrong. Burl trusted Q-mean over a correct π.

6 sample decisions

gi=75 seed=75 • final=8 bot=6 • Δeq=-31.41 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=57.5s — decision_75/transcript.live
gi=434 seed=434 • final=22 bot=3 • Δeq=-0.02 • n_turns=4 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case'] • wall=61.8s — decision_434/transcript.live
gi=661 seed=661 • final=19 bot=23 • Δeq=-0.77 • n_turns=7 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case', 'explore_game', 'probe_best_case'] • wall=53.4s — decision_661/transcript.live
gi=929 seed=929 • final=1 bot=4 • Δeq=-0.28 • n_turns=9 • belief=[1] • ext=1 • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=75.4s — decision_929/transcript.live
gi=1092 seed=1092 • final=20 bot=2 • Δeq=-0.36 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=67.8s — decision_1092/transcript.live
gi=1507 seed=1507 • final=24 bot=16 • Δeq=-9.45 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'explore_game', 'probe_best_case'] • wall=37.9s — decision_1507/transcript.live

BURL_DRIFTS_FROM_PI 37 (1.9%) +0.1pp vs seq

π was right; Burl picked something other than π and other than Q-mean — wandered off into a wrong answer of its own.

6 sample decisions

gi=4 seed=4 • final=10 bot=0 • Δeq=-5.61 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=74.9s — decision_4/transcript.live
gi=200 seed=200 • final=25 bot=13 • Δeq=-8.40 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=69.4s — decision_200/transcript.live
gi=574 seed=574 • final=14 bot=2 • Δeq=-5.18 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=54.5s — decision_574/transcript.live
gi=980 seed=980 • final=15 bot=14 • Δeq=-12.17 • n_turns=6 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'explore_game', 'probe_best_case', 'probe_best_case'] • wall=42.3s — decision_980/transcript.live
gi=1349 seed=1349 • final=6 bot=5 • Δeq=-0.89 • n_turns=4 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case'] • wall=51.4s — decision_1349/transcript.live
gi=1559 seed=1559 • final=19 bot=7 • Δeq=-10.76 • n_turns=4 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case'] • wall=57.9s — decision_1559/transcript.live

Guarded buckets — harness handled it

These rows are excluded from regret math but report on the harness's behaviour. FORCED_COMMIT means the model never produced a legal commit and the Phase A guard force-committed the highest-E[Q] legal play; this preserves the bid (no illegal commits) but tells us how often Burl stalls. ILLEGAL and OTHER should both be 0 — they are.

FORCED_COMMIT 219 (10.9%) -1.4pp vs seq

Burl never produced a legal commit — Phase A guard force-committed the highest-E[Q] legal play. Excluded from regret math; valuable as a 'how often does the agent stall' signal.

6 sample decisions

gi=2 seed=2 • final=11 bot=11 • Δeq=+0.00 • n_turns=13 • belief=[1, 2] • FORCED • ext=3 • tools=['belief_trajectory', 'belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=74.9s — decision_2/transcript.live
gi=394 seed=394 • final=21 bot=21 • Δeq=+0.00 • n_turns=10 • belief=[1] • FORCED • ext=1 • tools=['belief_trajectory', 'explore_game', 'explore_game', 'probe_best_case', 'explore_game', 'explore_game', 'probe_best_case'] • wall=52.3s — decision_394/transcript.live
gi=649 seed=649 • final=3 bot=3 • Δeq=+0.00 • n_turns=10 • belief=[1] • FORCED • ext=1 • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case', 'ask_rule', 'ask_rule', 'explore_game', 'probe_best_case'] • wall=59.5s — decision_649/transcript.live
gi=963 seed=963 • final=24 bot=24 • Δeq=+0.00 • n_turns=8 • belief=[1, 2] • FORCED • tools=['belief_trajectory', 'belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=105.3s — decision_963/transcript.live
gi=1307 seed=1307 • final=18 bot=18 • Δeq=+0.00 • n_turns=14 • belief=[1, 2] • FORCED • ext=3 • tools=['belief_trajectory', 'belief_trajectory', 'explore_game', 'explore_game', 'probe_best_case', 'probe_worst_case', 'ask_rule', 'ask_rule'] • wall=76.0s — decision_1307/transcript.live
gi=1637 seed=1637 • final=9 bot=9 • Δeq=+0.00 • n_turns=8 • belief=[1] • FORCED • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case', 'explore_game', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=87.3s — decision_1637/transcript.live

ILLEGAL 0 (0.0%) +0.0pp vs seq

Final play is None / illegal. Phase A is supposed to keep this at 0.

0 sample decisions

(no decisions in this bucket)

OTHER 0 (0.0%) +0.0pp vs seq

Catch-all. Should be 0 after the v2 OTHER-triage classifier.

0 sample decisions

(no decisions in this bucket)

What's next — STaR run-3

The corpus is ready. Recommended next steps:

Build the filtered STaR corpus from the strict pool, with the --min-assistant-chars 300 filter to strip templated tail rows (the prior-experiment sub-100-char artifacts that caused the 71-row STaR collapse). We landed on 300 as the right cutoff — strips ~40% of rows uniformly across all buckets (so it's row-decomposition noise, not signal); zero gold-bucket decisions are lost. PYTHONPATH=. .venv/bin/python scratch/belief_trajectory_rollout/star/build_filtered_corpus.py \ --harvest harvest_batched_20260425_072910 \ --strict-pool-buckets ALL_AGREE_CORRECT,BURL_ALONE_FIXES,BOTH_FIX,BURL_INDEPENDENT_RIGHT,BURL_FOLLOWS_PI_RIGHT \ --min-assistant-chars 300 \ --out scratch/belief_trajectory_rollout/star/filter_corpus_v2_min300.jsonl
Train LoRA filter-only STaR with rank=8, lr=3e-5, 1 epoch, and val-loss + early-stopping. The val-loss + early-stop wiring already lives in burl/train/star_mlx.py from a prior session. Start with conservative hyperparams; the prior 71-row run collapsed at lr=1e-4 because of memorizable templated rows — both are now fixed. PYTHONPATH=. .venv/bin/python burl/train/star_mlx.py \ --train-corpus scratch/belief_trajectory_rollout/star/filter_corpus_v2_min300.train.jsonl \ --val-corpus scratch/belief_trajectory_rollout/star/filter_corpus_v2_min300.val.jsonl \ --rank 8 --lr 3e-5 --epochs 1 \ --steps-per-eval 50 --early-stop-val-rise 0.02 --early-stop-patience 2 \ --out scratch/belief_trajectory_rollout/star/run3_$(date +%Y%m%d_%H%M%S)/
Eval on the held-out 560 (sequential gold corpus) — same buckets, compare regret and matches_bot deltas vs the unadapted model. Win criterion: regret on Burl drops below 0.517 (the deployable Q-mean router's regret) by a meaningful margin without harming π.
(Stretch) Rationalization pass on BURL_BREAKS_CONSENSUS (299 rows) — only after filter-only shows a clean signal. That bucket alone is 3.1× the size of any bucket we tried in prior STaR experiments.

Open questions still warm:

In v1's parity audit, Burl's redundant-tool-call pattern (gi=7 looped belief_trajectory; gi=13 repeated explore_game{play:17}) was symptomatic of mid-thinking truncation. v2 should have eliminated this; a sample sweep on five v2 BURL_BREAKS_CONSENSUS decisions would confirm the cleanup before STaR training treats them as natural reasoning failures.
v2 budget extensions: 733 total / max 3 per decision. Worth checking whether the extension trigger is reading "reasoning incomplete" or "tool reject"; the latter is the legitimate use case, the former indicates the same thinking-budget issue at a smaller scale.

Generated by scratch/belief_trajectory_rollout/build_review.py on the 2000-decision harvest. Static page — no server needed; reload after re-running the build script to refresh.