Burl harvest v2 — 2000-decision corpus review

harvest_batched_20260425_072910/ • finished 2026-04-25 13:15 local • all gates passed

Decisions
2,000
Wall
5h 46m
Strict pool
1,062
Loss buckets
719
Forced commit
219 (10.9%)
Illegal
0
Quarantines
0

Overview — what this corpus is

This is a 2000-decision Burl harvest collected on fresh seeds from gus/data/corpus_train_chunk_0-99.pt. For every Texas-42 decision point, we ran the wax-museum tool-loop against Burl (Gemma 4 E2B + LoRA via MLX-LM, batched at 6 sequences, max_tokens=2048) on variant D_required_first ("Turn 1: belief_trajectory() first"). Each decision is then classified into one of 14 buckets based on how Burl's final play compared to π (policy head), Q-mean (K=200 belief-sampled), and the oracle's E[Q] argmax — using per_decision_train_chunk_0_99_k200.jsonl as ground truth.

The corpus has two downstream uses:

Why we re-ran the harvest (the v1 contamination story)

The first 2000-decision batched run (harvest_batched_20260425_031033, 1398/2000 before SIGKILL) was contaminated by a turn-1 truncation bug. The user's morning intuition flagged it: "we didn't have short-decision problems before batched mode". Three parallel agents validated:

v1 root cause: max_tokens=1024 was too tight for the model's preferred thinking-then-tool-call pattern. In ~11.4% of decisions, the model exhausted its budget mid-thinking-block on turn 1 and never reached the belief_trajectory() tool call. The harness re-prompted, the model eventually called belief_trajectory() on turn 2, and we lost ~2500 chars of mid-sentence reasoning that was never persisted as a structured thinking event. Bucket parity gate was a false pass — distribution-level errors cancelled out, but 43% of decisions actually changed bucket between sequential and batched.

Fix: max_tokens 1024 → 2048 (justified by the data — sequential's p99 was 1770 chars / ~600 tokens, max 2639 / ~900 tokens; 2048 covers the full observed range with 9× margin), plus a per-wave OOM-resilience layer (try/except classifier, quarantine.jsonl ledger, SIGKILL-recoverable sentinel files, --rerun-quarantined retry pass). Reverted run was launched as v2 at 07:29:10. Zero quarantine fires across the 5h 46m run.

Reference docs (siblings to this page): LENGTH_STATS_COMPARISON.md · PARITY_AUDIT.md · HARVEST_RESILIENCE_NOTES.md · prod_2000_v2.log

Validation gates — v2 vs v1 vs sequential baseline

GateSequential 560 (max_tokens=8192)Batched v1 (max_tokens=1024) — CONTAMINATEDBatched v2 (max_tokens=2048)
belief NOT at turn 1~0%11.4%0.0%
truncated_at_cap0~1.2%0
bailed000
illegal commits000
matches_bot rate~56%~58%61.7%
forced_commit rate12.3%~12%10.9%
quarantine firesn/an/a0 / 333 waves
strict pool / 560294 (52.5%)contaminated→ 1062 / 2000 (53.1%)
All gates pass. v2 is shippable as the canonical corpus for STaR run-3.

Strict pool — Burl matched the oracle (1062 rows, 53.1% of corpus)

These are the rows where Burl's final play matched the oracle's argmax over E[Q]. The strict pool is the full positive-example corpus for STaR filter-only. The non-trivial subset (excluding ALL_AGREE_CORRECT, where π already had the right answer trivially) is the more interesting training material: 202 rows across the four "Burl-did-something-real" buckets.

ALL_AGREE_CORRECT 860 (43.0%) -0.2pp vs seq

All three (π, Q-mean, Burl) match the oracle. Trivial win — π already had it; Burl just didn't break it.

6 sample decisions
  • gi=5 seed=5 • final=21 bot=21 • Δeq=+0.00 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=74.9sdecision_5/transcript.live
  • gi=348 seed=348 • final=23 bot=23 • Δeq=+0.00 • n_turns=4 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case'] • wall=74.9sdecision_348/transcript.live
  • gi=718 seed=718 • final=15 bot=15 • Δeq=+0.00 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=58.5sdecision_718/transcript.live
  • gi=1033 seed=1033 • final=15 bot=15 • Δeq=+0.00 • n_turns=5 • belief=[1, 3] • tools=['belief_trajectory', 'explore_game', 'belief_trajectory', 'probe_best_case'] • wall=38.8sdecision_1033/transcript.live
  • gi=1357 seed=1357 • final=20 bot=20 • Δeq=+0.00 • n_turns=6 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_best_case', 'probe_worst_case'] • wall=65.5sdecision_1357/transcript.live
  • gi=1667 seed=1667 • final=12 bot=12 • Δeq=+0.00 • n_turns=12 • belief=[1, 2] • ext=3 • tools=['belief_trajectory', 'belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case', 'explore_game', 'probe_best_case'] • wall=79.5sdecision_1667/transcript.live

BURL_ALONE_FIXES 14 (0.7%) -0.2pp vs seq

Burl alone matches the oracle. Both π and Q-mean are wrong, *and pick distinct wrong plays*. The cleanest gold signal: Burl reasoned its way past two heads that disagreed on the wrong answer.

6 sample decisions
  • gi=180 seed=180 • final=19 bot=19 • Δeq=+0.00 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=61.9sdecision_180/transcript.live
  • gi=376 seed=376 • final=7 bot=7 • Δeq=+0.00 • n_turns=6 • belief=[1, 2] • tools=['belief_trajectory', 'belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=96.5sdecision_376/transcript.live
  • gi=653 seed=653 • final=25 bot=10 • Δeq=+0.37 • n_turns=4 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case'] • wall=59.5sdecision_653/transcript.live
  • gi=940 seed=940 • final=7 bot=7 • Δeq=+0.00 • n_turns=6 • belief=[1, 4] • tools=['belief_trajectory', 'explore_game', 'belief_trajectory', 'probe_best_case'] • wall=57.8sdecision_940/transcript.live
  • gi=1131 seed=1131 • final=12 bot=12 • Δeq=+0.00 • n_turns=8 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=91.5sdecision_1131/transcript.live
  • gi=1204 seed=1204 • final=20 bot=20 • Δeq=+0.00 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=36.7sdecision_1204/transcript.live

BOTH_FIX 52 (2.6%) +1.0pp vs seq

Burl + Q-mean match the oracle; π is wrong. The 'belief sampling' route shows up cleanly here.

6 sample decisions
  • gi=16 seed=16 • final=20 bot=9 • Δeq=+0.01 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=81.1sdecision_16/transcript.live
  • gi=275 seed=275 • final=3 bot=3 • Δeq=+0.00 • n_turns=6 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=42.5sdecision_275/transcript.live
  • gi=515 seed=515 • final=1 bot=1 • Δeq=+0.00 • n_turns=4 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case'] • wall=71.4sdecision_515/transcript.live
  • gi=896 seed=896 • final=12 bot=12 • Δeq=+0.00 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'explore_game', 'probe_best_case'] • wall=54.9sdecision_896/transcript.live
  • gi=1056 seed=1056 • final=8 bot=8 • Δeq=+0.00 • n_turns=4 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case'] • wall=55.1sdecision_1056/transcript.live
  • gi=1416 seed=1416 • final=11 bot=11 • Δeq=+0.00 • n_turns=4 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case'] • wall=46.9sdecision_1416/transcript.live

BURL_INDEPENDENT_RIGHT 88 (4.4%) -0.1pp vs seq

Burl matches the oracle; π and Q-mean agree on the *same* wrong play. Burl bucked a wrong consensus — the strongest evidence of independent reasoning.

6 sample decisions
  • gi=3 seed=3 • final=12 bot=12 • Δeq=+0.00 • n_turns=5 • belief=[1, 2] • tools=['belief_trajectory', 'belief_trajectory', 'explore_game', 'probe_best_case'] • wall=74.9sdecision_3/transcript.live
  • gi=156 seed=156 • final=16 bot=16 • Δeq=+0.00 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=42.0sdecision_156/transcript.live
  • gi=408 seed=408 • final=5 bot=27 • Δeq=+0.70 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=59.8sdecision_408/transcript.live
  • gi=731 seed=731 • final=13 bot=13 • Δeq=+0.00 • n_turns=6 • belief=[1] • ext=1 • tools=['belief_trajectory', 'explore_game', 'probe_best_case'] • wall=49.6sdecision_731/transcript.live
  • gi=1184 seed=1184 • final=16 bot=16 • Δeq=+0.00 • n_turns=7 • belief=[1, 2] • tools=['belief_trajectory', 'belief_trajectory', 'explore_game', 'probe_best_case', 'probe_best_case', 'probe_worst_case'] • wall=81.5sdecision_1184/transcript.live
  • gi=1560 seed=1560 • final=25 bot=25 • Δeq=+0.00 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=38.2sdecision_1560/transcript.live

BURL_FOLLOWS_PI_RIGHT 48 (2.4%) +0.1pp vs seq

π and Burl match the oracle; Q-mean is wrong. Burl correctly *did not* deviate to follow Q-mean's belief-sampling answer.

6 sample decisions
  • gi=131 seed=131 • final=16 bot=16 • Δeq=+0.00 • n_turns=4 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case'] • wall=56.1sdecision_131/transcript.live
  • gi=504 seed=504 • final=0 bot=0 • Δeq=+0.00 • n_turns=4 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case'] • wall=46.2sdecision_504/transcript.live
  • gi=656 seed=656 • final=24 bot=18 • Δeq=+0.01 • n_turns=4 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case'] • wall=52.9sdecision_656/transcript.live
  • gi=1171 seed=1171 • final=20 bot=20 • Δeq=+0.00 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=38.0sdecision_1171/transcript.live
  • gi=1517 seed=1517 • final=16 bot=16 • Δeq=+0.00 • n_turns=6 • belief=[1] • ext=1 • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=56.9sdecision_1517/transcript.live
  • gi=1751 seed=1751 • final=9 bot=7 • Δeq=+0.03 • n_turns=6 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=156.4sdecision_1751/transcript.live

Loss buckets — Burl produced a wrong final play (719 rows, 36.0%)

Burl's regret across these rows is what STaR rationalization is meant to fix. The oracle's correct answer gets exposed to the model and we train it to write the trajectory that would have produced that answer. BURL_BREAKS_CONSENSUS is the sharpest signal: π and Q-mean both already knew the right answer; Burl alone deviated. There is no "the heads were confused" excuse here — Burl invented a wrong story.

BURL_BREAKS_CONSENSUS 299 (14.9%) -2.4pp vs seq

π and Q-mean match the oracle, but Burl alone deviates to a wrong play. The mirror of BURL_INDEPENDENT_RIGHT — and the sharpest STaR rationalization target. The right answer was already known by both heads; Burl invented a wrong story.

6 sample decisions
  • gi=1 seed=1 • final=25 bot=19 • Δeq=-3.53 • n_turns=8 • belief=[1] • ext=3 • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=74.9sdecision_1/transcript.live
  • gi=308 seed=308 • final=11 bot=11 • Δeq=+0.00 • n_turns=4 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case'] • wall=46.0sdecision_308/transcript.live
  • gi=662 seed=662 • final=27 bot=21 • Δeq=-0.03 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=53.4sdecision_662/transcript.live
  • gi=954 seed=954 • final=13 bot=10 • Δeq=-3.96 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=46.0sdecision_954/transcript.live
  • gi=1310 seed=1310 • final=8 bot=3 • Δeq=-0.33 • n_turns=7 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case', 'explore_game', 'explore_game'] • wall=49.6sdecision_1310/transcript.live
  • gi=1680 seed=1680 • final=11 bot=21 • Δeq=-4.75 • n_turns=8 • belief=[1, 2] • tools=['belief_trajectory', 'belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=55.1sdecision_1680/transcript.live

BURL_INDEPENDENT_WRONG 148 (7.4%) +1.5pp vs seq

All three are wrong, but Burl picks something neither π nor Q-mean did. Burl made up its own incorrect answer.

6 sample decisions
  • gi=0 seed=0 • final=7 bot=14 • Δeq=-4.62 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=74.9sdecision_0/transcript.live
  • gi=318 seed=318 • final=14 bot=17 • Δeq=-4.23 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=74.8sdecision_318/transcript.live
  • gi=532 seed=532 • final=20 bot=2 • Δeq=-1.45 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=25.9sdecision_532/transcript.live
  • gi=857 seed=857 • final=11 bot=15 • Δeq=-14.76 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=97.4sdecision_857/transcript.live
  • gi=1260 seed=1260 • final=5 bot=14 • Δeq=-4.38 • n_turns=6 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_best_case'] • wall=59.6sdecision_1260/transcript.live
  • gi=1655 seed=1655 • final=14 bot=0 • Δeq=-13.03 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=34.3sdecision_1655/transcript.live

ALL_AGREE_WRONG 100 (5.0%) +0.9pp vs seq

All three pick the same wrong play. The data says even the oracle's frontier is hard here.

6 sample decisions
  • gi=41 seed=41 • final=4 bot=8 • Δeq=-4.11 • n_turns=6 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'explore_game', 'probe_best_case', 'probe_best_case'] • wall=64.5sdecision_41/transcript.live
  • gi=336 seed=336 • final=23 bot=20 • Δeq=+0.86 • n_turns=5 • belief=[1, 2] • tools=['belief_trajectory', 'belief_trajectory', 'explore_game', 'probe_best_case'] • wall=66.9sdecision_336/transcript.live
  • gi=819 seed=819 • final=24 bot=26 • Δeq=-0.36 • n_turns=8 • belief=[1] • ext=1 • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'explore_game', 'explore_game', 'probe_best_case'] • wall=63.5sdecision_819/transcript.live
  • gi=1044 seed=1044 • final=27 bot=26 • Δeq=-3.97 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=84.9sdecision_1044/transcript.live
  • gi=1301 seed=1301 • final=26 bot=15 • Δeq=+1.05 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=87.5sdecision_1301/transcript.live
  • gi=1590 seed=1590 • final=15 bot=16 • Δeq=-1.06 • n_turns=4 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case'] • wall=40.5sdecision_1590/transcript.live

BURL_PARROTS_PI_WRONG 57 (2.9%) +0.6pp vs seq

π == Burl, both wrong. Burl mirrored a wrong policy head.

6 sample decisions
  • gi=40 seed=40 • final=20 bot=12 • Δeq=-8.39 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=64.5sdecision_40/transcript.live
  • gi=356 seed=356 • final=26 bot=27 • Δeq=-0.00 • n_turns=4 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case'] • wall=45.8sdecision_356/transcript.live
  • gi=830 seed=830 • final=2 bot=9 • Δeq=-1.88 • n_turns=5 • belief=[1, 2] • tools=['belief_trajectory', 'belief_trajectory', 'explore_game', 'probe_best_case'] • wall=63.3sdecision_830/transcript.live
  • gi=1054 seed=1054 • final=9 bot=3 • Δeq=-6.11 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=57.4sdecision_1054/transcript.live
  • gi=1251 seed=1251 • final=5 bot=17 • Δeq=-0.01 • n_turns=8 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_best_case', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=49.2sdecision_1251/transcript.live
  • gi=1515 seed=1515 • final=14 bot=12 • Δeq=-5.06 • n_turns=6 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=56.9sdecision_1515/transcript.live

QMEAN_ALONE_FIXES 37 (1.9%) +0.1pp vs seq

Q-mean matches the oracle; π and Burl don't. Burl missed a fix it could have inherited from belief-sampling.

6 sample decisions
  • gi=46 seed=46 • final=20 bot=14 • Δeq=-14.05 • n_turns=4 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case'] • wall=45.0sdecision_46/transcript.live
  • gi=224 seed=224 • final=20 bot=21 • Δeq=-10.40 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=56.7sdecision_224/transcript.live
  • gi=709 seed=709 • final=25 bot=4 • Δeq=-9.35 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'explore_game', 'probe_best_case'] • wall=48.9sdecision_709/transcript.live
  • gi=909 seed=909 • final=10 bot=0 • Δeq=-0.08 • n_turns=4 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case'] • wall=59.4sdecision_909/transcript.live
  • gi=1139 seed=1139 • final=20 bot=4 • Δeq=-11.96 • n_turns=7 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'explore_game', 'probe_best_case', 'probe_worst_case', 'probe_worst_case'] • wall=56.0sdecision_1139/transcript.live
  • gi=1544 seed=1544 • final=19 bot=17 • Δeq=-0.31 • n_turns=5 • belief=[1, 2] • tools=['belief_trajectory', 'belief_trajectory', 'explore_game', 'probe_best_case'] • wall=42.9sdecision_1544/transcript.live

BURL_PARROTS_QMEAN_WRONG 41 (2.0%) +0.0pp vs seq

π was right; Burl == Q-mean, and both are wrong. Burl trusted Q-mean over a correct π.

6 sample decisions
  • gi=75 seed=75 • final=8 bot=6 • Δeq=-31.41 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=57.5sdecision_75/transcript.live
  • gi=434 seed=434 • final=22 bot=3 • Δeq=-0.02 • n_turns=4 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case'] • wall=61.8sdecision_434/transcript.live
  • gi=661 seed=661 • final=19 bot=23 • Δeq=-0.77 • n_turns=7 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case', 'explore_game', 'probe_best_case'] • wall=53.4sdecision_661/transcript.live
  • gi=929 seed=929 • final=1 bot=4 • Δeq=-0.28 • n_turns=9 • belief=[1] • ext=1 • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=75.4sdecision_929/transcript.live
  • gi=1092 seed=1092 • final=20 bot=2 • Δeq=-0.36 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=67.8sdecision_1092/transcript.live
  • gi=1507 seed=1507 • final=24 bot=16 • Δeq=-9.45 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'explore_game', 'probe_best_case'] • wall=37.9sdecision_1507/transcript.live

BURL_DRIFTS_FROM_PI 37 (1.9%) +0.1pp vs seq

π was right; Burl picked something other than π and other than Q-mean — wandered off into a wrong answer of its own.

6 sample decisions
  • gi=4 seed=4 • final=10 bot=0 • Δeq=-5.61 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=74.9sdecision_4/transcript.live
  • gi=200 seed=200 • final=25 bot=13 • Δeq=-8.40 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=69.4sdecision_200/transcript.live
  • gi=574 seed=574 • final=14 bot=2 • Δeq=-5.18 • n_turns=5 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=54.5sdecision_574/transcript.live
  • gi=980 seed=980 • final=15 bot=14 • Δeq=-12.17 • n_turns=6 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'explore_game', 'probe_best_case', 'probe_best_case'] • wall=42.3sdecision_980/transcript.live
  • gi=1349 seed=1349 • final=6 bot=5 • Δeq=-0.89 • n_turns=4 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case'] • wall=51.4sdecision_1349/transcript.live
  • gi=1559 seed=1559 • final=19 bot=7 • Δeq=-10.76 • n_turns=4 • belief=[1] • tools=['belief_trajectory', 'explore_game', 'probe_best_case'] • wall=57.9sdecision_1559/transcript.live

Guarded buckets — harness handled it

These rows are excluded from regret math but report on the harness's behaviour. FORCED_COMMIT means the model never produced a legal commit and the Phase A guard force-committed the highest-E[Q] legal play; this preserves the bid (no illegal commits) but tells us how often Burl stalls. ILLEGAL and OTHER should both be 0 — they are.

FORCED_COMMIT 219 (10.9%) -1.4pp vs seq

Burl never produced a legal commit — Phase A guard force-committed the highest-E[Q] legal play. Excluded from regret math; valuable as a 'how often does the agent stall' signal.

6 sample decisions
  • gi=2 seed=2 • final=11 bot=11 • Δeq=+0.00 • n_turns=13 • belief=[1, 2] • FORCED • ext=3 • tools=['belief_trajectory', 'belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=74.9sdecision_2/transcript.live
  • gi=394 seed=394 • final=21 bot=21 • Δeq=+0.00 • n_turns=10 • belief=[1] • FORCED • ext=1 • tools=['belief_trajectory', 'explore_game', 'explore_game', 'probe_best_case', 'explore_game', 'explore_game', 'probe_best_case'] • wall=52.3sdecision_394/transcript.live
  • gi=649 seed=649 • final=3 bot=3 • Δeq=+0.00 • n_turns=10 • belief=[1] • FORCED • ext=1 • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case', 'ask_rule', 'ask_rule', 'explore_game', 'probe_best_case'] • wall=59.5sdecision_649/transcript.live
  • gi=963 seed=963 • final=24 bot=24 • Δeq=+0.00 • n_turns=8 • belief=[1, 2] • FORCED • tools=['belief_trajectory', 'belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=105.3sdecision_963/transcript.live
  • gi=1307 seed=1307 • final=18 bot=18 • Δeq=+0.00 • n_turns=14 • belief=[1, 2] • FORCED • ext=3 • tools=['belief_trajectory', 'belief_trajectory', 'explore_game', 'explore_game', 'probe_best_case', 'probe_worst_case', 'ask_rule', 'ask_rule'] • wall=76.0sdecision_1307/transcript.live
  • gi=1637 seed=1637 • final=9 bot=9 • Δeq=+0.00 • n_turns=8 • belief=[1] • FORCED • tools=['belief_trajectory', 'explore_game', 'probe_best_case', 'probe_worst_case', 'explore_game', 'explore_game', 'probe_best_case', 'probe_worst_case'] • wall=87.3sdecision_1637/transcript.live

ILLEGAL 0 (0.0%) +0.0pp vs seq

Final play is None / illegal. Phase A is supposed to keep this at 0.

0 sample decisions(no decisions in this bucket)

OTHER 0 (0.0%) +0.0pp vs seq

Catch-all. Should be 0 after the v2 OTHER-triage classifier.

0 sample decisions(no decisions in this bucket)

What's next — STaR run-3

The corpus is ready. Recommended next steps:

  1. Build the filtered STaR corpus from the strict pool, with the --min-assistant-chars 300 filter to strip templated tail rows (the prior-experiment sub-100-char artifacts that caused the 71-row STaR collapse). We landed on 300 as the right cutoff — strips ~40% of rows uniformly across all buckets (so it's row-decomposition noise, not signal); zero gold-bucket decisions are lost. PYTHONPATH=. .venv/bin/python scratch/belief_trajectory_rollout/star/build_filtered_corpus.py \ --harvest harvest_batched_20260425_072910 \ --strict-pool-buckets ALL_AGREE_CORRECT,BURL_ALONE_FIXES,BOTH_FIX,BURL_INDEPENDENT_RIGHT,BURL_FOLLOWS_PI_RIGHT \ --min-assistant-chars 300 \ --out scratch/belief_trajectory_rollout/star/filter_corpus_v2_min300.jsonl
  2. Train LoRA filter-only STaR with rank=8, lr=3e-5, 1 epoch, and val-loss + early-stopping. The val-loss + early-stop wiring already lives in burl/train/star_mlx.py from a prior session. Start with conservative hyperparams; the prior 71-row run collapsed at lr=1e-4 because of memorizable templated rows — both are now fixed. PYTHONPATH=. .venv/bin/python burl/train/star_mlx.py \ --train-corpus scratch/belief_trajectory_rollout/star/filter_corpus_v2_min300.train.jsonl \ --val-corpus scratch/belief_trajectory_rollout/star/filter_corpus_v2_min300.val.jsonl \ --rank 8 --lr 3e-5 --epochs 1 \ --steps-per-eval 50 --early-stop-val-rise 0.02 --early-stop-patience 2 \ --out scratch/belief_trajectory_rollout/star/run3_$(date +%Y%m%d_%H%M%S)/
  3. Eval on the held-out 560 (sequential gold corpus) — same buckets, compare regret and matches_bot deltas vs the unadapted model. Win criterion: regret on Burl drops below 0.517 (the deployable Q-mean router's regret) by a meaningful margin without harming π.
  4. (Stretch) Rationalization pass on BURL_BREAKS_CONSENSUS (299 rows) — only after filter-only shows a clean signal. That bucket alone is 3.1× the size of any bucket we tried in prior STaR experiments.

Open questions still warm:


Generated by scratch/belief_trajectory_rollout/build_review.py on the 2000-decision harvest. Static page — no server needed; reload after re-running the build script to refresh.