Why this exists
The standing question in this project is: can a general-purpose
language model — the kind that produces blog posts and Python — be taught
to play Texas 42 well by giving it tools to query the game state, and
asking it to think out loud between tool calls?
Not "can it generate text about 42." Can it actually play, well enough
that the moves it makes stand up to oracle E[Q] scrutiny.
The answer is "sort of, with care." The interesting question is exactly
where it succeeds and where it doesn't, and what the failure shapes look
like, because that's how we know whether the next round of training
might fix it.
Over the last three days we collected 2,000 carefully-instrumented
decision points and classified each one by how the model's play compared
to two reference estimators and a near-oracle. This page is the readable
version of what we found.
The numbers
Strict pool (Burl right)
1,062
Loss buckets (Burl wrong)
719
Forced commits
219 (10.9%)
Out of 2,000 decisions, Burl picked the oracle's argmax 1,062
times (53.1%). Of those, 860
were trivial — the cases where every estimator agreed and was right; we
keep them in the corpus as filler so training doesn't only see hard
cases. The other 202 are the cases where Burl did
something the simpler estimators didn't, and where Burl turned out to be
correct.
On the wrong side: 719 losses spread across seven shapes of
failure. The biggest loss bucket — and the most interesting one for
retraining — is BURL_BREAKS_CONSENSUS
at 299 decisions, where both quick
estimators already agreed on the correct play and Burl alone deviated to
a worse one. Those are the rationalization targets: we know the right
answer, we have a recording of the model's reasoning, we ask the model
to rewrite the reasoning toward the correct answer, and we
fine-tune on the rewrite.
The 14 buckets
Every decision lands in exactly one bucket. The bucket is determined
purely by which of (π, Q-mean, Burl) match the oracle. The classifier
is in scratch/belief_trajectory_rollout/tag_corpus.py and
is open-source-able if anyone wants to read 200 lines of decision-tree
Python.
Gold buckets (1,062 decisions — Burl matched the oracle)
All Agree Correct 860 (43.0%)
Both quick estimators and the model all picked the same play, and that play was the engine's optimum. The easy round.
At the 42 table: Everyone at the table — including your partner, the kibitzer, and the engine — would have played the same domino. Nothing surprising; nobody got tested.
Sample decisions (5)
- decision_5 —
gi=5 seed=5 · played 21 (oracle 21) · Δeq=+0.00 · 5 turns · 75s wall - decision_423 —
gi=423 seed=423 · played 16 (oracle 16) · Δeq=+0.00 · 5 turns · 49s wall - decision_858 —
gi=858 seed=858 · played 22 (oracle 22) · Δeq=+0.00 · 4 turns · 57s wall - decision_1228 —
gi=1228 seed=1228 · played 23 (oracle 23) · Δeq=+0.00 · 4 turns · 51s wall - decision_1614 —
gi=1614 seed=1614 · played 18 (oracle 18) · Δeq=+0.00 · 6 turns · 50s wall
Burl Alone Fixes 14 (0.7%)
The model alone matched the oracle's optimum. The two quick estimators were wrong, and they were wrong in different ways — they didn't even agree with each other. The model reasoned past both of them.
At the 42 table: Your partner says one thing, your nephew says another, and you — taking your time, looking at the trick, the trumps still out, who's been void in what — pick a third domino and it turns out you were right.
Sample decisions (5)
- decision_180 —
gi=180 seed=180 · played 19 (oracle 19) · Δeq=+0.00 · 5 turns · 62s wall - decision_376 —
gi=376 seed=376 · played 7 (oracle 7) · Δeq=+0.00 · 6 turns · 96s wall - decision_653 —
gi=653 seed=653 · played 25 (oracle 10) · Δeq=+0.37 · 4 turns · 60s wall - decision_940 —
gi=940 seed=940 · played 7 (oracle 7) · Δeq=+0.00 · 6 turns · 58s wall - decision_1131 —
gi=1131 seed=1131 · played 12 (oracle 12) · Δeq=+0.00 · 8 turns · 92s wall
Both Fix 52 (2.6%)
The model and Q-mean both got it; π didn't. The 'belief sampling' route shows up cleanly here.
At the 42 table: The instinctive policy ('what does this kind of position usually want?') was wrong, but if you took a moment and thought about whose hand could contain what, the right play was clear. The model and Q-mean both took that moment; π didn't.
Sample decisions (5)
- decision_16 —
gi=16 seed=16 · played 20 (oracle 9) · Δeq=+0.01 · 5 turns · 81s wall - decision_358 —
gi=358 seed=358 · played 1 (oracle 1) · Δeq=+0.00 · 8 turns · 46s wall - decision_665 —
gi=665 seed=665 · played 19 (oracle 19) · Δeq=+0.00 · 8 turns · 53s wall - decision_1038 —
gi=1038 seed=1038 · played 14 (oracle 14) · Δeq=+0.00 · 6 turns · 50s wall - decision_1416 —
gi=1416 seed=1416 · played 11 (oracle 11) · Δeq=+0.00 · 4 turns · 47s wall
Burl Independent Right 88 (4.4%)
The model bucked a wrong consensus. Both quick estimators landed on the same wrong play. The model picked something else, and that something else was the oracle's optimum.
At the 42 table: Your partner and the kibitzer agree, and they're both wrong. The model heard them and said 'no, here's why' — and was right.
Sample decisions (5)
- decision_3 —
gi=3 seed=3 · played 12 (oracle 12) · Δeq=+0.00 · 5 turns · 75s wall - decision_203 —
gi=203 seed=203 · played 9 (oracle 9) · Δeq=+0.00 · 5 turns · 69s wall - decision_620 —
gi=620 seed=620 · played 14 (oracle 0) · Δeq=+0.81 · 6 turns · 81s wall - decision_934 —
gi=934 seed=934 · played 12 (oracle 1) · Δeq=+0.52 · 5 turns · 71s wall - decision_1496 —
gi=1496 seed=1496 · played 12 (oracle 12) · Δeq=+0.00 · 5 turns · 66s wall
Burl Follows Pi Right 48 (2.4%)
π was already correct. Q-mean was wrong. The model correctly stuck with π and didn't get pulled off course by the belief-sampled answer.
At the 42 table: Your gut was right; a more elaborate analysis would have led you astray. The model resisted being talked out of it.
Sample decisions (5)
- decision_131 —
gi=131 seed=131 · played 16 (oracle 16) · Δeq=+0.00 · 4 turns · 56s wall - decision_512 —
gi=512 seed=512 · played 9 (oracle 9) · Δeq=+0.00 · 6 turns · 71s wall - decision_772 —
gi=772 seed=772 · played 7 (oracle 7) · Δeq=+0.00 · 4 turns · 80s wall - decision_1410 —
gi=1410 seed=1410 · played 17 (oracle 17) · Δeq=+0.00 · 6 turns · 98s wall - decision_1661 —
gi=1661 seed=1661 · played 3 (oracle 3) · Δeq=+0.00 · 4 turns · 66s wall
Loss buckets (719 decisions — Burl was wrong)
Burl Breaks Consensus 299 (14.9%)
Both quick estimators agreed on the optimum, but the model alone deviated to a worse play. There is no 'the heads were confused' excuse — the right answer was already locally available; the model invented a wrong story.
At the 42 table: Your partner and the kibitzer agreed, and they were correct. You overruled them anyway and lost a trick you didn't need to lose. Most painful bucket — and the cleanest target for retraining: we know the right answer, and we have a record of how the model talked itself out of it.
Sample decisions (5)
- decision_1 —
gi=1 seed=1 · played 25 (oracle 19) · Δeq=-3.53 · 8 turns · 75s wall - decision_367 —
gi=367 seed=367 · played 25 (oracle 25) · Δeq=+0.00 · 7 turns · 59s wall - decision_737 —
gi=737 seed=737 · played 8 (oracle 18) · Δeq=-5.70 · 7 turns · 70s wall - decision_1178 —
gi=1178 seed=1178 · played 17 (oracle 4) · Δeq=-3.63 · 5 turns · 46s wall - decision_1611 —
gi=1611 seed=1611 · played 24 (oracle 22) · Δeq=-3.52 · 10 turns · 64s wall
Burl Independent Wrong 148 (7.4%)
All three are wrong, but the model picked a *different* wrong play than the two estimators. The model wasn't parroting either of them — it was wrong on its own terms.
At the 42 table: The whole table is making the same mistake, and on top of that you find a way to make a fourth, original mistake.
Sample decisions (5)
- decision_0 —
gi=0 seed=0 · played 7 (oracle 14) · Δeq=-4.62 · 5 turns · 75s wall - decision_351 —
gi=351 seed=351 · played 25 (oracle 15) · Δeq=-4.67 · 5 turns · 75s wall - decision_630 —
gi=630 seed=630 · played 9 (oracle 16) · Δeq=-0.10 · 5 turns · 50s wall - decision_1075 —
gi=1075 seed=1075 · played 27 (oracle 1) · Δeq=-10.43 · 5 turns · 89s wall - decision_1521 —
gi=1521 seed=1521 · played 26 (oracle 21) · Δeq=-0.55 · 5 turns · 78s wall
All Agree Wrong 100 (5.0%)
All three (the two estimators and the model) pick the same wrong play. The position is genuinely hard, even for the oracle.
At the 42 table: Everyone at the table agrees on what to play. They're all wrong. The hand was unlucky or required reading something nobody could read.
Sample decisions (5)
- decision_41 —
gi=41 seed=41 · played 4 (oracle 8) · Δeq=-4.11 · 6 turns · 64s wall - decision_424 —
gi=424 seed=424 · played 19 (oracle 2) · Δeq=-1.77 · 4 turns · 49s wall - decision_946 —
gi=946 seed=946 · played 27 (oracle 23) · Δeq=-0.00 · 5 turns · 52s wall - decision_1263 —
gi=1263 seed=1263 · played 13 (oracle 10) · Δeq=-0.72 · 7 turns · 60s wall - decision_1590 —
gi=1590 seed=1590 · played 15 (oracle 16) · Δeq=-1.06 · 4 turns · 40s wall
Burl Parrots Pi Wrong 57 (2.9%)
The model echoed π's instinctive answer, and that instinct was wrong.
At the 42 table: You went with your gut and your gut steered you off the cliff.
Sample decisions (5)
- decision_40 —
gi=40 seed=40 · played 20 (oracle 12) · Δeq=-8.39 · 5 turns · 64s wall - decision_466 —
gi=466 seed=466 · played 9 (oracle 7) · Δeq=-4.00 · 8 turns · 62s wall - decision_942 —
gi=942 seed=942 · played 9 (oracle 6) · Δeq=-1.41 · 5 turns · 52s wall - decision_1168 —
gi=1168 seed=1168 · played 5 (oracle 9) · Δeq=-0.02 · 6 turns · 52s wall - decision_1502 —
gi=1502 seed=1502 · played 27 (oracle 21) · Δeq=-0.79 · 5 turns · 55s wall
Qmean Alone Fixes 37 (1.9%)
The belief-sampling estimator alone got it right; the model and π didn't.
At the 42 table: If the model had taken the belief-sampling answer seriously, it would have been right. It didn't.
Sample decisions (5)
- decision_46 —
gi=46 seed=46 · played 20 (oracle 14) · Δeq=-14.05 · 4 turns · 45s wall - decision_234 —
gi=234 seed=234 · played 20 (oracle 8) · Δeq=-10.69 · 7 turns · 74s wall - decision_794 —
gi=794 seed=794 · played 16 (oracle 2) · Δeq=-3.61 · 6 turns · 79s wall - decision_1128 —
gi=1128 seed=1128 · played 15 (oracle 3) · Δeq=-8.25 · 4 turns · 92s wall - decision_1441 —
gi=1441 seed=1441 · played 13 (oracle 18) · Δeq=-0.01 · 5 turns · 71s wall
Burl Parrots Qmean Wrong 41 (2.0%)
π was correct. The model echoed Q-mean and Q-mean was wrong.
At the 42 table: Your gut was right. You overthought it and copied the elaborate analysis. Both of you ended up wrong together.
Sample decisions (5)
- decision_75 —
gi=75 seed=75 · played 8 (oracle 6) · Δeq=-31.41 · 5 turns · 58s wall - decision_527 —
gi=527 seed=527 · played 12 (oracle 25) · Δeq=-0.04 · 4 turns · 39s wall - decision_831 —
gi=831 seed=831 · played 4 (oracle 6) · Δeq=-0.51 · 8 turns · 63s wall - decision_1092 —
gi=1092 seed=1092 · played 20 (oracle 2) · Δeq=-0.36 · 5 turns · 68s wall - decision_1600 —
gi=1600 seed=1600 · played 20 (oracle 20) · Δeq=+0.00 · 4 turns · 52s wall
Burl Drifts From Pi 37 (1.9%)
π was correct. The model picked something other than π and other than Q-mean — its own original wrong answer.
At the 42 table: Your gut was right. You overruled it and went off-piste. Nobody at the table agrees with where you went.
Sample decisions (5)
- decision_4 —
gi=4 seed=4 · played 10 (oracle 0) · Δeq=-5.61 · 5 turns · 75s wall - decision_271 —
gi=271 seed=271 · played 12 (oracle 23) · Δeq=-0.00 · 8 turns · 42s wall - decision_616 —
gi=616 seed=616 · played 23 (oracle 23) · Δeq=+0.00 · 8 turns · 42s wall - decision_1195 —
gi=1195 seed=1195 · played 9 (oracle 15) · Δeq=-7.48 · 6 turns · 44s wall - decision_1550 —
gi=1550 seed=1550 · played 27 (oracle 23) · Δeq=-5.12 · 4 turns · 66s wall
Guarded buckets — handled by the harness
Forced Commit 219 (10.9%)
The model never produced a legal commit on its own; the harness's safety net force-played the highest-E[Q] legal domino. We exclude these from the regret math, but they're tracked because they tell us how often the model 'stalls'.
At the 42 table: The model couldn't decide. The clock was running. The dealer played the best legal domino on its behalf so the hand could continue.
Sample decisions (5)
- decision_2 —
gi=2 seed=2 · played 11 (oracle 11) · Δeq=+0.00 · 13 turns · forced · 75s wall - decision_453 —
gi=453 seed=453 · played 19 (oracle 19) · Δeq=+0.00 · 10 turns · forced · 61s wall - decision_775 —
gi=775 seed=775 · played 4 (oracle 4) · Δeq=+0.00 · 8 turns · forced · 64s wall - decision_1157 —
gi=1157 seed=1157 · played 20 (oracle 23) · Δeq=-14.46 · 8 turns · forced · 62s wall - decision_1574 —
gi=1574 seed=1574 · played 10 (oracle 10) · Δeq=+0.00 · 13 turns · forced · 76s wall
Illegal 0 (0.0%)
The model produced no legal final play. Phase A guards exist to keep this at zero.
At the 42 table: Should never happen. Did not happen this run.
Sample decisions (0)
(none)
Other 0 (0.0%)
Catch-all bucket. Kept at zero by the v2 classifier.
At the 42 table: Bookkeeping. Empty.
Sample decisions (0)
(none)
The bug we caught (and why we ran the harvest twice)
Three days ago we ran the same 2,000-decision harvest with a slightly
smaller per-turn token budget (1,024 tokens instead of 2,048). It
finished. The bucket distribution looked plausible. Aggregate parity
against the older 560-decision sequential baseline was within 5 percentage
points everywhere. We were ready to ship.
Then the user (still half-awake on a coffee) asked something
inconvenient:
"why did the previous run not have any short-decision
problems and this one does?"
Three parallel investigation agents ran the comparison. The verdict:
in roughly 11.4% of decisions, the model had been running out of token
budget while still mid-sentence in its very first thinking block — never
reaching the tool call. The harness re-prompted, the model recovered on
turn 2, and the trace looked superficially fine. But the chain of
reasoning that had been mid-flight when the budget hit was permanently
lost.
More damningly: 43% of decisions had moved between buckets relative to
the slow sequential baseline. The bucket parity gate had been a false
pass. Errors were cancelling out at the distribution level. Per-decision
they were everywhere.
We doubled the token budget to 2,048, added a per-wave OOM-resilience
layer (so a single batch failure quarantines just that batch and the
run keeps going), wrote a SIGKILL-recovery sentinel so a hard crash
doesn't lose the wave's work, and re-ran. Five hours and forty-six
minutes later, this is what we're looking at. Zero quarantine fires
across 333 batches. Zero of the 2,000 decisions show the truncation
signature any longer.
For 42 players: this is the moment in the construction of any complex
thing where you realize the foundation is two inches off and you have
to redo the framing. Painful, expensive, correct.
For the stats prof
A few numbers that probably matter to one person reading this and not
to anyone else:
| Quantity | Sequential 560 baseline (8192 tok) | Batched v1 (1024 tok, contaminated) | Batched v2 (2048 tok) |
| Mean π regret | 0.551 | — | (unchanged) |
| Mean Q-mean regret | 0.517 | — | (unchanged) |
| Mean Burl regret (legal, non-forced) | 2.441 | drifted | tba — see below |
| P(belief tool called on turn 1) | ≈100% | ≈88.6% | 100% |
| P(no terminator on assistant turn) | 0.1% | 2.3% | 0.0% |
| P(forced commit) | 12.3% | ≈12% | 10.9% |
| n_turns mean / median / p95 / max | 5.5 / 4 / 10 / 16 | contaminated | 6.3 / 5 / 11 / 14 |
| Wall per decision (median) | 37 s sequential | — | 57 s batched-of-6 |
| Quarantine fires in 5h46m | n/a | n/a | 0 / 333 |
The strict pool grew from 294 of 560 (52.5%) on the sequential baseline
to 1,062 of 2,000 (53.1%) on v2 — not a
statistically significant shift in rate at this n, but it does
mean the absolute count of non-trivial gold examples (202)
is now 3.9× the count we had to work with last time.
The headline bucket distribution between v2-batched and the sequential
560 falls within ±2 percentage points everywhere except
BURL_BREAKS_CONSENSUS (17.3% sequential → 14.9% v2,
Δ = −2.4pp). That direction is what we'd hope for if the v1 truncation
bug was inflating the bucket; it's still slightly elevated, suggesting a
residual real-policy disagreement between batched and sequential modes
that we'll want to look at on a per-decision-trajectory basis before
declaring full parity.
Open question worth chewing on: the two harvests share aligned
(seed, declaration, narrator_seat, legal_plays) tuples for
560 decisions. We can do a paired McNemar test on bucket flips between
sequential and v2 batched on those 560, conditional on the v1 truncation
bug being absent in v2. Last time the same test against contaminated v1
detected the bug; this time we expect it to fail-to-reject, which would
formally confirm parity. (I haven't run it yet.)
What's next
The corpus is now ready for the next round of training, called
STaR — short for Self-Taught Reasoner. The plan:
-
Filter-only pass. Take the 202 non-trivial
gold-bucket decisions, format them as supervised-learning examples
("here's the state, here's the trace the model produced, here's the
correct play it landed on"), and fine-tune a LoRA adapter on them.
This locks in the cases where Burl already reasons well; the hope is
that the structure of "good" reasoning generalizes beyond the
specific positions.
-
Rationalization pass. Take the
299
BURL_BREAKS_CONSENSUS
decisions — where we know the right answer and we have a recording of
Burl's wrong reasoning. Show the model the right answer, ask it to
rewrite its reasoning so the trajectory ends at that answer,
and fine-tune on those rewritten traces. This is the more ambitious
half of STaR; it's gated on the filter-only pass showing a clean
improvement first.
-
Evaluate on a held-out sample to confirm the new
adapter improves regret without breaking other behaviour, and ship the
adapter — or shelve it and iterate — based on what we see.
The honest answer is we don't know yet whether STaR will work at this
corpus size. A previous attempt at filter-only on a much smaller corpus
(71 examples) collapsed: the loss dropped to 0.10 in a hundred iterations,
the adapter learned to memorize templated tail-content from the
transcripts, and zero out of three eval positions came out right. We
learned three things from that failure that are now baked into the v2
pipeline: don't fine-tune on tiny corpora; strip rows whose content is
short enough to be memorizable boilerplate; instrument val-loss with
early-stopping so a memorization spiral can't run to completion. With
1,062 strict-pool rows (vs 71) and a min-character filter on the
reasoning content, we're in a better position. But it's research, not
engineering — we'll run, look at what comes out, and decide.
Generated by scratch/belief_trajectory_rollout/build_review_family.py
on the 2000-decision harvest. Static page; the engineer-targeted version
with cleaner tables and fewer 42-table analogies is at
HARVEST_REVIEW.html in this same directory.