Teaching a chatbot to play 42 — what we learned this weekend

Why this exists

The standing question in this project is: can a general-purpose language model — the kind that produces blog posts and Python — be taught to play Texas 42 well by giving it tools to query the game state, and asking it to think out loud between tool calls? Not "can it generate text about 42." Can it actually play, well enough that the moves it makes stand up to oracle E[Q] scrutiny.

The answer is "sort of, with care." The interesting question is exactly where it succeeds and where it doesn't, and what the failure shapes look like, because that's how we know whether the next round of training might fix it.

Over the last three days we collected 2,000 carefully-instrumented decision points and classified each one by how the model's play compared to two reference estimators and a near-oracle. This page is the readable version of what we found.

Cast of characters — click any to expand

A handful of proper nouns recur on every page of this project. They aren't really jargon, but they're shorthand for things that take a paragraph to explain. The expandos below are the dictionary; come back to them when a name re-appears.

Burl

The name of this whole agent: a Texas-42 player built around an open-source large language model (Gemma 4 E2B, ≈2 billion parameters, runs locally on the M5 Max). Burl is given a hand, the trick state, the bid, and the trump declaration, and is asked to pick the next domino to play. Crucially, Burl can call tools: small Python functions that compute things like "what's the highest trump still out there", "what would happen on average if I played this domino", and so on. Burl reasons in plain English between tool calls, then commits a play.

Gemma

The actual neural-net language model under Burl's hood. Open-weights, made by Google. We use the 2B-parameter version because it fits on the M5 Max with room to spare and because it has a serviceable native tool-calling format we can drive without contortion.

Gus

A different, much smaller, custom model — a transformer trained from scratch on a million simulated 42 hands. Gus's job is to estimate, given a game state, what the opponents' hidden hands probably look like (a "belief"). Gus is not a player; it's an oracle of opponent-hand likelihoods. Burl can ask Gus a question via the belief_trajectory() tool, get back a calibrated posterior over who-holds-what, and reason from there.

π (the policy head)

The fastest of three reference plays we compare Burl to. π is a single forward pass through Gus that asks: "given this state, what would you play?" — no belief sampling, no rollouts. Think of it as fast intuition baked into a small model. We have access to π's confidence (its peak probability and entropy), so we can characterise how sure the fast intuition is.

Q-mean (the belief-sampled estimator)

A more elaborate reference. We sample 200 plausible deals consistent with what we've seen, simulate forward, and average the per-play expected-value (E[Q]) of each move. The argmax gives Q-mean's preferred play. Slower, but it actually uses Gus's belief to reason over hidden information. In production this is the cheap deployable win — it ships independently of the LLM stuff.

Oracle (E[Q] argmax)

The "right answer" we score everyone against. Strictly speaking it's a near-oracle: an exhaustive expected-value computation under the same belief distribution Q-mean uses. In retrospective scoring, this is the closest thing to ground truth we have for "what should have been played from this position". Burl's regret on a decision is the gap between the oracle's E[Q] and the E[Q] of the play Burl actually committed.

Wax-museum (the harness)

The scaffold around Burl that structures a single decision. Named because the position is frozen in place — like a wax-museum tableau — while the model walks around it asking questions. Concretely, wax-museum is a HATEOAS-style tool loop: at each turn the harness tells the model exactly which tools are currently available (the schema is gated by what's already happened); the model emits a tool call; the harness runs it and returns the result; and the cycle repeats until the model commits a play. There are also two safety nets: if the model goes too long, the harness extends its turn budget once; and if the model produces no legal commit at all, the harness force-commits the highest-E[Q] legal play so the hand can continue.

Variant D_required_first

The variant of the wax-museum protocol used in this harvest. The "D" stands for the prompt design family; "required-first" means the schema gating requires the very first tool call to be belief_trajectory() — i.e., before doing anything else, ask Gus what the opponents likely hold. We tested several alternatives over a 350-decision pilot; required-first won on tool uptake (100% of decisions made the call) and on regret.

STaR (Self-Taught Reasoner)

A training recipe published by Zelikman, Wu, Goodman 2022. Two passes: (1) filter-only — collect cases where the model already happened to get the right answer with non-trivial reasoning, and fine-tune on those traces, locking the good behaviour in; (2) rationalization — for cases where the model got it wrong, show it the correct answer and ask it to write the reasoning trace that would have led there, then fine-tune on those generated rationales. Our 2000-decision corpus gives us 1,062 candidates for the first pass and 719 for the second.

LoRA (Low-Rank Adaptation)

A fine-tuning technique that doesn't touch the base model's weights at all. Instead, it inserts small "adapter" matrices alongside each big weight matrix and only trains those. The big model stays frozen and identical; the adapters are tiny (a few megabytes) and can be swapped in or out. Lets us run several training experiments cheaply on the M5 Max without ever needing to copy 4 GB of base weights.

regret (the ML version)

For a given decision, regret = (expected value of the optimal play) − (expected value of the play actually committed). Always non-negative. Aggregating across many decisions, mean regret is the headline performance metric. The deployable Q-mean router currently sits at 0.517 mean regret across the 560-decision sequential baseline. Lower is better.

"decision"

One single moment in a hand where it is Burl's turn to play and there is more than one legal play. We harvest one decision from each hand-position, not whole hands — that gives us much higher-variance test material per CPU-hour than playing entire hands.

"matches_bot"

The trace-level field that records whether Burl's final play matched the oracle's argmax. (The "bot" in the name is historical — it's the oracle, not a separate bot.) When you see that label in the bucket details, "matches_bot=True" means Burl picked the optimum.

The numbers

Decisions evaluated

2,000

Wall-clock

5h 46m

Strict pool (Burl right)

1,062

Loss buckets (Burl wrong)

719

Forced commits

219 (10.9%)

Illegal plays

Out of 2,000 decisions, Burl picked the oracle's argmax 1,062 times (53.1%). Of those, 860 were trivial — the cases where every estimator agreed and was right; we keep them in the corpus as filler so training doesn't only see hard cases. The other 202 are the cases where Burl did something the simpler estimators didn't, and where Burl turned out to be correct.

On the wrong side: 719 losses spread across seven shapes of failure. The biggest loss bucket — and the most interesting one for retraining — is BURL_BREAKS_CONSENSUS at 299 decisions, where both quick estimators already agreed on the correct play and Burl alone deviated to a worse one. Those are the rationalization targets: we know the right answer, we have a recording of the model's reasoning, we ask the model to rewrite the reasoning toward the correct answer, and we fine-tune on the rewrite.

The 14 buckets

Every decision lands in exactly one bucket. The bucket is determined purely by which of (π, Q-mean, Burl) match the oracle. The classifier is in scratch/belief_trajectory_rollout/tag_corpus.py and is open-source-able if anyone wants to read 200 lines of decision-tree Python.

Gold buckets (1,062 decisions — Burl matched the oracle)

All Agree Correct 860 (43.0%)

Both quick estimators and the model all picked the same play, and that play was the engine's optimum. The easy round.

At the 42 table: Everyone at the table — including your partner, the kibitzer, and the engine — would have played the same domino. Nothing surprising; nobody got tested.

Sample decisions (5)

decision_5 — gi=5 seed=5 · played 21 (oracle 21) · Δeq=+0.00 · 5 turns · 75s wall
decision_423 — gi=423 seed=423 · played 16 (oracle 16) · Δeq=+0.00 · 5 turns · 49s wall
decision_858 — gi=858 seed=858 · played 22 (oracle 22) · Δeq=+0.00 · 4 turns · 57s wall
decision_1228 — gi=1228 seed=1228 · played 23 (oracle 23) · Δeq=+0.00 · 4 turns · 51s wall
decision_1614 — gi=1614 seed=1614 · played 18 (oracle 18) · Δeq=+0.00 · 6 turns · 50s wall

Burl Alone Fixes 14 (0.7%)

The model alone matched the oracle's optimum. The two quick estimators were wrong, and they were wrong in different ways — they didn't even agree with each other. The model reasoned past both of them.

At the 42 table: Your partner says one thing, your nephew says another, and you — taking your time, looking at the trick, the trumps still out, who's been void in what — pick a third domino and it turns out you were right.

Sample decisions (5)

decision_180 — gi=180 seed=180 · played 19 (oracle 19) · Δeq=+0.00 · 5 turns · 62s wall
decision_376 — gi=376 seed=376 · played 7 (oracle 7) · Δeq=+0.00 · 6 turns · 96s wall
decision_653 — gi=653 seed=653 · played 25 (oracle 10) · Δeq=+0.37 · 4 turns · 60s wall
decision_940 — gi=940 seed=940 · played 7 (oracle 7) · Δeq=+0.00 · 6 turns · 58s wall
decision_1131 — gi=1131 seed=1131 · played 12 (oracle 12) · Δeq=+0.00 · 8 turns · 92s wall

Both Fix 52 (2.6%)

The model and Q-mean both got it; π didn't. The 'belief sampling' route shows up cleanly here.

At the 42 table: The instinctive policy ('what does this kind of position usually want?') was wrong, but if you took a moment and thought about whose hand could contain what, the right play was clear. The model and Q-mean both took that moment; π didn't.

Sample decisions (5)

decision_16 — gi=16 seed=16 · played 20 (oracle 9) · Δeq=+0.01 · 5 turns · 81s wall
decision_358 — gi=358 seed=358 · played 1 (oracle 1) · Δeq=+0.00 · 8 turns · 46s wall
decision_665 — gi=665 seed=665 · played 19 (oracle 19) · Δeq=+0.00 · 8 turns · 53s wall
decision_1038 — gi=1038 seed=1038 · played 14 (oracle 14) · Δeq=+0.00 · 6 turns · 50s wall
decision_1416 — gi=1416 seed=1416 · played 11 (oracle 11) · Δeq=+0.00 · 4 turns · 47s wall

Burl Independent Right 88 (4.4%)

The model bucked a wrong consensus. Both quick estimators landed on the same wrong play. The model picked something else, and that something else was the oracle's optimum.

At the 42 table: Your partner and the kibitzer agree, and they're both wrong. The model heard them and said 'no, here's why' — and was right.

Sample decisions (5)

decision_3 — gi=3 seed=3 · played 12 (oracle 12) · Δeq=+0.00 · 5 turns · 75s wall
decision_203 — gi=203 seed=203 · played 9 (oracle 9) · Δeq=+0.00 · 5 turns · 69s wall
decision_620 — gi=620 seed=620 · played 14 (oracle 0) · Δeq=+0.81 · 6 turns · 81s wall
decision_934 — gi=934 seed=934 · played 12 (oracle 1) · Δeq=+0.52 · 5 turns · 71s wall
decision_1496 — gi=1496 seed=1496 · played 12 (oracle 12) · Δeq=+0.00 · 5 turns · 66s wall

Burl Follows Pi Right 48 (2.4%)

π was already correct. Q-mean was wrong. The model correctly stuck with π and didn't get pulled off course by the belief-sampled answer.

At the 42 table: Your gut was right; a more elaborate analysis would have led you astray. The model resisted being talked out of it.

Sample decisions (5)

decision_131 — gi=131 seed=131 · played 16 (oracle 16) · Δeq=+0.00 · 4 turns · 56s wall
decision_512 — gi=512 seed=512 · played 9 (oracle 9) · Δeq=+0.00 · 6 turns · 71s wall
decision_772 — gi=772 seed=772 · played 7 (oracle 7) · Δeq=+0.00 · 4 turns · 80s wall
decision_1410 — gi=1410 seed=1410 · played 17 (oracle 17) · Δeq=+0.00 · 6 turns · 98s wall
decision_1661 — gi=1661 seed=1661 · played 3 (oracle 3) · Δeq=+0.00 · 4 turns · 66s wall

Loss buckets (719 decisions — Burl was wrong)

Burl Breaks Consensus 299 (14.9%)

Both quick estimators agreed on the optimum, but the model alone deviated to a worse play. There is no 'the heads were confused' excuse — the right answer was already locally available; the model invented a wrong story.

At the 42 table: Your partner and the kibitzer agreed, and they were correct. You overruled them anyway and lost a trick you didn't need to lose. Most painful bucket — and the cleanest target for retraining: we know the right answer, and we have a record of how the model talked itself out of it.

Sample decisions (5)

decision_1 — gi=1 seed=1 · played 25 (oracle 19) · Δeq=-3.53 · 8 turns · 75s wall
decision_367 — gi=367 seed=367 · played 25 (oracle 25) · Δeq=+0.00 · 7 turns · 59s wall
decision_737 — gi=737 seed=737 · played 8 (oracle 18) · Δeq=-5.70 · 7 turns · 70s wall
decision_1178 — gi=1178 seed=1178 · played 17 (oracle 4) · Δeq=-3.63 · 5 turns · 46s wall
decision_1611 — gi=1611 seed=1611 · played 24 (oracle 22) · Δeq=-3.52 · 10 turns · 64s wall

Burl Independent Wrong 148 (7.4%)

All three are wrong, but the model picked a *different* wrong play than the two estimators. The model wasn't parroting either of them — it was wrong on its own terms.

At the 42 table: The whole table is making the same mistake, and on top of that you find a way to make a fourth, original mistake.

Sample decisions (5)

decision_0 — gi=0 seed=0 · played 7 (oracle 14) · Δeq=-4.62 · 5 turns · 75s wall
decision_351 — gi=351 seed=351 · played 25 (oracle 15) · Δeq=-4.67 · 5 turns · 75s wall
decision_630 — gi=630 seed=630 · played 9 (oracle 16) · Δeq=-0.10 · 5 turns · 50s wall
decision_1075 — gi=1075 seed=1075 · played 27 (oracle 1) · Δeq=-10.43 · 5 turns · 89s wall
decision_1521 — gi=1521 seed=1521 · played 26 (oracle 21) · Δeq=-0.55 · 5 turns · 78s wall

All Agree Wrong 100 (5.0%)

All three (the two estimators and the model) pick the same wrong play. The position is genuinely hard, even for the oracle.

At the 42 table: Everyone at the table agrees on what to play. They're all wrong. The hand was unlucky or required reading something nobody could read.

Sample decisions (5)

decision_41 — gi=41 seed=41 · played 4 (oracle 8) · Δeq=-4.11 · 6 turns · 64s wall
decision_424 — gi=424 seed=424 · played 19 (oracle 2) · Δeq=-1.77 · 4 turns · 49s wall
decision_946 — gi=946 seed=946 · played 27 (oracle 23) · Δeq=-0.00 · 5 turns · 52s wall
decision_1263 — gi=1263 seed=1263 · played 13 (oracle 10) · Δeq=-0.72 · 7 turns · 60s wall
decision_1590 — gi=1590 seed=1590 · played 15 (oracle 16) · Δeq=-1.06 · 4 turns · 40s wall

Burl Parrots Pi Wrong 57 (2.9%)

The model echoed π's instinctive answer, and that instinct was wrong.

At the 42 table: You went with your gut and your gut steered you off the cliff.

Sample decisions (5)

decision_40 — gi=40 seed=40 · played 20 (oracle 12) · Δeq=-8.39 · 5 turns · 64s wall
decision_466 — gi=466 seed=466 · played 9 (oracle 7) · Δeq=-4.00 · 8 turns · 62s wall
decision_942 — gi=942 seed=942 · played 9 (oracle 6) · Δeq=-1.41 · 5 turns · 52s wall
decision_1168 — gi=1168 seed=1168 · played 5 (oracle 9) · Δeq=-0.02 · 6 turns · 52s wall
decision_1502 — gi=1502 seed=1502 · played 27 (oracle 21) · Δeq=-0.79 · 5 turns · 55s wall

Qmean Alone Fixes 37 (1.9%)

The belief-sampling estimator alone got it right; the model and π didn't.

At the 42 table: If the model had taken the belief-sampling answer seriously, it would have been right. It didn't.

Sample decisions (5)

decision_46 — gi=46 seed=46 · played 20 (oracle 14) · Δeq=-14.05 · 4 turns · 45s wall
decision_234 — gi=234 seed=234 · played 20 (oracle 8) · Δeq=-10.69 · 7 turns · 74s wall
decision_794 — gi=794 seed=794 · played 16 (oracle 2) · Δeq=-3.61 · 6 turns · 79s wall
decision_1128 — gi=1128 seed=1128 · played 15 (oracle 3) · Δeq=-8.25 · 4 turns · 92s wall
decision_1441 — gi=1441 seed=1441 · played 13 (oracle 18) · Δeq=-0.01 · 5 turns · 71s wall

Burl Parrots Qmean Wrong 41 (2.0%)

π was correct. The model echoed Q-mean and Q-mean was wrong.

At the 42 table: Your gut was right. You overthought it and copied the elaborate analysis. Both of you ended up wrong together.

Sample decisions (5)

decision_75 — gi=75 seed=75 · played 8 (oracle 6) · Δeq=-31.41 · 5 turns · 58s wall
decision_527 — gi=527 seed=527 · played 12 (oracle 25) · Δeq=-0.04 · 4 turns · 39s wall
decision_831 — gi=831 seed=831 · played 4 (oracle 6) · Δeq=-0.51 · 8 turns · 63s wall
decision_1092 — gi=1092 seed=1092 · played 20 (oracle 2) · Δeq=-0.36 · 5 turns · 68s wall
decision_1600 — gi=1600 seed=1600 · played 20 (oracle 20) · Δeq=+0.00 · 4 turns · 52s wall

Burl Drifts From Pi 37 (1.9%)

π was correct. The model picked something other than π and other than Q-mean — its own original wrong answer.

At the 42 table: Your gut was right. You overruled it and went off-piste. Nobody at the table agrees with where you went.

Sample decisions (5)

decision_4 — gi=4 seed=4 · played 10 (oracle 0) · Δeq=-5.61 · 5 turns · 75s wall
decision_271 — gi=271 seed=271 · played 12 (oracle 23) · Δeq=-0.00 · 8 turns · 42s wall
decision_616 — gi=616 seed=616 · played 23 (oracle 23) · Δeq=+0.00 · 8 turns · 42s wall
decision_1195 — gi=1195 seed=1195 · played 9 (oracle 15) · Δeq=-7.48 · 6 turns · 44s wall
decision_1550 — gi=1550 seed=1550 · played 27 (oracle 23) · Δeq=-5.12 · 4 turns · 66s wall

Guarded buckets — handled by the harness

Forced Commit 219 (10.9%)

The model never produced a legal commit on its own; the harness's safety net force-played the highest-E[Q] legal domino. We exclude these from the regret math, but they're tracked because they tell us how often the model 'stalls'.

At the 42 table: The model couldn't decide. The clock was running. The dealer played the best legal domino on its behalf so the hand could continue.

Sample decisions (5)

decision_2 — gi=2 seed=2 · played 11 (oracle 11) · Δeq=+0.00 · 13 turns · forced · 75s wall
decision_453 — gi=453 seed=453 · played 19 (oracle 19) · Δeq=+0.00 · 10 turns · forced · 61s wall
decision_775 — gi=775 seed=775 · played 4 (oracle 4) · Δeq=+0.00 · 8 turns · forced · 64s wall
decision_1157 — gi=1157 seed=1157 · played 20 (oracle 23) · Δeq=-14.46 · 8 turns · forced · 62s wall
decision_1574 — gi=1574 seed=1574 · played 10 (oracle 10) · Δeq=+0.00 · 13 turns · forced · 76s wall

Illegal 0 (0.0%)

The model produced no legal final play. Phase A guards exist to keep this at zero.

At the 42 table: Should never happen. Did not happen this run.

Sample decisions (0)

(none)

Other 0 (0.0%)

Catch-all bucket. Kept at zero by the v2 classifier.

At the 42 table: Bookkeeping. Empty.

Sample decisions (0)

(none)

The bug we caught (and why we ran the harvest twice)

Three days ago we ran the same 2,000-decision harvest with a slightly smaller per-turn token budget (1,024 tokens instead of 2,048). It finished. The bucket distribution looked plausible. Aggregate parity against the older 560-decision sequential baseline was within 5 percentage points everywhere. We were ready to ship.

Then the user (still half-awake on a coffee) asked something inconvenient: "why did the previous run not have any short-decision problems and this one does?" Three parallel investigation agents ran the comparison. The verdict: in roughly 11.4% of decisions, the model had been running out of token budget while still mid-sentence in its very first thinking block — never reaching the tool call. The harness re-prompted, the model recovered on turn 2, and the trace looked superficially fine. But the chain of reasoning that had been mid-flight when the budget hit was permanently lost.

More damningly: 43% of decisions had moved between buckets relative to the slow sequential baseline. The bucket parity gate had been a false pass. Errors were cancelling out at the distribution level. Per-decision they were everywhere.

We doubled the token budget to 2,048, added a per-wave OOM-resilience layer (so a single batch failure quarantines just that batch and the run keeps going), wrote a SIGKILL-recovery sentinel so a hard crash doesn't lose the wave's work, and re-ran. Five hours and forty-six minutes later, this is what we're looking at. Zero quarantine fires across 333 batches. Zero of the 2,000 decisions show the truncation signature any longer.

For 42 players: this is the moment in the construction of any complex thing where you realize the foundation is two inches off and you have to redo the framing. Painful, expensive, correct.

For the stats prof

A few numbers that probably matter to one person reading this and not to anyone else:

Quantity	Sequential 560 baseline (8192 tok)	Batched v1 (1024 tok, contaminated)	Batched v2 (2048 tok)
Mean π regret	0.551	—	(unchanged)
Mean Q-mean regret	0.517	—	(unchanged)
Mean Burl regret (legal, non-forced)	2.441	drifted	tba — see below
P(belief tool called on turn 1)	≈100%	≈88.6%	100%
P(no terminator on assistant turn)	0.1%	2.3%	0.0%
P(forced commit)	12.3%	≈12%	10.9%
n_turns mean / median / p95 / max	5.5 / 4 / 10 / 16	contaminated	6.3 / 5 / 11 / 14
Wall per decision (median)	37 s sequential	—	57 s batched-of-6
Quarantine fires in 5h46m	n/a	n/a	0 / 333

The strict pool grew from 294 of 560 (52.5%) on the sequential baseline to 1,062 of 2,000 (53.1%) on v2 — not a statistically significant shift in rate at this n, but it does mean the absolute count of non-trivial gold examples (202) is now 3.9× the count we had to work with last time.

The headline bucket distribution between v2-batched and the sequential 560 falls within ±2 percentage points everywhere except BURL_BREAKS_CONSENSUS (17.3% sequential → 14.9% v2, Δ = −2.4pp). That direction is what we'd hope for if the v1 truncation bug was inflating the bucket; it's still slightly elevated, suggesting a residual real-policy disagreement between batched and sequential modes that we'll want to look at on a per-decision-trajectory basis before declaring full parity.

Open question worth chewing on: the two harvests share aligned (seed, declaration, narrator_seat, legal_plays) tuples for 560 decisions. We can do a paired McNemar test on bucket flips between sequential and v2 batched on those 560, conditional on the v1 truncation bug being absent in v2. Last time the same test against contaminated v1 detected the bug; this time we expect it to fail-to-reject, which would formally confirm parity. (I haven't run it yet.)

What's next

The corpus is now ready for the next round of training, called STaR — short for Self-Taught Reasoner. The plan:

Filter-only pass. Take the 202 non-trivial gold-bucket decisions, format them as supervised-learning examples ("here's the state, here's the trace the model produced, here's the correct play it landed on"), and fine-tune a LoRA adapter on them. This locks in the cases where Burl already reasons well; the hope is that the structure of "good" reasoning generalizes beyond the specific positions.
Rationalization pass. Take the 299 BURL_BREAKS_CONSENSUS decisions — where we know the right answer and we have a recording of Burl's wrong reasoning. Show the model the right answer, ask it to rewrite its reasoning so the trajectory ends at that answer, and fine-tune on those rewritten traces. This is the more ambitious half of STaR; it's gated on the filter-only pass showing a clean improvement first.
Evaluate on a held-out sample to confirm the new adapter improves regret without breaking other behaviour, and ship the adapter — or shelve it and iterate — based on what we see.

The honest answer is we don't know yet whether STaR will work at this corpus size. A previous attempt at filter-only on a much smaller corpus (71 examples) collapsed: the loss dropped to 0.10 in a hundred iterations, the adapter learned to memorize templated tail-content from the transcripts, and zero out of three eval positions came out right. We learned three things from that failure that are now baked into the v2 pipeline: don't fine-tune on tiny corpora; strip rows whose content is short enough to be memorizable boilerplate; instrument val-loss with early-stopping so a memorization spiral can't run to completion. With 1,062 strict-pool rows (vs 71) and a min-character filter on the reasoning content, we're in a better position. But it's research, not engineering — we'll run, look at what comes out, and decide.

Generated by scratch/belief_trajectory_rollout/build_review_family.py on the 2000-decision harvest. Static page; the engineer-targeted version with cleaner tables and fewer 42-table analogies is at HARVEST_REVIEW.html in this same directory.