# Batched-vs-Sequential Parity Audit

Reference: `harvest_20260424_133611/` (sequential). Rerun: `harvest_batched_20260425_010306/` (batch=6, max_tokens=1024). Both 560 decisions on `D_required_first`.

## Headline finding

**Batched is NOT faithful to sequential.** Bucket-level parity hid systematic content regressions. 241/560 decisions changed bucket between runs (43%). The most consequential failure mode is a **turn-1 max-token truncation**: 64 batched decisions emit a 2300–3000-char prose dump on turn 1 with no tool call, get re-prompted, and call `belief_trajectory` on turn 2 — explaining the entire `belief_called_turns=[2]` anomaly.

## Per-decision summary

| gi  | seq bucket | bat bucket | seq turns | bat turns | seq asst chars | bat asst chars | final play (seq → bat) |
|----:|---|---|---:|---:|---:|---:|---|
| 7   | BURL_BREAKS_CONSENSUS | BURL_BREAKS_CONSENSUS | 5  | 11 | 1446 | 3896 | 25 → 25 |
| 23  | BURL_BREAKS_CONSENSUS | BURL_BREAKS_CONSENSUS | 5  | 5  | 3543 | 2172 | 21 → 21 |
| 159 | BURL_ALONE_FIXES      | BURL_ALONE_FIXES      | 5  | 5  | 1738 | 1462 | 5 → 5  |
| 13  | FORCED_COMMIT         | FORCED_COMMIT         | 10 | 8  | 3784 |  992 | 24 → 24 (forced) |
| 42  | ALL_AGREE_CORRECT     | ALL_AGREE_CORRECT     | 7  | 6  | 2565 | 3716 | 24 → 24 |

(Final plays match in all 5 picks. Trajectories do not.)

## Diff highlights

### gi=7 (BURL_BREAKS_CONSENSUS) — same play, very different reasoning path
- **Seq**: probes 25 directly. 4-tool sequence (`belief → explore → probe_best → probe_worst`), commits 25 cleanly on turn 5.
- **Bat**: explores 17, attempts to commit 17 — **commit rejected** (must follow suit 4). Re-runs `belief_trajectory` on turn 7 (2nd time). Tries 6, **rejected**. Eventually commits 25 on turn 11 reasoning "hoping the engine accepts it as a void play". Two `budget_extended` events fired.
- Tool sequence ballooned to 7 calls. Turns nearly 2.5× longer. Final play coincidentally matches but the *thinking trace is unrecognizable* — a STaR target trained on this would learn fundamentally different policy.

### gi=23 (BURL_BREAKS_CONSENSUS) — same play, leaner batched trace
- Same final play (21), same belief_called_turns=[1], same n_turns=5. Tool sequences differ (`probe_best, probe_worst` in bat vs `explore_game, probe_best` in seq); seq commits with 989-char concluding paragraph, bat with 639-char. Reasoning is shorter but coherent. No truncation.

### gi=159 (BURL_ALONE_FIXES) — final-turn reasoning compressed but coherent
- Both commit 5. Sequential turn 4 has a 431-char paragraph reasoning about probe_worst_case alongside the tool call. Batched turn 4 is a bare 53-char tool call with no inline reasoning — but turn 5 thinking block (517 chars) makes up most of it. **Net asst chars dropped 1738 → 1462 (-16%)**. No mid-sentence cuts.

### gi=13 (FORCED_COMMIT) — different forced reason, redundant tool calls in batched
- Both forced-commit play 24. **Forced reasons differ**: seq `"highest-E[Q] probed play (mean=-11.10, probed=1)"` vs bat `"highest-E[Q] oracle scan (mean=-11.10, n_legal=1)"`. The mean and final play match; the forcing path is different.
- Bat called `belief_trajectory` twice in a row (turns 1, 2), then `explore_game{play:17}` twice in a row (turns 7, 8). This *redundant-call loop* is exactly the failure pattern truncation produces — model lost track of state. Batched also had `thinking_chars=3626` (vs seq=435) — model was disproportionately deliberating in turn 3 (2200-char thinking block).
- Wall time 23.5s → 118.5s (5×).

### gi=42 (belief_turns=[2] anomaly) — TURN-1 TRUNCATION
- **Sequential**: turn 1 = `thinking` (2074 chars) + `assistant_text` (`<|tool_call>call:belief_trajectory{}`). Clean.
- **Batched**: turn 1 = `assistant_text` containing **2730 chars of unstructured prose (`<|channel>thought\nHere's a thinking process to arrive at the decision: 1. Analyze the Game State…`)** ending mid-sentence at `*Checking for` — **no tool call emitted**. Harness re-prompts; turn 2 finally calls `belief_trajectory`. This is the canonical signature: model overflowed `max_tokens=1024` while still in the thinking phase, the last-turn output got persisted as `assistant_text` instead of structured `thinking + tool_call`, and the harness counted that as turn 1.

## belief_turns=[2] explanation

Across all 71 batched decisions with `belief_called_turns=[2]`:
- **64 / 71** have turn 1 = a 2300–3000-char prose `assistant_text` with NO `<|tool_call>` tag — i.e., model exhausted its 1024-token budget while still emitting the thinking-process scaffold and never reached the tool call.
- The remaining 7: similar pattern with belief_called_turns=[2,3] or [3], or one outlier (gi=145) with a 34-char turn 1.
- **Root cause**: `max_tokens=1024` is too tight for the model's preferred thinking-then-tool-call pattern. Sequential mode either had a higher cap or was less constrained per request. The harness correctly counts the truncated turn as turn 1, and the model recovers on turn 2 — *but it has lost ~2500 chars of mid-sentence reasoning that was never persisted as a `thinking` block, and is not a faithful sample of the policy under proper sampling conditions*.

## Verdict

**Batched mode has TWO distinct systematic failure modes, not random jitter:**

1. **Turn-1 truncation (~64/560 = 11.4%)**: max_tokens=1024 cap chops thinking blocks mid-sentence. Recovered turn 2 calls a tool but the model has effectively lost a turn of structured reasoning.
2. **Re-prompt loops & redundant tool calls** (gi=7, gi=13): commit rejections + tighter token budgets push the model into stuck states where it re-calls `belief_trajectory` or repeats the same `explore_game{play:X}`. This drives wall time up 3–5× and inflates `tool_sequence` length.

The **560-rerun bucket parity gate was a false pass.** Bucket distributions matched within ±5pp because (a) buckets are coarse summary stats over 560 decisions, (b) some failures bias toward `FORCED_COMMIT` while others bias toward `ALL_AGREE_CORRECT`, and the errors partially cancel at distribution level. Per-gi bucket changes (43%!) and per-trajectory content regressions (severe in 4/5 sampled decisions) are the truth.

**Recommendation before any 2000-decision production run**:
- Raise `max_tokens` to ≥2048, retest the belief=[2] rate. Target: <1% of decisions.
- Investigate why batched commit rejections trigger redundant `belief_trajectory` re-calls (gi=7, gi=13). Possibly the batched harness re-feeds turn-rejection context differently than sequential.
- Re-run the 560 rerun and compare per-gi `tool_sequence` and `n_turns` distributions, not just bucket distributions. The current parity check is too coarse.
