# Per-turn assistant content length comparison

Source: `assistant_text` events in `events.jsonl`. One row per (decision_dir, turn) — i.e. one assistant generation block. The three harvests share aligned `decision_N` indices (verified: 560/560 sequential and batched-560 decisions match on `(seed, declaration, narrator_seat, legal_plays)`).

## Sequential `max_tokens` default

Sequential harvester used **`max_tokens=8192`** (per the comment in `harvest_batched.py`: `MODEL_MAX_TOKENS = 1024  # batched cap — sequential used 8192`). Both batched harvests use **1024**.

## Per-harvest stats (chars per assistant turn)

| harvest | n_turns | mean | median | p90 | p99 | max | %≥1024 | %≥1500 | %≥2048 |
|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|
| sequential_560 | 3371 | 262 | 54 | 807 | 1770 | 2639 | 6.1% | 2.1% | 0.6% |
| batched_560_rerun | 3496 | 317 | 54 | 911 | 2814 | 3220 | 8.4% | 4.1% | 2.7% |
| batched_2000_killed | 8709 | 303 | 54 | 867 | 2795 | 3193 | 7.6% | 3.6% | 2.2% |

## Paired same-seed A/B: sequential_560 vs batched_560_rerun

Matched `(decision_N, turn)` keys present in both harvests: **2875**.

| stat | sequential | batched | delta (batched − seq) |
|---|---:|---:|---:|
| mean | 255 | 323 | +68 |
| median | 54 | 54 | +0 |
| p90 | 796 | 960 | +164 |
| p99 | 1763 | 2845 | +1082 |
| max | 2639 | 3220 | +581 |

Per-turn mean delta (batched − sequential): **+68.6 chars** across 2875 matched turns.

## Truncation signal

Heuristic: well-formed Gemma turns end with `<tool_call|>` or `<channel|>`. Turns missing both are likely truncated by the token cap.

| harvest | n_turns | no-terminator | ≥2800 chars (≈ at 1024-token cap) |
|---|---:|---:|---:|
| sequential_560 | 3371 | 4 (0.1%) | 0 (0.0%) |
| batched_560_rerun | 3496 | 81 (2.3%) | 41 (1.2%) |
| batched_2000_killed | 8709 | 178 (2.0%) | 85 (1.0%) |

## Interpretation

**Are batched outputs systematically shorter than sequential? No — they are slightly *longer*.** Mean is +68 chars, p90 +164, p99 +1082 in batched vs sequential on the matched-seed pair. Median is identical (54). The systematic-shortening hypothesis is not supported by the data; if anything, batched generations run a bit longer. The original 41% sub-100-char rate in the STaR corpus is a property of the harvest decomposition (most turns are bare `<|tool_call>...<tool_call|>` strings around 50 chars), not a batched-mode regression.

**Where batched does diverge from sequential is the upper tail.** 2.3% of batched turns lack a proper terminator (vs 0.1% sequential) and 1.2% sit at the 1024-token ceiling (≥2800 chars; sequential never reaches that range). So batched is truncating a small tail mid-thought, while sequential — given 8192 tokens of headroom — finishes them. Sequential's own p99 is 1770 chars (~600 tokens), well under 1024.

**What `max_tokens` is justified?** Sequential's p99 is 1770 chars and max 2639 chars (≈900 tokens). **A 1024-token cap is borderline** — it covers the median and p90 cleanly, but truncates ~1–2% of turns including the longest reasoning blocks. **2048 tokens is the safe floor**: it covers sequential's full observed range with margin, eliminates the truncation-at-cap behavior currently visible in batched runs, and at typical token costs roughly doubles KV but only on the small fraction of long turns. If KV budget allows, **set `max_tokens=2048`**.
