# Harvest resilience + max_tokens revert — ready to launch

Surgical changes to `scratch/belief_trajectory_rollout/harvest_batched.py` so the production 2000-decision rerun survives transient Metal OOMs / SIGKILLs and gives the model enough room to think.

## Diff summary

| File | Lines added | Lines removed | Net |
|---|---:|---:|---:|
| `scratch/belief_trajectory_rollout/harvest_batched.py` | ~250 | ~10 | +240 |

Only one file touched. No restructuring of the existing lockstep loop — additions are: (a) constant revert, (b) a self-contained resilience module (~150 lines, OOM classifier + quarantine I/O + sentinel I/O), (c) per-wave try/except + sentinel write/clear, (d) four CLI flags, (e) resume-time orphan-sentinel detection.

## What changed

### 1. `MODEL_MAX_TOKENS` 1024 → 2048
Cross-checked with `LENGTH_STATS_COMPARISON.md` — sequential's p99 is 1770 chars (~600 tokens), max 2639 chars (~900 tokens). 1024-token cap was truncating ~1.2% of batched turns; 2048 covers full sequential range with margin. `PARITY_AUDIT.md` also pins 64 of 560 batched decisions to a turn-1 truncation that 2048 should eliminate. Override via `--max-tokens N`.

### 2. Per-wave OOM resilience
Wraps `model.step_batch(...)` in a try/except. `_is_oom_like(exc)` classifies an exception as quarantinable if its message contains any of: `metal::malloc`, `Resource limit`, `Resource exhausted`, `broadcast_shapes`, `out of memory`, OR if it is a `MemoryError`. On catch:

- Append a record to `<harvest_dir>/quarantine.jsonl` containing `{wave_idx, gi_range, seeds, error_kind, error_message, max_tokens_at_failure, quarantined_at}`.
- Mark every decision in the wave `gen_failed=True` so `_finalize` skips writing `trace_summary.json` (matches existing semantics — preserves resume idempotency).
- Continue to the next wave.

Non-OOM `RuntimeError` keeps the original behavior (mark `gen_failed`, break wave) but does NOT quarantine; this preserves whatever the prior failure path was for unfamiliar errors.

### 3. SIGKILL recovery via wave sentinel
Before each wave's first `step_batch` call we write `<harvest_dir>/wave_in_progress.txt` with `{wave_idx, gi_range, seeds, max_tokens, started_at, pid}`. We clear it after the wave's generate loop exits. SIGKILL can't be caught — but on next `--resume`, if a sentinel exists, the resume path quarantines that wave (with `error_kind="SIGKILL_or_crash"`) and clears the file.

### 4. `--rerun-quarantined` flag
With `--resume <dir> --rerun-quarantined`:
- Reads all `quarantine.jsonl` records in the dir.
- Excludes any gi already in `quarantine_resolved.jsonl` (succeeded on retry) or `quarantine_terminal.jsonl` (failed again — drop from corpus).
- Runs the remaining gi list at `--retry-batch-size` (default 4, half of normal 8/6).
- After retry success, appends `{gi, seed, resolved_at}` to `quarantine_resolved.jsonl`.
- After retry failure, appends `{gi, seed, reason, terminal_at}` to `quarantine_terminal.jsonl`.

### 5. `--max-tokens N` override
Constant `MODEL_MAX_TOKENS` is the default; `--max-tokens N` overrides. Used at `GemmaLocalNativeBatched(max_tokens=...)` construction and embedded in sentinel + quarantine records for forensics.

## New CLI surface

```
--max-tokens N            Default 2048. Override per-turn generation cap.
--rerun-quarantined       With --resume: retry quarantined gi's only.
--retry-batch-size N      Default 4. Batch size for --rerun-quarantined.
--inject-oom-at-wave N    TEST: raise fake metal::malloc at wave N.
```

Pre-existing flags unchanged: `--resume`, `--limit`, `--variant`, `--batch-size`, `--decisions`, `--model-repo`, `--adapter-path`, `--corpus-path`.

## Failure mode → behavior table

| Failure | Detected by | Behavior | Recoverable? |
|---|---|---|---|
| Metal OOM (`metal::malloc`, `Resource limit`, `Resource exhausted`) | RuntimeError msg match | Quarantine wave, log warning, advance to next wave | Yes via `--rerun-quarantined` |
| mlx-lm 0.31.2 broadcast bug (`broadcast_shapes`) | RuntimeError msg match | Quarantine wave, advance | Yes via `--rerun-quarantined` |
| Python `MemoryError` | `isinstance(exc, MemoryError)` | Quarantine wave, advance | Yes via `--rerun-quarantined` |
| OS OOM-killer / SIGKILL | Orphaned `wave_in_progress.txt` at next `--resume` | Quarantine that wave's gi range, clear sentinel, continue normally | Yes via `--rerun-quarantined` |
| Unknown `RuntimeError` | Catch but classifier says no | Original behavior: mark wave's decisions `gen_failed`, break wave (resume re-attempts in normal flow) | Yes via normal `--resume` |
| `_init_decision_state` failure | Existing try/except in init pass | Logged, decision skipped, wave continues | Decision dropped permanently from this run |
| Normal completion | n/a | `_finalize` writes `trace_summary.json`, `n_completed += 1` | n/a |

## Test plan (smoke without actually OOMing)

The `--inject-oom-at-wave N` flag triggers `RuntimeError("metal::malloc test injection")` on step 1 of wave N. Use it like this:

```bash
# 1. Smoke run with injection at wave 3 (waves 0..2 succeed; wave 3 quarantines).
PYTHONPATH=. .venv/bin/python -u scratch/belief_trajectory_rollout/harvest_batched.py \
    --variant D_required_first --batch-size 4 --limit 16 \
    --inject-oom-at-wave 3 \
    --corpus-path gus/data/corpus_train_chunk_0-99.pt
```

Expected post-conditions:
- 12 decisions completed (waves 0..2, gi 0..11).
- Wave 3 (gi 12..15) NOT in `decision_*/trace_summary.json`.
- `quarantine.jsonl` contains one record with `gi_range=[12, 13, 14, 15]`, `error_kind="metal::malloc"`.
- `wave_in_progress.txt` does NOT exist (cleared post-wave).
- Status line `wave wall=...s ... quarantined=True` in `harvest_batched_STATUS.txt`.

```bash
# 2. Re-attempt the quarantined batch.
PYTHONPATH=. .venv/bin/python -u scratch/belief_trajectory_rollout/harvest_batched.py \
    --resume harvest_batched_<ts> --rerun-quarantined --retry-batch-size 2 \
    --variant D_required_first \
    --corpus-path gus/data/corpus_train_chunk_0-99.pt
```

Expected post-conditions:
- gi 12..15 each get `decision_<gi>/trace_summary.json`.
- `quarantine_resolved.jsonl` contains 4 records.
- `quarantine_terminal.jsonl` not created (no terminal failures).

```bash
# 3. SIGKILL recovery — manual: kill the process during a wave, then resume.
#    a. Start a normal harvest in the background.
#    b. Mid-wave, kill -9 the process.
#    c. Confirm wave_in_progress.txt exists in harvest_<ts>/.
#    d. Resume:
PYTHONPATH=. .venv/bin/python -u scratch/belief_trajectory_rollout/harvest_batched.py \
    --resume harvest_batched_<ts> --batch-size 4 \
    --variant D_required_first --limit 16 \
    --corpus-path gus/data/corpus_train_chunk_0-99.pt
```

Expected: status line `orphan wave sentinel detected: ...`; the killed wave appears in `quarantine.jsonl`; sentinel cleared; remaining waves run normally; `--rerun-quarantined` follow-up retries the killed wave.

## Backwards compatibility

Default-flag behavior is byte-identical to before, modulo the `MODEL_MAX_TOKENS` 1024→2048 default change. No new files appear unless the resilience layer fires (quarantine.jsonl + wave_in_progress.txt are absent on a clean run). Existing `--resume` semantics preserved: idempotent skip-pass via `_is_trace_summary_complete`.

## Recommended launch command

```bash
PYTHONPATH=. .venv/bin/python -u scratch/belief_trajectory_rollout/harvest_batched.py \
    --variant D_required_first \
    --batch-size 6 \
    --limit 2000 \
    --max-tokens 2048 \
    --corpus-path gus/data/corpus_train_chunk_0-99.pt \
    2>&1 | tee scratch/belief_trajectory_rollout/prod_2000_v2.log
```

Notes for the operator:
- `--max-tokens 2048` is the floor justified by `LENGTH_STATS_COMPARISON.md`. If team-lead wants a different value, just edit that one flag.
- `--batch-size 6` matches the prior killed run; if 2048-token cap creates Metal pressure at 6 sequences, the resilience layer will quarantine the offending wave rather than dying. Then run `--rerun-quarantined --retry-batch-size 2` afterward.
- The killed `harvest_batched_20260425_031033/` partial outputs (1398/2000 done) are preserved per instruction. Decide separately whether to resume that dir vs. start fresh — a fresh run avoids any file-state ambiguity from the SIGKILL.

## Files preserved

`scratch/belief_trajectory_rollout/harvest_batched_20260425_031033/` is untouched (1398 partial decisions). Useful as comparison data — do not delete.

## Sibling-agent crosslinks

- `LENGTH_STATS_COMPARISON.md` — confirms 2048 floor; sequential p99=1770 chars (~600 tokens), max 2639 chars (~900 tokens); 1.2% of batched turns currently sit at the 1024-token cap.
- `PARITY_AUDIT.md` — confirms turn-1 truncations from the 1024 cap account for the `belief_called_turns=[2]` anomaly observed in 64/560 batched decisions; raising the cap should remove this regression class.