What we learned letting a reasoning knob retune our benchmark

Same fleet, same seed, same number of games per iteration. Only the OpenAI reasoning_effort flag changed. The absolute score moved further than any other single lever in this benchmark.

Dashboard

Dice · April 2026

Arena War is an iterative, spatial, multi-agent coding benchmark. On the same seed, same fleet, and same number of games per iteration, turning up OpenAI's reasoning_effort from the default to high on one model produced a larger absolute-score swing than switching modes, switching models, or bumping n from 10 to 25. It also introduced a new failure mode — extraction errors — that broke the adversarial comparison at a fixed token budget. This post documents the 2×2×2 (mode × reasoning × model) cell-by-cell, states precisely what the result means, and is clear about what it doesn't.

TL;DR

Reproduction: runner at arena-war-eval-v0.3.3, schemaVersion 6, per-game seeds are a deterministic function of the run seed, the model index, the iteration, and the game index. All four JSON artifacts are on the machine that produced them.

Best-iteration score by cell Frontier pair, n=25 games per iteration, seed 424242 60% 50% 40% 30% 20% 10% 0% best iter mean (%) held-out reference ≈ 20% 14.5% 38.6% self-play · default 53.9% 57.0% self-play · high 40.8% 40.9% adversarial · default 46.4% 38.3% adversarial · high claude-opus-4-7 gpt-5.4-2026-03-05
Figure 1. Best-iteration score across the four (mode × reasoning) cells. Each bar is the highest per-iteration mean territory fraction recorded by the model across up to 6 iterations at n=25 games each. The dashed horizontal line is the held-out reference's mean territory across its 25 matched games against each model, averaged across the four runs (≈ 20%). The two self-play columns show the starkest split: at default reasoning, opus finishes at 14.5% while gpt-5.4 finishes at 38.6%; raising gpt-5.4's reasoning_effort to high lifts both models' best-iteration scores into the 50s.

The benchmark in one paragraph

Arena War is a 60×60 circular grid with four players. Each tick, every player's algorithm returns a prioritized list of [row, col] cells it wants to claim; the engine resolves claims in order, respecting adjacency and the circular mask. A game ends at a fixed tick budget, when the grid fills, or when no player can make progress (a stalemate). The primary score is the percentage of cells occupied by the player's color at end-of-game. An eval iteration is n=25 games, each with a different per-game seed, against a fixed fleet of baseline opponents. The prompt gives a model full game rules, examples of prior-iteration scores (its own for self-play; its own plus two anonymized top-2 opponent algorithms for adversarial), and asks it to return a single JavaScript function. The runner extracts, compiles, and runs that function headlessly against the baseline fleet, then feeds scores back into the next iteration's prompt. Early stop fires when an iteration fails to produce a new best (STALE) or when a run-level plateau rule says the CI95 overlaps the running-best's CI95. The output JSON is self-describing: version, schema, seed, per-game seeds, per-iteration failure flags, bootstrap pairwise diffs, a Bradley-Terry opponent-aware rating, a best-vs-best head-to-head matrix, and a held-out reference benchmark.

What we varied

AxisValuesModels affected
Modeself-play, adversarialboth
Reasoning effortdefault (omitted), highgpt-5.4-2026-03-05 only (reasoning_effort is an OpenAI reasoning-family parameter; it is omitted for anthropic and for non-reasoning OpenAI models)
Modelclaude-opus-4-7, gpt-5.4-2026-03-05

Held constant: n=25 games per iteration, seed 424242, baseline fleet (Density Wave, Diagonal Spiral, Greedy BFS), held-out reference algorithm (HeldOutReference-v1), up to 6 iterations with CI-overlap plateau early-stop, 8192 visible token output budget per call. At reasoning=high, the OpenAI provider raises max_completion_tokens to a floor of 32768 to cover both internal reasoning and visible output; non-reasoning calls are untouched. Each call has a per-effort request-timeout (30 minutes at high; the OpenAI SDK's default of 10 minutes was empirically insufficient for long iterative prompts).

Four runs fall out of the design:

RunModeReasoningNotes
Aself-playdefaultBaseline n=25; prior session
BadversarialdefaultBaseline n=25; prior session
C2self-playhighThis session; re-ran after surfacing a timeout bug
DadversarialhighThis session

Cell-by-cell results

The 2×2×2

Mode × reasoningopus bestgpt-5.4 bestPairwise best-vs-bestH2H best-vs-best (25 games)opus Δ vs referencegpt-5.4 Δ vs reference
Self-play, default 14% (iter 2) 39% (iter 1) gpt Δ = −24.1%, CI [−28.9, −18.4] sig (b_better) gpt 15-9-1, Δ = −6.9 [−13.5, −0.0] sig −10.8% [−17.4, −3.6] sig (ref better) +13.6% [+11.9, +15.5] sig, 25/25 wins
Self-play, high 54% (iter 3) 57% (iter 3) Δ = −3.0%, CI [−7.4, +1.3] tied gpt 17-6-2, Δ = −8.3 [−13.0, −3.3] sig (b_better) +31.0% [+27.2, +34.5] sig, 25/25 wins +42.3% [+39.9, +44.7] sig, 25/25 wins
Adversarial, default 41% (iter 1) 41% (iter 3) Δ = −0.04%, CI [−4.8, +4.9] tied opus 12-13-0, Δ = −2.9 [−6.4, +0.9] tied +15.0% [+12.3, +17.8] sig, 25/25 wins +20.2% [+15.8, +24.4] sig, 23/25 wins
Adversarial, high 46% (iter 4) 38% (iter 1) Δ = +8.1%, CI [+4.3, +12.4] sig (a_better) opus 24-1-0, Δ = +13.6 [+9.6, +17.7] sig +32.1% [+29.5, +34.8] sig, 25/25 wins +10.5% [+7.0, +13.8] sig, 22/25 wins

All percentages are cells-claimed-at-end-of-game; all CIs are 95% bootstrap CIs with 4000 resamples; significance is CI-excludes-zero at α = 0.05.

Iteration traces

At reasoning=high, both models get less monotonic and more volatile. The four panels below plot per-iteration mean territory for both models in each of the four cells. Failed iterations (SYN_ERR) are drawn as an open circle on the x-axis.

Per-iteration territory y = iteration mean territory, x = iteration index; SYN_ERR shown as open circle 60 40 20 0 opus · 14% gpt-5.4 · 39% 1 2 3 self-play · default 60 40 20 0 opus · 54% gpt-5.4 · 57% SYN SYN 1 3 4 self-play · high 60 40 20 0 opus · 41% gpt-5.4 · 41% 1 2 3 adversarial · default 60 40 20 0 opus · 46% gpt-5.4 · 38% SYN SYN 1 4 5 6 adversarial · high
Figure 2. Per-iteration mean territory for both models across the four cells. Each dot is n=25 games. Open circles on the x-axis denote SYN_ERR iterations where the runner could not extract a parseable function from the model's output. Dashed line segments cross over a failed iteration; solid segments are consecutive successful iterations. At reasoning=high in self-play (top right), both models step-function to their best near iteration 3; in adversarial (bottom right), opus iterates cleanly to 46% at iteration 4 while gpt-5.4 fails to iterate past iter 1.

Where gpt-5.4's +18pp really came from

Line up gpt-5.4's iter-by-iter self-play traces, default vs high:

At iter 1 the two configurations are statistically indistinguishable. The +18pp gain is not a baseline lift; it's a higher ceiling the model can reach on the iterative prompt, once it has seen its own prior winner code. This matches the intuition that reasoning effort helps most when the task is "think carefully about what you already wrote and improve it" — exactly the iterative prompt's job.

Three load-bearing findings

1. At the frontier, matched-lobby H2H and cross-lobby pairwise can disagree — and H2H is the stronger evidence

In Run C2 (self-play, reasoning=high), the bootstrap pairwise comparison between opus's iter-3 (54%) and gpt-5.4's iter-3 (57%) returns a tied verdict:

pairwise: Δ = -3.0%, CI95 = [-7.4, +1.3]

But if you take those same two algorithms and drop them into a shared lobby — identical grid, identical seed sequence, four-player free-for-all — 25 games gives:

h2h:      gpt-5.4 wins 17, opus wins 6, 2 draws
          Δ = -8.3%, CI95 = [-13.0, -3.3]    (significant)
Matched-lobby head-to-head · self-play · high Best-iter opus vs best-iter gpt-5.4, same 25 seeds, four-player free-for-all 17 gpt-5.4 wins 6 opus wins 2 draws Δ mean territory = −8.3% · 95% CI [−13.0, −3.3] · significant Pairwise-bootstrap verdict on the same two algorithms was tied (CI [−7.4, +1.3]).
Figure 3. The pairwise bootstrap compares separately-played game sets — opus played its 25 games in a lobby with three baselines, gpt-5.4 played its 25 in a different lobby with the same three baselines. That averages across different opponent draws and inflates variance. The matched-lobby H2H puts both models in the same four-player game with two baselines as the third and fourth seats, so the variance from opponent draw cancels. Implication: for frontier comparisons where the two leaders are close, prefer H2H; default to H2H in reporting when the pairwise CI touches zero.

2. reasoning=high has a non-trivial downside: extraction failures

Across the two high runs, gpt-5.4 produced 4 SYNTAX_ERROR iterations (C2 iter 2, C2 iter 5, D iter 2, D iter 3). Across the two default runs, zero. All four failures happen on iterations where the prompt grew — iter 2 onward in self-play adds the iter-1 winner source; iter 3 onward adds iter-2 winner source too; adversarial adds anonymized opponent code blocks on top.

gpt-5.4 iterations: parseable vs SYN_ERR by run All 4 SYN_ERR iterations happen under reasoning=high, zero under default self-play · default 3 of 3 parseable adversarial · default 3 of 3 parseable self-play · high 3 of 5 parseable (2 SYN_ERR) adversarial · high 1 of 3 parseable (2 SYN_ERR, early-stopped) parseable (function extracted) SYN_ERR (no named function in output)
Figure 4. Iteration-by-iteration extraction status for gpt-5.4 across all four runs. Green blocks are iterations where the runner successfully extracted function myAlgorithm(…) { … } from the model's response and evaluated it. Red blocks are iterations where the call completed (the model produced text) but the output contained no parseable named function. All four red blocks sit under reasoning=high; the two runs at default reasoning have zero. The most likely mechanism is token-budget exhaustion on internal reasoning traces.

The most likely mechanism is token budget exhaustion on internal reasoning. At reasoning=high the provider raises max_completion_tokens to a 32768 floor to cover both internal reasoning and the visible response. On a short prompt, a 529-token completion (we measured this in a smoke test) uses <2% of that floor. On an iterative prompt with 3500+ characters of prior winner code, reasoning traces are substantially longer, and the model can exhaust its budget before it emits the parseable function myAlgorithm(...) { ... } block. What we see — "call returned text, no named function in the text" — is consistent with a partial emission where the model wrote its reasoning summary and ran out of budget before writing the code.

This has a consequence for the adversarial comparison in Run D: gpt-5.4 fails to extract on iters 2 and 3, gets no new scores, and the plateau rule early-stops it at iter-1's 38%. Opus meanwhile iterates normally to 46% at iter 4. The pairwise Δ = +8.1 sig and the H2H 24-1-0 are both consequences of gpt-5.4 getting three fewer effective iterations than opus. We do not read this as "opus beats gpt-5.4 at adversarial reasoning=high." We read it as "the 32k token floor is insufficient for gpt-5.4 at high on the adversarial iterative prompt." The clean rerun would be with a 48k or 64k floor.

3. opus self-play is genuinely noisy at n=25 — don't draw strong conclusions from a single opus number

Opus's best iteration score in self-play at default reasoning was 14% at n=25. At reasoning=high — where opus doesn't even receive the reasoning knob — opus's best was 54% at n=25. The +40pp swing is pure sampling variance between two independent runs against the same seed.

We investigated: anthropic models don't expose a reasoning_effort equivalent on their API (Sonnet/Opus 4 have extended-thinking via a separate parameter, which we don't toggle here). Opus's iterations are independently sampled temperature=0.7 calls with the same prompt each time. The deterministic game seeds don't stabilize this because the stochastic element is in the generation, not the game.

This means: at n=25, opus self-play numbers should always be reported with a large grain of salt. The n=10 number from a prior run (52% iter 1) is a particularly unreliable overstatement; it was well within the n=10 CI95 [34, 56] but it drove a headline we later had to retract. In any follow-up, opus self-play should be run at least 3 times to get a distribution over best-iteration scores, not a point estimate.

Two tests that are less noisy:

The held-out reference, as an anchor

Every run includes a held-out reference benchmark: 25 games of the model's best iteration against HeldOutReference-v1, a frozen hand-written opponent that no model sees in training or in prompts. At n=25 the reference result is the single most comparable number across runs because (a) it's always the same opponent, (b) the seed derivation is the same, and (c) the significance test is the same.

Δ vs held-out reference · all 8 (model × run) Sorted by Δ; error bars are 95% bootstrap CIs −15% −10% 0 +10% +20% +30% +40% +42.3% gpt-5.4 · self-play · high +32.1% opus · adv · high +31.0% opus · self-play · high +20.2% gpt-5.4 · adv · default +15.0% opus · adv · default +13.6% gpt-5.4 · self-play · default +10.5% gpt-5.4 · adv · high ★ −10.8% opus · self-play · default ★ = gpt-5.4 adv · high early-stopped at iter 1 due to SYN_ERR cascade; Δ is iter-1 only (n=25).
Figure 5. Δ vs held-out reference across all eight (model × run) entries, sorted by Δ. Error bars are 95% bootstrap CIs. Seven of eight entries are significantly above zero. The one negative entry — opus · self-play · default — is the small-N variance outlier discussed in Finding 3. The top two entries, both under reasoning=high, are separated by ~1pp and are statistically tied within their CIs. The prior reference-Δ record was +20.2%; the new record is +42.3%, a 2.1× bump.

Two observations:

Bradley-Terry ratings (an opponent-aware sanity check)

Arena War also computes a Bradley-Terry rating via MM iteration across the full four-player game set (all baselines, all model iterations). It's useful as a second opinion on the leaderboard.

For Run D (adversarial, reasoning=high):

PlayerKindEloGamesWins
gpt-5.4-2026-03-05model1341.57974
claude-opus-4-7model1153.9454383
Density Wavebaseline978.1529314
Diagonal Spiralbaseline821.6529188
Greedy BFSbaseline704.9529101

gpt-5.4 has a 188-point Elo lead over opus — despite opus winning the matched-lobby H2H 24-1-0. The reason is BT weights by games played: gpt-5.4 only has 79 games in the rating pool (iter 1 plus the reference/H2H runs; it never generated a valid iter 2/3 algorithm), and it won 74 of them. Opus has 454 games (five full iterations plus reference/H2H runs) and won 383. BT rewards the model with a higher win rate, even if it played far fewer games — the uncertainty in the rating doesn't show up as an Elo decrement.

This is a known pathology of BT on unbalanced sample sizes. We flag it in the dashboard and don't treat it as the primary verdict for adversarial reasoning=high. The pairwise CI + H2H pair is the real story.

Methodology

What this doesn't prove

  1. Not a frontier-model ranking: n=25 is enough to get tight CIs on reference-Δ and on matched-lobby H2H, but not enough to give a stable point estimate of opus's self-play best. Any ranking that ignores the +40pp between the two opus self-play runs is overfit to one sample.
  2. Not a general coding benchmark: Arena War measures one thing — the ability to iteratively improve a spatial territory algorithm through competition feedback. It does not test reasoning about modular code, long-horizon agents, tool use, architecture, debugging, or open-ended engineering. gpt-5.4's +18pp here is not evidence that reasoning=high helps more generally.
  3. Not a "high is always better" claim: In adversarial at reasoning=high, gpt-5.4 crashed into extraction failures and its comparison collapsed. At a fixed 32k completion floor, high with adversarial prompts is unstable. If you're running reasoning=high on long prompts in production, budget accordingly.
  4. Not an explanation of the underlying cause of +18pp: We can observe that the improvement happens at iteration 3 when the iterative prompt is active, not at iteration 1. We can't claim internal-chain-of-thought quality, attention sharpness, or any mechanism — we're showing behavior, not internals.
  5. Not comparable across EVAL_VERSIONs without re-running: Each EVAL_VERSION bump is marked score-affecting in the CHANGELOG. The v0.3.3 runs here should not be averaged with v0.3.2 or older numbers without a side-by-side replication.
  6. Reasoning knob gives different families different advantages: opus gets no reasoning_effort knob in this eval; anthropic has an extended-thinking parameter we don't exercise here. A follow-up with extended-thinking on opus and reasoning_effort=high on gpt-5.4 simultaneously would be a fairer cross-provider frontier test.

What we'd do next

Appendix · reproducibility

RunModeReasoningEVAL_VERSIONSchemaSeedJSON
Aself-playdefaultarena-war-eval-v0.3.25424242eval-results-frontier-n25-selfplay.json
Badversarialdefaultarena-war-eval-v0.3.25424242eval-results-frontier-n25-adversarial.json
C2self-playhigharena-war-eval-v0.3.36424242eval-results-frontier-n25-selfplay-reasoning-high.json
Dadversarialhigharena-war-eval-v0.3.36424242eval-results-frontier-n25-adversarial-reasoning-high.json

The v0.3.3 runner adds --reasoning-effort, records protocol.reasoningEffort, and scales the OpenAI request timeout by effort level. See CHANGELOG in eval-runner.js and PR #17 for the full diff.

To reproduce Run C2:

npm run eval -- \
  --model claude-opus-4-7@anthropic \
  --model gpt-5.4-2026-03-05@openai \
  --mode self-play \
  --games-per-iter 25 \
  --iterations 6 \
  --seed 424242 \
  --reasoning-effort high

To reproduce Run D, swap --mode adversarial.

See the full, interactive results dashboard →