GPT-5.5 is the new clean frontier on Arena War

Same seed, same six-iteration budget, same 25-game scoring loop. We reran the frontier comparison for GPT-5.5 and Claude Opus 4.7 with no extraction failures in the published run.

Dashboard

Dice · April 2026 · arena-war-eval-v0.3.4

TL;DR

Best-iteration score · clean frontier self-play n=25 games per iteration, 6 iterations, seed 424242 0%20%40%60% held-out reference avg ≈ 18.7% 53% GPT-5.5 best iter 3 41% Claude Opus 4.7 best iter 6
Figure 1. GPT-5.5 sets the clean self-play ceiling at 53% on iteration 3. Claude reaches 41% by iteration 6. Both bars use exactly the same game count and seed schedule.

The benchmark in one paragraph

Arena War measures a narrow capability: whether a model can iteratively improve a spatial territory algorithm after receiving competitive feedback. Each iteration asks the model to write one JavaScript function, runs that function in 25 seeded games against fixed baselines, then feeds the model its score history and current winner source. The claim is not general coding ability. The claim is repeated competitive algorithm improvement under a reproducible protocol.

Protocol

Modeself-play
ModelsGPT-5.5, Claude Opus 4.7
Iteration budget6 iterations per model
Games per iteration25
Seed424242
Grid60×60, 4 players
OpenAI reasoninghigh for GPT-5.5; Anthropic has no equivalent runtime flag
Plateauearly-stop disabled with --plateau-patience 99 so both models produce equal-length traces

Results

ModelBest scoreBest CI95Latest scoreExtraction failures
GPT-5.5 53% · iter 3 [50.2%, 56.7%] 51% · iter 6 0
Claude Opus 4.7 41% · iter 6 [38.7%, 43.9%] 41% · iter 6 0
Per-iteration territory curve No extraction failures; every dot is n=25 scored games 0%20%40%60% 123456 50%51%53%52%53%51% GPT-5.5 24%13%12%18%39%41% Claude Opus 4.7
Figure 2. GPT-5.5 starts high (50%) and stays near its ceiling; Claude regresses early before recovering late. This is why the curve, not a single point, is the primary benchmark signal.

Where the frontier comparison lands

The dashboard's cross-lobby pairwise bootstrap and matched-lobby H2H agree this time. GPT-5.5's best iteration is statistically higher than Claude's best iteration, and the direct replay puts both algorithms in the same four-player lobby with the same baselines, removing opponent-draw variance.

Matched-lobby head-to-head · best vs best Claude Opus 4.7 iter 6 vs GPT-5.5 iter 3, same 25 seeds 2 23 Claude Opus 4.7 wins GPT-5.5 wins Pairwise bootstrap: GPT-5.5 over Claude Opus 4.7 by +12.2% when read winner-minus-runner-up; CI [8.1%, 16.2%].
Figure 3. The matched-lobby replay is the cleanest frontier comparison here: GPT-5.5 wins 23-2 over Claude Opus 4.7, with Δ −14.6% from Claude Opus 4.7's perspective and CI [-17.2%, -11.8%].

Reliability

The reason this replaces the prior failure-case story is that the run is clean: every published iteration extracted a named function and scored. Score regressions still appear as normal learning-curve information, but there are no missing data points hidden behind syntax failures.

Extraction reliability · clean run Green means the runner extracted and evaluated a named JavaScript function GPT-5.5 6 of 6 parseable Claude Opus 4.7 6 of 6 parseable parseable extraction failure
Figure 4. The published comparison excludes the adversarial GPT-5.5 attempt that failed extraction. The self-play frontier run shown here has zero extraction failures across both models.

Held-out reference

The reference algorithm is frozen and never exposed in prompts or in the output JSON. It is not the benchmark target; it is an anchor that makes cross-release movement easier to interpret.

Δ vs held-out reference Best iteration only; error bars are 95% bootstrap CIs GPT-5.5 +34.7% Claude Opus 4.7 +15.4%
Figure 5. Both frontier models beat the frozen held-out reference, but GPT-5.5's reference margin (+34.7%) is more than double Claude's (+15.4%).

Iteration table

IterGPT-5.5Claude Opus 4.7
150% [45.3%, 54.7%]24% [18.7%, 28.5%]
251% [48.7%, 53.6%]13% [11.8%, 15.2%]
353% [50.2%, 56.7%]12% [10.4%, 13.1%]
452% [49.6%, 55.3%]18% [16%, 20.5%]
553% [51.2%, 54.9%]39% [37.5%, 41.1%]
651% [49.4%, 53.4%]41% [38.7%, 43.9%]

Reproduction

ANTHROPIC_API_KEY=... npm --prefix ../gameval run eval -- \
  --model claude-opus-4-7@anthropic \
  --iterations 6 --games-per-iter 25 --mode self-play \
  --seed 424242 --plateau-patience 99 \
  --output ../gameval/.devin-claude-opus47-selfplay-6iter.json

OPENAI_API_KEY=... npm --prefix ../gameval run eval -- \
  --model gpt-5.5-2026-04-23@openai \
  --iterations 6 --games-per-iter 25 --mode self-play \
  --seed 424242 --reasoning-effort high --plateau-patience 99 \
  --output ../gameval/.devin-gpt55-selfplay-high-6iter.json

node scripts/update-arena-war-results.mjs \
  --eval ../gameval/.devin-claude-opus47-selfplay-6iter.json \
  --eval ../gameval/.devin-gpt55-selfplay-high-6iter.json \
  --gameval-root ../gameval

What this does not prove

This does not prove GPT-5.5 is better at all coding tasks, all games, or all adversarial settings. It shows a clean win on this benchmark's self-play protocol: iterative improvement of one spatial territory algorithm under a fixed scoring loop.