GPT-5.5 is the new clean frontier on Arena War

Same seed, same six-iteration budget, same 25-game scoring loop. We reran the frontier comparison for GPT-5.5 and Claude Opus 4.7 with no extraction failures in the published run.

Dashboard

Dice · April 2026 · arena-war-eval-v0.3.4

TL;DR

GPT-5.5 reached 53% territory at iteration 3 (CI [50.2%, 56.7%]), versus 41% for Claude Opus 4.7 at iteration 6.
The best-iteration pairwise bootstrap favors GPT-5.5 by +12.2% over Claude Opus 4.7, with CI [8.1%, 16.2%].
The matched-lobby best-vs-best replay is stronger evidence: GPT-5.5 wins 23-2 across 25 seed-matched games.
Held-out reference anchor: GPT-5.5 is +34.7% over the frozen reference; Claude is +15.4%.
An adversarial GPT-5.5 attempt failed extraction at iteration 4, so it is intentionally excluded. The published artifact is the clean self-play comparison.

Figure 1. GPT-5.5 sets the clean self-play ceiling at 53% on iteration 3. Claude reaches 41% by iteration 6. Both bars use exactly the same game count and seed schedule.

The benchmark in one paragraph

Arena War measures a narrow capability: whether a model can iteratively improve a spatial territory algorithm after receiving competitive feedback. Each iteration asks the model to write one JavaScript function, runs that function in 25 seeded games against fixed baselines, then feeds the model its score history and current winner source. The claim is not general coding ability. The claim is repeated competitive algorithm improvement under a reproducible protocol.

Protocol

Mode	self-play
Models	GPT-5.5, Claude Opus 4.7
Iteration budget	6 iterations per model
Games per iteration	25
Seed	424242
Grid	60×60, 4 players
OpenAI reasoning	`high` for GPT-5.5; Anthropic has no equivalent runtime flag
Plateau	early-stop disabled with `--plateau-patience 99` so both models produce equal-length traces

Results

Model	Best score	Best CI95	Latest score	Extraction failures
GPT-5.5	53% · iter 3	[50.2%, 56.7%]	51% · iter 6	0
Claude Opus 4.7	41% · iter 6	[38.7%, 43.9%]	41% · iter 6	0

Figure 2. GPT-5.5 starts high (50%) and stays near its ceiling; Claude regresses early before recovering late. This is why the curve, not a single point, is the primary benchmark signal.

Where the frontier comparison lands

The dashboard's cross-lobby pairwise bootstrap and matched-lobby H2H agree this time. GPT-5.5's best iteration is statistically higher than Claude's best iteration, and the direct replay puts both algorithms in the same four-player lobby with the same baselines, removing opponent-draw variance.

Figure 3. The matched-lobby replay is the cleanest frontier comparison here: GPT-5.5 wins 23-2 over Claude Opus 4.7, with Δ −14.6% from Claude Opus 4.7's perspective and CI [-17.2%, -11.8%].

Reliability

The reason this replaces the prior failure-case story is that the run is clean: every published iteration extracted a named function and scored. Score regressions still appear as normal learning-curve information, but there are no missing data points hidden behind syntax failures.

Figure 4. The published comparison excludes the adversarial GPT-5.5 attempt that failed extraction. The self-play frontier run shown here has zero extraction failures across both models.

Held-out reference

The reference algorithm is frozen and never exposed in prompts or in the output JSON. It is not the benchmark target; it is an anchor that makes cross-release movement easier to interpret.

Figure 5. Both frontier models beat the frozen held-out reference, but GPT-5.5's reference margin (+34.7%) is more than double Claude's (+15.4%).

Iteration table

Iter	GPT-5.5	Claude Opus 4.7
1	50% [45.3%, 54.7%]	24% [18.7%, 28.5%]
2	51% [48.7%, 53.6%]	13% [11.8%, 15.2%]
3	53% [50.2%, 56.7%]	12% [10.4%, 13.1%]
4	52% [49.6%, 55.3%]	18% [16%, 20.5%]
5	53% [51.2%, 54.9%]	39% [37.5%, 41.1%]
6	51% [49.4%, 53.4%]	41% [38.7%, 43.9%]

Reproduction

ANTHROPIC_API_KEY=... npm --prefix ../gameval run eval -- \
  --model claude-opus-4-7@anthropic \
  --iterations 6 --games-per-iter 25 --mode self-play \
  --seed 424242 --plateau-patience 99 \
  --output ../gameval/.devin-claude-opus47-selfplay-6iter.json

OPENAI_API_KEY=... npm --prefix ../gameval run eval -- \
  --model gpt-5.5-2026-04-23@openai \
  --iterations 6 --games-per-iter 25 --mode self-play \
  --seed 424242 --reasoning-effort high --plateau-patience 99 \
  --output ../gameval/.devin-gpt55-selfplay-high-6iter.json

node scripts/update-arena-war-results.mjs \
  --eval ../gameval/.devin-claude-opus47-selfplay-6iter.json \
  --eval ../gameval/.devin-gpt55-selfplay-high-6iter.json \
  --gameval-root ../gameval

What this does not prove

This does not prove GPT-5.5 is better at all coding tasks, all games, or all adversarial settings. It shows a clean win on this benchmark's self-play protocol: iterative improvement of one spatial territory algorithm under a fixed scoring loop.