GPT-5.5 is the new clean frontier on Arena War
Same seed, same six-iteration budget, same 25-game scoring loop. We reran the frontier comparison for GPT-5.5 and Claude Opus 4.7 with no extraction failures in the published run.
DashboardTL;DR
- GPT-5.5 reached 53% territory at iteration 3 (CI [50.2%, 56.7%]), versus 41% for Claude Opus 4.7 at iteration 6.
- The best-iteration pairwise bootstrap favors GPT-5.5 by +12.2% over Claude Opus 4.7, with CI
[8.1%, 16.2%]. - The matched-lobby best-vs-best replay is stronger evidence: GPT-5.5 wins 23-2 across 25 seed-matched games.
- Held-out reference anchor: GPT-5.5 is +34.7% over the frozen reference; Claude is +15.4%.
- An adversarial GPT-5.5 attempt failed extraction at iteration 4, so it is intentionally excluded. The published artifact is the clean self-play comparison.
The benchmark in one paragraph
Arena War measures a narrow capability: whether a model can iteratively improve a spatial territory algorithm after receiving competitive feedback. Each iteration asks the model to write one JavaScript function, runs that function in 25 seeded games against fixed baselines, then feeds the model its score history and current winner source. The claim is not general coding ability. The claim is repeated competitive algorithm improvement under a reproducible protocol.
Protocol
| Mode | self-play |
|---|---|
| Models | GPT-5.5, Claude Opus 4.7 |
| Iteration budget | 6 iterations per model |
| Games per iteration | 25 |
| Seed | 424242 |
| Grid | 60×60, 4 players |
| OpenAI reasoning | high for GPT-5.5; Anthropic has no equivalent runtime flag |
| Plateau | early-stop disabled with --plateau-patience 99 so both models produce equal-length traces |
Results
| Model | Best score | Best CI95 | Latest score | Extraction failures |
|---|---|---|---|---|
| GPT-5.5 | 53% · iter 3 | [50.2%, 56.7%] | 51% · iter 6 | 0 |
| Claude Opus 4.7 | 41% · iter 6 | [38.7%, 43.9%] | 41% · iter 6 | 0 |
Where the frontier comparison lands
The dashboard's cross-lobby pairwise bootstrap and matched-lobby H2H agree this time. GPT-5.5's best iteration is statistically higher than Claude's best iteration, and the direct replay puts both algorithms in the same four-player lobby with the same baselines, removing opponent-draw variance.
Reliability
The reason this replaces the prior failure-case story is that the run is clean: every published iteration extracted a named function and scored. Score regressions still appear as normal learning-curve information, but there are no missing data points hidden behind syntax failures.
Held-out reference
The reference algorithm is frozen and never exposed in prompts or in the output JSON. It is not the benchmark target; it is an anchor that makes cross-release movement easier to interpret.
Iteration table
| Iter | GPT-5.5 | Claude Opus 4.7 |
|---|---|---|
| 1 | 50% [45.3%, 54.7%] | 24% [18.7%, 28.5%] |
| 2 | 51% [48.7%, 53.6%] | 13% [11.8%, 15.2%] |
| 3 | 53% [50.2%, 56.7%] | 12% [10.4%, 13.1%] |
| 4 | 52% [49.6%, 55.3%] | 18% [16%, 20.5%] |
| 5 | 53% [51.2%, 54.9%] | 39% [37.5%, 41.1%] |
| 6 | 51% [49.4%, 53.4%] | 41% [38.7%, 43.9%] |
Reproduction
ANTHROPIC_API_KEY=... npm --prefix ../gameval run eval -- \
--model claude-opus-4-7@anthropic \
--iterations 6 --games-per-iter 25 --mode self-play \
--seed 424242 --plateau-patience 99 \
--output ../gameval/.devin-claude-opus47-selfplay-6iter.json
OPENAI_API_KEY=... npm --prefix ../gameval run eval -- \
--model gpt-5.5-2026-04-23@openai \
--iterations 6 --games-per-iter 25 --mode self-play \
--seed 424242 --reasoning-effort high --plateau-patience 99 \
--output ../gameval/.devin-gpt55-selfplay-high-6iter.json
node scripts/update-arena-war-results.mjs \
--eval ../gameval/.devin-claude-opus47-selfplay-6iter.json \
--eval ../gameval/.devin-gpt55-selfplay-high-6iter.json \
--gameval-root ../gameval
What this does not prove
This does not prove GPT-5.5 is better at all coding tasks, all games, or all adversarial settings. It shows a clean win on this benchmark's self-play protocol: iterative improvement of one spatial territory algorithm under a fixed scoring loop.