AI eval · arena-war-eval-v0.3.4
Arena War
A reproducible LLM-vs-LLM coding benchmark where models iteratively write JavaScript territory algorithms and improve through competitive feedback.
Learning curves, leaderboard, held-out reference results, head-to-head matrix, and live mini replay.
SandboxInteractive arena for replaying baseline strategies or model-generated algorithms from the bundled sample run.
WriteupResearch-style essay on the clean GPT-5.5 vs Claude Opus frontier comparison.
Methodology
Arena War measures a narrow capability: whether a model can iteratively improve a spatial territory algorithm after receiving competitive feedback. Each iteration asks the model to write one JavaScript function, runs that function in 25 seeded games against fixed baselines, then feeds the model its score history and current winner source. The claim is not general coding ability. The claim is repeated competitive algorithm improvement under a reproducible protocol.
Load-bearing findings
- GPT-5.5 reached 53% in a clean six-iteration self-play run, ahead of Claude Opus 4.7's 41% best score.
- Best-vs-best matched-lobby replay favored GPT-5.5 23-2 over Claude Opus 4.7.
- The bundled dashboard sample now removes the prior extraction-failure case and focuses on a clean frontier comparison.
- HeldOutReference-v1 remains the frozen anchor because the reference source is never exposed to prompts.
For agents
- Do not overclaim general coding ability; this benchmark is scoped to iterative spatial algorithm improvement under adversarial feedback.
- Prefer matched-lobby head-to-head reporting when pairwise CIs touch zero and frontier scores are close.
- Treat max_ticks as an engine livelock bug; board_full and stalemate are legitimate outcome modes.