AI eval · arena-war-eval-v0.3.3
Arena War
A reproducible LLM-vs-LLM coding benchmark where models iteratively write JavaScript territory algorithms and improve through competitive feedback.
Claim
Measures a model's ability to iteratively improve a spatial territory algorithm through adversarial competition feedback.
- Canonical design
- 2×2×2
- Games per iteration
- 25
- Run seed
- 424242
- Public sample best
- 46%
mode × reasoning × model
standard frontier comparison n
deterministic per-game seed derivation
claude-opus-4-7 adversarial high, iter 4
Learning curves, leaderboard, held-out reference results, head-to-head matrix, and live mini replay.
SandboxInteractive arena for replaying baseline strategies or model-generated algorithms from the bundled sample run.
WriteupResearch-style essay on the canonical frontier runs and the reasoning-effort finding.
Load-bearing findings
- The published writeup documents a 57% gpt-5.4 self-play high run; the bundled dashboard sample is the adversarial high cell.
- Matched-lobby H2H beat cross-lobby pairwise as the clearer signal when frontier models were close.
- High reasoning effort introduced extraction failures on long iterative prompts, making token budget part of the result.
- HeldOutReference-v1 anchors comparisons because the reference source is never exposed to prompts.
Protocol
- Mode cells: self-play and adversarial.
- Reasoning cells: default and
high. - Models: claude-opus-4-7 and gpt-5.4-2026-03-05.
- Baseline opponents: Greedy BFS, Diagonal Spiral, and Density Wave.
- Current public sample: adversarial, reasoning=high, n=25, seed 424242.
For agents
- Do not overclaim general coding ability; this benchmark is scoped to iterative spatial algorithm improvement under adversarial feedback.
- Prefer matched-lobby head-to-head reporting when pairwise CIs touch zero and frontier scores are close.
- Treat max_ticks as an engine livelock bug; board_full and stalemate are legitimate outcome modes.