Mean territory percentage by eval iteration per model. Shaded bands are ±1 std dev across games. Hollow × markers are iterations that failed before scoring, pinned to the previous scored result so stalled development is still visible. GPT-5.4 iteration labels are shown on the curve because its later attempts failed before producing new scores. Where a held-out reference algorithm ran, its mean is drawn as a dashed horizontal line — models below it failed to beat a hand-written algorithm whose source they never saw.
Figure 1: Read the trajectory, not the absolute peak. A model with a steeper slope under the same protocol is the stronger learner. A flat line at a high value is still a good result — it means the model reached peak fast. The reference line is the trust anchor: it is frozen across eval versions and never exposed to the models.
Ordered by strongest best result. Ties fall to the earlier best iteration. Click on a row for net improvement, successful iterations, and stop reason.
| # | Model | Best score | Iter | Replay |
|---|
The reference algorithm's source is never included in any prompt or in eval-results.json. It is a frozen anchor that lets us compare across eval versions without models ever seeing it. A model "losing to the reference" with a CI that excludes zero is the most trustworthy single claim this eval produces.
Every model shown above was evaluated under the same run configuration. This strip is what makes the comparison apples-to-apples.
A good result is primarily relative: stronger final performance and faster improvement than peers on the same run.
A strong model either finishes with a better score than its peers, improves faster across iterations, or both under the same protocol.
All models in this run are judged under the same shared setup: same baseline opponents, same iteration budget, and the same scoring loop.
The learning curve is primary. Final score matters, but how a model improves after feedback is the benchmark's core capability signal.
The arena replay is a follow-on inspection tool. Use it after you understand the benchmark result, not instead of it.
Arena War measures iterative algorithm improvement under adversarial pressure: a model must read the task, write code, learn from reward signals, and produce stronger follow-up algorithms over repeated rounds. It does not claim to measure general coding ability from one chart. For deeper methodology, see benchmark-methodology.md.
Arena War is a 4-player territorial expansion game on a circular grid (default 60×60, ~2724 in-play cells). Each tick, every algorithm returns a frontier of up to max(1, floor(N/8)) candidate cells (7 cells/tick at N=60); the engine enforces claim limits and resolves conflicts (same-tick collisions stay unclaimed). Games end in one of three distinct modes, every game carries its terminationReason in the output JSON (schemaVersion ≥ 7). The exact grid size for this run is shown in the Shared Protocol strip above.
Every reachable in-circle cell has been claimed. Territory percentages sum to ~100%. The winner filled more of the board than the other players before the board filled.
No player claimed anything this tick. Visible unclaimed territory is expected and legitimate — it's an informative signal of mutually-blocking adversarial play. The winner has the largest territory at arrest-time, even if that's 21% of the full board.
The node eval runner caps runaway games at size × size ticks. Normal play terminates via board_full or stalemate well before this. Hitting max_ticks in production data is a livelock-bug signal to investigate.
Territory % alone can't distinguish a 21% board_full (tight 4-way split) from a 21% stalemate (no-progress arrest) from a 21% max_ticks (livelock bug). The joint distribution of (terminationReason, pct, ticks) is the real signal. Inspect the per-iteration "Termination mix" row below to see how often each mode fired.
Full spec, claim-resolution rules, and stalemate-cause taxonomy in benchmark-methodology.md §9.
Bootstrap comparison of each model pair's best iteration scores (4000 resamples, α = 0.05). CI excluding zero means the pair is statistically separable under this run's sample size.
Row model's best algorithm vs column model's best algorithm across seed-matched games. Cell shows row wins – column wins (draws) and mean territory Δ with 95% CI.
How often each model's iterations hit an annotated failure mode: extraction errors, runtime crashes, out-of-bounds exploits, timeouts, regressions, and plateau stalls. Use this to distinguish models that scored well from models that scored well and stayed stable.
Counts are per-iteration occurrences across this run. An iteration may carry multiple flags (e.g. a regression that also stalled).
The selected iteration's algorithm sits in seat 0 (player 1) against the shared baseline opponents. Game auto-loops; click Pause to freeze, or Open in full arena to inspect a single game end-to-end.
The benchmark comes first. After reviewing the leaderboard and best iterations here, use the sandbox arena to replay the strongest algorithms against each other.