Arena War — Results Dashboard

Figure 1 — Learning Curves Across Models

Mean territory percentage by eval iteration per model. Shaded bands are ±1 std dev across games. Hollow × markers are iterations that failed before scoring, pinned to the previous scored result so stalled development is still visible. GPT-5.4 iteration labels are shown on the curve because its later attempts failed before producing new scores. Where a held-out reference algorithm ran, its mean is drawn as a dashed horizontal line — models below it failed to beat a hand-written algorithm whose source they never saw.

Figure 1: Read the trajectory, not the absolute peak. A model with a steeper slope under the same protocol is the stronger learner. A flat line at a high value is still a good result — it means the model reached peak fast. The reference line is the trust anchor: it is frozen across eval versions and never exposed to the models.

Leaderboard

Ordered by strongest best result. Ties fall to the earlier best iteration. Click on a row for net improvement, successful iterations, and stop reason.

#	Model	Best score	Iter	Replay

Held-Out Reference Benchmark

The reference algorithm's source is never included in any prompt or in eval-results.json. It is a frozen anchor that lets us compare across eval versions without models ever seeing it. A model "losing to the reference" with a CI that excludes zero is the most trustworthy single claim this eval produces.

Shared Protocol

Every model shown above was evaluated under the same run configuration. This strip is what makes the comparison apples-to-apples.

Benchmark Summary

Comparison Verdict

A good result is primarily relative: stronger final performance and faster improvement than peers on the same run.

How To Read This Run

What counts as a good result?

A strong model either finishes with a better score than its peers, improves faster across iterations, or both under the same protocol.

Why trust the comparison?

All models in this run are judged under the same shared setup: same baseline opponents, same iteration budget, and the same scoring loop.

What is the main signal?

The learning curve is primary. Final score matters, but how a model improves after feedback is the benchmark's core capability signal.

What is secondary?

The arena replay is a follow-on inspection tool. Use it after you understand the benchmark result, not instead of it.

Arena War measures iterative algorithm improvement under adversarial pressure: a model must read the task, write code, learn from reward signals, and produce stronger follow-up algorithms over repeated rounds. It does not claim to measure general coding ability from one chart. For deeper methodology, see benchmark-methodology.md.

Game Rules & Outcome Modes

Arena War is a 4-player territorial expansion game on a circular grid (default 60×60, ~2724 in-play cells). Each tick, every algorithm returns a frontier of up to max(1, floor(N/8)) candidate cells (7 cells/tick at N=60); the engine enforces claim limits and resolves conflicts (same-tick collisions stay unclaimed). Games end in one of three distinct modes, every game carries its terminationReason in the output JSON (schemaVersion ≥ 7). The exact grid size for this run is shown in the Shared Protocol strip above.

board_full — the clean win

Every reachable in-circle cell has been claimed. Territory percentages sum to ~100%. The winner filled more of the board than the other players before the board filled.

stalemate — the no-progress case

No player claimed anything this tick. Visible unclaimed territory is expected and legitimate — it's an informative signal of mutually-blocking adversarial play. The winner has the largest territory at arrest-time, even if that's 21% of the full board.

max_ticks — safety cap (should never fire)

The node eval runner caps runaway games at size × size ticks. Normal play terminates via board_full or stalemate well before this. Hitting max_ticks in production data is a livelock-bug signal to investigate.

What this means for reading the dashboard

Territory % alone can't distinguish a 21% board_full (tight 4-way split) from a 21% stalemate (no-progress arrest) from a 21% max_ticks (livelock bug). The joint distribution of (terminationReason, pct, ticks) is the real signal. Inspect the per-iteration "Termination mix" row below to see how often each mode fired.

Full spec, claim-resolution rules, and stalemate-cause taxonomy in benchmark-methodology.md §9.

Pairwise Comparisons

Bootstrap comparison of each model pair's best iteration scores (4000 resamples, α = 0.05). CI excluding zero means the pair is statistically separable under this run's sample size.

Head-to-Head Matrix

Row model's best algorithm vs column model's best algorithm across seed-matched games. Cell shows row wins – column wins (draws) and mean territory Δ with 95% CI.

Matrix will populate when multi-model data is available

Failure Taxonomy

How often each model's iterations hit an annotated failure mode: extraction errors, runtime crashes, out-of-bounds exploits, timeouts, regressions, and plateau stalls. Use this to distinguish models that scored well from models that scored well and stayed stable.

Counts are per-iteration occurrences across this run. An iteration may carry multiple flags (e.g. a regression that also stalled).

Inspect A Model Run

Model Summary

Iteration Feedback

Feedback Signals Used In The Next Prompt

Same-Model Feedback Leaderboard

Same-Model Feedback History

Generated Algorithm

Live Replay

tick 0 Open in full arena ↗

The selected iteration's algorithm sits in seat 0 (player 1) against the shared baseline opponents. Game auto-loops; click Pause to freeze, or Open in full arena to inspect a single game end-to-end.

Best-Vs-Best Replay Candidates

The benchmark comes first. After reviewing the leaderboard and best iterations here, use the sandbox arena to replay the strongest algorithms against each other.

arena war — results dashboard