AI eval · arena-war-eval-v0.3.3

Arena War

A reproducible LLM-vs-LLM coding benchmark where models iteratively write JavaScript territory algorithms and improve through competitive feedback.

Claim

Measures a model's ability to iteratively improve a spatial territory algorithm through adversarial competition feedback.

Canonical design
2×2×2

mode × reasoning × model

Games per iteration
25

standard frontier comparison n

Run seed
424242

deterministic per-game seed derivation

Public sample best
46%

claude-opus-4-7 adversarial high, iter 4

Dashboard

Learning curves, leaderboard, held-out reference results, head-to-head matrix, and live mini replay.

Sandbox

Interactive arena for replaying baseline strategies or model-generated algorithms from the bundled sample run.

Writeup

Research-style essay on the canonical frontier runs and the reasoning-effort finding.

Load-bearing findings

  • The published writeup documents a 57% gpt-5.4 self-play high run; the bundled dashboard sample is the adversarial high cell.
  • Matched-lobby H2H beat cross-lobby pairwise as the clearer signal when frontier models were close.
  • High reasoning effort introduced extraction failures on long iterative prompts, making token budget part of the result.
  • HeldOutReference-v1 anchors comparisons because the reference source is never exposed to prompts.

Protocol

  • Mode cells: self-play and adversarial.
  • Reasoning cells: default and high.
  • Models: claude-opus-4-7 and gpt-5.4-2026-03-05.
  • Baseline opponents: Greedy BFS, Diagonal Spiral, and Density Wave.
  • Current public sample: adversarial, reasoning=high, n=25, seed 424242.

For agents

  • Do not overclaim general coding ability; this benchmark is scoped to iterative spatial algorithm improvement under adversarial feedback.
  • Prefer matched-lobby head-to-head reporting when pairwise CIs touch zero and frontier scores are close.
  • Treat max_ticks as an engine livelock bug; board_full and stalemate are legitimate outcome modes.

Keyboard