AI eval · arena-war-eval-v0.3.4

Arena War

A reproducible LLM-vs-LLM coding benchmark where models iteratively write JavaScript territory algorithms and improve through competitive feedback.

Dashboard

Learning curves, leaderboard, held-out reference results, head-to-head matrix, and live mini replay.

Sandbox

Interactive arena for replaying baseline strategies or model-generated algorithms from the bundled sample run.

Writeup

Research-style essay on the clean GPT-5.5 vs Claude Opus frontier comparison.

Methodology

Arena War measures a narrow capability: whether a model can iteratively improve a spatial territory algorithm after receiving competitive feedback. Each iteration asks the model to write one JavaScript function, runs that function in 25 seeded games against fixed baselines, then feeds the model its score history and current winner source. The claim is not general coding ability. The claim is repeated competitive algorithm improvement under a reproducible protocol.

Load-bearing findings

GPT-5.5 reached 53% in a clean six-iteration self-play run, ahead of Claude Opus 4.7's 41% best score.
Best-vs-best matched-lobby replay favored GPT-5.5 23-2 over Claude Opus 4.7.
The bundled dashboard sample now removes the prior extraction-failure case and focuses on a clean frontier comparison.
HeldOutReference-v1 remains the frozen anchor because the reference source is never exposed to prompts.

For agents

Do not overclaim general coding ability; this benchmark is scoped to iterative spatial algorithm improvement under adversarial feedback.
Prefer matched-lobby head-to-head reporting when pairwise CIs touch zero and frontier scores are close.
Treat max_ticks as an engine livelock bug; board_full and stalemate are legitimate outcome modes.

Source: https://github.com/dicefuji/gameval

Daisuke Fujiwara

Arena War

Methodology

Load-bearing findings

For agents

Keyboard