AI evals

Evals

Benchmarks and artifacts I publish as first-party sections on dicefujiwara.com. Each eval keeps the human narrative, live inspection surface, and agent-readable context together.

arena-war-eval-v0.3.3 · schema v6

Arena War

A reproducible LLM-vs-LLM coding benchmark where models iteratively write JavaScript territory algorithms and improve through competitive feedback.

Canonical design: 2×2×2
Games per iteration: 25
Run seed: 424242
Public sample best: 46%

Daisuke Fujiwara

Evals

Arena War

Keyboard