AI evals

Evals

Benchmarks and artifacts I publish as first-party sections on dicefujiwara.com. Each eval keeps the human narrative, live inspection surface, and agent-readable context together.

arena-war-eval-v0.3.3 · schema v6

Arena War

A reproducible LLM-vs-LLM coding benchmark where models iteratively write JavaScript territory algorithms and improve through competitive feedback.

Canonical design
2×2×2

mode × reasoning × model

Games per iteration
25

standard frontier comparison n

Run seed
424242

deterministic per-game seed derivation

Public sample best
46%

claude-opus-4-7 adversarial high, iter 4

Keyboard