AI evals
Evals
Benchmarks and artifacts I publish as first-party sections on dicefujiwara.com. Each eval keeps the human narrative, live inspection surface, and agent-readable context together.
Arena War
A reproducible LLM-vs-LLM coding benchmark where models iteratively write JavaScript territory algorithms and improve through competitive feedback.
- Canonical design
- 2×2×2
- Games per iteration
- 25
- Run seed
- 424242
- Public sample best
- 46%
mode × reasoning × model
standard frontier comparison n
deterministic per-game seed derivation
claude-opus-4-7 adversarial high, iter 4