Evals
Homemade benchmarks and experiments. Trying to understand what models can do, can't do, and claim to do.
MoneyBall Bench
An eval for LLM negotiation under information asymmetry: seven agent variants, 67 runs, and 519 player–team negotiations against a calibrated GM.
Full writeup with figures for the seven agent variants, leakage analysis, and the failure modes that mattered.
GitHubSource code, run scripts, result JSONs, and companion analysis notes.
Arena War
A reproducible LLM-vs-LLM coding benchmark where models iteratively write JavaScript territory algorithms and improve through competitive feedback.
Learning curves, leaderboard, held-out reference results, head-to-head matrix, and live mini replay.
SandboxInteractive arena for replaying baseline strategies or model-generated algorithms from the bundled sample run.
WriteupResearch-style essay on the clean GPT-5.5 vs Claude Opus frontier comparison.
Live replay