Evals

Homemade benchmarks and experiments. Trying to understand what models can do, can't do, and claim to do.

moneyball-bench-v3 · schema v1

MoneyBall Bench

An eval for LLM negotiation under information asymmetry: seven agent variants, 67 runs, and 519 player–team negotiations against a calibrated GM.

Findings

Full writeup with figures for the seven agent variants, leakage analysis, and the failure modes that mattered.

GitHub

Source code, run scripts, result JSONs, and companion analysis notes.

arena-war-eval-v0.3.4 · schema v7

Arena War

A reproducible LLM-vs-LLM coding benchmark where models iteratively write JavaScript territory algorithms and improve through competitive feedback.

Dashboard

Learning curves, leaderboard, held-out reference results, head-to-head matrix, and live mini replay.

Sandbox

Interactive arena for replaying baseline strategies or model-generated algorithms from the bundled sample run.

Writeup

Research-style essay on the clean GPT-5.5 vs Claude Opus frontier comparison.

Live replay

Keyboard