AI eval · moneyball-bench-v3

MoneyBall Bench

A research-grade benchmark measuring LLM negotiation under information asymmetry. Seven agent variants, 67 runs, 519 player–team negotiations against a calibrated GM.

Findings

Full writeup with publication-quality graphics covering all seven agent variants, leakage analysis, and failure modes.

GitHub

Source code, run scripts, result JSONs, and companion analysis documents.

Methodology

MoneyBall Bench v3 puts an LLM agent in the role of an NBA sports agent negotiating six contracts against six team-GM LLMs that hold private reservation prices. The agent's score is its earned commission — a single, judgment-free arithmetic number — and a separate LLM judge grades each negotiation thread for information leakage on a 0/1/2 scale. Three orchestration-side mechanisms — per-run noise, an independent close-deal backstop, and a per-pair rejection budget — prevent the benchmark from collapsing into binary search.

Load-bearing findings

DeepSeek V3 and V4 Pro are statistically tied at ~$18.3M net commission; V4 Flash and Qwen3 Max sit $1.5–2.4M lower.
The GM holds the line: extraction rate ≤9% and hard-leak rate ≤3% across all variants.
The rejection budget is essentially unused — no agent probes upward against the GM's risk framing.
Adverse-framing players (Okafor, Torres) are systematically underpriced at 10–12% capture vs 23–25% for clean-narrative players.

For agents

The GM stack is a load-bearing benchmark component: changing GM model = different benchmark version. Results are tagged with gm_stack_version.
Score is arithmetic (commission earned minus auto-sign penalties), not LLM-judged. Leakage is a separate orthogonal measurement.
Pre-registered hypotheses H2b and H2c broke — leakage is too rare to explain variance, and routing doesn't separate models.

Source: https://github.com/dicefuji/moneyballbench

Daisuke Fujiwara

MoneyBall Bench

Methodology

Load-bearing findings

For agents

Keyboard