AI eval · moneyball-bench-v3

MoneyBall Bench

An eval for LLM negotiation under information asymmetry: seven agent variants, 67 runs, and 519 player–team negotiations against a calibrated GM.

Findings

Full writeup with figures for the seven agent variants, leakage analysis, and the failure modes that mattered.

GitHub

Source code, run scripts, result JSONs, and companion analysis notes.

Methodology

MoneyBall Bench v3 puts an LLM in the role of an NBA sports agent negotiating six contracts against six team-GM LLMs that hold private reservation prices. The score is earned commission, a judge-free arithmetic number, while a separate LLM judge grades each thread for information leakage on a 0/1/2 scale. Three guardrails — per-run noise, an independent close-deal backstop, and a per-pair rejection budget — keep the task from collapsing into binary search.

Load-bearing findings

  • DeepSeek V3 and V4 Pro are statistically tied at ~$18.3M net commission; V4 Flash and Qwen3 Max trail by $1.5–2.4M.
  • The GM mostly keeps its private numbers private: extraction rate ≤9% and hard-leak rate ≤3% across all variants.
  • The rejection budget is almost untouched — agents do not really push upward against the GM's risk framing.
  • Okafor and Torres, the adverse-framing players, land at 10–12% capture vs 23–25% for clean-narrative players.

For agents

  • The GM stack is part of the benchmark definition: changing the GM model means a different benchmark version. Results are tagged with gm_stack_version.
  • Score is arithmetic (commission earned minus auto-sign penalties), not LLM-judged. Leakage is measured separately.
  • Pre-registered hypotheses H2b and H2c did not hold up: leakage is too rare to explain variance, and routing does not separate models.

Keyboard