The bottleneck is elicitation, not leakage

Seven agent variants negotiate 519 contracts against a calibrated GM. The GM holds its private numbers. No agent asks for more.

GitHub

Dice · May 2026 · moneyball-bench-v3

TL;DR

Net commission by agent model n=10 per model (n=7 for K2.6@4096), 95% CI from the t-distribution −$5M $0 $5M $10M $15M $20M DeepSeek V3 self-play $18.33M V4 Pro $18.20M V4 Flash $16.69M Qwen3 Max $15.89M Kimi K2.5 $14.61M K2.6@4096 n=7 $7.91M K2.6@2048 −$0.87M TIER 1 TIER 2 TIER 3
Figure 1. Net commission across seven agent variants. Whiskers show 95% CIs from the t-distribution. Tier 1 (DeepSeek V3, V4 Pro) CIs overlap — the spec's TIED rule applies. Tier 3 is dominated by K2.6's truncation pathology. The $0 line marks the break-even point; negative scores mean auto-sign penalties exceeded earned commission.

The benchmark in one paragraph

MoneyBall Bench v3 puts an LLM agent in the role of an NBA sports agent negotiating six contracts against six team-GM LLMs that hold private reservation prices. The agent's score is its earned commission — a single, judgment-free arithmetic number — and a separate LLM judge grades each negotiation thread for information leakage on a 0/1/2 scale. Three orchestration-side mechanisms — per-run noise, an independent close-deal backstop, and a per-pair rejection budget — prevent the benchmark from collapsing into binary search.

Benchmark architecture Three LLM roles, three safety mechanisms, one arithmetic score Agent model under test 6 player stat cards + public profiles GM × 6 DeepSeek V3 (fixed) private (max_aav, max_years) Judge DeepSeek V3 (offline) scores 0/1/2 per thread email threads post-hoc Reservation noise ±5% fuzz, seeded per run Close-deal backstop rejects above noised max_aav Rejection budget 3 above-ceiling attempts → locked NetScore = Σ(AAV × Years × 10%) − $0.5M × auto_signed_count
Figure 2. The three LLM roles and three safety mechanisms. The agent negotiates via email threads; the GM holds private numbers; the judge scores leakage post-hoc. The backstop independently validates every close_deal call against noised reservation prices — a GM socially engineered into agreeing still has the deal blocked.

Where agents leave money on the table

Capture rate — (signed_AAV − floor) / (ceiling − floor) — isolates how much of the available negotiation space the agent claimed. The pattern is striking: the two players whose profiles involve adverse information framing — Kevin Okafor (durability risk) and Raymond Torres (declining veteran) — are systematically the cheapest captures across all four frontier variants.

Per-player capture rate · frontier models Darker = higher capture. Okafor and Torres (adverse framing) are systematically lowest. K2.5 Qwen3 V3 V4 Flash V4 Pro Marcus Cole auction dynamics 18.3% 5.4% 30.6% 21.8% 35.2% Darnell Washington length vs. AAV 21.2% 12.8% 32.7% 20.6% 26.6% Tyrese Grant team-fit ID 26.6% 14.4% 37.2% 23.4% 24.4% ← adverse framing Kevin Okafor adverse framing 7.7% 20.0% 10.5% 8.8% 10.0% Jaylen Brooks upside framing 23.1% 16.2% 14.0% 16.9% 20.1% Raymond Torres adverse advocacy 14.8% 12.5% 11.1% 8.8% 10.5% low high
Figure 3. Capture rate by player and model. The dashed line separates clean-narrative players (top three) from adverse-framing players (bottom two, plus Brooks). The pattern is consistent across all frontier variants: agents capture 23–25% on Cole/Washington/Grant but only 10–12% on Okafor/Torres. The GM's risk narrative goes uncontested.

The GM holds the line

Across 519 judged threads, the calibrated GM leaked its reservation prices in only 13 threads (2.5% extraction rate) and committed hard leaks (judge score 2) in 2 threads (0.4%). The benchmark's information-asymmetry premise is intact — the safety mechanisms work as designed.

Leakage rates by agent model DeepSeek V3 judge · score ≥1 = extraction, score =2 = hard leak 0% 2% 4% 6% 8% 10% K2.5 2.4% K2.6@2048 0% Qwen3 Max 9.1% 3.0% DeepSeek V3 5.9% 1.0% V4 Flash 2.1% V4 Pro 5.6% Extraction rate (score ≥ 1) Hard-leak rate (score = 2)
Figure 4. Leakage rates by model. Qwen3 Max has the highest extraction rate (9.1%) but the lowest commission among frontier variants — the best leakage extractor is the worst earner, the opposite of what H2a predicts. Hard leaks are uniformly rare (≤3%). K2.6@4096 is excluded (judge not run on retest).

Failure-mode geography

The pilot exposed three distinct ways agents lose money. Truncation cascade: K2.6 at 2048 max_tokens completes only 1 of 10 runs because its tool-use loop terminates early. Variance-driven ties: K2.5 has two catastrophic auto-sign collapses in 10 runs, inflating its CI to span every other variant. Adverse-narrative concession: all four frontier models anchor near the player's floor when the GM introduces risk framing.

K2.6 truncation cascade · 2048 vs 4096 max_tokens Green = completed run (6 deals), red = auto-sign collapse (0 deals) K2.6@2048 1 of 10 $18.3 0 1 2 3 4 5 6 7 8 9 K2.6@4096 4 of 7 $14.5 $18.5 $14.0 $8.7 0 1 2 3 4 5 6 +3 recovered
Figure 5. K2.6 run outcomes at two token budgets. At 2048, 9 of 10 runs collapse into auto-sign penalties (−$3M each). At 4096, 4 of 7 runs recover to the $8–18M band. Red squares = auto-sign collapse; green squares show the net commission of completed runs. This is a tool-use stability finding, not a model-quality finding.

Cap pressure does not elevate leakage

We stratified all 519 threads by whether the agent had already signed another player at the same team (cap-pressure) or not. If the GM's cap constraints after a prior signing made it more likely to leak its reservation prices, we would expect higher leakage in cap-pressure threads. We found no signal.

Cap-pressure stratification · pooled across all models 123 cap-pressure threads vs 396 non-cap-pressure threads Extraction rate (score ≥ 1) 2.8% Cap-pressure n=108 judged 2.7% Non-cap-pressure n=367 judged gap: +0.1pp · Fisher's p = 1.000 Hard-leak rate (score = 2) 0.9% Cap-pressure 1 hard leak 0.3% Non-cap-pressure 1 hard leak gap: +0.7pp · Fisher's p = 0.403
Figure 6. Cap-pressure stratification across all models. Neither extraction nor hard-leak rate differs significantly between cap-pressure and non-cap-pressure threads. The GM's "we have constraints" language is uniformly directional but soft enough that the judge scores it 0. Fisher's exact test (two-sided) is non-significant for both comparisons.

Cohort

VariantProvider / modelnmax_tokensNotes
K2.5moonshotai/kimi-k2.5102048Initial pilot
K2.6@2048moonshotai/kimi-k2.6102048Initial pilot
K2.6@4096moonshotai/kimi-k2.674096Retest after truncation
Qwen3 Maxqwen/qwen3-max102048Phase 16
DeepSeek V3deepseek/deepseek-v3.2-exp102048Self-play (same as GM/judge)
V4 Flashdeepseek/deepseek-v4-flash102048MoE (284B/13B)
V4 Prodeepseek/deepseek-v4-pro102048MoE (1.6T/49B)

Pre-registered hypotheses

Self-play and benchmark integrity

DeepSeek V3 plays itself: it is the agent, the GM, and the leakage judge for its own runs. V4 Pro is architecturally distinct but from the same model family and lands in the same statistical band ($18.20M vs $18.33M, CIs overlap). This undermines the hypothesis that V3 benefits from shared-architecture priors with the GM — if it did, V4 Pro should score lower. DeepSeek V3 is the only variant to hit the rejection budget (0.4/run), suggesting more aggressive probing, but it doesn't get penalized for it.

Limitations

  1. n=10 is the spec floor. Several CI overlaps would resolve at n=20. K2.5 explicitly needs it.
  2. Judge not validated against humans. Cohen's κ is unmeasured. Leakage numbers are exploratory.
  3. Same-model judge. DeepSeek V3 judges its own runs.
  4. Single GM stack. All conclusions are conditional on deepseek/deepseek-v3.2-exp:temp0.3.
  5. Single season. Multi-season compounding (H3) is untested.
  6. No baseline comparisons published. Floor-Aware and Truly-Naive baselines are implemented but not run.

What to run next

The most informative single experiment is a probe-prompted variant: adding a single-line system-prompt nudge ("when a GM signals flexibility, propose a higher number") to one frontier model and measuring the capture-rate delta on Kevin Okafor and Raymond Torres. If the gain is large (>5pp), the limiting factor is elicitation behavior, not model intelligence — and the v3 architecture is correctly identifying that gap.

"The current generation of models, given a GM that holds its numbers reliably, fails to ask."