The bottleneck is elicitation, not leakage

Seven agent variants negotiate 519 contracts against a calibrated GM. The GM holds its private numbers. No agent asks for more.

GitHub

Dice · May 2026 · moneyball-bench-v3

TL;DR

DeepSeek V3 and V4 Pro are statistically tied at ~$18.3M net commission, with 95% CIs that overlap. V4 Flash and Qwen3 Max sit $1.5–2.4M lower but remain in the same competitive band.
The GM holds the line: across 519 judged threads, extraction rate is ≤9% and hard-leak rate is ≤3% for every variant. The benchmark's three safety mechanisms (noise, backstop, rejection budget) compose to make mechanical reservation-price extraction infeasible.
No agent probes upward. The rejection budget — designed to prevent binary-search-via-tool-spam — is essentially unused (0.0–0.4 hits per run). Models accept the GM's risk frame instead of countering it, anchoring near the player's floor.
The two adverse-framing players (Okafor, Torres) are systematically underpriced at 10–12% capture, vs 23–25% for clean-narrative players. This gap is consistent across all four frontier variants.
K2.6's bottom-tier placement is a truncation pathology, not a capability gap: raising max_tokens from 2048 to 4096 recovers most runs.

Figure 1. Net commission across seven agent variants. Whiskers show 95% CIs from the t-distribution. Tier 1 (DeepSeek V3, V4 Pro) CIs overlap — the spec's TIED rule applies. Tier 3 is dominated by K2.6's truncation pathology. The $0 line marks the break-even point; negative scores mean auto-sign penalties exceeded earned commission.

The benchmark in one paragraph

MoneyBall Bench v3 puts an LLM agent in the role of an NBA sports agent negotiating six contracts against six team-GM LLMs that hold private reservation prices. The agent's score is its earned commission — a single, judgment-free arithmetic number — and a separate LLM judge grades each negotiation thread for information leakage on a 0/1/2 scale. Three orchestration-side mechanisms — per-run noise, an independent close-deal backstop, and a per-pair rejection budget — prevent the benchmark from collapsing into binary search.

Figure 2. The three LLM roles and three safety mechanisms. The agent negotiates via email threads; the GM holds private numbers; the judge scores leakage post-hoc. The backstop independently validates every close_deal call against noised reservation prices — a GM socially engineered into agreeing still has the deal blocked.

Where agents leave money on the table

Capture rate — (signed_AAV − floor) / (ceiling − floor) — isolates how much of the available negotiation space the agent claimed. The pattern is striking: the two players whose profiles involve adverse information framing — Kevin Okafor (durability risk) and Raymond Torres (declining veteran) — are systematically the cheapest captures across all four frontier variants.

Figure 3. Capture rate by player and model. The dashed line separates clean-narrative players (top three) from adverse-framing players (bottom two, plus Brooks). The pattern is consistent across all frontier variants: agents capture 23–25% on Cole/Washington/Grant but only 10–12% on Okafor/Torres. The GM's risk narrative goes uncontested.

The GM holds the line

Across 519 judged threads, the calibrated GM leaked its reservation prices in only 13 threads (2.5% extraction rate) and committed hard leaks (judge score 2) in 2 threads (0.4%). The benchmark's information-asymmetry premise is intact — the safety mechanisms work as designed.

Figure 4. Leakage rates by model. Qwen3 Max has the highest extraction rate (9.1%) but the lowest commission among frontier variants — the best leakage extractor is the worst earner, the opposite of what H2a predicts. Hard leaks are uniformly rare (≤3%). K2.6@4096 is excluded (judge not run on retest).

Failure-mode geography

The pilot exposed three distinct ways agents lose money. Truncation cascade: K2.6 at 2048 max_tokens completes only 1 of 10 runs because its tool-use loop terminates early. Variance-driven ties: K2.5 has two catastrophic auto-sign collapses in 10 runs, inflating its CI to span every other variant. Adverse-narrative concession: all four frontier models anchor near the player's floor when the GM introduces risk framing.

Figure 5. K2.6 run outcomes at two token budgets. At 2048, 9 of 10 runs collapse into auto-sign penalties (−$3M each). At 4096, 4 of 7 runs recover to the $8–18M band. Red squares = auto-sign collapse; green squares show the net commission of completed runs. This is a tool-use stability finding, not a model-quality finding.

Cap pressure does not elevate leakage

We stratified all 519 threads by whether the agent had already signed another player at the same team (cap-pressure) or not. If the GM's cap constraints after a prior signing made it more likely to leak its reservation prices, we would expect higher leakage in cap-pressure threads. We found no signal.

Figure 6. Cap-pressure stratification across all models. Neither extraction nor hard-leak rate differs significantly between cap-pressure and non-cap-pressure threads. The GM's "we have constraints" language is uniformly directional but soft enough that the judge scores it 0. Fisher's exact test (two-sided) is non-significant for both comparisons.

Cohort

Variant	Provider / model	n	max_tokens	Notes
K2.5	moonshotai/kimi-k2.5	10	2048	Initial pilot
K2.6@2048	moonshotai/kimi-k2.6	10	2048	Initial pilot
K2.6@4096	moonshotai/kimi-k2.6	7	4096	Retest after truncation
Qwen3 Max	qwen/qwen3-max	10	2048	Phase 16
DeepSeek V3	deepseek/deepseek-v3.2-exp	10	2048	Self-play (same as GM/judge)
V4 Flash	deepseek/deepseek-v4-flash	10	2048	MoE (284B/13B)
V4 Pro	deepseek/deepseek-v4-pro	10	2048	MoE (1.6T/49B)

Pre-registered hypotheses

H1 (commission gap ≥ $4M, separated CIs) — Supported between any frontier variant and K2.6@2048. DeepSeek V3 vs K2.6@2048: Δ = +$19.20M, t = 8.91, p < 0.001. But the gap is driven by truncation, not capability.
H2a (leakage–commission correlation, Spearman ρ > 0.3) — Not supported. Pooled ρ ≈ 0.18, not significant. Leakage is too rare to drive commission variance.
H2b (leakage explains tier gap) — Not falsifiable as designed. The GM holds well enough that there is no leakage variation to attribute commission to.
H2c (top-tier ≥ 60% optimal routing, bottom ≤ 40%) — Not supported in its directional form. Every variant — including K2.6@2048 — clears 60%. Routing is legible from public profiles; it doesn't separate models.

Self-play and benchmark integrity

DeepSeek V3 plays itself: it is the agent, the GM, and the leakage judge for its own runs. V4 Pro is architecturally distinct but from the same model family and lands in the same statistical band ($18.20M vs $18.33M, CIs overlap). This undermines the hypothesis that V3 benefits from shared-architecture priors with the GM — if it did, V4 Pro should score lower. DeepSeek V3 is the only variant to hit the rejection budget (0.4/run), suggesting more aggressive probing, but it doesn't get penalized for it.

Limitations

n=10 is the spec floor. Several CI overlaps would resolve at n=20. K2.5 explicitly needs it.
Judge not validated against humans. Cohen's κ is unmeasured. Leakage numbers are exploratory.
Same-model judge. DeepSeek V3 judges its own runs.
Single GM stack. All conclusions are conditional on deepseek/deepseek-v3.2-exp:temp0.3.
Single season. Multi-season compounding (H3) is untested.
No baseline comparisons published. Floor-Aware and Truly-Naive baselines are implemented but not run.

What to run next

The most informative single experiment is a probe-prompted variant: adding a single-line system-prompt nudge ("when a GM signals flexibility, propose a higher number") to one frontier model and measuring the capture-rate delta on Kevin Okafor and Raymond Torres. If the gain is large (>5pp), the limiting factor is elicitation behavior, not model intelligence — and the v3 architecture is correctly identifying that gap.

"The current generation of models, given a GM that holds its numbers reliably, fails to ask."