The bottleneck is elicitation, not leakage
Seven agent variants negotiate 519 contracts against a calibrated GM. The GM holds its private numbers. No agent asks for more.
GitHubTL;DR
- DeepSeek V3 and V4 Pro are statistically tied at ~$18.3M net commission, with 95% CIs that overlap. V4 Flash and Qwen3 Max sit $1.5–2.4M lower but remain in the same competitive band.
- The GM holds the line: across 519 judged threads, extraction rate is ≤9% and hard-leak rate is ≤3% for every variant. The benchmark's three safety mechanisms (noise, backstop, rejection budget) compose to make mechanical reservation-price extraction infeasible.
- No agent probes upward. The rejection budget — designed to prevent binary-search-via-tool-spam — is essentially unused (0.0–0.4 hits per run). Models accept the GM's risk frame instead of countering it, anchoring near the player's floor.
- The two adverse-framing players (Okafor, Torres) are systematically underpriced at 10–12% capture, vs 23–25% for clean-narrative players. This gap is consistent across all four frontier variants.
- K2.6's bottom-tier placement is a truncation pathology, not a capability gap: raising max_tokens from 2048 to 4096 recovers most runs.
The benchmark in one paragraph
MoneyBall Bench v3 puts an LLM agent in the role of an NBA sports agent negotiating six contracts against six team-GM LLMs that hold private reservation prices. The agent's score is its earned commission — a single, judgment-free arithmetic number — and a separate LLM judge grades each negotiation thread for information leakage on a 0/1/2 scale. Three orchestration-side mechanisms — per-run noise, an independent close-deal backstop, and a per-pair rejection budget — prevent the benchmark from collapsing into binary search.
close_deal call against noised reservation prices — a GM socially engineered into agreeing still has the deal blocked.Where agents leave money on the table
Capture rate — (signed_AAV − floor) / (ceiling − floor) — isolates how much of the available negotiation space the agent claimed. The pattern is striking: the two players whose profiles involve adverse information framing — Kevin Okafor (durability risk) and Raymond Torres (declining veteran) — are systematically the cheapest captures across all four frontier variants.
The GM holds the line
Across 519 judged threads, the calibrated GM leaked its reservation prices in only 13 threads (2.5% extraction rate) and committed hard leaks (judge score 2) in 2 threads (0.4%). The benchmark's information-asymmetry premise is intact — the safety mechanisms work as designed.
Failure-mode geography
The pilot exposed three distinct ways agents lose money. Truncation cascade: K2.6 at 2048 max_tokens completes only 1 of 10 runs because its tool-use loop terminates early. Variance-driven ties: K2.5 has two catastrophic auto-sign collapses in 10 runs, inflating its CI to span every other variant. Adverse-narrative concession: all four frontier models anchor near the player's floor when the GM introduces risk framing.
Cap pressure does not elevate leakage
We stratified all 519 threads by whether the agent had already signed another player at the same team (cap-pressure) or not. If the GM's cap constraints after a prior signing made it more likely to leak its reservation prices, we would expect higher leakage in cap-pressure threads. We found no signal.
Cohort
| Variant | Provider / model | n | max_tokens | Notes |
|---|---|---|---|---|
| K2.5 | moonshotai/kimi-k2.5 | 10 | 2048 | Initial pilot |
| K2.6@2048 | moonshotai/kimi-k2.6 | 10 | 2048 | Initial pilot |
| K2.6@4096 | moonshotai/kimi-k2.6 | 7 | 4096 | Retest after truncation |
| Qwen3 Max | qwen/qwen3-max | 10 | 2048 | Phase 16 |
| DeepSeek V3 | deepseek/deepseek-v3.2-exp | 10 | 2048 | Self-play (same as GM/judge) |
| V4 Flash | deepseek/deepseek-v4-flash | 10 | 2048 | MoE (284B/13B) |
| V4 Pro | deepseek/deepseek-v4-pro | 10 | 2048 | MoE (1.6T/49B) |
Pre-registered hypotheses
- H1 (commission gap ≥ $4M, separated CIs) — Supported between any frontier variant and K2.6@2048. DeepSeek V3 vs K2.6@2048: Δ = +$19.20M, t = 8.91, p < 0.001. But the gap is driven by truncation, not capability.
- H2a (leakage–commission correlation, Spearman ρ > 0.3) — Not supported. Pooled ρ ≈ 0.18, not significant. Leakage is too rare to drive commission variance.
- H2b (leakage explains tier gap) — Not falsifiable as designed. The GM holds well enough that there is no leakage variation to attribute commission to.
- H2c (top-tier ≥ 60% optimal routing, bottom ≤ 40%) — Not supported in its directional form. Every variant — including K2.6@2048 — clears 60%. Routing is legible from public profiles; it doesn't separate models.
Self-play and benchmark integrity
DeepSeek V3 plays itself: it is the agent, the GM, and the leakage judge for its own runs. V4 Pro is architecturally distinct but from the same model family and lands in the same statistical band ($18.20M vs $18.33M, CIs overlap). This undermines the hypothesis that V3 benefits from shared-architecture priors with the GM — if it did, V4 Pro should score lower. DeepSeek V3 is the only variant to hit the rejection budget (0.4/run), suggesting more aggressive probing, but it doesn't get penalized for it.
Limitations
- n=10 is the spec floor. Several CI overlaps would resolve at n=20. K2.5 explicitly needs it.
- Judge not validated against humans. Cohen's κ is unmeasured. Leakage numbers are exploratory.
- Same-model judge. DeepSeek V3 judges its own runs.
- Single GM stack. All conclusions are conditional on
deepseek/deepseek-v3.2-exp:temp0.3. - Single season. Multi-season compounding (H3) is untested.
- No baseline comparisons published. Floor-Aware and Truly-Naive baselines are implemented but not run.
What to run next
The most informative single experiment is a probe-prompted variant: adding a single-line system-prompt nudge ("when a GM signals flexibility, propose a higher number") to one frontier model and measuring the capture-rate delta on Kevin Okafor and Raymond Torres. If the gain is large (>5pp), the limiting factor is elicitation behavior, not model intelligence — and the v3 architecture is correctly identifying that gap.
"The current generation of models, given a GM that holds its numbers reliably, fails to ask."