The bottleneck is elicitation, not leakage
Seven agent variants negotiated 519 contracts against a calibrated GM. The GM kept its private numbers private. The agents still did not push very hard.
GitHubTL;DR
- DeepSeek V3 and V4 Pro are statistically tied at ~$18.3M net commission; their 95% CIs overlap. V4 Flash and Qwen3 Max trail by $1.5–2.4M, but they are still in the same competitive band.
- The GM mostly keeps its private numbers private. Across 519 judged threads, every variant stays at ≤9% extraction and ≤3% hard leaks. Noise, the close-deal backstop, and the rejection budget do their job.
- No agent really tests the ceiling. The rejection budget is almost untouched (0.0–0.4 hits per run). Instead of pushing back on the GM's risk story, models tend to accept it and settle near the player's floor.
- Okafor and Torres are the clearest misses. The two adverse-framing players land at 10–12% capture, compared with 23–25% for the clean-narrative players. The pattern shows up across all four frontier variants.
- K2.6's last-place result is a truncation failure, not evidence that the model cannot negotiate. Moving max_tokens from 2048 to 4096 recovers most runs.
MoneyBall Bench came from a question I could not shake after reading two recent evals.
Anthropic's Project Deal (2026) put LLMs on both sides of a controlled marketplace and found that stronger models got better deals while the losing side often did not realize it was being outplayed. Prompting style mattered less than raw model capability. The part that stuck with me was the warning about real agentic markets: private information extraction will matter.
Andon Labs' Vending-Bench Arena (2025–2026) pushed in a similar direction. In a competitive business simulation, frontier models found strategies they were never explicitly told to use: monopoly-building, cartel formation, payments to leave a market. Multi-turn economic behavior was not a toy signal anymore.
Those benchmarks measure outcomes, but they do not fully separate the mechanism. Did the better model win because it negotiated better, or because it got the other side to reveal information it should have kept private? MoneyBall Bench is built around that narrower question: in agentic negotiations under information asymmetry, how much of the advantage comes from better negotiation skill versus better private-information extraction?
The important design choice is keeping the two measurements separate. Commission is just arithmetic: dollars earned, no LLM judge in the loop. Leakage is scored afterward by a separate judge on a 0/1/2 scale for each thread. Either number alone is incomplete. The interesting part is what happens when you put them next to each other.
In earlier versions, I tried to engineer leakage out of the GM entirely. After peer review, I changed my mind. Leakage is not just a flaw to suppress; it is part of the thing worth measuring. If a model can draw out a GM's reservation price through skilled questioning, that is directly relevant to negotiation under asymmetry. Removing that possibility would remove some of the most interesting behavior.
The benchmark in one paragraph
MoneyBall Bench v3 puts an LLM in the role of an NBA sports agent. It has to negotiate six contracts against six team-GM LLMs, each holding private reservation prices. The score is earned commission, which keeps the main outcome arithmetic and judge-free. A separate LLM judge then grades each negotiation thread for information leakage on a 0/1/2 scale. Three guardrails — per-run noise, an independent close-deal backstop, and a per-pair rejection budget — keep the task from turning into binary search.
close_deal call against noised reservation prices, so even a socially engineered GM cannot approve an invalid deal.Where agents leave money on the table
Capture rate — (signed_AAV − floor) / (ceiling − floor) — asks how much of the available negotiation room the agent actually claimed. The cleanest pattern is not model-specific. Kevin Okafor (durability risk) and Raymond Torres (declining veteran), the two players with adverse framing in their profiles, are the cheapest captures across all four frontier variants.
The GM holds the line
Across 519 judged threads, the calibrated GM revealed reservation-price information in only 13 threads (2.5% extraction rate) and produced hard leaks (judge score 2) in 2 threads (0.4%). The information-asymmetry premise holds up: the guardrails work.
Failure-mode geography
The pilot surfaced three different ways agents lose money. Truncation cascade: K2.6 at 2048 max_tokens completes only 1 of 10 runs because its tool-use loop terminates early. Variance-driven ties: K2.5 has two catastrophic auto-sign collapses in 10 runs, stretching its CI across every other variant. Adverse-narrative concession: all four frontier models anchor near the player's floor when the GM introduces risk framing.
Cap pressure does not elevate leakage
I stratified all 519 threads by whether the agent had already signed another player with the same team. If a prior signing made the GM's cap constraints more likely to reveal reservation prices, cap-pressure threads should show more leakage. They do not.
Cohort
| Variant | Provider / model | n | max_tokens | Notes |
|---|---|---|---|---|
| K2.5 | moonshotai/kimi-k2.5 | 10 | 2048 | Initial pilot |
| K2.6@2048 | moonshotai/kimi-k2.6 | 10 | 2048 | Initial pilot |
| K2.6@4096 | moonshotai/kimi-k2.6 | 7 | 4096 | Retest after truncation |
| Qwen3 Max | qwen/qwen3-max | 10 | 2048 | Phase 16 |
| DeepSeek V3 | deepseek/deepseek-v3.2-exp | 10 | 2048 | Self-play (same as GM/judge) |
| V4 Flash | deepseek/deepseek-v4-flash | 10 | 2048 | MoE (284B/13B) |
| V4 Pro | deepseek/deepseek-v4-pro | 10 | 2048 | MoE (1.6T/49B) |
Pre-registered hypotheses
- H1 (commission gap ≥ $4M, separated CIs) — Supported between any frontier variant and K2.6@2048. DeepSeek V3 vs K2.6@2048: Δ = +$19.20M, t = 8.91, p < 0.001. The catch is that the gap is driven by truncation, not capability.
- H2a (leakage–commission correlation, Spearman ρ > 0.3) — Not supported. Pooled ρ ≈ 0.18 and is not significant. Leakage is too rare to explain commission variance.
- H2b (leakage explains tier gap) — Not falsifiable as designed. The GM holds well enough that there is not enough leakage variation to attribute commission differences to it.
- H2c (top-tier ≥ 60% optimal routing, bottom ≤ 40%) — Not supported in its directional form. Every variant, including K2.6@2048, clears 60%. Routing is visible from public profiles, so it does not separate models.
Self-play and benchmark integrity
DeepSeek V3 plays itself: it is the agent, the GM, and the leakage judge for its own runs. V4 Pro is architecturally distinct but from the same model family, and it lands in the same statistical band ($18.20M vs $18.33M, CIs overlap). That weakens the idea that V3 wins because it shares priors with the GM. If that were the main driver, V4 Pro should fall back. DeepSeek V3 is also the only variant to touch the rejection budget (0.4/run), which suggests more aggressive probing, but not enough to hurt its score.
Limitations
- n=10 is only the spec floor. Several CI overlaps would probably clarify at n=20. K2.5 especially needs the extra runs.
- The judge is not validated against humans. Cohen's κ is unmeasured, so the leakage numbers should stay exploratory.
- The judge is the same model family. DeepSeek V3 judges its own runs.
- There is one GM stack. All conclusions are conditional on
deepseek/deepseek-v3.2-exp:temp0.3. - There is one season. Multi-season compounding (H3) is still untested.
- The baselines are not published yet. Floor-Aware and Truly-Naive baselines are implemented but not run.
What to run next
The most useful next run is a probe-prompted variant: add one system-prompt nudge ("when a GM signals flexibility, propose a higher number") to a frontier model, then measure the capture-rate delta on Kevin Okafor and Raymond Torres. If the gain is large (>5pp), the bottleneck is elicitation behavior rather than model intelligence, and the v3 setup is pointing at the right failure mode.
"The current generation of models, given a GM that holds its numbers reliably, fails to ask."