The bottleneck is elicitation, not leakage

Seven agent variants negotiated 519 contracts against a calibrated GM. The GM kept its private numbers private. The agents still did not push very hard.

GitHub

Dice · May 2026 · moneyball-bench-v3

TL;DR

MoneyBall Bench came from a question I could not shake after reading two recent evals.

Anthropic's Project Deal (2026) put LLMs on both sides of a controlled marketplace and found that stronger models got better deals while the losing side often did not realize it was being outplayed. Prompting style mattered less than raw model capability. The part that stuck with me was the warning about real agentic markets: private information extraction will matter.

Andon Labs' Vending-Bench Arena (2025–2026) pushed in a similar direction. In a competitive business simulation, frontier models found strategies they were never explicitly told to use: monopoly-building, cartel formation, payments to leave a market. Multi-turn economic behavior was not a toy signal anymore.

Those benchmarks measure outcomes, but they do not fully separate the mechanism. Did the better model win because it negotiated better, or because it got the other side to reveal information it should have kept private? MoneyBall Bench is built around that narrower question: in agentic negotiations under information asymmetry, how much of the advantage comes from better negotiation skill versus better private-information extraction?

The important design choice is keeping the two measurements separate. Commission is just arithmetic: dollars earned, no LLM judge in the loop. Leakage is scored afterward by a separate judge on a 0/1/2 scale for each thread. Either number alone is incomplete. The interesting part is what happens when you put them next to each other.

In earlier versions, I tried to engineer leakage out of the GM entirely. After peer review, I changed my mind. Leakage is not just a flaw to suppress; it is part of the thing worth measuring. If a model can draw out a GM's reservation price through skilled questioning, that is directly relevant to negotiation under asymmetry. Removing that possibility would remove some of the most interesting behavior.

Net commission by agent model n=10 per model (n=7 for K2.6@4096), 95% CI from the t-distribution −$5M $0 $5M $10M $15M $20M DeepSeek V3 self-play $18.33M V4 Pro $18.20M V4 Flash $16.69M Qwen3 Max $15.89M Kimi K2.5 $14.61M K2.6@4096 n=7 $7.91M K2.6@2048 −$0.87M TIER 1 TIER 2 TIER 3
Figure 1. Net commission across the seven agent variants. Whiskers are 95% CIs from the t-distribution. DeepSeek V3 and V4 Pro overlap, so the spec's TIED rule applies. The bottom tier is mostly a K2.6 truncation story. The $0 line is break-even; below it, auto-sign penalties outweighed earned commission.

The benchmark in one paragraph

MoneyBall Bench v3 puts an LLM in the role of an NBA sports agent. It has to negotiate six contracts against six team-GM LLMs, each holding private reservation prices. The score is earned commission, which keeps the main outcome arithmetic and judge-free. A separate LLM judge then grades each negotiation thread for information leakage on a 0/1/2 scale. Three guardrails — per-run noise, an independent close-deal backstop, and a per-pair rejection budget — keep the task from turning into binary search.

Benchmark architecture Three LLM roles, three safety mechanisms, one arithmetic score Agent model under test 6 player stat cards + public profiles GM × 6 DeepSeek V3 (fixed) private (max_aav, max_years) Judge DeepSeek V3 (offline) scores 0/1/2 per thread email threads post-hoc Reservation noise ±5% fuzz, seeded per run Close-deal backstop rejects above noised max_aav Rejection budget 3 above-ceiling attempts → locked NetScore = Σ(AAV × Years × 10%) − $0.5M × auto_signed_count
Figure 2. The three LLM roles and the three guardrails. The agent negotiates by email, the GM holds private numbers, and the judge scores leakage after the fact. The backstop checks every close_deal call against noised reservation prices, so even a socially engineered GM cannot approve an invalid deal.

Where agents leave money on the table

Capture rate — (signed_AAV − floor) / (ceiling − floor) — asks how much of the available negotiation room the agent actually claimed. The cleanest pattern is not model-specific. Kevin Okafor (durability risk) and Raymond Torres (declining veteran), the two players with adverse framing in their profiles, are the cheapest captures across all four frontier variants.

Per-player capture rate · frontier models Darker = higher capture. Okafor and Torres (adverse framing) are systematically lowest. K2.5 Qwen3 V3 V4 Flash V4 Pro Marcus Cole auction dynamics 18.3% 5.4% 30.6% 21.8% 35.2% Darnell Washington length vs. AAV 21.2% 12.8% 32.7% 20.6% 26.6% Tyrese Grant team-fit ID 26.6% 14.4% 37.2% 23.4% 24.4% ← adverse framing Kevin Okafor adverse framing 7.7% 20.0% 10.5% 8.8% 10.0% Jaylen Brooks upside framing 23.1% 16.2% 14.0% 16.9% 20.1% Raymond Torres adverse advocacy 14.8% 12.5% 11.1% 8.8% 10.5% low high
Figure 3. Capture rate by player and model. The dashed line separates the cleaner player stories from the adverse-framing cases. Across the frontier variants, agents capture 23–25% on Cole/Washington/Grant but only 10–12% on Okafor/Torres. The GM introduces risk, and the agents mostly let that framing stand.

The GM holds the line

Across 519 judged threads, the calibrated GM revealed reservation-price information in only 13 threads (2.5% extraction rate) and produced hard leaks (judge score 2) in 2 threads (0.4%). The information-asymmetry premise holds up: the guardrails work.

Leakage rates by agent model DeepSeek V3 judge · score ≥1 = extraction, score =2 = hard leak 0% 2% 4% 6% 8% 10% K2.5 2.4% K2.6@2048 0% Qwen3 Max 9.1% 3.0% DeepSeek V3 5.9% 1.0% V4 Flash 2.1% V4 Pro 5.6% Extraction rate (score ≥ 1) Hard-leak rate (score = 2)
Figure 4. Leakage rates by model. Qwen3 Max extracts the most (9.1%) but earns the least among the frontier variants. That is the opposite of the H2a story: the best leakage extractor is not the best negotiator. Hard leaks stay rare (≤3%). K2.6@4096 is excluded because the judge was not run on the retest.

Failure-mode geography

The pilot surfaced three different ways agents lose money. Truncation cascade: K2.6 at 2048 max_tokens completes only 1 of 10 runs because its tool-use loop terminates early. Variance-driven ties: K2.5 has two catastrophic auto-sign collapses in 10 runs, stretching its CI across every other variant. Adverse-narrative concession: all four frontier models anchor near the player's floor when the GM introduces risk framing.

K2.6 truncation cascade · 2048 vs 4096 max_tokens Green = completed run (6 deals), red = auto-sign collapse (0 deals) K2.6@2048 1 of 10 $18.3 0 1 2 3 4 5 6 7 8 9 K2.6@4096 4 of 7 $14.5 $18.5 $14.0 $8.7 0 1 2 3 4 5 6 +3 recovered
Figure 5. K2.6 run outcomes at two token budgets. At 2048, 9 of 10 runs collapse into auto-sign penalties (−$3M each). At 4096, 4 of 7 recover into the $8–18M band. Red squares are auto-sign collapses; green squares show completed-run net commission. This is a tool-use stability finding, not a model-quality finding.

Cap pressure does not elevate leakage

I stratified all 519 threads by whether the agent had already signed another player with the same team. If a prior signing made the GM's cap constraints more likely to reveal reservation prices, cap-pressure threads should show more leakage. They do not.

Cap-pressure stratification · pooled across all models 123 cap-pressure threads vs 396 non-cap-pressure threads Extraction rate (score ≥ 1) 2.8% Cap-pressure n=108 judged 2.7% Non-cap-pressure n=367 judged gap: +0.1pp · Fisher's p = 1.000 Hard-leak rate (score = 2) 0.9% Cap-pressure 1 hard leak 0.3% Non-cap-pressure 1 hard leak gap: +0.7pp · Fisher's p = 0.403
Figure 6. Cap-pressure stratification across all models. Extraction and hard-leak rates do not meaningfully differ between cap-pressure and non-cap-pressure threads. The GM's "we have constraints" language points in a direction, but it stays soft enough for the judge to score it 0. Fisher's exact test (two-sided) is non-significant for both comparisons.

Cohort

VariantProvider / modelnmax_tokensNotes
K2.5moonshotai/kimi-k2.5102048Initial pilot
K2.6@2048moonshotai/kimi-k2.6102048Initial pilot
K2.6@4096moonshotai/kimi-k2.674096Retest after truncation
Qwen3 Maxqwen/qwen3-max102048Phase 16
DeepSeek V3deepseek/deepseek-v3.2-exp102048Self-play (same as GM/judge)
V4 Flashdeepseek/deepseek-v4-flash102048MoE (284B/13B)
V4 Prodeepseek/deepseek-v4-pro102048MoE (1.6T/49B)

Pre-registered hypotheses

Self-play and benchmark integrity

DeepSeek V3 plays itself: it is the agent, the GM, and the leakage judge for its own runs. V4 Pro is architecturally distinct but from the same model family, and it lands in the same statistical band ($18.20M vs $18.33M, CIs overlap). That weakens the idea that V3 wins because it shares priors with the GM. If that were the main driver, V4 Pro should fall back. DeepSeek V3 is also the only variant to touch the rejection budget (0.4/run), which suggests more aggressive probing, but not enough to hurt its score.

Limitations

  1. n=10 is only the spec floor. Several CI overlaps would probably clarify at n=20. K2.5 especially needs the extra runs.
  2. The judge is not validated against humans. Cohen's κ is unmeasured, so the leakage numbers should stay exploratory.
  3. The judge is the same model family. DeepSeek V3 judges its own runs.
  4. There is one GM stack. All conclusions are conditional on deepseek/deepseek-v3.2-exp:temp0.3.
  5. There is one season. Multi-season compounding (H3) is still untested.
  6. The baselines are not published yet. Floor-Aware and Truly-Naive baselines are implemented but not run.

What to run next

The most useful next run is a probe-prompted variant: add one system-prompt nudge ("when a GM signals flexibility, propose a higher number") to a frontier model, then measure the capture-rate delta on Kevin Okafor and Raymond Torres. If the gain is large (>5pp), the bottleneck is elicitation behavior rather than model intelligence, and the v3 setup is pointing at the right failure mode.

"The current generation of models, given a GM that holds its numbers reliably, fails to ask."