NeurIPS 2025 Paper: Fantastic Bugs and Where to Find Them in AI Benchmarks
By Sang Truong
AI benchmarks shape the trajectory of AI development. Our evaluations are only as good as the questions we ask. Right now, many benchmark questions are not doing so well.
In our NeurIPS 2025 paper, Fantastic Bugs and Where to Find Them in AI Benchmarks, we introduce an efficient way to detect flawed benchmark items at scale.
We identify three major categories of problematic questions:
- Ambiguous questions
- Incorrect answer keys
- Grading issues (for example: the correct answer is “4” but the grader marks “4.00” as incorrect)
Manually auditing benchmarks is very costly. MMLU alone spans 57 domains and contains 14,000 questions. Using the core assumption of unidimensionality in evaluation, we propose three measurement-theoretic statistics that automatically flag problematic items for expert review.
Across nine widely used benchmarks, our framework helps human experts identify flawed questions with up to 84% precision. The paper contains many interesting examples.
If you are at NeurIPS in San Diego, please visit our poster 1403 on Friday, Dec 5, 2025, from 11 AM to 2 PM.
This work is joint with Yuheng Tu, Michael Hardy, Anka Reuel, Zeyu Tang, Jonathan Perera, Chibuike Uwakwe, Ben Domingue, Nick Haber, and Sanmi Koyejo.
Reposted from LinkedIn with the author's permission.