From Benchmarks to Real-World Impact: Causal Science Conference Explores Modern Challenges in AI Evaluation
By Yifan Guo, Anushka Murthy, Wenqian Xing, Zhenghao Zeng (Tiger), Aditya Ghosh
On April 24, 2026, the Stanford Causal Science Conference: Frontiers in AI Evaluation brought together leading researchers from academia and industry to examine how causal reasoning, evaluation, and measurement are reshaping the future of AI systems. Across sessions spanning AI alignment, benchmarking, multimodal evaluation, education technology, software reliability, and enterprise AI agents, speakers explored a common challenge: How to build and evaluate AI systems at scale that remain trustworthy, interpretable, and useful as they become more capable and deeply integrated into real-world workflows.
The discussions highlighted the growing importance of rigorous evaluation frameworks, causal inference methods, and human-centered system design. Speakers emphasized that benchmark performance alone is no longer sufficient; instead, the field must increasingly focus on how AI systems interact with people, adapt after deployment, and generate meaningful real-world outcomes. Together, the sessions offered a broad view of the technical, scientific, and societal questions shaping the next generation of AI research.
Session 1: World Models, Alignment, and Multimodal Evaluation
AI’s Models of the World, and Ours – Jon Kleinberg
Jon Kleinberg used chess as a grounded case study for thinking about domains where superhuman AI has already been present for decades. He described how chess engines have reshaped human expertise, i.e., spectators with engines can now see moves that even elite players miss, aesthetic principles no longer reliably predict correctness, and AI assistance can create an illusion of understanding. He discussed Maia, a chess model trained to predict human moves at different skill levels, and used it to explore how AI systems can better model human behavior. A central theme was “handoff”: powerful systems may make moves that are optimal for themselves but leave humans unable to continue effectively. Kleinberg argued that designing AI partners requires not only superhuman performance, but also an ability to keep humans in positions they can understand and act from. He closed by connecting this to broader questions about implicit world models in sequence-generating systems.
Why We Must Go Beyond Post-Training for Robust AI Alignment – Dylan Hadfield-Menell
Dylan Hadfield-Menell argued that current post-training approaches to alignment, including RLHF-style methods, are flexible but fundamentally brittle. He reviewed how post-training can select helpful and safe behaviors from broad pre-trained models, but emphasized that these safeguards are often easily bypassed through jailbreaks, simple activation interventions, benign fine-tuning, or model tampering. He presented evidence that safety behavior can drift unpredictably after domain-specific fine-tuning, especially in high-stakes areas such as medical AI, and that existing safety benchmarks often disagree or fail under small evaluation changes. As an alternative direction, he proposed reducing the attack surface through domain-specific scoping, inspired by computer security, and discussed early work using feature filtering to preserve in-domain performance while limiting out-of-domain capabilities. He concluded that more robust alignment may require moving part of the alignment process into pre-training through data filtering, recontextualization, and training-time steering.
Scalable Evaluation of Multimodal AI Systems for Creative Optimization – Bahareh Azarnoush
Bahareh Azarnoush described Netflix’s approach to evaluating multimodal AI systems used for creative optimization across promotional assets, localization, artwork, trailers, synopses, and dubbing. She emphasized a creative-first philosophy in which AI supports and scales expert creative judgment rather than replacing it. The talk framed evaluation as the control system for AI development: offline evaluation builds reusable rubrics, golden datasets, and calibrated model-based judges, while online evaluation connects outputs to real member behavior and long-term member value. A key theme was the need to combine intrinsic evaluation of media quality with extrinsic evaluation of causal impact on member outcomes. Through a case study, she showed how creative rubrics, e.g., precision, tone, factuality, and clarity, can be scaled using LLM judges and then linked causally to short-term engagement signals and long-term retention. The broader takeaway was that rigorous, causally grounded evaluation is essential for turning AI R&D into reliable business and member value.
Session Discussion
Throughout the session, discussions focused on how to evaluate and govern AI systems once they become capable of shaping human workflows rather than simply assisting with isolated tasks. A recurring theme was that surface-level success can be misleading. For example, chess engines may give humans the illusion of understanding a position, aligned models may appear safe until they are lightly modified, and creative AI systems may seem impressive without actually improving outcomes.
Speakers emphasized the importance of evaluation methods that extend beyond initial performance and examine what happens after deployment — including whether humans can effectively take over from AI systems, whether safety properties remain stable after adaptation, and whether model outputs genuinely support the values and objectives they are intended to serve.
Session 2: AI in Human Workflows: Lessons from Clinical Scribing and Education
Evaluation Under Pressure: Lessons from Deploying Clinical AI at Scale – Zachary Lipton
Zachary Lipton talked about his experience in turning doctor-patient audio into clinical documentation at scale. Most of the talk focused on the evaluation problems that emerge once a system like this is in production. The shift from a modular pipeline to end-to-end LLM generation removed any clean reference output. The team now does reference-free hill climbing in a fuzzily defined space. Rubrics keep growing as customer complaints arrive. LLM-as-judge becomes essential for iteration speed, but it raises a who-judges-the-judges problem. Zachary also flagged a development-evaluation inversion, where models are easier to build than to evaluate. He raised a related tension between staying competitive and staying statistically rigorous as base models keep changing. He closed by calling frontier AI companies a rich and mostly untouched playground for the causal inference community.
AI and Human Learning – Emma Brunskill
Emma Brunskill’s talk asked whether EdTech actually improves student outcomes at the levels they use it, beyond what randomized trials in controlled settings show. She described joint work on the MAP Accelerator product with Khan Academy and economists, using a large panel of students across multiple school years. They use peer usage in the same classroom as a proxy for a given student’s own usage, after adjusting for fixed effects. A placebo test on reading scores supports the assumption. Even at very low average usage, the analysis finds a positive effect on math scores. Effects are larger at recommended usage levels. Higher-performing students gain more per hour of usage. Teachers who focus on skills mastered, rather than just time spent, also see higher per-hour gains. Emma closed by arguing that the field should stop looking for an education silver bullet. AI should be treated as one piece of a combination lock of many small effects.
Session Discussion
This session centered on practical challenges in evaluating AI systems once they become part of human workflows. In clinical documentation, evaluation after deployment is no longer just a matter of comparing outputs to a fixed reference, and evaluation frameworks have yet to catch up with how quickly systems can now be changed. In education, the effect of a platform's use depends not only on the technology itself, but also on how teachers implement it, how much students use it, and whether usage is tied to meaningful learning goals. Broadly, both talks suggested that once AI systems are deployed at scale, how users adapt to the tool and how it is implemented become as central to evaluation as any benchmark performance.
Session 3: Evaluating the Frontier: Measuring Rapidly Improving AI Systems
The Benchmark Problem – Benjamin Recht
Ben Recht argued that the surprising linear relationship between ImageNetV1 and ImageNetV2 accuracy is not just a quirk of distribution shift, but evidence that benchmarks can behave like calibrated tests of a shared latent ability. Using item response theory, he framed models as having “aptitudes” and benchmark examples as having “hardness,” so that the full matrix of which models get which items right contains more measurement information than average accuracy alone. Under this view, lines on a probit scale arise naturally when two benchmarks measure the same underlying skill but differ in item difficulty, which suggests that benchmark design should focus on validity, calibration, and test construction rather than only iid generalization. The broader takeaway was that ML evaluation may need a measurement-theoretic foundation: instead of treating new test sets mainly as defenses against overfitting or distribution shift, we can use calibrated item parameters and adaptive testing ideas to build better, more reliable evaluations of model capability.
Benchmarking to Advance the AI Frontier – Ofir Press
Ofir Press described his approach to building benchmarks that drive AI progress. He drew on his work creating SWE-bench, as well as more recent benchmarks like Critical Point, AlgoTune, and CodeClash. He emphasized that benchmarks are perishable assets. A well-designed benchmark should start near zero accuracy. It typically gets solved within a few years of release. He laid out three rules for designing benchmarks. First, build benchmarks that correlate with real-world usefulness, rather than abstract IQ-style measures. Second, make them genuinely hard, ideally starting at zero or near-zero accuracy. Saturated benchmarks no longer carry signal about where the frontier is moving. Third, ensure the answers can be deterministically verified. LLM-as-judge approaches remain too unreliable to serve as ground truth, and most of the real work of benchmark building goes into constructing the verifier. Ofir also traced a rough history of the field. It moved from school-level math problems, through graduate-level exams, to real human-day tasks like SWE-bench. It is now moving toward longer team-scale tasks, and even verifiable tasks that no human has yet solved.
Keeping up with AI capabilities – David Rein
David Rein’s talk centered on how hard it is to build benchmarks that keep pace with rapidly improving AI systems. He illustrated this with his own track record. Benchmarks he released, like GPQA and METR’s time-horizon evaluation, saturated almost as soon as they were created. The METR time-horizon metric measures the length of a human task that a given model can complete with fifty percent probability. On a log scale, this quantity has grown roughly linearly over recent years. Current frontier models handle tasks that take humans on the order of a day. A central tension was that good benchmarks need cheap, deterministic verification, yet any task with an easy algorithmic reward signal can often be rapidly optimized against. He laid out three possible responses: keep extending time-horizon benchmarks, perhaps with more manual scoring; stitching many benchmarks together using item response theory; or shifting toward measuring real-world impact, though these measures are often lagging and hard to interpret. His broader takeaway was that, given the recent pace of progress, we should expect benchmark saturation to keep accelerating and take seriously the possibility that models may soon handle tasks lasting weeks or even years of human work.
Session Discussion
The discussion centered on what makes a benchmark genuinely useful as AI systems become more capable. One theme was verification: the speakers emphasized that benchmarks with deterministic rewards are desirable, but acknowledged that when clean verification is unavailable, one may need to decompose tasks into smaller, checkable subtasks. A related principle was that benchmarks should be designed so that optimizing for benchmark performance also makes the model more useful in the real world, rather than merely teaching it to exploit an artificial metric. The discussion then pivoted to a broader forecasting concern: if AI systems are already improving productivity inside AI labs, this could create a recursive acceleration in model progress, making benchmark saturation even harder to manage. The discussion also highlighted a limitation of current time-horizon evaluations like METR’s: many tasks are isolated from human interaction, so they may miss bottlenecks that arise in real workflows where models need to coordinate with, ask questions of, or adapt to people.
Session 4: Software Agents in Production: Reliability, Evaluation, and Security
Causal inference for software reliability – Anish Agarwal
Anish Agarwal argued that software reliability is a natural but underdeveloped area for causal inference: modern observability tools show many correlations during outages, but teams really need help identifying root causes across logs, metrics, traces, deploys, code, tickets, and organizational knowledge. He framed Traversal’s work as moving toward “self-driving production”, where agents can traverse complex system graphs, diagnose incidents, and eventually help mitigate them, while noting that evaluation is difficult because the incident ground truth is private, noisy, and often only partially correct.
Making coding agents trustworthy at Snowflake – Anupam Datta
Anupam Datta focused on making enterprise coding agents trustworthy, using Snowflake’s Cortex Code as an example. His talk emphasized three pillars: evaluating agents beyond the final answers through goals, plans, and actions; optimizing agents by giving them domain-specific tools like SQL execution and data lineage, so they can solve data-engineering tasks more efficiently; and securing agents against risks such as prompt injection, malicious MCP servers, credential leakage, and unsafe actions. Together, the talks argued that deploying agents in enterprise environments requires not just better models, but strong causal reasoning, process-level evaluation, tool design, and security guardrails.
Session Discussion
After Anish Agarwal’s talk, the discussion focused on how to evaluate long-horizon software-reliability agents. He said evaluation should combine outcome-based measures at different levels of granularity: did the agent find the right “haystack”, the right microservice, the right root cause, or the right PR/commit? He also noted that absolute scoring is hard, so relative comparisons, such as Bradley–Terry-style evaluations, may be more practical. A second question pushed on benchmarks: unlike coding, incident-response data is mostly private, company-specific, and hard for LLMs to handle because it involves time-series/log/metric data.
Synthetic chaos-engineering environments help somewhat, but they do not fully capture production-scale failures. After Anupam Datta’s Snowflake/Cortex Code talk, the discussion centered on evaluating agents for data science and data analytics tasks. He said Snowflake uses a mix of public benchmarks such as DAP step, internally curated datasets created by data scientists, and data from external vendors building RL-style evaluation environments. His main point was that there is no magic bullet: high-quality data-analysis evals still require human experts, sometimes assisted by LLMs, to curate and manually verify tasks.