Skip to main content Skip to secondary navigation
Main content start

Big Questions, Bold Ideas: 2026 Winter Forum Recap

By Sergio Alvarez

The C4DU Winter Forum on February 12, 2026, convened researchers from across Stanford for an afternoon of talks and discussion on data-driven discovery in the physical sciences. Across two sessions—Forward Modeling the Universe and Agentically Discovering the Universe—speakers highlighted both the scientific opportunities created by increasingly complex datasets and the practical challenges of building reliable, reproducible workflows with rapidly improving AI systems. The invited talks are summarized below, followed by a photo gallery.

Center for Decoding the Universe Updates and Key Projects — Risa Wechsler, Susan Clark, Ben Nachman

In this edition of the C4DU Forum, Risa Wechsler offered a warm welcome and underscored the Center’s interdisciplinary scope, framing “decoding the universe” broadly and inviting participation well beyond astrophysics. Ben Nachman highlighted upcoming community events and described connections between C4DU and the U.S. Department of Energy’s Genesis mission, including efforts to build AI-ready datasets and foundation-model systems relevant to astrophysics and cosmology. Susan Clark previewed the two sessions and emphasized a recurring theme for the day: pairing ambitious data-driven science with careful evaluation and human-in-the-loop judgment.

Session 1: Forward Modeling the Universe

Symmetry Breaking in Transformers for Efficient and Interpretable Training — Eva Silverstein

Eva Silverstein framed modern AI models as complex systems and argued that importing tools from physics can expose the structure of their learning dynamics. Focusing on transformer attention, she highlighted continuous symmetries in the architecture and connected them—via Noether’s theorem—to conserved quantities that can shape optimization. She presented a symmetry-breaking approach that injects bias-like terms during training (sampled rather than learned) to break these symmetries, aiming to improve training efficiency and encourage more interpretable internal organization. She also outlined a Hamiltonian perspective on optimization and discussed early empirical results and open questions, including how such interventions might interact with scaling and downstream adaptation.

Rubin Data Preparation for Multi-Modal Analysis — Adam Bolton

Adam Bolton (SLAC) gave an overview of the Vera C. Rubin Observatory and the Legacy Survey of Space and Time (LSST), emphasizing repeated multi-band imaging at a scale that makes automated analysis essential. He described Rubin as an integrated system spanning instrumentation and data management, and outlined how the Rubin U.S. Data Facility supports processing and access at the community scale. He then situated Rubin within a broader ecosystem of “cosmic frontier” surveys and modalities (images, catalogs, spectra, and more), noting that analysis pipelines are often siloed by experiment and typically only “meet” at the level of high-level constraints. He argued that multimodal infrastructure and foundation-model-ready data products could enable more integrated cross-survey science earlier in the analysis chain.

GraphGP: Scalable Gaussian Processes with Vecchia’s Approximation — Benjamin Dodge

Benjamin Dodge introduced GraphGP, a research software package aimed at making Gaussian-process priors practical for large 3D inference problems. Motivated by 3D mapping of Milky Way dust from measurements of hundreds of millions of stars, he described the need for smooth, correlated priors to regularize an ill-posed inverse problem and to avoid artifacts that can arise from purely local inference. GraphGP leverages Vecchia’s approximation to scale Gaussian-process structure through sparse-precision representations, paired with fast neighbor search and parallelization strategies. He also emphasized modern implementation choices—including automatic differentiation support and GPU-oriented kernels—to make these priors usable inside contemporary inference pipelines.

Anomaly Detection in the Presence of Known Signals — Dennis Noll

Dennis Noll presented strategies for anomaly detection when the data contains a known, prominent signal and the goal is to find additional structure on top of it. Using a signal-region/sideband framing, he outlined how to learn a background model from data outside a signal region and transfer that expectation inside it, then extend the approach by incorporating simulations of the known signal. In a Higgs-to-two-photon example, he described how latent-space representations and generative modeling can support broad searches without training directly on a specific new-physics hypothesis. He emphasized a key practical constraint discussed throughout the session: sensitivity depends strongly on the accuracy of the known-signal modeling and associated simulations.

Discussion (Session 1)

Discussion focused on what makes these methods practically useful, not just conceptually appealing. For symmetry breaking, questions centered on what the injected bias terms buy in practice—especially interpretability and whether the same idea could transfer to scientific analysis settings beyond language modeling. For anomaly detection, the conversation sharpened around a concrete constraint: sensitivity hinges on whether simulations and background estimates are accurate to the level of the anomalies being sought, and how one might mitigate that dependence in real applications.

Session 2: Agentically Discovering the Universe

Infrastructure for agentic workflows — Tom Abel

Tom Abel described emerging “agentic” tools through the lens of research infrastructure, arguing that increasingly capable coding assistants make workflow design itself a first-class problem. He presented a vision for a personal “research operating system” that helps researchers organize artifacts, triage information, and connect projects over long horizons, while keeping users in control of their data and provenance. He described prototypes built around an inbox-like model for papers and notes, and discussed directions for agent-assisted writing and collaboration. Discussion emphasized practical questions about reliability, safety, and how to trace outputs back to their sources as agent involvement grows.

Progress in agentic workflows — Jo Ciucă & Marcelo Alvarez

Jo Ciucă and Marcelo Alvarez framed agentic systems as goal-driven models operating over extended horizons in real environments (terminals, codebases, and data products), where outcomes emerge from human–tool–data interaction. They argued that benchmarking must go beyond “working outputs” to evaluate multi-step behavior—error recovery, intent preservation, and scientific correctness rather than the appearance of correctness. Through examples spanning ideation-heavy workflows, pipeline development, and replication-style tasks, they emphasized the importance of careful task scoping and intermediate validation, especially in domains where feedback is slow or ambiguous. They raised broader questions about how to design benchmarks that reward iteration and learning, how to surface hidden assumptions, and how mentorship and team roles shift as agents become more capable collaborators.

Terminal-Bench-Science: Evaluating AI Agents on Complex, Real-World Scientific Workflows in Terminal Environments — Steven Dillmann

Steven Dillmann described TerminalBench as a benchmark for terminal-based agents built around objectively verifiable tasks, and argued that rigorous evaluation has been central to recent progress in coding-capable systems. He highlighted the pace of improvement on such benchmarks and outlined an effort to extend this approach to the natural sciences by collecting domain-expert-authored tasks grounded in real research workflows. The goal is to define tasks that are both scientifically meaningful and objectively testable, supported by standardized task packaging and evaluation infrastructure that enables reproducible comparisons across models and agents.

Learning IRC-Safe Jet Clustering with Geometric Algebra Transformers — Gregor Krzmanc

Gregor Krzmanc discussed jet clustering at the Large Hadron Collider and why infrared-and-collinear (IRC) safety is a core requirement for physically meaningful analyses. He reviewed standard approaches such as anti-\(k_T\) and described scenarios—such as semi-visible jets—where clustering performance can degrade. He presented a learning-based approach that maps particles into a learned clustering space and uses an object-condensation-style loss to encourage sensible grouping, alongside consistency checks under soft/collinear perturbations. In simulation-based studies, he reported improvements over standard baselines while maintaining IRC-safety-inspired robustness.

Discussion (Session 2)

Discussion emphasized that the central bottleneck is increasingly evaluation and oversight, not “getting code to run.” Participants returned to how to benchmark multi-step behavior (error recovery, intent preservation, and iteration with domain experts), and how to surface hidden assumptions that can produce plausible-looking outputs with incorrect science. The conversation also highlighted downstream consequences for training and mentorship—how to help researchers build the judgment needed to audit fast-moving agent outputs—and raised broader questions about what it means to “trust” results when responsibility and verification are distributed differently than in traditional collaborations.