Center for Decoding the Universe Quarterly Forum Recap - Fall 2024
Thank you to all the presenters, volunteers, and attendees who joined us for the inaugural Center for Decoding the Universe Quarterly Forum! Enjoy the recorded sessions on our YouTube channel, some of the event photos, and the recaps, courtesy of Kate Storey-Fisher, Sydney Erickson, and Phil Mansfield.
Session 1
The Center for Decoding the Universe’s inaugural Quarterly Forum kicked off with an introduction from Risa Wechsler (Director, KIPAC) and Chris Mentzel (Executive Director, Stanford Data Science). The Center is a new joint initiative between these two institutes that aims to understand how the universe works—astrophysically speaking—by using the increasing abundance of observational data, which comes with increasing data science challenges.
Risa Wechsler and Susan Clark, co-directors of CDU, situated the first session in our home galaxy, the Milky Way. There are big questions about our home and neighborhood we are still working to answer: how did the Galaxy form and evolve, what fuels it, and what is the dark matter that composes much of its mass? Researchers are piecing together this picture using multi-modal data, from spectra and images to polarization and time-domain data. We can infer the Milky Way’s formation from its satellite galaxies, its 3D structure from gas flows, and its dynamical history from stellar streams. With current and upcoming missions including Gaia, the Rubin Observatory, and the Roman Space Telescope, we will have the data in hand to fill in our understanding of our own galaxy.
The talks in this session focused on active areas of machine learning research applicable to the data challenges facing Milky Way science, as well as diving deeper into what these challenges are. Surya Ganguli (Stanford Applied Physics) presented his group’s work on using physics to understand machine learning and vice versa. One key focus was interpreting how both brains and neural networks learn; Ganguli shared an example of a monkey learning to use its brain to control a cursor, and then just when the unsuspecting monkey had gotten the hang of it, rotating the coordinate system and making it re-learn the task. They were able to use unsupervised learning on the monkey’s neural data to uncover its motor learning process. Next, Phillip Frank (KIPAC) gave an overview of problems involving recovering physical fields, such as maps of the magnetism and dust in the Galaxy. The challenge is that our data are sparse and noisy, but forward modeling combined with approaches such as geometric variational inference allows us to reconstruct these fields. Finally, Henry Zheng (Stanford Physics) introduced a new approach to optimization for training machine learning models, Energy Conserving Descent. It uses a particular Hamiltonian, a physics-informed way to describe the energy of a system, to converge to the optimum, independent of initial weights. Together, these talks laid out tools that will help us decode the physics of our galaxy. Watch the session video!
Session 2
We started Session 2 with an overview of the core data products in astrophysics from Dalya Baron and Adam Bolton. When studying astrophysical objects, “it’s the same physics out there … the length scales are just different from our normal experience,” Adam explained. Astrophysicists apply core physics principles, such as electromagnetism, statistical mechanics, and gravity, to understand objects ranging from exoplanets to supermassive black holes.
Information about an object’s underlying physics can be found in the light that the object emits. We collect light using telescopes, which can be designed to measure different properties of light. When producing images, we measure the spatial distribution of light. When producing spectra, we measure the wavelength distribution of light. When producing light-curves, we measure the variation in the time of light that hits our telescope. These three main data products, images, spectra, and light-curves, give insight into different physical properties, such as age, temperature, and density. With the advent of survey-based astronomy, we now have images, spectra, and light-curves for billions of objects. Astrophysics is well into the domain of data-intensive science, and we need help from our data science colleagues to make sense of all of this data!
That’s where people like Eric Nguyen, Tijana Zrnic, and Gordon Wetzstein come in. We learned how they were able to apply machine learning methods to make advances in their fields.
Eric told us about a new DNA model, called Evo, which harnesses the power of large language models to predict new protein systems. The trick is “treating DNA as a language,” according to Eric, with chapters made of chromosomes, and a vocabulary made of four letters. Evo was created by a partnership between teams in the computer science and biology departments, which may be a model astrophysicists adopt through the new center. Tijana presented a method to improve the reliability of machine learning predictions. She warned that machine learning predictions are often replacing real data products, and there should be a protocol for drawing valid conclusions from these predictions. Using prediction-powered inference, a correction scheme Tijana derived, predictions can be adjusted to account for machine learning bias. Finally, we heard from Gordon Wetzstein about efforts to solve inverse problems, which are prevalent in astrophysics. Gordon gave the example of the Mantis shrimp, positing that a sensory system (the eyes) and a processing system (the brain) that co-evolve may be more powerful than systems designed independently. Gordon then presented a technique for improved computational imaging of molecules, where the unknown orientation of a 3D molecule affects the 2D projection observed in a micrograph. This problem was noted to be very analogous to the unknown orientation of 3D galaxies, which affects what we observe in 2D telescope images. Presentations from Eric, Tijana, and Gordon sparked great interest from astrophysicists in the room, with hopes that some of these techniques may prove beneficial to ongoing research. Watch the session video!
Session 3
The third session started with a summary of how astronomers derive information from large-scale astronomy measurements. The following talks focused on simulation-based inference (SBI) as a mechanism for improving our ability to extract this information and on handling “distribution shifts” where a method’s training data isn’t representative of the data the method will be run on. This is an important issue for many forms of machine learning, but SBI is particularly interested in this question since it involves using simulated data to extract information from real data.
The first talk (What is the Universe Made of? Cosmological Inference - By Emmanuel Schann & Phil Marshall) was a high-level introduction to two topics about the large-scale structure of the universe. The first is the Cosmic Microwave Background (CMB), which is the afterglow of the Big Bang. From an observer’s perspective, the CMB looks like a distant, dim, and very red shell of opaque fire. The distribution of matter in the early universe is imprinted on temperature variations in the CMB and the late-time growth of structure can also be extracted from the ways it can distort the CMB’s light. CMB analysis is hard for many reasons, but a key one is that you need to look through the whole universe to see the CMB and there are intervening objects which can also distort it in ways that don’t teach you about fundamental physics. The second major topic was the expansion of the universe, described through the Hubble Function. The Hubble Function contains information on many important physical constants and can be measured by finding the distance to objects like supernovae and massive galaxies that have heavily warped light around them. A new telescope that was partially developed at SLAC and Stanford, the Vera Rubin Observatory, will expand the number of these objects substantially. However, the computational cost and modeling complexity associated with deriving distances to these objects are so substantial that new approaches will be needed.
In the subsequent talks, Sanmi Koyejo (Is Distribution Shift Still An AI Problem?) sketched out the landscape of methods that can help machine learning methods survive distribution shifts. He took a middle position between two extremes (with claims that distribution shifts are essentially fatal to ML methods at one extreme and claims that LLMs and foundation models solve this problem out-of-the-box at the other extreme). Sanmi advocated for reframing the problem as trying to measure and optimize the performance of ML techniques under worst-case distribution shifts and provided a detailed decision tree to suggest different methods for achieving this. Kate Storey-Fisher (SBI for 3D Galaxy Clustering) talked about using SBI to study cosmology with a neural net that learns the distribution of simulated galaxies on large scales. Sydney Erikson (Modeling Strongly Lensed Quasars With Neural Posterior Estimation) gave a talk on a similar approach to measuring distances to large galaxies. The two emphasized the benefits of this family of approaches compared to traditional analysis, with the two most important benefits being that traditional analysis can destroy a lot of information contained in the data and that SBI can be substantially faster. But both also stressed that SBI is not a magic wand and has its own limitations, with distribution shifts being a key one. Watch the session video!