Shining a Light on Emerging Talent: Rising Stars in Data Science 2024 Recap
Coming Stanford in 2025!
The Rising Stars in Data Science 2024 workshop, held November 14-15 at UC San Diego’s Halıcıoğlu Data Science Institute (HDSI), brought together a vibrant community of early-career researchers and mentors eager to shape the future of data science. Over two packed days, participants explored career pathways, cutting-edge research, and the many ways data science is reshaping the world.
The event kicked off on Thursday morning by inspiring opening remarks from Virginia de Sa (UCSD), David Uminsky (University of Chicago), and Chris Mentzel (Stanford Data Science). The energy was high as attendees moved into the first panel on career trajectories, where panelists shared insights on navigating the often winding paths into data science. The second panel offered practical job search strategies, giving attendees an insider’s view of recent candidates and search committee members alike. After a relaxed lunch in the courtyard, participants split into smaller groups for mentor meetings, soaking up advice from seasoned professionals.
The afternoon brought an exciting lineup of lightning talks, with concurrent sessions covering a variety of hot topics like Machine Learning, Trustworthy AI, and Health Data Science. Researchers from top institutions—including the University of Washington, Harvard, and Carnegie Mellon—delivered fast-paced presentations on their latest work, fueling discussions and sparking new ideas. The day wound down with a reception and poster session, where participants mingled, discussed posters, and enjoyed an evening of relaxed networking.
Friday began with two highly anticipated panels, each aimed at those embarking on academic careers. The first session focused on tips for making the most of that pivotal first year as an assistant professor, while the second explored the secrets of building a successful long-term career in academia. With expert panelists from UCSD, Stanford, and UChicago, attendees gained actionable insights on everything from teaching and research to finding balance and purpose.
In the afternoon, participants dove into more lightning talks, exploring AI’s societal impact, Applied Math in ML, and Human-Centric AI before gathering for the closing session. Rajesh Gupta wrapped up the event with a motivating send-off, and participants celebrated their new connections at a happy hour at UCSD’s Price Center.
With a mix of panels, lightning talks, mentorship, and social gatherings, Rising Stars in Data Science 2024 left attendees inspired, connected, and ready to take their careers to the next level in the ever-evolving world of data science and Stanford Data Science is eager to host the 2025 workshop on the Stanford campus! In the meantime, below is more information about the research from the Stanford scholars who were among the 2024 Rising Stars.
Shining a Light on Stanford’s Rising Stars
Advancing Medical Diagnostics Through AI: Maya Varma’s Breakthrough Research
Maya Varma, a fifth-year PhD student at Stanford University, is pioneering artificial intelligence (AI) methods to transform disease diagnostics, particularly in the medical imaging domain. Her research centers on three primary objectives:
- Accurate Diagnostics from Complex Data: Varma develops AI models to analyze intricate medical images such as chest X-rays and musculoskeletal X-rays.
- Equitable AI Models: She ensures these models operate reliably across diverse patient populations by addressing biases in machine learning systems before global deployment.
- Global Accessibility: Varma works to make AI tools accessible worldwide by minimizing computational demands and releasing open-source datasets and benchmarks.
Her talk focused on "Ravl," a method addressing spurious correlations in vision-language models. Such correlations, when a model associates unrelated features with specific conditions, can lead to life-threatening diagnostic errors. For instance, prior models detected pneumothorax based on treatment devices rather than the actual clinical signs, risking false predictions in untreated patients.
Ravl identifies and disentangles spurious correlations through a two-stage process. First, it isolates problematic clusters of image-text features. Then, a novel loss function is applied to fine-tune vision-language models, preventing reliance on irrelevant attributes during learning.
The efficacy of Ravl was demonstrated in both medical and general domains, significantly outperforming comparable methods in mitigating spurious correlations. For example, it identified biases in models that linked metallic clips in X-rays with certain heart conditions or fast-food restaurant text signs with scene classifications.
Looking ahead, Varma plans to enhance AI systems across their lifecycle—from development to post-deployment monitoring—ensuring robust, reliable, and scalable diagnostic tools for global health applications.
Tracking and Combating Misinformation: Hans Hanley’s Innovative Tools for the Digital Era
The rise of misinformation has created a pressing need for tools to track and analyze false narratives across the digital landscape. PhD student Hans Hanley’s recent presentation shed light on "Tall Tales," a system designed to address this challenge by utilizing artificial intelligence (AI) and large language models (LLMs). This initiative provides researchers, journalists, and policymakers with the means to understand how misinformation spreads and identify its sources.
The Real-World Impact of Misinformation
Misinformation can lead to severe real-world consequences. A striking example occurred in March when pro-Russian actors falsely claimed that King Charles III had died. This disinformation, propagated by outlets like Sputnik News and others, aimed to undermine trust in the UK's ability to support Ukraine. The delay in countering such claims highlights the urgent need for faster and more reliable tools to debunk falsehoods.
The Challenge of Limited Data Access
The ability to study misinformation has been further hindered by social media platforms restricting API access for researchers. Without this access, understanding the online environment and tracking narratives becomes increasingly difficult.
The Tall Tales Solution
Tall Tales is a project that collects and analyzes news articles to identify cohesive narrative clusters. The system works through:
- Data Collection: Gathering news articles from a range of sources, categorized as reliable, mixed, or unreliable based on external ratings like Media Bias Fact Check.
- LLM Embeddings: Using a multilingual embedding model, Tall Tales differentiates between articles discussing similar or distinct stories. This model clusters articles into narratives while accounting for nuances in themes and topics.
- Stance Detection: Determining the attitudes of various articles toward specific topics, such as vaccines or geopolitical issues, to understand how narratives are framed.
Insights from the Research
Tall Tales revealed notable patterns in the dissemination of misinformation. The system also mapped the stances of websites, showing how reliable outlets lean toward pro-vaccine and pro-Ukraine positions, while unreliable ones skew the opposite.
Additionally, the network analysis identified key players in spreading misinformation, such as Gateway Pundit and RT (Russia Today), and revealed how narratives travel through the digital ecosystem.
Future Directions
The team aims to expand their analysis to include global news ecosystems and social media platforms like Telegram and Weibo. By mapping stances and extremism scores across different countries, they hope to uncover deeper political divides and enhance the understanding of how misinformation operates internationally.
A Call to Action
The Tall Tales project underscores the need for robust tools to combat misinformation. As AI continues to evolve, it offers promising solutions for tracking false narratives, ensuring transparency, and protecting public trust in the digital age.
Jinzhou Li on Understanding the Root Cause in Monogenic Disorders Using Gene Expression Data
Background and Challenges
Postdoctoral scholar Jinzhou Li’s study focuses on genetic research, specifically analyzing gene expression data from patients suffering from monogenic disorders—diseases caused by mutations in a single gene. Mutations affect the gene's expression levels, differing significantly from those observed in healthy individuals. Identifying the root cause of these diseases presents unique challenges due to complex interactions between genes and the difficulty of isolating the gene responsible for the disorder.
- Propagation of Effects: Mutations in one gene can propagate their effects through gene networks, resulting in many aberrantly expressed genes.
- The Hardness of Estimating Gene Networks: Estimating the gene network is known to be challenging, and it is generally unidentifiable without strong assumptions.
- Personalized Root Cause Discovery: Different patients, even those with similar symptoms, typically have different root causes. Therefore, root cause discovery must be performed in a personalized manner.
Methodology
Jinzhou’s research adopts a linear structural equation model to analyze observational and interventional gene expression data. The observational data serve as a reference for comparing with the interventional data to determine aberrancy.
- Model Assumptions: Each gene's expression is modeled as a function of other genes and noise, with the mutation (or intervention) modeled as a specific shift (denoted as delta) in one gene.
- Data Structure: Observational data consists of normal gene expression levels, while interventional data consists of the gene expression levels from a single patient.
- Root Cause Discovery: A mathematical approach involving permutation of variables and Cholesky decomposition is used to identify the root cause of observed effects. The focus is on identifying specific patterns, even when the gene network is not identifiable.
Key Findings
- Identifiability: The study demonstrated that even if the full gene interaction network is unidentifiable, the root cause can still be identified through careful analysis of observational and interventional data.
- Performance of the Method: The proposed method outperformed simpler comparative approaches in identifying root causes for individual patients. The findings align with prior biological studies, validating the method’s accuracy.
- Application to Real Data: Using RNA sequencing data from real patients, the method successfully identified disease-causing genes likely responsible for the observed disease. However, challenges remain in scaling the approach for larger datasets.
Limitations and Future Directions
- Computational Demands: The method requires significant computational resources, with analysis for one patient taking several hours. Parallel computing can help address this limitation.
- Single Mutation Assumption: The model assumes a single mutation per patient, which may not fully capture the complexity of more complex genetic disorders than monogenic disorders.
- Broader Applications: Future work could adapt the method to polygenic disorders and explore its application across diverse populations.
Conclusion
Jinzhou’s research provides a novel framework for identifying the root causes of monogenic disorders using gene expression data. By addressing key challenges in causal analysis, the method offers a promising tool for understanding genetic diseases and guiding precision medicine approaches. However, further work is needed to optimize its computational efficiency and extend its applicability to more complex genetic conditions.