2024 Data Science Undergraduate Research Pathways
The Stanford Data Science Undergraduate Research Pathways (DSURP) program is a new, unique opportunity giving qualifying students enrolled at non-research institutions – including state, liberal arts, and community colleges – the chance to conduct a research project within Stanford Data Science.
This 8-week, in-person experience connects students with both a Stanford mentor and faculty member based on their skill sets and interests. This year, students contributed to projects in genomics, statistical methodology, and AI methods for nephrology. Each student also received academic career guidance and participated in a number of social and professional development events throughout the summer.
The aim of the program is to provide students with opportunities and mentorship unavailable through their home institutions and equip them with the knowledge, experience, and skills to pursue postgraduate degrees and careers in academia.
The first summer of the Data Science Undergraduate Research Pathways (DSURP) program ran from June 24th to August 16th, 2024.
DSURP Leadership Team
Chiara Sabatti is an Associate Director for Stanford Data Science and Professor of Biomedical Data Science and of Statistics. Profile.
Daniel LeJeune is a postdoctoral scholar in the Stanford Department of Statistics, having previously completed his PhD in electrical engineering at Rice University. He is passionate about mentorship and making a research career possible for anyone. His own research centers on understanding and exploiting emergent properties of machine learning systems. In his personal life, he enjoys baking bread and cooking Cajun food for his friends and family as well as hunting for hidden gems at estate sales.
Projects
Using Distance to Detect Location of Positive Selection in the Genome
(Poster)
Understanding whether a gene is under selective pressure is a fundamental question in evolutionary biology. Mutations, recombination, selection, etc. are all forces that affect molecular variation, and hence a statistic based on molecular data alone is not clearly informative about selection. Whole genome molecular samples allow us to infer the samples' gene genealogies. Here, we investigate whether genealogies are informative about selection and can be used to statistically test selective hypotheses. By detecting where positive selection has occurred, we can understand more about the functions of specific parts of the human genome as well as find fundamentally important variants, which have many applications, such as being used to create more effective genetic therapies. Through realistic simulations of gene genealogies along the genome under different selective pressures, specifically, neutral versus positive, we aim to assess whether a statistic based on distances between genealogies is effective in detecting regions under positive selection.
Faculty Mentor
Julia Palacios, Associate Professor of Statistics, of Biomedical Data Science and, by courtesy, of Biology
Dr. Palacios seeks to provide statistically rigorous answers to concrete, data driven questions in evolutionary genetics and public health . My research involves probabilistic modeling of evolutionary forces and the development of computationally tractable methods that are applicable to big data problems. Past and current research relies heavily on the theory of stochastic processes, Bayesian nonparametrics and recent developments in machine learning and statistical theory for big data.
Fellow
Ally Kwan is a rising second-year student at Foothill College majoring in mathematics and computer science. With broad interests ranging from game theory to statistical inference, she enjoys working on projects that blend statistics with real-world applications to make a difference. Outside of school, you’ll find her swimming, playing with her dog, and attempting to bake.
Survival Analysis Goes DNAMite: IML for Kidney Waitlist Mortality
(Poster)
Interpretable machine learning (IML) is increasingly vital in high-stakes domains like clinical decision-making and healthcare, where it promotes transparency, identifies biases, and enhances the understanding of influential features. However, in nephrology, machine learning has been underutilized for modeling patient outcomes on transplant waitlists, partly due to existing literature suggesting that Cox-based models outperform machine learning in predictive accuracy. This research addresses this gap by training and evaluating models across various data sizes. The team demonstrates that with sufficient training data and appropriate methodologies, the IML model, DNAMite, not only surpasses traditional and tree-based approaches in predictive accuracy but also maintains a high level of interpretability.
Faculty Mentor
Madeleine Udell, Assistant Professor of Management Science & Engineering and, by courtesy, of Electrical Engineering
Madeleine Udell is Assistant Professor of Management Science and Engineering at Stanford University, with an affiliation with the Institute for Computational and Mathematical Engineering (ICME) and courtesy appointment in Electrical Engineering, and Associate Professor with tenure (on leave) of Operations Research and Information Engineering and Richard and Sybil Smith Sesquicentennial Fellow at Cornell University. She completed her PhD at Stanford in Computational and Mathematical Engineering and a postdoctoral fellowship at the Center for the Mathematics of Information at Caltech. Her research aims to accelerate and simplify large-scale data analysis and optimization, with impact on challenges in healthcare, finance, marketing, operations, and engineering systems design, among others. Her work in optimization seeks to detect and exploit novel structures, leading to faster and more memory-efficient algorithms, automatic proofs of optimality, better complexity guarantees, and user-friendly optimization solvers and modeling languages. Her work in machine learning centers on challenges of data preprocessing, interpretability, and causality, which are critical to practical application in domains with messy data. Her awards include the Kavli Fellowship (2023), Alfred P. Sloan Research Fellowship (2021), an NSF CAREER award (2020), and an ONR Young Investigator Award (2020).
Fellow
Billy Block is a Statistics and Computer Science student entering their final year at Cal Poly San Luis Obispo, on track to graduate a year early. Billy has a passion for interdisciplinary fields like Operations Research and Computational Engineering. He is now preparing to pursue a PhD in these areas, aiming to contribute to the advancement of data-driven solutions in complex systems, with a particular interest in interpretable machine learning for healthcare.
Empirical Bayesian Methods with Binomial Random Variables
(Poster)
This project looked at Empirical Bayesian methods to generate confidence intervals for the population proportion of binomially distributed random variables. This problem seems almost intractable as these methods are provided with a single observation yet can generate good confidence intervals for the population proportion. The method developed dominates another, purely Bayesian method, and provides better performance on average than other popular binomial confidence intervals. The team is now turning their focus to look at Stein's paradox. They are beginning by looking at what happens with binomial random variables.
Faculty Mentors
Dennis Sun, Associate Professor of Statistics
Manuel Rivas, Assistant Professor of Biomedical Data Science
Fellow
Visruth Srimath Kandali is a sophomore studying statistics and data science at Cal Poly San Luis Obispo. Visruth is especially interested in statistical computing and Bayesian statistics. Outside of research, Visruth enjoys photography, reading, & writing.