In the 17th century, Johannes Kepler could draw the trajectory of planets on a piece of paper and use his powerful imagination to infer conclusions. Today, the Large Synoptic Survey Telescope will in a few months produce the deepest, widest image of the universe, at the rate of 15 terabytes of data every night! How does one even look at this? In the 19th century, John Snow traced the London cholera outbreak to the contamination of a water pump on Broad Street by drawing a dot map illustrating the cluster of disease incidences around the pump. His heroic data-gathering effort was guided by the conviction that water was a key element in the transmission. Today, when we analyze an outbreak, we have available to us wide-ranging and detailed information: Movements of individuals can be tracked by GPS-carrying mobile devices; embedded sensors measure the environment (particle counts, temperature, water quality); internet queries on symptoms and remedies give clues about the extent of the spread; and DNA information allows the study of pathogens involved and their evolution. The data is rich, often agnostic and ripe for both learning and for driving decisions.
Large and complex data sets now drive nearly every aspect of science and discovery. Scholars from virtually every academic field and discipline are using data to advance the frontiers of knowledge in ways never before thought possible. At Stanford alone:
An Earth Science researcher has teamed up with computer scientists to use data science to review satellite imagery to determine crop yield production. This has ramifications for the global food supply and land use.
A social scientist in our Communication department used data science tools to predict fake social media accounts created by the Chinese government and to analyze how hundreds of millions of fabricated social media posts are used to distract the public from controversial issues.
A recent mathematical breakthrough enables doctors to generate highly resolved images from far fewer data bits than previously required, decreasing MRI scanning time by eightfold and, accordingly, eliminating the need to deepen the anesthesia children receive to stop respiration for a prolonged period of time.
Despite these success stories, many important scientific investigations are currently limited by their abilities to extract essential information from data to advance their disciplines. Harnessing the data revolution requires new tools and techniques to reason analytically from data and, even more crucially, to have confidence in the validity of the results. It is not reasonable to expect that a political scientist or a marine biologist also be a world expert in large-scale data management, large-scale computation or methods of inference and predictive modeling. Conversely, statisticians and computer scientists need context to imagine which questions we might ask from data to develop useful tools. Collaboration, communication and the sharing of knowledge will be key to breaking through current limitations.
The goal of Stanford Data Science is to weave data science research and methods into the fabric of the university — its faculty, its students, its research and its teaching — to advance discovery and the creation of knowledge and to provide insights that suggest solutions to the world’s most pressing problems. Our goal is to build a community that brings together the world’s very best data scientists with scholars from other fields who rely on accurate, dependable, large data sets and data science techniques to do their work at the highest level.
Stanford Data Science will contribute to research by enabling faculty to take on some of the world’s most vexing research questions by giving them a place where they can find data science support and collaboration. There they will work with scholars who are not only advancing the field of data science itself, but also colleagues who study the ethical issues related to data collection and use; how to properly use data to solve these problems; and how to interpret the results in a way that considers bias, statistical variability and causation.
Stanford Data Science will also contribute to education, teaching, and learning. As a university of the 21st century, we have an obligation to ensure that the next generation of citizens are literate in data science and that our students, regardless of their career or professional path, understand how to collect and manage data, interpret data and learn from data. One way we will do this is to create a data science lab to train students with hands-on projects that provide opportunities for real-world impact.
How do we get there? We know there is widespread interest in building a community of like-minded scholars who understand the power of data. After two separate campus-wide calls for proposals for seed grants earlier this year, we received 220 applications from 61 academic departments. The excitement and enthusiasm was palpable — yet we were able to fund only 10 of these extraordinary proposals. And this was just the beginning! The community will grow dramatically and evolve as we create a campus-wide resource to stimulate research collaboration; provide technical, computational and educational resources; and create platforms for communicating our work.