Skip to main content Skip to secondary navigation

Science of Data Science

Main content start

Science is experiencing simultaneous challenges and opportunities at an unprecedented rate:

  • From new sources of data, especially in large quantity and unconventional structure, often from “non-scientific” sources, such as social media;
  • From new algorithmic techniques potentially expanding greatly the ability to reason from data but whose interpretation, validity and fairness can not be established by our current statistical and computational techniques;
  • From the crucial need for scientifically valid advice on questions of the greatest importance to the future of society, of life and of the earth itself---advice that must be effectively communicated to society.

In all of these, data science is clearly central. Recent computational, statistical and other research has been of great value. Much more needs to be done, however, and with a sense of urgency.

Validity of algorithmic inferences:

Algorithmic techniques to infer patterns and structure have had exceptional success recently in many areas of practical value. They can also be important, even revolutionary, for science in many areas. Data as divergent as social media interactions on one hand and satellite or drone images on the other may provide vital results through such algorithms.

However, the scientific validity of the results can not be assumed. Conventional concepts such as random sampling of the intended population are rarely relevant. A deeper understanding of the data sources and the computations applied will be essential.

Fairness of algorithmic decisions:

Beyond the scientific validity of inferences, the use of algorithmic results to recommend practical actions raises important questions of fairness and equitable treatment. Data science needs to search for valid notions of fairness, to ensure that the results of analysis and the data-based algorithms using them are fair to all demographic and other cohorts.

Privacy and the public interest:

Huge quantities of data exist for individuals, through social media, other internet activities and databases of medical, governmental, employment and commercial records. Computational and statistical techniques are needed that satisfy both the right to privacy and society’s need to deal with important questions. Progress has been made with new approaches such as differential privacy and distributed inference on private data. Much more needs to be done given the increasing attraction of mining such data sources, with the potential risks to individual rights.


Some of the richest sources of extensive data for scientific study are observational (“non-randomized”) data bases made available by the explosion of technology (the internet and digital records in medicine, government and business). Naive application of inferential techniques to infer causal mechanisms will be seriously misleading on such data, potentially with disastrously mistaken conclusions. Research in new statistical and computational techniques to adjust for such data sources is needed.

The reproducibility crisis:

Repeated and often highly visible incidents have highlighted failures to reproduce “scientific” conclusions; for example, frequent editorials in prestigious journals such as Science and Nature have documented and apologized for many failures to reproduce published results.

Issues of scientific and academic culture are undoubtedly part of the problem. However, the radical changes in sources of data and algorithms applied mean that the practice of data analysis has changed enormously. Data science needs to find new inferential paradigms that allow data exploration prior to the formulation of hypotheses.