CORES Launch event summary (with videos)
On February 18, the Center for Open and REproducible Science held our Launch event to kick off our newly established center. It was a fantastic day filled with interdisciplinary lectures on the open science movement and panels focused on the status of open science at Stanford and discussing issues in open science. We were given a glimpse into the emerging universe of open science and how we can further improve the adoption at Stanford and beyond. We are sharing the videos with summaries for each presentation.
Russell Poldrack, Professor in Stanford Psychology and Director of CORES - “Introduction to CORES”
We started the launch event with introductory remarks from Russell Poldrack (Professor of Stanford Psychology, Director of CORES). Dr. Poldrack began by drawing our attention to a survey conducted in 2020 by the Pew Research Center illustrating how scientists are some of the most trustworthy people in the United States. Transparency and reproducibility lie at the heart of our integrity and public credibility of science. Stanford is well positioned to address and make advancements on these issues.
Open science practices are challenging for many colleagues to implement in their every day research operations. Open Science represents a set of values for how we decide to do our research: transparency and accessibility, diversity and inclusion, and community-mindedness. These objectives align nicely with a report released by the National Academies of Science, Engineering, and Medicine in 2018. They conveyed a set of open science recommendations for research institutions to adopt. As open science is further adopted there is value that can be derived from being more transparent. Reproducibility is crucial for trusting the results scientists publish on. Though there are barriers currently present that make it difficult to implement open science practices. This where the new CORES Center (and more broadly Stanford Data Science) can address those challenges. CORES has three primary areas of focus: open science, evidence synthesis, and reproducibility.
Mercè Crosas, University Research Data Management Officer, Harvard University Information Technology and Chief Data Science and Technology Officer, The Institute for Quantitative Social Science - “ Enhancing collaboration and access to data throughout the research lifecycle”
Mercè began her talk by conveying why data sharing is crucial through the example of COVID. At the beginning of the outbreak, it was important to understand the virus from multiple different perspectives and track the rate of infection. After the data was released, there was a sharp uptick in COVID preprints and publications coming online. This uptick could be partially attributed to open data sharing. Open data sharing allows scientists to reproduce and trust our results.
Mercè moved on to illustrate the recent advances in data sharing and university computing resources. We are seeing new data policies being implemented at journals across many disciplines, funders (e.g. National Institutes of Health (NIH)), and scientific communities. More domain-specific and domain-general repositories are coming online to support open data sharing efforts. As open data sharing continues to grow, there is a growing need for university services to support these efforts. These services can live within the libraries and IT offices or in academic departments such as statistics or bioinformatics. The typical service is general consulting. In designing the service offerings, these services will need to address and support researchers throughout the research lifecycle. The research lifecycle covers: planning, active research, and dissemination and preservation.
Mercè conveyed her vision for the path ahead and items her group is currently working on. One objective is to achieve a data commons. A data commons integrates the data with cloud computing infrastructures loaded with commonly used tools for managing, analyzing, and sharing data to create an interoperable resource for a research community. There are a set of challenges that one must address to realize a data commons. One challenge is capturing the contextual information of the dataset such as, gathering sufficient information to reuse the data, completed code to reproduce results, and data sources with transformations to understand validity. A couple other challenges include: difficulties finding and accessing datasets, sharing and using large and complex datasets, and protecting sensitive or proprietary data. One ultimate goal is to further connect different data commons together so they are interoperable and can perform federated search queries.
This panel was moderated by Russell Poldrack (Professor in Stanford Psychology and Director of CORES)
The panel participants: Kam Moler (Vice Provost and Dean of Research, Marvin Chodorow Professor and Professor of Applied Physics and of Physics)
Melissa Bondy (Stanford Medicine Discovery Professor and Professor of Epidemiology and Population Health, Co-Director, Stanford Center for Population Health Sciences, Associate Director, Population Sciences at the Stanford Cancer Institute)
Emmanuel Candes (Barnum-Simons Chair in Mathematics and Statistics, Professor of Statistics, Professor of Electrical Engineering (by courtesy), Faculty Director, Stanford Data Science Institute)
Steve Goodman (Associate Dean of Clinical and Translational Research, Professor of Epidemiology and Population Health and of Medicine (Primary care and Population Health), Co-Director, Meta-research Innovation Center at Stanford (METRICS))
Jon Krosnick (Frederic O. Glover Professor in Humanities and Social Sciences, Professor of Communication, Professor of Political Science, Professor of Psychology (by courtesy))
The panel discussion focused on the current state of open science initiatives at Stanford.
Steve Goodman began by introducing a new program on rigor and reproducibility in the School of Medicine. Rigor refers to the strength of research design and analysis of the experiment. Reproducibility refers to making the data accessible both to the primary investigator and research community. The first step of this initiative is data gathering through surveys. This is not limited to a technical venture. There are courses and tools to support data sharing and reproducibility. They strongly value and think what is really important is culture change. Culture change happens from the bottom to the up. They have working groups to modify promotional criteria and enhanced resumes. They want to make the process as easy and simple as possible to implement open science practices into their research workflows.
Kam Moler discussed the values and mission of the CORES center. Kam is excited by the aspirational message our mission conveys. The values the center holds has seen uptake at Stanford, which has positive byproducts in the broader context. Transparency and openness is an important value to hold and has clear connections with ongoing Stanford efforts. Transparency and openness is important and has wide support from government agencies that these values have demonstrated their worth. Truly innovative and advanced research projects are strengthened by open sharing policies. Transparency and openness practices nurtures public trust and strengthens research credibility.
Emmanuel Candes began by mentioning that we do open science to achieve replicability so the results can stand the test of time. Replicability has two aspects: ability to reproduce each step of data analysis and get the same results and the results can be confirmed by fellow scientists. The second aspect can become a statistical problem and the community can evaluate the statistical finding. The statistics department has devoted energy to developing methods and tools to ensure what we report can stand up to scientific scrutiny. Culture change is another important aspect and the hope is a course can be developed to teach students about the challenges of reproducibility.
Jon Krosnick started by referring us to the wikipedia entry of open science and highlighting the dissemination and accessibility of the research products. We can begin to stretch the definition of open science to also include learning and synthesizing the lessons of the past in a simple and easy way. Openness is challenging but forces scientists to do better housekeeping of one's work and possibility that one’s work will be closely scrutinized leading us to implement better research practices. There is fear in getting caught, but that probability is tiny if the scientific community is sharing their research products. One example of when open science fails is polling. When a community shifts from good to suboptimal methodology, it will affect the ultimate results and conclusions reached. We were introduced to a causal diagram of potential pathways explaining why a scientist may not do optimal science.
Melissa Bondy started by describing the importance of having rigor and reproducibility in science, including training the next generation of scientists. Incorporating standardized best practices is important in enhancing research methodology. One funder, National Institutes of Health, has implemented several checks and balances into the system to keep scientists honest and further push toward open data sharing. Particularly in human research, safeguards do need to be put in place to protect our research participants from later potential identification. It is important to ensure the data is cleaned up and wrangled to address the scientific questions under investigation.
In the open panel section, the group discussed the incentive aspect of the conversation. The key of incorporating open science is through the lens of value added. One potential change could be instituting a CV format change to provide space to highlight open science advancements. This could be implemented not only for tenure review but at salary review to capture the wide span of professor career ranks. The senior professors can help junior professors by giving them opportunities in larger team science projects. Addressing the current incentive structure such as publications may have to be answered to continue pushing the culture change we seek.
Fernando Pérez, Associate Professor of Statistics, UC Berkeley, Faculty Scientist, Data Science and Technology Division, Lawrence Berkeley National Laboratory - “Computing, Statistics, and Reproducibility; pedagogical reflections”
Fernando focused his talk around pedagogical reflections of data science instruction. We were introduced to the collaborative and reproducible data science course he teaches primarily catered to undergraduates. The course has a broad reach across campus. The goals of the course is the what, why and how of performing reproducible science. The ideas are taken from a large and growing body of research investigating reproducibility. The practical backbone skills taught in the course cover: version control, programming, process automation, data analysis, documentation, software testing, continuous integration, and reproducible containers. The students work through jupyter notebooks. Jupyter notebook is a web based application to create a computational based narrative that combines text, mathematics, code and the results. The work environment is hosted via JupyerLab, a next generation interface of Jupyter notebooks, is a system for interactive data science and computation as an entire architecture.
Shifting back to the course, they dive deeper into the backbone skills for working in the open. One of the first things taught is on version control as a computational hygiene that should be a daily habit. In practice, this is expressed through git and GitHub. The entire course is hosted and managed on GitHub. Another skill, automation, is taught through open source documentation and implemented through continuous integration. The students can share their work through Jupyter Book. It is important and valuable for the students to get working on real life data. The real life examples give the students a sense of how open science is performed and shared in the research community. Students are later introduced to Binder books to create a shareable and reproducible product. The final project is for students to perform their own original analyses in a reproducible environment.
Fernando launched his talk into discussing reproducibility as the foundation of collaboration. The tools (e.g. Jupyter universe) are primarily initially used for the content or services the tool provides to the user. The software can be extensible and adaptable to the needs of the user. It is important for the software to be software agnostic so that one can choose the language they would like to use. These tools are made possible because of the community that supports each other and makes advancements as a team. An open and collaborative community supports the engine of science. The embracing of open science practices have real life impacts. One real life impact is seeing the growth of students signing up for these data science courses to learn about reproducible practices.
David Moher, Senior Scientist in the Clinical Epidemiology Program, Ottawa Hospital Research Institute, Director of the Centre for Journalology, Ottawa Hospital Research Institute, Associate Professor at the School of Epidemiology and Public Health, University of Ottawa, University Research Chair, University of Ottawa - “Gestational time for elephants is long, so too for open science and reproducibility”
David began the talk by sharing his perspective as a faculty of medicine on research integrity and trustworthy research. He shared a study that evaluated how primary outcomes were switched from the time of the trial to the trial publication. He drew our attention to the notion that trustworthy research needs to be useful to readers. With respect to COVID data sharing, there was enthusiasm to do so, but it fell short when it came to publication and sharing the data. This counteracts with what the patients would like, they value the practice of open science and data sharing. Not sharing data begins to erode our scientific integrity.
When evaluating academic criteria for promotion and tenure within the biomedical sciences, we see that some institutions are not transparent about their criterias, those that do share their criteria are primarily based in the traditional methods of evaluation (e.g. publications). Academic ranks are partially based on productivity such as quantity of publications and the number of grants. In David’s view, institutions should move away from the current set of criterias that promote questionable research integrity to incentivize and reward research integrity that promote trustworthiness. There is a set of Hong Kong principles, that David is involved with, that fosters research integrity in the researcher assessment process. One principle is valuing the accurate and transparent reporting of all research, regardless of the results. One way this is implemented is through registered reports. Registered reports are study protocols that are peer-reviewed and granted in principle acceptance for publication prior to the study being conducted.
One challenge of implementing open science practices can be not knowing how well you or an institution is doing. One approach to resolving that problem is instituting automated digital dashboards for visualizing progress. It makes it easier to evaluate and benchmark yourself or institutions against each other. These dashboards could also make it easier for journals to evaluate themselves and understand how they can further enhance their policies.
Another way to value and highlight open science achievements is by reimagining our current implementation of the CV. This CV can then be used when evaluating a faculty member for hiring, promotion and tenure. Change will come as the incentive structures begin to be reevaluated and enhanced. Potential changes do need to be supported with evidence. Another avenue to pursue can be funders valuing and promoting the incorporation of open science practices into their proposals.
Panel: Examining issues in Open Science
This panel was moderated by Monica Bobra (Research Scientist, Hansen Experimental Physics Laboratory)
The panel participants: David Studdert (Senior Associate Vice Provost for data resources, Professor of Medicine (Primary care outcomes research), Professor of Law)
Sharad Goel (Assistant Professor of Management Science and Engineering, Assistant Professor of Sociology (by courtesy), Assistant Professor of Computer Science (by courtesy), and Assistant Professor of Law (by courtesy))
Ashley Jester (Assistant Director, Science and Engineering Libraries)
Quay (Ph.D student in Civil and Environmental Engineering)
The panel discussion focused on the practical elements of a scientific study.
The panel began by discussing the pros and cons of open scientific data. Open data can be a bit ambiguous and one perspective can be that open data should be FAIR data. There also needs to be clarity about the differing levels of access different types of data are permitted to be. Different domains are at various points along the open data spectrum. In policy generation there needs to be an acknowledgement and understanding that there is not a one-size fits all model and it needs to be catered to that particular domain. In some domains, they are the data users of collected data from various agencies and stepping into the data sharing role is sometimes not a feasible goal to achieve. Another wrinkle to add into the conversation is embargoing data due to the long collection times some studies take and removing that incentive is challenging to reconcile with immediate reproducibility. One other case is holding off some data that could be legally permissible but may have ethical considerations attached to sharing. There is value in being the data creator and collector even if the dataset is immediately made available. The group that collected the data knows the data better than anyone and have control over what questions could be incorporated into the design.
The next question posed was regarding the accessibility and usability of open datasets. This also has a high amount of variance and isn’t typically standardized even within one source. The primary contributor of open data is typically the government. The modernization of the data delivery method can be quite costly and likely underfunded.
The panel moved onto the next topic of open source software. The panel evaluated how feasible incorporating a fully reproducible environment is. There are aspects and items that are easier than others to achieve. When external services are part of the research pipeline, that typically cannot achieve full reproducibility. There are both technical and practical challenges to reaching full reproducibility. One valuable aspect of reproducibility is a how-to guide for running their analysis and generating their figures. Unfortunately, that is not a common research product in the community. Standardization or best practices of data organization and code sharing is one way this can be addressed. An interesting discussion emerged regarding if proprietary software can still be considered open science. A challenge of that is restrictive software licensing will significantly adversely affect the global accessibility of one’s project. Open science should be lowering barriers to our research products and reproducibility. Stanford could have a part to play in the open source community. This could be funding open source projects or helping maintain them. The open source software could be incorporated into the classroom. Another way Stanford could help is by utilizing the research software engineer community and work into their role contributing back to open source projects.
We would like to thank all of our speakers for their presentations, panelists for their thoughts, and our attendees for tuning in!
See you at another CORES event!