Changing the Culture of Data Management and Sharing: A Report on the NASEM Workshop

changing culture of data sharing

The National Academies of Sciences, Engineering, and Medicine (NASEM) hosted the two-day virtual workshop Changing the Culture of Data Management and Sharing on 28th-29th April 2021 to discuss the challenges and opportunities for establishing effective data management and sharing practices and exploring the question of universal availability of scientific data. In advance of the new NIH Data Management and Sharing Plan that will be implemented in 2023, this virtual workshop was of great interest to researchers, funders, and publishers and had over 1200 attendees. GigaScience Data Scientist Chris Armit attended the workshop and reports below on some of the major highlights. Also collecting together the video streams of the event now they have gone live.

NIH Data Management and Sharing Policy
As GigaScience Editorial Board member Maryann Martone (University of California, San Diego) explained in the opening session, the NIH Data Management and Sharing Policy was released in Oct 2020, and a Data Management and Sharing Plan will be required for all NIH-funded research from 25th Jan 2023. The Policy goals are as follows:

  • Increase scientific transparency and public trust
  • Improve reliability / reproducibility
  • Enable reuse of valuable data
  • Accelerate discoveries

Richard Nakamura (NIH Center for Scientific Review) followed on from this and presented “Goals for the NIH Data Management and Sharing Policy” which highlighted the benefits of Data Sharing and FAIR Data. These benefits included reproducibility, and the possibility of meta-analyses that would enable new conclusions to be drawn from existing datasets. Richard further highlighted the need for a culture shift to ensure universal availability of scientific data, but was swift to point out that this requires cooperation across a diverse research environment that includes researchers, funders, and publishers.

What Has the COVID-19 Pandemic Taught Us About Data Sharing and Open Science?
So what infrastructure is needed to enable Data Sharing? Patricia Brennan (Director, National Library of Medicine) touched on this in her Keynote Presentation entitled “What Has the COVID-19 Pandemic Taught Us About Data Sharing and Open Science?” Patricia highlighted critical steps taken by NIH – such as investing in data repositories, hub-and-spoke frameworks with a Common Cloud Infrastructure, and data life cycle plans that ensure consent and reuse – that have been particularly helpful for promoting Data Sharing during the pandemic. As Patricia explained, the two key projects that were most relevant for COVID-19 were: 1) Post-Acute Sequelae of SARS-CoV-2 Infection (PASC) Initiative, and; 2) Rapid Acceleration of Diagnostics (RADx), which together cost NIH almost $2 billion and which had core principles of Data Sharing – including patient consent and patient de-identification – integrated into their project plans from the very beginning. So what happens when COVID-19 data are shared and science is open? Vaccines entered Phase 3 Clinical Trials within 180 days. In addition, there was timely characterisation of variants, rapid evaluation of therapeutics and real-time interventions and, importantly, the public were reassured that steps were being taken to combat the pandemic. Are there any messages here for future studies? As Patricia explains, “consent is important” and this leads to the question of whether there should be generalized consent for human patient data to enable clinical data to be shared more easily.

The Need to Incentivise Data Sharing
So what is this culture shift that Richard Nakamura feels is necessary? Additional perspectives on this topic were provided in the panel discussion at the end of the opening session. In particular, Alexander Ropelewski (Brain Image Library) hinted that there may be barriers at the institutional level that prove troublesome for the data-submitting labs who wish to submit their data to a centralized repository. From a clinical data perspective, Joshua Wallach (Yale School of Public Health) echoed this sentiment and highlighted the need to incentivise data sharing in the clinical community. As Joshua explained, in the clinical research environment it is more likely for data to be ‘kept within the group’. As a corollary to this, Atal Butte (UCSF) was swift to point out that DOIs are insufficient as an incentive, and that we should be thinking in terms of profit when proposing motivations for a Data Sharing culture shift.

On a related note, in the later panel discussion in the session entitled “Data Quality and Other Factors that Make Data More Likely to be Reused”, Jan Bjaalie (University of Oslo, EU Human Brain Project) further commented that researchers could benefit from “a business model” that incentivises data reuse.

Measuring Success in Data Sharing Strategies
So what does successful Data Sharing look like? In the session entitled, “Strategies for Managing and Sharing Data: Diverse Needs and Challenges”, David Haussler (UC Santa Cruz) explained how the Global Alliance for Genomics and Health (GA4GH) strategy changed from a Data Commons to a Data Federation. As a founding member of GA4GH, David explained that the original idea of a Data Commons, which would aggregate data from multi-national datasets, did not work due to a fundamental ‘lack of trust’ in the international community. The Data Federation approach – which would host data locally – ensures privacy preservation and this has proven to be a more successful strategy for Data Sharing. As David explains, “the most important issue in Data Sharing is respect. Respect for those who make the science possible by contributing data. But that respect has to be earned with an equal amount of generosity.” GigaScience are organisational members of GA4GH, and for more see the recent GigaBlog on the GA4GH 8th Plenary Meeting.

From a repository perspective, the talk by Rebecca Koskela (Research Data Alliance) was particularly informative. Rebecca highlighted a survey the RDA performed that questioned, “If your data are not available to others, why or why not?” By conducting the same survey on a 4-year cycle, some interesting changes in perceptions are emerging over this 12-year period of investigation, with insufficient time and a lack of funding being less commonly held viewpoints in the more recent surveys. As Rebecca explains, the advantages of Research Data Repositories include: avoidance of data generation costs; efficiency of data management; long-term usability and reuse of data; transparency of scientific results; and value-added data products. Rebecca additionally highlighted the perspective of journals and publishers, and explained that Journal Open Data Sharing Policies “grew organically”, whereby there was an early understanding that a citable DOI was needed, but over time a need for adopting FAIR Principles was additionally deemed of immense value.

Jeremy Wolfe (Brigham & Women’s Hospital, Harvard Medical School) additionally highlighted his take on the Journal Editor perspective and pointed out “there is a tension between full reporting and the desire to publish science that moves the field forward”. Jeremy pointed out that, from a researcher perspective, the structure of Basic Experimental Studies with Humans (BESH) “does not fit well with formats like”, but also that from a Journal Editor perspective he does not want journals flooded with extraneous data that dilute the core message of the scientific findings in a manuscript. Jeremy offers one potential solution to this dilemma by splitting research findings into Grant Progress Reports and Peer-Reviewed Publications.

Value and Costs of Managing and Sharing Data
In the session entitled “Value and Costs of Managing and Sharing Data”, John Borghi (Lane Medical Library, Stanford University) explored data sharing in practice and offered insights into the perceived value of data management and data sharing. Key motivations for data management were the need “to ensure access for collaborators”, “to foster openness and reproducibility”, and “to prevent loss of data”. Ana Van Gulick (FigShare) further addressed this point and highlighted that “measuring data impact is important”, but was also swift to point out that “data citation is still an emerging practice without clear standards” and that it is “still early to see large-scale reuse of open data”. As Ana explained, the perception of data sharing is dependent on recognition of value from both funders and host institutions.

Is Data the New Oil…?
Daniel Goroff (Vice President and Program Director, Alfred P. Sloan Foundation) was undoubtedly the best-dressed speaker at the workshop and gave a very interesting talk from an economist’s perspective that highlighted the potential pitfalls of reusing sensitive data. In the session entitled “Data Quality and Other Factors that Make Data More Likely to be Reused”, Daniel questioned whether “Data is the New Oil” and highlighted that reference datasets, if open, can potentially represent a public good. As Daniel explains, “Oil is rival. Once you use up a barrel, no one else can use that same barrelful”. In contrast, a public good as a commodity is ‘non-rival’ and does not get used up. A public good has the additional property that it is non-exclusive. However, as Daniel further explains, “There are laws and expectations about people’s data. Data reuse threatens research validity if the results are not accurate. Data reuse threatens research legality if the results are not privacy preserving. Reusing data requires inevitable trade-offs between privacy and accuracy”.

In the panel discussion, Daniel highlighted the moral and legal considerations of data reuse as it relates to human data. Somewhat controversially, Daniel suggested that data reuse in this scenario may actually be cost-ineffective, and that generation of primary data tailored to a research question could be more beneficial. As Daniel explains, “If you tell me that we could spend 2 or 3% of the federal budget that normally goes to research on making data available for reuse, I think that sounds great…If you tell me that its 20% of the federal budget, I begin to wonder…If its 30 or 40% then I’m not sure how much its worth it, as opposed to generating new research”. Daniel highlighted that what is needed here are statistical measures of reuse to ensure that data science “discoveries” are ‘true and not false‘, and that these are sadly lacking in journals and data repositories. For more details of this type of question see the following slides from one of Daniel Goroff’s previous talks.

The Importance of Consent – an Issue for Citizen Science?
Consent to reuse data was a major theme at the workshop, as was the issue of respect. Anita Allen (University of Pennsylvania Law School), in the session entitled “Implementing the NIH Data Management and Sharing Policy: The Evolving Ethics of Data Sharing” echoed this sentiment in a very thought-provoking way and highlighted a poignant legal case in the University of Pennsylvania where human remains of a young girl who died in the 1985 MOVE bombing were used as an anthropological use case study without consent from the parents. The Penn Museum has now apologized for keeping the remains, and according to The Philadelphia Inquirer, “the museum said the remains should have been returned, and pledged to reassess its practices”.

On a related note, Mark Rothstein (Herbert F. Boehl Chair of Law and Medicine, Founding Director of the Institute for Bioethics, Health Policy and Law at the University of Louisville School of Medicine) – in the session entitled “Shaping a Culture of Data Sharing – Reducing Barriers and Increasing Incentives” – highlighted potential issues with Citizen Scientists that may wish to reuse data, but that are not subject to Common Rule or FDA research regulations. From a legal perspective, Mark actually questions whether there should be restricted access for citizen scientists rather than the more familiar open access.

Data Availability in the Time of COVID-19: A Publisher’s Perspective
So, from a publishing perspective, has the COVID-19 pandemic incentivized the sharing of clinical data? As members of the data working group of the C19 Rapid Review Consortium this was a very interesting question for us. In the session entitled, “Encouraging Data Sharing Outside of Mandates”, Ashley Farley (Bill & Melinda Gates Foundation) highlighted the lack of Data Availability Statements in COVID-19 related articles. Referring to a report by Georgina Humphreys, who is Clinical Data Sharing Manager at Wellcome, Ashley highlighted that only 9% of COVID-19 research articles in Europe PMC have any data availability statement to indicate where, and under what conditions, the data can be accessed. Referring to the original Wellcome report, it is additionally noteworthy that this low percentage of data availability statements in COVID-19 articles is compared to “22% of all research articles published in 2020”. This data publisher’s perspective is an interesting counterpoint to the Keynote talk by Patricia Brennan, which had only emphasised the positive aspects of Data Sharing in the time of COVID-19.

In summary, this workshop explored many different facets of Data Sharing, and raised more questions than answers. I agree with Richard Nakamura that a culture shift is needed, and that there needs to be coordination between researchers, funders, and publishers, but there are invaluable insights from policy makers, economists, and legal teams that additionally need to be considered to ensure that this culture shift is ethical, and that due consent is given to patient privacy.

The webcast was recorded and the entire video playlist is made publicly available at the following link.