We have a Q&A with author Tom Edinburgh from the University of Cambridge on his new GigaByte paper presenting Sepsis-3 criteria in AmsterdamUMCdb, which is one of the largest freely accessible Intensive Care database in Europe. As a journal that promotes transparency and openness, we’ve previously written about the challenges of peer reviewing clinical data, and how our named open peer review provides opportunities in opening up access of controlled access datasets. Here we provide another example of the peer reviewing of de-identified health data, with an additional layer of complexity that reviewers can overcome by completing a course on human subjects research. The authors were keen to be able to share as many outputs of this project as they could, being big reproducible research (see contributor to this work and GigaByte Editorial Board Member Stephen Eglen’s previous Q&A on his early attempts at “push button papers” using Knitr and CodeCheck certified reproducibility certificates). In this post Tom explains this process and the key advantages of the new clinical definition of Sepsis-3 over previous definitions of Sepsis.
Sepsis is a life-threatening condition and is characterised by organ dysfunction. The dysregulated host response to infection is a major factor in the aetiology of sepsis, and Sepsis-3 includes a revised definition of sepsis based on a scoring system that utilises Sequential Organ Failure Assessment (SOFA) across six domains including respiratory, neurological, cardiovascular, liver, coagulation and renal systems.
I was the Curator and Reviewer for this GigaByte Technical Release manuscript, and it was interesting to explore how one gains access to the de-identified health data used in this study. Amsterdam University Medical Centers Database (AmsterdamUMCdb) contains de-identified health data related to tens of thousands of intensive care unit admissions, including demographics, vital signs, laboratory tests and medications. The database, although de-identified, still contains detailed information regarding the clinical care of patients, so must be treated with appropriate care and respect and cannot be shared without permission.
To gain access to AmsterdamUMCdb, one must first complete the Data or Specimens Only Research (DSOR) course from CITI. This course introduces some of the core issues in Human Subjects research, including deidentification, informed consent, Institutional Review Board (IRB) policy, and conflict of interest (COI), and there is additionally advice on when an IRB member should recuse themself from voting on a study that is under review. The background on Ethics is especially interesting and highlighted the need for the publication of the 1974 Belmont Report for the Protection of Human Subjects of Biomedical and Behavioral Research.
Once I completed the DSOR course I was able to submit my Access Request Form and End User License Agreement for AmsterdamUMCdb. This application needs to be counter-signed by an intensivist, i.e. a health professional who specialises in the care of critically ill patients, and who is the named reference on the application from. I am grateful to Dr Ari Ercole who, as an intensivist and one of the co-authors of the GigaByte manuscript, very kindly provided this reference in support of my application. Access to AmsterdamUMCdb was relatively straightforward, and the Data Archiving and Networked Services (DANS) based in the Netherlands were very responsive and were able to send me a link where I could download the AmsterdamUMCdb archive.
Here, in one of our long running author Q&A’s we talk to first author Tom Edinburgh, a PhD student from the University of Cambridge about this work.
Can you tell us a bit about the Amsterdam University Medical Centers Database (AmsterdamUMCdb)? What is it and how does it work? Did your analysis of this massive cohort uncover any surprising findings?
Intensive care is home to perhaps the most data-dense clinical environment, with continuous multimodality monitoring (which includes demographics, physiology, treatments and drugs, lab values and scans) an essential component in patient care. This creates a real challenge for data-sharing, because the level of detail raises privacy, ethical and legal concerns that need to be balanced against the potential research benefits, and databases must undergo a comprehensive risk-based de-identification process before becoming freely-accessible. Until recently, the prominent large-scale intensive care unit (ICU) databases were limited to cohorts from the US, and differences in demographics, treatments and resources limits the ability to generalise knowledge to other populations. Amsterdam University Medical Centers Database (AmsterdamUMCdb) is a recent European ICU database, which is compliant with both the U.S. Health Insurance Portability and Accountability Act (HIPAA) and the European General Data Protection Regulation (GDPR) and is freely accessible. GDPR compliance and the demonstration of public acceptability for such projects was a landmark step forwards for this kind of work in the EU. It contains approximately 1 billion data points from >20,000 critically ill patients admitted to Amsterdam UMC between 2003 and 2016, with the records having passed through a robust risk-based de-identification process. The different modalities to the data are combined in a data-lake structure, linked via anonymised identifiers, and is accompanied by a GitHub repository containing detailed resources for users.
The new paper in GigaByte presents the code for implementing the Sepsis-3 definition. What exactly does this do, and why did you decide to open source it?
Worldwide, sepsis is a huge killer and a key reason for ICU admission. As such, it has always been, and will likely always continue to be, an important area for ICU research. The huge heterogeneity in sepsis patients makes a Big Data observational database approach very attractive. However, this is always fundamentally going to depend on operationalising a consistent definition of sepsis, which isn’t an easy task. This implementation of the Sepsis-3 definition has several key components: First, we must calculate the Sequential Organ Failure Assessment (SOFA) scores, which are a marker of the severity of the patient condition. This is made up of a number of sub-scores, each graded on an integer scale from 0 to 4, relating to the cardiovascular, respiratory and central nervous systems, as well as coagulation, liver function and renal function. A new organ dysfunction is defined as an increase in total SOFA score of at least 2 points. For the Sepsis-3 criteria to be met, this must be accompanied by acute infection, which is operationalised by non-prophylactic antibiotic escalation (which may be an addition to the antibiotics administered or a switch to stronger antibiotics). We also restructured the previous ‘diagnosed at admission’ sepsis definition currently employed in the AmsterdamUMCdb repository, in order to make direct comparison between these definitions. We’re passionate about reproducibility and open science, and making our work on this open-source is a small effort in the right direction towards this goal, allowing users not only to use the Sepsis-3 definition within AmsterdamUMCdb but also to examine more closely the underlying computations.
You say in the manuscript that the incidence of Sepsis at admission is rarely documented consistently. What are the key advantages of Sepsis-3 clinical criteria over previous definitions of Sepsis?
Sepsis has a broad umbrella definition (‘a life-threatening organ dysfunction caused by a dysregulated host response to infection’) and it’s even harder to operationalise that definition in a clinical setting. It’s often not considered the ‘primary’ reason for admission even when identified during admission, and so may not always be recorded consistently. Whilst there is debate within the clinical care community about the shifting definition of sepsis, the Sepsis-3 definition addresses concerns that the previous ‘SIRS’ definition is not necessarily indicative of a dysregulated life-threatening response (‘poor discriminant validity’). Sepsis-3 was proposed as a more data-driven definition and is accompanied by a bedside clinical score for rapid identification that can be consistently applied across ICU centres.
AmsterdamUMCdb archives clinical data from a 13-year period between 2003 and 2016. Are there plans to extend this resource to include more recent Intensive Care Unit patient data, for example, patient data from the current COVID pandemic?
Good question! A COVID-19 intensive care unit database is actually something that is already largely in place, and it’s a very impressive feat that Amsterdam UMC and partner company Pacmed have achieved with this, because they have integrated patient data from 35 intensive care units in Netherlands. This nationwide data-sharing collaboration and processing pipeline has created the first EHR database with full-admission data from COVID-19 patients admitted to ICU, with >400 million data points on over 46,000 parameters from more than 1500 critically ill patients. More details about this can be found at https://amsterdammedicaldatascience.nl. Access is necessarily more tightly controlled than for AmsterdamUMCdb, as the smaller number of patients increases the risks of re-identification. However the existence of this COVID-19 dataset was actually one of the key motivators for the work detailed in our manuscript, as we sought a respiratory sepsis cohort for comparison with a similar COVID-19 cohort from the DDW in an ongoing research project.
What challenges did you have with sharing this code? As we have named, open peer review, how did you find this process, and how did it assist review of the code and data?
We’re very grateful to the reviewers, Chris Armit and Tom Pollard, for their feedback, including comments about code readability and usability that we can take forward more generally in future projects. The suggestions they made helped us to go back over the code with an eye to improving its clarity. In particular, whilst it is non-trivial to avoid hardcoding, given the codebase is often reliant on line-by-line transformations of unique variables, we did introduce some capability for command-line argument parsing. Publishing this manuscript in an open science journal, where results are independently verified in peer review, would have been trickier without coordination with the editorial team and the open peer review, since there are necessarily some access requirements for AmsterdamUMCdb (these requirements and the database set-up are straightforward but potentially time-consuming for a one-off use of the database during review). The GigaByte review format is very structured, which is helpful for authors (and I imagine reviewers as well).
You have pointed out that you would welcome any contribution to this codebase from the wider user community. What sort of things would you like to see undertaken?
First and foremost, we hope this is a useful resource for researchers who are working with AmsterdamUMCdb, which, as mentioned, is one of the largest freely-available critical care databases. As an example, having identified ICU data sharing as a key priority and supported the development and release of AmsterdamUMCdb, the European Society of Intensive Care Medicine have already hosted multidisciplinary-team datathons (and planned future ones) to investigate clinical questions using AmsterdamUMCdb. Some of the technicalities of coding the definition of Sepsis-3 may be challenging for clinicians with an interest in data science, and having access to scripts such as this may reduce some of the technical barriers and allow researchers to more easily define patient cohorts for their investigations. At the same time, we definitely welcome feedback or improvements to the code, and to that end users can create an issue on the GitHub repository.
Apart from this, it’s perhaps not straightforward to extend the codebase directly, as ultimately it’s a single immutable transformation of the dataset into something clinically useful. Large-scale intensive care databases, at the current time, are distinct in terms of structure, parameters available, and even default language of any plain-text, so it’s an incredibly challenging task to map EHRs from multiple hospitals to a single consistent data model, when these hospitals may have different data recording and storing systems. This is without factoring in differences in protocols and treatments which are important to consider in this kind of work; for instance, we needed to identify antibiotic administration from Amsterdam UMC’s routine practice of selective digestive decontamination. However, one benefit of making this code open-access is that it can potentially provide a starting point for other researchers to encode the Sepsis-3 definitions and identify patient cohorts in other similar databases.
Tom Edinburgh, Stephen J. Eglen, Patrick Thoral, Paul Elbers, Ari Ercole, Sepsis-3 criteria in AmsterdamUMCdb: open-source code implementation, GigaByte, 1, 2022 https://doi.org/10.46471/gigabyte.45