Coronavirus research needs to be a marathon, not a sprint

Coronavirus trendsNot only epidemics themselves come in waves, also research into emerging infectious diseases has ups and downs, a Gigascience paper published today reports. The authors at Ben-Gurion University of the Negev analysed more than 35 million papers and explored research (scientometric trends) related to nine different infectious diseases. The scientific literature screened for this study spans 20 years and reveals major differences: Some diseases, such as HIV, are studied at a pretty steady rate. On the other hand, emerging diseases such as SARS and MERS show a much more uneven pattern. This is  a major problem for pandemic preparedness, as the authors explain in their Press Release:

“There has been no sustained research into these types of infections, merely peaks following specific outbreaks. That pattern has left us woefully unprepared for the COVID-19 pandemic. If we want to be ready for the next pandemic, we must maintain a steady pace of research, even after the current pandemic subsides. The path to understanding is a marathon, not a sprint.”

For GigaBlog, the authors have kindly answered some questions to put this scientometric trends work into context:

Author Q&A with Dima Kagan, Jacob Moran-Gilad & Dr. Michael Fire (pictured) from Ben Gurion University of the Negev.   

What is scientometrics and how can it help to understand research trends?

Scientometrics (or the Science of measuring Science) is the field of study that centers around the analysis of scientific publications. Scientometrics can be utilized to uncover research topics that are over/under studied or central topics in a specific research field. Using such data, we can better understand how science changes over time, and how scientific research trends evolve.

Previous work (including your GigaScience paper on over-optimization of academic publishing metrics) has described the ever-upward trend in terms of overall scientific output. For this study, you explored a subset of the entire scientific literature that is dealing with infectious diseases. What are some specific trends and characteristics of this body of literature?

In this study, we explored publications about nine infectious diseases. We observed that in terms of publication trends, there are two types of infectious diseases. The first type is emerging infectious diseases, their scientific output spike after an outbreak, and subsides shortly after the outbreak is contained. The second type is diseases with a high burden that has steady research without drastic changes over the years and commonly exhibiting a constant increase.

What lessons do we need to learn from that observation? Why do different lines of research (e.g. HIV compared to coronaviruses) show different long-term patterns in terms of publication output?

Scientometric Trends for Coronaviruses

Publications by quartile over time for different diseases. Source: Paper.

One of the main reasons for the differences we see here is as a result that SARS and MERS were contained relatively quickly. Moreover, SARS and MERS had a relatively low number of confirmed cases in first world countries. Also, we found that SARS and MERS were studied in fewer countries than the other diseases we inspected. We believe that when an outbreak starts there is a high interest in the academic community since no one knows if it will be the next Disease X. However, after the disease is contained, all the resources are shifted back, and everything returns to normal. Also, it is reasonable to assume that when a disease is contained its research will be considered less relevant and get fewer citations. On the other side, there are diseases such as HIV that have a high burden and very relevant in first world countries where most of the research is done. We believe that dedicated funding should be allocate to sustain research on emerging infections globally even for diseases that do not pose an immediate threat to Western countries.

What data sources did you look at for this study?

Today there is a variety of bibliometric datasets, each dataset has its advantages and disadvantages. As the main dataset for this study, we used the Microsoft Academic Graph, one of the most detailed academic publication datasets. It contains not only data on publication but also on journals, authors, citations, and more. Even though Microsoft Academic Graph data consist of hundreds of millions of publications, there are cases where the data isn’t complete. Moreover, for this study, we needed additional data sources to get more details on publications, such as the authors’ affiliations, and institutes’ geo-locations. To fill in the missing data, we used additional datasets. For extracting institutions’ locations, we utilized Wikidata and Wikipedia. To get more data on journals, we utilized the SCImago Journal Rank (SJR) dataset (a measure of journal ranking considered as a free alternative to the impact factor measure) and PubMed (a search engine of academic publications on the topics of medicine, nursing, dentistry, veterinary medicine, health care systems, and preclinical sciences).

The COVID-19 pandemic is accompanied by an enormous output of research articles. Do you have a hunch how this will compare to publishing trends during previous epidemics?

We can observe that there is probably a much bigger spike than in previous epidemics. From data released by Semantic Scholar, currently there are more than 192,500 COVID-19 related papers from which more than 148,000 were published in the last three months. The COVID-19 pandemic struck in almost the whole world, and as it seems it is going to stay with us for a longer time than certain past epidemics. It resulted in a multidisciplinary, worldwide effort to fight against the outbreak. Additionally, many grants for coronavirus research were published in the last six months. All these factors lead us to believe that COVID-19 will have a much larger research output than previous epidemics.

Even specialists find it difficult to get an overview of the scientific output during this pandemic. In the future, what kind of tools or methods could help to organize and explore the scientific literature?

Exploring scientific output generally is a challenging task. Today there are more than 214,000,000 publications, and this number keeps growing exponentially. The volume, velocity, and variety of academic data make it even more challenging to create a dataset. Moreover, during this study, we stumbled upon a lot of missing data, such as missing ISSN (International Standard Serial Number – a number used to uniquely identify publications) that makes it hard to fuse publication records with journal impact datasets. Also, there are many additional problems. For instance, institution names are written inconsistently across papers, little variations in a paper title are sometimes considered as a different paper, etc. Solving these types of issues through data curation will assist researchers in discovering bibliometric patterns, especially in fields with a relatively small number of papers, in which each piece of information can improve the analysis.

How can publishers do better in disseminating coronavirus research?

GigaScience has been part of the C19 Rapid Review consortium, and addressing the aims of this this effort to speed up access to crucial coronavirus research, this study has been available for early feedback in bioRxiv (see the preprint version here). In analyzing two decades of published research it was noted that very few papers had data and code available, and one of the recommendations of this study is that publishing the code and data should be mandatory by journals when possible. With that in mind snapshots of the containerised analysis code and an archive of the data and results are available from the GigaScience GigaDB repository:

Read the Gigascience article here:

Kagan D, Moran-Gilad J, Fire M. Scientometric Trends for Coronaviruses and Other Emerging Viral Infections. Gigascience. 2020;9(8):giaa085. doi:10.1093/gigascience/giaagiaa085