Promoting Data Citation in Nature (and Pushing Past Panda Problems)

December 21, 2012

Regular reader of this blog will be aware of our efforts to promote data citation using digital object identifiers (DOIs), and this week, alongside Rebecca Lawrence from F1000 Research and Kevin Ashley from the Digital Curation Centre, our Editor in Chief Laurie Goodman has a correspondence in Nature strongly making this case. The motivation to cite datasets comes from a recognition that data generated in the course of research are just as valuable to the ongoing academic discourse as papers, and should therefore be treated in the same manner. The correspondence makes this case, and is timely with the recent launch of Thomson-Reuters data citation index. While datasets can be linked to database accessions and other identifiers, DOIs are more stable and permanent than URLs, and have the crucial advantage over alternatives in that they are already familiar to researchers, publishers, and libraries. With the new ORCID system allowing DOIs to be imported and linked to an authors other research works, funders such as NSF allowing datasets to be listed in biosketches, and data citation indexes now allowing datasets to be tracked and credited to data producers, there are finally tangible benefits and incentives for data producers in creating and citing data DOIs.

We have previously written about this subject and making the point in high profile journal such as Nature, the authors hope this can promote this point to a much wider audience, and also help directly lobby publishers such as NPG. This is tempered slightly by having to make these arguments in a closed access forum, especially as this means we are not able to reproduce the content here, but we are currently double checking we can put the pre-print version up. The published letter “Data-set visibility: Cite links to data in reference lists” is very short, but in the limited word limit allowed the authors ask publishers, funders, researchers and institutions that Datasets should be more prominently linked to their associated research articles as standard practice, and link to the DCC best practice guidelines to give more detailed instructions on how to do this. They also make a further argument that this increased visibility and accessibility of datasets would also benefit peer-reviewers and readers by “raising standards of data analysis, promoting more detailed review, encouraging data curation and boosting reproducibility and data reuse”.

A brief history of data citation
Working with DataCite and the British Library, our GigaDB databases first DOI (the genome of the deadly outbreak strain E. coli) was issued in June 2011 and we and the growing number of data publishers (including our co-authors on this correspondence F1000 Research) have been working closely with a number of publishers to allow the citation of datasets. The Nature commentary makes the point that at present very few journals are currently doing this, but after our initially unsuccessful attempts at getting DOIs included in the NEJM E. coli paper and DOIs into a Nature Biotechnology paper, our experiences of working with publishers has generally been more positive. Of the journals polled by F1000, only Cell Press and Ann Oncol have said they would have an issue with the pre-publication release of data in this manner. After working closely with the editors of Genome Biology to include the DOI of the Sorghum genome in the references of a paper in November 2011, BioMed Central have used this example in their instructions for authors as how to cite datasets, and we published a commentary in the BMC Research Notes Data Sharing and Standardization series highlighting this best practice. Since the Sorghum example, PLoS, Springer, Science, and a number of publishers have now started properly integrating DOIs from Dryad, Figshare and a number of other data publication platforms into their references.

Two Steps Forward, One Step Back
Following some initial difficulties getting the DOIs from Macaque genomes into a Nature Biotechnology publication, in February 2012 the editors agreed to allow the first DOIs to be cited in the journal. In October this year two papers in the same issue of the main Nature journal included GigaDB datasets in the references, and this letter is a further welcome sign of support from Nature and NPG. That there is still much work to be done is still clear though, and we have a demonstration of the challenges that are still to be overcome in the very week that this letter has been published. Having the correct editorial policies is one thing, but these need made to be made clear in the instructions the authors, editors and production departments follow, and in a paper just published in Nature Genetics on Panda population genomics, the DOIs for the Panda and Polar Bear genomes were moved by the journal from the references list to the URLs section of the journal. Relegating the datasets in this way not only prevents the data producers receiving due credit for making their data publicly available, not having the data in the references means they have also lost the ability to count and track the citations in the data citation index.

As the DOIs in the article were changed into the URLs that they redirect to (e.g. http://dx.doi.org/10.5524/100004 was changed to the current GigaDB URL http://gigadb.org/giant-panda/), this loses the stability and persistence that DOIs bring, and there is much greater risk in the future that the URLs may change and will no longer become resolvable. This was one of the reasons we have bought DOIs from the British Library and are using them for our datasets, and as we are currently migrating GigaDB to a new location and platform, DOIs give us the flexibility to move domains and locations without downstream users losing access, and preventing the need for citations having to change. Unlike the data Accessions section of the journal that is publicly accessible, and the reference section that is mined by citation indexes, the URLs section is the Nature Genetics is totally hidden behind a paywall and so is not mineable or accessible in any way, totally losing the advantages of linking literature and data. Coming out a few days after this particular example occurred, this letter in Nature will hopefully increase awareness and prevent such incidents happening in the future, and speed up the acceptance and treatment of data as first-class records of research.

References
1. Goodman, L., Lawrence, R. & Ashley, K. Data-set visibility: Cite links to data in reference lists. Nature 492, 356 (2012).
2. Zhao, S. et al. Whole-genome sequencing of giant pandas provides insights into demographic history and local adaptation. Nature Genetics (2012).doi:10.1038/ng.2494
3. Edmunds, S. et al. Adventures in data citation: sorghum genome data exemplifies the new gold standard. BMC Research Notes 5, 223 (2012).

Promoting Data Citation in Nature (and Pushing Past Panda Problems)

Scott Edmunds

Blog post tags

Recent comments