Adventures in Data Citation: sorghum as a standard for data release

May 11, 2012

Adventures in Data Citation A correspondence we have contributed to has just been published in the BMC Research Notes “Data standardization, sharing and publication series” on the adventures in data-citation and data-release practices surrounding the Sorghum genome that is available in our GigaDB database and that was published last year in Genome Biology. We use Sorghum as an example to highlight the issues surrounding data release and use strong words, subtitling the paper “sorghum genome data exemplifies the new gold standard”, justified in this case by the considerable efforts the authors made to go beyond the standards of the field and follow the latest best-practices.

Despite genomics having a reputation as being the field of biology with the best established data-release practices and policies, compliance is still mixed. The authors of the Sorghum study went far beyond the usual minimal raw data deposition and spent six months working with the curators of four public repositories (on top of GigaDB) to make sure that all six data types featured in the paper were in their most usable forms. Making all of the supporting data freely available to allow transparency and reproducibility of work is a key goal of GigaScience, and we felt that this demonstration of leadership in the sharing, standardization and publication of biomedical research data should be applauded and highlighted. We feel that the correspondence article fits the open-data related series scope and criteria well, and hope that it can be used to make the wider research community, on top of the usual digital curation experts, more aware of best practices and what is currently possible with data publication.

Sorghum Illustrating Data Citation
Data citation arises from a recognition that data generated in the course of research are just as valuable to the ongoing academic discourse as papers, and DataCite (formed in 2009) provides a technical infrastructure using data-DOIs to aid this. To truly put data on a par with research publications and to credit and track their impact the same way, data DOIs need to be treated the same way as scientific articles and cited in the references section of papers. While this is not new in the environmental sciences (see this paper from 2005 for example), the biology community has not been citing data in this way despite published guidelines and recommendations by databases, other than in very sporadic cases (such as this article citing a PDB DOI). Based on our early hiccups getting our dataset DOIs into other journals, the authors worked very closely with the editors of Genome Biology (and carefully following the guidelines of the DCC) to integrate data DOIs into the references of the research article – the first time that we are aware of that this has been accomplished in the field of genomics. Since this was originally highlighted in the BMC blog, there have been several more successes in this area: subsequent data DOIs have been referenced in Springer journals, one of our data DOIs made it into the references of a Nature series journal for the first time, PLoS journals are now referencing Figshare handles, and our publisher BioMed Central is using the Sorghum dataset as the example of how to cite data in their instructions for authors.

Sorghum Illustrating Data Deposition
The Sorghum study is also an excellent example for future data-submitters in regards to what can be done to not only comply with but also go beyond minimal journal data policies. On top of all of the data in the Genome Biology paper being available from GigaDB, the raw data (SRA), genome assemblies (in genbank here), and processed data such as SNPs, Structural Variations, Copy Number Variations and Indels were also deposited in their respective NCBI databases. Furthermore, the authors not only adhered to the standard journal editorial policies for genomics studies insisting on raw data deposition (and if possible genome assemblies) in one of the three INSDC databases, but also deposited additionally processed data to the dbSNP and dbVar databases. This additional effort is at best encouraged by journals but is not currently mandated. When the annotated data is fully integrated into these databases, detailed curation is a time-consuming process (particularly when having to get to grips with data produced by BGI’s new SV-tools) and the staggered build releases mean that full integration can take several months, it will be the first plant data in the relatively new dbVar database.

The advantages highlighted in the correspondence are that the GigaDB entry tied together all of these related datasets in one place and allowed them to be released rapidly in a stable and citable form beforethe associated analysis paper’s publication. In addition to complementing the data deposited in the NCBI databases, being available in GigaDB makes the data more discoverable through other channels, such as the DataCite metadata search engine and eventually through citation indexes. In future papers, if additional data types that do not have established public repositories are included in the paper, the data could be made available in GigaDB, as GigaDB can provide a home for potentially any useful data type, supporting information, scripts or source-code. In Sorghum’s case, depositing the data in GigaDB also allowed us to give it a clear CC0 public domain waiver under our data policies, maximizing its potential downstream use and liberating it from any potential legal wrangling.

The Rise of Data Citation
We are hoping this correspondence will further highlight and encourage these promising developments, and motivate the citation indexes to more quickly adopt and track these important research outputs. The correspondence is well timed, with a growing number of developments and announcements regarding data-publishing being made in recent months. On top of new data publishing platforms such as F1000 Research, Figshare and Datsets International already highlighted in this blog, there have been further announcements of new data journals in the pipeline, including Geosciences Data Journal from the Royal Meteorological Society and Wiley-Blackwell and a new series of meta-journals launched from Ubiquity press (the publishers of which are co-authors our commentary). These will obviously all benefit from the increased awareness of data citation and the more standardized data practices that studies such as this can help encourage.

References

1. Edmunds, S., Pollard, T., Hole, B., & Basford, A. (2012). Adventures in data citation: sorghum genome data exemplifies the new gold standard BMC Research Notes, 5 (1) DOI: 10.1186/1756-0500-5-223
2. Zheng, L., Guo, X., He, B., Sun, L., Peng, Y., Dong, S., Liu, T., Jiang, S., Ramachandran, S., Liu, C., & Jing, H. (2011). Genome-wide patterns of genetic variation in sweet and grain sorghum (Sorghum bicolor) Genome Biology, 12 (11) DOI: 10.1186/gb-2011-12-11-r114
3. Zheng, L-Y; Guo, X-S; He, B; Sun, L-J; Peng, Y; Dong, S-S; Liu, T-F; Jiang, S; Ramachandran, S; Liu, C-M; Jing, H-C (2011): Genome data from sweet and grain sorghum (Sorghum bicolor). GigaScience. http://dx.doi.org/10.5524/100012
4. Hrynaszkiewicz, I. (2010). A call for BMC Research Notes contributions promoting best practice in data standardization, sharing and publication BMC Research Notes, 3 (1) DOI: 10.1186/1756-0500-3-235
5. Ball A, Duke M: ‘How to Cite Datasets and Link to Publications’. DCC How-to Guides. Edinburgh: Digital Curation Centre; 2011: http://www.dcc.ac.uk/resources/how-guides

Adventures in Data Citation: sorghum as a standard for data release

References

Scott Edmunds

Blog post tags

Recent comment