Data Citation Enters the (year of the) Dragon

DragonToday marks the first day of the Chinese Lunar New Year, and as we enter the supposedly auspicious year of the Dragon now is a good opportunity to look towards developments in the nascent field of data publication over the upcoming year. This week marked important announcements of new and improved data publication platforms. Those lucky enough to attend Science Online (or filter through the nearly 30,000 tweets produced by the meetings end!) will have seen the new-look Figshare website promoted in the “Dealing with Data” session, and there has also been good coverage online of the platforms launch including in the Wellcome trust blog. Since the launch of the original website roughly a year ago, the recent support from Digital Science (a sister company of Nature Publishing Group) has allowed them to release a much improved front-end, increased storage (currently 250MB, but potentially unlimited), and importantly where data citation is concerned, the use of citable DataCite DOIs. 
Following on from the many developments in the last year (see our posting from last months IDCC meeting) another publisher has just thrown their hat into the data publishing ring, with Hindawi announcing the launch of “Datasets International“, a new platform for “archiving, documenting, and distributing scholarly research datasets”. Like Figshare, Dryad and the other platforms already announced (including our associated GigaDB), they follow best practice by asking authors to provide data under a creative commons CC0 license, although it is currently unclear how much (if any) data hosting is included in their $300 article processing charge.

As we’ve written previously in this blog, how you cite data is important in tracking and maximizing its use. Ultimately the adoption of data-publication will be greatly aided by publishers, authors and the indexing services correctly carrying out best practice for data-citation, and citing the dataset DOIs in the references. CrossRef have just aided this with a supportive call for publishers to cite DataCite DOIs in the reference sections of articles, although the article they use as an example is illustrative of the problem by not following this. Our recent success in getting one of our GigaDB dataset DOIs integrated into the references of a Genome Biology article is a great example that this can done. The utility of this example is further highlighted this month, as BioMed Central (our publisher alongside BGI) has now used this paper as an example in BioMed Central’s reference style guide, found in any journal’s instructions for authors. It now explicitly mentions datasets and provides it as an example of a dataset citation:
“Only articles, datasets and abstracts that have been published or are in press, or are available through public e-print/preprint servers, may be cited
Dataset with persistent identifier Zheng,L-Y; Guo, X-S; He, B; Sun, L-J; Peng, Y; Dong, S-S; Liu, T-F; Jiang, S;Ramachandran, S; Liu, C-M; Jing, H-C (2011): Genome data from sweet and grain sorghum (Sorghum bicolor). GigaScience.”Springer, BioMed Centrals parent publisher, is also providing examples of correctly cited data, with this recent example correctly citing a Dryad dataset in its references.

Those interested in GigaScience journal and its integrated GigaDB database should contact us on Currently the database is being populated with datasets from BGI collaborations and projects, but upon launch of the journal we will be hosting and issuing DOIs to datasets associated with GigaScience articles, giving an extra form of credit, and increasing the discoverability and impact of an authors work. Please contact us via the above email or submit your big-data associated research articles through our submission page.
Gong Xi Fa Chai! Happy New Year of the Dragon!

1. Zheng LY, Guo XS, He B, Sun LJ, Peng Y, Dong SS, Liu TF, Jiang S, Ramachandran S, Liu CM, Jing HC. Genome-wide patterns of genetic variation in sweet and grain sorghum (Sorghum bicolor).  GenomeBiol. 2011 Nov 21;12(11):R114.

2. Zheng, L-Y; Guo, X-S; He, B; Sun, L-J; Peng, Y; Dong, S-S; Liu, T-F; Jiang, S; Ramachandran, S; Liu, C-M; Jing, H-C (2011): Genome data from sweet and grain sorghum (Sorghum bicolor). GigaScience.

3. Hodkinson BP, Uehling JK, Smith ME (2012) Lepidostroma vilgalysii, a new basidiolichen from the New World. Mycological Progress, online in
advance of print. doi:10.1007/s11557-011-0800-z
4. Hodkinson BP, Uehling JK, Smith ME (2012) Data from: Lepidostroma vilgalysii, a new basidiolichen from the New World. Dryad Digital Repository.


*UPDATE* 24/1/12: Hindawi have confirmed they will include data hosting (see comments).

Recent comments

  • I would like to let you all know that the $300 article processing charge, indicated by Hindawi, will include the hosting of all datasets.

  • Thanks for clarifying Kamal. Great you’ll be hosting the data, as there currently is a lack of repositories for so many data types. Can I just ask how much storage (MB/GB) you get for that? Good luck with the project, as the time seems right for data publishing!

  • Thank you for your wishes Scott. Regarding your question, we do not have any predefined limits on the size of datasets that can be submitted to Datasets International. If we encounter any cases where a dataset is too large for us to host, we will work with the submitting authors to try to find a solution. For example, if there are any centralized databases that are well-equipped to host the data we may deposit the dataset with them and then provide a link to their database. However, unless we encounter any cases of extremely large datasets, we will be happy to host any amount of data that a researcher would like to publish. Best regards.

  • […] publishing is currently very topical, and coming on top of the many other recent schemes we have written about in the past, Nature Publishing Group have recently announced their own foray into the field, with […]

Comments are closed.