Data Citation December

"Citation needed"Despite the approaching holidays its been another busy month in the GigaScience office, with Alexandra attending the InCoB/ISMB-Asia meeting in Kuala Lumpur (see her talk slides here) and the Human Variome Project meeting in Beijing, and Scott attending a number of meetings and workshops in the UK, including the International Digital Curation Conference (IDCC) in Bristol. The “Digital” in the meeting title was a bit of a giveaway of the level of technological savvy of the attendees, as it was heavily tweeted (see #idcc and this storify), blogged (see here for a good example), and videos are also available for many of the talks, so we will not repeat what is already well covered.

With additional workshops on data impact and reuse, Bristol was the center of the Data Citation universe in December, with representatives and talks from many data publishing projects, databases and issuing bodies such as our DataCite collaborators, so it was an excellent opportunity to assess where things currently stand. Interesting new infrastructure was presented by Mark Hahnel, giving a preview of the new design of the FigShare platform launching in the new year, which for the first time will use citable DOIs for their datasets. Brian Hole from Ubiquity press presented on “Publication and Citation”, and mentioned data publishing platforms coming from them, and the representatives of other publishers present showed that there are obviously other commercial projects in the pipeline (for example this from F1000).

Being a curation conference, researcher driven approaches were also on display, and the Environmental Sciences community in particular have been publishing datasets with DOIs for many years, both from the well established Pangaea database, and by individual data centers (Sarah Callaghan’s talk representing NERC’s environmental data centers being a great example). Phillip Bourne’s excellent talk imagined the possibilities that mixing open data stores with well integrated widgets and tools to mashup and produce new analyses could bring, and he mentioned that the very well established PDB (Protein Database) uses DOIs as accessions, but these are not integrated and cited into associated publications. This is a bit of a missed opportunity, and Mark Hahnel (video here) and Heather Piwowar (slides and video) both highlighted the needs for proper attribution and impact tracking for datasets to incentivise sharing of data. Our recent examples of DOIs linked to datasets from our GigaDB database getting integrated into articles in Nature Biotechnology (see more here), and Genome Biology (see here) demonstrates that this is feasible to link datasets with global, resolvable identifiers into articles.

Whilst Pangaea and the Environmental Science community have managed to do this for a number of years (including examples from as far back as 2005), the integration of data DOIs into the references of the Genome Biology article was the first time we are aware of that this has been accomplished in the field of genomics. This example is a great example of the practicalities of how data can be cited (following the best practice guidelines of the DCC), but until the bibliometric indices properly track them this is only a first step. With this important next step likely to finally happening in the new year, this meeting was a good opportunity for the data DOI producers and publishers to compare notes and ready themselves for the important year ahead. As December comes to a close, we at GigaScience would like to wish you all seasons greetings, and we look forward to an exciting 2012 for the field of data publishing!