Last week marked two important milestones in the deadly 2011 European E. coli 0104:H4 outbreak: the Robert Koch institute announcing the end of the outbreak, and the publication of several papers from the many groups sequencing the pathogen. This included a publication in the New England Journal of Medicine by groups from the BGI, UMC Hamburg-Eppendorf, and Birmingham University acknowledging members of the crowdsourcing community and the work achieved using the genome sequence our colleagues at the BGI made available via our GigaScience database. This was our first dataset released with a DOI and under the freest CC0 public domain license, so now is a great opportunity to look back to see the consequences of this novel form of data release.
Due to the unusual severity of the outbreak – thousands severely ill and 50 deaths to date, it was clear that the usual scientific procedure of producing data, analyzing it slowly and then releasing it to the public after a potentially long peer-review procedure would have been unhelpful in this case. By releasing the first genomic data before it had even finished uploading to NCBI via twitter, and promoting its use and releasing subsequent improved assemblies this way, a huge community of microbial genomicists around the world took up the challenge to study the organism collaboratively (a process that was dubbed by some to have made E. coli the first “Tweenome”). Once a github repository had been created (thanks to the efforts of the Era7 team in Spain) to provide a home to these analyses and data, groups around the world started producing their own annotations and assemblies within 24 hours, and within a couple of a days a potential ancestral strain had been identified (further clearing Spanish farmers of the blame), and the many antibiotic resistance genes and pathogenic features were much more clearly understood. By releasing the data under a CC0 license, this allowed truly open-source analysis, and the UK HPA and github members followed suit in releasing their work in this way.
Huge progress was achieved in record time, and from this incredibly speedy work a free diagnostic protocol and free primers were distributed by the BGI to immediately help tracking the source of the outbreak. On top of the good feeling and positive coverage obtained by this (despite some inevitable disagreement over credit and what exactly was achieved), these novel forms of pre-publication data release did not prevent the acquisition of more traditional forms of scientific credit – publication in prestigious scientific and medical journals.
On top of all of the scientific and public health lessons to be learned, coming from a journal perspective this makes it a very important example and test case of how new and faster methods of scientific communication and data dissemination can still complement and work alongside the traditional systems. This is particularly clear as the open-source analysis was published in the New England of Medicine, a prestigious organ with a nearly 200 year history, and founder of the Ingelfinger rule causing issues in some (mainly medical) journals regarding certain pre-publication forms of data release. Maximizing the use of the data by putting it into the public domain still did not trump scientific etiquette and convention that allowed those producing the data to be attributed and take credit. This is a great argument in favour of open-data, and an important lesson to all scientists worrying about setting their data free.
As (we think) the first ever citable data DOI released to an unpublished genome, this new form of intermediate credit (similar to microattribution) did not hinder the eventual publication of the genome analysis paper. We’d like to thank our collaborators in the Datacite and the British Library for their help issuing the DOIs, and hope it provides a good example for similar data producers and projects to follow. We have followed this example with the release of additional unpublished genomes, and large supplementary datasets associated with articles in GigaScience will be given DOIs to make them more trackable and findable, further showing their interoperability with traditional scientific articles and forms of data release. This particular disease outbreak was unusually pathogenic, and the sterling efforts of the medical community and suffering of those affected should not be forgotten. Whilst there are still many unanswered questions and huge amounts of work still to be done, many lessons have hopefully been learned, and (as highlighted here) this project provides an excellent example for the future on how a more collaborative and open-form of science can carried out. As GigaScience would like to be a forum for the discussion of these issues, as well as promote and work with the open-science movement, we strongly hope that this can continue and grow.
Comments are closed.
[…] of pre-publication data-release. For example, the deadly 2011 outbreak E. coli dataset was the first we released, and this led to the crowdsourcing of its analysis, which was cited in the recent Royal Society […]
[…] of data citation Working with DataCite and the British Library, our GigaDB databases first DOI (the genome of the deadly outbreak strain E. coli) was issued in June 2011 and we and the growing number of data publishers (including our co-authors […]
[…] year, the GigaScience journal launched their first citable DOI dataset on the E. coli genome. The information was released on Twitter with a creative commons license and in a pre-publication […]
[…] Parrot genomes (AKA the Peoples Parrot), and personal genomics analysis via blogs. We have written previously on how our and collaborators at UMC Hamburg-Eppendorf release of CC0 (completely public domain) E. […]
[…] have written a lot about this twitter driven “tweenome” analysis in the past, but this video and article tries to learn lessons from the E. coli project and use […]
[…] uses of GitHub have included the phenomenal crowd-sourced genetic sequencing effort of the 2011 E.coli outbreak, and more recently the open-source fight against the devastating Ash Dieback disease in European […]
For an update on the results of this see our more recent postings http://finaloriginalblogs.dev/gigablog/?p=403&preview=true and http://finaloriginalblogs.dev/gigablog/2013/02/21/tweenome-on-film-excellent-video-on-crowdsourcing-killer-outbreaks/
[…] CHO genome in our GigaDB database, and from participating in the E. coli genome crowdsourcing (see GigaBlog), this is an area we are particularly interested in. Any interested participants should take part […]
[…] people” call to arms paper, appealing for people to join in the crowdsourcing effort. Since our involvement in the crowdsourcing of the genome of the 2011 German E. coli outbreak, we’ve been fascinated in any of these projects […]
[…] it the opportunity for harnessing the power of crowdsourcing in solving problems, such as with the 2011 e. Coli outbreak and the Galaxy Zoo Citizen Science Project. It has brought with it the ability for anyone, not just […]
[…] to the crowdsourcing of the deadly German 2011 E. Coli “sproutbreak” we helped kick start (see “notes from a tweenome”) was Kat Holt (University of Melbourne), and it was great to see some of her latest work on […]
[…] to the crowdsourcing of the deadly German 2011 E. Coli “sproutbreak” we helped kick start (see “notes from a tweenome”) was Kat Holt (University of Melbourne), and it was great to see some of her latest work on […]
[…] of pre-publication data-release. For example, the deadly 2011 outbreak E. coli dataset was the first we released, and this led to the crowdsourcing of its analysis, which was cited in the recent Royal Society […]
[…] uses of GitHub have included the phenomenal crowd-sourced genetic sequencing effort of the 2011 E.coli outbreak, and more recently the open-source fight against the devastating Ash Dieback disease in European […]