Getting Techy With It: GigaScience Technology Update 2014

10530770_806106836068605_7801527962476975705_n

When it comes to technology, GigaScience has always been open and willing to embrace new ways of integrating technology in its publishing processes, with the ultimate goal of working towards more reproducible, interactive and executable papers. So far, 2014 has been an extremely busy year for Team Giga with regards to various technical developments as the journal continues to push the boundaries of reproducible research and open data. Our team have been busy working on our technical platforms, working on case studies and exemplar articles, as well as presenting many of the early results at conferences around the world (see for example Peter Li’s talk at the Galaxy Community conference this year). With 2015 rapidly approaching, we thought we would give you an end of year summary of recent and upcoming technical developments, including some published examples that showcase them.

Publishing Gets Technical
One of our biggest focuses since launch has been to integrate and improve our GigaDB database, as well as implement new programs and platforms. Much of the work put into our version 2 of GigaDB was covered in this paper in the DataBase Journal that our lead curator Chris Hunter presented at the Biocuration 2014 conference in April. We are currently working on v3, and hope to release this early in the New Year. One project our Data Scientist, Rob Davidson, has been working on is implementing OMERO. Commonly used in the imaging community, OMERO is an open source platform from the OME (Open Microscopy Environment) community that can handle imaging data in a secure, central repository that can support over 130 different file formats, including all major microscopy formats. Being open source has allowed plugins for data types such as high content screens to be built, and overall allows one to view, organize, analyse and share imaging data to minimise incidents like the cherry picking of imaging data in the retracted Nature STAP papers. We are pleased to announce that our OMERO-based digital viewer (GigaDV) will be launching in the coming months. This is especially important as we are getting increasing amounts of large-scale imaging datasets to showcase, including recently published “prickly pictures” of sea urchins; a (39 Gb) dataset comprises of 141 high-resolution MRI scans from 98 species of sea urchins, and a similar volume of earthworm micro-CT data.

GigaGalaxy and “bring your own” data parties
Prior to joining the GigaScience team, Rob worked in Mark Viant’s group at the University of Birmingham – and through NERC funded work with us have developed the first end-to-end metabolomics analysis pipeline in Galaxy, which we will be making available for public use in the next in the next few months.  This will join other workflows in our GigaGalaxy server, including implemented SOAPdenovo 2 workflows from that paper (look out for the pre-print of the case study work we have been carrying out on this with the ISA-, Research Object and Nanopublication communities), a population genomics toolkit, and another in-press population genetics tool (Smilefinder). In addition to allowing more interactive and reproducible methods and analyses, using GigaGalaxy also allows much easier and clearer visualisation of results in our papers (see more on its functionality and our Galaxy series in GigaBlog. We’ve just published our third paper in the series and have another one in press.

Given our drive to engage the community and further lower the barriers for data release, in June we co-organised a metabolomics data hackathon at our BGI Hong Kong office with attendees from local universities and the UK metabolomics standards community. The fruitful interaction between the participants resulted in the delivery of an ISA-Tab viewing component for web browsers and the curation and release of a number datasets that are currently being written up and reviewed as Data Note articles. Owing to the success on this first data-a-thon (and other similar ‘bring your own data’ events encouraged by ELIXIR), the same teams will be organizing a follow-up meeting.  Check out the blog and please get in touch if you’d like to participate in future events.

Papers Showcasing the Giga Approach
When working with human genome sequences, anonymization is almost impossible when dealing with genomic epidemiological data and new methods of sharing results whist keeping anonymity are required. Recently published in GigaScience, GWATCH is a dynamic web-based platform that enables rapid search of large genome-wide datasets, visualisation of disease-gene associations, 2D- and 3D-snapshots of gene regions along a moving and colourful chromosome highway, including real-time validation of candidate genes whilst preserving anonymity behind firewalls. Check out the moving colourful chromosome highway in this video:

Something we had a lot of fun with, in addition to being worthwhile for the community, was our project (also using Amazon Cloud) aiding the speedy release of the first E. Coli sequence dataset from Oxford Nanopore’s USB-powered, MinION™ sequencer from Nick Loman’s group, which after rapid peer review we recently published,  This rapid release of the 135 GB of data before other databases were geared up to handle it has already enabled the data to be integrated in teaching materials and become reference data for new nanopore real-time analysis tools. Watch for forthcoming projects and reports using this new technology.

Coronary artery disease is the most common cause of heart attacks and diagnosis is key to prevent such events occurring. A useful tool in diagnosis is magnetic resonance imaging (MRI) that is used to directly examine blood flow to the myocardium of the heart. However, for MRI to be most effective, it requires compensation for the breathing motion of the patient, which is done using complex image processing methods. Thus, there is a need to improve these tools and algorithms, and a key to achieving this is the availability of large publicly available MRI datasets to allow testing, optimization and development of new methods. We have published a virtual box that aids the fight against heart disease. Of course, in true GigaScience “pushing-the-boundaries” fashion, we and the authors have published and packaged the data alongside tools, scripts and the software required to run the experiments to enable reproducible comparisons between new tools. This is available for download from GigaDB as a “virtual hard disk” that specifically enable researchers to directly run the experiments themselves, as well as add their own annotations to the data set. These experiments in reproducibility follow on from the electrophysiology paper we published earlier in the year that showcased dynamic ways of generating and publishing papers that utilise R and Knitr (see GigaBlog).

Giga GitHub
GigaScience has been using its GitHub repository, aka. Giga GitHub, to host code from our own open source projects and infrastructure, as well as helping authors of a number of our papers to archive dynamical and updatable versions their code (on top of the archived “version of record” snapshots we host in GigaDB).  In a new experiment, last month we used GitHub to host a supplementary table from a paper, to enable the contents to be dynamic and updatable. The paper by Lei Chen and colleagues is a review article  summarising the latest resources for genome editing using ZFN, TALEN and CRISPR/Cas9 nuclease technologies, and as this is such a rapidly developing field to keep it up to date we created a version (linked in the paper) that can be edited, forked and more to enable others to crowdsource content. Feel free to have a play here: https://github.com/gigascience/paper-chen2014/wiki 

More Open Peer Review
As you are all aware of by now, improving transparency and reproducibility of research is one of GigaScience’s main goals, and one way we achieve this is through open peer review; however, we have taken this one step further by partnering with Publons – the world’s largest open peer-review platform. Our reviewers can now get further recognized for their efforts through Publons, announced via the BMC Blog. We also got personal with Andrew Preston (co-founder of Publons) in a Q&A. So, if you have reviewed for any of our papers you would have received tokens via email; please set up your Publons profile and claim the credit you deserve. We’ve also been carrying out a trial with AcademicKarma, which aims to speed up and produce a more balanced peer review process. Sign up with your ORCID if you are interested in trying it out too.

References

1. Sneddon TP, Zhe XS, Edmunds SC, Li P, Goodman L, Hunter CI: GigaDB: promoting data dissemination and reproducibility.  Database 2014: bau018. doi:10.1093/database/bau018.

2. Wollny, G; Kellman, P: Free breathing myocardial perfusion data sets for performance analysis of motion compensation algorithms. GigaScience 2014 3:23 https://doi.org/10.1186%2F2047-217X-3-23

3. Svitin et al. GWATCH: a web platform for automated gene association discovery analysis. GigaScience 2014, 3:18 https://doi.org/10.1186/2047-217X-3-18

4. Chen et al. Advances in genome editing technology and its promising application in evolutionary and ecological studies. GigaScience 2014, 3:24 https://doi.org/10.1186%2F2047-217X-3-24