Data Intensive Software Publishing & Sailing The Code Ocean. Q&A With Ruibang Luo.

GigaScience is always trying to push the boundaries of how we disseminate reproducible research, and to adapt to the challenges of dealing with experiments become more data-intensive. We now showcase a new reproducible research platform we’ve been testing called Code Ocean, and have a Q&A with our Author Ruibang Luo on his experiences using it.

To ensure reproducibility scientific journals are increasingly asking authors to make their code and data publicly available. On top of mandating that all software we publish is open under OSI-compliant licenses, and code in GitHub (helping them with our own GigaGithub repository), we’ve experimented taking “version of record” code snapshots in our GigaDB repository (e.g. this example). While scholarly communication has changed incrementally since the first scientific paper over 350 years ago, a huge amount is finally happening to update and make these processes fit for purpose for the data-centric age we now live in. Integrating with the most state-of-the open science platforms has helped us adapt to these challenges without having to reinvent the wheel and build additional infrastructure.

After our very early integrations with publons, protocols repository, and pre-print server bioRxiv, this week our latest experiment is with cloud-based executable research platform Code Ocean. Through publishing computational tools and pipelines using workflow management systems like Galaxy (also via integration with our GigaGalaxy platform), virtual machines, and docker, we’ve gained experience peer reviewing and publishing these more dynamic types of research objects. Code Ocean is a new addition to the reproducibility toolkit that allows researchers to upload code and data in 10 programming languages and link working code in a computational environment with the associated article. Code Ocean go to the effort to wrap and encapsulate the data, code, and computation environment in a “Compute Capsule” that can be interacted with through their platform. As promoters of data and software citation, we are happy to see they also assign a Digital Object Identifier (DOI) to the algorithm, that allows the algorithm to be cited in the reference sections of papers. Producing a handy plugin, we’ve also embedded this into our GigaDB entry to allow users to inspect and modify the code, tweak parameters, upload data, run it on AWS, and see what the results look like.

Code Ocean windowAs we target the “big data” end of data driven research we wanted to see how Code Ocean could handle a really computationally intensive example. Fortunately we had a perfect example to test out with a new variant calling algorithm for massive scale datasets from the Center for Computational Biology at Johns Hopkins University submitted to us. This new tool (16GT) has just been published so you can now inspect the results. Despite being a deliberately challenging example, Code Ocean passed the integration test with flying colours and you can see the results yourself at the bottom of this blog and via this DOI. Problems encountered included minor bugs with the front-end, needing to tweak the code to support multi-threading and making the perl files executable, but the whole process took the authors 40 minutes to resolve.

In one of our series of GigaBlog Q&As we talk to first author Ruibang Luo on his work, and his thoughts from using the Code Ocean platform. Ruibang is a Postdoc in the Salzberg and Schatz Labs at Johns Hopkins University, and before that was leading the HKU-BGI Bioinformatics Algorithms and Core Technology Research Laboratory in Hong Kong. As a previous author with us, he was a good sport in letting us use him as a guinea pig. And answering the following questions.

Tell us a little about 16GT and why you developed it. Why do we need another variant caller?

While GATK developed by the Broad Institute is becoming the standard for variant calling, it provides two different algorithms. UnifiedGenotyper came earlier and runs faster. It uses a probabilistic model and a split-read based model for SNP and Indel detection, respectively. HaplotypeCaller came later and provides better sensitivity than its predecessor, especially on Indels. It assembles possible haplotypes locally and detects differences between the haplotypes and the reference, thus SNPs and Indels can be detected simultaneously. But it runs much slower than its predecessor because assembly is computationally intensive. 16GT, by extending the traditional 10-genotype probabilities model to 16 genotypes to represent both diploid SNP and Indel genotypes, reconciles the detection of SNPs and Indels in a single model, but runs much faster than local assembly methods. Instead of competition with other variant callers, 16GT introduces an update to a traditional variant calling model that worked decently. With a faster running speed and a comparable sensitivity to local assembly methods, 16GT fits itself into clinical contexts, where turnaround time is critical, and the sequencing depth needs to be especially high.

What sort of datasets are you using this on, and as you are continuing to extend the functionality it where is this research going?

I was using the Genome In A Bottle dataset, sample NA12878 specifically for evaluating 16GT. I was using the published GIAB v2.19 dataset in my study, where the false positive calls of 16GT might categorized as either a false positive call by 16GT or a false negative call by GIAB. The later version of GIAB that reported to be published later this year, will have less false negative calls and serves as an even better dataset for evaluating the performance of variant callers.

Will you use Code Ocean again?

I’ve already built another project called LRSim in Code Ocean, I think Code Ocean is awesome and I would like to use it for my future projects and I would like to recommend my colleagues.

You previously published SOAPdenovo2 with us in 2012, and we’ve used that paper as an example in promoting and testing reproducibility (our efforts implementing it in Galaxy and testing it with FAIR data models covered in a PLOS One paper). As our most highly cited paper (over 1200 citations and counting) do you think these efforts have potentially aided this re-use, and how has the process of implementing your paper in CodeOcean compared?

The Galaxy pipeline, the FAIR data models and the errata to the SOAPdenovo2 paper have definitely improved its reproducibility and kept the paper alive. Most of the work and collaboration was done by GigaScience (special thanks to the Executive Editor, Scott Edmunds). Code Ocean, recently developed with the latest concepts and frameworks, provides better user experiences than Galaxy. Code Ocean imports code directly from GitHub and allows modification to the codes on the fly. It provides version control and DOIs, which are essential when publishing algorithms and tools nowadays.

Further Reading

1. Luo R, Schatz MC, Salzberg SL. 16GT: a fast and sensitive variant caller using a 16-genotype probabilistic model. GigaScience 2017. doi:10.1093/gigascience/gix045
2. Luo, R (2017):16GT: a fast and sensitive variant caller using a 16-genotype probabilistic model [Source Code]. Code Ocean.
3. González-Beltrán A, Li P, Zhao J, Avila-Garcia MS, Roos M, Thompson M, van der Horst E, Kaliyaperumal R, Luo R, Lee TL, Lam TW, Edmunds SC, Sansone SA, Rocca-Serra P. From Peer-Reviewed to Peer-Reproduced in Scholarly Publishing: The Complementary Roles of Data Models and Workflows in Bioinformatics. PLoS One. 2015 Jul 8;10(7):e0127612. doi: 10.1371/journal.pone.0127612.