In this data-driven era, research is faced with new challenges, from sharing, storing and accessing data, including how to better integrate data to answer big questions in science. With many data repositories available, it is hard to maintain them all – some repositories are forced to close – meaning loss of access to invaluable datasets.
Just published in GigaScience, is iMicrobe, an open platform that enables community-driven science and data sharing for the greater metagenomics community. As a journal that follows the FAIR (Findable, Accessible, Interoperable, and Reusable) principles and promoting reproducibility, it’s great to see platforms, like iMicrobe, meet FAIR principles and promote reproducibility using CyVerse to track data analysis and provenance, as well as providing open protocols in Protocols.io that can also be used for educational and training purposes. Here, in one of our Author Q&A’s, Bonnie Hurwitz (iMicrobe Project Lead), shares her thoughts on the challenges they have faced with the platform, benefits for the greater metagenomics community, and future plans to further enable citizen science and educational endeavors.
How did the idea of iMicrobe take shape, and what were the challenges your encountered when putting this platform together?
This question alone could fill volumes! iMicrobe picked up at the end of the CAMERA project. They were going to shut down their servers meaning the community would lose access to valuable datasets. The Gordon and Betty Moore Foundation gave us a small ($250K) grant to create iMicrobe. That money allowed us to buy servers to store the data files and develop a new search interface to find the data sets.
When we started on the grant, the CAMERA project had let their Oracle DBA (database administrator) go and so were unable to provide an export of the data and metadata for the projects. That meant we had to write tools to “scrape” every page of the site and parse the HTML to reverse-engineer their data model and piece together how all the data fit together. Further we had to find all the FTP links to the datasets in the HTML given that their FTP server did not allow directory listings or recursive “get” commands. Given the unreliability of FTP requests, we tried each file more than once to check that we’d received the entire contents because there were no MD5 sums to help us verify that we had downloaded the entire file. This work gave us first-hand experience in the importance of open and accessible data that we have put at the forefront of developing our platform.
In addition to issues with downloading the data, the metadata from the CAMERA project didn’t use controlled vocabularies. This meant that the same measurement for something like “dissolved oxygen” might be called that or “oxygen” or “dissolved oxygen CTD” or possibly “DOC” which could be mistaken for “dissolved organic carbon.” We worked with an ontologist to help us normalize the data as best we could, but it was impossible to completely fix the fields given missing and inaccurate data. Worse, the units of measurement were often unstated or were included in the field name, so it’s been impossible to know if values are actually comparable. The community is in desperate need of semantics and controlled vocabularies to enable data interoperability and advanced analytics now possible with machine learning and AI.
Metagenomics data production is growing extremely rapidly, and remote accessibility to data and transferring these types of large datasets is a huge challenge – what problems have you and your users encountered so far and how does iMicrobe try to overcome these?
From the beginning, we’ve been fortunate to build off the CyVerse Cyberinfrastructure which allows us to store all the large datasets in their “Data Store” making them accessible to both to our users via high-speed parallel iRODS transfers and the Cyverse “Discovery Environment” but also to the high-performance compute (HPC) resources at the Texas Advanced Computing Center (TACC) via their Agave application programming interface (API). The metagenomics community struggles not only with data access but also with finding adequate computing power to analyze large datasets. iMicrobe users have access to these free and unlimited high-performance computer resources on the Stampede and Stampede 2 HPC systems at TACC.
As a platform for data-driven discovery do you have any examples of interesting findings you or any of your users have found using it yet?
To test the scalability of the iMicrobe platform, we developed an app called Libra that allows users to perform an all-vs-all comparison of metagenomes using raw sequence data using a Hadoop Framework. Using this big data analytics app, we were able to reanalyze data from the Tara Ocean Viromes to show that the structure of viral populations in the ocean is primarily driven by temperature irrespective of where the sample was taken in the ocean. By having these data connected through an integrated platform that is paired with powerful computer resources these analyses were possible.
We’re big fans of open science and anything that makes data sharing easier; what has the uptake been like with the greater metagenomics community? Have you found more groups willing to open their data and protocols? Has there been any negative feedback?
We have had many people request to deposit data in iMicrobe, but, unfortunately, we are not funded to be a data repository. We always request that primary sequence data go to NCBI or ENA. However, one reason people may want to use iMicrobe is that we have not limited ourselves a subset of data types. From the CAMERA project, we inherited complex data products such as contigs, assemblies, and even assemblies of combined samples (e.g, the MMETSP project). The community seems to have a need to share other value-added data products such as gene predictions and protein cluster. With support and effort, iMicrobe could become a platform to share data and analysis tools and the products of analyzing data with our analysis tools. We hope to link these data, analyses, and tools with many other partners, including protocols.io to enhance reproducibility for the science community.
As a community-driven and open platform is there any scope for citizen science users? And have you seen any use as an educational/teaching tool?
Indeed, one of the goals of iMicrobe is to allow users with limited training in programming and computer sciences to run advanced bioinformatics and analysis tools. To this end, we have developed protocols at protocols.io to train students and citizen scientists about using the iMicrobe platform. We use these protocols to teach undergraduates and graduate students in a class called Metagenomics at the University of Arizona. By developing these instructional resources, we hope to enable the community to explore microbiome datasets and run their own original analyses. With community science projects like the American Gut, citizen scientists are becoming interested in both their own microbiome and the relationship of the microbiome with health and disease in general. Resources like iMicrobe could help to enable these citizen science and educational endeavors.
What’s next for the platform?
Our motto at iMicrobe is to free the compute and the data! We feel strongly that the ability to make analyses available via containers like Docker and Singularity holds great promise for sharing bioinformatics tools. Containers are like time capsules that encode all of the complex dependencies and code needed for a tool that are often difficult for users to install on a new platform. For example, most tools require advanced system administration skills or access to hardware that are beyond the grasp of many labs and their students. Further, many university students don’t have access to local HPC resources, and high schools and community colleges have almost no such resources. A platform like iMicrobe can be an agent to democratize science, serving as a bridge to both large datasets and the hardware needed to analyze them.
To this end, we hope to increase the number of apps available through iMicrobe, especially by involving the research community for the inclusion of their most used tools and pipelines. We plan to promote these efforts through a tool we developed called The Appetizer, that allows developers to create JSON files that describe the inputs and outputs of containerized tools. In addition, we plan to develop capabilities to create complex workflows where the output of one app could be connected to another app or apps.
We are also working closely with the Genome Standards Consortium and Open Knowledge Foundation to make data truly interoperable. In particular, we are working on a project called Planet Microbe that unifies controlled vocabularies and semantics across large-scale oceanographic cruise datasets. Importantly, we repackage links to the data, protocols, and structured metadata into frictionless data packages, that like containers, are modular and can be repurposed by other projects. Our goal is to create a cyberinfrastructure that is made of components that are useful in and of themselves, that we simply link to create a larger unified system. Ultimately, we want to enable scientists to be LEGO master builders, where they can repurpose our tools (as containers), data (as frictionless data packages), and computer systems (through APIs that use NSF-supported computer systems) to build new communities and infrastructures, we may never have imagined.
1. Choi I, Ponsero AJ, Bomhoff M, Youens-Clark K, Hartman JH, Hurwitz BL. Libra: scalable k-mer-based tool for massive all-vs-all metagenome comparisons. Gigascience. 2019 Feb 1;8(2). doi: 10.1093/gigascience/giy165.
2. Youens-Clark K, Bomhoff M, Ponsero AJ, Wood-Charlson EM, Lynch J, Choi I, Hartman JH, Hurwitz BL. iMicrobe: Tools and data-driven discovery platform for the microbiome sciences. Gigascience. 2019 Jul 1;8(7). pii: giz083. doi:
3. iMicrobe website https://www.imicrobe.us/