Our Biggest Dataset Yet. Oh, Ruili?
A new Data Note provides genome sequencing data that effectively triples the number of plant species with available genome data. This mammoth amount of work comes on the back of the growing efforts of the scientific community to sequence more plant genomes to aid in understanding their complex evolution and provide practical information for improving agricultural yield. To date, around 350 land plant genomes have been sequenced. The desire for more plant genome sequences has recently been highlighted with the announcement of the 10KP project (see the commentary in GigaScience), which aims to ultimately sequence 10,000 plant genomes to resolve the evolution of all the major branches of the plant tree of life (see also our write up of the 1KP project). To scale up towards this target, this new work effectively provides sequence data from an entire digital botanical garden.
Researchers from the China National GeneBank, BGI, and the Forestry Bureau of Ruili, China have sampled and sequenced 761 samples, representing 689 vascular plant species from 137 families and 49 orders. The plant samples are all from in and around the 500-hectare Botanical Garden in Ruili, a subtropical part of Yunnan province in China bordering Myanmar. Being in a biologically rich part of China, the garden is committed to protecting endangered and Chinese-endemic plants, including the preservation and archiving of these germplasm resources to assist with their long-term conservation. This project is the world’s first scientific and systematic attempt to digitize a whole botanical garden based on genomic as well as voucher specimen information. Founded in 2002 and containing more than 1,200 species of native plants, Ruili is a brilliant example of where technology can take the centuries old botanical garden, with a virtual tour (in Chinese) also available online.
On the scientific potential of this resource, BGI’s CEO and author on the paper Xun Xu highlights that: “Current understanding of the evolution of plants and their diversity in a phylogenomic context is limited because of the lack of genome-scale information across phylogenetically diverse species. This innovative project integrates a new way of thinking about the digitization of all the plant species to augment evolutionary and ecological research in botanical gardens.”
High-Throughput Herbarium Building
In total, the researchers produced 54 terabytes of sequencing data, with an average sequencing depth of 60X per species. In addition to the basic challenge of carrying out DNA sequencing on this number of species, another major task was scaling up the species identification, digitizing images of the specimens, and building a new herbarium for their storage at a new China National GeneBank (CNGB) herbarium in Shenzhen. Visiting the site a few months ago we witnessed the processing and scanning of all the collected specimens (see pictures). So far, of the 761 specimens, sequence and chloroplast data has enabled the identification of 257 plants at the species level and 504 at the family level. Deep learning has also been successful applied to 181 species to enable them to be identified to the species level.
Author Ting Yang says that this was “the largest amount of data I have ever processed. During the data analyses, I think the biggest challenges was sequence checking and results examination.” This required researchers to individually check each of the 761 sample’s sequencing data, and compare the chloroplast gene sequences with herbarium specimens for species identification.
Another difficulty relating to simply getting to the point of being able to do the sequencing work was collecting all the samples. Author Jinpu Wei states: “We cooperated with experts from the Ruili Forestry Bureau to collect plant materials distributed in the area of Ruili for the establishment of a digital botanical garden. After 45-days of tiring effort, we collected 1,093 plant materials. Although it was challenging for us to transport the materials properly, we finally managed to ensure the high quality of these plant materials for future research.”
Corresponding author, Xin Liu, adds that the project “was a baseline project to fine tune and standardize the sampling, methodologies, and the data accumulation and analyses techniques for large-scale genome projects like the 10KP (10 thousand Plant Genome Project). From this project, we have gained considerable and useful experience for subsequent sample collection, sequencing, and assembly. At the same time, the data produced from this study can be effectively used in subsequent genome projects.”
The Data Reuse Potential of a Digital Botanical Garden
Despite having constructed only one sequencing library for each species, the authors were able to assemble preliminary genomes for 17 of them, reflecting the quality and reuse potential of the DNA. Outside of the data producing team, external researchers have already independently assembled the genomes of species of particular interest to them. The potential for the wider research community to study their species of interest, improve other genomes, develop tools and methods, and provide education opportunities for new generations of scientists is enormous. Prof Stephen Tsui (pictured), Director of the Hong Kong Bioinformatics Centre at the Chinese University of Hong Kong acknowledged the utility to his Students and PostDocs, saying: “We would like to appreciate the great efforts for the establishment of this project. Using the data deposited in the NCBI, we have assembled a Bauhinia variegata draft genome of 264 Mbp in size. This could serve as an excellent starting point and pilot study for our future full-scale genome analysis of this important plant species”.
Lead author Huan Liu added that “Genomic characterization will provide a large amount of basic data for plant genome assembly, which will be an excellent start for the 10KP project. At the same time, it lays a good foundation for the future research on the correlation mechanism from macroscopic ecology and biodiversity to microscopic molecular level.”
Downloading a Digitized Botanical Garden via GigaDB
To promote more extensive data sharing than just making sequence data available, the researchers are also making the digitized images available and providing access to the herbarium. The Herbarium (HCNGB) serves as a living plant database that records the position of species grown in the Ruili Botanical Garden and monitors the status of each species.
All the digital data generated (including images, raw sequencing data, assembled chloroplast genomes, and preliminary nuclear genome assemblies) are available via the NCBI SRA, China National GeneBank CNSA, and our GigaDB repository. With GigaDB being the most convenient way to browse and access the outputs of the project. To enable the data to be searched and genomes and species identification to be updated, metadata is indexed and linked via Datacite DOIs. To maximise reuse potential all resources are also released without restriction under a CC0 waiver. We have minted DataCite DOIs for all 761 specimens, and (inspired by Metadata2020) we’ve even been carrying out a randomised controlled trial (RCT) to use this large dataset to see if rich metadata makes it more discoverable (see the preregistration for this ongoing experiment here).
Author Sunil Kumar Sahu highlighted that this is the most important legacy of the project “This dataset is of great value to plant researchers, and more importantly, can serve as a reference for future planetary-scale genome sequencing projects including the Earth BioGenome Project and 10KP.”
Liu H et al. Molecular digitization of a botanical garden: high-depth whole genome sequencing of 689 vascular plant species from the Ruili Botanical Garden Gigascience. 2019 doi: 10.1093/gigascience/giz007