Apple 2.0. A delicious genome.

Gigascience this week published a high quality genome of the apple, the “golden delicious” variety, to be precise. Although a version of the Malus domestica genome has already been available, Xuewei Li et al.’s de novo assembly significantly improves on that previous ressource, using a healthy injection of long reads from third generation, “single-molecule” sequencing technology.

The “old” apple reference genome covered about 89% of the non-repetitive parts of the genome, and it had quite a large number of holes: the N50 contig length, a metric roughly equivalent to the mean length of gap-free sequence, was just about 17 kb. For many applications, such as large scale RNA sequencing or re-sequencing projects, researchers do need a higher quality genome.

Feeding the world with “precision agriculture” needs more precise references to work from. And that’s what  Xuewei Li and colleagues delivered by combining 76 Gigabases(Gb) of Illumina HiSeq reads, equivalent to a 102 x coverage of the genome, with around 22 Gb of long reads from a third generation (PacBio) platform. In total, the newly assembled genome comes out at around 630 Mb, with approximately 54,000 protein-coding genes. Crucially, the N50 contig length of the hybrid assembly is a rather impressive 110 kb.

Long reads for high quality

The new paper fits in with an emerging trend towards higher quality genome assemblies, taking advantage of those extra long reads churned out by PacBio’s and Oxford Nanopore’s  technologies. Just a look at recent programs of the yearly Plant and Animal Genomes conference shows how this is increasingly becoming the standard for genome projects.

Sequencing entire organisms, big or small, is now business as usual. But many next generation based genome sequences, quickly produced as they are, come with a drawback: They all start with a big heap of tiny snippets. Bioinformaticians command powerful algorithms to assemble those bits into longer units, but quite a few of the published draft genomes that came out over the last couple of years are of rather low quality; meaning they are full of gaps and the precise ordering of the assembled sequence chunks is not known.

High precision scaffolding. (image credit: Rasbak via Wikimedia, CC-BY)

High precision scaffolding.
(credit: Rasbak via Wikimedia, CC-BY)

“Scaffolding” is the technical term for the laborious task of connecting free-floating pieces of sequence data into a linear arrangement that represents biological reality. And that’s where third generation sequencing can help out. Sequencing machines such as PacBio and the tiny Nanopore MinION produce amazingly long, continuous reads, often spanning several thousand basepairs.

Connecting the dots

Crafty bioinformaticians can use those long stretches of DNA to massively improve existing genome assemblies, by using long reads as solid anchoring points and as bridges to connect the dots, so to speak.

Other genomes of agronomic importance that have already been improved by PacBio-enhanced scaffolding include goat and cod, but the competing Oxford Nanopore technology is also starting to prove its utility for this purpose. For example, Rene Warren and colleagues last year presented an algorithm in Gigascience, showing that its long reads can indeed improve genome assemblies – despite the MinION’s currently still rather high error rate. Accuracy may not be all that important, if the main purpose is to connect loose ends by injecting long reads into an assembly pipeline. Accuracy is also improving at a rapid rate, with another paper this month presenting a new INC-Seq method for nanopore data that uses rolling circle amplification of circularized templates, to generate more accurate long reads.

Apart from long reads, optical mapping is another important method to aid genome assembly, and GigaScience is a hub for papers describing this useful tool (see GigaBlog and our collection page for the latest papers).

The new apple genome is a reminder that sequencing an organism is never an end in itself. The real test is in the usability and quality of the data. And in that respect, it seems the future of genomics looks healthy as an apple.