Biggest ever contest to put genome assemblers through their paces
If you haven’t caught it yet, the largest systematic assessment the process of genome assembly carried out to date has been published this week in GigaScience. The second Assemblathon competition saw 21 teams submit 43 entries based on data from three different unassembled parrot, cichlid fish, and boa constrictor genomes sequenced using three different technologies. Ten key metrics are outlined, based on over 100 different measures for each assembly, and they focus on different aspects of an assembly’s quality.
The paper has already generated a lot of coverage, particularly on the unusual peer review process. Assemblathon2 was initially submitted to a preprint server and the named reviewers have blogged and commented on their reviews of the paper. Since the data was in the public domain and the authors enjoyed the discussion, GigaScience’s editors encouraged open discussion of the peer review of this article. You can see our Editor in Chief Laurie’s insight into this in her guest BMC blog posting, and video interview for Biome.
With a new species genome announced almost daily, genomics is getting faster and cheaper all the time. Piecing together genomes from raw sequencing data to produce high quality finished genome sequences without the aid of a previously assembled reference is still technically challenging and requires a huge amount of computational power and resources. It is performed by more and more labs around the world. With new sequencing tools every month, and nearly limitless ways of carrying this complex process out, it is not clear as to which is the best method of piecing a genome together. The Assemblathon is a set of periodic collaborative efforts aiming to address this issue to help improve how genomics is carried out.
The logistics of carrying out such a large competition were challenging, with large volumes of test and entry data hosted by supercomputing centers and mirrored in the cloud, and automated scripts calculated and presented the many results. Reviewing the paper was equally challenging and novel; everyone embraced GigaScience’s open and transparent review process, with authors and reviewers tweeting and posting comments online and in blogs during the review process. The results of this real-time, open peer-review are available to view on the Assemblathon website, with the signed reviewer reports and history also archived and viewable alongside the article. Titus Brown’s blogging and comments that Assemblathon is a damning indictment of computational biology has been highlighted in Nature News, and post publication, the discussion on twitter has continued (see some of the #titusischucknorris comments), and the paper already has our second highest Altmetric rating.
To boost reproducibility the supporting data and 27 GB of entries are also hosted in the GigaScience GigaDB database and NCBI SRA database. We’ve also just posted the Parrot reference data and optical maps onto GigaDB as well, the Budgerigar used by Assemblathon being our second Parrot genome hosted after the Puerto Rican Parrot. Following on from our sequence squeeze competition paper and forthcoming AFP and CAFA challenge series, these types of crowdsourced and open innovation challenges are necessary tools to dealing with the challenges of “big data”. These efforts are something we are keen to promote in GigaScience, so if you have related work please let us know.
1. Keith R Bradnam et al., Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species GigaScience 2013 2:10 http://dx.doi.org/10.1186/2047-217X-2-10
2. Feedback and analysis of the Assemblathon 2 pre-print http://tmblr.co/ZzXdssfOMJfy
3. Bradnam, KR et al., (2013): Assemblathon 2 assemblies; GigaScience Database. http://dx.doi.org/10.5524/100060