Biocuration on the Bund

April 16, 2018

GigaScience are regular attendees of the International Biocuration Conference, and you may have read our write-ups ups going back to 2012 edition. This year Biocuration is back behind the bamboo curtain, with the 11th conference held in the Crowne Plaza Hotel Shanghai from April 8th-11th and hosted by Fudan University.

Symbolised by the spectacular Bund waterfront, Shanghai is the very symbol of modern China, and there was a palpable excitement at this year’s focus on Big Data Innovation and the promise to translate Big Data into biomedical knowledge and understanding. The GigaScience curation team participated heavily, with Mary Ann Tuli presenting a talk on “What MODs can learn from Journals – a GigaDB curator’s perspective” (slides here), and Chris Armit participating in the Translational Data Curation & Harmonisation panel. Here the team provide a write-up of how the meeting went.

biocuration conference stage

A Shanghai Biocuration Surprise
The Biocuration community work together to ensure data integrity and consistency in data content. Towards this end, a multitude of tools were presented that enable data cleaning, data standardisation, and the addition of metadata sample attributes to ensure that datasets are both findable and intelligible.

To illustrate with a few example, EMBL-EBI’s Annotare – presented by Anja Füllgrabe – is a user-friendly web interface that uses forms enabled with controlled vocabularies and predictive text to help submitting authors add sample attributes to their Array Express and Expression Atlas datasets. The sample metadata can be curated in a tabular MAGE-Tab format, ensuring that errors that have been introduced by naïve submitters are stripped from the metadata prior to publication.

Edith Wong of Stanford University discussed the importance of data standardisation in the context of the Saccharomyces Genome Database (SGD) and their move towards presenting RNA-seq data as molecules per cell rather than as fluorescence units. The use of standard units in this way is an important means of ensuring that the vast amount of RNA-seq data that is available is intelligible, and enables integration and cross-comparison of sequence data derived from proprietary platforms.

In the Clinical and Translational Data Curation and Harmonization Workshop, Richard Schneider of the Luxembourg Center for Systems Biomedicine (LCSB) presented a fourfold framework that is pertinently relevant for the entire Biocuration community. This framework included: 1) Data format conversion, to a machine-readable format; 2) Data cleansing, to remove errors, inconsistencies, and ambiguities; 3) Data transformation, to impute missing data; and 4) Data standardisation, including the use of controlled vocabularies, ontologies, and standard units. Many, if not all, of the tools presented at this year’s meeting could be classified within this framework. An important addition by Suzi Lewis of Lawrence Berkeley National Laboratory, was to highlight that data cleaning should not be thought of as limited to metadata, as there are annotation tools available – such as Apollo – that enable direct curation of sequence data using a graphical and sharable web-based interface.

Of great interest to the Biocuration community was the development of automation tools to reduce human intervention and productivity tools that make human intervention easy as possible. Many of the tools showcased at the two poster sessions addressed these concerns. In addition, many of the talks expressed the overarching need for large data centres to orchestrate, collect and archive large amounts of biological data. An important initiative in this respect is the Elixir European repository, which promises 10-year sustainability for open data generated in Europe, and provides high availability of these data through the use of virtual servers. However, as Martin Romacker of Roche Innovation Center, Basel was quick to point out, this only applies to open data, and the view from the industrial sector is that there is a pertinent need for web tools to analyse patient data that cannot be released due to privacy requirements for clinical data.

The Database Resources of the BIG Data Center Workshop was particularly insightful in this respect. The core resources in the BIGD are genome sequence archive, genome warehouse, gene expression nebulas, genome variation map, methylation bank, and science wikis (wikis for community annotations). The BIGD Genome Warehouse is a repository for whole genome sequence (WGS) data and currently archives 157 human genomes (496 TB data), although for privacy reasons these are not accessible from the Genome Warehouse web interface. BIGD propose to deliver a web-based Blast sequence alignment tool capable of handling WGS data, and this represents an important means of assessing sequence similarity between human genomic datasets, whilst respecting privacy agreements. An alternative solution was provided by Genomics England – represented by this year’s Biocuration Career Award winner Eleanor Williams – which provides a Gene Panel App to assess gene variants likely to be associated with rare diseases and cancer. These represent important initiatives that aim to analyse human data, whilst ensuring that the original genomic datasets remain private and confidential.

Our favourite topic of Open Data was a major focus at this conference. The FAIR principles – that ensure research data is Findable, Accessible, Interoperable and Reusable – was an overarching theme for many of the talks, indicating that these principles are well supported within the MODs and other resources. An intriguing study in this respect was provided by the (Re)usable Data Project – presented by Lilly Winfree of Oregon Health and Science University – that highlighted how complex licensing and data reuse restrictions are hindering the reuse of seemingly ‘open’ data. These complexities can range from alternative Creative Common Licenses (CC-BY, CC-ND, CC-NC, etc) being used for data subsets in a single data resource, to terms of use on the website that contradict the stated license agreement. The (Re)usable Data Project developed a five-part rubric to evaluate and rank over 50 biomedical and biological data resources, and a key finding of this study was that over 50% of data resources lacked a clear and easily findable license. This highlights a clear need to improve licensing practices so as to enable data reuse by the research community.

Mary Ann Tuli, Chris Armit and “Jesse” Xiao Si Zhe from the GigaDB team.

The Careers in Biocuration Workshop was of great interest to this community. There was a general lack of support for remote working, although as Mary Ann Tuli pointed out this model worked very well for our “Big Data” Journal GigaScience. To understand the hopes, dreams, and skillsets of the biocurator community, the Biocuration Career Description Survey queried the background, computational skills, and career satisfaction of the workshop attendees.

The results of this survey are available on the ISB website and will be used to better inform scientists, students, and newcomers to Biocuration as a career path. A white paper “Biocuration: Distilling data into knowledge” was additionally discussed as a critical means of increasing public awareness on the key role biocuration plays in scientific discovery.

Training opportunities for Biocurators was an additional topic of great interest. The Institute of Continuing Education at Cambridge University will be providing a one-year PgCert in Biocuration from September/October 2018. The course will be mostly online but will additionally include face-to-face sessions and is deemed suitable for both new and experienced biocurators.

Biocuration on the Bund

Mary Ann Tuli

Blog post tags