Faster, Dataset! Kill! Kill!

June 7, 2013

The speed of data
Last week was the Bio-IT World Asia meeting in Singapore, and while we didn’t attend this year (see last years conference report in Genome Biology), our editorial board member Tin-Lap Lee presented on the GigaGalaxy server that we have been collaborating with him on (see slides). Also timed for the meeting, Aspera made a press release on our recent adoption of their suite of software products to provide authors, reviewers, and users with the tools to upload and download the extremely large data sets that accompany manuscripts at maximum speed. You can see additional coverage in GenomeWeb, but essentially users can access Aspera’s free downloadable Connect Web Browser Plug-in to submit and access our largest data sets. Aspera estimate 10-100X faster data transfer than FTP, and in our tests so far we have seen >20X faster upload and download speeds.
Being a journal and database particularly focusing on large-scale data, and with a number of our datasets pushing the terabyte range (particularly these 88 cancer genomes from the Asian Cancer Research Group), this makes them usable without having to ship physical drives.

More variety of data in GigaDB
On top of the new functionality and additions to our GigaDB database, we have been continuing to add new datasets, and new types of data to our repository. On top of our first neuroimaging data (see the blog posting) and two new disease model rat genomes, we have just added our first two proteomics datasets. Having recently published a paper on metabolomics data-sharing from the MetaboLights repository at the EBI, this joins some of the other mass-spectrometry data we are hosting. Many of these are multi-omics datasets, and on top of having a place where all of the related datasets are available for download in a single location, our submitters have also included links to some of the other repositories that the data is included. As the proteomics community is already well served by the proteomExchange databases, the processed data from our first two datasets have also been submitted there. Different to many data-type specific repositories, proteomExchange like GigaDB uses citable DOIs as accessions, and using this we are able to link these datasets through their metadata and take a linked data approach to make them more discoverable. As we have also just introduced a RSS feed for our new datasets, please follow this or register if you are interested in seeing the latest additions to GigaDB.

References

1. Salek et al.: Dissemination of metabolomics results:role of MetaboLights and COSMOS. GigaScience 2013 2:8. https://doi.org/10.1186%2F2047-217X-2-8
2. Zhang, S; Wen, B; Zhou, B; Yang, L; Cha, C; Xu, S; Qiu, X; Wang, Q; Sun, H; Lou, X; Zi, J; Zhang, Y; Lin, L; Liu S (2013): Quantitative proteomics data using mTRAQ/MRM looking at human AKR family members in cancer cell lines. GigaScience Database. http://dx.doi.org/10.5524/100047
3. Chen, Z; Wen, B; Wang, Q; Tong, W; Guo, J; Bai, X; Zhao, J; Sun, Y; Tang, Q; Lin, Z; Lin, L; Liu (2013): Quantitative proteomics and transcriptomics data from the anaerobic thermophilic eubacterium Thermoanaerobacter tengcongensis. GigaScience Database. http://dx.doi.org/10.5524/100048

Faster, Dataset! Kill! Kill!

References

Scott Edmunds

Blog post tags

Recent comment