A DNA Double Helix
Many cloud service providers are providing subsidised rates for hosting data. Amazon Web Services, for instance, levied no charge for hosting 200 trillion bytes of data from the 1,000 Genomes Project. National Human Genome Research Institute

A group of genome scientists from Europe, the US and Canada are urging the NIH and other agencies to pay for the storage of major genomic data sets in the most popular cloud services.

This could allow authorised scientists to easily and cheaply tap into a global commons as and when they need to, they write in Nature, instead of researchers wasting time and money on independently transferring data to the cloud of their choice.

The cheapness, flexibility, reliability and security of cloud computing have been highlighted in the report by the scientists who point to the challenges faced by researchers in accessing big data sets.

They cite the free or at heavily subsidised rates offered by some like Amazon Web Services. It levies no charge for hosting 200 trillion bytes of data from the 1,000 Genomes Project.

Early this year, the US National Institutes of Health (NIH) lifted its ban on the use of cloud computing for the genetic information held in its repository, the database of Genotypes and Phenotypes (dbGaP).

With genomic data sets to explode in coming years as sequencing becomes cheaper, storage and analysis are poised to become major challenges.

The International Cancer Genome Consortium (ICGC) has amassed a data set in excess of two petabytes (1 petabyte is 10 to the power of 15 bytes) in five years.

Such data would require more than a year's time and be very costly to transfer from a repository to a local network through a university internet connection.

Cloud services

By availing of cloud services, researchers can work on multiple computers to complete an analysis quickly, and pay for only the computing time used, the report says.

It would also allow several researchers to share data and work in tandem and cut short genome analysis from months to days.

As cloud services are available through the internet and multiple users share hardware, funding agencies are wary of privacy issues.

But services offered by major players like Amazon, Google and Microsoft as also some smaller companies are as secure as most academic data centres, the authors note.

In a bid to reduce time and costs in procuring access, they suggest that relevant funding agencies ask for every major genomic data set to be uploaded into the most popular academic and commercial clouds available, and to pay for the long-term storage of the data.

The researchers will have to pay only for the time when they do their analysis and data has to be copied only once, unlike present rules that see inordinate delays and costs if two groups are availing of the same data.

The funding agencies should provide for deposition of the same data sets in multiple clouds to avoid problems arising from monopoly practices.

Reliable protocols for authorising access to sensitive data as also mechanisms to enable and revoke access will be needed, the researchers say.