Genome data to flood internet by 2025 exceeding Twitter and YouTube demands

A DNA Double Helix — As sequencing costs drop, more and more genomes will be analysed. Human genomes sequenced alone are poised to reach millions National Human Genome Research Institute

Genomic data will explode soon posing storage challenges far beyond that presented by YouTube and Twitter, says a report.

By 2025, between 100 million and 2 billion human genomes are expected to be sequenced, requiring as much as two to 40 exabytes (one exabyte equals one quintillion bytes) of data storage, according to a team of scientists.

This outstrips YouTube's projected annual storage needs of one to two exabytes of video by 2025 and Twitter's projected one to 17 petabytes per year (one petabyte is 1,000 terabytes or 1,000,000 gigabytes).

It even exceeds the one exabyte per year projected for the world's largest radio astronomy project, the Square Kilometre Array, to be sited in South Africa and Australia.

As sequencing costs decline, more and more genomes will be analysed. Human genomes sequenced alone are poised to reach millions soon.

The demands are "humongous because the number of data that must be stored for a single genome are 30 times larger than the size of the genome itself, to make up for errors incurred during sequencing and preliminary analysis", Nature claims.

Storage however, is not the only challenge as computing requirements for acquiring, distributing and analysing genomics data can be demanding both in terms of data volume and the speed of analysis required.

"This serves as a clarion call that genomics is going to pose some severe challenges," says biologist Gene Robinson from the University of Illinois at Urbana-Champaign (UIUC), a co-author of the paper.

However, not many are impressed with the comparison with YouTube and Twitter.

Narayan Desai, a computer scientist at Ericsson in San Jose told Nature, "This isn't a particularly credible analysis."

He says the paper underestimates the processing and analysis aspects of the video and text data collected and distributed by Twitter and YouTube, such as advertisement targeting and serving videos to diverse formats.

Decentralised growth

The problem with genomic data he believes owes to the decentralised growth of the group, unlike high-energy physics or astronomy.

These areas require coordination and consensus for instrument design, data collection and sampling strategies. But genomics data sets have remained "balkanized, despite the recent interest of cloud-computing companies in centrally storing large amounts of genomics data".

Unlike the other fields that quickly process raw data and discard them, genomics does not yet have standards for converting raw sequence data into processed data.

The authors of the report contend that the variety of analyses that biologists want to perform in genomics is also uniquely large and could lead to poor translation during analysis as volume of data goes up.

"If you have a million genomes, you're talking about a million-squared pairwise comparisons," says Saurabh Sinha, a computer scientist at the UIUC and a co-author of the paper. "The algorithms for doing that are going to scale badly."

Sharing of genetic data has been a sore point in the field with most players reluctant to part with information.

Two years ago a "global alliance" of 69 institutions announced that standards and policies would be developed to encourage data-sharing of a person's DNA sequence combined with clinical information. Not all major data holders joined the alliance.