Genome Researchers Have Too Much Data 239
An anonymous reader writes "The NY Times reports, 'The field of genomics is caught in a data deluge. DNA sequencing is becoming faster and cheaper at a pace far outstripping Moore's law. The result is that the ability to determine DNA sequences is starting to outrun the ability of researchers to store, transmit and especially to analyze the data. Now, it costs more to analyze a genome than to sequence a genome. There is now so much data, researchers cannot keep it all.' One researcher says, 'We are going to have to come up with really clever ways to throw away data so we can see new stuff.'"
Re:Wrong problem (Score:0, Informative)
Only kind of correct - they also don't really have a clue what it means. It is kind of like reading a binary program and trying to say saying what the program does.
as a genome researcher (Score:5, Informative)
Re:Wrong problem (Score:5, Informative)
Genomes have *a lot* of redundant data across multiple genomes. It's not hard to do de-duplication and compression when you're storing multiple genomes in the same storage system.
Wikipedia seems to agree with me:
http://en.wikipedia.org/wiki/Human_genome#Information_content [wikipedia.org]
The 2.9 billion base pairs of the haploid human genome correspond to a maximum of about 725 megabytes of data, since every base pair can be coded by 2 bits. Since individual genomes vary by less than 1% from each other, they can be losslessly compressed to roughly 4 megabytes.
Disclaimer: I have worked on genome data storage and analysis projects.
Re:Wrong problem (Score:2, Informative)
Only kind of kind of correct - they also don't really have a clue as to the accuracy, with the short read illuminas that dominate, they have problems with repeats and inversions and deltions, the basepairs with hydroxy methyl C or thiophosphate, the sequence of the centromeres and telomeres, and the ability to contigs into phase with parental genomes....aside from that, it's all peachy
oh yeah, I bet the contamination rates are not real good either (there was a paper a few months ago on this, looking at public data bases, kinda scary)
Re:Wrong problem (Score:5, Informative)
Drops in NGS Costs Outpacing Storage Costs (Score:4, Informative)
The big problem is that the dramatic decreases in sequencing costs driven by next-gen sequencing (in particular the Illumina HiSeq 2000, which produces in excess of 2TB of raw data per run) have outpaced the decreases in storage costs. We're getting to the point where storing the data is going to be more expensive than sequencing it. I'm a grad student working in a lab with 2 of the HiSeqs (thank you HHMI!) and our 300TB HP Extreme Storage array (not exactly "extreme" in our eyes) is barely keeping up (on top of the problems were having with datacenter space, power, and cooling).
I'll reference an earlier /. post about this:
http://science.slashdot.org/story/11/03/06/1533249/graphs-show-costs-of-dna-sequencing-falling-fast
There are some solutions to the storage problems such as Goby (http://campagnelab.org/software/goby/) but those require additional compute time, and we're already stressing our compute cluster as is. Solutions like "the cloud(!)" don't help much when you 10TB of data to transfer just to start the analysis - the connectivity just isn't there.
Re:Bad... (Score:4, Informative)
Re:Wrong problem (Score:3, Informative)
It's not lossy compression.
You store the first human's genome exactly. Then you store the second as a bitmask of the first -- 1 if it matches, 0 if it doesn't. You'll have 99% 1's and 1% 0's. You then compress this.
Of course it's more complicated than this due to alignment issues, etc, but this need not be lossy compression
Re:ASCII storage? (Score:4, Informative)
Big biotechnology purchases are typically hundreds of thousands of dollars though, so most labs are used to shelling out for this kind of price bracket.
Re:Last post (Score:5, Informative)
Then again, it could be worse: you could use the single strand formulation. Error rates are far higher. This turns out to be a surprisingly effective strategy for organisms using it, although less so for the rest of us.
TEDx Talk on the Subject (Score:4, Informative)
I did a talk on this a few years back at TEDx Austin (shameless self promotion): http://www.youtube.com/watch?v=8C-8j4Zhxlc [youtube.com]
I still deal with this on a daily basis and it's a real challenge. Next-generation sequencing instruments are amazing tools and are truly transforming biology. However, the basic science of genomics will always be data intensive. Sequencing depth (the amount of data that needs to be collected) is driven primarily by the fact that genomes are large (e. coli has around 5 M bases in it's genome, humans have around 3 billion) and biology is noisy. Genomes must be over-sampled to produce useful results. For example, detecting variants in a genome requires 15-30x coverage. For a human, this equates to 45-90 Gbases or raw sequence data, which is roughly 45-90 GB of stored data for a single experiment.
The two common solutions I've noticed mentioned often in this thread, compression and clouds, are promising, but not yet practical in all situations. Compression helps save on storage, but almost every tool works on ASCII data, so there's always a time penalty when accessing the data. The formats of record for genomic sequences are also all ASCII (fasta, and more recently fastq), so it will be a while, if ever, before binary formats become standard.
The grid/cloud is a promising future solution, but there are still some barriers. Moving a few hundred gigs of data to the cloud is non-trivial over most networks (yes, those lucky enough to have Internet2 connections can do it better, assuming the bio building has a line running to it) and, despite the marketing hype, Amazon does not like it when you send disks. It's also cheaper to host your own hardware if you're generating tens or hundreds of terabytes. 40 TB on Amazon costs roughly $80k a year whereas 40 TB on an HPC storage system is roughly $60k total (assuming you're buying 200+ TB, which is not uncommon). Even adding an admin and using 3 years' depreciation, it's cheaper to have your own storage. The compute needs are rather modest as most sequencing applications are I/O bound - a few high memory (64 GB) nodes are all that's usually needed.
Keep in mind, too, that we're asking biologists to do this. Many biologists got into biology because they didn't like math and computers. Prior to next-generation sequencing, most biological computation happened in calculators and lab notebooks.
Needless to say, this is a very fun time to be a computer scientist working in the field.
-Chris