The DNA Data Deluge 138

Posted by samzenpus on Thursday June 27, 2013 @10:07PM from the too-many-letters dept.

the_newsbeagle writes "Fast, cheap genetic sequencing machines have the potential to revolutionize science and medicine--but only if geneticists can figure out how to deal with the floods of data their machines are producing. That's where computer scientists can save the day. In this article from IEEE Spectrum, two computational biologists explain how they're borrowing big data solutions from companies like Google and Amazon to meet the challenge. An explanation of the scope of the problem, from the article: 'The roughly 2000 sequencing instruments in labs and hospitals around the world can collectively generate about 15 petabytes of compressed genetic data each year. To put this into perspective, if you were to write this data onto standard DVDs, the resulting stack would be more than 2 miles tall. And with sequencing capacity increasing at a rate of around three- to fivefold per year, next year the stack would be around 6 to 10 miles tall. At this rate, within the next five years the stack of DVDs could reach higher than the orbit of the International Space Station.'"

The DNA Data Deluge

This discussion has been archived. No new comments can be posted.

Search 138 Comments Log In/Create an Account

Comments Filter:

This just goes to show... (Score:4, Informative)

by Gavin Scott ( 15916 ) writes: on Thursday June 27, 2013 @10:42PM (#44128945)

...what a shitty storage medium DVDs are these days.
A cheap 3TB disk drive burned to DVDs will produce a rather unwieldy tower of disks as well.
G.

Storage Non-Problem - Sequences Compresses to MBs (Score:5, Informative)

by esten ( 1024885 ) writes: on Thursday June 27, 2013 @11:12PM (#44129089)

Storage is not the problem. Computational power is.
Each genetic sequence is ~3GB but since sequences between individuals are very similar it is possible to compress them by only recording the differences from a reference sequences making each genome ~20 MB. This means you could store a sequences for everybody in the world in ~132 PB or 0.05% or total worldwide data storage (295 exabytes)
Now the real challenge is more in having enough computational power to read and process the 3 billion letters genetic sequence and designing effective algorithms to process this data.
More info on compression of genomic sequences
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3074166/ [nih.gov]

Re:At least they're not rolling their own. (Score:5, Informative)

by Samantha Wright ( 1324923 ) writes: on Thursday June 27, 2013 @11:59PM (#44129343) Homepage Journal

I can't comment on the physics data, but in the case of the bio data that the article discusses, we honestly have no idea what to do with it. Most sequencing projects collect an enormous amount of useless information, a little like saving an image of your hard drive every time you screw up grub's boot.lst. We keep it around on the off chance that some of it might be useful in some other way eventually, although there are ongoing concerns that much of the data just won't be high enough quality for some stuff.
On the other hand, a lot of the specialised datasets (like the ones being stored in the article) are meant as baselines, so researchers studying specific problems or populations don't have to go out and get their own information. Researchers working with such data usually have access to various clusters or supercomputers through their institutions; for example, my university gives me access to SciNet [scinethpc.ca]. There's still vying for access when someone wants to run a really big job, but there are practical alternatives in many cases (such as GPGPU computing.)
Also, I'm pretty sure the Utah data centre is kept pretty busy with its NSA business.

Re:Storage Non-Problem - Sequences Compresses to M (Score:4, Informative)

by B1ackDragon ( 543470 ) writes: on Friday June 28, 2013 @12:27AM (#44129449)

This is very much the case. I work as a bioinformatician at a sequencing center, and I would say we see around 50-100G of sequence data for the average run/experiment, which isn't really so bad, certainly not compared to the high energy physics crowd and given a decent network. The trick is what we want to do with the data: some of the processes are embarrassingly parallel, but many algorithms don't lend themselves to that sort of thing. We have a few 1TB ram machines, and even those are limiting in some cases. Many of the problems are NP-hard, and even the for the heuristics we'd ideally use superlinear algorithms, but we can't have that either, it's near linear time (and memory) or bust which sucks.

I'm actually really looking forward to a vast reduction in dataset size and cost in the life sciences, so we can make use of and design better algorithmic methods and get back to answering questions. That's up to the engineers designing the sequencing machines though..

Re:Digital DNA storage anyone ? (Score:5, Informative)

by the gnat ( 153162 ) writes: on Friday June 28, 2013 @12:30AM (#44129461)

why aren't they storing it in digital DNA format
Because they need to be able to read it back quickly, and error-free. Add to that, it's actually quite expensive to synthesize that much DNA; hard drives are relatively cheap by comparison.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

The DNA Data Deluge 138

The DNA Data Deluge More Login

The DNA Data Deluge

This just goes to show... (Score:4, Informative)

Storage Non-Problem - Sequences Compresses to MBs (Score:5, Informative)

Re:At least they're not rolling their own. (Score:5, Informative)

Re:Storage Non-Problem - Sequences Compresses to M (Score:4, Informative)

Re:Digital DNA storage anyone ? (Score:5, Informative)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot