Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
Biotech

The DNA Data Deluge 138

the_newsbeagle writes "Fast, cheap genetic sequencing machines have the potential to revolutionize science and medicine--but only if geneticists can figure out how to deal with the floods of data their machines are producing. That's where computer scientists can save the day. In this article from IEEE Spectrum, two computational biologists explain how they're borrowing big data solutions from companies like Google and Amazon to meet the challenge. An explanation of the scope of the problem, from the article: 'The roughly 2000 sequencing instruments in labs and hospitals around the world can collectively generate about 15 petabytes of compressed genetic data each year. To put this into perspective, if you were to write this data onto standard DVDs, the resulting stack would be more than 2 miles tall. And with sequencing capacity increasing at a rate of around three- to fivefold per year, next year the stack would be around 6 to 10 miles tall. At this rate, within the next five years the stack of DVDs could reach higher than the orbit of the International Space Station.'"
This discussion has been archived. No new comments can be posted.

The DNA Data Deluge

Comments Filter:
  • by Gavin Scott ( 15916 ) on Thursday June 27, 2013 @10:42PM (#44128945)

    ...what a shitty storage medium DVDs are these days.

    A cheap 3TB disk drive burned to DVDs will produce a rather unwieldy tower of disks as well.

    G.

  • by esten ( 1024885 ) on Thursday June 27, 2013 @11:12PM (#44129089)

    Storage is not the problem. Computational power is.

    Each genetic sequence is ~3GB but since sequences between individuals are very similar it is possible to compress them by only recording the differences from a reference sequences making each genome ~20 MB. This means you could store a sequences for everybody in the world in ~132 PB or 0.05% or total worldwide data storage (295 exabytes)

    Now the real challenge is more in having enough computational power to read and process the 3 billion letters genetic sequence and designing effective algorithms to process this data.

    More info on compression of genomic sequences
    http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3074166/ [nih.gov]

  • by Samantha Wright ( 1324923 ) on Thursday June 27, 2013 @11:59PM (#44129343) Homepage Journal

    I can't comment on the physics data, but in the case of the bio data that the article discusses, we honestly have no idea what to do with it. Most sequencing projects collect an enormous amount of useless information, a little like saving an image of your hard drive every time you screw up grub's boot.lst. We keep it around on the off chance that some of it might be useful in some other way eventually, although there are ongoing concerns that much of the data just won't be high enough quality for some stuff.

    On the other hand, a lot of the specialised datasets (like the ones being stored in the article) are meant as baselines, so researchers studying specific problems or populations don't have to go out and get their own information. Researchers working with such data usually have access to various clusters or supercomputers through their institutions; for example, my university gives me access to SciNet [scinethpc.ca]. There's still vying for access when someone wants to run a really big job, but there are practical alternatives in many cases (such as GPGPU computing.)

    Also, I'm pretty sure the Utah data centre is kept pretty busy with its NSA business.

  • by B1ackDragon ( 543470 ) on Friday June 28, 2013 @12:27AM (#44129449)
    This is very much the case. I work as a bioinformatician at a sequencing center, and I would say we see around 50-100G of sequence data for the average run/experiment, which isn't really so bad, certainly not compared to the high energy physics crowd and given a decent network. The trick is what we want to do with the data: some of the processes are embarrassingly parallel, but many algorithms don't lend themselves to that sort of thing. We have a few 1TB ram machines, and even those are limiting in some cases. Many of the problems are NP-hard, and even the for the heuristics we'd ideally use superlinear algorithms, but we can't have that either, it's near linear time (and memory) or bust which sucks.

    I'm actually really looking forward to a vast reduction in dataset size and cost in the life sciences, so we can make use of and design better algorithmic methods and get back to answering questions. That's up to the engineers designing the sequencing machines though..
  • by the gnat ( 153162 ) on Friday June 28, 2013 @12:30AM (#44129461)

    why aren't they storing it in digital DNA format

    Because they need to be able to read it back quickly, and error-free. Add to that, it's actually quite expensive to synthesize that much DNA; hard drives are relatively cheap by comparison.

"What man has done, man can aspire to do." -- Jerry Pournelle, about space flight

Working...