Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
Biotech Science

Genome Researchers Have Too Much Data 239

Posted by Soulskill
from the we-should-try-storing-it-in-dna dept.
An anonymous reader writes "The NY Times reports, 'The field of genomics is caught in a data deluge. DNA sequencing is becoming faster and cheaper at a pace far outstripping Moore's law. The result is that the ability to determine DNA sequences is starting to outrun the ability of researchers to store, transmit and especially to analyze the data. Now, it costs more to analyze a genome than to sequence a genome. There is now so much data, researchers cannot keep it all.' One researcher says, 'We are going to have to come up with really clever ways to throw away data so we can see new stuff.'"
This discussion has been archived. No new comments can be posted.

Genome Researchers Have Too Much Data

Comments Filter:
  • Nope (Score:4, Insightful)

    by masternerdguy (2468142) on Friday December 02, 2011 @02:32PM (#38241640)
    No such thing as too much data on a scientific topic.
  • Bad... (Score:4, Insightful)

    by Ixne (599904) on Friday December 02, 2011 @02:33PM (#38241656)
    Throwing out data in order to be able to analyze other data, especially when it comes to genes and how they interact, sounds like one of the worst ideas I've heard.
  • by Hentes (2461350) on Friday December 02, 2011 @02:35PM (#38241692)

    Most scientific topics are like this, there is too much raw data to analize it all. But a good scientist can spot the patterns and can distinguish between important stuff and noise.

  • by BagOBones (574735) on Friday December 02, 2011 @02:38PM (#38241736)

    Research team finds important role for junk DNA
    http://www.princeton.edu/main/news/archive/S24/28/32C04/ [princeton.edu]

    Accept in the field of DNA they still don't know what is and is not important.

  • Re:Nope (Score:3, Insightful)

    by blair1q (305137) on Friday December 02, 2011 @02:42PM (#38241782) Journal

    Sure there is.

    They're collecting data they can't analyze yet.

    But they don't have to collect it if they can't analyze it, because DNA isn't going away any time soon.

    It's like trying to fill your swimming pool before you've dug it. I hope you have a sealed foundation, because you've got too much water. You might as well wait, because it's stupid to think you'll lose your water connection before the pool is done.

    Same way they've got too much data. No reason for them to be filling up disk space now if they can just get the data again when they know what to do with it.

  • Re:Wrong problem (Score:5, Insightful)

    by jacoby (3149) on Friday December 02, 2011 @02:43PM (#38241804) Homepage Journal

    Yes and no. It isn't just storage. What we have comes off the the sequencers as TIFFs first, and after the first analysis we toss the TIFFs to free up some big space. But that's just the first analysis, and we go to machines with kilo-cores and TBs of memory in multiple modes, and many of our tools are not yet written to be threaded.

  • They should learn (Score:4, Insightful)

    by hbar squared (1324203) on Friday December 02, 2011 @02:43PM (#38241806)
    ...from CERN. Sure, the Grid [wikipedia.org] was massively expensive, but I doubt genome researchers are generating 27 TB of data per day.
  • by Overzeetop (214511) on Friday December 02, 2011 @03:01PM (#38242078) Journal

    It will come, but it doesn't make the wait less frustrating. I'm an aerospace engineer, and I remember building and preparing structural finite element models by hand on virtual "cards" (I'm not old enough to have used actual cards), and trying to plan my day around getting 2-3 alternate models complete so that I could run the simulations overnight. In the span of 5 years, I was building the models graphically on a PC, and runs were taking less than 30 minutes. Now, I can do models of foolish complexity and I fret when a run takes more than a minute, wondering if the computer has hung on a matrix inversion that isn't converging.

    You should, in some ways, feel lucky you weren't trying to do this twenty years ago. I understand your frustration, though.

    Just think - in twenty years, you'll be able to tell stories about hand coding optimizations and efficiencies to accommodate the computing power, as you describe to your intern why she's getting absolute garbage results from what looks like a very complete model of her project.

  • by sirlark (1676276) on Friday December 02, 2011 @03:18PM (#38242382)

    A good scientist will design the experiment before collecting the data. If he spots patterns, it's because something interesting happened to another experiment. Then he'll design a new experiment to collect data on the interesting thing.

    Flippant response: A good scientist doesn't delete his raw data...

    More sober response: Except to do an experiment said scientist might need a sequence. And that sequence needs to be stored somewhere, often in a publicly accessible database as per funding stipulations. And that sequence has literally gigabytes more information than he needs for his experiment, because he's only looking at part of the sequence. Consider also that sequencing a small genome may take a few days in the lab, but annotating can take weeks or even months of human time. And the sequence is just the tip of the iceberg, it doesn't tell us anything because we need to know how the genome is expressed, and how the expressed genes are regulated, and how they are modified after transcription, and how they are modified after translation, and how the proteins that translation forms interact with other proteins and sometimes with the DNA itself. Life is messy, and singling out stuff for targeted experimentation in the biosciences is a lot more difficult than in physics, and even chemistry.

    Seriously, this is a non-problem. Don't waste resources keeping and managing the data if you can make more. And I can't imagine how you can't make more data from DNA. The stuff is everywhere.

    Sequencing may be getting cheaper, but it's not so cheap that scientists facing funding cuts can afford to throw away data simply to recreate it. Also, DNA isn't the only thing that's sequenced or used. Protein's are notoriously hard to purify and sequence, RNA can also be difficult to get in sufficient quantities. The only reason DNA is plentiful is because it's so easy to copy using PCR [wikipedia.org], but those copies are not necessarily perfect.

  • It's not the data (Score:4, Insightful)

    by thisisauniqueid (825395) on Friday December 02, 2011 @04:04PM (#38243144)
    It's not that there's too much data to store. There's too much to analyze. Storing 1M genomes is tractable today. Doing a pairwise comparison of 1M genomes requires half a trillion whole-genome comparisons. Even Google doesn't compute on that scale yet. (Disclaimer: I'm a postdoc in computational biology.)

If it's worth hacking on well, it's worth hacking on for money.

Working...