Genome Researchers Have Too Much Data

Catch up on stories from the past week (and beyond) at the Slashdot story archive

Genome Researchers Have Too Much Data 239

Posted by Soulskill on Friday December 02, 2011 @03:28PM from the we-should-try-storing-it-in-dna dept.

An anonymous reader writes "The NY Times reports, 'The field of genomics is caught in a data deluge. DNA sequencing is becoming faster and cheaper at a pace far outstripping Moore's law. The result is that the ability to determine DNA sequences is starting to outrun the ability of researchers to store, transmit and especially to analyze the data. Now, it costs more to analyze a genome than to sequence a genome. There is now so much data, researchers cannot keep it all.' One researcher says, 'We are going to have to come up with really clever ways to throw away data so we can see new stuff.'"

This discussion has been archived. No new comments can be posted.

Genome Researchers Have Too Much Data

Load All Comments

Search 239 Comments Log In/Create an Account

Comments Filter:

Last post (Score:2, Funny)

by Anonymous Coward writes:

All previous posts have been purged due to too much data.
- Re:Last post (Score:5, Funny)
  
  by NFN_NLN ( 633283 ) writes: on Friday December 02, 2011 @04:18PM (#38242386)
  
  There is now so much data, researchers cannot keep it all.' One researcher says, 'We are going to have to come up with really clever ways to throw away data so we can see new stuff.'"
  Perhaps they can come up with a new type of storage mechanism modeled after nature. They could store this data in tight helical structures and instead of base 2 use base 4.
  
  Parent Share
  twitter facebook
  - Re:Last post (Score:5, Informative)
    
    by edremy ( 36408 ) writes: on Friday December 02, 2011 @05:32PM (#38243602) Journal
    
    The error rate is too high- data copying using that medium and the best available (naturally derived) technology makes an error roughly every 100,000 bases. There are existing correction routines, but far too much data is damaged on copy, even given the highly redundant coding tables.
    Then again, it could be worse: you could use the single strand formulation. Error rates are far higher. This turns out to be a surprisingly effective strategy for organisms using it, although less so for the rest of us.
    
    Parent Share
    twitter facebook
    - Re: (Score:2)
      
      by gorzek ( 647352 ) writes:
      
      "RNA does what you want, unless what you want is consistency." -- Larry Wall (sort of)
Wrong problem (Score:5, Interesting)

by sunderland56 ( 621843 ) writes: on Friday December 02, 2011 @03:30PM (#38241614)

They don't have too much data, they have insufficient affordable storage.

Share
twitter facebook
- Re: (Score:3)
  
  by TheRealMindChild ( 743925 ) writes:
  
  "to the cloud!"
- Re:Wrong problem (Score:5, Funny)
  
  by bugs2squash ( 1132591 ) writes: on Friday December 02, 2011 @03:40PM (#38241756)
  
  If only they had some kind of small living cell it could be stored in...
  
  Parent Share
  twitter facebook
- Re:Wrong problem (Score:5, Insightful)
  
  by jacoby ( 3149 ) writes: on Friday December 02, 2011 @03:43PM (#38241804) Homepage Journal
  
  Yes and no. It isn't just storage. What we have comes off the the sequencers as TIFFs first, and after the first analysis we toss the TIFFs to free up some big space. But that's just the first analysis, and we go to machines with kilo-cores and TBs of memory in multiple modes, and many of our tools are not yet written to be threaded.
  
  Parent Share
  twitter facebook
  - Re: (Score:2)
    
    by rubycodez ( 864176 ) writes:
    
    surely storage and transmission can't be an issue, the capacity and bandwidth of a mini-van full of 2TB disks from RAID sets should be sufficient
    - Re: (Score:3)
      
      by Hognoxious ( 631665 ) writes:
      
      Given recent events in Thailand, it might be wise to replace the mini-van with something that floats.
  - Re: (Score:3)
    
    by buchner.johannes ( 1139593 ) writes:
    
    To be clear, the problem is this. The sequencing (cheap now) produces a lot of strips of a few DNA elements. They are overlapping, and its unknown from which position they are from.
    So the difficulty is to arrange those strips to reproduce the original DNA sequence. It is a NP-hard problem, no wonder Moore's law doesn't outrun that!
    - - Re: (Score:3)
        
        by Daniel Dvorkin ( 106857 ) writes:
        
        I thought that was a solved problem.
        No, sequence assembly is still an area of active research; here [google.com] is a sampling of papers published on the problem this year alone. Part of the problem is that "next gen" sequencing produces reads which are less reliable the farther down the fragment you go -- and the fragments are short, so there a hell of a lot of them to reassemble. The overall volume of sequencing is getting bigger and cheaper all the time, but there are some really serious reliability problems that need to be ironed out.
- Re:Wrong problem (Score:5, Informative)
  
  by TooMuchToDo ( 882796 ) writes: on Friday December 02, 2011 @03:51PM (#38241914)
  
  Genomes have *a lot* of redundant data across multiple genomes. It's not hard to do de-duplication and compression when you're storing multiple genomes in the same storage system.
  Wikipedia seems to agree with me:
  http://en.wikipedia.org/wiki/Human_genome#Information_content [wikipedia.org]
  The 2.9 billion base pairs of the haploid human genome correspond to a maximum of about 725 megabytes of data, since every base pair can be coded by 2 bits. Since individual genomes vary by less than 1% from each other, they can be losslessly compressed to roughly 4 megabytes.
  Disclaimer: I have worked on genome data storage and analysis projects.
  
  Parent Share
  twitter facebook
  - Re: (Score:2)
    
    by tgd ( 2822 ) writes:
    
    Since individual genomes vary by less than 1% from each other, they can be losslessly compressed to roughly 4 megabytes.
    640K will always be enough!
    - the pell mell fullness of time (Score:3)
      
      by epine ( 68316 ) writes:
      
      640K will always be enough!
      Yeah, back when Slashdot ran at 2400 bps, the comment limit was shorter than Twitter. But not to worry, like the Witnesses, the "great crowd" with seven-digit UIDs are relegated to a paradise on earth.
      I have to say in 1981 making those decisions I felt like I was providing enough freedom for ten years, that is the move from 64K to 640K felt like something that would last a great deal of time.
      The complaints as Gates recalls began in five years. He was off by a factor of two. I r
  - Re:Wrong problem (Score:5, Funny)
    
    by StikyPad ( 445176 ) writes: on Friday December 02, 2011 @04:05PM (#38242146) Homepage
    
    Warning: Monkeying with lossy compression for human genomic data may lead to monkeys.
    
    Parent Share
    twitter facebook
    - Re: (Score:3, Informative)
      
      by Anonymous Coward writes:
      
      It's not lossy compression.
      You store the first human's genome exactly. Then you store the second as a bitmask of the first -- 1 if it matches, 0 if it doesn't. You'll have 99% 1's and 1% 0's. You then compress this.
      Of course it's more complicated than this due to alignment issues, etc, but this need not be lossy compression
      - Re: (Score:3)
        
        by StikyPad ( 445176 ) writes:
        
        I didn't say it was lossy compression, I was just warning against it... though judging by your response, it may already be too late!
  - Re: (Score:2)
    
    by Remus Shepherd ( 32833 ) writes:
    
    So compressed, you have 4 megabytes of data...per individual. 7 billion individual human beings means you potentially need 28 petabytes of storage...
    That's just for human beings. If we look at the sequences of non-human species the storage needed expands exponentially. Why, even if we used efficient DNA storage to keep all this data, we'd need a whole planet just to house it.
    - Re: (Score:2)
      
      by next_ghost ( 1868792 ) writes:
      
      28 petabytes only if you save each file separately. If you store multiple files in one archive, it'll be much smaller.
    - Re: (Score:3)
      
      by c6gunner ( 950153 ) writes:
      
      So compressed, you have 4 megabytes of data...per individual. 7 billion individual human beings means you potentially need 28 petabytes of storage...
      I'm not sure why you'd want to store the genome of every human on the planet, but for that kind of project 28 petabytes is peanuts. The newest IBM storage array is 120-ish petabytes. We're talking about storing 4 megabytes per person. In the modern world, most people have at least a 4 gigabyte flash drives. I could store the genomic information of myself, all my relatives, and all my friends, and still have space left over.
  - Funny... (Score:2)
    
    by Kamiza Ikioi ( 893310 ) writes:
    
    In a previous post, people were saying that mixing biology and computer science was a stupid idea, here [slashdot.org]. However, this clearly shows that is much needed, except that in this case, the computer geeks can help out the biology nerds.
  - Re: (Score:2)
    
    by hesaigo999ca ( 786966 ) writes:
    
    I suggested way back to google about storage capacity for gmail, to implement a pointer system that would mark identical emails across multiple accounts and use one version of the same files but with multiple pointers to that file, which is what many people do with compression algorithms. I know in their system for GF and GFS2 they also do many threaded calls that also cross check for duplication.
    Maybe someone aught to set up these tools to recognize similar patterns and set up compression.
- Re:Wrong problem (Score:5, Informative)
  
  by GAATTC ( 870216 ) writes: on Friday December 02, 2011 @04:00PM (#38242050)
  
  Nope - the bottleneck is largely analysis. While the volume of the data is sometimes annoying in terms of not being able to attach whole data files to emails (19GB for a single 100bp flow cell lane from a HiSeq2000) it is not an intellectually hard problem to solve and it really doesn't contribute significantly to the cost of doing these experiments (compared to people's salaries). The intellectually hard problem has nothing to do with data storage. As the article states "The result is that the ability to determine DNA sequences is starting to outrun the ability of researchers to store, transmit and especially to analyze the data.". We just finished up generating and annotating a de novo transcriptome (sequences of all of the expressed genes in an organism without a reference genome). Sequencing took 5 days and cost ~$1600. Analysis is going on 4 months and has taken at least one man year at this point and there is still plenty of analysis to go.
  
  Parent Share
  twitter facebook
  - Re: (Score:2)
    
    by yodleboy ( 982200 ) writes:
    
    out of curiosity, what's the analysis? are you looking for something specific? comparing to something else? poking around to see what looks interesting? all of the above? thx
- Re: (Score:2)
  
  by msauve ( 701917 ) writes:
  
  "they have insufficient affordable storage."
  
  I've got an idea to solve that, which I'm going to patent.
  
  You store the sequence as a chain of different types of molecules (I'll call them "base pairs") which can link together, that way the storage will take up really minimal space. You could even have a chemical process which replicated the original, to produce more of the original.
  - Re: (Score:2)
    
    by Hognoxious ( 631665 ) writes:
    
    You store the sequence as a chain of different types of molecules (I'll call them "base pairs") which can link together, that way the storage will take up really minimal space. You could even have a chemical process which replicated the original, to produce more of the original.
    But the vessels that are used to perform those chemical processes take up a hell of a lot of space.
    Bizarrely, if the size of the vessels is measured in non-metric they're considerably bigger.
- - Re: (Score:2, Informative)
    
    by Anonymous Coward writes:
    
    Only kind of kind of correct - they also don't really have a clue as to the accuracy, with the short read illuminas that dominate, they have problems with repeats and inversions and deltions, the basepairs with hydroxy methyl C or thiophosphate, the sequence of the centromeres and telomeres, and the ability to contigs into phase with parental genomes....aside from that, it's all peachy
    oh yeah, I bet the contamination rates are not real good either (there was a paper a few months ago on this, looking at pu
    - Re: (Score:2)
      
      by Samantha Wright ( 1324923 ) writes:
      
      Actually, short reads aren't that bad as they may seem from a distance—the lab for which I consult has spent about a year surveying second-gen sequencing platforms, and it turns out the the 5th-generation ABI SOLiD platform finally lives up to its name, even though it uses only ~20 nt reads instead of the Illumina's 100. The chemistry has improved to a point where read quality isn't the biggest issue any more.
Nope (Score:4, Insightful)

by masternerdguy ( 2468142 ) writes: on Friday December 02, 2011 @03:32PM (#38241640)

No such thing as too much data on a scientific topic.

Share
twitter facebook
- Re: (Score:3, Insightful)
  
  by blair1q ( 305137 ) writes:
  
  Sure there is.
  They're collecting data they can't analyze yet.
  But they don't have to collect it if they can't analyze it, because DNA isn't going away any time soon.
  It's like trying to fill your swimming pool before you've dug it. I hope you have a sealed foundation, because you've got too much water. You might as well wait, because it's stupid to think you'll lose your water connection before the pool is done.
  Same way they've got too much data. No reason for them to be filling up disk space now if they c
  - - Re: (Score:2)
      
      by blair1q ( 305137 ) writes:
      
      If they're discussing it with us, it's in their way.
      Take as much data as you need to test your analysis methods, then ramp up data collection when your analysis works.
      - Re: (Score:2)
        
        by Hognoxious ( 631665 ) writes:
        
        That's like saying don't buy any more books until you've read the ones you've already got. Or don't download any more pr0n until ... well, you get the drift.
Bad... (Score:4, Insightful)

by Ixne ( 599904 ) writes: on Friday December 02, 2011 @03:33PM (#38241656)

Throwing out data in order to be able to analyze other data, especially when it comes to genes and how they interact, sounds like one of the worst ideas I've heard.

Share
twitter facebook
- Re:Bad... (Score:4, Informative)
  
  by Samantha Wright ( 1324923 ) writes: on Friday December 02, 2011 @04:06PM (#38242170) Homepage Journal
  
  Although that isn't quite what we're talking about here, reductionism in biology has been an ongoing problem for decades. Traditional biochemists often reduce the system they're examining to simple gene-pair interactions, or perhaps a few components at once, and focus only on the disorders that can be succinctly described by them. That's why very small-scale issues like haemophilia and sickle-cell anaemia were sorted out so early on. As diseases with larger and more complex origins become more important, research and money is being directed toward them. Cancer has been by far the most powerful driving force in the quest to understand biology from a broader viewpoint, primarily because it's integrally linked to a very important, complicated process (cell replication) that involves hundreds if not thousands of genes, miRNAs, and proteins.
  
  Parent Share
  twitter facebook
  - Re: (Score:2)
    
    by Hognoxious ( 631665 ) writes:
    
    Re: your sig.
    It would appear by your name that you are in fact a biologistess.
- - Re: (Score:2)
    
    by Hognoxious ( 631665 ) writes:
    
    In other words, "Sure I can process this data in an hour. Put in the queue and I'll get to it in about a year.". Processing the data then takes a year, and this wait time will only increase.
    That's like saying in 1811 that it'll take a century to get somewhere, because at the time the fastest thing is a horse or a sailboat. But by 1911 there's trains and steamships.
Time for the scientists to ge to work (Score:5, Insightful)

by Hentes ( 2461350 ) writes: on Friday December 02, 2011 @03:35PM (#38241692)

Most scientific topics are like this, there is too much raw data to analize it all. But a good scientist can spot the patterns and can distinguish between important stuff and noise.

Share
twitter facebook
- Re:Time for the scientists to ge to work (Score:5, Insightful)
  
  by BagOBones ( 574735 ) writes: on Friday December 02, 2011 @03:38PM (#38241736)
  
  Research team finds important role for junk DNA
  http://www.princeton.edu/main/news/archive/S24/28/32C04/ [princeton.edu]
  Accept in the field of DNA they still don't know what is and is not important.
  
  Parent Share
  twitter facebook
  - Re: (Score:2)
    
    by Hentes ( 2461350 ) writes:
    
    That's exactly what makes science interesting, when new better models show that some of the data previously disposed as junk can also be predicted. But making a perfect model would require infinite resources, so sometimes tradeoffs has to be made.
  - Re: (Score:2)
    
    by Samantha Wright ( 1324923 ) writes:
    
    Transposons are interesting and complex, but they don't play much of a role in mammals. Intergenic DNA is still important in that it provides scaffolding (an active chromosome resembles a puff-ball with all of the important genes at the outside edges, where they're most accessible to incoming proteins) and flex room (sometime proteins will actually bend DNA and pinch it to make sure the important genes stick out) but so far we believe that the actual sequence of most of the human genome isn't very important
- Re: (Score:2)
  
  by blair1q ( 305137 ) writes:
  
  A good scientist will design the experiment before collecting the data. If he spots patterns, it's because something interesting happened to another experiment. Then he'll design a new experiment to collect data on the interesting thing.
  Seriously, this is a non-problem. Don't waste resources keeping and managing the data if you can make more. And I can't imagine how you can't make more data from DNA. The stuff is everywhere.
  - Re: (Score:2)
    
    by bberens ( 965711 ) writes:
    
    And I can't imagine how you can't make more data from DNA. The stuff is everywhere.
    I work in a cheap motel you insensitive clod!
  - Re:Time for the scientists to ge to work (Score:5, Insightful)
    
    by sirlark ( 1676276 ) writes: on Friday December 02, 2011 @04:18PM (#38242382)
    
    A good scientist will design the experiment before collecting the data. If he spots patterns, it's because something interesting happened to another experiment. Then he'll design a new experiment to collect data on the interesting thing.
    Flippant response: A good scientist doesn't delete his raw data...
    More sober response: Except to do an experiment said scientist might need a sequence. And that sequence needs to be stored somewhere, often in a publicly accessible database as per funding stipulations. And that sequence has literally gigabytes more information than he needs for his experiment, because he's only looking at part of the sequence. Consider also that sequencing a small genome may take a few days in the lab, but annotating can take weeks or even months of human time. And the sequence is just the tip of the iceberg, it doesn't tell us anything because we need to know how the genome is expressed, and how the expressed genes are regulated, and how they are modified after transcription, and how they are modified after translation, and how the proteins that translation forms interact with other proteins and sometimes with the DNA itself. Life is messy, and singling out stuff for targeted experimentation in the biosciences is a lot more difficult than in physics, and even chemistry.
    Seriously, this is a non-problem. Don't waste resources keeping and managing the data if you can make more. And I can't imagine how you can't make more data from DNA. The stuff is everywhere.
    Sequencing may be getting cheaper, but it's not so cheap that scientists facing funding cuts can afford to throw away data simply to recreate it. Also, DNA isn't the only thing that's sequenced or used. Protein's are notoriously hard to purify and sequence, RNA can also be difficult to get in sufficient quantities. The only reason DNA is plentiful is because it's so easy to copy using PCR [wikipedia.org], but those copies are not necessarily perfect.
    
    Parent Share
    twitter facebook
    - Re: (Score:2)
      
      by Samantha Wright ( 1324923 ) writes:
      
      Fortunately, most biologists are unimaginative, and the medical establishment's coffers are bottomless, so really only four genomes ever actually get much mileage: human, rat, mouse, and chimpanzee. Perhaps a parasite or virus here and there. I weep for plant biologists.
- Re: (Score:2)
  
  by Nyall ( 646782 ) writes:
  
  Time for scientists to get to work? What an elegantly simple solution.
  The next time I have to debug something, maybe my fist step should be identifying the problem [taken from dilbert..]
They should learn (Score:4, Insightful)

by hbar squared ( 1324203 ) writes: on Friday December 02, 2011 @03:43PM (#38241806)

...from CERN. Sure, the Grid [wikipedia.org] was massively expensive, but I doubt genome researchers are generating 27 TB of data per day.

Share
twitter facebook
- Re: (Score:2)
  
  by AtomicDevice ( 926814 ) writes:
  
  Plus, they could make a digital frontier to reshape the human condition
- - Re: (Score:3)
    
    by Samantha Wright ( 1324923 ) writes:
    
    As an aside, BGI is not just any centre, it is the centre. Much like biochemists send their crystals to a synchrotron for X-ray crystallography, biologists send their sequences to BGI to get them sequenced. They own something like 180 high-throughput sequencing instruments, which is about 5-10% of the installed base, give or take [bham.ac.uk].
- - Re: (Score:3)
    
    by Arabani ( 1127547 ) writes:
    
    At BGI they have 180 machines... each run from a machine is approximately 3 TB of raw data. A single run takes one week. That is 77 TB per day of data being produced. And that is only BGI.. there are at least 180 machines outside of BGI scattered across the world. So imagine 140TB per day.
    CERN is nothing compared to Genomic data.
    When LHC is running at full luminosity, it produces roughly a megabyte per event per detector (for CMS and ATLAS at least). Of course, the events are happening at ~40MHz, so 288 TB of raw data per hour. That's why they have to trigger, and hence throw out 99% of the data.
    
    Genomic data is nothing compared to elementary particles.
Is it .. (Score:4, Interesting)

by ackthpt ( 218170 ) writes: on Friday December 02, 2011 @03:43PM (#38241808) Homepage Journal

Is it outpacing their ability to file patents on genome sequences?

Share
twitter facebook
as a genome researcher (Score:5, Informative)

by ecorona ( 953223 ) writes: on Friday December 02, 2011 @03:44PM (#38241826) Homepage

As a genome researcher, I'd like to point out that I, for one, do not have nearly enough genome data. I simply need about 512GB of RAM on a computer with a hard drive that is about 100x faster than my current SSD, and processing power about 1000x cheaper. Right now, I bite the bullet and carefully construct data structures and implement all sorts of tricks make the most out of the RAM I do have, minimize how much I have to use a hard drive, and extract every bit of performance available out of my 8 core machine. I wait around and eventually get things done, but my research would go way faster and be more sophisticated if I didn't have these hardware limitations.

Share
twitter facebook
- Re:as a genome researcher (Score:5, Insightful)
  
  by Overzeetop ( 214511 ) writes: on Friday December 02, 2011 @04:01PM (#38242078) Journal
  
  It will come, but it doesn't make the wait less frustrating. I'm an aerospace engineer, and I remember building and preparing structural finite element models by hand on virtual "cards" (I'm not old enough to have used actual cards), and trying to plan my day around getting 2-3 alternate models complete so that I could run the simulations overnight. In the span of 5 years, I was building the models graphically on a PC, and runs were taking less than 30 minutes. Now, I can do models of foolish complexity and I fret when a run takes more than a minute, wondering if the computer has hung on a matrix inversion that isn't converging.
  You should, in some ways, feel lucky you weren't trying to do this twenty years ago. I understand your frustration, though.
  Just think - in twenty years, you'll be able to tell stories about hand coding optimizations and efficiencies to accommodate the computing power, as you describe to your intern why she's getting absolute garbage results from what looks like a very complete model of her project.
  
  Parent Share
  twitter facebook
  - - Re: (Score:2)
      
      by Samantha Wright ( 1324923 ) writes:
      
      I think the point is more like "in 20 years, there won't be any men left in the STEM fields."
- Re: (Score:2)
  
  by Hognoxious ( 631665 ) writes:
  
  I simply need about 512GB of RAM
  What? Why can't you make do with 640k?
- - Re: (Score:2, Offtopic)
    
    by martas ( 1439879 ) writes:
    
    Goatse alert.
  - Re: (Score:2)
    
    by GameboyRMH ( 1153867 ) writes:
    
    Oh god someone modded this informative. IT'S GOATSE, GENIUSES.
Isn't it compressable? (Score:3)

by BlueCoder ( 223005 ) writes: on Friday December 02, 2011 @03:44PM (#38241828)

I would figure most genomes are highly compressible. Especially if compressed against thousands of samples of a species and even across different species.
I have half my mothers genome and half my fathers. I couldn't have that many mutations. To store all three genomes couldn't take more than 2.0001 times the size of a human genome.

Share
twitter facebook
- Re: (Score:3)
  
  by Derekloffin ( 741455 ) writes:
  
  That is what I was thinking. Maybe they just need a more customized compression algorithm. The problem there, I suppose, is figure out matches can be an expensive operation in itself.
- Re: (Score:2)
  
  by oodaloop ( 1229816 ) writes:
  
  I would figure most genomes are highly compressible
  I know right? I can fit all of my DNA inside of a single cell! When will these people learn?
- Re: (Score:2)
  
  by complete loony ( 663508 ) writes:
  
  The finished product of the analysis can be stored reasonably efficiently. But I don't think that's what they are talking about. I believe this has more to do with the memory / disk / cpu load of putting all the pieces of the jigsaw together.
Time to outsource these efforts (Score:2)

by bogaboga ( 793279 ) writes:

I think these researchers should look at outsourcing these efforts, and China now has bragging rights to the fastest computer [thedailybeast.com].
After all most of our electronics are all imported. It's sad, but what do you do when "...the ability to determine DNA sequences is starting to outrun the ability of researchers to store, transmit and especially to analyze the data..." as the intro to this submission says?
- Re: (Score:2)
  
  by the gnat ( 153162 ) writes:
  
  I think these researchers should look at outsourcing these efforts, and China now has bragging rights to the fastest computer.
  Except they don't - the Japanese just brought a system online that is around 3x more powerful.
  But a more general issue is that you don't need a conventional supercomputer to analyze genomic data - you just need a lot of aggregate processing power. Supercomputers are good for serial numerical methods like molecular dynamics, climate modeling, or simulating nuclear explosions (the onl
Steve Yegge is on the way. (Score:2)

by doom ( 14564 ) writes:

But stand back! Steve Yegge is on the way to show them how to get things to scale:
https://www.youtube.com/watch?v=vKmQW_Nkfk8 [youtube.com]
Where does it all come from? (Score:4, Funny)

by WaffleMonster ( 969671 ) writes: on Friday December 02, 2011 @03:51PM (#38241910)

I was under the impression the complete DNA sequence for a human can be stored on an ordinary CD.
Given the amount of data mentioned in TFA it it begs the question what the hell are they sequencing? The genome of everyone on the planet?

Share
twitter facebook
- Re: (Score:2)
  
  by godrik ( 1287354 ) writes:
  
  no, the genome of every bacteria in a soil sample. We (I work as a computer scientist in a genome related research lab) do not work only on human genomes.
- Re: (Score:2)
  
  by Daniel_Staal ( 609844 ) writes:
  
  Nope.
  Every living being on the planet. And as many of the dead ones as they can get their hands on.
- Re: (Score:2)
  
  by Punchcardz ( 598335 ) writes:
  
  This is true, but doesn't really capture the types of experiments that are being done in many cases. Yes, your genome can be stored on a CD. However, next gen sequencing is usually done with a high degree of overlapping coverage, to catch any mistakes in the sequencing, which is still basically a biochemical process despite geting large text files as the end result. So any genome is sequenced multiple times: say 8x coverage as fairly standard. That is if you are interested in sequencing a single genome. If
Your genome will fit conveniently on a CD. (Score:2)

by bhspencer ( 2523290 ) writes:

From the article "three billion bases of DNA in a set of human chromosomes". A base may hold 1 of 4 values A, C, G and T. So each base can be represented with 2 bits. 2 bits * 3 billion = 750MB.
- - Re: (Score:3)
    
    by Daniel_Staal ( 609844 ) writes:
    
    That's human genomes.
    They are also sequencing plants, and (other) animals, and fungus, and bacteria, and viruses, and...
diff (Score:2)

by dabridgham ( 814799 ) writes:

Someone needs to introduce these researchers to the 'diff' program.
Drops in NGS Costs Outpacing Storage Costs (Score:4, Informative)

by Anonymous Coward writes: on Friday December 02, 2011 @04:02PM (#38242102)

The big problem is that the dramatic decreases in sequencing costs driven by next-gen sequencing (in particular the Illumina HiSeq 2000, which produces in excess of 2TB of raw data per run) have outpaced the decreases in storage costs. We're getting to the point where storing the data is going to be more expensive than sequencing it. I'm a grad student working in a lab with 2 of the HiSeqs (thank you HHMI!) and our 300TB HP Extreme Storage array (not exactly "extreme" in our eyes) is barely keeping up (on top of the problems were having with datacenter space, power, and cooling).
I'll reference an earlier /. post about this:
http://science.slashdot.org/story/11/03/06/1533249/graphs-show-costs-of-dna-sequencing-falling-fast
There are some solutions to the storage problems such as Goby (http://campagnelab.org/software/goby/) but those require additional compute time, and we're already stressing our compute cluster as is. Solutions like "the cloud(!)" don't help much when you 10TB of data to transfer just to start the analysis - the connectivity just isn't there.

Share
twitter facebook
- - Re: (Score:3)
    
    by Daniel Dvorkin ( 106857 ) writes:
    
    What is in that 2TB of data? A human genome only takes up 750MB. (A base may hold 1 of 4 values A, C, G and T. So each base can be represented with 2 bits. 2 bits * 3 billion = 750MB)
    What you get out of next-gen sequencing isn't actually the sequence of a genome; it's the sequences of a bunch of fragments, each of which has to be resequenced several times (8 or 16 is the current standard, so you'll hear about "8x sequencing" for example; anything less than 4x sequencing is considered so unreliable as to be worthless, and even 16x may not really be enough) to reduce the number of read and assembly errors to an acceptable level. And although the final "consensus sequence" which is the ou
"GNome" researchers have too much data (Score:2)

by MikeTheGreat ( 34142 ) writes:

Hehe - I mis-read this as "GNome researchers" have too much data.
Probably along the lines of several thousand comments to the effect of "I can't stand GNOME 3", "I liked GNOME 2 better", etc, etc :)
Here's an idea.... (Score:2)

by Nidi62 ( 1525137 ) writes:

I'm sure all the insurance companies would love to buy up all that data...
Go to the cloud (Score:2)

by sowalsky ( 142308 ) writes:

For individual research units, the cost of maintaining the processing power and storage space for these types of projects can be cost-prohibitive. Cloud-based options offer distributed computing power and low-cost storage that is often a more economical solution that paying for the equipment in house, especially when genomic projects can come in spurts rather than a continuous stream.
Disclaimer: I work with large amounts of genomic data and use both in-house and cloud-based analysis tools.
Easy... (Score:2)

by jellomizer ( 103300 ) writes:

'We are going to have to come up with really clever ways to throw away data so we can see new stuff.'

Have all you data open on a Windows share, and a FTP. Have them available on the full internet. Make some honest mistakes in setting up permissions. Copy and Past the "wrong link" into a hackers/gaming website. Wait a while.... All your data has been replaced with illegal information. which makes it easy to clean out. Problem solved.
AI (Score:3)

by slyrat ( 1143997 ) writes: on Friday December 02, 2011 @04:15PM (#38242346)

This seems like just the kind of problem that AI will help with narrowing the field of 'interesting' things to look at. Either that or better ways to search through the data that is available along with better ways to store said data will probably work.

Share
twitter facebook
Reminded of a Parallel Computing Problem (Score:2)

by wbtittle ( 456702 ) writes:

Way back in 1993, I visited an atomic laboratory in Pennsylvania. On the tour, they showed us the 30,000 core computing machine they had purchased several years before. "We still can't program it".
30 seconds later he pointed to the next piece of metal.
This is our 120,000 core computer.
I raised my hand "Why did you buy a 120,000 core machine when you can't even program the 30,000 core machine!"
"Well it's faster."
one of my early lessons in big companies attacking the wrong problem.
Is it a searching problem? (Score:2)

by camh ( 32881 ) writes:

A couple of researchers in Sydney think they've got a model for searching the genoma much more efficiently. They're trying to fund their research and development with crowdsourcing: http://rockethub.com/projects/4065-a-gps-for-the-genome [rockethub.com] : "The PASTE project [is] based on a new number system we call Permutahedral Indexing - P.I. for short, an N-dimensional map that efficiently locates and interrelates complex datasets in the space of all possible data. P.I. does this efficiently even when the data has hundr
LHC has similar problem (Score:2)

by peter303 ( 12292 ) writes:

They only save a tiny fraction of collision events, those deemed "interesting". Even so thats petabytes a year. This keeps the researchers busy during shutdowns such as now analyzing these data for for new particles or anomalies.
Compress at the level of PROTEINS (Score:3)

by wisebabo ( 638845 ) writes: on Friday December 02, 2011 @04:36PM (#38242684) Journal

So, why can't they compress the data at the level of proteins? I mean it takes thousands of DNA base pairs to code for 1 protein, like hemoglobin, so instead of storing all that just say "here is the DNA sequence for protein X". Any exceptions, like mutations could then be indicated as "at position 758, the A is replaced by a G".
Of course if there is something REALLY novel, like a bioengineered virus that used different (non-standard) 3 base pair codons to encode the same amino acid, this kind of data compression wouldn't work but for 99.9999% of "natural" cases it would. (I saw this idea in the tv series "regenesis"). So for these (hopefully rare, it was for a bio-weapon!) cases a different type of compression would be used. "My" compression algorithm would, of course, break which would be a good indication this wasn't a natural DNA sequence.
I am neither a bio-expert nor a compression expert but this seems to me to be similar to the problem of compressing a vast library of books. Is it best to compress at the level of letters, words or even sentences? I'm only guessing what this entails because I'm not a linguist either! :(
(Then there's the whole business of introns or exons which "seem" to be content/protein free but I understand contain lots of regulatory information despite their repetitive nature. I would imagine these could be handled by some sort of pattern RLE.)

Share
twitter facebook
- Re: (Score:2)
  
  by wisebabo ( 638845 ) writes:
  
  If I may wax philosophical about my own posting, the advantage of using this "level" of encoding is that nature has, through ruthlessly efficient evolution, pruned out the almost-infinite number of non-useful proteins. Almost every DNA sequence that encodes a protein that is deleterious to the survival to the organism has been eliminated by the grim reaper. The few "bad" but non-lethal proteins that are still around in a living organism (like mis-folded hemoglobin to fight sickle cell disease) will stick
Compression / dedupe? (Score:2)

by dave562 ( 969951 ) writes:

I am not a geneticist, so I might be way off base here. But isn't DNA data a grouping of ATCG bonds in various arrangements? It seems like the nature of the data itself would lend to effective compression and/or de-duplication.
- Re: (Score:2)
  
  by certron ( 57841 ) writes:
  
  ...way off base...
  Well played.
  - Re: (Score:2)
    
    by dave562 ( 969951 ) writes:
    
    It just goes to show, being good at one thing does not necessarily mean you are qualified to comment on something else. ;)
It's not the data (Score:4, Insightful)

by thisisauniqueid ( 825395 ) writes: on Friday December 02, 2011 @05:04PM (#38243144)

It's not that there's too much data to store. There's too much to analyze. Storing 1M genomes is tractable today. Doing a pairwise comparison of 1M genomes requires half a trillion whole-genome comparisons. Even Google doesn't compute on that scale yet. (Disclaimer: I'm a postdoc in computational biology.)

Share
twitter facebook
Gnome Researchers (Score:2)

by Fjandr ( 66656 ) writes:

I'm not awake enough yet. I read the title as "Gnome Researchers Have Too Much Data."
TEDx Talk on the Subject (Score:4, Informative)

by rockmuelle ( 575982 ) writes: on Friday December 02, 2011 @06:00PM (#38244100)

I did a talk on this a few years back at TEDx Austin (shameless self promotion): http://www.youtube.com/watch?v=8C-8j4Zhxlc [youtube.com]
I still deal with this on a daily basis and it's a real challenge. Next-generation sequencing instruments are amazing tools and are truly transforming biology. However, the basic science of genomics will always be data intensive. Sequencing depth (the amount of data that needs to be collected) is driven primarily by the fact that genomes are large (e. coli has around 5 M bases in it's genome, humans have around 3 billion) and biology is noisy. Genomes must be over-sampled to produce useful results. For example, detecting variants in a genome requires 15-30x coverage. For a human, this equates to 45-90 Gbases or raw sequence data, which is roughly 45-90 GB of stored data for a single experiment.
The two common solutions I've noticed mentioned often in this thread, compression and clouds, are promising, but not yet practical in all situations. Compression helps save on storage, but almost every tool works on ASCII data, so there's always a time penalty when accessing the data. The formats of record for genomic sequences are also all ASCII (fasta, and more recently fastq), so it will be a while, if ever, before binary formats become standard.
The grid/cloud is a promising future solution, but there are still some barriers. Moving a few hundred gigs of data to the cloud is non-trivial over most networks (yes, those lucky enough to have Internet2 connections can do it better, assuming the bio building has a line running to it) and, despite the marketing hype, Amazon does not like it when you send disks. It's also cheaper to host your own hardware if you're generating tens or hundreds of terabytes. 40 TB on Amazon costs roughly $80k a year whereas 40 TB on an HPC storage system is roughly $60k total (assuming you're buying 200+ TB, which is not uncommon). Even adding an admin and using 3 years' depreciation, it's cheaper to have your own storage. The compute needs are rather modest as most sequencing applications are I/O bound - a few high memory (64 GB) nodes are all that's usually needed.
Keep in mind, too, that we're asking biologists to do this. Many biologists got into biology because they didn't like math and computers. Prior to next-generation sequencing, most biological computation happened in calculators and lab notebooks.
Needless to say, this is a very fun time to be a computer scientist working in the field.
-Chris

Share
twitter facebook
Moore's Law != Performance (Score:2)

by ender06 ( 913978 ) writes:

DNA sequencing is becoming faster and cheaper at a pace far outstripping Moore's law.

Moore's law, or rather Moore's observation, has absolutely nothing to do with performance and everything to do with the number of transistors. For the love of deity of your choice, will they stop using it regarding performance? Simply mentioning something computer related doesn't make the writer look smarter. Yes, an increase in the number of transistors can see an increase in performance but it isn't guranteed. Eg. Bull
The problem isn't completed genomes... (Score:3)

by Vornzog ( 409419 ) writes: on Friday December 02, 2011 @06:36PM (#38244694)

Though, there is quite a lot of that being generated these days.
The problem is the *raw* data - the files that come directly off of the sequencing instruments.
When we sequenced the human genome, everything came off the instrument as a 'trace file' - 4 different color traces, one representing a fluorescent dye for each base. These files are larger than text, but you store the data on your local hard drive and do the base calling and assembly on a desktop or beefy laptop by today's standards.
2nd gen sequencers (Illumina, 454, etc) take images, and a lot of them, generating many GB of data for even small runs. The information is lower quality, but there is a lot more of it. You need a nice storage solution and a workstation grade computer to realistically analyze this data.
3rd gen sequencers are just coming out, and they don't take pictures - they take movies with very high frame rates. Single molecule residence time frame rates. Typically, you don't store the rawest data - the instrument interprets it before the data gets saved out for analysis. You need high end network attached storage solutions to store even the 'interpreted' raw data, and you'd better start thinking about a cluster as an analysis platform.
This is what the article is really about - do you keep your raw 2nd and 3rd gen data? If you are doing one genome, sure! why not? If you are a genome center running these machines all the time, you just can't afford to do that, though. No one can really - the monetary value of the raw data is pretty low, you aren't going to get much new out of it once you've analyzed it, and your lab techs are gearing up to run the instrument again overnight...
The trick is that this puts you at odds with data retention policies that were written at a time when you could afford to retain all of your data...

Share
twitter facebook
- Re: (Score:3)
  
  by Baloroth ( 2370816 ) writes:
  
  Oh hey look you made another account to goatse /. with. Good job.
  - Re: (Score:2)
    
    by GameboyRMH ( 1153867 ) writes:
    
    Still doesn't elude my filter. [slashdot.org] See, I was thinking ahead when I designed that. He'll either have to quit trolling or get off his lazy ass and put some effort into trolling us.
    - - Re: (Score:2)
        
        by GameboyRMH ( 1153867 ) writes:
        
        3140? Haha that's pretty lame. I got more than that on a pastebin URL I had in my sig for a couple of weeks. Still I can see why you keep going, considering some of the lulzy responses, although there's a good amount of pity mixed in there these days.
        Seriously though, you're a one-trick pony, the Microsoft of trolling. Try to mix it up. You're boring the grizzled veterans on here.
- Re: (Score:3, Funny)
  
  by Anonymous Coward writes:
  
  I see an opportunity for work, and jobs.
  Wozniak. He is called Wozniak. But opportunity will have to wait, because Jobs is dead. Sorry to break it to you like this.
  Come on, every story has an Apple angle, if you look at it the right way.. in fact, I bet those researchers could store all that data on an iPod if they wanted! You can plug it right in and sync with iTunes!
- Re: (Score:3)
  
  by Samantha Wright ( 1324923 ) writes:
  
  Bioinformatics is indeed a very lucrative profession, but few programmers have the willingness to memorize the huge canon of data while they're in college that is required to be proficient in it. The curriculum is about 70% computer science and 30% life sciences, including organic chemistry at some universities.
- Re: (Score:2)
  
  by bberens ( 965711 ) writes:
  
  So, create a public DNA museum of sequences.
  They have those, they call it the "public school system."
- Re: (Score:3)
  
  by Samantha Wright ( 1324923 ) writes:
  
  Done: NCBI [nih.gov], DDBJ [nig.ac.jp], and Ensembl [ensembl.org] all perform that role. The problem is what to do with all of it.
- Re:ASCII storage? (Score:4, Informative)
  
  by Samantha Wright ( 1324923 ) writes: on Friday December 02, 2011 @04:11PM (#38242250) Homepage Journal
  
  ASCII storage of nucleotide and protein information is actually very standard. The most widespread format is called FASTA, named after the fast alignment program that introduced it. When you sequence a whole genome on a second-generation sequencing platform (like Illumina or SOLiD), there's a step in the process where you end up with a huge (10-100 GB) text file containing little puzzle pieces of DNA that must then be assembled by a specialized program. These files usually don't hang around very long, but the point of keeping them in this inefficient storage format is, simply, performance: CPUs are oriented toward byte-based computing at a minimum, and so frequent compression/decompression becomes prohibitively inefficient.
  
  Big biotechnology purchases are typically hundreds of thousands of dollars though, so most labs are used to shelling out for this kind of price bracket.
  
  Parent Share
  twitter facebook
- Re: (Score:2)
  
  by Samantha Wright ( 1324923 ) writes:
  
  Can't analyse a compressed sequence. Gotta decompress it first. Disks are cheaper than time.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Last post (Score:2, Funny)

Re:Last post (Score:5, Funny)

Re:Last post (Score:5, Informative)

Re: (Score:2)

Wrong problem (Score:5, Interesting)

Re: (Score:3)

Re:Wrong problem (Score:5, Funny)

Re:Wrong problem (Score:5, Insightful)

Re: (Score:2)

Re: (Score:3)

Re: (Score:3)

Re: (Score:3)

Re:Wrong problem (Score:5, Informative)

Re: (Score:2)

the pell mell fullness of time (Score:3)

Re:Wrong problem (Score:5, Funny)

Re: (Score:3, Informative)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Funny... (Score:2)

Re: (Score:2)

Re:Wrong problem (Score:5, Informative)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2, Informative)

Re: (Score:2)

Nope (Score:4, Insightful)

Re: (Score:3, Insightful)

Re: (Score:2)

Re: (Score:2)

Bad... (Score:4, Insightful)

Re:Bad... (Score:4, Informative)

Re: (Score:2)

Re: (Score:2)

Time for the scientists to ge to work (Score:5, Insightful)

Re:Time for the scientists to ge to work (Score:5, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re:Time for the scientists to ge to work (Score:5, Insightful)

Re: (Score:2)

Re: (Score:2)

They should learn (Score:4, Insightful)

Re: (Score:2)

Re: (Score:3)

Re: (Score:3)

Is it .. (Score:4, Interesting)

as a genome researcher (Score:5, Informative)

Re:as a genome researcher (Score:5, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2, Offtopic)

Re: (Score:2)

Isn't it compressable? (Score:3)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Time to outsource these efforts (Score:2)

Re: (Score:2)

Steve Yegge is on the way. (Score:2)

Where does it all come from? (Score:4, Funny)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Your genome will fit conveniently on a CD. (Score:2)

Re: (Score:3)

diff (Score:2)

Drops in NGS Costs Outpacing Storage Costs (Score:4, Informative)

Re: (Score:3)

"GNome" researchers have too much data (Score:2)

Here's an idea.... (Score:2)

Go to the cloud (Score:2)

Easy... (Score:2)

AI (Score:3)

Reminded of a Parallel Computing Problem (Score:2)

Is it a searching problem? (Score:2)