Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
Biotech

The DNA Data Deluge 138

the_newsbeagle writes "Fast, cheap genetic sequencing machines have the potential to revolutionize science and medicine--but only if geneticists can figure out how to deal with the floods of data their machines are producing. That's where computer scientists can save the day. In this article from IEEE Spectrum, two computational biologists explain how they're borrowing big data solutions from companies like Google and Amazon to meet the challenge. An explanation of the scope of the problem, from the article: 'The roughly 2000 sequencing instruments in labs and hospitals around the world can collectively generate about 15 petabytes of compressed genetic data each year. To put this into perspective, if you were to write this data onto standard DVDs, the resulting stack would be more than 2 miles tall. And with sequencing capacity increasing at a rate of around three- to fivefold per year, next year the stack would be around 6 to 10 miles tall. At this rate, within the next five years the stack of DVDs could reach higher than the orbit of the International Space Station.'"
This discussion has been archived. No new comments can be posted.

The DNA Data Deluge

Comments Filter:
  • by The_Wilschon ( 782534 ) on Thursday June 27, 2013 @10:10PM (#44128807) Homepage
    In high energy physics, we rolled our own big data solutions (mostly because there was no big data other than us when we did so). It turned out to be terrible.
  • by Anonymous Coward on Thursday June 27, 2013 @10:52PM (#44128987)

    Actually ASCII files are the easiest to process. And since we generally use a handful of ambiguity codes, it's more like ATGCNX. Due to repetitive segments GZIP actually works out better than your proposed 2-bit scheme. We do a lot of UNIX piping through GZIP which is still faster than a magnetic harddrive can retrieve data.

  • by WaywardGeek ( 1480513 ) on Thursday June 27, 2013 @11:35PM (#44129211) Journal

    Please... entire DNA genomes are tiny... on the order of 1Gb, with no compression. Taking into account the huge similarities to published genomes, we can compress that by at least 1000X. What they are talking about is the huge amount of data spit out by the sequencing machines in order to determine your genome. Once determined, it's tiny.

    That said, what I need is raw machine data. I'm having to do my own little exome research project. My family has a very rare form of X-linked color blindness that is most likely caused by a single gene defect on our X chromosome. It's no big deal, but now I'm losing central vision, with symptoms most similar to late-onset Starardt's Disease. My UNC ophthalmologist beat the experts at John Hopkins and Jacksonville's hospital, and made the correct call, directly refuting the other doctor's diagnosis of Stargartd's. She though I had something else and that my DNA would prove it. She gave me the opportunity to have my exome sequenced, and she was right.

    So, I've got something pretty horrible, and my ophthalmologist thinks it's most likely related to my unusual form of color blindness. My daughter carries this gene, as does my cousin and one of her sons. Gene research to the rescue?!? Unfortunately... no. There are simply too few people like us. So... being a slashdot sort of geek who refuses to give up, I'm running my own study. Actually, the UNC researchers wanted to work with me... all I'd have to do is bring my extended family from California to Chapel Hill a couple of times over a couple of years and have them see doctors at UNC. There's simply no way I could make that happen.

    Innovative companies to the rescue... This morning, Axeq, a company headquartered in MD, received my families DNA for exome sequencing at their Korean lab. They ran an exome sequencing special in April: $600 per exome, with an order size minimum of six. They have been great to work with, and accepted my order for only four. Bioserve, also in MD, did the DNA extraction from whole blood, and they have been even more helpful. The blood extraction labs were also incredibly helpful, once we found the right places (very emphatically not Labcorp or Quest Diagnostics). The Stanford clinic lab manager was unbelievably helpful, and in LA, the lab director at the San Antonio Hospital Lab went way overboard, So far, I have to give Axeq and Bioserve five stars out of five, and the blood draw labs deserve a six.

    Assuming I get what I'm expecting, I'll get a library of matched genes, and also all the raw machine output data, for four relatives. The output data is what I really need, since our particular mutation is not currently in the gene database. Once I get all the data, I'll need to do a bit of coding to see if I can identify the mutation. Unfortunately, there are several ways that this could be impossible. For example, "copy number variations", or CNVs, if they go on for over a few hundred base pairs, are unable to be detected with current technology. Ah... the life of a geek. This is yet another field I have to get familiar with...

  • by wezelboy ( 521844 ) on Friday June 28, 2013 @12:43AM (#44129529)
    When I had to get the first draft of the human genome onto CD, I used 2 bit substitution and run length encoding on repeats. gzip definitely did not cut it.
  • by Anonymous Coward on Friday June 28, 2013 @01:20AM (#44129661)

    A single finished genome is not the problem. It is the raw data.

    The problem is that any time you sequence a new individual's genome for a species that already has a genome assembly, you need minimum 5x coverage across the genome to reliably find variation. Because of variation in coverage, that means you may have to shoot for >20x coverage to find all the variation. The problem is more complex when you are trying to de novo assemble a genome for a species that does NOT have a genome assembly. In this case, you often have to aim for at least 40x coverage (and in the 100x range may be better).

    To get the data, we use next-gen sequencing. To give you an idea of the data output, a single Illumina HiSeq 2000 run produces 3 billion reads. Each "read" is a pair of genomic fragments 100 bases long. That means 600,000,000,000 bases are produced in a single run. The run is stored as a .fastq file, meaning that each base is stored as an ASCII character, and has an associated quality score stored as another ASCII character. So that's 1.2 trillion ASCII characters for a single run, or about 1.09 terabytes uncompressed. This does not include the storage for the (uncompressable) images taken by the sequencing machine in order to call the bases. They can be an order of magnitude larger. A single experiment may involve dozens of such runs.

    There is an expectation that these runs will be made available in a public repository when an analysis is published. That puts great stress on places like NIH, where 1.7 quadrillion raw bases have been uploaded in about the last four years:
    http://www.ncbi.nlm.nih.gov/Traces/sra/ [nih.gov]

    You are correct when you say that computational power is a bigger problem, but again, this is not related to the three billion bases of the genome, which is trivial in size. Once again, the problem is the raw data. When assembling a new species' genome from scratch, you somehow have to reassemble those 3 billion pairs of 100-base reads. The way that is done is by hashing every single read into pieces about 21 nucleotides long, then storing them all, creating a de Bruijin graph, and navigating through it. The amount of RAM required for this is absolutely insane.

"What man has done, man can aspire to do." -- Jerry Pournelle, about space flight

Working...