The DNA Data Deluge 138
the_newsbeagle writes "Fast, cheap genetic sequencing machines have the potential to revolutionize science and medicine--but only if geneticists can figure out how to deal with the floods of data their machines are producing. That's where computer scientists can save the day. In this article from IEEE Spectrum, two computational biologists explain how they're borrowing big data solutions from companies like Google and Amazon to meet the challenge. An explanation of the scope of the problem, from the article: 'The roughly 2000 sequencing instruments in labs and hospitals around the world can collectively generate about 15 petabytes of compressed genetic data each year. To put this into perspective, if you were to write this data onto standard DVDs, the resulting stack would be more than 2 miles tall. And with sequencing capacity increasing at a rate of around three- to fivefold per year, next year the stack would be around 6 to 10 miles tall. At this rate, within the next five years the stack of DVDs could reach higher than the orbit of the International Space Station.'"
At least they're not rolling their own. (Score:5, Interesting)
Re: (Score:3)
http://en.wikipedia.org/wiki/IBM_1360 [wikipedia.org]
Re: (Score:1)
I rolled my own but forgot what the results were...
Re: (Score:3)
Being the wake in front of the Bleeding Edge, HEP gets to learn all sorts of lessons before everyone else. As a result, you get to make all the mistakes that everyone else gets to learn from.
Re: (Score:3)
Re:At least they're not rolling their own. (Score:5, Informative)
I can't comment on the physics data, but in the case of the bio data that the article discusses, we honestly have no idea what to do with it. Most sequencing projects collect an enormous amount of useless information, a little like saving an image of your hard drive every time you screw up grub's boot.lst. We keep it around on the off chance that some of it might be useful in some other way eventually, although there are ongoing concerns that much of the data just won't be high enough quality for some stuff.
On the other hand, a lot of the specialised datasets (like the ones being stored in the article) are meant as baselines, so researchers studying specific problems or populations don't have to go out and get their own information. Researchers working with such data usually have access to various clusters or supercomputers through their institutions; for example, my university gives me access to SciNet [scinethpc.ca]. There's still vying for access when someone wants to run a really big job, but there are practical alternatives in many cases (such as GPGPU computing.)
Also, I'm pretty sure the Utah data centre is kept pretty busy with its NSA business.
Re: (Score:3)
It's a neat thought, but it would never beat the basics. While there are a lot of genes that have common ancestors (called paralogues [wikipedia.org]), the hierarchical history of these genes is often hard to determine or something that pre-dates human speciation; for example, there's only one species (a weird blob [wikipedia.org] a little like a multi-cellular amoeba) that has a single homeobox gene.
While building a complete evolutionary history of gene families is of great interest to science, it's pointless to try exploiting it for com
Re: (Score:2)
gzip can be faster than the read/write buffer on standard hard drives.
Gzip of what? Chromosome-at-once? Isn't that the wrong way of traversing the data set, if you're aiming for actual compression? More to the point, gzip, if I'm not mistaken, is good for data with 8-bit boundaries. What if the data gets stored in base-4, six bits per triplet/codon? Finally, talking about string algorithms, I'd have thought that the best way of compressing the stuff would involve mapping the extant alleles and storing only references to them in the individual genomes.
Re: (Score:2)
Here's the lowdown [blogspot.ca] on how BZGF works, as one example. In this case, there are many short distinct of DNA being stored together, each with offset and quality information, many of which may be identical. The compression is localized to smaller blocks (I'm not sure if they're 4096-byte disk sectors or something else.) You're right that there's probably some performance lost due to the misalignment, but 6 and 8 line up every 24 bits, so at worst that means patterns of four codons or three bytes—and a ste
Re: (Score:2)
Re: (Score:3)
Re: (Score:2)
But it (mostly) works...
Re: (Score:2)
In high energy physics, we rolled our own big data solutions (mostly because there was no big data other than us when we did so). It turned out to be terrible.
But genetic data isn't particle physics data. It makes perfect sense to roll out a custom "big data" (whatever that crap means) solution because of the very nature of the data stored (at the very least, you will want DNA-specific compression algorithms because there's huge redundancy in the data spread horizontally across the sequenced individuals).
Re: (Score:2)
I shouldn't say they. I work with next gen DNA data on a daily basis. The main problem is everyone in biology uses awful flat ascii files for so many things. And databases... well most are so badly done because they are literally done by someone reading the "SQL for Dummies" book as he does it.
The last but not least of the problems are experimental design. Too often things
Re: (Score:2)
Miles of DVDs? (Score:3)
Can't we have more meaningful units?
How many Libraries of Congress is that?
Re: (Score:2)
Re: (Score:3)
You should not write a C++ interpreter. You especially shouldn't write an interpreter of a language that looks almost just like C++, but is different from it in unpredictable ways, some of which contribute to bad coding habits and/or make normal C++ more difficult to learn.
Strictly sequential files are a bad model for data if most of your time is spent constructing more-and-more elaborate subsets of that data. When we want to examine a subset, we practically have to make a complete copy of all the data f
obvious solution (Score:1)
don't store it all on DVDs, then
Bogus units (Score:5, Insightful)
Re: (Score:2)
Digital DNA storage anyone ? (Score:2, Insightful)
why aren't they storing it in digital DNA format?. Seems like a pretty efficient data storage format to me! A couple of grams of the stuff should suffice.
Re: (Score:1)
Re: (Score:3, Interesting)
Actually ASCII files are the easiest to process. And since we generally use a handful of ambiguity codes, it's more like ATGCNX. Due to repetitive segments GZIP actually works out better than your proposed 2-bit scheme. We do a lot of UNIX piping through GZIP which is still faster than a magnetic harddrive can retrieve data.
Re: (Score:3, Interesting)
Re: (Score:2)
I did a first draft of insulin sequence on punch cards.
Re:Digital DNA storage anyone ? (Score:5, Informative)
why aren't they storing it in digital DNA format
Because they need to be able to read it back quickly, and error-free. Add to that, it's actually quite expensive to synthesize that much DNA; hard drives are relatively cheap by comparison.
Re: (Score:2)
ASCII is fast and error correcting now?
Relative to genome sequencing, hell yes it is! For the original sequencing, a relatively high error rate isn't a huge deal because there is massive redundancy in the fragment reads, which is also required to actually assemble all those bits and pieces. But you can see why it's even more inefficient this way...
Re: (Score:2)
The comparison is between conventional storage in ASCII format, versus storing information in DNA. Whether ASCII is an optimally fast and error free on a purely objective scale, among other forms of conventional digital electronic storage, is irrelevant to the question I was answering.
The problem will solve itself (Score:5, Funny)
To put this into perspective, if you were to write this data onto standard DVDs, the resulting stack would be more than 2 miles tall.
Once that happens, they'll be able to stop storing it on DVDs and move it into the cloud.
Re: (Score:2)
To put this into perspective, if you were to write this data onto standard DVDs, the resulting stack would be more than 2 miles tall.
Once that happens, they'll be able to stop storing it on DVDs and move it into the cloud.
And before anyone knows, we would have a space elevator within the next 5 years instead of the eternal +25 [xkcd.com].
Re: (Score:2)
Please, do continue measuring the massless sizeless thing in units of things with mass and size. It makes lots of sense.
This just goes to show... (Score:4, Informative)
...what a shitty storage medium DVDs are these days.
A cheap 3TB disk drive burned to DVDs will produce a rather unwieldy tower of disks as well.
G.
Re: (Score:1)
I also remember the feeling when I had to face it that I lost data beyond repair on a CD "backup".
Re: (Score:2)
Of course it would sound more impressive if they used a stack of punched cards....
Simple. Get the NSA to do it. (Score:5, Funny)
Publish a scientific, paper stating that potential terrorists or other subversives can be identified via DNA sequencing. The NSA will then covertly collect DNA samples from the entire population, and store everyone's genetic profiles in massive databases. Government will spend the trillions of dollars necessary without question. After all, if you are against it, you want another 9/11 to happen.
Re: (Score:2)
You mean ask the NSA how they've already done it.
Re: (Score:1)
Database Replication (Score:5, Insightful)
Bit rot is also a big problem with data. So, the data has to be reduplicated to keep entropy from destroying it, which means a self corrective meta data must be used. If only there were a highly compact self correcting self replicating data storage system with 1's and 0's the size of small molecules...
My greatest fear is that when we meet the aliens, they'll laugh, stick us in a holographic projector, and gather around to watch the vintage porn encoded in our DNA.
Re:Database Replication (Score:4, Funny)
I propose we call this new data method Data Neutral Assembly.
Re: (Score:1)
If only there were a highly compact self correcting self replicating data storage system with 1's and 0's the size of small molecules...
In the future, if sequencing becomes extremely fast and cheap, it might make sense to discard sequencing data after analysis and leave DNA in its original format for storage. That said, if the colony of (bacteria/yeast/whatever you are maintaining your library in) that you happen to pick when you grow up a new batch to maintain the cell line happened to pick up a mutation in your gene of interest, you won't know until you sequence it again. I'm a graduate student in a small academic lab and if I want to "
Re: (Score:3)
Bit rot is also a big problem with data.
Take a whiff from a piece of meat after 2 weeks at room temperature and compare it with how a DVD smells after the same time.
Complains on bit rot accepted only after the experiment.
Re: (Score:3)
I know you were going for funny, but much of what you will be smelling in your experiment is from bacteria eating the protein and polysaccharides in the meat. The DNA is remarkably stable and even if some of it is fragmented, you have a massively redundant set in your pile of meat.
We've sequenced DNA from nearly a million years ago and I regularly store DNA dried out and stuck to a piece of paper. DVDs won't last nearly that long before the dyes start to break down. For a long term archival system, we could
2000 devices make a lot of data (Score:3)
It seems a little overly sensationalist to aggregate the devices together when determining the storage size to make such a dramatic 2 mile high tower of DVD's... If you look at them individually, it's not that much data:
(15 x 10^15 bytes/device) / (2000 devices) / (1 x 10e9 bytes/gb) = 7500GB, or 7.5TB
That's a stack of 4TB hard drives 2 inches high. Or if you must use DVD's, that's a stack of 1600 DVD's 2 meters high.
Re: (Score:2)
Let's look at your units.
bytes / device / devices / (bytes/gb) = gb/devices^2
Oops! So the math SHOULD be 15e15*2e3/1e9 = 30e9. GB, not Gb, incidentally. That's actually kind of a lot.
---
Thanks for the critique of the typos in my hastily typed out formula, but it would have meant more if you were correct.
I spent 10 minutes trying to type out a real formula that would pass Slashdot's "junk" filter, but it kept telling me I had too many junk characters, so here's the closest I could get to
LHC (Score:1)
Hmmm, shitton... (Score:1)
Re: (Score:2)
Storage Non-Problem - Sequences Compresses to MBs (Score:5, Informative)
Storage is not the problem. Computational power is.
Each genetic sequence is ~3GB but since sequences between individuals are very similar it is possible to compress them by only recording the differences from a reference sequences making each genome ~20 MB. This means you could store a sequences for everybody in the world in ~132 PB or 0.05% or total worldwide data storage (295 exabytes)
Now the real challenge is more in having enough computational power to read and process the 3 billion letters genetic sequence and designing effective algorithms to process this data.
More info on compression of genomic sequences
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3074166/ [nih.gov]
Re: (Score:1)
Each genetic sequence is ~3GB but since sequences between individuals are very similar it is possible to compress them by only recording the differences from a reference sequences making each genome ~20 MB.
That's true, but the problem is that as a good scientist you are required by most journals and universities to keep the original sequence data that are coming off these high-through sequencers (aka the fastq files) so that you can show you work so-to-speak if it ever comes into question. These files often contain 30-40x coverage of your 3Gb reference sequence and even compressed are still several GB in size. Additional, because these large-scale sequencing projects are costing millions of dollars, the NIH
Re: (Score:2)
Re:Storage Non-Problem - Sequences Compresses to M (Score:4, Informative)
I'm actually really looking forward to a vast reduction in dataset size and cost in the life sciences, so we can make use of and design better algorithmic methods and get back to answering questions. That's up to the engineers designing the sequencing machines though..
Re: (Score:3, Interesting)
A single finished genome is not the problem. It is the raw data.
The problem is that any time you sequence a new individual's genome for a species that already has a genome assembly, you need minimum 5x coverage across the genome to reliably find variation. Because of variation in coverage, that means you may have to shoot for >20x coverage to find all the variation. The problem is more complex when you are trying to de novo assemble a genome for a species that does NOT have a genome assembly. In this
Re: (Score:2)
Each genetic sequence is ~3GB but since sequences between individuals are very similar it is possible to compress them by only recording the differences from a reference sequences making each genome ~20 MB. This means you could store a sequences for everybody in the world in ~132 PB or 0.05% or total worldwide data storage (295 exabytes)
For a single delta to a reference, but there's probably lots of redundancy in the deltas. If you have a tree/set of variations (Base human + "typical" Asian + "typical" Japanese + "typical" Okinawa + encoding the diff) you can probably bring the world estimate down by a few orders of magnitude, depending on how much is systematic and how much is unique to the individual.
Re: (Score:2)
Does it say something about handling this way also internal repeats?
The answer is obvious! (Score:4, Funny)
They should use a NoSQL multi-shard vertically intgrated stack with a RESTfull rails driven in-memory virtual multi-parallel JPython enabled solution.
Bingo!
Re: (Score:1)
Brog, that tech stack is like soooo month-ago
Re: (Score:2)
Re: (Score:2)
They should use a NoSQL multi-shard vertically intgrated stack with a RESTfull rails driven in-memory virtual multi-parallel JPython enabled solution.
Sounds like the technological equivalent of the human body => sounds about right!
Re: (Score:2)
Come on, that are yesterday's buzzwords! You can do better, I'm sure.
AO-Hell metrics... (Score:2)
"...At this rate, within the next five years the stack of DVDs could reach higher than the orbit of the International Space Station."
And another 10 years after that, the amount of DVDs used will have almost reached the number of AOL CDs sitting in landfills.
Sorry, couldn't help myself with the use of such an absurd metric. Not like we haven't moved on to other forms of storage the size of a human thumbnail that offer 15x the density of a DVD...
Re: (Score:2)
I have a solution, molecular storage (Score:2)
If there were only some way to store the information encoded in DNA in a molecular level storage device... oh wait, face palm.
Oddly... I have a clue about this stuff lately (Score:5, Interesting)
Please... entire DNA genomes are tiny... on the order of 1Gb, with no compression. Taking into account the huge similarities to published genomes, we can compress that by at least 1000X. What they are talking about is the huge amount of data spit out by the sequencing machines in order to determine your genome. Once determined, it's tiny.
That said, what I need is raw machine data. I'm having to do my own little exome research project. My family has a very rare form of X-linked color blindness that is most likely caused by a single gene defect on our X chromosome. It's no big deal, but now I'm losing central vision, with symptoms most similar to late-onset Starardt's Disease. My UNC ophthalmologist beat the experts at John Hopkins and Jacksonville's hospital, and made the correct call, directly refuting the other doctor's diagnosis of Stargartd's. She though I had something else and that my DNA would prove it. She gave me the opportunity to have my exome sequenced, and she was right.
So, I've got something pretty horrible, and my ophthalmologist thinks it's most likely related to my unusual form of color blindness. My daughter carries this gene, as does my cousin and one of her sons. Gene research to the rescue?!? Unfortunately... no. There are simply too few people like us. So... being a slashdot sort of geek who refuses to give up, I'm running my own study. Actually, the UNC researchers wanted to work with me... all I'd have to do is bring my extended family from California to Chapel Hill a couple of times over a couple of years and have them see doctors at UNC. There's simply no way I could make that happen.
Innovative companies to the rescue... This morning, Axeq, a company headquartered in MD, received my families DNA for exome sequencing at their Korean lab. They ran an exome sequencing special in April: $600 per exome, with an order size minimum of six. They have been great to work with, and accepted my order for only four. Bioserve, also in MD, did the DNA extraction from whole blood, and they have been even more helpful. The blood extraction labs were also incredibly helpful, once we found the right places (very emphatically not Labcorp or Quest Diagnostics). The Stanford clinic lab manager was unbelievably helpful, and in LA, the lab director at the San Antonio Hospital Lab went way overboard, So far, I have to give Axeq and Bioserve five stars out of five, and the blood draw labs deserve a six.
Assuming I get what I'm expecting, I'll get a library of matched genes, and also all the raw machine output data, for four relatives. The output data is what I really need, since our particular mutation is not currently in the gene database. Once I get all the data, I'll need to do a bit of coding to see if I can identify the mutation. Unfortunately, there are several ways that this could be impossible. For example, "copy number variations", or CNVs, if they go on for over a few hundred base pairs, are unable to be detected with current technology. Ah... the life of a geek. This is yet another field I have to get familiar with...
Re: (Score:2)
CNVs actually can be detected if you have enough read depth; it's just that most assemblers are too stupid (or, in computer science terms, "algorithmically beautiful") to account for them. SAMTools can generate a coverage/pileup graph without too much hassle, and it should be obvious where significant differences in copy number occur.
(Also, the human genome is about 3.1 gigabases, so about 3.1 GB in FASTA format. De novo assembles will tend to be smaller because they can't deal with duplications.)
Re: (Score:2)
I agree, CNVs are really easy to detect if you have the read depth. I've been using the samtools pileup output to show CNVs in my study organism. However, to make the results mean anything to most people, I've got to do a few more steps of processing to get all that data in a nice visual format.
If you don't have the read depth, you lose the ability to discriminate small CNVs from noise. Large CNVs, such as for whole chromosomes, are readily observed even in datasets with minimal coverage.
Re: (Score:2)
Thanks, guys, for the CNV info. I'm doing only 30-deep sequencing, but I will get 3 exome sequences all probably having the same defect on the X chromosome. Combining the data should give me some reasonable CNV detection ability.
Re: (Score:2)
Re: (Score:2)
Thanks! I certainly will need some guidance, so if you don't mind, I'll ping you when I get the data. Same thing for the guy below who also offered to help.
Re: (Score:1)
I also do this for my day job, from the side of downstream variant analysis and population genetics, so please feel free to contact me as well.
Re: (Score:2)
Will do! I wasn't expecting so much generosity in reply to my post, but thanks! I'm no dummy, but all I have is a Wikipedia level of knowledge of genetics, so any help I can get will be very much appreciated.
Re: (Score:1)
Re: (Score:2)
Thanks! I will check out snpEff. I certainly will need some help, so if you don't mind, I will contact you when I get the data. Same thing for the guy above who offered to help.
Re: (Score:2)
>I need is raw machine data
Too bad genome centers disagree with you (I, au contraire, agree with you). We need raw NMR data for structures as well.
Re: (Score:2)
Well... I hope I can get all the read data. Since I'm doing exome sequencing, rather than genome sequencing, shouldn't the raw data be something like 100X less, or around 5Gb per exome?
Re: (Score:2)
Yeah... I know you're right. I'm a fast learner, and I jump into all sorts of fields and make waves. One thing I've learned is that being smart gets you only so far. There's no substituted for real-world experience.
Re: (Score:2)
Ok, I see how this is a real issue now. If it costs $50 to store a gnome ($100 for a 1T drive, and 500 gigabytes per genome of machine data), and the lab wants a copy as well as the user, and a backup somewhere, that's $150, which is significant when we imagine the entire process dropping to $1,000. The guys drawing blood and extracting DNA need their money, too, which frankly should be $100 to $200. Even shipping isn't cheap. Dry ice all the way to Korea has to cost a ton. My package was 17 lbs! When
To put this into perspective (Score:2)
To put this into perspective, if you were to write this data onto standard DVDs, the resulting stack would be more than 2 miles tall.
NO. This does not put anything "into perspective", except it meant "a lot of data" for the average Joe.
To put it into useful perspective, we should compare with large data encountered in other sciences, such as 25PB per year from the LHC. And that's after aggressively discarding collisions that doesn't look promising in the first pass, it would be orders of magnitude bigger otherwise.
But now just 15PB per year doesn't look that newsworthy, eh?
Re: (Score:2)
Well, if you really need to have that kind of contest...
The data files being discussed are text files generated as summaries of the raw sensor data from the sequencing machine. In the case of Illumina systems, the raw data consists of a huge high-resolution image; different colours in the image are interpreted as different nucleotides, and each pixel is interpreted as the location of a short fragment of DNA. (Think embarrassingly parallel multithreading.)
If we were to keep and store all of this raw data, th
Re: (Score:2)
Not really trying to turn it into a contest, but just "to put this into perspective". More or less, the point is other science projects have been dealing with similar data volume for a few years already, if there is anything newsworthy about this "DNA Data Deluge", it better be something more than just the data volume.
Re: (Score:2)
a straightforward solution (Score:2)
"At this rate, within the next five years the stack of DVDs could reach higher than the orbit of the International Space Station."
Use more than one stack. You're welcome.
600 bytes per person (Score:2)
If you have the genomes of your parents, and your own genome, yours is about 70 new spot mutations, about 60 crossovers, and you have to specify who your parents were. About 600 bytes of new information per person. You could store the genomes of the entire human race on a couple terabytes if you knew the family trees well enough. I tried to nail down the statistics for that in http://burtleburtle.net/bob/future/geninfo.html [burtleburtle.net] .
Heroic acts along with create the globe (Score:1)
Yay, AdEnine & 1 click splicing (Score:4, Funny)
doubling doubling (Score:1)
I'll send my invoice later (Score:1)
Compression by Reference (Score:2)
The sequence read archives (such as the one hosted by NCBI) as a repository for this sequencing data, uses "compression by reference," a highly-efficient way to compress and store a lot of the data. The raw data that comes off these sequencers is often >99% homologous to the reference genome (such as human, etc), so the most efficient way to compress and store this data is only to record what is different between the sequence output and the reference genome.
We need to find new approaches ... (Score:2)
I suggest storing it molecular form by pairs of four different bases (guanine, thymine, adenine, cytosine) combined in an aesthetically pleasing, double helical molecule!
Re: (Score:1)
Perspective? (Score:2)
To put this into perspective, if you were to write this data onto standard DVDs, the resulting stack would be more than 2 miles tall.
This doesn't put it into perspective at all. What is a DVD and why would I put data on it?
Re: (Score:2)
For God's sake! Give it to me in a useful unit that is normally provided by journalists... Give it to me in Library of Congresses!
Why Bother (Score:2)
It's not like they can use the data, it all is or soon will be patented! Even the patent holders are SOL because anything their bit of patented gene interacts with is patented by someone else. What a lovely system we have!
In the future... (Score:1)
In the future, DVD's will be made much thinner and won't stack up as high.
We have the means! (Score:1)
Re: (Score:2)
Re: (Score:1)
Or AC penises.
We're talking about big size measurements not micro measurements.
Re:Who uses DVDs? (Score:5, Funny)
Re: (Score:3)
Yes, but that took millions of years to develop the simplest versions.
It's astonishing that it took humans only a few millenia to get to that point on our own.
Re: (Score:2)