

Sequencing a Human Genome In a Week 101
blackbearnh writes "The Human Genome Project took 13 years to sequence a single human's genetic information in full. At Washington University's Genome Center, they can now do one in a week. But when you're generating that much data, just keeping track of it can become a major challenge. David Dooling is in charge of managing the massive output of the Center's herd of gene sequencing machines, and making it available to researchers inside the Center and around the world. He'll be talking about his work at OSCON, and gave O'Reilly Radar a sense of where the state of the art in genome sequencing is heading. 'Now we can run these instruments. We can generate a lot of data. We can align it to the human reference. We can detect the variance. We can determine which variance exists in one genome versus another genome. Those variances that are cancerous, specific to the cancer genome, we can annotate those and say these are in genes. ... Now the difficulty is following up on all of those and figuring out what they mean for the cancer. ... We know that they exist in the cancer genome, but which ones are drivers and which ones are passengers? ... [F]inding which ones are actually causative is becoming more and more the challenge now.'"
Re:DNA GATC (Score:5, Interesting)
'I say we fork and refactor the entire project.'
You mean like this?:
http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pubmed&pubmedid=16729053 [nih.gov]
Re:Moore's law at work? (Score:4, Interesting)
Buttload of data (Score:2, Interesting)
Re:Here's what I want to know... (Score:3, Interesting)
"Suppose they sequence a specific human's genome. Now they do it again. Will the two sequences be the same?"
Not [wikipedia.org] necessarily [wikipedia.org]. ;-)
Humans have ~810.6 MiB of DNA (Score:2, Interesting)
So, what's going on here? Are the file formats used to store this data *that* bloated? Or are they trying to include structural information beyond sequence? What am I missing that makes this an unwieldy amount of data?
(I have to laugh at how Vista is apparently 20 times more complex than the people that use it...)
Re:Passing this data back to the scientist (Score:3, Interesting)
a whole human genome will fit on a CD.
if you just transmit the diffs from the generic human you could put it in an e-mail
Re:Humans have ~810.6 MiB of DNA (Score:2, Interesting)
The new method sounds like they're doing a microarray or something and just storing high resolution jpegs. I could see why that would require oodles of image processing power. It does seem like an odd storage format for what's essentially linear data.
I suppose my point is more that they're storing a lot of useless information. I could see storing a ton of info about a sequence back when graduate students were adding nucleotides and interpreting graphs by hand, but in this day and age you'd just redundantly sequence until you got to the desired accuracy. I couldn't imagine that it'd be cheaper to have technicians manually tweak the entire sequence.
BTW, I'm not arguing against you, more against some of the design decisions of automated sequencers. You clearly know a lot more about the subject than my undergrad degree allows me to even think about refuting.
I also manage a Next-gen Sequencing Machine (Score:3, Interesting)
Next gen sequencing eats up huge amounts of space. Every run on our Illumina Genome Analyzer II machine takes up 4 terabytes of intermediate data, most of which comes from the something like 100,000+ 20 Mb bitmap picture files taken from the flowcells. All that much data is an ass load of work to process. Just today I got a little lazy with my Perl programming and let the program go unsupervised...and it ate up 32 gb of ram and froze up the server. Took redhat 3 full hours to decide it had enough of the swapping and kill the process.
For people not familiar with current generation sequencing machines, they can scan between 30-80 bp reads and use alignment programs to match up the reads to species databases. The reaction/imaging takes 2 days, prep takes about a week, processing images takes another 2 days, alignment takes about 4. The Illumina machine achieves higher throughput than the ABI ones but gives shorter reads; we get about 4 billion nt per run if we do everything right. Keep in mind though, that 4 billion that they mention in the summary is misleading: the read cover distribution is not uniform (ie you do not cover every nucleotide of the human's 3 billion nt genome). To ensure 95%+ coverage, you'd have to use 20-40 runs on the Illumina machine...in other words, about 6-10 months of non-stop work to get a reasonable degree of coverage over the entire human genome (at which point you can use programs to "assemble" the reads in a contiguous genome). WashU is very wealthy so they have quite a few of these machines available to work at any given time.
the main problem these days is that processing all that much data requires a huge amount of computer knowhow (writing software, algorithms, installing software, using other people's poorly documented programs), and a good understanding of statistics and algorithms, especially when it comes to efficiency. Another problem they never mention are artifacts from the chemical protocol; just the other day we found a very unusual anomaly that indicated the first 1/3 of all our reads was absolutely crap (usually only the last few bases are unreliable); turned out our slight modification of the Illumina protocol to tailor it to studying epigenomic effects had quite large effects of the sequencing reactions later on. Even for good reads, a lot of the bases can be suspect so you have to do a huge amount of averaging, filtering, and statistical analysis to make sure your results/graphs are accurate.