Forgot your password?
typodupeerror
Biotech Science

Sequencing a Human Genome In a Week 101

Posted by kdawson
from the data-data-everywhere dept.
blackbearnh writes "The Human Genome Project took 13 years to sequence a single human's genetic information in full. At Washington University's Genome Center, they can now do one in a week. But when you're generating that much data, just keeping track of it can become a major challenge. David Dooling is in charge of managing the massive output of the Center's herd of gene sequencing machines, and making it available to researchers inside the Center and around the world. He'll be talking about his work at OSCON, and gave O'Reilly Radar a sense of where the state of the art in genome sequencing is heading. 'Now we can run these instruments. We can generate a lot of data. We can align it to the human reference. We can detect the variance. We can determine which variance exists in one genome versus another genome. Those variances that are cancerous, specific to the cancer genome, we can annotate those and say these are in genes. ... Now the difficulty is following up on all of those and figuring out what they mean for the cancer. ... We know that they exist in the cancer genome, but which ones are drivers and which ones are passengers? ... [F]inding which ones are actually causative is becoming more and more the challenge now.'"
This discussion has been archived. No new comments can be posted.

Sequencing a Human Genome In a Week

Comments Filter:
  • Re:DNA GATC (Score:5, Interesting)

    by RDW (41497) on Monday July 13, 2009 @07:55PM (#28684541)

    'I say we fork and refactor the entire project.'

    You mean like this?:

    http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pubmed&pubmedid=16729053 [nih.gov]

  • by blackbearnh (637683) * on Monday July 13, 2009 @07:58PM (#28684585)
    It wasn't the computing power that was the holdup, it was the sequencing throughput. Also, as noted in the article, they can do it in a week now partially because they have the completed human genome to use as a template to match things up against. As I analogized in the interview, it's like the difference between putting together a jigsaw puzzle with the cover image available, and doing one without.
  • Buttload of data (Score:2, Interesting)

    by virgil Lante (1382951) on Monday July 13, 2009 @08:01PM (#28684613)
    Illumina's Solexa sequencing produces around 7 TB of data per genome sequencing. Its a feat just to move the data around, let alone analyze it. Its amazing how far sequencing technology has come, but how little our knowledge of biology as a whole has advanced. 'The Cancer Genome' does not exist. No tumor is the same and in cancer, especially solid tumors, no two cells are the same. Sequencing a gamish of cells from a tumor only gives you the average which may or may not give any pertinent information about the tumor. Vogelstein's group has shown this quite convincingly but hardly anyone truly looks at what the data really says.
  • by K. S. Kyosuke (729550) on Monday July 13, 2009 @08:02PM (#28684617)

    "Suppose they sequence a specific human's genome. Now they do it again. Will the two sequences be the same?"

    Not [wikipedia.org] necessarily [wikipedia.org]. ;-)

  • by izomiac (815208) on Monday July 13, 2009 @09:27PM (#28685211) Homepage
    The human genome is approximately 3.4 billion base pairs long. There are four bases, so this would correspond to 2 bits of information per base. 2 * 3,400,000,000 /8 /1024 /1024 = 810.6 MiB of data per sequence. That doesn't seem like it'd be too difficult. With a little compression it'd fit on a CD. Now, I suppose each section is sequenced multiple times and you'd want some parity, but it still seems like something that'd easily fit on a DVD (especially if alternate sequences are all diff'd from the first). Perhaps throw in another disc for pre-computed analysis results and that ought to be it.

    So, what's going on here? Are the file formats used to store this data *that* bloated? Or are they trying to include structural information beyond sequence? What am I missing that makes this an unwieldy amount of data?

    (I have to laugh at how Vista is apparently 20 times more complex than the people that use it...)
  • by goombah99 (560566) on Monday July 13, 2009 @10:12PM (#28685447)

    a whole human genome will fit on a CD.

    if you just transmit the diffs from the generic human you could put it in an e-mail

  • by izomiac (815208) on Monday July 13, 2009 @10:49PM (#28685773) Homepage
    Interesting, I was assuming that it was more of the former method since I hadn't studied the latter. Correct me if I'm wrong, but as I remember it that method involves supplying only one type of fluorescently labeled nucleotide at a time during in vitro DNA replication and measuring the intensity of flashes as nucleotides are added (e.g. brighter flash means two bases were added, even brighter if it's three, etc.). Keeping track of four sensors at 200 bytes per base would imply sensors that could detect 133 levels of brightness or 8 measurements per base at 16 levels of brightness. That seems like a lot higher resolution than the example data sheets I've seen, but maybe that's what current technology can do. Still though, most bases are fairly unambiguous so the bulk of the sequence could likely be stored as results only.

    The new method sounds like they're doing a microarray or something and just storing high resolution jpegs. I could see why that would require oodles of image processing power. It does seem like an odd storage format for what's essentially linear data.

    I suppose my point is more that they're storing a lot of useless information. I could see storing a ton of info about a sequence back when graduate students were adding nucleotides and interpreting graphs by hand, but in this day and age you'd just redundantly sequence until you got to the desired accuracy. I couldn't imagine that it'd be cheaper to have technicians manually tweak the entire sequence.

    BTW, I'm not arguing against you, more against some of the design decisions of automated sequencers. You clearly know a lot more about the subject than my undergrad degree allows me to even think about refuting.
  • by Anonymous Coward on Monday July 13, 2009 @11:22PM (#28686047)

    Next gen sequencing eats up huge amounts of space. Every run on our Illumina Genome Analyzer II machine takes up 4 terabytes of intermediate data, most of which comes from the something like 100,000+ 20 Mb bitmap picture files taken from the flowcells. All that much data is an ass load of work to process. Just today I got a little lazy with my Perl programming and let the program go unsupervised...and it ate up 32 gb of ram and froze up the server. Took redhat 3 full hours to decide it had enough of the swapping and kill the process.

    For people not familiar with current generation sequencing machines, they can scan between 30-80 bp reads and use alignment programs to match up the reads to species databases. The reaction/imaging takes 2 days, prep takes about a week, processing images takes another 2 days, alignment takes about 4. The Illumina machine achieves higher throughput than the ABI ones but gives shorter reads; we get about 4 billion nt per run if we do everything right. Keep in mind though, that 4 billion that they mention in the summary is misleading: the read cover distribution is not uniform (ie you do not cover every nucleotide of the human's 3 billion nt genome). To ensure 95%+ coverage, you'd have to use 20-40 runs on the Illumina machine...in other words, about 6-10 months of non-stop work to get a reasonable degree of coverage over the entire human genome (at which point you can use programs to "assemble" the reads in a contiguous genome). WashU is very wealthy so they have quite a few of these machines available to work at any given time.

    the main problem these days is that processing all that much data requires a huge amount of computer knowhow (writing software, algorithms, installing software, using other people's poorly documented programs), and a good understanding of statistics and algorithms, especially when it comes to efficiency. Another problem they never mention are artifacts from the chemical protocol; just the other day we found a very unusual anomaly that indicated the first 1/3 of all our reads was absolutely crap (usually only the last few bases are unreliable); turned out our slight modification of the Illumina protocol to tailor it to studying epigenomic effects had quite large effects of the sequencing reactions later on. Even for good reads, a lot of the bases can be suspect so you have to do a huge amount of averaging, filtering, and statistical analysis to make sure your results/graphs are accurate.

If you're not part of the solution, you're part of the precipitate.

Working...