Forgot your password?
typodupeerror
Biotech Science

Sequencing a Human Genome In a Week 101

Posted by kdawson
from the data-data-everywhere dept.
blackbearnh writes "The Human Genome Project took 13 years to sequence a single human's genetic information in full. At Washington University's Genome Center, they can now do one in a week. But when you're generating that much data, just keeping track of it can become a major challenge. David Dooling is in charge of managing the massive output of the Center's herd of gene sequencing machines, and making it available to researchers inside the Center and around the world. He'll be talking about his work at OSCON, and gave O'Reilly Radar a sense of where the state of the art in genome sequencing is heading. 'Now we can run these instruments. We can generate a lot of data. We can align it to the human reference. We can detect the variance. We can determine which variance exists in one genome versus another genome. Those variances that are cancerous, specific to the cancer genome, we can annotate those and say these are in genes. ... Now the difficulty is following up on all of those and figuring out what they mean for the cancer. ... We know that they exist in the cancer genome, but which ones are drivers and which ones are passengers? ... [F]inding which ones are actually causative is becoming more and more the challenge now.'"
This discussion has been archived. No new comments can be posted.

Sequencing a Human Genome In a Week

Comments Filter:
  • DNA GATC (Score:5, Funny)

    by sakdoctor (1087155) on Monday July 13, 2009 @06:41PM (#28684403) Homepage

    Functions that don't do anything, no comments, worst piece of code ever!

    I say we fork and refactor the entire project.

    • Re:DNA GATC (Score:5, Interesting)

      by RDW (41497) on Monday July 13, 2009 @06:55PM (#28684541)

      'I say we fork and refactor the entire project.'

      You mean like this?:

      http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pubmed&pubmedid=16729053 [nih.gov]

    • Functions that don't do anything, no comments, worst piece of code ever!

      Most of it doesn't code proteins or any of the other things that have been reverse-engineered so far. How do you know it's NOT comments?

      (And if terrestrial life was engineered and it IS comments, do they qualify as "holy writ"?)

      • There was some SF book I read, where it was explained that the comments were "made by a demo version of creature editor" and that was the reason for humans to die after 100 years. Some hacker has then found a way to reset the demo counter and thus to make people live forever.

        • Like this girl [go.com]?

        • There was some SF book I read, where it was explained that the comments were "made by a demo version of creature editor" and that was the reason for humans to die after 100 years. Some hacker has then found a way to reset the demo counter and thus to make people live forever.

          Hey, what if that's how DNA work? Would that not be like awesome and stuff?

      • How do you know it's NOT comments?

        Come on, how many programmers do you know that write comments, meaningful or not? I personally have a massive descriptive dialogue running down the side. "Real" programmers have told me that is excessive. Looking at their code I find one comment every 20 to fifty lines, and descriptive identifiers, like i, x or y. The genome will be just like that. (Also, given that any big project ends up with lots of dead code. (yes, I know the compiler identifies that, but ...)

        • Come on, how many programmers do you know that write comments, meaningful or not?

          Plenty. And the ones that do tend to have more functional programs, too. B-)

          (My own code is heavily commented - to the point of providing a second full description of the design. And a colleague once said I'm the only person he'd trust to program his pacemaker. B-) )

    • I would like to announce publicly that my genome is released under the GPL
      • by ComaVN (325750)

        I would like to announce publicly that my genome is released under the GPL

        So you'll only allow you children to mate with other GPL'ed people?

        • by hakey (1227664)
          No, its viral. When his children mate with non-GPL people, their code becomes GPL.
    • At least it's backed up well. 3 backups of almost everything ain't bad.

      Two strands on each chromesome... I'm probably in the wrong crowd of nerds...

    • Re: (Score:3, Funny)

      by K. S. Kyosuke (729550)
      You thought God can't spell "job security"? Mind you, he's omnipotent!
    • Actually it's just the arrogance of some scientist. Who later found out, that all those parts who seemingly did not do anything, were in fact just as relevant. Just in a different way. Whoops!

    • Another problem they never mention are artifacts from the chemical protocol; just the other day we found a very unusual anomaly that indicated the first 1/3 of all our reads was absolutely crap (usually only the last few bases are unreliable); turned out our slight modification of the Illumina protocol to tailor it to studying epigenomic effects had quite large effects of the sequencing reactions later on.

      Do you have any more details about this? I'm working on solexa sequencing of ChIP DNA with (modified

    • Splice too much of that bad, useless, convoluted code into a "new" human and we might end up with a G-Gnome or GNOME (Gratuitous, Nacent, Ogreous, Mechanised Entity). Call it... "G-UNIT", and give it a uniform and a mission. Or, give it a script and a part and call it Smeegul/Smigel...)

  • by HotNeedleOfInquiry (598897) on Monday July 13, 2009 @06:43PM (#28684423)
    Suppose they sequence a specific human's genome. Now they do it again. Will the two sequences be the same?
    • by QuantumG (50515) * <qg@biodome.org> on Monday July 13, 2009 @06:45PM (#28684431) Homepage Journal

      Typically they sequence every base at least 30 times.

    • by blackbearnh (637683) * on Monday July 13, 2009 @06:46PM (#28684443)
      I wondered the same thing, so I asked. From the article: And between two cells, one cell right next to the other, they should be identical copies of each other. But sometimes mistakes are made in the process of copying the DNA. And so some differences may exist. However, we're not at present currently sequencing single cells. We'll collect a host of cells and isolate the DNA from a host of cells. So what you end up is with when you read the sequence out on these things is, essentially, an average of this DNA sequence. Well, I mean it's digital in that eventually you get down to a single piece of DNA. But once you align these things back, if you see 30 reads that all align to the same region of the genome and only one of them has an A at the position and all of the others have a T at that position, you can't say whether that A was actually some small change between one cell and its 99 closest neighbors or whether that was just an error in the sequencing. So it's hard to say cell-to-cell how much difference there is. But, of course, that difference does exist, otherwise that's mutation and that's what eventually leads to cancer and other diseases.
      • It can get even more complicated too: if you have 10x coverage of a position, and 9 say T while 1 says G, it may be an allelic variation. There's a one in 16 chance this'll happen randomly instead of 5 Ts and 5Gs as you expect.

      • by timeOday (582209)
        In other words, you don't have "a" genome. What you are is a big bunch of cells, closely enough related that their genomes are very similar.
    • Re: (Score:3, Interesting)

      by K. S. Kyosuke (729550)

      "Suppose they sequence a specific human's genome. Now they do it again. Will the two sequences be the same?"

      Not [wikipedia.org] necessarily [wikipedia.org]. ;-)

    • Suppose they sequence a specific human's genome. Now they do it again. Will the two sequences be the same?

      They should be. An individual's genome does not change over time. Gene expression can change, which can itself lead to significant problems such as cancer.

      • by Rakishi (759894)

        The genome sure as hell changes, lots of mutations happening all the time in probably every cell of our body. The usual cause of cancer is certain genetic mutations beyond the scope of the body's mechanisms for dealing with them. That in turn causes your gene expressions to change since they're also, to a large extent, controlled by the genome. Purely, or mostly, gene expression caused cancer is apparently also possible however it's not the only cause of cancer.

        • The genome sure as hell changes

          Not necessarily. The genome refers specifically to the genes encoded by DNA; mutations can also occur in the non-coding regions. Indeed the non-coding regions are often the most critical for gene expression.

          Hence a non-genomic mutation can have a profound effect on gene expression.

          lots of mutations happening all the time in probably every cell of our body

          Also not necessarily true. For example, a non-dividing cell has no reason to duplicate its own genome, hence it has almost no chance to acquire mutations.

          That in turn causes your gene expressions to change since they're also, to a large extent, controlled by the genome.

          As I already described, much of gene expression is regulated by non c

          • by Rakishi (759894)

            The genome refers specifically to the genes encoded by DNA;

            Genome(gene+[chromos]om) refers to all the genetic material, coding and non-coding, gene or not. It has meant that for the past 90 years, before non-coding regions were even thought of I'm guessing, and last I checked it hasn't been redefined. You have very nicely explained why having it refer to simply coding regions is stupid. If I'm wrong then I'd love to know but you'll need to provide me a reference.

            The definition of a gene on the other hand may be changing to include both coding and non-coding regions

            • by Rakishi (759894)

              Just to add so we're all clear on definitions, as I understand it we're talking about these regions of dna here:
              a) rna/protein coding region
              b) transcribed non-coding regions (introns)
              c) regulatory region
              d) unknown/junk/other non-coding region

              I suspect there's some additional stuff I missed but it's been a while since I cared too much about this.

              In my molecular biology classes "gene" was used to refer to a, b and c as they relate to a protein. To be honest that someone would t

            • by bradbury (33372)

              The Non-Homologous End Joining (NHEJ) DNA double strand break repair process can produce mutagenic deletions (and sometimes insertions) in the DNA sequence. Both the Werner's Syndrome (WRN) and Artemis (DCLRE1) proteins involved in that process have exonuclease activity in order to process the DNA ends into forms which can be reunited. The Homologous Recombination (HR) pathway, which is more active during cell replication, is more likely to produce "gene conversion" which can involve copying of formerly m

          • by bradbury (33372)

            I would suggest that you spend some time studying the topic in more detail before you make comments on /.

            At all the genome conferences I've been to the "genome" includes everything -- the chromosome number and architecture, the coding, regulatory & non-coding regions (tRNA, rRNA, miRNA/siRNA, telomere length, etc.). But the non-coding, highly variable parts of the genome can be considered part of the "big picture" because the amount of *really* junk DNA may function as a free radical "sink" which prot

            • I would suggest that you spend some time studying the topic in more detail before you make comments

              Starting your response by insulting the other person? I ordinarily wouldn't bother responding, but since you took the time to provide a peer-reviewed article as a reference I'll give you a chance.

              The DNA composition and ultrastructure may also effect things like gene expression (DNA unwinding temperature, variable access to genes, etc.).

              Actually if you had read what I said earlier about gene regulation you would have seen that I already said that. Non-coding regions are where transcription factors for gene expression often bind.

              If you knew about the 5+ types of DNA repair (BER, NER, MMR, HR, NHEJ) involving 150+ proteins or had some knowledge

              You could be less arrogant and presumptuous in your statements. You seem to have taken what you wanted to see in my

      • by maxume (22995)

        Except cells do undergo the occasional survivable mutation, and then there are the people that integrated what would have been a twin, and so on.

    • Suppose they sequence a specific human's genome. Now they do it again. Will the two sequences be the same.

      You're talking about for different individuals? There will be differences, yes, but most of that difference should be in non-coding regions. The actual regions making proteins should be nearly identical. I only work with a few DNA sequences that code for proteins, so that's all I'd be interested in, but there are other applications for medicine that the variation in non-coding regions would be important.

    • by bogado (25959)

      Each person have a different sequence, while the first time they sequenced one of the billions "human genomes". Doing different people could help finding what makes one person different from another and on the other hand what make us similar. :-)

  • by Anonymous Coward on Monday July 13, 2009 @06:43PM (#28684425)

    Just store all that data as a chemical compound. Maybe a nucleic acid of some kind? Using two long polymers made of sugars and phosphates? I bet the whole thing could be squeezed into something smaller than the head of a pin!

  • by SlashBugs (1339813) on Monday July 13, 2009 @07:00PM (#28684603)
    Data handling and analysis is becoming a big problem for biologists generally. Techniques like microarray (or exon array) analysis can tell you how strongly a set of genes (tens of thousands, with hundreds of thousands of splice variants) are being expressed under given conditions. But actually handling this data is a nightmare, especially as a lot of biologists ended up there because they love science but aren't great at maths. Given a list of thousands of genes, teasing out the statistically significantly different genes from the noise is only the first step. Then you have to decide what's biologically important (e.g. what's the prime mover and what's just a side-effect), and then you have a list of genes which might have known functions but more likely have just a name or even a tag like "hypothetical ORF #3261", for genes that are predicted by analysis of the genome but have never been proved to actually be expressed. After this, there's the further complication that these techniques only tell you what's going on at the DNA or RNA level. The vast majority of genes only have effects when translated into protein and, perhaps, further modified, meaning that you cant's be sure that the levels you're detecting by the sequencing (DNA level) or expression analysis chips (RNA level) actually reflects what's going on in the cell.

    One of the big problems studying expression patterns in cancer specifically is the paucity of samples. The genetic differences between individuals (and tissues within individuals) means there's a lot of noise underlying the "signal" of the putative cancer signatures. This is especially true because there are usually several genetic pathways that a given tissue can take to becoming cancerous: you might only need mutations in a small subset of a long list of genes, which is difficult to spot by sheer data mining. While cancer is very common, each type of cancer is much less so; therefore the paucity of available samples of a given cancer type in a given stage makes reaching statistical significance very difficult. There are some huge projects underway at the moment to collate all cancer labs' samples for meta-analysis, dramatically increasing the statistical power of the studies. A good example of this is the Pancreas Expression Database [pancreasexpression.org], which some pacreatic cancer researchers are getting very excited about.
    • You have to be very careful about what findings at different levels actually mean, and how the various levels correlate.

      For example when looking at duplications/expansions in cancer, an expansion of a locus results in about a 50% correlation between DNA level change and expression level chane. Protein and gene expression levels correlate 50 to 60% of the time (or less depending on who's data you look at). So therefore, being gracious and assuming a 60% correlation at the two levels you are already belo
    • A good example of this is the Pancreas Expression Database [pancreasexpression.org], which some pacreatic cancer researchers are getting very excited about.

      Kim Jong-il [yahoo.com] will be ecstatic to hear that. Dear Leader can't very well put the Grim Reaper into political prison....

    • The vast majority of genes only have effects when translated into protein

      That depends on your definition. If you define a gene as "stretch of DNA that is translated into protein," which until fairly recently was the going definition, then of course your statement is tautologically true (replacing "the vast majority of" with "all.") But if you define it as "a stretch of DNA that does something biologically interesting," then it's no longer at all clear. Given the number of regulatory elements not directly

      • Very good points. I think I've been using a very sloppy definition of gene, just a vague idea that it's only DNA>RNA>protein>action or DNA>RNA>action. I've never really got deeply into thinking about regulatory elements, etc. It's compounded by the fact that, while I'm interested in cancer, most of my actual work is with a DNA-based virus that only produces a very few non-translated RNAs that we're aware of. I have a tough time convincing some people that even those are biologically relevant.
  • Buttload of data (Score:2, Interesting)

    by virgil Lante (1382951)
    Illumina's Solexa sequencing produces around 7 TB of data per genome sequencing. Its a feat just to move the data around, let alone analyze it. Its amazing how far sequencing technology has come, but how little our knowledge of biology as a whole has advanced. 'The Cancer Genome' does not exist. No tumor is the same and in cancer, especially solid tumors, no two cells are the same. Sequencing a gamish of cells from a tumor only gives you the average which may or may not give any pertinent information a
    • by LokiSteve (557281)
      Working with Illumina in a busy lab turns into a battle of platters faster than most people want to believe.

      The next thing to hit is supposed to be 3D DNA modeling where the interactions within the genome is mapped. Meaning; if x and y is present b will be produced but will only be active if x is at position #100 or position #105 if it's at another position c will be created, etc. It differs from normal mapping because the code AND position within the code is taken into account so there are conditions ad
  • The human genome is approximately 3.4 billion base pairs long. There are four bases, so this would correspond to 2 bits of information per base. 2 * 3,400,000,000 /8 /1024 /1024 = 810.6 MiB of data per sequence. That doesn't seem like it'd be too difficult. With a little compression it'd fit on a CD. Now, I suppose each section is sequenced multiple times and you'd want some parity, but it still seems like something that'd easily fit on a DVD (especially if alternate sequences are all diff'd from the f
    • Right, but that's the finished product. You start with a ton of fragments that you need to sequence and fit together, and there's overlap, and multiple reads (like 30?) of each sequence, so you end up with much much more data that gets refined down into the end sequence.
    • by rnaiguy (1304181)
      You have to take into account that sequencing machines do not just spit out a pretty string of A, C, T, G. For the older sequencing method, the raw data from the sequencing machine consists of 4 intensity traces (one for each base), so you have to record 4 waves, which are then interpreted (sometimes imperfectly) by software to give you the sequence. The raw data does have to be stored and moved around for some period of time, and often needs to be stored for other analyses. This data is around 200 kilobyte
      • Re: (Score:2, Interesting)

        by izomiac (815208)
        Interesting, I was assuming that it was more of the former method since I hadn't studied the latter. Correct me if I'm wrong, but as I remember it that method involves supplying only one type of fluorescently labeled nucleotide at a time during in vitro DNA replication and measuring the intensity of flashes as nucleotides are added (e.g. brighter flash means two bases were added, even brighter if it's three, etc.). Keeping track of four sensors at 200 bytes per base would imply sensors that could detect 1
        • by RDW (41497)

          'The new method sounds like they're doing a microarray or something and just storing high resolution jpegs. I could see why that would require oodles of image processing power. It does seem like an odd storage format for what's essentially linear data.'

          There's a good summary of the technology here:

          http://seqanswers.com/forums/showthread.php?t=21 [seqanswers.com]

          Millions of short sequences are analysed in a massively parallel way, and you need to take a new high resolution image for every cycle to see which 'polonies' the ne

    • by johannesg (664142)

      So, what's going on here? Are the file formats used to store this data *that* bloated?

      <genome species="human">... ;-)

  • by Anonymous Coward on Monday July 13, 2009 @10:22PM (#28686047)

    Next gen sequencing eats up huge amounts of space. Every run on our Illumina Genome Analyzer II machine takes up 4 terabytes of intermediate data, most of which comes from the something like 100,000+ 20 Mb bitmap picture files taken from the flowcells. All that much data is an ass load of work to process. Just today I got a little lazy with my Perl programming and let the program go unsupervised...and it ate up 32 gb of ram and froze up the server. Took redhat 3 full hours to decide it had enough of the swapping and kill the process.

    For people not familiar with current generation sequencing machines, they can scan between 30-80 bp reads and use alignment programs to match up the reads to species databases. The reaction/imaging takes 2 days, prep takes about a week, processing images takes another 2 days, alignment takes about 4. The Illumina machine achieves higher throughput than the ABI ones but gives shorter reads; we get about 4 billion nt per run if we do everything right. Keep in mind though, that 4 billion that they mention in the summary is misleading: the read cover distribution is not uniform (ie you do not cover every nucleotide of the human's 3 billion nt genome). To ensure 95%+ coverage, you'd have to use 20-40 runs on the Illumina machine...in other words, about 6-10 months of non-stop work to get a reasonable degree of coverage over the entire human genome (at which point you can use programs to "assemble" the reads in a contiguous genome). WashU is very wealthy so they have quite a few of these machines available to work at any given time.

    the main problem these days is that processing all that much data requires a huge amount of computer knowhow (writing software, algorithms, installing software, using other people's poorly documented programs), and a good understanding of statistics and algorithms, especially when it comes to efficiency. Another problem they never mention are artifacts from the chemical protocol; just the other day we found a very unusual anomaly that indicated the first 1/3 of all our reads was absolutely crap (usually only the last few bases are unreliable); turned out our slight modification of the Illumina protocol to tailor it to studying epigenomic effects had quite large effects of the sequencing reactions later on. Even for good reads, a lot of the bases can be suspect so you have to do a huge amount of averaging, filtering, and statistical analysis to make sure your results/graphs are accurate.

  • Well, how about pollution, processed food, and all that trash being the main reason we get cancer?

    Cancer was not even a known disease, a century ago, because nobody had it. (And if people get cancer now, way before the average age of death a century ago, then it can't be that it is because we now get older.)

    But I guess there is no money in that. Right?

    • wow. read a book. (Score:3, Insightful)

      by CFD339 (795926)

      First, kinds of cancers were known to exist a century ago. Tumors and growths were not unheard of. Most childhood cancers killed quickly and were undiagnosed as specific disease other than "wasting away". When the average lifespan was 30-40 years, a great many other cancers were not present because people didn't live long enough to die from them.

      As we cure "other" diseases, cancers become more likely causes of death. Cells fail to divide perfectly, some may go cancerous others simply don't produce as

    • Cancer has been with us throughout recorded history. Ancient Egyptian, Greek, Roman and Chinese doctors described and drew tumours growing on their patients covering a span of about 2000-4000 years ago. There's also archeological evidence of cancers much older than that, e.g. in Bronze age fossils [answers.com].

      Cancer has become more common over the last hundred years or so. A huge part of that is simply the fact that we're living much longer, meaning that the odds of a given person developing cancer are much higher.
  • ...used against me for anything without violating the DMCA. The act of decoding it by some forensics lab paternity test or future insurance company medical cost profile would become unlawful and I'm sure the RIAA would help me with the cost of prosecuting the lawsuit.

  • Check the vending machines !

  • by bogado (25959)

    While a single human genome is a lot of information, storing thousands shouldn't add much requirements, one can simply store a diff from the first.

Work without a vision is slavery, Vision without work is a pipe dream, But vision with work is the hope of the world.

Working...