Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
Science Technology

Genetic Database Hits One Billion Entries 189

ChocSnorfler writes to tell us that the Sanger Institute is reporting that their Genetic Record Database has hit one billion entries, making it the world's largest. From the announcement: "The Trace Archive is a store of all the sequence data produced and published by the world scientific community, including the Sanger Institute's own prodigious output as a world-leading genomics institution. To grasp how much data is in the Archive, if it were printed out as a single line of text, it would stretch around the world more than 250 times. Printing it out on pages of A4 would produce a stack of paper two-and-a-half times as high as Mount Everest. The Archive is 22 Terabytes in size and doubling every ten months."
This discussion has been archived. No new comments can be posted.

Genetic Database Hits One Billion Entries

Comments Filter:
  • genetic information of organisms - mice, fish, flies, bacteria and, of course, humans... All the data are freely available to the world scientific community (http://trace.ensembl.org/ [ensembl.org]) Sweet, now I can finally build myself that fleet of flying super monkeys I've always wanted!
  • by BadAnalogyGuy ( 945258 ) <BadAnalogyGuy@gmail.com> on Tuesday January 17, 2006 @10:43PM (#14496353)
    Some dumbass is always printing 300 pages of documents and hogging the printer. Forchrissakes, just figure out what pages you need and print those! Asshole.

    The amount of data here is really enormous. To put it in perspective, if you lined up 7143 blondes, the number of strands of hair present would approximately equal the number of entries in this database.
  • by Anonymous Coward
    I could make this sentence wrap around the world a zillion times if I used 10^100 point text.
  • by JeanBaptiste ( 537955 ) on Tuesday January 17, 2006 @10:44PM (#14496358)
    "To grasp how much data is in the Archive, if it were printed out as a single line of text, it would stretch around the world more than 250 times. Printing it out on pages of A4 would produce a stack of paper two-and-a-half times as high as Mount Everest. "

    I have twice that much data on my 128k thumbdrive, if printed out in 72 point font size.

    Anyone care to translate this into volkswagens, or libraries of congress?
  • 22TB is nothing. (Score:4, Insightful)

    by Duncan3 ( 10537 ) on Tuesday January 17, 2006 @10:45PM (#14496362) Homepage
    Wow, that's almost 12U of rack space. Oh my *yawn*

    Now the fact that that's all genetic data, that's amazing considering a human is only ~1GB so 22,000 humans worth.
    • Re:22TB is nothing. (Score:2, Interesting)

      by Endymion ( 12816 )
      seriously... I've personally added at least that much to NCBI's archive...

      I guess it depends on what they mean by "genetic data", exactly. if they are including the traces, that's not much.
    • by TheSpoom ( 715771 ) * <slashdot&uberm00,net> on Tuesday January 17, 2006 @11:36PM (#14496587) Homepage Journal
      I'm pretty sure storing humans on your hard drive is illegal.
    • Yeah, it's really nothing to be impressed about. I have well over 22TB of porn sitting on my computer.
    • by cerebis ( 560975 ) on Wednesday January 18, 2006 @06:42AM (#14498110)
      As this is a trace archive, it stores not just the DNA sequence (ACGT) but also the signal data produced by the machines used in these experiments, which is used to determine the DNA sequence (or basecall).

      The signal data is composed of peaks and troughs across 4 channels, corresponding to the 4 base types. A peak in a channel corresponds to a base of that type passing in front of the detector. In your typical sampling configuration, a peak is made up of about 12 data pts.

      Now, since each sampled point in the signal is stored as a 4 byte int and the base for that peak is stored as a 1 byte char, then you've got basically a 192:1 ratio of techincally superfluous signal data to actual DNA sequence.

      Since there are yet other peices of information in the file, this ratio is actually larger.

      Of course, there is a good reason for keeping trace data rather than just the DNA sequences, the notion being that you have more information with which to validate the integrity of what you've done. There have been cases where scientific databases have had their data integrity damaged over time by low quality (ie. mistakes) submissions.

      In this case, they're retain the wrong file type, as it doesn't store the original unfiltered data signal, only a heavily filtered and manipulated one. Most modern basecallers start from the original unfiltered data to gain more advantage through better processing, you cannot do this with the file type they are retaining.

    • It's probably a little bit more than that, as it's managed. Also, it's doubling in
      size every 10 months; a problem as the rate of increase of hard drive size is
      something like once every 18 months. This means the cost of providing this storage
      will increase exponentially.

      Incidentally, it's not "genetic data"--that is sequence data. It's trace data
      which is then interpreted to produce sequence data. So actually, the data storage
      requirements for each base takes more than 2 bits. Moreover it's redundant (DNA is
      seq
  • Dubious claims (Score:4, Interesting)

    by Dr. Photo ( 640363 ) on Tuesday January 17, 2006 @10:45PM (#14496364) Journal
    if it were printed out as a single line of text, it would stretch around the world more than 250 times. Printing it out on pages of A4 would produce a stack of paper two-and-a-half times as high as Mount Everest.

    Such claims should be taken with a grain of salt until they reveal what fonts and point sizes they use.
    • Here's your standard (Score:3, Informative)

      by TubeSteak ( 669689 )
      I'm gonna assume 12 point, single spaced with inch (or inch and a half) margins is pretty standard fare.

      And by standard, I mean: whatever MS Office defaults to

      Diana Hacker's "A Writer's Reference" says the same thing.

      /I'm not a grammar Nazi, I was forced to purchase it many years ago and have kept it handy ever since.

    • Re:Dubious claims (Score:3, Informative)

      by RedWizzard ( 192002 )

      Such claims should be taken with a grain of salt until they reveal what fonts and point sizes they use.

      It's just meaningless reporter-speak. A stupid attempt to provide context for readers who can't visualise that much data. Of course, I doubt many such readers have a good concept of the circumference of the world or the height of Mt Everest either.

      I actually have my masters thesis on a single sheet of A4. I had to use a 1.5 point font to make it fit. You could still read it though.

    • Re:Dubious claims (Score:5, Insightful)

      by timeOday ( 582209 ) on Wednesday January 18, 2006 @12:43AM (#14496892)
      Such claims should be taken with a grain of salt until they reveal what fonts and point sizes they use.
      Let me interpret for you: it's a lot.

      What's incredibly more lame is that 99% of the slashdot comments on this article so far are stuck on units of measure. Clearly it's a lot. Instead of debating the length of a piece of string, how about some discussion on how to distribute and analyze so much data. At this point I'd almost welcome some grousing about patents or dumb google DNA-related theories. We're barely scratching the surface on understanding genetic data. Even finding approximate substring matches within samples is fairly difficult. Here we have the world's biggest crossword puzzle which encodes the secrets of life itself and most of you guys are stuck on the point size of the font.

    • Such claims should be taken with a grain of salt until they reveal what fonts and point sizes they use.

      It's a moot point anyway. You're never going to be able to open the whole file in Word to begin with.
  • "...every ten MINUTES." Imagine we'd look like the Ferengi with loads of teeth and slick heads.
  • by MarkusQ ( 450076 ) on Tuesday January 17, 2006 @10:48PM (#14496375) Journal

    if it were printed out as a single line of text, it would stretch around the world more than 250 times. Printing it out on pages of A4 would produce a stack of paper two-and-a-half times as high as Mount Everest

    Did anybody else think "Wow, I've got a great idea for a mural for the space elevator!"

    Anybody?

    Uh, well, it's late...

    --MarkusQ

    • Oh wow. I just realized...you could do a lot of cool things with that. How about a reproduction of all the major written works in civilization, e.g., from Epic of Gilgamesh and the Vedas through the Bible through the Divine Comedy through Paradise Lost to the Lord of the Rings and other modern texts? Print them in their entirety, in standard font, in order from the oldest at the surface to the most recent at the top. It would be impressively symbolic.
  • Torrent? (Score:5, Funny)

    by mendaliv ( 898932 ) on Tuesday January 17, 2006 @10:48PM (#14496379)
    Would somebody please torrent it?
  • The Archive is 22 Terabytes in size and doubling every ten months

    This enormous archive will devour us all.. ARGHH!
  • This is a real question...

    How the scientist do that?

    They wiggle this gen, and see what happens?
    How do they go for the "scientific method" of experimentation?

    • by AlanKilian ( 887556 ) on Tuesday January 17, 2006 @11:04PM (#14496453)
      From: http://www.learner.org/channel/courses/biology/tex tbook/genom/genom_7.html [learner.org]

      A biological approach to determining the function of a gene is to create a mutation and then observe the effect of the mutation on the organism. This is called a knockout study. While it is not ethical to create knockout mutants in humans, many such mutants are already known, especially those that cause disease. One advantage of having a genome sequence is that it greatly facilitates the identification of genes in which mutations lead to a particular disease.

      The mouse, where one can make and characterize knockout mutants, is an excellent model system for studying genetic diseases of humans; its genome is remarkably similar to a human's. Nearly all human genes have homologs in mice, and large regions of the chromosomes are very well conserved between the two species. In fact, human chromosomes can be (figuratively) cut into about 150 pieces, mixed and matched, and then reassembled into the 21 chromosomes of a mouse. Thus, it is possible to create mutants in mice to determine the probable function of the same genes in humans. Genetic stocks of mutant mice have been developed and maintained since the 1940s.

      One goal of the mouse genome project is to make and characterize mutations in order to determine the function of every mouse gene. After a particular gene mutation has been linked to a particular disorder, the normal function of the gene may be determined. An example of this approach is the mutated gene that resulted in cleft palates in mice. The researchers found that the gene's normal function is to close the embryo's palate. An understanding of the genetics behind cleft palate in mice may one day be used to help prevent this common birth defect in humans.
    • by Stachybotris ( 936861 ) on Tuesday January 17, 2006 @11:11PM (#14496477)
      In most cases they work backwards. You start with a known protein, determine its amino acid sequence, and then convert that into the most likely DNA sequence (accounting for codon bias). Primers/probes are then generated for the 3' and 5' ends of the probable DNA sequence. If you're working with a small genome like that of a bacterium, you can perform a restriction digest to get random hunks of chromosome. These are then amplified via PCR using your designer primers. The final product is then sequenced.

      In other cases you can create a gene knockout by splicing a random gene into your gene of interest. This causes your target gene to encode a non-functional protein. Then you watch and see what happens to the test subject. In some cases the creature dies because the gene turned out to be extremely important. In others it results in minor to significant impairment. But because of the complexity of most organisms, single-gene knockouts usually don't have too much effect - the creature has multiple pathways that can accomplish the same goal. This is especially true for critical functions like those in the immune system.
      • Hang on hang on, you're detailing how to find a gene in a genome by direct experiment (something you do when that's all that you've got to work with), when this article is talking about genomic databases and consequently bioinformatics should be used to greater extent.

        Rather than go through the entire process you outline, one could avoid a great deal of the wet work but sequencing the protein and then jumping into computer space; searching the genome database for hits.

        This assumes you're organism of stu

  • by ScentCone ( 795499 ) on Tuesday January 17, 2006 @10:54PM (#14496400)
    If we stacked up all of the useless length metaphors/comparisons from end to end, they'd still add up to a non-useful mental image of a billion genetic records.

    I mean, "printed out as a single line of text, it would stretch around the world more than 250 times" means what, in terms of helping us picture this? I take it that we're not supposed to be able to imagine a billion records, but we can all clearly picture some text wrapped around the planet 250 times? Ah, that's much more helpful!

    Now, I just got done re-indexing 10 million records in a database, and I can sort of picture 100 times that much work. This is slashdot! More nerdly examples, please.
    • If we stacked up all of the useless length metaphors/comparisons from end to end, they'd still add up to a non-useful mental image of a billion genetic records.

      Helpfully, that's precisely how meaningless a milestone one billion sequencer traces is.

    • by jmv ( 93421 ) on Tuesday January 17, 2006 @11:38PM (#14496597) Homepage
      More nerdly examples, please.

      - It would require 100,000 liters of ink to write down all the 1's and 0's
      - It would take 400 years to transmit it over a 14.4 kbps modem
          * Requiring about 10 Giga Joules
      - If each bit was encoded on a single hydrogen atom, the whold db would weight about 0.1 mg
      - If ones are transmitted as a single (infrared) photon, it would take 0.01 Joules to transmit the whole db
          * You could transmit it 100 times with the energy of a mouse trap
      - It would require about one year for a million monkeys to type it in (without having to guess)
    • This is slashdot! More nerdly examples, please.

      You do realize that announcements written by the Sanger Institute are not written for Slashdot readers, right?

      It's a quote. Deal with it.

      • You do realize that announcements written by the Sanger Institute are not written for Slashdot readers, right?

        I do. But in some ways, I think my point is even more appropriate for the lay audience. Meaning, again, how is someone supposed to picture text wrapping around the planet 250 times? Isn't that just another way of saying "more than you can really get your head around" anyway? Most analogies like that aren't really helpful to anyone. Is text going around the planet 100 times really a lot less in yo
  • by Stachybotris ( 936861 ) on Tuesday January 17, 2006 @11:00PM (#14496433)
    When we figure out what all of that does. For every organism as or more complex than your average bacterium, there's a large amount of what amounts to filler DNA. Viruses don't have this problem, as few of them are large enough to even get by without overlapping reading frames. If you shrink this dataset down to only sequences that encode functional proteins (read: genes), there's still an insane amount of information. If you then remove the introns, the dataset gets even smaller. But of course, we don't really know if the introns and intra-genic regions of DNA (the so-called 'junk DNA') have functions (or how many they have), although some do act as regulators of transcription.

    Given that a change of just 1 base in 500 of the 16S rRNA gene is sufficient to differentiate between two different species of bacteria, I have to wonder how many of these entries are quasi-redundant. When you consider how many species of bacteria are known to man, that means that there are literally thousands of potential entries for each gene. Unless, of course, they're storing only consensus sequences, which still vary widely between genera.

    Sadly, the trend here seems to be more of 'sequence it, upload it, and patent it' instead of 'sequence it, upload it, figure out what it does/makes, do something useful with it'. Knowing the sequence for the Ubiquitin gene is all well and good, but it's of little practical importance. Being able to construct designer proteins to treat illnesses based on that information, however, is a truly worthy goal. Unfortunately, that's also where the 'patent it' part comes into play...
    • When we can do full quantum electromagnetic simulation of even a square micron of space in at least 1/100th of real time then we'll have no trouble figuring out this stuff works. Either that or three dimensional microscoping scanning technology (or a combination of both).
    • First of all I want to point out, so-called "junk DNA" has proven to be a very bad idea for thinking of introns and other untranslated regions (like UTR's [untranslated regions around protein-coding regions], regions of DNA which are not used to create proteins [in the regular way] via mRNA (messenger RNA), then translated to protein). Most scientists will agree nowadays there is _alot_ of information in these non-exonic regions, the most prominent example up to date being microRNA - small RNA pieces from i
    • "I have to wonder how many of these entries are quasi-redundant."

      All of them, pretty much. This is a trace archive. It stores the traces as they
      come of the sequencing machine. Given that DNA is normally sequenced to 10x
      (five times in both directions), most of the data in this database will be
      replicates.

      "Sadly, the trend here seems to be more of 'sequence it, upload it, and patent it' instead of 'sequence it, upload it, figure out what it does/makes, do something useful with it'."

      Data collection is the bed r
  • All well and good, but how many Libraries of Congress does 2.5 Mt Everest / A4 pages equal?

    My calculator has no Mt Everest button.

  • use my sequence generator:

    ruby -e 'while 1; print "c a t g".split[(rand 4)]; end'

    Just hit control-c when the sequence is long enough to suit you
  • Pfft. I would be more impressed if it was all running on MSDE.
  • by Vorondil28 ( 864578 ) on Tuesday January 17, 2006 @11:19PM (#14496513) Journal
    Something tells me a 22TB MS Access table just wouldn't cut it. :-P
  • Do the math (Score:3, Interesting)

    by Kickboy12 ( 913888 ) on Tuesday January 17, 2006 @11:28PM (#14496552) Homepage
    1 billion entries = ~22 Terabytes
    1 billion x 1,000 Bytes = ~0.9 Terabytes

    Which means, on average, your genetic code can be stored in 22KB.

    Just an interesting thought.
    • Re:Do the math (Score:3, Interesting)

      by Wabin ( 600045 )
      except that each entry is not an individual. It is a trace from a sequencing rig, usually. Which means that it is usually 500-1000 bases of sequence (with a bunch of other info there as well... it is not just the As Ts Gs and Cs, but also sequence quality and such). The human genome is roughly 3 billion bases. So they have the equivalent of say 200x the genome of an individual. Of course, the data they have is probably much more concentrated on some areas, where they have thousands of traces, and other a
  • by Mr_Tulip ( 639140 ) on Tuesday January 17, 2006 @11:30PM (#14496559) Homepage
    I mean, most of that data is just redundant pairs of A-G C-T T-G etc...

    I reckon you could zip it up and it'll fit on a couple of floppy disks.
  • As reported on /. the standard units of measurement are:

    Football Fields in Length
    Mt Everest in Height (even tho the avg person has no idea how tall it really is).
    Olympic Sized Swimming Pools in Volume (which again the avg person has no idea)
    Number of Chins in a Chinese phonebook (when talking about someone's momma).

  • by sehryan ( 412731 ) on Tuesday January 17, 2006 @11:47PM (#14496632)
    These people are obviously not aware that the standard unit of measurements for the press is Rhode Island and Texas. Without phrasing it in these units, I have no idea how much data that really is.
  • So what? (Score:4, Funny)

    by Anon.Pedant ( 892943 ) on Tuesday January 17, 2006 @11:54PM (#14496654)
    I'm not impressed. I already have genetic material all over my computer.

    (Oops, did I just admit something bad?)

  • by Anonymous Coward
    Printing it out on pages of A4 would produce a stack of paper two-and-a-half times as high as Mount Everest.

    Tapping it out on morse code would take 10000 drummers 5 years!

    Expressing it in smoke signals would burn 100 amazon rain forests!

    Putting it in fortune cookies would require flour and sugar with the same approximate mass as the moon!

    And sending it in semaphore would require every man, woman and child on the planet to signal nonstop with every flag ever made until the year 2010!

    That's a lot of data.
  • by sbaker ( 47485 ) * on Wednesday January 18, 2006 @12:08AM (#14496716) Homepage
    All this hype about how vastly much paper you get if you print it all out misses the wonder of the thing.

    The wonder isn't how BIG the human genome is - the amazing thing is how *TINY* it is.

    The human genome is 3 billion base pairs...each base pair is one of only four possibilities - so two bits each. 750 Megabytes...that's one CD-ROM. There is a lot of redundancy in it too - many of those base pairs are never 'expressed' as proteins, many are replicated redundantly dozens of times. So with compression, or even just deleting the junk - you'd get it down to maybe 100 to 200 megs - tops.

    I find it utterly amazing that all that complexity is so amazingly compactly encoded.

    Yeah - that's a lot of bits of paper - or 600 floppy disks or some other bullshit - but by the standards of modern media, it's MICROSCOPIC.

    Announcements like this would do better to explain how LITTLE data this really is - that's the wonder of the thing.
    • by The Step Child ( 216708 ) on Wednesday January 18, 2006 @01:06AM (#14496998) Homepage
      Just as amazing is that there are only about 25,000 protein coding genes in the entire human genome (though obviously there are more proteins possible through splicing and post-translational modification, but I digress). Also amazing is the precision in which the chromosomes wind up all that DNA. Imagine taking a piece of yarn miles and miles long and compacting it into something that could fit into a paper bag - now imagine someone asking you to take out a VERY specific piece of that yarn and exposing it from your roll, disturbing the rest of the yarn as little as possible, then putting it back exactly as it was before when they're finished with it...that's basically what each chromosome has to do when genes are expressed. And it's all mediated by proteins coded in that very DNA.
  • On the other hand... (Score:4, Interesting)

    by Chris Snook ( 872473 ) on Wednesday January 18, 2006 @12:10AM (#14496729)
    ...the entire database would fit on just one sheet of A(-24) paper. (Yes, I actually did the math.)
  • by tinrobot ( 314936 ) on Wednesday January 18, 2006 @12:27AM (#14496793)
    I won't give away the ending, but my favorite part is:

    ctattggacttggaatcggatattggacacttggaatcggata

  • by musakko ( 739094 ) on Wednesday January 18, 2006 @12:27AM (#14496798)
    The Archive is 22 Terabytes in size and doubling every ten months.

    Go FoxPro!

  • The Archive is 22 Terabytes in size and doubling every ten months.

    Doubling every 10 months? I think hard drives are doing that as well, or damn close to it. A few years ago, 22 terabytes sounded like a lot, but these days, not so much. I've got half a terabyte in my server and another half in the other two computers in my home and if I didn't regularly burn stuff to DVD, I would have run out of space a long time ago. Terabytes just aren't what they used to be. Well, they are and they aren't.
  • by avi33 ( 116048 ) on Wednesday January 18, 2006 @01:48AM (#14497168) Homepage
    All your base (pairs) belong to us.
  • To grasp how much data is in the Archive, if it were printed out as a single line of text, it would stretch around the world more than 250 times.

    Not at a 100 million DPI it won't.
  • The Archive is 22 Terabytes in size and doubling every ten months.

    I doubt that. Surely that means by the end of the day it will be:

    22 * 2^144 Terabytes = 5*10^44 Terabytes

    in size.....I don't even know what you call that!

  • Now, if you want to do something really cool with that database, you'd blast it against itself using no repeat masking. Or just blast it against the repeats database :-)
  • If those are the full sequences, and the bio technology evolves enough so that build the full sequence out of digital data...

    Woa.. Just imagine the possibilities.

    We won't have to feel guilty for extinct species anymore!

    PS.: Anyone wanna join my safari party next weekend?
  • The word "The" printed out as a single line could strretch around the world two hundreds and fifty one times, given a sufficiently large font.

    While that is crazy, it begs the question, are they thinking in points? 10? 11? 12? 72? Why didn't that say 500 times? 1000 times? a million times?

    Is there an rfc for this specification of measurement? Can I order things in 'printed word lengths around the world'?

    Can I measure my penis with this?

    Does google calculator support this?

    I shot the sheriff but I sold the dep
  • 2 columns (Score:2, Interesting)

    by Narc ( 126887 )
    I can't confirm this, maybe someone can tho. I had an oracle training course last year and the instructor told us she had someone from sanger working on the human genome stuff, and their database was something daft like 2 columns wide. It was used in an example to explain the intricacies of hot backups and such..

    Interesting if its true!

  • ...can you imagine how much it would cost to have it bound?

    Really, though, they should come up with a better comparison. "If burned to CD, it would take half as many CDs as AOL sends out in a year".

The use of money is all the advantage there is to having money. -- B. Franklin

Working...