Genetic Database Hits One Billion Entries 189
ChocSnorfler writes to tell us that the Sanger Institute is reporting that their Genetic Record Database has hit one billion entries, making it the world's largest. From the announcement: "The Trace Archive is a store of all the sequence data produced and published by the world scientific community, including the Sanger Institute's own prodigious output as a world-leading genomics institution. To grasp how much data is in the Archive, if it were printed out as a single line of text, it would stretch around the world more than 250 times. Printing it out on pages of A4 would produce a stack of paper two-and-a-half times as high as Mount Everest. The Archive is 22 Terabytes in size and doubling every ten months."
w00t! Opensource genetics! (Score:2, Funny)
For God's sake, don't print it! (Score:5, Funny)
The amount of data here is really enormous. To put it in perspective, if you lined up 7143 blondes, the number of strands of hair present would approximately equal the number of entries in this database.
Re:For God's sake, don't print it! (Score:2)
Now, I wonder when gene blending will happen outside of the morphing software and bedroom. Maybe we can bring tails back...
Re:For God's sake, don't print it! (Score:4, Funny)
At least your name is "BadAnalogyGuy", which gives you a better excuse than the story submitter.
Re:For God's sake, don't print it! (Score:3, Interesting)
By the time you successfully print the 22TB of data, you would no doubt pass the 10 month threshold for the double sized growth. Once you start printing, you'd never stop!
then again, a new challenge for Epson/HP etc... develop a printer that is robust enough to print a paper mount everest!
Re:For God's sake, don't print it! (Score:2)
Hey, that could be an information disinformation campaign band... A modern spin on "Pontius Pilate and the Nail-Drivin' Five" from the 70's.
With enough gene therapy, though, Chaka and the Sleetaks can ALL be transformed into uubersekshoowalls (read: uubersexuals).
Re:For God's sake, don't print it! (Score:2)
I forgot the URL to the picture of Chaka and the Sleetak:
http://www.google.com/search?q=picture+of+Chaka&ie =UTF-8&oe=UTF-8 [google.com]
http://images.google.com/imgres?imgurl=http://www. landofthelost.com/chaka.jpg&imgrefurl=http://www.l andofthelost.com/press.htm&h=212&w=214&sz=7&tbnid= HMKY0VGCSicJ:&tbnh=100&tbnw=101&prev=/images%3Fq%3 Dchaka%26hl%3Den%26lr%3D&oi=imagesr&start=3 [google.com]
(Mix or Nix some genes and you might get
Re:For God's sake, don't print it! (Score:5, Funny)
Like when I was in grad school, I remember our IT guy was hopping mad because he had to come in on a sunday to reboot the server because some dumbass decided to print the entire mouse chromomome 22 sequence. Something about a spool file and crashing his server...
Re:For God's sake, don't print it! (Score:4, Interesting)
2 cents,
Queen B
Re:For God's sake, don't print it! (Score:3, Insightful)
If I'm doing the math right, that would put the storage needed at about 25EB eight years out from now (about ten doublings is 1024 times the current needs). Which is only 50,000 500GB drives. While certainly quite a lot, if the average hard drive space is even 10GB, times millions of computers just in the US, I think we're set. Seagate probably sells that much storage every week.
I'm
Re:For God's sake, don't print it! (Score:2)
Re:For God's sake, don't print it! (Score:2)
Re:For God's sake, don't print it! (Score:2)
Mmmmmm
=)
Re:For God's sake, don't print it! (Score:2)
But I still got a +5 funny
Re:For God's sake, don't print it! (Score:2)
Holly shit batman! Just imagine if instead we chisel it on stone plates! It might go to the moon and back.
"The Archive is 22 Terabytes in size and doubling every ten months."
22 Terabytes, i.e. if written on a holodisk (coming: 2007 - 2008) it'll be about 22 if em. That would produce a stack about the height of my home scanner.
How many LOCs is that? (Score:2, Funny)
i love meaningless data (Score:5, Funny)
I have twice that much data on my 128k thumbdrive, if printed out in 72 point font size.
Anyone care to translate this into volkswagens, or libraries of congress?
Re:i love meaningless data (Score:2, Funny)
Re:i love meaningless data (Score:4, Funny)
I keep forgetting, how many Volkswagens to the Ferrari?
Re:i love meaningless data (Score:3, Funny)
Re:i love meaningless data (Score:5, Funny)
Re:i love meaningless data (Score:3, Interesting)
So this is roughly the size of the TEXT in the library of congress.
Re:i love meaningless data (Score:5, Funny)
If we take a 1967 Volkswagen to be a measuremeant of length then it is 1606.01 times larger than a single letter so it would take 9823500.48 Volkswagi to tailgate around the earth. Multiply that by 250 and you get ~ 2.455875x10^9 Volkswagens.
Since it is quite easy to convert Volkswagens to Library of Congresses I won't go into further detail.
Re:i love meaningless data (Score:3, Funny)
I just threw up in my mouth.
Re:i love meaningless data (Score:2)
I just threw up in my mouth.
No stupids its already plural, like Vaxen.
Get it right !
Re:i love meaningless data (Score:2)
Re:i love meaningless data (Score:2)
Hey, you have admitted the true power of the Imperial System here, it's 0.1 inches!
Disclaimer: I'm a EU citizen
Re:i love meaningless data (Score:2)
Re:i love meaningless data (Score:2)
Re:i love meaningless data (Score:2)
Re:i love meaningless data (Score:2)
22TB is nothing. (Score:4, Insightful)
Now the fact that that's all genetic data, that's amazing considering a human is only ~1GB so 22,000 humans worth.
Re:22TB is nothing. (Score:2, Interesting)
I guess it depends on what they mean by "genetic data", exactly. if they are including the traces, that's not much.
Re:22TB is nothing. (Score:5, Funny)
Re:22TB is nothing. (Score:4, Funny)
In the meantime, you can still get the genetic layouts of other animals on eDonkey. (groan)
Re:22TB is nothing. (Score:2)
Re:22TB is nothing. (Score:2)
What two consenting adults do on hard disks is none of my concern...
Re:22TB is nothing. (Score:2)
DNA sequence makes up only a small proportion. (Score:4, Informative)
The signal data is composed of peaks and troughs across 4 channels, corresponding to the 4 base types. A peak in a channel corresponds to a base of that type passing in front of the detector. In your typical sampling configuration, a peak is made up of about 12 data pts.
Now, since each sampled point in the signal is stored as a 4 byte int and the base for that peak is stored as a 1 byte char, then you've got basically a 192:1 ratio of techincally superfluous signal data to actual DNA sequence.
Since there are yet other peices of information in the file, this ratio is actually larger.
Of course, there is a good reason for keeping trace data rather than just the DNA sequences, the notion being that you have more information with which to validate the integrity of what you've done. There have been cases where scientific databases have had their data integrity damaged over time by low quality (ie. mistakes) submissions.
In this case, they're retain the wrong file type, as it doesn't store the original unfiltered data signal, only a heavily filtered and manipulated one. Most modern basecallers start from the original unfiltered data to gain more advantage through better processing, you cannot do this with the file type they are retaining.
Re:22TB is nothing. (Score:2)
size every 10 months; a problem as the rate of increase of hard drive size is
something like once every 18 months. This means the cost of providing this storage
will increase exponentially.
Incidentally, it's not "genetic data"--that is sequence data. It's trace data
which is then interpreted to produce sequence data. So actually, the data storage
requirements for each base takes more than 2 bits. Moreover it's redundant (DNA is
seq
Dubious claims (Score:4, Interesting)
Such claims should be taken with a grain of salt until they reveal what fonts and point sizes they use.
Here's your standard (Score:3, Informative)
And by standard, I mean: whatever MS Office defaults to
Diana Hacker's "A Writer's Reference" says the same thing.
/I'm not a grammar Nazi, I was forced to purchase it many years ago and have kept it handy ever since.
Re:Dubious claims (Score:3, Informative)
It's just meaningless reporter-speak. A stupid attempt to provide context for readers who can't visualise that much data. Of course, I doubt many such readers have a good concept of the circumference of the world or the height of Mt Everest either.
I actually have my masters thesis on a single sheet of A4. I had to use a 1.5 point font to make it fit. You could still read it though.
Re:Dubious claims (Score:5, Insightful)
What's incredibly more lame is that 99% of the slashdot comments on this article so far are stuck on units of measure. Clearly it's a lot. Instead of debating the length of a piece of string, how about some discussion on how to distribute and analyze so much data. At this point I'd almost welcome some grousing about patents or dumb google DNA-related theories. We're barely scratching the surface on understanding genetic data. Even finding approximate substring matches within samples is fairly difficult. Here we have the world's biggest crossword puzzle which encodes the secrets of life itself and most of you guys are stuck on the point size of the font.
Re:Dubious claims (Score:2)
It's a moot point anyway. You're never going to be able to open the whole file in Word to begin with.
Whoa!! I thought it said: (Score:1)
If printed out... (Score:5, Funny)
if it were printed out as a single line of text, it would stretch around the world more than 250 times. Printing it out on pages of A4 would produce a stack of paper two-and-a-half times as high as Mount Everest
Did anybody else think "Wow, I've got a great idea for a mural for the space elevator!"
Anybody?
Uh, well, it's late...
--MarkusQ
Re:If printed out... (Score:2)
Torrent? (Score:5, Funny)
A4 paper wouldn't work. (Score:4, Funny)
You can't do that with ordinary A4 paper. You need to reinforce it on the sides at least so it won't tumble over. Plus, I doubt the paper would sit still with the high winds once it gets above a few thousand feet. Sheesh.
Re:A4 paper wouldn't work. (Score:2)
If we're not careful.. (Score:2, Funny)
This enormous archive will devour us all.. ARGHH!
How do they map their function? (Score:1, Interesting)
How the scientist do that?
They wiggle this gen, and see what happens?
How do they go for the "scientific method" of experimentation?
Re:How do they map their function? (Score:5, Informative)
A biological approach to determining the function of a gene is to create a mutation and then observe the effect of the mutation on the organism. This is called a knockout study. While it is not ethical to create knockout mutants in humans, many such mutants are already known, especially those that cause disease. One advantage of having a genome sequence is that it greatly facilitates the identification of genes in which mutations lead to a particular disease.
The mouse, where one can make and characterize knockout mutants, is an excellent model system for studying genetic diseases of humans; its genome is remarkably similar to a human's. Nearly all human genes have homologs in mice, and large regions of the chromosomes are very well conserved between the two species. In fact, human chromosomes can be (figuratively) cut into about 150 pieces, mixed and matched, and then reassembled into the 21 chromosomes of a mouse. Thus, it is possible to create mutants in mice to determine the probable function of the same genes in humans. Genetic stocks of mutant mice have been developed and maintained since the 1940s.
One goal of the mouse genome project is to make and characterize mutations in order to determine the function of every mouse gene. After a particular gene mutation has been linked to a particular disorder, the normal function of the gene may be determined. An example of this approach is the mutated gene that resulted in cleft palates in mice. The researchers found that the gene's normal function is to close the embryo's palate. An understanding of the genetics behind cleft palate in mice may one day be used to help prevent this common birth defect in humans.
Re:How do they map their function? (Score:5, Informative)
In other cases you can create a gene knockout by splicing a random gene into your gene of interest. This causes your target gene to encode a non-functional protein. Then you watch and see what happens to the test subject. In some cases the creature dies because the gene turned out to be extremely important. In others it results in minor to significant impairment. But because of the complexity of most organisms, single-gene knockouts usually don't have too much effect - the creature has multiple pathways that can accomplish the same goal. This is especially true for critical functions like those in the immune system.
Less work, faster results with the database (Score:2)
Rather than go through the entire process you outline, one could avoid a great deal of the wet work but sequencing the protein and then jumping into computer space; searching the genome database for hits.
This assumes you're organism of stu
So tired. So very, very tired. Of that. (Score:5, Insightful)
I mean, "printed out as a single line of text, it would stretch around the world more than 250 times" means what, in terms of helping us picture this? I take it that we're not supposed to be able to imagine a billion records, but we can all clearly picture some text wrapped around the planet 250 times? Ah, that's much more helpful!
Now, I just got done re-indexing 10 million records in a database, and I can sort of picture 100 times that much work. This is slashdot! More nerdly examples, please.
Re:So tired. So very, very tired. Of that. (Score:2)
Helpfully, that's precisely how meaningless a milestone one billion sequencer traces is.
Re:So tired. So very, very tired. Of that. (Score:5, Funny)
- It would require 100,000 liters of ink to write down all the 1's and 0's
- It would take 400 years to transmit it over a 14.4 kbps modem
* Requiring about 10 Giga Joules
- If each bit was encoded on a single hydrogen atom, the whold db would weight about 0.1 mg
- If ones are transmitted as a single (infrared) photon, it would take 0.01 Joules to transmit the whole db
* You could transmit it 100 times with the energy of a mouse trap
- It would require about one year for a million monkeys to type it in (without having to guess)
Re:So tired. So very, very tired. Of that. (Score:2)
See, now that's what I'm talking about. A proper, well-scaled, nerdly example. Except at least half the readers here will say 14.4-whatis-that-now?
Re:So tired. So very, very tired. Of that. (Score:2)
Re:So tired. So very, very tired. Of that. (Score:2)
You do realize that announcements written by the Sanger Institute are not written for Slashdot readers, right?
It's a quote. Deal with it.
Re:So tired. So very, very tired. Of that. (Score:2)
I do. But in some ways, I think my point is even more appropriate for the lay audience. Meaning, again, how is someone supposed to picture text wrapping around the planet 250 times? Isn't that just another way of saying "more than you can really get your head around" anyway? Most analogies like that aren't really helpful to anyone. Is text going around the planet 100 times really a lot less in yo
I will be more impressed... (Score:5, Informative)
Given that a change of just 1 base in 500 of the 16S rRNA gene is sufficient to differentiate between two different species of bacteria, I have to wonder how many of these entries are quasi-redundant. When you consider how many species of bacteria are known to man, that means that there are literally thousands of potential entries for each gene. Unless, of course, they're storing only consensus sequences, which still vary widely between genera.
Sadly, the trend here seems to be more of 'sequence it, upload it, and patent it' instead of 'sequence it, upload it, figure out what it does/makes, do something useful with it'. Knowing the sequence for the Ubiquitin gene is all well and good, but it's of little practical importance. Being able to construct designer proteins to treat illnesses based on that information, however, is a truly worthy goal. Unfortunately, that's also where the 'patent it' part comes into play...
Re:I will be more impressed... (Score:2)
Re:I will be more impressed... (Score:3, Informative)
Re:I will be more impressed... (Score:2)
All of them, pretty much. This is a trace archive. It stores the traces as they
come of the sequencing machine. Given that DNA is normally sequenced to 10x
(five times in both directions), most of the data in this database will be
replicates.
"Sadly, the trend here seems to be more of 'sequence it, upload it, and patent it' instead of 'sequence it, upload it, figure out what it does/makes, do something useful with it'."
Data collection is the bed r
a metric we can use, please? (Score:2)
All well and good, but how many Libraries of Congress does 2.5 Mt Everest / A4 pages equal?
My calculator has no Mt Everest button.
don't use their database (Score:2)
ruby -e 'while 1; print "c a t g".split[(rand 4)]; end'
Just hit control-c when the sequence is long enough to suit you
More Impressive.. (Score:2)
Anybody know what DB Software they're using? (Score:3, Funny)
Do the math (Score:3, Interesting)
1 billion x 1,000 Bytes = ~0.9 Terabytes
Which means, on average, your genetic code can be stored in 22KB.
Just an interesting thought.
Re:Do the math (Score:3, Interesting)
How big compressed? (Score:3, Insightful)
I reckon you could zip it up and it'll fit on a couple of floppy disks.
Re:How big compressed? (Score:2)
Google cache of PDF A Compression Algorithm for DNA Sequences and Its Applications in Genome Comparison [google.com].
Standard Units of Measurement... (Score:2)
Football Fields in Length
Mt Everest in Height (even tho the avg person has no idea how tall it really is).
Olympic Sized Swimming Pools in Volume (which again the avg person has no idea)
Number of Chins in a Chinese phonebook (when talking about someone's momma).
Wrong standards (Score:3, Funny)
So what? (Score:4, Funny)
(Oops, did I just admit something bad?)
A lot of data (Score:2, Funny)
Tapping it out on morse code would take 10000 drummers 5 years!
Expressing it in smoke signals would burn 100 amazon rain forests!
Putting it in fortune cookies would require flour and sugar with the same approximate mass as the moon!
And sending it in semaphore would require every man, woman and child on the planet to signal nonstop with every flag ever made until the year 2010!
That's a lot of data.
The amazing thing is how SMALL it is. (Score:5, Insightful)
The wonder isn't how BIG the human genome is - the amazing thing is how *TINY* it is.
The human genome is 3 billion base pairs...each base pair is one of only four possibilities - so two bits each. 750 Megabytes...that's one CD-ROM. There is a lot of redundancy in it too - many of those base pairs are never 'expressed' as proteins, many are replicated redundantly dozens of times. So with compression, or even just deleting the junk - you'd get it down to maybe 100 to 200 megs - tops.
I find it utterly amazing that all that complexity is so amazingly compactly encoded.
Yeah - that's a lot of bits of paper - or 600 floppy disks or some other bullshit - but by the standards of modern media, it's MICROSCOPIC.
Announcements like this would do better to explain how LITTLE data this really is - that's the wonder of the thing.
Re:The amazing thing is how SMALL it is. (Score:4, Interesting)
On the other hand... (Score:4, Interesting)
I've read the whole thing.. (Score:5, Funny)
ctattggacttggaatcggatattggacacttggaatcggata
Re:I've read the whole thing.. (Score:2)
Re:I've read the whole thing.. (Score:2, Funny)
Man that's disgusting. Please keep your fantasies to yourself.
This could only be.. (Score:3, Funny)
Go FoxPro!
22 Terabytes! Wow! (Score:2)
Doubling every 10 months? I think hard drives are doing that as well, or damn close to it. A few years ago, 22 terabytes sounded like a lot, but these days, not so much. I've got half a terabyte in my server and another half in the other two computers in my home and if I didn't regularly burn stuff to DVD, I would have run out of space a long time ago. Terabytes just aren't what they used to be. Well, they are and they aren't.
in other words... (Score:4, Funny)
Print resolution (Score:2)
Not at a 100 million DPI it won't.
Doubling? (Score:2)
I doubt that. Surely that means by the end of the day it will be:
22 * 2^144 Terabytes = 5*10^44 Terabytes
in size.....I don't even know what you call that!
Re:Doubling? (Score:2)
For bonus points (Score:2)
Animals on a flash stick (Score:2)
Woa.. Just imagine the possibilities.
We won't have to feel guilty for extinct species anymore!
PS.: Anyone wanna join my safari party next weekend?
The word "The" 251 times around the world (Score:2)
While that is crazy, it begs the question, are they thinking in points? 10? 11? 12? 72? Why didn't that say 500 times? 1000 times? a million times?
Is there an rfc for this specification of measurement? Can I order things in 'printed word lengths around the world'?
Can I measure my penis with this?
Does google calculator support this?
I shot the sheriff but I sold the dep
Re:The word "The" 251 times around the world (Score:2)
2 columns (Score:2, Interesting)
Interesting if its true!
Forget out the printing... (Score:2)
...can you imagine how much it would cost to have it bound?
Really, though, they should come up with a better comparison. "If burned to CD, it would take half as many CDs as AOL sends out in a year".
Metric, not Imperial (Score:1)
Re:733t speak! (Score:2)
And by identifying, I mean name, SSN, Age, race, etc
Re:733t speak! (Score:2)