Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
Science

Gene Mappers May Have Missed Half The Genes 22

Nepre writes: "Forbes.com is running a story about new research that suggests that the Human Genome Project may have missed tens of thousands of genes in the race to map the human genome. This is interesting given the intense competition between commercial and academic research. As my grandmother used to say, "The faster you go, the behinder you get!""
This discussion has been archived. No new comments can be posted.

Gene Mappers May Have Missed Half The Genes

Comments Filter:
  • I wonder if any of these researchers used information theory and entropy measurement to try and detect unknown gene sequences.

    One would immagine that the entropy of a DNA sequence coding for a protein would be lower then random sequences.

    • IANAMB (I'm not a molecular biologist) but I'm not sure that the entrons (unused part of the DNA) are really random either. I thought they were leftovers from archaic genes and maybe genes transmited from viruses a long time ago.
      • by jdiggans ( 61449 )
        IAB (I'm a bioinformaticist). You're partly correct. Introns (the 'junk' inbetween the exonic regions in DNA and freshly transcribed mRNA) do tend towards non-random sequence. You can use a variety of metrics to make guesses as to where introns and exons begin and end within a gene's coding region based on sequence entropy, on GC/AT frequency, on neural nets or hidden markov models trained on known examples, etc.

        These metrics, however, are only useful once one knows something about where a 'gene' starts and ends. The real problem here is that some of the assumptions we've made historically about gene structure has potentially led us astray. Yes, the chromosomes are full of junk DNA but no, it's nothing near random for the most part and is full of 'repetitive' elements (short segments that repeat endlessly, query Genbank [nih.gov] for 'ALU Repeat' and see how many sequences you find) that make any sort of pattern matching a tough sell genome-wide. There are also plenty of 'psueudogenes' interspersed throughout the genome, leftovers from a bygone era. It's the question of which of these pseudogenes might actually still BE transcribed that only mRNA expression analysis can provide. Hopkins is definitely on the right track w/ something like SAGE (though it's not exactly high-throughput, hence our man's need for extrapolation to genome-wide numbers).

        The paper should be an interesting read to say the least.
        -j
    • by danat ( 325541 )
      Of course they did.

      Gene prediction applications uses many statistical and computational methods in order to predict where hypothetically a gene might be hiding.

      Information content and entropy measurement are used vastly in many of them.
      For example:
      the basic tool to compare how much one sequence resembles another (by doing sequence alignment) uses substitution matrices. Matrices that give a score to any possible substitution of one unit - a nucleotide for DNA/RNA or amino acid in the case of proteins, according to the likelihood of this substitution of being meaningful or just pure chance. The most commonly used matrix (actually it's a whole family or series of them) is called BLOSOM and information content is being used on one of the stages in its build.

      However, this is not the news here.
      What the researchers did was combining the prediction from the DNA sequence itself with results from expression data, meaning the results of experiments that measure whether a subsequence of DNA is transcribed to RNA.
  • by Ieshan ( 409693 ) <ieshan@@@gmail...com> on Monday February 25, 2002 @12:56PM (#3065475) Homepage Journal
    I'm rather concerned by some of the statements these guys made, before we put too much credibility in his findings.

    "If the mouse and human genomes were so similar, we would be mice," says Shoemaker.

    Well, Mr. Shoemaker, to be quite honest, we're not that far off, evolutionary speaking. We share the same classification as mammals, have hundreds of bodily functions that are nothing short of the same, share very complex behavioral patterns, and study the guys in an attempt to find out how our own brains work (go ask a research neurologist). If the man who says this is the director of anything, we need to push him off of his pedestal and teach him some biology.

    "Before you count genes, you really need to define what a gene is," says Daniel Shoemaker.

    Basically, it seems like the guy is "trolling". "Nuh uh, Taco!", he's saying. "The Theory of Graviity must be wrong because you mis-spelled gravity!" Really, he's saying that people are wrong, and then saying he's right, and then saying that the criteria he's used to make this sort of judgement doesn't exist in the first place.

    No definition for a gene? "A unit of heredity. The unit of genetic function which carries the information for a single polypeptide."
    • How to Count Genes (Score:2, Informative)

      by danat ( 325541 )
      The question of what is the definition of a gene is not that trivial.

      for example in the definition you quote:

      " No definition for a gene? "A unit of heredity. The unit of genetic function which carries the information for a single polypeptide." "

      The two sentences are not equal definitions.
      There are mRNA sequences that are units of heredity, play important roll in defining our genetic makeover but are not translated to become a polypeptide. Does the DNA which serves as their template count as a gene? depends on the exact definition.

      It is quite acceptable now that those sequences, include transcription factors, do count, but there are other difficulties in counting.

      For example the issue of alternative splicing.
      When you take many of the human proteins you will find that the DNA that is translated to make that protein is divided in more than one segment of DNA. It can be divided to many smaller segments - exons, separated by untranslated segments - introns.
      Those segments are being cut and edited to result in a protein - a process called splicing.
      Now, to complex things up, many times different proteins results from the same or almost the same set of segments by different editing. There is a question how to count those variants.

      One of the reasons that the number of estimated genes was probably lower than the actual number is this phenomena of alternative splicing which is much more frequent than first assumed.
    • If you think a gene can be so simply defined, you don't know enough biology. I'm in the middle of writing a book about the sequencing of c. elegans, so I do know a little about these complexities.

      One point is that a single stretch of DNA can be sliced and transcribed in several different ways by the RNA machinery in the cell to specify several different proteins. How many genes do you then count?

      Another is that you can't (of course) simply read down the sequence and say "there's a gene" even if you can say "there's a repetitive sequence that looks pretty much like junk". There's no algorithm for looking at a sequence of DNA and saying "this is a gene and it produces that protein."

      Anyone seriously interested in checking out the real complexities of the various definitions of "gene" should check out Evelyn Fox Keller's little book "The century of the gene" [amazon.com]. There are lots of working scientists who'd say her distinctions don't matter too much in practice. But there are none who'd disagree with her facts.

      And I know at least one very very smart and experienced genetecist (Sydney Brenner) who is convinced that there are at least 60,000 human genes in the genome. This just isn't a question that can be solved either from first principles or by any quick and simple form of counting.
  • by Anonymous Coward
    I get the impression that this guy predicted there would be around 80,000 genes, and after the mappers showed there were many fewer, he decided to say that the mappers had their definition of gene wrong, or that they missed the genes. He's not exactly clear on that, except to reiterate that based on indirect evidence, he beleives there are 80,000 genes, if you define them properly.

    The article says the 30,000 figure is close to a worm or a fruit fly. There's a better list here [indiana.edu], which lists the gene counts for humans, mice, worms, etc. With their disclaimer that these counts are not yet complete, it seems that humans have 46,000 genes in this database, compared with 22,300 for a worm, 24,900 for a fruit fly, and 39,156 for a mouse. Exactly why this should be so unreasonable is beyond me. Maybe if we define the gene so that humans have 80,000, then fruit flies will have 60,000! After all, how can you draw a comparison if you pretend not to know the definition, eh?
  • Your grandmother wasn't by any chance the Red Queen [everything2.com], was she?

  • by idgrad ( 137342 ) on Monday February 25, 2002 @02:57PM (#3066202)
    An earlier comment hit the nail on the head, I'm quite sure Mr. Shoemaker sold 80,000 genes to a biotech/pharmacutical company, and now has to explain why he doesn't owe them half their money back (what a funny conversation that would be to listen to).

    What many people don't want to address that are trying to sell genomics, is that the differences between a mouse and a human are likely not the result of there being more genes in humans, but rather a difference in regulation of (approximaly) the same number of genes. That is to say that there are likely differences in the promoter (on switch) and repressor (off switch) portions of these genes, that cause one to be active in a certain situation in the human, but not in the mouse. A simple analogy demonstrates the difference: you can have two similar cars with similar horsepower, number of tires, gears etc, but if you put an old grandmother in one, and a formula 1 driver in the other, and watched them drive on the highway, you might make the mistake of thinking one car had more power, a larger engine (genes) than the other- when in fact the difference between the two is due to control of the same equipment(gas=promoter and brake=repressor elements of genes). Further analysis of the control regions of genes, as well as differences in protein-protein interations (proteonomics)will likely explain the differences between a human and a mouse, not 50,000 as yet undiscovered genes.
    • (* An earlier comment hit the nail on the head, I'm quite sure Mr. Shoemaker sold 80,000 genes to a biotech/pharmacutical company, and now has to explain why he doesn't owe them half their money back .....what a funny conversation that would be to listen to *)

      Almost as much fun as listening to people bicker about the meaning of the word "integration". The gov was utterly stupid to let Microsoft slip in such a wiggle word into the consent decree without better defining/limiting it.

    • Agreed. Euchromatin/heterochromatin transcription regulation is another factor that could be responsible for morphological as well as metabolical differences. Availability of a gene is then depending on the structure of DNA itself.
  • Old news... (Score:2, Insightful)

    by Mercaptan ( 257186 )
    For those in the biological sciences who do this kind of stuff, this is pretty old news. It was pretty obvious when they "announced" the completion that they had barely finished anything. Annotating the sequence information to include meaningful protein expression and gene regulation data will take many more years. It has come to light that the disparity between the Human Genome Project's version of the genome and Celera's version of the genome is pretty wide. This means that both sides missed something. Unfortunately, the political push to announce overpowered fact.

    While by the present count, there seem to be too few genes, there are many other mechanisms to be explored, including alternative splicing, where the previously inviolable gene unit gets reorganized to generate different proteins, as well as ways to discover more genes. I have no doubt that the dynamics of genetic expression are far more complex than we know now.

    DNA may be a linear string of data, but the interactions it makes with enzymes, cellular structures, and other molecules are massively parallel.
  • My research team uses a high throughput mouse model system to identify mouse oncogenes and their human orthologues. We are fortunate enough to have access to both the public and Celera mouse/human genomes and have made detailed gene structure comparisons between hundreds of mouse and human genes and synteny comparisons between thousands of genes. Gene regulation and alternative splicing theories aside (both valid), there are certainly enough structural difference between mouse and human genes to obviate the need to invoke "missing genes". A better argument would be to focus on the known limitations of gene finding algorithms and the imperfections in EST and genome assemblies.

    Are there missing genes in the public and private databases? Certainly. Until there is a full length clone, each gene call is a hypothesis. Ongoing curation will continue to sort out both false negatives and false positives.

    CVBIG! [cvbig.org]
  • Original estimates for human genome size were around 80,000. As the genome was getting sequenced this estimate kept dropping eventually reaching the current accepted number of 30,000.

    Genes contain introns and exons. The introns are discarded, but the exons can discarded or included in the final protein. So for one gene we can code two similar, but functionally different proteins.

    e.g. (numbers correspond to exons)

    gene: 1 - 2 - 3 - 4 - 5
    protein 1: 1 - 2 - 4 - 5
    protein 2: 2 - 3 - 4 - 5

    So in this case exon 1 and exon 3 could be different domains that completely change the functionality of the protein.

    Another phenomenon that can happen from this is that the reading frame can be shifted from the splicing (if exon_length % 3 != 0) event and you get a completely different protein. This is usually found in organisms with smaller genomes.

    More info available here [arizona.edu].

    There is actually a database dedicated to this phenonmena here [lbl.gov].

    Note: I am a crystallography and I admit to knowing very little about genetics, so take this at its face value

Don't panic.

Working...