International Challenge To Computationally Interpret Protein Function 59
Shipud writes "We live in the post-genomic era, when DNA sequence data is growing exponentially. However, for most of the genes that we identify, we have no idea of their biological functions. They are like words in a foreign language, waiting to be deciphered. The Critical Assessment of Function Annotation, or CAFA, is a new experiment to assess the performance of the multitude of computational methods developed by research groups worldwide and help channel the flood of data from genome research to deduce the function of proteins. Thirty research groups participated in the first CAFA, presenting a total of 54 algorithms. The researchers participated in blind-test experiments in which they predicted the function of protein sequences for which the functions are already known but haven't yet been made publicly available. Independent assessors then judged their performance. The challenge organizers explain that: 'The accurate annotation of protein function is key to understanding life at the molecular level and has great biochemical and pharmaceutical implications, explain the study authors; however, with its inherent difficulty and expense, experimental characterization of function cannot scale up to accommodate the vast amount of sequence data already available. The computational annotation of protein function has therefore emerged as a problem at the forefront of computational and molecular biology.'"
Dang. (Score:1, Offtopic)
Re: (Score:3)
It's possible that some protein out there will cure a lot of cancers. It could be in platypus, or in some fungus in a desert, some coral, or some other exotic species. We're never going to test all
Re: (Score:2)
My understanding was that folding at home was brute force taking these sequences, testing all possible conformations, and seeing what was the lowest energy conformation.
Incorrect. Folding@Home uses proteins whose structure (and usually function) is already exceptionally well characterized. That's how they can tell if their simulation actually worked. The point of the project isn't to predict the structure, because that's still extraordinarily difficult to do by purely physical simulation (as opposed to m
Re: (Score:1)
That's when my wife asks me to help out with the laundry.
Re:No idea... like words in a foreign language (Score:5, Insightful)
Re: (Score:2)
Computational power can scale infinitely, and scales geometrically with time.
No, it can't. There are fundamental limits to information storage and computation. Those limits are a lot better than we can achieve, but they exist.
Or make the problem more efficient.
A better algorithm always works. It's worth noting here that at worst, one can just make the protein physically and see what happens in real time. So it can't be that hard computationally.
Re: (Score:2)
"There are fundamental limits to information storage and computation."
Do those limits rely on assumptions about dimensions and time? Might dark energy and/or dark matter change some of those assumptions and thus make limits that feel so fundamental now evaporate?
640k ought to be enough for anybody.
Re: (Score:2)
Do those limits rely on assumptions about dimensions and time?
No. They derive from the second law of thermodynamics, which assumes very little...
Re: (Score:2)
Second law of thermodynamics is statistical (Fluctuation Theorem). Can we exploit statistics to find ways to violate the Second Law consistently enough to expand the current "fundamental" limits?
Re: (Score:2)
Do those limits rely on assumptions about dimensions and time?
Yes. But these are assumptions borne out by our observations of our reality.
Might dark energy and/or dark matter change some of those assumptions and thus make limits that feel so fundamental now evaporate?
No. Dark energy is somewhat relevant in that an expanding universe does have an easier time of dissipating heat and a higher theoretical limit on information that can be packed into a cosmologically large space-time ball of given radius (the surface area (which is proportional to the maximum information a space can contain) of the ball becomes an exponential function of the radius rather than a fixed power). Against this, you have t
Re: (Score:2)
So, at the very least, our "fundamental limits" might not hold, if our observations about reality turn out to be like the flatlanders', and there are really more dimensions than we can sense?
Re: (Score:2)
No, it can't. There are fundamental limits to information storage and computation. Those limits are a lot better than we can achieve, but they exist.
What are these fundamental limits?
Re:A plan of action (Score:5, Insightful)
Without a good plan, we'll be at it for decades. Here's what I think genomic researchers should do.
Genes (and proteins) are obviously organized hierarchically. Which means there must be a control hierarchy in there somewhere. To unravel and properly classify the genome, researchers must first identify and understand the hierarchical control system. Only then can they begin to populate the branches with the correct genes.
After the tree is completely built and all the genes have found their correct locations on the tree, then it's a matter of going through the tree from the top down and switching the branches of the tree off/on one at a time to see what happens. It's hard but it can be done.
Unfortunately there doesn't have to be "a" control hierarchy: each subsystem can have its own hierarchy (or none) that uses its own unique control mechanisms, they don't have to operate by the same rules, they can mess with each other by lots of different ad hoc means. And that's just the genes: the proteins are much harder to model, at least as far as useful predictions go.
It's been ad hoc with no code review for over 3 billion years.
Re: (Score:3)
Re: (Score:2)
Re: (Score:2)
Nature doesn't design out of Knuth, and it is a big mistake to act or think like we will find nice analogs of human type design.
Re: (Score:2)
So what you're saying is http://xkcd.com/224/ [xkcd.com]
Code obfuscation (Score:3)
Don't discount that as stupid. Most of what he said is true. Evolution makes you write code that works, not good or clean code, just code that works. The only time evolution comes into lay is when the code can't even compile.
Indeed there's even some selective pressure for code obfuscation. Viruses take advantage of compression for example. New functions usually evolve from faulty events in old genes. There's no pressure to remove accidental calls to the wrong subroutine if they don't matter, hence a lot of messages go to the wrong place as well as the right place. Even in higher animals you see this (dog's legs that scratch themselves when you scratch their ribs) is probably some back propagation on the nerve network that
Re:A plan of action (Score:5, Funny)
Stunning. Absolutely astounding. Yet another AC has taken Science by the balls and shaken the Universe to it's core. Dizzying intellect, artistic prose. He's probably six feet tall, blonde and with the chiseled features of a Grecian statue.
Oh. Wait.
Re: (Score:3)
Re: (Score:2)
Genes (and proteins) are obviously organized hierarchically. Which means there must be a control hierarchy in there somewhere.
Obvious nonesense, if not then point to the "control hierarchy" in an ant colony (no, the Queen ant does issue orders to the soldiers and workers).
How does it work? (Score:3)
I am not a biologist so forgive me my ignorance but when people say that DNA is the blueprint for an organism I never understand how a bunch of proteins can determine an organism's shape and behavior. Aren't there more factors that determine those things, like the surroundings in which the DNA is used, like chemicals that the growing organism is surrounded with, temperature, etc?
Re: (Score:2)
BTW I do know that DNA codes for proteins and that the proteins plus certain self-assembly mechanisms account for most of the work done in growing an organism. But there my knowledge ends.
Re: (Score:1)
I am not a biologist so forgive me my ignorance but when people say that DNA is the blueprint for an organism I never understand how a bunch of proteins can determine an organism's shape and behavior. Aren't there more factors that determine those things, like the surroundings in which the DNA is used, like chemicals that the growing organism is surrounded with, temperature, etc?
I think the details are not fully understood. However, in answer to your question I think nature and nurture both play a role. Lots of research has been done on identical twins who have the same DNA. Lots of research has also been done on dizygotic twins who do not share the same DNA. We know that identical twins look the same until the environment changes them. For example, if one of the twins works out to become a body builder, the pair will look quite different. We also know dizygotic twins look di
Re: (Score:2)
Re: (Score:3)
You're absolutely right. Microenvironment -- the cell's chemical, mechanical, and physical environment, determines which genes are
Re: (Score:2)
Re: (Score:1)
Biologist here.
Proteins do all the work. Here's the background:
DNA data is transcribed (think of DNA as a sequence of information, stream of bytes, if that helps) to mRNA (the m stands for 'messenger'). The DNA has twice the redundancy, if you will, as the mRNA. The DNA is for long-term storage, and the mRNA serves as a template for protein production. DNA is read to make mRNA, which is in turn read (executed, perhaps? I'm bad at analogies) to create proteins. There are molecular machines that perform
Assumptions (Score:2)
That is all nice, but most of these prediction algorithms are based on one or more of the following assumptions, which are not always true:
Re:Assumptions (Score:4, Informative)
1. We have accurate mapping of the genes.
We have a pretty good idea on this one. Specific polymerases have specific sequences which they respond to, defining the start sequences of genes. It is possible we have missed some polymerase, but the likelihood is low given the extensive searches which have been done for them. As well, regions which are genes have a distinctively different character than regions which are not genes (at least in the general sense).
2. We can predict the protein sequence from the sequence of the gene.
We also have a pretty good idea about this, due to decades and decades of biologists trying to figure out the answer to this problem. The genetic code turns out to differ in some organisms from what we think of as the default. Sometimes multiple amino acids are coded for by the same sequence of bases, and so multiple proteins are produced from the identical coding region of DNA. Sometimes proteins are produced with modified amino acids, which are not explicitly coded for in the DNA of the gene, but rather by the activity of other proteins defined elsewhere by DNA. (This is a stochastic process and interference in the distribution of outcomes can sometimes result in pathological consequences.) In some organisms, the DNA is decompressed into RNA which is then translated into protein in a more typical way. (Extra bases are incorporated into the RNA in a repeatable way that results in amino acids added which were not defined in the sequence of DNA of the gene being added to proteins.) There's a whole bunch of stuff on alternate splicing, which we explicitly know that we don't know how to predict, that produces variations in protein sequence from a single gene sequence.
3. One protein can not be the product of two genes.
There are plenty of ways in which two separate genes can produce an identical protein. This actually happens ALL THE TIME in mammals, since we have two copies of every gene and most of these pairs have identical sequence. Even if the genes produce the identical protein through different mechanisms, if the protein is identical... then the protein is identical.
4. We have a good understanding of what the functions of the proteins in the training set are.
We do have a good idea of what the functions of the proteins in the training set are. See all of molecular biology for your citations.
5. If two proteins have similar sequence, they must have similar functions.
This is explicitly known to be false and is not expected under the evolutionary model. Look up the category of proteins known as 'crystalins' for a specific case counter to your assumption.
6. One protein has one function.
It is generally thought that there is a primary function for every protein. All things in biology are fuzzy, such that every protein probably has secondary side reactions or functions which may or may not be biologically relevant. (Arsenic is poisonous to us because our enzymes have a hard time distinguishing it from Phosphorous, so the enzymes which incorporate phosphorous also 'function' to incorporate arsenic.)
7. A protein has a function.
Any protein synthesized by a cell costs energy. Under the evolutionary model of biology, proteins which don't have a function should have been discarded because their synthesis was wasting energy. That said, lots and lots of proteins are continuously created and then rapidly degraded because they were improperly folded or had other problems which brought them to the attention of intracellular systems with the 'function' of degrading such errant protein and returning their components to the cell for more productive use. Some genetic diseases are the consequence of the buildup of proteins which are otherwise non-symptomatic, but don't get degraded properly by the degradation systems.
Re: (Score:2)
In short, biologists are aware of the limitations of their assumptions and have some solid idea as to when their assumptions are valid or not.
Doing the bioinformatics will help a researcher sort through the dramatically large number of gene sequences to find a set which is likely enriched for the characteristic they are looking for. They know they will miss interesting cases which don't match the models used. Without these sorts of predictions, they would have to rely on random guessing as a strategy wi
Re: (Score:2)
---
As I understand it, the ribosome is responsible for taking the RNA and creating the protein. IIRC, it also folds the protein. A single protein can be folded in a few different ways to produce different building blocks.
There is also some sort of checking mechanism that checks for proper sequence and proper folding. If there is an error, the constructed/folded protein is broken down and the process is retried. Sometimes, the error ch
Re: (Score:2)
The ribosome is a complex of protein and ribosomal RNA (rRNA). The catalytic subunit of the ribosome, which adds new amino acids to the nascent protein, is the rRNA. A single protein can be folded an infinite number of ways, but only a small subset of that possibility is stable. Proteins which have failed to fold 'properly' will be bound by 'heat shock proteins' (HSPs) which assist the new protein in folding. These complexes provide some buffering against the problems of incorrectly manufactured or
Re: (Score:2)
1. We have accurate mapping of the genes.
We have a pretty good idea on this one. Specific polymerases have specific sequences which they respond to, defining the start sequences of genes. It is possible we have missed some polymerase, but the likelihood is low given the extensive searches which have been done for them. As well, regions which are genes have a distinctively different character than regions which are not genes (at least in the general sense).
You justify an assumption with assumption. The core promoter sequences are so degenerate that they can be found pretty much anywhere. This has lead to misannotation of long genes as multiple single genes. There are a number other causes of annotation errors.
Re: (Score:3)
You justify an assumption with assumption. The core promoter sequences are so degenerate that they can be found pretty much anywhere. This has lead to misannotation of long genes as multiple single genes. There are a number other causes of annotation errors.
There are also numerous examples of manually curated entries that are wrong because people studied non-existent proteins as a result of cloning artifacts or ignoring nonsense mediated decay. Here is one example where a transcripts containing unspliced introns that are eliminated by NMD have been studied and ascribed a function Zhu J, Chen X. MCG10, a novel p53 target gene that encodes a KH domain RNA-binding protein, is capable of inducing apoptosis and cell cycle arrest in G(2)-M. Mol Cell Biol. 2000 Aug;20(15):5602-18. (accessions AF257770, AF257771)
Those long single genes which are sometimes miss-annotated as a series of smaller genes... are sometimes transcribed as a long single gene and sometimes as a series of smaller genes. You've primarily pointed out that biology is hard and that most published papers are full of crap.
Your pretty good idea is applicable to about 60% of the long reading frames and even less applicable to short ORFs: Ingolia NT, Lareau LF, Weissman JS. Ribosome profiling of mouse embryonic stem cells reveals the complexity and dynamics of mammalian proteomes. Cell. 2011 Nov 11;147(4):789-802.. Mind you this does not include processes like RNA editing, that can further complicate how we predict protein sequence based on gene sequence.
This counter-argument doesn't counter my argument.
I wasn't commenting on ploidity. I had in mind things like trans-splicing, where you assemble mature RNA from transcripts that belong to different genes sometimes located on different chromosomes, or the way protozoan genomes are rearranged prior to expression in the macronucleus.
I wasn't commenting on ploidy either. Protozoans do things in all sorts of ways, most of which we have no idea about... and don't care about for the most part. The knowledg
Wouldn't... (Score:1)
...a post-genomic world be one in which we had stopped fiddeling with genes and DNA and such?
Aren't we more in the midst of a Genomics Revolution?
Or more accurately, we are in the infancy of the Genomics Revolution.
TED Talk: Understanding cancer through proteomics (Score:1)