Orphan Genes: A Guide for the Perplexed
As readers will know who have followed the exchange between Martin Poenie, Doug Axe, and Jonathan M., most recently here but originating in critical comments by Dr. Poenie about Stephen Meyer's argument Darwin's Doubt, we have a dispute going on between scientists over how to interpret experiments about proteins. The dispute has to do with whether or not neo-Darwinian evolution, or any other form of unguided evolution for that matter, has the creative power attributed to it by many scientists.
Living things depend on a myriad of proteins to carry out cellular functions. These proteins are not globs of unstructured stuff. To carry out their function in the cell, most proteins have to fold into particular three-dimensional shapes. Interestingly, most proteins can be arranged into groups based on the similarity of their structures, or folds, as scientists call them. There are several thousand distinct folds now known among proteins whose structure have been determined.
Here's the surprising thing. As scientists sequence more genomes from different organisms, they are discovering that roughly 10-20% of each genome's protein-coding sequence is new, that is, unlike any other known protein-coding sequence. This was a one of the biggest surprises to come out of the whole genome-sequencing project, though by no means the biggest.
Why? The working assumption had been that, given common descent and the fact that most housekeeping genes are shared among living things, and the assumption hitherto that evolution occurs by incremental small changes, orphan genes (protein-coding sequences without known protein-coding antecedents) should be rare if not non-existent.
At this point it is necessary to explain a little about how such orphan sequences come to be identified. Normally, in DNA that does not code for protein, roughly 1 in 20 triplet sequences will be either TAA, TAG, and TGA. When DNA is copied into RNA, substituting U for T, the RNA is then interpreted by the protein-making machinery using the following code: AUG tells the protein-making machinery, "Start here," and UAA, UAG, and UGA say, "Stop here." Hence the names start and stop codons. Just statistically speaking, stop codons should be relatively common in a random DNA sequence. Traditionally, stretches of DNA that have a start codon and no in-frame stop codons for at least 100 nucleotides or more are called open reading frames, or ORFs (the length chosen depends on assumptions made about what constitutes a minimum length for function), and on that basis are identified as possible protein-coding genes (this is the case in bacteria -- in eukaryotes it's more complex).
Orphan genes (sometimes called ORFan genes in bacteria) are those open reading frames that lack identifiable sequence similarity to other protein-coding genes. Lack of similarity is hard to prove, given the size of the genomic universe. Methods vary from researcher to researcher, so each study needs to be evaluated carefully. There is also always the possibility that any given ORF has no function. No doubt some orphan genes will prove to be artifacts of incomplete evidence (see below). But orphan genes are a reality, nonetheless, based on numerous and substantial studies.
Thus, the existence and prevalence of orphan genes raises a number of significant questions.
1. Do orphan genes encode functional proteins? In many cases there is evidence to suggest that they do. Some are highly conserved, even essential for viability to the organism from which they come. Some are involved in important species-specific or group-specific functions.
2. Will similar sequences be found in other genomes, as we obtain more data? This could be the case if orphans are simply the result of our having sampled too little of worldwide genomic diversity. Orphan genes could be examples of once common genes now lost in most other species, or they could be far voyagers, come from other life forms and integrated into new contexts (this is especially possible among bacteria). This is unlikely to be the case for all orphan genes, however, because we keep discovering more as we sequence more genomes.
3. Will orphan proteins show structural similarity, if not sequence similarity, to known proteins? This would suggest that orphan genes started out with sequence similarity, but have lost it because of rapid adaptive evolution or, alternatively, long-term neutral evolution. The current answer would seem to suggest that at least some orphan genes have no known structural similarity. It is too soon to say whether that will always be the case.
4. Given the fact that such surprising species- or clade-specific proteins exist, it raises interesting questions about where orphans come from. Some might have come from gene duplication followed by rapid adaptive evolution (see #3 above). If that is the case we should see traces left behind in the orphan protein's three-dimensional structure. Some propose recruitment from non-coding DNA by a combination of mechanisms, including insertion of transposable elements. This is possible, but it would require that the insertion or other mechanism(s) be lucky events in order to produce a stable, functional protein, that is, one that is of use to the organism. Exactly how lucky is one of the issues we are debating.
5. Then there is the elephant in the room that evolutionary biologists don't want to acknowledge. Perhaps we see so many species- and clade-specific orphan genes because they are uniquely designed for species- and clade-specific functions. Certainly, this runs contrary to the expectation of common descent.
Exciting times! Much more work has to be done before we can determine which of the possibilities above are true. It may well be that all of them are true, at least sometimes, though I am sure Dr. Poenie would rule out #5. If common descent is true, the apparent rate of generation of new proteins is astonishing by anyone's expectation. What now needs to be determined is whether or not naturalistic processes known to be operating are actually capable of generating so many new proteins.
Domazet-Loso and Tautz (2003) An Evolutionary Analysis of Orphan Genes in Drosophila. Genome Res. 13: 2213-2219.
Khalturin et al. (2008) A novel gene family controls species-specific morphological traits in Hydra. PLoS Biol 6(11): e278.
Jaroszewski et al. (2009) Exploration of Uncharted Regions of the Protein Universe. PLoS Biol 7: e1000205.
Xie C, Zhang YE, Chen J-Y, Liu C-J, Zhou W-Z, et al. (2012) Hominoid-Specific De Novo Protein-Coding Genes Originating from Long Non-Coding RNAs. PLoS Genet 8(9): e1002942.
Helen Pilcher (2013) All Alone. New Scientist, January 19, p. 38-41.
Photo credits: zenilorac, Bitboy/Flickr.