How The Junk DNA Hypothesis Has Changed Since 1980
As someone who has studied the concept of "junk DNA" for over twenty years, I am dismayed by two statements that appear repeatedly on various blog sites discussing evolution. No, I am not referring to arguments of the form "the onion has six times more DNA than do mammals; therefore, there is no deity," that are invariably followed by terms of disparagement hurled at anyone who even marginally departs from the Darwinian perspective. Rather, my consternation stems from a half-truth and a false fact that are recycled ad nauseum by those who apparently believe that, despite all the genomic and transcriptomic data that have been obtained only in this decade--data that have overturned a number of trenchant assumptions--a certain hypothesis published in 1980 is outside the purview of serious questioning.
The half-truth is the oft-read comment that goes something like this: "No one ever asserted that junk DNA is without function...it was long suspected that these sequences have important roles in the cells." Now, to be fair, it is correct to say that models for, say, repetitive DNA-based operations in metazoan development, have been proposed since the 1960s.1 It is also true that the evolutionary process of exaptation--the accidental acquisition of a function--has been used to explain how the odd transposon here or there along a chromosome can regulate a locus. Nonspecific effects of "extra" DNA on the cell have also been suggested for around three decades, if not longer. That said, the junk DNA hypothesis that one commonly reads as being an unassailable observation, as an incontrovertible empirical conclusion, presents as a clear prediction that the vast majority of non-gene sequences are devoid of any precise specificational role in ontogeny. Allow me to explain.
Two papers appeared back to back in the journal Nature in 1980: "Selfish Genes, the Phenotype Paradigm and Genome Evolution" by W. Ford Doolittle and Carmen Sapienza2 and "Selfish DNA: The Ultimate Parasite" by Leslie Orgel and Francis Crick.3 These laid the framework for thinking about nonprotein-coding regions of chromosomes, judging from how they are cited. What these authors effectively did was advance Dawkins's 1976 selfish gene idea4 in such a way that all the genomic DNA evidence available up to that time could be accounted for by a plausible scenario. The thesis presented in both articles is that the only specific function of the vast bulk of "nonspecific" sequences, especially repetitive elements such as transposons, is to replicate themselves -- this is the consequence of natural selection operating within genomes, beneath the radar of the cell. These junk sequences, it was postulated, can duplicate and disperse throughout chromosomes because they have little or no effect on the phenotype, save for the occasional mutation that results from their mobility. On the positive side, the C-value paradox, the longstanding puzzle that genome sizes have no correlation with perceived organismal complexity -- a lily, for instance, can have twenty times more nuclear DNA than a mouse -- was satisfactorily explained by the hypothesis. Also, the problem of repetitive elements of which the "variety and patterns of their interspersion with unique sequence DNA make no particular phylogenetic or phenotypically functional sense" 3 was argued to have a simple solution. Likewise, the finding in the late 1970s that protein-coding regions in eukaryotes are interrupted by nonprotein-coding "introns" could be understood...as perhaps the degenerate remains of old transposable sequences.
A careful reading of these papers reveals, though, in what ways nonprotein-coding DNA function were thought by these authors to be likely. At the risk of being accused of quote-mining, let me first note the definitions of junk or selfish DNA:
A piece of selfish DNA, in its purest form, has two distinct properties:(1) It arises when a DNA sequence spreads by forming additional copies of itself within the genome.
(2) It makes no specific contribution to the phenotype.
[W]e shall use the term selfish DNA in a wider sense, so that it can refer not only to obviously repetitive DNA but also to certain other DNA sequences which appear to have little or no function, such as much of the DNA in the introns of genes and parts of the DNA sequences between genes...The conviction has been growing that much of this extra DNA is 'junk', in other words, that it has little specificity and conveys little or no selective advantage to the organism...in the case of selfish DNA, the sequence which spreads makes no contribution to the phenotype of the organism, except insofar as it is a slight burden to the cell that contains it. Selfish DNA sequences may be transcribed in some cases and not in others. The spread of selfish DNA within the genome can be compared to the spread of a not-too-harmful parasite within its host.3
Natural selection operating within genomes will inevitably result in the appearance of DNAs with no phenotypic expression whose only 'function' is survival within genomes.2
Second, no prohibition was placed on relatively few selfish motifs modulating a gene in a way that they positively contributed to fitness, or on these elements en masse having nonspecific effects on the cell:
We do not deny that prokaryotic transposable elements or repetitive and unique-sequence DNAs not coding for protein in eukaryotes may have roles of immediate phenotypic benefit to the organism.2
It would be surprising if the host organism did not occasionally find some use for particular selfish DNA sequences, especially if there were many different sequences widely distributed over the chromosomes. One obvious use, as repeatedly stressed by Britten and Davidson, would be for control purposes at one level or another. This seems more than plausible.
A mechanism which scattered, more or less at random, many kinds of repeated sequences in many places in the genome would appear to be rather good for this purpose [of gene regulation]. Most sets of such sequences would be unlikely to find themselves in the right combination of places to be useful but, by chance, the members of one particular set might be located so that they could be used to turn on (or turn off) together a set of genes which had never been controlled before in a coordinated way. A next way of doing this would be to use as control sequences not the many identical copies distributed over the genome, but a small subset of these which had mutated away from the master sequence in the same manner.
On this picture, each set of repeated sequences might be 'tested' from time to time in evolution by the production of a control macromolecule...to recognize those sequences. If this produced a favorable result, natural selection would confirm and extend the mechanism. If not, it would be selected against and discarded. Such a process implies that most sets of repeated sequences will never be of use since, on statistical grounds, their members will usually be in unsuitable places.
It thus seems unlikely that all selfish DNA has acquired a special function...
In some circumstances, the sheer bulk of selfish DNA may be used by the organism for its own purpose. That is, the selfish DNA may acquire a nonspecific function which gives the organism a selective advantage.3
In other words, the opinion expressed these two works is that "excess" DNA is junk in the sense that it is largely devoid of phenotype-specifying information. This perspective was being discussed in the 1970s and it quickly became the consensus after this pair of papers appeared. Don't take my word for it--follow the literature trail. Simply type in terms such as "junk DNA," "selfish DNA," "repetitive DNA," "noncoding," etc. using the Pubmed search engine and read the articles. What should become obvious is that the view expounded by Orgel and Crick on the one hand, and Doolittle and Sapienza on the other, has been considered by many cellular and molecular biologists to be the correct explanation for much of genomic DNA until very recently.
So the oft-read claim on the web that the term "junk DNA" never implied developmentally "non-functional DNA" is one that is made either out of ignorance or disingenuousness.
That said, the success of the junk DNA proposal was based in part on the narrative it provided. But its acceptance was also due to definitions and presuppositions that remain with us today. Regarding the former, a gene was described in 1980 as a discrete section of the chromosome that encoded a protein or in some instances an RNA, with the "one gene, one enzyme" model exemplifying this concept. Sequences of DNA that do not specify a protein were labeled "noncoding" or, as we have seen, "nonspecific." By connotation, then, almost all genomic regions of any given eukaryote lack coding potential, which was understood then and now to mean being a part of the "genetic program." Linked to this definition of a gene was the assumption that cross-species conservation of a DNA string implies that it has been retained by natural selection, because it embodies some instructions that enhance the fitness of an organism. Since a large fraction of nonprotein-coding DNA is often restricted to members of a species or a genus or a family, it fails the conservation test and thus is said to be dispensable: the refuse of the duplication and transposition process. In short, the backdrop of the junk DNA hypothesis was the premise that sequences like repetitive elements are noncoding in the strictest way--encoding no proteins or RNAs other than those used in their own manipulative, lascivious, and licentious replication; and their evolutionary lability reflects this lack of coding potential.
This brings me to the false fact. It has been said that 90% of all genomic DNA (in eukaryotes) is junk. No taxon is mentioned; no reference is cited...the value is just repeated by those commenting on evo blogs. To be sure, tagging a percentage to such a claim is a lot better than simply saying that "most DNA is junk." In lieu of an actual piece of research that demonstrated support for this proclamation, let's critically examine the 90% junk figure by focusing on human genomic DNA. Only around 1.5% of our chromosomal sequences encode proteins, which entails that 98.5% of the genome is noncoding by the classical definition. If someone wanted to make the equation noncoding = junk, then lo and behold functional sequences in Homo sapiens drop far below the 10% value. But we know that this equation is not valid. A surprising finding of ENCODE and other transcriptome projects is that almost every nucleotide of human (and mouse) chromosomes is transcribed in a regulated way 5 6 7 8 9 10 11 12 13 14 15. Most of the RNAs produced are various nonprotein-coding transcripts that are copied from both strands in a cell type-, tissue type-, or developmental stage-specific manner 16 17 18. These RNAs belong to a number of different functional classes and new categories are being discovered all the time 19 20 21 22 23 24. Further, these nonprotein-coding transcriptional units extend into and arise from protein-coding segments. Many also map to the regions between protein-coding loci.25 The RNA map of the mammalian genome has moreover been demonstrated to be hierarchical and far from random. 13 15 26
Clearly, the "gene" definition that provided the framework for the junk DNA hypothesis is defunct27 28, and much discussion now centers on providing an operational description.29 30 31 32 That is to say, the coding/noncoding distinction is being rethought. And if one considers functional DNA to be equivalent to transcription units that are developmentally expressed together with their regulatory regions, the fraction that can be dismissed as junk becomes startlingly small--this is what the results of recent studies imply. 33
Indeed, if we accept the equation transcription units + control elements = developmentally functional DNA, then the number of loci in the human genome jumps from a paltry 20,000 to hundreds of thousands, and the percentage of non-junk DNA increases to well over 90%.
It could be argued that most of these RNA-encoding loci are really cellular "noise" due to transcription running amok, on the basis that so few are phylogenetically conserved--after all, didn't Orgel and Crick foresee such a possibility in their definition of selfish DNA? Well, this line of argumentation doesn't hold. Another counterintuitive result of the ENCODE project and other comparative genomic analyses is that known functional sections of the mammalian genome such as protein-coding segments appear to be diverging without constraint 5 34, whereas a host of "junk" sequences are under some type of selective pressure--including most human "noncoding" DNA stretches. 35 36 The same has been repeatedly detected for the fruit fly genome, where most nonprotein-coding sequences appear to be under functional constraint--with the species-specific differences having the statistical hallmarks of being "adaptive" 37 38 39 40. Even the Y chromosome of the fruit fly, long presented as "exhibit A" in the gallery of garbage DNA, has been shown to have diverse effects on the phenotype of this insect.41 Such results are exactly the opposite of what Orgel and Crick and Doolittle and Sapienza predicted.
Instead of 90% of the human or fly genome being junk, it seems that 90% or more of chromosomal DNA has some kind of specific developmental function, given the available data. Indeed, the emerging picture is that the species-specific nonprotein-coding regions encode numerous RNAs that help to shape the phenotype in ways that we are only beginning to understand.42 43 44 45 46 This is especially true for the transposable element fraction of human chromosomes--about 50% of our DNA--much of which is arranged and expressed in a taxon-specific manner. 33 47 48 49 Part of the reason for why a human is not a chimp is not a cow is not a whale, then, is that each species has its own set of sui generis "genes"--genomic texts specifying unique RNAs or even proteins that are used in embryogenesis.
To put everything into perspective, I'll mine another quote from a paper worth reading:
We now know that more of the DNA in eukaryotic cells is copied into RNA than previously had been thought. Many of these transcripts serve regulatory instead of template functions in gene readout. Some of these newly recognized RNAs come from regions of the genome that had heretofore been deemed "junk DNA," yet no one could answer the obvious question: if "junk," then why still around? Before memory fades, we should note that there were some reasonably well articulated ideas 30-40 years ago that anticipated these recent discoveries.1
Indeed, those were the very same well-articulated ideas that the selfish DNA hypothesis was supposed to have dispensed with, once and for all.
How things have changed since 1980.
1 Pederson T. 2009. The discovery of eukaryotic genome design and its forgotten corollary--the postulate of gene regulation by nuclear RNA. FASEB J. 23(7): 2019-2021.
2 Doolittle WF, Sapienza C. Selfish genes, the phenotype paradigm and genome evolution. Nature 284(5757): 601-603.
4 Dawkins, R. 1976. The Selfish Gene. Oxford University Press, New York, New York.
5 ENCODE Project Consortium, Birney E, et al. 2007. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447(7146): 799-816.
6 Frith MC, et al. 2006. Pseudo-messenger RNA: phantoms of the transcriptome. PLoS Genet. 2(4): e23.
7 Katayama S, et al. 2005. Antisense transcription in the mammalian transcriptome. Science 309(5740): 1564-1566.
8 Kapranov P, et al. 2007. RNA maps reveal new RNA classes and a possible function for pervasive transcription. Science 316(5830): 1484-1488.
9 Wu JQ, et al. 2008. Systematic analysis of transcribed loci in ENCODE regions using RACE sequencing reveals extensive transcription in the human genome. Genome Biol. 9(1): R3.
10 Furuno M, et al. 2006. Clusters of internally primed transcripts reveal novel long noncoding RNAs. PLoS Genet. 2(4): e37.
12 Dinger ME, et al. In press. Pervasive transcription of the eukaryotic genome: functional indices and conceptual implications. Brief Funct Genomic Proteomic.
13 Kapranov P, et al. 2007. Genome-wide transcription and the implications for genomic organization. Nat Rev Genet. 8(6):413-423.
14 Kapranov P, et al. 2005. Examples of the complex architecture of the human transcriptome revealed by RACE and high-density tiling arrays. Genome Res. 15(7):987-997.
16 Rinn JL, et al. 2007. Functional demarcation of active and silent chromatin domains in human HOX loci by noncoding RNAs. Cell 129(7): 1311-1323.
18 Mercer TR, et al. 2008. Specific expression of long noncoding RNAs in the mouse brain. Proc Natl Acad Sci U S A 105(2): 716-721.
20 Taft RJ, et al. 2009. Tiny RNAs associated with transcription start sites in animals. Nat Genet. 41(5): 572-578.
22 Wilusz JE, et al. 2009. Long noncoding RNAs: functional surprises from the RNA world. Genes Dev. 23(13): 1494-1504.
23 Affymetrix ENCODE Transcriptome Project; Cold Spring Harbor Laboratory ENCODE Transcriptome Project. 2009. Post-transcriptional processing generates a diversity of 5'-modified long and short RNAs. Nature 457(7232): 1028-1032.
25 Khalil AM, et al. 2009. Many human large intergenic noncoding RNAs associate with chromatin-modifying complexes and affect gene expression. Proc Natl Acad Sci USA 106(28): 11667-11672.
26 Thurman RE, et al. 2007. Identification of higher-order functional domains in the human ENCODE regions. Genome Res. 17(6): 917-927.
27 Gerstein MB, et al. 2007. What is a gene, post-ENCODE? History and updated definition. Genome Res. 17(6): 669-681.
29 Scherrer K, Jost J. 2007. Gene and genon concept: coding versus regulation. A conceptual and information-theoretic analysis of genetic storage and expression in the light of modern molecular biology. Theory Biosci. 126(2-3): 65-113.
34 Margulies EH, et al. 2007. Analyses of deep mammalian sequence alignments and constraint predictions for 1% of the human genome. Genome Res. 17(6): 760-774.
35 Asthana S, et al. 2007. Widely distributed noncoding purifying selection in the human genome. Proc Natl Acad Sci USA. 104(30): 12410-12415.
36 Eory L, et al. In press. Distributions of selectively constrained sites and deleterious mutation rates in the hominid and murid genomes. Mol Biol Evol.
37 Andolfatto P. 2005. Adaptive evolution of non-coding DNA in Drosophila. Nature 437(7062): 1149-1152.
39 Halligan DL, Keightley PD. 2006. Ubiquitous selective constraints in the Drosophila genome revealed by a genome-wide interspecies comparison. Genome Res. 16(7): 875-884.
40 Haddrill PR, et al. 2008. Positive and negative selection on noncoding DNA in Drosophila simulans. Mol Biol Evol. 25(9): 1825-1834.
41 Lemos B, et al. 2008. Polymorphic Y chromosomes harbor cryptic variation with manifold functional consequences. Science 319(5859): 91-93.
42 Glinsky GV. 2008. Phenotype-defining functions of multiple non-coding RNA pathways. Cell Cycle 7(11): 1630-1639.
43 Bond CS, Fox AH. 2008. Paraspeckles: nuclear bodies built on long noncoding RNA. J. Cell Biol. 186(5): 637-644.
44 Barak M, et al. In press. Evidence for large diversity in the human transcriptome created by Alu RNA editing. Nucleic Acids Res.
45 Lee JT. 2009. Lessons from X-chromosome inactivation: long ncRNA as guides and tethers to the epigenome. Genes Dev. 23(16): 1831-1842.
46 Guttman M, et al. 2009. Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals. Nature 458(7235): 223-227.
47 Faulkner GJ, et al. 2009. The regulated retrotransposon transcriptome of mammalian cells. Nat Genet. 41(5): 563-571.
48 Tay SK, et al. 2009. Global discovery of primate-specific genes in the human genome. Proc Natl Acad Sci U S A. 106(29): 12019-12024.
49 Walters RD, et al. 2009. InvAluable junk: the cellular impact and function of Alu and B2 RNAs. IUBMB Life. 61(8): 831-837.