The paradox of extremely fast evolution driven in multi-copy gene systems - A resolution

  1. State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-sen University, Guangzhou, China

Peer review process

Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, public reviews, and a provisional response from the authors.

Read more about eLife’s peer review process.

Editors

  • Reviewing Editor
    Ziyue Gao
    University of Pennsylvania, Philadelphia, United States of America
  • Senior Editor
    George Perry
    Pennsylvania State University, University Park, United States of America

Reviewer #1 (Public Review):

The manuscript by Wang et al is, like its companion paper, very unusual in the opinion of this reviewer. It builds off of the companion theory paper's exploration of the "Wright-Fisher Haldane" model but applies it to the specific problem of diversity in ribosomal RNA arrays. The authors argue that polymorphism and divergence among rRNA arrays are inconsistent with neutral evolution, primarily stating that the amount of polymorphism suggests a high effective size and thus a slow fixation rate, while we, in fact, observe relatively fast fixation between species, even in putatively non-functional regions. They frame this as a paradox in need of solving, and invoke the WFH model.

The same critiques apply to this paper as to the presentation of the WFH model and the lack of engagement with the literature, particularly concerning Cannings models and non-diffusive limits. However, I have additional concerns about this manuscript, which I found particularly difficult to follow.

My first, and most major, concern is that I can never tell when the authors are referring to diversity in a single copy of an rRNA gene compared to when they are discussing diversity across the entire array of rRNA genes. I admit that I am not at all an expert in studies of rRNA diversity, so perhaps this is a standard understanding in the field, but in order for this manuscript to be read and understood by a larger number of people, these issues must be clarified.

The authors frame the number of rRNA genes as roughly equivalent to expanding the population size, but this seems to be wrong: the way that a mutation can spread among rRNA gene copies is fundamentally different than how mutations spread within a single copy gene. In particular, a mutation in a single copy gene can spread through vertical transmission, but a mutation spreading from one copy to another is fundamentally horizontal: it has to occur because some molecular mechanism, such as slippage, gene conversion, or recombination resulted in its spread to another copy. Moreover, by collapsing diversity across genes in an rRNA array, the authors are massively increasing the mutational target size.

For example, it's difficult for me to tell if the discussion of heterozygosity at rRNA genes in mice starting on line 277 is collapsed or not. The authors point out that Hs per kb is ~5x larger in rRNA than the rest of the genome, but I can't tell based on the authors' description if this is diversity per single copy locus or after collapsing loci together. If it's the first one, I have concerns about diversity estimation in highly repetitive regions that would need to be addressed, and if it's the second one, an elevated rate of polymorphism is not surprising, because the mutational target size is in fact significantly larger.

Even if these issues were sorted out, I'm not sure that the authors framing, in terms of variance in reproductive success is a useful way to understand what is going on in rRNA arrays. The authors explicitly highlight homogenizing forces such as gene conversion and replication slippage but then seem to just want to incorporate those as accounting for variance in reproductive success. However, don't we usually want to dissect these things in terms of their underlying mechanism? Why build a model based on variance in reproductive success when you could instead explicitly model these homogenizing processes? That seems more informative about the mechanism, and it would also serve significantly better as a null model, since the parameters would be able to be related to in vitro or in vivo measurements of the rates of slippage, gene conversion, etc.

In the end, I find the paper in its current state somewhat difficult to review in more detail, because I have a hard time understanding some of the more technical aspects of the manuscript while so confused about high-level features of the manuscript. I think that a revision would need to be substantially clarified in the ways I highlighted above.

Reviewer #2 (Public Review):

Summary:

Multi-copy gene systems are expected to evolve slower than single-copy gene systems because it takes longer for genetic variants to fix in the large number of gene copies in the entire population. Paradoxically, their evolution is often observed to be surprisingly fast. To explain this paradox, the authors hypothesize that the rapid evolution of multi-copy gene systems arises from stronger genetic drift driven by homogenizing forces within individuals, such as gene conversion, unequal crossover, and replication slippage. They formulate this idea by combining the advantages of two classic population genetic models -- adding the V(k) term (which is the variance in reproductive success) in the Haldane model to the Wright-Fisher model. Using this model, the authors derived the strength of genetic drift (i.e., reciprocal of the effective population size, Ne) for the multi-copy gene system and compared it to that of the single-copy system. The theory was then applied to empirical genetic polymorphism and divergence data in rodents and great apes, relying on comparison between rRNA genes and genome-wide patterns (which mostly are single-copy genes). Based on this analysis, the authors concluded that neutral genetic drift could explain the rRNA diversity and evolution patterns in mice but not in humans and chimpanzees, pointing to a positive selection of rRNA variants in great apes.

Strengths:

Overall, the new WFH model is an interesting idea. It is intuitive, efficient, and versatile in various scenarios, including the multi-copy gene system and other cases discussed in the companion paper by Ruan et al.

Weaknesses:

Despite being intuitive at a high level, the model is a little unclear, as several terms in the main text were not clearly defined and connections between model parameters and biological mechanisms are missing. Most importantly, the data analysis of rRNA genes is extremely over-simplified and does not adequately consider biological and technical factors that are not discussed in the model. Even if these factors are ignored, the authors' interpretation of several observations is unconvincing, as alternative scenarios can lead to similar patterns. Consequently, the conclusions regarding rRNA genes are poorly supported. Overall, I think this paper shines more in the model than the data analysis, and the modeling part would be better presented as a section of the companion theory paper rather than a stand-alone paper. My specific concerns are outlined below.

(1) Unclear definition of terms

Many of the terms in the model or the main text were not clearly defined the first time they occurred, which hindered understanding of the model and observations reported. To name a few:

(i) In Eq(1), although C* is defined as the "effective copy number", it is unclear what it means in an empirical sense. For example, Ne could be interpreted as "an ideal WF population with this size would have the same level of genetic diversity as the population of interest" or "the reciprocal of strength of allele frequency change in a unit of time". A few factors were provided that could affect C*, but specifically, how do these factors impact C*? For example, does increased replication slippage increase or decrease C*? How about gene conversion or unequal cross-over? If we don't even have a qualitative understanding of how these processes influence C*, it is very hard to make interpretations based on inferred C*. How to interpret the claim on lines 240-241 (If the homogenization is powerful enough, rRNA genes would have C*<1)? Please also clarify what C* would be, in a single-copy gene system in diploid species.

(ii) In Eq(1), what exactly is V*(K)? Variance in reproductive success across all gene copies in the population? What factors affect V*(K)? For the same population, what is the possible range of V*(K)/V(K)? Is it somewhat bounded because of biological constraints? Are V*(K) and C*(K) independent parameters, or does one affect the other, or are both affected by an overlapping set of factors?

(iii) In the multi-copy gene system, how is fixation defined? A variant found at the same position in all copies of the rRNA genes in the entire population?

(iv) Lines 199-201, HI, Hs, and HT are not defined in the context of a multi-copy gene system. What are the empirical estimators?

(v) Line 392-393, f and g are not clearly defined. What does "the proportion of AT-to-GC conversion" mean? What are the numerator and denominator of the fraction, respectively?

(2) Technical concerns with rRNA gene data quality

Given the highly repetitive nature and rapid evolution of rRNA genes, myriads of things could go wrong with read alignment and variant calling, raising great concerns regarding the data quality. The data source and methods used for calling variants were insufficiently described at places, further exacerbating the concern.

(i) What are the accession numbers or sample IDs of the high-coverage WGS data of humans, chimpanzees, and gorillas from NCBI? How many individuals are in each species? These details are necessary to ensure reproducibility and correct interpretation of the results.

(ii) Sequencing reads from great apes and mice were mapped against the human and mouse rDNA reference sequences, respectively (lines 485-486). Given the rapid evolution of rRNA genes, even individuals within the same species differ in copy number and sequences of these genes. Alignment to a single reference genome would likely lead to incorrect and even failed alignment for some reads, resulting in genotyping errors. Differences in rDNA sequence, copy number, and structure are even greater between species, potentially leading to higher error rates in the called variants. Yet the authors provided no justification for the practice of aligning reads from multiple species to a single reference genome nor evidence that misalignment and incorrect variant calling are not major concerns for the downstream analysis.

(vi) It is unclear how variant frequency within an individual was defined conceptually or computed from data (lines 499-501). The population-level variant frequency was calculated by averaging across individuals, but why was the averaging not weighted by the copy number of rRNA genes each individual carries? How many individuals are sampled for each species? Are the sample sizes sufficient to provide an accurate estimate of population frequencies?

(vii) Fixed variants are operationally defined as those with a frequency>0.8 in one species. What is the justification for this choice of threshold? Without knowing the exact sample size of the various species, it's difficult to assess whether this threshold is appropriate.

(viii) It is not explained exactly how FIS, FST, and divergence levels of rRNA genes were calculated from variant frequency at individual and species levels. Formulae need to be provided to explain the computation.

(3) Complete ignorance of the difference in mutation rate difference between rRNA genes and genome-wide average

Nearly all data analysis in this paper relied on comparison between rRNA genes with the rest (presumably single-copy part) of the genome. However, mutation rate, a key parameter determining the diversity and divergence levels, was completely ignored in the comparison. It is well known that mutation rate differs tremendously along the genome, with both fine and large-scale variation. If the mutation rate of rRNA genes differs substantially from the genome average, it would invalidate almost all of the analysis results. Yet no discussion or justification was provided.

Related to mutation rate: given the hypermutability of CpG sites, it is surprising that the evolution/fixation rate of rRNA estimated with or without CpG sites is so close (2.24% vs 2.27%). Given the 10 - 20-fold higher mutation rate at CpG sites in the human genome, and 2% CpG density (which is probably an under-estimate for rDNA), we expect the former to be at least 20% higher than the latter.

Among the weaknesses above, concern (1) can be addressed with clarification, but concerns (2) and (3) invalidate almost all findings from the data analysis and cannot be easily alleviated with a complete revamp work.

Author response:

(1) First, we wish to point out that there has not been a model for quantifying genetic drift in multi-copy gene systems. Hence, the first attempt using the Haldane model is not expected to be familiar and readily acceptable. Nevertheless, the standard WF (Wright-Fisher) model cannot handle drift in multi-copy gene systems, such as viruses, due to the two levels of genetic drift – within individuals as well as between individuals of the population.

[Point 1 responds to the comments that we did not engage with the literature, in particular, publications like the Canning model, which are extensions of the WF model. As pointed out above, models based on the WF sampling cannot handle the two levels of genetic drift.]

(2) A crucial aspect of the study is the nature of rRNA gene cluster, which is also a multi-copy gene system. It is easy to see some multi-copy gene systems, like viral particles or mtDNAs, to have a sub-population of genes within each individual. It is less obvious that tandem arrays of gene copies like rRNA genes can be treated as sub-populations that are subjected to drift. Nevertheless, rRNA gene copies frequently transfer mutations among copies in the same cell via the homogenization process. Hence, rRNA genes do not have the property of "locus" of single-copy genes as they move about as well (a bit like transposons but via different mechanisms). Indeed, the collection of rRNA genes in a cell is referred to as the “community of genes” as cited in Fig. 1. Over hundreds of generations, rRNA genes are effectively a small gene pool like mtDNAs within cells. Furthermore, the copy number of rRNA genes also changes rapidly among individuals. For these reasons, genetic drift is operative within cells and this study aims to determine its strength (see Response 3 below).

[Point 2 of the response addresses questions of Review #1 such as "(whether) the authors are referring to diversity in a single copy of an rRNA gene (or) diversity across the entire array of rRNA genes" or "(whether) the discussion of heterozygosity at rRNA ... is diversity per single copy locus or after collapsing loci together". The answer should be "the genetic diversity of the population of rRNA genes in the cell", noting that the single gene locus does not apply here. Similarly, a question like "Alignment to a single reference genome would likely lead to incorrect and even failed alignment for some reads'" from Review #2 appears to be based on the homology concept of a rRNA gene locus. All rRNA gene copies are aligned against the consensus of the population of genes of the species. The consensus nucleotide nearly always accounts for > 90% of the gene copies in the population.]

(3) We now clarify the meaning of C*, the effective copy number of rRNA genes. We apologize that the abstract is indeed unclear, and even misleading. In the abstract, we did not use different notations for the actual copy number (C) and the effective copy number (C*) of rRNA genes. Instead, we use the letter C to designate both. Furthermore, in the main text, the presentation of the effective number, C*, is overly complicated (in order to be realistic). We apologize. Slight modifications of the abstract should have removed all the mis-understandings, as shown below.

"On average, rDNAs have C ~ 150 - 300 copies per haploid in humans. While a neutral mutation of a single-copy gene would take 4N (N being the population size) generations to become fixed, the time should be 4NC* generations for rRNA genes where 1<< C* (C* being the effective copy number; C* > C or C* <C will depend on the strength of drift). However, the observed fixation time in mouse and human is < 4N, implying the paradox of C* < 1. Genetic drift that encompasses all random neutral evolutionary forces appears as much as 100 times stronger for rRNA genes as for single-copy genes, thus reducing C* to < 1."

[Point 3 responds to the key criticisms. From Review #1 " The authors frame the number of rRNA genes as roughly equivalent to expanding the population size, ... a mutation can spread among rRNA gene copies is fundamentally different …". Indeed, the abstract can be very misleading when it uses CN interchangeably with C*N, essentially by allowing C to mean both.

From Review #2 "In Eq (1), although C* is defined as the "effective copy number", it is unclear what it means in an empirical sense…". From the slightly revised text quoted above, it should be clear that the fixation time as well as the level of polymorphism represent the empirical measures of C*".

(4) Lastly, we shall address the mis-understood "reproductive success" of rRNA genes, which is the number of progeny, K, in the Haldane model. K should be more accurately referred to as the transmission speed. For single-copy genes, reproductive success and transmission both mean the same thing, K. But the term reproductive success is not appropriate for rRNA genes even though the formulae for K are the same for all gene systems

[Point 4 responds to all criticisms using the term "reproductive success"]

  1. Howard Hughes Medical Institute
  2. Wellcome Trust
  3. Max-Planck-Gesellschaft
  4. Knut and Alice Wallenberg Foundation