The paradox of extremely fast evolution driven in multi-copy gene systems - A resolution

Xiaopei Wang; Yongsen Ruan; Lingjie Zhang; Xiangnyu Chen; Zongkun Shi; Haiyu Wang; Bingjie Chen; Miles Tracy; Haijun Wen; Chung-I Wu

doi:10.7554/eLife.99992.1

eLife assessment

This study attempts to resolve an apparent paradox of rapid evolutionary rates of multi-copy gene systems by using a theoretical model that integrates two classic population models. While the conceptual framework is intuitive and thus useful, the specific model is perplexing and difficult to penetrate for non-specialists. The data analysis of rRNA genes provides inadequate support for the conclusions due to a lack of consideration of technical challenges, mutation rate variation, and the relationship between molecular processes and model parameters.

https://doi.org/10.7554/eLife.99992.1.sa3

Significance of findings

useful: Findings that have focused importance and scope

landmark
fundamental
important
valuable
useful

Strength of evidence

inadequate: Methods, data and analyses do not support the primary claims

exceptional
compelling
convincing
solid
incomplete
inadequate

During the peer-review process the editor and reviewers write an eLife assessment that summarises the significance of the findings reported in the article (on a scale ranging from landmark to useful) and the strength of the evidence (on a scale ranging from exceptional to inadequate). Learn more about eLife assessments

Abstract

Multi-copy gene systems that evolve within, as well as between, individuals are common. They include viruses, mitochondrial DNAs, transposons and multi-gene families. The paradox is that their evolution in two stages should be far slower than single-copy systems but the opposite is often true. The paradox has been unresolvable because the standard Wright-Fisher (WF) model for molecular evolution cannot track multi-copy genes. We now apply the newly expanded WF-Haldane (WFH) model to such systems, first on ribosomal rRNA genes. On average, rRNAs have C ∼ 150 - 300 copies per haploid in mammals. While a neutral mutation would take 4N (N being the population size) generations to become fixed, the time should be 4NC generations for rRNAs. However, the observed fixation time in mouse and human is < 4N; thus the paradox means, effectively, C < 1. Genetic drift (i.e., all random neutral forces driving molecular evolution by the WFH model) of rRNA genes appears 200-300 times higher than single-copy genes, thus reducing C to < 1. The large increases in genetic drift are driven by the homogenizing forces of unbiased gene conversion, unequal crossover and replication slippage within individuals. This study is one of the first applications of the WFH model to track random neutral forces of evolution. Such random forces, outside of the standard WF model, are often mis-interpreted as the working of natural selection.

Introduction

In this study, we focus on multi-copy gene systems, where the evolution takes place both within and between individuals in two stages. Multi-copy gene systems include viruses, transposons, mitochondria and multi-gene family (Alexandrov, et al. 2001; Szitenberg, et al. 2016; Xu, et al. 2019; Ruan, Wen, et al. 2021). Given the large number of copies in the entire population and the two-stage evolution, the neutral evolutionary rate should be much slower than in the standard single-copy systems. However, the rapid evolution of multi-copy gene systems has been extensively documented (Charlesworth, et al. 1994; Eickbush and Eickbush 2007; Jurka, et al. 2007; Hou, et al. 2023). The main reason for this paradox is that the speed of neutral evolution of multi-gene systems is beyond the power of the standard WF (Wright-Fisher) model of molecular evolution.

The speed of neutral evolution is the basis for determining how fast or slow all types of molecular evolution take place. Neutral evolution is driven by random transmission, unbiased conversion, stochastic replication etc., which collectively constitute genetic drift. Hence, genetic drift is the fundamental force of molecular evolution. All other evolutionary forces, such as selection, mutation and migration, may be of greater biological interest, but their inferences are possible only when genetic drift is fully accounted for. Under-estimation of genetic drift has been the major cause of the over-estimation of other forces. In the companion study(Ruan, et al. 2024), we show that the conventional Wright-Fisher (WF) model may often under-account genetic drift.

Ruan, et al. (Ruan, et al. 2024) propose the integration of the WF model with the Haldane model of genetic drift, referred to as the WFH model. The Haldane model is based on the branching process. In haploids, each individual produces K progeny with the mean and variance of E(K) and V(K). Genetic drift is primarily V(K) as there would be no drift if V(K) = 0. Gene frequency change in the population is thus scaled by N (population size), expressed as V(K)/N. In diploids, K would be the number of progeny to whom the gene copy of interest is transmitted. (The adjustments between haploidy and multi-ploidy are straightforward [see (Ruan, et al. 2024)])

In the WF model, gene frequency is governed by 1/N (or 1/2N in diploids) because K would follow the Poisson distribution whereby V(K) = E(K). Since E(K) is generally ∼1 in the long run, V(K) would be ∼ 1 as well. (We note that N_e = N/V(K) can be a solution for the WF model, but such a solution would be biologically nonsensical albeit mathematically feasible.) Instead, the WFH model is biologically based on the Haldane model, but with mathematical solutions approximated by the WF model when feasible.

The WF model has led to a number of paradoxes as discussed in (Ruan, et al. 2024). They include i) the paradox of changing N’s whereby the strength of drift may increase as N increases; ii) the paradox of sex chromosomes whereby genetic drift depends on the sex, even when N is fully accounted for; iii) the paradox of drift under selection whereby genetic drift depends on V(K) but not on N. These paradoxes are resolved by the integrated WFH model.

In addition to the many paradoxes, there exist a number of genetic systems that are incompatible with the WF model. The influence of stochastic forces are hence rarely defined in such systems. These may include somatic cell evolution (Ling, et al. 2015; Wu, et al. 2016; Chen, Wu, et al. 2022), domestication (Wang, et al. 2014; Ostrander, et al. 2019) and even interspecific competitions, which do have stochastic elements as well.

A common multi-copy system is viral evolution, evident in the recent pandemic of COVID-19 (Ruan, Luo, et al. 2021; Ruan, Wen, et al. 2021; Ruan, et al. 2022; Hou, et al. 2023). Another example is systems of large number of nearly-identical genes within the same genome such as transposons (Szitenberg, et al. 2016), mitochondrial DNAs (Xu, et al. 2019), satellite DNAs (Wu, et al. 1989; Cabot, et al. 1993; Alexandrov, et al. 2001), and ribosomal RNA genes (van Sluis and McStay 2019; Hori, et al. 2021). In these examples, the copy number (designated C) is in the hundreds or thousands per individual. Nevertheless, C = 2 as in all diploids is also a multi-copy gene system as the two copies may often evolve interactively via gene conversion or segregation distortion (Wu, et al. 1988; Tao, et al. 2001; Wyckoff, et al. 2002; McDermott and Noor 2010). The WF model has been noted to be inadequate even in such diploid systems (Charlesworth 2009; Charlesworth and Charlesworth 2010; Chen, et al. 2017). After all, the WF model is an essentially a haloid model with 2N copies in the population.

We now apply the Haldane model to track the genetic drift in multi-copy gene systems, using ribosomal RNA (rRNA) genes as an example. The absence of an adequate model has led to speculations of adaptation by natural selection (Dover 1982; Lu and Wu 2005; Nair, et al. 2008; Arkhipova 2018; Chen, Yang, et al. 2022). We will show that neutral stochasticity would be a simpler explanation if the more powerful Haldane model is adopted. The analysis also reveals a new paradox of genetic drift; namely, the “effective population size” within individuals can in fact be smaller than 1.

Results

PART I presents a best-known multi-gene system of the rRNA genes. PART II (Theory) consolidates aspects of the Haldane model with respect to polymorphism within species as well as divergence between species. In PART III (Data analyses), we apply the theory to rDNA evolution in mice and apes (human and chimpanzee).

PART I - The biology of rRNA gene clusters

The ribosomal RNA genes (rDNAs) are multi-copy gene clusters (Bowman, et al. 2020) that are arrayed as tandem repeats on multiple chromosomes as shown in Fig. 1A (Guillén, et al. 2004; Cazaux, et al. 2011). In humans, the copy number can vary from 60 to 1600 per individual (mean, 315; SD, 104; median, 301) (Parks, et al. 2018). For each haploid genome, C ∼ 150 in humans and C ∼ 110 in mice (Parks, et al. 2018). In humans, the five rRNA clusters are located on the short arm of the five acrocentric chromosomes (Smirnov, et al. 2021). Such an arrangement permits crossovers between chromosomes without perturbing the rest of the genomes. In Mus, the rDNAs are all located in the pericentromeric or sub-telomeric region, on the long arms of telocentric chromosomes (Cazaux, et al. 2011; Potapova and Gerton 2019). Thus, unequal crossovers between non-homologous chromosomes may involve centromeres while other genic regions are also minimally perturbed.

The “chromosome community of rDNAs on five acrocentric chromosomes.
(A) The genomic locations of rDNA tandem repeats in human (Gibbons, et al. 2015) and mouse (Cazaux, et al. 2011). rDNAs are located on the short arms (human), or the proximal end of the long arms (mouse), of the acrocentric chromosome. Either way, inter-chromosomal exchanges are permissible. (B) The organization of rDNA repeat unit. IGS (intergenic spacer) is not transcribed. Among the transcribed regions, 18S, 5.8S and 28S segments are in the mature rRNA while ETS (external transcribed spacer) and ITS (internal transcribed spacer) are excluded. (C) The “chromosomes community” map modified from Fig. 1 of (Guarracino, et al. 2023). The map shows the distance of divergence among chromosome segments. The large circle encompasses rDNAs from all 5 chromosomes with color dots indicating the chromosome origin. It is clear that rDNAs experience concerted evolution regardless of their genomic locations. The slightly smaller thin circle, from the analysis of this study, shows that the rDNA gene pool from each individual captures approximately 95% of the total diversity of human population. (D) A simple illustration that shows the transmissions of two new mutations (#1 and #2 in red letter). Mutation 1 experiences replication slippage, gene conversion and unequal crossover and grows to 9 copies (K = 9) after transmission. Mutation 2 emerges and disappears (K = 0). This shows how *V(K)* may be augmented by the homogenization process.

Each copy of rRNA gene has functional and non-functional parts as shown in Fig. 1B. The “functional” regions of rDNA, 18S, 5.8S, and 28S rDNA, are believed to be under strong negative selection, resulting in a slow evolution rate since the divergence of animals and plants (Salim and Gerton 2019). In contrast, the transcribed spacer (ETS and ITS) and the intergenic spacer (IGS) show significant sequence variation even among closely related species (Eickbush and Eickbush 2007). This suggests that these “non-functional” sequences are less constrained by negative selection. In this study of genetic drift, we focus on the non-functional parts. Data on the evolution of the functional parts will be provided only for comparisons.

The “chromosome community” of ribosomal RNA genes

While a human individual with 300 rRNA genes may appear to have 300 loci, the concept of distinct “gene loci” cannot be applied to the rRNA gene clusters. This is because DNA sequences can spread from one copy to others on the same chromosome via replication slippage. They can also spread among all other copies on the same or different chromosomes via gene conversion and unequal crossovers (Nagylaki 1983; Ohta and Dover 1983; Stults, et al. 2008; Smirnov, et al. 2021). Replication slippage and unequal crossovers would also alter the copy number of rRNA genes, thus accounting for the copy number variation among individuals. These mechanisms will be referred to collectively as the homogenization process, resulting in a non-Mendelian pattern of inheritance.

Copies of the cluster on the same chromosome are widely known to be nearly identical in sequences (Hori, et al. 2021; Nurk, et al. 2022). Previous research has also provided extensive evidence that rDNA clusters among nonhomologous chromosomes share the same variants, thus confirming genetic exchange between chromosomes (Krystal, et al. 1981; Arnheim, et al. 1982; van Sluis, et al. 2019). For example, Nurk, et al. (Nurk, et al. 2022) report a 5-kb segment that covers mostly non-functional sites showing a 98.7% identity among the short arms of the five acrocentric chromosomes. Additionally, variants are often found in all human rDNAs but not in other great apes, suggesting rapid fixation (Arnheim, et al. 1980). The rapid fixation is unexpected because, for a new rDNA mutation to be fixed, it has to be fixed among all ∼ 300 copies in all individuals in the population. In this process, both rapid drift and positive selection can contribute to these fixations. However, the main difference between them is that genetic drift only shorten the fixation time of mutations without changing their fixation probability, whereas selection can alter both. As a result, rapid drift can only accelerate the evolution rate up to neutral molecular evolutionary rate (Eq. (3) below), while positive selection is not constrained by this limit.

With the extensive exchanges via the homogenization process, copies of rRNA genes in an individual can be treated as a sub-population of gene copies. Such sub-populations do not have to be in the sense of a Mendelian population when using the branching process to track genetic drift (see Discussion and Chen et al. 2017). (Guarracino, et al. 2023) recently proposed the view of “chromosome community” for the short arms of the 5 acrocentric chromosomes, where rRNA genes are located. As seen in Fig. 1C, these five short arms harbor a pool of rRNA genes that do not have strong associations with genomic locations. Fig. 1C further illustrates that each human individual’s rRNA gene pool can capture ∼ 95% of the total diversity of the human populations (see PART III). Fig. 1D presents the molecular mechanisms of genetic drift within individuals (i.e., Stage I evolution) in relation to the multi-copy gene systems.

PART II - Theory

1. The Haldane model of genetics drift applied to multi-copy genetic systems

The Haldane model of genetic drift based on the branching process is intuitively more appealing. In the model, each copy of the gene leaves K copies in a time interval with the mean and variance of E(K) and V(K). If V(K) = 0, there is no gene frequency change and no genetic drift. In the WF model, V(K) = E(K) whereas V(K) is decoupled from E(K) in the Haldane model; the latter thus being more flexible.

Below, we compare the strength of genetic drift in rRNA genes vs. that of single-copy genes using the branching process. E(K) is set to 1 so that we can focus on V(K). For single-copy genes, the effective population size that determines genetic drift would be N_e = N/V(K) (Kimura and Crow 1963). Details are presented in the companion paper (Ruan, et al. 2024). We shall use * to designate the equivalent symbols for rRNA genes. Hence,

where N* is the number of rRNA genes in the entire population and, therefore, N* = C*N. For rRNA genes, C* >> 1 but < 150, C* being the effective copy number of rRNA genes. (C =1 for single copy genes.) The value of C* is uncertain because each haploid in humans, on average, has ∼ 150 copies that do not segregate but assort among chromosomes (Parks, et al. 2018). V*(K) for rRNA genes is likely to be much larger than V(K) because K for rRNA mutations may be affected by a host of factors including replication slippage, unequal cross-over, gene conversion and other related mechanisms not operating on single copy genes. We thus designate Vh = V*(K)/V(K), as the homogenizing factor of rRNA genes.

Fig. 1D is a simple illustration that the homogenizing process may enhance V(K) substantially over the conventional WF model as each mechanism amplifies the variance in K. Importantly, the homogenization process is considered stochastic rather than deterministic because it affects both the mutants and wild type equally, as expected of genetic drift.

We then designate R* = (1/N_e*)/(1/N_e) as the strength of genetic drift of rRNA genes, relative to that of single copy genes. R* is measurable, as will be done below.

In short, R* is the ratio of two factors. A larger Vh [=V*(K)/V(K)] speeds up the drift strength of rRNA genes whereas a larger C* slows down the process of fixation by drift. It would be most interesting if R* > 1 as it would mean that the effective population size of rRNA genes is smaller than single-copy genes. In this case, the homogenization force should be strong enough to overcome the large copy number of rRNA genes (∼ 150) during the fixation of new mutations.

2. rDNA polymorphism within species

For rDNAs, there are 3 levels of diversity: H_I, H_S and H_T. They represent, respectively, the heterozygosity within individual (I), within species (S) and in the total data set (T, which often comprises two species or more). Briefly, H = 2p(1-p) if there are two variants with the frequency of p and 1-p. In an equilibrium haploid population,

Given u, H_S is a measure of genetic drift within a population. For rRNA genes, H_S is expected to be larger than that single-copy genes as N* > N. With the three levels of heterozygosity for multi-copy genes, there are two levels of differentiation. First, F_IS is the differentiation among individuals within the species, defined by

Second, F_STis the differentiation between populations (or species) in the context of total diversity, defined as

The two F statistics will inform about the distribution of nucleotide diversity among individuals, or between species. H_T is the expected heterozygosity when there is no subdivision between species and H_Sis the observed heterozygosity. Therefore, F_ST will be equal to 0 when there is no subdivision and will be positive as long as there is variation in allele frequency across subdivisions.

3. rDNA divergence between species

Whereas the level of genetic diversity is a function of the effective population size, the rate of divergence between species, in its basic form, is not. The rate of neutral molecular evolution is (λ), although driven by mutation and genetic drift, is generally shown by Eq. (3) below (Crow and Kimura 1970; Hartl, et al. 1997; Li 1997):

Hence, we might view the evolution of rRNA genes in the long run to be the same as single-copy genes. However, Eq. (3) is valid only in the long-term evolution. For shorter-term evolution, it is necessary to factor in the fixation time (Fig. 2), T_f, which is the time between the emergence of a mutation and its fixation. If we study two species with a divergent time (T_d) equal to or smaller than T_f, then few mutations of recent emergence would have been fixed as species diverge.

Fixation of mutations at two levels of species divergence, (*T_d1*) and (*T_d2*).
(*T_f1*) and (*T_f2*) are mutations with a shorter and longer fixation time, respectively for single-copy and multi-copy genes. Note that mutations with a longer *T_f* would show a lower fixation rate in short-term evolution.

Note that T_d is about 6 million years (Myrs) between human and chimpanzee while T_f (as measured by coalescence) is 0.6 - 1 Myrs in humans. Mutations of single-copy genes would not get fixed during the more recent T_f, as indicated in Figure 2. Thus, λ may be 1/6 to 1/10 lower than the theoretical value. In comparison, T_f of rRNA mutations should be at least 4.8 - 8 Myrs based on the C* estimate. Thus, λ may be at least 80% lower than calculated for single-copy genes.

Our own theoretical derivation has shown that the fixation time for rRNA genes is close to 4N * as is the case for single-copy genes at ∼ 4N_e*. If V_e*(K) is sufficiently larger than V(K), it is in fact possible for N_e* < N_e such that genetic drift is stronger for rRNA genes than for single-copy genes and T_f* < T_f. Therefore, T_f1 can represent the T_f*of mutations in rRNA genes in Fig. 2. This is interesting because, if the homogenization is powerful enough, rRNA genes would have an effective copy number of C* < 1. A short T_f*, which can be obtained by large V*(K), would lead to a small T_f*/T_d and thus a higher λ, particularly in short-term evolution. However, even T_f* approaches 0, λ can exceed that of single-copy genes but is still limited by fixation probability. As noted above, the rapidly fixed mutations may not be well represented in the polymorphic data but can accumulate in species divergence.

PART III - Data Analyses

Before presenting the main analyses, we provide some empirical observations on the rapid homogenization of rRNA genes within individuals (Stage I). These observations are needed for PART III that comprises i) the analyses of rDNA polymorphisms within species in mouse and human; and ii) the analysis of the divergence between species.

Empirical measurements of homogenization within cells

In an accompanying study (Wang, et al. unpublished data), the evolutionary rate of neutral rRNA variants within cells is measured. Here, genetic drift operates via the homogenizing mechanisms that include gene conversion, unequal crossover and replication slippage. In the literature, measurements of neutral rRNA evolution are usually based on comparisons among individuals. Therefore, the Mendelian mechanisms of chromosome segregation and assortment would also shuffle variants among individuals. Segregation and assortment would confound the measurements of homogenization effects within individuals.

In one experiment, the homogenization effects in rDNAs are measured in cultured cell lines over 6 months of evolution. For in vivo homogenization, we analyze the evolution of rRNA genes within solid tumors. We estimate the rate at which rRNA variants spread among copies within the same cells, which undergo an asexual process. The measurements suggest that, in the absence of recombination and chromosome assortment, the fixation time of new rRNA mutations within cells would take only 1 - 3 kyrs (thousand years). Since a new mutation in single-copy genes would take 300 - 600 kyrs to be fixed in human populations, the speed of genetic drift in Stage I evolution is orders faster than in Stage II. Therefore, despite having several hundred copies of rRNA genes per genome, the speed of genetic drift in rRNA genes may not be that much slower than in single-copy genes. This postulate will be tested below.

1. rDNA polymorphism within species

1) Polymorphism in mice

For rRNA genes, H_I of 10 individuals ranges from 0.0056 to 0.0067 while H_S is 0.0073 (Table 1). Thus, F_IS = [H_S - H_I]/H_Sfor mice is 0.14, which means 86% of variation is within each individual. In other words, even one single randomly chosen individual would yield 85% of the diversity of the whole species. Hence, the estimated H_S should be robust as it is not affected much by the sampling.

rRNA gene diversity in the 10 *M. m. domesticus* strains of a global collection.

H_S for M. m. domesticus single-copy genes is roughly 1.40 per kb genome-wide (Geraldes, et al. 2008) while H_S for rRNA genes is 7.25 per kb, 5.2 times larger. In other words, C* = N*/N ∼ 5.2. If we use the polymorphism data, it is as if rDNA array has a population size 5.2 times larger than single-copy genes. Although the actual copy number on each haploid is ∼ 110, these copies do not segregate like single-copy genes and we should not expect N* to be 100 times larger than N. The H_S results confirm the prediction that rRNA genes should be more polymorphic than single-copy genes.

Based on the polymorphism result, one might infer slower drift for rDNAs than for single-copy genes. However, the results from the divergence data in later sections will reveal the under-estimation of drift strength from polymorphism data. Such data would miss variants that have a fast drift process driven by, for example, gene conversion. Strength of genetic drift should therefore be measured by the long-term fixation rate.

2) Polymorphism in human

F_IS for rDNA among 8 human individuals is 0.059, much smaller than 0.142 in M. m. domesticus mice, suggesting a major role of stage I evolution in humans. Consisted with low F_IS, Fig. 3 shows strong correlation of the polymorphic site frequency of rDNA transcribed region among each pair of individuals from three continents (2 Asians, 2 Europeans and 2 Africans). Correlation of polymorphic sites in IGS region is shown in Supplementary Fig. 1. Like those in mice, the pattern indicates that intra-species polymorphism is mainly preserved within individuals. In addition, the 6 panels on the diagonal line given in Fig. 3 demonstrated that the allele frequency spectrum is highly concentrated at very low frequency within individuals, indicating rapid homogenization of rDNA mutations among different copies even different chromosomes. This homogenization can lead to the elimination of the genetic variation of this region. The observed H_I of humans for rDNAs is 0.0064 to 0.0077 and the H_S is 0.0072 (Table 2). Research has shown that heterozygosity for the human genome is about 0.00088 (Zhao, et al. 2000), meaning the effective copy number of rDNAs is roughly, or C* ∼ 8. This reduction in effective copy number from 150 to 8 indicates strong genetic drift due to homogenization force.

Correlation of variant frequencies between human individuals.
The pairwise correlation of variant site frequency in the transcribed region of rDNAs among 6 individuals (2 Asians, 2 Europeans, and 2 Africans). The high correlations suggest that the diversity in each individual can well capture the population diversity. Each color represents a region of rDNA. The diagonal plots present the variant frequency distribution. The upper right section summarizes the Pearson correlation coefficients derived from the mirror symmetric plots in the bottom left. The analysis excluded the 18S, 3’ETS, and 5.8S regions due to the limited polymorphic sites. The result for IGS region is presented in Supplementary Figure S1.

rRNA gene diversity in the 8 humans of a global collection.

2. rDNA divergence between species

We now consider the evolution of rRNA genes between species by analyzing the rate of fixation (or near fixation) of mutations. Polymorphic variants are filtered out in the calculation.

1) Between mouse species - Genetic drift as the sole driving force of the rapid divergence

We now use the F_ST statistic to delineate fixation and polymorphism. The polymorphism in M. m. domesticus is compared with two outgroup species, M. spretus and M. m. castaneus, respectively. There are hence two depths in the phylogeny with two T_d’s, as shown in Fig. 4A (Rudra, et al. 2016; Kumar, et al. 2022). There is a fourth species, M. m. musculus (shown in grey in Fig. 4A), which yields very similar results as M. m. domesticus in these two comparisons. These additional analyses are shown in Supplement Table S1-S2.

Levels of polymorphism and divergence in mice.
(A) Phylogeny of *Mus musculus* and *Mus spretus* mice. The divergence times are obtained from http://timetree.org/. The line segment labelled 0.5 represents 0.5 Myrs. (B) *F_IS* distribution within *M. m. domesticus*. The distribution of *F_IS* for polymorphic sites in 3 outbred mouse strains or 10 mouse strains (including 7 inbred mice) in Table 1 (Inset) as shown. (C) *F_ST* distribution between *M. m. domesticus* and *Mus spretus*. Note that the F values rise above 0.8 only in (C).

The F_IS values of polymorphic sites in 3 outbred mice are primarily below 0.2 and rarely above 0.8 in Fig. 4B, indicating the low genetic differentiation in rDNAs within these 3 M. m. domesticus. While the F_IS distribution of 10 mice from Table 1, including 7 inbred and 3 outbred mouse strains, exhibits a noticeable right skewness, but does not exceed 0.8. This suggests that inbreeding to a certain extent limits the process of homogenization and enhances population differentiation. In comparison, the distribution of the F_ST of variant sites between M. m. domesticus and Mus spretus has a large peak near F_ST = 1. This peak in Fig. 4C represents species divergence not seen within populations (i.e., F_IS). We use F_ST = 0.8 as a cutoff for divergence sites between the two species. Roughly, when a mutant is > 0.95 in frequency in one species and < 0.05 in the other, F_ST would be > 0.80.

We first compare the divergence between M. m. domesticus and M. m. castaneus whereby T_d has been estimated to be less than 0.5 Myrs (Fujiwara, et al. 2022). In comparison, between Mus m. domesticus and Mus spretus, T_d is close to 3 Myrs (Rudra, et al. 2016). As noted above, the reduction in the divergence rate relative to that of Eq. (3) is proportional to T_f/T_d(for single copy genes) or T_f*/T_d (for rRNA genes). As T_f and T_f* are both from M. m. domesticus and T_d is 6 times larger in comparison with M. spretus, we expect the results to be quite different between the two sets of species comparisons.

Although T_f and T_f* estimates are less reliable than T_d estimates, both comparisons use the same T_f and T_f* from M. m. domesticus. Hence, the results should be qualitatively unbiased. For a demonstration, we shall use the estimates of T_f (i.e., the coalescence time) at 0.2 Myrs for single-copy genes by using an average nucleotide diversity of 0.0014 and the mutation rate of 5.7×10^-9 per base pair per generation (Geraldes, et al. 2008; Phifer-Rixey, et al. 2020). Based on the estimated C* above, we obtain T_f* for rDNA mutations at 5×0.2 Myrs, or 1 Myrs. While some have estimated T_f to be close to 0.4 Myrs (Fujiwara, et al. 2022), we aim to show that the pattern of reduction in rRNA divergence is true even with the conservative estimates.

Between M. m. domesticus and M. m. castaneus, the reduction in substitution rate for single copy gene should be ∼ 40% (T_f/T_d = 0.2/0.5), and the reduction for rRNA genes should be 100% (T_f*/T_d = 1/0.5 > 1). Table 3 on their DNA sequence divergence shows that rRNA genes are indeed far less divergent than single-copy genes. In fact, only a small fraction of rRNA mutations is expected to be fixed as T_f*for rDNA at 1 Myrs is 2 times larger than the divergence time, T_d. We should note again that the non-negligible fixation of rRNA mutations suggests that C* at 5 is perhaps an over-estimate.

Divergence in rRNA genes between *M. m. domesticus* and *M. m. castaneus*.

Between Mus m. domesticus and Mus spretus, the reduction in actual evolution rate from theoretical limit for single-copy genes should be 6.7% (T_f/T_d = 0.2/3) and, for rRNA genes, should be 33% (T_f*/T_d= 1/3). The evolutionary rate (i.e. the fixation rate) of IGS region is lower than single-copy genes, 0.01 in IGS and 0.021 in genome-wide (Table 4), as one would expect. However, ETS and ITS regions have evolved at a surprising rate that is 12% higher than single-copy genes. Note that the reduction in C*, even to the lowest limit of C* =1, would only elevate the rate of fixation in rRNA genes to a parity with single-copy genes. From Eq. 2, the explanation would be that the homogenization factor V*(K)/V(K) has to be very large, such that C* is < 1. With such rapid homogenization, the fixation time approaches 0 and the evolutionary rate in rRNA genes can indeed reach the theoretical limit of Eq. (3). In such a scenario, the substitution rate in ETS and ITS, compared to single-copy genes in mice, may increase by 7%, T_f /(T_d –T_f) = 0.2/(3-0.2). If we use T_f ∼ 0.4 Myrs in an alternative estimation, the increase can be up to 15%.

Divergence in rRNA genes between *M. m. domesticus* and *Mus spretus*.

In conclusion, the high rate of fixation in ETS and ITS may be due to very frequent gene conversions that reduce C* to be less than 1. In contrast, IGS may have undergone fewer gene conversions and its long-term C*is slightly larger than 1. Indeed, the heterozygosity in IGS region, at about 2-fold higher than that of ETS and ITS regions (8‰ for IGS, 5‰ for ETS and 3‰ for ITS), supports this interpretation.

2) Between Human and Chimpanzee - Positive selection in addition to rapid drift in rDNA divergence

Like the data of mouse studies, the polymorphism of rDNAs in humans would suggest a slower long-term evolution rate. The same caveat is that C* estimated from the polymorphism data would have missed those rapidly fixed variants. Hence, the long-term C* obtained from species divergence might be much smaller than 8.

Our results show that the evolutionary rate of rRNA genes between human and chimpanzee is substantially higher than that of other single-copy genes (Table 5). Especially, 5’ETS region shows a 100% rate acceleration, at 22.7‰ vs. 11‰ genome-wide. Even after removing CpG sites that are considered to have a ten-fold higher mutation rate, their fixation rate still reaches 22.4‰. In this case, even if C* <<1, the extremely rapid fixation will only increase the substitution rate by T_f /(T_d –T_f) by 11%, compared to single-copy genes. Thus, the much accelerated evolution of rRNA genes between humans and chimpanzees cannot be entirely attributed to genetic drift. Other forces, such as biased gene conversion, must have contributed to this acceleration as well.

Divergence in rRNA genes between Human and Chimpanzee.

3) Biased gene conversion - Positive selection for rRNA mutations in apes, but not in mice

As stated, gene conversion should increase V(K) and speed up genetic drift. Conversion is commonly observed in highly recombinogenic regions, especially in repetitive sequences. If gene conversion is unbiased, the effect on the rate of divergence between species would be quite modest unless T_d ≤ T_f. This is because unbiased conversion would only reduce the fixation time, T_f, as demonstrated in Fig. 2.

In this context, it is necessary to define unbiased gene conversion as it has different meanings at the site level vs. at the gene level. At the site level, when an AT site is paired with a GC site, it has been commonly reported that gene conversion would favor AT-to-GC over GC-to-AT conversion (Jeffreys and Neumann 2002; Meunier and Duret 2004). We first designate the proportion of AT-to-GC conversion as f and the reciprocal, GC-to-AT, as g. Given f ≠ g, this bias is true at the site level. However, at the genic level, the conversion may in fact be unbiased for the following reason. Let p be the proportion of AT sites and q be the proportion of GC sites in the gene. The flux of AT-to-GC would be pf and the flux in reverse, GC-to-AT, would be qg. At equilibrium, pf = qg. Given f and g, the ratio of p and q would eventually reach p/q = g/f. Thus, gene conversion at the genic level would be unbiased when the AT/GC ratio is near the equilibrium. In other words, there is no selection driving either AT-to-GC or GC-to-AT changes.

We now show that genic conversion is indeed unbiased (or nearly unbiased) between mouse species. In contrast, between human and chimpanzee, the fluxes are unequal. It appears that selection has favored the AT-to-GC changes in the last 6 million years that separate the two species.

In these genic analyses, we first determine whether there is biased conversion in the human lineage (Brown and Jiricny 1989; Galtier and Duret 2007). Using the same variant found in chimpanzees and gorillas as the ancestral state, we identified the derived variants that became nearly fixed in humans with frequency > 0.8 (Table 6). The chi-square test shows that the GC variants had a significantly higher fixation probability compared to AT. In addition, this pattern is also found in chimpanzees (p < 0.001). Between M. m. domesticus and M. m. castaneus (Table 6), the chi-square test reveals no difference in the fixation probability between GC and AT (p = 0.957). Further details can be found in Supplementary Figure 2. Overall, a higher fixation probability of the GC variants is found in human and chimpanzee, whereas this bias is not observed in mice.

The A/T to G/C and G/C to A/T changes in apes and mouse.

Based on Table 6, we could calculate the value of p, q, f and g as shown in Table 7. In particular, the (pf)/(qg) ratio is much larger than 1 in both the human and chimpanzee lineages. In contrast, the ratio in mouse is not significantly different from 1. The ratio for total rDNAs in mouse is lower than 1, but that is likely due to the influence of selective constraints on the functional parts of these genes.

The parameter values of p, q, f and g in the evolution between A/T and G/C.

Discussion

The Haldane model is an “individual-output” model of genetic drift (Chen, et al. 2017). Hence, it does not demand the population to follow the rules of Mendelian populations. It is also sufficiently flexible for studying various stochastic forces other than the sampling errors that together drive genetic drift. In the companion study(Ruan, et al. 2024), we address the ecological forces of genetic drift and, in this study, we analyze the neutral evolution of rRNA genes. Both examples are amenable to the analysis by the Haldane model, but not by the WF model.

In multi-copy systems, there are several mechanisms of homogenization within individuals. For rRNA genes, whether on the same or different chromosomes (Gonzalez and Sylvester 2001; van Sluis and McStay 2019), the predominant mechanism of homogenization is recombination that may lead to gene conversion or unequal crossover. In the process of exchanging DNA sections in meiosis, gene conversions are an order of magnitude more common than crossover (Cole, et al. 2012; Williams, et al. 2015). It is not clear how large a role is played by replication slippage which affects copes of the same cluster.

There have been many rigorous analyses that confront the homogenizing mechanisms directly. These studies (Smith 1974; Ohta 1976; Dover 1982; Nagylaki 1983; Ohta and Dover 1983) modeled gene conversion and unequal cross-over head on. Unfortunately, on top of the complexities of such models, the key parameter values are rarely obtainable. In the branching process, all these complexities are wrapped into V(K) for formulating the evolutionary rate. In such a formulation, the collective strength of these various forces may indeed be measurable, as shown in this study.

The branching process is a model for general processes. Hence, it can be used to track genetic drift in systems with two stages of evolution - within- and between-individuals, even though TEs, viruses or rRNA genes are very different biological entities. We use the rRNA genes to convey this point. Multi-copy genes, like rDNA, are under rapid genetic drift via the homogenization process. The drift is strong enough to reduce the copy number in the population from ∼ 150N to < N. A fraction of mutations in multi-copy genes may have been fixed by drift almost instantaneously in the evolutionary time scale. This acceleration is seen in mice but would have been interpreted to be due to positive selection by the convention. Interestingly, while positive selection may not be necessary to explain the mice data, it is indeed evident in human and chimpanzee, as the evolutionary rate of rRNA genes exceeds the limit of the strong drift.

In conclusion, the Haldane model is far more general than the WF model as this and the companion study clearly demonstrate. Its E(K) parameter (which is usually set to 1) should be equivalent to the single parameter of the WF model (i.e., N). The other parameter of the Haldane model, i.e., V(K), is then free to track genetic drift, whereas the WF model, setting V(K) = E(K), is highly constrained. Nevertheless, the vast literature using the WF model has led to substantial understandings of the neutral process via the diffusion process (Crow and Kimura 1970) or coalescence (Kingman 1982; Fu 2006). In this sense, the Haldane model should be built on the WF model by introducing a second parameter and permit the analyses of a broader range of stochastic ecological and evolutionary forces.

Materials and Methods

Data Collection

We collected high-coverage whole-genome sequencing data for our study. The genome sequences of human, chimpanzee and gorilla were sourced from National Center for Biotechnology Information (NCBI). Human individuals were drawn from diverse geographical origins encompassing three continents (4 Asians, 2 Europeans and 2 Africans) and Asia CRC was the normal tissue of Case 1 patient from (Chen, Wu, et al. 2022). Genomic sequences of mice were sourced from the Wellcome Sanger Institute’s Mouse Genome Project (MGP). Although some artificial selection have been performed on laboratory mouse strains (Yang, et al. 2011), the WSB/EiJ, ZALENDE/Ei and LEWES/EiJ strains were derived from wild populations. Incorporating these wild-derived laboratory strains, along with other inbred strains, a cohort of 10 mice was utilized to approximate the population representation of M. m. domesticus. Furthermore, the low F_IS of 0.14 for rDNA in M. m. domesticus found in this study suggests that each mouse covers 86% of population’s genetic diversity, thereby mitigating concerns about potential sampling biases. Accessions and the detailed information of samples used in this study are list in Supplementary Table 4 and Table 5.

Data Processing

Following adapter trimming and the removal of low-quality sequences, these whole-genome sequencing data of apes and mice were mapped against respective reference sequences: the human rDNA reference sequence (Human ribosomal DNA complete repeating unit, GenBank: U13369.1) and the mouse rDNA reference sequence (Mus musculus ribosomal DNA, complete repeating unit, GenBank: BK000964.3). Alignment was performed using Burrows-Wheeler-Alignment Tool v0.7.17 (Li and Durbin 2009) with default parameters. Each individual was considered as a pool of rRNA genes, thus there were three levels of diversity.

To determine the variant frequency within individual, bcftools (Danecek, et al. 2021) was employed to call variants from multiple alignment files. Per-sample allele depths were obtained using following settings ‘bcftools mpileup --redo-BAQ --max-depth 50000 --per-sample-mF --annotate FORMAT/AD,FORMAT/ADF,FORMAT/ADR,FORMAT/DP,INFO/AD,INFO/ADF,INFO/A DR’ and ‘bcftools call -mv’. The sites with depth <10 were filtered out in subsequent analysis, and the average of the minimum depths for each remaining site across all samples is above 3000, thereby enhancing the robustness of variant frequency estimation. In this process, we can identity all the polymorphic sites present in these samples and the variant frequency within each individual. The population-level variant frequency was computed by averaging variant frequencies across all individuals. Our analysis specifically focused on single nucleotide variants (SNVs), while other mutation types were discarded for consistency.

Identification of Divergence Sites

Polymorphism represents the transient phase preceding divergence. With variant frequency within individual, within species, and between species, we obtained the F_IS and F_ST values for each site. Within the M. m. domesticus population, 779 polymorphic sites were identified. Among these, 744 sites (95.5%) exhibited an F_IS below 0.4 and rarely above 0.8 in Fig. 4B. In the comparison between M. m. domesticus and Mus spretus, 1579 variant sites were found, among which 453 sites displayed an F_ST above 0.8, indicating swift fixation of mutations during species divergence.

F_ST analysis between human and chimpanzee was conducted for each human individual, summarized in Table 5. We identified a range from 672 to 705 sites with F_ST values above 0.8 across individuals, depicting robust divergence sites. Considering the high mutation rate in CpG sites (Ehrlich and Wang 1981; Sved and Bird 1990) and predominantly GC content in rDNA (e.g. 58% GC in human), we further estimated the evolutionary rate at non-CpG sites during the interspecies divergence. To achieve this, mutations in CpG sites were manually removed by excluding all sites contained CpG in one species and TpG or CpA in the other; the reverse was similarly discarded. Additionally, the count of non-CpG sites within the mapping length, where site depth exceeded 10, was performed by Samtools (Danecek, et al. 2021) with the settings ‘samtools mpileup -Q 15 -q 20’. As a result, the evolutionary rate of rDNA in non-CpG sites was ascertained.

For assessing diversity and divergence across gene segments, we used ‘samtools faidx’ to partition variants into a total of 8 regions within rRNA genes, including 5’ETS, ITS1, ITS2, 3’ETS, IGS, 18S, 5.8S, and 28S, aligning them with corresponding reference sequences for further analysis. The functional parts (18S, 5.8S, and 28S) were subject to strong negative selection, exhibiting minimal substitutions during species divergence as expected. This observation is primarily used for comparison with the non-functional parts and reflects that the non-functional parts are less constrained by negative selection.

Genome-wide Divergence Estimation

To assess the genome-wide divergence between 4 mouse strain species, we downloaded their toplevel reference genomes from Ensembl genome browser (GenBank Assembly ID: GCA_001624865.1, GCA_001624775.1, GCA_001624445.1, and GCA_001624835.1). Then we used Mash tools (Ondov, et al. 2016) to estimate divergence across the entire genome (mainly single-copy genes) with ‘mash sketch’ and ‘mash dist’. Additionally, the effective copy number of rRNA genes, denoted as C*, can be estimated by calculating the ratio of population diversity observed in rDNA to that observed in single-copy genes.

Estimation of Site Conversion

To estimate site conversions, variants shared between chimpanzees and gorillas, humans and gorillas, M. m. castaneus and M. spretus were considered as the ancestral state for human, chimpanzee, and M. m. domesticus, respectively. We identified the derived variants that became nearly fixed (frequency > 0.8) in the lineages of humans, chimpanzees and M. m. domesticus. In this study, six types of mutations were tabulated, representing ancestral-to-derived as depicted in Supplementary Fig. 2. For example, A-to-G represented the both A-to-G and T-to-C types of mutations. The C-to-G (or G-to-C) and A-to-T (or T-to-A) types of mutations were excluded in the subsequent analysis.

The alternative hypotheses of biased mutation pattern (Wolfe, et al. 1989; Francino and Ochman 1999) alone can be rejected in this study. According to the prediction of the mutational mechanism hypotheses, AT or GC variants should have equal fixation probabilities. We quantified the nearly fixed number of AT-to-GC and GC-to-AT types of mutations and conducted a chi-square test to assess their fixation probabilities. Notably, we found the GC variant had significant higher fixation probability than AT at site level in apes, but not in mice. Since the fixed sites accounted for less than 1% of non-functional length of rDNA, we calculated the AT-to-GC conversion rate, f, by dividing the number of fixed G or C sites by the number of AT base pairs, and the GC-to-AT conversion rate, g, was computed using a comparable approach. In order to test whether there is an equilibrium state at the genic level in these three species, we computed the (pf)/(qg) ratio in Table 7, and a significant deviation of the ratio from 1 would imply biased genic conversion.

Data Availability

No new data were generated in this study. The genomic data used in this study are available from National Center for Biotechnology Information (NCBI) (https://www.ncbi.nlm.nih.gov/) and the Mouse Genomes Project (https://www.sanger.ac.uk/data/mouse-genomes-project/).

The specific accession numbers recorded in Supplementary Table S3 and S4.

Acknowledgements

We are grateful for the helpful comments from many colleagues on the Chat Group “Cancer - The new evolving species”, in particular, Weiwei Zhai, Yong Zhang, GuoDong Wang of CAS and Jianrong Yang of SYSU.

Supporting information

supplementary information

References

1.
1. Alexandrov I
2. Kazakov A
3. Tumeneva I
4. Shepelev V
5. Yurov Y
2001Alpha-satellite DNA of primates: old and new familiesChromosoma 110:253–266Google Scholar
2.
1. Arkhipova IR
2018Neutral Theory, Transposable Elements, and Eukaryotic Genome EvolutionMolecular Biology and Evolution 35:1332–1337Google Scholar
3.
1. Arnheim N
2. Krystal M
3. Schmickel R
4. Wilson G
5. Ryder O
6. Zimmer E
1980Molecular evidence for genetic exchanges among ribosomal genes on nonhomologous chromosomes in man and apesProc Natl Acad Sci U S A 77:7323–7327Google Scholar
4.
1. Arnheim N
2. Treco D
3. Taylor B
4. Eicher EM
1982Distribution of ribosomal gene length variants among mouse chromosomesProc Natl Acad Sci U S A 79:4677–4680Google Scholar
5.
1. Bowman JC
2. Petrov AS
3. Frenkel-Pinter M
4. Penev PI
5. Williams LD
2020Root of the Tree: The Significance, Evolution, and Origins of the RibosomeChemical Reviews 120:4848–4878Google Scholar
6.
1. Brown T
2. Jiricny J
1989Repair of base-base mismatches in simian and human cellsGenome / National Research Council Canada = Génome / Conseil national de recherches Canada 31:578–583Google Scholar
7.
1. Cabot EL
2. Doshi P
3. Wu ML
4. Wu CI
1993Population genetics of tandem repeats in centromeric heterochromatin: unequal crossing over and chromosomal divergence at the Responder locus of Drosophila melanogasterGenetics 135:477–487Google Scholar
8.
1. Cazaux B
2. Catalan J
3. Veyrunes F
4. Douzery EJP
5. Britton-Davidian J
2011Are ribosomal DNA clusters rearrangement hotspots? A case study in the genus Mus (RodentiaMuridae). BMC Evolutionary Biology 11:124Google Scholar
9.
1. Charlesworth B
2009Effective population size and patterns of molecular evolution and variationNature Reviews Genetics 10:195–205Google Scholar
10.
1. Charlesworth B
2. Charlesworth D
2010Elements of evolutionary geneticsSpringer Google Scholar
11.
1. Charlesworth B
2. Sniegowski P
3. Stephan W
1994The evolutionary dynamics of repetitive DNA in eukaryotesNature 371:215–220Google Scholar
12.
1. Chen B
2. Wu X
3. Ruan Y
4. Zhang Y
5. Cai Q
6. Zapata L
7. Wu CI
8. Lan P
9. Wen H
2022Very large hidden genetic diversity in one single tumor: evidence for tumors-in-tumorNatl Sci Rev 9:nwac250Google Scholar
13.
1. Chen Q
2. Yang H
3. Feng X
4. Chen Q
5. Shi S
6. Wu CI
7. He Z
2022Two decades of suspect evidence for adaptive molecular evolution-negative selection confounding positive-selection signalsNatl Sci Rev 9:nwab217Google Scholar
14.
1. Chen Y
2. Tong D
3. Wu CI
2017A New Formulation of Random Genetic Drift and Its Application to the Evolution of Cell PopulationsMol Biol Evol 34:2057–2064Google Scholar
15.
1. Cole F
2. Kauppi L
3. Lange J
4. Roig I
5. Wang R
6. Keeney S
7. Jasin M
2012Homeostatic control of recombination is implemented progressively in mouse meiosisNature Cell Biology 14:424–430Google Scholar
16.
1. Crow JF
2. Kimura M
1970An Introduction to Population Genetics TheoryNY, USA: Harper & Row Google Scholar
17.
1. Danecek P
2. Bonfield JK
3. Liddle J
4. Marshall J
5. Ohan V
6. Pollard MO
7. Whitwham A
8. Keane T
9. McCarthy SA
10. Davies RM
11. et al.
2021Twelve years of SAMtools and BCFtoolsGigascience 10Google Scholar
18.
1. Dover G
1982Molecular drive: a cohesive mode of species evolutionNature 299:111–117Google Scholar
19.
1. Ehrlich M
2. Wang RY
19815-Methylcytosine in eukaryotic DNAScience 212:1350–1357Google Scholar
20.
1. Eickbush TH
2. Eickbush DG
2007Finely orchestrated movements: evolution of the ribosomal RNA genesGenetics 175:477–485Google Scholar
21.
1. Francino MP
2. Ochman H
1999Isochores result from mutation not selectionNature 400:30–31Google Scholar
22.
1. Fu Y-X
2006Exact coalescent for the Wright–Fisher modelTheoretical Population Biology 69:385–394Google Scholar
23.
1. Fujiwara K
2. Kawai Y
3. Takada T
4. Shiroishi T
5. Saitou N
6. Suzuki H
7. Osada N
2022Insights into Mus musculus Population Structure across Eurasia Revealed by Whole-Genome AnalysisGenome Biol Evol 14Google Scholar
24.
1. Galtier N
2. Duret L
2007Adaptation or biased gene conversion? Extending the null hypothesis of molecular evolutionTrends in Genetics 23:273–277Google Scholar
25.
1. Geraldes A
2. Basset P
3. Gibson B
4. Smith KL
5. Harr B
6. Yu HT
7. Bulatova N
8. Ziv Y
9. Nachman MW
2008Inferring the history of speciation in house mice from autosomal, X-linked, Y-linked and mitochondrial genesMol Ecol 17:5349–5363Google Scholar
26.
1. Gibbons JG
2. Branco AT
3. Godinho SA
4. Yu S
5. Lemos B
2015Concerted copy number variation balances ribosomal DNA dosage in human and mouse genomesPNAS 112:2485–2490Google Scholar
27.
1. Gonzalez IL
2. Sylvester JE
2001Human rDNA: Evolutionary Patterns within the Genes and Tandem Arrays Derived from Multiple ChromosomesGenomics 73:255–263Google Scholar
28.
1. Guarracino A
2. Buonaiuto S
3. de Lima LG
4. Potapova T
5. Rhie A
6. Koren S
7. Rubinstein B
8. Fischer C
9. Abel HJ
10. Antonacci-Fulton LL
11. et al.
2023Recombination between heterologous human acrocentric chromosomesNature 617:335–343Google Scholar
29.
1. Guillén AKZ
2. Hirai Y
3. Tanoue T
4. Hirai H
2004Transcriptional repression mechanisms of nucleolus organizer regions (NORs) in humans and chimpanzeesChromosome Research 12:225–237Google Scholar
30.
1. Hartl DL
2. Clark AG
3. Clark AG
1997Principles of population geneticsSunderland: Sinauer associates Google Scholar
31.
1. Hori Y
2. Shimamoto A
3. Kobayashi T
2021The human ribosomal DNA array is composed of highly homogenized tandem clustersGenome Res 31:1971–1982Google Scholar
32.
1. Hou M
2. Shi JR
3. Gong ZK
4. Wen HJ
5. Lan Y
6. Deng XZ
7. Fan QH
8. Li JJ
9. Jiang ML
10. Tang XP
11. et al.
2023Intra- vs. Interhost Evolution of SARS-CoV-2 Driven by Uncorrelated Selection-The Evolution ThwartedMolecular Biology and Evolution 40:msad204Google Scholar
33.
1. Jeffreys AJ
2. Neumann R
2002Reciprocal crossover asymmetry and meiotic drive in a human recombination hot spotNat Genet 31:267–271Google Scholar
34.
1. Jurka J
2. Kapitonov VV
3. Kohany O
4. Jurka MV
2007Repetitive sequences in complex genomes: structure and evolutionAnnu Rev Genomics Hum Genet 8:241–259Google Scholar
35.
1. Kimura M
2. Crow JF
1963THE MEASUREMENT OF EFFECTIVE POPULATION NUMBEREvolution 17:279–288Google Scholar
36.
1. Kingman JFC
1982On the Genealogy of Large PopulationsJournal of Applied Probability 19:27–43Google Scholar
37.
1. Krystal M
2. D’Eustachio P
3. Ruddle FH
4. Arnheim N
1981Human nucleolus organizers on nonhomologous chromosomes can share the same ribosomal gene variantsProceedings of the National Academy of Sciences of the United States of America 78:5744–5748Google Scholar
38.
1. Kumar S
2. Suleski M
3. Craig JM
4. Kasprowicz AE
5. Sanderford M
6. Li M
7. Stecher G
8. Hedges SB
2022TimeTree 5: An Expanded Resource for Species Divergence TimesMol Biol Evol 39Google Scholar
39.
1. Li H
2. Durbin R
2009Fast and accurate short read alignment with Burrows-Wheeler transformBioinformatics 25:1754–1760Google Scholar
40.
1. Li W-H
1997Molecular evolutionSunderland: Mass.: Sinauer Associates Google Scholar
41.
1. Ling S
2. Hu Z
3. Yang Z
4. Yang F
5. Li Y
6. Lin P
7. Chen K
8. Dong L
9. Cao L
10. Tao Y
11. et al.
2015Extremely high genetic diversity in a single tumor points to prevalence of non-Darwinian cell evolutionProc Natl Acad Sci U S A 112:E6496–6505Google Scholar
42.
1. Lu J
2. Wu C-I
2005Weak selection revealed by the whole-genome comparison of the X chromosome and autosomes of human and chimpanzeeProceedings of the National Academy of Sciences 102:4063–4067Google Scholar
43.
1. McDermott SR
2. Noor MAF
2010The role of meiotic drive in hybrid male sterilityPhilosophical Transactions of the Royal Society B-Biological Sciences 365:1265–1272Google Scholar
44.
1. Meunier J
2. Duret L
2004Recombination drives the evolution of GC-content in the human genomeMolecular Biology and Evolution 21:984–990Google Scholar
45.
1. Nagylaki T
1983Evolution of a large population under gene conversionProc Natl Acad Sci U S A 80:5941–5945Google Scholar
46.
1. Nair S
2. Miller B
3. Barends M
4. Jaidee A
5. Patel J
6. Mayxay M
7. Newton P
8. Nosten F
9. Ferdig MT
10. Anderson TJC
2008Adaptive Copy Number Evolution in Malaria ParasitesPlos Genetics 4:e1000243Google Scholar
47.
1. Nurk S
2. Koren S
3. Rhie A
4. Rautiainen M
5. Bzikadze AV
6. Mikheenko A
7. Vollger MR
8. Altemose N
9. Uralsky L
10. Gershman A
11. et al.
2022The complete sequence of a human genomeScience 376:44–53Google Scholar
48.
1. Ohta T
1976Simple model for treating evolution of multigene familiesNature 263:74–76Google Scholar
49.
1. Ohta T
2. Dover GA
1983Population genetics of multigene families that are dispersed into two or more chromosomesProc Natl Acad Sci U S A 80:4079–4083Google Scholar
50.
1. Ondov BD
2. Treangen TJ
3. Melsted P
4. Mallonee AB
5. Bergman NH
6. Koren S
7. Phillippy AM
2016Mash: fast genome and metagenome distance estimation using MinHashGenome Biology 17:132Google Scholar
51.
1. Ostrander EA
2. Wang G-D
3. Larson G
4. vonHoldt BM
5. Davis BW
6. Jagannathan V
7. Hitte C
8. Wayne RK
9. Zhang Y-P
10. Dog KC
2019Dog10K: an international sequencing effort to advance studies of canine domestication, phenotypes and healthNational Science Review 6:810–824Google Scholar
52.
1. Parks MM
2. Kurylo CM
3. Dass RA
4. Bojmar L
5. Lyden D
6. Vincent CT
7. Blanchard SC
2018Variant ribosomal RNA alleles are conserved and exhibit tissue-specific expressionScience Advances 4Google Scholar
53.
1. Phifer-Rixey M
2. Harr B
3. Hey J
2020Further resolution of the house mouse (Mus musculus) phylogeny by integration over isolation-with-migration historiesBMC Evol Biol 20:120Google Scholar
54.
1. Potapova T
2. Gerton J
2019Ribosomal DNA and the nucleolus in the context of genome organizationChromosome Research 27Google Scholar
55.
1. Ruan Y
2. Hou M
3. Tang X
4. He X
5. Lu X
6. Lu J
7. Wu CI
8. Wen H
2022The runaway evolution of SARS-CoV-2 leading to the highly evolved Delta strainMol Biol Evol Google Scholar
56.
1. Ruan Y
2. Luo Z
3. Tang X
4. Li G
5. Wen H
6. He X
7. Lu X
8. Lu J
9. Wu CI
2021On the founder effect in COVID-19 outbreaks: how many infected travelers may have started them all?Natl Sci Rev 8:nwaa246Google Scholar
57.
1. Ruan Y
2. Wang X
3. Hou M
4. Diao W
5. Xu S
6. Wen H
7. Wu C-I
2024Resolving Paradoxes in Molecular Evolution: The Integrated WF-Haldane (WFH) Model of Genetic DriftbioRxiv Google Scholar
58.
1. Ruan Y
2. Wen H
3. He X
4. Wu CI
2021A theoretical exploration of the origin and early evolution of a pandemicSci Bull (Beijing 66:1022–1029Google Scholar
59.
1. Rudra M
2. Chatterjee B
3. Bahadur M
2016Phylogenetic relationship and time of divergence of Mus terricolor with reference to other Mus speciesJ Genet 95:399–409Google Scholar
60.
1. Salim D
2. Gerton JL
2019Ribosomal DNA instability and genome adaptabilityChromosome Research 27:73–87Google Scholar
61.
1. Smirnov E
2. Chmúrčiaková N
3. Liška F
4. Bažantová P
5. Cmarko D
2021Variability of Human rDNACells 10Google Scholar
62.
1. Smith GP
1974Unequal Crossover and the Evolution of Multigene FamiliesCold Spring Harb Symp Quant Biol 38:507–513Google Scholar
63.
1. Stults DM
2. Killen MW
3. Pierce HH
4. Pierce AJ
2008Genomic architecture and inheritance of human ribosomal RNA gene clustersGenome Res 18:13–18Google Scholar
64.
1. Sved J
2. Bird A
1990The expected equilibrium of the CpG dinucleotide in vertebrate genomes under a mutation modelProc Natl Acad Sci U S A 87:4692–4696Google Scholar
65.
1. Szitenberg A
2. Cha S
3. Opperman CH
4. Bird DM
5. Blaxter ML
6. Lunt DH
2016Genetic Drift, Not Life History or RNAi, Determine Long-Term Evolution of Transposable ElementsGenome Biology and Evolution 8:2964–2978Google Scholar
66.
1. Tao Y
2. Hartl DL
3. Laurie CC
2001Sex-ratio segregation distortion associated with reproductive isolation in DrosophilaProceedings of the National Academy of Sciences 98:13183–13188Google Scholar
67.
1. Tatsumoto S
2. Go Y
3. Fukuta K
4. Noguchi H
5. Hayakawa T
6. Tomonaga M
7. Hirai H
8. Matsuzawa T
9. Agata K
10. Fujiyama A
2017Direct estimation of de novo mutation rates in a chimpanzee parent-offspring trio by ultra-deep whole genome sequencingSci Rep 7:13561Google Scholar
68.
1. van Sluis M
2. Gailín M
3. McCarter JGW
4. Mangan H
5. Grob A
6. McStay B
2019Human NORs, comprising rDNA arrays and functionally conserved distal elements, are located within dynamic chromosomal regionsGenes Dev 33:1688–1701Google Scholar
69.
1. van Sluis M
2. McStay B
2019Nucleolar DNA Double-Strand Break Responses Underpinning rDNA Genomic StabilityTrends in Genetics 35:743–753Google Scholar
70.
1. Varki A
2. Altheide TK
2005Comparing the human and chimpanzee genomes: searching for needles in a haystackGenome Res 15:1746–1758Google Scholar
71.
1. Wang G-D
2. Xie H-B
3. Peng M-S
4. Irwin D
5. Zhang Y-P
2014Domestication Genomics: Evidence from AnimalsAnnual Review of Animal Biosciences 2:65–84Google Scholar
72.
1. Williams AL
2. Genovese G
3. Dyer T
4. Altemose N
5. Truax K
6. Jun G
7. Patterson N
8. Myers SR
9. Curran JE
10. Duggirala R
11. et al.
2015Non-crossover gene conversions show strong GC bias and unexpected clustering in humansElife 4Google Scholar
73.
1. Wolfe KH
2. Sharp PM
3. Li WH
1989Mutation rates differ among regions of the mammalian genomeNature 337:283–285Google Scholar
74.
1. Wu C-I
2. Lyttle TW
3. Wu M-L
4. Lin G-F
1988Association between a satellite DNA sequence and the responder of segregation distorter in D. melanogasterCell 54:179–189Google Scholar
75.
1. Wu C-I
2. True JR
3. Johnson N
1989Fitness reduction associated with the deletion of a satellite DNA arrayNature 341:248–251Google Scholar
76.
1. Wu CI
2. Wang HY
3. Ling S
4. Lu X
2016The Ecology and Evolution of Cancer: The Ultra-Microevolutionary ProcessAnnu Rev Genet 50:347–369Google Scholar
77.
1. Wyckoff GJ
2. Li J
3. Wu C-I
2002Molecular Evolution of Functional Genes on the Mammalian Y ChromosomeMolecular Biology and Evolution 19:1633–1636Google Scholar
78.
1. Xu J
2. Nuno K
3. Litzenburger UM
4. Qi Y
5. Corces MR
6. Majeti R
7. Chang HY
2019Single-cell lineage tracing by endogenous mutations enriched in transposase accessible mitochondrial DNAElife 8Google Scholar
79.
1. Yang H
2. Wang JR
3. Didion JP
4. Buus RJ
5. Bell TA
6. Welsh CE
7. Bonhomme F
8. Yu AH-T
9. Nachman MW
10. Pialek J
11. et al.
2011Subspecific origin and haplotype diversity in the laboratory mouseNat Genet 43:648–655Google Scholar
80.
1. Zhao Z
2. Jin L
3. Fu Y-X
4. Ramsay M
5. Jenkins T
6. Leskinen E
7. Pamilo P
8. Trexler M
9. Patthy L
10. Jorde LB
11. et al.
2000Worldwide DNA sequence variation in a 10-kilobase noncoding region on human chromosome 22Proceedings of the National Academy of Sciences 97:11354–11358Google Scholar

Article and author information

Author information

Xiaopei Wang
State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-sen University, Guangzhou, China
ORCID iD: 0009-0007-4587-4624
Yongsen Ruan
State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-sen University, Guangzhou, China
ORCID iD: 0000-0002-5573-4154
Lingjie Zhang
State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-sen University, Guangzhou, China
Xiangnyu Chen
State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-sen University, Guangzhou, China
Zongkun Shi
State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-sen University, Guangzhou, China
Haiyu Wang
State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-sen University, Guangzhou, China
Bingjie Chen
State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-sen University, Guangzhou, China
Miles Tracy
State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-sen University, Guangzhou, China
Haijun Wen
State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-sen University, Guangzhou, China
- Corresponding authors: wenhj5@mail.sysu.edu.cn (H. Wen); ciwu@uchicago.edu (C.I. Wu)
Chung-I Wu
State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-sen University, Guangzhou, China
- Corresponding authors: wenhj5@mail.sysu.edu.cn (H. Wen); ciwu@uchicago.edu (C.I. Wu)

Version history

Preprint posted: May 9, 2024
Sent for peer review: June 10, 2024
Reviewed Preprint version 1: August 20, 2024
Reviewed Preprint version 2: November 15, 2024
Reviewed Preprint version 3: March 6, 2025

Cite all versions

You can cite all versions using the DOI https://doi.org/10.7554/eLife.99992. This DOI represents all versions, and will always resolve to the latest one.

Copyright

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, public reviews, and a provisional response from the authors.

Reviewing Editor
Ziyue Gao
University of Pennsylvania, Philadelphia, United States of America
Senior Editor
George Perry
Pennsylvania State University, University Park, United States of America

Reviewer #1 (Public Review):

The manuscript by Wang et al is, like its companion paper, very unusual in the opinion of this reviewer. It builds off of the companion theory paper's exploration of the "Wright-Fisher Haldane" model but applies it to the specific problem of diversity in ribosomal RNA arrays. The authors argue that polymorphism and divergence among rRNA arrays are inconsistent with neutral evolution, primarily stating that the amount of polymorphism suggests a high effective size and thus a slow fixation rate, while we, in fact, observe relatively fast fixation between species, even in putatively non-functional regions. They frame this as a paradox in need of solving, and invoke the WFH model.

The same critiques apply to this paper as to the presentation of the WFH model and the lack of engagement with the literature, particularly concerning Cannings models and non-diffusive limits. However, I have additional concerns about this manuscript, which I found particularly difficult to follow.

My first, and most major, concern is that I can never tell when the authors are referring to diversity in a single copy of an rRNA gene compared to when they are discussing diversity across the entire array of rRNA genes. I admit that I am not at all an expert in studies of rRNA diversity, so perhaps this is a standard understanding in the field, but in order for this manuscript to be read and understood by a larger number of people, these issues must be clarified.

The authors frame the number of rRNA genes as roughly equivalent to expanding the population size, but this seems to be wrong: the way that a mutation can spread among rRNA gene copies is fundamentally different than how mutations spread within a single copy gene. In particular, a mutation in a single copy gene can spread through vertical transmission, but a mutation spreading from one copy to another is fundamentally horizontal: it has to occur because some molecular mechanism, such as slippage, gene conversion, or recombination resulted in its spread to another copy. Moreover, by collapsing diversity across genes in an rRNA array, the authors are massively increasing the mutational target size.

For example, it's difficult for me to tell if the discussion of heterozygosity at rRNA genes in mice starting on line 277 is collapsed or not. The authors point out that Hs per kb is ~5x larger in rRNA than the rest of the genome, but I can't tell based on the authors' description if this is diversity per single copy locus or after collapsing loci together. If it's the first one, I have concerns about diversity estimation in highly repetitive regions that would need to be addressed, and if it's the second one, an elevated rate of polymorphism is not surprising, because the mutational target size is in fact significantly larger.

Even if these issues were sorted out, I'm not sure that the authors framing, in terms of variance in reproductive success is a useful way to understand what is going on in rRNA arrays. The authors explicitly highlight homogenizing forces such as gene conversion and replication slippage but then seem to just want to incorporate those as accounting for variance in reproductive success. However, don't we usually want to dissect these things in terms of their underlying mechanism? Why build a model based on variance in reproductive success when you could instead explicitly model these homogenizing processes? That seems more informative about the mechanism, and it would also serve significantly better as a null model, since the parameters would be able to be related to in vitro or in vivo measurements of the rates of slippage, gene conversion, etc.

In the end, I find the paper in its current state somewhat difficult to review in more detail, because I have a hard time understanding some of the more technical aspects of the manuscript while so confused about high-level features of the manuscript. I think that a revision would need to be substantially clarified in the ways I highlighted above.

https://doi.org/10.7554/eLife.99992.1.sa2

Reviewer #2 (Public Review):

Summary:

Multi-copy gene systems are expected to evolve slower than single-copy gene systems because it takes longer for genetic variants to fix in the large number of gene copies in the entire population. Paradoxically, their evolution is often observed to be surprisingly fast. To explain this paradox, the authors hypothesize that the rapid evolution of multi-copy gene systems arises from stronger genetic drift driven by homogenizing forces within individuals, such as gene conversion, unequal crossover, and replication slippage. They formulate this idea by combining the advantages of two classic population genetic models -- adding the V(k) term (which is the variance in reproductive success) in the Haldane model to the Wright-Fisher model. Using this model, the authors derived the strength of genetic drift (i.e., reciprocal of the effective population size, Ne) for the multi-copy gene system and compared it to that of the single-copy system. The theory was then applied to empirical genetic polymorphism and divergence data in rodents and great apes, relying on comparison between rRNA genes and genome-wide patterns (which mostly are single-copy genes). Based on this analysis, the authors concluded that neutral genetic drift could explain the rRNA diversity and evolution patterns in mice but not in humans and chimpanzees, pointing to a positive selection of rRNA variants in great apes.

Strengths:

Overall, the new WFH model is an interesting idea. It is intuitive, efficient, and versatile in various scenarios, including the multi-copy gene system and other cases discussed in the companion paper by Ruan et al.

Weaknesses:

Despite being intuitive at a high level, the model is a little unclear, as several terms in the main text were not clearly defined and connections between model parameters and biological mechanisms are missing. Most importantly, the data analysis of rRNA genes is extremely over-simplified and does not adequately consider biological and technical factors that are not discussed in the model. Even if these factors are ignored, the authors' interpretation of several observations is unconvincing, as alternative scenarios can lead to similar patterns. Consequently, the conclusions regarding rRNA genes are poorly supported. Overall, I think this paper shines more in the model than the data analysis, and the modeling part would be better presented as a section of the companion theory paper rather than a stand-alone paper. My specific concerns are outlined below.

(1) Unclear definition of terms

Many of the terms in the model or the main text were not clearly defined the first time they occurred, which hindered understanding of the model and observations reported. To name a few:

(i) In Eq(1), although C* is defined as the "effective copy number", it is unclear what it means in an empirical sense. For example, Ne could be interpreted as "an ideal WF population with this size would have the same level of genetic diversity as the population of interest" or "the reciprocal of strength of allele frequency change in a unit of time". A few factors were provided that could affect C*, but specifically, how do these factors impact C*? For example, does increased replication slippage increase or decrease C*? How about gene conversion or unequal cross-over? If we don't even have a qualitative understanding of how these processes influence C*, it is very hard to make interpretations based on inferred C*. How to interpret the claim on lines 240-241 (If the homogenization is powerful enough, rRNA genes would have C*<1)? Please also clarify what C* would be, in a single-copy gene system in diploid species.

(ii) In Eq(1), what exactly is V*(K)? Variance in reproductive success across all gene copies in the population? What factors affect V*(K)? For the same population, what is the possible range of V*(K)/V(K)? Is it somewhat bounded because of biological constraints? Are V*(K) and C*(K) independent parameters, or does one affect the other, or are both affected by an overlapping set of factors?

(iii) In the multi-copy gene system, how is fixation defined? A variant found at the same position in all copies of the rRNA genes in the entire population?

(iv) Lines 199-201, HI, Hs, and HT are not defined in the context of a multi-copy gene system. What are the empirical estimators?

(v) Line 392-393, f and g are not clearly defined. What does "the proportion of AT-to-GC conversion" mean? What are the numerator and denominator of the fraction, respectively?

(2) Technical concerns with rRNA gene data quality

Given the highly repetitive nature and rapid evolution of rRNA genes, myriads of things could go wrong with read alignment and variant calling, raising great concerns regarding the data quality. The data source and methods used for calling variants were insufficiently described at places, further exacerbating the concern.

(i) What are the accession numbers or sample IDs of the high-coverage WGS data of humans, chimpanzees, and gorillas from NCBI? How many individuals are in each species? These details are necessary to ensure reproducibility and correct interpretation of the results.

(ii) Sequencing reads from great apes and mice were mapped against the human and mouse rDNA reference sequences, respectively (lines 485-486). Given the rapid evolution of rRNA genes, even individuals within the same species differ in copy number and sequences of these genes. Alignment to a single reference genome would likely lead to incorrect and even failed alignment for some reads, resulting in genotyping errors. Differences in rDNA sequence, copy number, and structure are even greater between species, potentially leading to higher error rates in the called variants. Yet the authors provided no justification for the practice of aligning reads from multiple species to a single reference genome nor evidence that misalignment and incorrect variant calling are not major concerns for the downstream analysis.

(vi) It is unclear how variant frequency within an individual was defined conceptually or computed from data (lines 499-501). The population-level variant frequency was calculated by averaging across individuals, but why was the averaging not weighted by the copy number of rRNA genes each individual carries? How many individuals are sampled for each species? Are the sample sizes sufficient to provide an accurate estimate of population frequencies?

(vii) Fixed variants are operationally defined as those with a frequency>0.8 in one species. What is the justification for this choice of threshold? Without knowing the exact sample size of the various species, it's difficult to assess whether this threshold is appropriate.

(viii) It is not explained exactly how FIS, FST, and divergence levels of rRNA genes were calculated from variant frequency at individual and species levels. Formulae need to be provided to explain the computation.

(3) Complete ignorance of the difference in mutation rate difference between rRNA genes and genome-wide average

Nearly all data analysis in this paper relied on comparison between rRNA genes with the rest (presumably single-copy part) of the genome. However, mutation rate, a key parameter determining the diversity and divergence levels, was completely ignored in the comparison. It is well known that mutation rate differs tremendously along the genome, with both fine and large-scale variation. If the mutation rate of rRNA genes differs substantially from the genome average, it would invalidate almost all of the analysis results. Yet no discussion or justification was provided.

Related to mutation rate: given the hypermutability of CpG sites, it is surprising that the evolution/fixation rate of rRNA estimated with or without CpG sites is so close (2.24% vs 2.27%). Given the 10 - 20-fold higher mutation rate at CpG sites in the human genome, and 2% CpG density (which is probably an under-estimate for rDNA), we expect the former to be at least 20% higher than the latter.

Among the weaknesses above, concern (1) can be addressed with clarification, but concerns (2) and (3) invalidate almost all findings from the data analysis and cannot be easily alleviated with a complete revamp work.

https://doi.org/10.7554/eLife.99992.1.sa1

Author response:

(1) First, we wish to point out that there has not been a model for quantifying genetic drift in multi-copy gene systems. Hence, the first attempt using the Haldane model is not expected to be familiar and readily acceptable. Nevertheless, the standard WF (Wright-Fisher) model cannot handle drift in multi-copy gene systems, such as viruses, due to the two levels of genetic drift – within individuals as well as between individuals of the population.

[Point 1 responds to the comments that we did not engage with the literature, in particular, publications like the Canning model, which are extensions of the WF model. As pointed out above, models based on the WF sampling cannot handle the two levels of genetic drift.]

(2) A crucial aspect of the study is the nature of rRNA gene cluster, which is also a multi-copy gene system. It is easy to see some multi-copy gene systems, like viral particles or mtDNAs, to have a sub-population of genes within each individual. It is less obvious that tandem arrays of gene copies like rRNA genes can be treated as sub-populations that are subjected to drift. Nevertheless, rRNA gene copies frequently transfer mutations among copies in the same cell via the homogenization process. Hence, rRNA genes do not have the property of "locus" of single-copy genes as they move about as well (a bit like transposons but via different mechanisms). Indeed, the collection of rRNA genes in a cell is referred to as the “community of genes” as cited in Fig. 1. Over hundreds of generations, rRNA genes are effectively a small gene pool like mtDNAs within cells. Furthermore, the copy number of rRNA genes also changes rapidly among individuals. For these reasons, genetic drift is operative within cells and this study aims to determine its strength (see Response 3 below).

[Point 2 of the response addresses questions of Review #1 such as "(whether) the authors are referring to diversity in a single copy of an rRNA gene (or) diversity across the entire array of rRNA genes" or "(whether) the discussion of heterozygosity at rRNA ... is diversity per single copy locus or after collapsing loci together". The answer should be "the genetic diversity of the population of rRNA genes in the cell", noting that the single gene locus does not apply here. Similarly, a question like "Alignment to a single reference genome would likely lead to incorrect and even failed alignment for some reads'" from Review #2 appears to be based on the homology concept of a rRNA gene locus. All rRNA gene copies are aligned against the consensus of the population of genes of the species. The consensus nucleotide nearly always accounts for > 90% of the gene copies in the population.]

(3) We now clarify the meaning of C*, the effective copy number of rRNA genes. We apologize that the abstract is indeed unclear, and even misleading. In the abstract, we did not use different notations for the actual copy number (C) and the effective copy number (C*) of rRNA genes. Instead, we use the letter C to designate both. Furthermore, in the main text, the presentation of the effective number, C*, is overly complicated (in order to be realistic). We apologize. Slight modifications of the abstract should have removed all the mis-understandings, as shown below.

"On average, rDNAs have C ~ 150 - 300 copies per haploid in humans. While a neutral mutation of a single-copy gene would take 4N (N being the population size) generations to become fixed, the time should be 4NC* generations for rRNA genes where 1<< C* (C* being the effective copy number; C* > C or C* <C will depend on the strength of drift). However, the observed fixation time in mouse and human is < 4N, implying the paradox of C* < 1. Genetic drift that encompasses all random neutral evolutionary forces appears as much as 100 times stronger for rRNA genes as for single-copy genes, thus reducing C* to < 1."

[Point 3 responds to the key criticisms. From Review #1 " The authors frame the number of rRNA genes as roughly equivalent to expanding the population size, ... a mutation can spread among rRNA gene copies is fundamentally different …". Indeed, the abstract can be very misleading when it uses CN interchangeably with C*N, essentially by allowing C to mean both.

From Review #2 "In Eq (1), although C* is defined as the "effective copy number", it is unclear what it means in an empirical sense…". From the slightly revised text quoted above, it should be clear that the fixation time as well as the level of polymorphism represent the empirical measures of C*".

(4) Lastly, we shall address the mis-understood "reproductive success" of rRNA genes, which is the number of progeny, K, in the Haldane model. K should be more accurately referred to as the transmission speed. For single-copy genes, reproductive success and transmission both mean the same thing, K. But the term reproductive success is not appropriate for rRNA genes even though the formulae for K are the same for all gene systems

[Point 4 responds to all criticisms using the term "reproductive success"]

https://doi.org/10.7554/eLife.99992.1.sa0

Significance of findings

Strength of evidence

Abstract

Introduction

Results

PART I - The biology of rRNA gene clusters

The “chromosome community of rDNAs on five acrocentric chromosomes.

The “chromosome community” of ribosomal RNA genes

PART II - Theory

1. The Haldane model of genetics drift applied to multi-copy genetic systems

2. rDNA polymorphism within species

3. rDNA divergence between species

Fixation of mutations at two levels of species divergence, (Td1) and (Td2).

PART III - Data Analyses

Empirical measurements of homogenization within cells

1. rDNA polymorphism within species

1) Polymorphism in mice

rRNA gene diversity in the 10 M. m. domesticus strains of a global collection.

2) Polymorphism in human

Correlation of variant frequencies between human individuals.

rRNA gene diversity in the 8 humans of a global collection.

2. rDNA divergence between species

1) Between mouse species - Genetic drift as the sole driving force of the rapid divergence

Levels of polymorphism and divergence in mice.

Divergence in rRNA genes between M. m. domesticus and M. m. castaneus.

Divergence in rRNA genes between M. m. domesticus and Mus spretus.

2) Between Human and Chimpanzee - Positive selection in addition to rapid drift in rDNA divergence

Divergence in rRNA genes between Human and Chimpanzee.

3) Biased gene conversion - Positive selection for rRNA mutations in apes, but not in mice

The A/T to G/C and G/C to A/T changes in apes and mouse.

The parameter values of p, q, f and g in the evolution between A/T and G/C.

Discussion

Materials and Methods

Data Collection

Data Processing

Identification of Divergence Sites

Genome-wide Divergence Estimation

Estimation of Site Conversion

Data Availability

Acknowledgements

Supporting information

References

Article and author information

Author information

Xiaopei Wang

Yongsen Ruan

Lingjie Zhang

Xiangnyu Chen

Zongkun Shi

Haiyu Wang

Bingjie Chen

Miles Tracy

Haijun Wen

Chung-I Wu

Version history

Cite all versions

Copyright

Peer review process

Editors

Fixation of mutations at two levels of species divergence, (T_d1) and (T_d2).