Introduction

In this study, we focus on multi-copy gene systems, where the evolution takes place both within and between individuals in two stages. Multi-copy gene systems include viruses, transposons, mitochondria and multi-gene family (Alexandrov, et al. 2001; Szitenberg, et al. 2016; Xu, et al. 2019; Ruan, Wen, et al. 2021). Given the large number of copies in the entire population and the two-stage evolution, the neutral evolutionary rate should be much slower than in the standard single-copy systems. However, the rapid evolution of multi-copy gene systems has been extensively documented (Charlesworth, et al. 1994; Eickbush and Eickbush 2007; Jurka, et al. 2007; Hou, et al. 2023). The main reason for this paradox is that the speed of neutral evolution of multi-gene systems is beyond the power of the standard WF (Wright-Fisher) model of molecular evolution.

The speed of neutral evolution is the basis for determining how fast or slow all types of molecular evolution take place. Neutral evolution is driven by random transmission, unbiased conversion, stochastic replication etc., which collectively constitute genetic drift. Hence, genetic drift is the fundamental force of molecular evolution. All other evolutionary forces, such as selection, mutation and migration, may be of greater biological interest, but their inferences are possible only when genetic drift is fully accounted for. Under-estimation of genetic drift has been the major cause of the over-estimation of other forces. In the companion study(Ruan, et al. 2024), we show that the conventional Wright-Fisher (WF) model may often under-account genetic drift.

Ruan, et al. (Ruan, et al. 2024) propose the integration of the WF model with the Haldane model of genetic drift, referred to as the WFH model. The Haldane model is based on the branching process. In haploids, each individual produces K progeny with the mean and variance of E(K) and V(K). Genetic drift is primarily V(K) as there would be no drift if V(K) = 0. Gene frequency change in the population is thus scaled by N (population size), expressed as V(K)/N. In diploids, K would be the number of progeny to whom the gene copy of interest is transmitted. (The adjustments between haploidy and multi-ploidy are straightforward [see (Ruan, et al. 2024)])

In the WF model, gene frequency is governed by 1/N (or 1/2N in diploids) because K would follow the Poisson distribution whereby V(K) = E(K). Since E(K) is generally ∼1 in the long run, V(K) would be ∼ 1 as well. (We note that Ne = N/V(K) can be a solution for the WF model, but such a solution would be biologically nonsensical albeit mathematically feasible.) Instead, the WFH model is biologically based on the Haldane model, but with mathematical solutions approximated by the WF model when feasible.

The WF model has led to a number of paradoxes as discussed in (Ruan, et al. 2024). They include i) the paradox of changing N’s whereby the strength of drift may increase as N increases; ii) the paradox of sex chromosomes whereby genetic drift depends on the sex, even when N is fully accounted for; iii) the paradox of drift under selection whereby genetic drift depends on V(K) but not on N. These paradoxes are resolved by the integrated WFH model.

In addition to the many paradoxes, there exist a number of genetic systems that are incompatible with the WF model. The influence of stochastic forces are hence rarely defined in such systems. These may include somatic cell evolution (Ling, et al. 2015; Wu, et al. 2016; Chen, Wu, et al. 2022), domestication (Wang, et al. 2014; Ostrander, et al. 2019) and even interspecific competitions, which do have stochastic elements as well.

A common multi-copy system is viral evolution, evident in the recent pandemic of COVID-19 (Ruan, Luo, et al. 2021; Ruan, Wen, et al. 2021; Ruan, et al. 2022; Hou, et al. 2023). Another example is systems of large number of nearly-identical genes within the same genome such as transposons (Szitenberg, et al. 2016), mitochondrial DNAs (Xu, et al. 2019), satellite DNAs (Wu, et al. 1989; Cabot, et al. 1993; Alexandrov, et al. 2001), and ribosomal RNA genes (van Sluis and McStay 2019; Hori, et al. 2021). In these examples, the copy number (designated C) is in the hundreds or thousands per individual. Nevertheless, C = 2 as in all diploids is also a multi-copy gene system as the two copies may often evolve interactively via gene conversion or segregation distortion (Wu, et al. 1988; Tao, et al. 2001; Wyckoff, et al. 2002; McDermott and Noor 2010). The WF model has been noted to be inadequate even in such diploid systems (Charlesworth 2009; Charlesworth and Charlesworth 2010; Chen, et al. 2017). After all, the WF model is an essentially a haloid model with 2N copies in the population.

We now apply the Haldane model to track the genetic drift in multi-copy gene systems, using ribosomal RNA (rRNA) genes as an example. The absence of an adequate model has led to speculations of adaptation by natural selection (Dover 1982; Lu and Wu 2005; Nair, et al. 2008; Arkhipova 2018; Chen, Yang, et al. 2022). We will show that neutral stochasticity would be a simpler explanation if the more powerful Haldane model is adopted. The analysis also reveals a new paradox of genetic drift; namely, the “effective population size” within individuals can in fact be smaller than 1.

Results

PART I presents a best-known multi-gene system of the rRNA genes. PART II (Theory) consolidates aspects of the Haldane model with respect to polymorphism within species as well as divergence between species. In PART III (Data analyses), we apply the theory to rDNA evolution in mice and apes (human and chimpanzee).

PART I - The biology of rRNA gene clusters

The ribosomal RNA genes (rDNAs) are multi-copy gene clusters (Bowman, et al. 2020) that are arrayed as tandem repeats on multiple chromosomes as shown in Fig. 1A (Guillén, et al. 2004; Cazaux, et al. 2011). In humans, the copy number can vary from 60 to 1600 per individual (mean, 315; SD, 104; median, 301) (Parks, et al. 2018). For each haploid genome, C ∼ 150 in humans and C ∼ 110 in mice (Parks, et al. 2018). In humans, the five rRNA clusters are located on the short arm of the five acrocentric chromosomes (Smirnov, et al. 2021). Such an arrangement permits crossovers between chromosomes without perturbing the rest of the genomes. In Mus, the rDNAs are all located in the pericentromeric or sub-telomeric region, on the long arms of telocentric chromosomes (Cazaux, et al. 2011; Potapova and Gerton 2019). Thus, unequal crossovers between non-homologous chromosomes may involve centromeres while other genic regions are also minimally perturbed.

The “chromosome community of rDNAs on five acrocentric chromosomes.

(A) The genomic locations of rDNA tandem repeats in human (Gibbons, et al. 2015) and mouse (Cazaux, et al. 2011). rDNAs are located on the short arms (human), or the proximal end of the long arms (mouse), of the acrocentric chromosome. Either way, inter-chromosomal exchanges are permissible. (B) The organization of rDNA repeat unit. IGS (intergenic spacer) is not transcribed. Among the transcribed regions, 18S, 5.8S and 28S segments are in the mature rRNA while ETS (external transcribed spacer) and ITS (internal transcribed spacer) are excluded. (C) The “chromosomes community” map modified from Fig. 1 of (Guarracino, et al. 2023). The map shows the distance of divergence among chromosome segments. The large circle encompasses rDNAs from all 5 chromosomes with color dots indicating the chromosome origin. It is clear that rDNAs experience concerted evolution regardless of their genomic locations. The slightly smaller thin circle, from the analysis of this study, shows that the rDNA gene pool from each individual captures approximately 95% of the total diversity of human population. (D) A simple illustration that shows the transmissions of two new mutations (#1 and #2 in red letter). Mutation 1 experiences replication slippage, gene conversion and unequal crossover and grows to 9 copies (K = 9) after transmission. Mutation 2 emerges and disappears (K = 0). This shows how V(K) may be augmented by the homogenization process.

Each copy of rRNA gene has functional and non-functional parts as shown in Fig. 1B. The “functional” regions of rDNA, 18S, 5.8S, and 28S rDNA, are believed to be under strong negative selection, resulting in a slow evolution rate since the divergence of animals and plants (Salim and Gerton 2019). In contrast, the transcribed spacer (ETS and ITS) and the intergenic spacer (IGS) show significant sequence variation even among closely related species (Eickbush and Eickbush 2007). This suggests that these “non-functional” sequences are less constrained by negative selection. In this study of genetic drift, we focus on the non-functional parts. Data on the evolution of the functional parts will be provided only for comparisons.

The “chromosome community” of ribosomal RNA genes

While a human individual with 300 rRNA genes may appear to have 300 loci, the concept of distinct “gene loci” cannot be applied to the rRNA gene clusters. This is because DNA sequences can spread from one copy to others on the same chromosome via replication slippage. They can also spread among all other copies on the same or different chromosomes via gene conversion and unequal crossovers (Nagylaki 1983; Ohta and Dover 1983; Stults, et al. 2008; Smirnov, et al. 2021). Replication slippage and unequal crossovers would also alter the copy number of rRNA genes, thus accounting for the copy number variation among individuals. These mechanisms will be referred to collectively as the homogenization process, resulting in a non-Mendelian pattern of inheritance.

Copies of the cluster on the same chromosome are widely known to be nearly identical in sequences (Hori, et al. 2021; Nurk, et al. 2022). Previous research has also provided extensive evidence that rDNA clusters among nonhomologous chromosomes share the same variants, thus confirming genetic exchange between chromosomes (Krystal, et al. 1981; Arnheim, et al. 1982; van Sluis, et al. 2019). For example, Nurk, et al. (Nurk, et al. 2022) report a 5-kb segment that covers mostly non-functional sites showing a 98.7% identity among the short arms of the five acrocentric chromosomes. Additionally, variants are often found in all human rDNAs but not in other great apes, suggesting rapid fixation (Arnheim, et al. 1980). The rapid fixation is unexpected because, for a new rDNA mutation to be fixed, it has to be fixed among all ∼ 300 copies in all individuals in the population. In this process, both rapid drift and positive selection can contribute to these fixations. However, the main difference between them is that genetic drift only shorten the fixation time of mutations without changing their fixation probability, whereas selection can alter both. As a result, rapid drift can only accelerate the evolution rate up to neutral molecular evolutionary rate (Eq. (3) below), while positive selection is not constrained by this limit.

With the extensive exchanges via the homogenization process, copies of rRNA genes in an individual can be treated as a sub-population of gene copies. Such sub-populations do not have to be in the sense of a Mendelian population when using the branching process to track genetic drift (see Discussion and Chen et al. 2017). (Guarracino, et al. 2023) recently proposed the view of “chromosome community” for the short arms of the 5 acrocentric chromosomes, where rRNA genes are located. As seen in Fig. 1C, these five short arms harbor a pool of rRNA genes that do not have strong associations with genomic locations. Fig. 1C further illustrates that each human individual’s rRNA gene pool can capture ∼ 95% of the total diversity of the human populations (see PART III). Fig. 1D presents the molecular mechanisms of genetic drift within individuals (i.e., Stage I evolution) in relation to the multi-copy gene systems.

PART II - Theory

1. The Haldane model of genetics drift applied to multi-copy genetic systems

The Haldane model of genetic drift based on the branching process is intuitively more appealing. In the model, each copy of the gene leaves K copies in a time interval with the mean and variance of E(K) and V(K). If V(K) = 0, there is no gene frequency change and no genetic drift. In the WF model, V(K) = E(K) whereas V(K) is decoupled from E(K) in the Haldane model; the latter thus being more flexible.

Below, we compare the strength of genetic drift in rRNA genes vs. that of single-copy genes using the branching process. E(K) is set to 1 so that we can focus on V(K). For single-copy genes, the effective population size that determines genetic drift would be Ne = N/V(K) (Kimura and Crow 1963). Details are presented in the companion paper (Ruan, et al. 2024). We shall use * to designate the equivalent symbols for rRNA genes. Hence,

where N* is the number of rRNA genes in the entire population and, therefore, N* = C*N. For rRNA genes, C* >> 1 but < 150, C* being the effective copy number of rRNA genes. (C =1 for single copy genes.) The value of C* is uncertain because each haploid in humans, on average, has ∼ 150 copies that do not segregate but assort among chromosomes (Parks, et al. 2018). V*(K) for rRNA genes is likely to be much larger than V(K) because K for rRNA mutations may be affected by a host of factors including replication slippage, unequal cross-over, gene conversion and other related mechanisms not operating on single copy genes. We thus designate Vh = V*(K)/V(K), as the homogenizing factor of rRNA genes.

Fig. 1D is a simple illustration that the homogenizing process may enhance V(K) substantially over the conventional WF model as each mechanism amplifies the variance in K. Importantly, the homogenization process is considered stochastic rather than deterministic because it affects both the mutants and wild type equally, as expected of genetic drift.

We then designate R* = (1/Ne*)/(1/Ne) as the strength of genetic drift of rRNA genes, relative to that of single copy genes. R* is measurable, as will be done below.

In short, R* is the ratio of two factors. A larger Vh [=V*(K)/V(K)] speeds up the drift strength of rRNA genes whereas a larger C* slows down the process of fixation by drift. It would be most interesting if R* > 1 as it would mean that the effective population size of rRNA genes is smaller than single-copy genes. In this case, the homogenization force should be strong enough to overcome the large copy number of rRNA genes (∼ 150) during the fixation of new mutations.

2. rDNA polymorphism within species

For rDNAs, there are 3 levels of diversity: HI, HS and HT. They represent, respectively, the heterozygosity within individual (I), within species (S) and in the total data set (T, which often comprises two species or more). Briefly, H = 2p(1-p) if there are two variants with the frequency of p and 1-p. In an equilibrium haploid population,

Given u, HS is a measure of genetic drift within a population. For rRNA genes, HS is expected to be larger than that single-copy genes as N* > N. With the three levels of heterozygosity for multi-copy genes, there are two levels of differentiation. First, FIS is the differentiation among individuals within the species, defined by

Second, FSTis the differentiation between populations (or species) in the context of total diversity, defined as

The two F statistics will inform about the distribution of nucleotide diversity among individuals, or between species. HT is the expected heterozygosity when there is no subdivision between species and HSis the observed heterozygosity. Therefore, FST will be equal to 0 when there is no subdivision and will be positive as long as there is variation in allele frequency across subdivisions.

3. rDNA divergence between species

Whereas the level of genetic diversity is a function of the effective population size, the rate of divergence between species, in its basic form, is not. The rate of neutral molecular evolution is (λ), although driven by mutation and genetic drift, is generally shown by Eq. (3) below (Crow and Kimura 1970; Hartl, et al. 1997; Li 1997):

Hence, we might view the evolution of rRNA genes in the long run to be the same as single-copy genes. However, Eq. (3) is valid only in the long-term evolution. For shorter-term evolution, it is necessary to factor in the fixation time (Fig. 2), Tf, which is the time between the emergence of a mutation and its fixation. If we study two species with a divergent time (Td) equal to or smaller than Tf, then few mutations of recent emergence would have been fixed as species diverge.

Fixation of mutations at two levels of species divergence, (Td1) and (Td2).

(Tf1) and (Tf2) are mutations with a shorter and longer fixation time, respectively for single-copy and multi-copy genes. Note that mutations with a longer Tf would show a lower fixation rate in short-term evolution.

Note that Td is about 6 million years (Myrs) between human and chimpanzee while Tf (as measured by coalescence) is 0.6 - 1 Myrs in humans. Mutations of single-copy genes would not get fixed during the more recent Tf, as indicated in Figure 2. Thus, λ may be 1/6 to 1/10 lower than the theoretical value. In comparison, Tf of rRNA mutations should be at least 4.8 - 8 Myrs based on the C* estimate. Thus, λ may be at least 80% lower than calculated for single-copy genes.

Our own theoretical derivation has shown that the fixation time for rRNA genes is close to 4N * as is the case for single-copy genes at ∼ 4Ne*. If Ve*(K) is sufficiently larger than V(K), it is in fact possible for Ne* < Ne such that genetic drift is stronger for rRNA genes than for single-copy genes and Tf* < Tf. Therefore, Tf1 can represent the Tf*of mutations in rRNA genes in Fig. 2. This is interesting because, if the homogenization is powerful enough, rRNA genes would have an effective copy number of C* < 1. A short Tf*, which can be obtained by large V*(K), would lead to a small Tf*/Td and thus a higher λ, particularly in short-term evolution. However, even Tf* approaches 0, λ can exceed that of single-copy genes but is still limited by fixation probability. As noted above, the rapidly fixed mutations may not be well represented in the polymorphic data but can accumulate in species divergence.

PART III - Data Analyses

Before presenting the main analyses, we provide some empirical observations on the rapid homogenization of rRNA genes within individuals (Stage I). These observations are needed for PART III that comprises i) the analyses of rDNA polymorphisms within species in mouse and human; and ii) the analysis of the divergence between species.

Empirical measurements of homogenization within cells

In an accompanying study (Wang, et al. unpublished data), the evolutionary rate of neutral rRNA variants within cells is measured. Here, genetic drift operates via the homogenizing mechanisms that include gene conversion, unequal crossover and replication slippage. In the literature, measurements of neutral rRNA evolution are usually based on comparisons among individuals. Therefore, the Mendelian mechanisms of chromosome segregation and assortment would also shuffle variants among individuals. Segregation and assortment would confound the measurements of homogenization effects within individuals.

In one experiment, the homogenization effects in rDNAs are measured in cultured cell lines over 6 months of evolution. For in vivo homogenization, we analyze the evolution of rRNA genes within solid tumors. We estimate the rate at which rRNA variants spread among copies within the same cells, which undergo an asexual process. The measurements suggest that, in the absence of recombination and chromosome assortment, the fixation time of new rRNA mutations within cells would take only 1 - 3 kyrs (thousand years). Since a new mutation in single-copy genes would take 300 - 600 kyrs to be fixed in human populations, the speed of genetic drift in Stage I evolution is orders faster than in Stage II. Therefore, despite having several hundred copies of rRNA genes per genome, the speed of genetic drift in rRNA genes may not be that much slower than in single-copy genes. This postulate will be tested below.

1. rDNA polymorphism within species

1) Polymorphism in mice

For rRNA genes, HI of 10 individuals ranges from 0.0056 to 0.0067 while HS is 0.0073 (Table 1). Thus, FIS = [HS - HI]/HSfor mice is 0.14, which means 86% of variation is within each individual. In other words, even one single randomly chosen individual would yield 85% of the diversity of the whole species. Hence, the estimated HS should be robust as it is not affected much by the sampling.

rRNA gene diversity in the 10 M. m. domesticus strains of a global collection.

HS for M. m. domesticus single-copy genes is roughly 1.40 per kb genome-wide (Geraldes, et al. 2008) while HS for rRNA genes is 7.25 per kb, 5.2 times larger. In other words, C* = N*/N ∼ 5.2. If we use the polymorphism data, it is as if rDNA array has a population size 5.2 times larger than single-copy genes. Although the actual copy number on each haploid is ∼ 110, these copies do not segregate like single-copy genes and we should not expect N* to be 100 times larger than N. The HS results confirm the prediction that rRNA genes should be more polymorphic than single-copy genes.

Based on the polymorphism result, one might infer slower drift for rDNAs than for single-copy genes. However, the results from the divergence data in later sections will reveal the under-estimation of drift strength from polymorphism data. Such data would miss variants that have a fast drift process driven by, for example, gene conversion. Strength of genetic drift should therefore be measured by the long-term fixation rate.

2) Polymorphism in human

FIS for rDNA among 8 human individuals is 0.059, much smaller than 0.142 in M. m. domesticus mice, suggesting a major role of stage I evolution in humans. Consisted with low FIS, Fig. 3 shows strong correlation of the polymorphic site frequency of rDNA transcribed region among each pair of individuals from three continents (2 Asians, 2 Europeans and 2 Africans). Correlation of polymorphic sites in IGS region is shown in Supplementary Fig. 1. Like those in mice, the pattern indicates that intra-species polymorphism is mainly preserved within individuals. In addition, the 6 panels on the diagonal line given in Fig. 3 demonstrated that the allele frequency spectrum is highly concentrated at very low frequency within individuals, indicating rapid homogenization of rDNA mutations among different copies even different chromosomes. This homogenization can lead to the elimination of the genetic variation of this region. The observed HI of humans for rDNAs is 0.0064 to 0.0077 and the HS is 0.0072 (Table 2). Research has shown that heterozygosity for the human genome is about 0.00088 (Zhao, et al. 2000), meaning the effective copy number of rDNAs is roughly, or C* ∼ 8. This reduction in effective copy number from 150 to 8 indicates strong genetic drift due to homogenization force.

Correlation of variant frequencies between human individuals.

The pairwise correlation of variant site frequency in the transcribed region of rDNAs among 6 individuals (2 Asians, 2 Europeans, and 2 Africans). The high correlations suggest that the diversity in each individual can well capture the population diversity. Each color represents a region of rDNA. The diagonal plots present the variant frequency distribution. The upper right section summarizes the Pearson correlation coefficients derived from the mirror symmetric plots in the bottom left. The analysis excluded the 18S, 3’ETS, and 5.8S regions due to the limited polymorphic sites. The result for IGS region is presented in Supplementary Figure S1.

rRNA gene diversity in the 8 humans of a global collection.

2. rDNA divergence between species

We now consider the evolution of rRNA genes between species by analyzing the rate of fixation (or near fixation) of mutations. Polymorphic variants are filtered out in the calculation.

1) Between mouse species - Genetic drift as the sole driving force of the rapid divergence

We now use the FST statistic to delineate fixation and polymorphism. The polymorphism in M. m. domesticus is compared with two outgroup species, M. spretus and M. m. castaneus, respectively. There are hence two depths in the phylogeny with two Td’s, as shown in Fig. 4A (Rudra, et al. 2016; Kumar, et al. 2022). There is a fourth species, M. m. musculus (shown in grey in Fig. 4A), which yields very similar results as M. m. domesticus in these two comparisons. These additional analyses are shown in Supplement Table S1-S2.

Levels of polymorphism and divergence in mice.

(A) Phylogeny of Mus musculus and Mus spretus mice. The divergence times are obtained from http://timetree.org/. The line segment labelled 0.5 represents 0.5 Myrs. (B) FIS distribution within M. m. domesticus. The distribution of FIS for polymorphic sites in 3 outbred mouse strains or 10 mouse strains (including 7 inbred mice) in Table 1 (Inset) as shown. (C) FST distribution between M. m. domesticus and Mus spretus. Note that the F values rise above 0.8 only in (C).

The FIS values of polymorphic sites in 3 outbred mice are primarily below 0.2 and rarely above 0.8 in Fig. 4B, indicating the low genetic differentiation in rDNAs within these 3 M. m. domesticus. While the FIS distribution of 10 mice from Table 1, including 7 inbred and 3 outbred mouse strains, exhibits a noticeable right skewness, but does not exceed 0.8. This suggests that inbreeding to a certain extent limits the process of homogenization and enhances population differentiation. In comparison, the distribution of the FST of variant sites between M. m. domesticus and Mus spretus has a large peak near FST = 1. This peak in Fig. 4C represents species divergence not seen within populations (i.e., FIS). We use FST = 0.8 as a cutoff for divergence sites between the two species. Roughly, when a mutant is > 0.95 in frequency in one species and < 0.05 in the other, FST would be > 0.80.

We first compare the divergence between M. m. domesticus and M. m. castaneus whereby Td has been estimated to be less than 0.5 Myrs (Fujiwara, et al. 2022). In comparison, between Mus m. domesticus and Mus spretus, Td is close to 3 Myrs (Rudra, et al. 2016). As noted above, the reduction in the divergence rate relative to that of Eq. (3) is proportional to Tf/Td(for single copy genes) or Tf*/Td (for rRNA genes). As Tf and Tf* are both from M. m. domesticus and Td is 6 times larger in comparison with M. spretus, we expect the results to be quite different between the two sets of species comparisons.

Although Tf and Tf* estimates are less reliable than Td estimates, both comparisons use the same Tf and Tf* from M. m. domesticus. Hence, the results should be qualitatively unbiased. For a demonstration, we shall use the estimates of Tf (i.e., the coalescence time) at 0.2 Myrs for single-copy genes by using an average nucleotide diversity of 0.0014 and the mutation rate of 5.7×10-9 per base pair per generation (Geraldes, et al. 2008; Phifer-Rixey, et al. 2020). Based on the estimated C* above, we obtain Tf* for rDNA mutations at 5×0.2 Myrs, or 1 Myrs. While some have estimated Tf to be close to 0.4 Myrs (Fujiwara, et al. 2022), we aim to show that the pattern of reduction in rRNA divergence is true even with the conservative estimates.

Between M. m. domesticus and M. m. castaneus, the reduction in substitution rate for single copy gene should be ∼ 40% (Tf/Td = 0.2/0.5), and the reduction for rRNA genes should be 100% (Tf*/Td = 1/0.5 > 1). Table 3 on their DNA sequence divergence shows that rRNA genes are indeed far less divergent than single-copy genes. In fact, only a small fraction of rRNA mutations is expected to be fixed as Tf*for rDNA at 1 Myrs is 2 times larger than the divergence time, Td. We should note again that the non-negligible fixation of rRNA mutations suggests that C* at 5 is perhaps an over-estimate.

Divergence in rRNA genes between M. m. domesticus and M. m. castaneus.

Between Mus m. domesticus and Mus spretus, the reduction in actual evolution rate from theoretical limit for single-copy genes should be 6.7% (Tf/Td = 0.2/3) and, for rRNA genes, should be 33% (Tf*/Td= 1/3). The evolutionary rate (i.e. the fixation rate) of IGS region is lower than single-copy genes, 0.01 in IGS and 0.021 in genome-wide (Table 4), as one would expect. However, ETS and ITS regions have evolved at a surprising rate that is 12% higher than single-copy genes. Note that the reduction in C*, even to the lowest limit of C* =1, would only elevate the rate of fixation in rRNA genes to a parity with single-copy genes. From Eq. 2, the explanation would be that the homogenization factor V*(K)/V(K) has to be very large, such that C* is < 1. With such rapid homogenization, the fixation time approaches 0 and the evolutionary rate in rRNA genes can indeed reach the theoretical limit of Eq. (3). In such a scenario, the substitution rate in ETS and ITS, compared to single-copy genes in mice, may increase by 7%, Tf /(TdTf) = 0.2/(3-0.2). If we use Tf ∼ 0.4 Myrs in an alternative estimation, the increase can be up to 15%.

Divergence in rRNA genes between M. m. domesticus and Mus spretus.

In conclusion, the high rate of fixation in ETS and ITS may be due to very frequent gene conversions that reduce C* to be less than 1. In contrast, IGS may have undergone fewer gene conversions and its long-term C*is slightly larger than 1. Indeed, the heterozygosity in IGS region, at about 2-fold higher than that of ETS and ITS regions (8‰ for IGS, 5‰ for ETS and 3‰ for ITS), supports this interpretation.

2) Between Human and Chimpanzee - Positive selection in addition to rapid drift in rDNA divergence

Like the data of mouse studies, the polymorphism of rDNAs in humans would suggest a slower long-term evolution rate. The same caveat is that C* estimated from the polymorphism data would have missed those rapidly fixed variants. Hence, the long-term C* obtained from species divergence might be much smaller than 8.

Our results show that the evolutionary rate of rRNA genes between human and chimpanzee is substantially higher than that of other single-copy genes (Table 5). Especially, 5’ETS region shows a 100% rate acceleration, at 22.7‰ vs. 11‰ genome-wide. Even after removing CpG sites that are considered to have a ten-fold higher mutation rate, their fixation rate still reaches 22.4‰. In this case, even if C* <<1, the extremely rapid fixation will only increase the substitution rate by Tf /(TdTf) by 11%, compared to single-copy genes. Thus, the much accelerated evolution of rRNA genes between humans and chimpanzees cannot be entirely attributed to genetic drift. Other forces, such as biased gene conversion, must have contributed to this acceleration as well.

Divergence in rRNA genes between Human and Chimpanzee.
3) Biased gene conversion - Positive selection for rRNA mutations in apes, but not in mice

As stated, gene conversion should increase V(K) and speed up genetic drift. Conversion is commonly observed in highly recombinogenic regions, especially in repetitive sequences. If gene conversion is unbiased, the effect on the rate of divergence between species would be quite modest unless TdTf. This is because unbiased conversion would only reduce the fixation time, Tf, as demonstrated in Fig. 2.

In this context, it is necessary to define unbiased gene conversion as it has different meanings at the site level vs. at the gene level. At the site level, when an AT site is paired with a GC site, it has been commonly reported that gene conversion would favor AT-to-GC over GC-to-AT conversion (Jeffreys and Neumann 2002; Meunier and Duret 2004). We first designate the proportion of AT-to-GC conversion as f and the reciprocal, GC-to-AT, as g. Given fg, this bias is true at the site level. However, at the genic level, the conversion may in fact be unbiased for the following reason. Let p be the proportion of AT sites and q be the proportion of GC sites in the gene. The flux of AT-to-GC would be pf and the flux in reverse, GC-to-AT, would be qg. At equilibrium, pf = qg. Given f and g, the ratio of p and q would eventually reach p/q = g/f. Thus, gene conversion at the genic level would be unbiased when the AT/GC ratio is near the equilibrium. In other words, there is no selection driving either AT-to-GC or GC-to-AT changes.

We now show that genic conversion is indeed unbiased (or nearly unbiased) between mouse species. In contrast, between human and chimpanzee, the fluxes are unequal. It appears that selection has favored the AT-to-GC changes in the last 6 million years that separate the two species.

In these genic analyses, we first determine whether there is biased conversion in the human lineage (Brown and Jiricny 1989; Galtier and Duret 2007). Using the same variant found in chimpanzees and gorillas as the ancestral state, we identified the derived variants that became nearly fixed in humans with frequency > 0.8 (Table 6). The chi-square test shows that the GC variants had a significantly higher fixation probability compared to AT. In addition, this pattern is also found in chimpanzees (p < 0.001). Between M. m. domesticus and M. m. castaneus (Table 6), the chi-square test reveals no difference in the fixation probability between GC and AT (p = 0.957). Further details can be found in Supplementary Figure 2. Overall, a higher fixation probability of the GC variants is found in human and chimpanzee, whereas this bias is not observed in mice.

The A/T to G/C and G/C to A/T changes in apes and mouse.

Based on Table 6, we could calculate the value of p, q, f and g as shown in Table 7. In particular, the (pf)/(qg) ratio is much larger than 1 in both the human and chimpanzee lineages. In contrast, the ratio in mouse is not significantly different from 1. The ratio for total rDNAs in mouse is lower than 1, but that is likely due to the influence of selective constraints on the functional parts of these genes.

The parameter values of p, q, f and g in the evolution between A/T and G/C.

Discussion

The Haldane model is an “individual-output” model of genetic drift (Chen, et al. 2017). Hence, it does not demand the population to follow the rules of Mendelian populations. It is also sufficiently flexible for studying various stochastic forces other than the sampling errors that together drive genetic drift. In the companion study(Ruan, et al. 2024), we address the ecological forces of genetic drift and, in this study, we analyze the neutral evolution of rRNA genes. Both examples are amenable to the analysis by the Haldane model, but not by the WF model.

In multi-copy systems, there are several mechanisms of homogenization within individuals. For rRNA genes, whether on the same or different chromosomes (Gonzalez and Sylvester 2001; van Sluis and McStay 2019), the predominant mechanism of homogenization is recombination that may lead to gene conversion or unequal crossover. In the process of exchanging DNA sections in meiosis, gene conversions are an order of magnitude more common than crossover (Cole, et al. 2012; Williams, et al. 2015). It is not clear how large a role is played by replication slippage which affects copes of the same cluster.

There have been many rigorous analyses that confront the homogenizing mechanisms directly. These studies (Smith 1974; Ohta 1976; Dover 1982; Nagylaki 1983; Ohta and Dover 1983) modeled gene conversion and unequal cross-over head on. Unfortunately, on top of the complexities of such models, the key parameter values are rarely obtainable. In the branching process, all these complexities are wrapped into V(K) for formulating the evolutionary rate. In such a formulation, the collective strength of these various forces may indeed be measurable, as shown in this study.

The branching process is a model for general processes. Hence, it can be used to track genetic drift in systems with two stages of evolution - within- and between-individuals, even though TEs, viruses or rRNA genes are very different biological entities. We use the rRNA genes to convey this point. Multi-copy genes, like rDNA, are under rapid genetic drift via the homogenization process. The drift is strong enough to reduce the copy number in the population from ∼ 150N to < N. A fraction of mutations in multi-copy genes may have been fixed by drift almost instantaneously in the evolutionary time scale. This acceleration is seen in mice but would have been interpreted to be due to positive selection by the convention. Interestingly, while positive selection may not be necessary to explain the mice data, it is indeed evident in human and chimpanzee, as the evolutionary rate of rRNA genes exceeds the limit of the strong drift.

In conclusion, the Haldane model is far more general than the WF model as this and the companion study clearly demonstrate. Its E(K) parameter (which is usually set to 1) should be equivalent to the single parameter of the WF model (i.e., N). The other parameter of the Haldane model, i.e., V(K), is then free to track genetic drift, whereas the WF model, setting V(K) = E(K), is highly constrained. Nevertheless, the vast literature using the WF model has led to substantial understandings of the neutral process via the diffusion process (Crow and Kimura 1970) or coalescence (Kingman 1982; Fu 2006). In this sense, the Haldane model should be built on the WF model by introducing a second parameter and permit the analyses of a broader range of stochastic ecological and evolutionary forces.

Materials and Methods

Data Collection

We collected high-coverage whole-genome sequencing data for our study. The genome sequences of human, chimpanzee and gorilla were sourced from National Center for Biotechnology Information (NCBI). Human individuals were drawn from diverse geographical origins encompassing three continents (4 Asians, 2 Europeans and 2 Africans) and Asia CRC was the normal tissue of Case 1 patient from (Chen, Wu, et al. 2022). Genomic sequences of mice were sourced from the Wellcome Sanger Institute’s Mouse Genome Project (MGP). Although some artificial selection have been performed on laboratory mouse strains (Yang, et al. 2011), the WSB/EiJ, ZALENDE/Ei and LEWES/EiJ strains were derived from wild populations. Incorporating these wild-derived laboratory strains, along with other inbred strains, a cohort of 10 mice was utilized to approximate the population representation of M. m. domesticus. Furthermore, the low FIS of 0.14 for rDNA in M. m. domesticus found in this study suggests that each mouse covers 86% of population’s genetic diversity, thereby mitigating concerns about potential sampling biases. Accessions and the detailed information of samples used in this study are list in Supplementary Table 4 and Table 5.

Data Processing

Following adapter trimming and the removal of low-quality sequences, these whole-genome sequencing data of apes and mice were mapped against respective reference sequences: the human rDNA reference sequence (Human ribosomal DNA complete repeating unit, GenBank: U13369.1) and the mouse rDNA reference sequence (Mus musculus ribosomal DNA, complete repeating unit, GenBank: BK000964.3). Alignment was performed using Burrows-Wheeler-Alignment Tool v0.7.17 (Li and Durbin 2009) with default parameters. Each individual was considered as a pool of rRNA genes, thus there were three levels of diversity.

To determine the variant frequency within individual, bcftools (Danecek, et al. 2021) was employed to call variants from multiple alignment files. Per-sample allele depths were obtained using following settings ‘bcftools mpileup --redo-BAQ --max-depth 50000 --per-sample-mF --annotate FORMAT/AD,FORMAT/ADF,FORMAT/ADR,FORMAT/DP,INFO/AD,INFO/ADF,INFO/A DR’ and ‘bcftools call -mv’. The sites with depth <10 were filtered out in subsequent analysis, and the average of the minimum depths for each remaining site across all samples is above 3000, thereby enhancing the robustness of variant frequency estimation. In this process, we can identity all the polymorphic sites present in these samples and the variant frequency within each individual. The population-level variant frequency was computed by averaging variant frequencies across all individuals. Our analysis specifically focused on single nucleotide variants (SNVs), while other mutation types were discarded for consistency.

Identification of Divergence Sites

Polymorphism represents the transient phase preceding divergence. With variant frequency within individual, within species, and between species, we obtained the FIS and FST values for each site. Within the M. m. domesticus population, 779 polymorphic sites were identified. Among these, 744 sites (95.5%) exhibited an FIS below 0.4 and rarely above 0.8 in Fig. 4B. In the comparison between M. m. domesticus and Mus spretus, 1579 variant sites were found, among which 453 sites displayed an FST above 0.8, indicating swift fixation of mutations during species divergence.

FST analysis between human and chimpanzee was conducted for each human individual, summarized in Table 5. We identified a range from 672 to 705 sites with FST values above 0.8 across individuals, depicting robust divergence sites. Considering the high mutation rate in CpG sites (Ehrlich and Wang 1981; Sved and Bird 1990) and predominantly GC content in rDNA (e.g. 58% GC in human), we further estimated the evolutionary rate at non-CpG sites during the interspecies divergence. To achieve this, mutations in CpG sites were manually removed by excluding all sites contained CpG in one species and TpG or CpA in the other; the reverse was similarly discarded. Additionally, the count of non-CpG sites within the mapping length, where site depth exceeded 10, was performed by Samtools (Danecek, et al. 2021) with the settings ‘samtools mpileup -Q 15 -q 20’. As a result, the evolutionary rate of rDNA in non-CpG sites was ascertained.

For assessing diversity and divergence across gene segments, we used ‘samtools faidx’ to partition variants into a total of 8 regions within rRNA genes, including 5’ETS, ITS1, ITS2, 3’ETS, IGS, 18S, 5.8S, and 28S, aligning them with corresponding reference sequences for further analysis. The functional parts (18S, 5.8S, and 28S) were subject to strong negative selection, exhibiting minimal substitutions during species divergence as expected. This observation is primarily used for comparison with the non-functional parts and reflects that the non-functional parts are less constrained by negative selection.

Genome-wide Divergence Estimation

To assess the genome-wide divergence between 4 mouse strain species, we downloaded their toplevel reference genomes from Ensembl genome browser (GenBank Assembly ID: GCA_001624865.1, GCA_001624775.1, GCA_001624445.1, and GCA_001624835.1). Then we used Mash tools (Ondov, et al. 2016) to estimate divergence across the entire genome (mainly single-copy genes) with ‘mash sketch’ and ‘mash dist’. Additionally, the effective copy number of rRNA genes, denoted as C*, can be estimated by calculating the ratio of population diversity observed in rDNA to that observed in single-copy genes.

Estimation of Site Conversion

To estimate site conversions, variants shared between chimpanzees and gorillas, humans and gorillas, M. m. castaneus and M. spretus were considered as the ancestral state for human, chimpanzee, and M. m. domesticus, respectively. We identified the derived variants that became nearly fixed (frequency > 0.8) in the lineages of humans, chimpanzees and M. m. domesticus. In this study, six types of mutations were tabulated, representing ancestral-to-derived as depicted in Supplementary Fig. 2. For example, A-to-G represented the both A-to-G and T-to-C types of mutations. The C-to-G (or G-to-C) and A-to-T (or T-to-A) types of mutations were excluded in the subsequent analysis.

The alternative hypotheses of biased mutation pattern (Wolfe, et al. 1989; Francino and Ochman 1999) alone can be rejected in this study. According to the prediction of the mutational mechanism hypotheses, AT or GC variants should have equal fixation probabilities. We quantified the nearly fixed number of AT-to-GC and GC-to-AT types of mutations and conducted a chi-square test to assess their fixation probabilities. Notably, we found the GC variant had significant higher fixation probability than AT at site level in apes, but not in mice. Since the fixed sites accounted for less than 1% of non-functional length of rDNA, we calculated the AT-to-GC conversion rate, f, by dividing the number of fixed G or C sites by the number of AT base pairs, and the GC-to-AT conversion rate, g, was computed using a comparable approach. In order to test whether there is an equilibrium state at the genic level in these three species, we computed the (pf)/(qg) ratio in Table 7, and a significant deviation of the ratio from 1 would imply biased genic conversion.

Data Availability

No new data were generated in this study. The genomic data used in this study are available from National Center for Biotechnology Information (NCBI) (https://www.ncbi.nlm.nih.gov/) and the Mouse Genomes Project (https://www.sanger.ac.uk/data/mouse-genomes-project/).

The specific accession numbers recorded in Supplementary Table S3 and S4.

Acknowledgements

We are grateful for the helpful comments from many colleagues on the Chat Group “Cancer - The new evolving species”, in particular, Weiwei Zhai, Yong Zhang, GuoDong Wang of CAS and Jianrong Yang of SYSU.