Introduction

In this study, we focus on multi-copy gene systems, where the evolution takes place in two stages: both within (stage I) and between individuals (stage II). Multi-copy gene systems include viruses, transposons, mitochondria and multi-gene families (Alexandrov, et al. 2001; Szitenberg, et al. 2016; Xu, et al. 2019; Ruan, et al. 2021). Given the extra stage of within-host fixation, the neutral evolutionary rate of multi-copy systems should be much slower than in single-copy systems. However, the rapid evolution of multi-copy systems has been extensively documented (Charlesworth, et al. 1994; Eickbush and Eickbush 2007; Jurka, et al. 2007; Hou, et al. 2023). A reason for this paradox, as well as many others (Ruan, et al. 2024), is that the speed of neutral evolution of multi-gene systems is not known.

The speed of neutral evolution is the basis for determining how fast or slow all types of molecular evolution take place. Neutral evolution is driven by random transmission, gene conversion, stochastic replication etc., which collectively constitute genetic drift. Hence, genetic drift is the fundamental force of molecular evolution. All other evolutionary forces, such as selection, mutation and migration, may be of greater biological interest, but inferences are possible only when genetic drift is fully accounted for. In the companion study (Ruan, et al. 2024), we show that the standard Wright-Fisher (WF) model may often under-account genetic drift, thus leading to the over-estimation of selection.

We propose the integration of the WF model with the Haldane model, referred to as the WFH model of genetic drift (Ruan, et al. 2024). The Haldane model is based on the branching process. In haploids, each individual produces K progeny with the mean and variance of E(K) and V(K). Genetic drift is primarily V(K) as there would be no drift if V(K) = 0. Gene frequency change in the population is then scaled by N (population size), expressed as V(K)/N. In diploids, K would be the number of progeny to whom the gene copy of interest is transmitted. (The adjustments between haploidy and multi-ploidy are straightforward).

In the WF model, gene frequency is governed by 1/N (or 1/2N in diploids) because K would follow the Poisson distribution whereby V(K) = E(K). As E(K) is generally ∼1, V(K) would also be ∼ 1. In this backdrop, many “modified WF” models have been developed(Der, et al. 2011), most of them permitting V(K) ≠ E(K) (Karlin and McGregor 1964; Chia and Watterson 1969; Cannings 1974). Nevertheless, paradoxes encountered by the standard WF model apply to these modified WF models as well because all WF models share the key feature of gene sampling (see below and (Ruan, et al. 2024)). One of the paradoxes, first noted in (Chen, et al. 2017) is genetic drift during tumor growth whereby drift appears to become stronger as N increases (Wu, et al. 2016; Chen, Wu, et al. 2022; Zhai, et al. 2022). This trend is in stark opposite to the central tenet of the WF models.

A paradox requiring dedicated efforts to analyze concerns multi-copy gene systems, which are as diverse as viral epidemics (Huang, et al. 2021), transposons (Szitenberg, et al. 2016), mitochondrial DNAs (Xu, et al. 2019), satellite DNAs (Cabot, et al. 1993; Alexandrov, et al. 2001), and ribosomal RNA genes (van Sluis and McStay 2019; Hori, et al. 2021). In COVID-19, the inability of the WF models to track both within- and between-host evolution simultaneously is a main reason for much confusion about the origin, spread and driving forces of SARS-CoV-2 (Ruan, et al. 2021; Deng, et al. 2022; Guan and Zhong 2022; Pan, Liu, et al. 2022; Ruan, et al. 2022; Zhou, et al. 2022; Hou, et al. 2023).

In multi-copy systems, the copy number (designated C) is in the hundreds or thousands per individual. Nevertheless, C = 2 as in all diploids is also a multi-copy gene system as the two copies may often evolve interactively via gene conversion or segregation distortion (Wu, et al. 1988; McDermott and Noor 2010). The WF models have been noted to be inadequate even in diploid systems (Charlesworth 2009; Chen, et al.2017). After all, the WF model is essentially a haloid model with 2N copies in the population.

We now apply the Haldane model to multi-copy gene systems, using ribosomal RNA (rRNA) genes as an example. While the WF models have led to speculations of pervasive natural selection (Dover 1982; Arkhipova 2018; Chen, Yang, et al. 2022; Pan, Zhang, et al. 2022; Wang, et al. 2022), neutral stochasticity would be a simpler explanation if the more powerful Haldane model is adopted. Since the WF model can yield results that are often good approximations for the Haldane model, the integration is referred to as the WFH model.

Results

PART I presents a best-known multi-gene system of the rRNA genes. PART II (Theory) consolidates aspects of the Haldane model applied to polymorphism within species as well as divergence between species. In PART III (Data analyses), we apply the theory to rDNA evolution in mice and apes (human and chimpanzee).

PART I - The biology of rRNA gene clusters

The ribosomal RNA genes (rDNAs) are multi-copy gene clusters (Bowman, et al. 2020) that are arrayed as tandem repeats on multiple chromosomes as shown in Fig. 1A (Guillén, et al. 2004; Cazaux, et al. 2011). In humans, the copy number can vary from 60 to 1600 per individual (mean, 315; SD, 104; median, 301) (Parks, et al. 2018). For each haploid genome, C ∼ 150 on average in humans and C ∼ 110 in mice (Parks, et al. 2018). In humans, the five rRNA clusters are located on the short arm of the five acrocentric chromosomes (Smirnov, et al. 2021). Such an arrangement permits crossovers between chromosomes without perturbing the rest of the genomes. In Mus, the rDNAs are all located in the pericentromeric or sub-telomeric region, on the long arms of telocentric chromosomes (Cazaux, et al. 2011; Potapova and Gerton 2019). Thus, unequal crossovers between non-homologous chromosomes may involve centromeres while other genic regions are also minimally perturbed.

The “chromosome community of rDNAs on five acrocentric chromosomes.

(A) The genomic locations of rDNA tandem repeats in human (Gibbons, et al. 2015) and mouse (Cazaux, et al. 2011). rDNAs are located on the short arms (human), or the proximal end of the long arms (mouse), of the chromosome. Either way, inter-chromosomal exchanges are permissible. (B) The organization of rDNA repeat unit. IGS (intergenic spacer) is not transcribed. Among the transcribed regions, 18S, 5.8S and 28S segments are in the mature rRNA while ETS (external transcribed spacer) and ITS (internal transcribed spacer) are excluded. (C) The pseudo-population of rRNA genes is shown by the “chromosomes community” map (Guarracino, et al. 2023), which indicates the divergence distance among chromosome segments. The large circle encompasses rDNAs from all 5 chromosomes. It shows the concerted evolution among rRNA genes from all chromosomes, which thus resemble members of a (pseudo-)population. The slightly smaller thin circle, from the analysis of this study, shows that the rDNA gene pool from each individual captures approximately 95% of the total diversity of human population. (D) A simple illustration that shows the transmissions of two new mutations (#1 and #2 in red letter). Mutation 1 experiences replication slippage, gene conversion and unequal crossover and grows to 9 copies (K = 9) after transmission. Mutation 2 emerges and disappears (K = 0). This shows how V(K) may be augmented by the homogenization process.

Each copy of rRNA gene has a functional and non-functional part as shown in Fig. 1B. The “functional” regions of rDNA, 18S, 5.8S, and 28S rDNA, are believed to be under strong negative selection, resulting in a slow evolution rate in animals (Salim and Gerton 2019). In contrast, the transcribed spacer (ETS and ITS) and the intergenic spacer (IGS) show significant sequence variation even among closely related species (Eickbush and Eickbush 2007). Clearly, these “non-functional” sequences are less constrained by negative selection. In this study of genetic drift, we focus on the non-functional parts. Data on the evolution of the functional parts will be provided only for comparisons.

The pseudo-population of ribosomal DNA copies within each individual

While a human haploid with 200 rRNA genes may appear to have 200 loci, the concept of “gene loci” cannot be applied to the rRNA gene clusters. This is because DNA sequences can spread from one copy to others on the same chromosome via replication slippage. They can also spread among copies on different chromosomes via gene conversion and unequal crossovers (Nagylaki 1983; Ohta and Dover 1983; Stults, et al. 2008; Smirnov, et al. 2021). Replication slippage and unequal crossovers would also alter the copy number of rRNA genes. These mechanisms will be referred to collectively as the homogenization process. Copies of the cluster on the same chromosome are known to be nearly identical in sequences (Hori, et al. 2021; Nurk, et al. 2022). Previous research has also provided extensive evidence for genetic exchanges between chromosomes (Krystal, et al. 1981; Arnheim, et al. 1982; van Sluis, et al. 2019).

In short, rRNA gene copies in an individual can be treated as a pseudo-population of gene copies. Such a pseudo-population is not Mendelian but its genetic drift can be analyzed using the branching process (see below). The pseudo-population corresponds to the “chromosome community” proposed recently (Guarracino, et al. 2023). As seen in Fig. 1C, the five short arms harbor a shared pool of rRNA genes that can be exchanged among them. Fig. 1D presents the possible molecular mechanisms of genetic drift within individuals whereby mutations may spread, segregate or disappear among copies. Hence, rRNA gene diversity or polymorphism refers to the variation across all rRNA copies, as these genes exist as paralogs rather than orthologs. This diversity can be assessed at both individual and population levels according to the multi-copy nature of rRNA genes.

PART II - Theory

1. The Haldane model of genetics drift applied to multi-copy gene systems

The Haldane model of genetic drift based on the branching process is intuitively appealing. In the model, each copy of the gene leaves K copies in a time interval with the mean and variance of E(K) and V(K). If V(K) = 0, there is no gene frequency change and no genetic drift. In the standard WF model, V(K) = E(K) whereas V(K) is decoupled from E(K) in the Haldane model; the latter thus being more flexible. [In the companion paper, we discuss the modified WF models with V(K)E(K) which, nevertheless, do not resolve the paradoxes.]

Below, we compare the strength of genetic drift in rRNA genes vs. that of single-copy genes using the Haldane model (Ruan, et al. 2024). We shall use * to designate the equivalent symbols for rRNA genes; for example, E(K) vs. E*(K). Both are set to 1, such that the total number of copies in the long run remains constant.

For simplicity, we let V(K) = 1 for single-copy genes. (If we permit V(K) ≠ 1, the analyses will involve the ratio of V*(K) and V(K) to reach the same conclusion but with unnecessary complexities.) For rRNA genes, V*(K) ≥ 1 may generally be true because K for rDNA mutations are affected by a host of homogenization factors including replication slippage, unequal cross-over, gene conversion and other related mechanisms not operating on single-copy genes. Hence,

where C is the average number of rRNA genes in an individual and V*(K) reflects the homogenization process on rRNA genes (Fig. 1D). Thus,

represents the effective copy number of rRNA genes in the population, determining the level of genetic diversity relative to single-copy genes. Since C is in the hundreds and V*(K) is expected to be > 1, the relationship of 1 << C*C is hypothesized. Fig. 1D is a simple illustration that the homogenizing process may enhance V*(K) substantially over the WF model.

In short, genetic drift of rRNA genes would be equivalent to single-copy genes in a population of size NC* (or N*). Since C* >> 1 is hypothesized, genetic drift for rRNA genes is expected to be slower than for single-copy genes.

2. rDNA polymorphism within species

A standard measure of genetic drift is the level of heterozygosity (H). At the mutation-selection equilibrium

where μ is the mutation rate of the entire gene and Ne is the effective population size. In this study, Ne = N for single-copy gene and Ne = C*N for rRNA genes. The empirical measure of nucleotide diversity H is given by

where L is the gene length (for each copy of rRNA gene, L ∼ 43kb) and pi is the variant frequency at the i-th site.

We calculate H of rRNA genes at three levels – within-individual, within-species and then, within total samples (HI, HS and HT, respectively). HS and HT are standard population genetic measures (Hartl, et al. 1997; Crow and Kimura 2009). In calculating HS, all sequences in the species are used, regardless of the source individuals. A similar procedure is applied to HT. The HI statistic is adopted for multi-copy gene systems for measuring within-individual polymorphism. Note that copies within each individual are treated as a pseudo-population (see Fig. 1 and text above). With multiple individuals, HI is averaged over them.

Given the three levels of heterozygosity, there are two levels of differentiation. First, FIS is the differentiation among individuals within the species, defined by

FIS is hence the proportion of genetic diversity in the species that is found only between individuals. We will later show FIS 0.05 in human rDNA (Table 2), meaning 95% of rDNA diversity is found within individuals.

Second, FST is the differentiation between species within the total species complex, defined as

FST is the proportion of genetic diversity in the total data that is found only between species. Between mouse species, FST distribution is close to 1 (Fig. 4C), indicating a large genetic distance between species relative to within-species polymorphisms.

rRNA gene nucleotide diversity in the 10 M. m. domesticus strains of a global collection.

3. rDNA divergence between species

Whereas the level of genetic diversity is a function of the effective population size, the rate of divergence between species, in its basic form, is not. The rate of neutral molecular evolution (λ), although driven by mutation and genetic drift, is generally shown by Eq. (3) (Crow and Kimura 1970; Hartl, et al. 1997; Li 1997):

Note that the factor of 1/N in Eq. (3) indicates the fixation probability of a new mutation. For rDNA mutations, fixation must occur in two stages – fixation within individuals and then among individuals in the population. (We note again that new mutations can be fixed via homogenization in an individual, effectively forming a pseudo-population for rRNA genes.) Due to the cancelation of N in Eq. (3), the evolutionary rate of rRNA genes in the long run should be the same as single-copy genes.

Eq. (3) is valid in the long-term evolution. For shorter-term evolution, it is necessary to factor in the fixation time (Fig. 2), Tf, which is the time between the emergence of a mutation and its fixation. If we study two species with a divergent time (Td) equal to or smaller than Tf, then few mutations of recent emergence would have been fixed as species diverge.

Fixation of mutations at two levels of species divergence, (Td1) and (Td2).

(Tf1) and (Tf2) are mutations with a shorter and longer fixation time, respectively for single-copy and multi-copy genes. Note that mutations with a longer Tf would show a lower fixation rate in short-term evolution.

Note that Td is about 6 million years (Myrs) between human and chimpanzee while Tf (as measured by coalescence) is 0.6 - 1 Myrs in humans. Mutations of single-copy genes would not get fixed during the more recent Tf, as indicated in Figure 2. Thus, the realized substitution rate may be 1/6 to 1/10 lower than the theoretical value. In comparison, Tf of rRNA mutations should be at least 4.8 - 8 Myrs based on the C* estimate (see PATR III). Thus, the substitution rate could be at least 80% lower than calculated for single-copy genes.

Our own theoretical derivation has shown that the fixation time for rRNA genes is close to 4N* as is the case for single-copy genes at ∼ 4N. If V*(K) is sufficiently larger than V(K), it is in fact possible for N*< N such that genetic drift is stronger for rRNA genes than for single-copy genes and T * < T. Therefore, T can represent the Tf* of mutations in rRNA genes in Fig. 2. This is interesting because, if the homogenization is powerful enough, rRNA genes would have an effective copy number of C* < 1. A short Tf*, which can be obtained by large V*(K), would lead to a small Tf*/Td and thus a higher substitution rate, particularly in short-term evolution. However, even Tf* approaches 0, the substitution rate can exceed that of single-copy genes but is still limited by fixation probability. As noted above, the rapidly fixed mutations may not be well represented in the polymorphic data but can accumulate in species divergence.

PART III - Data Analyses

Before presenting the main analyses, we provide some empirical observations on the rapid homogenization of rRNA genes within individuals (Stage I). These observations are needed for PART III that comprises i) the analyses of rDNA polymorphisms within species in mouse and human; and ii) the analysis of the divergence between species.

Empirical measurements of homogenization within cells

In an accompanying study (Wang, et al. unpublished data), the evolutionary rate of neutral rRNA variants within cells is measured. Here, genetic drift operates via the homogenizing mechanisms that include gene conversion, unequal crossover and replication slippage. In the literature, measurements of neutral rRNA evolution are usually based on comparisons among individuals. Therefore, the Mendelian mechanisms of chromosome segregation and assortment would also shuffle variants among individuals. Segregation and assortment would confound the measurements of homogenization effects within individuals.

In one experiment, the homogenization effects in rDNAs are measured in cultured cell lines over 6 months of evolution. For in vivo homogenization, we analyze the evolution of rRNA genes within solid tumors. We estimate the rate at which rRNA variants spread among copies within the same cells, which undergo an asexual process. The measurements suggest that, in the absence of recombination and chromosome assortment, the fixation time of new rRNA mutations within cells would take only 1 - 3 kyrs (thousand years). Since a new mutation in single-copy genes would take 300 - 600 kyrs to be fixed in human populations, the speed of genetic drift in Stage I evolution is orders faster than in Stage II. Therefore, despite having several hundred copies of rRNA genes per genome, the speed of genetic drift in rRNA genes may not be that much slower than in single-copy genes. This postulate will be tested below.

1. rDNA polymorphism within species

1) Polymorphism in mice

For rRNA genes, HI of 10 individuals ranges from 0.0056 to 0.0067 while HS is 0.0073 (Table 1). Thus, FIS = [HS - HI]/HS for mice is 0.14, which means 86% of variation is within each individual. In other words, even one single randomly chosen individual would yield 85% of the diversity of the whole species. Hence, the estimated HS should be robust as it is not affected much by the sampling.

HS for M. m. domesticus single-copy genes is roughly 1.40 per kb genome-wide (Geraldes, et al. 2008) while HS for rRNA genes is 7.25 per kb (Table 1), 5.2 times larger. In other words, C* = N*/N ∼ 5.2. If we use the polymorphism data, it is as if rDNA array has a population size 5.2 times larger than single-copy genes. Although the actual copy number on each haploid is ∼ 110, these copies do not segregate like single-copy genes and we should not expect N* to be 100 times larger than N. The HS results confirm the prediction that rRNA genes should be more polymorphic than single-copy genes.

Based on the polymorphism result, one might infer slower drift for rDNAs than for single-copy genes. However, the results from the divergence data in later sections will reveal the under-estimation of drift strength from polymorphism data. Such data would miss variants that have a fast drift process driven by, for example, gene conversion. Strength of genetic drift should therefore be measured by the long-term fixation rate.

2) Polymorphism in human

FIS for rDNA among 8 human individuals is 0.059 (Table 2), much smaller than 0.142 in M. m. domesticus mice, indicating minimal genetic differences across human individuals and high level of genetic identity in rDNAs between homologous chromosomes among human population. Consisted with low FIS, Fig. 3 shows strong correlation of the polymorphic site frequency of rDNA transcribed region among each pair of individuals from three continents (2 Asians, 2 Europeans and 2 Africans). Correlation of polymorphic sites in IGS region is shown in Supplementary Fig. 1. The results suggest that the genetic drift due to the sampling of chromosomes during sexual reproduction (e.g., segregation and assortment) is augmented substantially by the effects of homogenization process within individual. Like those in mice, the pattern indicates that intra-species polymorphism is mainly preserved within individuals. The observed HI of humans for rDNAs is 0.0064 to 0.0077 and the HS is 0.0072 (Table 2). Research has shown that heterozygosity for the human genome is about 0.00088 (Zhao, et al. 2000), meaning the effective copy number of rDNAs is roughly, or C* ∼ 8. This reduction in effective copy number from 150 to 8 indicates strong genetic drift due to homogenization force.

Correlation of variant frequencies between human individuals.

The pairwise correlation of variant site frequency in the transcribed region of rDNAs among 6 individuals (2 Asians, 2 Europeans, and 2 Africans). The high correlations suggest that the diversity in each individual can well capture the population diversity. Each color represents a region of rDNA. The diagonal plots present the variant frequency distribution. The upper right section summarizes the Pearson correlation coefficients derived from the mirror symmetric plots in the bottom left. The analysis excluded the 18S, 3’ETS, and 5.8S regions due to the limited polymorphic sites. The result for IGS region is presented in Supplementary Figure S1.

rRNA gene nucleotide diversity in the 8 humans of a global collection.

2. rDNA divergence between species

We now consider the evolution of rRNA genes between species by analyzing the rate of fixation (or near fixation) of mutations. Polymorphic variants are filtered out in the calculation. Note that Eq. (3) shows that the mutation rate, μ, determines the long-term evolutionary rate, λ. Since we will compare the λ values between rRNA and single-copy genes, we have to compare their mutation rates first by analyzing their long-term evolution. As shown in Table S1, λ falls in the range of 50-60 (differences per Kb) for single-copy genes and 40 – 70 for the non-functional parts of rRNA genes. The data thus suggest that rRNA and single-copy genes are comparable in mutation rate. Differences between their λ values will have to be explained by other means.

1) Between mouse species - Genetic drift as the sole driving force of the rapid divergence

We now use the FST statistic to delineate fixation and polymorphism. The polymorphism in M. m. domesticus is compared with two outgroup species, M. spretus and M. m. castaneus, respectively. There are hence two depths in the phylogeny with two Td’s, as shown in Fig. 4A (Rudra, et al. 2016; Kumar, et al. 2022). There is a fourth species, M. m. musculus (shown in grey in Fig. 4A), which yields very similar results as M. m. domesticus in these two comparisons. These additional analyses are shown in Supplement Table S2-S3.

Levels of polymorphism and divergence in mice.

(A) Phylogeny of Mus musculus and Mus spretus mice. The divergence times are obtained from http://timetree.org/. The line segment labeled 0.5 represents 0.5 Myrs. (B) FIS distribution within M. m. domesticus. The distribution of FIS for polymorphic sites in 3 outbred mouse strains or 10 mouse strains (including 7 inbred mice) in Table 1 (Inset) is shown. (C) FST distribution between M. m. domesticus and Mus spretus. Note that the F values rise above 0.8 only in (C).

The FIS values of polymorphic sites in 3 outbred mice are primarily below 0.2 and rarely above 0.8 in Fig. 4B, indicating the low genetic differentiation in rDNAs within these 3 M. m. domesticus. While the FIS distribution of 10 mice from Table 1, including 7 inbred and 3 outbred mouse strains, exhibits a noticeable right skewness, but does not exceed 0.8. This suggests that inbreeding to a certain extent limits the process of homogenization and enhances population differentiation. In comparison, the distribution of the FST of variant sites between M. m. domesticus and Mus spretus has a large peak near FST = 1. This peak in Fig. 4C represents species divergence not seen within populations (i.e., FIS). We use FST = 0.8 as a cutoff for divergence sites between the two species. Roughly, when a mutant is > 0.95 in frequency in one species and < 0.05 in the other, FST would be > 0.80.

We first compare the divergence between M. m. domesticus and M. m. castaneus whereby Td has been estimated to be less than 0.5 Myrs (Fujiwara, et al. 2022). In comparison, between Mus m. domesticus and Mus spretus, Td is close to 3 Myrs (Rudra, et al. 2016). As noted above, the reduction in the divergence rate relative to that of Eq. (3) is proportional to Tf/Td (for single copy genes) or T */Td (for rRNA genes). As Tf and T * are both from M. m. domesticus and T is 6 times larger in comparison with M. spretus, we expect the results to be quite different between the two sets of species comparisons.

Although Tf and Tf* estimates are less reliable than Td estimates, both comparisons use the same Tf and T * from M. m. domesticus. Hence, the results should be qualitatively unbiased. For a demonstration, we shall use the estimates of Tf (i.e., the coalescence time) at 0.2 Myrs for single-copy genes by using an average nucleotide diversity of 0.0014 and the mutation rate of 5.7×10−9 per base pair per generation (Geraldes, et al. 2008; Phifer-Rixey, et al. 2020). Based on the estimated C* above, we obtain Tf* for rDNA mutations at 5×0.2 Myrs, or 1 Myrs. While some have estimated Tf to be close to 0.4 Myrs (Fujiwara, et al. 2022), we aim to show that the pattern of reduction in rRNA divergence is true even with the conservative estimates.

Between M. m. domesticus and M. m. castaneus, the reduction in substitution rate for single copy gene should be ∼ 40% (Tf/Td = 0.2/0.5), and the reduction for rRNA genes should be 100% (Tf*/Td = 1/0.5 > 1). Table 3 on their DNA sequence divergence shows that rRNA genes are indeed far less divergent than single-copy genes. In fact, only a small fraction of rDNA mutations is expected to be fixed as T * for rDNA at 1 Myrs is 2 times larger than the divergence time, Td. We should note again that the non-negligible fixation of rRNA mutations suggests that C* at 5 is perhaps an over-estimate.

Divergence in rRNA genes between M. m. domesticus and M. m. castaneus.

Between Mus m. domesticus and Mus spretus, the reduction in actual substitution rate from theoretical limit for single-copy genes should be 6.7% (Tf/Td = 0.2/3) and, for rRNA genes, should be 33% (T */Td = 1/3). The evolutionary rate (i.e. the fixation rate) of IGS region is lower than single-copy genes, 0.01 in IGS and 0.021 in genome-wide (Table 4), as one would expect. However, ETS and ITS regions have evolved at a surprising rate that is 12% higher than single-copy genes. Note that the reduction in C*, even to the lowest limit of C* =1, would only elevate the rate of fixation in rRNA genes to a parity with single-copy genes. From Eq. (1), the explanation would be that V*(K) has to be very large, such that C* is < 1. With such rapid homogenization, the fixation time approaches 0 and the substitution rate in rRNA genes can indeed reach the theoretical limit of Eq. (3). In such a scenario, the substitution rate in ETS and ITS, compared to single-copy genes in mice, may increase by 7%, Tf /(TdTf) = 0.2/(3-0.2). If we use Tf ∼ 0.4 Myrs in an alternative estimation, the increase can be up to 15%.

Divergence in rRNA genes between M. m. domesticus and Mus spretus.

In conclusion, the high rate of fixation in ETS and ITS may be due to very frequent gene conversions that reduce C* to be less than 1. In contrast, IGS may have undergone fewer gene conversions and its long-term C* is slightly larger than 1. Indeed, the heterozygosity in IGS region, at about 2-fold higher than that of ETS and ITS regions (8‰ for IGS, 5‰ for ETS and 3‰ for ITS), supports this interpretation.

2) Between Human and Chimpanzee - Positive selection in addition to rapid drift in rDNA divergence

Like the data of mouse studies, the polymorphism of rDNAs in humans would suggest a slower short-term evolution rate. The same caveat is that C* estimated from the polymorphism data would have missed those rapidly fixed variants. Hence, the long-term C* obtained from species divergence might be much smaller than 8.

Our results show that the evolutionary rate of rRNA genes between human and chimpanzee is substantially higher than that of other single-copy genes (Table 5). Especially, 5’ETS region shows a 100% rate acceleration, at 22.7‰ vs. 11‰ genome-wide. Even after removing CpG sites, their fixation rate still reaches 22.4‰. In this case, even if C*<<1, the extremely rapid fixation will only increase the substitution rate by Tf /(TdTf) by 11%, compared to single-copy genes. Thus, the much accelerated evolution of rRNA genes between humans and chimpanzees cannot be entirely attributed to genetic drift. In the next and last section, we will test if selection is operating on rRNA genes by examining the pattern of gene conversion.

Divergence in rRNA genes between Human and Chimpanzee.

3) Positive selection for rRNA mutations in apes, but not in mice – Evidence from gene conversion patterns

For gene conversion, we examine the patterns of AT-to-GC vs. GC-to-AT changes. While it has been reported that gene conversion would favor AT-to-GC over GC-to-AT conversion (Jeffreys and Neumann 2002; Meunier and Duret 2004) at the site level, we are interested at the gene level by summing up all conversions across sites. We designate the proportion of AT-to-GC conversion as f and the reciprocal, GC-to-AT, as g. Both f and g represent the proportion of fixed mutations between species (see Methods). So defined, f and g are influenced by the molecular mechanisms as well as natural selection. The latter may favor a higher or lower GC ratio at the genic level between species. As the selective pressure is distributed over the length of the gene, each site may experience rather weak pressure.

Let p be the proportion of AT sites and q be the proportion of GC sites in the gene. The flux of AT-to-GC would be pf and the flux in reverse, GC-to-AT, would be qg. At equilibrium, pf = qg. Given f and g, the ratio of p and q would eventually reach p/q = g/f. We now determine if the fluxes are in equilibrium (pf =qg). If they are not, the genic GC ratio is likely under selection and is moving to a different equilibrium.

In these genic analyses, we first analyze the human lineage (Brown and Jiricny 1989; Galtier and Duret 2007). Using chimpanzees and gorillas as the outgroups, we identified the derived variants that became nearly fixed in humans with frequency > 0.8 (Table 6). The chi-square test shows that the GC variants had a significantly higher fixation probability compared to AT. In addition, this pattern is also found in chimpanzees (p < 0.001). In M. m. domesticus (Table 6), the chi-square test reveals no difference in the fixation probability between GC and AT (p = 0.957). Further details can be found in Supplementary Figure 2. Overall, a higher fixation probability of the GC variants is found in human and chimpanzee, whereas this bias is not observed in mice.

The A/T to G/C and G/C to A/T changes in apes and mouse.

Based on Table 6, we could calculate the value of p, q, f and g (see Table 7). Shown in the last row of Table 7, the (pf)/(qg) ratio is much larger than 1 in both the human and chimpanzee lineages. Notably, the ratio in mouse is not significantly different from 1. Combining Tables 4 and 7, we conclude that the slight acceleration of fixation in mice can be accounted for by genetic drift, due to gene conversion among rRNA gene copies. In contrast, the different fluxes corroborate the interpretations of Table 5 that selection is operating in both humans and chimpanzees.

The parameter values of p, q, f and g in the evolution between A/T and G/C.

Discussion

The Haldane model is an “individual-output” model of genetic drift (Chen, et al. 2017). Hence, it does not demand the population to follow the rules of Mendelian populations. It is also sufficiently flexible for studying various stochastic forces other than the sampling errors that together drive genetic drift. In the companion study(Ruan, et al. 2024), we address the ecological forces of genetic drift and, in this study, we analyze the neutral evolution of rRNA genes. Both examples are amenable to the analysis by the Haldane model, but not by the WF model.

In multi-copy systems, there are several mechanisms of homogenization within individuals. For rRNA genes, whether on the same or different chromosomes (Gonzalez and Sylvester 2001; van Sluis and McStay 2019), the predominant mechanism of homogenization mechanism are gene conversion and unequal crossover. In the process of exchanging DNA sections in meiosis, gene conversions are an order of magnitude more common than crossover (Cole, et al. 2012; Williams, et al. 2015). It is not clear how large a role is played by replication slippage which affects copes of the same cluster.

There have been many rigorous analyses that confront the homogenizing mechanisms directly. These studies (Smith 1974; Ohta 1976; Dover 1982; Nagylaki 1983; Ohta and Dover 1983) modeled gene conversion and unequal cross-over head on. Unfortunately, on top of the complexities of such models, the key parameter values are rarely obtainable. In the branching process, all these complexities are wrapped into V*(K) for formulating the evolutionary rate. In such a formulation, the collective strength of these various forces may indeed be measurable, as shown in this study.

The branching process is a model for general processes. Hence, it can be used to track genetic drift in systems with two stages of evolution - within- and between-individuals, even though TEs, viruses or rRNA genes are very different biological entities. We use the rRNA genes to convey this point. Multi-copy genes, like rDNA, are under rapid genetic drift via the homogenization process. The drift is strong enough to reduce the copy number in the population from ∼ 150N to < N. A fraction of mutations in multi-copy genes may have been fixed by drift almost instantaneously in the evolutionary time scale. This acceleration is seen in mice but would have been interpreted to be due to positive selection by the convention. Interestingly, while positive selection may not be necessary to explain the mice data, it is indeed evident in human and chimpanzee, as the evolutionary rate of rRNA genes exceeds the limit of the strong drift.

In conclusion, the Haldane model is far more general than the WF model as this and the companion study clearly demonstrate. Its E(K) parameter (which is usually set to 1) should be equivalent to the single parameter of the WF model (i.e., N). The other parameter of the Haldane model, i.e., V(K), is then free to track genetic drift, whereas the WF model, setting V(K) = E(K), is highly constrained. Nevertheless, the vast literature using the WF model has led to substantial understandings of the neutral process via the diffusion process (Crow and Kimura 1970) or coalescence (Kingman 1982; Fu 2006). In this sense, the Haldane model should be built on the WF model by introducing a second parameter and permit the analyses of a broader range of stochastic ecological and evolutionary forces.

Materials and methods

Data Collection

We collected high-coverage whole-genome sequencing data for our study. The genome sequences of human (n = 8), chimpanzee (n = 1) and gorilla (n = 1) were sourced from National Center for Biotechnology Information (NCBI) (Supplementary Table 3). Human individuals were drawn from diverse geographical origins encompassing three continents (4 Asians, 2 Europeans and 2 Africans) and Asia CRC was the normal tissue of Case 1 patient from (Chen, Wu, et al. 2022). Genomic sequences of mice (n = 13) were sourced from the Wellcome Sanger Institute’s Mouse Genome Project (MGP) (Keane, et al. 2011). Although some artificial selection has been performed on laboratory mouse strains (Yang, et al. 2011), the WSB/EiJ, ZALENDE/Ei and LEWES/EiJ strains were derived from wild populations. Incorporating these wild-derived laboratory strains, along with other inbred strains, a cohort of 10 mice was utilized to approximate the population representation of M. m. domesticus. Furthermore, the low FIS of 0.14 for rDNA in M. m. domesticus found in this study suggests that each mouse covers 86% of the population’s genetic diversity, thereby mitigating concerns about potential sampling biases. Accessions and the detailed information of samples used in this study are listed in the Supplementary Table 3 and Table 4.

Variant allele frequency

Following adapter trimming and the removal of low-quality sequences, these whole-genome sequencing data of apes and mice were mapped against respective reference sequences: the human rDNA reference sequence (Human ribosomal DNA complete repeating unit, GenBank: U13369.1) and the mouse rDNA reference sequence (Mus musculus ribosomal DNA, complete repeating unit, GenBank: BK000964.3). Alignment was performed using Burrows-Wheeler-Alignment Tool v0.7.17 (Li and Durbin 2009) with default parameters. All mapping and analysis are performed among individual copies of rRNA genes. The pipelines in variant calling are similar to the ones used in the literature (Ma, et al. 2022; Sun, et al. 2022). Each individual was considered as a psedo-population of rRNA genes and the diversity of rRNA genes was calculated using this psedo-population of rRNA genes. To determine variant frequency within individual, variants were called from multiple alignment files using bcftools (Danecek, et al. 2021) to ensure the inclusion of all polymorphic sites that appeared in at least one sample and to maintain consistent processing steps. Per-sample variant calling results were generated in a VCF file using the following settings: ‘bcftools mpileup --redo-BAQ --max-depth 50000 --per-sample-mF --annotate’ and ‘bcftools call -mv’. Our analysis specifically focused on single nucleotide variants (SNVs) with only two alleles, while other mutation types were discarded. Variant information of interest per sample was extracted from the VCF files using the command ‘bcftools view -V indels’ and ‘bcftools query –f ‘%CHROM\t%POS\t%REF\t%ALT\t[\t%GT]\t[\t%DP]\t[\t%AD]\n’\’. The AD (total high-quality bases of allelic depths) was used to obtain the number of reference-supporting reads (n.ref) and alternative-supporting reads (n.alt) in each sample. Sites with a depth < 10 in any sample were filtered out, resulting in an average of the minimum depths above 3000 for each remaining site across all samples. This filtering step enhances the robustness of variant frequency estimation, where the variant frequency was approximated by the ratio n.alt/(n.ref + n.alt). This process allowed us to identify all polymorphic sites present in the samples and the variant frequency within each individual. Then the population-level variant frequency was computed by averaging variant frequencies across all individuals.

Identification of Divergence Sites

Polymorphism represents the transient phase preceding divergence. With variant frequency within individual, within species, and between species, we obtained the FIS and FST values for each site. Within the M. m. domesticus population, 779 polymorphic sites were identified. Among these, 744 sites (95.5%) exhibited an FIS below 0.4 and rarely above 0.8 in Fig. 4B. In the comparison between M. m. domesticus and Mus spretus, 1579 variant sites were found, among which 453 sites displayed an FST above 0.8, indicating swift fixation of mutations during species divergence.

FST analysis between human and chimpanzee was conducted for each human individual, summarized in Table 5. We identified a range from 672 to 705 sites with FST values above 0.8 across individuals, depicting robust divergence sites. Considering the high mutation rate in CpG sites (Ehrlich and Wang 1981; Sved and Bird 1990) and predominantly GC content in rDNA (e.g. 58% GC in human), we further estimated the evolutionary rate at non-CpG sites during the interspecies divergence. To achieve this, mutations in CpG sites were manually removed by excluding all sites containing CpG in one species and TpG or CpA in the other; the reverse was similarly discarded. Additionally, the count of non-CpG sites within the mapping length, where site depth exceeded 10, was performed by Samtools (Danecek, et al. 2021) with the settings ‘samtools mpileup -Q 15 -q 20’. As a result, the evolutionary rate of rDNA in non-CpG sites was ascertained.

For assessing diversity and divergence across gene segments, we used ‘samtools faidx’ to partition variants into a total of 8 regions within rRNA genes, including 5’ETS, ITS1, ITS2, 3’ETS, IGS, 18S, 5.8S, and 28S, aligning them with corresponding reference sequences for further analysis. The functional parts (18S, 5.8S, and 28S) were subject to strong negative selection, exhibiting minimal substitutions during species divergence as expected. This observation is primarily used for comparison with the non-functional parts and reflects that the non-functional parts are less constrained by negative selection.

Genome-wide Divergence Estimation

To assess the genome-wide divergence between 4 mouse strain species, we downloaded their toplevel reference genomes from Ensembl genome browser (GenBank Assembly ID: GCA_001624865.1, GCA_001624775.1, GCA_001624445.1, and GCA_001624835.1). Then we used Mash tools (Ondov, et al. 2016) to estimate divergence across the entire genome (mainly single-copy genes) with ‘mash sketch’ and ‘mash dist’. Additionally, the effective copy number of rRNA genes, denoted as C*, can be estimated by calculating the ratio of population diversity observed in rDNA to that observed in single-copy genes.

Estimation of Site Conversion

To estimate site conversions, whether accidental or directional, it was essential to identify new mutant alleles in each lineage after divergence. New mutations were defined as derived alleles that differed from ancestral alleles shared by two outgroup species, where the ancestral state also exhibited high identity among copies. By focusing on new mutations with low initial frequency, we minimized the influence of their initial frequency on fixation probability and fixation time. Specifically, variants shared between chimpanzee and gorilla, humans and gorilla, M. m. castaneus and M. spretus, each with a frequency greater than 0.8, were considered as the ancestral for human, chimpanzee, and M. m. domesticus, respectively.

Derived variants were then categorized into two groups: (nearly) fixed (frequency > 0.8) and not fixed mutations in the lineages of humans, chimpanzees and M. m. domesticus (Table 6). The frequency threshold of >0.8 was chosen to balance the need for a sufficient number of sites to calculate of (f/g), and to ensure their reliability. We also applied a more stringent threshold of >0.9, which yielded similar results.

In this study, six types of mutations were tabulated, representing ancestral-to-derived as depicted in Supplementary Fig. 2. For example, A-to-G represented the both A-to-G and T-to-C types of mutations. The C-to-G (or G-to-C) and A-to-T (or T-to-A) types of mutations were excluded in the subsequent analysis.

Specifically, f represents the proportion of fixed mutations where an A or T nucleotide has been converted to a G or C nucleotide. The numerator for f is the number of fixed mutations from A-to-G, T-to-C, T-to-G, or A-to-C. Since the fixed sites accounted for less than 1% of the non-functional length of rDNA, the denominator is the total number of A or T sites in the rDNA sequence of the species lineage.

Similarly, g is defined as the proportion of fixed mutations where a G or C nucleotide has been converted to an A or T nucleotide. The numerator for g is the number of fixed mutations from G-to-A, C-to-T, C-to-A, or G-to-T. The denominator is the total number of G or C sites in the rDNA sequence of the species lineage.

The consensus rDNA sequences for the species lineage were generated by Samtools consensus (Danecek, et al. 2021) from the bam file after alignment. The following command was used: ‘samtools consensus -@ 20 -a -d 10 --show-ins no --show-del yes input_sorted.bam output.fa’.

The alternative hypotheses of GC-biased mutation process (Wolfe, et al. 1989; Francino and Ochman 1999) alone can be rejected in this study. According to the prediction of the mutational mechanism hypotheses, AT or GC variants should have equal fixation probabilities. We quantified the nearly fixed number of AT-to-GC and GC-to-AT types of mutations and conducted a chi-square test to assess their fixation probabilities.

Notably, we found the G (or C) variant had a significant higher fixation probability than A (or T) at site level in apes, but not in mice. In order to test whether there is an equilibrium state at the genic level in these three species, we computed the (pf)/(qg) ratio in Table 7, and a significant deviation of the ratio from 1 would imply biased genic conversion.

Data Availability

No new data were generated in this study. The genomic data used in this study are available from National Center for Biotechnology Information (NCBI) (https://www.ncbi.nlm.nih.gov/) and the Mouse Genomes Project (https://www.sanger.ac.uk/data/mouse-genomes-project/). The specific accession numbers recorded in Supplementary Table S4 and S5.

Acknowledgements

We are grateful for the helpful comments from many colleagues on the Chat Group “Cancer - The new evolving species”, in particular, Weiwei Zhai, Yong Zhang, GuoDong Wang of CAS and Jianrong Yang of SYSU. This work was supported by the National Natural Science Foundation of China (32150006, 32293193/32293190 to C.I.W., 32200493 to Y.R., and 82341092 to HJ Wen.), the National Key Research and Development Projects of the Ministry of Science and Technology of China (2021YFC2301300, 2021YFC0863400), and Guangdong Key Research and Development Program (No. 2022B1111030001).