The paradox of extremely fast evolution driven by genetic drift in multi-copy gene systems

Xiaopei Wang; Yongsen Ruan; Lingjie Zhang; Xiangnyu Chen; Zongkun Shi; Haiyu Wang; Bingjie Chen; Miles E Tracy; Chung-I Wu; Haijun Wen

doi:10.7554/eLife.99992.2

eLife Assessment

This study presents a useful theoretical model of molecular evolution and attempts to use it to resolve the paradox of rapid evolution of ribosomal RNA genes. While intuitive, the model's underlying issue is grouping many factors under "variance in reproductive success" without explicitly modeling the molecular processes. This limitation, along with insufficient consideration of technical challenges in alignment and variants calling, provides incomplete support for the authors' claim that the observed paradoxical patterns in rRNA genes can largely be explained by homogenizing processes, such as gene conversion, unequal crossover and replication slippage.

https://doi.org/10.7554/eLife.99992.2.sa3

Significance of findings

useful: Findings that have focused importance and scope

landmark
fundamental
important
valuable
useful

Strength of evidence

incomplete: Main claims are only partially supported

exceptional
compelling
convincing
solid
incomplete
inadequate

During the peer-review process the editor and reviewers write an eLife assessment that summarises the significance of the findings reported in the article (on a scale ranging from landmark to useful) and the strength of the evidence (on a scale ranging from exceptional to inadequate). Learn more about eLife assessments

Abstract

Multi-copy gene systems that evolve within, as well as between, individuals are common. They include viruses, mitochondrial DNAs, transposons and multi-gene families. The paradox is that their (neutral) evolution in two stages should be far slower than single-copy systems but the opposite is often true. As the paradox cannot be resolved by the standard Wright-Fisher (WF) model, we now apply the newly expanded WF-Haldane (WFH;(Ruan, et al. 2024)) model to mammalian ribosomal RNA (rRNA) genes. On average, rDNAs have C ∼ 150 - 300 copies per haploid in humans. While a neutral mutation of a single-copy gene would take 4N generations (N being the population size of an ideal population) to become fixed, the time should be 4NC* generations for rRNA genes (C*being the effective copy number). Note that C* >> 1, but C* < (or >) C would depend on the drift strength. Surprisingly, the observed fixation time in mouse and human is < 4N, implying the paradox of C* < 1. Genetic drift that encompasses all random neutral evolutionary forces appears as much as 100 times stronger for rRNA genes as for single-copy genes, thus reducing C* to < 1. The large increases in genetic drift are driven by the homogenizing forces of gene conversion, unequal crossover and replication slippage within individuals. This study is one of the first applications of the WFH model to track random genetic drift in multi-copy gene systems. Many random forces, often stronger than the WF model prediction, could be mis-interpreted as the working of natural selection.

Introduction

In this study, we focus on multi-copy gene systems, where the evolution takes place in two stages: both within (stage I) and between individuals (stage II). Multi-copy gene systems include viruses, transposons, mitochondria and multi-gene families (Alexandrov, et al. 2001; Szitenberg, et al. 2016; Xu, et al. 2019; Ruan, et al. 2021). Given the extra stage of within-host fixation, the neutral evolutionary rate of multi-copy systems should be much slower than in single-copy systems. However, the rapid evolution of multi-copy systems has been extensively documented (Charlesworth, et al. 1994; Eickbush and Eickbush 2007; Jurka, et al. 2007; Hou, et al. 2023). A reason for this paradox, as well as many others (Ruan, et al. 2024), is that the speed of neutral evolution of multi-gene systems is not known.

The speed of neutral evolution is the basis for determining how fast or slow all types of molecular evolution take place. Neutral evolution is driven by random transmission, gene conversion, stochastic replication etc., which collectively constitute genetic drift. Hence, genetic drift is the fundamental force of molecular evolution. All other evolutionary forces, such as selection, mutation and migration, may be of greater biological interest, but inferences are possible only when genetic drift is fully accounted for. In the companion study (Ruan, et al. 2024), we show that the standard Wright-Fisher (WF) model may often under-account genetic drift, thus leading to the over-estimation of selection.

We propose the integration of the WF model with the Haldane model, referred to as the WFH model of genetic drift (Ruan, et al. 2024). The Haldane model is based on the branching process. In haploids, each individual produces K progeny with the mean and variance of E(K) and V(K). Genetic drift is primarily V(K) as there would be no drift if V(K) = 0. Gene frequency change in the population is then scaled by N (population size), expressed as V(K)/N. In diploids, K would be the number of progeny to whom the gene copy of interest is transmitted. (The adjustments between haploidy and multi-ploidy are straightforward).

In the WF model, gene frequency is governed by 1/N (or 1/2N in diploids) because K would follow the Poisson distribution whereby V(K) = E(K). As E(K) is generally ∼1, V(K) would also be ∼ 1. In this backdrop, many “modified WF” models have been developed(Der, et al. 2011), most of them permitting V(K) ≠ E(K) (Karlin and McGregor 1964; Chia and Watterson 1969; Cannings 1974). Nevertheless, paradoxes encountered by the standard WF model apply to these modified WF models as well because all WF models share the key feature of gene sampling (see below and (Ruan, et al. 2024)). One of the paradoxes, first noted in (Chen, et al. 2017) is genetic drift during tumor growth whereby drift appears to become stronger as N increases (Wu, et al. 2016; Chen, Wu, et al. 2022; Zhai, et al. 2022). This trend is in stark opposite to the central tenet of the WF models.

A paradox requiring dedicated efforts to analyze concerns multi-copy gene systems, which are as diverse as viral epidemics (Huang, et al. 2021), transposons (Szitenberg, et al. 2016), mitochondrial DNAs (Xu, et al. 2019), satellite DNAs (Cabot, et al. 1993; Alexandrov, et al. 2001), and ribosomal RNA genes (van Sluis and McStay 2019; Hori, et al. 2021). In COVID-19, the inability of the WF models to track both within- and between-host evolution simultaneously is a main reason for much confusion about the origin, spread and driving forces of SARS-CoV-2 (Ruan, et al. 2021; Deng, et al. 2022; Guan and Zhong 2022; Pan, Liu, et al. 2022; Ruan, et al. 2022; Zhou, et al. 2022; Hou, et al. 2023).

In multi-copy systems, the copy number (designated C) is in the hundreds or thousands per individual. Nevertheless, C = 2 as in all diploids is also a multi-copy gene system as the two copies may often evolve interactively via gene conversion or segregation distortion (Wu, et al. 1988; McDermott and Noor 2010). The WF models have been noted to be inadequate even in diploid systems (Charlesworth 2009; Chen, et al.2017). After all, the WF model is essentially a haloid model with 2N copies in the population.

We now apply the Haldane model to multi-copy gene systems, using ribosomal RNA (rRNA) genes as an example. While the WF models have led to speculations of pervasive natural selection (Dover 1982; Arkhipova 2018; Chen, Yang, et al. 2022; Pan, Zhang, et al. 2022; Wang, et al. 2022), neutral stochasticity would be a simpler explanation if the more powerful Haldane model is adopted. Since the WF model can yield results that are often good approximations for the Haldane model, the integration is referred to as the WFH model.

Results

PART I presents a best-known multi-gene system of the rRNA genes. PART II (Theory) consolidates aspects of the Haldane model applied to polymorphism within species as well as divergence between species. In PART III (Data analyses), we apply the theory to rDNA evolution in mice and apes (human and chimpanzee).

PART I - The biology of rRNA gene clusters

The ribosomal RNA genes (rDNAs) are multi-copy gene clusters (Bowman, et al. 2020) that are arrayed as tandem repeats on multiple chromosomes as shown in Fig. 1A (Guillén, et al. 2004; Cazaux, et al. 2011). In humans, the copy number can vary from 60 to 1600 per individual (mean, 315; SD, 104; median, 301) (Parks, et al. 2018). For each haploid genome, C ∼ 150 on average in humans and C ∼ 110 in mice (Parks, et al. 2018). In humans, the five rRNA clusters are located on the short arm of the five acrocentric chromosomes (Smirnov, et al. 2021). Such an arrangement permits crossovers between chromosomes without perturbing the rest of the genomes. In Mus, the rDNAs are all located in the pericentromeric or sub-telomeric region, on the long arms of telocentric chromosomes (Cazaux, et al. 2011; Potapova and Gerton 2019). Thus, unequal crossovers between non-homologous chromosomes may involve centromeres while other genic regions are also minimally perturbed.

The “chromosome community of rDNAs on five acrocentric chromosomes.
(A) The genomic locations of rDNA tandem repeats in human (Gibbons, et al. 2015) and mouse (Cazaux, et al. 2011). rDNAs are located on the short arms (human), or the proximal end of the long arms (mouse), of the chromosome. Either way, inter-chromosomal exchanges are permissible. (B) The organization of rDNA repeat unit. IGS (intergenic spacer) is not transcribed. Among the transcribed regions, 18S, 5.8S and 28S segments are in the mature rRNA while ETS (external transcribed spacer) and ITS (internal transcribed spacer) are excluded. (C) The pseudo-population of rRNA genes is shown by the “chromosomes community” map (Guarracino, et al. 2023), which indicates the divergence distance among chromosome segments. The large circle encompasses rDNAs from all 5 chromosomes. It shows the concerted evolution among rRNA genes from all chromosomes, which thus resemble members of a (pseudo-)population. The slightly smaller thin circle, from the analysis of this study, shows that the rDNA gene pool from each individual captures approximately 95% of the total diversity of human population. (D) A simple illustration that shows the transmissions of two new mutations (#1 and #2 in red letter). Mutation 1 experiences replication slippage, gene conversion and unequal crossover and grows to 9 copies (K = 9) after transmission. Mutation 2 emerges and disappears (K = 0). This shows how *V(K)* may be augmented by the homogenization process.

Each copy of rRNA gene has a functional and non-functional part as shown in Fig. 1B. The “functional” regions of rDNA, 18S, 5.8S, and 28S rDNA, are believed to be under strong negative selection, resulting in a slow evolution rate in animals (Salim and Gerton 2019). In contrast, the transcribed spacer (ETS and ITS) and the intergenic spacer (IGS) show significant sequence variation even among closely related species (Eickbush and Eickbush 2007). Clearly, these “non-functional” sequences are less constrained by negative selection. In this study of genetic drift, we focus on the non-functional parts. Data on the evolution of the functional parts will be provided only for comparisons.

The pseudo-population of ribosomal DNA copies within each individual

While a human haploid with 200 rRNA genes may appear to have 200 loci, the concept of “gene loci” cannot be applied to the rRNA gene clusters. This is because DNA sequences can spread from one copy to others on the same chromosome via replication slippage. They can also spread among copies on different chromosomes via gene conversion and unequal crossovers (Nagylaki 1983; Ohta and Dover 1983; Stults, et al. 2008; Smirnov, et al. 2021). Replication slippage and unequal crossovers would also alter the copy number of rRNA genes. These mechanisms will be referred to collectively as the homogenization process. Copies of the cluster on the same chromosome are known to be nearly identical in sequences (Hori, et al. 2021; Nurk, et al. 2022). Previous research has also provided extensive evidence for genetic exchanges between chromosomes (Krystal, et al. 1981; Arnheim, et al. 1982; van Sluis, et al. 2019).

In short, rRNA gene copies in an individual can be treated as a pseudo-population of gene copies. Such a pseudo-population is not Mendelian but its genetic drift can be analyzed using the branching process (see below). The pseudo-population corresponds to the “chromosome community” proposed recently (Guarracino, et al. 2023). As seen in Fig. 1C, the five short arms harbor a shared pool of rRNA genes that can be exchanged among them. Fig. 1D presents the possible molecular mechanisms of genetic drift within individuals whereby mutations may spread, segregate or disappear among copies. Hence, rRNA gene diversity or polymorphism refers to the variation across all rRNA copies, as these genes exist as paralogs rather than orthologs. This diversity can be assessed at both individual and population levels according to the multi-copy nature of rRNA genes.

PART II - Theory

1. The Haldane model of genetics drift applied to multi-copy gene systems

The Haldane model of genetic drift based on the branching process is intuitively appealing. In the model, each copy of the gene leaves K copies in a time interval with the mean and variance of E(K) and V(K). If V(K) = 0, there is no gene frequency change and no genetic drift. In the standard WF model, V(K) = E(K) whereas V(K) is decoupled from E(K) in the Haldane model; the latter thus being more flexible. [In the companion paper, we discuss the modified WF models with V(K) ≠ E(K) which, nevertheless, do not resolve the paradoxes.]

Below, we compare the strength of genetic drift in rRNA genes vs. that of single-copy genes using the Haldane model (Ruan, et al. 2024). We shall use * to designate the equivalent symbols for rRNA genes; for example, E(K) vs. E*(K). Both are set to 1, such that the total number of copies in the long run remains constant.

For simplicity, we let V(K) = 1 for single-copy genes. (If we permit V(K) ≠ 1, the analyses will involve the ratio of V*(K) and V(K) to reach the same conclusion but with unnecessary complexities.) For rRNA genes, V*(K) ≥ 1 may generally be true because K for rDNA mutations are affected by a host of homogenization factors including replication slippage, unequal cross-over, gene conversion and other related mechanisms not operating on single-copy genes. Hence,

where C is the average number of rRNA genes in an individual and V*(K) reflects the homogenization process on rRNA genes (Fig. 1D). Thus,

represents the effective copy number of rRNA genes in the population, determining the level of genetic diversity relative to single-copy genes. Since C is in the hundreds and V*(K) is expected to be > 1, the relationship of 1 << C*≤C is hypothesized. Fig. 1D is a simple illustration that the homogenizing process may enhance V*(K) substantially over the WF model.

In short, genetic drift of rRNA genes would be equivalent to single-copy genes in a population of size NC* (or N*). Since C* >> 1 is hypothesized, genetic drift for rRNA genes is expected to be slower than for single-copy genes.

2. rDNA polymorphism within species

A standard measure of genetic drift is the level of heterozygosity (H). At the mutation-selection equilibrium

where μ is the mutation rate of the entire gene and N_e is the effective population size. In this study, N_e = N for single-copy gene and N_e = C*N for rRNA genes. The empirical measure of nucleotide diversity H is given by

where L is the gene length (for each copy of rRNA gene, L ∼ 43kb) and p_i is the variant frequency at the i-th site.

We calculate H of rRNA genes at three levels – within-individual, within-species and then, within total samples (H_I, H_S and H_T, respectively). H_S and H_T are standard population genetic measures (Hartl, et al. 1997; Crow and Kimura 2009). In calculating H_S, all sequences in the species are used, regardless of the source individuals. A similar procedure is applied to H_T. The H_I statistic is adopted for multi-copy gene systems for measuring within-individual polymorphism. Note that copies within each individual are treated as a pseudo-population (see Fig. 1 and text above). With multiple individuals, H_I is averaged over them.

Given the three levels of heterozygosity, there are two levels of differentiation. First, F_IS is the differentiation among individuals within the species, defined by

F_IS is hence the proportion of genetic diversity in the species that is found only between individuals. We will later show F_IS ∼ 0.05 in human rDNA (Table 2), meaning 95% of rDNA diversity is found within individuals.

Second, F_ST is the differentiation between species within the total species complex, defined as

F_ST is the proportion of genetic diversity in the total data that is found only between species. Between mouse species, F_ST distribution is close to 1 (Fig. 4C), indicating a large genetic distance between species relative to within-species polymorphisms.

rRNA gene nucleotide diversity in the 10 *M. m. domesticus* strains of a global collection.

3. rDNA divergence between species

Whereas the level of genetic diversity is a function of the effective population size, the rate of divergence between species, in its basic form, is not. The rate of neutral molecular evolution (λ), although driven by mutation and genetic drift, is generally shown by Eq. (3) (Crow and Kimura 1970; Hartl, et al. 1997; Li 1997):

Note that the factor of 1/N in Eq. (3) indicates the fixation probability of a new mutation. For rDNA mutations, fixation must occur in two stages – fixation within individuals and then among individuals in the population. (We note again that new mutations can be fixed via homogenization in an individual, effectively forming a pseudo-population for rRNA genes.) Due to the cancelation of N in Eq. (3), the evolutionary rate of rRNA genes in the long run should be the same as single-copy genes.

Eq. (3) is valid in the long-term evolution. For shorter-term evolution, it is necessary to factor in the fixation time (Fig. 2), T_f, which is the time between the emergence of a mutation and its fixation. If we study two species with a divergent time (T_d) equal to or smaller than T_f, then few mutations of recent emergence would have been fixed as species diverge.

Fixation of mutations at two levels of species divergence, (*T_d1*) and (*T_d2*).
(*T_f1*) and (*T_f2*) are mutations with a shorter and longer fixation time, respectively for single-copy and multi-copy genes. Note that mutations with a longer *T_f* would show a lower fixation rate in short-term evolution.

Note that T_d is about 6 million years (Myrs) between human and chimpanzee while T_f (as measured by coalescence) is 0.6 - 1 Myrs in humans. Mutations of single-copy genes would not get fixed during the more recent T_f, as indicated in Figure 2. Thus, the realized substitution rate may be 1/6 to 1/10 lower than the theoretical value. In comparison, T_f of rRNA mutations should be at least 4.8 - 8 Myrs based on the C* estimate (see PATR III). Thus, the substitution rate could be at least 80% lower than calculated for single-copy genes.

Our own theoretical derivation has shown that the fixation time for rRNA genes is close to 4N* as is the case for single-copy genes at ∼ 4N. If V*(K) is sufficiently larger than V(K), it is in fact possible for N*< N such that genetic drift is stronger for rRNA genes than for single-copy genes and T * < T. Therefore, T can represent the T_f* of mutations in rRNA genes in Fig. 2. This is interesting because, if the homogenization is powerful enough, rRNA genes would have an effective copy number of C* < 1. A short T_f*, which can be obtained by large V*(K), would lead to a small T_f*/T_d and thus a higher substitution rate, particularly in short-term evolution. However, even T_f* approaches 0, the substitution rate can exceed that of single-copy genes but is still limited by fixation probability. As noted above, the rapidly fixed mutations may not be well represented in the polymorphic data but can accumulate in species divergence.

PART III - Data Analyses

Before presenting the main analyses, we provide some empirical observations on the rapid homogenization of rRNA genes within individuals (Stage I). These observations are needed for PART III that comprises i) the analyses of rDNA polymorphisms within species in mouse and human; and ii) the analysis of the divergence between species.

Empirical measurements of homogenization within cells

In an accompanying study (Wang, et al. unpublished data), the evolutionary rate of neutral rRNA variants within cells is measured. Here, genetic drift operates via the homogenizing mechanisms that include gene conversion, unequal crossover and replication slippage. In the literature, measurements of neutral rRNA evolution are usually based on comparisons among individuals. Therefore, the Mendelian mechanisms of chromosome segregation and assortment would also shuffle variants among individuals. Segregation and assortment would confound the measurements of homogenization effects within individuals.

In one experiment, the homogenization effects in rDNAs are measured in cultured cell lines over 6 months of evolution. For in vivo homogenization, we analyze the evolution of rRNA genes within solid tumors. We estimate the rate at which rRNA variants spread among copies within the same cells, which undergo an asexual process. The measurements suggest that, in the absence of recombination and chromosome assortment, the fixation time of new rRNA mutations within cells would take only 1 - 3 kyrs (thousand years). Since a new mutation in single-copy genes would take 300 - 600 kyrs to be fixed in human populations, the speed of genetic drift in Stage I evolution is orders faster than in Stage II. Therefore, despite having several hundred copies of rRNA genes per genome, the speed of genetic drift in rRNA genes may not be that much slower than in single-copy genes. This postulate will be tested below.

1. rDNA polymorphism within species

1) Polymorphism in mice

For rRNA genes, H_I of 10 individuals ranges from 0.0056 to 0.0067 while H_S is 0.0073 (Table 1). Thus, F_IS = [H_S - H_I]/H_S for mice is 0.14, which means 86% of variation is within each individual. In other words, even one single randomly chosen individual would yield 85% of the diversity of the whole species. Hence, the estimated H_S should be robust as it is not affected much by the sampling.

H_S for M. m. domesticus single-copy genes is roughly 1.40 per kb genome-wide (Geraldes, et al. 2008) while H_S for rRNA genes is 7.25 per kb (Table 1), 5.2 times larger. In other words, C* = N*/N ∼ 5.2. If we use the polymorphism data, it is as if rDNA array has a population size 5.2 times larger than single-copy genes. Although the actual copy number on each haploid is ∼ 110, these copies do not segregate like single-copy genes and we should not expect N* to be 100 times larger than N. The H_S results confirm the prediction that rRNA genes should be more polymorphic than single-copy genes.

Based on the polymorphism result, one might infer slower drift for rDNAs than for single-copy genes. However, the results from the divergence data in later sections will reveal the under-estimation of drift strength from polymorphism data. Such data would miss variants that have a fast drift process driven by, for example, gene conversion. Strength of genetic drift should therefore be measured by the long-term fixation rate.

2) Polymorphism in human

F_IS for rDNA among 8 human individuals is 0.059 (Table 2), much smaller than 0.142 in M. m. domesticus mice, indicating minimal genetic differences across human individuals and high level of genetic identity in rDNAs between homologous chromosomes among human population. Consisted with low F_IS, Fig. 3 shows strong correlation of the polymorphic site frequency of rDNA transcribed region among each pair of individuals from three continents (2 Asians, 2 Europeans and 2 Africans). Correlation of polymorphic sites in IGS region is shown in Supplementary Fig. 1. The results suggest that the genetic drift due to the sampling of chromosomes during sexual reproduction (e.g., segregation and assortment) is augmented substantially by the effects of homogenization process within individual. Like those in mice, the pattern indicates that intra-species polymorphism is mainly preserved within individuals. The observed H_I of humans for rDNAs is 0.0064 to 0.0077 and the H_S is 0.0072 (Table 2). Research has shown that heterozygosity for the human genome is about 0.00088 (Zhao, et al. 2000), meaning the effective copy number of rDNAs is roughly, or C* ∼ 8. This reduction in effective copy number from 150 to 8 indicates strong genetic drift due to homogenization force.

Correlation of variant frequencies between human individuals.
The pairwise correlation of variant site frequency in the transcribed region of rDNAs among 6 individuals (2 Asians, 2 Europeans, and 2 Africans). The high correlations suggest that the diversity in each individual can well capture the population diversity. Each color represents a region of rDNA. The diagonal plots present the variant frequency distribution. The upper right section summarizes the Pearson correlation coefficients derived from the mirror symmetric plots in the bottom left. The analysis excluded the 18S, 3’ETS, and 5.8S regions due to the limited polymorphic sites. The result for IGS region is presented in Supplementary Figure S1.

rRNA gene nucleotide diversity in the 8 humans of a global collection.

2. rDNA divergence between species

We now consider the evolution of rRNA genes between species by analyzing the rate of fixation (or near fixation) of mutations. Polymorphic variants are filtered out in the calculation. Note that Eq. (3) shows that the mutation rate, μ, determines the long-term evolutionary rate, λ. Since we will compare the λ values between rRNA and single-copy genes, we have to compare their mutation rates first by analyzing their long-term evolution. As shown in Table S1, λ falls in the range of 50-60 (differences per Kb) for single-copy genes and 40 – 70 for the non-functional parts of rRNA genes. The data thus suggest that rRNA and single-copy genes are comparable in mutation rate. Differences between their λ values will have to be explained by other means.

1) Between mouse species - Genetic drift as the sole driving force of the rapid divergence

We now use the F_ST statistic to delineate fixation and polymorphism. The polymorphism in M. m. domesticus is compared with two outgroup species, M. spretus and M. m. castaneus, respectively. There are hence two depths in the phylogeny with two T_d’s, as shown in Fig. 4A (Rudra, et al. 2016; Kumar, et al. 2022). There is a fourth species, M. m. musculus (shown in grey in Fig. 4A), which yields very similar results as M. m. domesticus in these two comparisons. These additional analyses are shown in Supplement Table S2-S3.

Levels of polymorphism and divergence in mice.
(A) Phylogeny of *Mus musculus* and *Mus spretus* mice. The divergence times are obtained from http://timetree.org/. The line segment labeled 0.5 represents 0.5 Myrs. (B) *F_IS* distribution within *M. m. domesticus*. The distribution of *F_IS* for polymorphic sites in 3 outbred mouse strains or 10 mouse strains (including 7 inbred mice) in Table 1 (Inset) is shown. (C) *F_ST* distribution between *M. m. domesticus* and *Mus spretus*. Note that the F values rise above 0.8 only in (C).

The F_IS values of polymorphic sites in 3 outbred mice are primarily below 0.2 and rarely above 0.8 in Fig. 4B, indicating the low genetic differentiation in rDNAs within these 3 M. m. domesticus. While the F_IS distribution of 10 mice from Table 1, including 7 inbred and 3 outbred mouse strains, exhibits a noticeable right skewness, but does not exceed 0.8. This suggests that inbreeding to a certain extent limits the process of homogenization and enhances population differentiation. In comparison, the distribution of the F_ST of variant sites between M. m. domesticus and Mus spretus has a large peak near F_ST = 1. This peak in Fig. 4C represents species divergence not seen within populations (i.e., F_IS). We use F_ST = 0.8 as a cutoff for divergence sites between the two species. Roughly, when a mutant is > 0.95 in frequency in one species and < 0.05 in the other, F_ST would be > 0.80.

We first compare the divergence between M. m. domesticus and M. m. castaneus whereby T_d has been estimated to be less than 0.5 Myrs (Fujiwara, et al. 2022). In comparison, between Mus m. domesticus and Mus spretus, T_d is close to 3 Myrs (Rudra, et al. 2016). As noted above, the reduction in the divergence rate relative to that of Eq. (3) is proportional to T_f/T_d (for single copy genes) or T */T_d (for rRNA genes). As T_f and T * are both from M. m. domesticus and T is 6 times larger in comparison with M. spretus, we expect the results to be quite different between the two sets of species comparisons.

Although T_f and T_f* estimates are less reliable than T_d estimates, both comparisons use the same T_f and T * from M. m. domesticus. Hence, the results should be qualitatively unbiased. For a demonstration, we shall use the estimates of T_f (i.e., the coalescence time) at 0.2 Myrs for single-copy genes by using an average nucleotide diversity of 0.0014 and the mutation rate of 5.7×10⁻⁹ per base pair per generation (Geraldes, et al. 2008; Phifer-Rixey, et al. 2020). Based on the estimated C* above, we obtain T_f* for rDNA mutations at 5×0.2 Myrs, or 1 Myrs. While some have estimated T_f to be close to 0.4 Myrs (Fujiwara, et al. 2022), we aim to show that the pattern of reduction in rRNA divergence is true even with the conservative estimates.

Between M. m. domesticus and M. m. castaneus, the reduction in substitution rate for single copy gene should be ∼ 40% (T_f/T_d = 0.2/0.5), and the reduction for rRNA genes should be 100% (T_f*/T_d = 1/0.5 > 1). Table 3 on their DNA sequence divergence shows that rRNA genes are indeed far less divergent than single-copy genes. In fact, only a small fraction of rDNA mutations is expected to be fixed as T * for rDNA at 1 Myrs is 2 times larger than the divergence time, T_d. We should note again that the non-negligible fixation of rRNA mutations suggests that C* at 5 is perhaps an over-estimate.

Divergence in rRNA genes between *M. m. domesticus* and *M. m. castaneus*.

Between Mus m. domesticus and Mus spretus, the reduction in actual substitution rate from theoretical limit for single-copy genes should be 6.7% (T_f/T_d = 0.2/3) and, for rRNA genes, should be 33% (T */T_d = 1/3). The evolutionary rate (i.e. the fixation rate) of IGS region is lower than single-copy genes, 0.01 in IGS and 0.021 in genome-wide (Table 4), as one would expect. However, ETS and ITS regions have evolved at a surprising rate that is 12% higher than single-copy genes. Note that the reduction in C*, even to the lowest limit of C* =1, would only elevate the rate of fixation in rRNA genes to a parity with single-copy genes. From Eq. (1), the explanation would be that V*(K) has to be very large, such that C* is < 1. With such rapid homogenization, the fixation time approaches 0 and the substitution rate in rRNA genes can indeed reach the theoretical limit of Eq. (3). In such a scenario, the substitution rate in ETS and ITS, compared to single-copy genes in mice, may increase by 7%, T_f /(T_d –T_f) = 0.2/(3-0.2). If we use T_f ∼ 0.4 Myrs in an alternative estimation, the increase can be up to 15%.

Divergence in rRNA genes between *M. m. domesticus* and *Mus spretus*.

In conclusion, the high rate of fixation in ETS and ITS may be due to very frequent gene conversions that reduce C* to be less than 1. In contrast, IGS may have undergone fewer gene conversions and its long-term C* is slightly larger than 1. Indeed, the heterozygosity in IGS region, at about 2-fold higher than that of ETS and ITS regions (8‰ for IGS, 5‰ for ETS and 3‰ for ITS), supports this interpretation.

2) Between Human and Chimpanzee - Positive selection in addition to rapid drift in rDNA divergence

Like the data of mouse studies, the polymorphism of rDNAs in humans would suggest a slower short-term evolution rate. The same caveat is that C* estimated from the polymorphism data would have missed those rapidly fixed variants. Hence, the long-term C* obtained from species divergence might be much smaller than 8.

Our results show that the evolutionary rate of rRNA genes between human and chimpanzee is substantially higher than that of other single-copy genes (Table 5). Especially, 5’ETS region shows a 100% rate acceleration, at 22.7‰ vs. 11‰ genome-wide. Even after removing CpG sites, their fixation rate still reaches 22.4‰. In this case, even if C*<<1, the extremely rapid fixation will only increase the substitution rate by T_f /(T_d –T_f) by 11%, compared to single-copy genes. Thus, the much accelerated evolution of rRNA genes between humans and chimpanzees cannot be entirely attributed to genetic drift. In the next and last section, we will test if selection is operating on rRNA genes by examining the pattern of gene conversion.

Divergence in rRNA genes between Human and Chimpanzee.

3) Positive selection for rRNA mutations in apes, but not in mice – Evidence from gene conversion patterns

For gene conversion, we examine the patterns of AT-to-GC vs. GC-to-AT changes. While it has been reported that gene conversion would favor AT-to-GC over GC-to-AT conversion (Jeffreys and Neumann 2002; Meunier and Duret 2004) at the site level, we are interested at the gene level by summing up all conversions across sites. We designate the proportion of AT-to-GC conversion as f and the reciprocal, GC-to-AT, as g. Both f and g represent the proportion of fixed mutations between species (see Methods). So defined, f and g are influenced by the molecular mechanisms as well as natural selection. The latter may favor a higher or lower GC ratio at the genic level between species. As the selective pressure is distributed over the length of the gene, each site may experience rather weak pressure.

Let p be the proportion of AT sites and q be the proportion of GC sites in the gene. The flux of AT-to-GC would be pf and the flux in reverse, GC-to-AT, would be qg. At equilibrium, pf = qg. Given f and g, the ratio of p and q would eventually reach p/q = g/f. We now determine if the fluxes are in equilibrium (pf =qg). If they are not, the genic GC ratio is likely under selection and is moving to a different equilibrium.

In these genic analyses, we first analyze the human lineage (Brown and Jiricny 1989; Galtier and Duret 2007). Using chimpanzees and gorillas as the outgroups, we identified the derived variants that became nearly fixed in humans with frequency > 0.8 (Table 6). The chi-square test shows that the GC variants had a significantly higher fixation probability compared to AT. In addition, this pattern is also found in chimpanzees (p < 0.001). In M. m. domesticus (Table 6), the chi-square test reveals no difference in the fixation probability between GC and AT (p = 0.957). Further details can be found in Supplementary Figure 2. Overall, a higher fixation probability of the GC variants is found in human and chimpanzee, whereas this bias is not observed in mice.

The A/T to G/C and G/C to A/T changes in apes and mouse.

Based on Table 6, we could calculate the value of p, q, f and g (see Table 7). Shown in the last row of Table 7, the (pf)/(qg) ratio is much larger than 1 in both the human and chimpanzee lineages. Notably, the ratio in mouse is not significantly different from 1. Combining Tables 4 and 7, we conclude that the slight acceleration of fixation in mice can be accounted for by genetic drift, due to gene conversion among rRNA gene copies. In contrast, the different fluxes corroborate the interpretations of Table 5 that selection is operating in both humans and chimpanzees.

The parameter values of p, q, f and g in the evolution between A/T and G/C.

Discussion

The Haldane model is an “individual-output” model of genetic drift (Chen, et al. 2017). Hence, it does not demand the population to follow the rules of Mendelian populations. It is also sufficiently flexible for studying various stochastic forces other than the sampling errors that together drive genetic drift. In the companion study(Ruan, et al. 2024), we address the ecological forces of genetic drift and, in this study, we analyze the neutral evolution of rRNA genes. Both examples are amenable to the analysis by the Haldane model, but not by the WF model.

In multi-copy systems, there are several mechanisms of homogenization within individuals. For rRNA genes, whether on the same or different chromosomes (Gonzalez and Sylvester 2001; van Sluis and McStay 2019), the predominant mechanism of homogenization mechanism are gene conversion and unequal crossover. In the process of exchanging DNA sections in meiosis, gene conversions are an order of magnitude more common than crossover (Cole, et al. 2012; Williams, et al. 2015). It is not clear how large a role is played by replication slippage which affects copes of the same cluster.

There have been many rigorous analyses that confront the homogenizing mechanisms directly. These studies (Smith 1974; Ohta 1976; Dover 1982; Nagylaki 1983; Ohta and Dover 1983) modeled gene conversion and unequal cross-over head on. Unfortunately, on top of the complexities of such models, the key parameter values are rarely obtainable. In the branching process, all these complexities are wrapped into V*(K) for formulating the evolutionary rate. In such a formulation, the collective strength of these various forces may indeed be measurable, as shown in this study.

The branching process is a model for general processes. Hence, it can be used to track genetic drift in systems with two stages of evolution - within- and between-individuals, even though TEs, viruses or rRNA genes are very different biological entities. We use the rRNA genes to convey this point. Multi-copy genes, like rDNA, are under rapid genetic drift via the homogenization process. The drift is strong enough to reduce the copy number in the population from ∼ 150N to < N. A fraction of mutations in multi-copy genes may have been fixed by drift almost instantaneously in the evolutionary time scale. This acceleration is seen in mice but would have been interpreted to be due to positive selection by the convention. Interestingly, while positive selection may not be necessary to explain the mice data, it is indeed evident in human and chimpanzee, as the evolutionary rate of rRNA genes exceeds the limit of the strong drift.

In conclusion, the Haldane model is far more general than the WF model as this and the companion study clearly demonstrate. Its E(K) parameter (which is usually set to 1) should be equivalent to the single parameter of the WF model (i.e., N). The other parameter of the Haldane model, i.e., V(K), is then free to track genetic drift, whereas the WF model, setting V(K) = E(K), is highly constrained. Nevertheless, the vast literature using the WF model has led to substantial understandings of the neutral process via the diffusion process (Crow and Kimura 1970) or coalescence (Kingman 1982; Fu 2006). In this sense, the Haldane model should be built on the WF model by introducing a second parameter and permit the analyses of a broader range of stochastic ecological and evolutionary forces.

Materials and methods

Data Collection

We collected high-coverage whole-genome sequencing data for our study. The genome sequences of human (n = 8), chimpanzee (n = 1) and gorilla (n = 1) were sourced from National Center for Biotechnology Information (NCBI) (Supplementary Table 3). Human individuals were drawn from diverse geographical origins encompassing three continents (4 Asians, 2 Europeans and 2 Africans) and Asia CRC was the normal tissue of Case 1 patient from (Chen, Wu, et al. 2022). Genomic sequences of mice (n = 13) were sourced from the Wellcome Sanger Institute’s Mouse Genome Project (MGP) (Keane, et al. 2011). Although some artificial selection has been performed on laboratory mouse strains (Yang, et al. 2011), the WSB/EiJ, ZALENDE/Ei and LEWES/EiJ strains were derived from wild populations. Incorporating these wild-derived laboratory strains, along with other inbred strains, a cohort of 10 mice was utilized to approximate the population representation of M. m. domesticus. Furthermore, the low F_IS of 0.14 for rDNA in M. m. domesticus found in this study suggests that each mouse covers 86% of the population’s genetic diversity, thereby mitigating concerns about potential sampling biases. Accessions and the detailed information of samples used in this study are listed in the Supplementary Table 3 and Table 4.

Variant allele frequency

Following adapter trimming and the removal of low-quality sequences, these whole-genome sequencing data of apes and mice were mapped against respective reference sequences: the human rDNA reference sequence (Human ribosomal DNA complete repeating unit, GenBank: U13369.1) and the mouse rDNA reference sequence (Mus musculus ribosomal DNA, complete repeating unit, GenBank: BK000964.3). Alignment was performed using Burrows-Wheeler-Alignment Tool v0.7.17 (Li and Durbin 2009) with default parameters. All mapping and analysis are performed among individual copies of rRNA genes. The pipelines in variant calling are similar to the ones used in the literature (Ma, et al. 2022; Sun, et al. 2022). Each individual was considered as a psedo-population of rRNA genes and the diversity of rRNA genes was calculated using this psedo-population of rRNA genes. To determine variant frequency within individual, variants were called from multiple alignment files using bcftools (Danecek, et al. 2021) to ensure the inclusion of all polymorphic sites that appeared in at least one sample and to maintain consistent processing steps. Per-sample variant calling results were generated in a VCF file using the following settings: ‘bcftools mpileup --redo-BAQ --max-depth 50000 --per-sample-mF --annotate’ and ‘bcftools call -mv’. Our analysis specifically focused on single nucleotide variants (SNVs) with only two alleles, while other mutation types were discarded. Variant information of interest per sample was extracted from the VCF files using the command ‘bcftools view -V indels’ and ‘bcftools query –f ‘%CHROM\t%POS\t%REF\t%ALT\t[\t%GT]\t[\t%DP]\t[\t%AD]\n’\’. The AD (total high-quality bases of allelic depths) was used to obtain the number of reference-supporting reads (n.ref) and alternative-supporting reads (n.alt) in each sample. Sites with a depth < 10 in any sample were filtered out, resulting in an average of the minimum depths above 3000 for each remaining site across all samples. This filtering step enhances the robustness of variant frequency estimation, where the variant frequency was approximated by the ratio n.alt/(n.ref + n.alt). This process allowed us to identify all polymorphic sites present in the samples and the variant frequency within each individual. Then the population-level variant frequency was computed by averaging variant frequencies across all individuals.

Identification of Divergence Sites

Polymorphism represents the transient phase preceding divergence. With variant frequency within individual, within species, and between species, we obtained the F_IS and F_ST values for each site. Within the M. m. domesticus population, 779 polymorphic sites were identified. Among these, 744 sites (95.5%) exhibited an F_IS below 0.4 and rarely above 0.8 in Fig. 4B. In the comparison between M. m. domesticus and Mus spretus, 1579 variant sites were found, among which 453 sites displayed an F_ST above 0.8, indicating swift fixation of mutations during species divergence.

F_ST analysis between human and chimpanzee was conducted for each human individual, summarized in Table 5. We identified a range from 672 to 705 sites with F_ST values above 0.8 across individuals, depicting robust divergence sites. Considering the high mutation rate in CpG sites (Ehrlich and Wang 1981; Sved and Bird 1990) and predominantly GC content in rDNA (e.g. 58% GC in human), we further estimated the evolutionary rate at non-CpG sites during the interspecies divergence. To achieve this, mutations in CpG sites were manually removed by excluding all sites containing CpG in one species and TpG or CpA in the other; the reverse was similarly discarded. Additionally, the count of non-CpG sites within the mapping length, where site depth exceeded 10, was performed by Samtools (Danecek, et al. 2021) with the settings ‘samtools mpileup -Q 15 -q 20’. As a result, the evolutionary rate of rDNA in non-CpG sites was ascertained.

For assessing diversity and divergence across gene segments, we used ‘samtools faidx’ to partition variants into a total of 8 regions within rRNA genes, including 5’ETS, ITS1, ITS2, 3’ETS, IGS, 18S, 5.8S, and 28S, aligning them with corresponding reference sequences for further analysis. The functional parts (18S, 5.8S, and 28S) were subject to strong negative selection, exhibiting minimal substitutions during species divergence as expected. This observation is primarily used for comparison with the non-functional parts and reflects that the non-functional parts are less constrained by negative selection.

Genome-wide Divergence Estimation

To assess the genome-wide divergence between 4 mouse strain species, we downloaded their toplevel reference genomes from Ensembl genome browser (GenBank Assembly ID: GCA_001624865.1, GCA_001624775.1, GCA_001624445.1, and GCA_001624835.1). Then we used Mash tools (Ondov, et al. 2016) to estimate divergence across the entire genome (mainly single-copy genes) with ‘mash sketch’ and ‘mash dist’. Additionally, the effective copy number of rRNA genes, denoted as C*, can be estimated by calculating the ratio of population diversity observed in rDNA to that observed in single-copy genes.

Estimation of Site Conversion

To estimate site conversions, whether accidental or directional, it was essential to identify new mutant alleles in each lineage after divergence. New mutations were defined as derived alleles that differed from ancestral alleles shared by two outgroup species, where the ancestral state also exhibited high identity among copies. By focusing on new mutations with low initial frequency, we minimized the influence of their initial frequency on fixation probability and fixation time. Specifically, variants shared between chimpanzee and gorilla, humans and gorilla, M. m. castaneus and M. spretus, each with a frequency greater than 0.8, were considered as the ancestral for human, chimpanzee, and M. m. domesticus, respectively.

Derived variants were then categorized into two groups: (nearly) fixed (frequency > 0.8) and not fixed mutations in the lineages of humans, chimpanzees and M. m. domesticus (Table 6). The frequency threshold of >0.8 was chosen to balance the need for a sufficient number of sites to calculate of (f/g), and to ensure their reliability. We also applied a more stringent threshold of >0.9, which yielded similar results.

In this study, six types of mutations were tabulated, representing ancestral-to-derived as depicted in Supplementary Fig. 2. For example, A-to-G represented the both A-to-G and T-to-C types of mutations. The C-to-G (or G-to-C) and A-to-T (or T-to-A) types of mutations were excluded in the subsequent analysis.

Specifically, f represents the proportion of fixed mutations where an A or T nucleotide has been converted to a G or C nucleotide. The numerator for f is the number of fixed mutations from A-to-G, T-to-C, T-to-G, or A-to-C. Since the fixed sites accounted for less than 1% of the non-functional length of rDNA, the denominator is the total number of A or T sites in the rDNA sequence of the species lineage.

Similarly, g is defined as the proportion of fixed mutations where a G or C nucleotide has been converted to an A or T nucleotide. The numerator for g is the number of fixed mutations from G-to-A, C-to-T, C-to-A, or G-to-T. The denominator is the total number of G or C sites in the rDNA sequence of the species lineage.

The consensus rDNA sequences for the species lineage were generated by Samtools consensus (Danecek, et al. 2021) from the bam file after alignment. The following command was used: ‘samtools consensus -@ 20 -a -d 10 --show-ins no --show-del yes input_sorted.bam output.fa’.

The alternative hypotheses of GC-biased mutation process (Wolfe, et al. 1989; Francino and Ochman 1999) alone can be rejected in this study. According to the prediction of the mutational mechanism hypotheses, AT or GC variants should have equal fixation probabilities. We quantified the nearly fixed number of AT-to-GC and GC-to-AT types of mutations and conducted a chi-square test to assess their fixation probabilities.

Notably, we found the G (or C) variant had a significant higher fixation probability than A (or T) at site level in apes, but not in mice. In order to test whether there is an equilibrium state at the genic level in these three species, we computed the (pf)/(qg) ratio in Table 7, and a significant deviation of the ratio from 1 would imply biased genic conversion.

Data Availability

No new data were generated in this study. The genomic data used in this study are available from National Center for Biotechnology Information (NCBI) (https://www.ncbi.nlm.nih.gov/) and the Mouse Genomes Project (https://www.sanger.ac.uk/data/mouse-genomes-project/). The specific accession numbers recorded in Supplementary Table S4 and S5.

Acknowledgements

We are grateful for the helpful comments from many colleagues on the Chat Group “Cancer - The new evolving species”, in particular, Weiwei Zhai, Yong Zhang, GuoDong Wang of CAS and Jianrong Yang of SYSU. This work was supported by the National Natural Science Foundation of China (32150006, 32293193/32293190 to C.I.W., 32200493 to Y.R., and 82341092 to HJ Wen.), the National Key Research and Development Projects of the Ministry of Science and Technology of China (2021YFC2301300, 2021YFC0863400), and Guangdong Key Research and Development Program (No. 2022B1111030001).

Supporting information

supplementary information

References

1. Alexandrov I
2. Kazakov A
3. Tumeneva I
4. Shepelev V
5. Yurov Y
2001Alpha-satellite DNA of primates: old and new familiesChromosoma 110:253–266Google Scholar
1. Arkhipova IR
2018Neutral Theory, Transposable Elements, and Eukaryotic Genome EvolutionMolecular Biology and Evolution 35:1332–1337Google Scholar
1. Arnheim N
2. Treco D
3. Taylor B
4. Eicher EM
1982Distribution of ribosomal gene length variants among mouse chromosomesProc Natl Acad Sci U S A 79:4677–4680Google Scholar
1. Bowman JC
2. Petrov AS
3. Frenkel-Pinter M
4. Penev PI
5. Williams LD
2020Root of the Tree: The Significance, Evolution, and Origins of the RibosomeChemical Reviews 120:4848–4878Google Scholar
1. Brown T
2. Jiricny J
1989Repair of base-base mismatches in simian and human cellsGenome / National Research Council Canada = Génome / Conseil national de recherches Canada 31:578–583Google Scholar
1. Cabot EL
2. Doshi P
3. Wu ML
4. Wu CI
1993Population genetics of tandem repeats in centromeric heterochromatin: unequal crossing over and chromosomal divergence at the Responder locus of Drosophila melanogasterGenetics 135:477–487Google Scholar
1. Cannings C
1974The latent roots of certain Markov chains arising in genetics: A new approach, I. Haploid modelsAdvances in Applied Probability 6:260–290Google Scholar
1. Cazaux B
2. Catalan J
3. Veyrunes F
4. Douzery EJP
5. Britton-Davidian J
2011Are ribosomal DNA clusters rearrangement hotspots? A case study in the genus Mus (RodentiaMuridae). BMC Evolutionary Biology 11:124Google Scholar
1. Charlesworth B
2009Effective population size and patterns of molecular evolution and variationNature Reviews Genetics 10:195–205Google Scholar
1. Charlesworth B
2. Sniegowski P
3. Stephan W
1994The evolutionary dynamics of repetitive DNA in eukaryotesNature 371:215–220Google Scholar
1. Chen B
2. Wu X
3. Ruan Y
4. Zhang Y
5. Cai Q
6. Zapata L
7. Wu CI
8. Lan P
9. Wen H
2022Very large hidden genetic diversity in one single tumor: evidence for tumors-in-tumorNatl Sci Rev 9:nwac250Google Scholar
1. Chen QP
2. Yang H
3. Feng X
4. Chen QJ
5. Shi SH
6. Wu CI
7. He ZW
2022Two decades of suspect evidence for adaptive molecular evolution-negative selection confounding positive-selection signalsNational Science Review 9:nwab217Google Scholar
1. Chen Y
2. Tong D
3. Wu CI
2017A New Formulation of Random Genetic Drift and Its Application to the Evolution of Cell PopulationsMol Biol Evol 34:2057–2064Google Scholar
1. Chia AB
2. Watterson GA
1969Demographic effects on the rate of genetic evolution I. constant size populations with two genotypesJournal of Applied Probability 6:231–248Google Scholar
1. Cole F
2. Kauppi L
3. Lange J
4. Roig I
5. Wang R
6. Keeney S
7. Jasin M
2012Homeostatic control of recombination is implemented progressively in mouse meiosisNature Cell Biology 14:424–430Google Scholar
1. Crow J
2. Kimura MJNY
1970An Introduction to Population Genetics TheoryHarper & Row Google Scholar
1. Crow JF
2. Kimura M
2009An Introduction to Population Genetics TheoryBlackburn Press Google Scholar
1. Danecek P
2. Bonfield JK
3. Liddle J
4. Marshall J
5. Ohan V
6. Pollard MO
7. Whitwham A
8. Keane T
9. McCarthy SA
10. Davies RM
11. et al.
2021Twelve years of SAMtools and BCFtoolsGigascience 10Google Scholar
1. Deng S
2. Xing K
3. He X
2022Mutation signatures inform the natural host of SARS-CoV-2National Science Review 9:nwab220Google Scholar
1. Der R
2. Epstein CL
3. Plotkin JB
2011Generalized population models and the nature of genetic driftTheoretical Population Biology 80:80–99Google Scholar
1. Dover G
1982Molecular drive: a cohesive mode of species evolutionNature 299:111–117Google Scholar
1. Ehrlich M
2. Wang RY
19815-Methylcytosine in eukaryotic DNAScience 212:1350–1357Google Scholar
1. Eickbush TH
2. Eickbush DG
2007Finely orchestrated movements: evolution of the ribosomal RNA genesGenetics 175:477–485Google Scholar
1. Francino MP
2. Ochman H
1999Isochores result from mutation not selectionNature 400:30–31Google Scholar
1. Fu Y-X
2006Exact coalescent for the Wright–Fisher modelTheoretical Population Biology 69:385–394Google Scholar
1. Fujiwara K
2. Kawai Y
3. Takada T
4. Shiroishi T
5. Saitou N
6. Suzuki H
7. Osada N
2022Insights into Mus musculus Population Structure across Eurasia Revealed by Whole-Genome AnalysisGenome Biol Evol 14Google Scholar
1. Galtier N
2. Duret L
2007Adaptation or biased gene conversion? Extending the null hypothesis of molecular evolutionTrends in Genetics 23:273–277Google Scholar
1. Geraldes A
2. Basset P
3. Gibson B
4. Smith KL
5. Harr B
6. Yu HT
7. Bulatova N
8. Ziv Y
9. Nachman MW
2008Inferring the history of speciation in house mice from autosomal, X-linked, Y-linked and mitochondrial genesMol Ecol 17:5349–5363Google Scholar
1. Gibbons JG
2. Branco AT
3. Godinho SA
4. Yu S
5. Lemos B
2015Concerted copy number variation balances ribosomal DNA dosage in human and mouse genomesProc Natl Acad Sci U S A 112:2485–2490Google Scholar
1. Gonzalez IL
2. Sylvester JE
2001Human rDNA: Evolutionary Patterns within the Genes and Tandem Arrays Derived from Multiple ChromosomesGenomics 73:255–263Google Scholar
1. Guan W-j
2. Zhong N-s
2022Strategies for reopening in the forthcoming COVID-19 era in ChinaNational Science Review 9:nwac054Google Scholar
1. Guarracino A
2. Buonaiuto S
3. de Lima LG
4. Potapova T
5. Rhie A
6. Koren S
7. Rubinstein B
8. Fischer C
9. Abel HJ
10. Antonacci-Fulton LL
11. et al.
2023Recombination between heterologous human acrocentric chromosomesNature 617:335–343Google Scholar
1. Guillén AKZ
2. Hirai Y
3. Tanoue T
4. Hirai H
2004Transcriptional repression mechanisms of nucleolus organizer regions (NORs) in humans and chimpanzeesChromosome Research 12:225–237Google Scholar
1. Hartl DL
2. Clark AG
3. Clark AG
1997Principles of population geneticsSinauer associates Sunderland Google Scholar
1. Hori Y
2. Shimamoto A
3. Kobayashi T
2021The human ribosomal DNA array is composed of highly homogenized tandem clustersGenome Res 31:1971–1982Google Scholar
1. Hou M
2. Shi JR
3. Gong ZK
4. Wen HJ
5. Lan Y
6. Deng XZ
7. Fan QH
8. Li JJ
9. Jiang ML
10. Tang XP
11. et al.
2023Intra- vs. Interhost Evolution of SARS-CoV-2 Driven by Uncorrelated Selection-The Evolution ThwartedMolecular Biology and Evolution 40:msad204Google Scholar
1. Huang J
2. Liu X
3. Zhang L
4. Zhao Y
5. Wang D
6. Gao J
7. Lian X
8. Liu C
2021The oscillation-outbreaks characteristic of the COVID-19 pandemicNational Science Review 8:nwab100Google Scholar
1. Jeffreys AJ
2. Neumann R
2002Reciprocal crossover asymmetry and meiotic drive in a human recombination hot spotNat Genet 31:267–271Google Scholar
1. Jurka J
2. Kapitonov VV
3. Kohany O
4. Jurka MV
2007Repetitive sequences in complex genomes: structure and evolutionAnnu Rev Genomics Hum Genet 8:241–259Google Scholar
1. Karlin S
2. McGregor J
1964Direct Product Branching Processes and Related Markov ChainsProceedings of the National Academy of Sciences 51:598–602Google Scholar
1. Keane TM
2. Goodstadt L
3. Danecek P
4. White MA
5. Wong K
6. Yalcin B
7. Heger A
8. Agam A
9. Slater G
10. Goodson M
11. et al.
2011Mouse genomic variation and its effect on phenotypes and gene regulationNature 477:289–294Google Scholar
1. Kingman JFC
1982On the Genealogy of Large PopulationsJournal of Applied Probability 19:27–43Google Scholar
1. Krystal M
2. D’Eustachio P
3. Ruddle FH
4. Arnheim N
1981Human nucleolus organizers on nonhomologous chromosomes can share the same ribosomal gene variantsProceedings of the National Academy of Sciences of the United States of America 78:5744–5748Google Scholar
1. Kumar S
2. Suleski M
3. Craig JM
4. Kasprowicz AE
5. Sanderford M
6. Li M
7. Stecher G
8. Hedges SB
2022TimeTree 5: An Expanded Resource for Species Divergence TimesMol Biol Evol 39Google Scholar
1. Li H
2. Durbin R
2009Fast and accurate short read alignment with Burrows-Wheeler transformBioinformatics 25:1754–1760Google Scholar
1. Li W-H
1997Molecular evolutionSunderland, Mass: Sinauer Associates Google Scholar
1. Ma Y
2. Mao X
3. Wang J
4. Zhang L
5. Jiang Y
6. Geng Y
7. Ma T
8. Cai L
9. Huang S
10. Hollingsworth P
11. et al.
2022Pervasive hybridization during evolutionary radiation of Rhododendron subgenus Hymenanthes in mountains of southwest ChinaNational Science Review 9:nwac276Google Scholar
1. McDermott SR
2. Noor MAF
2010The role of meiotic drive in hybrid male sterilityPhilosophical Transactions of the Royal Society B-Biological Sciences 365:1265–1272Google Scholar
1. Meunier J
2. Duret L
2004Recombination drives the evolution of GC-content in the human genomeMolecular Biology and Evolution 21:984–990Google Scholar
1. Nagylaki T
1983Evolution of a large population under gene conversionProc Natl Acad Sci U S A 80:5941–5945Google Scholar
1. Nurk S
2. Koren S
3. Rhie A
4. Rautiainen M
5. Bzikadze AV
6. Mikheenko A
7. Vollger MR
8. Altemose N
9. Uralsky L
10. Gershman A
11. et al.
2022The complete sequence of a human genomeScience 376:44–53Google Scholar
1. Ohta T
1976Simple model for treating evolution of multigene familiesNature 263:74–76Google Scholar
1. Ohta T
2. Dover GA
1983Population genetics of multigene families that are dispersed into two or more chromosomesProc Natl Acad Sci U S A 80:4079–4083Google Scholar
1. Ondov BD
2. Treangen TJ
3. Melsted P
4. Mallonee AB
5. Bergman NH
6. Koren S
7. Phillippy AM
2016Mash: fast genome and metagenome distance estimation using MinHashGenome Biology 17:132Google Scholar
1. Pan Y
2. Liu P
3. Wang F
4. Wu P
5. Cheng F
6. Jin X
7. Xu S
2022Lineage-specific positive selection on ACE2 contributes to the genetic susceptibility of COVID-19National Science Review 9:nwac118Google Scholar
1. Pan Y
2. Zhang C
3. Lu Y
4. Ning Z
5. Lu D
6. Gao Y
7. Zhao X
8. Yang Y
9. Guan Y
10. Mamatyusupu D
11. et al.
2022Genomic diversity and post-admixture adaptation in the UyghursNational Science Review 9:nwab124Google Scholar
1. Parks MM
2. Kurylo CM
3. Dass RA
4. Bojmar L
5. Lyden D
6. Vincent CT
7. Blanchard SC
2018Variant ribosomal RNA alleles are conserved and exhibit tissue-specific expressionScience Advances 4Google Scholar
1. Phifer-Rixey M
2. Harr B
3. Hey J
2020Further resolution of the house mouse (Mus musculus) phylogeny by integration over isolation-with-migration historiesBMC Evol Biol 20:120Google Scholar
1. Potapova T
2. Gerton J
2019Ribosomal DNA and the nucleolus in the context of genome organizationChromosome Research 27Google Scholar
1. Ruan Y
2. Wang X
3. Hou M
4. Diao W
5. Xu S
6. Wen H
7. Wu C-I
2024Resolving Paradoxes in Molecular Evolution: The Integrated WF-Haldane (WFH) Model of Genetic DriftbioRxiv :2024.2002.2019.581083Google Scholar
1. Ruan Y
2. Wen H
3. He X
4. Wu CI
2021A theoretical exploration of the origin and early evolution of a pandemicSci Bull (Beijing 66:1022–1029Google Scholar
1. Ruan Y
2. Wen H
3. Hou M
4. He Z
5. Lu X
6. Xue Y
7. He X
8. Zhang YP
9. Wu CI
2022The twin-beginnings of COVID-19 in Asia and Europe-one prevails quicklyNatl Sci Rev 9:nwab223Google Scholar
1. Rudra M
2. Chatterjee B
3. Bahadur M
2016Phylogenetic relationship and time of divergence of Mus terricolor with reference to other Mus speciesJ Genet 95:399–409Google Scholar
1. Salim D
2. Gerton JL
2019Ribosomal DNA instability and genome adaptabilityChromosome Research 27:73–87Google Scholar
1. Smirnov E
2. Chmúrčiaková N
3. Liška F
4. Bažantová P
5. Cmarko D
2021Variability of Human rDNACells 10Google Scholar
1. Smith GP
1974Unequal Crossover and the Evolution of Multigene FamiliesCold Spring Harb Symp Quant Biol. 38:507–513Google Scholar
1. Stults DM
2. Killen MW
3. Pierce HH
4. Pierce AJ
2008Genomic architecture and inheritance of human ribosomal RNA gene clustersGenome Res 18:13–18Google Scholar
1. Sun N
2. Yang L
3. Tian F
4. Zeng H
5. He Z
6. Zhao K
7. Wang C
8. Meng M
9. Feng C
10. Fang C
11. et al.
2022Sympatric or micro-allopatric speciation in a glacial lake? Genomic islands support neitherNational Science Review 9:nwac291Google Scholar
1. Sved J
2. Bird A
1990The expected equilibrium of the CpG dinucleotide in vertebrate genomes under a mutation modelProc Natl Acad Sci U S A 87:4692–4696Google Scholar
1. Szitenberg A
2. Cha S
3. Opperman CH
4. Bird DM
5. Blaxter ML
6. Lunt DH
2016Genetic Drift, Not Life History or RNAi, Determine Long-Term Evolution of Transposable ElementsGenome Biology and Evolution 8:2964–2978Google Scholar
1. Tatsumoto S
2. Go Y
3. Fukuta K
4. Noguchi H
5. Hayakawa T
6. Tomonaga M
7. Hirai H
8. Matsuzawa T
9. Agata K
10. Fujiyama A
2017Direct estimation of de novo mutation rates in a chimpanzee parent-offspring trio by ultra-deep whole genome sequencingSci Rep 7:13561Google Scholar
1. van Sluis M
2. Gailín M
3. McCarter JGW
4. Mangan H
5. Grob A
6. McStay B
2019Human NORs, comprising rDNA arrays and functionally conserved distal elements, are located within dynamic chromosomal regionsGenes Dev 33:1688–1701Google Scholar
1. van Sluis M
2. McStay B
2019Nucleolar DNA Double-Strand Break Responses Underpinning rDNA Genomic StabilityTrends in Genetics 35:743–753Google Scholar
1. Varki A
2. Altheide TK
2005Comparing the human and chimpanzee genomes: searching for needles in a haystackGenome Res 15:1746–1758Google Scholar
1. Wang X
2. He Z
3. Guo Z
4. Yang M
5. Xu S
6. Chen Q
7. Shao S
8. Li S
9. Zhong C
10. Duke NC
11. et al.
2022Extensive gene flow in secondary sympatry after allopatric speciationNational Science Review 9:nwac280Google Scholar
1. Williams AL
2. Genovese G
3. Dyer T
4. Altemose N
5. Truax K
6. Jun G
7. Patterson N
8. Myers SR
9. Curran JE
10. Duggirala R
11. et al.
2015Non-crossover gene conversions show strong GC bias and unexpected clustering in humansElife 4Google Scholar
1. Wolfe KH
2. Sharp PM
3. Li WH
1989Mutation rates differ among regions of the mammalian genomeNature 337:283–285Google Scholar
1. Wu C-I
2. Lyttle TW
3. Wu M-L
4. Lin G-F
1988Association between a satellite DNA sequence and the responder of segregation distorter in D. melanogasterCell 54:179–189Google Scholar
1. Wu CI
2. Wang HY
3. Ling S
4. Lu X
2016The Ecology and Evolution of Cancer: The Ultra-Microevolutionary ProcessAnnu Rev Genet 50:347–369Google Scholar
1. Xu J
2. Nuno K
3. Litzenburger UM
4. Qi YY
5. Corces MR
6. Majeti R
7. Chang HY
2019Single-cell lineage tracing by endogenous mutations enriched in transposase accessible mitochondrial DNAElife 8Google Scholar
1. Yang H
2. Wang JR
3. Didion JP
4. Buus RJ
5. Bell TA
6. Welsh CE
7. Bonhomme F
8. Yu AH-T
9. Nachman MW
10. Pialek J
11. et al.
2011Subspecific origin and haplotype diversity in the laboratory mouseNat Genet 43:648–655Google Scholar
1. Zhai W
2. Lai H
3. Kaya NA
4. Chen J
5. Yang H
6. Lu B
7. Lim JQ
8. Ma S
9. Chew SC
10. Chua KP
11. et al.
2022Dynamic phenotypic heterogeneity and the evolution of multiple RNA subtypes in hepatocellular carcinoma: the PLANET studyNational Science Review 9:nwab192Google Scholar
1. Zhao Z
2. Jin L
3. Fu Y-X
4. Ramsay M
5. Jenkins T
6. Leskinen E
7. Pamilo P
8. Trexler M
9. Patthy L
10. Jorde LB
11. et al.
2000Worldwide DNA sequence variation in a 10-kilobase noncoding region on human chromosome 22Proceedings of the National Academy of Sciences 97:11354–11358Google Scholar
1. Zhou T
2. Shi T
3. Li A
4. Zhu L
5. Zhao X
6. Mao N
7. Qin W
8. Bi H
9. Yang M
10. Dai M
11. et al.
2022A third dose of inactivated SARS-CoV-2 vaccine induces robust antibody responses in people with inadequate response to two-dose vaccinationNational Science Review 9:nwac066Google Scholar

Article and author information

Author information

Xiaopei Wang
State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-sen University, Guangzhou, China
ORCID iD: 0009-0007-4587-4624
Yongsen Ruan
State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-sen University, Guangzhou, China
ORCID iD: 0000-0002-5573-4154
Lingjie Zhang
State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-sen University, Guangzhou, China
Xiangnyu Chen
State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-sen University, Guangzhou, China
Zongkun Shi
State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-sen University, Guangzhou, China
Haiyu Wang
State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-sen University, Guangzhou, China
Bingjie Chen
State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-sen University, Guangzhou, China
Miles E Tracy
State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-sen University, Guangzhou, China
Chung-I Wu
State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-sen University, Guangzhou, China
- For correspondence: ciwu@uchicago.edu
Haijun Wen
State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-sen University, Guangzhou, China
- For correspondence: wenhj5@mail.sysu.edu.cn

Version history

Preprint posted: May 9, 2024
Sent for peer review: June 10, 2024
Reviewed Preprint version 1: August 20, 2024
Reviewed Preprint version 2: November 15, 2024
Reviewed Preprint version 3: March 6, 2025

Cite all versions

You can cite all versions using the DOI https://doi.org/10.7554/eLife.99992. This DOI represents all versions, and will always resolve to the latest one.

Copyright

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

views: 1,770
downloads: 69
citations: 2

Views, downloads and citations are aggregated across all versions of this paper published by eLife.

Significance of findings

Strength of evidence

Abstract

Introduction

Results

PART I - The biology of rRNA gene clusters

The pseudo-population of ribosomal DNA copies within each individual

PART II - Theory

1. The Haldane model of genetics drift applied to multi-copy gene systems

2. rDNA polymorphism within species

rRNA gene nucleotide diversity in the 10 M. m. domesticus strains of a global collection.

3. rDNA divergence between species

PART III - Data Analyses

Empirical measurements of homogenization within cells

1. rDNA polymorphism within species

1) Polymorphism in mice

2) Polymorphism in human

rRNA gene nucleotide diversity in the 8 humans of a global collection.

2. rDNA divergence between species

1) Between mouse species - Genetic drift as the sole driving force of the rapid divergence

Divergence in rRNA genes between M. m. domesticus and M. m. castaneus.

Divergence in rRNA genes between M. m. domesticus and Mus spretus.

2) Between Human and Chimpanzee - Positive selection in addition to rapid drift in rDNA divergence

Divergence in rRNA genes between Human and Chimpanzee.

3) Positive selection for rRNA mutations in apes, but not in mice – Evidence from gene conversion patterns

The A/T to G/C and G/C to A/T changes in apes and mouse.

The parameter values of p, q, f and g in the evolution between A/T and G/C.

Discussion

Materials and methods

Data Collection

Variant allele frequency

Identification of Divergence Sites

Genome-wide Divergence Estimation

Estimation of Site Conversion

Data Availability

Acknowledgements

Supporting information

References

Article and author information

Author information

Xiaopei Wang

Yongsen Ruan

Lingjie Zhang

Xiangnyu Chen

Zongkun Shi

Haiyu Wang

Bingjie Chen

Miles E Tracy

Chung-I Wu

Haijun Wen

Version history

Cite all versions

Copyright

Metrics