Admixture of evolutionary rates across a butterfly hybrid zone
Abstract
Hybridization is a major evolutionary force that can erode genetic differentiation between species, whereas reproductive isolation maintains such differentiation. In studying a hybrid zone between the swallowtail butterflies Papilio syfanius and Papilio maackii (Lepidoptera: Papilionidae), we made the unexpected discovery that genomic substitution rates are unequal between the parental species. This phenomenon creates a novel process in hybridization, where genomic regions most affected by gene flow evolve at similar rates between species, while genomic regions with strong reproductive isolation evolve at speciesspecific rates. Thus, hybridization mixes evolutionary rates in a way similar to its effect on genetic ancestry. Using coalescent theory, we show that the ratemixing process provides distinct information about levels of gene flow across different parts of genomes, and the degree of ratemixing can be predicted quantitatively from relative sequence divergence ($F}_{ST$) between the hybridizing species at equilibrium. Overall, we demonstrate that reproductive isolation maintains not only genomic differentiation, but also the rate at which differentiation accumulates. Thus, asymmetric rates of evolution provide an additional signature of loci involved in reproductive isolation.
Editor's evaluation
The authors leverage theory, simulations, and empirical population genomics to evaluate what are the consequences of differences in substitution rates in hybridizing species. This is a largely overlooked pheonomenon. This study highlights the issue and demonstrates that two hybridizing species of Papilio have differences in thir substitution rates. The work will be of interest to a large group of evolutionary biologists, especially those studying evolution at the wholegenome level.
https://doi.org/10.7554/eLife.78135.sa0Introduction
DNA substitution, the process whereby singlenucleotide mutations accumulate over time, is a critical process in molecular evolution. Both molecular phylogenetics and coalescent theory rely on observed mutations (Yang and Rannala, 2012; Wakeley, 2016), and so the rate of substitution/mutation is the predominant link from molecular data to information about the timing of past events (Bromham and Penny, 2003). Substitution rates often vary among lineages: generation time, spontaneous mutation rate, and fixation probabilities of new mutations could all contribute to the variation of substitution rates (Ohta, 1993; Lynch, 2010). Recent evidence suggests mutation rates are even variable among human populations (Harris, 2015; DeWitt et al., 2021). As such variation affects how fast the molecular clock ticks, reconstructing gene genealogies among different species sometimes accounts for speciesspecific rates of evolution (Lepage et al., 2007). However, under the standard coalescent framework, empirical studies of within and betweenspecies variation tend to ignore rate variation among populations (Costa and WilkinsonHerbots, 2017; Wolf and Ellegren, 2017; Kautt et al., 2020). The latter is partly based on a popular assumption in coalescent theory that neutral mutation rate is constant for a given locus across the whole genealogy (Hudson, 1990). Hybridization and speciation lie in the gray zone of these extremes, and have their own problem: molecular clocks from different lineages could be mixed by crossspecies gene flow — a gene could evolve under one clock before it flows into another species and switches to evolve according to a different clock (Figure 1). This mixing process is largely outside the scope of existing theories and has received little attention from empirical studies.
However, if mixing between molecular clocks exists and can be measured, its strength could carry information about gene flow, which is important for studying reproductive isolation between incipient species. Genomic regions responsible for reproductive isolation lead to locally elevated genomic divergence (“genomic islands”), often caused by linked genomic regions experiencing less gene flow (“barrier loci”; Nosil et al., 2009; Michel et al., 2010; Renaut et al., 2013; Payseur and Rieseberg, 2016). In studying a hybrid zone between two butterfly species, Papilio syfanius and Papilio maackii, we discovered evidence for unequal genomewide substitution rates between the two species. Using this system, we investigate the interaction between unequal substitution rates and gene flow, and whether this interaction reveals new information on reproductive isolation.
As these butterflies are rare species, occurring in a remote region of China, and are hard to collect, we employed methods based on analysis of wholegenome sequences of a few specimens. We hope that these methods may prove of use in studying other rare or perhaps endangered species where few individuals can be sacrificed. Results will follow two parallel lines: first, we provide evidence that genomic islands are associated with barrier loci. Then we infer the existence of unequal substitution rates. Finally, using a coalescent model, we calculate the relationship between the magnitude of genomic islands and the degree of mixing between substitution rates in linked regions. Throughout the analysis, we assume reverse mutations are rare, so that higher substitution rates always lead to higher numbers of observed substitutions.
Results
Divergent sister species with ongoing hybridization
We sampled 11 males of P. syfanius and P. maackii across a geographic transect (Figure 2A, dashed lines) covering both pure populations (in the sense of being geographically far from the hybrid zone) and hybrid populations. We also include four outgroup species, two of which have chromosomelevel genome assemblies (P. bianor, P. xuthus; Lu et al., 2019; Li et al., 2015), while the other two (P. arcturus, P. dialis) are new to this study. All samples were resequenced to at least 20× coverage across the genome and mapped to the genome assembly of P. bianor. Among sampled local populations, P. syfanius inhabits the highlands of Southwest China (Figure 2A, red region), whereas P. maackii dominates at lower elevations (Figure 2A, blue region) (see Figure 2—figure supplement 3 and Figure 2—source data 4 for the complete distribution). The two lineages form a spatially contiguous hybrid zone at the edge of the Hengduan Mountains (Figure 2B) with individuals exhibiting intermediate wing patterns (Figure 2A: purple dot, corresponding to population WN in Figure 2B). Consistent with previous results (Condamine et al., 2013), assembled whole mitochondrial genomes are not distinct between the two lineages (Figure 2C, Figure 2—source data 3), suggesting either that divergence was recent, or that gene flow has homogenized the mitochondrial genomes. However, the two species are likely adapted to different environments associated with altitude, as several ecological traits are strongly divergent (Kashiwabara, 1991; Figure 2—figure supplement 1). Similarly, between pure populations (KM and XY in Figure 2B), relative divergence (${F}_{ST}$) is also high across the entire nuclear genome (Figure 2D, Figure 2—source data 1). The ${F}_{ST}$ on autosomes averages between 0.2–0.4, and on the sex chromosome (Zchromosome) it reaches 0.78. A highly heterogeneous landscape of ${F}_{ST}$ is accompanied by numerous islands of elevated sequence divergence (${D}_{XY}$) and reduced genetic diversity ($\pi $) scattered across the genome (Figure 2—figure supplement 4, Figure 2—source data 2). Since animal mitochondrial DNA generally has higher mutation rates than the nuclear genome (HaagLiautard et al., 2008), its low divergence between the two species are likely the result of gene flow. Overall, despite ongoing hybridization, genomes of the pure populations of P. syfanius and P. maackii are strongly differentiated.

Figure 2—source data 1
 https://cdn.elifesciences.org/articles/78135/elife78135fig2data1v2.zip

Figure 2—source data 2
 https://cdn.elifesciences.org/articles/78135/elife78135fig2data2v2.zip

Figure 2—source data 3
 https://cdn.elifesciences.org/articles/78135/elife78135fig2data3v2.zip

Figure 2—source data 4
 https://cdn.elifesciences.org/articles/78135/elife78135fig2data4v2.zip

Figure 2—source data 5
 https://cdn.elifesciences.org/articles/78135/elife78135fig2data5v2.zip
Genomic islands are associated with barrier loci
A natural question is whether genomic differentiation is associated with barrier loci and reproductive isolation. In other words, can ${F}_{ST}$ variation be attributed to gene flow variation between sister species? We suspect that barrier loci likely exist, because sequence variation between pure populations suggests that elevated ${F}_{ST}$ is associated with reduced $\pi $ and elevated ${D}_{XY}$ across autosomes (Figure 3A), as expected for hybridizing species (Irwin et al., 2018). The Z chromosome (sex chromosome) has the highest ${D}_{XY}$ and the lowest $\pi $ among all chromosomes (Figure 2—figure supplement 4), another characteristic of hybridizing species with barriers to gene flow (Kronforst et al., 2013).
To test for the presence of barrier loci, we augment the analysis with the sequences of four individuals from the population closest to the center of the hybrid zone (Population WN). We investigate whether differences in ancestry variation provide additional evidence for barrier loci in this hybrid zone. The underlying logic is that barrier loci will simultaneously:
Reduce linked $\pi $ in pure populations (Ravinet et al., 2017);
Elevate linked ${D}_{XY}$ between pure populations (Ravinet et al., 2017);
Elevate pairwise linkage disequilibrium in hybrid zones (Barton, 1983);
Enrich linked ancestry from one lineage in hybrid zones (Sedghifar et al., 2016).
Effects 3 and 4 can be bundled together as “reduced ancestry randomness” around barrier loci because both are expected if intermixing of segments of different ancestries within a genomic interval is prevented. For effects 3 and 4, because of small sample sizes, estimating sitespecific statistics such as pairwise linkage disequilibrium is untenable. However, our highquality chromosomelevel reference genome enabled accurate estimation of local ancestry. As a remedy for small sample sizes, we employ two entropy metrics borrowed from signalprocessing theory to quantify ancestry randomness in local regions along chromosomes (see Materials and methods). By dividing chromosomes into segments, we can extract indirect information about effects 3 and 4 at the expense of reduced genomic resolution. The proposed metrics, ${S}_{b}$ and ${S}_{w}$, correspond to the randomness of ancestry between and within individual diploid genomes from a local population (Figure 3B). For a cohort of ideal chromosomes with uniform recombination and marker density, if ancestry is independent between homologous chromosomes, both ${S}_{b}$ and ${S}_{w}$ increase with reduced local ancestry correlation and more balanced parental contribution (Figure 3C).
For a given autosomal segment, we then calculate $\pi $, ${D}_{XY}$, and ${F}_{ST}$ between pure populations, as well as entropy metrics ${S}_{b}$ and ${S}_{w}$ in hybrid individuals for the same segment. To investigate whether effects 1–4 are all present in our system, the joint range among entropy, $\pi $, and ${D}_{XY}$ is shown in Figure 3D, which suggests that low ancestry randomness (low entropy) is likely associated with reduced $\pi $ within species and elevated ${D}_{XY}$ between species. To further quantify such association, we estimated Pearson’s correlation coefficients ($\rho $) between entropy and the latter statistics (Figure 3E). These associations are strongly significant ($Z$scores > 3). Consequently, reduced ancestry randomness in hybrids (effects 3 & 4) coincides with classical patterns of barrier loci between pure populations (effects 1 and 2). This analysis is not sufficient to exclude all alternative hypotheses. For instance, we cannot entirely exclude the possibility that patterns are driven by lowrecombination regions (${S}_{w},{S}_{b}\downarrow $) that experience linked selection ($\pi \downarrow $) also have elevated mutation rates (${D}_{XY}\uparrow $). Nonetheless, this alternative seems most unlikely, as lowrecombination regions should typically be less rather than more mutable (Lercher and Hurst, 2002; JensenSeaman et al., 2004; Yang et al., 2015; Arbeithuber et al., 2015; Liu et al., 2017). Overall, adding information from hybrid populations strengthens the evidence for barrier loci acting across autosomes.
The Z chromosome was excluded from this analysis as it likely differs in mutation rate or effective population size (Presgraves, 2018), but its ancestry in hybrid individuals either retains purity or resembles very recent hybridization (long blocks of heterozygous ancestry, Figure 3—figure supplement 1). The Z chromosome has low ancestry randomness, and it also has the highest level of divergence (Figure 2D), both of which suggest strong barriers to gene flow between P. syfanius and P. maackii on this chromosome.
Asymmetric site patterns
A hint that substitution rates differ between the two species comes from sitepattern asymmetry (but this asymmetry alone is insufficient to establish the existence of divergent rates). We focus on two kinds of biallelic site patterns. In the first kind, choose three taxa (P_{1},P_{2},O_{1}), with P_{1}=syfanius, P_{2}=maackii, while O_{1} is an outgroup. Assuming no other factors, if substitution rates are equal between P_{1} and P_{2}, then site patterns (P_{1},P_{2},O_{1})=(A,B,B) and (B,A,B) occur with equal frequencies, where “A” and “B” represent distinct alleles. This leads to a $D$ statistic describing the asymmetry between threetaxon site patterns:
where $n$ is the count of a particular site pattern across designated genomic regions. A significantly nonzero $D}_{3$ indicates strongly asymmetric distributions of alleles between P_{1} and P_{2}.
In a second kind of site pattern test, we compare four taxa (P_{1},P_{2},O_{1},O_{2}) and calculate a similar statistic between site patterns (A,B,B,A) vs (B,A,B,A):
Classically, a significantly nonzero $D}_{4$ suggests that gene flow occurs between an outgroup and either P_{1} or P_{2}, thus it is widely used to detect hybridization (the ABBABABA test) (Durand et al., 2011; Hibbins and Hahn, 2022). Here, $D}_{4$ is used more generally as an additional metric of site pattern asymmetry.
We compute both $D}_{3$ and $D}_{4$ on synonymous, nonsynonymous, and intronic sites, with P. syfanius and P. maackii samples taken from pure populations (KM and XY). For each type of site, we progressively exclude regions with local ${F}_{ST}$ below a certain threshold, and report $D$ statistics on the remaining sites in order to show the increasing sitepattern asymmetry in more divergent regions (Figure 4). $D}_{3$ is significantly negative regardless of outgroup or site type for most ${F}_{ST}$ thresholds, and $D}_{4$ is also significantly negative for most outgroup combinations when computed across the entire genome ($Z$scores are shown in Figure 4—figure supplement 1), proving that sitepatterns are strongly asymmetric between P. syfanius and P. maackii. Importantly, the direction of asymmetry is nearly identical across all outgroup comparisons. This asymmetry cannot be attributed to batchspecific variation as all samples were processed and sequenced in a single run, and variants were always called on all individuals of P. syfanius and P. maackii. Sequencing coverage is normal for most annotated genes used in the analysis (Table 1), suggesting that asymmetry is not due to systematic copynumber variation that could affect variant calls. Nonetheless, two independent processes of evolution could explain observed asymmetric site patterns. In hypothesis I (Figure 5A, left), asymmetry is generated via stronger gene flow between P. syfanius and outgroups, leading to biased allelesharing. In hypothesis II, site pattern asymmetry is due to unequal substitution rates between P. syfanius and P. maackii, which is further modified by recurrent mutations in all four outgroups (Figure 5A, right). We test each hypothesis below.
Hybridization with outgroups does not explain sitepattern asymmetry
We first test hypothesis I by phylogenetic reconstruction using SNPs in annotated regions. We construct local gene trees for each 50 kb nonoverlapping window for all samples, including the four outgroup species. As biased gene flow with outgroups should rupture the monophyletic relationship among all P. syfanius +P. maackii individuals, the fraction of windows producing paraphyletic gene trees can be used to assess the potential impact of gene flow. However, almost all gene trees show the expected monophyletic relationship (Figure 5B). This conclusion is independent of the level of support used to filter out genomic windows with ambiguous topologies (see Materials and methods). Consequently, hardly any window shows a phylogenetic signal of hybridization with outgroups. Nonetheless, branch lengths in reconstructed gene trees suggest possible substitution rate divergence between the two species: Among highly supported monophyletic trees (bootstrap support ≥ 95), P. maackii (in populations YA, BJ, XY) is always significantly more distant than P. syfanius from the most recent common ancestor of P. maackii +P. syfanius (Figure 5C, significance levels reported via a Wilcoxon signedrank test for each pair of individuals, see Figure 5—source data 2). Second, the direction of allele sharing in the $D$ statistics is unanimously biased towards P. syfanius. If hypothesis I were true, hybridization with outgroups is required to take place mainly in the highland lineage. There is no evidence to support why the highland lineage should receive more gene flow based on current geographic distributions, as outgroups P. xuthus and P. bianor overlap broadly with both P. maackii and P. syfanius, while outgroup P. dialis is sympatric only with P. maackii (Condamine et al., 2013). Although it is possible that historical and modern geographic distributions differ, and archaic gene flow might have occurred preferentially with P. syfanius, it should still leave some phylogenetic signal of introgression. Overall, we find little evidence for biased hybridization required by hypothesis I.
One might worry that by rejecting hypothesis I, we also throw doubt on widely accepted conclusions of the ABBABABA test for gene flow in other systems (that a significantly nonzero $D}_{4$ implies hybridization with outgroups) (Durand et al., 2011). However, in the next section, we show why $D}_{3$ and $D}_{4$ are fully consistent with hypothesis II, and so this is just a special case where the ABBABABA test produces a false positive for gene flow.
Evidence for divergent substitution rates
In hypothesis II, divergent substitution rates between P. maackii and P. syfanius interact with recurrent mutations in outgroups to produce asymmetric site patterns. To understand its effect on $D$ statistics, consider a simplified model of recurrent mutation (Figure 5D,E), where a site in outgroup $i$ mutates with probability $p}_{i$, producing the same derived allele with probability $c$. When averaged across the genome, $c$ can be treated as a constant, and $p}_{i$ increases with distance to the outgroup. In the absence of gene flow, for threetaxon patterns, recurrent mutations modify $D}_{3$ by a factor of approximately $(12c{p}_{1})$ (Figure 5D, see Materials and methods), and observed $D}_{3$ will thus be positively correlated with $p}_{1$. For fourtaxon patterns, it can be shown that observed $D}_{4$ is always negative due to larger probabilities of recurrent mutation in more distant outgroups (Figure 5E, see Materials and methods). Assuming no significant contribution of incomplete lineage sorting (Maddison and Knowles, 2006), the value of $D}_{4$ becomes more negative with increasing $\mathrm{\Delta}{p}_{i}/\mathrm{\Sigma}{p}_{i}=({p}_{2}{p}_{1})/({p}_{2}+{p}_{1})$.
To test these signatures, we employ estimated node heights of outgroups in the mitochondrial tree (Figure 2C) as proxies for outgroup distance, and hence for the relative probability of recurrent mutation ($\rho}_{i$). In line with expected signatures, we find that observed $D}_{3$ indeed increases with node height (Figure 5F, left), and observed $D}_{4$ decreases with (Δ Node height/∑ Node height; Figure 5F, right). Thus, the directions and magnitudes of both $D$ statistics are congruent with hypothesis II. As hypothesis II naturally predicts unanimously negative $D}_{3$ and $D}_{4$ as well as their relative magnitudes among different outgroup combinations, it is more parsimonious than hypothesis I. Hence, divergent substitution rates likely exist between P. syfanius and P. maackii.
Ratemixing at migrationdrift equilibrium
Having tested for the existence of divergent substitution rates, we now explore how they become mixed by gene flow using a coalescent framework. In particular, we will assess whether the conceptual picture in Figure 1 can be recovered quantitatively in the butterfly system. As gene flow is ongoing between the two lineages, consider two haploid populations of size $N$ exchanging genes at rate $m$ (Figure 6A). This simple isolationwithmigration (IM) model at equilibrium has relative divergence (Notohara, 1990)
To quantify the signature of mixed rates, consider the asymmetry in observed numbers of substitutions (circles in Figure 6A). Take a pair of sequences from two populations. Let $\u27e8{n}_{1}\u27e9$ be the expected number of derived alleles exclusive to the sequence in population 1, and let $\u27e8{n}_{2}\u27e9$ be the same expected number in population 2. Their ratio is defined as observed rate ratio:
Evidently, $r=1$ is the symmetric point where both sequences have the same number of derived alleles. Further, let the actual substitution rate in population 1 be ${\mu}_{1}$, and the actual substitution rate in population 2 be ${\mu}_{2}$. The ratio between the two actual rates are defined as the true rate ratio:
At migrationdrift equilibrium, observed rate ratio ($r$) and observed divergence (${F}_{ST}$) are related by the following formula parameterized singly by $r}_{0$ (Figure 6B, see Materials and methods):
which translates into
These formulae indicate that unequal substitution rates are more mixed in regions with lower genomic divergence. Equation 7 is surprising, because it reveals that the remaining fraction of substitution rate divergence ($(r1)/({r}_{0}1)$) is almost the same as ${F}_{ST}$, which corresponds to the fraction of genetic variance explained by population structure (Wright, 1949). In Figure 6B, the relationship between $r$ and ${F}_{ST}$ is still largely linear when one species evolves three times as fast as its sister species (${r}_{0}=3$), and so Equation 7 might be robust under biologically realistic rates of substitutions between incipient species. Using extensive simulations (Figure 6—figure supplement 1 to Figure 6—figure supplement 3), we show that the full formula (Equation 6) is also robust in a number of equilibrium population structures.
To test whether such predictions are met, we calculate $r$ on synonymous, nonsynonymous, and intronic sites partitioned by their local ${F}_{ST}$ values between pure populations (KM & XY), and recover a similar monotonic relationship across most ${F}_{ST}$ partitions (Figure 6C). For introns, the observed relationship between $r$ and ${F}_{ST}$ is a near perfect fit to Equation 6 (Figure 6D, squares), with an estimated $r}_{0$ = 1.837. For synonymous (Figure 6D, circles) and nonsynonymous sites (Figure 6D, triangles), Equation 6 also provides an excellent fit in regions with low to intermediate ${F}_{ST}$. Estimated $r}_{0$ for synonymous sites is 1.813, close to that of introns, while nonsynonymous sites have a considerably lower $r}_{0$ of 1.633. If introns and synonymous substitutions are approximately neutral, we infer that neutral substitution rates are about 80% greater in the lowland species.
Discussion
Entropy as a useful measure of ancestry randomness
A critical step in our analysis is to associate genomic islands with barrier loci. Conventionally, information on increased association between barrier sites in hybrid populations comes from twolocus linkage disequilibrium (Slatkin, 1975). Empirical studies on hybrid populations frequently use such statistics to strengthen the evidence for barriers to gene flow (Knief et al., 2019; Wang et al., 2022). Alternatively, if phased haplotypes can be sequenced, the length of ancestry tracts (Sedghifar et al., 2016) or the density of ancestry junctions (Janzen et al., 2018; Wang et al., 2022) also carry information on barrier loci. All these methods break down with small numbers of unphased samples, as forced phasing can produce a large number of false ancestry junctions. However, conventional notions such as “ancestry tract length” are not definable for unphased local ancestry. The dilemma forces us to consider more robust statistics that carry information on ancestry association even in unphased data.
The development of both entropy metrics follows three intuitions:
Vectorization: ancestry is a categorical concept, and it should map to a signal containing the contribution of each source population.
Conservation law: the total ancestry from all sources at a particular locus is conserved.
Highly (auto)correlated signals have a concentrated spectrum (low spectral entropy).
Since entropy carries information about the degree of ancestry correlation, it will decrease in regions of low recombination and high genetic relatedness. If genetic relatedness is produced by inbreeding, it should affect the entire genome in a similar way, and betweenindividual entropy (${S}_{b}$) will be similar across different parts of the genome. However, if inbreeding is so severe that ${S}_{b}$ is globally zero, it will not be an informative metric. It is worth noting that withinindividual entropy ${S}_{w}$ shares a conceptual similarity to the wavelet transform of ancestry signals (Pugach et al., 2011; Sanderson et al., 2015). These entropy metrics will be particularly useful for windowbased genomic analysis of ancestry correlation with limited sampling, and they are also compatible with larger sample sizes and more than two parental species. (The formulae of entropy with two parental species are presented in Materials and methods, and mathematical details are discussed in Appendix 1.) Nonetheless, this entropy approach assumes local ancestry can be accurately inferred, which will be challenging for studies with lowcoverage sequencing, nonchromosomal genome assembly, or lacking reference populations.
Potential mechanisms of divergent substitution rates
Interestingly, a higher substitution rate in the lowland lineage P. maackii is congruent with the evolutionary speed hypothesis (Rensch, 1959), where evolution accelerates in warmer climates. Our finding echoes the results of many previous studies (Gillooly et al., 2005; Wright et al., 2006; Lin et al., 2019; Ivan et al., 2022). Without measuring the pergeneration mutation rates in both species, it is unclear what mechanisms cause increased substitution rates, but the lowland lineage typically has an additional autumn brood that is absent in P. syfanius (Takasaki et al., 2007; Figure 2—figure supplement 2). Warmer temperatures in lowland habitats might also increase spontaneous mutation rates in ectothermic insects (Waldvogel and Pfenninger, 2021). Both mechanisms could produce increased substitution rates in the lowland species. It is surprising that estimated $r}_{0$ between synonymous sites and introns agree with each other under such a coarse framework, and estimated $r}_{0$ for nonsynonymous sites is considerably lower. Less asymmetric rates for nonsynonymous substitutions could perhaps be explained by the nearly neutral theory (Ohta, 1993), which argues that many nonsynonymous mutations are mildly deleterious, and selection is more efficient in suppressing them in larger populations and slowing down substitutions. In the field, the lowland species often appears in larger numbers with wellconnected habitats, while the highland species faces a highly heterogeneous landscape of the Hengduan Mountains, which could lead to differences in effective population sizes required by our expectation under the nearly neutral theory.
Are introgression tests robust to substitution rate variation?
An additional result from our study is that divergent substitution rates might produce spuriously nonzero $D}_{4$ statistics when combined with recurrent mutations, which could increase the false positive rate of the ABBABABA test (Durand et al., 2011). This phenomenon has been suspected in humans (Amos, 2020), and is certainly a theoretical possibility (Hibbins and Hahn, 2022), but has not been tested in most empirical studies.
A wide class of introgression tests targeting gene flow between outgroups and a pair of taxa are based on site pattern information. Hahn’s $D}_{3$ computes absolute sequence divergence between groups in a triplet of species (Hahn and Hibbins, 2019), and will be affected by unequal substitution rates in a similar way to our $D}_{3$. Martin’s $f}_{d$ computes the same numerator as our $D}_{4$ (Martin et al., 2015), so it could also produce false positives under similar situations. Related statistics include ${D}_{p}$ (Hamlin et al., 2020), ${D}_{f}$ (Pfeifer and Kapan, 2019). A general guideline for sitepattern based statistics is that the focal pair of taxa are closely related such that substitution rates do not differ, while outgroups should not be too distant to minimize recurrent mutations (Hibbins and Hahn, 2022). However, whether these assumptions are met in empirical studies is worthy of investigation, and our system provides a counterexample even between sister species with ongoing hybridization.
A separate pitfall might occur if introgression tests based on sitepatterns are applied to genomic windows to locate regions introgressed from outgroups. In our case, ${F}_{ST}$ peaks have the most asymmetric substitution rates between P. maackii and P. syfanius, thus they will most likely be associated with falsepositive $D$ statistics. This could lead to the incorrect interpretation that some barrier loci (“speciation genes”) are introgressed from outgroups – a popular hypothesis in adaptive introgression and hybrid speciation, see Edelman and Mallet, 2021.
To this end, we speculate that using appropriate substitution models to infer gene tree topology will perform better in assessing the impact of introgression with outgroups.
The conceptual picture of ratemixing
In the gray zone of incomplete speciation, interspecific hybrids bridge between gene pools of divergent lineages (Mallet, 2005). We here demonstrate a similar role of hybridization in coupling and mixing differing substitution rates. Divergent rates of substitution carry information about outgroups, while divergence based on allele frequency differences does not. Preserving divergent substitution rates is a stronger effect than maintenance of allele frequency differences, because divergence of allele frequencies is a prerequisite for rate preservation. This dependency can be coarsely quantified across the genome by the relationship between observed rate ratio $r$ and relative divergence ${F}_{ST}$ in an equilibrium system of hybridizing populations (Equation 6). At migrationdrift equilibrium, it is not surprising that divergent substitution rates are associated with relative divergence. In Figure 6A, when coalescence occurs rapidly compared to gene flow, most substitutions separating individuals are speciesspecific. However, when gene flow is faster than coalescence, individuals will carry substitutions that occurred in both species. This could have important implications, because preserving lineagespecific substitution rates as measured by $r$ might not require low absolute rates of gene flow. Instead, reducing effective population sizes via recurrent linked selection might achieve a similar result in populations at equilibrium ($N\downarrow \Rightarrow {F}_{ST}\uparrow \Rightarrow r\uparrow $).
The theory in its present form has several limitations. First, mutations follow the infinitesite model (Kimura, 1969), so that reverse mutations, double mutations, and substitution types are not accurately reflected in the estimated ratio between speciesspecific substitution rates. Second, population structure is assumed at equilibrium, whereas real data could carry footprints from nonequilibrium demographic processes (e.g. secondary contact) (Hey, 2010). Third, there could be considerable population structure within each species contributing to elevated ${F}_{ST}$ but not necessarily to ratedivergence. This effect could be seen in simulated steppingstone models (Figure 6—figure supplement 3) and will underestimate the level of rate divergence. Fourth, substitution is stochastic across the genome, and accurately estimating observed rate ratio $r$ relies on averaging substitution numbers across a large number of sites. This poses a problem if sites linked to high ${F}_{ST}$ regions are rare (i.e. few genomic islands). Lastly, as the theory is built upon the neutral coalescent, it is best suited for studying behaviors of neutral sites.
Nonetheless, the monotonic relationship between $r$ and ${F}_{ST}$ (i.e. larger sequence divergence is associated with more asymmetric substitution rates) might be qualitatively robust regardless of the aforementioned caveats. For instance, even for hybrid zones formed by recent secondary contact, reducing the absolute rate of gene flow by barrier loci in principle also keeps divergent rates from mixing, simply because it prevents substitutions accumulated in the allopatric phase from flowing between species.
In conclusion, our study characterizes several genomic consequences of the ratemixing process when molecular clocks tick at different speeds between hybridizing lineages. This process provides new information on reproductive isolation but also leads to pitfalls in interpreting results of popular introgression tests. As this phenomenon is neglected in most studies of hybridization and speciation, its full scope awaits further investigation in both theories and empirical systems.
Materials and methods
Museum specimens and climate data
Request a detailed protocolMuseum specimens with verifiable locality data of all species were gathered from The University Museum of The University of Tokyo (Harada et al., 2012; Yago et al., 2021), Global Biodiversity Information Facility (The Global Biodiversity Information Facility, 2021b; The Global Biodiversity Information Facility, 2021a), and individual collectors (Figure 2—source data 5). Records of P. maackii from Japan, Korea and NE China were excluded from the analysis, so that most P. maackii individuals correspond to ssp. han, the subspecies that hybridizes with P. syfanius. Spatial principal component analysis was performed on elevation, maximum temperature of warmest month, minimum temperature of coldest month, and annual precipitation, all with 30s resolution from WorldClim2 (Fick and Hijmans, 2017). The first two PCAs, combined with tree cover (Hansen et al., 2013), were used in MaxEnt3.4.1 to produce species distribution models that use known localities to predict occurrence probabilities across the entire landscape (Phillips et al., 2017). Outputs were trimmed near known boundaries of each species. See Figure 2—figure supplement 3 for the final result.
Sampling, resequencing, and mitochondrial phylogeny
Request a detailed protocolEleven males of P. syfanius and P. maackii, with one male of P. arcturus and one male of P. dialis were collected in the field between July and August in 2018 (Figure 2—source data 5), and were stored in RNAlater at –20 °C prior to DNA extraction. E.Z.N.A Tissue DNA kit was used to extract genomic DNA, and KAPA DNA HyperPlus 1/4 was used for library preparation, with an insert size of 350 bp and 2 PCR cycles. The library is sequenced on a Illumina NovaSeq machine with pairedend reads of 150 bp. Adaptors were trimmed using Cutadapt1.8.1, and subsequently the reads were mapped to the reference genome of P. bianor with BWA0.7.15, then deduplicated and sorted via PicardTools2.9.0. The average coverage among 13 individuals in nonrepetitive regions varies between 20× and 30×. Variants were called twice using BCFtools1.9 – the first including all samples, used in analyses involving outgroups, and the second excluding P. arcturus and P. dialis, used in all other analyses. The following thresholds were used to filter variants: $10N<$ DP $<50N$, where $N$ is the sample size; QUAL > 30; MQ > 40; MQ0F < 0.2. As a comparison, we also called variants with GATK4 and followed its best practices, and 93% of postfiltered SNPs called by GATK4 overlapped with those called by BCFtools. We used SNPs called by BCFtools throughout the analysis. Mitochondrial genomes were assembled from trimmed reads with NOVOPlasty4.3.1 (Dierckxsens et al., 2017), using a published mitochondrial ND5 gene sequence of P. maackii as a bait (NCBI accession number: AB239823.1). We also used the following published mitochondrial genomes (NCBI accession numbers): KR822739.1 (Papilio glaucus), NC_029244.1 (Papilio xuthus), JN019809.1 (Papilio bianor). The neighborjoining mitochondrial phylogeny was built with Geneious Prime2021.2.2 (genetic distance model: TamuraNei), and we used 10^{4} replicates for bootstrapping. The reference genome of P. xuthus was previously aligned to the genome of P. bianor and we used this alignment directly in all analysis (Lu et al., 2019).
Calculating sitepattern asymmetry
Request a detailed protocolGiven a species tree {{P_{1},P_{2}},O}, where P_{1} and P_{2} are sister species and O is an outgroup, if mutation rates are equal between {P_{1},P_{2}}, and no gene flow with O, then on average the number of derived alleles in P_{1} should equal the number of derived alleles in P_{2}. Let $\mathcal{S}$ be a collection of sites, $f}_{s$ be the frequency of a particular site pattern at site $s\in \mathcal{S}$. “ABB” be the pattern where only P_{2} and O share the same allele, and “BAB” be the pattern where only P_{1} and O share the same allele, then the threespecies $D}_{3$ statistic is calculated as
where $\mathcal{S}$ is always limited to sites without polymorphism in the outgroup O. This statistic is in principle capturing the same source of asymmetry as the statistic proposed by Hahn and Hibbins, 2019, although their version uses divergence to the outgroup instead of frequencies of sitecounts. Similarly, the fourspecies $D}_{4$ statistic, which considers species tree {{{P_{1},P_{2}},O_{1}},O_{2}} and site patterns ABBA versus BABA (Durand et al., 2011) is calculated as
where $\mathcal{S}$ is always limited to sites without polymorphism in the second outgroup O_{2}. The significance of both tests was computed using blockjackknife over 1 Mb blocks across the genome. Additionally, we estimated rate ratio as follows. First we restricted to sites where all outgroups are fixed for the same ancestral allele to dampen the influence of recurrent mutation. Then, for each site, sample one allele at random from each focal lineage. Calculate the probability of observing a derived allele in P_{1} but not in P_{2}, and the probability of observing a derived allele in P_{2} but not in P_{1}. The rate ratio is computed as the ratio between the two probabilities. Explicitly, let $I(\cdot )$ be the identity function, and $f}_{s$ be the frequency of the derived allele, then:
Its standard error was estimated using 1 Mb blockjackknifing. We excluded P. xuthus from the outgroups to increase the number of informative sites when using this formula.
D_{3} and D_{4} under unequal substitution rates and recurrent mutations
Request a detailed protocolIn this section, we calculate observed $D}_{3$ and $D}_{4$ assuming that incomplete lineage sorting contributes insignificantly to both statistics. If incomplete lineage sorting is present, it will not create new bias (numerators are on average unchanged), but will likely dampen existing bias (inflating denominators).
As substitutions are independent along each lineage, we can mute recurrent mutations in outgroups and generate them afterwards. For three taxa with gene tree {{P_{1},P_{2}},O_{1}}, before recurrent mutations, there are $n}_{1$ sites with pattern (B,A,A), and $n}_{2$ sites with pattern (A,B,A). If substitution rate is higher in P_{2}, we have $n}_{2}>{n}_{1$, so the true value of $D}_{3$, written as ${\widehat{D}}_{3}$, is always negative:
Next, recurrent mutations in O_{1} occur at each site with an average probability $p}_{1$, and with an average probability $c$, ancestral alleles from affected sites in O_{1} are converted to the same derived allele in {P_{1},P_{2}}. $c$ will be independent of $n}_{1$ and $n}_{2$, as long as substitutions between {P_{1},P_{2}} are only different in rates, but not mutation types. Hence, two possible mutation paths exist:
The expected site counts, after recurrent mutations, become
Using the new expected site counts in $D}_{3$ statistics produce the following value:
Since ${\widehat{D}}_{3}$ is negative, it grows approximately linearly near small values of $p}_{1$. (The full equation is still monotonic in $p}_{1$.)
Similarly, for fourtaxon statistics, before recurrent mutation, there are two types of sites: (A,B,B,B) $n}_{1$; (B,A,B,B) $n}_{2$. Suppose the average probability of recurrent mutation is $p}_{1$ in O_{1}, and $p}_{2$ in O_{2}, and the conversion probability of each recurrent mutation into derived alleles of {P_{1}, P_{2}} is $c$ for both outgroups. Using the same procedure, one can show that
Since recurrent mutations occur more frequently in distant outgroups, $p}_{2}>{p}_{1$. Because ${\widehat{D}}_{3}$ is negative, we have ${D}_{4}<0$.
Local gene trees
Request a detailed protocolLocal gene trees were estimated using iqtree2.0 (Minh et al., 2020) on 50 kb nonoverlapping genomic windows with options m MFP B 5000. Only SNPs from annotated regions (synonymous sites +nonsynonymous sites +introns) across all individuals were used. For diploid individuals, heterozygous sites were assigned IUPAC ambiguity codes and iqtree assigned equal likelihood for each underlying character, thus information from heterozygous sites is largely retained. This is crucial as we are interested in the branch length of inferred trees. Option m MFP implements iqtree’s ModelFinder that tests the FreeRate model to accommodate maximum flexibility with ratevariation among sites (Kalyaanamoorthy et al., 2017). We also used UltraFast Bootstrap to calculate the support for different types of splits in each window (the B 5000 option; Hoang et al., 2018). In each window, we extracted the support for monophyly among P. maackii +P.syfanius directly from the output of UltraFast Bootstrap, and we define the support for paraphyly among P. maackii +P.syfanius as (100  the support for monophyly). For each level of support, we filtered out genomic windows where both the support for monophyly and the support for paraphyly drop below the given level. The remaining windows were considered informative.
Ratemixing under the equilibrium IM model
Request a detailed protocolWe construct a continuoustime coalescent model as follows. Both populations have $N$ haploid individuals, gene flow rate is $m$, and coalescent rate is ${N}^{1}$ in each population. In the equilibrium system, as we track both haploid individuals backward in time, there are six distinct states: $(12),\phantom{\rule{thinmathspace}{0ex}}(21),\phantom{\rule{thinmathspace}{0ex}}(1,2),\phantom{\rule{thinmathspace}{0ex}}(1,2),\phantom{\rule{thinmathspace}{0ex}}(0),\phantom{\rule{thinmathspace}{0ex}}(0)$, where 1 and 2 represent two individuals prior to coalescent, 0 is the state of coalescent, and $(\cdot \cdot )$ shows the location of each lineage. Its transition density $\mathbf{B}(t)$ satisfies ${\partial}_{t}\mathbf{B}=\mathbf{AB}$, where $\mathbf{A}$ is given as
Let ${S}_{ij}(T)$ be the mean sojourn time of an uncoalesced individual inside population $i$ during $0\le t\le T$, conditioning on the individual being taken from population $j$ at $t=0$. Assuming the infinitesite mutation model, let ${\mu}_{i}$ be the substitution rate in population $i$, observed rate ratio $r$ is thus
where (due to symmetry)
Let ${r}_{0}={\mu}_{2}/{\mu}_{1}$, and since ${F}_{ST}={(1+4Nm)}^{1}$, we have
Local ancestry estimation
Request a detailed protocolSoftware ELAI (Guan, 2014) with a doublelayer HMM model was used to estimate diploid local ancestries across chromosomes. An example command is as follows: elailin g genotype.maackii.txt p 10 g genotype.syfanius.txt p 11 g genotype.admixed.txt p 1 pos position.txt s 30 C 2 c 10 mg 5000 excludenopos.
Note that mg specifies the resolution of ancestry blocks, thus increasing its value will increase the stochastic error of incorrectly inferring very short blocks of ancestry. To control for uncertainty, we estimated repeatedly for 50 times. All replicates were used simultaneously in finding the correlation coefficients between entropy and other variables. Results from an example run is in Figure 3—figure supplement 1.
Ancestry and entropy
Here we introduce concisely the data transformation framework for calculating the entropy of local ancestry. The mathematical detail of this approach is presented in the appendix.
Ancestry representation
Request a detailed protocolThe space of all ancestry signals is highdimensional, and directly calculating the entropy in this space is not feasible with just a few individuals. So we propose to measure only the pairwise correlation of ancestries among sites, which captures only the secondorder randomness, but is sufficient for practical purposes. Consider a hybrid individual with two parental populations indexed by $k=1,2$. Assuming a continuous genome, let ${p}_{k}(l)=0,\frac{1}{2},1$ be the diploid ancestry of locus $l$ within genomic interval $[0,L]$. By definition, we have ${p}_{1}(l)+{p}_{2}(l)=1$, that is the total ancestry is conserved everywhere in the genome. The biancestry signal at locus $l$ is defined as the following complex variable
where $i=\sqrt{1}$ is the imaginary number. An advantage of using a complex representation for the biancestry signal is that we can model different ancestries along the genome as different phases of a complex unit phasor (${e}^{i\theta}$), such that the power of the signal at any given locus is simply the sum of both ancestries, which is conserved (${z(l)}^{2}=1$). It ensures that we do not bias the analysis to any particular region or any particular individual when decomposing the signal into its spectral components.
Withinindividual spectral entropy (${S}_{w}$)
Request a detailed protocolTo characterize the average autocorrelation along an ancestry signal at a given scale $l$, define the following scaledependent autocorrelation function
where $z(l)$ is understood as a periodic function such that $z(\xi +l)=z(\xi +lL)$ whenever the position goes outside of $[0,L]$. The WienerKhinchin theorem guarantees that $z(l)$’s power spectrum $\zeta (f)$, which is discrete, and the autocorrelation function $A(l)$ form a Fouriertransform pair. Due to the uncertainty principle of Fourier transform, $A(l)$ that vanishes quickly at short distances (smallscale autocorrelation) will produce a wide $\zeta (f)$, and vice versa. So the entropy ${S}_{w}$ of $\zeta (f)$, which measures the spread of the total ancestry into each spectral component, also measures the scale of autocorrelation. In practice, $\zeta (f)$ is the square modulus of the Fourier series coefficients of $z(l)$, and we fold the spectrum around $f=0$ before calculating the withinindividual entropy ${S}_{w}$. The formula used in the manuscript is
where ${Z}_{n}$ are the Fourier coefficients from the expansion $z(l)={\sum}_{n=\mathrm{\infty}}^{+\mathrm{\infty}}{Z}_{n}{e}^{i2\pi nl/L}$. To speed up the Fourier expansion, we could densely pack equallyspaced markers that sample a continuous ancestry signal into a discrete signal, which then undergoes Fast Fourier Transform (FFT). The spectrum of FFT (discrete and finite) approximates the continuoustime Fourier spectrum (discrete and infinite), and entropy also converges as marker density increases.
Betweenindividual spectral entropy (${S}_{b}$)
Request a detailed protocolAs ancestry configuration is far from random around barrier loci, it will also influence the correlation of ancestry between different individuals at the same locus. For a genomic region experiencing strong barrier effects, two individuals could either be very similar in ancestry, or very different. This effect can be quantified by first calculating the crosscorrelation ${C}_{j,{j}^{\prime}}(l)={z}_{j}(l)\overline{{z}_{{j}^{\prime}}(l)}$ at position $l$ between individuals $j$ and ${j}^{\prime}$, and then averaging across a genome interval: ${c}_{j,{j}^{\prime}}=\frac{1}{L}{\int}_{0}^{L}{C}_{j,{j}^{\prime}}(l)dl$. The $J\times J$ dimensional matrix $\mathbf{C}$ with entries ${c}_{j,{j}^{\prime}}$ describes the pairwise crosscorrelation within the cohort of $J$ individuals. We also have ${c}_{j,j}\equiv 1$ as each individual is perfectly correlated with itself. The matrix $\mathbf{C}$ is Hermitian, so it has a real spectral decomposition with eigenvalues ${\lambda}_{j}$ that satisfy ${\sum}_{j}{\lambda}_{j}/J=1$. This process is very similar to performing a principal component analysis on the entire cohort of individuals, and ${\lambda}_{j}/J$ describes the fraction of the total ancestry projected onto principal component $j$. If many loci covary in ancestry, the spectrum $\{{\lambda}_{j}\}$ will be concentrated near the first few components. Similarly, we use entropy to measure the spread of the spectrum, and hence the betweenindividual spectral entropy is defined as
Appendix 1
Representation of ancestry on a hybrid chromosome
The ancestry of a hybrid depends on the pure reference populations. The following assumptions are used throughout this section:
There are a finite number of pure reference populations indexed by $k\in \{1,2,\mathrm{\cdots},K\}$. In most cases, we are only interested in $K=2$, for instance, a hybrid zone between two lineages.
The chromosome has so many sites so that it can be treated as a contiguous rod with length $L$. $l\in [0,L]$ is the index of positions on a chromosome.
The species’ ploidy is $n}_{p$ (completely phased data ⇔ ${n}_{p}=1$). When not specified, we assume the data is unphased, so that the ancestry on a particular chromosome always refers to the ancestry on a collection of $n}_{p$ homologous chromosomes.
Ancestry can be inferred. In practice, regions with low density of informative SNPs will make inference difficult.
As "ancestry" is a categorical variable (i.e., there is no intrinsic order among the reference populations), to quantify the correlation of ancestry along a chromosome, we need to map the contribution of each ancestry category to some numeric value that can be used to calculate correlation.
The first choice is to use the probability of each category. For unphased data, let ${p}_{k}(l)$ be the fraction of the $n}_{p$ploid chromosome at position $l$ coming from reference population $k$. The vectorvalued function $\mathbf{B}(l)={({p}_{1}(l),{p}_{2}(l),\mathrm{\cdots},{p}_{K}(l))}^{\top}$ completely describes the ancestry on a single chromosome. By definition, ${\sum}_{k}{p}_{k}(l)\equiv 1$. This is the conservation of total ancestry at any position. However, by this representation, the correlation ${\mathbf{B}}^{\top}({l}_{1})\mathbf{B}({l}_{2})$, which is the product of ancestries at different positions, will have units of the square of probabilities. If a locus is to be compared with itself, then the correlation will be
Ideally, we want the correlation of ancestry of a locus with itself to be the same regardless of its ancestry configuration. Equation (24) does not meet this criteria as it depends on $\mathbf{B}$. Here we borrow some concepts from signalprocessing theory. The conservation of total ancestry means that ancestry is more analogous to the "energy" of a signal, which is in units of [squaredsignal], rather than the signal itself. Therefore, we could instead define a second type of ancestry representation by taking the square roots of probabilities.
Definition 1 (Spherical representation of ancestry)
If the probability representation of ancestry is $\mathbf{B}\mathit{}\mathrm{(}l\mathrm{)}$, then the spherical representation $\mathbf{y}\mathit{}\mathrm{(}l\mathrm{)}$ takes the squareroot of each element of $\mathbf{B}\mathit{}\mathrm{(}l\mathrm{)}$:
Following this representation, the autocorrelation between two spherical ancestries becomes a natural measure of the similarity between two probability distributions:
Definition 2 (Correlation between two spherical ancestries)
The autocorrelation function between $\mathbf{y}\mathit{}\mathrm{(}{l}_{\mathrm{1}}\mathrm{)}$ and $\mathbf{y}\mathit{}\mathrm{(}{l}_{\mathrm{2}}\mathrm{)}$ is the following quantity known as the Bhattacharyya coefficient, also known as the fidelity measure in information theory:
Further, selfcorrelation is always 1:
The conservation of selfcorrelation will be important when we decompose ancestry into its spectral components, as it will not bias our analysis to any particular region of the chromosome. There is also a geometric meaning associated with this representation. Since $\mathbf{y}(l)$ has a ${L}^{2}$norm of 1, each ancestry configuration corresponds to a point on the unit sphere in ${\mathbb{R}}^{K}$, and different ancestries are represented by the orientation of this vector in ${\mathbb{R}}^{K}$.
In many studies we are dealing only with two reference populations, such as hybrid zones between a pair of divergent lineages. This situation allows a more compact representation of ancestry using complex numbers:
Definition 3 (Complex representation of ancestry)
If $K\mathrm{=}\mathrm{2}$, define the following complex variable $z\mathit{}\mathrm{(}l\mathrm{)}$ as the representation of ancestries:
where $i\mathrm{=}\sqrt{\mathrm{}\mathrm{1}}$ is the imaginary number.
This seemingly artificial definition is not the first time that a complex number is used to model a physical phenomenon. In quantumn mechanics, the quantumn wave function of a particle is represented by a complex wave with the probability of occurrence measured by the squaremodulus of the wave. Here, we are also using the squaremodulus of $z(l)$ to represent the total ancestry of a given locus.
A complexvalued signal $z(l)$, like any real signal, has the definition of autocorrelation:
Definition 4 (Correlation between two complex ancestries)
The correlation between two complex ancestries $z\mathit{}\mathrm{(}{l}_{\mathrm{1}}\mathrm{)}$ and $z\mathit{}\mathrm{(}{l}_{\mathrm{2}}\mathrm{)}$ is the following product:
By this definition, the real part $\text{Re}[A({l}_{1},{l}_{2})]$ of the autocorrelation is just the Bhattacharyya coefficient between two ancestry configurations on l_{1} and l_{2}, which is a measure of similarity. The absolute value of the imaginary part of the autocorrelation $\text{Im}[A({l}_{1},{l}_{2})]$ measures the volume of the parallelogram spanned by vectors $\mathbf{y}({l}_{1})$ and $\mathbf{y}({l}_{2})$, thus it is a measure of dissimilarity. Further, the selfcorrelation $A(l,l)$ is also conserved for all $l$, as $\text{Im}[A(l,l)]\equiv 0$ and $\text{Re}[A(l,l)]\equiv 1$.
The complex representation possesses some useful properties not shared with the spherical representation, but it also limits our analysis to cases with only two reference populations. This will not pose a problem for the study of hybrid zones between a pair of parapatric lineages.
Mean autocorrelation vs. hybrid index
The hybrid index is defined as the average ancestry in a hybrid from a given reference population over a set of loci. It is usually calculated for each individual or for each chromosome. In our notation:
Definition 5 (Hybrid index)
The hybrid index over a given genomic interval $\mathrm{[}\mathrm{0}\mathrm{,}L\mathrm{]}$ for reference population $k$ is
In contrast, the average of the spherical or the complex representation provides information about autocorrelation within the chromosome instead of the mean ancestry.
Theorem 1 (Mean autocorrelation)
With the spherical or the complex representation, the squared ${L}^{\mathrm{2}}$norm (or the squared modulus, when a complex representation is available) of the average signal represents the mean autocorrelation of the ancestry on the genomic interval $\mathrm{[}\mathrm{0}\mathrm{,}L\mathrm{]}$.
Proof. For a general spherical representation $\mathbf{y}(l)$, we have
For a complex representation $z(l)$, notice that $A({l}_{1},{l}_{2})=\overline{A({l}_{2},{l}_{1})}$, so
This guarantees that the mean autocorrelation when using a complex representation is the same as the mean autocorrelation when using a spherical representation with $K=2$. Additionally,
This completes the proof.
The quantity $a$ is a measure of the average similarity of the ancestry configurations. It does not consider the genomic position of different ancestry configurations, so it does not contain information about whether similar ancestry configurations are clustered together.
The entropy of ancestry
To characterize the scale of correlation, the distance between loci is important because correlation often drops while distance between loci increases. The information about the spatial scale of correlation is retained when the full spectrum of correlation is considered within a single individual.
A second source of correlation arises when we consider the relationship between individuals. To convey the main idea, let’s compare between a set of ancestry signals from an inversion and a set of ancestry signals from a regular region subject to normal recombination. For the inversion, since recombination between chromosomes of different ancestry is often completely suppressed, the ancestry signal along the inversion will be close to constant. When two haploid individuals are compared at the inverted region, they are either different everywhere in terms of ancestry, or completely the same. Comparing between diploid individuals is similar, although the difference is more finegrained due to the presence of the heterozygotes. However, in a collinear region subject to a regular rate of recombination, any ancestry signal will switch stochastically between states. In the latter case, the similarity between two ancestry signals will be similar across multiple pairwise comparisons. It will be helpful to think in the principal component (PC) space spanned by all individuals’ ancestry signals. Ancestry signals from an inversion will form several tight clusters in the PC space, while those from a regular region will form a single cloud with a larger dispersion.
Both sources of correlation, and hence both types of randomness, can be measured using entropy in information theory.
Definition 6 (Shannon entropy)
The Shannon entropy for a discrete probability distribution $\mathrm{\{}{p}_{j}\mathrm{\}}$, $j\mathrm{\in}\mathbb{Z}$, is defined as
For a continuous distribution with probability density function $p\mathit{}\mathrm{(}x\mathrm{)}$, $x\mathrm{\in}\mathbb{R}$, the Shannon differential entropy is defined as
For any nonnegative series $\mathrm{\{}{p}_{j}\mathrm{\}}$ or nonnegative function $p\mathit{}\mathrm{(}x\mathrm{)}$, suppose they converge upon summation/integration, we use the same notations $S\mathit{}\mathrm{(}\mathrm{\{}{p}_{j}\mathrm{\}}\mathrm{)}$ and $S\mathit{}\mathrm{(}p\mathit{}\mathrm{(}x\mathrm{)}\mathrm{)}$ to denote the entropy after they are appropriately normalized to probability distributions.
Shannon entropy (or any other entropy measure) is a useful measure of the spread of a distribution over its entire configuration space. When the distribution is concentrated (low randomness, high certainty), $S$ will be low. $S=0$ if and only if ${p}_{j}=1$ for some $j$.
Entropy associated with the correlation within a single individual
Definition 7 (Fourier spectrum of the spherical representation)
For the spherical representation of ancestry on genomic interval $[0,L]$
expand each component into its Fourier series:
The folded Fourier spectrum is defined as
Definition 8 (Fourier spectrum of the complex representation)
For the complex representation of ancestry on genomic interval$\mathrm{[}\mathrm{0}\mathrm{,}L\mathrm{]}$
expand $z\mathit{}\mathrm{(}l\mathrm{)}$ into its Fourier series:
The folded Fourier spectrum is defined as
The following theorem guarantees that the folded spectrum ${\zeta}_{n}$ is the same for biancestry signals using either representation.
Theorem 2
When $K\mathrm{=}\mathrm{2}$, the folded Fourier spectrum ${\zeta}_{n}$ is the same for $\mathbf{y}\mathit{}\mathrm{(}l\mathrm{)}\mathrm{=}{\mathrm{(}\sqrt{{p}_{\mathrm{1}}\mathit{}\mathrm{(}l\mathrm{)}}\mathrm{,}\sqrt{{p}_{\mathrm{2}}\mathit{}\mathrm{(}l\mathrm{)}}\mathrm{)}}^{\mathrm{\top}}$ and $z\mathit{}\mathrm{(}l\mathrm{)}\mathrm{=}\sqrt{{p}_{\mathrm{1}}\mathit{}\mathrm{(}l\mathrm{)}}\mathrm{+}i\mathit{}\sqrt{{p}_{\mathrm{2}}\mathit{}\mathrm{(}l\mathrm{)}}$.
Proof. Expand each component into its Fourier series:
The folded spectrum using the spherical representation is
Using the linearity of Fourier expansion, we have
Thus,
Since the Fourier coefficients of a real function at opposite frequencies are complex conjugates to each other, we have
Finally,
Since $\hat{p}}_{2,n}\overline{{\hat{p}}_{1,n}}+\overline{{\hat{p}}_{2,n}}{\hat{p}}_{1,n$ is real, the last term of the previous equation becomes zero. For the zeroth component ${\zeta}_{0}$, as both ${\widehat{p}}_{1,0}$ and ${\widehat{p}}_{2,0}$ are real, there is no difference between the two representations. This completes the proof.
A nice property of the Fourier spectrum ${\zeta}_{n}$ is that it can be interpreted as a probability distribution of ancestry among different frequency components. By Parseval’s theorem, it is easy to verify that ${\sum}_{n}{\zeta}_{n}=\frac{1}{L}{\int}_{0}^{L}({\sum}_{k}{p}_{k}(l))dl=1$. From the WienerKhinchin theorem, the unfolded spectrum forms a Fourier transform pair with the autocorrelation function of the original signal. The significance of the WienerKhinchin theorem is that information about the autocorrelation of the original signal can now be extracted from the Fourier spectrum ${\zeta}_{n}$.
Definition 9 (Withinindividual entropy)
Let ${S}_{w}\mathrm{=}\mathrm{}{\mathrm{\sum}}_{n}{\zeta}_{n}\mathit{}\mathrm{ln}\mathit{}{\zeta}_{n}$. ${S}_{w}$ is the Shannon entropy of the folded Fourier spectrum ${\zeta}_{n}$. As ${S}_{w}$ captures the correlation structure within each individual, we also call it the withinindividual entropy.
Formally, we have the following uncertainty principle relating the withinindividual entropy to the scale of autocorrelation.
Theorem 3 (Entropic uncertainty)
Let $A\mathit{}\mathrm{(}l\mathrm{)}\mathrm{=}\frac{\mathrm{1}}{L}\mathit{}{\mathrm{\int}}_{\mathrm{0}}^{L}{\mathbf{y}}^{\mathrm{\top}}\mathit{}\mathrm{(}x\mathrm{)}\mathit{}\mathbf{y}\mathit{}\mathrm{(}x\mathrm{+}l\mathrm{)}\mathit{}\mathrm{d}x$ be the average autocorrelation at scale $l$ ($\mathrm{0}\mathrm{\le}l\mathrm{\le}L$) for the spherical ancestry. Here, $\mathbf{y}$ is understood as a periodic function of period $L$. The following inequality holds:
where $Q\mathrm{=}{\mathrm{\int}}_{\mathrm{0}}^{L}{A}^{\mathrm{2}}\mathit{}\mathrm{(}l\mathrm{)}\mathit{}\mathrm{d}l$ is a normalization factor. Note that the righthandside of the inequality is invariant under linearly rescaling $A\mathit{}\mathrm{(}l\mathrm{)}$ to a different interval $\mathrm{[}\mathrm{0}\mathrm{,}{L}^{\mathrm{\prime}}\mathrm{]}$. Thus, we can also write the inequality compactly, supposing $A\mathit{}\mathrm{(}l\mathrm{)}$ has been rescaled to $\mathrm{[}\mathrm{0}\mathrm{,}\mathrm{2}\mathrm{]}$, as
where $a$ is the average autocorrelation defined in Equation 31.
Proof. (i) In this part, we establish the WienerKhinchin relation that $A(l)$ expands into a Fourier series with coefficients ${\eta}_{n}={\sum}_{k=1}^{K}{{\widehat{p}}_{k,n}}^{2}$:
The derivation is as follows:
(ii) Let ${\int}_{0}^{L}{A}^{2}(l)dl=Q$. It is obvious that $\{{\eta}_{n}/\sqrt{Q}\}$ are Fourier series coefficients of $A(l)/\sqrt{Q}$, they obey the HausdorffYoung inequality
where $q\in (1,2)$ and $1/{q}^{\prime}+1/q=1$. Let $\varphi (q)={\left({\int}_{0}^{1}{\leftA(xL)/\sqrt{Q/L}\right}^{q}\phantom{\rule{negativethinmathspace}{0ex}}\mathrm{d}x\right)}^{\frac{1}{q}}{\left(\sum _{n=\mathrm{\infty}}^{+\mathrm{\infty}}{\left{\eta}_{n}/\sqrt{Q}\right}^{{q}^{\mathrm{\prime}}}\right)}^{\frac{1}{{q}^{\mathrm{\prime}}}}$. Since Fourier series preserve the 2norm, $\varphi (2)=0$, and since $\varphi (q)\ge 0$ for $q\in (1,2)$, we have ${\varphi}^{\prime}(2)\le 0$. This translates into
which yields
Note that the quantity ${\int}_{0}^{L}\frac{{A}^{2}(l)}{Q}\mathrm{ln}\frac{{A}^{2}(l)}{Q}\mathrm{d}l+\mathrm{ln}L$ is actually independent of $L$ upon rescaling.
(iii) Next, we show that $S(\{{\eta}_{n}\})\ge S(\{{\eta}_{n}^{2}\})$. Since ${\sum}_{n}{\eta}_{n}=1$ and ${\eta}_{n}$ is nonnegative, we can always reorder them into a descending series ${\widehat{\eta}}_{n}$ ($n\ge 0$) with the same entropy as $S(\{{\eta}_{n}\})$ (because entropy is permutationinvariant). The result follows as long as $S(\{{\widehat{\eta}}_{n}\})\ge S(\{{\widehat{\eta}}_{n}^{2}\})$. Let ${\beta}_{n}={\widehat{\eta}}_{n+1}/{\widehat{\eta}}_{n}\le 1$. The two series, after normalization, can be written as
Consequently,
This implies that ${\widehat{\eta}}_{0}\le {\widehat{\eta}}_{0}^{2}/Q$. Suppose there exists $j\ge 0$ such that
Then there exists ${j}^{\prime}\le j$ such that $\hat{\eta}}_{0}{\beta}_{0}{\beta}_{1}\cdots {\beta}_{{j}^{\mathrm{\prime}}}>\frac{{\hat{\eta}}_{0}^{2}}{Q}{\beta}_{0}^{2}{\beta}_{1}^{2}\cdots {\beta}_{{j}^{\mathrm{\prime}}}^{2$. So for any $j\ge {j}^{\prime}$, we have $\hat{\eta}}_{0}{\beta}_{0}{\beta}_{1}\cdots {\beta}_{j}>\frac{{\hat{\eta}}_{0}^{2}}{Q}{\beta}_{0}^{2}{\beta}_{1}^{2}\cdots {\beta}_{j}^{2$. The difference
will always be positive and monotonically increases for any $j\ge {j}^{\prime}$. This is not consistent with the fact that ${lim}_{j\to \mathrm{\infty}}{\mathrm{\Delta}}_{j}=0$. Thus, we conclude that for any $j$, the partial sum follows the inequality
This is to say that infinite sequence $\{{\widehat{\eta}}_{0}^{2}/Q\}$ majorizes $\{{\widehat{\eta}}_{n}\}$. By Theorem 2.2 of Li and Busch, 2013, $S(\{{\widehat{\eta}}_{n}\})\ge S(\{{\widehat{\eta}}_{n}^{2}\})$, and so $S(\{{\eta}_{n}\})\ge S(\{{\eta}_{n}^{2}\})$
(iv) Finally, since ${\zeta}_{n}={\eta}_{n}+{\eta}_{n}=2{\eta}_{n}$ for $n\ge 1$, we have
which is equivalent to
Combining previous four steps yields the result of the theorem.
Entropy associated with the correlation between individuals
Definition 10 (Entropy of a linear operator)
Let $\mathcal{L}$ be a compact selfadjoint linear operator on a Hilbert space $\mathscr{H}$. If $\mathcal{L}$ has a countable set of eigenvalues $\mathrm{\{}{\nu}_{i}\mathrm{\}}$, and is positive semidefinite, then define the entropy of $\mathcal{L}$ as
where $\mathrm{\text{Tr}}\mathit{}\mathcal{L}\mathrm{=}{\mathrm{\sum}}_{i}{\nu}_{i}$ is the trace of the linear operator $\mathcal{L}$
Definition 11 (Mercer’s spectrum)
Let ${A}_{j}\mathit{}\mathrm{(}{l}_{\mathrm{1}}\mathrm{,}{l}_{\mathrm{2}}\mathrm{)}$ be the autocorrelation function for individual $j\mathit{}\mathrm{(}\mathrm{1}\mathrm{\le}j\mathrm{\le}J\mathrm{)}$, and define the average autocorrelation function as
Since ${A}_{j}$ is Hermitian, the average $\mathrm{\u27e8}A\mathrm{\u27e9}$ is also Hermitian, thus the integral operator ${I}_{\mathrm{\u27e8}A\mathrm{\u27e9}}$ defined by ${I}_{\mathrm{\u27e8}A\mathrm{\u27e9}}\mathit{}\varphi \mathit{}\mathrm{(}l\mathrm{)}\mathrm{=}{\mathrm{\int}}_{\mathrm{0}}^{L}\mathrm{\u27e8}A\mathrm{\u27e9}\mathit{}\mathrm{(}l\mathrm{,}s\mathrm{)}\mathit{}\varphi \mathit{}\mathrm{(}s\mathrm{)}\mathit{}\mathrm{d}s$ has a series of real eigenvalues $\mathrm{\{}{\nu}_{j}\mathrm{\}}$ satisfying ${\mathrm{\sum}}_{j}{\nu}_{j}\mathrm{=}L$. ${\nu}_{j}$ is the solution to the eigenvalue problem:
The spectrum defined by $\mathrm{\{}{\nu}_{j}\mathrm{/}L\mathrm{\}}$ is the Mercer’s spectrum.
Definition 12 (Crosscorrelation spectrum)
Let the crosscorrelation matrix be $\mathbf{C}\mathrm{=}{\mathrm{\{}{c}_{j\mathrm{,}{j}^{\mathrm{\prime}}}\mathrm{\}}}_{J\mathrm{\times}J}$, where ${c}_{j\mathrm{,}{j}^{\mathrm{\prime}}}$ is the average crosscorrelation between individual $j$ and ${j}^{\mathrm{\prime}}$:
As $\mathbf{C}$ is Hermitian, it has a series of real eigenvalues ${\lambda}_{j}$ satisfying ${\mathrm{\sum}}_{j}{\lambda}_{j}\mathrm{=}J$. Let the normalized spectrum of $\mathbf{C}$ be $\mathrm{\{}{\lambda}_{j}\mathrm{/}J\mathrm{\}}$, then $\mathrm{\{}{\lambda}_{j}\mathrm{/}J\mathrm{\}}$ is defined as the crosscorrelation spectrum.
The following theorem states that when the complex representation is adopted for biancestry signals, the Mercer’s spectrum coincides with the crosscorrelation spectrum, so that even if the Mercer’s spectrum is calculated using correlation within individuals, both captures the correlation between individuals.
Theorem 4
If ${\mathrm{\{}{z}_{j}\mathit{}\mathrm{(}l\mathrm{)}\mathrm{\}}}_{j\mathrm{=}\mathrm{1}}^{J}$ is a collection of complex biancestry signals, then its Mercers’ spectrum is the same as its crosscorrelation spectrum.
Proof. The average autocorrelation $\u27e8A\u27e9$ is computed as
So the integral equation (51) becomes
where $(\nu ,\varphi )$ forms the solution to the above equation. Rearranging the terms, we write
where ${\alpha}_{j}$ stands for the constant inside the bracket. Substituting into the original integral equation, we have
which is equivalent to
Notice that the integral in the bracket contains the crosscorrelation between ${j}^{\prime}$ and $j$, so if the following relationship holds
then the system has a solution of $\nu $. The above equation is equivalent to
Which means that $J\nu /L$ is the eigenvalue of the crosscorrelation matrix $\mathbf{C}$, and since $\mathbf{C}$ is a Hermitian matrix, the existence of its real eigenvalues is guaranteed. So for each ${\nu}_{i}$, we have ${\lambda}_{j}=J{\nu}_{j}/L$, which is to say
As both spectra coincide for the complex representation, we can define the betweenindividual entropy using either of them.
Definition 13 (Betweenindividual entropy of the complex representation)
For complex biancestry signals, let ${S}_{b}\mathrm{:=}S\mathit{}\mathrm{(}{I}_{\mathrm{\u27e8}A\mathrm{\u27e9}}\mathrm{)}\mathrm{=}S\mathit{}\mathrm{(}\mathbf{C}\mathrm{)}$. ${S}_{b}$ is the Shannon entropy characterizing the betweenindividual correlation.
Corollary 1 (Maximum ${S}_{b}$).
For $J$ diploid individuals with the complex representation of ancestry, the largest attainable ${S}_{b}$ is given by
where $c\mathrm{=}\mathrm{(}\mathrm{3}\mathrm{+}\mathrm{2}\mathit{}\sqrt{\mathrm{2}}\mathrm{)}\mathrm{/}\mathrm{8}$.
Proof. For entropy to be large, autocorrelation within each individual must be very weak. This will occur if a sufficient amount of recombination has taken place such that the ancestry across the genome is largely independent between loci. The maximum value will occur when both parental lineages contribute equally to the hybrid ancestry, so that the hybrid index $h=0.5$. The heterozygosity $H$ is then 0.5 following a random distribution of ancestries along the two genome copies. The offdiagonal elements in the crosscorrelation matrix $\mathbf{C}$ thus take the value
For a $J\times J$ matrix whose diagonal elements are all 1, and whose offdiagonal elements are all $c$, its normalized eigenvalues are given by $\{\frac{1+(J1)c}{J},\frac{1c}{J},\mathrm{\cdots},\frac{1c}{J}\}$, which leads to the above result.
Corollary 2 (Minimum ${S}_{b}$). ${S}_{b}=0$ is the minimum betweenindividual entropy, and is attainable if recombination is completely suppressed in the genomic interval of interest.
Proof. If recombination is completely suppressed along $[0,L]$, the ancestry signal is constant along the interval in any individual with any ploidy. Hence, ${A}_{j}\equiv 1$, and $\u27e8A\u27e9\equiv 1$. The only nonzero eigenvalue associated with a constant integral kernel is $\nu =L$, so that we have ${S}_{b}=0$ using the Mercer’s spectrum. Note that the converse is not true, because the diploid ancestry signal is unaware of phase. A region completely heterozygous may in fact has a recombination break point, and the ancestry can flip phases when crossing the break point. However, this extreme situation is unlikely as it requires two separate recombination events to have occurred at precisely the same point and in opposite directions. If the hybrid zone is old, a long track of heterozygous ancestry is therefore usually good evidence for the presence of barrier loci.
Data availability
Source code is available at: https://github.com/tzxiong/2021_Maackii_Syfanius_HybridZone, (copy archived at swh:1:rev:41b220f489e17ff9795e5a0666e9579a00b2b3b8) Wholegenome sequences are deposited in the National Center for Biotechnology Information, Sequence Read Archive (BioProject Accession Number: PRJNA765117).

NCBI Sequence Read ArchiveID PRJNA765117. Hybridization between Papilio syfanius and Papilio maackii.

GitHubID 2021_Maackii_Syfanius_HybridZone. Database for the code used in the 2021 manuscript on the hybrid zone between Papilio syfanius and Papilio maackii.

GigaDB DatasetsSupporting data for "Chromosomallevel reference genome of Chinese peacock butterfly (Papilio bianor) based on thirdgeneration DNA sequencing and HiC analysis".https://doi.org/10.5524/100653
References

Signals interpreted as archaic introgression appear to be driven primarily by faster evolution in AfricaRoyal Society Open Science 7:191900.https://doi.org/10.1098/rsos.191900

Multilocus clinesEvolution; International Journal of Organic Evolution 37:454–471.https://doi.org/10.1111/j.15585646.1983.tb05563.x

NOVOPlasty: de novo assembly of organelle genomes from whole genome dataNucleic Acids Research 45:e18.https://doi.org/10.1093/nar/gkw955

Testing for ancient admixture between closely related populationsMolecular Biology and Evolution 28:2239–2252.https://doi.org/10.1093/molbev/msr048

Prevalence and adaptive impact of introgressionAnnual Review of Genetics 55:265–283.https://doi.org/10.1146/annurevgenet021821020805

WorldClim 2: new 1‐km spatial resolution climate surfaces for global land areasInternational Journal of Climatology 37:4302–4315.https://doi.org/10.1002/joc.5086

A threesample test for introgressionMolecular Biology and Evolution 36:2878–2882.https://doi.org/10.1093/molbev/msz178

Assessing biological factors affecting postspeciation introgressionEvolution Letters 4:137–154.https://doi.org/10.1002/evl3.159

Highresolution global maps of 21stcentury forest cover changeScience (New York, N.Y.) 342:850–853.https://doi.org/10.1126/science.1244693

Catalogue of the Suguru Igarashi insect collection, Part I. Lepidoptera, PapilionidaeThe University Museum, The University of Tokyo, Material Reports 94:1–390.

Isolation with migration models for more than two populationsMolecular Biology and Evolution 27:905–920.https://doi.org/10.1093/molbev/msp296

UFBoot2: Improving the Ultrafast Bootstrap ApproximationMolecular Biology and Evolution 35:518–522.https://doi.org/10.1093/molbev/msx281

Gene genealogies and the coalescent processOxford Surveys in Evolutionary Biology 44:31.

Temperature predicts the rate of molecular evolution in Australian Eugongylinae skinksEvolution; International Journal of Organic Evolution 76:252–261.https://doi.org/10.1111/evo.14342

The breakdown of genomic ancestry blocks in hybrid lineages given a finite number of recombination sitesEvolution; International Journal of Organic Evolution 72:735–750.https://doi.org/10.1111/evo.13436

Comparative recombination rates in the rat, mouse, and human genomesGenome Research 14:528–538.https://doi.org/10.1101/gr.1970304

Epistatic mutations under divergent selection govern phenotypic variation in the crow hybrid zoneNature Ecology & Evolution 3:570–576.https://doi.org/10.1038/s4155901908479

A general comparison of relaxed molecular clock modelsMolecular Biology and Evolution 24:2669–2680.https://doi.org/10.1093/molbev/msm193

Von Neumann entropy and majorizationJournal of Mathematical Analysis and Applications 408:384–393.https://doi.org/10.1016/j.jmaa.2013.06.019

Outbred genome sequencing and CRISPR/Cas9 gene editing in butterfliesNature Communications 6:1–10.https://doi.org/10.1038/ncomms9212

Evolutionary Rates of Bumblebee Genomes Are Faster at Lower ElevationsMolecular Biology and Evolution 36:1215–1219.https://doi.org/10.1093/molbev/msz057

Evolution of the mutation rateTrends in Genetics 26:345–352.https://doi.org/10.1016/j.tig.2010.05.003

Inferring phylogeny despite incomplete lineage sortingSystematic Biology 55:21–30.https://doi.org/10.1080/10635150500354928

Hybridization as an invasion of the genomeTrends in Ecology & Evolution 20:229–237.https://doi.org/10.1016/j.tree.2005.02.010

Evaluating the use of ABBABABA statistics to locate introgressed lociMolecular Biology and Evolution 32:244–257.https://doi.org/10.1093/molbev/msu269

IQTREE 2: New models and efficient methods for phylogenetic inference in the genomic eraMolecular Biology and Evolution 37:1530–1534.https://doi.org/10.1093/molbev/msaa015

Divergent selection and heterogeneous genomic divergenceMolecular Ecology 18:375–402.https://doi.org/10.1111/j.1365294X.2008.03946.x

The coalescent and the genealogical process in geographically structured populationJournal of Mathematical Biology 29:59–75.https://doi.org/10.1007/BF00173909

A genomic perspective on hybridization and speciationMolecular Ecology 25:2337–2360.https://doi.org/10.1111/mec.13557

Estimates of introgression as a function of pairwise distancesBMC Bioinformatics 20:1–11.https://doi.org/10.1186/s128590192747z

Evaluating genomic signatures of “the large Xeffect” during complex speciationMolecular Ecology 27:3822–3830.https://doi.org/10.1111/mec.14777

Interpreting the genomic landscape of speciation: a road map for finding barriers to gene flowJournal of Evolutionary Biology 30:1450–1477.https://doi.org/10.1111/jeb.13047

BookEvolution Above the Species LevelNew York, United States: Columbia University Press.https://doi.org/10.7312/rens91062

Beyond clines: lineages and haplotype blocks in hybrid zonesMolecular Ecology 25:2559–2576.https://doi.org/10.1111/mec.13677

Unusual successive occurrence of the Maackii Peacock (Papilio maackii; Papilionidae) at Okayama University of Science, southwestern Honshu lowland, in 2006 summer and autumnNaturalistae 11:11–14.

Temperature dependence of spontaneous mutation ratesGenome Research 31:1582–1589.https://doi.org/10.1101/gr.275168.120

Making sense of genomic islands of differentiation in light of speciationNature Reviews Genetics 18:87–100.https://doi.org/10.1038/nrg.2016.133

The genetical structure of populationsAnnals of Eugenics 15:323–354.https://doi.org/10.1111/j.14691809.1949.tb02451.x

Catalogue of the Keiichi Omoto butterfly collection, Part I (Papilionidae: Baroniinae and ParnassiinaeThe University Museum, The University of Tokyo, Material Reports 127:1–162.

Molecular phylogenetics: principles and practiceNature Reviews Genetics 13:303–314.https://doi.org/10.1038/nrg3186
Decision letter

Daniel R MatuteReviewing Editor; University of North Carolina, Chapel Hill, United States

George H PerrySenior Editor; Pennsylvania State University, United States

Daniel L PowellReviewer; Stanford University, United States
Our editorial process produces two outputs: (i) public reviews designed to be posted alongside the preprint for the benefit of readers; (ii) feedback on the manuscript for the authors, including requests for revisions, shown below. We also include an acceptance summary that explains what the editors found interesting or important about the work.
Decision letter after peer review:
Thank you for submitting your article "Admixture of evolutionary rates across a hybrid zone" for consideration by eLife. Your article has been reviewed by 2 peer reviewers, and the evaluation has been overseen by a Reviewing Editor and George Perry as the Senior Editor. The following individual involved in review of your submission has agreed to reveal their identity: Daniel L Powell (Reviewer #1).
The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission.
Essential revisions:
Overall, the manuscript does a good job of presenting results in a clear and concise manner. Before making a final decision I have three requests:
1. The reviewers have requested some clarifications to make the entropy derivations more accessible to the reader, soften some claims (i.e., archaic gene flow), and add a section regarding the caveats to the discussion. I strongly agree with the reviewers and a potential resubmission should include a response to these requests.
I have two more things to add.
2. Please add the study organisms to the title of the manuscript. As it is currently written, the title is overly ambitious and the reader might be misled to think that the differences in substitution rates is a general phenomenon.
3. The scholarship of the piece needs to be improved. In the current form, the discussion has seven references which is rather surprising given that the study of introgression is one of the most vibrant aspects of current speciation research. The discussion needs to be more encompassing of our current understanding of the factors that determine the fate of an introgressed allele. The current form of the discussion should elaborate on what are the primary hypothesis that could lead to differences in substitution rates and assess what are the potential implications for the interaction between selection and recombination (in the face of differences in substitution rates).
Reviewer #1 (Recommendations for the authors):
I have nothing in the way ion further suggestions for the authors that isn't included in the public review except the following.
Lines 187188: In the previous paragraph you claim that divergent rates likely exist, but here you state that you've established that they do. This strengthening of the statement seems unwarranted.
Reviewer #2 (Recommendations for the authors):
In their manuscript ("Admixture of evolutionary rates across a hybrid zone"), Xiong et al. use whole genome resequencing data to assess rates of genome evolution between two species of butterflies and determine whether loci that evade introgression between the species are also those that evolve at asymmetric rates. They show the landscape of divergence is heterogeneous among these sister species and the genomes of the hybrids exhibit correlated ancestry blocks (both within an individual's genome, and between individuals). Interestingly, sites that are more strongly diverged appear to be more tightly correlated, indicative of potential barrier loci. The authors also find significantly skewed D statistics using four outgroups. While these statistics might normally be used to infer introgression, the authors further use topological weights, and node heights to show that asymmetric rates of genome evolution might account for these skewed D statistics, rather than introgression. Using both coalescent theory and empirical observations, they show that these asymmetrically evolving sites are correlated with variation in relative divergence among species. Overall, I congratulate the authors on a clever and well written manuscript. I think this work is incredibly rigorous and I appreciate the clear communication of fairly complex results. I do have a few suggestions, which I think the authors might find useful.
Namely, while the authors nicely show potential asymmetric evolutionary rates between species, and also show that highly divergent sites are more likely to show patterns of lower entropy (i.e. less randomly associated with other loci), which I agree may be indicative of barrier loci, I feel there might be a missing link between these two processes in the way the manuscript is currently written. For example, the authors show these differences in evolutionary rates (i.e. the ratio number of substitutions between lineages) is correlated with divergence, but I'm curious if they are also correlated with ancestry entropy. In my mind, I agree with the authors that barrier loci should be highly diverged (and exhibit reduced diversity), but these signatures can be caused by a number of evolutionary processes. Instead, the entropy data is quite striking to me and more convincing of barrier loci. Therefore, I might expect the entropy measurements to also be correlated with the D statistics or the ratio of substitution rates.
I find Figures 4 and 6 really convincing of differential evolutionary rates over differences in introgression with the four outgroups. One additionally might predict that if two species were evolving differentially (and this in turn was reflected in the relative divergence between these sister species), that one might also see asymmetric differences in something like DXY between each focal species and each outgroup, and this might also be correlated with FST between the two sister lineages. This might be nice as a supplement to bolster your argument.
Lastly, I think the authors could clarify two important aspects of their differential evolutionary rate hypothesis.
First: it's unclear to me whether the authors are arguing that barrier loci evolve at different rates than the rest of the genome (and therefore show both increased divergence and increased asymmetric rates of evolution), or whether loci that evolve at different rates may have a harder time introgressing between species (by nature of the fact that a given locus' evolutionary rate is at odds with the evolutionary rate of the rest of the recipient genome). Perhaps this is a difference without much distinction, but a bit more detail from the authors on this might be useful. Additionally, if the authors are arguing the former, I think there's a nice amount of literature that they could tap into to talk about why different barrier loci might evolve rapidly (for example, if they are involved in a genetic arms race).
Second: The finding that these two sister species have such drastically different evolutionary rates is really interesting, and I think the authors could elaborate a bit more on why they think that is. There is some explanation on lines 232234, but I'm curious whether one of these species has undergone a massive bottle neck or partakes in more inbreeding in ways that are also known to affect D statistics and such.
https://doi.org/10.7554/eLife.78135.sa1Author response
1. The reviewers have requested some clarifications to make the entropy derivations more accessible to the reader, soften some claims (i.e., archaic gene flow), and add a section regarding the caveats to the discussion. I strongly agree with the reviewers and a potential resubmission should include a response to these requests.
We thank both reviewers and the reviewing editor for their encouraging assessments of this manuscript. All their suggestions have helped us state the results in a more accurate and meaningful way.
2. Please add the study organisms to the title of the manuscript. As it is currently written, the title is overly ambitious and the reader might be misled to think that the differences in substitution rates is a general phenomenon.
“Butterfly” is now added to the title. (“Admixture of evolutionary rates across a butterfly hybrid zone”)
3. The scholarship of the piece needs to be improved. In the current form, the discussion has seven references which is rather surprising given that the study of introgression is one of the most vibrant aspects of current speciation research. The discussion needs to be more encompassing of our current understanding of the factors that determine the fate of an introgressed allele. The current form of the discussion should elaborate on what are the primary hypothesis that could lead to differences in substitution rates and assess what are the potential implications for the interaction between selection and recombination (in the face of differences in substitution rates).
To improve the scholarship and discuss these results in light of existing and historical work, we have added new references and carried out a major revision of the Discussion, with the following subsections:
1) Entropy as a useful measure of ancestry randomness. We discuss the incentives behind developing these metrics, and the intuition for choosing particular forms of ancestry representation. We also explain the impact of other processes such as inbreeding and recombination on entropy, and show how they are useful for windowbased genomic analysis.
2) Potential mechanisms of divergent substitution rates. Here, we speculate on several ways that different substitution rates might result. Ohta’s nearly neutral theory is now included as a possible explanation for smaller asymmetric rates on nonsynonymous substitutions than synonymous and intronic ones.
3) Are introgression tests robust to substitution rate variation? This part summarizes the finding that introgression tests based on site patterns can produce falsepositives when substitution rates diverge between species and recurrent mutations exist in outgroups. In particular, we stress that, since barrier loci have the most asymmetric rates of evolution during hybridization (given the existence of divergent rates), they might be falsely identified as having been introgressed from an outgroup when introgression tests are used on genomic windows. This result might therefore mislead researchers to the conclusion that some “speciation genes” are borrowed from outgroups.
4) The conceptual picture of ratemixing. In this final subsection, we recapitulate the main conclusion of this study, that locally asymmetric substitution rates reflect loci responsible for reproductive isolation. The caveats of the theory we develop in this manuscript are discussed in detail, and we conjecture that the monotonic relationship between rate asymmetry and sequence divergence might be a robust signature irrespective of model details.
Reviewer #1 (Recommendations for the authors):
I have nothing in the way ion further suggestions for the authors that isn't included in the public review except the following.
Lines 187188: In the previous paragraph you claim that divergent rates likely exist, but here you state that you've established that they do. This strengthening of the statement seems unwarranted.
Thanks! We changed the first sentence of this subsection to “Having tested for …” rather than “Having established the existence of …”
Reviewer #2 (Recommendations for the authors):
In their manuscript ("Admixture of evolutionary rates across a hybrid zone"), Xiong et al. use whole genome resequencing data to assess rates of genome evolution between two species of butterflies and determine whether loci that evade introgression between the species are also those that evolve at asymmetric rates. They show the landscape of divergence is heterogeneous among these sister species and the genomes of the hybrids exhibit correlated ancestry blocks (both within an individual's genome, and between individuals). Interestingly, sites that are more strongly diverged appear to be more tightly correlated, indicative of potential barrier loci. The authors also find significantly skewed D statistics using four outgroups. While these statistics might normally be used to infer introgression, the authors further use topological weights, and node heights to show that asymmetric rates of genome evolution might account for these skewed D statistics, rather than introgression. Using both coalescent theory and empirical observations, they show that these asymmetrically evolving sites are correlated with variation in relative divergence among species. Overall, I congratulate the authors on a clever and well written manuscript. I think this work is incredibly rigorous and I appreciate the clear communication of fairly complex results. I do have a few suggestions, which I think the authors might find useful.
Namely, while the authors nicely show potential asymmetric evolutionary rates between species, and also show that highly divergent sites are more likely to show patterns of lower entropy (i.e. less randomly associated with other loci), which I agree may be indicative of barrier loci, I feel there might be a missing link between these two processes in the way the manuscript is currently written. For example, the authors show these differences in evolutionary rates (i.e. the ratio number of substitutions between lineages) is correlated with divergence, but I'm curious if they are also correlated with ancestry entropy. In my mind, I agree with the authors that barrier loci should be highly diverged (and exhibit reduced diversity), but these signatures can be caused by a number of evolutionary processes. Instead, the entropy data is quite striking to me and more convincing of barrier loci. Therefore, I might expect the entropy measurements to also be correlated with the D statistics or the ratio of substitution rates.
We agree with the intuition of the reviewer here that entropy could also be correlated with rate asymmetry. In fact, since the measure of "divergence" we use, F_{ST}. is negatively correlated with entropy (Figure 3E, third panel), and also positively correlated with rate ratio r, it is likely that entropy will be negatively correlated with r. The reason we use F_{ST} instead of entropy in studying the process of ratemixing between populations is as follows:
First, F_{ST} is widely used in comparing populations, and has wellunderstood properties in population genetics theory. Yes, when F_{ST} is used to study gene flow, results can be misled by a number of other evolutionary processes (e.g., linked selection not caused by barrier loci), entropy is not exempt from complications either. As stated in the response to Reviewer #1, inbreeding reduces betweenindividual entropy (S_{b}), and low recombination rates reduce withinindividual entropy (S_{w}), and both processes are not directly related to barrier loci per se.
Second, as entropy is defined on estimated local ancestry rather than DNA sequences, its performance will depend on the statistical power of ancestry estimation, which might introduce an additional layer of uncertainty to the results. However, F_{ST} can be directly estimated from sequences.
Third, we here use the entropy measure only within the hybrid population, while F_{ST} could be calculated between any populations (in this case, the “pure” populations), and so is potentially applicable to a wider set of data in which hybrid populations have not been sampled.
I find Figures 4 and 6 really convincing of differential evolutionary rates over differences in introgression with the four outgroups. One additionally might predict that if two species were evolving differentially (and this in turn was reflected in the relative divergence between these sister species), that one might also see asymmetric differences in something like DXY between each focal species and each outgroup, and this might also be correlated with FST between the two sister lineages. This might be nice as a supplement to bolster your argument.
Yes! Interestingly, D_{XY}based statistics were the very first clues we had that showed some kind of asymmetry between these two species, and we were then genuinely puzzled by this asymmetry, which initially was thought to show introgression with an outgroup (we only included one outgroup P. bianor, at the early stage of the analysis).
The statistic we used originally was “Population Branch Statistic” (PBS). In one version of the PBS, choose a triplet (P1, P2, O). P1 and P2 are two focal taxa, while O is an outgroup. PBS describes how distant each lineage is from the center of the triplet, and is calculated as follows:
(see: https://github.com/thamala/PBScan/blob/master/PBScan_manual.pdf)
Many people use PBS to find regions with unusually fast evolution in one lineage (indicating selection). In our case, among 50kb sliding windows, PBS(maackii) is usually larger than PBS(syfanius), and this asymmetry increases with F_{ST}. For instance, chromosome 15 has one of the lowest F_{ST}, while chromosome 30 (Z) has the highest F_{ST}, and PBS is highly skewed towards P. maackii on chromosome 30.
We then thought that such a large skew in PBS couldn’t all be attributed to natural selection in P. maackii, and that there must be other processes generating such patterns. However, because PBS is not a normalized measure, it’s difficult to make a fair comparison across genomic windows. For this reason, we switched to D_{3} and D_{4} statistics, both are normalized, and are specifically designed to capture asymmetry information.In short, it is absolutely true that information based on D_{XY} will show similar asymmetry between the focal taxa. But we argue that the normalized measures using D statistics make the point more clearly.
Lastly, I think the authors could clarify two important aspects of their differential evolutionary rate hypothesis.
First: it's unclear to me whether the authors are arguing that barrier loci evolve at different rates than the rest of the genome (and therefore show both increased divergence and increased asymmetric rates of evolution), or whether loci that evolve at different rates may have a harder time introgressing between species (by nature of the fact that a given locus' evolutionary rate is at odds with the evolutionary rate of the rest of the recipient genome). Perhaps this is a difference without much distinction, but a bit more detail from the authors on this might be useful. Additionally, if the authors are arguing the former, I think there's a nice amount of literature that they could tap into to talk about why different barrier loci might evolve rapidly (for example, if they are involved in a genetic arms race).
Thank you for point it out. We apologize for not being clear on this result. The main idea throughout the manuscript is simply that:
1. DNA accumulates substitutions at different rates between the two species
2. Hybridization mixes these rates for any specific locus, but at different rates of mixture depending on proximity to barrier loci.
To improve the wording, we now always use “mix” instead of “couple” whenever we are talking about rates being mixed between the two species. In the abstract, we also emphasize this mixing process is “between species”, and so we are not comparing “between loci”.
Second: The finding that these two sister species have such drastically different evolutionary rates is really interesting, and I think the authors could elaborate a bit more on why they think that is. There is some explanation on lines 232234, but I'm curious whether one of these species has undergone a massive bottle neck or partakes in more inbreeding in ways that are also known to affect D statistics and such.
There is now a subsection in Discussion to include possible mechanisms of substitution rate differences. Although it is somewhat speculative, we believe it may be to do with the different numbers of generations per year between the two species. The highland lineage has 12 generations a year while the lowland lineage has 23. If everything else is equal, the ratio between the generation numbers could well be about ~1.8. We also speculate that Ohta’s nearly neutral theory might explain the reduced rate asymmetry of ~1.6 for nonsynonymous sites (the lowland lineage has larger populations more efficient selection against mildly deleterious mutations slowing down nonsynonymous substitutions more in the lowland).
https://doi.org/10.7554/eLife.78135.sa2Article and author information
Author details
Funding
American Philosophical Society (The Lewis and Clark Fund for Exploration and Field Research (20172018))
 Tianzhu Xiong
The NSFSimons Center for Mathematical and Statistical Analysis of Biology (Award Number #1764269), and The Quantitative Biology Initiative (Harvard University) (Graduate Student Fellowship)
 Tianzhu Xiong
Department of Organismic and Evolutionary Biology, Harvard University (Graduate Student Fellowship)
 Tianzhu Xiong
Department of Organismic and Evolutionary Biology, Harvard University (Faculty Startup Fund)
 James Mallet
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
This work was supported by the Lewis and Clark Fund for Exploration and Field Research (American Philosophical Society, 2017–2018) granted to TX.; TX is also funded by a studentship from Harvard Department of Organismic and Evolutionary Biology, the NSFSimons Center for Mathematical and Statistical Analysis of Biology at Harvard (award number #1764269) and the Harvard Quantitative Biology Initiative during the project. We thank Harvard FAS Research Computing for providing computation resources. We thank Fernando Seixas for discussion on hybrid ancestry inference. We thank Kaifeng Bu for his comments on informationtheoretic measures. We thank Nathaniel Edelman, Neil Rosser, Miriam Miyagi, Yuttapong Thawornwattana, Sarah Dendy, Liang Qiao, Adam Cotton, John Wakeley, Robin Hopkins, and Naomi Pierce for their valuable input. We also thank Yuchen Zheng, Zhuoheng Jiang, Shaoji Hu, Feng Cao, and Shui Xu for support in the field. Feng Cao provided specimen photos from the hybrid zone.
Senior Editor
 George H Perry, Pennsylvania State University, United States
Reviewing Editor
 Daniel R Matute, University of North Carolina, Chapel Hill, United States
Reviewer
 Daniel L Powell, Stanford University, United States
Publication history
 Preprint posted: September 28, 2021 (view preprint)
 Received: February 23, 2022
 Accepted: June 14, 2022
 Accepted Manuscript published: June 15, 2022 (version 1)
 Version of Record published: June 30, 2022 (version 2)
Copyright
© 2022, Xiong et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 1,309
 Page views

 346
 Downloads

 1
 Citations
Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.
Download links
Downloads (link to download the article as PDF)
Open citations (links to open the citations from this article in various online reference manager services)
Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)
Further reading

 Evolutionary Biology
 Genetics and Genomics
Ciliates are microbial eukaryotes that undergo extensive programmed genome rearrangement, a natural genome editing process that converts long germline chromosomes into smaller generich somatic chromosomes. Three wellstudied ciliates include Oxytricha trifallax, Tetrahymena thermophila and Paramecium tetraurelia, but only the Oxytricha lineage has a massively scrambled genome, whose assembly during development requires hundreds of thousands of precise programmed DNA joining events, representing the most complex genome dynamics of any known organism. Here we study the emergence of such complex genomes by examining the origin and evolution of discontinuous and scrambled genes in the Oxytricha lineage. This study compares six genomes from three species, the germline and somatic genomes for Euplotes woodruffi, Tetmemena sp., and the model ciliate Oxytricha trifallax. To complement existing data, we sequenced, assembled and annotated the germline and somatic genomes of Euplotes woodruffi, which provides an outgroup, and the germline genome of Tetmemena sp.. We find that the germline genome of Tetmemena is as massively scrambled and interrupted as Oxytricha's : 13.6% of its gene loci require programmed translocations and/or inversions, with some genes requiring hundreds of precise gene editing events during development. This study revealed that the earlierdiverged spirotrich, E. woodruffi, also has a scrambled genome, but only roughly half as many loci (7.3%) are scrambled. Furthermore, its scrambled genes are less complex, together supporting the position of Euplotes as a possible evolutionary intermediate in this lineage, in the process of accumulating complex evolutionary genome rearrangements, all of which require extensive repair to assemble functional coding regions. Comparative analysis also reveals that scrambled loci are often associated with local duplications, supporting a gradual model for the origin of complex, scrambled genomes via many small events of DNA duplication and decay.

 Evolutionary Biology
 Microbiology and Infectious Disease
A morbidostat is a bioreactor that uses antibiotics to control the growth of bacteria, making it wellsuited for studying the evolution of antibiotic resistance. However, morbidostats are often too expensive to be used in educational settings. Here we present a lowcost morbidostat called the EVolutionary biorEactor (EVE) that can be built by students with minimal engineering and programming experience. We describe how we validated EVE in a real classroom setting by evolving replicate Escherichia coli populations under chloramphenicol challenge, thereby enabling students to learn about bacterial growth and antibiotic resistance.