Admixture of evolutionary rates across a butterfly hybrid zone

  1. Tianzhu Xiong  Is a corresponding author
  2. Xueyan Li
  3. Masaya Yago
  4. James Mallet
  1. Department of Organismic and Evolutionary Biology, Harvard University, United States
  2. Kunming Institute of Zoology, Chinese Academy of Sciences, China
  3. The University Museum, The University of Tokyo, Japan

Abstract

Hybridization is a major evolutionary force that can erode genetic differentiation between species, whereas reproductive isolation maintains such differentiation. In studying a hybrid zone between the swallowtail butterflies Papilio syfanius and Papilio maackii (Lepidoptera: Papilionidae), we made the unexpected discovery that genomic substitution rates are unequal between the parental species. This phenomenon creates a novel process in hybridization, where genomic regions most affected by gene flow evolve at similar rates between species, while genomic regions with strong reproductive isolation evolve at species-specific rates. Thus, hybridization mixes evolutionary rates in a way similar to its effect on genetic ancestry. Using coalescent theory, we show that the rate-mixing process provides distinct information about levels of gene flow across different parts of genomes, and the degree of rate-mixing can be predicted quantitatively from relative sequence divergence (FST) between the hybridizing species at equilibrium. Overall, we demonstrate that reproductive isolation maintains not only genomic differentiation, but also the rate at which differentiation accumulates. Thus, asymmetric rates of evolution provide an additional signature of loci involved in reproductive isolation.

Editor's evaluation

The authors leverage theory, simulations, and empirical population genomics to evaluate what are the consequences of differences in substitution rates in hybridizing species. This is a largely overlooked pheonomenon. This study highlights the issue and demonstrates that two hybridizing species of Papilio have differences in thir substitution rates. The work will be of interest to a large group of evolutionary biologists, especially those studying evolution at the whole-genome level.

https://doi.org/10.7554/eLife.78135.sa0

Introduction

DNA substitution, the process whereby single-nucleotide mutations accumulate over time, is a critical process in molecular evolution. Both molecular phylogenetics and coalescent theory rely on observed mutations (Yang and Rannala, 2012; Wakeley, 2016), and so the rate of substitution/mutation is the predominant link from molecular data to information about the timing of past events (Bromham and Penny, 2003). Substitution rates often vary among lineages: generation time, spontaneous mutation rate, and fixation probabilities of new mutations could all contribute to the variation of substitution rates (Ohta, 1993; Lynch, 2010). Recent evidence suggests mutation rates are even variable among human populations (Harris, 2015; DeWitt et al., 2021). As such variation affects how fast the molecular clock ticks, reconstructing gene genealogies among different species sometimes accounts for species-specific rates of evolution (Lepage et al., 2007). However, under the standard coalescent framework, empirical studies of within- and between-species variation tend to ignore rate variation among populations (Costa and Wilkinson-Herbots, 2017; Wolf and Ellegren, 2017; Kautt et al., 2020). The latter is partly based on a popular assumption in coalescent theory that neutral mutation rate is constant for a given locus across the whole genealogy (Hudson, 1990). Hybridization and speciation lie in the gray zone of these extremes, and have their own problem: molecular clocks from different lineages could be mixed by cross-species gene flow — a gene could evolve under one clock before it flows into another species and switches to evolve according to a different clock (Figure 1). This mixing process is largely outside the scope of existing theories and has received little attention from empirical studies.

Gene flow interacts with divergent substitution rates and affects observed numbers of substitutions.

Each gray block represents a species with its species-specific substitution rate. Solid lines represent gene genealogies prior to coalescence, and horizontal jumps between species represent inter-specific gene flow. Dots are substitutions. When a gene sequence inherits mutations derived under multiple rates, the number of substitutions it carries will reflect a mixture of substitution rates among different species. If gene flow is strong, each lineage carries a similar number of substitutions; If gene flow is weak, genes evolve independently with species-specific rates, and the distribution of substitutions in each lineage will likely be skewed towards the distribution of species-specific substitution rates.

However, if mixing between molecular clocks exists and can be measured, its strength could carry information about gene flow, which is important for studying reproductive isolation between incipient species. Genomic regions responsible for reproductive isolation lead to locally elevated genomic divergence (“genomic islands”), often caused by linked genomic regions experiencing less gene flow (“barrier loci”; Nosil et al., 2009; Michel et al., 2010; Renaut et al., 2013; Payseur and Rieseberg, 2016). In studying a hybrid zone between two butterfly species, Papilio syfanius and Papilio maackii, we discovered evidence for unequal genome-wide substitution rates between the two species. Using this system, we investigate the interaction between unequal substitution rates and gene flow, and whether this interaction reveals new information on reproductive isolation.

As these butterflies are rare species, occurring in a remote region of China, and are hard to collect, we employed methods based on analysis of whole-genome sequences of a few specimens. We hope that these methods may prove of use in studying other rare or perhaps endangered species where few individuals can be sacrificed. Results will follow two parallel lines: first, we provide evidence that genomic islands are associated with barrier loci. Then we infer the existence of unequal substitution rates. Finally, using a coalescent model, we calculate the relationship between the magnitude of genomic islands and the degree of mixing between substitution rates in linked regions. Throughout the analysis, we assume reverse mutations are rare, so that higher substitution rates always lead to higher numbers of observed substitutions.

Results

Divergent sister species with ongoing hybridization

We sampled 11 males of P. syfanius and P. maackii across a geographic transect (Figure 2A, dashed lines) covering both pure populations (in the sense of being geographically far from the hybrid zone) and hybrid populations. We also include four outgroup species, two of which have chromosome-level genome assemblies (P. bianor, P. xuthus; Lu et al., 2019; Li et al., 2015), while the other two (P. arcturus, P. dialis) are new to this study. All samples were re-sequenced to at least 20× coverage across the genome and mapped to the genome assembly of P. bianor. Among sampled local populations, P. syfanius inhabits the highlands of Southwest China (Figure 2A, red region), whereas P. maackii dominates at lower elevations (Figure 2A, blue region) (see Figure 2—figure supplement 3 and Figure 2—source data 4 for the complete distribution). The two lineages form a spatially contiguous hybrid zone at the edge of the Hengduan Mountains (Figure 2B) with individuals exhibiting intermediate wing patterns (Figure 2A: purple dot, corresponding to population WN in Figure 2B). Consistent with previous results (Condamine et al., 2013), assembled whole mitochondrial genomes are not distinct between the two lineages (Figure 2C, Figure 2—source data 3), suggesting either that divergence was recent, or that gene flow has homogenized the mitochondrial genomes. However, the two species are likely adapted to different environments associated with altitude, as several ecological traits are strongly divergent (Kashiwabara, 1991; Figure 2—figure supplement 1). Similarly, between pure populations (KM and XY in Figure 2B), relative divergence (FST) is also high across the entire nuclear genome (Figure 2D, Figure 2—source data 1). The FST on autosomes averages between 0.2–0.4, and on the sex chromosome (Z-chromosome) it reaches 0.78. A highly heterogeneous landscape of FST is accompanied by numerous islands of elevated sequence divergence (DXY) and reduced genetic diversity (π) scattered across the genome (Figure 2—figure supplement 4, Figure 2—source data 2). Since animal mitochondrial DNA generally has higher mutation rates than the nuclear genome (Haag-Liautard et al., 2008), its low divergence between the two species are likely the result of gene flow. Overall, despite ongoing hybridization, genomes of the pure populations of P. syfanius and P. maackii are strongly differentiated.

Figure 2 with 4 supplements see all
Overview of the study system.

(A) The geographic distribution of P. syfanius (red) and P. maackii (blue). Scale-bar: 100 km. Dashed line is the sampling transect covering five populations (colored circles). (B) Elevation, climate, and sample sizes along the transect. (C) Mitochondrial tree with four outgroups. (D)FST across chromosomes (50 kb windows with 10 kb increments). The inset shows the estimated density of FST on each chromosome.

Figure 2—source data 1

Estimated FST between populations KM and XY (50 kb windows with 10 kb increments).

https://cdn.elifesciences.org/articles/78135/elife-78135-fig2-data1-v2.zip
Figure 2—source data 2

DXY and π estimated for populations KM and XY (non-overlapping 10 kb windows).

https://cdn.elifesciences.org/articles/78135/elife-78135-fig2-data2-v2.zip
Figure 2—source data 3

The phylogeny and alignment among mitochondrial genomes.

https://cdn.elifesciences.org/articles/78135/elife-78135-fig2-data3-v2.zip
Figure 2—source data 4

Results and data for MaxEnt species distribution models.

https://cdn.elifesciences.org/articles/78135/elife-78135-fig2-data4-v2.zip
Figure 2—source data 5

Sample information.

https://cdn.elifesciences.org/articles/78135/elife-78135-fig2-data5-v2.zip

Genomic islands are associated with barrier loci

A natural question is whether genomic differentiation is associated with barrier loci and reproductive isolation. In other words, can FST variation be attributed to gene flow variation between sister species? We suspect that barrier loci likely exist, because sequence variation between pure populations suggests that elevated FST is associated with reduced π and elevated DXY across autosomes (Figure 3A), as expected for hybridizing species (Irwin et al., 2018). The Z chromosome (sex chromosome) has the highest DXY and the lowest π among all chromosomes (Figure 2—figure supplement 4), another characteristic of hybridizing species with barriers to gene flow (Kronforst et al., 2013).

Figure 3 with 1 supplement see all
Evidence for barrier loci on autosomes.

For all plots, pure populations refer to XY & KM, and hybrid population refers to WN. (A) Between pure populations, reduced diversity (π, showing values averaged between pure populations) is associated with increased divergence (DXY) across autosomes (30 segments per chromosome). Error bars are standard errors. (B) The conceptual picture of entropy metrics on diploid, unphased ancestry signals. In a genomic window, between-individual entropy (Sb) measures local ancestry randomness among individuals, while within-individual entropy (Sw) measures ancestry randomness along a chromosomal interval. (C) Simulated behaviors of entropy in a simplified model of biparental local ancestry. Chromosomes are assumed to be spatially homogeneous, thus recombination rate is uniform among 1000 equally spaced SNPs, and adjacent SNPs have a single probability of ancestry disassociation. For each haplotype block with linked ancestry, its ancestry is randomly assigned according to the average contribution from each species. Each pair of haploid chromosomes are combined into an unphased ancestry signal before calculating entropy. The top plot assumes equal contribution from both species, and the bottom plot assumes ancestry disassociation probability = 0.001. Solid lines are average entropies across 1000 repeated simulations, and shaded areas represent averaged upper and lower deviations from the mean. (D) The joint range among entropy, π, and DXY across autosomes (20 segments per chromosome). Color range is normalized by the range of entropy in each plot. Gray represents higher entropy, and colored regions are associated with lower entropy. Heatmaps represent linear fits to the ensembles of points. (E) The correlation ρ on autosomes between entropy in hybrid populations and {π, DXY, FST} within and between pure populations. ρ is shown above each regression line. Error bars are standard errors of entropy from 50 repeated estimates of local ancestry using software ELAI (parameters are in Materials and methods). The significance of ρ is estimated using block-jackknifing among all segments: Z-scores are shown in parentheses.

Figure 3—source data 1

Estimated entropy (Sw, Sb), DXY, π, FST on 20 segments per chromosome (including the Z chromosome).

https://cdn.elifesciences.org/articles/78135/elife-78135-fig3-data1-v2.zip

To test for the presence of barrier loci, we augment the analysis with the sequences of four individuals from the population closest to the center of the hybrid zone (Population WN). We investigate whether differences in ancestry variation provide additional evidence for barrier loci in this hybrid zone. The underlying logic is that barrier loci will simultaneously:

  1. Reduce linked π in pure populations (Ravinet et al., 2017);

  2. Elevate linked DXY between pure populations (Ravinet et al., 2017);

  3. Elevate pairwise linkage disequilibrium in hybrid zones (Barton, 1983);

  4. Enrich linked ancestry from one lineage in hybrid zones (Sedghifar et al., 2016).

Effects 3 and 4 can be bundled together as “reduced ancestry randomness” around barrier loci because both are expected if intermixing of segments of different ancestries within a genomic interval is prevented. For effects 3 and 4, because of small sample sizes, estimating site-specific statistics such as pairwise linkage disequilibrium is untenable. However, our high-quality chromosome-level reference genome enabled accurate estimation of local ancestry. As a remedy for small sample sizes, we employ two entropy metrics borrowed from signal-processing theory to quantify ancestry randomness in local regions along chromosomes (see Materials and methods). By dividing chromosomes into segments, we can extract indirect information about effects 3 and 4 at the expense of reduced genomic resolution. The proposed metrics, Sb and Sw, correspond to the randomness of ancestry between and within individual diploid genomes from a local population (Figure 3B). For a cohort of ideal chromosomes with uniform recombination and marker density, if ancestry is independent between homologous chromosomes, both Sb and Sw increase with reduced local ancestry correlation and more balanced parental contribution (Figure 3C).

For a given autosomal segment, we then calculate π, DXY, and FST between pure populations, as well as entropy metrics Sb and Sw in hybrid individuals for the same segment. To investigate whether effects 1–4 are all present in our system, the joint range among entropy, π, and DXY is shown in Figure 3D, which suggests that low ancestry randomness (low entropy) is likely associated with reduced π within species and elevated DXY between species. To further quantify such association, we estimated Pearson’s correlation coefficients (ρ) between entropy and the latter statistics (Figure 3E). These associations are strongly significant (Z-scores > 3). Consequently, reduced ancestry randomness in hybrids (effects 3 & 4) coincides with classical patterns of barrier loci between pure populations (effects 1 and 2). This analysis is not sufficient to exclude all alternative hypotheses. For instance, we cannot entirely exclude the possibility that patterns are driven by low-recombination regions (Sw,Sb) that experience linked selection (π) also have elevated mutation rates (DXY). Nonetheless, this alternative seems most unlikely, as low-recombination regions should typically be less rather than more mutable (Lercher and Hurst, 2002; Jensen-Seaman et al., 2004; Yang et al., 2015; Arbeithuber et al., 2015; Liu et al., 2017). Overall, adding information from hybrid populations strengthens the evidence for barrier loci acting across autosomes.

The Z chromosome was excluded from this analysis as it likely differs in mutation rate or effective population size (Presgraves, 2018), but its ancestry in hybrid individuals either retains purity or resembles very recent hybridization (long blocks of heterozygous ancestry, Figure 3—figure supplement 1). The Z chromosome has low ancestry randomness, and it also has the highest level of divergence (Figure 2D), both of which suggest strong barriers to gene flow between P. syfanius and P. maackii on this chromosome.

Asymmetric site patterns

A hint that substitution rates differ between the two species comes from site-pattern asymmetry (but this asymmetry alone is insufficient to establish the existence of divergent rates). We focus on two kinds of biallelic site patterns. In the first kind, choose three taxa (P1,P2,O1), with P1=syfanius, P2=maackii, while O1 is an outgroup. Assuming no other factors, if substitution rates are equal between P1 and P2, then site patterns (P1,P2,O1)=(A,B,B) and (B,A,B) occur with equal frequencies, where “A” and “B” represent distinct alleles. This leads to a D statistic describing the asymmetry between three-taxon site patterns:

(1) D3=DP1,P2,O1=nABBnBABnABB+nBAB,

where n is the count of a particular site pattern across designated genomic regions. A significantly non-zero D3 indicates strongly asymmetric distributions of alleles between P1 and P2.

In a second kind of site pattern test, we compare four taxa (P1,P2,O1,O2) and calculate a similar statistic between site patterns (A,B,B,A) vs (B,A,B,A):

(2) D4=DP1,P2,O1,O2=nABBAnBABAnABBA+nBABA

Classically, a significantly non-zero D4 suggests that gene flow occurs between an outgroup and either P1 or P2, thus it is widely used to detect hybridization (the ABBA-BABA test) (Durand et al., 2011; Hibbins and Hahn, 2022). Here, D4 is used more generally as an additional metric of site pattern asymmetry.

We compute both D3 and D4 on synonymous, nonsynonymous, and intronic sites, with P. syfanius and P. maackii samples taken from pure populations (KM and XY). For each type of site, we progressively exclude regions with local FST below a certain threshold, and report D statistics on the remaining sites in order to show the increasing site-pattern asymmetry in more divergent regions (Figure 4). D3 is significantly negative regardless of outgroup or site type for most FST thresholds, and D4 is also significantly negative for most outgroup combinations when computed across the entire genome (Z-scores are shown in Figure 4—figure supplement 1), proving that site-patterns are strongly asymmetric between P. syfanius and P. maackii. Importantly, the direction of asymmetry is nearly identical across all outgroup comparisons. This asymmetry cannot be attributed to batch-specific variation as all samples were processed and sequenced in a single run, and variants were always called on all individuals of P. syfanius and P. maackii. Sequencing coverage is normal for most annotated genes used in the analysis (Table 1), suggesting that asymmetry is not due to systematic copy-number variation that could affect variant calls. Nonetheless, two independent processes of evolution could explain observed asymmetric site patterns. In hypothesis I (Figure 5A, left), asymmetry is generated via stronger gene flow between P. syfanius and outgroups, leading to biased allele-sharing. In hypothesis II, site pattern asymmetry is due to unequal substitution rates between P. syfanius and P. maackii, which is further modified by recurrent mutations in all four outgroups (Figure 5A, right). We test each hypothesis below.

Figure 4 with 1 supplement see all
D statistics are unanimously negative.

For each data point, we choose an FST threshold (x-axis) and report D statistics on SNPs with a background FST (50 kb windows and 10 kb increments) no less than the given threshold. Error bars are standard errors estimated using block-jackknife with 1 Mb blocks. The aberrant behavior for the highest FST bins is we believe mostly due to low sample sizes for these bins.

Table 1
Coverage abnormality in SNPs from coding sequences (CDSs).

Abnormal coverage is inferred if the average coverage of a CDS exceeds twice of the median coverage of all CDSs in the genome.

Number of SNPs from CDSs with abnormal coverage1993
Total number of SNPs from all CDSs1060525
Unequal substitution rates between pure population of P. syfanius and P. maackii.

(A) Two hypotheses to explain negative D statistics. (B) The percentage of local gene trees (50 kb non-overlapping windows) where P. syfanius and P. maackii are together monophyletic. For each level of support, we filter out windows with Support(monophyly) and Support(paraphyly) below a given level, and report both the percentage of windows passing the filter (informative windows) and the percentage of monophyletic windows. (C) The distribution of P. syfanius branch lengths (Ls) relative to those of P. maackii (Lm) among highly supported monophyletic trees (Support ≥95). Each curve corresponds to a pairwise comparison between a syfanius individual and a maackii individual. Branch lengths are distances from tips to the most-recent common ancestor of all syfanius +maackii individuals. (D) Behavior of D3 under divergent substitution rates and recurrent mutations in outgroups (O1). (E) Behavior of D4 under divergent substitution rates and recurrent mutations in outgroups (O1 & O2). (F) Left: Median D3 is positively correlated with node height; Right: Median D4 is negatively correlated with (ΔNode height)/(ΣNode height). Node height is used as a proxy for the probability of recurrent mutation (pi).

Figure 5—source data 1

Concatenated gene trees based on 50 kb windows and the underlying support for each binary split.

https://cdn.elifesciences.org/articles/78135/elife-78135-fig5-data1-v2.zip
Figure 5—source data 2

Results of the signed-rank test on branch elongation in P. maackii.

https://cdn.elifesciences.org/articles/78135/elife-78135-fig5-data2-v2.zip

Hybridization with outgroups does not explain site-pattern asymmetry

We first test hypothesis I by phylogenetic reconstruction using SNPs in annotated regions. We construct local gene trees for each 50 kb non-overlapping window for all samples, including the four outgroup species. As biased gene flow with outgroups should rupture the monophyletic relationship among all P. syfanius +P. maackii individuals, the fraction of windows producing paraphyletic gene trees can be used to assess the potential impact of gene flow. However, almost all gene trees show the expected monophyletic relationship (Figure 5B). This conclusion is independent of the level of support used to filter out genomic windows with ambiguous topologies (see Materials and methods). Consequently, hardly any window shows a phylogenetic signal of hybridization with outgroups. Nonetheless, branch lengths in reconstructed gene trees suggest possible substitution rate divergence between the two species: Among highly supported monophyletic trees (bootstrap support ≥ 95), P. maackii (in populations YA, BJ, XY) is always significantly more distant than P. syfanius from the most recent common ancestor of P. maackii +P. syfanius (Figure 5C, significance levels reported via a Wilcoxon signed-rank test for each pair of individuals, see Figure 5—source data 2). Second, the direction of allele sharing in the D statistics is unanimously biased towards P. syfanius. If hypothesis I were true, hybridization with outgroups is required to take place mainly in the highland lineage. There is no evidence to support why the highland lineage should receive more gene flow based on current geographic distributions, as outgroups P. xuthus and P. bianor overlap broadly with both P. maackii and P. syfanius, while outgroup P. dialis is sympatric only with P. maackii (Condamine et al., 2013). Although it is possible that historical and modern geographic distributions differ, and archaic gene flow might have occurred preferentially with P. syfanius, it should still leave some phylogenetic signal of introgression. Overall, we find little evidence for biased hybridization required by hypothesis I.

One might worry that by rejecting hypothesis I, we also throw doubt on widely accepted conclusions of the ABBA-BABA test for gene flow in other systems (that a significantly nonzero D4 implies hybridization with outgroups) (Durand et al., 2011). However, in the next section, we show why D3 and D4 are fully consistent with hypothesis II, and so this is just a special case where the ABBA-BABA test produces a false positive for gene flow.

Evidence for divergent substitution rates

In hypothesis II, divergent substitution rates between P. maackii and P. syfanius interact with recurrent mutations in outgroups to produce asymmetric site patterns. To understand its effect on D statistics, consider a simplified model of recurrent mutation (Figure 5D,E), where a site in outgroup i mutates with probability pi, producing the same derived allele with probability c. When averaged across the genome, c can be treated as a constant, and pi increases with distance to the outgroup. In the absence of gene flow, for three-taxon patterns, recurrent mutations modify D3 by a factor of approximately (1-2cp1) (Figure 5D, see Materials and methods), and observed D3 will thus be positively correlated with p1. For four-taxon patterns, it can be shown that observed D4 is always negative due to larger probabilities of recurrent mutation in more distant outgroups (Figure 5E, see Materials and methods). Assuming no significant contribution of incomplete lineage sorting (Maddison and Knowles, 2006), the value of D4 becomes more negative with increasing Δpi/Σpi=(p2p1)/(p2+p1).

To test these signatures, we employ estimated node heights of outgroups in the mitochondrial tree (Figure 2C) as proxies for outgroup distance, and hence for the relative probability of recurrent mutation (ρi). In line with expected signatures, we find that observed D3 indeed increases with node height (Figure 5F, left), and observed D4 decreases with (Δ Node height/∑ Node height; Figure 5F, right). Thus, the directions and magnitudes of both D statistics are congruent with hypothesis II. As hypothesis II naturally predicts unanimously negative D3 and D4 as well as their relative magnitudes among different outgroup combinations, it is more parsimonious than hypothesis I. Hence, divergent substitution rates likely exist between P. syfanius and P. maackii.

Rate-mixing at migration-drift equilibrium

Having tested for the existence of divergent substitution rates, we now explore how they become mixed by gene flow using a coalescent framework. In particular, we will assess whether the conceptual picture in Figure 1 can be recovered quantitatively in the butterfly system. As gene flow is ongoing between the two lineages, consider two haploid populations of size N exchanging genes at rate m (Figure 6A). This simple isolation-with-migration (IM) model at equilibrium has relative divergence (Notohara, 1990)

(3) FST=11+4Nm
Figure 6 with 3 supplements see all
Divergence is correlated with increased differences in the relative number of substitutions.

(A) Behavior of the equilibrium isolation-with-migration model with divergent substitution rates. If coalescence is weaker than gene flow, each lineage has a similar number of derived alleles. If coalescence is stronger than gene flow, lineages sampled from the population with a faster substitution rate will also inherit more derived alleles. (B) Theoretical relationship between observed rate ratio (r) and relative divergence (FST), parameterized by the true ratio r0) of substitution rates. (C) Observed rate ratios between P. maackii (population XY) and P. syfanius (Population KM), partitioned by ten FST intervals. Error bars are standard errors calculated using 1 Mb block-jackknifing. (D) The theoretical relationship between r and FST is a good fit to observation.

To quantify the signature of mixed rates, consider the asymmetry in observed numbers of substitutions (circles in Figure 6A). Take a pair of sequences from two populations. Let n1 be the expected number of derived alleles exclusive to the sequence in population 1, and let n2 be the same expected number in population 2. Their ratio is defined as observed rate ratio:

(4) r=n2n1

Evidently, r=1 is the symmetric point where both sequences have the same number of derived alleles. Further, let the actual substitution rate in population 1 be μ1, and the actual substitution rate in population 2 be μ2. The ratio between the two actual rates are defined as the true rate ratio:

(5) r0=μ2μ1

At migration-drift equilibrium, observed rate ratio (r) and observed divergence (FST) are related by the following formula parameterized singly by r0 (Figure 6B, see Materials and methods):

(6) r=1+r0+FST(r01)1+r0FST(r01),

which translates into

(7) FSTr1r01ifr01

These formulae indicate that unequal substitution rates are more mixed in regions with lower genomic divergence. Equation 7 is surprising, because it reveals that the remaining fraction of substitution rate divergence ((r-1)/(r0-1)) is almost the same as FST, which corresponds to the fraction of genetic variance explained by population structure (Wright, 1949). In Figure 6B, the relationship between r and FST is still largely linear when one species evolves three times as fast as its sister species (r0=3), and so Equation 7 might be robust under biologically realistic rates of substitutions between incipient species. Using extensive simulations (Figure 6—figure supplement 1 to Figure 6—figure supplement 3), we show that the full formula (Equation 6) is also robust in a number of equilibrium population structures.

To test whether such predictions are met, we calculate r on synonymous, nonsynonymous, and intronic sites partitioned by their local FST values between pure populations (KM & XY), and recover a similar monotonic relationship across most FST partitions (Figure 6C). For introns, the observed relationship between r and FST is a near perfect fit to Equation 6 (Figure 6D, squares), with an estimated r0 = 1.837. For synonymous (Figure 6D, circles) and nonsynonymous sites (Figure 6D, triangles), Equation 6 also provides an excellent fit in regions with low to intermediate FST. Estimated r0 for synonymous sites is 1.813, close to that of introns, while nonsynonymous sites have a considerably lower r0 of 1.633. If introns and synonymous substitutions are approximately neutral, we infer that neutral substitution rates are about 80% greater in the lowland species.

Discussion

Entropy as a useful measure of ancestry randomness

A critical step in our analysis is to associate genomic islands with barrier loci. Conventionally, information on increased association between barrier sites in hybrid populations comes from two-locus linkage disequilibrium (Slatkin, 1975). Empirical studies on hybrid populations frequently use such statistics to strengthen the evidence for barriers to gene flow (Knief et al., 2019; Wang et al., 2022). Alternatively, if phased haplotypes can be sequenced, the length of ancestry tracts (Sedghifar et al., 2016) or the density of ancestry junctions (Janzen et al., 2018; Wang et al., 2022) also carry information on barrier loci. All these methods break down with small numbers of unphased samples, as forced phasing can produce a large number of false ancestry junctions. However, conventional notions such as “ancestry tract length” are not definable for unphased local ancestry. The dilemma forces us to consider more robust statistics that carry information on ancestry association even in unphased data.

The development of both entropy metrics follows three intuitions:

  • Vectorization: ancestry is a categorical concept, and it should map to a signal containing the contribution of each source population.

  • Conservation law: the total ancestry from all sources at a particular locus is conserved.

  • Highly (auto)correlated signals have a concentrated spectrum (low spectral entropy).

Since entropy carries information about the degree of ancestry correlation, it will decrease in regions of low recombination and high genetic relatedness. If genetic relatedness is produced by inbreeding, it should affect the entire genome in a similar way, and between-individual entropy (Sb) will be similar across different parts of the genome. However, if inbreeding is so severe that Sb is globally zero, it will not be an informative metric. It is worth noting that within-individual entropy Sw shares a conceptual similarity to the wavelet transform of ancestry signals (Pugach et al., 2011; Sanderson et al., 2015). These entropy metrics will be particularly useful for window-based genomic analysis of ancestry correlation with limited sampling, and they are also compatible with larger sample sizes and more than two parental species. (The formulae of entropy with two parental species are presented in Materials and methods, and mathematical details are discussed in Appendix 1.) Nonetheless, this entropy approach assumes local ancestry can be accurately inferred, which will be challenging for studies with low-coverage sequencing, non-chromosomal genome assembly, or lacking reference populations.

Potential mechanisms of divergent substitution rates

Interestingly, a higher substitution rate in the lowland lineage P. maackii is congruent with the evolutionary speed hypothesis (Rensch, 1959), where evolution accelerates in warmer climates. Our finding echoes the results of many previous studies (Gillooly et al., 2005; Wright et al., 2006; Lin et al., 2019; Ivan et al., 2022). Without measuring the per-generation mutation rates in both species, it is unclear what mechanisms cause increased substitution rates, but the lowland lineage typically has an additional autumn brood that is absent in P. syfanius (Takasaki et al., 2007; Figure 2—figure supplement 2). Warmer temperatures in lowland habitats might also increase spontaneous mutation rates in ectothermic insects (Waldvogel and Pfenninger, 2021). Both mechanisms could produce increased substitution rates in the lowland species. It is surprising that estimated r0 between synonymous sites and introns agree with each other under such a coarse framework, and estimated r0 for nonsynonymous sites is considerably lower. Less asymmetric rates for nonsynonymous substitutions could perhaps be explained by the nearly neutral theory (Ohta, 1993), which argues that many nonsynonymous mutations are mildly deleterious, and selection is more efficient in suppressing them in larger populations and slowing down substitutions. In the field, the lowland species often appears in larger numbers with well-connected habitats, while the highland species faces a highly heterogeneous landscape of the Hengduan Mountains, which could lead to differences in effective population sizes required by our expectation under the nearly neutral theory.

Are introgression tests robust to substitution rate variation?

An additional result from our study is that divergent substitution rates might produce spuriously non-zero D4 statistics when combined with recurrent mutations, which could increase the false positive rate of the ABBA-BABA test (Durand et al., 2011). This phenomenon has been suspected in humans (Amos, 2020), and is certainly a theoretical possibility (Hibbins and Hahn, 2022), but has not been tested in most empirical studies.

A wide class of introgression tests targeting gene flow between outgroups and a pair of taxa are based on site pattern information. Hahn’s D3 computes absolute sequence divergence between groups in a triplet of species (Hahn and Hibbins, 2019), and will be affected by unequal substitution rates in a similar way to our D3. Martin’s fd computes the same numerator as our D4 (Martin et al., 2015), so it could also produce false positives under similar situations. Related statistics include Dp (Hamlin et al., 2020), Df (Pfeifer and Kapan, 2019). A general guideline for site-pattern based statistics is that the focal pair of taxa are closely related such that substitution rates do not differ, while outgroups should not be too distant to minimize recurrent mutations (Hibbins and Hahn, 2022). However, whether these assumptions are met in empirical studies is worthy of investigation, and our system provides a counterexample even between sister species with ongoing hybridization.

A separate pitfall might occur if introgression tests based on site-patterns are applied to genomic windows to locate regions introgressed from outgroups. In our case, FST peaks have the most asymmetric substitution rates between P. maackii and P. syfanius, thus they will most likely be associated with false-positive D statistics. This could lead to the incorrect interpretation that some barrier loci (“speciation genes”) are introgressed from outgroups – a popular hypothesis in adaptive introgression and hybrid speciation, see Edelman and Mallet, 2021.

To this end, we speculate that using appropriate substitution models to infer gene tree topology will perform better in assessing the impact of introgression with outgroups.

The conceptual picture of rate-mixing

In the gray zone of incomplete speciation, interspecific hybrids bridge between gene pools of divergent lineages (Mallet, 2005). We here demonstrate a similar role of hybridization in coupling and mixing differing substitution rates. Divergent rates of substitution carry information about outgroups, while divergence based on allele frequency differences does not. Preserving divergent substitution rates is a stronger effect than maintenance of allele frequency differences, because divergence of allele frequencies is a prerequisite for rate preservation. This dependency can be coarsely quantified across the genome by the relationship between observed rate ratio r and relative divergence FST in an equilibrium system of hybridizing populations (Equation 6). At migration-drift equilibrium, it is not surprising that divergent substitution rates are associated with relative divergence. In Figure 6A, when coalescence occurs rapidly compared to gene flow, most substitutions separating individuals are species-specific. However, when gene flow is faster than coalescence, individuals will carry substitutions that occurred in both species. This could have important implications, because preserving lineage-specific substitution rates as measured by r might not require low absolute rates of gene flow. Instead, reducing effective population sizes via recurrent linked selection might achieve a similar result in populations at equilibrium (NFSTr).

The theory in its present form has several limitations. First, mutations follow the infinite-site model (Kimura, 1969), so that reverse mutations, double mutations, and substitution types are not accurately reflected in the estimated ratio between species-specific substitution rates. Second, population structure is assumed at equilibrium, whereas real data could carry footprints from non-equilibrium demographic processes (e.g. secondary contact) (Hey, 2010). Third, there could be considerable population structure within each species contributing to elevated FST but not necessarily to rate-divergence. This effect could be seen in simulated stepping-stone models (Figure 6—figure supplement 3) and will underestimate the level of rate divergence. Fourth, substitution is stochastic across the genome, and accurately estimating observed rate ratio r relies on averaging substitution numbers across a large number of sites. This poses a problem if sites linked to high FST regions are rare (i.e. few genomic islands). Lastly, as the theory is built upon the neutral coalescent, it is best suited for studying behaviors of neutral sites.

Nonetheless, the monotonic relationship between r and FST (i.e. larger sequence divergence is associated with more asymmetric substitution rates) might be qualitatively robust regardless of the aforementioned caveats. For instance, even for hybrid zones formed by recent secondary contact, reducing the absolute rate of gene flow by barrier loci in principle also keeps divergent rates from mixing, simply because it prevents substitutions accumulated in the allopatric phase from flowing between species.

In conclusion, our study characterizes several genomic consequences of the rate-mixing process when molecular clocks tick at different speeds between hybridizing lineages. This process provides new information on reproductive isolation but also leads to pitfalls in interpreting results of popular introgression tests. As this phenomenon is neglected in most studies of hybridization and speciation, its full scope awaits further investigation in both theories and empirical systems.

Materials and methods

Museum specimens and climate data

Request a detailed protocol

Museum specimens with verifiable locality data of all species were gathered from The University Museum of The University of Tokyo (Harada et al., 2012; Yago et al., 2021), Global Biodiversity Information Facility (The Global Biodiversity Information Facility, 2021b; The Global Biodiversity Information Facility, 2021a), and individual collectors (Figure 2—source data 5). Records of P. maackii from Japan, Korea and NE China were excluded from the analysis, so that most P. maackii individuals correspond to ssp. han, the subspecies that hybridizes with P. syfanius. Spatial principal component analysis was performed on elevation, maximum temperature of warmest month, minimum temperature of coldest month, and annual precipitation, all with 30s resolution from WorldClim-2 (Fick and Hijmans, 2017). The first two PCAs, combined with tree cover (Hansen et al., 2013), were used in MaxEnt-3.4.1 to produce species distribution models that use known localities to predict occurrence probabilities across the entire landscape (Phillips et al., 2017). Outputs were trimmed near known boundaries of each species. See Figure 2—figure supplement 3 for the final result.

Sampling, re-sequencing, and mitochondrial phylogeny

Request a detailed protocol

Eleven males of P. syfanius and P. maackii, with one male of P. arcturus and one male of P. dialis were collected in the field between July and August in 2018 (Figure 2—source data 5), and were stored in RNAlater at –20 °C prior to DNA extraction. E.Z.N.A Tissue DNA kit was used to extract genomic DNA, and KAPA DNA HyperPlus 1/4 was used for library preparation, with an insert size of 350 bp and 2 PCR cycles. The library is sequenced on a Illumina NovaSeq machine with paired-end reads of 150 bp. Adaptors were trimmed using Cutadapt-1.8.1, and subsequently the reads were mapped to the reference genome of P. bianor with BWA-0.7.15, then deduplicated and sorted via PicardTools-2.9.0. The average coverage among 13 individuals in non-repetitive regions varies between 20× and 30×. Variants were called twice using BCFtools-1.9 – the first including all samples, used in analyses involving outgroups, and the second excluding P. arcturus and P. dialis, used in all other analyses. The following thresholds were used to filter variants: 10N< DP <50N, where N is the sample size; QUAL > 30; MQ > 40; MQ0F < 0.2. As a comparison, we also called variants with GATK4 and followed its best practices, and 93% of post-filtered SNPs called by GATK4 overlapped with those called by BCFtools. We used SNPs called by BCFtools throughout the analysis. Mitochondrial genomes were assembled from trimmed reads with NOVOPlasty-4.3.1 (Dierckxsens et al., 2017), using a published mitochondrial ND5 gene sequence of P. maackii as a bait (NCBI accession number: AB239823.1). We also used the following published mitochondrial genomes (NCBI accession numbers): KR822739.1 (Papilio glaucus), NC_029244.1 (Papilio xuthus), JN019809.1 (Papilio bianor). The neighbor-joining mitochondrial phylogeny was built with Geneious Prime-2021.2.2 (genetic distance model: Tamura-Nei), and we used 104 replicates for bootstrapping. The reference genome of P. xuthus was previously aligned to the genome of P. bianor and we used this alignment directly in all analysis (Lu et al., 2019).

Calculating site-pattern asymmetry

Request a detailed protocol

Given a species tree {{P1,P2},O}, where P1 and P2 are sister species and O is an outgroup, if mutation rates are equal between {P1,P2}, and no gene flow with O, then on average the number of derived alleles in P1 should equal the number of derived alleles in P2. Let S be a collection of sites, fs be the frequency of a particular site pattern at site sS. “ABB” be the pattern where only P2 and O share the same allele, and “BAB” be the pattern where only P1 and O share the same allele, then the three-species D3 statistic is calculated as

(8) DP1,P2,O=sS(fs,ABBfs,BAB)sS(fs,ABB+fs,BAB),

where S is always limited to sites without polymorphism in the outgroup O. This statistic is in principle capturing the same source of asymmetry as the statistic proposed by Hahn and Hibbins, 2019, although their version uses divergence to the outgroup instead of frequencies of site-counts. Similarly, the four-species D4 statistic, which considers species tree {{{P1,P2},O1},O2} and site patterns ABBA versus BABA (Durand et al., 2011) is calculated as

(9) DP1,P2,O1,O2=sS(fs,ABBAfs,BABA)sS(fs,ABBA+fs,BABA),

where S is always limited to sites without polymorphism in the second outgroup O2. The significance of both tests was computed using block-jackknife over 1 Mb blocks across the genome. Additionally, we estimated rate ratio as follows. First we restricted to sites where all outgroups are fixed for the same ancestral allele to dampen the influence of recurrent mutation. Then, for each site, sample one allele at random from each focal lineage. Calculate the probability of observing a derived allele in P1 but not in P2, and the probability of observing a derived allele in P2 but not in P1. The rate ratio is computed as the ratio between the two probabilities. Explicitly, let I() be the identity function, and fs be the frequency of the derived allele, then:

(10) Rate ratio r=sSfs,P1(1fs,P2)ΠioutgroupsI(fs,i=0)sS(1fs,P1)fs,P2ΠioutgroupsI(fs,i=0)

Its standard error was estimated using 1 Mb block-jackknifing. We excluded P. xuthus from the outgroups to increase the number of informative sites when using this formula.

D3 and D4 under unequal substitution rates and recurrent mutations

Request a detailed protocol

In this section, we calculate observed D3 and D4 assuming that incomplete lineage sorting contributes insignificantly to both statistics. If incomplete lineage sorting is present, it will not create new bias (numerators are on average unchanged), but will likely dampen existing bias (inflating denominators).

As substitutions are independent along each lineage, we can mute recurrent mutations in outgroups and generate them afterwards. For three taxa with gene tree {{P1,P2},O1}, before recurrent mutations, there are n1 sites with pattern (B,A,A), and n2 sites with pattern (A,B,A). If substitution rate is higher in P2, we have n2>n1, so the true value of D3, written as D^3, is always negative:

(11) D^3=n1n2n1+n2<0

Next, recurrent mutations in O1 occur at each site with an average probability p1, and with an average probability c, ancestral alleles from affected sites in O1 are converted to the same derived allele in {P1,P2}. c will be independent of n1 and n2, as long as substitutions between {P1,P2} are only different in rates, but not mutation types. Hence, two possible mutation paths exist:

(12) (A,B,B)p1(A,B,)c(A,B,A)(B,A,B)(B,A,B)p1(B,A,)c(B,A,A)(A,B,B)

The expected site counts, after recurrent mutations, become

(13) nABB=(1p1)n1+cp1n2nBAB=(1p1)n2+cp1n1

Using the new expected site counts in D3 statistics produce the following value:

(14) D3=1(c+1)p11+(c1)p1D^3(12cp1)D^3

Since D^3 is negative, it grows approximately linearly near small values of p1. (The full equation is still monotonic in p1.)

Similarly, for four-taxon statistics, before recurrent mutation, there are two types of sites: (A,B,B,B)- n1; (B,A,B,B)- n2. Suppose the average probability of recurrent mutation is p1 in O1, and p2 in O2, and the conversion probability of each recurrent mutation into derived alleles of {P1, P2} is c for both outgroups. Using the same procedure, one can show that

(15) D4=p2p1p2+p1D^3

Since recurrent mutations occur more frequently in distant outgroups, p2>p1. Because D^3 is negative, we have D4<0.

Local gene trees

Request a detailed protocol

Local gene trees were estimated using iqtree-2.0 (Minh et al., 2020) on 50 kb non-overlapping genomic windows with options -m MFP -B 5000. Only SNPs from annotated regions (synonymous sites +nonsynonymous sites +introns) across all individuals were used. For diploid individuals, heterozygous sites were assigned IUPAC ambiguity codes and iqtree assigned equal likelihood for each underlying character, thus information from heterozygous sites is largely retained. This is crucial as we are interested in the branch length of inferred trees. Option -m MFP implements iqtree’s ModelFinder that tests the FreeRate model to accommodate maximum flexibility with rate-variation among sites (Kalyaanamoorthy et al., 2017). We also used UltraFast Bootstrap to calculate the support for different types of splits in each window (the -B 5000 option; Hoang et al., 2018). In each window, we extracted the support for monophyly among P. maackii +P.syfanius directly from the output of UltraFast Bootstrap, and we define the support for paraphyly among P. maackii +P.syfanius as (100 - the support for monophyly). For each level of support, we filtered out genomic windows where both the support for monophyly and the support for paraphyly drop below the given level. The remaining windows were considered informative.

Rate-mixing under the equilibrium IM model

Request a detailed protocol

We construct a continuous-time coalescent model as follows. Both populations have N haploid individuals, gene flow rate is m, and coalescent rate is N-1 in each population. In the equilibrium system, as we track both haploid individuals backward in time, there are six distinct states: (1|2),(2|1),(1,2|),(|1,2),(0|),(|0), where 1 and 2 represent two individuals prior to coalescent, 0 is the state of coalescent, and (|) shows the location of each lineage. Its transition density B(t) satisfies tB=AB, where A is given as

(16) (-2m0mm000-2mmm00mm-2m-1N000mm0-2m-1N00001N0-mm0001Nm-m)

Let Si|j(T) be the mean sojourn time of an uncoalesced individual inside population i during 0tT, conditioning on the individual being taken from population j at t=0. Assuming the infinite-site mutation model, let μi be the substitution rate in population i, observed rate ratio r is thus

(17) r=μ2S2|2()+μ1S1|2()μ1S1|1()+μ2S2|1()

where (due to symmetry)

(18) S1|1()=S2|2()=0+(1,0,0,1,0,0)eAt(1,0,0,0,0,0)Tdt=1+2Nm2mS2|1()=S1|2()=0+(0,1,1,0,0,0)eAt(1,0,0,0,0,0)Tdt=N

Let r0=μ2/μ1, and since FST=(1+4Nm)-1, we have

(19) r=1+r0+FST(r01)1+r0FST(r01)

Local ancestry estimation

Request a detailed protocol

Software ELAI (Guan, 2014) with a double-layer HMM model was used to estimate diploid local ancestries across chromosomes. An example command is as follows: elai-lin -g genotype.maackii.txt -p 10 -g genotype.syfanius.txt -p 11 -g genotype.admixed.txt -p 1 -pos position.txt -s 30 -C 2 -c 10 -mg 5000 --exclude-nopos.

Note that -mg specifies the resolution of ancestry blocks, thus increasing its value will increase the stochastic error of incorrectly inferring very short blocks of ancestry. To control for uncertainty, we estimated repeatedly for 50 times. All replicates were used simultaneously in finding the correlation coefficients between entropy and other variables. Results from an example run is in Figure 3—figure supplement 1.

Ancestry and entropy

Here we introduce concisely the data transformation framework for calculating the entropy of local ancestry. The mathematical detail of this approach is presented in the appendix.

Ancestry representation

Request a detailed protocol

The space of all ancestry signals is high-dimensional, and directly calculating the entropy in this space is not feasible with just a few individuals. So we propose to measure only the pairwise correlation of ancestries among sites, which captures only the second-order randomness, but is sufficient for practical purposes. Consider a hybrid individual with two parental populations indexed by k=1,2. Assuming a continuous genome, let pk(l)=0,12,1 be the diploid ancestry of locus l within genomic interval [0,L]. By definition, we have p1(l)+p2(l)=1, that is the total ancestry is conserved everywhere in the genome. The bi-ancestry signal at locus l is defined as the following complex variable

(20) z(l)=p1(l)+ip2(l)=eiarccosp1(l),

where i=-1 is the imaginary number. An advantage of using a complex representation for the bi-ancestry signal is that we can model different ancestries along the genome as different phases of a complex unit phasor (eiθ), such that the power of the signal at any given locus is simply the sum of both ancestries, which is conserved (|z(l)|2=1). It ensures that we do not bias the analysis to any particular region or any particular individual when decomposing the signal into its spectral components.

Within-individual spectral entropy (Sw)

Request a detailed protocol

To characterize the average autocorrelation along an ancestry signal at a given scale l, define the following scale-dependent autocorrelation function

(21) A(l)=Re[1L0Lz(ξ)z(ξ+l)¯dξ],

where z(l) is understood as a periodic function such that z(ξ+l)=z(ξ+l-L) whenever the position goes outside of [0,L]. The Wiener-Khinchin theorem guarantees that z(l)’s power spectrum ζ(f), which is discrete, and the autocorrelation function A(l) form a Fourier-transform pair. Due to the uncertainty principle of Fourier transform, A(l) that vanishes quickly at short distances (small-scale autocorrelation) will produce a wide ζ(f), and vice versa. So the entropy Sw of ζ(f), which measures the spread of the total ancestry into each spectral component, also measures the scale of autocorrelation. In practice, ζ(f) is the square modulus of the Fourier series coefficients of z(l), and we fold the spectrum around f=0 before calculating the within-individual entropy Sw. The formula used in the manuscript is

(22) Sw=n=0+ζnlnζnζn={|Zn|2+|Zn|2(n>0)|Z0|2(n=0)

where Zn are the Fourier coefficients from the expansion z(l)=n=-+Znei2πnl/L. To speed up the Fourier expansion, we could densely pack equally-spaced markers that sample a continuous ancestry signal into a discrete signal, which then undergoes Fast Fourier Transform (FFT). The spectrum of FFT (discrete and finite) approximates the continuous-time Fourier spectrum (discrete and infinite), and entropy also converges as marker density increases.

Between-individual spectral entropy (Sb)

Request a detailed protocol

As ancestry configuration is far from random around barrier loci, it will also influence the correlation of ancestry between different individuals at the same locus. For a genomic region experiencing strong barrier effects, two individuals could either be very similar in ancestry, or very different. This effect can be quantified by first calculating the cross-correlation Cj,j(l)=zj(l)zj(l)¯ at position l between individuals j and j, and then averaging across a genome interval: cj,j=1L0LCj,j(l)dl. The J×J dimensional matrix C with entries cj,j describes the pairwise cross-correlation within the cohort of J individuals. We also have cj,j1 as each individual is perfectly correlated with itself. The matrix C is Hermitian, so it has a real spectral decomposition with eigenvalues λj that satisfy jλj/J=1. This process is very similar to performing a principal component analysis on the entire cohort of individuals, and λj/J describes the fraction of the total ancestry projected onto principal component j. If many loci co-vary in ancestry, the spectrum {λj} will be concentrated near the first few components. Similarly, we use entropy to measure the spread of the spectrum, and hence the between-individual spectral entropy is defined as

(23) Sb=-jλjJlnλjJ

Appendix 1

Representation of ancestry on a hybrid chromosome

The ancestry of a hybrid depends on the pure reference populations. The following assumptions are used throughout this section:

  • There are a finite number of pure reference populations indexed by k{1,2,,K}. In most cases, we are only interested in K=2, for instance, a hybrid zone between two lineages.

  • The chromosome has so many sites so that it can be treated as a contiguous rod with length L. l[0,L] is the index of positions on a chromosome.

  • The species’ ploidy is np (completely phased data ⇔ np=1). When not specified, we assume the data is unphased, so that the ancestry on a particular chromosome always refers to the ancestry on a collection of np homologous chromosomes.

  • Ancestry can be inferred. In practice, regions with low density of informative SNPs will make inference difficult.

As "ancestry" is a categorical variable (i.e., there is no intrinsic order among the reference populations), to quantify the correlation of ancestry along a chromosome, we need to map the contribution of each ancestry category to some numeric value that can be used to calculate correlation.

The first choice is to use the probability of each category. For unphased data, let pk(l) be the fraction of the np-ploid chromosome at position l coming from reference population k. The vector-valued function B(l)=(p1(l),p2(l),,pK(l)) completely describes the ancestry on a single chromosome. By definition, kpk(l)1. This is the conservation of total ancestry at any position. However, by this representation, the correlation B(l1)B(l2), which is the product of ancestries at different positions, will have units of the square of probabilities. If a locus is to be compared with itself, then the correlation will be

(24) B(l)B(l)=kpk2(l)1

Ideally, we want the correlation of ancestry of a locus with itself to be the same regardless of its ancestry configuration. Equation (24) does not meet this criteria as it depends on B. Here we borrow some concepts from signal-processing theory. The conservation of total ancestry means that ancestry is more analogous to the "energy" of a signal, which is in units of [squared-signal], rather than the signal itself. Therefore, we could instead define a second type of ancestry representation by taking the square roots of probabilities.

Definition 1 (Spherical representation of ancestry)

If the probability representation of ancestry is B(l), then the spherical representation y(l) takes the square-root of each element of B(l):

(25) y(l)=(p1(l),p2(l),,pK(l))

Following this representation, the autocorrelation between two spherical ancestries becomes a natural measure of the similarity between two probability distributions:

Definition 2 (Correlation between two spherical ancestries)

The autocorrelation function between y(l1) and y(l2) is the following quantity known as the Bhattacharyya coefficient, also known as the fidelity measure in information theory:

(26) A(l1,l2)=y(l1)y(l2)=kyk(l1)yk(l2)[0,1]

Further, self-correlation is always 1:

(27) A(l,l)=y(l)y(l)=kpk(l)1

The conservation of self-correlation will be important when we decompose ancestry into its spectral components, as it will not bias our analysis to any particular region of the chromosome. There is also a geometric meaning associated with this representation. Since y(l) has a L2-norm of 1, each ancestry configuration corresponds to a point on the unit sphere in K, and different ancestries are represented by the orientation of this vector in K.

In many studies we are dealing only with two reference populations, such as hybrid zones between a pair of divergent lineages. This situation allows a more compact representation of ancestry using complex numbers:

Definition 3 (Complex representation of ancestry)

If K=2, define the following complex variable z(l) as the representation of ancestries:

(28) z(l)=p1(l)+ip2(l)

where i=-1 is the imaginary number.

This seemingly artificial definition is not the first time that a complex number is used to model a physical phenomenon. In quantumn mechanics, the quantumn wave function of a particle is represented by a complex wave with the probability of occurrence measured by the square-modulus of the wave. Here, we are also using the square-modulus of z(l) to represent the total ancestry of a given locus.

A complex-valued signal z(l), like any real signal, has the definition of autocorrelation:

Definition 4 (Correlation between two complex ancestries)

The correlation between two complex ancestries z(l1) and z(l2) is the following product:

(29) A(l1,l2)=z(l1)z(l2)¯=p1(l1)p1(l2)+p2(l1)p2(l2)+i(p2(l1)p1(l2)p1(l1)p2(l2))

By this definition, the real part Re[A(l1,l2)] of the autocorrelation is just the Bhattacharyya coefficient between two ancestry configurations on l1 and l2, which is a measure of similarity. The absolute value of the imaginary part of the autocorrelation |Im[A(l1,l2)]| measures the volume of the parallelogram spanned by vectors y(l1) and y(l2), thus it is a measure of dissimilarity. Further, the self-correlation A(l,l) is also conserved for all l, as Im[A(l,l)]0 and Re[A(l,l)]1.

The complex representation possesses some useful properties not shared with the spherical representation, but it also limits our analysis to cases with only two reference populations. This will not pose a problem for the study of hybrid zones between a pair of parapatric lineages.

Mean autocorrelation vs. hybrid index

The hybrid index is defined as the average ancestry in a hybrid from a given reference population over a set of loci. It is usually calculated for each individual or for each chromosome. In our notation:

Definition 5 (Hybrid index)

The hybrid index over a given genomic interval [0,L] for reference population k is

(30) hk=1L0Lpk(l)dl

In contrast, the average of the spherical or the complex representation provides information about autocorrelation within the chromosome instead of the mean ancestry.

Theorem 1 (Mean autocorrelation)

With the spherical or the complex representation, the squared L2-norm (or the squared modulus, when a complex representation is available) of the average signal represents the mean autocorrelation of the ancestry on the genomic interval [0,L].

(31) a:=1L2[0,L]2A(l1,l2)dl1dl2=1L0Ly(l)dl2=|1L0Lz(l)dl|2

Proof. For a general spherical representation y(l), we have

1L0Ly(l)dl2=1L2k(0Lpk(l)dl)=1L2k0Lpk(l1)dl10Lpk(l2)dl2=1L2[0,1]2kpk(l1)pk(l2)dl1dl2=1L2[0,L]2A(l1,l2)dl1dl2

For a complex representation z(l), notice that A(l1,l2)=A(l2,l1)¯, so

1L2[0,L]2A(l1,l2)dl1dl2=1L2[0,L]2ReA(l1,l2)dl1dl2=1L2[0,1]2k=1,2pk(l1)pk(l2)dl1dl2

This guarantees that the mean autocorrelation when using a complex representation is the same as the mean autocorrelation when using a spherical representation with K=2. Additionally,

|1L0Lz(l)dl|2=1L20Lz(l1)dl10Lz(l2)dl2¯=1L2[0,L]2z(l1)z(l2)¯dl1dl2=1L2[0,L]2A(l1,l2)dl1dl2

This completes the proof.

The quantity a is a measure of the average similarity of the ancestry configurations. It does not consider the genomic position of different ancestry configurations, so it does not contain information about whether similar ancestry configurations are clustered together.

The entropy of ancestry

To characterize the scale of correlation, the distance between loci is important because correlation often drops while distance between loci increases. The information about the spatial scale of correlation is retained when the full spectrum of correlation is considered within a single individual.

A second source of correlation arises when we consider the relationship between individuals. To convey the main idea, let’s compare between a set of ancestry signals from an inversion and a set of ancestry signals from a regular region subject to normal recombination. For the inversion, since recombination between chromosomes of different ancestry is often completely suppressed, the ancestry signal along the inversion will be close to constant. When two haploid individuals are compared at the inverted region, they are either different everywhere in terms of ancestry, or completely the same. Comparing between diploid individuals is similar, although the difference is more fine-grained due to the presence of the heterozygotes. However, in a collinear region subject to a regular rate of recombination, any ancestry signal will switch stochastically between states. In the latter case, the similarity between two ancestry signals will be similar across multiple pairwise comparisons. It will be helpful to think in the principal component (PC) space spanned by all individuals’ ancestry signals. Ancestry signals from an inversion will form several tight clusters in the PC space, while those from a regular region will form a single cloud with a larger dispersion.

Both sources of correlation, and hence both types of randomness, can be measured using entropy in information theory.

Definition 6 (Shannon entropy)

The Shannon entropy for a discrete probability distribution {pj}, j, is defined as

(32) S({pj})=-jpjlnpj

For a continuous distribution with probability density function p(x), x, the Shannon differential entropy is defined as

(33) S(p(x))=-p(x)lnp(x)dx

For any nonnegative series {pj} or nonnegative function p(x), suppose they converge upon summation/integration, we use the same notations S({pj}) and S(p(x)) to denote the entropy after they are appropriately normalized to probability distributions.

Shannon entropy (or any other entropy measure) is a useful measure of the spread of a distribution over its entire configuration space. When the distribution is concentrated (low randomness, high certainty), S will be low. S=0 if and only if pj=1 for some j.

Entropy associated with the correlation within a single individual

Definition 7 (Fourier spectrum of the spherical representation)

For the spherical representation of ancestry on genomic interval [0,L]

y(l)=(p1(l),p2(l),,pK(l)),

expand each component into its Fourier series:

pk(l)=n=+p^k,nei2πLnl

The folded Fourier spectrum is defined as

(34) ζn={2k=1K|p^k,n|2(n>0)k=1K|p^k,0|2(n=0)

Definition 8 (Fourier spectrum of the complex representation)

For the complex representation of ancestry on genomic interval[0,L]

z(l)=p1(l)+ip2(l),

expand z(l) into its Fourier series:

z(l)=n=+Znei2πLnl

The folded Fourier spectrum is defined as

(35) ζn={|Zn|2+|Zn|2(n>0)|Z0|2(n=0)

The following theorem guarantees that the folded spectrum ζn is the same for bi-ancestry signals using either representation.

Theorem 2

When K=2, the folded Fourier spectrum ζn is the same for y(l)=(p1(l),p2(l)) and z(l)=p1(l)+ip2(l).

Proof. Expand each component into its Fourier series:

p1(l)=n=+p^1,nei2πLnlp2(l)=n=+p^2,nei2πLnl

The folded spectrum using the spherical representation is

ζn={2|p^1,n|2+2|p^2,n|2(n>0)|p^1,0|2+|p^2,0|2(n=0)

Using the linearity of Fourier expansion, we have

Zn=p^1,n+ip^2,n

Thus,

ZnZn¯=(p^1,n+ip^2,n)(p^1,n¯ip^2,n¯)=p^1,np^1,n¯+p^2,np^2,n¯+ip^2,np^1,n¯ip^1,np^2,n¯=p^1,np^1,n¯+p^2,np^2,n¯2Im(p^2,np^1,n¯)

Since the Fourier coefficients of a real function at opposite frequencies are complex conjugates to each other, we have

ZnZn¯=p^1,np^1,n¯+p^2,np^2,n¯2Im(p^2,np^1,n¯)=p^1,n¯p^1,n+p^2,n¯p^2,n2Im(p^2,n¯p^1,n)

Finally,

|Zn|2+|Zn|2=ZnZn¯+ZnZn¯=2p^1,np^1,n¯+2p^2,np^2,n¯2Im(p^2,np^1,n¯+p^2,n¯p^1,n)

Since p^2,np^1,n¯+p^2,n¯p^1,n is real, the last term of the previous equation becomes zero. For the zero-th component ζ0, as both p^1,0 and p^2,0 are real, there is no difference between the two representations. This completes the proof.

A nice property of the Fourier spectrum ζn is that it can be interpreted as a probability distribution of ancestry among different frequency components. By Parseval’s theorem, it is easy to verify that nζn=1L0L(kpk(l))dl=1. From the Wiener-Khinchin theorem, the unfolded spectrum forms a Fourier transform pair with the autocorrelation function of the original signal. The significance of the Wiener-Khinchin theorem is that information about the autocorrelation of the original signal can now be extracted from the Fourier spectrum ζn.

Definition 9 (Within-individual entropy)

Let Sw=-nζnlnζn. Sw is the Shannon entropy of the folded Fourier spectrum ζn. As Sw captures the correlation structure within each individual, we also call it the within-individual entropy.

Formally, we have the following uncertainty principle relating the within-individual entropy to the scale of autocorrelation.

Theorem 3 (Entropic uncertainty)

Let A(l)=1L0Ly(x)y(x+l)dx be the average autocorrelation at scale l (0lL) for the spherical ancestry. Here, y is understood as a periodic function of period L. The following inequality holds:

(36) Sw=S({ζn})0LA2(l)QlnA2(l)Qdl+(1L0LA(l)dl-1)ln2+lnL

where Q=0LA2(l)dl is a normalization factor. Note that the right-hand-side of the inequality is invariant under linearly re-scaling A(l) to a different interval [0,L]. Thus, we can also write the inequality compactly, supposing A(l) has been rescaled to [0,2], as

(37) Sw-S(A2(l))+aln2,

where a is the average autocorrelation defined in Equation 31.

Proof. (i) In this part, we establish the Wiener-Khinchin relation that A(l) expands into a Fourier series with coefficients ηn=k=1K|p^k,n|2:

(38) A(l)=n=+k=1K|p^k,n|2ei2πLnl

The derivation is as follows:

(39) A(l)=1L0LK=1Kyk(x)yk(x+1)dx=K=1K1L0L(n=+P^k,nei2πLnx)(n=+P^k,nei2πLn(x+1))dx=K=1Kn=+ei2πLnl1L0LP^k,nP^k,ndx=n=+K=1K|P^k,n|2ei2πLnl=n=+ηnei2πLnl

(ii) Let 0LA2(l)dl=Q. It is obvious that {ηn/Q} are Fourier series coefficients of A(l)/Q, they obey the Hausdorff-Young inequality

(40) (n=+|ηn/Q|q)1q(01|A(xL)/Q/L|qdx)1q,

where q(1,2) and 1/q+1/q=1. Let ϕ(q)=(01|A(xL)/Q/L|qdx)1q(n=+|ηn/Q|q)1q. Since Fourier series preserve the 2-norm, ϕ(2)=0, and since ϕ(q)0 for q(1,2), we have ϕ(2)0. This translates into

(41) (01|A(xL)/Q/L|2dx)12{14ln01|A(xL)/Q/L|2dx+1201|A(xL)/Q/L|2ln|A(xL)/Q/L|dx01|A(xL)/Q/L|2dx}(n=+|ηn/Q|2)12{14ln+|ηn/Q|212n=+|ηn/Q|2ln|ηn/Q|n=+|ηn/Q|2}0,

which yields

(42) S({ηn2})=-n=-+ηn2Qlnηn2Q0LA2(l)QlnA2(l)Qdl+lnL

Note that the quantity 0LA2(l)QlnA2(l)Qdl+lnL is actually independent of L upon re-scaling.

(iii) Next, we show that S({ηn})S({ηn2}). Since nηn=1 and ηn is nonnegative, we can always re-order them into a descending series η^n (n0) with the same entropy as S({ηn}) (because entropy is permutation-invariant). The result follows as long as S({η^n})S({η^n2}). Let βn=η^n+1/η^n1. The two series, after normalization, can be written as

(43) {η^n}=η^0,η^0β0,η^0β0β1,η^0β0β1β2,{η^n2/Q}=η^02Q,η^02Qβ02,η^02Qβ02β12,η^02Qβ02β12β22,

Consequently,

(44) η^0(1+β0+β0β1+)=η^02Q(1+β02+β02β12+)=1

This implies that η^0η^02/Q. Suppose there exists j0 such that

(45) η^0(1+β0+β0β1++β0β1βj)>η^02Q(1+β02+β02β12++β02β12βj2)

Then there exists jj such that η^0β0β1βj>η^02Qβ02β12βj2. So for any jj, we have η^0β0β1βj>η^02Qβ02β12βj2. The difference

(46) Δj=η^0(1+β0+β0β1++β0β1βj)-η^02Q(1+β02+β02β12++β02β12βj2)

will always be positive and monotonically increases for any jj. This is not consistent with the fact that limjΔj=0. Thus, we conclude that for any j, the partial sum follows the inequality

(47) η^0(1+β0+β0β1++β0β1βj)η^02Q(1+β02+β02β12++β02β12βj2)

This is to say that infinite sequence {η^02/Q} majorizes {η^n}. By Theorem 2.2 of Li and Busch, 2013, S({η^n})S({η^n2}), and so S({ηn})S({ηn2})

(iv) Finally, since ζn=ηn+η-n=2ηn for n1, we have

(48) S({ζn})=ζ0lnζ0n12ηnln(2ηn)=ζ0lnζ0n1(ηn+ηn)(lnηn+ln2)=S({ηn})(1η0)ln2,

which is equivalent to

(49) S({ζn})+(1-1L0LA(l)dl)ln2=S({ηn})

Combining previous four steps yields the result of the theorem.

Entropy associated with the correlation between individuals

Definition 10 (Entropy of a linear operator)

Let be a compact self-adjoint linear operator on a Hilbert space . If has a countable set of eigenvalues {νi}, and is positive semidefinite, then define the entropy of as

(50) S():=-iνiTrlnνiTr,

where Tr=iνi is the trace of the linear operator 

Definition 11 (Mercer’s spectrum)

Let Aj(l1,l2) be the autocorrelation function for individual j(1jJ), and define the average autocorrelation function as

A(l1,l2)=1JjAj(l1,l2).

Since Aj is Hermitian, the average A is also Hermitian, thus the integral operator IA defined by IAϕ(l)=0LA(l,s)ϕ(s)ds has a series of real eigenvalues {νj} satisfying jνj=L. νj is the solution to the eigenvalue problem:

(51) 0LA(l,s)ϕj(s)ds=νjϕj(l)

The spectrum defined by {νj/L} is the Mercer’s spectrum.

Definition 12 (Cross-correlation spectrum)

Let the cross-correlation matrix be C={cj,j}J×J, where cj,j is the average cross-correlation between individual j and j:

(52) cj,j={1L0Lyj(l)yj(l)dl(for the spherical representation)1L0Lzj(l)zj(l)¯dl(for the complex representation)

As C is Hermitian, it has a series of real eigenvalues λj satisfying jλj=J. Let the normalized spectrum of C be {λj/J}, then {λj/J} is defined as the cross-correlation spectrum.

The following theorem states that when the complex representation is adopted for bi-ancestry signals, the Mercer’s spectrum coincides with the cross-correlation spectrum, so that even if the Mercer’s spectrum is calculated using correlation within individuals, both captures the correlation between individuals.

Theorem 4

If {zj(l)}j=1J is a collection of complex bi-ancestry signals, then its Mercers’ spectrum is the same as its cross-correlation spectrum.

Proof. The average autocorrelation A is computed as

A(l1,l2)=1Jjzj(l1)zj(l2)¯

So the integral equation (51) becomes

1Jjzj(l)0Lzj(s)¯ϕ(s)ds=νϕ(l),

where (ν,ϕ) forms the solution to the above equation. Rearranging the terms, we write

ϕ(l)=j[ν-1J0Lzj(s)¯ϕ(s)ds]zj(l)=jαjzj(l),

where αj stands for the constant inside the bracket. Substituting into the original integral equation, we have

1Jjzj(l)0Lzj(s)¯jαjzj(s)ds=νjαjzj(l),

which is equivalent to

j[jαj0Lzj(s)zj(s)¯ds]zj(l)=Jνjαjzj(l)

Notice that the integral in the bracket contains the cross-correlation between j and j, so if the following relationship holds

jαj1L0Lzj(s)zj(s)¯ds=jαjcj,j¯=JνLαj

then the system has a solution of ν. The above equation is equivalent to

C¯α=JνLαCα¯=JνLα¯

Which means that Jν/L is the eigenvalue of the cross-correlation matrix C, and since C is a Hermitian matrix, the existence of its real eigenvalues is guaranteed. So for each νi, we have λj=Jνj/L, which is to say

λjJ=νjL,for i

As both spectra coincide for the complex representation, we can define the between-individual entropy using either of them.

Definition 13 (Between-individual entropy of the complex representation)

For complex bi-ancestry signals, let Sb:=S(IA)=S(C). Sb is the Shannon entropy characterizing the between-individual correlation.

Corollary 1 (Maximum Sb).

For J diploid individuals with the complex representation of ancestry, the largest attainable Sb is given by

(53) Sb,max=(J1)(1c)Jln1cJ1+(J1)cJln1+(J1)cJ,(1c)lnJ(1c)ln(1c)clnc(J1)

where c=(3+22)/8.

Proof. For entropy to be large, autocorrelation within each individual must be very weak. This will occur if a sufficient amount of recombination has taken place such that the ancestry across the genome is largely independent between loci. The maximum value will occur when both parental lineages contribute equally to the hybrid ancestry, so that the hybrid index h=0.5. The heterozygosity H is then 0.5 following a random distribution of ancestries along the two genome copies. The off-diagonal elements in the cross-correlation matrix C thus take the value

c=2×(12+2-1212)2=3+228

For a J×J matrix whose diagonal elements are all 1, and whose off-diagonal elements are all c, its normalized eigenvalues are given by {1+(J-1)cJ,1-cJ,,1-cJ}, which leads to the above result.

Corollary 2 (Minimum Sb). Sb=0 is the minimum between-individual entropy, and is attainable if recombination is completely suppressed in the genomic interval of interest.

Proof. If recombination is completely suppressed along [0,L], the ancestry signal is constant along the interval in any individual with any ploidy. Hence, Aj1, and A1. The only non-zero eigenvalue associated with a constant integral kernel is ν=L, so that we have Sb=0 using the Mercer’s spectrum. Note that the converse is not true, because the diploid ancestry signal is unaware of phase. A region completely heterozygous may in fact has a recombination break point, and the ancestry can flip phases when crossing the break point. However, this extreme situation is unlikely as it requires two separate recombination events to have occurred at precisely the same point and in opposite directions. If the hybrid zone is old, a long track of heterozygous ancestry is therefore usually good evidence for the presence of barrier loci.

Data availability

Source code is available at: https://github.com/tzxiong/2021_Maackii_Syfanius_HybridZone, (copy archived at swh:1:rev:41b220f489e17ff9795e5a0666e9579a00b2b3b8) Whole-genome sequences are deposited in the National Center for Biotechnology Information, Sequence Read Archive (BioProject Accession Number: PRJNA765117).

The following data sets were generated
    1. Xiong T
    2. Mallet J
    (2022) NCBI Sequence Read Archive
    ID PRJNA765117. Hybridization between Papilio syfanius and Papilio maackii.
    1. Xiong T
    (2022) GitHub
    ID 2021_Maackii_Syfanius_HybridZone. Database for the code used in the 2021 manuscript on the hybrid zone between Papilio syfanius and Papilio maackii.
The following previously published data sets were used
    1. Lu S
    2. Yang J
    3. Dai X
    4. Xie F
    5. He J
    6. Dong Z
    7. Mao J
    8. Liu G
    9. Chang Z
    10. Zhao R
    11. Wan W
    12. Zhang R
    13. Li Y
    14. Wang W LX
    (2019) GigaDB Datasets
    Supporting data for "Chromosomal-level reference genome of Chinese peacock butterfly (Papilio bianor) based on third-generation DNA sequencing and Hi-C analysis".
    https://doi.org/10.5524/100653

References

    1. Barton NH
    (1983) Multilocus clines
    Evolution; International Journal of Organic Evolution 37:454–471.
    https://doi.org/10.1111/j.1558-5646.1983.tb05563.x
    1. Harada M
    2. Teshirogi T
    3. Ozawa H
    4. Yago M
    (2012)
    Catalogue of the Suguru Igarashi insect collection, Part I. Lepidoptera, Papilionidae
    The University Museum, The University of Tokyo, Material Reports 94:1–390.
    1. Hudson RR
    (1990)
    Gene genealogies and the coalescent process
    Oxford Surveys in Evolutionary Biology 44:31.
    1. Kashiwabara S
    (1991)
    Why are Papilio dehaanii from Tokara Is And Izu Is. Beautiful?
    Choken-Field 6:1–16.
    1. Li Y
    2. Busch P
    (2013) Von Neumann entropy and majorization
    Journal of Mathematical Analysis and Applications 408:384–393.
    https://doi.org/10.1016/j.jmaa.2013.06.019
  1. Book
    1. Rensch B
    (1959) Evolution Above the Species Level
    New York, United States: Columbia University Press.
    https://doi.org/10.7312/rens91062
    1. Takasaki H
    2. Kawaguchi N
    3. Kuribayashi T
    4. Hasuo R
    5. Kobayashi S
    (2007)
    Unusual successive occurrence of the Maackii Peacock (Papilio maackii; Papilionidae) at Okayama University of Science, southwestern Honshu lowland, in 2006 summer and autumn
    Naturalistae 11:11–14.
  2. Book
    1. Wakeley J
    (2016)
    Coalescent Theory: An Introduction
    Macmillan Learning.
    1. Yago M
    2. Katsuyama R
    3. Harada M
    4. Teshirogi M
    5. Ogawa Y
    6. Ishizuka S
    7. Omoto K
    (2021)
    Catalogue of the Keiichi Omoto butterfly collection, Part I (Papilionidae: Baroniinae and Parnassiinae
    The University Museum, The University of Tokyo, Material Reports 127:1–162.

Decision letter

  1. Daniel R Matute
    Reviewing Editor; University of North Carolina, Chapel Hill, United States
  2. George H Perry
    Senior Editor; Pennsylvania State University, United States
  3. Daniel L Powell
    Reviewer; Stanford University, United States

Our editorial process produces two outputs: (i) public reviews designed to be posted alongside the preprint for the benefit of readers; (ii) feedback on the manuscript for the authors, including requests for revisions, shown below. We also include an acceptance summary that explains what the editors found interesting or important about the work.

Decision letter after peer review:

Thank you for submitting your article "Admixture of evolutionary rates across a hybrid zone" for consideration by eLife. Your article has been reviewed by 2 peer reviewers, and the evaluation has been overseen by a Reviewing Editor and George Perry as the Senior Editor. The following individual involved in review of your submission has agreed to reveal their identity: Daniel L Powell (Reviewer #1).

The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission.

Essential revisions:

Overall, the manuscript does a good job of presenting results in a clear and concise manner. Before making a final decision I have three requests:

1. The reviewers have requested some clarifications to make the entropy derivations more accessible to the reader, soften some claims (i.e., archaic gene flow), and add a section regarding the caveats to the discussion. I strongly agree with the reviewers and a potential resubmission should include a response to these requests.

I have two more things to add.

2. Please add the study organisms to the title of the manuscript. As it is currently written, the title is overly ambitious and the reader might be misled to think that the differences in substitution rates is a general phenomenon.

3. The scholarship of the piece needs to be improved. In the current form, the discussion has seven references which is rather surprising given that the study of introgression is one of the most vibrant aspects of current speciation research. The discussion needs to be more encompassing of our current understanding of the factors that determine the fate of an introgressed allele. The current form of the discussion should elaborate on what are the primary hypothesis that could lead to differences in substitution rates and assess what are the potential implications for the interaction between selection and recombination (in the face of differences in substitution rates).

Reviewer #1 (Recommendations for the authors):

I have nothing in the way ion further suggestions for the authors that isn't included in the public review except the following.

Lines 187-188: In the previous paragraph you claim that divergent rates likely exist, but here you state that you've established that they do. This strengthening of the statement seems unwarranted.

Reviewer #2 (Recommendations for the authors):

In their manuscript ("Admixture of evolutionary rates across a hybrid zone"), Xiong et al. use whole genome resequencing data to assess rates of genome evolution between two species of butterflies and determine whether loci that evade introgression between the species are also those that evolve at asymmetric rates. They show the landscape of divergence is heterogeneous among these sister species and the genomes of the hybrids exhibit correlated ancestry blocks (both within an individual's genome, and between individuals). Interestingly, sites that are more strongly diverged appear to be more tightly correlated, indicative of potential barrier loci. The authors also find significantly skewed D statistics using four outgroups. While these statistics might normally be used to infer introgression, the authors further use topological weights, and node heights to show that asymmetric rates of genome evolution might account for these skewed D statistics, rather than introgression. Using both coalescent theory and empirical observations, they show that these asymmetrically evolving sites are correlated with variation in relative divergence among species. Overall, I congratulate the authors on a clever and well written manuscript. I think this work is incredibly rigorous and I appreciate the clear communication of fairly complex results. I do have a few suggestions, which I think the authors might find useful.

Namely, while the authors nicely show potential asymmetric evolutionary rates between species, and also show that highly divergent sites are more likely to show patterns of lower entropy (i.e. less randomly associated with other loci), which I agree may be indicative of barrier loci, I feel there might be a missing link between these two processes in the way the manuscript is currently written. For example, the authors show these differences in evolutionary rates (i.e. the ratio number of substitutions between lineages) is correlated with divergence, but I'm curious if they are also correlated with ancestry entropy. In my mind, I agree with the authors that barrier loci should be highly diverged (and exhibit reduced diversity), but these signatures can be caused by a number of evolutionary processes. Instead, the entropy data is quite striking to me and more convincing of barrier loci. Therefore, I might expect the entropy measurements to also be correlated with the D statistics or the ratio of substitution rates.

I find Figures 4 and 6 really convincing of differential evolutionary rates over differences in introgression with the four outgroups. One additionally might predict that if two species were evolving differentially (and this in turn was reflected in the relative divergence between these sister species), that one might also see asymmetric differences in something like DXY between each focal species and each outgroup, and this might also be correlated with FST between the two sister lineages. This might be nice as a supplement to bolster your argument.

Lastly, I think the authors could clarify two important aspects of their differential evolutionary rate hypothesis.

First: it's unclear to me whether the authors are arguing that barrier loci evolve at different rates than the rest of the genome (and therefore show both increased divergence and increased asymmetric rates of evolution), or whether loci that evolve at different rates may have a harder time introgressing between species (by nature of the fact that a given locus' evolutionary rate is at odds with the evolutionary rate of the rest of the recipient genome). Perhaps this is a difference without much distinction, but a bit more detail from the authors on this might be useful. Additionally, if the authors are arguing the former, I think there's a nice amount of literature that they could tap into to talk about why different barrier loci might evolve rapidly (for example, if they are involved in a genetic arms race).

Second: The finding that these two sister species have such drastically different evolutionary rates is really interesting, and I think the authors could elaborate a bit more on why they think that is. There is some explanation on lines 232-234, but I'm curious whether one of these species has undergone a massive bottle neck or partakes in more inbreeding in ways that are also known to affect D statistics and such.

https://doi.org/10.7554/eLife.78135.sa1

Author response

1. The reviewers have requested some clarifications to make the entropy derivations more accessible to the reader, soften some claims (i.e., archaic gene flow), and add a section regarding the caveats to the discussion. I strongly agree with the reviewers and a potential resubmission should include a response to these requests.

We thank both reviewers and the reviewing editor for their encouraging assessments of this manuscript. All their suggestions have helped us state the results in a more accurate and meaningful way.

2. Please add the study organisms to the title of the manuscript. As it is currently written, the title is overly ambitious and the reader might be misled to think that the differences in substitution rates is a general phenomenon.

“Butterfly” is now added to the title. (“Admixture of evolutionary rates across a butterfly hybrid zone”)

3. The scholarship of the piece needs to be improved. In the current form, the discussion has seven references which is rather surprising given that the study of introgression is one of the most vibrant aspects of current speciation research. The discussion needs to be more encompassing of our current understanding of the factors that determine the fate of an introgressed allele. The current form of the discussion should elaborate on what are the primary hypothesis that could lead to differences in substitution rates and assess what are the potential implications for the interaction between selection and recombination (in the face of differences in substitution rates).

To improve the scholarship and discuss these results in light of existing and historical work, we have added new references and carried out a major revision of the Discussion, with the following subsections:

1) Entropy as a useful measure of ancestry randomness. We discuss the incentives behind developing these metrics, and the intuition for choosing particular forms of ancestry representation. We also explain the impact of other processes such as inbreeding and recombination on entropy, and show how they are useful for window-based genomic analysis.

2) Potential mechanisms of divergent substitution rates. Here, we speculate on several ways that different substitution rates might result. Ohta’s nearly neutral theory is now included as a possible explanation for smaller asymmetric rates on nonsynonymous substitutions than synonymous and intronic ones.

3) Are introgression tests robust to substitution rate variation? This part summarizes the finding that introgression tests based on site patterns can produce false-positives when substitution rates diverge between species and recurrent mutations exist in outgroups. In particular, we stress that, since barrier loci have the most asymmetric rates of evolution during hybridization (given the existence of divergent rates), they might be falsely identified as having been introgressed from an outgroup when introgression tests are used on genomic windows. This result might therefore mislead researchers to the conclusion that some “speciation genes” are borrowed from outgroups.

4) The conceptual picture of rate-mixing. In this final subsection, we recapitulate the main conclusion of this study, that locally asymmetric substitution rates reflect loci responsible for reproductive isolation. The caveats of the theory we develop in this manuscript are discussed in detail, and we conjecture that the monotonic relationship between rate asymmetry and sequence divergence might be a robust signature irrespective of model details.

Reviewer #1 (Recommendations for the authors):

I have nothing in the way ion further suggestions for the authors that isn't included in the public review except the following.

Lines 187-188: In the previous paragraph you claim that divergent rates likely exist, but here you state that you've established that they do. This strengthening of the statement seems unwarranted.

Thanks! We changed the first sentence of this subsection to “Having tested for …” rather than “Having established the existence of …”

Reviewer #2 (Recommendations for the authors):

In their manuscript ("Admixture of evolutionary rates across a hybrid zone"), Xiong et al. use whole genome resequencing data to assess rates of genome evolution between two species of butterflies and determine whether loci that evade introgression between the species are also those that evolve at asymmetric rates. They show the landscape of divergence is heterogeneous among these sister species and the genomes of the hybrids exhibit correlated ancestry blocks (both within an individual's genome, and between individuals). Interestingly, sites that are more strongly diverged appear to be more tightly correlated, indicative of potential barrier loci. The authors also find significantly skewed D statistics using four outgroups. While these statistics might normally be used to infer introgression, the authors further use topological weights, and node heights to show that asymmetric rates of genome evolution might account for these skewed D statistics, rather than introgression. Using both coalescent theory and empirical observations, they show that these asymmetrically evolving sites are correlated with variation in relative divergence among species. Overall, I congratulate the authors on a clever and well written manuscript. I think this work is incredibly rigorous and I appreciate the clear communication of fairly complex results. I do have a few suggestions, which I think the authors might find useful.

Namely, while the authors nicely show potential asymmetric evolutionary rates between species, and also show that highly divergent sites are more likely to show patterns of lower entropy (i.e. less randomly associated with other loci), which I agree may be indicative of barrier loci, I feel there might be a missing link between these two processes in the way the manuscript is currently written. For example, the authors show these differences in evolutionary rates (i.e. the ratio number of substitutions between lineages) is correlated with divergence, but I'm curious if they are also correlated with ancestry entropy. In my mind, I agree with the authors that barrier loci should be highly diverged (and exhibit reduced diversity), but these signatures can be caused by a number of evolutionary processes. Instead, the entropy data is quite striking to me and more convincing of barrier loci. Therefore, I might expect the entropy measurements to also be correlated with the D statistics or the ratio of substitution rates.

We agree with the intuition of the reviewer here that entropy could also be correlated with rate asymmetry. In fact, since the measure of "divergence" we use, FST. is negatively correlated with entropy (Figure 3E, third panel), and also positively correlated with rate ratio r, it is likely that entropy will be negatively correlated with r. The reason we use FST instead of entropy in studying the process of rate-mixing between populations is as follows:

First, FST is widely used in comparing populations, and has well-understood properties in population genetics theory. Yes, when FST is used to study gene flow, results can be misled by a number of other evolutionary processes (e.g., linked selection not caused by barrier loci), entropy is not exempt from complications either. As stated in the response to Reviewer #1, inbreeding reduces between-individual entropy (Sb), and low recombination rates reduce within-individual entropy (Sw), and both processes are not directly related to barrier loci per se.

Second, as entropy is defined on estimated local ancestry rather than DNA sequences, its performance will depend on the statistical power of ancestry estimation, which might introduce an additional layer of uncertainty to the results. However, FST can be directly estimated from sequences.

Third, we here use the entropy measure only within the hybrid population, while FST could be calculated between any populations (in this case, the “pure” populations), and so is potentially applicable to a wider set of data in which hybrid populations have not been sampled.

I find Figures 4 and 6 really convincing of differential evolutionary rates over differences in introgression with the four outgroups. One additionally might predict that if two species were evolving differentially (and this in turn was reflected in the relative divergence between these sister species), that one might also see asymmetric differences in something like DXY between each focal species and each outgroup, and this might also be correlated with FST between the two sister lineages. This might be nice as a supplement to bolster your argument.

Yes! Interestingly, DXY-based statistics were the very first clues we had that showed some kind of asymmetry between these two species, and we were then genuinely puzzled by this asymmetry, which initially was thought to show introgression with an outgroup (we only included one outgroup P. bianor, at the early stage of the analysis).

The statistic we used originally was “Population Branch Statistic” (PBS). In one version of the PBS, choose a triplet (P1, P2, O). P1 and P2 are two focal taxa, while O is an outgroup. PBS describes how distant each lineage is from the center of the triplet, and is calculated as follows:

PBS(P1)=log11DXY(P1,O)+log11DXY(P1,P2)log11DXY(P2,O)2
PBS(P2)=log11DXY(P2,O)+log11DXY(P1,P2)log11DXY(P1,O)2

(see: https://github.com/thamala/PBScan/blob/master/PBScan_manual.pdf)

Many people use PBS to find regions with unusually fast evolution in one lineage (indicating selection). In our case, among 50kb sliding windows, PBS(maackii) is usually larger than PBS(syfanius), and this asymmetry increases with FST. For instance, chromosome 15 has one of the lowest FST, while chromosome 30 (Z) has the highest FST, and PBS is highly skewed towards P. maackii on chromosome 30.

Author response image 1

We then thought that such a large skew in PBS couldn’t all be attributed to natural selection in P. maackii, and that there must be other processes generating such patterns. However, because PBS is not a normalized measure, it’s difficult to make a fair comparison across genomic windows. For this reason, we switched to D3 and D4 statistics, both are normalized, and are specifically designed to capture asymmetry information.In short, it is absolutely true that information based on DXY will show similar asymmetry between the focal taxa. But we argue that the normalized measures using D statistics make the point more clearly.

Lastly, I think the authors could clarify two important aspects of their differential evolutionary rate hypothesis.

First: it's unclear to me whether the authors are arguing that barrier loci evolve at different rates than the rest of the genome (and therefore show both increased divergence and increased asymmetric rates of evolution), or whether loci that evolve at different rates may have a harder time introgressing between species (by nature of the fact that a given locus' evolutionary rate is at odds with the evolutionary rate of the rest of the recipient genome). Perhaps this is a difference without much distinction, but a bit more detail from the authors on this might be useful. Additionally, if the authors are arguing the former, I think there's a nice amount of literature that they could tap into to talk about why different barrier loci might evolve rapidly (for example, if they are involved in a genetic arms race).

Thank you for point it out. We apologize for not being clear on this result. The main idea throughout the manuscript is simply that:

1. DNA accumulates substitutions at different rates between the two species

2. Hybridization mixes these rates for any specific locus, but at different rates of mixture depending on proximity to barrier loci.

To improve the wording, we now always use “mix” instead of “couple” whenever we are talking about rates being mixed between the two species. In the abstract, we also emphasize this mixing process is “between species”, and so we are not comparing “between loci”.

Second: The finding that these two sister species have such drastically different evolutionary rates is really interesting, and I think the authors could elaborate a bit more on why they think that is. There is some explanation on lines 232-234, but I'm curious whether one of these species has undergone a massive bottle neck or partakes in more inbreeding in ways that are also known to affect D statistics and such.

There is now a subsection in Discussion to include possible mechanisms of substitution rate differences. Although it is somewhat speculative, we believe it may be to do with the different numbers of generations per year between the two species. The highland lineage has 1-2 generations a year while the lowland lineage has 2-3. If everything else is equal, the ratio between the generation numbers could well be about ~1.8. We also speculate that Ohta’s nearly neutral theory might explain the reduced rate asymmetry of ~1.6 for nonsynonymous sites (the lowland lineage has larger populations more efficient selection against mildly deleterious mutations slowing down nonsynonymous substitutions more in the lowland).

https://doi.org/10.7554/eLife.78135.sa2

Article and author information

Author details

  1. Tianzhu Xiong

    Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, United States
    Contribution
    Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Software, Visualization, Writing - original draft, Writing – review and editing
    For correspondence
    txiong@g.harvard.edu
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-4576-8764
  2. Xueyan Li

    Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, China
    Contribution
    Resources, The reference genomes for P. bianor and P. xuthus. Fieldworks, Writing – review and editing
    Competing interests
    No competing interests declared
  3. Masaya Yago

    The University Museum, The University of Tokyo, Tokyo, Japan
    Contribution
    Museum specimens of P. syfanius and P. maackii, Resources, Writing – review and editing
    Competing interests
    No competing interests declared
  4. James Mallet

    Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, United States
    Contribution
    Conceptualization, Funding acquisition, Project administration, Resources, Supervision, Writing - original draft, Writing – review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-3370-0367

Funding

American Philosophical Society (The Lewis and Clark Fund for Exploration and Field Research (2017-2018))

  • Tianzhu Xiong

The NSF-Simons Center for Mathematical and Statistical Analysis of Biology (Award Number #1764269), and The Quantitative Biology Initiative (Harvard University) (Graduate Student Fellowship)

  • Tianzhu Xiong

Department of Organismic and Evolutionary Biology, Harvard University (Graduate Student Fellowship)

  • Tianzhu Xiong

Department of Organismic and Evolutionary Biology, Harvard University (Faculty Start-up Fund)

  • James Mallet

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

This work was supported by the Lewis and Clark Fund for Exploration and Field Research (American Philosophical Society, 2017–2018) granted to TX.; TX is also funded by a studentship from Harvard Department of Organismic and Evolutionary Biology, the NSF-Simons Center for Mathematical and Statistical Analysis of Biology at Harvard (award number #1764269) and the Harvard Quantitative Biology Initiative during the project. We thank Harvard FAS Research Computing for providing computation resources. We thank Fernando Seixas for discussion on hybrid ancestry inference. We thank Kaifeng Bu for his comments on information-theoretic measures. We thank Nathaniel Edelman, Neil Rosser, Miriam Miyagi, Yuttapong Thawornwattana, Sarah Dendy, Liang Qiao, Adam Cotton, John Wakeley, Robin Hopkins, and Naomi Pierce for their valuable input. We also thank Yuchen Zheng, Zhuoheng Jiang, Shaoji Hu, Feng Cao, and Shui Xu for support in the field. Feng Cao provided specimen photos from the hybrid zone.

Senior Editor

  1. George H Perry, Pennsylvania State University, United States

Reviewing Editor

  1. Daniel R Matute, University of North Carolina, Chapel Hill, United States

Reviewer

  1. Daniel L Powell, Stanford University, United States

Publication history

  1. Preprint posted: September 28, 2021 (view preprint)
  2. Received: February 23, 2022
  3. Accepted: June 14, 2022
  4. Accepted Manuscript published: June 15, 2022 (version 1)
  5. Version of Record published: June 30, 2022 (version 2)

Copyright

© 2022, Xiong et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 1,309
    Page views
  • 346
    Downloads
  • 1
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Tianzhu Xiong
  2. Xueyan Li
  3. Masaya Yago
  4. James Mallet
(2022)
Admixture of evolutionary rates across a butterfly hybrid zone
eLife 11:e78135.
https://doi.org/10.7554/eLife.78135
  1. Further reading

Further reading

    1. Evolutionary Biology
    2. Genetics and Genomics
    Yi Feng, Rafik Neme ... Laura F Landweber
    Research Article

    Ciliates are microbial eukaryotes that undergo extensive programmed genome rearrangement, a natural genome editing process that converts long germline chromosomes into smaller gene-rich somatic chromosomes. Three well-studied ciliates include Oxytricha trifallax, Tetrahymena thermophila and Paramecium tetraurelia, but only the Oxytricha lineage has a massively scrambled genome, whose assembly during development requires hundreds of thousands of precise programmed DNA joining events, representing the most complex genome dynamics of any known organism. Here we study the emergence of such complex genomes by examining the origin and evolution of discontinuous and scrambled genes in the Oxytricha lineage. This study compares six genomes from three species, the germline and somatic genomes for Euplotes woodruffi, Tetmemena sp., and the model ciliate Oxytricha trifallax. To complement existing data, we sequenced, assembled and annotated the germline and somatic genomes of Euplotes woodruffi, which provides an outgroup, and the germline genome of Tetmemena sp.. We find that the germline genome of Tetmemena is as massively scrambled and interrupted as Oxytricha's : 13.6% of its gene loci require programmed translocations and/or inversions, with some genes requiring hundreds of precise gene editing events during development. This study revealed that the earlier-diverged spirotrich, E. woodruffi, also has a scrambled genome, but only roughly half as many loci (7.3%) are scrambled. Furthermore, its scrambled genes are less complex, together supporting the position of Euplotes as a possible evolutionary intermediate in this lineage, in the process of accumulating complex evolutionary genome rearrangements, all of which require extensive repair to assemble functional coding regions. Comparative analysis also reveals that scrambled loci are often associated with local duplications, supporting a gradual model for the origin of complex, scrambled genomes via many small events of DNA duplication and decay.

    1. Evolutionary Biology
    2. Microbiology and Infectious Disease
    Vishhvaan Gopalakrishnan, Dena Crozier ... Jacob G Scott
    Feature Article Updated

    A morbidostat is a bioreactor that uses antibiotics to control the growth of bacteria, making it well-suited for studying the evolution of antibiotic resistance. However, morbidostats are often too expensive to be used in educational settings. Here we present a low-cost morbidostat called the EVolutionary biorEactor (EVE) that can be built by students with minimal engineering and programming experience. We describe how we validated EVE in a real classroom setting by evolving replicate Escherichia coli populations under chloramphenicol challenge, thereby enabling students to learn about bacterial growth and antibiotic resistance.