Admixture of evolutionary rates across a butterfly hybrid zone
Abstract
Hybridization is a major evolutionary force that can erode genetic differentiation between species, whereas reproductive isolation maintains such differentiation. In studying a hybrid zone between the swallowtail butterflies Papilio syfanius and Papilio maackii (Lepidoptera: Papilionidae), we made the unexpected discovery that genomic substitution rates are unequal between the parental species. This phenomenon creates a novel process in hybridization, where genomic regions most affected by gene flow evolve at similar rates between species, while genomic regions with strong reproductive isolation evolve at species-specific rates. Thus, hybridization mixes evolutionary rates in a way similar to its effect on genetic ancestry. Using coalescent theory, we show that the rate-mixing process provides distinct information about levels of gene flow across different parts of genomes, and the degree of rate-mixing can be predicted quantitatively from relative sequence divergence () between the hybridizing species at equilibrium. Overall, we demonstrate that reproductive isolation maintains not only genomic differentiation, but also the rate at which differentiation accumulates. Thus, asymmetric rates of evolution provide an additional signature of loci involved in reproductive isolation.
Editor's evaluation
The authors leverage theory, simulations, and empirical population genomics to evaluate what are the consequences of differences in substitution rates in hybridizing species. This is a largely overlooked pheonomenon. This study highlights the issue and demonstrates that two hybridizing species of Papilio have differences in thir substitution rates. The work will be of interest to a large group of evolutionary biologists, especially those studying evolution at the whole-genome level.
https://doi.org/10.7554/eLife.78135.sa0Introduction
DNA substitution, the process whereby single-nucleotide mutations accumulate over time, is a critical process in molecular evolution. Both molecular phylogenetics and coalescent theory rely on observed mutations (Yang and Rannala, 2012; Wakeley, 2016), and so the rate of substitution/mutation is the predominant link from molecular data to information about the timing of past events (Bromham and Penny, 2003). Substitution rates often vary among lineages: generation time, spontaneous mutation rate, and fixation probabilities of new mutations could all contribute to the variation of substitution rates (Ohta, 1993; Lynch, 2010). Recent evidence suggests mutation rates are even variable among human populations (Harris, 2015; DeWitt et al., 2021). As such variation affects how fast the molecular clock ticks, reconstructing gene genealogies among different species sometimes accounts for species-specific rates of evolution (Lepage et al., 2007). However, under the standard coalescent framework, empirical studies of within- and between-species variation tend to ignore rate variation among populations (Costa and Wilkinson-Herbots, 2017; Wolf and Ellegren, 2017; Kautt et al., 2020). The latter is partly based on a popular assumption in coalescent theory that neutral mutation rate is constant for a given locus across the whole genealogy (Hudson, 1990). Hybridization and speciation lie in the gray zone of these extremes, and have their own problem: molecular clocks from different lineages could be mixed by cross-species gene flow — a gene could evolve under one clock before it flows into another species and switches to evolve according to a different clock (Figure 1). This mixing process is largely outside the scope of existing theories and has received little attention from empirical studies.
However, if mixing between molecular clocks exists and can be measured, its strength could carry information about gene flow, which is important for studying reproductive isolation between incipient species. Genomic regions responsible for reproductive isolation lead to locally elevated genomic divergence (“genomic islands”), often caused by linked genomic regions experiencing less gene flow (“barrier loci”; Nosil et al., 2009; Michel et al., 2010; Renaut et al., 2013; Payseur and Rieseberg, 2016). In studying a hybrid zone between two butterfly species, Papilio syfanius and Papilio maackii, we discovered evidence for unequal genome-wide substitution rates between the two species. Using this system, we investigate the interaction between unequal substitution rates and gene flow, and whether this interaction reveals new information on reproductive isolation.
As these butterflies are rare species, occurring in a remote region of China, and are hard to collect, we employed methods based on analysis of whole-genome sequences of a few specimens. We hope that these methods may prove of use in studying other rare or perhaps endangered species where few individuals can be sacrificed. Results will follow two parallel lines: first, we provide evidence that genomic islands are associated with barrier loci. Then we infer the existence of unequal substitution rates. Finally, using a coalescent model, we calculate the relationship between the magnitude of genomic islands and the degree of mixing between substitution rates in linked regions. Throughout the analysis, we assume reverse mutations are rare, so that higher substitution rates always lead to higher numbers of observed substitutions.
Results
Divergent sister species with ongoing hybridization
We sampled 11 males of P. syfanius and P. maackii across a geographic transect (Figure 2A, dashed lines) covering both pure populations (in the sense of being geographically far from the hybrid zone) and hybrid populations. We also include four outgroup species, two of which have chromosome-level genome assemblies (P. bianor, P. xuthus; Lu et al., 2019; Li et al., 2015), while the other two (P. arcturus, P. dialis) are new to this study. All samples were re-sequenced to at least 20× coverage across the genome and mapped to the genome assembly of P. bianor. Among sampled local populations, P. syfanius inhabits the highlands of Southwest China (Figure 2A, red region), whereas P. maackii dominates at lower elevations (Figure 2A, blue region) (see Figure 2—figure supplement 3 and Figure 2—source data 4 for the complete distribution). The two lineages form a spatially contiguous hybrid zone at the edge of the Hengduan Mountains (Figure 2B) with individuals exhibiting intermediate wing patterns (Figure 2A: purple dot, corresponding to population WN in Figure 2B). Consistent with previous results (Condamine et al., 2013), assembled whole mitochondrial genomes are not distinct between the two lineages (Figure 2C, Figure 2—source data 3), suggesting either that divergence was recent, or that gene flow has homogenized the mitochondrial genomes. However, the two species are likely adapted to different environments associated with altitude, as several ecological traits are strongly divergent (Kashiwabara, 1991; Figure 2—figure supplement 1). Similarly, between pure populations (KM and XY in Figure 2B), relative divergence () is also high across the entire nuclear genome (Figure 2D, Figure 2—source data 1). The on autosomes averages between 0.2–0.4, and on the sex chromosome (Z-chromosome) it reaches 0.78. A highly heterogeneous landscape of is accompanied by numerous islands of elevated sequence divergence () and reduced genetic diversity () scattered across the genome (Figure 2—figure supplement 4, Figure 2—source data 2). Since animal mitochondrial DNA generally has higher mutation rates than the nuclear genome (Haag-Liautard et al., 2008), its low divergence between the two species are likely the result of gene flow. Overall, despite ongoing hybridization, genomes of the pure populations of P. syfanius and P. maackii are strongly differentiated.
-
Figure 2—source data 1
- https://cdn.elifesciences.org/articles/78135/elife-78135-fig2-data1-v3.zip
-
Figure 2—source data 2
- https://cdn.elifesciences.org/articles/78135/elife-78135-fig2-data2-v3.zip
-
Figure 2—source data 3
- https://cdn.elifesciences.org/articles/78135/elife-78135-fig2-data3-v3.zip
-
Figure 2—source data 4
- https://cdn.elifesciences.org/articles/78135/elife-78135-fig2-data4-v3.zip
-
Figure 2—source data 5
- https://cdn.elifesciences.org/articles/78135/elife-78135-fig2-data5-v3.zip
Genomic islands are associated with barrier loci
A natural question is whether genomic differentiation is associated with barrier loci and reproductive isolation. In other words, can variation be attributed to gene flow variation between sister species? We suspect that barrier loci likely exist, because sequence variation between pure populations suggests that elevated is associated with reduced and elevated across autosomes (Figure 3A), as expected for hybridizing species (Irwin et al., 2018). The Z chromosome (sex chromosome) has the highest and the lowest among all chromosomes (Figure 2—figure supplement 4), another characteristic of hybridizing species with barriers to gene flow (Kronforst et al., 2013).
To test for the presence of barrier loci, we augment the analysis with the sequences of four individuals from the population closest to the center of the hybrid zone (Population WN). We investigate whether differences in ancestry variation provide additional evidence for barrier loci in this hybrid zone. The underlying logic is that barrier loci will simultaneously:
Reduce linked in pure populations (Ravinet et al., 2017);
Elevate linked between pure populations (Ravinet et al., 2017);
Elevate pairwise linkage disequilibrium in hybrid zones (Barton, 1983);
Enrich linked ancestry from one lineage in hybrid zones (Sedghifar et al., 2016).
Effects 3 and 4 can be bundled together as “reduced ancestry randomness” around barrier loci because both are expected if intermixing of segments of different ancestries within a genomic interval is prevented. For effects 3 and 4, because of small sample sizes, estimating site-specific statistics such as pairwise linkage disequilibrium is untenable. However, our high-quality chromosome-level reference genome enabled accurate estimation of local ancestry. As a remedy for small sample sizes, we employ two entropy metrics borrowed from signal-processing theory to quantify ancestry randomness in local regions along chromosomes (see Materials and methods). By dividing chromosomes into segments, we can extract indirect information about effects 3 and 4 at the expense of reduced genomic resolution. The proposed metrics, and , correspond to the randomness of ancestry between and within individual diploid genomes from a local population (Figure 3B). For a cohort of ideal chromosomes with uniform recombination and marker density, if ancestry is independent between homologous chromosomes, both and increase with reduced local ancestry correlation and more balanced parental contribution (Figure 3C).
For a given autosomal segment, we then calculate , , and between pure populations, as well as entropy metrics and in hybrid individuals for the same segment. To investigate whether effects 1–4 are all present in our system, the joint range among entropy, , and is shown in Figure 3D, which suggests that low ancestry randomness (low entropy) is likely associated with reduced within species and elevated between species. To further quantify such association, we estimated Pearson’s correlation coefficients () between entropy and the latter statistics (Figure 3E). These associations are strongly significant (-scores > 3). Consequently, reduced ancestry randomness in hybrids (effects 3 & 4) coincides with classical patterns of barrier loci between pure populations (effects 1 and 2). This analysis is not sufficient to exclude all alternative hypotheses. For instance, we cannot entirely exclude the possibility that patterns are driven by low-recombination regions () that experience linked selection () also have elevated mutation rates (). Nonetheless, this alternative seems most unlikely, as low-recombination regions should typically be less rather than more mutable (Lercher and Hurst, 2002; Jensen-Seaman et al., 2004; Yang et al., 2015; Arbeithuber et al., 2015; Liu et al., 2017). Overall, adding information from hybrid populations strengthens the evidence for barrier loci acting across autosomes.
The Z chromosome was excluded from this analysis as it likely differs in mutation rate or effective population size (Presgraves, 2018), but its ancestry in hybrid individuals either retains purity or resembles very recent hybridization (long blocks of heterozygous ancestry, Figure 3—figure supplement 1). The Z chromosome has low ancestry randomness, and it also has the highest level of divergence (Figure 2D), both of which suggest strong barriers to gene flow between P. syfanius and P. maackii on this chromosome.
Asymmetric site patterns
A hint that substitution rates differ between the two species comes from site-pattern asymmetry (but this asymmetry alone is insufficient to establish the existence of divergent rates). We focus on two kinds of biallelic site patterns. In the first kind, choose three taxa (P1,P2,O1), with P1=syfanius, P2=maackii, while O1 is an outgroup. Assuming no other factors, if substitution rates are equal between P1 and P2, then site patterns (P1,P2,O1)=(A,B,B) and (B,A,B) occur with equal frequencies, where “A” and “B” represent distinct alleles. This leads to a statistic describing the asymmetry between three-taxon site patterns:
where is the count of a particular site pattern across designated genomic regions. A significantly non-zero indicates strongly asymmetric distributions of alleles between P1 and P2.
In a second kind of site pattern test, we compare four taxa (P1,P2,O1,O2) and calculate a similar statistic between site patterns (A,B,B,A) vs (B,A,B,A):
Classically, a significantly non-zero suggests that gene flow occurs between an outgroup and either P1 or P2, thus it is widely used to detect hybridization (the ABBA-BABA test) (Durand et al., 2011; Hibbins and Hahn, 2022). Here, is used more generally as an additional metric of site pattern asymmetry.
We compute both and on synonymous, nonsynonymous, and intronic sites, with P. syfanius and P. maackii samples taken from pure populations (KM and XY). For each type of site, we progressively exclude regions with local below a certain threshold, and report statistics on the remaining sites in order to show the increasing site-pattern asymmetry in more divergent regions (Figure 4). is significantly negative regardless of outgroup or site type for most thresholds, and is also significantly negative for most outgroup combinations when computed across the entire genome (-scores are shown in Figure 4—figure supplement 1), proving that site-patterns are strongly asymmetric between P. syfanius and P. maackii. Importantly, the direction of asymmetry is nearly identical across all outgroup comparisons. This asymmetry cannot be attributed to batch-specific variation as all samples were processed and sequenced in a single run, and variants were always called on all individuals of P. syfanius and P. maackii. Sequencing coverage is normal for most annotated genes used in the analysis (Table 1), suggesting that asymmetry is not due to systematic copy-number variation that could affect variant calls. Nonetheless, two independent processes of evolution could explain observed asymmetric site patterns. In hypothesis I (Figure 5A, left), asymmetry is generated via stronger gene flow between P. syfanius and outgroups, leading to biased allele-sharing. In hypothesis II, site pattern asymmetry is due to unequal substitution rates between P. syfanius and P. maackii, which is further modified by recurrent mutations in all four outgroups (Figure 5A, right). We test each hypothesis below.
Hybridization with outgroups does not explain site-pattern asymmetry
We first test hypothesis I by phylogenetic reconstruction using SNPs in annotated regions. We construct local gene trees for each 50 kb non-overlapping window for all samples, including the four outgroup species. As biased gene flow with outgroups should rupture the monophyletic relationship among all P. syfanius +P. maackii individuals, the fraction of windows producing paraphyletic gene trees can be used to assess the potential impact of gene flow. However, almost all gene trees show the expected monophyletic relationship (Figure 5B). This conclusion is independent of the level of support used to filter out genomic windows with ambiguous topologies (see Materials and methods). Consequently, hardly any window shows a phylogenetic signal of hybridization with outgroups. Nonetheless, branch lengths in reconstructed gene trees suggest possible substitution rate divergence between the two species: Among highly supported monophyletic trees (bootstrap support ≥ 95), P. maackii (in populations YA, BJ, XY) is always significantly more distant than P. syfanius from the most recent common ancestor of P. maackii +P. syfanius (Figure 5C, significance levels reported via a Wilcoxon signed-rank test for each pair of individuals, see Figure 5—source data 2). Second, the direction of allele sharing in the statistics is unanimously biased towards P. syfanius. If hypothesis I were true, hybridization with outgroups is required to take place mainly in the highland lineage. There is no evidence to support why the highland lineage should receive more gene flow based on current geographic distributions, as outgroups P. xuthus and P. bianor overlap broadly with both P. maackii and P. syfanius, while outgroup P. dialis is sympatric only with P. maackii (Condamine et al., 2013). Although it is possible that historical and modern geographic distributions differ, and archaic gene flow might have occurred preferentially with P. syfanius, it should still leave some phylogenetic signal of introgression. Overall, we find little evidence for biased hybridization required by hypothesis I.
One might worry that by rejecting hypothesis I, we also throw doubt on widely accepted conclusions of the ABBA-BABA test for gene flow in other systems (that a significantly nonzero implies hybridization with outgroups) (Durand et al., 2011). However, in the next section, we show why and are fully consistent with hypothesis II, and so this is just a special case where the ABBA-BABA test produces a false positive for gene flow.
Evidence for divergent substitution rates
In hypothesis II, divergent substitution rates between P. maackii and P. syfanius interact with recurrent mutations in outgroups to produce asymmetric site patterns. To understand its effect on statistics, consider a simplified model of recurrent mutation (Figure 5D,E), where a site in outgroup mutates with probability , producing the same derived allele with probability . When averaged across the genome, can be treated as a constant, and increases with distance to the outgroup. In the absence of gene flow, for three-taxon patterns, recurrent mutations modify by a factor of approximately (Figure 5D, see Materials and methods), and observed will thus be positively correlated with . For four-taxon patterns, it can be shown that observed is always negative due to larger probabilities of recurrent mutation in more distant outgroups (Figure 5E, see Materials and methods). Assuming no significant contribution of incomplete lineage sorting (Maddison and Knowles, 2006), the value of becomes more negative with increasing .
To test these signatures, we employ estimated node heights of outgroups in the mitochondrial tree (Figure 2C) as proxies for outgroup distance, and hence for the relative probability of recurrent mutation (). In line with expected signatures, we find that observed indeed increases with node height (Figure 5F, left), and observed decreases with (Δ Node height/∑ Node height; Figure 5F, right). Thus, the directions and magnitudes of both statistics are congruent with hypothesis II. As hypothesis II naturally predicts unanimously negative and as well as their relative magnitudes among different outgroup combinations, it is more parsimonious than hypothesis I. Hence, divergent substitution rates likely exist between P. syfanius and P. maackii.
Rate-mixing at migration-drift equilibrium
Having tested for the existence of divergent substitution rates, we now explore how they become mixed by gene flow using a coalescent framework. In particular, we will assess whether the conceptual picture in Figure 1 can be recovered quantitatively in the butterfly system. As gene flow is ongoing between the two lineages, consider two haploid populations of size exchanging genes at rate (Figure 6A). This simple isolation-with-migration (IM) model at equilibrium has relative divergence (Notohara, 1990)
To quantify the signature of mixed rates, consider the asymmetry in observed numbers of substitutions (circles in Figure 6A). Take a pair of sequences from two populations. Let be the expected number of derived alleles exclusive to the sequence in population 1, and let be the same expected number in population 2. Their ratio is defined as observed rate ratio:
Evidently, is the symmetric point where both sequences have the same number of derived alleles. Further, let the actual substitution rate in population 1 be , and the actual substitution rate in population 2 be . The ratio between the two actual rates are defined as the true rate ratio:
At migration-drift equilibrium, observed rate ratio () and observed divergence () are related by the following formula parameterized singly by (Figure 6B, see Materials and methods):
which translates into
These formulae indicate that unequal substitution rates are more mixed in regions with lower genomic divergence. Equation 7 is surprising, because it reveals that the remaining fraction of substitution rate divergence () is almost the same as , which corresponds to the fraction of genetic variance explained by population structure (Wright, 1949). In Figure 6B, the relationship between and is still largely linear when one species evolves three times as fast as its sister species (), and so Equation 7 might be robust under biologically realistic rates of substitutions between incipient species. Using extensive simulations (Figure 6—figure supplement 1 to Figure 6—figure supplement 3), we show that the full formula (Equation 6) is also robust in a number of equilibrium population structures.
To test whether such predictions are met, we calculate on synonymous, nonsynonymous, and intronic sites partitioned by their local values between pure populations (KM & XY), and recover a similar monotonic relationship across most partitions (Figure 6C). For introns, the observed relationship between and is a near perfect fit to Equation 6 (Figure 6D, squares), with an estimated = 1.837. For synonymous (Figure 6D, circles) and nonsynonymous sites (Figure 6D, triangles), Equation 6 also provides an excellent fit in regions with low to intermediate . Estimated for synonymous sites is 1.813, close to that of introns, while nonsynonymous sites have a considerably lower of 1.633. If introns and synonymous substitutions are approximately neutral, we infer that neutral substitution rates are about 80% greater in the lowland species.
Discussion
Entropy as a useful measure of ancestry randomness
A critical step in our analysis is to associate genomic islands with barrier loci. Conventionally, information on increased association between barrier sites in hybrid populations comes from two-locus linkage disequilibrium (Slatkin, 1975). Empirical studies on hybrid populations frequently use such statistics to strengthen the evidence for barriers to gene flow (Knief et al., 2019; Wang et al., 2022). Alternatively, if phased haplotypes can be sequenced, the length of ancestry tracts (Sedghifar et al., 2016) or the density of ancestry junctions (Janzen et al., 2018; Wang et al., 2022) also carry information on barrier loci. All these methods break down with small numbers of unphased samples, as forced phasing can produce a large number of false ancestry junctions. However, conventional notions such as “ancestry tract length” are not definable for unphased local ancestry. The dilemma forces us to consider more robust statistics that carry information on ancestry association even in unphased data.
The development of both entropy metrics follows three intuitions:
Vectorization: ancestry is a categorical concept, and it should map to a signal containing the contribution of each source population.
Conservation law: the total ancestry from all sources at a particular locus is conserved.
Highly (auto)correlated signals have a concentrated spectrum (low spectral entropy).
Since entropy carries information about the degree of ancestry correlation, it will decrease in regions of low recombination and high genetic relatedness. If genetic relatedness is produced by inbreeding, it should affect the entire genome in a similar way, and between-individual entropy () will be similar across different parts of the genome. However, if inbreeding is so severe that is globally zero, it will not be an informative metric. It is worth noting that within-individual entropy shares a conceptual similarity to the wavelet transform of ancestry signals (Pugach et al., 2011; Sanderson et al., 2015). These entropy metrics will be particularly useful for window-based genomic analysis of ancestry correlation with limited sampling, and they are also compatible with larger sample sizes and more than two parental species. (The formulae of entropy with two parental species are presented in Materials and methods, and mathematical details are discussed in Appendix 1.) Nonetheless, this entropy approach assumes local ancestry can be accurately inferred, which will be challenging for studies with low-coverage sequencing, non-chromosomal genome assembly, or lacking reference populations.
Potential mechanisms of divergent substitution rates
Interestingly, a higher substitution rate in the lowland lineage P. maackii is congruent with the evolutionary speed hypothesis (Rensch, 1959), where evolution accelerates in warmer climates. Our finding echoes the results of many previous studies (Gillooly et al., 2005; Wright et al., 2006; Lin et al., 2019; Ivan et al., 2022). Without measuring the per-generation mutation rates in both species, it is unclear what mechanisms cause increased substitution rates, but the lowland lineage typically has an additional autumn brood that is absent in P. syfanius (Takasaki et al., 2007; Figure 2—figure supplement 2). Warmer temperatures in lowland habitats might also increase spontaneous mutation rates in ectothermic insects (Waldvogel and Pfenninger, 2021). Both mechanisms could produce increased substitution rates in the lowland species. It is surprising that estimated between synonymous sites and introns agree with each other under such a coarse framework, and estimated for nonsynonymous sites is considerably lower. Less asymmetric rates for nonsynonymous substitutions could perhaps be explained by the nearly neutral theory (Ohta, 1993), which argues that many nonsynonymous mutations are mildly deleterious, and selection is more efficient in suppressing them in larger populations and slowing down substitutions. In the field, the lowland species often appears in larger numbers with well-connected habitats, while the highland species faces a highly heterogeneous landscape of the Hengduan Mountains, which could lead to differences in effective population sizes required by our expectation under the nearly neutral theory.
Are introgression tests robust to substitution rate variation?
An additional result from our study is that divergent substitution rates might produce spuriously non-zero statistics when combined with recurrent mutations, which could increase the false positive rate of the ABBA-BABA test (Durand et al., 2011). This phenomenon has been suspected in humans (Amos, 2020), and is certainly a theoretical possibility (Hibbins and Hahn, 2022), but has not been tested in most empirical studies.
A wide class of introgression tests targeting gene flow between outgroups and a pair of taxa are based on site pattern information. Hahn’s computes absolute sequence divergence between groups in a triplet of species (Hahn and Hibbins, 2019), and will be affected by unequal substitution rates in a similar way to our . Martin’s computes the same numerator as our (Martin et al., 2015), so it could also produce false positives under similar situations. Related statistics include (Hamlin et al., 2020), (Pfeifer and Kapan, 2019). A general guideline for site-pattern based statistics is that the focal pair of taxa are closely related such that substitution rates do not differ, while outgroups should not be too distant to minimize recurrent mutations (Hibbins and Hahn, 2022). However, whether these assumptions are met in empirical studies is worthy of investigation, and our system provides a counterexample even between sister species with ongoing hybridization.
A separate pitfall might occur if introgression tests based on site-patterns are applied to genomic windows to locate regions introgressed from outgroups. In our case, peaks have the most asymmetric substitution rates between P. maackii and P. syfanius, thus they will most likely be associated with false-positive statistics. This could lead to the incorrect interpretation that some barrier loci (“speciation genes”) are introgressed from outgroups – a popular hypothesis in adaptive introgression and hybrid speciation, see Edelman and Mallet, 2021.
To this end, we speculate that using appropriate substitution models to infer gene tree topology will perform better in assessing the impact of introgression with outgroups.
The conceptual picture of rate-mixing
In the gray zone of incomplete speciation, interspecific hybrids bridge between gene pools of divergent lineages (Mallet, 2005). We here demonstrate a similar role of hybridization in coupling and mixing differing substitution rates. Divergent rates of substitution carry information about outgroups, while divergence based on allele frequency differences does not. Preserving divergent substitution rates is a stronger effect than maintenance of allele frequency differences, because divergence of allele frequencies is a prerequisite for rate preservation. This dependency can be coarsely quantified across the genome by the relationship between observed rate ratio and relative divergence in an equilibrium system of hybridizing populations (Equation 6). At migration-drift equilibrium, it is not surprising that divergent substitution rates are associated with relative divergence. In Figure 6A, when coalescence occurs rapidly compared to gene flow, most substitutions separating individuals are species-specific. However, when gene flow is faster than coalescence, individuals will carry substitutions that occurred in both species. This could have important implications, because preserving lineage-specific substitution rates as measured by might not require low absolute rates of gene flow. Instead, reducing effective population sizes via recurrent linked selection might achieve a similar result in populations at equilibrium ().
The theory in its present form has several limitations. First, mutations follow the infinite-site model (Kimura, 1969), so that reverse mutations, double mutations, and substitution types are not accurately reflected in the estimated ratio between species-specific substitution rates. Second, population structure is assumed at equilibrium, whereas real data could carry footprints from non-equilibrium demographic processes (e.g. secondary contact) (Hey, 2010). Third, there could be considerable population structure within each species contributing to elevated but not necessarily to rate-divergence. This effect could be seen in simulated stepping-stone models (Figure 6—figure supplement 3) and will underestimate the level of rate divergence. Fourth, substitution is stochastic across the genome, and accurately estimating observed rate ratio relies on averaging substitution numbers across a large number of sites. This poses a problem if sites linked to high regions are rare (i.e. few genomic islands). Lastly, as the theory is built upon the neutral coalescent, it is best suited for studying behaviors of neutral sites.
Nonetheless, the monotonic relationship between and (i.e. larger sequence divergence is associated with more asymmetric substitution rates) might be qualitatively robust regardless of the aforementioned caveats. For instance, even for hybrid zones formed by recent secondary contact, reducing the absolute rate of gene flow by barrier loci in principle also keeps divergent rates from mixing, simply because it prevents substitutions accumulated in the allopatric phase from flowing between species.
In conclusion, our study characterizes several genomic consequences of the rate-mixing process when molecular clocks tick at different speeds between hybridizing lineages. This process provides new information on reproductive isolation but also leads to pitfalls in interpreting results of popular introgression tests. As this phenomenon is neglected in most studies of hybridization and speciation, its full scope awaits further investigation in both theories and empirical systems.
Materials and methods
Museum specimens and climate data
Request a detailed protocolMuseum specimens with verifiable locality data of all species were gathered from The University Museum of The University of Tokyo (Harada et al., 2012; Yago et al., 2021), Global Biodiversity Information Facility (The Global Biodiversity Information Facility, 2021b; The Global Biodiversity Information Facility, 2021a), and individual collectors (Figure 2—source data 5). Records of P. maackii from Japan, Korea and NE China were excluded from the analysis, so that most P. maackii individuals correspond to ssp. han, the subspecies that hybridizes with P. syfanius. Spatial principal component analysis was performed on elevation, maximum temperature of warmest month, minimum temperature of coldest month, and annual precipitation, all with 30s resolution from WorldClim-2 (Fick and Hijmans, 2017). The first two PCAs, combined with tree cover (Hansen et al., 2013), were used in MaxEnt-3.4.1 to produce species distribution models that use known localities to predict occurrence probabilities across the entire landscape (Phillips et al., 2017). Outputs were trimmed near known boundaries of each species. See Figure 2—figure supplement 3 for the final result.
Sampling, re-sequencing, and mitochondrial phylogeny
Request a detailed protocolEleven males of P. syfanius and P. maackii, with one male of P. arcturus and one male of P. dialis were collected in the field between July and August in 2018 (Figure 2—source data 5), and were stored in RNAlater at –20 °C prior to DNA extraction. E.Z.N.A Tissue DNA kit was used to extract genomic DNA, and KAPA DNA HyperPlus 1/4 was used for library preparation, with an insert size of 350 bp and 2 PCR cycles. The library is sequenced on a Illumina NovaSeq machine with paired-end reads of 150 bp. Adaptors were trimmed using Cutadapt-1.8.1, and subsequently the reads were mapped to the reference genome of P. bianor with BWA-0.7.15, then deduplicated and sorted via PicardTools-2.9.0. The average coverage among 13 individuals in non-repetitive regions varies between 20× and 30×. Variants were called twice using BCFtools-1.9 – the first including all samples, used in analyses involving outgroups, and the second excluding P. arcturus and P. dialis, used in all other analyses. The following thresholds were used to filter variants: DP , where is the sample size; QUAL > 30; MQ > 40; MQ0F < 0.2. As a comparison, we also called variants with GATK4 and followed its best practices, and 93% of post-filtered SNPs called by GATK4 overlapped with those called by BCFtools. We used SNPs called by BCFtools throughout the analysis. Mitochondrial genomes were assembled from trimmed reads with NOVOPlasty-4.3.1 (Dierckxsens et al., 2017), using a published mitochondrial ND5 gene sequence of P. maackii as a bait (NCBI accession number: AB239823.1). We also used the following published mitochondrial genomes (NCBI accession numbers): KR822739.1 (Papilio glaucus), NC_029244.1 (Papilio xuthus), JN019809.1 (Papilio bianor). The neighbor-joining mitochondrial phylogeny was built with Geneious Prime-2021.2.2 (genetic distance model: Tamura-Nei), and we used 104 replicates for bootstrapping. The reference genome of P. xuthus was previously aligned to the genome of P. bianor and we used this alignment directly in all analysis (Lu et al., 2019).
Calculating site-pattern asymmetry
Request a detailed protocolGiven a species tree {{P1,P2},O}, where P1 and P2 are sister species and O is an outgroup, if mutation rates are equal between {P1,P2}, and no gene flow with O, then on average the number of derived alleles in P1 should equal the number of derived alleles in P2. Let be a collection of sites, be the frequency of a particular site pattern at site . “ABB” be the pattern where only P2 and O share the same allele, and “BAB” be the pattern where only P1 and O share the same allele, then the three-species statistic is calculated as
where is always limited to sites without polymorphism in the outgroup O. This statistic is in principle capturing the same source of asymmetry as the statistic proposed by Hahn and Hibbins, 2019, although their version uses divergence to the outgroup instead of frequencies of site-counts. Similarly, the four-species statistic, which considers species tree {{{P1,P2},O1},O2} and site patterns ABBA versus BABA (Durand et al., 2011) is calculated as
where is always limited to sites without polymorphism in the second outgroup O2. The significance of both tests was computed using block-jackknife over 1 Mb blocks across the genome. Additionally, we estimated rate ratio as follows. First we restricted to sites where all outgroups are fixed for the same ancestral allele to dampen the influence of recurrent mutation. Then, for each site, sample one allele at random from each focal lineage. Calculate the probability of observing a derived allele in P1 but not in P2, and the probability of observing a derived allele in P2 but not in P1. The rate ratio is computed as the ratio between the two probabilities. Explicitly, let be the identity function, and be the frequency of the derived allele, then:
Its standard error was estimated using 1 Mb block-jackknifing. We excluded P. xuthus from the outgroups to increase the number of informative sites when using this formula.
D3 and D4 under unequal substitution rates and recurrent mutations
Request a detailed protocolIn this section, we calculate observed and assuming that incomplete lineage sorting contributes insignificantly to both statistics. If incomplete lineage sorting is present, it will not create new bias (numerators are on average unchanged), but will likely dampen existing bias (inflating denominators).
As substitutions are independent along each lineage, we can mute recurrent mutations in outgroups and generate them afterwards. For three taxa with gene tree {{P1,P2},O1}, before recurrent mutations, there are sites with pattern (B,A,A), and sites with pattern (A,B,A). If substitution rate is higher in P2, we have , so the true value of , written as , is always negative:
Next, recurrent mutations in O1 occur at each site with an average probability , and with an average probability , ancestral alleles from affected sites in O1 are converted to the same derived allele in {P1,P2}. will be independent of and , as long as substitutions between {P1,P2} are only different in rates, but not mutation types. Hence, two possible mutation paths exist:
The expected site counts, after recurrent mutations, become
Using the new expected site counts in statistics produce the following value:
Since is negative, it grows approximately linearly near small values of . (The full equation is still monotonic in .)
Similarly, for four-taxon statistics, before recurrent mutation, there are two types of sites: (A,B,B,B)- ; (B,A,B,B)- . Suppose the average probability of recurrent mutation is in O1, and in O2, and the conversion probability of each recurrent mutation into derived alleles of {P1, P2} is for both outgroups. Using the same procedure, one can show that
Since recurrent mutations occur more frequently in distant outgroups, . Because is negative, we have .
Local gene trees
Request a detailed protocolLocal gene trees were estimated using iqtree-2.0 (Minh et al., 2020) on 50 kb non-overlapping genomic windows with options -m MFP -B 5000. Only SNPs from annotated regions (synonymous sites +nonsynonymous sites +introns) across all individuals were used. For diploid individuals, heterozygous sites were assigned IUPAC ambiguity codes and iqtree assigned equal likelihood for each underlying character, thus information from heterozygous sites is largely retained. This is crucial as we are interested in the branch length of inferred trees. Option -m MFP implements iqtree’s ModelFinder that tests the FreeRate model to accommodate maximum flexibility with rate-variation among sites (Kalyaanamoorthy et al., 2017). We also used UltraFast Bootstrap to calculate the support for different types of splits in each window (the -B 5000 option; Hoang et al., 2018). In each window, we extracted the support for monophyly among P. maackii +P.syfanius directly from the output of UltraFast Bootstrap, and we define the support for paraphyly among P. maackii +P.syfanius as (100 - the support for monophyly). For each level of support, we filtered out genomic windows where both the support for monophyly and the support for paraphyly drop below the given level. The remaining windows were considered informative.
Rate-mixing under the equilibrium IM model
Request a detailed protocolWe construct a continuous-time coalescent model as follows. Both populations have haploid individuals, gene flow rate is , and coalescent rate is in each population. In the equilibrium system, as we track both haploid individuals backward in time, there are six distinct states: , where 1 and 2 represent two individuals prior to coalescent, 0 is the state of coalescent, and shows the location of each lineage. Its transition density satisfies , where is given as
Let be the mean sojourn time of an uncoalesced individual inside population during , conditioning on the individual being taken from population at . Assuming the infinite-site mutation model, let be the substitution rate in population , observed rate ratio is thus
where (due to symmetry)
Let , and since , we have
Local ancestry estimation
Request a detailed protocolSoftware ELAI (Guan, 2014) with a double-layer HMM model was used to estimate diploid local ancestries across chromosomes. An example command is as follows: elai-lin -g genotype.maackii.txt -p 10 -g genotype.syfanius.txt -p 11 -g genotype.admixed.txt -p 1 -pos position.txt -s 30 -C 2 -c 10 -mg 5000 --exclude-nopos.
Note that -mg specifies the resolution of ancestry blocks, thus increasing its value will increase the stochastic error of incorrectly inferring very short blocks of ancestry. To control for uncertainty, we estimated repeatedly for 50 times. All replicates were used simultaneously in finding the correlation coefficients between entropy and other variables. Results from an example run is in Figure 3—figure supplement 1.
Ancestry and entropy
Here we introduce concisely the data transformation framework for calculating the entropy of local ancestry. The mathematical detail of this approach is presented in the appendix.
Ancestry representation
Request a detailed protocolThe space of all ancestry signals is high-dimensional, and directly calculating the entropy in this space is not feasible with just a few individuals. So we propose to measure only the pairwise correlation of ancestries among sites, which captures only the second-order randomness, but is sufficient for practical purposes. Consider a hybrid individual with two parental populations indexed by . Assuming a continuous genome, let be the diploid ancestry of locus within genomic interval . By definition, we have , that is the total ancestry is conserved everywhere in the genome. The bi-ancestry signal at locus is defined as the following complex variable
where is the imaginary number. An advantage of using a complex representation for the bi-ancestry signal is that we can model different ancestries along the genome as different phases of a complex unit phasor (), such that the power of the signal at any given locus is simply the sum of both ancestries, which is conserved (). It ensures that we do not bias the analysis to any particular region or any particular individual when decomposing the signal into its spectral components.
Within-individual spectral entropy ()
Request a detailed protocolTo characterize the average autocorrelation along an ancestry signal at a given scale , define the following scale-dependent autocorrelation function
where is understood as a periodic function such that whenever the position goes outside of . The Wiener-Khinchin theorem guarantees that ’s power spectrum , which is discrete, and the autocorrelation function form a Fourier-transform pair. Due to the uncertainty principle of Fourier transform, that vanishes quickly at short distances (small-scale autocorrelation) will produce a wide , and vice versa. So the entropy of , which measures the spread of the total ancestry into each spectral component, also measures the scale of autocorrelation. In practice, is the square modulus of the Fourier series coefficients of , and we fold the spectrum around before calculating the within-individual entropy . The formula used in the manuscript is
where are the Fourier coefficients from the expansion . To speed up the Fourier expansion, we could densely pack equally-spaced markers that sample a continuous ancestry signal into a discrete signal, which then undergoes Fast Fourier Transform (FFT). The spectrum of FFT (discrete and finite) approximates the continuous-time Fourier spectrum (discrete and infinite), and entropy also converges as marker density increases.
Between-individual spectral entropy ()
Request a detailed protocolAs ancestry configuration is far from random around barrier loci, it will also influence the correlation of ancestry between different individuals at the same locus. For a genomic region experiencing strong barrier effects, two individuals could either be very similar in ancestry, or very different. This effect can be quantified by first calculating the cross-correlation at position between individuals and , and then averaging across a genome interval: . The dimensional matrix with entries describes the pairwise cross-correlation within the cohort of individuals. We also have as each individual is perfectly correlated with itself. The matrix is Hermitian, so it has a real spectral decomposition with eigenvalues that satisfy . This process is very similar to performing a principal component analysis on the entire cohort of individuals, and describes the fraction of the total ancestry projected onto principal component . If many loci co-vary in ancestry, the spectrum will be concentrated near the first few components. Similarly, we use entropy to measure the spread of the spectrum, and hence the between-individual spectral entropy is defined as
Appendix 1
Representation of ancestry on a hybrid chromosome
The ancestry of a hybrid depends on the pure reference populations. The following assumptions are used throughout this section:
There are a finite number of pure reference populations indexed by . In most cases, we are only interested in , for instance, a hybrid zone between two lineages.
The chromosome has so many sites so that it can be treated as a contiguous rod with length . is the index of positions on a chromosome.
The species’ ploidy is (completely phased data ⇔ ). When not specified, we assume the data is unphased, so that the ancestry on a particular chromosome always refers to the ancestry on a collection of homologous chromosomes.
Ancestry can be inferred. In practice, regions with low density of informative SNPs will make inference difficult.
As "ancestry" is a categorical variable (i.e., there is no intrinsic order among the reference populations), to quantify the correlation of ancestry along a chromosome, we need to map the contribution of each ancestry category to some numeric value that can be used to calculate correlation.
The first choice is to use the probability of each category. For unphased data, let be the fraction of the -ploid chromosome at position coming from reference population . The vector-valued function completely describes the ancestry on a single chromosome. By definition, . This is the conservation of total ancestry at any position. However, by this representation, the correlation , which is the product of ancestries at different positions, will have units of the square of probabilities. If a locus is to be compared with itself, then the correlation will be
Ideally, we want the correlation of ancestry of a locus with itself to be the same regardless of its ancestry configuration. Equation (24) does not meet this criteria as it depends on . Here we borrow some concepts from signal-processing theory. The conservation of total ancestry means that ancestry is more analogous to the "energy" of a signal, which is in units of [squared-signal], rather than the signal itself. Therefore, we could instead define a second type of ancestry representation by taking the square roots of probabilities.
Definition 1 (Spherical representation of ancestry)
If the probability representation of ancestry is , then the spherical representation takes the square-root of each element of :
Following this representation, the autocorrelation between two spherical ancestries becomes a natural measure of the similarity between two probability distributions:
Definition 2 (Correlation between two spherical ancestries)
The autocorrelation function between and is the following quantity known as the Bhattacharyya coefficient, also known as the fidelity measure in information theory:
Further, self-correlation is always 1:
The conservation of self-correlation will be important when we decompose ancestry into its spectral components, as it will not bias our analysis to any particular region of the chromosome. There is also a geometric meaning associated with this representation. Since has a -norm of 1, each ancestry configuration corresponds to a point on the unit sphere in , and different ancestries are represented by the orientation of this vector in .
In many studies we are dealing only with two reference populations, such as hybrid zones between a pair of divergent lineages. This situation allows a more compact representation of ancestry using complex numbers:
Definition 3 (Complex representation of ancestry)
If , define the following complex variable as the representation of ancestries:
where is the imaginary number.
This seemingly artificial definition is not the first time that a complex number is used to model a physical phenomenon. In quantumn mechanics, the quantumn wave function of a particle is represented by a complex wave with the probability of occurrence measured by the square-modulus of the wave. Here, we are also using the square-modulus of to represent the total ancestry of a given locus.
A complex-valued signal , like any real signal, has the definition of autocorrelation:
Definition 4 (Correlation between two complex ancestries)
The correlation between two complex ancestries and is the following product:
By this definition, the real part of the autocorrelation is just the Bhattacharyya coefficient between two ancestry configurations on l1 and l2, which is a measure of similarity. The absolute value of the imaginary part of the autocorrelation measures the volume of the parallelogram spanned by vectors and , thus it is a measure of dissimilarity. Further, the self-correlation is also conserved for all , as and .
The complex representation possesses some useful properties not shared with the spherical representation, but it also limits our analysis to cases with only two reference populations. This will not pose a problem for the study of hybrid zones between a pair of parapatric lineages.
Mean autocorrelation vs. hybrid index
The hybrid index is defined as the average ancestry in a hybrid from a given reference population over a set of loci. It is usually calculated for each individual or for each chromosome. In our notation:
Definition 5 (Hybrid index)
The hybrid index over a given genomic interval for reference population is
In contrast, the average of the spherical or the complex representation provides information about autocorrelation within the chromosome instead of the mean ancestry.
Theorem 1 (Mean autocorrelation)
With the spherical or the complex representation, the squared -norm (or the squared modulus, when a complex representation is available) of the average signal represents the mean autocorrelation of the ancestry on the genomic interval .
Proof. For a general spherical representation , we have
For a complex representation , notice that , so
This guarantees that the mean autocorrelation when using a complex representation is the same as the mean autocorrelation when using a spherical representation with . Additionally,
This completes the proof.
The quantity is a measure of the average similarity of the ancestry configurations. It does not consider the genomic position of different ancestry configurations, so it does not contain information about whether similar ancestry configurations are clustered together.
The entropy of ancestry
To characterize the scale of correlation, the distance between loci is important because correlation often drops while distance between loci increases. The information about the spatial scale of correlation is retained when the full spectrum of correlation is considered within a single individual.
A second source of correlation arises when we consider the relationship between individuals. To convey the main idea, let’s compare between a set of ancestry signals from an inversion and a set of ancestry signals from a regular region subject to normal recombination. For the inversion, since recombination between chromosomes of different ancestry is often completely suppressed, the ancestry signal along the inversion will be close to constant. When two haploid individuals are compared at the inverted region, they are either different everywhere in terms of ancestry, or completely the same. Comparing between diploid individuals is similar, although the difference is more fine-grained due to the presence of the heterozygotes. However, in a collinear region subject to a regular rate of recombination, any ancestry signal will switch stochastically between states. In the latter case, the similarity between two ancestry signals will be similar across multiple pairwise comparisons. It will be helpful to think in the principal component (PC) space spanned by all individuals’ ancestry signals. Ancestry signals from an inversion will form several tight clusters in the PC space, while those from a regular region will form a single cloud with a larger dispersion.
Both sources of correlation, and hence both types of randomness, can be measured using entropy in information theory.
Definition 6 (Shannon entropy)
The Shannon entropy for a discrete probability distribution , , is defined as
For a continuous distribution with probability density function , , the Shannon differential entropy is defined as
For any nonnegative series or nonnegative function , suppose they converge upon summation/integration, we use the same notations and to denote the entropy after they are appropriately normalized to probability distributions.
Shannon entropy (or any other entropy measure) is a useful measure of the spread of a distribution over its entire configuration space. When the distribution is concentrated (low randomness, high certainty), will be low. if and only if for some .
Entropy associated with the correlation within a single individual
Definition 7 (Fourier spectrum of the spherical representation)
For the spherical representation of ancestry on genomic interval
expand each component into its Fourier series:
The folded Fourier spectrum is defined as
Definition 8 (Fourier spectrum of the complex representation)
For the complex representation of ancestry on genomic interval
expand into its Fourier series:
The folded Fourier spectrum is defined as
The following theorem guarantees that the folded spectrum is the same for bi-ancestry signals using either representation.
Theorem 2
When , the folded Fourier spectrum is the same for and .
Proof. Expand each component into its Fourier series:
The folded spectrum using the spherical representation is
Using the linearity of Fourier expansion, we have
Thus,
Since the Fourier coefficients of a real function at opposite frequencies are complex conjugates to each other, we have
Finally,
Since is real, the last term of the previous equation becomes zero. For the zero-th component , as both and are real, there is no difference between the two representations. This completes the proof.
A nice property of the Fourier spectrum is that it can be interpreted as a probability distribution of ancestry among different frequency components. By Parseval’s theorem, it is easy to verify that . From the Wiener-Khinchin theorem, the unfolded spectrum forms a Fourier transform pair with the autocorrelation function of the original signal. The significance of the Wiener-Khinchin theorem is that information about the autocorrelation of the original signal can now be extracted from the Fourier spectrum .
Definition 9 (Within-individual entropy)
Let . is the Shannon entropy of the folded Fourier spectrum . As captures the correlation structure within each individual, we also call it the within-individual entropy.
Formally, we have the following uncertainty principle relating the within-individual entropy to the scale of autocorrelation.
Theorem 3 (Entropic uncertainty)
Let be the average autocorrelation at scale () for the spherical ancestry. Here, is understood as a periodic function of period . The following inequality holds:
where is a normalization factor. Note that the right-hand-side of the inequality is invariant under linearly re-scaling to a different interval . Thus, we can also write the inequality compactly, supposing has been rescaled to , as
where is the average autocorrelation defined in Equation 31.
Proof. (i) In this part, we establish the Wiener-Khinchin relation that expands into a Fourier series with coefficients :
The derivation is as follows:
(ii) Let . It is obvious that are Fourier series coefficients of , they obey the Hausdorff-Young inequality
where and . Let . Since Fourier series preserve the 2-norm, , and since for , we have . This translates into
which yields
Note that the quantity is actually independent of upon re-scaling.
(iii) Next, we show that . Since and is nonnegative, we can always re-order them into a descending series () with the same entropy as (because entropy is permutation-invariant). The result follows as long as . Let . The two series, after normalization, can be written as
Consequently,
This implies that . Suppose there exists such that
Then there exists such that . So for any , we have . The difference
will always be positive and monotonically increases for any . This is not consistent with the fact that . Thus, we conclude that for any , the partial sum follows the inequality
This is to say that infinite sequence majorizes . By Theorem 2.2 of Li and Busch, 2013, , and so
(iv) Finally, since for , we have
which is equivalent to
Combining previous four steps yields the result of the theorem.
Entropy associated with the correlation between individuals
Definition 10 (Entropy of a linear operator)
Let be a compact self-adjoint linear operator on a Hilbert space . If has a countable set of eigenvalues , and is positive semidefinite, then define the entropy of as
where is the trace of the linear operator
Definition 11 (Mercer’s spectrum)
Let be the autocorrelation function for individual , and define the average autocorrelation function as
Since is Hermitian, the average is also Hermitian, thus the integral operator defined by has a series of real eigenvalues satisfying . is the solution to the eigenvalue problem:
The spectrum defined by is the Mercer’s spectrum.
Definition 12 (Cross-correlation spectrum)
Let the cross-correlation matrix be , where is the average cross-correlation between individual and :
As is Hermitian, it has a series of real eigenvalues satisfying . Let the normalized spectrum of be , then is defined as the cross-correlation spectrum.
The following theorem states that when the complex representation is adopted for bi-ancestry signals, the Mercer’s spectrum coincides with the cross-correlation spectrum, so that even if the Mercer’s spectrum is calculated using correlation within individuals, both captures the correlation between individuals.
Theorem 4
If is a collection of complex bi-ancestry signals, then its Mercers’ spectrum is the same as its cross-correlation spectrum.
Proof. The average autocorrelation is computed as
So the integral equation (51) becomes
where forms the solution to the above equation. Rearranging the terms, we write
where stands for the constant inside the bracket. Substituting into the original integral equation, we have
which is equivalent to
Notice that the integral in the bracket contains the cross-correlation between and , so if the following relationship holds
then the system has a solution of . The above equation is equivalent to
Which means that is the eigenvalue of the cross-correlation matrix , and since is a Hermitian matrix, the existence of its real eigenvalues is guaranteed. So for each , we have , which is to say
As both spectra coincide for the complex representation, we can define the between-individual entropy using either of them.
Definition 13 (Between-individual entropy of the complex representation)
For complex bi-ancestry signals, let . is the Shannon entropy characterizing the between-individual correlation.
Corollary 1 (Maximum ).
For diploid individuals with the complex representation of ancestry, the largest attainable is given by
where .
Proof. For entropy to be large, autocorrelation within each individual must be very weak. This will occur if a sufficient amount of recombination has taken place such that the ancestry across the genome is largely independent between loci. The maximum value will occur when both parental lineages contribute equally to the hybrid ancestry, so that the hybrid index . The heterozygosity is then 0.5 following a random distribution of ancestries along the two genome copies. The off-diagonal elements in the cross-correlation matrix thus take the value
For a matrix whose diagonal elements are all 1, and whose off-diagonal elements are all , its normalized eigenvalues are given by , which leads to the above result.
Corollary 2 (Minimum ). is the minimum between-individual entropy, and is attainable if recombination is completely suppressed in the genomic interval of interest.
Proof. If recombination is completely suppressed along , the ancestry signal is constant along the interval in any individual with any ploidy. Hence, , and . The only non-zero eigenvalue associated with a constant integral kernel is , so that we have using the Mercer’s spectrum. Note that the converse is not true, because the diploid ancestry signal is unaware of phase. A region completely heterozygous may in fact has a recombination break point, and the ancestry can flip phases when crossing the break point. However, this extreme situation is unlikely as it requires two separate recombination events to have occurred at precisely the same point and in opposite directions. If the hybrid zone is old, a long track of heterozygous ancestry is therefore usually good evidence for the presence of barrier loci.
Data availability
Source code is available at: https://github.com/tzxiong/2021_Maackii_Syfanius_HybridZone, (copy archived at swh:1:rev:41b220f489e17ff9795e5a0666e9579a00b2b3b8) Whole-genome sequences are deposited in the National Center for Biotechnology Information, Sequence Read Archive (BioProject Accession Number: PRJNA765117).
-
NCBI Sequence Read ArchiveID PRJNA765117. Hybridization between Papilio syfanius and Papilio maackii.
-
GitHubID 2021_Maackii_Syfanius_HybridZone. Database for the code used in the 2021 manuscript on the hybrid zone between Papilio syfanius and Papilio maackii.
-
GigaDB DatasetsSupporting data for "Chromosomal-level reference genome of Chinese peacock butterfly (Papilio bianor) based on third-generation DNA sequencing and Hi-C analysis".https://doi.org/10.5524/100653
References
-
Signals interpreted as archaic introgression appear to be driven primarily by faster evolution in AfricaRoyal Society Open Science 7:191900.https://doi.org/10.1098/rsos.191900
-
Multilocus clinesEvolution; International Journal of Organic Evolution 37:454–471.https://doi.org/10.1111/j.1558-5646.1983.tb05563.x
-
NOVOPlasty: de novo assembly of organelle genomes from whole genome dataNucleic Acids Research 45:e18.https://doi.org/10.1093/nar/gkw955
-
Testing for ancient admixture between closely related populationsMolecular Biology and Evolution 28:2239–2252.https://doi.org/10.1093/molbev/msr048
-
Prevalence and adaptive impact of introgressionAnnual Review of Genetics 55:265–283.https://doi.org/10.1146/annurev-genet-021821-020805
-
WorldClim 2: new 1‐km spatial resolution climate surfaces for global land areasInternational Journal of Climatology 37:4302–4315.https://doi.org/10.1002/joc.5086
-
A three-sample test for introgressionMolecular Biology and Evolution 36:2878–2882.https://doi.org/10.1093/molbev/msz178
-
Assessing biological factors affecting postspeciation introgressionEvolution Letters 4:137–154.https://doi.org/10.1002/evl3.159
-
High-resolution global maps of 21st-century forest cover changeScience (New York, N.Y.) 342:850–853.https://doi.org/10.1126/science.1244693
-
Catalogue of the Suguru Igarashi insect collection, Part I. Lepidoptera, PapilionidaeThe University Museum, The University of Tokyo, Material Reports 94:1–390.
-
Isolation with migration models for more than two populationsMolecular Biology and Evolution 27:905–920.https://doi.org/10.1093/molbev/msp296
-
UFBoot2: Improving the Ultrafast Bootstrap ApproximationMolecular Biology and Evolution 35:518–522.https://doi.org/10.1093/molbev/msx281
-
Gene genealogies and the coalescent processOxford Surveys in Evolutionary Biology 44:31.
-
Temperature predicts the rate of molecular evolution in Australian Eugongylinae skinksEvolution; International Journal of Organic Evolution 76:252–261.https://doi.org/10.1111/evo.14342
-
The breakdown of genomic ancestry blocks in hybrid lineages given a finite number of recombination sitesEvolution; International Journal of Organic Evolution 72:735–750.https://doi.org/10.1111/evo.13436
-
Comparative recombination rates in the rat, mouse, and human genomesGenome Research 14:528–538.https://doi.org/10.1101/gr.1970304
-
Epistatic mutations under divergent selection govern phenotypic variation in the crow hybrid zoneNature Ecology & Evolution 3:570–576.https://doi.org/10.1038/s41559-019-0847-9
-
A general comparison of relaxed molecular clock modelsMolecular Biology and Evolution 24:2669–2680.https://doi.org/10.1093/molbev/msm193
-
Von Neumann entropy and majorizationJournal of Mathematical Analysis and Applications 408:384–393.https://doi.org/10.1016/j.jmaa.2013.06.019
-
Outbred genome sequencing and CRISPR/Cas9 gene editing in butterfliesNature Communications 6:1–10.https://doi.org/10.1038/ncomms9212
-
Evolutionary Rates of Bumblebee Genomes Are Faster at Lower ElevationsMolecular Biology and Evolution 36:1215–1219.https://doi.org/10.1093/molbev/msz057
-
Evolution of the mutation rateTrends in Genetics 26:345–352.https://doi.org/10.1016/j.tig.2010.05.003
-
Inferring phylogeny despite incomplete lineage sortingSystematic Biology 55:21–30.https://doi.org/10.1080/10635150500354928
-
Hybridization as an invasion of the genomeTrends in Ecology & Evolution 20:229–237.https://doi.org/10.1016/j.tree.2005.02.010
-
Evaluating the use of ABBA-BABA statistics to locate introgressed lociMolecular Biology and Evolution 32:244–257.https://doi.org/10.1093/molbev/msu269
-
IQ-TREE 2: New models and efficient methods for phylogenetic inference in the genomic eraMolecular Biology and Evolution 37:1530–1534.https://doi.org/10.1093/molbev/msaa015
-
Divergent selection and heterogeneous genomic divergenceMolecular Ecology 18:375–402.https://doi.org/10.1111/j.1365-294X.2008.03946.x
-
The coalescent and the genealogical process in geographically structured populationJournal of Mathematical Biology 29:59–75.https://doi.org/10.1007/BF00173909
-
A genomic perspective on hybridization and speciationMolecular Ecology 25:2337–2360.https://doi.org/10.1111/mec.13557
-
Estimates of introgression as a function of pairwise distancesBMC Bioinformatics 20:1–11.https://doi.org/10.1186/s12859-019-2747-z
-
Evaluating genomic signatures of “the large X-effect” during complex speciationMolecular Ecology 27:3822–3830.https://doi.org/10.1111/mec.14777
-
Interpreting the genomic landscape of speciation: a road map for finding barriers to gene flowJournal of Evolutionary Biology 30:1450–1477.https://doi.org/10.1111/jeb.13047
-
BookEvolution Above the Species LevelNew York, United States: Columbia University Press.https://doi.org/10.7312/rens91062
-
Beyond clines: lineages and haplotype blocks in hybrid zonesMolecular Ecology 25:2559–2576.https://doi.org/10.1111/mec.13677
-
Unusual successive occurrence of the Maackii Peacock (Papilio maackii; Papilionidae) at Okayama University of Science, southwestern Honshu lowland, in 2006 summer and autumnNaturalistae 11:11–14.
-
Temperature dependence of spontaneous mutation ratesGenome Research 31:1582–1589.https://doi.org/10.1101/gr.275168.120
-
Making sense of genomic islands of differentiation in light of speciationNature Reviews Genetics 18:87–100.https://doi.org/10.1038/nrg.2016.133
-
The genetical structure of populationsAnnals of Eugenics 15:323–354.https://doi.org/10.1111/j.1469-1809.1949.tb02451.x
-
Catalogue of the Keiichi Omoto butterfly collection, Part I (Papilionidae: Baroniinae and ParnassiinaeThe University Museum, The University of Tokyo, Material Reports 127:1–162.
-
Molecular phylogenetics: principles and practiceNature Reviews Genetics 13:303–314.https://doi.org/10.1038/nrg3186
Article and author information
Author details
Funding
American Philosophical Society (The Lewis and Clark Fund for Exploration and Field Research (2017-2018))
- Tianzhu Xiong
The NSF-Simons Center for Mathematical and Statistical Analysis of Biology (Award Number #1764269), and The Quantitative Biology Initiative (Harvard University) (Graduate Student Fellowship)
- Tianzhu Xiong
Department of Organismic and Evolutionary Biology, Harvard University (Graduate Student Fellowship)
- Tianzhu Xiong
Department of Organismic and Evolutionary Biology, Harvard University (Faculty Start-up Fund)
- James Mallet
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
This work was supported by the Lewis and Clark Fund for Exploration and Field Research (American Philosophical Society, 2017–2018) granted to TX.; TX is also funded by a studentship from Harvard Department of Organismic and Evolutionary Biology, the NSF-Simons Center for Mathematical and Statistical Analysis of Biology at Harvard (award number #1764269) and the Harvard Quantitative Biology Initiative during the project. We thank Harvard FAS Research Computing for providing computation resources. We thank Fernando Seixas for discussion on hybrid ancestry inference. We thank Kaifeng Bu for his comments on information-theoretic measures. We thank Nathaniel Edelman, Neil Rosser, Miriam Miyagi, Yuttapong Thawornwattana, Sarah Dendy, Liang Qiao, Adam Cotton, John Wakeley, Robin Hopkins, and Naomi Pierce for their valuable input. We also thank Yuchen Zheng, Zhuoheng Jiang, Shaoji Hu, Feng Cao, and Shui Xu for support in the field. Feng Cao provided specimen photos from the hybrid zone.
Copyright
© 2022, Xiong et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics
-
- 1,781
- views
-
- 424
- downloads
-
- 9
- citations
Views, downloads and citations are aggregated across all versions of this paper published by eLife.
Download links
Downloads (link to download the article as PDF)
Open citations (links to open the citations from this article in various online reference manager services)
Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)
Further reading
-
- Cancer Biology
- Evolutionary Biology
In growing cell populations such as tumours, mutations can serve as markers that allow tracking the past evolution from current samples. The genomic analyses of bulk samples and samples from multiple regions have shed light on the evolutionary forces acting on tumours. However, little is known empirically on the spatio-temporal dynamics of tumour evolution. Here, we leverage published data from resected hepatocellular carcinomas, each with several hundred samples taken in two and three dimensions. Using spatial metrics of evolution, we find that tumour cells grow predominantly uniformly within the tumour volume instead of at the surface. We determine how mutations and cells are dispersed throughout the tumour and how cell death contributes to the overall tumour growth. Our methods shed light on the early evolution of tumours in vivo and can be applied to high-resolution data in the emerging field of spatial biology.
-
- Evolutionary Biology
Spatial patterns in genetic diversity are shaped by individuals dispersing from their parents and larger-scale population movements. It has long been appreciated that these patterns of movement shape the underlying genealogies along the genome leading to geographic patterns of isolation by distance in contemporary population genetic data. However, extracting the enormous amount of information contained in genealogies along recombining sequences has, until recently, not been computationally feasible. Here we capitalize on important recent advances in genome-wide gene-genealogy reconstruction and develop methods to use thousands of trees to estimate per-generation dispersal rates and to locate the genetic ancestors of a sample back through time. We take a likelihood approach in continuous space using a simple approximate model (branching Brownian motion) as our prior distribution of spatial genealogies. After testing our method with simulations we apply it to Arabidopsis thaliana. We estimate a dispersal rate of roughly 60km2 per generation, slightly higher across latitude than across longitude, potentially reflecting a northward post-glacial expansion. Locating ancestors allows us to visualize major geographic movements, alternative geographic histories, and admixture. Our method highlights the huge amount of information about past dispersal events and population movements contained in genome-wide genealogies.