Determining the probability of hemiplasy in the presence of incomplete lineage sorting and introgression

Abstract
Introduction
Results
Discussion
Materials and methods
Appendix 1
Data availability
References
Article and author information
Metrics

Abstract

The incongruence of character states with phylogenetic relationships is often interpreted as evidence of convergent evolution. However, trait evolution along discordant gene trees can also generate these incongruences – a phenomenon known as hemiplasy. Classic comparative methods do not account for discordance, resulting in incorrect inferences about the number, timing, and direction of trait transitions. Biological sources of discordance include incomplete lineage sorting (ILS) and introgression, but only ILS has received theoretical consideration in the context of hemiplasy. Here, we present a model that shows introgression makes hemiplasy more likely, such that methods that account for ILS alone will be conservative. We also present a method and software (HeIST) for making statistical inferences about the probability of hemiplasy and homoplasy in large datasets that contain both ILS and introgression. We apply our methods to two empirical datasets, finding that hemiplasy is likely to contribute to the observed trait incongruences in both.

Introduction

Convergent evolution of the same phenotype in distantly related species provides some of the most compelling evidence for natural selection. Comparative inferences of convergence require that the species history is known (Felsenstein, 1985). Comparative methods applied to such histories implicitly assume that the loci underlying convergent traits also follow the species tree. However, gene trees at individual loci can disagree with each other and with the species tree, a phenomenon known as gene tree discordance. While genomic data allow us to overcome many technical sources of discordance (Delsuc et al., 2005; Dunn et al., 2008; Misof et al., 2014), discordance also has biological causes (Degnan and Rosenberg, 2009), and remains a common feature of phylogenomic datasets (Pollard et al., 2006; Fontaine et al., 2015; Pease et al., 2016; Novikova et al., 2016; Wu et al., 2018; Vanderpool et al., 2020).

Gene tree discordance can have multiple sources, including biological causes such as incomplete lineage sorting (ILS), introgression, and horizontal gene transfer, and technical causes such as hidden paralogy or errors in gene tree inference (Schrempf and Szöllősi, 2020). Here, we focus primarily on the first two biological causes: ILS and introgression. Looking backwards in time, ILS is the failure of lineages to coalesce within a population before reaching the next most recent ancestral population. The probability of discordance due to ILS is a classic result of the multispecies coalescent, and depends on the population size and the length of time in which coalescence can occur (Hudson, 1983; Pamilo and Nei, 1988). More recently, the classic multispecies coalescent model has been extended to include introgression (a term we use to encompass hybridization and subsequent gene flow), in a framework called the ‘multispecies network coalescent’ (Yu et al., 2012; Yu et al., 2014; Wen et al., 2016). In this model, species relationships are modeled as a network, with introgression represented by horizontal reticulation edges. Individual loci probabilistically follow or do not follow the reticulation edge, after which they sort according to the multispecies coalescent process (i.e. with ILS). A major advantage of this approach is that ILS and introgression can be modeled simultaneously (reviewed in Degnan, 2018), allowing for more detailed study of the consequences of discordance.

Importantly, discordant gene trees can lead to the appearance of apparently convergent traits. This is because discordant gene trees have internal branches that do not exist in the species tree. If a mutation occurs along such a branch at a locus controlling trait variation, it may produce a pattern of character states that is incongruent with the species tree. Incongruent trait patterns are the basis for inferences of convergent evolution (‘homoplasy’), and thus this phenomenon has become known as hemiplasy (Avise and Robinson, 2008). Since hemiplasy can produce the same kinds of trait incongruence as homoplasy, failing to account for gene tree discordance can generate misleading inferences about convergence (Mendes and Hahn, 2016; Mendes et al., 2016). Studies in systems with widespread discordance have found that hemiplasy is a likely explanation for many patterns of incongruence (Copetti et al., 2017; Wu et al., 2018; Guerrero and Hahn, 2018).

The problem of hemiplasy makes it clear that robust inferences about the evolution of traits must account for gene tree discordance (Hahn and Nakhleh, 2016). Recent work has provided expressions for the probabilities of hemiplasy and homoplasy (Guerrero and Hahn, 2018), allowing for an assessment of whether a single transition (hemiplasy) or two transitions (homoplasy) is more likely to explain trait incongruence. This model shows that the most important factors contributing to a high risk of hemiplasy relative to homoplasy are a short internal branch on the species tree (which increases the rate of gene tree discordance), and a low mutation rate (which reduces the probability of the multiple independent transitions needed for homoplasy). However, applying this model in present form to empirical phylogenetic data faces two major limitations. First, incomplete lineage sorting is the only source of gene tree discordance considered, excluding scenarios with gene flow. Second, the model is limited to evolution along a three-taxon tree, restricting calculations for the exact probability of hemiplasy in larger clades.

With genomic data now available for many species, it has become clear that introgression is a common phenomenon (Mallet et al., 2016). Introgression leads to different patterns of gene tree discordance than expected under ILS alone – specifically, minority gene tree topologies supporting a history of introgression are expected to become more common than those produced via ILS alone. These differences form the conceptual basis for common tests of introgression using genomic data (Reich et al., 2009; Green et al., 2010; Durand et al., 2011; Patterson et al., 2012; Pease and Hahn, 2015). Introgression also affects the expected coalescence times between pairs of species (Joly et al., 2009; Brandvain et al., 2014; Hibbins and Hahn, 2019; Hahn and Hibbins, 2019). Pairs of species that have exchanged genes will have lower levels of sequence divergence, and therefore longer shared internal branches, at introgressed loci than expected under ILS alone. These differences in the frequency and branch lengths of genealogies produced by introgression should meaningfully affect the probability of hemiplasy. Therefore, it is important that both sources of gene tree discordance be accounted for in models of trait evolution.

For trees with more than three taxa, the number of possible gene trees and mutational configurations that could explain a particular pattern of trait incongruence increases dramatically. To illustrate this problem, we consider two cases of empirical incongruence of a binary trait. First, consider the case of New Guinea lizards that have evolved green blood from a red-blooded ancestor (Figure 1A; Rodriguez et al., 2018). A clade of 15 taxa contains both the green-blooded species and red-blooded species (the ancestral state). Given the phylogenetic distribution of the six green-blooded species—and no consideration of gene tree discordance—four independent transitions are necessary to explain this incongruence (Figure 1). However, the internal branches on this tree are short and discordance is likely. Individual loci could therefore group the green-blooded taxa into as few as one and as many as six separate clades. Depending on the history at loci affecting blood color, the distribution of green-blooded taxa could therefore be explained by anywhere from one to six mutations, and even more if we consider back-mutations. The one-mutation case represents a single transition due to hemiplasy along a branch that does not exist in the species tree, while the two- and three-mutation cases represent a combination of hemiplasy and homoplasy. The problem becomes even more complex when introgression occurs in the phylogeny, because each reticulation event introduces a new set of gene trees formed from the coalescent process at introgressed loci (Hibbins and Hahn, 2019). One such example is the origin of a chromosomal inversion spanning a gene involved in wing coloration in the Heliconius erato/sara clade of butterflies (Figure 1B; Edelman et al., 2019). Overall, the huge number of possible gene trees (>213 trillion for 15 lizard species; Felsenstein, 2004) and the large number of possible mutational events on these trees makes it infeasible to derive an explicit mathematical solution to address questions about hemiplasy in many empirical systems.

Figure 1 with 2 supplements see all

Download asset Open asset

Two empirical examples of apparent convergence in character states that could potentially be explained by hemiplasy.

(A) Maximum-likelihood species tree of the clade including green-blooded lizards and an outgroup, constructed from the concatenation of 3220 ultra-conserved elements (data from Rodriguez et al., 2018). Branch lengths in substitutions per site; nodes labeled with site concordance factors. (B) Coalescent network of *Heliconius erato/sara* clade, processed from the network constructed for the clade in Edelman et al., 2019. Branch lengths in units of 2N generations; rate, direction, and approximate timing of introgression events indicated by vertical arrows. In both trees, taxa with derived characters are colored, and the most parsimonious transitions from ancestral to derived states are labeled with circles.

Figure 1—source data 1 Input file given to HeIST for the lizard analysis, using the tree and character states shown in Figure 1A.: https://cdn.elifesciences.org/articles/63753/elife-63753-fig1-data1-v2.zip
Download elife-63753-fig1-data1-v2.zip
Figure 1—source data 2 Input file given to HeIST for the butterfly analysis, using the tree, character states, and introgression events shown in Figure 1B.: https://cdn.elifesciences.org/articles/63753/elife-63753-fig1-data2-v2.zip
Download elife-63753-fig1-data2-v2.zip

Here, we make two steps toward addressing these problems. First, we derive expressions for the probabilities of hemiplasy and homoplasy under the multispecies network coalescent for three taxa. Our results show that hemiplasy becomes increasingly likely relative to homoplasy as introgression occurs at a higher rate and at a more recent time relative to speciation. We also show how this pattern is influenced by the direction of introgression. These results highlight the need to account for both ILS and introgression in order to understand the origins of a trait incongruence. Second, we present a tool called HeIST (Hemiplasy Inference Simulation Tool) that uses coalescent simulation to dissect patterns of hemiplasy and homoplasy in larger phylogenies. This tool provides an estimate of the most likely number of transitions giving rise to observed incongruence of binary traits, and accounts for both ILS and introgression. Lastly, we apply HeIST to two empirical cases of apparent convergence in a binary trait, finding that hemiplasy is likely to contribute to the observed trait incongruences.

Results

A model for the probability of hemiplasy under the multispecies network coalescent

To study the effects of introgression on the probability of hemiplasy, we combine concepts from two previously published models: the ‘parent tree’ framework of Hibbins and Hahn, 2019, and the model of binary-trait evolution presented in Guerrero and Hahn, 2018 (see Wang et al., 2020 for an alternative way to extend the model to incorporate introgression). Consider a rooted three-taxon tree with the topology ((A,B), C). We define t₁ as the time of speciation between lineages A and B in units of 2N generations, and t₂ as the time of speciation between C and the ancestor of A and B. We also imagine an instantaneous introgression event between species B and C at time t_m, which can be in either direction (C → B or B → C). We define the total probability of a locus following an introgressed history as δ, with δ₂ denoting the probability of C → B introgression, and δ₃ the probability of B → C introgression. Introgression in both directions at an individual locus is not allowed in our model. However, a single introgression event in both directions can be modeled by allowing different directions at different loci. The history described here is represented by the phylogenetic network shown in Figure 2 (top). Other introgression scenarios can be accommodated by our model (see Discussion), but will not be considered here.

Figure 2 with 1 supplement see all

Download asset Open asset

A phylogenetic network (top) can be split into a set of parent trees (bottom) representing the possible histories at individual loci.

The probability that a locus is described by a particular parent tree depends on the probability of introgression (arrow labels). The horizontal ‘tube’ shown in the phylogenetic network does not depict introgression over a continuous time interval, but rather shows the timing of introgression (t_m) in an instantaneous pulse, while allowing for coalescence to be visualized for loci that follow a history of introgression.

To make it easier to track the history of different gene trees, we imagine that a phylogenetic network can be split into a set of ‘parent trees’ which describe the history at individual loci (Meng and Kubatko, 2009; Liu et al., 2014; Hibbins and Hahn, 2019; Figure 2, bottom). Within each of these parent trees, which describe either the species history or the history of introgression, gene trees sort under the multispecies coalescent process. Loci follow the species history, referred to as parent tree 1, with probability 1 – (δ₂ + δ₃). With C → B introgression, some loci will follow the alternative history within parent tree 2, with probability δ₂. In parent tree 2, species B and C are sister and share a ‘speciation’ time of t_m. B → C introgression causes loci to follow parent tree 3 with probability δ₃; in this history, lineages B and C are sister and split at time t_m, while the split time of A and the ancestor of B/C is reduced to t₁. This reduction in the second split time in parent tree 3 occurs because the presence of loci from lineage B in lineage C allows C to trace its ancestry through B going back in time. Since B is more closely related to A than C, this allows C to coalesce with A at an earlier time (Figure 2). Each introgression event is modeled as a discrete and instantaneous ‘pulse’ that generates its own parent tree, and in our model we consider a single introgression event for simplicity. However, multiple events or introgression over a continuous time interval can be modeled by introducing multiple pulses with different directions, timings, or probabilities. Each such event introduces its own parent tree and set of gene trees.

Each parent tree can produce four gene trees under the multispecies coalescent process: one tree from lineage sorting, and three equally probable trees from incomplete lineage sorting (Figure 2—figure supplement 1). In other words, introgression always involves ILS, as these are not mutually exclusive histories. Each of these possible gene trees has five branches along which mutations can occur: three tip branches, an internal branch, and an ancestral branch. A subset of these possible gene trees within each parent tree can lead to hemiplasy, while homoplasy can happen in any gene tree (Figure 3). Guerrero and Hahn, 2018 provide exact expectations for the probability of a mutation on each branch of each genealogy in an ILS-only model. Before extending this framework to incorporate introgression, the ILS-only model will be briefly described here, using a slightly updated notation that will make it easier to include the effects of introgression.

Figure 3

Download asset Open asset

The possible paths to homoplasy and hemiplasy under the multispecies network coalescent.

Homoplasy can happen on any gene tree, as long as there are two independent mutations on tip branches (panel A). Homoplasy can also happen via a mutation in the ancestor of all three species, followed by a reversal (not shown). All cases of hemiplasy require a transition on the internal branch of a gene tree with the topology ((B,C),A). In parent tree 1 (panel B), only one such possible gene tree exists (shown in gray; BC₁). In both parent trees 2 and 3 (panels C and D respectively), there are two possible gene trees with this topology. These gene trees differ in internal branch lengths, depending on the parent tree of origin and whether the tree is the result of lineage sorting (BC1₂ and BC1₃) or incomplete lineage sorting (BC2₂ and BC2₃) within introgressed histories.

Consider a binary trait that is incongruent with the described species tree, where species B and C have the derived state and A has the ancestral state. We denote λ₁, λ₂, and λ₃ as the tip branches in any topology leading to species A, B, and C respectively; λ₄ denotes the internal branch of any topology, and λ₅ the branch subtending the root. The notation ν(λ, τ) represents the probability of a mutation on branch λ_i in genealogy τ, where τ represents any of the four gene trees from any of the three parent trees. The rates of 0 → 1 and 1 → 0 mutations are assumed to be equal, and the rate among lineages is assumed to be constant. Finally, to describe individual genealogies, we use the notation XY_i=1,2,3, where X and Y denote the sister taxa, and the subscript i denotes the parent-tree of origin. In cases where a tree topology can be produced by either lineage sorting or ILS, a non-subscripted 1 or 2 is used, respectively. Under the ILS-only model, hemiplasy can only occur through a substitution on branch λ₄ of genealogy BC₁ (Figure 2—figure supplement 1C, Figure 3B). This occurs with the following probability:

P_{e} [B C_{1}] = (\frac{1}{3} e^{- (t_{2} - t_{1})}) ν (λ_{4}, B C_{1}) \prod_{i \neq 4}^{} (1 - ν (λ_{i}, B C_{1}))

(Guerrero and Hahn, 2018). Equation 1 has three components: the probability of observing genealogy BC₁, the probability that a mutation happens on the internal branch of that genealogy, and the probability that no other mutations occur. See section 1 of the Appendix for the full expressions for each mutation probability.

Now consider the phylogenetic network described earlier and shown in Figure 2. At an introgressed locus, the parent tree topology is ((B,C), A), but could be either parent tree 2 or 3. Within each of these parent trees, there are two possible gene trees that share this topology: one produced by lineage sorting (Figure 2—figure supplement 1E, Figure 3C) and one produced by ILS where B and C are still the first to coalesce (Figure 2—figure supplement 1F, Figure 3C). While these trees have the same topology, their expected frequencies and internal branch lengths differ. These quantities also differ depending on the direction of introgression at the locus, that is whether the history follows parent tree 2 or 3.

We first consider the C → B direction of introgression, and genealogy BC1₂, which is the result of lineage sorting within parent tree 2. This gives:

P_{e} [B C 1_{2}] = (1 - e^{- (t_{2} - t_{m})}) ν (λ_{4}, B C 1_{2}) \prod_{i \neq 4}^{} (1 - ν (λ_{i}, B C 1_{2}))

While Equation 2 has the same three core components as Equation 1, there are several important differences. First, the gene tree probability is the probability of lineage sorting within parent tree 2, which differs from the probability of ILS within parent tree 1. Second, the lower bound of coalescence is t_m rather than t₁, resulting in a higher probability of lineage sorting in parent tree 2 as compared to parent tree 1. Third, because B and C coalesce more quickly in this tree, they share a longer internal branch, which means the probability of mutation on that branch is higher (see section 1 of the Appendix).

ILS within parent tree two produces gene tree BC2₂, in which B and C are the first to coalesce in the common ancestor of all three species. The probability of hemiplasy in this case is:

P_{e} [B C 2_{2}] = (\frac{1}{3} e^{- (t_{2} - t_{m})}) ν (λ_{4}, B C 2_{2}) \prod_{i \neq 4}^{} (1 - ν (λ_{i}, B C 2_{2}))

In Equation 3, the gene tree probability represents ILS in parent tree 2. This probability is lower than its equivalent in parent tree 1, again because t_m is the lower bound for coalescence. Since the upper bound to coalescence is the same (t₂), the probability of a mutation on the internal branch of this gene tree is the same as for BC₁ (the ILS topology within parent tree 1). To get the overall probability of hemiplasy due to both ILS and introgression when there is gene flow from C → B, we weight the probability from each gene tree (Equations 1-3) by the admixture proportion, giving the following:

P_{e} [I L S, C \to B] = (1 - δ_{2}) P_{e} [B C_{1}] + δ_{2} (P_{e} [B C 1_{2}] + P_{e} [B C 2_{2}])

From Equation 4, we can see that introgression will increase the probability of hemiplasy over ILS alone (Equation 1) whenever the probability of hemiplasy from parent tree two is higher than from parent tree 1 (i.e. $P_{e} [B C 1_{2}] + P_{e} [B C 2_{2}] > P_{e} [B C_{1}]$ ). This is true whenever t₂ >t_m (see section 2 of the Appendix), which is by definition always true in this model.

Finally, we consider the probability of hemiplasy when introgression is in the direction B → C (represented by admixture fraction δ₃). As mentioned previously, this direction of introgression results in an upper bound to coalescence of t₁ rather than t₂. This is the primary difference between the directions of introgression, affecting both the expected gene tree frequencies and mutation probabilities (compare to Equations 2 and 3):

P_{e} [B C 1_{3}] = (1 - e^{- (t_{1} - t_{m})}) ν (λ_{4}, B C 1_{3}) \prod_{i \neq 4}^{} (1 - ν (λ_{i}, B C 1_{3}))

and

P_{e} [B C 2_{3}] = (\frac{1}{3} e^{- (t_{1} - t_{m})}) ν (λ_{4}, B C 2_{3}) \prod_{i \neq 4}^{} (1 - ν (λ_{i}, B C 2_{3}))

For the general probability of hemiplasy, including both directions of introgression, we now have:

\begin{array}{ll} P_{e} [I L S, C \to B, B \to C] = & (1 - (δ_{2} + δ_{3})) P_{e} [B C_{1}] + δ_{2} (P_{e} [B C 1_{2}] + P_{e} [B C 2_{2}]) + \\ δ_{3} (P_{e} [B C 1_{3}] + P_{e} [B C 2_{3}]) \end{array}

Finally, we consider the probability of homoplasy. As described in Guerrero and Hahn, 2018, there are two possible paths to homoplasy for a three-taxon tree where taxa B and C carry the derived state. The first is parallel 0 → one mutations on branches λ₂ and λ₃ (Figure 3A), and the second is a 0 → one mutation on branch λ₅ followed by a 1 → 0 reversal on branch λ₁. Both these paths to homoplasy can happen on any possible genealogy, because every topology contains independent tip branches leading to species B and C, as well as an internal branch ancestral to all three species. This gives the following:

P_{o} = \sum_{τ}^{} p (τ) [ν (λ_{2}, τ) ν (λ_{3}, τ) \prod_{i \neq 2, 3}^{} (1 - ν (λ_{i}, τ)) + ν (λ_{5}, τ) \prod_{i \neq 1, 5}^{} (1 - ν (λ_{i}, τ))]

where τ denotes the set of all possible gene trees. (Note that the sum inside Equation 8 is multiplied by $\frac{1}{p (τ)}$ in the main text of Guerrero and Hahn, 2018. This is a typo in that paper, but the results presented from their model use the correct expression, $p (τ)$ .) This formulation can also be applied to the extended model with introgression, with the understanding that τ now also includes the gene trees produced by parent trees 2 and 3. Each gene tree used in this summation will have a different set of mutation probabilities, which are detailed in section 1 of the Appendix.

To understand the analytical effect of introgression on the relative risks of hemiplasy and homoplasy, we plotted the ratio $P_{e} / P_{o}$ over a realistic range of admixture proportions, timings, and directions (Figure 4). The values of t₁ and t₂ were held constant at 1 and 3.5 coalescent units, respectively, with a population-scaled mutation rate of θ = 0.002. These settings ensured a constant contribution of incomplete lineage sorting to the risk of hemiplasy, leading to a baseline ratio of hemiplasy to homoplasy, $P_{e} / P_{o}$ , of 0.818 with no introgression. We varied the admixture proportion from 0 to 10%, and the value of t_m from 0.99 (just after the most recent speciation) to 0.01, for three different direction conditions: C → B only, B → C only, and equal rates in both directions.

Figure 4

Download asset Open asset

The probability of hemiplasy relative to homoplasy (contours) as a function of the admixture proportion (x-axis), the time between speciation and introgression (y-axis), and the direction of introgression (panels).

The contours delineate the factor difference between hemiplasy and homoplasy; for instance, a contour value of 2.0 means hemiplasy is twice as probable as homoplasy in that area of parameter space. At x = 0 in each panel, P_e/P_o = 0.818. (A) Equal rates of introgression in both directions. (B) Introgression in only the C → B direction. (C) Introgression in only the B → C direction.

Figure 4—source data 1 Data used to generate the contour plots in Figure 4. The value of P_e/P_o is reported for each parameter combination. Data for each panel is in separate tabs.: https://cdn.elifesciences.org/articles/63753/elife-63753-fig4-data1-v2.xlsx
Download elife-63753-fig4-data1-v2.xlsx

Introgression makes hemiplasy more likely than incomplete lineage sorting alone

Using our model for the probability of hemiplasy and of homoplasy, we examined the ratio P_e/P_o over a range of different introgression scenarios. This ratio summarizes how much more probable hemiplasy is than homoplasy for a given area of parameter space; for example, a value of P_e/P_o = 2 means hemiplasy is twice as likely as homoplasy. We find that the probability of hemiplasy relative to homoplasy increases as a function of the admixture proportion and how recently introgression occurs relative to speciation (Figure 4). As mentioned in the Introduction, there are several possible reasons for these observed trends. The strongest effect on P_e/P_o comes from the admixture proportion: a higher proportion means more loci evolving under parent trees 2 and 3, which means higher frequencies of the genealogies where hemiplasy is possible (i.e. BC1₂, BC2₂, BC1_3,BC2₃). The range of simulated admixture proportions from 0 to 10% was meant to capture a biologically plausible range of values, although rates of introgression can sometimes be much higher than this (e.g. Fontaine et al., 2015). Even in this modest range, the effect on the probability of hemiplasy can be substantial. We found that an admixture proportion of 5% results in hemiplasy being anywhere from 1.5 to 4 times more likely than homoplasy (depending on the timing and direction of introgression; Figure 4). Given the baseline value of P_e/P_o with no introgression for our chosen parameters (0.818), this represents at minimum a doubling of the probability of hemiplasy relative to homoplasy.

The effect of the timing of introgression is more complicated, as it manifests in multiple ways. First, more recent introgression increases the values of t₂ – t_m and t₁ – t_m, which in turn increases the degree of lineage sorting in parent trees 2 and 3, respectively. This leads to a higher frequency of gene trees where hemiplasy is possible. Second, the expected length of the internal branches in these two genealogies increases as introgression becomes more recent, which leads to a higher probability of mutations occurring on these branches. Third, since the total height of each tree is being held constant, more recent introgression reduces the lengths of the tip branches leading to species B and C. This reduces the probability of homoplasy due to parallel substitutions, again making hemiplasy relatively more likely. Finally, the strength of the effect of the timing of introgression increases with the admixture proportion, since it is a property of introgressed loci; in other words, the values of of t₂ – t_m and t₁ – t_m do not matter unless loci follow a history of introgression.

The direction of introgression affects the relationship between the admixture proportion, the timing of introgression, and hemiplasy risk (Figure 4B and C). While hemiplasy becomes more likely than homoplasy with increased admixture in either direction, P_e/P_o is lower in any given part of parameter space for B → C introgression (Figure 4C). This is because the bounds of coalescence for parent tree 3 are t₁ and t_m, which are always closer in time than t₂ and t_m (Figure 2). The smaller internal branch in parent tree 3 leads to a higher rate of ILS, in addition to a shorter internal gene tree branch (and lower mutation probability) on genealogies that undergo lineage sorting in these histories. Finally, the timing of introgression has a stronger effect on P_e/P_o in the B → C direction (Figure 4C). This is likely because parent tree 3 is truncated relative to parent tree 2 (see Figure 2), and so the difference t₁ – t_m makes up a proportionally larger part of the tree height.

HeIST: Hemiplasy Inference Simulation Tool

As described above, it is possible to infer the most likely number of transitions for an incongruent trait while accounting for discordance in a rooted tree with three taxa. However, similar calculations are computationally difficult for larger numbers of taxa. Here, we present a tool built on top of the coalescent simulator ms (Hudson, 2002) and sequence simulator Seq-Gen (Rambaut and Grassly, 1997) that provides an intuitive way to interrogate the parameter space of larger trees. Our tool, dubbed HeIST, takes a phylogenetic tree (including an option to specify introgression events) with observed character states as input and returns a simulated distribution of the number of transitions necessary to explain those character states. Introgression events must be specified as an instantaneous ‘pulse’ from one lineage to another, but we allow flexibility with respect to the timing of that pulse, as well as the rate, direction, and the lineages involved. The input phylogeny must be in coalescent units, but we also include a tool for converting trees given in units of substitutions per site to coalescent units, as long as branches are also associated with concordance factors (see section entitled ‘Inferring the tip branch lengths of a phylogeny in coalescent units’ below).

HeIST uses ms to simulate a large number of gene trees from the specified species tree or species network, and then simulates the evolution of a single nucleotide site along each of these gene trees using Seq-Gen. Loci where the simulated nucleotide states (transformed into 0/1 characters representing ancestral and derived states) match the character states observed on the species tree are taken as replicate simulations of the evolution of the trait being studied. In these ‘focal’ cases, HeIST counts the number of mutations that occurred along the gene tree in each simulation. It also returns information on the frequency of tip vs. internal branch mutations, transition vs. reversal mutations, the distribution of gene tree topologies, and whether gene trees originate from the species branching history or introgression history. Finally, it returns a summary of how much hemiplasy is likely to contribute to observed character states, using Fitch parsimony (Fitch, 1971) to obtain a homoplasy-only baseline for comparison. HeIST is implemented in Python 3 and the package/source code are freely available from https://github.com/mhibbins/HeIST.

HeIST effectively captures the effects of ILS and introgression on hemiplasy risk

To evaluate the performance of HeIST, we simulated across nine conditions with increasing expected probabilities of hemiplasy, across five different trait mutation rates. The results, shown in Figure 5, confirm the theoretical predictions shown in Figure 4: the probability of hemiplasy increases as a function of decreasing internal branch length (ILS1-ILS3), increasing probability of introgression (INT1-INT3), and more recent introgression (INT4-INT6). The effect of the timing of introgression is weaker than the effect of the introgression rate, also in line with theoretical expectations. These results held true for both the probability conditional on observing the specified trait pattern (Figure 5A) and the raw probability (Figure 5B).

Figure 5 with 6 supplements see all

Download asset Open asset

Probabilities of hemiplasy estimated from *HeIST* across nine simulated conditions.

ILS1-ILS3 decrease the internal branch length of the species tree; INT1-INT3 introduce introgression between derived taxa with increasing probability; INT4-INT6 make introgression more recent while holding the probability constant. See Figure 5—figure supplement 1 for the exact parameters used in each condition. Panel A shows the probability conditional on observing the trait pattern, whereas panel B shows the raw probability out of 100,000 simulations.

Figure 5—source data 1 Data used to generate Figure 5. condition: the simulation condition. hemi_count: the count of trees where a single mutation was observed. homo_count: the count of trees where two mutations were observed. cond_hemi: hemi_count divided by the sum of hemi_count and homo_count. raw_hemi: hemi_count divided by the total number of simulated trees. cond_homo: homo_count divided by the sum of hemi_count and homo_count. raw_homo: homo_count divided by the total number of simulated trees. pepo: cond_hemi divided by cond_homo.: https://cdn.elifesciences.org/articles/63753/elife-63753-fig5-data1-v2.xlsx
Download elife-63753-fig5-data1-v2.xlsx

While the change in the probability of hemiplasy is broadly consistent with theoretical expectations, the probabilities estimated from HeIST consistently underestimated the exact values predicted from theory by a small amount (Figure 5—figure supplement 2). We suspect this is due to the occurrence of multiple hits on the same branch of a gene tree, which are not accounted for in our theoretical model. Reversals on branches where hemiplasy can occur would slightly reduce the number of observed hemiplasy cases, leading to the observed underestimation. Consistent with our hypothesis, the mean-normalized mean squared error between simulated and expected values is lower for both lower mutation rates and simulated conditions with a shorter internal branch (Figure 5—figure supplement 3). Overall, the mismatch between simulations and theory appears to be negligible for lower, more realistic trait mutation rates, so we do not believe this will be a concern for most empirical applications.

When the parameters of a phylogenetic network are estimated from empirical data, it is possible that many different parameter combinations may be equally likely, especially when only a subset of features are used to fit the model. However, these combinations may differentially affect the probability of hemiplasy: for instance, if the frequency of gene trees is used to fit the network model, but the length of gene tree branches is ignored. To investigate this, we applied HeIST to five simulated conditions in which the probability and recency of introgression were increased, while the frequency of the discordant gene tree that could cause hemiplasy was held constant (Figure 5—figure supplement 4). We found that, despite a constant gene tree probability, the conditional probability of hemiplasy increased in each successive condition as introgression became more recent and frequent (Figure 5—figure supplement 5). These results to some extent merely serve to reinforce the notion that introgression has an effect on hemiplasy above and beyond the effect of ILS alone: by lengthening the branch on the discordant tree that hemiplastic mutations can occur on, introgression has a larger effect than ILS alone. But even when network models that include introgression are used, the estimated effects on hemiplasy will be conservative if parameters are estimated using gene tree frequencies alone.

To evaluate the effects of using HeIST on real data, we compared results using the ‘true’ species tree (Figure 5—figure supplement 6A) to those obtained from estimating the species tree with branch lengths using simulated DNA data. This data was run through a pipeline involving estimating a phylogeny using RAxML, converting branch lengths to coalescent units, and smoothing (Figure 5—figure supplement 6B and C). In all cases, the tree was comprised of eight taxa with no introgression, with three incongruent taxa sharing a hypothetical derived character with a mutation rate of 0.05 per 2N generations. Regardless of whether the ‘extend’ or ‘redistribute’ method was used for smoothing, the overall effect of estimating the tree from sequence data was to lengthen both internal and tip branch lengths, reducing the conditional probability of hemiplasy relative to when the true tree was used (see Figure 5—figure supplement 6 for exact probabilities). These results suggest that when our unit-conversion approach and smoothing are applied to empirical datasets, the resulting probability estimates will be conservative with respect to the hypothesis of hemiplasy.

The distribution of green-blooded New Guinea lizards is likely to have arisen from fewer than four transitions

We investigated the most likely number of transitions to green blood from a red-blooded ancestor in New Guinea lizards of the genus Prasinohaema (Rodriguez et al., 2018). Phylogenies constructed using RAxML (Figure 1, Figure 1—figure supplement 1) and ASTRAL (Figure 1—figure supplement 2) recover the phylogeny published by Rodriguez et al., 2018, including the placement of green-blooded species, and also confirm the existence of very short internal branches. In line with this observation, site concordance factors estimated from UCEs indicate very high rates of discordance in this clade, with some approaching a star tree (i.e. all topologies having frequencies of 33%) (Figure 1, Supplementary file 1). This strongly suggests that the apparent convergent evolution of the green blood phenotype has been affected by hemiplasy.

We used HeIST with the 15-taxon subclade containing six green-blooded species to determine the most likely number of trait transitions. Using our branch length unit conversion tool subs2coal, we obtained a best fit line of $y = 0.3038 + 157.03 x$ with an adjusted R² of 0.554 (Figure 6—figure supplement 1A). This formula was used to predict the tip branch lengths of the lizard phylogeny in coalescent units, for input to HeIST (Figure 6—figure supplement 2). This analysis was repeated using two different outgroups, which differed in their distance from the focal subclade. The results were essentially the same using both outgroups; here, we present probabilities using the closer outgroup, Scincella lateralis. After simulating 10¹⁰ loci from the lizard phylogeny using HeIST, we obtained 2042 loci with a distribution of derived states that matched the empirical distribution of green-blooded species. It is important to note that this number is expected to be a very small proportion of the total number of simulated loci. This occurs because it is necessary to simulate trait histories randomly, but we use only the ones that match the observed distribution. Due to the enormous space of possibilities, the probability of any single trait distribution will be very low, especially with large numbers of taxa and high rates of trait incongruence.

With four independent transitions required without discordance, there are three possible scenarios that involve at least one hemiplastic transition (Figure 6A). The first is a hemiplasy-only scenario, in which all green-blooded species are grouped into a single monophyletic clade in a gene tree, and a single transition in the ancestor of this clade explains the observed distribution (Figure 6A, left). Out of 2042 focal loci, 726 (35.5%) correspond to this hemiplasy-only case. In the second case, the green-blooded species may be grouped into one or two clades in a gene tree, and there are two independent transitions—at least one of which must involve a discordant ancestral branch (Figure 6A, mid-left). Since there are still multiple independent transitions, this case represents a combination of hemiplasy and homoplasy, but exactly which mutations on which branches are hemiplasy vs. homoplasy will depend on the gene tree topology. Of 2042 focal cases, 1316 (64.5%), correspond to this scenario. In the third case, the green-blooded species are grouped into as many as three clades, with three independent transitions, at least one of which must be hemiplastic (Figure 6A, mid-right). Finally, the green-blooded species may be grouped into four clades, with four independent transitions, as in the species tree (Figure 6A, right). We observed no instances of the latter two cases out of 2042 focal loci. These results strongly support the conclusion that, due to hemiplasy, the green-blooded phenotype arose from one or two independent transitions, rather than four.

Figure 6 with 2 supplements see all

Download asset Open asset

Probable histories for (A) the origin of green blood in New Guinea lizards and (B) the chromosomal inversion spanning the gene *cortex* in *Heliconius*, calculated using *HeIST*.

Trees depict the maximum number of clades expected for gene tree topologies under each scenario, with green-blooded clades in green and inversion clades in blue. Branches with proposed ancestral-to-derived transitions are labeled with stars. Exactly which species are sorted into these clades can vary, meaning many possible gene trees exist for each of the depicted scenarios. Correspondingly, any of the labeled hypothetical mutations could represent hemiplasy or homoplasy (except in the case of a single mutation, which must be hemiplasy), depending on the gene tree topology. Reported probabilities are based on 10¹⁰ simulated trees for New Guinea lizards and 10⁷ trees for *Heliconius*, with probabilities conditional on matching the empirical trait distributions. Panel C shows what proportion of gene trees originate from a history of introgression vs. the species tree for the results summarized in panel B (blue) as compared to what would be expected based on the inferred network in Figure 1 (black).

Figure 6—source data 1 Output file from the HeIST lizard analysis, from which the probabilities are reported in Figure 6A.: https://cdn.elifesciences.org/articles/63753/elife-63753-fig6-data1-v2.zip
Download elife-63753-fig6-data1-v2.zip
Figure 6—source data 2 Output file from the HeIST butterfly analysis, from which probabilities are reported in Figure 6B and C.: https://cdn.elifesciences.org/articles/63753/elife-63753-fig6-data2-v2.zip
Download elife-63753-fig6-data2-v2.zip

In all 2042 simulated focal cases, the gene trees on which mutations arose grouped the green-blooded species as monophyletic, regardless of the number of mutations that occurred on the tree. In addition, almost all these monophyletic clades share the same structure, containing two subclades: one containing P. semoni, P. prehensicauda, P. flavipes, and P. sp nov 1; another containing P. virens and P. sp nov 2. It is important to note that the frequency of monophyletic groupings is not expected to reflect the overall distribution of gene trees, but rather the distribution conditional on observing the trait incongruence of interest. These observations make our estimated probabilities easy to interpret: if there was one mutation, it happened in the ancestor of the green-blooded clade; if there were two mutations, they most likely occurred in the ancestors of the two subclades.

Following the logic of ‘phyloGWAS’ (Pease et al., 2016), we checked biallelic sites in the UCE alignment and topologies from the UCE gene trees for a monophyletic clade of green-blooded lizards in order to identify regions potentially associated with variation in blood color. However, both the gene tree and UCE datasets contained missing samples, which made it difficult to confidently identify truly monophyletic clades. On average, approximately nine taxa were unrepresented in the tips of individual gene trees, and approximately 10 were not assigned a base at individual sites in UCEs. The identity of the missing taxa varied across sites and trees, but often included species inside the 15-taxon subclade containing the green-blooded species, which made it more difficult to consistently polarize and compare patterns of monophyly. In the small proportion of gene trees and UCE sites where information was available for all taxa, we did not find any monophyletic groupings of green-blooded species.

A chromosomal inversion in the Heliconius erato/sara clade likely has a single origin

In addition to the analysis of green-blooded lizards, we also investigated the origins of a chromosomal inversion in the Heliconius erato/sara clade (Edelman et al., 2019). This inversion spans the gene cortex, which is known to influence wing patterning and coloration across butterflies (Joron et al., 2006; Nadeau et al., 2016). While parsimony applied to the species phylogeny would suggest two independent origins of the inversion (Figure 1B), there is clear evidence in Edelman et al., 2019 of both incomplete lineage sorting and introgression among the clades sharing the inversion, implicating a role for hemiplasy.

We inferred branch lengths in coalescent units for the phylogenetic network of these species given in Edelman et al., 2019. Using our unit conversion tool, we obtained a best-fit line of $y = - 1.815 + 302.49 x$ with an adjusted R² of 0.98 (though as a note of caution with the R², this regression contained only five data points; see Figure 6—figure supplement 1B). The predicted branch leading to the outgroup was extremely long (~40N generations), so the tree was smoothed using the ‘extend’ method without the outgroup, and the outgroup was re-added post-smoothing at a length proportional to the original network. The two most highly supported introgression events in the inferred phylogenetic network were then added to the coalescent tree with their previously inferred direction, rate, and approximate timing, before being given to HeIST as input (Figure 1B).

Using HeIST, we found that a single origin of the inversion was most likely, representing 660 of 923 (71.5%) focal cases (Figure 6B, left). The scenario involving two mutations was less likely, but was still found in 253 of 923 cases (27.4%) (Figure 6B, right). We also observed a small number (3/923, 0.32%) of focal cases with three independent transitions. Overall, our results support the original findings of Edelman et al., 2019 that the inversion likely arose once and then was shared between lineages via introgression.

Out of 923 simulated loci matching the trait pattern, we found that 413 originated from an introgressed history. This proportion (0.447) is substantially higher than the sum of introgression probabilities specified in the input (0.201), which suggests that introgression contributes more to the probability of observing the trait incongruence than would be expected by chance. In addition, as in the lizard simulations, we found that almost every simulated focal tree (913/923, 98.9%) grouped the Heliconius species that share the inversion as monophyletic. However, there is more variation in the structure of the subclades than there was in the lizards. Nevertheless, we can infer from this that two-mutation cases are most likely to arise as independent mutations in the ancestors of two subclades that are part of a larger monophyletic group.

Discussion

Phenotypic convergence among species can provide important evidence for natural selection. The molecular variation underlying this convergence can arise through independent mutations at the molecular level (Storz, 2016). However, it has recently become clear that such cases of ‘true’ convergence need to be distinguished from cases of apparent convergence due to hemiplasy (Hahn and Nakhleh, 2016). Some effort has been made in this regard, through the use of coalescent simulation, summary statistics, and updated comparative approaches (Pease et al., 2016; Copetti et al., 2017; Guerrero and Hahn, 2018; Wu et al., 2018). However, these approaches often assume incomplete lineage sorting as the only source of discordance, and cannot explicitly resolve the number of transitions required to explain a trait distribution while accounting for discordance. More recently, Bastide et al., 2018 and Karimi et al., 2020 developed extensions to comparative methods that allow quantitative trait likelihoods to be calculated on phylogenetic networks. However, while phylogenetic network inference methods are often robust to the effects of ILS (Solís-Lemus and Ané, 2016; Wen et al., 2018), the estimated networks themselves do not contain the necessary information to simultaneously capture the effects of ILS and introgression on trait probabilities (Mendes et al., 2018).

Here, we take two important steps toward addressing these problems by: (1) studying the effect of introgression on the risk of hemiplasy under the multispecies network coalescent model and (2) providing a tool that can infer the most probable number of transitions given a phylogenetic distribution of binary traits. We find that introgression increases the risk of hemiplasy over ILS alone, and uncover likely hemiplastic origins for the evolution of green blood from a red-blooded ancestor in New Guinea lizards, and a chromosomal inversion spanning a gene important for wing coloration in Heliconius. While our work has important implications for studies of trait evolution, it also carries numerous limitations and simplifying assumptions, which suggest logical next steps for further work. Below we discuss these implications, assumptions, and future directions.

The probability of hemiplasy due to introgression

A multitude of studies have revealed the potential role of introgression in shaping phenotypic convergence and adaptation (e.g. Heliconius Genome Consortium, 2012; Huerta-Sánchez et al., 2014; Jones et al., 2018; Mullen et al., 2020). However, such studies rarely consider how introgression could lead to false inferences of convergence, due to hemiplasy at both the molecular and phenotypic levels, if left unaccounted for. Our model results show that both ILS and introgression must be accounted for in order to make robust inferences of convergent evolution.

Our model for the probability of hemiplasy with introgression, combining concepts from two previously published models, also shares most of their assumptions. First, we have assumed the simplest possible introgression scenario, involving a single pair of species and with introgression occurring instantaneously at some point in the past. However, much more complex introgression scenarios are possible, including introgression between multiple species pairs, involving ancestral populations (and internal branches), at multiple time points in the past, or continuously over a period of time. Horizontal gene transfer, which is more common in prokaryotes, would also require networks that contain reticulation edges spanning very long periods of time. It is not always clear how the probability of hemiplasy would be affected under these alternative introgression scenarios. For example, we assume that the taxa sharing the derived state are also the ones involved in introgression, but introgression between other species pairs could alter patterns of discordance and therefore affect the hemiplasy risk, albeit less directly. Many of these scenarios could be incorporated into the general MSNC framework as additional parent trees, but with more complex histories this may become mathematically intractable even in the three-taxon case; our hemiplasy inference tool, HeIST, is designed to ameliorate this issue. Despite these limitations, we can generally expect that introgression will increase the overall risk of hemiplasy whenever rates of introgression are higher between pairs of species that also share the derived state for an incongruent trait. This is because what truly matters is the generation of gene tree topologies with internal branches where hemiplastic transitions can occur; the increased variance in coalescence times under more complex introgression scenarios, while affecting mutation probabilities, should have a comparatively minor effect (Figure 4).

We also assume that the coalescence times and gene tree frequencies of loci underlying trait variation follow neutral expectations, even though alleles controlling trait variation are often under some form of selection. Directional selection on such variation will reduce N_e relative to neutral expectations, which will decrease the rate of incomplete lineage sorting and consequently hemiplasy due to ILS. Of course, the amount of ILS used in our simulations is not taken directly from neutral expectations, but rather is estimated from real data. Therefore, the effects of selection on traits of interest will only be manifest if they are greater than the general effects of linked selection across the regions used to estimate discordance (Kern and Hahn, 2018). On the other hand, introgressed alleles can lead to hemiplasy even in cases where there is no ILS. In fact, directional selection would also make it more likely that introgressed loci have a discordant topology, as it reduces ILS within parent trees 2 and 3. Alternatively, balancing selection can maintain ancestral polymorphism and increase rates of discordance due to ILS. This will also increase the risk of hemiplasy (e.g. Fontaine et al., 2015; Lamichhaney et al., 2016; Palesch et al., 2018).

Considerations for the inference tool HeIST

While the software we introduce here allows for multiple novel types of inferences, it also has several limitations that are important to address. Errors common to all phylogenetic methods can be introduced into the user-specified species tree/network at several steps, including errors in ortholog identification, tree topology, concordance factors, and branch lengths (via both the conversion to coalescent units and tree smoothing). The process of smoothing the coalescent tree should introduce predictable biases in branch length estimates. When using ete3’s method for redistributing branch lengths, internal branches that are very short may have their length increased; conversely, long external branches may be shortened. The lengthening of internal branches decreases the overall rate of discordance, and makes inferences about hemiplasy from HeIST conservative. Similarly, when smoothing is done using our function for extending tip branch lengths, the probability of independent mutations on those tip branches (i.e. homoplasy) is increased, again making hemiplasy inferences conservative. The results presented in Figure 5—figure supplement 6 capture the overall effects of errors in phylogeny estimation, branch length prediction, and smoothing.

Errors in inference may affect our approach to branch length unit conversion in several ways. If concordance factors are underestimates—for instance, due to errors in gene tree reconstruction—then the branch lengths in coalescent units will also be underestimates of their true values. The result would be simulations with more ILS and discordance than actually occurred. In cases where there are concerns about branch length estimates, we suggest running HeIST across multiple values; for tip branches, the option exists within HeIST to use the lower and upper bounds of the prediction interval in addition to the predictions themselves. In addition, if there are tip branch lengths in the original tree that fall outside the range of internal branch length values, the predicted value of those tip branches in coalescent units may be less reliable, since it requires extrapolation beyond the range of datapoints used to fit the regression. Lastly, we note that Bastide et al., 2018 propose an approach to estimating coalescent tip branch lengths on a network using the method of least-squares between pairwise genetic distances and network pairwise distances. We expect this approach to have very similar performance to ours, since linear regression is done using least-squares and pairwise genetic distances should be highly correlated with concordance factors.

There are also several practical points to consider when applying HeIST to empirical data. When researchers have questions about hemiplasy involving either very large phylogenies or very low mutation rates, only a small number of simulated trees may match the incongruent pattern found in real data. The large number of simulations required may not be computationally feasible, although careful pruning of species that do not affect inferences of hemiplasy may greatly reduce this limitation. By default, HeIST will prune the input phylogeny to include the smallest subclade that contains all the taxa with the derived state, plus a specified outgroup. In addition, while HeIST can simulate phylogenies with introgression, it requires that the timing, direction, and rate of each introgression event is provided. To obtain this information, we recommend using a phylogenetic network-based approach such as PhyloNet (Wen et al., 2018), SNaQ (Solís-Lemus and Ané, 2016), or the SpeciesNetwork (Zhang et al., 2018a) package within BEAST2 (Bouckaert et al., 2019).

Finally, an issue that concerns both our theoretical work and HeIST is the specification of the mutation rate. In both cases, we assume that the rates of 0 → 1 and 1 → 0 transitions are equivalent, and that these rates are constant across the tree under study. Violations of these assumptions will certainly influence the probabilities of hemiplasy and homoplasy, although it is unlikely that underlying mutation rates will vary substantially among closely related lineages (Lynch, 2010). More importantly, these rates represent the mutation rate among character states, and may not always be the same as nucleotide mutation rates. We have assumed in the results presented here that transitions between character states are controlled by a single site, and therefore that the nucleotide mutation rate is a good approximation of the trait mutation rate. However, the degree to which this is true will depend on the genetic architecture underlying a trait. For example, transitions in floral color are often underlain by loss-of-function mutations, and many mutational targets can potentially lead to the same phenotypic changes (Rausher, 2008; Smith and Rausher, 2011). In such cases, the rate of trait transitions can potentially be many times higher than the nucleotide mutation rate, with homoplasy becoming more probable as a result. In contrast, trait transitions can also require multiple molecular changes, the order of which may be constrained by pleiotropy and epistasis. Such changes underlie, for instance, high-altitude adaptation of hemoglobin in mammals (Storz et al., 2009; Tufts et al., 2015). In these cases, the rate of trait transitions may be many times lower than the nucleotide mutation rate, with hemiplasy becoming more probable as a result.

Evolution of green-blooded lizards and the Heliconious inversion

In our analysis of lizards in the genus Prasinohaema, we found strong support for one or two independent origins of green blood from a red-blooded ancestor, with two origins being the most likely. This contrasts with analyses that do not account for gene tree discordance, in which four transitions is the best explanation. In Heliconius, we found support for a single origin of a chromosomal inversion, in contrast to methods that do not account for discordance. Both these results strongly suggest that hemiplasy has played a role in the evolution of these traits.

Applications of HeIST to these clades involves some system-specific assumptions, the first of which relates to the genetic architecture of the traits under study. For the lizard analysis, it invokes the potentially strong assumption that the green-blooded phenotype is achievable by a single mutation. While the physiological mechanism for this phenotype is well-understood (Austin and Jessing, 1994), the genetic architecture underlying the transition from a red-blooded ancestor is not. As discussed in the previous section, this architecture will affect the choice of θ used as the trait evolutionary rate in our simulations. Since the genetic architecture is unknown, our choice of θ was based on what is typically observed for nucleotide mutations in vertebrate systems (Lynch, 2010). For the Heliconius inversion, the architecture is more clear-cut, since chromosomal inversions are a single mutational event by definition. While the per-generation rate of de novo chromosomal inversions is not known for many systems, it is certain to be lower than the rate for nucleotide mutations per-site. Nucleotide θ is estimated at 0.02–0.03 for H. melpomene (Martin et al., 2016), and averages around 0.01 in invertebrates (Lynch, 2010). Our choice of θ for the inversion was one order of magnitude lower than these estimates.

Another key assumption is that the estimated gene trees and concordance factors are accurate, as is the regression approach for converting branch length units. The observed R² of 0.554 for the unit-conversion in the lizard dataset might be interpreted as surprisingly low given that it is a regression of the same quantity measured in two different units. This value likely reflects uncertainty generated in several steps of our analysis, including the estimation of branch lengths in the maximum-likelihood species tree, and the procedure of randomly sampling quartets to estimate sCFs used by IQ-TREE. In Heliconius, the R² was much higher at 0.98, but with only five data points there was limited information about the true relationship. Nonetheless, we observed the expected positive correlation in both cases, and a sufficient amount of variation is explained to ensure that tip branches estimated in coalescent units are proportionally similar to those in the maximum-likelihood tree, suggesting that the regression approach works well as an approximation. In addition, the regression line on the lizard data appears to slightly over-estimate very short branch lengths in coalescent units, making our inferences of hemiplasy conservative.

Conclusions

A major question in the study of convergent evolution is whether phenotypic convergence is underlain by convergent changes at the molecular level (Storz, 2016). The work presented here is concerned primarily with such molecular changes, and the results of our empirical analyses highlight how apparently convergent phenotypes can arise from a single molecular change. Such shared changes come about as a result of gene tree discordance due to ILS, introgression, or some combination of the two. Given that these phenomena are common in phylogenomic datasets (Pollard et al., 2006; Fontaine et al., 2015; Pease et al., 2016; Novikova et al., 2016; Wu et al., 2018), perhaps it should be less surprising that phylogenetically incongruent traits often have a common genetic basis.

Finally, while the tools presented here may help to rule out cases of molecular convergence, the observation of a single molecular origin for a trait does not rule out the occurrence of convergent adaptation in general. Parallel selective pressures from the environment on the same molecular variation may be regarded as one of many possible modes of convergent evolution (Lee and Coop, 2017). In studying novel phenotypes such as green blood or wing patterning and coloration, there is still tremendous interest in understanding the ecological pressures that may have led to the independent fixation of single, ancestral changes along multiple lineages. In general, integrative approaches combining modern phylogenomics with an ecological context will pave the way toward an improved understanding of the nature of convergent evolution.

Materials and methods

Accuracy of HeIST

Request a detailed protocol

To confirm that HeIST accurately counts mutation events, and is consistent with our theoretical findings, we evaluated its performance under nine simulated conditions with increasing levels of expected hemiplasy. All simulated conditions involve a four-taxon tree with the topology (((4,3),2),1). Species 4 and 2 carry the derived state for a hypothetical binary character. The split of species one from the ancestor of 4, 3, and 2 occurs at 8N generations in the past. The first three simulated conditions contain no introgression, and progressively decrease the length of the internal branch subtending species 4 and 3. The total tree height was held constant. The simulated internal branch lengths were 2N, 1.5N, and N generations for conditions ILS1, ILS2, and ILS3 respectively. The subsequent six conditions maintain the ILS3 condition for branch lengths, with the addition of an introgression event from species 2 into species 4. For conditions INT1, INT2, and INT3, the timing of introgression was held constant at 0.6N generations, while the introgression probability was set to 0.01, 0.05, and 0.1, respectively. For conditions INT4, INT5, and INT6, the introgression probability was held constant at 0.1, while the timing of introgression was reduced to 0.4N, 0.2N, and 0.1N, respectively. The parameters used for each condition are summarized in Figure 5—figure supplement 1.

We performed two sets of simulations: (1) 100 replicates of each condition, consisting of 100,000 gene trees each, with a constant mutation rate of 0.05 per 2N generations; (2) 20 replicates of each condition, for each of five different mutation rates per 2N generations (0.0005, 0.0025, 0.005, 0.025, 0.05), each consisting of 1,000,000 gene trees. For each combination of parameters, we estimated the probability of hemiplasy conditional on observing the specified trait pattern, and the raw probability of hemiplasy out of the total number of replicates. For the latter simulation set, we estimated the mean-squared error (MSE) using the simulated values as observations and the expected value from theory as the true mean. These MSE values were divided by the simulated mean to compare error across conditions with different ranges of expected values.

Trade-offs among parameters mean that many combinations of estimated network parameters may be equally consistent with patterns in subsets of the observed data. To investigate possible effects on the probability of hemiplasy, we evaluated the performance of HeIST under five additional simulation conditions (Figure 5—figure supplement 4). In each successive condition, the probability of introgression was increased, while the timing of introgression was made more recent. The length of the internal branch in the species tree was also increased such that the expected frequency of the discordant gene tree that causes hemiplasy remained approximately constant (Figure 5—figure supplement 5A). These simulations used the same tree topology, derived taxa, split time of the ancestral population, and mutation rate as the first set of benchmarking simulations. Condition 1 used the same parameters as ILS1. Conditions 2–5 used the following sets of parameters, respectively: 2.08N, 2.32N, 2.6N, 2.8N generations for the length of the internal branch; 0.01, 0.025, 0.04, 0.05 for the probability of introgression; 0.4N, 0.3N, 0.2N, 0.1N generations for the timing of introgression (Figure 5—figure supplement 4). For each condition, we performed 100 replicate simulations of 100,000 gene trees each in HeIST, and estimated the probability of hemiplasy conditional on observing the trait pattern.

Inferring the tip branch lengths of a phylogeny in coalescent units

Request a detailed protocol

Inferences made under the multispecies coalescent require branch lengths specified in coalescent units. However, most standard methods for building phylogenies infer branches in units of substitutions per site. Units of absolute time inferred from substitution rates using molecular clock approaches can be converted into coalescent units, provided that the generation time and effective population size are known. However, these parameters are sometimes not available or accurate for a given system. As an alternative, estimates of gene tree discordance can be used to estimate internal branch lengths in coalescent units, but these provide no information about the lengths of tip branches. For example, the species tree inference software ASTRAL (Zhang et al., 2018b) does not infer tip branch lengths, while the software MP-EST (Liu et al., 2010) adds branches of length nine for every tip. These tip lengths are necessary to make accurate inferences about hemiplasy and homoplasy from empirical data, since they affect the probability of mutation on tip branches.

To ameliorate this problem, we have applied a simple regression approach for inferring tip lengths in coalescent units (see Bastide et al., 2018 for an alternative method). Our approach makes use of concordance factors: estimates of the fraction of concordant loci with respect to a particular branch in a species tree. Concordance factors come in two flavors: gene concordance factors (gCFs) (Gadagkar et al., 2005; Ané et al., 2007), which estimate the concordance of gene tree topologies, and site concordance factors (sCFs) (Minh et al., 2020a), which do the same for parsimony-informative sites. In general, concordance factors estimated from quartets provide an estimate of $1 - \frac{2}{3} e^{- T}$ , where T is the length of the internal branch in coalescent units. With concordance factors given on the internal branches of a tree that has lengths in substitutions per site, the aforementioned formula can be used to obtain estimates of those same branch lengths in units of 2N generations. A regression of the internal branch length estimates in both units can then be used to obtain a formula for unit conversion between them. HeIST uses this formula to predict the tip branch lengths of the tree in coalescent units. To partially account for uncertainty introduced during tip branch length prediction, HeIST can also be run using the lower or upper bounds of the prediction 95% confidence interval as the inferred tip lengths, in addition to the predictions themselves. As a final step in this process, the tree in coalescent units is smoothed, as ms requires the input tree to be ultrametric. HeIST has two options for how to perform this smoothing. The first redistributes the tree branch lengths so that the distance from the root to each tip is the same; this is done using the convert_to_ultrametric() function in the Python library ete3 (Huerta-Cepas et al., 2016). The second extends the lengths of tip branches while preserving internal branch lengths; this function is coded within HeIST, but was borrowed from a commented block in ete3’s source code.

To investigate the potential bias introduced to results from HeIST by either phylogenetic inference, the branch regression approach, or subsequent smoothing, we compared the outputs of HeIST run from an eight-taxon test tree with known branch lengths. To generate realistic datasets, we first simulated 3000 gene trees from the known species tree using ms, and then simulated 1 Kb of sequence from each locus with θ = 0.001 using Seq-Gen. These loci were concatenated into a single 3 Mb alignment, which was given to RAxML version 8.2.12 (Stamatakis, 2014) using the GTR substitution model with rate heterogeneity to infer a species tree in units of substitutions per site. This inferred tree and the concatenated alignment were given to IQ-TREE version 2.0 (Minh et al., 2020b) to infer site concordance factors. This substitution tree with nodes labeled with concordance factors was given as input to HeIST, where our branch regression approach was applied. We tested both methods for tree smoothing from this input. We estimated the probability of each number of mutations conditional on observing the incongruent site pattern, and results from this analysis were compared to those obtained using the ‘true’ test tree (Figure 5—figure supplement 6).

Our regression approach is implemented in HeIST and can be run as part of the overall hemiplasy analysis, or separately using the module ‘subs2coal’.

Empirical applications of HeIST

Request a detailed protocol

We applied HeIST to two empirical case studies where hemiplasy appeared to be a plausible explanation for observed trait incongruences. The first is a dataset of New Guinea lizards (Rodriguez et al., 2018). As described in the Introduction, the genus Prasinohaema contains six species that have evolved green blood from a red-blooded ancestor (Figure 1A). Previous analyses of the species tree built from thousands of loci inferred that four independent transitions are necessary to explain the phylogenetic distribution of green-blooded species (Rodriguez et al., 2018). This conclusion is the same using any standard phylogenetic comparative method, whether ancestral state reconstruction is carried out using maximum likelihood (Rodriguez et al., 2018) or Fitch parsimony (this study). However, the phylogeny for this clade contains many short branches (Figure 1), suggesting that a scenario involving at least some hemiplasy (in this case, 1–3 mutations) may be preferred over homoplasy-only scenarios when discordance is accounted for.

To address this question, we used the original dataset of Rodriguez et al., 2018, consisting of 3220 ultra-conserved elements (UCEs) totalling approximately 1.3 Mb for 43 species. We then down-sampled these species to 15 taxa in the clade including the green-blooded species and an outgroup (Figure 1A). We constructed a concatenated maximum likelihood species tree, in addition to gene trees for each UCE, using RAxML version 8.2.12 (Stamatakis, 2014). To verify the species tree topology for the 15-taxon subclade, we also constructed a tree with ASTRAL-III version 5.6.3 (Zhang et al., 2018a). Site and gene concordance factors were calculated for this tree using IQ-TREE version 2.0 (Minh et al., 2020a; Minh et al., 2020b). To obtain the phylogeny in coalescent units, we employed the regression approach described above for unit conversion as implemented in HeIST. The ‘extend’ method was used for tree smoothing. We then used HeIST to simulate 10¹⁰ loci from the lizard subclade containing green-blooded species, with a population-scaled mutation rate (θ) of 0.0005 per 2N generations. While specific parameter estimates are not available for this system, our choice of θ reflects broad estimates of N_e and µ on the order of 10⁵–10⁶ (Lynch, 2006) and 10⁻⁸ – 10⁻⁹ per-base per-generation (Lynch, 2010), respectively, in vertebrates (see Discussion). This analysis was performed for each of two outgroups: Lygosoma sp, which is sister to 40 species in the 43-species phylogeny, and Scincella lateralis, which is sister to the 15-taxon clade containing the green-blooded species. We also calculated D-statistics (Green et al., 2010) for 12 trios involving green-blooded taxa, finding no strong evidence of introgression (block bootstrap significance tests, Supplementary file 1-Tables 2 and 3). Therefore, our simulations did not include any introgression events.

The second empirical case study involves the origins of a chromosomal inversion spanning a gene important for wing coloration in Heliconius butterflies (Edelman et al., 2019). The derived inversion arrangement is shared by four taxa, grouped into two subclades in the erato/sara group of Heliconius. Fitch parsimony suggests two independent origins, but a combination of short internal branches and introgression between the ancestral populations sharing the inversion (Edelman et al., 2019) strongly suggests a role for hemiplasy. We obtained the phylogenetic network—that is, the species tree with reticulation edges—inferred in units of substitutions per site, in addition to gene concordance factors, from the authors. As our regression approach for conversion to coalescent units cannot be used on phylogenetic networks directly, we used the species tree embedded in the network with concordance factors as input to subs2coal. The two most strongly supported introgression events were then added back onto the smoothed network in coalescent units, using the inferred rates and directions, with approximate timings based on the location of the events in the original network and our requirement that these events be instantaneous ‘pulses.’ From this input (shown in Figure 1B), we simulated 10⁷ gene trees in HeIST using a mutation rate of 0.0005 per 2N generations. That our choice of θ for this system is the same as in our lizard analysis is just a coincidence: it reflects a trade-off between the generally higher effective population size for insects (Lynch, 2006) and the lower mutation rate expected for chromosomal inversions; see Discussion. We also performed the same simulations without specifying the introgression events to obtain an ILS-only estimate of the probability of hemiplasy.

Appendix 1

1 Mutation probabilities on genealogies

Each of the twelve possible genealogies under our parent tree model has a set of five branch lengths along which mutations can occur. $λ_{1}$ , $λ_{2}$ , and $λ_{3}$ denote the tip branches leading to species A, B, and C respectively; $λ_{4}$ denotes the internal branch, and $λ_{5}$ denotes the ancestral branch. As described in the supplement of Guerrero and Hahn, 2018, the mutation probability on each of these branches has the general form $\int 1 - e^{- μ x} f (x) d x$ , where µ is the mutation probability per $2 N$ generations, x is the random variable for the branch length, and f(x) is the probability density function for x. We begin with the mutation probabilities for parent tree 1, which are found in the supplement of Guerrero and Hahn, and will be re-written here to be consistent with notation. In the following notation, parent tree 1 will be denoted as "pt1". Since many of the genealogies are identical in length, the mutation probabilities on their branches can be written with general expressions. We first consider the genealogies $A B 2_{1}$ , $B C_{1}$ , and $A C_{1}$ , which are all produced via incomplete lineage sorting in parent tree 1, and share the following set of mutation probabilities:

ν_{1} [I L S, p t 1] = \frac{1}{Λ} \int_{0}^{t_{3} - t_{2}} (1 - e^{- μ (t_{1} + (t_{2} - t_{1}) + x)}) \frac{3}{2} (e^{- x} - e^{- 3 x}) d x

ν_{2} [I L S, p t 1] = \frac{1}{Λ} \int_{0}^{t_{3} - t_{2}} (1 - e^{- μ (t_{1} + (t_{2} - t_{1}) + x)}) 3 e^{- 3 x} (1 - e^{- ((t_{3} - t_{2}) - x)}) d x

ν_{4} [I L S, p t 1] = {\frac{1}{Λ}}_{0}^{t_{3} - t_{2}} \int_{0}^{t_{3} - t_{2}} 3 e^{- 3 y} (\int_{0}^{(t_{3} - t_{2}) - y} e^{- x} (1 - e^{- μ x}) d x) d y

ν_{5} [I L S, p t 1] = \frac{1}{Λ} \int_{0}^{t_{3} - t_{2}} (1 - e^{- μ ((t_{3} - t_{2}) - x)}) 3 e^{- 3 x} d x

In each of the above, $Λ = 1 + \frac{1}{2} e^{- 3 (t_{3} - t_{2})} - \frac{3}{2} e^{- (t_{3} - t_{2})}$ is the probability of coalescence of A, B, and C in their ancestral population. t₃ denotes the total height of the tree, i.e. the time at the base of the tree. The difference between t₃ and t₂ determines the duration of the ancestral population of all three taxa, before speciation occurs. Equations 1 through four each represent the mutation probabilities for multiple branches, which are as follows:

ν_{1} [I L S, p t 1] = ν (λ_{3}, A B 2_{1}) = ν (λ_{1}, B C_{1}) = ν (λ_{2}, A C_{1})

\begin{array}{ll} ν_{2} [I L S, p t 1] & = ν (λ_{1}, A B 2_{1}) = ν (λ_{2}, A B 2_{1}) = \\ ν (λ_{2}, B C_{1}) & = ν (λ_{3}, B C_{1}) = ν (λ_{1}, A C_{1}) = ν (λ_{3}, A C_{1}) \end{array}

ν_{4} [I L S, p t 1] = ν (λ_{4}, A B 2_{1}) = ν (λ_{4}, B C_{1}) = ν (λ_{4}, A C_{1})

ν_{5} [I L S, p t 1] = ν (λ_{5}, A B 2_{1}) = ν (λ_{5}, B C_{1}) = ν (λ_{5}, A C_{1})

The gene tree produced by lineage sorting in parent tree 1, $A B 1_{1}$ , has a different set of mutation probabilities, since the branches have different expected lengths. These are:

ν (λ_{1}, A B 1_{1}) = ν (λ_{2}, A B 1_{1}) = \frac{1}{1 - e^{- (t_{2} - t_{1})}} \int_{0}^{t_{2} - t_{1}} (1 - e^{- μ (t_{1} + x)}) e^{- x} d x

ν (λ_{3}, A B 1_{1}) = \frac{1}{1 - e^{- (t_{3} - t_{2})}} \int_{0}^{t_{3} - t_{2}} (1 - e^{- μ (t_{1} + (t_{2} - t_{1}) + x)}) e^{- x} d x

ν (λ_{4}, A B 1_{1}) = \int_{0}^{t_{2} - t_{1}} \frac{e^{- y}}{1 - e^{- (t_{2} - t_{1})}} (\int_{0}^{t_{3} - t_{2}} (1 - e^{- μ ((t_{2} - t_{1}) - y + x)}) \frac{e^{- x}}{1 - e^{- (t_{3} - t_{2})}} d x) d y

ν (λ_{5}, A B 1_{1}) = \int_{0}^{t_{2} - t_{1}} \frac{1}{1 - e^{- (t_{3} - t_{2})}} \int_{0}^{t_{3} - t_{2}} (1 - e^{- μ ((t_{3} - t_{2}) - x)}) e^{- x} d x

Now we consider introgression, starting with parent tree 2. Many of the mutation probabilities are symmetrical with parent tree 1 and therefore remain the same, and the remainder have the same general form with different parameters. For the ILS genealogies $B C 2_{2}$ , $A B_{2}$ , and $A C_{2}$ , Equations 1 and 2 have the time of A-B speciation (t₁) replaced with the timing of B-C introgression (t_m). This gives:

ν_{1} [I L S, p t 2] = \frac{1}{Λ} \int_{0}^{t_{3} - t_{2}} (1 - e^{- μ (t_{m} + (t_{2} - t_{m}) + x)}) \frac{3}{2} (e^{- x} - e^{- 3 x}) d x

ν_{2} [I L S, p t 2] = \frac{1}{Λ} \int_{0}^{t_{3} - t_{2}} (1 - e^{- μ (t_{m} + (t_{2} - t_{m}) + x)}) 3 e^{- 3 x} (1 - e^{- ((t_{3} - t_{2}) - x)}) d x

ν_{4} [I L S, p t 2] = ν_{4} [I L S, p t 1]

ν_{5} [I L S, p t 2] = ν_{5} [I L S, p t 1]

These correspond to the following branch mutation probabilities:

ν_{1} [I L S, p t 2] = ν (λ_{1}, B C 2_{2}) = ν (λ_{3}, A B_{2}) = ν (λ_{2}, A C_{2})

\begin{array}{ll} ν_{2} [I L S, p t 2] & = ν (λ_{2}, B C 2_{2}) = ν (λ_{3}, B C 2_{2}) = \\ ν (λ_{1}, A B_{2}) & = ν (λ_{2}, A B_{2}) = ν (λ_{1}, A C_{2}) = ν (λ_{3}, A C_{2}) \end{array}

ν_{4} [I L S, p t 2] = ν (λ_{4}, B C 2_{2}) = ν (λ_{4}, A B_{2}) = ν (λ_{4}, A C_{2})

ν_{5} [I L S, p t 2] = ν (λ_{5}, B C 2_{2}) = ν (λ_{5}, A B_{2}) = ν (λ_{5}, A C_{2})

For the genealogy produced by lineage sorting in parent tree 2, $B C 1_{2}$ , we have:

ν (λ_{2}, B C 1_{2}) = ν (λ_{3}, B C 1_{2}) = \frac{1}{1 - e^{- (t_{2} - t_{m})}} \int_{0}^{t_{2} - t_{m}} (1 - e^{- μ (t_{m} + x)}) e^{- x} d x

ν (λ_{1}, B C 1_{2}) = \frac{1}{1 - e^{- (t_{3} - t_{2})}} \int_{0}^{t_{3} - t_{2}} (1 - e^{- μ (t_{m} + (t_{2} - t_{m}) + x)}) e^{- x} d x

ν (λ_{4}, B C 1_{2}) = \int_{0}^{t_{2} - t_{m}} \frac{e^{- y}}{1 - e^{- (t_{2} - t_{m})}} (\int_{0}^{t_{3} - t_{2}} (1 - e^{- μ ((t_{2} - t_{m}) - y + x)}) \frac{e^{- x}}{1 - e^{- (t_{3} - t_{2})}} d x) d y

ν (λ_{5}, B C 1_{2}) = ν (λ_{5}, A B 1_{1})

Finally, we consider parent tree 3. The mutation probabilities have the same formulation as parent tree 2, with two key changes: since parent tree 3 is shorter (Figure 2 of main text), t₂ is replaced by t₁. This also applies to the value of $Λ$ , which we will denote for parent tree 3 as $Λ_{3} = 1 + \frac{1}{2} e^{- 3 (t_{3} - t_{1})} - \frac{3}{2} e^{- (t_{3} - t_{1})}$ . For the ILS genealogies $B C 2_{3}$ , $A B_{3}$ , and $A C_{3}$ , this gives:

ν_{1} [I L S, p t 3] = \frac{1}{Λ_{3}} \int_{0}^{t_{3} - t_{1}} (1 - e^{- μ (t_{m} + (t_{1} - t_{m}) + x)}) \frac{3}{2} (e^{- x} - e^{- 3 x}) d x

ν_{2} [I L S, p t 3] = \frac{1}{Λ_{3}} \int_{0}^{t_{3} - t_{1}} (1 - e^{- μ (t_{m} + (t_{1} - t_{m}) + x)}) 3 e^{- 3 x} (1 - e^{- ((t_{3} - t_{1}) - x)}) d x

ν_{4} [I L S, p t 3] = \frac{1}{Λ_{3}} \int_{0}^{t_{3} - t_{1}} 3 e^{- 3 y} (\int_{0}^{(t_{3} - t_{1}) - y} e^{- x} (1 - e^{- μ x}) d x) d y

ν_{5} [I L S, p t 3] = \frac{1}{Λ_{3}} \int_{0}^{t_{3} - t_{1}} (1 - e^{- μ ((t_{3} - t_{1}) - x)}) 3 e^{- 3 x} d x

Where:

ν_{1} [I L S, p t 3] = ν (λ_{1}, B C 2_{3}) = ν (λ_{3}, A B_{3}) = ν (λ_{2}, B C_{3})

\begin{array}{ll} ν_{2} [I L S, p t 3] & = ν (λ_{2}, B C 2_{3}) = ν (λ_{3}, B C 2_{3}) = \\ ν (λ_{1}, A B_{3}) & = ν (λ_{2}, A B_{3}) = ν (λ_{1}, A C_{3}) = ν (λ_{3}, A C_{3}) \end{array}

ν_{4} [I L S, p t 3] = ν (λ_{4}, B C 2_{3}) = ν (λ_{4}, A B_{3}) = ν (λ_{4}, A C_{3})

ν_{5} [I L S, p t 3] = ν (λ_{5}, B C 2_{3}) = ν (λ_{5}, A B_{3}) = ν (λ_{5}, A C_{3})

Finally, for the genealogy $B C 1_{3}$ , the mutation probabilities are as follows:

ν (λ_{2}, B C 1_{3}) = ν (λ_{3}, B C 1_{3}) = \frac{1}{1 - e^{- (t_{1} - t_{m})}} \int_{0}^{t_{1} - t_{m}} (1 - e^{- μ (t_{m} + x)}) e^{- x} d x

ν (λ_{1}, B C 1_{3}) = \frac{1}{1 - e^{- (t_{3} - t_{1})}} \int_{0}^{t_{3} - t_{1}} (1 - e^{- μ (t_{m} + (t_{1} - t_{m}) + x)}) e^{- x} d x

ν (λ_{4}, B C 1_{3}) = \int_{0}^{t_{1} - t_{m}} \frac{e^{- y}}{1 - e^{- (t_{1} - t_{m})}} (\int_{0}^{t_{3} - t_{1}} (1 - e^{- μ ((t_{1} - t_{m}) - y + x)}) \frac{e^{- x}}{1 - e^{- (t_{3} - t_{1})}} d x) d y

ν (λ_{5}, B C 1_{3}) = \frac{1}{1 - e^{- (t_{3} - t_{1})}} \int_{0}^{t_{3} - t_{1}} (1 - e^{- μ ((t_{3} - t_{1}) - x)}) e^{- x} d x

2 When does introgression makes hemiplasy more likely than ILS alone?

The probability of hemiplasy with $C \to B$ introgression is

P_{e} = (1 - δ) P_{e} [B C_{1}] + δ (P_{e} [B C 1_{2}] + P_{e} [B C 2_{2}])

From this, it can be seen that introgression makes hemiplasy more likely than ILS alone when:

P_{e} [B C 1_{2}] + P_{e} [B C 2_{2}] > P_{e} [B C_{1}]

When is this true? Substituting the relevant expressions from the main text gives:

\begin{array}{ll} (1 - e^{- (t_{2} - t_{m})}) v (λ_{4}, B C 1_{2}) \prod_{i \neq 4} (1 - v (λ_{i}, B C 1_{2})) + \\ (\frac{1}{3} e^{- (t_{2} - t_{m})}) v (λ_{4}, B C 2_{2}) \prod_{i \neq} (1 - v (λ_{i}, B C 2_{2})) \\ > (\frac{1}{3} e^{- (t_{2} - t_{1})}) v (λ_{4}, B C_{1}) \prod_{i \neq} (1 - v (λ_{i}, B C_{1})) \end{array}

\begin{array}{ll} (1 - e^{- (t_{2} - t_{m})}) v (λ_{4}, B C 1_{2}) \prod_{i \neq 4} (1 - v (λ_{i}, B C 1_{2})) > \\ (\frac{1}{3} e^{- (t_{2} - t_{1})}) v (λ_{4}, B C_{1}) \prod_{i \neq} (1 - v (λ_{i}, B C_{1})) - \\ (\frac{1}{3} e^{- (t_{2} - t_{m})}) v (λ_{4}, B C 2_{2}) \prod_{i \neq} (1 - v (λ_{i}, B C 2_{2})) \end{array}

The mutation probabilities on the right side of the inequality are equal since they are on the same topology with the same branch lengths. Therefore, Equation 40 can simplified:

\begin{array}{ll} (1 - e^{- (t_{2} - t_{m})}) v (λ_{4}, B C 1_{2}) \prod_{i \neq 4} (1 - v (λ_{i}, B C 1_{2})) > \\ v (λ_{4}, B C_{1}) \prod_{i \neq 4} (1 - v (λ_{i}, B C_{1})) (\frac{1}{3} e^{- (t_{2} - t_{1})} - \frac{1}{3} e^{- (t_{2} - t_{m})}) \end{array}

As the most conservative case, let us assume a hybrid speciation scenario where t₁ = t_m. This represents the most conservative introgression scenario, since Figure 4 in the main text shows that more recent introgression makes hemiplasy more likely. In this case, the right side of the inequality simplifies to 0, leaving

(1 - e^{- (t_{2} - t_{m})}) v (λ_{4}, B C 1_{2}) \prod_{i \neq 4} (1 - v (λ_{i}, B C 1_{2})) > 0

This is true whenever $t_{2} > t_{m}$ , which is true by definition in this model. Therefore, introgression always makes hemiplasy more likely than ILS alone.

Data availability

Availability of the lizard genomic data and Heliconius phylogenetic network is detailed in the Acknowledgements section of the source manuscript. Source code and test cases for our software HeIST are freely available from the GitHub repository. Source data files have been provided for Figures 1, 4, 5, and 6. The Appendix details all the mutation rate parameters of our mathematical model.

The following previously published data sets were used

(2018) NCBI BioProject
ID PRJNA420910. Raw sequencing reads from ultraconserved elements of Australasian skinks.

https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA420910
1. Edelman NB
2. Frandsen PF
3. Miyagi M
4. Clavijo B
5. Davey J
6. Dikow RB
7. Van Belleghem SM
8. Patterson N
9. Neafsey DE
10. Richard C
11. Kumar S
12. Moreira GRP
13. Salazar C
14. Chouteau M
15. Counterman B
16. Papa R
17. Blaxter M
18. Reed RD
19. Dasmahapatra KK
20. Kronforst M
21. Joron M
22. Jiggins CD
23. McMillan WO
24. Di Palma F
25. Blumberg AJ
26. Wakeley J
27. Jaffe D
28. Mallet J
(2019) Dryad Digital Repository
Data from: Genomic architecture and introgression shape a butterfly radiation.

https://doi.org/10.5061/dryad.b7bj832

References

1. Ané C
2. Larget B
3. Baum DA
4. Smith SD
5. Rokas A
(2007) Bayesian estimation of concordance among gene trees
Molecular Biology and Evolution 24:412–426.

https://doi.org/10.1093/molbev/msl170
- PubMed
- Google Scholar
1. Austin CC
2. Jessing KW
(1994) Green-blood pigmentation in lizards
Comparative Biochemistry and Physiology Part A: Physiology 109:619–626.

https://doi.org/10.1016/0300-9629(94)90201-1
- Google Scholar
1. Avise JC
2. Robinson TJ
(2008) Hemiplasy: a new term in the lexicon of phylogenetics
Systematic Biology 57:503–507.

https://doi.org/10.1080/10635150802164587
- PubMed
- Google Scholar
(2018) Phylogenetic comparative methods on phylogenetic networks with reticulations
Systematic Biology 67:800–820.

https://doi.org/10.1093/sysbio/syy033
- PubMed
- Google Scholar
1. Bouckaert R
2. Vaughan TG
3. Barido-Sottani J
4. Duchêne S
5. Fourment M
6. Gavryushkina A
7. Heled J
8. Jones G
9. Kühnert D
10. De Maio N
11. Matschiner M
12. Mendes FK
13. Müller NF
14. Ogilvie HA
15. du Plessis L
16. Popinga A
17. Rambaut A
18. Rasmussen D
19. Siveroni I
20. Suchard MA
21. Wu CH
22. Xie D
23. Zhang C
24. Stadler T
25. Drummond AJ
(2019) BEAST 2.5: an advanced software platform for Bayesian evolutionary analysis
PLOS Computational Biology 15:e1006650.

https://doi.org/10.1371/journal.pcbi.1006650
- PubMed
- Google Scholar
1. Brandvain Y
2. Kenney AM
3. Flagel L
4. Coop G
5. Sweigart AL
(2014) Speciation and introgression between Mimulus nasutus and Mimulus guttatus
PLOS Genetics 10:e1004410.

https://doi.org/10.1371/journal.pgen.1004410
- PubMed
- Google Scholar
1. Copetti D
2. Búrquez A
3. Bustamante E
4. Charboneau JLM
5. Childs KL
6. Eguiarte LE
7. Lee S
8. Liu TL
9. McMahon MM
10. Whiteman NK
11. Wing RA
12. Wojciechowski MF
13. Sanderson MJ
(2017) Extensive gene tree discordance and hemiplasy shaped the genomes of North American columnar cacti
PNAS 114:12003–12008.

https://doi.org/10.1073/pnas.1706367114
- PubMed
- Google Scholar
1. Degnan JH
(2018) Modeling hybridization under the network multispecies coalescent
Systematic Biology 67:786–799.

https://doi.org/10.1093/sysbio/syy040
- PubMed
- Google Scholar
1. Degnan JH
2. Rosenberg NA
(2009) Gene tree discordance, phylogenetic inference and the multispecies coalescent
Trends in Ecology & Evolution 24:332–340.

https://doi.org/10.1016/j.tree.2009.01.009
- PubMed
- Google Scholar
(2005) Phylogenomics and the reconstruction of the tree of life
Nature Reviews Genetics 6:361–375.

https://doi.org/10.1038/nrg1603
- PubMed
- Google Scholar
1. Dunn CW
2. Hejnol A
3. Matus DQ
4. Pang K
5. Browne WE
6. Smith SA
7. Seaver E
8. Rouse GW
9. Obst M
10. Edgecombe GD
11. Sørensen MV
12. Haddock SH
13. Schmidt-Rhaesa A
14. Okusu A
15. Kristensen RM
16. Wheeler WC
17. Martindale MQ
18. Giribet G
(2008) Broad phylogenomic sampling improves resolution of the animal tree of life
Nature 452:745–749.

https://doi.org/10.1038/nature06614
- PubMed
- Google Scholar
(2011) Testing for ancient admixture between closely related populations
Molecular Biology and Evolution 28:2239–2252.

https://doi.org/10.1093/molbev/msr048
- PubMed
- Google Scholar
1. Edelman NB
2. Frandsen PB
3. Miyagi M
4. Clavijo B
5. Davey J
6. Dikow RB
7. García-Accinelli G
8. Van Belleghem SM
9. Patterson N
10. Neafsey DE
11. Challis R
12. Kumar S
13. Moreira GRP
14. Salazar C
15. Chouteau M
16. Counterman BA
17. Papa R
18. Blaxter M
19. Reed RD
20. Dasmahapatra KK
21. Kronforst M
22. Joron M
23. Jiggins CD
24. McMillan WO
25. Di Palma F
26. Blumberg AJ
27. Wakeley J
28. Jaffe D
29. Mallet J
(2019) Genomic architecture and introgression shape a butterfly radiation
Science 366:594–599.

https://doi.org/10.1126/science.aaw2090
- PubMed
- Google Scholar
1. Felsenstein J
(1985) Phylogenies and the comparative method
The American Naturalist 125:1–15.

https://doi.org/10.1086/284325
- Google Scholar
Book
1. Felsenstein J
(2004)
Inferring Phylogenies

Sinauer Associates.
- Google Scholar
1. Fitch WM
(1971) Toward defining the course of evolution: minimum change for a specific tree topology
Systematic Zoology 20:406–416.

https://doi.org/10.2307/2412116
- Google Scholar
1. Fontaine MC
2. Pease JB
3. Steele A
4. Waterhouse RM
5. Neafsey DE
6. Sharakhov IV
7. Jiang X
8. Hall AB
9. Catteruccia F
10. Kakani E
11. Mitchell SN
12. Wu YC
13. Smith HA
14. Love RR
15. Lawniczak MK
16. Slotman MA
17. Emrich SJ
18. Hahn MW
19. Besansky NJ
(2015) Extensive introgression in a malaria vector species complex revealed by phylogenomics
Science 347:1258524.

https://doi.org/10.1126/science.1258524
- PubMed
- Google Scholar
(2005) Inferring species phylogenies from multiple genes: concatenated sequence tree versus consensus gene tree
Journal of Experimental Zoology Part B: Molecular and Developmental Evolution 304B:64–74.

https://doi.org/10.1002/jez.b.21026
- Google Scholar
1. Green RE
2. Krause J
3. Briggs AW
4. Maricic T
5. Stenzel U
6. Kircher M
7. Patterson N
8. Li H
9. Zhai W
10. Fritz MH
11. Hansen NF
12. Durand EY
13. Malaspinas AS
14. Jensen JD
15. Marques-Bonet T
16. Alkan C
17. Prüfer K
18. Meyer M
19. Burbano HA
20. Good JM
21. Schultz R
22. Aximu-Petri A
23. Butthof A
24. Höber B
25. Höffner B
26. Siegemund M
27. Weihmann A
28. Nusbaum C
29. Lander ES
30. Russ C
31. Novod N
32. Affourtit J
33. Egholm M
34. Verna C
35. Rudan P
36. Brajkovic D
37. Kucan Ž
38. Gušic I
39. Doronichev VB
40. Golovanova LV
41. Lalueza-Fox C
42. de la Rasilla M
43. Fortea J
44. Rosas A
45. Schmitz RW
46. Johnson PLF
47. Eichler EE
48. Falush D
49. Birney E
50. Mullikin JC
51. Slatkin M
52. Nielsen R
53. Kelso J
54. Lachmann M
55. Reich D
56. Pääbo S
(2010) A draft sequence of the Neandertal genome
Science 328:710–722.

https://doi.org/10.1126/science.1188021
- PubMed
- Google Scholar
1. Guerrero RF
2. Hahn MW
(2018) Quantifying the risk of hemiplasy in phylogenetic inference
PNAS 115:12787–12792.

https://doi.org/10.1073/pnas.1811268115
- PubMed
- Google Scholar
1. Hahn MW
2. Hibbins MS
(2019) A three-sample test for introgression
Molecular Biology and Evolution 36:2878–2882.

https://doi.org/10.1093/molbev/msz178
- PubMed
- Google Scholar
1. Hahn MW
2. Nakhleh L
(2016) Irrational exuberance for resolved species trees
Evolution 70:7–17.

https://doi.org/10.1111/evo.12832
- Google Scholar
1. Heliconius Genome Consortium
(2012) Butterfly genome reveals promiscuous exchange of mimicry adaptations among species
Nature 487:94–98.

https://doi.org/10.1038/nature11041
- PubMed
- Google Scholar
1. Hibbins MS
2. Hahn MW
(2019) The timing and direction of introgression under the multispecies network coalescent
Genetics 211:1059–1073.

https://doi.org/10.1534/genetics.118.301831
- PubMed
- Google Scholar
1. Hudson RR
(1983) Testing the constant-rate neutral allele model with protein sequence data
Evolution 37:203–217.

https://doi.org/10.1111/j.1558-5646.1983.tb05528.x
- Google Scholar
1. Hudson RR
(2002) Generating samples under a Wright-Fisher neutral model of genetic variation
Bioinformatics 18:337–338.

https://doi.org/10.1093/bioinformatics/18.2.337
- PubMed
- Google Scholar
(2016) ETE 3: reconstruction, analysis, and visualization of phylogenomic data
Molecular Biology and Evolution 33:1635–1638.

https://doi.org/10.1093/molbev/msw046
- PubMed
- Google Scholar
1. Huerta-Sánchez E
2. Jin X
3. Asan
4. Bianba Z
5. Peter BM
6. Vinckenbosch N
7. Liang Y
8. Yi X
9. He M
10. Somel M
11. Ni P
12. Wang B
13. Ou X
14. Huasang
15. Luosang J
16. Cuo ZX
17. Li K
18. Gao G
19. Yin Y
20. Wang W
21. Zhang X
22. Xu X
23. Yang H
24. Li Y
25. Wang J
26. Wang J
27. Nielsen R
(2014) Altitude adaptation in Tibetans caused by introgression of Denisovan-like DNA
Nature 512:194–197.

https://doi.org/10.1038/nature13408
- PubMed
- Google Scholar
(2009) A statistical approach for distinguishing hybridization and incomplete lineage sorting
The American Naturalist 174:E54–E70.

https://doi.org/10.1086/600082
- PubMed
- Google Scholar
1. Jones MR
2. Mills LS
3. Alves PC
4. Callahan CM
5. Alves JM
6. Lafferty DJR
7. Jiggins FM
8. Jensen JD
9. Melo-Ferreira J
10. Good JM
(2018) Adaptive introgression underlies polymorphic seasonal camouflage in snowshoe hares
Science 360:1355–1358.

https://doi.org/10.1126/science.aar5273
- PubMed
- Google Scholar
1. Joron M
2. Papa R
3. Beltrán M
4. Chamberlain N
5. Mavárez J
6. Baxter S
7. Abanto M
8. Bermingham E
9. Humphray SJ
10. Rogers J
11. Beasley H
12. Barlow K
13. ffrench-Constant RH
14. Mallet J
15. McMillan WO
16. Jiggins CD
(2006) A conserved supergene locus controls colour pattern diversity in Heliconius butterflies
PLOS Biology 4:e303.

https://doi.org/10.1371/journal.pbio.0040303
- PubMed
- Google Scholar
1. Karimi N
2. Grover CE
3. Gallagher JP
4. Wendel JF
5. Ané C
6. Baum DA
(2020) Reticulate evolution helps explain apparent homoplasy in floral biology and pollination in baobabs (Adansonia; Bombacoideae; Malvaceae)
Systematic Biology 69:462–478.

https://doi.org/10.1093/sysbio/syz073
- PubMed
- Google Scholar
1. Kern AD
2. Hahn MW
(2018) The neutral theory in light of natural selection
Molecular Biology and Evolution 35:1366–1371.

https://doi.org/10.1093/molbev/msy092
- PubMed
- Google Scholar
1. Lamichhaney S
2. Han F
3. Berglund J
4. Wang C
5. Almén MS
6. Webster MT
7. Grant BR
8. Grant PR
9. Andersson L
(2016) A beak size locus in Darwin's finches facilitated character displacement during a drought
Science 352:470–474.

https://doi.org/10.1126/science.aad8786
- PubMed
- Google Scholar
1. Lee KM
2. Coop G
(2017) Distinguishing among modes of convergent adaptation using population genomic data
Genetics 207(4):1591–1619.

https://doi.org/10.1534/genetics.117.300417
- Google Scholar
1. Liu L
2. Yu L
3. Edwards SV
(2010) A maximum pseudo-likelihood approach for estimating species trees under the coalescent model
BMC Evolutionary Biology 10:302.

https://doi.org/10.1186/1471-2148-10-302
- PubMed
- Google Scholar
1. Liu KJ
2. Dai J
3. Truong K
4. Song Y
5. Kohn MH
6. Nakhleh L
(2014) An HMM-based comparative genomic framework for detecting introgression in eukaryotes
PLOS Computational Biology 10:e1003649.

https://doi.org/10.1371/journal.pcbi.1003649
- PubMed
- Google Scholar
1. Lynch M
(2006) The origins of eukaryotic gene structure
Molecular Biology and Evolution 23:450–468.

https://doi.org/10.1093/molbev/msj050
- PubMed
- Google Scholar
1. Lynch M
(2010) Evolution of the mutation rate
Trends in Genetics 26:345–352.

https://doi.org/10.1016/j.tig.2010.05.003
- PubMed
- Google Scholar
(2016) How reticulated are species?
BioEssays 38:140–149.

https://doi.org/10.1002/bies.201500149
- PubMed
- Google Scholar
1. Martin SH
2. Möst M
3. Palmer WJ
4. Salazar C
5. McMillan WO
6. Jiggins FM
7. Jiggins CD
(2016) Natural selection and genetic diversity in the butterfly Heliconius melpomene
Genetics 203:525–541.

https://doi.org/10.1534/genetics.115.183285
- PubMed
- Google Scholar
1. Mendes FK
2. Hahn Y
3. Hahn MW
(2016) Gene tree discordance can generate patterns of diminishing convergence over time
Molecular Biology and Evolution 33:3299–3307.

https://doi.org/10.1093/molbev/msw197
- PubMed
- Google Scholar
(2018) A multispecies coalescent model for quantitative traits
eLife 7:e36482.

https://doi.org/10.7554/eLife.36482
- PubMed
- Google Scholar
1. Mendes FK
2. Hahn MW
(2016) Gene tree discordance causes apparent substitution rate variation
Systematic Biology 65:711–721.

https://doi.org/10.1093/sysbio/syw018
- PubMed
- Google Scholar
1. Meng C
2. Kubatko LS
(2009) Detecting hybrid speciation in the presence of incomplete lineage sorting using gene tree incongruence: a model
Theoretical Population Biology 75:35–45.

https://doi.org/10.1016/j.tpb.2008.10.004
- PubMed
- Google Scholar
(2020a) New methods to calculate concordance factors for phylogenomic datasets
Molecular Biology and Evolution 37:2727–2733.

https://doi.org/10.1093/molbev/msaa106
- PubMed
- Google Scholar
(2020b) IQ-TREE 2: new models and efficient methods for phylogenetic inference in the genomic era
Molecular Biology and Evolution 37:1530–1534.

https://doi.org/10.1093/molbev/msaa015
- PubMed
- Google Scholar
1. Misof B
2. Liu S
3. Meusemann K
4. Peters RS
5. Donath A
6. Mayer C
7. Frandsen PB
8. Ware J
9. Flouri T
10. Beutel RG
11. Niehuis O
12. Petersen M
13. Izquierdo-Carrasco F
14. Wappler T
15. Rust J
16. Aberer AJ
17. Aspöck U
18. Aspöck H
19. Bartel D
20. Blanke A
21. Berger S
22. Böhm A
23. Buckley TR
24. Calcott B
25. Chen J
26. Friedrich F
27. Fukui M
28. Fujita M
29. Greve C
30. Grobe P
31. Gu S
32. Huang Y
33. Jermiin LS
34. Kawahara AY
35. Krogmann L
36. Kubiak M
37. Lanfear R
38. Letsch H
39. Li Y
40. Li Z
41. Li J
42. Lu H
43. Machida R
44. Mashimo Y
45. Kapli P
46. McKenna DD
47. Meng G
48. Nakagaki Y
49. Navarrete-Heredia JL
50. Ott M
51. Ou Y
52. Pass G
53. Podsiadlowski L
54. Pohl H
55. von Reumont BM
56. Schütte K
57. Sekiya K
58. Shimizu S
59. Slipinski A
60. Stamatakis A
61. Song W
62. Su X
63. Szucsich NU
64. Tan M
65. Tan X
66. Tang M
67. Tang J
68. Timelthaler G
69. Tomizuka S
70. Trautwein M
71. Tong X
72. Uchifune T
73. Walzl MG
74. Wiegmann BM
75. Wilbrandt J
76. Wipfler B
77. Wong TK
78. Wu Q
79. Wu G
80. Xie Y
81. Yang S
82. Yang Q
83. Yeates DK
84. Yoshizawa K
85. Zhang Q
86. Zhang R
87. Zhang W
88. Zhang Y
89. Zhao J
90. Zhou C
91. Zhou L
92. Ziesmann T
93. Zou S
94. Li Y
95. Xu X
96. Zhang Y
97. Yang H
98. Wang J
99. Wang J
100. Kjer KM
101. Zhou X
(2014) Phylogenomics resolves the timing and pattern of insect evolution
Science 346:763–767.

https://doi.org/10.1126/science.1257570
- PubMed
- Google Scholar
1. Mullen SP
2. VanKuren NW
3. Zhang W
4. Nallu S
5. Kristiansen EB
6. Wuyun Q
7. Liu K
8. Hill RI
9. Briscoe AD
10. Kronforst MR
(2020) Disentangling population history and character evolution among hybridizing lineages
Molecular Biology and Evolution 37:1295–1305.

https://doi.org/10.1093/molbev/msaa004
- PubMed
- Google Scholar
1. Nadeau NJ
2. Pardo-Diaz C
3. Whibley A
4. Supple MA
5. Saenko SV
6. Wallbank RW
7. Wu GC
8. Maroja L
9. Ferguson L
10. Hanly JJ
11. Hines H
12. Salazar C
13. Merrill RM
14. Dowling AJ
15. ffrench-Constant RH
16. Llaurens V
17. Joron M
18. McMillan WO
19. Jiggins CD
(2016) The gene cortex controls mimicry and crypsis in butterflies and moths
Nature 534:106–110.

https://doi.org/10.1038/nature17961
- PubMed
- Google Scholar
1. Novikova PY
2. Hohmann N
3. Nizhynska V
4. Tsuchimatsu T
5. Ali J
6. Muir G
7. Guggisberg A
8. Paape T
9. Schmid K
10. Fedorenko OM
11. Holm S
12. Säll T
13. Schlötterer C
14. Marhold K
15. Widmer A
16. Sese J
17. Shimizu KK
18. Weigel D
19. Krämer U
20. Koch MA
21. Nordborg M
(2016) Sequencing of the genus Arabidopsis identifies a complex history of nonbifurcating speciation and abundant trans-specific polymorphism
Nature Genetics 48:1077–1082.

https://doi.org/10.1038/ng.3617
- PubMed
- Google Scholar
1. Palesch D
2. Bosinger SE
3. Tharp GK
4. Vanderford TH
5. Paiardini M
6. Chahroudi A
7. Johnson ZP
8. Kirchhoff F
9. Hahn BH
10. Norgren RB
11. Patel NB
12. Sodora DL
13. Dawoud RA
14. Stewart CB
15. Seepo SM
16. Harris RA
17. Liu Y
18. Raveendran M
19. Han Y
20. English A
21. Thomas GWC
22. Hahn MW
23. Pipes L
24. Mason CE
25. Muzny DM
26. Gibbs RA
27. Sauter D
28. Worley K
29. Rogers J
30. Silvestri G
(2018) Sooty mangabey genome sequence provides insight into AIDS resistance in a natural SIV host
Nature 553:77–81.

https://doi.org/10.1038/nature25140
- PubMed
- Google Scholar
1. Pamilo P
2. Nei M
(1988) Relationships between gene trees and species trees
Molecular Biology and Evolution 5:568–583.

https://doi.org/10.1093/oxfordjournals.molbev.a040517
- PubMed
- Google Scholar
1. Patterson N
2. Moorjani P
3. Luo Y
4. Mallick S
5. Rohland N
6. Zhan Y
7. Genschoreck T
8. Webster T
9. Reich D
(2012) Ancient admixture in human history
Genetics 192:1065–1093.

https://doi.org/10.1534/genetics.112.145037
- PubMed
- Google Scholar
1. Pease JB
2. Haak DC
3. Hahn MW
4. Moyle LC
(2016) Phylogenomics reveals three sources of adaptive variation during a rapid radiation
PLOS Biology 14:e1002379.

https://doi.org/10.1371/journal.pbio.1002379
- PubMed
- Google Scholar
1. Pease JB
2. Hahn MW
(2015) Detection and polarization of introgression in a five-taxon phylogeny
Systematic Biology 64:651–662.

https://doi.org/10.1093/sysbio/syv023
- PubMed
- Google Scholar
1. Pollard DA
2. Iyer VN
3. Moses AM
4. Eisen MB
(2006) Widespread discordance of gene trees with species tree in Drosophila: evidence for incomplete lineage sorting
PLOS Genetics 2:e173.

https://doi.org/10.1371/journal.pgen.0020173
- PubMed
- Google Scholar
1. Rambaut A
2. Grassly NC
(1997) Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees
Bioinformatics 13:235–238.

https://doi.org/10.1093/bioinformatics/13.3.235
- Google Scholar
1. Rausher MD
(2008) Evolutionary transitions in floral color
International Journal of Plant Sciences 169:7–21.

https://doi.org/10.1086/523358
- Google Scholar
1. Reich D
2. Thangaraj K
3. Patterson N
4. Price AL
5. Singh L
(2009) Reconstructing indian population history
Nature 461:489–494.

https://doi.org/10.1038/nature08365
- PubMed
- Google Scholar
(2018) Multiple origins of green blood in New Guinea lizards
Science Advances 4:eaao5017.

https://doi.org/10.1126/sciadv.aao5017
- PubMed
- Google Scholar
Book
1. Schrempf D
2. Szöllősi G
(2020)
The sources of phylogenetic conflicts

In: Celine S, Delsuc F, Nicolas G, editors. Phylogenetics in the Genomic Era. Self published. 3.1:1–3.1:23.
- Google Scholar
1. Smith SD
2. Rausher MD
(2011) Gene loss and parallel evolution contribute to species difference in flower color
Molecular Biology and Evolution 28:2799–2810.

https://doi.org/10.1093/molbev/msr109
- PubMed
- Google Scholar
1. Solís-Lemus C
2. Ané C
(2016) Inferring phylogenetic networks with maximum pseudolikelihood under incomplete lineage sorting
PLOS Genetics 12:e1005896.

https://doi.org/10.1371/journal.pgen.1005896
- PubMed
- Google Scholar
1. Stamatakis A
(2014) RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies
Bioinformatics 30:1312–1313.

https://doi.org/10.1093/bioinformatics/btu033
- PubMed
- Google Scholar
1. Storz JF
2. Runck AM
3. Sabatino SJ
4. Kelly JK
5. Ferrand N
6. Moriyama H
7. Weber RE
8. Fago A
(2009) Evolutionary and functional insights into the mechanism underlying high-altitude adaptation of deer mouse hemoglobin
PNAS 106:14450–14455.

https://doi.org/10.1073/pnas.0905224106
- PubMed
- Google Scholar
1. Storz JF
(2016) Causes of molecular convergence and parallelism in protein evolution
Nature Reviews Genetics 17:239–250.

https://doi.org/10.1038/nrg.2016.11
- PubMed
- Google Scholar
1. Tufts DM
2. Natarajan C
3. Revsbech IG
4. Projecto-Garcia J
5. Hoffmann FG
6. Weber RE
7. Fago A
8. Moriyama H
9. Storz JF
(2015) Epistasis constrains mutational pathways of hemoglobin adaptation in high-altitude pikas
Molecular Biology and Evolution 32:287–298.

https://doi.org/10.1093/molbev/msu311
- PubMed
- Google Scholar
1. Vanderpool D
2. Minh BQ
3. Lanfear R
4. Hughes D
5. Murali S
6. Harris RA
7. Raveendran M
8. Muzny DM
9. Hibbins MS
10. Williamson RJ
11. Gibbs RA
12. Worley KC
13. Rogers J
14. Hahn MW
(2020) Primate phylogenomics uncovers multiple rapid radiations and ancient interspecific introgression
PLOS Biology 18:e3000954.

https://doi.org/10.1371/journal.pbio.3000954
- PubMed
- Google Scholar
Preprint
1. Wang Y
2. Cao Z
3. Ogilvie HW
4. Nakhleh L
(2020) Phylogenomic assessment of the role of hybridization and introgression in trait evolution
bioRxiv.

https://doi.org/10.1101/2020.09.16.300343
- Google Scholar
1. Wen D
2. Yu Y
3. Nakhleh L
(2016) Bayesian inference of reticulate phylogenies under the multispecies network coalescent
PLOS Genetics 12:e1006006.

https://doi.org/10.1371/journal.pgen.1006006
- PubMed
- Google Scholar
1. Wen D
2. Yu Y
3. Zhu J
4. Nakhleh L
(2018) Inferring phylogenetic networks using PhyloNet
Systematic Biology 67:735–740.

https://doi.org/10.1093/sysbio/syy015
- PubMed
- Google Scholar
1. Wu M
2. Kostyun JL
3. Hahn MW
4. Moyle LC
(2018) Dissecting the basis of novel trait evolution in a radiation with widespread phylogenetic discordance
Molecular Ecology 27:3301–3316.

https://doi.org/10.1111/mec.14780
- Google Scholar
1. Yu Y
2. Degnan JH
3. Nakhleh L
(2012) The probability of a gene tree topology within a phylogenetic network with applications to hybridization detection
PLOS Genetics 8:e1002660.

https://doi.org/10.1371/journal.pgen.1002660
- PubMed
- Google Scholar
1. Yu Y
2. Dong J
3. Liu KJ
4. Nakhleh L
(2014) Maximum likelihood inference of reticulate evolutionary histories
PNAS 111:16448–16453.

https://doi.org/10.1073/pnas.1407950111
- PubMed
- Google Scholar
(2018a) Bayesian inference of species networks from multilocus sequence data
Molecular Biology and Evolution 35:504–517.

https://doi.org/10.1093/molbev/msx307
- PubMed
- Google Scholar
1. Zhang C
2. Rabiee M
3. Sayyari E
4. Mirarab S
(2018b) ASTRAL-III: polynomial time species tree reconstruction from partially resolved gene trees
BMC Bioinformatics 19:153.

https://doi.org/10.1186/s12859-018-2129-y
- PubMed
- Google Scholar

Article and author information

Author details

Mark S Hibbins

Department of Biology, Indiana University, Bloomington, United States

Contribution
Conceptualization, Data curation, Software, Formal analysis, Validation, Investigation, Visualization, Methodology, Writing - original draft, Project administration, Writing - review and editing

For correspondence
mhibbins@iu.edu

Competing interests
No competing interests declared

"This ORCID iD identifies the author of this article:" 0000-0002-4651-3704
Matthew JS Gibson

Department of Biology, Indiana University, Bloomington, United States

Contribution
Data curation, Software, Validation, Methodology, Writing - original draft, Writing - review and editing

Competing interests
No competing interests declared

"This ORCID iD identifies the author of this article:" 0000-0001-7855-1628
Matthew W Hahn
1. Department of Biology, Indiana University, Bloomington, United States
2. Department of Computer Science, Indiana University, Bloomington, United States
Contribution
Conceptualization, Supervision, Funding acquisition, Validation, Writing - original draft, Project administration, Writing - review and editing

Competing interests
No competing interests declared

"This ORCID iD identifies the author of this article:" 0000-0002-5731-8808

Funding

National Science Foundation (DEB-1936187)

Matthew W Hahn

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

We thank Rafael Guerrero, Leonie Moyle, and Ben Fulton for helpful comments and advice, as well as three anonymous reviewers, the associate editor, and Chris Austin for suggestions that helped improve this work. We also thank Zachary Rodriguez for sharing the lizard data and Nate Edelman for sharing the Heliconius phylogenetic network. This work was supported by National Science Foundation grant DEB-1936187.

Copyright

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

2,276

views
232

downloads
44

citations

Views, downloads and citations are aggregated across all versions of this paper published by eLife.

Citations by DOI

44

citations for umbrella DOI https://doi.org/10.7554/eLife.63753

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Mendeley

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

Mark S Hibbins
Matthew JS Gibson
Matthew W Hahn

(2020)

Determining the probability of hemiplasy in the presence of incomplete lineage sorting and introgression

eLife 9:e63753.

https://doi.org/10.7554/eLife.63753

Share this article

Cite this article

Two empirical examples of apparent convergence in character states that could potentially be explained by hemiplasy.

Figure 1—source data 1

Figure 1—source data 2

A phylogenetic network (top) can be split into a set of parent trees (bottom) representing the possible histories at individual loci.

The possible paths to homoplasy and hemiplasy under the multispecies network coalescent.

The probability of hemiplasy relative to homoplasy (contours) as a function of the admixture proportion (x-axis), the time between speciation and introgression (y-axis), and the direction of introgression (panels).

Figure 4—source data 1

Probabilities of hemiplasy estimated from HeIST across nine simulated conditions.

Figure 5—source data 1

Probable histories for (A) the origin of green blood in New Guinea lizards and (B) the chromosomal inversion spanning the gene cortex in Heliconius, calculated using HeIST.

Figure 6—source data 1

Figure 6—source data 2

Author details

Mark S Hibbins

Contribution

For correspondence

Competing interests

Matthew JS Gibson

Contribution

Competing interests

Matthew W Hahn

Contribution

Competing interests

Citations by DOI

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

Categories and tags