1. Evolutionary Biology
  2. Genetics and Genomics
Download icon

Introgression shapes fruit color convergence in invasive Galápagos tomato

  1. Matthew JS Gibson  Is a corresponding author
  2. María de Lourdes Torres
  3. Yaniv Brandvain
  4. Leonie C Moyle
  1. Department of Biology, Indiana University, United States
  2. Universidad San Francisco de Quito (USFQ). Colegio de Ciencias Biológicas y Ambientales, Laboratorio de Biotecnología Vegetal. Campus Cumbayá, Ecuador
  3. Galapagos Science Center, Universidad San Francisco de Quito and University of North Carolina at Chapel Hill, Ecuador
  4. Department of Plant Biology, University of Minnesota-Twin Cities, United States
Research Article
  • Cited 0
  • Views 864
  • Annotations
Cite this article as: eLife 2021;10:e64165 doi: 10.7554/eLife.64165

Abstract

Invasive species represent one of the foremost risks to global biodiversity. Here, we use population genomics to evaluate the history and consequences of an invasion of wild tomato—Solanum pimpinellifolium—onto the Galápagos Islands from continental South America. Using >300 archipelago and mainland collections, we infer this invasion was recent and largely the result of a single event from central Ecuador. Patterns of ancestry within the genomes of invasive plants also reveal post-colonization hybridization and introgression between S. pimpinellifolium and the closely related Galápagos endemic Solanum cheesmaniae. Of admixed invasive individuals, those that carry endemic alleles at one of two different carotenoid biosynthesis loci also have orange fruits—characteristic of the endemic species—instead of typical red S. pimpinellifolium fruits. We infer that introgression of two independent fruit color loci explains this observed trait convergence, suggesting that selection has favored repeated transitions of red to orange fruits on the Galápagos.

Introduction

The success of colonizing species depends on complex interactions between local environments and the availability of relevant genetic variation. Introduction events are often associated with strong genetic bottlenecks (Kolbe et al., 2004; Colautti et al., 2005; Golani et al., 2007) and reduced effective population sizes, features which may constrain the ability of colonizers to adapt to novel environments and compete with native biota (Lande, 1988; Lee, 2002). This suggests that biological invasions should rarely follow from introductions (Queller, 2000; Kolbe et al., 2004), yet successful invasions are nonetheless pervasive (Kolbe et al., 2004; Allendorf and Lundquist, 2003; Estoup et al., 2016; Comeault et al., 2020).

Several factors could be involved in this success. Despite intense bottlenecks, diversity could be maintained by other means, including multiple independent introductions (Facon et al., 2008; Kolbe et al., 2004) or via hybridization with congenerics present in the new habitat (Ellstrand and Schierenbeck, 2000; Lavergne and Molofsky, 2007; Reatini and Vision, 2020; Stepien et al., 2005). Of these mechanisms, hybridization might be particularly important for facilitating invasion into island habitats. Hybridization among native and introduced taxa is common on islands (Carlquist, 1974), potentially because of limitations on geographic extent, the abundance of generalist pollinators (Olesen et al., 2002), and/or frequent anthropogenic disturbance (Bertolo et al., 2012; Lin et al., 2013; Long et al., 2014). In addition, while the geographic isolation of insular habitats makes them hot spots for species endemism, only a small subset of continental taxa are successful in colonizing remote islands. The resulting incomplete trophic networks provide abundant ecological opportunities for invaders, including reproductive interactions between closely related species. Given these potentially complex contributing factors, describing the occurrence and consequences of invasion is critical for understanding both the dynamics of colonizing populations and for predicting conservation outcomes.

In this study, we investigate the contributions of demographic bottlenecks, single versus multiple introductions, and post-invasion hybridization, to patterns of genomic variation in populations of invasive and endemic tomato species on the Galápagos Islands. Two yellow/orange-fruited tomato species are considered endemic to the islands: Solanum cheesmaniae (L. Riley) Fosberg [CHS] and Solanum galapagense S.C. Darwin and Peralta [GAL] (Appendix 1, section S1). Two red-fruited invasive species from continental Ecuador and Peru are now also documented on the archipelago: Solanum pimpinellifolium L. [PIM] and Solanum lycopersicum L. [LYC]—the domesticated tomato. Domesticated LYC was almost certainly introduced for agriculture (Rick, 1963). Wild species PIM was also likely introduced by early human colonizers (Darwin et al., 2003; Darwin, 2009), however the timing and source of this introduction is not known. Recent field surveys indicate substantially increased abundance of the invasive species while abundance of the two endemic species has markedly declined over the past two decades (Darwin, 2009; Nuez et al., 2004; Gibson et al., 2020), suggesting that recent demographic shifts may pose an extinction threat to the endemic species. Several factors also indicate a high potential for hybridization between native and invasive species, including overlapping habitats (Darwin, 2009; Nuez et al., 2004), similar flower morphologies (Darwin et al., 2003; Darwin, 2009; Rick et al., 1977; Vosters et al., 2014), and shared pollinators (Darwin, 2009). Moreover, all four species are closely related—having diverged less than 500 kya (Pease et al., 2016)—and all can be crossed to produce hybrids in the greenhouse (Rick, 1956; Rick and Bowman, 1961; Rick and Fobes, 1975).

Using genomic sequencing data from 174 plants (representing all four species) from the largest islands of San Cristobal, Santa Cruz, and Isabela, and a panel of 132 mainland PIM accessions from across the entire native range on continental South America, we (i) infer the timing, source, and number of invasions by PIM onto the Galápagos and (ii) evaluate evidence for post-colonization gene flow between the four tomato taxa, and its evolutionary consequences. We find that the majority of PIM originated in central Ecuador, are the product of a recent invasion, and are actively hybridizing with an endemic relative. By characterizing fine-scale local ancestry, we find that the emergence of novel orange-fruited plants—which resemble the endemic species in color—in two invasive populations can be explained by endemic introgression at distinct carotenoid loci with known phenotypic effects specifically on fruit color. Our findings reconstruct a recent path of invasion via Ecuador, provide evidence for ongoing interspecific gene flow, and suggest a history of natural selection favoring orange fruits in the island habitat.

Results

Sequencing and collections

Sequence data were drawn from 306 individual samples. We performed double-digest RAD (ddRAD) sequencing (using PstI and EcoRI enzymes) of 174 wild collected individuals from 13 populations of endemic and invasive tomatoes from three islands in the Galápagos archipelago: San Cristobal, Santa Cruz, and Isabela (Figure 1; Table 1; Figure 1—figure supplement 1 and Supplementary file 1a). We complemented these data with ddRAD reads from 132 mainland PIM (Figure 2—figure supplement 1 and Supplementary file 1b), previously sequenced in Gibson and Moyle, 2020 using the same enzymes. We recovered 18,573 high-quality RAD loci, each sequenced to an average of 61.4× (s.d. = 35×) in 80% of all 306 samples (Supplementary file 1c and 1d). Average insert size was 192 bp (s.d. = 51.7) after adapter and quality trimming. After filtering for depth (>8 reads), 11,297 SNPs were retained. After filtering for LD (r2 < 0.7), 5767 SNPs were retained. Refer to Supplementary file 1e for a summary of each filtering step and the analyses for which each dataset was used.

Figure 1 with 1 supplement see all
Geographic distribution of focal sampling sites on the Galápagos Islands.

Inset: Photograph of polymorphic (red/orange) PIM fruits representative of populations MG114 and MG117. For simplicity, LYC populations as well as sampling sites with <8 individuals are not included here. Refer to Supplementary file 1a for a full list of collection localities and sample sizes.

Table 1
Diversity statistics for focal population samples (S = number of segregating sites; θW = Watterson’s theta; H = observed heterozygosity; π = genome-wide nucleotide diversity).
TaxaPopulationIslandEndemicSθWHπ
PIMPeruNN328206302.960.000250.00094
EcuadorNN217735544.200.000330.00130
MG105YN1520586.230.000110.00009
MG107YN44341567.360.00020.00034
MG111YN2562787.930.000120.00010
MG113YN37631330.170.000150.00029
MG114YN47301454.690.000160.00023
MG116YN49291632.1890.000240.00024
MG117YN47761581.520.000160.00028
CHSMG115YY85402407.170.000230.00045
MG120CYY854314.220.00010.00008
GALMG120GYY32821032.030.000130.00014
GAL×CHSMG120GCYY28571166.120.000230.00026
LYCMG125YN72192551.810.000350.00063
MG126YN45672192.160.000180.00052

Genetic data support an Ecuadorian origin for most invasive populations

Using our ddRAD sequencing data for Galápagos and continental PIM, we analyzed population genetic signatures of colonization and characterized the origin and path of invasion into the archipelago. Nucleotide diversity (π in 100 kb overlapping windows; Figure 2C) was reduced on average 6.6-fold in island populations relative to mainland accessions (Table 1), a pattern consistent with population genetic expectations following colonization.

Figure 2 with 6 supplements see all
Galápagos PIM is the result of a recent invasion from Ecuador.

(A) Map: average genetic distance between Galápagos PIM collections and each of the 132 mainland accessions. Plot: multi-locus principal components analysis (PCA). Squares, diamonds, and circles indicate Peruvian, Ecuadorian, and Galápagos collections, respectively. Inset: Predicted continental origins for Galápagos PIM collections. Colors are same as shown in the multi-locus PCA (Exact locations vary substantially between runs. Results from a single run are shown). (B) Maximum likelihood relationships among focal populations calculated with Treemix (allowing no migration). Left: inferred trees of 1000 resampled datasets (500 SNPs, with replacement). Right: consensus topology. All trees were rate-smoothed (λ = 1). (C) Diversity and divergence metrics. Left: nucleotide diversity (π) calculated for Galápagos PIM, Ecuador-PIM, and Peru PIM in overlapping 100 kb windows. Invariant windows (π = 0) are truncated and are instead shown in the inset bar plot. Right: average pairwise sequence divergence for three PIM comparisons: Gal×Gal, Gal×Ecu, and Gal×Peru. Each point represents a comparison between individuals, averaged over all loci.

Genetic variation in the native (mainland) range of PIM is highly geographically structured (Gibson and Moyle, 2020; Figure 2—figure supplement 1), allowing us to infer a putative origin of PIM lineages invasive on the Galápagos. To do so, we estimated genome-wide patterns of relatedness between invasive and mainland individuals using several methods. A rate-smoothed maximum likelihood tree constructed in Treemix (Pickrell and Pritchard, 2012) identified Galápagos PIM as monophyletic, and clearly separated island and non-island clades (Figure 2B; Figure 2—figure supplement 2). In general, pairwise sequence divergence was lower in Galápagos-Ecuador comparisons (average dxy = 2.3×103) than between Galápagos-Peru comparisons (average dxy = 3.6×103; Figure 2C), and samples showing low genome-wide divergence were clustered in central Ecuador (similar patterns were observed using FST; Supplementary file 1f).

To investigate potential source localities for invasive populations at a finer scale, we implemented the software Locator (Battey et al., 2020) which uses a machine learning algorithm to predict sample origins from genotype data. Locator predictions indicated two to three source regions for Galápagos PIM, although the exact locations varied across runs and depended on which island PIM collections were considered (Figure 2—figure supplement 1). Santa Cruz and San Cristobal PIM collections were predicted to have originated in central Ecuador; this result was generally consistent across runs, with the consensus being an origin near Los Rios and Guayas provinces in southcentral Ecuador (Figure 2—figure supplement 3). Interestingly, we also infer that one mainland accession represents a back migration from the Galápagos to Los Rios (LA0411; Appendix 1, section S2), further highlighting the high degree of connectivity between this region and the islands. In contrast, the remaining two samples, LA3123 (a historical collection from Santa Cruz sampled in 1991) and MG128-1 (newly sampled on Isabela), were predicted to have originated in alternative locations, with most runs supporting a Peruvian origin for LA3123 and an Ecuadorian origin for MG128-1 (Figure 2—figure supplement 3). The exact origin locations for these samples varied substantially across runs. In general, Locator predictions were consistent with the pattern of low pairwise sequence divergence between Galápagos PIM and central Ecuadorian samples, pointing to Ecuador, and perhaps central Ecuador in particular, as the source of the majority of invasive PIM populations on the Galápagos.

Together our data support two to three independent introductions of PIM onto the archipelago, each with variable consequences for current invasive populations: (i) a minor event from Peru [LA3123], (ii) a minor event from Ecuador [MG128-1], and (iii) a major event from central Ecuador that is responsible for nearly all sampled populations.

Demographic reconstruction supports a recent colonization by PIM on the Galápagos

We used the allele frequency spectrum to model the demographic history of invasive populations. In particular, we evaluated two demographic models using δaδi (Gutenkunst et al., 2010): (i) a neutral model of constant population size and (ii) a two-epoch instantaneous size change model. Since this species is self-fertile (i.e., it lacks genetic self-incompatibility that is present in some wild tomato species; Rick et al., 1977), we simultaneously inferred the inbreeding coefficient (F; Blischak et al., 2020). The two-epoch model thus included five parameters: epoch one population size (NB), epoch two population size (NF), timing of the first size change (TB), timing of second size change (TF), and the inbreeding coefficient. To limit potential confounding effects due to population structure within PIM, we estimated the folded site frequency spectrum (SFS) of a single population (MG114, which was the most deeply sampled of our PIM populations). We also masked regions of inferred introgression (as detected by our hidden Markov model [HMM], see below) as these can spuriously inflate rare variants and thus bias the inference of a bottleneck and subsequent parameter estimation (Appendix 1, section S3). In our masked dataset, we observed a large excess of rare variants (genome-wide Tajima’s D = −0.49 ±0.12; Figure 2—figure supplement 5) more consistent with a bottleneck model (RSSBottle = 0.23; ln(L)=−18.46) than a neutral model (RSSNeutral = 16.61; ln(L)=−26.63).

Table 2
Demographic model estimates for PIM population MG114 inferred using BFGS optimization in δaδi.

95% CI values were obtained from 1000 bootstrap replicates of the site frequency spectrum (SFS). Each estimate is shown in rescaled units (rescaled by NRef for NB and NF; and by 2NRef for TB and TF).

ParameterOptimumBootstrap
median
95% CI
NB408.32551.5256.32–9325.16
NF4041.632044.06442.09–26614.5
TB96.5277.152.37–492.68
TF105.14123.0229.95–401.15
F0.230.170.01–0.41

We used the best-fit bottleneck model to estimate the timing of the introduction by performing a two-step optimization procedure. We inferred a recent bottleneck occurring 201.66 (TB +TF) generations in the past (Table 2). Bootstrapped CIs for TB and TF were relatively small (2–493 gen for TB; 30–401 gen for TF), and medians were in close agreement with optimized estimates (Table 2). The bootstrapped median estimate for the time to bottleneck was 200.17, which is less than two generations away from the optimized estimate. CIs for population size parameters as estimated by a nonparametric bootstrap were larger (Figure 2—figure supplement 6) and median estimates deviated slightly from the optimized estimates (Table 2), although in all cases the optimized size estimates fell within the bootstrapped 95% CIs.

For comparison, we also model the history of an endemic CHS population (MG115) using the same framework as above. The two-epoch model again fit the data better (ln(L)=−44.61) than a neutral model (ln(L)=−74.18). In addition, we infer a large expansion phase occurring between 1114 and 1845 generations ago followed by a very recent and strong contraction (Supplementary file 1g). Compared to our estimate of the timing of the bottleneck in PIM, the inferred expansion in CHS is nearly eight times older. Tajima’s D was also higher in MG115 (D = −0.18 ± 0.08) compared to invasive PIM MG114, which is more consistent with neutral expectations and a larger fraction of intermediate frequency alleles.

Admixture analyses support the occurrence of inter- and intraspecific gene flow

The close evolutionary relationship of PIM, CHS, and GAL, their similar floral morphologies, and the presence of only a single major pollinator on the islands (Xylocarpii darwini; McMullen, 1999), indicate the potential for interspecific gene flow between tomato species may be high. Key morphological observations also suggest that these species may be exchanging genes (Darwin, 2009). In particular, we have previously described a novel fruit color polymorphism in two Santa Cruz PIM populations (MG114 and MG117; Gibson et al., 2020), where approximately 40% of individuals have orange instead of their ancestrally red fruits. Orange fruits are very rare in mainland PIM (TGRC passport data; http://www.tgrc.ucdavis.edu) but are diagnostic of the two endemic Galápagos species. Accordingly, we used multiple population genomic methods to investigate evidence of hybridization and introgression in the genomes of island plants, paying special attention to patterns of admixture in the polymorphic PIM populations.

We first examined evidence for recent (early generation) hybrids by evaluating genome-wide signatures in fastStructure (Raj et al., 2014) and NewHybrids (Anderson and Thompson, 2002). Interestingly, we find no evidence of early generation CHS×PIM hybrids in either of the polymorphic PIM populations MG114 and MG117 (Figure 3A). However, these analyses did detect variable levels of CHS×PIM admixture at the nearby site MG115 (Figure 3A), a pattern which is also reflected in principal component space (Figure 3—figure supplement 1). Using NewHybrids, we classified 4/6 of these admixed plants as first- or second-generation hybrids (Supplementary file 1h, i and j).

Figure 3 with 2 supplements see all
Patterns of population genetic structure and admixture on Santa Cruz and Isabela.

(A) fastStructure inference for all Santa Cruz samples (N = 74). K = 3. (B) fastStructure inference for all Isabela samples (N = 57). K = 3. (C) Treemix analysis summary (m = 6; ln[L]=395.08). Solid lines indicate interspecific events and dashed lines indicate intraspecific events. (D) Principal components analysis for samples at site MG120, a hybrid zone between CHS and GAL.

To complement the above analyses, we employed Treemix (Pickrell and Pritchard, 2012). The most likely topology inferred by Treemix implies three separate admixture events between PIM and CHS: two cases of CHS → PIM admixture and one case of PIM → CHS admixture on Santa Cruz (Figure 3C; Figure 3—figure supplement 2; Supplementary file 1k). This analysis therefore indicates repeated gene flow between PIM and CHS, although how many distinct events were involved is difficult to infer given the high genomic similarity of island PIM populations and recency of the invasion. As independent support for a history of gene flow, we calculated the four taxa D-statistic of Durand et al., 2011 using Solanum pennellii (LA3778) as an outgroup and treating PIM population MG114 and CHS (MG115) as P2 and P3, respectively. We found that D was 0.818 (s.d. = 0.028; bootstrapped p<0.02), indicating an excess of ABBA sites and strong evidence for admixture between island PIM and CHS. D was also significant when other invasive PIM populations—Santa Cruz population MG117 or San Cristobal populations MG107 and MG105—were used as P2, indicating that the detected admixture likely predates the dispersal and differentiation between Santa Cruz and San Cristobal invasive PIM. This is consistent with inferences in the Treemix graph, in which admixture events between PIM and CHS involve internal branches that subtend current San Cristobal and Santa Cruz PIM populations (Figure 3C).

Interestingly, fastStructure and Treemix produced conflicting results regarding evidence for admixture in the polymorphic PIM populations; Treemix appears to support this while fastStructure does not. To evaluate whether this was due to differences in the detection of more subtle—and potentially older—signals of introgression, we implemented a local ancestry assignment algorithm using a Hidden Markov Model (HMM) to probe for evidence of introgression at a finer scale. Doing so, we found evidence for bidirectional gene flow between CHS and PIM (Figure 4; Figure 4—figure supplement 1), with inferred introgression being more common in the CHS → PIM direction and in PIM populations that were polymorphic for fruit color. Our HMM detected clear evidence for CHS ancestry within the polymorphic PIM populations MG114 and MG117, reflecting admixture from CHS → PIM. In contrast, inferred admixture in nonpolymorphic PIM (e.g., MG116) was much more restricted. For MG114, CHS ancestry blocks were large (average = 16,235 kb), of varying size (sd = 139.43 kb), and composed on average 3.64% of the genomes of any given MG114 plant (Figure 4C,D). Shared ancestry between MG114 and MG115 was dominated by two large CHS haplotypes segregating at moderate to high frequencies on chromosome three (40%; mean size = 51.35 Mb) and chromosome six (20%; mean size = 35.3 Mb; Figure 4B). The genomic distribution of CHS ancestry blocks in different individuals indicates they are not independent. For example, on chromosome 3 all but one individual (5/6) carrying the CHS haplotype had identical breakpoints, consistent with them being derived from the same hybridization (and subsequent recombination) event and/or the individuals being closely related (Supplementary file 1l). In MG117, CHS ancestry made up 4.36% (sd = 2.56%) of any given MG117 genome and average block size was 9,687.5 kb (Figure 4C,D; Supplementary file 1m). As with MG114, a large CHS haplotype on chromosome six occurs at a frequency of 0.42. This block varied substantially in size in each individual, and all were discontinuous across the chromosome (i.e., there is an implied double crossover event). Further, two individuals were heterozygous for ancestry at the downstream portion of the haplotype (Figure 4—figure supplement 2, Figure 5—figure supplement 1, Figure 5—figure supplement 2). In comparison to MG114 and MG117, MG116 showed little evidence for shared ancestry. While 3/7 individuals had inferred signals of CHS ancestry, these blocks were generally small compared to MG114 and MG117 (average = 4,450 kb; Figure 5) and made up a substantially smaller fraction of the total genome (average admixture proportion = 0.55%; Figure 4D). No large CHS haplotypes were segregating in MG116, unlike those observed in MG114 and MG117 (Supplementary file 1n).

Figure 4 with 14 supplements see all
Local ancestry assignment using hidden Markov model (HMM) characterizes a history of endemic × invasive introgression.

(A) Patterns of diversity and divergence along chromosome 6 for an MG114 individual. The region of recent coalescence (low divergence; high diversity) with CHS is annotated in gray. This 20.2 kb block segregates at 20% in MG114. (B) Genome-wide HMM predictions for all individuals in MG114. The x-axis is ordered by chromosome and y-axis is ordered by individual. Two large CHS haplotypes segregate at high frequency on chromosomes 3 (40%) and 6 (20%). (C) Visual summary of admixture proportions from CHS into three PIM populations. (D) Summary of HMM assignment for each PIM population. Populations displaying variation in fruit color (MG114 and MG117) have more CHS ancestry than those which are fixed for the ancestral red state (MG116).

Figure 5 with 3 supplements see all
Patterns of local ancestry across focal chromosome regions of MG114 and MG117, enlarged to show variation in introgression block break points at color pathway genes.

(A) CHS ancestry at carotenoid biosynthesis gene PSY1 on chromosome 3 correlates with observed fruit color variation in MG114. (B) CHS ancestry at carotenoid biosynthesis gene CYC-B on chromosome 6 correlates with fruit color variation in MG117. Each cell represents 100 kb. Empty cells indicate windows with no sequence data. Empty cells are ghost shaded with each ancestry color based on neighboring assignments.

The size of detected ancestry blocks contains information regarding the timing of gene flow, because it depends on the number of recombination events (generations) that have occurred since an initial hybridization event. We can broadly estimate the age of these haplotypes using a simple logarithmic relationship (Appendix 1, section S4; Lynch and Walsh, 1998). In MG114 and MG117, age estimates are within the range of 4–12 generations (e.g., the large chromosome 3 and 6 CHS haplotypes in MG114 are estimated at 4.23 and 4.74 generations, respectively). In addition to placing boundaries on when the initial hybridization took place in the past (>4 and up to 12 generations), there is close agreement between age estimates in MG114 and MG117, suggesting that these instances of CHS introgression might have been derived from the same admixture event (Appendix 1, section S4).

Relative to the patterns of gene flow from CHS into polymorphic PIM described above, gene flow in the opposite direction (PIM → CHS) was more restricted. The average proportion of PIM ancestry within CHS individuals (at MG115) was 1.62% (s.d. = 2.72%). These results point to a potential bias in the direction of gene flow, with more exchange occurring from CHS into PIM than from PIM into CHS.

In addition to inferred introgression between PIM and CHS, we also found evidence for hybridization/introgression involving the other two taxa: GAL and LYC. In particular, we uncovered a recent history of hybridization between CHS and GAL on Isabela—at la Laguna de Manzanilla (Figure 3B and D). These two species have been reported as co-occurring at this site since 2000, and hybridization has previously been hypothesized based on allozyme and morphological analyses (Darwin, 2009; Gibson et al., 2020; Figure 3D; inset images). Our fastStructure and principal components analyses (PCA) of individuals at this site clearly identify nine samples as admixed (Figure 3B), mostly corresponding with morphological classifications of intermediacy (Figure 3D). Of the nine admixed plants, four appear to be first generation backcrosses, whereas four are F2 (Supplementary file 1i). We also identified putative cases of CHS/GAL×LYC admixture in two populations on Isabela (Figure 3B), although some of these signals might be the product of unmodeled genetic substructure (Appendix 1, section S5). On Santa Cruz, low levels of LYC (domesticated tomato) ancestry were also detected within some PIM populations, including a potential hybrid PIM×LYC population MG118 that was predicted to be entirely first-generation hybrids.

An introgressed origin for orange fruits in PIM

Because introgression from CHS into PIM was most evident in PIM populations that were polymorphic for fruit color (MG114 and MG117), we took an admixture mapping approach to investigate whether introgression influences fruit color variation in these populations. Specifically, we examined the association between local ancestry across the genome and observed fruit color phenotypes, paying special attention to genomic locations of eight known genes involved in carotenoid biosynthesis (Supplementary file 1o and 1p; Paran and van der Knaap, 2007). For MG117, we found that the presence of orange fruits correlated perfectly with CHS ancestry at one carotenoid synthesis gene: CYC-B on chromosome 6 (Figure 5B; Supplementary file 1o). This association is significant based on a χ2 test of independence (χ2 = 8.33; df = 1; p=0.0039). CYC-B is a lycopene beta cyclase and the specific locus known to underlie the lighter (orange) colored fruits observed in the endemic species (Stommel and Haynes, 1994).

In contrast, the similarly sized chromosome six haplotype in MG114 (Figure 4B) does not include the CHS allele at the CYC-B locus. Instead, in MG114 we find that the presence of orange fruits was solely predicted by CHS ancestry at PSY1 on chromosome 3, the first enzyme in the carotenoid fruit color pathway (Supplementary file 1p; χ2 = 11.12; df = 1; p=0.0009). Although the role of PSY1 in the coloration of endemic fruits has not yet been studied, loss of function mutations in PSY1 have been described in LYC and these produce orange fruit color (Fray and Grierson, 1993). To further investigate the association of PSY1 with endemic fruit pigmentation, we used previously published RNAseq data (Pease et al., 2016) to examine variation in PSY1 across the entire wild tomato clade. PSY1 is expressed at detectable levels in both endemic species and has a highly conserved coding sequence (1203/1239 shared sites), whose exceptions include a single non-synonymous substitution (62R → W) unique to the endemic clade (Figure 5—figure supplement 3). This substitution lies outside of the trans-isoprenyl diphosphate synthase protein domain associated with the enzyme’s function, but within the transit peptide signal sequence (residues 1–70), although the specific functional importance of this variant remains to be assessed. Regardless, based on our observed associations between carotenoid biosynthesis loci and fruit color variation, our data support two separate mechanisms underlying the emergence of orange fruits in Galápagos PIM, both of which were likely derived via introgression from CHS.

Discussion

Biological invasions are one of the foremost threats to global biodiversity, yet we still have a poor understanding of the processes that contribute to invasive colonization success. Here, we studied patterns of genome-wide ancestry and relatedness between endemic, invasive, and continental wild tomato populations in order to reconstruct the history and consequences of a recent biological invasion on the Galápagos. Invasive populations of S. pimpinellifolium (PIM) had low levels of genetic diversity and an excess of rare alleles, and we inferred two to three recent introductions onto the archipelago from Ecuador and Peru. As a consequence of this invasion, we uncovered evidence for recent and ongoing gene flow between PIM and the congeneric endemic species S. cheesmaniae (CHS). Local ancestry at two key carotenoid loci further supported an introgressed (CHS) origin for orange fruits in at least two invasive populations. Together, our results reconstruct the history of invasion, and infer that the source of convergent phenotypic evolution in the invasive populations is introgression of important functional alleles from endemic relatives.

Genomic data reconstruct the demographic history of invasion onto the Galápagos Islands

Our analyses identify three independent introduction events, yet only a single event from Ecuador comprises 98% of sampled invasives on the archipelago. The other two introductions—each represented by single plant collections—either did not produce large invasive populations or they result from much more recent introductions which have not yet established broadly. Indeed, our PCA (Figure 2A) supports the idea that they are the product of more recent introduction events, as these two collections are more closely related to the mainland PIM populations than plants derived from the primary introduction.

For the primary introduction, although we observed variance in source region predictions (Figure 2—figure supplement 3), the consensus prediction supports a southcentral origin near Guayas or Los Rios provinces. Intriguingly, historical data on human migration and trade on the islands also point to this as a likely region for the source of invasive PIM. First, although Ecuadorian colonizers of the Galápagos originate from across the country, one of the earliest and largest bursts of migration coincided with the Tungurahua province earthquake in 1949 (Toral-Granda et al., 2017). This province is geographically central and close to Los Rios. Second, the vast majority of all trade between Galápagos and the continent occurred—and continues to occur—from Guayaquil, the second largest city in Ecuador (Lundh, 2004; Toral-Granda et al., 2017). The surrounding agricultural regions, which include Los Rios, would be the most proximate sources for raw product shipments to the islands. This historical context provides additional support for our genetic inference of a majority southcentral origin for invasive populations.

Our inferences also clearly implicate humans as the source of PIM introduction. Our demographic reconstruction points to a recent bottleneck and expansion of PIM on the archipelago (Table 2), much more recent than our estimate for CHS. Similarly, our inference that LA0411 (a mainland Ecuador accession) is the product of back migration from the Galápagos underscores the recent and likely substantial human influence on the movement of PIM. We conclude that PIM is most likely the result of a recent, human-mediated expansion on the archipelago. Human introduced species represent upward of 70% of all alien plant species on the Galápagos (Quiroga, 2018), and PIM has similarly been hypothesized to be the product of a human introduction; however the timing and mode of its introduction—including the role of humans—was not previously known.

Hybridization as a consequence of invasion onto the Galápagos

One key evolutionary consequence of PIM’s introduction onto the Galápagos that emerges from our analyses is its hybridization with endemic congeneric species—primarily CHS. Hybridization has been hypothesized as a mechanism for promoting invasive colonization success, as it could help overcome the adaptive limits that might otherwise be imposed by genetic bottlenecks during the colonization process (the so-called ‘genetic paradox’ of invasion; Allendorf and Lundquist, 2003; Sakai et al., 2001). These bottlenecks can be especially severe during introductions onto islands (Kolbe et al., 2004; Colautti et al., 2005; Golani et al., 2007). In addition, several factors indicate the high potential for gene flow specifically between the four studied species (CHS, GAL, PIM, LYC), including their very close evolutionary relationships (all are members of the red-fruited Esculentum subclade within the wild tomatoes; Pease et al., 2016), and their incomplete reproductive barriers (Rick, 1956; Rick and Bowman, 1961; Rick and Fobes, 1975). Nonetheless, previous analyses based on handfuls of loci provided conflicting evidence for and against the occurrence of gene flow between species presents on the island (Nuez et al., 2004; Darwin, 2009).

Our data provide clear evidence for recent hybridization and introgression between all four tomato taxa on the archipelago. Although our focus here is primarily on CHS and PIM, we also find evidence for recent hybridization and/or introgression between CHS and PIM (Santa Cruz), PIM and LYC (Santa Cruz), CHS and GAL (Isabela), and, to a lesser extent, CHS and LYC (Isabela). These patterns suggest that hybridization—both with congeneric endemics (CHS and GAL) and invasives (LYC)—could serve as a source of adaptive genetic variation in invasive PIM.

The most prominent signal of gene flow is between PIM and CHS (Figure 3; Figure 4; Figure 5), including clear evidence for both early generation (F1 and F2) hybrid offspring and older introgression 4–12 generations in the past. Our results indicate (i) that CHS ancestry is maintained in some PIM populations beyond initial hybridization and (ii) that gene flow is ongoing.

The potential consequences of secondary genetic contact are numerous (Wolf et al., 2001; Todesco et al., 2016). While we do not have direct data on relative fitness of hybrids, the persistence of later generation CHS×PIM hybrids indicates they are not immediately selected against. Indeed, the genomes of most admixed PIM (MG114 and MG117) are consistent with a history of secondary contact and gene flow characterized not by strong hybrid incompatibility, but a less restricted exchange of alleles between species. Furthermore, the nonrandom distribution of CHS ancestry throughout admixed PIM suggests that it may be selectively maintained in certain regions of the genome. Instead of observing a heterogeneous set of CHS alleles in the backcrossed genome of PIM, we find that CHS ancestry is enriched on chromosomes 3 and 6, and absent in much of the rest of the genome, in both MG114 and MG117 (Figure 4B; Figure 4—figure supplement 2). Moreover, our local ancestry predictions provide evidence that these introgressed regions may contain key genes responsible for the emergence of orange fruit color in MG114 and MG117.

Orange fruit color in island PIM was derived via introgression from CHS

A key finding of our analyses is that introgression is likely the source of phenotypic convergence on orange fruits that is observed in invasive Santa Cruz PIM. Orange/yellow fruit color is diagnostic for the endemic species (CHS is typically pale yellow; GAL is typically orange) but extremely rare in PIM. At the genetic level, convergence could be based on three potential sources of variation: ancestrally segregating variation, introgression, or via a de novo transition. Of these, ancestral variation is the least likely: The very few described examples of orange fruits among continental PIM are all located in Peru (e.g., Sifres et al., 2007), and none have been reported in the inferred geographic region of origin of this invasion (TGRC passport data; http://www.tgrc.ucdavis.edu). One goal here was therefore to distinguish between introgressed and novel mutation as the source of phenotypic convergence. We did so by mapping the landscape of introgression throughout the genomes of invasive PIM plants, and evaluating its association with observed fruit color variation, and with loci known to underlie this trait in Solanum (Paran and van der Knaap, 2007). With these data we inferred a unique scenario in which phenotypic transitions to orange fruits in two different invasive PIM populations were each derived from introgression at a distinct carotenoid locus: CYC-B or PSY1.

Our data in conjunction with existing experimental evidence indicate that CYC-B is the causative locus for orange fruits in MG117. Interestingly, CYC-B mutants were first identified as natural allelic variation in the endemic species CHS (Rick, 1956). Introgression of the CHS beta allele at CYC-B into LYC causes the accumulation of β-carotene in ripening tissues and the production of orange fruits (Stommel and Haynes, 1994). Orange fruits segregate as a single dominant gene, and genotypic variation at this locus explains a large fraction of fruit color variation in experimental crosses (Rick, 1956; Stommel and Haynes, 1994). Our data show a clear association between CHS ancestry and orange fruit color at this locus (Figure 5)—including the observation that individuals heterozygous for ancestry display the dominant phenotype—so we infer that introgression of CHS CYC-B into PIM has the same large, dominant effect on fruit color in admixed individuals of this wild species.

Unlike CYC-B, PSY1 was first identified in the spontaneous fruit color mutant yellow-flesh in LYC (accession LA2997; Fray and Grierson, 1993), and its role has not been directly evaluated in CHS or GAL. Recessive r mutants at yellow-flesh carry a truncated version of PSY1 that is unable to convert precursor into phytoene. The resulting fruits accumulate almost no carotenoids and the yellow skin pigmentation is driven primarily by the accumulation of the flavonoid chaloconaringenin (Fray and Grierson, 1993). Using previously published RNA-seq data (Pease et al., 2016), we confirmed the expression of PSY1 in both endemic species and did not detect any truncation or premature stop mutations. Rather, we identified a single non-synonymous substitution (62R→W) within the transit peptide signal domain, found in both endemic species. Disruption of the transit signal sequence may prevent localization to the chloroplast and thus result in a nonfunctional enzyme, although determining the exact role PSY1 has in endemic—and by extension PIM—fruit coloration would require future functional confirmation. Regardless, CYC-B and PSY1 in invasive orange-fruited PIM are unequivocally derived from CHS, and current functional knowledge of both loci indicate their effects on fruit color could entirely explain observed phenotypic variation in orange-fruited PIM.

Finally, the ubiquitous lighter (orange and yellow) fruits of the two endemic species, the appearance of convergence toward endemic-like fruit colors in invasive PIM, and the likely independent recruitment of endemic fruit color alleles at PSY1 and CYC-B in MG114 and MG117, together suggest intriguing evidence that lighter fruits may have a specific selective advantage on the islands. The potential environmental basis of this selection is unknown, however differences in fruit dispersal—including disperser color preference(s) and/or fruit color apparency—on the islands versus the continental mainland could be a likely mechanism. Alternatively, at least in the case of PSY1 which likely involves either a full or partial loss of function, orange pigmentation could arise due to relaxed selection, if it is more costly to produce red fruits and they have no specific advantage in island environments. Future field experiments and fitness measurements will help to distinguish among these selective hypotheses.

Conclusions

Our results reconstruct a complex and recent history of invasion by wild tomato onto the Galápagos Islands, and highlight the potential importance of gene flow during colonization. Our results also add to an emerging phenotypic convergence literature by highlighting how admixture brought on by anthropogenic change can drive convergence over very short time scales. While the adaptive benefit of orange fruits remains to be evaluated, our finding of two separate molecular mechanisms underlying orange coloration each derived from CHS is highly suggestive that lighter fruit pigmentation is favored in the island environment. This study underscores how the long history of research on the Galápagos Islands continues to enrich our understanding of evolutionary processes in the natural world.

Materials and methods

Population sampling and genotyping

Request a detailed protocol

We sampled leaf tissue from 13 wild populations of invasive and endemic tomato taxa on the three largest islands of the Galápagos archipelago: San Cristobal, Santa Cruz, and Isabela (Figure 1; Table S1). Leaf tissue was dried in silica and DNA was extracted using Qiagen Plant Mini Kits (Qiagen, Valencia, CA). Two double-digest restriction site associated DNA sequencing (ddRAD) libraries were prepared using PstI and EcoRI enzymes by the Indiana University Center for Genomics and Bioinformatics. Libraries were sequenced across two Illumina NextSeq flowcells (150 bp, paired-end, mind-output). Raw reads were filtered for quality, trimmed of adapter sequence and low-quality bases using fastp (Chen et al., 2018), and demultiplexed by individual using the process_radtags program in Stacks (version 2; Catchen et al., 2013). Reads were mapped to the S. lycopersicum reference genome version SL3.0 using BWA (Li and Durbin, 2009). Bam files of 132 continental accessions representing the full species range of PIM (Figure 2—figure supplement 1; Supplementary file 1b) were jointly reanalyzed with the new samples in Stacks. Mapped reads were assembled and variants were called with the Stacks ref_map pipeline. Genotype calls made with fewer than eight reads were removed and subsequently we retained only sites having data for at least 80% of all 306 individuals. For all analyses except diversity/divergence calculations (π, Tajima’s D, dXY, FST) and Treemix, we pruned sites in high LD (r2 >0.7) using bcftools. Lastly, we also evaluated our dataset for two possible sources of bias—the potential effects of allele dropout (ADO) on genotype calls (Cariou et al., 2016) and mapping bias arising from using a single reference genome—and confirmed that there is little evidence for either source in our dataset (Appendix 1, section S6). All scripts are available at https://github.com/gibsonMatt/galtom (Gibson, 2021b; copy archived at swh:1:rev:1647969c397c5b13d15ab9b5d408bbbab2f6b4a8).

Nucleotide diversity and divergence estimates

Request a detailed protocol

Within-population diversity and divergence estimates across the genome were calculated using the Stacks program populations. Windowed π (Figure 1C) was extracted from Stacks output. For pairwise comparisons between the islands, Ecuador and Peru (right panel Figure 1C), we calculate genome-wide pairwise divergence directly from the assembled RAD loci (samples.fa file) using a custom Python script (http://github.com/gibsonmatt/galtom). For each pairwise comparison between samples, we count the total number of sequence differences and total number of sites for which both samples have data. This choice allows us to conveniently model patterns of diversity between diploid samples in our introgression HMM using a binomial. We also calculated the average genetic distance from each accession to all Galápagos PIM across polymorphic sites in the R package adegenet (Jombart and Ahmed, 2011; Figure 1A).

Phylogenetic reconstruction

Request a detailed protocol

We inferred a maximum likelihood tree of population relationships (Figure 1B) using Treemix with no specified migration. The Treemix input file was generated from a VCF using a custom Python script (http://github.com/gibsonmatt/galtom). We expected abundant phylogenetic discordance both within Galápagos PIM (given its recent divergence from Ecuador) and within the red-fruited tomato clade in general (Pease et al., 2016). To this end, we generated 1000 replicate datasets by sampling (with replacement) 500 SNPs from the full dataset. These trees were visualized using densitree as implemented in the R package phangorn (Schliep, 2011). Both the set of replicate trees and the consensus tree were rate-smoothed using r8s (λ = 1), using the chronopl function available in the R package ape (Paradis and Schliep, 2019). In addition to Treemix, we also inferred a maximum likelihood phylogeny of individual sample relationships using RA×ML (Figure 2—figure supplement 2). We subset our dataset to one individual per population for Galápagos collections and to 15–20 samples per geographic region (Peru and Ecuador) for mainland accessions. We ran RA×ML using the GTRCAT approximation of the general time reversible model of substitution allowing for rate heterogeneity. Twenty-five alternative runs from distinct maximum parsimony trees were performed, from which we selected the best single tree.

Demographic inference

Request a detailed protocol

We modeled bottleneck demographic histories from the SFS using δaδi (Gutenkunst et al., 2009), calculating the folded SFS using easySFS (https://github.com/isaacovercast/easySFS, copy archived at swh:1:rev:b866269f5813ef7cd8a12c7727048f993da8e9ffGibson, 2021a), 30–14 chromosomes, striking a balance between the number of segregating sites and levels of missing data in frequency bins. We chose this population because the larger number of samples (2N = 30) relative to other PIM populations afforded more statistical power. Prior to calculating the SFS for MG114, we removed regions inferred as introgressed by our HMM, as we found that a large fraction of singleton and doubleton sites in MG114 were shared with CHS (Appendix 1, section S3). These sites would be spuriously interpreted as de novo mutations derived in PIM post-colonization, thereby biasing our parameter estimates. See Appendix 1, section S3 for further discussion of our filtering scheme and its effects on δaδi estimates. For population MG115, we down-sampled to 24 chromosomes. For each population, two models were evaluated: (i) a neutral model of no population size change and (ii) an instantaneous bottleneck model. The parameters of the bottleneck model are described in the results. We used the BFGS algorithm and a two-step optimization procedure to explore the demographic parameter space of the PIM bottleneck model. Step i consisted of 200 replicate runs with twofold parameter perturbation. Step ii then consisted of 100 refinement runs of the optimizer with no perturbation and initial parameter values equal to the maximum likelihood estimates from step i. We used multiple diffusion grid sizes in δaδi (30, 40, and 60) as recommended in the manual to facilitate extrapolation. To estimate confidence intervals for our parameters, we performed a nonparametric bootstrapping procedure (assuming independence among sites) by sampling randomly from the observed SFS 1000 times. Since we filter for LD, we consider this method to be appropriate.

Time in δaδi is represented in units of 2Nref generations. To convert from these coalescent time values to numbers of generations, we estimate Nref as θ/4 μL, where θ is the population scaled mutation parameter (estimated by δaδi; 68.40 for MG114), μ is the per-generation mutation rate (assumed here to be 1 × 10−8 muts/bp/generation), and L is the length of queried sequence (5,806,952 bp; estimated from the data as the total number of bases where at least one sample in the focal population had data). Bottleneck and final effective population sizes were similarly converted from relative to absolute values using Nref.

Inferring gene flow between contemporary island populations

Request a detailed protocol

We used several independent methods to characterize genetic structure in our dataset. First, we applied Treemix (Pickrell and Pritchard, 2012) to assess evidence for broad signatures of admixture among populations. Because we were interested in understanding the history of PIM, we subset our full SNP dataset to exclude LYC populations since the exact taxonomic status of these samples is unclear and does not help address questions of gene flow between PIM and endemic species. Furthermore, we remove PIM populations with fewer than eight samples. We ran Treemix with several values for m (0–8), the migration parameter which determines how many reticulation branches are allowed. Based on the likelihoods provided by Treemix, we determined that a migration parameter of 6 was most appropriate. Increasing m led us to infer more intraspecific migration within PIM, but no additional interspecific events could be inferred and the increase in data likelihood was marginal (Figure 3—figure supplement 2; Supplementary file 1j).

We also employed model-based (fastStructure; Raj et al., 2014) and non-model-based (multi-locus PCA) methods to evaluate genetic structure. We ran both methods using the LD-filtered dataset of 5767 SNPs and on each island separately. For each island, we ran fastStructure for values of k between 1 and 7. We then chose an appropriate model complexity using the chooseK.py script supplied with fastStructure. After a value for k was chosen, we evaluated the stability of the ancestry assignments across 5–10 separate runs. The multi-locus PCA was implemented in the R package adegenet.

For select populations with evidence for admixture, we ran NewHybrids (Anderson and Thompson, 2002) to identify any early generation hybrid individuals (F1, F2, and backcrosses). NewHybrids was run using the same LD-filtered dataset as in fastStructure and PCA (script for converting from vcf to NewHybrids format is provided at http://github.com/gibsonmatt/galtom). For each NewHybrids run, we specify three groups: a parent population A group, a parent population B group, and an admixed group. For the CHS×PIM interaction on Santa Cruz, parent populations were defined as non-admixed MG115 and MG113 individuals. For the CHS×GAL interaction on Isabela, parent populations were MG120C and MG120G. For each population, we ran the Markov chain for at least 6000 iterations. Individual assignments were not sensitive to the choice of priors. Lastly, four-sample D-statistics (Durand et al., 2011) were calculated using the compD software package (https://github.com/stevemussmann/Comp-D; Mussmann, 2020), using S. pennellii (LA3778) as an outgroup species.

Introgression analysis

Request a detailed protocol

We implemented an HMM to identify fine-scale genomic signatures of introgression. Although RAD sequencing is not optimal for genome scanning due to lower marker density, recent signatures of introgression should be large and detectable. Nonetheless, we acknowledge the fact that our sequencing methods may not allow for full characterization of the landscape of introgression. In our analysis we leveraged the fact that, in regions of recent introgression, genetic diversity between samples within the destination population (π) should be elevated relative to diversity between samples in the source and destination populations (dXY). In other words, regions of recent interspecific coalescence should resemble individuals from the introgressing population more than individuals in the destination population. For each pairwise comparison between samples, we thus have an estimate of π (if the samples are from the same population) or dXY (if they are from different populations) as well as the raw number of sequence differences and total sites with data.

Our HMM featured three hidden states: (i) CHS ancestry, (ii) PIM ancestry, and (iii) heterozygous CHS/PIM ancestry, for which we used π, dXY, and the mean of π and dXY (m) in nonoverlapping 100 kb windows, respectively, to calculate emission probabilities. The HMM was applied to each individual separately, using pairwise sequence comparisons between it and all other focal samples to calculate probabilities. We used three binomial models to obtain these probabilities (see Appendix 1, section S7 for a detailed description of the model). For each chromosome and for each focal comparison, we found the most likely hidden state path for a given sequence of π and dXY using the Viterbi algorithm, controlling for underflow by operating in log-space. Because of the coarse scale (100 kb windows) used in ancestry assignment, our HMM struggled to identify smaller genetic signals potentially consistent with introgression. Figure 4—figure supplement 3 provides a per-window look at how ancestry assignment correlated with patterns of diversity and divergence in MG114. Patterns of diversity/divergence vary substantially even between adjacent windows and thus it is likely that we are not capturing sub-100kb signals of introgression. Nonetheless, the large size of the blocks we do identify allows us to be confident they are the result of introgression rather than ILS.

Appendix 1

S1. Field collections and taxonomic treatments

S1.1. Collections and DNA extraction

We visited San Cristóbal, Santa Cruz, and Isabela—the three most populated islands in the archipelago. These islands contain 40 documented TGRC locations (TGRC passport data) and these sites, as well as those described in Darwin, 2009 and Nuez et al., 2004, were searched during our expeditions. We looked for plants within a 200 m radius of each previously documented collection site, and we also searched for additional undocumented populations. Populations were sampled across linear transects and detailed photographs of leaf, fruit, and flower morphology were taken of all sampled individuals against color and length standards. The latitude and longitude of each collection location was logged using a smartphone. Each site and the identity of species found within are described in Table S1. For extracting DNA, three to five leaves were sampled from each plant and immediately dried in silica gel. Tissue disruption was performed with mini-pestles and DNA for each sample was extracted using Qiagen Plant Mini kits (Qiagen, Valencia, CA) at the Galápagos Science Center (San Cristobal, Galápagos, Ecuador). Our sequencing pipeline is described in the main text.

S1.2. Taxonomy

Population identity was determined based on the taxonomic treatments described in Darwin et al., 2003, with the exception of LYC, which we separated into two forms—LYC (domesticated tomato) and LYC var. cerasiforme (CER; cherry tomato; Rick, 1956)—to maintain consistency with Nuez et al., 2004. Ripe fruit color was the primary trait used in species identification, with the exception of individuals designated as orange-fruited S. pimpinellifolium, based on all other characters. We additionally measured nine leaf traits—leaf length, leaf width, terminal leaflet length, leaftlet count, interjected leaflet count, petiole length, internode length, leaflet length, and leaflet width previously identified as diagnostic by Darwin et al., 2003.

S2. LA0411 and our inference of back migration

S2.1. Original collection notes for LA0411

Charles Rick, 1956: ‘Pichilingue. 200 m. Lycopersicon pimpinellifolium or Lycopersicon esculentum var. cerasiforme? Weedy vigorous form growing as weed in corn field N of station buildings. Population of 15 plants found here; five had ripe fruit, long internodes, leaves more elaborate than pimp. Few long hairs at growing point like Santa Cruz pimpinellifolium. Alternation all = 3 or ±3. Flowers small, stigmas all exserted 1 mm or more. Population seems to be entirely uniform. Inflorescence all first order, racemose. Fruits large for pimp = 1.5 cm, red. No vectors seen. Long tailed thick billed black birds seen near patch. These said to eat wild tomatoes near Guayaquil cement factory. Odor = esculentum. Anthocyanin = dark’.

S2.2. Evidence for back migration

Based on patterns of genomic relatedness to Galápagos and surrounding PIM accessions, we infer that mainland Ecuadorian accession LA0411 (collected in 1956 by Charles Rick) is the product of a back migration from the Galápagos to mainland Ecuador. This accession showed a particularly strong resemblance to Galápagos PIM (average dxy = 0.0006) and was also divergent from neighboring mainland Los Rios accessions (Figure 2—figure supplement 4). This accession was indeed noted as morphologically similar to Galápagos PIM tomato collections when sampled on mainland Ecuador in 1956 (TGRC passport data; http://tgrc.ucdavis.edu; S2.1, above). This inference has two implications for understanding the history of PIM and other invasive species on the islands. First, it sets an upper bound on the timing of the initial introduction of PIM to the Galápagos as no later than 1956, as it must have already been established there prior to a back migration event. Second, it highlights the substantial connectivity between this region (the putative source of invasive PIM) and the Galápagos in general.

This finding does not affect our analyses of invasive population origins, including our inference that most invasive PIM have an Ecuadorian source. Invasive PIM have high genetic similarity with many accessions in this region of Ecuador (Figure 2A). Furthermore, running Locator without accession LA0411 produces results identical to those described in the main text.

S3. The impact of recent introgression on demographic inferences with δaδi

A recent history of introgression from CHS/GAL into invasive PIM populations could bias δaδi parameter estimates toward older dates by introducing low-frequency alleles that would incorrectly be interpreted as de novo mutations derived post-expansion. To examine this possibility, we determined whether rare alleles in MG114 were private (i.e., informative of the time since bottleneck) or shared with CHS (e.g., from introgression). We found that a large portion (52%) of singleton and doubleton SNPs in MG114 were shared with CHS (MG115), suggesting that introgressed variants could substantially affect our parameter estimates of the occurrence and timing of demographic changes within invasive PIM. For this reason, prior to model fitting we removed SNPs within all genomic regions where at least one MG114 individual had evidence for CHS ancestry based on our HMM. Any region inferred to be of CHS ancestry in any individual was removed from the dataset for all individuals. This resulted in 365 sites being filtered from the SFS relative to the original dataset.

S4. On the timing of introgression

The size of introgression blocks contains information about their age. Over time, recombination will break up contiguous blocks into smaller tracts, resulting in a mosaic of ancestry throughout the genome. We can roughly estimate the age of any given block if we assume that an initial hybridization event was followed by subsequent backcrossing to non-admixed parental individuals. Using a simple logarithmic relationship, the expected time t required to arrive at a proportion p of the donor genome after repeated backcrossing is estimated as:

log(1p)/log(2)

via Lynch and Walsh, 1998. As block size gets smaller, the expected time since hybridization increases.

The distribution of introgression tract sizes and the concordance of break points between blocks within and between populations are indicative of non-independence among blocks. Such a pattern makes any detailed assessment of the timing of these events difficult, and our broad estimate makes several simplifying assumptions. The occurrence of selfing and/or small population sizes may also affect the relationship between block length and age of introgression. Nonetheless, it is clear that many of the observed patterns of CHS ancestry throughout the genomes of MG114 and MG117 plants were derived from very recent (and likely shared) events. We further investigated the degree to which these histories of introgression were shared by comparing the ancestry of each population window-by-window. Specifically, we called windows of shared CHS ancestry in MG114 and MG117 as those where at least one individual in each population was assigned as CHS by our HMM. Across the genome, 250 windows were inferred to be of CHS ancestry in both populations, representing 26.8% of all CHS ancestry in MG114% and 35.4% of all CHS ancestry in MG117. These patterns imply a complex and partially shared history of admixture in MG114 and MG117. This inference is consistent with several of our other analyses, including Treemix (Figure 4) and D-statistics which also pointed to a shared basis to detected patterns of admixture. At the resolution which we are able to assign local ancestry, our data also firmly indicate that endemic variation is maintained within PIM beyond the first or second generation of hybridization.

S5. Further details on evidence for gene flow

We use several statistical methods for detecting gene flow between invasive and endemic populations. Our primary focus was on characterizing admixture between CHS and PIM on Santa Cruz (described in text), however several additional patterns are also worth presenting. These are discussed below.

S5.1. Treemix

In addition to the two cases of interspecific PIM×CHS admixture, three cases of intraspecific admixture within PIM were inferred using Treemix: one from Peru PIM into Ecuador PIM, one from the ancestral Galápagos branch to MG107, and one from Ecuador PIM to MG111. It is difficult to interpret the factors responsible for these inferred events, although they may not all reflect true admixture cases. The edge between Ecuador and Peru PIM is most likely a byproduct of our distinction between these groups; in reality the relatedness among mainland samples reflects a pattern of IBD across latitude. The edges leading to MG107 and MG111 could suggest the occurrence of additional minor introduction events nested within the major Ecuador event. Such a scenario might be plausible given the known substantial trade between the Galápagos and central mainland Ecuador, and the occurrence of at least one reintroduction event. However, we have no independent support for such additional events. Increasing m in Treemix had the effect of inferring additional intraspecific migration within PIM, but did not infer any additional between species admixture (see Figure 3—figure supplement 2 for additional Treemix run summaries) and the increase in data likelihood was marginal (Supplementary file 1k). This indicates that the precise parameter choices in these analyses do not change the number of introgression events inferred between invasive PIM and the endemic island species, beyond those that described in the main text.

S5.2. S. lycopersicum

In addition to MG118 (population of F1 PIM×LYC plants) described in text, other weaker signals of LYC ancestry in PIM (e.g., in MG113, M114, M116, and M117; Figure 4A) may also be the result of post-colonization admixture, however they may also represent latent (unmodeled) population structure within PIM and/or reflect the hybrid ancestry of LYC var. cerasiforme (Ranc et al., 2008). Increasing K from 3 to 4 in fastStructure swapped the minor LYC ancestry fractions seen in MG113, MG116, and MG117 for a fourth ancestry class. However, the minor LYC component seen in MG114 as well as the 50/50 LYC×PIM MG118 population retained their original LYC classifications. It is difficult to interpret this given the complex and unresolved history of LYC var. cerasiforme (Ranc et al., 2008), however it may indicate (i) the presence of two or more distinct LYC lineages on the Galápagos or (ii) a contribution by S. galapagense (GAL; not sampled on Santa Cruz but historically present). The former scenario may be expected if multiple varieties of domesticated seed have been used for cultivation on the archipelago.

Separate from the clear case of CHS×GAL admixture on Isabela at la laguna de manzanilla (discussed in the main text), additional signals of admixture between CHS/GAL (MG120)×LYC (MG122 and MG126) were also detected (Figure 4B). In MG122, increasing K from 3 to 4 clearly separates the three inferred GAL×LYC individuals into a separate non-admixed subcluster, suggesting that the inferred GAL ancestry in these samples likely reflects latent substructure within LYC. In contrast, even when increasing K from 4 to 5, the three inferred CHS×LYC samples in MG126 retain their minority CHS component, consistent with a real signal of shared ancestry between CHS and LYC for these individuals. Whether this is the result of gene flow post-colonization of LYC on the islands or a reflection of breeding history (CHS germplasm has been used as a source for certain beneficial crop traits; Rick, 1967; Stommel and Haynes, 1994) is unclear.

S6. Evaluating the potential impact of allele dropout and reference mapping bias

Allele drop out (ADO) is a potential source of bias in restriction enzyme-based genotyping protocols. When polymorphisms occur in restriction cut sites, the RAD tag can ‘drop out’, potentially masking true SNPs present in linked haplotypes. ADO can convert true heterozygous genotypes into homozygous calls. It has been well documented in simulations and real data (Cariou et al., 2016), yet methods for dealing with it are lacking. The general observation is that, when polymorphism is high, the probability of an SNP occurring in a cut site increases, resulting in fewer true heterozygotes being identified and an overall bias toward observing lower levels of diversity.

To assess whether ADO is having a substantial effect in our dataset, we examined the relationship between sequencing depth and observed locus heterozygosity, since loci experiencing dropout are expected to be sequenced to lower depths. We observe no significant association between depth and heterozygosity (linear regression p = 0.191; Appendix 1—figure 1), indicating that ADO is unlikely to be having a pervasive effect on our data or findings.

Appendix 1—figure 1
Sequencing depth (average number of reads per individual) across loci.

Panel (A) histogram of depth values. Panel (B) relationship between sequencing depth and observed heterozygosity.

To investigate any possible effects of mapping bias, we compared how data presence/absence, numbers of recovered RAD loci, and total number of mapped reads varied between species/populations. Data missingness (the fraction of individuals at a locus without data) never exceeded 20%. We found that CHS/GAL individuals had a higher proportion of sites with 0% missing data than did PIM, although PIM overall never exceeded 10% missing data whereas CHS/GAL did for some markers (Appendix 1—figure 2). Regardless, both observations confirmed that the final marker set has low levels of missing data, which should minimize any mapping bias. We also directly examined whether read mapping was biased toward any particular species or population. We did find variation by population (but not by species) in the total number of reads mapped with BWA (Appendix 1—figure 3, right panel), but since each population had higher than 95% of reads retained in Stacks (Appendix 1—figure 3, center panel), this variation in total reads mapped actually reflects variation among populations in total sequencing output. As a result, there is little variation among populations in the total number of assembled RAD loci (Appendix 1—figure 3, left panel), and no consistent difference between species.

Appendix 1—figure 2
Clade-specific distributions of missing data fractions across loci.

Left panel: histogram of missing data fractions for all PIM populations. Right panel: histogram of missing data fractions for all endemic (CHS/GAL) populations.

Appendix 1—figure 3
Population-specific estimates of the total number of assembled RAD loci (left panel), fractions of total reads retained after assembly (center panel), and total number of reads mapped with BWA (right panel).

The total number of mapped reads was on average lower in PIM, yet the average fractions of mapped reads assembled into loci (or ‘stacks’) and the total number of final RAD loci were roughly equal across populations and species. The differences in numbers of mapped reads point to substantial variation in total sequencing output rather than an inability to map in PIM.

S7. Local ancestry assignment with HMM

Our HMM was implemented in R and Python (code available at http://github.com/gibsonmatt/galtom). We used binned pairwise sequence divergence in 100 kb nonoverlapping windows to define emission probabilities and defined three hidden states (homozygous CHS ancestry, homozygous PIM ancestry, and heterozygous CHS/PIM). We use our HMM to identify regions of recent coalescence between each population.

For a focal individual and in each window, we calculated emission probabilities using three binomial models:

PCHS=(d1s)πs(1π)d1s
PPIM=(d2s)dXYs(1dXY)d2s
PHET=(d3s)ms(1m)d3s

where s is the number of sites in the window and d1, d2, and d3 are the median number of differences to MG114, MG115, and the mean of d1 and d2, respectively. Transition probabilities (for the CHS into PIM model) were defined as follows:

P[CHSCHS]=([1r]+r)+a
P[PIMPIM]=([1r]+r)+(1a)
P[CHSPIM]=P[PIMCHS]=P[HETCHS]=P[HETPIM]=r×a

where r and a can be interpreted as proportional to the per-window recombination rate and admixture proportion, respectively. We scaled r by a factor t, which can be interpreted as proportional to the time since admixture. Increasing t will cause the HMM to be more likely to switch between states since recombination will have had more time to break up introgressed blocks. We found that our HMM is relatively insensitive to chosen transition probabilities. For all populations, we chose to use 0.002, 0.48, and 10 for r, a, and t, respectively, as these produced consistent annotations that agreed with patterns of observed nucleotide diversity. Each focal individual in a population was analyzed separately against each of the individuals in the potential donor population. For example, to analyze introgression from CHS → PIM, divergence to all MG114 and MG115 individuals is calculated and used to define emissions.

Data availability

Raw, demultiplexed ddRAD reads have been deposited under NCBI BioProject PRJNA661300 and will be available once processed by NCBI. Genotype files, associated datasets, and analysis scripts have been deposited on Dryad (https://doi.org/10.5061/dryad.2v6wwpzkm). Additionally, data posted to Dryad can also be accessed at https://github.com/gibsonMatt/galtom (copy archived at https://archive.softwareheritage.org/swh:1:rev:1647969c397c5b13d15ab9b5d408bbbab2f6b4a8).

The following data sets were generated
    1. Gibson MJ
    2. Torres MdL
    3. Brandvain Y
    4. Moyle LC
    (2020) Dryad Digital Repository
    Data from: Reconstructing the history and biological consequences of a plant invasion on the Galápagos islands.
    https://doi.org/10.5061/dryad.2v6wwpzkm
The following previously published data sets were used
    1. Pease JB
    2. Haak DC
    3. Hahn MW
    4. Moyle LC
    (2016) Dryad Digital Repository
    Phylogenomics reveals three sources of adaptive variation during a rapid radiation.
    https://doi.org/10.5061/dryad.182dv
    1. Gibson MJS
    2. Moyle LC
    (2020) Dryad Digital Repository
    Regional differences in the abiotic environment contribute to genomic divergence within a wild tomato species.
    https://doi.org/10.5061/dryad.8gtht76k4

References

  1. Book
    1. Carlquist SJ
    (1974)
    Island Biology
    New York: Columbia University Press.
  2. Book
    1. Darwin SC
    (2009)
    The Systematics and Genetics of Tomatoes on the Galápagos Islands (Solanum, Solanaceae)
    Doctoral, UCL: University College London Press.
  3. Software
    1. Gibson MJS
    (2021a)
    easySFS, version swh:1:rev:b866269f5813ef7cd8a12c7727048f993da8e9ff https://archive.softwareheritage.org/swh:1:rev:b866269f5813ef7cd8a12c7727048f993da8e9ff
    Software Heritage.
  4. Conference
    1. Gutenkunst R
    2. Hernandez R
    3. Williamson S
    4. Bustamante C
    (2010)
    Diffusion approximations for demographic inference: dadi
    Nature Precedings.
  5. Book
    1. Lundh JP
    (2004)
    Galápagos: A Brief History Unpublished Manuscript
    Oslo, Norway: Human and Cartographic History of the Galápagos Islands.
  6. Book
    1. Lynch M
    2. Walsh B
    (1998)
    Genetics and Analysis of Quantitative Traits
    Massachusetts: Sinauer Sunderland.
  7. Book
    1. McMullen CK
    (1999)
    Flowering Plants of the Galápagos
     New York: Cornell University Press.
  8. Book
    1. Quiroga D
    (2018) Introduced species and the threats to the Galapagos
    In: de Lourdes Torres M, Mena C. F, editors. Understanding Invasive Species on the Galapagos (1st edn). Berlin: Springer. pp. 2–13.
    https://doi.org/10.1007/978-3-319-67177-2
  9. Conference
    1. Rick CM
    (1963)
    Biosystematic studies on galapagos tomatoes
    Occasional Papers of the California Academy of Sciences. pp. 59–77.
    1. Rick CM
    2. Bowman RI
    (1961)
    Galapagos tomatoes and tortoises. Evolution
    International Journal of Organic Evolution 1:407–417.

Decision letter

  1. Detlef Weigel
    Senior Editor; Max Planck Institute for Developmental Biology, Germany
  2. Hernán A Burbano
    Reviewing Editor; University College London, United Kingdom
  3. Gregory Owens
    Reviewer; University of Victoria, Canada

In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.

Acceptance summary:

This paper reconstructs the history of native and invasive tomatoes in the Galápagos Islands-including species that were first collected by Darwin himself. This is a careful and thoughtful study that describes a highly interesting case of phenotypic convergence for fruit color, driven by the exchange of carotenoid loci between endemic and invasive populations. The work provides a beautiful example of natural experiments that advance our understanding of evolution.

Decision letter after peer review:

Thank you for submitting your article "Reconstructing the history and biological consequences of a plant invasion on the Galápagos islands" for consideration by eLife. Your article has been reviewed by 2 peer reviewers, and the evaluation has been overseen by a Senior Editor and a Reviewing Editor. The following individual involved in review of your submission has agreed to reveal their identity: Gregory Owens (Reviewer #2).

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

As the editors have judged that your manuscript is of interest, but as described below that additional experiments are required before it is published, we would like to draw your attention to changes in our revision policy that we have made in response to COVID-19 (https://elifesciences.org/articles/57162). First, because many researchers have temporarily lost access to the labs, we will give authors as much time as they need to submit revised manuscripts. We are also offering, if you choose, to post the manuscript to bioRxiv (if it is not already there) along with this decision letter and a formal designation that the manuscript is "in revision at eLife". Please let us know if you would like to pursue this option. (If your work is more suitable for medRxiv, you will need to post the preprint yourself, as the mechanisms for us to do so are still in development.)

Summary:

Tomatoes of the Galapagos Islands are a fascinating system for studying the introduction of alien species as well as their hybridization and competition with native congeners. The current work is based on genotype information from native and introduced tomatoes on the islands as well as mainland individuals. The authors infer multiple introductions using demographic modelling and admixture/introgression between introduced and native populations, nicely highlighting that introduction histories are rarely simple.

Essential revisions:

A major concern are the potential biases inherent in the use of RAD-seq data in combination with a single reference genome. If the authors want to include the dadi analysis (which is based on the joint SFS), they must not only clearly discuss the caveats, but additional analyses must be conducted with other models. Currently, only a very limited set of models is used, with a focus on the median results in the main text – with very large CIs. One would expect the optima (reported in the text) as well as the bootstrap medians to be similar. That they are not suggests a poor fit between the data and the model. As the authors are very well aware, in these types of analyses there is always a 'best fit model' but the 'best' model is not particularly informative if it does not fit the data well."

I am including the full reviews, which provide several suggestions for how to improve the analyses / moderate the claims.

Reviewer #1:

Gibson and colleagues use RAD-seq genotyping data from continental and island populations of closely related Solanum species to examine population history and evidence for admixture in the Galapagos islands. This paper combines results from those previously published (Gibson et al., 2020, Molecular Ecology, Gibson et al., 2020, Evolutionary Ecology) and adds in new genotyping and population genetic analyses. The conclusions here are that (i) the wild tomato species S. pimpinellifolium was introduced to the Galapagos in the recent past and has caused a decline in the endemic S. cheesmaniae species and S. galapagense species (which appear not to actually be separate species based on the results here) and that (ii) 'borrowed' alleles from the S. cheesmaniae (CHS) and S. galapagense (GAL) have benefitted S. pimpinellifolium in some way, enhancing its reproductive success.

The results are broadly consistent with the previous findings published earlier this year both in terms of overall population structure (Gibson et al., 2020, Molecular Ecology) and the introgression of the orange fruit locus (Gibson et al., 2020, Evolutionary Ecology) and represent a clear next-step in this series of papers.

For the population genetic analyses presented, I am concerned about the potential biases introduced from RAD-seq data and in particular the strong conclusions and especially the specific time estimates regarding the migration of the S. pimpinellifolium populations. Inherent biases from this type of data and their impacts on basic population genetic parameters have been well-characterized. These were described in Gautier et al., 2013, Arnold et al., 2013, and more recently detailed in Cariou et al., BMC 2016. Arnold et al., 2013 (Mol Ecol) conducted analyses based on simulations and empirical data and found severe biases in genealogical inferences as well as population genetic summary statistics (pi, ThetaW, Tajima's D, FST). Similarly, Gautier et al., (Mol Ecol 2013) showed that allelic dropout in RAD-seq studies biases the inference of genetic variation within and between populations, which was further detailed by Cariou et al., BMC 2016.

Compounded with the issues known for RAD-seq, a single Solanum species is used for alignment of reads, which likely results in a further bias toward apparent lower variation in the diverged island species.

Here, the authors calculate the summary statistics that were already shown to be biased when calculated from RAD-seq data and also go further, using the joint site frequency spectrum for inference via analysis with dadi. Tajima's D is of course a summary statistic based on aspects of the SFS and if it is shown to be biased, I would expect that the numbers of variants assigned to the bins of the JSFS (used in dadi) are problematic.

Even in the best case (i.e., full sequence data), with dadi it is easily possible to find multiple very different demographic models that fit the data equally well (for example earlier migration of fewer individuals vs. more recent migration of a larger number of individuals). Also, in this case, the CIs from the dadi analysis are extremely large, which suggests that the models examined are not very close to reality (or that there are deeper issues with bias due to the nature of the genotype data). For example, in PIM, the bootstrap median estimates for the bottleneck and recovery times are 847 generations ago and 840 generations ago, with confidence intervals of 22-13,591 generations ago for the time to recovery and 0-8000 generations ago for the time to the bottleneck. For the species considered to be island endemics, some estimates were provided but no confidence intervals from that dadi analysis were reported. Even if we are to trust there is no major bias in the RAD-seq genotyping, the dadi do not exlude an ancient natural migration of the S. pimpinellifolium species and even seem to support this possibility.

There seems to be pretty strong evidence presented that the orange locus in PIM is due to introgression from CHS.

Gibson and colleagues use RAD-seq genotyping data from continental and island populations of closely related Solanum species to examine population history and evidence for admixture in the Galapagos islands. This paper combines results from those previously published (Gibson et al., 2020, Molecular Ecology, Gibson et al., 2020, Evolutionary Ecology) and adds in new genotyping and population genetic analyses. The conclusions here are that (i) the wild tomato species S. pimpinellifolium was introduced to the Galapagos in the recent past and has caused a decline in the endemic S. cheesmaniae species and S. galapagense species (which appear not to actually be separate species based on the results here) and that (ii) 'borrowed' alleles from the S. cheesmaniae (CHS) and S. galapagense (GAL) have benefitted S. pimpinellifolium in some way, enhancing its reproductive success.

The results are broadly consistent with the previous findings published earlier this year both in terms of overall population structure (Gibson et al., 2020, Molecular Ecology) and the introgression of the orange fruit locus (Gibson et al., 2020, Evolutionary Ecology) and represent a clear next-step in this series of papers.

For the population genetic analyses presented, I am concerned about the potential biases introduced from RAD-seq data and in particular the strong conclusions and especially the specific time estimates regarding the migration of the S. pimpinellifolium populations. Inherent biases from this type of data and their impacts on basic population genetic parameters have been well-characterized. These were described in Gautier et al., 2013, Arnold et al., 2013, and more recently detailed in Cariou et al., BMC 2016. Arnold et al., 2013 (Mol Ecol) conducted analyses based on simulations and empirical data and found severe biases in genealogical inferences as well as population genetic summary statistics (pi, ThetaW, Tajima's D, FST). Similarly, Gautier et al., (Mol Ecol 2013) showed that allelic dropout in RAD-seq studies biases the inference of genetic variation within and between populations, which was further detailed by Cariou et al., BMC 2016.

Compounded with the issues known for RAD-seq, a single Solanum species is used for alignment of reads, which likely results in a further bias toward apparent lower variation in the diverged island species.

Here, the authors calculate the summary statistics that were already shown to be biased when calculated from RAD-seq data and also go further, using the joint site frequency spectrum for inference via analysis with dadi. Tajima's D is of course a summary statistic based on aspects of the SFS and if it is shown to be biased, I would expect that the numbers of variants assigned to the bins of the JSFS (used in dadi) are problematic.

Even in the best case (i.e., full sequence data), with dadi it is easily possible to find multiple very different demographic models that fit the data equally well (for example earlier migration of fewer individuals vs. more recent migration of a larger number of individuals). Also, in this case, the CIs from the dadi analysis are extremely large, which suggests that the models examined are not very close to reality (or that there are deeper issues with bias due to the nature of the genotype data). For example, in PIM, the bootstrap median estimates for the bottleneck and recovery times are 847 generations ago and 840 generations ago, with confidence intervals of 22-13,591 generations ago for the time to recovery and 0-8000 generations ago for the time to the bottleneck. For the species considered to be island endemics, some estimates were provided but no confidence intervals from that dadi analysis were reported. Even if we are to trust there is no major bias in the RAD-seq genotyping, the dadi do not exlude an ancient natural migration of the S. pimpinellifolium species and even seem to support this possibility.

There seems to be pretty strong evidence presented that the orange locus in S. pimpinellifolium is due to introgression from CHS.

Reviewer #2:

This paper reconstructs the invasion history of wild tomatoes in the Galapagos. The authors genotype multiple samples from four species, including native and invasive tomatoes, and their putative mainland progenitors. They find support for multiple Ecuadorian origins of the invasion using demographic modelling. They also find that there has been introgression in multiple instances between native and invasive species and that a fruit color polymorphism in PIM is likely due to introgression from CHS at fruit color loci.

I really enjoyed this paper. It does a very nice and clean job of testing its hypotheses using modern techniques, like dadi or Locator, but also backing them up with basic pop-gen statistics, e.g., π and dxy. The figures are well-made and thoughtful, and the methods are very well documented. I think the results are worth publishing in eLife, in particular the finding that fruit color has introgressed is an interesting story that will catch people's interests. With this in mind, I wholeheartedly endorse the paper.

https://doi.org/10.7554/eLife.64165.sa1

Author response

Essential revisions:

A major concern are the potential biases inherent in the use of RAD-seq data in combination with a single reference genome. If the authors want to include the dadi analysis (which is based on the joint SFS), they must not only clearly discuss the caveats, but additional analyses must be conducted with other models. Currently, only a very limited set of models is used, with a focus on the median results in the main text -- with very large CIs. One would expect the optima (reported in the text) as well as the bootstrap medians to be similar. That they are not suggests a poor fit between the data and the model. As the authors are very well aware, in these types of analyses there is always a 'best fit model' but the 'best' model is not particularly informative if it does not fit the data well."

I am including the full reviews, which provide several suggestions for how to improve the analyses / moderate the claims.

Thank you for handing our manuscript and for the positive feedback. Below we have addressed each reviewer comment point-by-point, including detailing the text changes as well as additional analyses and figures that have been integrated in the revised submission. Broadly, these revised changes include (i) a thorough quantitative assessment of the impact of allele drop out (ADO) in our RAD seq data; (ii) a reassessment of sequence metadata to evaluate any evidence for mapping bias introduced by our use of a single reference; and (iii) correction of several errors in the dadi inference pipeline that led to exaggerated CIs. We summarize these changes below:

(i) RADseq bias: We agree that the previous manuscript could have better addressed the potential bias in RADseq data caused by allele drop out (ADO). In the revised submission, we now show evidence that genotyping errors caused by polymorphisms in restriction sites are likely not having systematic effects on our results (as detailed further in responses to reviewers below). Briefly, first we note that for species with genome-wide diversity estimates (i.e., π) below 2% (which includes our study species), previous studies (Cariou et al. 2016) indicate that the bias introduced by ADO is negligible. Second, we find no relationship between diversity metrics and sequencing depth (Appendix 1—figure 1); this would be expected if sites sequenced to lower depths experienced drop-out in our dataset. We have amended our methods section to highlight the lack of evidence for ADO (lines 598-602), added an additional supplementary text section (S6), and added Appendix 1—figures 1-3.

(ii) Mapping bias: We agree that the choice of reference genome can have a large influence on diversity metrics when studied species are distantly related. However the four species included in this manuscript are < 0.5My diverged, and prior data indicate that GAL/CHS and PIM are approximately equally related to LYC (the reference species; Pease et al., 2016). Our removal of sites with greater than 20% missing data should also reduce any potential effects of mapping bias if they exist. Nonetheless, to show this more directly, we now investigate data presence/absence, numbers of recovered RAD loci, and total number of mapped reads between species/populations, each of which indicates that mapping to the domesticated tomato reference genome does not generate any consistent or substantial bias between the three wild species. We have added additional text to our methods section to highlight the absence of mapping bias (lines 598-602), provided additional supplementary figures (Appendix 1—figures 2 and 3), and a supplementary text section (S6).

(iii) Demographic models: The largest concern you bring up regards inferences from our demographic models. We agree that the large reported CIs, and the disagreement between optimized estimates and bootstrapped medians, was an unexpected element of our previous analysis, given the natural history and collection data that indirectly support our two main inferences from this analysis (recent, human-mediated introduction of PIM, and older/pre-human introduction/evolution of endemic species). In revisiting our models in response to these concerns, we found three coding errors (described in detail below) that appear to have been responsible for the unexpected difference between optimization and bootstrap runs as well as inflated variance in bootstrapped sampling distributions. After correcting these errors, our revised analyses of the timing and strength of the bottleneck in PIM now have CIs that are more narrow (see revised Table 2), and estimates from optimization and bootstrapping that are in close agreement. For MG115 (CHS), our reanalysis with corrected code indicates that CHS most likely experienced a large expansion between 1114 and 1845 generations ago (rather than a bottleneck, which was a local minimum in our optimization runs), coupled with a very recent and strong contraction. Compared to our estimate of the timing of the bottleneck in PIM, the expansion in CHS is nearly 8 times older. These corrections address the major concerns about our prior poor model fit, while still using only single population SFSs for each species, but also acknowledge that the true history of these populations is likely much more complicated. More complex models which leverage the joint SFS might be possible, but we believe these would be less appropriate for these particular populations/species. We previously constructed a joint SFS from MG114 (PIM) and MG115 (CHS) but found that most variation is actually private and not shared between the two populations (species). We believe this is a realistic consequence of inbreeding coupled with the occurrence of a strong/recent bottleneck in PIM, and indicates that models built around single population SFSs for each species are a preferable alternative to more complex model inferences, that likely go beyond the inferential power of our dataset. Instead, in the revisions we now also provide model-free support for the inferred histories in the form of bootstrapped estimates for genome-wide Tajima’s D (Table 2—figure supplement 1). These also clearly support our inference of a recent bottleneck in PIM. We have amended our Results section to reflect the updates to our demographic model fits as well as to include the updated estimates and error intervals for Tajima’s D (lines 220-223).

Reviewer #1:

Gibson and colleagues use RAD-seq genotyping data from continental and island populations of closely related Solanum species to examine population history and evidence for admixture in the Galapagos islands. This paper combines results from those previously published (Gibson et al., 2020, Molecular Ecology, Gibson et al., 2020, Evolutionary Ecology) and adds in new genotyping and population genetic analyses. The conclusions here are that (i) the wild tomato species S. pimpinellifolium was introduced to the Galapagos in the recent past and has caused a decline in the endemic S. cheesmaniae species and S. galapagense species (which appear not to actually be separate species based on the results here) and that (ii) 'borrowed' alleles from the S. cheesmaniae (CHS) and S. galapagense (GAL) have benefitted S. pimpinellifolium in some way, enhancing its reproductive success.

Thank you for providing a thorough review of our manuscript. You are correct that there has been historical disagreement about the taxonomic status of the two Galapagos endemic species GAL and CHS. While the most current taxonomic treatments consider them to be separate species, past classifications have not. The distinction does not influence the major inferences from our study – with regard to invasion and introgression involving S. pimpinellifolium – so we reference the most recent and widely accepted classifications presented in Darwin et al. (2003).

We have addressed each of your comments below, which includes additional analyses, figures, and text changes where applicable.

The results are broadly consistent with the previous findings published earlier this year both in terms of overall population structure (Gibson et al., 2020, Molecular Ecology) and the introgression of the orange fruit locus (Gibson et al., 2020, Evolutionary Ecology) and represent a clear next-step in this series of papers.

Thank you for taking the time to become familiar with the previous literature supporting this manuscript. We agree that data on non-invasive/mainland PIM population structure, and field observations of invasive and endemic species on the Galapagos, both help to provide a clear context for posing and answering the central questions in the current study.

For the population genetic analyses presented, I am concerned about the potential biases introduced from RAD-seq data and in particular the strong conclusions and especially the specific time estimates regarding the migration of the S. pimpinellifolium populations. Inherent biases from this type of data and their impacts on basic population genetic parameters have been well-characterized. These were described in Gautier et al., 2013, Arnold et al., 2013, and more recently detailed in Cariou et al., BMC 2016. Arnold et al., 2013 (Mol Ecol) conducted analyses based on simulations and empirical data and found severe biases in genealogical inferences as well as population genetic summary statistics (pi, ThetaW, Tajima's D, FST). Similarly, Gautier et al., (Mol Ecol 2013) showed that allelic dropout in RAD-seq studies biases the inference of genetic variation within and between populations, which was further detailed by Cariou et al., BMC 2016.

Thank you for raising this concern. We agree that the manuscript could have more thoroughly addressed potential concerns about biases that can arise from RADseq data, and we have now modified this in the revised text. To clarify, the main source of bias is the potential for polymorphisms in restriction sites leading to allele drop out (ADO). As shown in the papers you cite, genotyping errors caused by ADO can lead to underestimates of SNP-based estimates of polymorphism and, as a result, can bias summary statistics generated from these data (as well as the SFS, but see response to your additional comments below). While we acknowledge this potential concern, two lines of evidence indicate that this effect does not systematically influence the results we present.

First, since ADO is dependent on the occurrence of polymorphisms in restriction cut sites, the impact of ADO is dependent on overall levels of diversity, and any potential bias should roughly scale with levels of sequence diversity. As a result, when polymorphism is low, the potential for bias is also expected to be low. Consistent with this, in their simulation and in silico digestion experiments, Cariou et al. (2016) show that, for species with genome-wide diversity estimates (i.e., π) below 2%, the bias introduced by ADO is “negligible” (pg 5). The average levels of genetic diversity in our species are substantially (an order of magnitude or more) lower than this threshold. In particular, the three wild species in our analysis have estimated average levels of heterozygosity between 0.03% and 0.04% for genome-wide coding sequences and between 0.0% and 0.01% for ultra-conserved orthologs, using high-depth (~30x or more) transcriptome data (Pease et al., 2016). Our estimates from RADseq data in the current study similarly point to extremely low levels of diversity, both on the continent (PIM heterozygosity = 0.025% – 0.033%) and on the Galapagos (PIM heterozygosity = 0.011% – 0.024%). These estimates are generally consistent with previous estimates using other non-RADseq datasets (Caicedo and Shall, 2004). These low levels of polymorphism indicate that the bias introduced by ADO is likely to be negligible in our species. We also note that ADO does not impact estimates of between population divergence (Cariou et al., 2016)

Second, we have also directly reexamined our data for evidence of a potential ADO bias by evaluating conditions under which we would expect to see this effect to be pronounced, if present (see Appendix 1—figure 1). Because high variability in depth is common in RAD datasets, it precludes any “depth-based” methods aimed at detecting ADO (e.g., the approximate Bayesian computation method presented in Cariou et al., 2016). (Indeed, depth across our filtered SNP loci is variable across the genome (Appendix 1—figure 1A) despite being very high on average (average read depth per sample = 114). To reduce this effect of variable depth, we had already filtered sites with fewer than 8 reads supporting either allele, and those where <80% of samples had data.) Instead, we performed a regression of observed heterozygosity (at polymorphic sites; N = 5767) on average depth per sample (Appendix 1—figure 1B) and observe no significant association (P = 0.191), a finding that is not due to lack of power given our large sample size. If ADO was leading to significant bias across the genome, we would expect to observe an association between estimates of genetic diversity (heterozygosity) and read depth, since sites experiencing drop out will also be sequenced to lower depth. This lack of association indicates little evidence for an effect of ADO, similarly indicating that this potential bias is unlikely to impact our inferences from this data.

To address this potential concern in the revised manuscript itself, we have amended our methods section (lines 598-602) to more directly state the possible caveats of using RADseq data. We also added Appendix 1—figure 1 and a brief section (SI section S6) to our supplementary file.

Compounded with the issues known for RAD-seq, a single Solanum species is used for alignment of reads, which likely results in a further bias toward apparent lower variation in the diverged island species.

We agree that choice of reference genome can potentially be a source of bias when compared species are distantly diverged and/or when the species being mapped are differentially related to the reference species. However the data indicate this is not the case for the tomato species presented here. First, all four species (GAL, CHS, PIM, and LYC) are members of the same sub-clade and are less than 0.5 My diverged. Because of this very recent, rapid divergence, species relationships between the individual taxa show substantial gene tree discordance (due to ILS; Pease et al., 2016). However there is no evidence that GAL/CHS and PIM are strongly differentially related to LYC (the reference species); if anything, the data suggests a polytomy among PIM, GAL/CHS, and LYC. Second, our filtering criteria for retaining data was very strict; we removed any sites with higher than 20% missing data, which should also substantially reduce any potential effects of mapping bias.

Third, to more directly investigate any possible effects of mapping bias, we have now also compared how data presence/absence, numbers of recovered RAD loci, and total number of mapped reads varies between species/populations. Appendix 1—figure 2 indicates that data missingness (the fraction of individuals at a locus without data, indicative of failure to map and/or low quality mapping) never exceeds 20% (as expected from our filtering scheme). In these data, we find that CHS/GAL individuals had a higher proportion of sites with 0% missing data than did PIM, although PIM overall never exceeded 10% missing data whereas CHS/GAL did for some markers (Appendix 1—figure 2). Regardless, both observations confirm that final marker set has low levels of missing data, which should minimize mapping bias. We also directly examined whether read mapping was biased towards any particular species or population (Appendix 1—figure 3). We did find variation by population (but not by species) in the total number of reads mapped with BWA (Appendix 1—figure 3, right panel), but since each population had higher than 95% of reads retained in Stacks (Appendix 1—figure 3, center panel), this variation in total reads mapped actually reflects variation among populations in total sequencing output. As a net result, there is little variation among populations in the total number of assembled RAD loci (Appendix 1—figure 3, left panel), and no consistent difference between species.

From the data in Appendix 1—figures 2 and 3, we conclude that reference-based mapping to the domesticated tomato genome bias is likely not responsible for any substantial mapping bias between our three wild species, and does not explain lower variation detected in the endemic species. Instead, the latter is more consistent with the known high selfing rates in these taxa, as well as their historical endemism.

Here, the authors calculate the summary statistics that were already shown to be biased when calculated from RAD-seq data and also go further, using the joint site frequency spectrum for inference via analysis with dadi. Tajima's D is of course a summary statistic based on aspects of the SFS and if it is shown to be biased, I would expect that the numbers of variants assigned to the bins of the JSFS (used in dadi) are problematic.

We jointly address this comment in our response to the related comment below.

Even in the best case (i.e., full sequence data), with dadi it is easily possible to find multiple very different demographic models that fit the data equally well (for example earlier migration of fewer individuals vs. more recent migration of a larger number of individuals). Also, in this case, the CIs from the dadi analysis are extremely large, which suggests that the models examined are not very close to reality (or that there are deeper issues with bias due to the nature of the genotype data). For example, in PIM, the bootstrap median estimates for the bottleneck and recovery times are 847 generations ago and 840 generations ago, with confidence intervals of 22-13,591 generations ago for the time to recovery and 0-8000 generations ago for the time to the bottleneck. For the species considered to be island endemics, some estimates were provided but no confidence intervals from that dadi analysis were reported. Even if we are to trust there is no major bias in the RAD-seq genotyping, the dadi do not exlude an ancient natural migration of the S. pimpinellifolium species and even seem to support this possibility.

Thank you for clearly stating your concerns. We agree that demographic models are significantly prone to bias (given that they are idealized representations of much more complicated true histories they are always incorrect to some degree, even with perfect knowledge of the SFS) and confidence intervals are typically large. This is especially true for very recent demographic events. Uncertainty in parameter estimates is expected to increase closer to the present (Li and Durbin, 2011). Our purpose for including them was not to place hard estimates on the timing of introduction; it was instead to show (1) that we observe a clear excess of rare alleles consistent with a bottleneck model, and (2) that the relative timing of the endemic dispersal event (GAL/CHS) is likely older than the invasive dispersal event (PIM).

As stated in our responses above, our data indicate that ADO is not leading to a systematic bias in our estimates of diversity. This should also extend to the SFS. In their simulations, Arnold et al. (2013) show that ADO affects the SFS by generating an artificial deficit of rare alleles and abundance of intermediate frequency alleles. This pattern would work against the detection of past bottlenecks, so it is unlikely that our inference of bottleneck events in PIM are spuriously generated only by ADO. Genotyping errors brought on by variance in allele sequencing depth are likely a bigger concern, as are the effects of introgression. (Introgression can inject low frequency alleles into the destination population, potentially inflating the observed abundances of rare alleles.) We have mitigated the effects of erroneous genotype calls from low depth sites by using strict filtering schemes (e.g., at least 8X coverage; see methods). Similarly, before estimating the SFS for population MG114 (the population used for invasive bottleneck inference), we removed all sites with evidence for introgression as predicted by our HMM.

As you mention, potential biases introduced by genotyping errors/ADO aside, large confidence intervals might also simply be due to non-identifiability in the limited set of models evaluated. We do not disagree with this statement and have spent a substantial amount of time investigating this during our revisions, especially as the previous –comparatively poor – model fit was puzzling, given other natural history and collections data from the endemic and invasive wild species. Upon reinspection and refitting of our models, we uncovered three programming errors that had substantial effects on our parameter estimates (both in the optimization procedure and in bootstrapping). We thank you for raising questions about these results so we were prompted to catch these errors! The errors were:

1) Only a single grid size was used in dadi for approximation of the diffusion system. The dadi manual suggests that at least two grid sizes are used for proper extrapolation and this we mistakenly did not do this previously. Furthermore, our choice of grid size (30) was likely too small. We have corrected this by now using three grid sizes (30, 40, and 60) and this has substantially reduced variance between optimization runs.

2) A for loop error that resulted in improper bootstrapping. Rather than fitting each replicate using the same starting parameter values, our code mistakenly caused each replicate run to use the previous bootstrap’s model fit as its starting values. This resulted in abnormally long runs (since it quickly approached parameter bounds) and heavily inflated levels of variation between replicates.

3) An error in the dadi optimization pipeline in Portik et al. (2017) that mis-reported the optimal parameter values. It appears that there was an indexing issue in the code provided with their manuscript that led to the mismatching of parameters with their likelihoods after fitting. The reported parameters deviated slightly from the true maximum likelihood estimates. We have addressed this by implementing our own similar optimization procedure (see methods, lines 652-657).

We have now refitted all models using the corrected code. We have revised table 2 (showing parameter estimates and CIs for MG114 – the invasive PIM population)and figure S6 showing the distributions of bootstrapped estimates. During the revisions, we also compared two optimization procedures (BFGS and Nelder-Mead) to ensure that choice of algorithm was not affecting our estimates. In both cases, optimized estimates matched closely with the bootstrapped median estimates (Figure S6) and CIs for TB and TF were smaller by 15-fold or more at the upper bound compared to the previous draft. The estimate for F remained approximately unchanged.

We similarly refitted the same two-epoch model to MG115 (CHS – the endemic species) using the corrected code and performed bootstrapping (Table 2—figure supplement 2). This reanalysis suggested that CHS likely experienced a large expansion between 1114 and 1845 generations ago, coupled with a very recent and strong contraction. The extreme recency of this inferred contraction led us to evaluate additional models as well. However, optimized estimates for single epoch (one size change) as well as a three-epoch (three sizes changes) models were less likely than the two-epoch fit, and so we continue to report the two-epoch model results.

Compared to our estimate of the timing of the bottleneck in PIM, the expansion in CHS is nearly 8 times older, consistent with our understanding of the comparative introduction history of these two species based on historical and contemporary collections and natural history. We have amended our Results section to reflect the change in maximum likelihood estimates for CHS and their associated implications (lines 234-232).

We believe that these corrections address concerns of non-identifiability, especially as it appears that the large CIs were partially a technical error that arose from our prior implementation of these models. We nonetheless acknowledge that the simple models we fit are more or less rough approximations. Unfortunately, fitting more complex models (i.e., to include an ancient speciation event (between CHS and PIM), followed by separate bottlenecks + migration) would require 5-6 additional parameters, as well as estimation of the joint SFS for these two species. We have previously constructed a joint SFS but found that most variation is actually private and not shared between the two populations, suggesting that little additional information is actually contained in the 2D spectrum. We think this is a realistic consequence of inbreeding coupled with the occurrence of strong/recent bottlenecks; thus models built around single population SFSs for each species are a preferable alternative to more complex model inferences that would rely on the joint SFS. We agree with the reviewer that we do not want to go beyond the limits of our data by using it to fit models for which we have insufficient power and resolution, such as these much more complex models.

Instead, as secondary support for our broad conclusions regarding the timing and occurrence of bottlenecks, we now provide additional results (Tajima’s D) which are model-free and do not assume any particular population history. We previously provided a Tajima’s D estimate for MG114 (invasive PIM), and we now provide the same estimate for MG115 (endemic CHS) as well as bootstrapped error intervals (Table 2—figure supplement 1). In accordance with our inference of a more recent bottleneck in PIM, the median value of Tajima’s D (estimated genome-wide) in PIM (-0.49 +/- 0.24) was lower than that of CHS (-0.18 +/- 0.16) indicating a larger excess of rare alleles in PIM consistent with expansion after a recent bottleneck. We have updated text in the Results section to reflect this (lines 220-223 and 240-242).

https://doi.org/10.7554/eLife.64165.sa2

Article and author information

Author details

  1. Matthew JS Gibson

    Department of Biology, Indiana University, Bloomington, United States
    Contribution
    Conceptualization, Resources, Data curation, Software, Formal analysis, Supervision, Funding acquisition, Validation, Investigation, Visualization, Methodology, Writing - original draft, Project administration, Writing - review and editing
    For correspondence
    gibsomat@indiana.edu
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-7855-1628
  2. María de Lourdes Torres

    1. Universidad San Francisco de Quito (USFQ). Colegio de Ciencias Biológicas y Ambientales, Laboratorio de Biotecnología Vegetal. Campus Cumbayá, Quito, Ecuador
    2. Galapagos Science Center, Universidad San Francisco de Quito and University of North Carolina at Chapel Hill, Galapagos, Ecuador
    Contribution
    Conceptualization, Supervision, Investigation, Methodology, Writing - review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-7207-4568
  3. Yaniv Brandvain

    Department of Plant Biology, University of Minnesota-Twin Cities, St. Paul, United States
    Contribution
    Software, Methodology, Writing - review and editing
    Competing interests
    No competing interests declared
  4. Leonie C Moyle

    Department of Biology, Indiana University, Bloomington, United States
    Contribution
    Conceptualization, Funding acquisition, Investigation, Methodology, Writing - original draft, Writing - review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-4960-8001

Funding

National Science Foundation (IOS 1127059)

  • Leonie C Moyle

Indiana University Bloomington (Brackenridge)

  • Matthew JS Gibson

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

We thank the Galápagos Science Center staff on San Cristóbal for logistic and permitting support, and the Galápagos National Park for assistance locating and sampling endemic populations. On-site field support was provided by Marcelo Loyola and Genaro Garcia. We also thank two anonymous reviewers for valuable comments that greatly improved the manuscript. This work was supported by a US National Science Foundation award (IOS 1127059) to LCM and the Indiana University Brackenridge award to MJSG. All field collections were made with appropriate permits and prior authorization by the Galápagos National Park and Ecuadorian Ministry of Environment (Permits PC-40–18 and PC-72–19).

Senior Editor

  1. Detlef Weigel, Max Planck Institute for Developmental Biology, Germany

Reviewing Editor

  1. Hernán A Burbano, University College London, United Kingdom

Reviewer

  1. Gregory Owens, University of Victoria, Canada

Publication history

  1. Received: October 20, 2020
  2. Accepted: June 23, 2021
  3. Accepted Manuscript published: June 24, 2021 (version 1)
  4. Version of Record published: July 21, 2021 (version 2)

Copyright

© 2021, Gibson et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 864
    Page views
  • 138
    Downloads
  • 0
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)

Further reading

    1. Evolutionary Biology
    Chenlu Di et al.
    Research Article Updated

    Advances in genome sequencing have improved our understanding of the genetic basis of human diseases, and thousands of human genes have been associated with different diseases. Recent genomic adaptation at disease genes has not been well characterized. Here, we compare the rate of strong recent adaptation in the form of selective sweeps between mendelian, non-infectious disease genes and non-disease genes across distinct human populations from the 1000 Genomes Project. We find that mendelian disease genes have experienced far less selective sweeps compared to non-disease genes especially in Africa. Investigating further the possible causes of the sweep deficit at disease genes, we find that this deficit is very strong at disease genes with both low recombination rates and with high numbers of associated disease variants, but is almost non-existent at disease genes with higher recombination rates or lower numbers of associated disease variants. Because segregating recessive deleterious variants have the ability to interfere with adaptive ones, these observations strongly suggest that adaptation has been slowed down by the presence of interfering recessive deleterious variants at disease genes. These results suggest that disease genes suffer from a transient inability to adapt as fast as the rest of the genome.

    1. Evolutionary Biology
    2. Microbiology and Infectious Disease
    Jennifer E Jones et al.
    Research Article Updated

    The influenza A virus (IAV) genome consists of eight negative-sense viral RNA (vRNA) segments that are selectively assembled into progeny virus particles through RNA-RNA interactions. To explore putative intersegmental RNA-RNA relationships, we quantified similarity between phylogenetic trees comprising each vRNA segment from seasonal human IAV. Intersegmental tree similarity differed between subtype and lineage. While intersegmental relationships were largely conserved over time in H3N2 viruses, they diverged in H1N1 strains isolated before and after the 2009 pandemic. Surprisingly, intersegmental relationships were not driven solely by protein sequence, suggesting that IAV evolution could also be driven by RNA-RNA interactions. Finally, we used confocal microscopy to determine that colocalization of highly coevolved vRNA segments is enriched over other assembly intermediates at the nuclear periphery during productive viral infection. This study illustrates how putative RNA interactions underlying selective assembly of IAV can be interrogated with phylogenetics.