Repeatability of adaptation in sunflowers reveals that genomic regions harbouring inversions also drive adaptation in species lacking an inversion

  1. Shaghayegh Soudi
  2. Mojtaba Jahani
  3. Marco Todesco
  4. Gregory L Owens
  5. Natalia Bercovich
  6. Loren H Rieseberg
  7. Sam Yeaman  Is a corresponding author
  1. Department of Biological Sciences, University of Calgary, Canada
  2. Department of Botany, University of British Columbia, Canada
  3. Michael Smith Laboratories, University of British Columbia, Canada
  4. Irving K. Barber Faculty of Science, University of British Columbia Okanagan, Canada
  5. Department of Biology, University of Victoria, Canada

Abstract

Local adaptation commonly involves alleles of large effect, which experience fitness advantages when in positive linkage disequilibrium (LD). Because segregating inversions suppress recombination and facilitate the maintenance of LD between locally adapted loci, they are also commonly found to be associated with adaptive divergence. However, it is unclear what fraction of an adaptive response can be attributed to inversions and alleles of large effect, and whether the loci within an inversion could still drive adaptation in the absence of its recombination-suppressing effect. Here, we use genome-wide association studies to explore patterns of local adaptation in three species of sunflower: Helianthus annuus, Helianthus argophyllus, and Helianthus petiolaris, which each harbour a large number of species-specific inversions. We find evidence of significant genome-wide repeatability in signatures of association to phenotypes and environments, which are particularly enriched within regions of the genome harbouring an inversion in one species. This shows that while inversions may facilitate local adaptation, at least some of the loci can still harbour mutations that make substantial contributions without the benefit of recombination suppression in species lacking a segregating inversion. While a large number of genomic regions show evidence of repeated adaptation, most of the strongest signatures of association still tend to be species-specific, indicating substantial genotypic redundancy for local adaptation in these species.

eLife assessment

This is a valuable comparative study of adaptation across multiple species. The results provide a solid example of the application of genotype–environment associations to demonstrate that local adaptation is repeatable.

https://doi.org/10.7554/eLife.88604.3.sa0

eLife digest

In plants, like in humans, DNA is arranged into sections known as genes that are in turn organised into structures called chromosomes. Mutations that modify the activity of these genes can help plant species to adapt to a new environment or to extreme conditions such as drought. However, successful adaptation often requires changes in many different genes. If these sets of genes are located close to each other on the same chromosome, any mutations will likely be passed onto the next generation together. If the genes are located further away, or even on different chromosomes, they may instead be inherited separately so that the next generation does not benefit as much from the adaptation.

A chromosome inversion – when a segment of chromosome breaks off and reattaches the other way around – can increase the likelihood that sets of mutations on the same chromosome will be inherited together. Many previous studies have found that chromosome inversions tend to drive the ability of species to adapt to different environments by keeping together mutations that affect the same characteristics. However, it is not clear how inversions affect the repeatability of the adaptation, that is, if another group of closely related plants faced the same challenge in their environment would they evolve in the same way, or would they evolve a new response?

To address this question, Soudi, Jahani et al. used a genetics approach known as a genome wide association study to explore how three closely related species of sunflower have adapted to their respective environments. Two of the species grow in various environments across the centre and west of the USA that are often hot and dry, whereas the third species is restricted to the more humid coastal plain of Texas, USA.

The experiments found that a few key genes had changed in all three sunflower species. However, each species also had mutations in a larger set of unique genes that were not changed in the other species. Regions of chromosomes harbouring inversions in one of the species tended to have more of the key genes within them, compared to other genomic regions. This was also true for species that did not have inversions in those regions. This demonstrates that genes in regions affected by chromosome inversions can still help plants adapt to changes in the environment even in the absence of inversions.

Sunflowers are widely grown for their edible oily seeds. In the future, some of the key genes identified in this work may be useful candidates for plant breeding to improve the resilience of sunflowers to drought, high temperatures and other environmental challenges.

Introduction

The genetic basis of local adaptation is sometimes highly repeatable, with examples of large effect genes driving responses in multiple species, such as FT affecting flowering time in numerous plants (Izawa, 2007; Auge et al., 2019) or Mc1r driving colour polymorphisms in vertebrates (Manceau et al., 2010; Rosenblum et al., 2014). Local adaptation can also be repeated for polygenic traits, with significant patterns of similar association found across many loci for comparisons of conifers (Yeaman et al., 2016), maize and its wild relative teosinte (Tittes et al., 2021; Wang et al., 2021), and Brassicaceae (Bohutínská et al., 2021), to name a few. While we have a growing number of examples of repeatability in the basis of adaptation, it is also interesting to know if species use different genes to adapt to the same selection pressure. Genotypic redundancy – the potential for many genotypes to yield a given phenotype – is one critical factor affecting the repeatability of adaptation, as high redundancy would be expected to result in lower repeatability (Yeaman et al., 2018). Another critical factor affecting repeatability is shared standing variation, whether present due to introgression or incomplete lineage sorting. In either case, variants shared among lineages are much more likely to contribute to a repeated response than new mutations (MacPherson and Nuismer 2018; Ralph and Coop 2015). Consistent with this, Bohutínská et al., 2021 found that repeatability was negatively related to phylogenetic distance. While we are now accumulating more studies about the repeatability of adaptation, we still have very few examples and much remains unknown about the relative importance of these factors (Yeaman, 2022).

Inversions have been implicated in local adaptation in many species (Wellenreuther and Bernatchez, 2018), likely due to their effect to suppress recombination among inverted and non-inverted haplotypes, and thereby maintain linkage disequilibrium (LD) among beneficial combinations of locally adapted alleles (Rieseberg, 2001; Noor et al. 2001; Kirkpatrick and Barton, 2006). This has been approached by models studying the establishment of inversions that capture combinations of locally adapted alleles present as standing variation (e.g. Kirkpatrick and Barton, 2006), as well as models examining the accumulation of locally adapted mutations within inversions (e.g. Schaal et al., 2022). If there is variation in the density of loci that can potentially contribute to local adaptation, inversions would be expected to preferentially establish and be retained in regions harbouring a high density of such loci (and this expectation would hold for both the capture and accumulation models). We would also expect to see stronger signatures of repeated local adaptation in such high-density regions. Despite mounting evidence of their importance in adaptation, it is unclear how inversions may covary with repeatability of adaptation among species. A fundamental parameter of importance in these models is the relationship between migration rate and strength of selection on individual alleles, which may not make persistent contributions to local adaptation without the suppressing effects of recombination if selection is too weak (Yeaman and Whitlock, 2011; Bürger and Akerman, 2011). If most alleles have small effects relative to migration rate and can only contribute to local adaptation via the benefit of the recombination-suppressing effect of an inversion, then we would expect little repeatability at the site of an inversion – other species lacking the inversion would not tend to use that same region for adaptation because selection would be too weak for alleles to persist. On the other hand, if some loci are particularly important for local adaptation and regularly yield mutations of large effect, with these patterns being conserved among species, repeatability within regions harbouring inversions may be substantial. Thus, studying whether adaptation at the same genomic region harbouring an inversion is observed in other species lacking the inversion can give insights about the underlying architecture of adaptation, and the evolution and maintenance of inversions.

Here, we explore the repeatability of local adaptation in three species of sunflowers, Helianthus annuus, Helianthus argophyllus, and Helianthus petiolaris (Figure 1), which harbour large regions of suppressed recombination (‘haploblocks’), most of which are inversions, and are often associated with adaptive traits (Todesco et al., 2020). H. annuus, the common sunflower, is the closest wild relative of cultivated sunflower, which was domesticated from it around 4000 years ago (Blackman et al., 2011). Populations of H. annuus are distributed throughout the central and western United States and generally found on mesic soils, but can grow in a variety of disturbed or extreme habitats, such as semi-desertic or frequently flooded areas, as well as salt marshes. H. petiolaris, the prairie sunflower, prefers sandier soils, and ecotypes of this species are adapted to sand sheets and sand dunes (Ostevik et al., 2016). Here we include samples from two subspecies: H. petiolaris ssp. petiolaris, which is commonly found in the southern Great Plains, and H. petiolaris ssp. fallax, which is limited to more arid regions in Colorado, Utah, New Mexico, and Arizona (Heiser et al., 1969). H. petiolaris and H. annuus have broad and overlapping distributions throughout the central and western United States and appear to have adapted to similar changes in temperature, moisture, and photoperiod regimes. There is also evidence indicating H. annuus and H. petiolaris have likely been exchanging genes during much of their history of divergence (Strasburg and Rieseberg, 2008), although partially isolated by strong pre- and post-zygotic barriers (Sambatti et al., 2012). The third species, H. argophyllus, the silverleaf sunflower, is found exclusively in the southeast coast of Texas and includes both an early flowering ecotype on the coastal barrier islands and a late flowering ecotype inland (Moyers and Rieseberg, 2016; Todesco et al., 2020). H. argophyllus is thought to have undergone cycles of sympatry and allopatry with H. annuus and H. petiolaris over time, but currently only overlaps with H. annuus, with thus overlap likely being a recent event (Heiser, 1951).

Sampling sites and phylogenetic relationship among surveyed species.

(A) Sampling locations of wild sunflower populations studied in this study, and (B) phylogenetic relationship of the four (sub-)species. Numbered brackets represent the six pairwise comparisons performed in this study.

Using broad sampling across the ranges of these species (Figure 1) and sequencing data first published by Todesco et al., 2020, we study the genomic basis of local adaptation by conducting genome-wide scans for associations with climatic and soil environmental variables and phenotypes measured in a common garden. An inherent problem in studying the basis of local adaptation is accounting for the covariance between genome-wide population genetic structure and environment. When the selection pressure driving local adaptation tends to parallel a main axis of demographic expansion or isolation by distance, neutral alleles will have spatial patterns resembling those of causal alleles. Methods that do not correct for population structure have large numbers of false positives because allele frequencies at neutral loci tend to correlate with the environment more than expected by chance (Lotterhos and Whitlock, 2014). By contrast, methods that use structure correction tend to suffer from false negatives because the causal loci have similar patterns of allele frequency variation as the genomic background, and so their statistical signatures are decreased (DeRaad et al., 2021; Booker et al., 2023a). When a non-corrected approach is applied to multiple species, the false-positive problem can be mitigated when testing for loci that contribute to repeated adaptation as the same gene should not tend to be a false positive in multiple species more often than expected by chance (Yeaman et al., 2016). Here we use a conventional genome-wide association study (GWAS) approach with structure correction to study the basis of phenotypes, which vary both within and among populations, and use an uncorrected approach to test association to environment, which only varies among populations (as all individuals within the same population have the same value of environmental variable). For the environmental association portions of the study, we also explore the effect of structure correction in some analyses to contrast the false-negative vs. false-positive problem inherent in each approach. We apply these approaches individually to H. annuus, H. argophyllus, and the two H. petiolaris subspecies, and make pairwise and higher-order comparisons among combinations of these four lineages to study repeatability.

Our overall aim is to characterize the extent of repeatability of local adaptation for different traits and environments and explore the importance of inversions and other factors that may drive differences in repeatability. This analysis builds upon the work of Todesco et al., 2020, which focused on the evolution of haploblocks within each species, but did not systematically study patterns of repeatability in signatures of association among species. We begin by using the strength of correlations between environment and phenotypes measured in a common garden to identify which environmental variables are likely driving local adaptation within each species, and use an index to quantify the relative similarity in these patterns among pairs of species (see Appendix 1—figure 1 for a schematic overview of methods and research questions). We then conduct genome-wide association tests to identify regions most strongly associated with phenotypes and environments within each species. To assess the similarity in these association statistics among species, we use methods similar to Yeaman et al., 2016 to identify regions of the genome that exhibit greater similarity in signatures of association among pairs of lineages than expected by chance, which are likely driven by repeated adaptation using the same genes (but not necessarily the same alleles). We then explore how the number and total size of these regions of repeated association covary with the index of similarity in environmental importance. We also explore whether these regions of repeated association tend to covary with patterns of shared standing variation and test if they are enriched within previously identified haploblocks to explore the role of inversions in adaptation. We then identify candidate genes that show particularly strong signatures of repeated association across multiple lineages. Finally, based on the number of signatures of association that are shared vs. non-shared among lineages, we estimate the number of regions genome-wide that could potentially contribute to local adaptation for each variable and phenotype, which is related to the genotypic redundancy. Taken together, our results show that local adaptation in sunflowers tends to involve both strong repeatability at a small number of genes, often associated with inversions, coupled with high redundancy and non-repeated responses across a much larger number of loci.

Results

Local adaptation at the phenotypic level

The four taxa vary considerably in the breadth of environmental variation spanned by their respective ranges, with H. annuus spanning the widest niche and H. argophyllus spanning the narrowest (Figure 2A). Using the correlation between phenotype and environment as a metric of the strength of local adaptation on individual traits, we observe strong correlations for many combinations (Appendix 1—figure 2), suggesting that local adaptation is quite pronounced for some traits. To quantify the similarity among pairs of species in the strength of local adaptation at the phenotypic level (and to later compare this to results at the genomic level), we calculate an index we refer to as the similarity in phenotype–environment correlation (SIPEC), which is maximized when both species have strong correlations (in either direction) between a phenotype and environmental variable (see ‘Materials and methods’). For each environmental variable, we take the maximum value of SIPEC across all phenotypes as a relative measure of the importance of the variable for driving local adaptation in both species (such that when SIPEC is low, the variable is not a strong driver of phenotypic adaptation in at least one of the two species). We find the largest values of SIPEC for temperature variables and the smallest values for soil types, and generally find higher values for comparisons between the petiolaris subspecies (Appendix 1—figure 3).

The range of environmental and phenotypic variation in the studied species.

Variation in environment (A) and phenotype (B) for the studied species, along the two largest axes of a principal component analysis (PCA). Violin plots show two examples of variation in environment (Hargreaves reference evapotranspiration index; Eref) and phenotype (Days to Flower; DTF) within and among the taxa (C, D).

Genome-wide analysis of repeated local adaptation

To search for regions of the genome driving repeated adaptation in multiple taxa, we first identified windows of the genome within each species that showed strong signatures of association to either phenotype (GWAS; with structure correction) or environment (GEA; without structure correction), referred to as ‘top candidate’ windows. To identify windows of repeated association (WRAs) between pairs of species, we then assessed whether top candidate windows identified in one focal taxon also tended to be enriched for strong signatures of association to the same environment or phenotype in each of the other taxa (using the null-W test, which tends to be more sensitive than just finding the overlap between the top candidate windows; see ‘Materials and methods’). As recombination rate can affect the sensitivity of these kinds of window-based genome scans (Booker et al., 2020), the identification of WRAs was conducted after binning windows by recombination rate, although in many cases we did not observe substantial differences in the null distributions among these bins (e.g. Appendix 1—figure 4). This analysis revealed many windows with signatures of repeated association, with the strongest repeated signature found for number of frost-free days (NFFD; Figure 3A; Appendix 1—figure 4), but also showed that many of the windows with the strongest association in one species did not have strong signatures in other species (Figure 3B).

Signatures of association for number of frost-free days (NFFD) in the four taxa on chromosome 15 (A) and genome-wide (B).

Panel (A) shows windows of repeated association (WRAs; coloured bars) for comparisons between the focal species, H. annuus, and each of the other three taxa, with the haploblocks in H. annuus shaded in violet and the regions with significant PicMin hits as vertical orange lines. Panel (B) shows the value of the top candidate index for each of the 1000 windows with the strongest signatures of association in at least one species (approximately top 2% of genome-wide windows). Rows are ordered using hierarchical clustering to group windows with similar patterns across multiple species, illustrating the extent of overlap/non-overlap in the windows with strongest signatures in each species (i.e. position in the figure does not reflect chromosomal position).

Under the null hypothesis that all regions of the genome evolve neutrally due to drift and independently in each species, ~5% of the top candidate windows identified in one species would be expected to fall into the tail of the null distribution of another species (i.e. be classified as WRAs). While we find significantly more WRAs than the 5% expected by chance, ranging from 6.3 to 44.1% (mean = 14.5%; Appendix 1—figure 5), multiple factors can violate the assumptions of this test and increase the proportion of windows classified as WRAs. Most importantly, similarity in signatures of association can be driven by shared ancestral variation or ongoing introgression, and this must always be considered as an alternative explanation. It is clear that most regions of the genome harbour some shared standing variation between all six pairwise comparisons of the taxa, whether due to segregating ancestral variation or introgression, based on an index quantifying the proportion of shared to non-shared SNPs in each window (Appendix 1—figure 6). However, we do not find consistent differences in the amount of shared standing variation within windows when comparing WRAs with the rest of the genome (Appendix 1—figure 6). Thus, while introgression may make subtle contributions to WRAs, a lack of increased shared standing variation in the significant windows suggests that it is not a primary driver of these patterns.

A second source of inflation in the number of WRAs is due to hitchhiking: if a given locus is a true positive driving adaptation, LD with other windows in tight physical linkage will result in spurious correlations causing them to also be classified as WRAs increasing the genome-wide proportion above 5%, and this effect will be particularly exaggerated in haploblocks. To control for this, we binned neighbouring WRAs together that exhibited high LD (>95th percentile) over short spans of the genome (<1 cM) in either species, yielding regions we refer to as clusters of repeated association (CRAs). While this clustering method should provide a partial control for the effect of hitchhiking, we note that the number of CRAs cannot be taken as a reliable estimate of the number of independent targets of natural selection. Instead, we treat the number and size of these CRAs as proxies for the relative genome-wide similarity in association between each pair of species, and conduct downstream analyses to test hypotheses about the factors driving local adaptation.

CRAs varied in number and size across the six pairwise-species comparisons, with particularly large CRAs identified at the sites of known haploblocks (Todesco et al., 2020). For CRAs identified for phenotypic associations, their extent ranged from 168 clusters spanning 48,524,903 bp (covering only 1.6% of the genome) in H. argophyllus-H. pet. petiolaris to a maximum of 1667 clusters spanning 426,409,647 bp (14.2% of the genome) in H. pet. petiolaris-H. pet. fallax (Appendix 1—figure 7). CRAs identified for environmental variables tended to be more numerous and cover a larger region of the genome, varying from 1154 clusters comprising 15% of the total genome for H. argophyllus-H. pet. fallax up to 2260 distinct clusters covering 29% of the genome between H. petiolaris subspecies (Appendix 1—figure 8). The large extent of the genome covered by CRAs is mainly driven by their occurrence within many different haploblocks, which tend to be unique to each species and cover substantial portions of their genomes (Todesco et al., 2020). Genes within CRAs tend to be enriched for a large number of GO terms spanning many different categories (Appendix 1—figures 9 and 10). When we use an approach that corrects for population structure in the genotype–environment association tests (BayPass), we find substantially reduced numbers of CRAs (Appendix 1—figure 11). This reduction is likely due to reduced power to detect true positive causal loci when population structure covaries with environment, as true positives do not ‘stand out’ relative to the genomic background (Booker et al., 2023a). While single-species association tests without correction for population structure typically yield large numbers of false positives, the null-W test comparing association results across species accounts for the chance that the same region is a false positive in both species (Yeaman et al., 2016).

Repeated association in the genome reflects patterns at the phenotypic level

If natural selection is driving repeated patterns of local adaptation in the genomes of two species, we should see greater similarity for associations with environmental variables that are also strongly associated with phenotypic variation in both species. Consistent with this prediction, we found that environmental variables with a high maximum SIPEC index (i.e. high correlation between environment and some axis of phenotypic variation in both species; Appendix 1—figure 3) also tended to have a larger number of CRAs (Figure 4), with higher repeatability at both the phenotype and genome level for temperature-related variables and much lower repeatability for soil-related variables (with similar patterns for mean SIPEC index; Appendix 1—figure 12). Unfortunately, given the non-independence of environmental variables there is pseudo-replication in these data so it is not possible to conduct a formal significance test of these patterns. In all cases, the number of CRAs exhibited a weak negative relationship with the index of standing variation, with environmental variables that had the greatest number of CRAs having the lowest relative amounts of shared standing variation (Appendix 1—figure 13). This suggests that the observed similarity in patterns of association among species is not being strongly driven by incomplete lineage sorting or introgression, as we would expect a positive relationship between the index of shared standing variation and the number of CRAs. We found similar but weaker patterns in the association between SIPEC and size of CRAs (Appendix 1—figure 14). The weakening of these patterns relative to those for number of CRAs may reflect comparatively greater contribution of haploblocks driving patterns with size rather than number (i.e. if a particularly large haploblock is a CRA, then that variable will have a particularly large value relative to its actual importance for local adaptation).

Relationship between maximum similarity in phenotype–environment correlation (SIPEC) and number of clusters of repeated association (CRAs).

SIPEC was calculated for each phenotypic principal component analysis (PCA) axis, with the maximum taken across the axes that cumulatively explain 95% of the phenotypic variance for each environment. Each panel (A-F) shows a comparison between a pair of species indicated above, and includes both a linear model fit to the data within the panel (coloured lines), and a linear model fit to all data simultaneously (black lines) for comparison. Note that because environmental variables are correlated, these points are not independent and therefore represent a source of pseudoreplication, preventing formal statistical tests of this relationship.

Overlap of signatures of repeated association with low recombination haploblocks

Inversions can facilitate local adaptation by suppressing recombination between locally adapted alleles within the inverted vs. ancestral haplotypes (Kirkpatrick and Barton, 2006). If a given species exhibits a signature of association to environment at a segregating inversion, it is interesting to test whether similar signatures of association are found in other species that lack a segregating inversion at the same region of the genome. This would suggest that particularly strong selection is acting on loci in this region, as signatures of association can still evolve even without the recombination suppressing effect of the inversion. While not all of the low-recombination haploblocks identified by Todesco et al., 2020 have been validated as inversions, for simplicity we treat each haploblock as representative of a segregating inversion. The majority of these haploblocks are present as segregating variation in only one of the three species, but when they occur in H. petiolaris, they tend to be found as segregating variation within both subspecies. Thus, if we find significant enrichment of the regions of repeated association (WRAs and CRAs) within haploblocks, this suggests that both the species with a segregating inversion and the species lacking an inversion are using this region of the genome to drive local adaptation. As a first test of this relationship, we assessed two-way contingency tables for whether top candidate windows within a species were also significant WRA (WRA/non-WRA) and whether they fell within haploblocks (yes/no). This approach controls for the potential enrichment of top candidate windows within haploblocks relative to the rest of the genome and asks if the proportion of WRAs within haploblocks is even higher than non-WRA top candidate windows. For most phenotypes and environments, we found that the proportion of WRAs occurring within haploblocks was much higher than the proportion of non-WRAs occurring within haploblocks (Figure 5). Across all species comparisons and environmental/phenotypic variables, a total of 310 contrasts showed significant enrichment for WRAs in haploblocks compared to only 23 contrasts showing a significantly higher proportion of non-WRAs in haploblocks (note that because of non-independence of the phenotypes and environmental variables, there is pseudo-replication inherent in these estimates). In H. argophyllus, while non-WRA top candidates tended to fall within haploblocks less commonly than in the other taxa, WRAs tended to be very strongly enriched within haploblocks (Figure 5B). As a follow-up, we also tested whether CRAs were enriched within haploblocks (without controlling for whether top candidate windows also tended to be enriched), also finding strongly significant enrichment of CRAs within haploblocks for many traits and environments (Appendix 1—figure 7C and D). Broad patterns revealed by these analyses are similar as both show that genomic regions harbouring haploblocks tend to be enriched for signatures of association to environment in species lacking the haploblock. It is noteworthy that few significant signatures of enrichment in haploblocks were found for H. petiolaris petiolaris vs. H. petiolaris fallax (Appendix 1—figure 7C and D), perhaps because extensive introgression across non-locally adapted regions of the genome obscures true signal.

Enrichment of signatures of repeated association within genomic regions harbouring a haploblock in one of the two compared lineages.

Each panel shows the proportion of top candidate windows that fall within haploblocks for windows with significant signatures of repeated association by the null-W test (windows of repeated association [WRAs]) vs. those with non-significant signatures (non-WRAs), with a different focal species plotted in each panel (A-D). Comparisons of H. petiolaris petiolaris vs. H. petiolaris fallax are omitted as they share segregating haploblocks. Each point corresponds to the results for a single phenotype or environment, with dark shading used for cases where the deviation from random for the contingency table is significant by a permutation test (p<0.05), and lighter shading indicating a non-significant result. Note that because many environmental variables and phenotypes are correlated with each other, these points are not independent and therefore represent a source of pseudoreplication, preventing formal statistical tests of the overall relationship within each panel.

Multi-species signatures of repeated association

As a complement to the pairwise analyses described above, we also conducted an analysis using PicMin, which simultaneously considers any number of lineages to identify genes with particularly strong and repeated signatures of association (Booker et al., 2023b). After identifying significant windows clustering together those <1 Mbp apart, we found a total of 145 regions that were significant for at least one environmental variable or phenotype (at False Discovery Rate (FDR) < 0.1 applied within each variable). The largest number of significant regions for environment was found with the NFFD variable (20 with FDR < 0.1; 44 with FDR < 0.2) and for phenotypes with total leaf number, a measure of developmental timing of flowering (9 with FDR < 0.1; 19 with FDR < 0.2). Considering the 720 individual 5000 bp windows with FDR < 0.2 for at least one variable, 387 (53.8%) are within 500 bp of a genic region, representing a 2.3× enrichment relative to the rate for windows that were not significant hits using PicMin (23.3% of other windows fall within 500 bp of a genic region; Χ2 test p<10–15).

Estimating the number of potentially adaptive loci

Even for the environment with the highest repeatability (NFFD), most of the strongest signals of association are found in only one lineage (Figure 3B). Unfortunately, while the null-W test provides strong support and controls false positives when identifying repeated association, regions of the genome with strong associations in only a single species may include a large (and unknown) number of false positives. Thus, while it is biologically interesting to know how much adaptation is non-repeated among species, it is difficult to quantify with certainty (Booker et al., 2023b; Booker et al., 2023a). Under the assumption that at least some of the non-repeated signatures of association are true positives, this implies that there is considerable genotypic redundancy, with many different ways for these species to adaptively respond to variation in the same environment. To estimate the number of windows that could potentially contribute to local adaptation for each variable (Leff), we modified the method of Yeaman et al., 2018 to partially account for the effect of linkage among nearby windows (see ‘Materials and methods’). This method assumes that the windows with signatures of association in each species are a random draw from a larger number of potentially contributing windows (Leff), which can be inferred based on the ratio of shared vs. non-shared windows with strong association signatures (see ‘Materials and methods’). We find that estimates of Leff tend to be large, always well over 1000 windows regardless of the trait or environment (Appendix 1—figure 15A). There are numerous sources of error that affect the estimation of Leff, which will tend to be overestimated due to linkage (Appendix 1—figure 15A) or when many signatures of association are false positives (see Appendix 1), and will be underestimated when some repeatability is due to shared standing variation (Yeaman et al., 2018). Even after controls for linkage and assuming a high false-positive rate of 80%, estimates of Leff remain in the hundreds even for the variable with the lowest value of Leff (Eref; Appendix 1—figure 15C and D).

Discussion

Despite its critical importance in shaping the architecture of adaptation, little is known about the extent of genotypic redundancy underlying different traits (Barghi et al., 2020; Láruson et al., 2020; Yeaman, 2022). Here, we have shown evidence of significant repeatability in the basis of local adaptation (Figures 4 and 5), but also an abundance of species-specific, non-repeated signatures (Figure 3; Appendix 1—figure 14). In particular, we find that regions of the genome that harbour inversions in one species also tend to be strongly enriched for signatures of association in other species lacking the inversion (Figure 5). Taken together, this suggests that local adaptation in these species is highly flexible – different species apparently use quite different sets of loci to adapt to the same environment – yet still involves some component that has minimal redundancy, with inversions playing a particularly important role. Some of the ‘usual suspects’ show up in the set of significantly repeated loci identified by PicMin: in addition to the homologs of the FT (FLOWERING LOCUS T) gene reported by Todesco et al., 2020, we also found hits for several other genes involved in circadian regulation and flowering time, including PRR3 (Para et al., 2007), TOE1 (Aukerman and Sakai, 2003), and PHYC (Takano et al., 2005; Chen et al., 2014), which is known to regulate photoperiodic responses in Arabidopsis accessions (Balasubramanian et al., 2006). Other top hits included genes involved in plant development and auxin transport (PIN3 [Keuskamp et al., 2010], ARF4 [Pekker et al., 2005], and MN [Bhatia et al., 2016]), and plant immunity (CRK13 [Acharya et al., 2007] and PRR2 [Cheval et al., 2017]).

The increased repeatability found in regions of the genome that harbour inversions in only one species is particularly interesting. Inversions are commonly associated with local adaptation (Wellenreuther and Bernatchez, 2018), likely because they reduce the rate at which recombination breaks up combinations of co-selected alleles (Kirkpatrick and Barton, 2006), which perhaps facilitates contributions by alleles that would be individually too weakly selected to overcome swamping by migration (Bürger and Akerman, 2011; Yeaman and Whitlock, 2011; Schaal et al., 2022). Evidence here suggests these regions still tend to contribute to adaptation in the species lacking the recombination-suppressing effect of an inversion, consistent with a strong effect of selection relative to migration on at least one locus in the region (rather than adaptation exclusively via many alleles of small effect; Yeaman, 2022). As an example, chromosome 15 harbours a large (72 Mbp) haploblock in H. annuus that is strongly associated with NFFD, and also shows some signatures of association in the other taxa, with particularly strong signatures of association on either side of the haploblock in H. argophyllus (Figure 3A). Interestingly, loci of repeated association identified by PicMin within this region include two genes whose homologs are known to regulate responses to cold: COLD-RESPONSIVE PROTEIN KINASE 1,CRPK1 (Liu et al., 2017) and LATE ELONGATED HYPOCOTYL, LHY (Mizoguchi et al., 2002; Dong et al., 2011). It seems likely that strong selection relative to migration is therefore acting upon several loci in this region, and in many others harbouring inversions.

The observed repeatability associated with inversions further supports the local adaptation model as an explanation for the long-term persistence of segregating inversions (at least in sunflower), rather than mechanisms based on dominance or meiotic drive (Rieseberg, 2001). If there is variation across the genome in the density of loci with the potential to be involved in local adaptation, then the establishment and maintenance of inversions would be biased towards regions harbouring a high density of such loci under this model. If the genomic basis for local adaptation is conserved amongst species, then these same regions are more likely to have high repeatability. Thus, our observation of genomic regions harbouring inversions also being enriched for WRAs is consistent with this general model for inversion evolution. Unfortunately, our observations do not provide much insight into whether inversions evolve through the capture (e.g. Kirkpatrick and Barton, 2006) or accumulation (e.g. Schaal et al., 2022) type of model as either model would be consistent with our results. Most of the sunflower inversions are >1 My old, and therefore predate any current local adaptation patterns, but likely do not predate the genes underlying local adaptation (which appear to be shared among the species we studied). As for the alleles underlying local adaptation, they may be younger than the inversions, but as our work suggests, these regions are prone to harbouring locally adaptive alleles, so it is possible that they also harboured other ancestral locally adaptive alleles. While many studies have demonstrated the importance of inversions for adaptation (Wellenreuther and Bernatchez, 2018; Hager et al., 2022), to our knowledge only two other studies have documented the involvement of the same loci making contributions in the absence of the recombination-suppressing effect of the inversion (Lee et al., 2017; Coughlan and Willis, 2019). This also highlights how comparative studies of a species lacking an inversion may help identify which genes are driving adaptation in another species with an inversion, as segregating inversions tend to have extensive LD that prevents identification of any potential targets of natural selection within them.

While our results suggest a large number of loci can potentially contribute to adaptation, implying high redundancy (Figure 3B; Appendix 1—figure 14), there are several factors that complicate inference. Separating the effects of drift and selection to detect signatures of local adaptation is notoriously difficult because population structure often covaries with features of the environment that drive adaptation (Lotterhos and Whitlock, 2014; Hoban et al., 2016; DeRaad et al., 2021). As found by other analyses (Yeaman et al., 2016; DeRaad et al., 2021), when we use structure correction in our genotype–environment association tests, we find many fewer signatures of repeated association (Appendix 1—figure 7 vs. Appendix 1—figure 11), likely due to the reduced power of BayPass to detect true positives when the environment covaries with population structure (Booker et al., 2023a). Here, we have side-stepped the issues involved in correction for population structure by instead relying on comparisons among species to identify loci with associations more extreme than expected by chance in multiple species, using the null-W and PicMin tests. This assumes that most loci in the genome are not involved in local adaptation, so that the relatively small proportion that are driving adaptation can therefore be picked out due to their tendency to fall into the tail of the distribution of association statistics in at least one other species (which would only occur at the rate of random chance under a purely neutral model). While this approach should be relatively robust for identifying loci with repeated patterns of adaptation, there is no way to formally estimate significance of associations found in only a single species, many of which may be false positives due to covariation of population structure and environment. However, even if we assume that 80% of the observed non-repeated loci are false positives, we still find that estimates of the effective number of loci contributing to adaptation (Leff) are in the hundreds (Appendix 1—figure 15C). As we are unable to detect loci of small effect due to the limited power, but these loci still likely make important contributions to heritability, our estimates of Leff will also be biased downwards by excluding these potentially important drivers. Finally, it is also difficult to exclude the contribution of introgression or incomplete lineage sorting to the observed signatures of repeated association. If a locus tended to be highly introgressed between two species in a restricted region of their range, it is possible it could also covary with environment and therefore result in a signature of repeated association that would be mistakenly interpreted as adaptation. While it is difficult to preclude this from our analysis, regions of the genome with signatures of repeated association do not tend to have higher levels of shared standing variation than background regions (Appendix 1—figure 6), suggesting that this is not a broad explanation for our observations. If anything, shared standing variation would be expected to increase repeatability of adaptation (MacPherson and Nuismer, 2017), adding further weight to the inference that there is high genotypic redundancy in these species.

In general, we find that temperature is the strongest driver of repeated adaptation at both the phenotypic and genomic levels. We quantified local adaptation at the phenotypic level using correlations between traits measured in common gardens and the home environment of the population they were sampled from. Across all phenotypes, these correlations tended to be strongest and most similar among species for temperature variables, particularly in comparisons among H. annuus and H. petiolaris subspecies (Appendix 1—figure 3). We see similar patterns of repeatability reflected in the genome, where temperature variables also tend to have the greatest repeatability (Figure 4). The similarity in these phenotypic and genomic signatures is consistent with an effect of strong selection as other artefactual or drift-based explanations for repeatability would not be expected to reflect patterns found at the phenotypic level. It should be noted that the reduced importance of soil variables in the SIPEC index might be partly driven by the fact that all traits were measured on above-ground features due to the difficulty of getting non-disruptive phenotypic measurements for roots.

Taken together, these results suggest that some fraction of the genome contributes to adaptation with low redundancy and high repeatability (which tends to be enriched within genomic regions where there exist inversions in at least one species), while the remainder of the adaptive response is driven by loci with high redundancy and species-specific contributions. Theoretical models of adaptive evolution necessarily involve simplified representations of genetics: population genetic approaches explore cases where strong selection deterministically drives a change in allele frequency at one or a few loci (usually without epistasis), whereas quantitative genetic approaches make the infinitesimal assumption that phenotypic change can be realized through small frequency changes at many loci (Barton et al., 2017; Barghi et al., 2020; Yeaman, 2022). Such models implicitly make quite different assumptions about redundancy: if a population genetic model of directional selection includes no epistasis, then it is implicitly assuming there is no redundancy as every locus can increase fitness and no locus can affect the potential of another locus to increase fitness. On the other hand, quantitative genetic models implicitly assume complete redundancy as the loci driving trait variation have interchangeable effects. Given that differences in redundancy result in quite different evolutionary dynamics (Hollinger et al., 2019; Yeaman, 2022), it is important to re-evaluate the behaviour of theoretical models in light of what we can learn about redundancy from empirical data. Our results suggest that adaptation is a complex process that does not map cleanly onto the assumptions of either approach. At least some component of trait variation experiences sufficiently strong selection and has low enough redundancy to drive the significant signatures of repeatability above the random null that we observe here. On the other hand, most of the strongest signatures of association are specific to a single species, implying high redundancy for most of the genome, even if there are still some detectable signatures of repeatability. Such observations may be consistent with classifying loci into ‘core’ and ‘peripheral’ sets similar to the omnigenic model, but would involve classification based on redundancy rather than gene expression, as advocated by Boyle et al. (2017). Theoretical models adopting a two-class representation of the genetic basis of trait variation may provide more realism than models assuming a simpler distribution, and perhaps yield new insights about qualitatively different dynamical behaviour. Given the difficulty of rigorously quantifying the contribution of alleles of small effect and loci contributing to local adaptation in only one species, it remains an open question what proportion of adaptive response is governed by loci with dynamical behaviour that follow the infinitesimal assumption. This could perhaps be quantified explicitly by decomposing heritabilities for traits driving local adaptation, as has been done for standing variation in humans (Visscher et al., 2017), but this would require a considerable increase in sample size as methods such as GCTA (Yang et al., 2011) do not yield accurate estimates at the sample sizes used here. Future work could perhaps resolve this question using multi-generational crossing designs to break up the LD that tends to accompany local adaptation, allowing a more accurate parsing of the effect sizes of genomic regions and the contribution by repeated vs. redundant components of adaptive trait variation.

Materials and methods

Data collection, common garden, and phenotyping

Request a detailed protocol

Seed samples were collected from 151 wild sunflower populations covering most of their native distributions during the summer of 2015 (H. annuus: 61 populations for GWAS and 71 populations for GEA; H. petiolaris fallax: 23 populations; H. petiolaris petiolaris: 18 populations; and H. argophyllus: 30 populations, Figure 1 and Supplementary file 1). Seeds from 10 additional populations of H. annuus for the GEA analysis had been previously collected in the summer of 2011. Sample seeds were obtained from randomly chosen mothers and were first germinated in a greenhouse for 2 wk, later moved to an open-sided greenhouse for acclimation. Phenotypic data were collected throughout the growing season, as detailed in Supplementary file 1. Extensive records of developmental and morphological traits throughout the growth of the plants including leaves, stem, and seeds were collected and digitally imaged to extract relevant phenotypic data.

Similarity in phenotype–environment correlation (SIPEC)

Request a detailed protocol

Locally adapted traits tend to exhibit strong correlations between environment and common-garden phenotype. To estimate which environmental variables are driving local adaptation for the same phenotype in pairs of species, we calculate an index of the similarity in phenotype–environment correlation, SIPEC = r1+r2r1r2 where r1 and r2 are the Pearson correlations between the environment and phenotype in the first and second species, respectively. This SIPEC index is maximized when the correlation between a phenotype and an environmental variable is large in both species regardless of the direction, so it does not differentiate phenotypically convergent vs. divergent patterns of local adaptation (e.g. increasing temperature causing an increase in flowering time in one species and a decrease in the other), and provides a means of estimating the relative importance of an environment driving local adaptation in both species in a similar way across all measured phenotypes. To account for non-independence among traits, for each pair of species we fit a PCA to all measured phenotypes and use the principal components that collectively explain 95% of the variance, and calculate SIPEC on the correlations of each of these axes with environment. For comparisons including H. argophyllus, 95% of the variance was typically explained by 8–10 PC axes (out of 28 or 29 phenotypes), whereas for comparisons among other taxa this included 21 or 22 PC axes (out of 65 or 66 phenotypes). We then report the maximum and mean value of SIPEC for each environmental variable across these phenotypic PCA axes.

Tests of SNP association with environment (GEA) and phenotype (GWAS)

Request a detailed protocol

A total of 39 environmental variables (21 climatic variables, 3 geographic variables, 15 soil variables; Supplementary file 1) were used for the genotype–environment association analysis (GEA). We refer to the climatic, soil, and geographic variables collectively as the ‘environmental variables’ for simplicity. Soil data were collected by taking 3–5 soil samples collected at each population from across the area in which seeds were collected and submitted to Midwest Laboratories Inc (Omaha, NE) for analysis. Climate data for each population were collected over a 30-year period (1961–1990) from geographic coordinates of the locations where the samples were collected using the software package Climate NA (Wang et al., 2012). We used the package BayPass version 2.1 (Gautier, 2015), which provides a re-implementation of the Bayesian hierarchical model, and explicitly accounts for the covariance structure among the population allele frequencies that originate from the shared history of the populations through the estimation of the population covariance matrix. This renders the identification of SNPs subjected to selection less sensitive to the confounding effect of population demography (Gunther and Coop, 2013). Population structure was estimated by choosing a random and unlinked set of 10,000 SNPs and running BayPass under the core model (i.e. no covariates). Then the Bayes factors (BF) were calculated running BayPass under the STD covariate model to evaluate association of SNPs with environmental variables (i.e. adjusting for population structure). For each SNP, the Bayes factor (denoted BFis as in Gautier, 2015) was presented in deciban units (db) via the transformation 10 log10 (BF). BFis relies on the importance sampling algorithm proposed by Gunther and Coop, 2013 and uses MCMC samples obtained under the core model. To produce a narrower set of outlier loci, we followed the popular Jeffreys’ rule (Jeffreys, 1961) that identified outlier loci with BF ≥ 10. As genome scan methods that correct for population structure can remove some potential signals of local adaption when there is covariation between the demographic history of the species and the environmental variables or phenotypic traits of interest, we also calculated Spearman’s rank correlation (ρ, uncorrected GEA) between population allele frequencies for each SNP and each environment variable.

GWAS analysis was performed on 86, 30, and 69 developmental and morphological traits in H. annuus, H. argophyllus, and H. petiolaris, respectively (Supplementary file 1). In total, 29 variables were measured in all three focal species and 39 were measured only in H. annuus and in H. petiolaris subspecies (Supplementary file 1). Seed and flower traits could not be collected for H. argophyllus since most plants of this species flowered very late in our common garden and failed to form fully developed inflorescences and set seeds before temperatures became too low for their survival. Population structure was controlled for in GWAS by including the first three principal components as covariates, as well as an identity-by-state (IBS) kinship matrix calculated by EMMAX (Kang et al., 2010). We ran each trait GWAS using EMMAX (v07Mar2010), as well as the EMMAX module in EasyGWAS (Grimm et al., 2017). For every SNP/peak above the Bonferroni significance threshold, candidate genes were selected within a 100 kb interval centred in the SNP with the lowest p-value, or within the boundaries of the GWAS peak, whichever was larger. All variants used for association were initially filtered for VQSR 90% tranche, and then further filtered to only include bi-allelic SNPs genotyped in ≥90% of samples and with a minor allele frequency ≥3%.

Identification of top candidate windows

Request a detailed protocol

We calculated the bottom 0.05 quantiles for the p-values from association tests, Spearman’s rank correlation (uncorrected GEA), and GWAS (corrected and uncorrected), yielding two 5% cutoffs. For each environmental and phenotypic variable, we identified all outlier SNPs as those that fell below the respective 5% cutoff. For BayPass, we considered SNPs with BF ≥ 10 as outlier SNPs. For each 5 kb window that we defined across the whole genome, we counted the number of outlier SNPs (a) and the total number of SNPs (n). To identify top candidate windows for each variable, we compared the number of outlier SNPs per each 5 kb window to the 0.9999 quantile of the binomial expectation where the expected frequency of outlier SNPs per window is calculated as ρ=ai/ni (summation over all 5 kb windows), calculating ρ independently for each environmental and phenotype variable and excluding windows with 0 outliers from the calculation of ρ (as per Yeaman et al., 2016). We also calculated a top candidate index using the same approach to categorize outliers, obtaining a p-value for a binomial test for the number of SNPs per window given an expected proportion of outliers (ρ; this p-value is not exact due to non-independence of SNPs, so we refer to this as an index).

Identifying outlier SNPs detected by genome scans from genome-wide distribution without accounting for local recombination rate variation can promote false-positive signals in recombination cold spots (i.e. low recombination regions) and be overly conservative in recombination hot spots (i.e. high recombination regions) (Booker et al., 2020). Therefore, to account for local recombination rate variation, all genomic windows were binned by their estimated recombination rates into five equally sized bins (bin1: 0–20% quantile; bin2: 20–40% quantile; bin3: 40–60% quantile; bin4: 60–80% quantile; bin5: 80–100% quantile). For each recombination bin, we estimated expected frequency of SNPs per window (ρ) and calculated cutoff separately. Windows falling above the threshold were identified as top candidate windows.

Genome-wide survey of repeatability (null-W test)

Request a detailed protocol

To explore repeatable genomic signatures of adaptation for each of the six pairwise contrasts among the four taxa (Figure 1), we used the method developed by Yeaman et al., 2016 with some modifications. A common approach is to identify candidates for adaptation independently in each species and then examine the overlap between these lists; however, this approach is quite stringent and may miss many interesting signals. The null-W test is more sensitive as it takes the list of top candidates from one species and tests whether they tend to show more extreme signatures of association than expected by chance. The null-W test is especially favourable when LD increases divergence of SNPs in tight linkage with causal SNPs but does not raise the test values enough for a window to be classified as an outlier according to the binomial test. For each top candidate window that we identified for each focal species in a pair, we refer to the same window in the other species as ‘top candidate ortholog’. The null distribution for each focal species and variable was constructed by randomly sampling 10,000 background SNPs from non-top candidate ortholog windows. For each non-top candidate window, we then estimated the test statistic (W) for the Wilcoxon signed-rank test vs. the 10,000 background SNPs. This resulted in a null distribution representing the differences between the 10,000 background SNPs and the non-top candidate ortholog windows. These were then standardized to Z-scores using the method in Whitlock and Schluter, 2020:

(1) Z=2W n1n2n1n2(n1+n2+1)/3

where n1 and n2 are the sample sizes being compared. In order to control for heterogeneity in recombination rate and its possible effects on the null distribution, we estimated null distribution for each recombination bin separately (five in total, see above). We then compared the p-values and BFs for each focal top candidate window to the 10,000 background SNPs, calculating the W statistics and converting into a Z-score. Empirical p-values were then calculated by comparing the Z-score for each top candidate window to the null distribution. When individual windows had values of W that exceeded the bounds of the null distribution, their empirical p-value was set to the reciprocal of the number of genes in the null distribution. For each species pair, we refer to the windows identified as significant by this test as ‘windows of repeated association’ or WRAs.

Linkage disequilibrium and detection of clusters of repeated association

Request a detailed protocol

LD among adjacent genomic windows can result in statistical non-independence and similar GEA/GWAS signatures across many windows. To identify the most significant WRAs and group neighbouring windows with similar signatures of repeatability into a single CRA, we used the following approach in each pairwise contrast, and for each environmental variable and phenotype: for each variable, empirical p-values for all WRAs were converted to q-values to adjust for false discovery, then beginning with the first significant WRA along the chromosome (i.e., with q < 0.05), we compared it to the next closest significant WRA by calculating the squared Pearson correlation coefficient (r2) on the allele frequencies across all pairs of SNPs and compared this estimate to a null distribution. To construct the null distribution, we generated a distribution of LD measurements between 10,000 randomly chosen windows with the same physical distance as between two significant WRAs (excluding all WRAs from the null distribution). If the r2 between the two neighbouring significant WRAs was greater than the 95th percentile of the null distribution, we clustered these two windows together. This process was repeated successively with the next closest neighbouring significant WRA, walking out along the chromosome until one of two stopping criteria was reached: (1) the LD between the last two windows did not exceed the 95% of the tail distribution or (2) the distance between the initial window and the current window next to it was larger than 1 cM, based on the linkage map from Todesco et al., 2020. When the first round of clustering stopped due to either of these two criteria, all the clustered windows were removed from the dataset and the process started with the second smallest empirical p-value. By doing this way, each significant WRA will only appear in a single CRA.

Estimating an index of shared standing variation

Request a detailed protocol

As genotype calling was conducted separately for each species (due to computational concerns), we estimated the amount of shared standing variation based on counts of shared vs. non-shared SNPs. If two species are evolving independently, the number of shared SNPs should follow a hypergeometric distribution, so we used an approach similar to the C-scores (Yeaman et al., 2018) to calculate the difference between the observed number of shared SNPs and the expectation, scaled by the standard deviation of the hypergeometric. Because of noise and a relatively small number of SNPs per window, we applied this approach on a sliding-window basis, including the five flanking windows upstream and downstream of each focal window the calculation of its index of shared standing variation.

Correspondence between regions of repeated association and chromosomal rearrangements

Request a detailed protocol

To assess the extent of overlap between regions of the genome with repeated signatures of association and previously identified low-recombination haploblocks, we used two approaches. Firstly, for each pairwise species contrast and variable, we constructed a contingency table for the number of top candidates that were significant WRAs vs. non-significant WRAs (by the null-W test), and that did vs. did not fall within a haploblock. We calculated the Pearson’s χ2 statistic on this table and then permuted the location of haploblocks throughout the genome to construct a null distribution of χ2 statistics, and calculated the p-value as the proportion of the null that exceeded the observed χ2 statistic (to account for non-independence of nearby WRAs), which is presented in Figure 5. Secondly, we compared observations of the length and number of overlapping regions to expectations based on a randomization approach. For each pair of species, we kept the position of each CRA constant and randomized the position of haploblocks 10,000 times to build a null distribution. By keeping the position of the CRAs constant, we maintained the architecture of adaptation independent from chromosomal rearrangements. We assessed significance by testing whether the observed overlaps between CRAs and haploblocks were more extreme than the 95th percentile of the tail of null distribution, which is presented in Appendix 1—figure 7. Details about identifying chromosomal rearrangements can be found in Todesco et al., 2020.

Identifying repeated signatures of association across all taxa

Request a detailed protocol

As a complement to the pairwise analysis, we used PicMin (Booker et al. 2022a) to identify windows of the genome with strong signatures of association in multiple (sub-) species. For each environmental variable, association signatures for each window are ranked genome-wide and PicMin identifies windows with extreme ranks in multiple species. We ran the analysis once with each of the two 3-way comparisons involving one petiolaris subspecies (i.e. H. annuus, H. argophyllus, and either H. pet. petiolaris or H. pet. fallax) to control for repeatability arising from similar patterns in the two petiolaris subspecies (due to high introgression/shared standing variation).

Evaluating genotypic redundancy using C-scores

Request a detailed protocol

Genome-wide quantification of the repeatability of association statistics provides insights into the amount of genotypic redundancy underlying a trait or environmental adaptation, and can be assessed using the C-score approach (Yeaman et al., 2018). Briefly, for a given trait or environment, the set of association test scores within each species is classified into ‘associated’ or ‘non-associated’ using the binomial top candidate approach. For a given pair of species (i = 1, 2), the observed number of associated windows in each species (a1, a2) can be compared to the number of windows that are associated in both species (ab) and the total number of windows being analysed (ax). Under a null hypothesis where all windows in the genome have equal potential to be involved in adaptation (i.e. associated with the trait or environment), the expected number of windows associated in both species will be described by a hypergeometric distribution, with the expectation ab-= a1a2/ax . The difference between the observed and expected amount of overlap in association scores can be quantified as a C-score by scaling the difference by the standard deviation of the hypergeometric (i.e. a C-score of 2 means that the observed amount of overlap is 2 standard deviations above the expectation under randomness).

We assess the C-scores for each phenotype and environment trait by classifying the top 0.5% of all 5 kb windows within each species as ‘associated’ (ai; based on the binomial top candidate index), and begin by calculating the C-score obtained when ax is set to the union of ai across all four focal species/sub-species (i.e. only those windows associated in at least one species are included in ax, a number much lower than the total number of windows in the genome). This yields a negative C-score as a random draw from such a small number of windows tends to yield many more overlapping associations by chance than the observation. When the C-score = 0, it means that the observed overlap between the pair of taxa being considered is consistent with a random draw of their respective ‘associated’ windows from an overall pool of ax windows. Thus, by adding rows to the matrix with ‘non-associated’ scores for all four species/sub-species until finding the matrix that yields C-score = 0, we can estimate the effective number of loci that contributed to variation for the trait or environment being considered (Leff), under the assumption that all such ax = Leff windows had an equal chance of contributing to the associations.

To run the C-score analysis on LD-clustered windows, we ran the following algorithm for each trait/environmental variable: for each species, we identified all CRAs that also had a top candidate index in top 0.5%. In most cases, a large cluster with associated windows in one species corresponds to either no clusters or a small cluster in another species. To harmonize cluster boundaries across species, we bin any overlapping clusters together to their maximum extent and take the average top candidate index for each cluster in each species as either (a) the mean across all windows that were ‘associated’ in that species or (b) the average across all windows associated in any species, if no windows were associated in that species. This yields a matrix that can be submitted to the hypergeometric C-score analysis.

Appendix 1

Additional methods

Whole-genome re-sequence data alignment and variant calling

DNA was extracted from leaf tissues (these are the same individuals examined in the previous study, see Todesco et al., 2020 for details). Briefly, all libraries were sequenced at the Genome Quebéc Innovation Center on HiSeq2500, HiSeq4000, and HiSeqX instruments to produce paired end, 150 bp reads (Illumina, San Diego, CA). Libraries with a low number of reads were re-sequenced to increase genome coverage. After quality filtering (see below), a total of 60.7 billion read pairs were obtained. Illumina adapters and poor quality reads were hard-clipped using Trimmomatic (v0.36) (Bolger et al., 2014). Reads were then aligned to the H. annuus XRQv1 genome (HanXRQr1.0-20151230; Badouin et al., 2017) using NextGenMap (v0.5.3; Sedlazeck et al., 2013). PCR duplicates were marked and removed using picard MarkDuplicates 2.9.3. Genomic regions containing transposable elements (~3/4 of the sunflower genome) were excluded to reduce computational time and improve variant quality. Genotyping for each species was performed independently as joint genotyping on the whole ensemble of samples was computational impractical. GATK’s VariantRecalibrator (v4.0.1.2; Van der Auwera and O’Connor, 2000), which filters variants in the call set according to a machine learning model inferred from a small set of ‘true’ variants, was used to remove low-quality calls and produce a dataset of a more manageable size. In the absence of an externally validated set of known sunflower variants to use as calibration, we computed a stringently filtered set from top-N samples with highest sequencing coverage for each species (N = 67 for cultivated sunflower and N = 20 for wild sunflower species). The stringency of the algorithm in classifying true/false variants was adjusted by comparing variant sets produced for different parameter values (tranche 100.0, 99.0, 90.0, 70.0, and 50.0). For each cohort, results for tranche = 90.0 were chosen for downstream analysis, based on heuristics: the number of novel SNPs identified, and improvements to the transition/transversion ratio (towards GATK’s default target of 2.15).

Remapping sites to the HA412-HO reference genome

As described with details by Todesco et al., 2020, haploblock analysis highlighted contig ordering issues with the XRQv1 reference assembly (see below). To overcome this, all sites were transferred to a new reference, HA412-HOv2, which used Hi-C for contig and scaffold ordering (Belton et al., 2012; Marie-Nelly et al., 2014). To do this, the 200 bp of reference sequence flanking each site in XRQv1 were extracted and aligned to HA412-HOv2 using BWA (Li, 2013). These alignments were filtered for mapping quality >40 and the HA412-HOv2 position for the variant site was extracted. Since all remapped sites were not in repetitive regions and had passed VQSR filtering, remapping success rate was high (96–98%). Whenever mapping suggested two different variants on the XRQv1 genome were in the same position on the HA412-HOv2 genome, likely due indels and imprecise alignment, one site was shifted by 1 bp so that they did not overlap. Remapping was preferred to de novo read alignment and variant calling against the HA412-HOv2 assembly because of the prohibitive amount of computational time that would have required.

Gene Ontology enrichment analysis of regions with repeated association

Genes that overlapped with CRAs associated with environmental and phenotypic variables were screened for enrichment of Gene Ontology (GO) terms. GO annotations for Arabidopsis thaliana genes from the TAIR database were mapped onto their sunflower homologs and a custom database of sunflower GO annotations was constructed. The R package TOPGO (Alexa and Rahnenfuhrer, 2023) was used to analyse the set of candidate genes to determine which categories were most overrepresented. Significance for each individual GO identifier was computed with Fisher’s exact test and significant GO terms were identified at an FDR of 1%. GO functional enrichment analysis was performed in the categories biological process (BP), cellular component (CC), and molecular function (MF).

Additional results

GO-enrichment analysis

To investigate the functional associations of genes overlapping with the WRA, we performed GO analysis using TopGO package in R. GO terms with corrected p-values<0.05 were considered significantly enriched. Appendix 1—figures 9 and 10 provide a list of GO terms that are over-represented in our gene set. GO components and processes associated with membrane assembly, transport through the endomembrane system, mRNA, and growth are significantly enriched in this analysis.

Appendix 1—figure 1
Schematic overview of methods and primary research questions.
Appendix 1—figure 2
Strength of phenotype–environment correlations across all traits for four different types of environmental variables in each of the sunflower species and subspecies.

Black points show individual values, and grey points show binned density.

Appendix 1—figure 3
Maximum similarity in phenotype–environment correlation (SIPEC) for pairs of taxa, across soil-, temperature-, and precipitation-related environmental variables.

The maximum value of SIPEC for each environment is calculated across all phenotypic principal component analysis (PCA) axes that cumulatively explain 95% of the variance.

Appendix 1—figure 4
The effect of recombination rate on width of the null-W distribution for the number of frost-free days (NFFD) variable for Helianthus annuus and H. argophyllus.

Recombination bins represent the 0th–20th percentile, 20th–40th percentile, etc.

Appendix 1—figure 5
Proportion of top candidate windows that are significant hits under the null-W test (windows of repeated association), for pairs of taxa, across precipitation (A), soil (B), and temperature (C) environmental variables.
Appendix 1—figure 6
Index of shared standing variation for windows of repeated association (WRAs) vs. top candidates that were not significant under the null-W test.

The index of shared standing variation reflects the proportion of SNPs that are shared vs. non-shared among species and provides an indicator of the likely extent of introgression, which does not appear to differ substantially among the windows of the genome with significant (p<0.05; red lines) vs. non-significant (p>0.05; black lines) null-W test results. Panels A-F show the values for each of the six pairwise comparisons among species, as indicated above each panel.

Appendix 1—figure 7
Number and size of clusters of repeated association (CRAs) and their overlap with haploblocks.

Total size and total number of CRAs detected among six studied pairs on each linkage group across all phenotypes by Genome-Wide Association (GWAS) (A), and Genotype-Environment Association (GEA) (B). Hatching areas indicate the total size and number of clusters residing within chromosomal rearrangements. Heat maps present proportion of CRAs by number (C) and size (D) per each phenotype variable and environment variable overlapping with chromosomal rearrangements. Stars in indicate overlaps between CRAs and haploblocks happen significantly different from chance (p-value ≤ 0.05). Grey cells in the heat maps indicate no data is available for that comparison and variable.

Appendix 1—figure 8
Total size and number of convergent clusters in different pairs for each analysis type.

The total size of convergent clusters (A) and total number of convergent clusters (B) identified among different pairs surveyed in the present study using association genetic approaches that corrected population structure versus those that did not correct across all environmental variables (GEA) and corrected GWAS.

Appendix 1—figure 9
Bar graph of Gene Ontology (GO) enrichment analysis for phenotype (A) and precipitation-related variables (B).

Bar plot depicts the significant enriched GO terms within categories: biological process, cellular component, and molecular function. Y-axis represents the GO term, and the X-axis represents the enrichment significance, respectively.

Appendix 1—figure 10
Bar graph of Gene Ontology (GO) enrichment analysis for soil (A) and temperature-related variables (B).

Bar plot depicts the significant enriched GO terms within categories: biological process, cellular component, and molecular function. Y-axis represents the GO term, and the X-axis represents the enrichment significance.

Appendix 1—figure 11
Effect of structure correction on number and size of clusters of repeated association (CRAs) and their overlaps with inversions.

Total size and total number of CRAs detected among six studied pairs on each linkage group across all environmental variables by corrected GEA (A). Hatching areas indicate the total size and number of clusters residing within haploblocks. Heat maps present proportion of CRAs overlapping with haploblocks by number (B) and size (C) for each phenotype and environmental variable (climate and soil). Stars indicate cases where observed overlaps between CRAs and haploblocks happen significantly more than expected by chance (p-value ≤ 0.05). Grey cells in the heat maps indicate no data is available for that comparison and variable.

Appendix 1—figure 12
Relationship between mean similarity in phenotype-environment correlation (SIPEC) and number of clusters of repeated association (CRAs).

SIPEC was calculated for each phenotypic principal component analysis (PCA) axis with the mean calculated across the axes that cumulatively explain 95% of the phenotypic variance for each environment. Each panel (A-F) shows a comparison between a pair of species indicated above, and includes both a linear model fit to the data within the panel (coloured lines), and a linear model fit to all data simultaneously (black lines) for comparison. Note that because environmental variables are correlated, these points are not independent and therefore represent a source of pseudoreplication, preventing formal statistical tests of this relationship.

Appendix 1—figure 13
Relationship between index of shared standing variation and number of clusters of repeated association (CRAs).

Panels A-F show the values for each of the six pairwise comparisons among species, as indicated above each panel, with lines showing linear model fits for data within each panel.

Appendix 1—figure 14
Relationship between mean similarity in phenotype–environment correlation (SIPEC) and size of clusters of repeated association (CRAs).

SIPEC was calculated for each phenotypic principal component analysis (PCA) axis with the mean calculated across the axes that cumulatively explain 95% of the phenotypic variance for each environment. Each panel (A-F) shows a comparison between a pair of species indicated above, and includes both a linear model fit to the data within the panel (coloured lines), and a linear model fit to all data simultaneously (black lines) for comparison. Note that because environmental variables are correlated, these points are not independent and therefore represent a source of pseudoreplication, preventing formal statistical tests of this relationship.

Appendix 1—figure 15
Estimates of effective number of loci in pairwise contrasts among species.

Panel (A) shows a comparison of estimates of the effective number of loci (Leff) when calculated with vs. without linkage disequilibrium (LD)-clustering for the environmental variables from the six pairwise contrasts among lineages. Panel (B) shows the effect of structure correction using BayPass on Leff. Panels (C) and (D) show the estimation of Leff for the variable with the lowest average value (Hargreaves reference evapotranspiration; Eref) under different false-positive rates for just the windows with non-repeated signatures (C) or for all windows (D).

Data availability

All scripts used for analysis and figures are deposited in a Dryad Digital Repository. Raw sequence data are deposited under NCBI BioProject accessions PRJNA532579, PRJNA398560, and PRJNA564337, as described in Todesco et al., 2020.

The following data sets were generated
    1. Sam Y
    2. Shaghayegh S
    3. Mojtaba J
    (2023) Dryad Digital Repository
    Data from: Repeatability of adaptation in sunflowers reveals that genomic regions harbouring inversions also drive adaptation in species lacking an inversion.
    https://doi.org/10.5061/dryad.wpzgmsbtd
The following previously published data sets were used
    1. Todesco M
    (2020) NCBI BioProject
    ID PRJNA532579. Wild Helianthus GWAS and GEA.
    1. Todesco M
    (2020) NCBI BioProject
    ID PRJNA398560. Wild and Weedy Helianthus annuus whole genome resequencing.
    1. Todesco M
    (2020) NCBI BioProject
    ID PRJNA564337. Hi-C sequencing of wild sunflowers.

References

    1. Heiser C
    2. Smith D
    3. Clevenger S
    4. Martin W
    (1969)
    The North American sunflowers (Helianthus)
    Memoirs of the Torrey Botanical Club 22:1–218.
  1. Book
    1. Jeffreys H
    (1961)
    Theory of Probability
    Oxford University Press.
  2. Book
    1. Van der Auwera GA
    2. O’Connor BD
    (2000)
    Genomics in the Cloud
    O’Reilly Media, Inc.
  3. Book
    1. Whitlock M
    2. Schluter D
    (2020)
    The Analysis of Biological Data
    Macmillan.

Peer review

Reviewer #1 (Public Review):

Soudi, Jahani et al. provide a valuable comparative study of local adaptation in four species of sunflowers, and investigate the repeatability of observed genomic signals of adaptation and their link to haploblocks, known to be numerous and important in this system. The study builds on previous work in sunflowers that have investigated haploblocks in those species and on methodologies developed to look at repeated signals of local adaptations. The authors provide solid evidence of both genotype-environment associations (GEA) and genome-wide association study (GWAS), as well as phenotypic correlations with the environment, to show that part of the local adaptation signal is repeatable and significantly co-occur in regions harboring haploblocks. Results also show that part of the signal is species specific and points to high genetic redundancy. This work will be of interest to evolutionary biologists in general and population geneticists in particular, and constitutes a good example of comparative local adaptation. Importantly, this study helps in advancing our understanding of the genetic architecture implicated in the adaptation process.

Strenghts: The authors take great care in acknowledging and investigating the multiple biases inherent to the used methods (GEA and GWAS) and use conservative and well thought statistical approaches to draw their conclusions. Additionally, I appreciated the nuanced discussion and can only agree with the authors that the adaptation process is complex and does not fully fit the classic simplified genetics models of either few large effect genes or only infinitesimal quantitative traits. I find the added Summary figure of this revised version (S1) extremely helpful in better understanding the different analysis steps and how they relate to the different questions.

Weaknesses: After those revisions, I did not find any major weakness and am satisfied with the authors responses.

https://doi.org/10.7554/eLife.88604.3.sa1

Reviewer #2 (Public Review):

In this study the authors sought to understand the extent of similarity among species in intraspecific adaptation to environmental heterogeneity at the phenotypic and genetic levels. A particular focus was to evaluate if regions that were associated with adaptation within putative inversions in one species were also candidates for adaptation in another species that lacked those inversions. This study is timely for the field of evolutionary genomics, due to recent interest surrounding how inversions arise and become established in adaptation.

Major strengths-

Their study system was well suited to addressing the aims, given that the different species of sunflower all had GWAS data on the same phenotypes from common garden experiments as well as landscape genomic data, and orthologous SNPs could be identified. Organizing a dataset of this magnitude is no small feat. The authors integrate many state-of-the-art statistical methods that they have developed in previous research into a framework for correlating genomic Windows of Repeated Association (WRA, also amalgamated into Clusters of Repeated Association based on LD among windows) with Similarity In Phenotype-Environment Correlation (SIPEC). The WRA/CRA methods are very useful and the authors do an excellent job at outlining the rationale for these methods.

Weaknesses-

The authors did an excellent job responding to the first set of reviews and overall I found the manuscript more streamlined and easier to read. The main weakness in the manuscript is that correlations among environmental variables were not controlled for in their results, and is a source of potential pseudoreplication. The authors are clear about the results that are affected by pseudoreplication.

The manuscript shows how to integrate many recent methods to study the repeatability of adaptation, and the methods and data are likely to be used in similar studies.

https://doi.org/10.7554/eLife.88604.3.sa2

Author response

The following is the authors’ response to the original reviews.

Reviewer #1 (Public Review):

Soudi, Jahani et al. provide a valuable comparative study of local adaptation in four species of sunflowers and investigate the repeatability of observed genomic signals of adaptation and their link to haploblocks, known to be numerous and important in this system. The study builds on previous work in sunflowers that have investigated haploblocks in those species and on methodologies developed to look at repeated signals of local adaptations. The authors provide solid evidence of both genotype-environment associations (GEA) and genome-wide association study (GWAS), as well as phenotypic correlations with the environment, to show that part of the local adaptation signal is repeatable and significantly co-occur in regions harboring haploblocks. Results also show that part of the signal is species specific and points to high genetic redundancy. The authors rightfully point out the complexities of the adaptation process and that the truth must lie somewhere between two extreme models of evolutionary genetics, i.e. a population genetics view of large effect loci and a quantitative genetics model. The authors take great care in acknowledging and investigating the multiple biases inherent to the used methods (GEA and GWAS) and use a conservative approach to draw their conclusions. The multiplicity of analyses and their interdependence make them slightly hard to understand and the manuscript would benefit from more careful explanations of concepts and logical links throughout. This work will be of interest to evolutionary biologists and population geneticists in particular, and constitutes an additional applied example to the comparative local adaptation literature.

Some thoughts on the last paragraph of the discussion (L481-497): I think it would be fine to have some more thoughts here on the processes that could contribute to the presence/absence of inversions, maybe in an "Ideas and Speculation" subsection. To me, your results point to the fact that though inversions are often presented as important for local adaptation, they seem to be highly contingent on the context of adaptation in each species. First, repeatability results are only at the window/gene level in your results, the specific mutations are not under scrutiny. Is it possible that inversions are only necessary when sets of small effect mutations are used, opposite to a large effect mutation in other species? Additionally, in a model with epistasis, fitness effects of mutations are dependent on the genomic background and it is possible that inversions were necessary in only certain contexts, even for the same mutations, i.e. some adaptive path contingency. Finally, do you have specific demographic history knowledge in this system that maps to the observations of the presence of inversions or not? For example, have the species "using" inversions been subject to more gene flow compared to others?

Thank you for the great suggestions and helpful comments. Regarding the question of demography, each of the species actually harbours quite a large number of haploblocks (13 in H. annuus spanning 326Mb, 6 in H. argophyllus spanning 114 Mb, and 18 in H. petiolaris spanning 467 Mb; see Todesco et al. 2020 for more details) so there does not seem to be any clear association with demography. We agree about the complexities that might underly the evolution of inversions that you outline above, and have refined some of the text where we discuss their evolution in the Discussion.

Reviewer #2 (Public Review):

In this study the authors sought to understand the extent of similarity among species in intraspecific adaptation to environmental heterogeneity at the phenotypic and genetic levels. A particular focus was to evaluate if regions that were associated with adaptation within putative inversions in one species were also candidates for adaptation in another species that lacked those inversions. This study is timely for the field of evolutionary genomics, due to recent interest surrounding how inversions arise and become established in adaptation.

Major strengths

Their study system was well suited to addressing the aims, given that the different species of sunflower all had GWAS data on the same phenotypes from common garden experiments as well as landscape genomic data, and orthologous SNPs could be identified. Organizing a dataset of this magnitude is no small feat. The authors integrate many state-of-the-art statistical methods that they have developed in previous research into a framework for correlating genomic Windows of Repeated Association (WRA, also amalgamated into Clusters of Repeated Association based on LD among windows) with Similarity In Phenotype-Environment Correlation (SIPEC). The WRA/CRA methods are very useful and the authors do an excellent job at outlining the rationale for these methods.

Thank you!

Major weaknesses

The study results rely heavily on the SIPEC measure, but I found the values reported difficult to interpret biologically. For example, in Figure 4 there is a range of SIPEC from 0 to 0.03 for most species pairs, with some pairs only as high as ~0.01. This does not appear to be a high degree of similarity in phenotype-environment correlation. For example, given the equation on line 517 for a single phenotype, if one species has a phenotype-environment correlation of 1.0 and the other has a correlation of 0.02, I would postulate that these two species do not have similar evolutionary responses, but the equation would give a value of (1+0.02)10.02/1 = 0.02 which is pretty typical "higher" value in Figure 4. I also question the logic behind using absolute values of the correlations for the SIPEC, because if a trait increases with an environment in one species but decreases with the environment in another species, I would not predict that the genetic basis of adaptation would be similar (as a side note, I would not question the logic behind using absolute correlations for associations with alleles, due to the arbitrary nature of signing alleles). I might be missing something here, so I look forward to reading the author's responses on these thoughts.

The reviewer makes a very good point about the range of SIPEC, and we have changed our analysis to reflect this, now reporting the maximum value of SIPEC for each environment (across the axes of the PCA on phenotypes that cumulatively explain 95% of the variance), in Figure 4 and Supplementary Figures S2 and S13. For consistency among manuscript versions and to illustrate the effect of this change, we retain the mean SIPEC value in one figure in the supplementary materials (S12), which shows the small effect of this change on the qualitative patterns. Figure 4 now shows that the maximum SIPEC value is regularly quite strong, which should address the reviewer’s concern that this is not being driven by anomalous and small values. We appreciate this point and think this change now more closely reflects how we are trying to estimate the biological feature of interest – that some axis of phenotypic space is strongly (or not) responding to selection from the environmental variable.

With respect to the logic behind using absolute value, we still feel this is justified for traits, because if a trait evolves to be bigger or smaller, it may still use the same genes. For example, flowering time may change to be later or earlier, which would result in opposite correlations with a given environment, but might use the same gene (e.g. FT) for this. As such, we think keeping absolute value is more representative as otherwise species with strong but opposite patterns of adaptation would look like they were very different. We have added a statement on line 584 in the methods section to further clarify the reason for this choice.

An additional potential problem with the analysis is that from the way the analysis is presented, it appears that the 33 environmental variables were essentially treated as independent data points (e.g. in Figure 4, Figure 5). It's not appropriate to treat the environmental variables independently because many of them are highly correlated. For example in Figure 4, many of the high similarity/CRA values tend to be categorized as temperature variables, which are likely to be highly correlated with each other. This seems like a type of pseudo replication and is a major weakness of the framework.

This is a good point and we fully agree. It is for this reason that we didn’t present any p-values or statistical tests of the overall patterns that are shown in these figures (i.e. the linear relationship between SIPEC and number of CRAs in figure 4 and the tendency for most points to fall above the 1:1 line in figure 5). But to make sure this is even more clear, we have added statements to the captions of these figures to remind readers that points are non-independent. We still feel that in the absence of a formal test, the overall patterns are strongly consistent with this interpretation. A smaller number of non-pseudo-replicated points in Figure 4 would still likely show linear patterns. Similarly, there are almost no significant points falling below the 1:1 line in Figure 5, and it seems unlikely that pseudoreplication would generate this pattern.

Below I highlight the main claims from the study and evaluate how well the results support the conclusions.

"We find evidence of significant genome-wide repeatability in signatures of association to phenotypes and environments" (abstract)

Given the questions above about SIPEC, I did not find this conclusion well supported with the way the data are presented in the manuscript.

We have changed the reporting of the SIPEC metric so that it more clearly reflects whichever axis of phenotypic space is most strongly correlated with environment in both species (using max instead of mean). This shows similar qualitative patterns but illustrates that this happens across much higher values of SIPEC, showing that it is in fact driven by high correlations in each species (or non-similar correlations resulting in low values of SIPEC). While we agree about the pseudo-replication problem preventing formal statistical test of this hypothesis, the visual pattern is striking and seems unlikely to be an artefact, so we think this does still support this conclusion.

"We find evidence of significant genome-wide repeatability in signatures of association to phenotypes and environments, which are particularly enriched within regions of the genome harbouring an inversion in one species. " (Abstract) And "increased repeatability found in regions of the genome that harbour inversions" (Discussion)WRAs, not whether haploblocks. The wording of the abstract is claiming the latter, but I think what they tested was the former. Let me know if I'm missing something here.

These claims are supported by the data shown in Figure 4, which shows that haploblocks are enriched for WRAs. I want to clarify a point about the wording here, as my understanding of the analysis is that the authors test if haploblocks are enriched with WRAs are enriched for

We are actually not interested in whether WRAs are enriched for haploblocks; we want to know if WRAs tend to occur more commonly within haploblocks than outside of them. We have tried to clarify that this is our aim in various places in the manuscript. Our analysis for Figure 5 is the one supporting these claims, and it uses the Chi-square test statistic to assess the number of WRAs and non-WRAs that fall within vs. outside of inversions, and a permutation test to assess the significance of this observation, for each environmental variable and phenotype. We don’t think that this test has any direction to it – it’s simply testing if there is non-random association between the levels of the two factors. Thus, we think the wording we have used is consistent with the test result and our aims. Perhaps the confusion arose from the two methods that we present in the Methods (one is used for Figure 5, the other for Figure S6C & D), so we have added clarifications there.

Notwithstanding the concerns about highly correlated environments potentially inflating some of the patterns in the manuscript, to my knowledge this is the first attempt in the literature to try this kind of comparison, and the results does generally suggest that inversions are more likely capturing, rather than accumulating adaptive variation. However, I don't think the authors can claim that repeated signatures are enriched with haploblock regions, and the authors should take care to refrain from stating the relative importance of different regions of the genome to adaptation without an analysis.

Actually, we don’t have a strong feeling about whether inversions are capturing vs. accumulating adaptive variation, as these results could be consistent with either. As described above, we do not understand why we can’t claim that repeated signatures are enriched within haploblocks. We thought the reviewer is perhaps referring to the fact that the points are pseudo-replicated in the figures due to environment? We note that a very large number of points are significantly different from random in terms of the distribution of WRAs within vs. outside of haploblocks (light- vs. dark-shaded symbols), and that almost all of them fall above the 1:1 line. While there may be pseudo-replication preventing a test of the bigger multi-environment/multi-species hypothesis across all phenotypes and environments, there is almost a complete lack of significant results in the other direction. This seems like quite strong evidence about enrichment of WRAs within haploblocks, across many environments/species contrasts. We have added some text to the description of patterns in figure 5 to try to clarify this.

"While a large number of genomic regions show evidence of repeated adaptation, most of the strongest signatures of association still tend to be species-specific, indicating substantial genotypic redundancy for local adaptation in these species." (Abstract)

Figure 3B certainly makes it look like there is very little similarity among species in the genetic basis of adaptation, which leaves the question as to how important the repeated signatures really are for adaptation if there are very few of them. (Is 3B for the whole genome or only that region?). This result seems to be at odds with the large number of CRAs and the claims about the importance of haploblock regions to adaptation, which extend from my previous point.

Figure 3B is for the whole genome, we have added text to the figure caption to clarify this. We think that both interpretations are possible: that most of the regions of the genome that are driving adaptation are non-repeated, but that a small but significant proportion of regions driving adaptation are repeated above what would be expected at random. Thus, it seems that there is high redundancy, coupled with adaptation via some genes that seem particularly functionally important and non-redundant, and therefore repeated. We added clarifying text on lines 541-548.

"we have shown evidence of significant repeatability in the basis of local adaptation (Figure 4,5), but also an abundance of species-specific, non-repeated signatures (Figure 3)"

While the claim is a solid one, I am left wondering how much of these genomes show repeated vs. non-repeated signatures, how much of these genomes have haploblocks, and how much overlap there really is. Finding a way to intuitively represent these unknowns would greatly strengthen the manuscript.

We agree, and really struggled to find the best way to communicate both the repeated patterns and the large amount of non-repeated signatures. Unfortunately, we have more confidence in the validity of repeated patterns because for the non-repeated patterns, a strong signature of association to environment in only one species could just be the product of structureenvironment correlation, as we didn’t control for population structure. Thus, trying to quantify the proportion of non-repeated signatures is difficult to do with any accuracy and we preferred to avoid putting too much emphasis on the simple calculation of the proportion of top candidate windows that were also WRAs.

Overall, I think the main claims from the study, the statistical framework, and the results could be revised to better support each other.

Although the current version of the manuscript has some potential shortcomings with regards to the statistical approaches, and the impact of this paper in its present form could be stifled because the biology tended to get lost in the statistics, these shortcomings may be addressed by the authors.

With some revisions, the framework and data could have a high impact and be of high utility to the community.

Thank you for your very helpful comments and suggestions on our paper, we really appreciate it.

Recommendations for the authors: please note that you control which revisions to undertake from the public reviews and recommendations for the authors

Editor's comments:

The reviewers make a series of reasonable suggestions that I echo. I found the paper quite hard to follow, and got fairly lost in the various layers of analyses done. Partially, this represents the complexity of empirical genomic data, which rarely deliver simple stories of convergence at a few genes. However, the properties of the various statistics used to detail local adaptation and convergence are not particularly clear and the figures presented were not intuitive representations of the data. This leaves the reader with an incomplete view of how much weight to put in the various lines of evidence marshaled. I would suggest simplifying the presentation of the results considerably. I add a few additional comments below.

Great suggestion, we’ve added a schematic overview of the methods and main research questions to Figure S1 in the supplementary materials.

A figure would help showing some of the signals of SNPs with putative signals of convergent environmental correlations across species, e.g. frequencies plotted against climate variables. This would help readers get a sense of how strong these signals were. These could be accompanied by the statistics calculated for these SNPs, that would allow the reader to start to get some intuitive sense of what the numbers mean.

Great suggestion, we have added a schematic overview of the methods to Figure S1 that shows some of the values and illustrates how the methods work using visual examples from our data.

In general, the introduction and some of the discussion of the inversion results feel oddly framed:

Abstract line 36: "This shows that while inversions may facilitate local adaptation, at least some of the loci involved can still make substantial contributions without the benefit of recombination suppression."

We have changed “some of the loci involved can still make substantial contributions without the benefit of recombination suppression” here to “some of the loci involved can still harbour mutations that make substantial contributions without the benefit of recombination suppression in species lacking a segregating inversion” as it hopefully clarifies that we’re not talking about individual alleles that are present in both species.

Models of the role of local adaptation in the establishment of inversions (Kirkpatrick & Barton) assume that there are multiple locally adapted alleles already present. It is the load created by these alleles being constantly maintained in the face of migration and subsequent recombination that allow an inversion to be selected for because it keeps together locally adapted alleles. Thus these models predict that there could well be standing local adaptation at these loci in the absence of the inversion in other species, and that these locally adapted alleles while not fixed may be at high frequency. (After establishment, inversions housing locally adapted alleles, can shield more weakly, locally beneficial alleles from migration allow other alleles to build up.) Empirically it's interesting to find signals of local adaptation in other species that don't contain putative inversions. But the logic of the different predictions is not particularly clear from the introduction, and only becomes somewhat clearer in the discussion.

Thank you for pointing out this murkiness, we have re-written portions of both the Introduction and Discussion to clarify this aspect.

From the introduction:Inversions have been implicated in local adaptation in many species (Wellenreuther and Bernatchez 2018), likely due to their effect to suppress recombination among inverted and noninverted haplotypes, and thereby maintain LD among beneficial combinations of locally adapted alleles (Rieseberg 2001; Noor et al. 2001; Kirkpatrick and Barton 2006). This has been approached by models studying the establishment of inversions that capture combinations of locally adapted alleles present as standing variation (e.g., Kirkpatrick and Barton 2006), as well as models examining the accumulation of locally adapted mutations within inversions (e.g., Schaal et al. 2022). If there is variation in the density of loci that can potentially contribute to local adaptation, inversions would be expected to preferentially establish and be retained in regions harbouring a high density of such loci (and this expectation would hold for both the capture and accumulation models). We would also expect to see stronger signatures of repeated local adaptation in such high density regions. Despite mounting evidence of their importance in adaptation, it is unclear how inversions may covary with repeatability of adaptation among species. A fundamental parameter of importance in these models is the relationship between migration rate and strength of selection on individual alleles, which may not make persistent contributions to local adaptation without the suppressing effects of recombination if selection is too weak (Yeaman and Whitlock 2011; Bürger and Akerman 2011). If most alleles have small effects relative to migration rate and can only contribute to local adaptation via the benefit of the recombination-suppressing effect of an inversion, then we would expect little repeatability at the site of an inversion – other species lacking the inversion would not tend to use that same region for adaptation because selection would be too weak for alleles to persist. On the other hand, if some loci are particularly important for local adaptation and regularly yield mutations of large effect, with these patterns being conserved among species, repeatability within regions harbouring inversions may be substantial. Thus, studying whether adaptation at the same genomic region harbouring an inversion is observed in other species lacking the inversion can give insights about the underlying architecture of adaptation, and the evolution and maintenance of inversions.

From the Discussion:The observed repeatability associated with inversions further supports the local adaptation model as an explanation for the long-term persistence of segregating inversions at least in sunflowers, rather than mechanisms based on dominance or meiotic drive (Rieseberg 2001). If there is variation across the genome in the density of loci with the potential to be involved in local adaptation, then the establishment and maintenance of inversions would be biased towards regions harbouring a high density such loci under this model. If the genomic basis for local adaptation is conserved amongst species, then these same regions are more likely to have high repeatability. Thus, our observation of genomic regions harbouring inversions also being enriched for WRAs is consistent with this general model for inversion evolution. Unfortunately, our observations do not provide much insight into whether inversions evolve through the capture (e.g. Kirkpatrick and Barton 2006) or accumulation (e.g. Schaal et al. 2022) type of model, as either model would be consistent with our results. Most of the sunflower inversions are >1 My old, and therefore predate any current local adaptation patterns, but likely do not predate the genes underlying local adaptation (which appear to be shared among the species we studied). As for the alleles underlying local adaptation, they may be younger than the inversions, but as our work suggests, these regions are prone to harbouring locally adaptive alleles so it is possible that they also harboured other ancestral locally adaptive alleles.

As a minor comment, there's a fair number of places where a more nuanced view of the field is needed, e.g.:

"Models in evolutionary genetics tend to focus on extremes: population genetic approaches explore cases where strong selection deterministically drives a change in allele frequency" --This seems like a strange strawman. Population genetic models span a huge parameter range. The empirical approaches of looking for sweeps by detecting genome-wide statistical outliers is predicated on strong selection, but there are numerous papers that have looked for signals of weak selection genome-wide.

Good point, we have changed our wording here.

Reviewer #1 (Recommendations For The Authors):

Comments

My main comment on the manuscript is that the different levels and diversity of analyses are slightly hard to follow on the first, and even second, read. As there are several layers of correlations and comparisons, as well as some independent analyses, I wonder if it might be helpful to have a summary schematic figure of how all analyses fit together.

Great idea, we have added Figure S1 that summarizes the main flow of the methods and research questions.

  • L169-171: Would it be more accurate to say that SIPEC is maximized when both species have strong correlations for an environmental variable across the same phenotypes? But maybe I misunderstood the index.

Good point, we have now simplified SIPEC, reporting the max instead of the mean, which we think better reflects when similar patterns are happening in both species for some phenotype.

  • L191: Given the discussion in the introduction and elsewhere about the correction for population structure, which version is used here? Same for Figure 3.

We have added clarification there.

  • L348: One [environmental] variable?

Added

  • L353: Maybe add a percentage indication for 387 so that it is comparable to the following23.3%.

Good point, added

-> L388 and paragraph: You mention "significant repeatability" but it is hard from the results at this point to have a broad idea of the amount of signal that is repeatable. Would it be possible to add here some quantitative measure of the proportion of signal repeatable or not, even if approximated?

I wish we could, but I think the precision implied by such an approximation would involve a huge amount of uncertainty and likely inaccuracy. Because it is so hard to conclusively identify how many loci are significant but non-repeated, we really don’t have a good handle on the denominator here. We are pretty confident that the repeated loci are strongly enriched for true positives, but the non-repeated loci are also almost certainly strongly enriched for false positives. While we really want to be able to quantify this explicitly, we don’t think it’s possible given our data.

-L415-418: "If there is variation [...] involved in local adaptation", I do not follow this argument, could you rephrase?

Changed

-L447-450: As you say in the supplementary methods, your analyses exclude 3/4 of the genome.Do you think this choice has a large impact on the number of outliers observed here as the genome-wide baseline would change?

This is a very good question, but one that is quite complex and without a clear answer – we chose not to delve into it in the paper to keep the discussion streamlined. My (SY) feeling is that it is unlikely that regions harbouring transposable elements would contribute much to adaptation, but I think we really don’t know if that is true. Even excluding ¾ of the genome harbouring TEs, ¼ of the genome still constitutes a huge amount of sequence and a very large number of genes and it seems plausible that most genes and genic regions would not contribute to adaptation for a given trait, so I don’t think this would change the results too much in a qualitative way – but would almost certainly change the number of windows that are significant, etc.

  • L455-457: "As we are unable [...] potentially important drivers" Could you provide the logical link here between loci of small effect and them being important drivers. I presume you mean that the large effect loci found here only account for a small proportion of the heritability?

Yes that’s what we meant here, so we’ve added some clarification.

  • L482: "enriched within inversions" should that be 'in genomic regions where there exist inversions in at least one species'? Thanks for catching that, yes. Changed.

  • Methods/SIPEC L512: Compared to the Results section it is unclear here what is referred to as an "environment" Is it a variable or a set of environment variables?

This is done per environmental variable.

I find the presence of the PCA for environment variables in Figure 2 misleading as my first interpretation was that PCs for environment were also used.

Good point, we have clarified this on line 190-193.

Maybe one potential addition to the formula would be to add an environment variable $j$ notation such that it reads "$SIPEC_j = \sum_i (|r_{ij,1}| + ...) ...$ where ... between environment variable $j$". I had initial difficulties to understand how this SIPEC was computed relating to environmental variables and this might help.

Given the other changes we made to SIPEC, we felt it was simpler to just present it as a single calculation on a given combination of phenotype and environment for a pair of species, and then discuss taking the mean and maximum of this later.

Finally, PCA axes explaining 95% of the variance are used, I would find it interesting to see how many PCs are used in comparison to the number of traits being measured.

We have added the following sentence to the methods describing this:

"For comparisons including H. argophyllus, 95% of the variance was typically explained by 8-10 PC axes (out of 28 or 29 phenotypes), whereas for comparisons among other taxa this included 21 or 22 PC axes out of 65 or 66 phenotypes."

Typos

L52: --

Changed

L254: portions [of] their

Changed

L399: additional closing parenthesis

Changed

L458: signatures [of] repeated association

Changed

L554: performed [on]

Changed

L578: 5 kp/kb windows

Changed

L601: casual/causal SNPs

Changed

L615: widow/window

Changed

L732: Banding/Banting Postdoctoral Fellowship

Changed

L1002 & L960: [Supplementary] Figure

Changed

Supplementary: Some figure titles are in bold and others are not.

Changed

Reviewer #2 (Recommendations For The Authors):

Overall I found the writing to be very clear and easy to follow. Despite my comments, it was clear that a lot of thought went into how to conduct the tests and visualize the results. I recommend ending the Discussion on a positive note, rather than an impossible test.

Thanks for the positive suggestion, we have done this.

In Figure 5, is the temperature variable missing in the legend and in the plot?

No, for this plot we just combined the temperature/precipitation variables into one variable called “climate”.

https://doi.org/10.7554/eLife.88604.3.sa3

Article and author information

Author details

  1. Shaghayegh Soudi

    Department of Biological Sciences, University of Calgary, Calgary, Canada
    Contribution
    Resources, Formal analysis, Investigation, Visualization, Writing – original draft, Writing – review and editing
    Contributed equally with
    Mojtaba Jahani
    Competing interests
    No competing interests declared
  2. Mojtaba Jahani

    1. Department of Biological Sciences, University of Calgary, Calgary, Canada
    2. Department of Botany, University of British Columbia, Vancouver, Canada
    Contribution
    Resources, Formal analysis, Investigation, Visualization, Writing – original draft, Writing – review and editing
    Contributed equally with
    Shaghayegh Soudi
    Competing interests
    No competing interests declared
  3. Marco Todesco

    1. Department of Botany, University of British Columbia, Vancouver, Canada
    2. Michael Smith Laboratories, University of British Columbia, Vancouver, Canada
    3. Irving K. Barber Faculty of Science, University of British Columbia Okanagan, Kelowna, Canada
    Contribution
    Visualization, Formal analysis, Investigation, Methodology, Writing – review and editing
    Competing interests
    No competing interests declared
  4. Gregory L Owens

    Department of Biology, University of Victoria, Victoria, Canada
    Contribution
    Visualization, Formal analysis, Investigation, Methodology, Writing – review and editing
    Competing interests
    No competing interests declared
  5. Natalia Bercovich

    Department of Botany, University of British Columbia, Vancouver, Canada
    Contribution
    Resources, Visualization, Methodology
    Competing interests
    No competing interests declared
  6. Loren H Rieseberg

    Department of Botany, University of British Columbia, Vancouver, Canada
    Contribution
    Conceptualization, Supervision, Investigation, Project administration, Writing – review and editing
    Competing interests
    No competing interests declared
  7. Sam Yeaman

    Department of Biological Sciences, University of Calgary, Calgary, Canada
    Contribution
    Conceptualization, Resources, Formal analysis, Supervision, Investigation, Visualization, Methodology, Writing – original draft, Project administration, Writing – review and editing
    For correspondence
    samuel.yeaman@ucalgary.ca
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-1706-8699

Funding

Genome Canada (LSARP2014-223SUN)

  • Loren H Rieseberg
  • Sam Yeaman

Genome British Columbia (LSARP2014-223SUN)

  • Loren H Rieseberg
  • Sam Yeaman

International Consortium for Sunflower Genomic Resources

  • Loren H Rieseberg

Alberta Innovates (212201729)

  • Sam Yeaman

Natural Sciences and Engineering Research Council of Canada (RGPIN/03950-2017)

  • Sam Yeaman

Canadian Institutes of Health Research (Banting Postdoctoral Fellowship)

  • Gregory L Owens

Natural Sciences and Engineering Research Council of Canada (Banting Postdoctoral Fellowship)

  • Gregory L Owens

Social Sciences and Humanities Research Council of Canada (Banting Postdoctoral Fellowship)

  • Gregory L Owens

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

The authors thank the Yeaman and Peichel labs for comments. Computational support was provided by the Digital Resources Alliance of Canada, and funding was provided by Genome Canada and Genome BC (LSARP2014-223SUN), the International Consortium for Sunflower Genomic Resources, Alberta Innovates, and NSERC Discovery. GLO was funded by a Banting Postdoctoral Fellowship.

Senior Editor

  1. Christian R Landry, Université Laval, Canada

Reviewing Editor

  1. Graham Coop, University of California, Davis, United States

Version history

  1. Preprint posted: April 14, 2023 (view preprint)
  2. Sent for peer review: May 3, 2023
  3. Preprint posted: July 3, 2023 (view preprint)
  4. Preprint posted: November 13, 2023 (view preprint)
  5. Version of Record published: December 14, 2023 (version 1)

Cite all versions

You can cite all versions using the DOI https://doi.org/10.7554/eLife.88604. This DOI represents all versions, and will always resolve to the latest one.

Copyright

© 2023, Soudi, Jahani et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 431
    Page views
  • 50
    Downloads
  • 0
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Shaghayegh Soudi
  2. Mojtaba Jahani
  3. Marco Todesco
  4. Gregory L Owens
  5. Natalia Bercovich
  6. Loren H Rieseberg
  7. Sam Yeaman
(2023)
Repeatability of adaptation in sunflowers reveals that genomic regions harbouring inversions also drive adaptation in species lacking an inversion
eLife 12:RP88604.
https://doi.org/10.7554/eLife.88604.3

Share this article

https://doi.org/10.7554/eLife.88604

Further reading

    1. Evolutionary Biology
    2. Genetics and Genomics
    Thomas A Sasani, Aaron R Quinlan, Kelley Harris
    Research Article

    Maintaining germline genome integrity is essential and enormously complex. Although many proteins are involved in DNA replication, proofreading, and repair, mutator alleles have largely eluded detection in mammals. DNA replication and repair proteins often recognize sequence motifs or excise lesions at specific nucleotides. Thus, we might expect that the spectrum of de novo mutations – the frequencies of C>T, A>G, etc. – will differ between genomes that harbor either a mutator or wild-type allele. Previously, we used quantitative trait locus mapping to discover candidate mutator alleles in the DNA repair gene Mutyh that increased the C>A germline mutation rate in a family of inbred mice known as the BXDs (Sasani et al., 2022, Ashbrook et al., 2021). In this study we developed a new method to detect alleles associated with mutation spectrum variation and applied it to mutation data from the BXDs. We discovered an additional C>A mutator locus on chromosome 6 that overlaps Ogg1, a DNA glycosylase involved in the same base-excision repair network as Mutyh (David et al., 2007). Its effect depends on the presence of a mutator allele near Mutyh, and BXDs with mutator alleles at both loci have greater numbers of C>A mutations than those with mutator alleles at either locus alone. Our new methods for analyzing mutation spectra reveal evidence of epistasis between germline mutator alleles and may be applicable to mutation data from humans and other model organisms.

    1. Chromosomes and Gene Expression
    2. Evolutionary Biology
    Katherine Rickelton, Trisha M Zintel ... Courtney C Babbitt
    Research Article Updated

    Primate evolution has led to a remarkable diversity of behavioral specializations and pronounced brain size variation among species (Barton, 2012; DeCasien and Higham, 2019; Powell et al., 2017). Gene expression provides a promising opportunity for studying the molecular basis of brain evolution, but it has been explored in very few primate species to date (e.g. Khaitovich et al., 2005; Khrameeva et al., 2020; Ma et al., 2022; Somel et al., 2009). To understand the landscape of gene expression evolution across the primate lineage, we generated and analyzed RNA-seq data from four brain regions in an unprecedented eighteen species. Here, we show a remarkable level of variation in gene expression among hominid species, including humans and chimpanzees, despite their relatively recent divergence time from other primates. We found that individual genes display a wide range of expression dynamics across evolutionary time reflective of the diverse selection pressures acting on genes within primate brain tissue. Using our samples that represent a 190-fold difference in primate brain size, we identified genes with variation in expression most correlated with brain size. Our study extensively broadens the phylogenetic context of what is known about the molecular evolution of the brain across primates and identifies novel candidate genes for the study of genetic regulation of brain evolution.