Polygenic adaptation from standing genetic variation allows rapid ecotype formation

  1. Nico Fuhrmann
  2. Celine Prakash
  3. Tobias S Kaiser  Is a corresponding author
  1. Max Planck Institute for Evolutionary Biology, Germany

Abstract

Adaptive ecotype formation can be the first step to speciation, but the genetic underpinnings of this process are poorly understood. Marine midges of the genus Clunio (Diptera) have recolonized Northern European shore areas after the last glaciation. In response to local tide conditions they have formed different ecotypes with respect to timing of adult emergence, oviposition behavior and larval habitat. Genomic analysis confirms the recent establishment of these ecotypes, reflected in massive haplotype sharing between ecotypes, irrespective of whether there is ongoing gene flow or geographic isolation. QTL mapping and genome screens reveal patterns of polygenic adaptation from standing genetic variation. Ecotype-associated loci prominently include circadian clock genes, as well as genes affecting sensory perception and nervous system development, hinting to a central role of these processes in ecotype formation. Our data show that adaptive ecotype formation can occur rapidly, with ongoing gene flow and largely based on a re-assortment of existing alleles.

Editor's evaluation

This valuable study combines phenotypic analysis, quantitative genetics and population genomics to provide solid evidence for multiple genes underlying adaptive divergence in a marine midge system linked to tidal rhythm. Genes with a plausible role in perceiving and responding to lunar information are among the loci that most highly differentiate populations with distinct behaviors.

https://doi.org/10.7554/eLife.82824.sa0

Introduction

The genetic foundations for rapid ecological adaptation and speciation remain poorly understood. Decades of research have focused on identifying single or few major genetic loci in such contexts, with some notable successes (Wu and Ting, 2004; Phadnis and Orr, 2009; Nosil and Schluter, 2011; Van’t Hof et al., 2016). However, many genome-wide association studies have now shown that quantitative trait phenotypes have usually a polygenic basis (Sella and Barton, 2019; Boyle et al., 2017). This implies that also rapid natural adaptation should be expected to occur in a polygenic context, based on subtle allele frequency changes at many loci with small effects (Barton, 2022; Barghi et al., 2020; Le Corre and Kremer, 2012). A classic example for polygenic adaptation is human height (Turchin et al., 2012; Robinson et al., 2015), and other examples from humans may include birth weight and female hip size (Field et al., 2016). However, powerful datasets to detect polygenic adaptations are still mostly limited to humans and agriculturally relevant species (Barghi et al., 2020), and examples of polygenic adaptation in natural populations are rare. To show that polygenic adaptation occurs in natural populations requires an exceptional biogeographic setting, as well as deep genomic and genetic analysis. One such example may be found in Midas cichlids, where a polygenic trait architecture was found to promote rapid and stable sympatric speciation (Kautt et al., 2020). This finding is particularly interesting, since theoretical work suggested that under scenarios with gene flow the genetic architecture of adaptation may – compared to adaptation without gene flow – rather consist of fewer, tightly linked loci with larger effects (Yeaman and Whitlock, 2011). Notably, the same study also suggested that the resulting large-effect QTL may often be composed of many tightly linked loci of smaller effects (Yeaman and Whitlock, 2011). Without migration we expect an exponential distribution of allele effects, with few large effect loci and many loci of small effects (Barghi et al., 2020; Orr, 1998).

In this study, we examine local adaptation and population divergence in another exceptional biogeographic setting, namely the postglacial evolution of new ecotypes in marine midges of the genus Clunio (Diptera: Chironomidae). In adaptation to their habitat, Clunio midges differ in oviposition behavior and reproductive timing, involving both circadian and circalunar clocks.

Circalunar clocks are biological time-keeping mechanisms that allow organisms to anticipate lunar phase (Neumann, 2014). Their molecular basis is unknown (Andreatta and Tessmar-Raible, 2020), making identification of adaptive loci for lunar timing both particularly challenging and interesting. In many marine organisms, circalunar clocks synchronize reproduction within a population. In Clunio marinus they have additional ecological relevance (Kaiser, 2014). Living in the intertidal zone of the European Atlantic coasts, C. marinus requires the habitat to be exposed by the tide for successful oviposition. The habitat is maximally exposed during the low waters of spring tide days around full moon and new moon. Adult emergence is restricted to these occasions by a circalunar clock, which tightly regulates development and maturation. Additionally, a circadian clock ensures emergence during only one of the two daily low tides. The adults reproduce immediately after emergence and die few hours later. As tidal regimes vary dramatically along the coastline, populations of this Atlantic ecotype of C. marinus show local timing adaptations (Kaiser, 2014; Neumann, 1967; Kaiser et al., 2011) for which the genetic underpinnings are partially known (Kaiser et al., 2016; Kaiser and Heckel, 2012).

When Clunio expanded into the North after the regression of the ice shield, new ecotypes evolved in the Baltic Sea (Remmert, 1955; Palmén and Lindeberg, 1959) and in the high Arctic (Neumann and Honegger, 1969; Pflüger and Neumann, 1971), which differ from the Atlantic ecotype of C. marinus in various ways (see Figure 1 for a summary of defining characteristics of the three ecotypes). In the Baltic Sea the tides are negligible and the Baltic ecotype oviposits on the open water, from where the eggs quickly sink to the submerged larval habitat at water depths of up to 20 m (Remmert, 1955; Endraß, 1976a). Reproduction of the Baltic ecotype happens every day precisely at dusk under control of a circadian clock (Heimbach, 1978), without a detectable circalunar rhythm (Endraß, 1976a). Near Bergen (Norway) the Baltic and Atlantic ecotypes were reported to co-occur in sympatry, but in temporal reproductive isolation. The Baltic ecotype reproduces at dusk, the Atlantic ecotype reproduces during the afternoon low tide (Heimbach, 1978). Therefore, the Baltic ecotype is currently considered a separate species – C. balticus. However, C. balticus and C. marinus can be successfully interbred in the laboratory (Heimbach, 1978).

Figure 1 with 1 supplement see all
Northern European ecotypes of Clunio and their circalunar rhythms.

The Atlantic, Arctic and Baltic ecotypes of Clunio differ mainly in their circalunar rhythms (B–H), circadian rhythms (Figure 1—figure supplement 1), as well as their habitat and the resulting oviposition behavior (see ‘Assessment of oviposition behavior’ in Methods section). (A) Sampling sites for this study. (B–H) Circalunar rhythms of adult emergence in corresponding laboratory strains under common garden conditions, with 16 hours of daylight and simulated tidal turbulence cycles to synchronize the lunar rhythm. In Arctic and Baltic ecotypes the lunar rhythm is absent (E,G,H) or very weak (F). Por-1SL: n=1,263; He-1SL: n=2,075; Ber-1SL: n=230; Tro-tAR: n=209; Ber-2AR: n=399; Seh-2AR: n=380; Ar-2AR: n=765.

In the high Arctic there are normal tides and the Arctic ecotype of C. marinus is found in intertidal habitats (Neumann and Honegger, 1969). During its reproductive season, the permanent light of polar day precludes synchronization of the circadian and circalunar clocks with the environment. Thus, the Arctic ecotype relies on a so-called tidal hourglass timer, which allows it to emerge and reproduce during every low tide (Pflüger and Neumann, 1971). It does not show circalunar or circadian rhythms (Pflüger and Neumann, 1971).

The geological history of Northern Europe (Patton et al., 2017) implies the Baltic Sea and the high Arctic could only be colonized by Clunio after the last glacial ice shield regressed about 10,000 years ago. This inference is also supported by subfossil Clunio head capsules in Baltic Sea sediment cores (Hofmann and Winn, 2000). The new adaptations to the absence of tides in the Baltic Sea and to polar day in the high Arctic must therefore have occurred within this time frame. One strength of our model system is that we know the major adaptations specific to each ecotype, that is we know the major axes of recent selection. This makes Clunio a particularly interesting model system for studying the genetics of rapid adaptation. At the same time it promises to yield insights into the genetic pathways underlying the ecotype characteristics, including circadian and circalunar timing.

Results

Clunio ecotypes

Starting from field work in Northern Europe (Figure 1A), we established one laboratory strain of the Arctic ecotype from Tromsø (Norway, Tro-tAR; see methods for strain nomenclature) and three laboratory strains of the Baltic ecotype, from Bergen (Norway, Ber-2AR), Sehlendorf (Germany; Seh-2AR) and Ar (Sweden; Ar-2AR). We also established a strain of the Atlantic ecotype from Bergen (Ber-1SL, sympatric with Ber-2AR) and used two existing Atlantic ecotype laboratory strains from Helgoland (Germany; He-1SL) and Port-en-Bessin (France; Por-1SL). We confirmed the identity of the ecotypes in the laboratory by the absence of a lunar rhythm in the Baltic and Arctic ecotypes (Figure 1B–H), their circadian rhythm (Figure 1—figure supplement 1B–H) and their oviposition behavior (for details see methods section). The Baltic ecotype from Bergen (Ber-2AR, Figure 1F) was found weakly lunar-rhythmic.

Evolutionary history and species status

We sequenced the full nuclear and mitochondrial genomes of 168 field-caught individuals from six geographic sites, including the two sympatric population ecotypes from Bergen (Figure 1). Based on a set of 792,032 single nucleotide polymorphisms (SNPs), we first investigated population structure and evolutionary history by performing a principal component analysis (PCA; Figure 2A–B) and testing for genetic admixture (Figure 2C). We also constructed a haplotype network of complete mitochondrial genomes (Figure 2D). There are several major observations that can be derived from these data.

Figure 2 with 7 supplements see all
Genetic structure and evolutionary history of Northern European Clunio ecotypes.

The analysis is based on 24 individuals from each population (23 for Por-1SL, 25 for He-1SL). (A, B) Principal Component Analysis (PCA) based on 792,032 SNPs separates populations by geographic location rather than ecotype. (C) ADMIXTURE analysis supports strong differentiation by geographic site (best K=6), but a notable genetic component from the Baltic Sea in the Bergen populations (see K=2 and 3). The Bergen populations are only separated at K=7 and then show a number of admixed individuals. (D) Haplotype network of full mitochondrial genomes reveals highly divergent clusters according to geographic site, but haplotype sharing between Ber-1SL and Ber-2AR. (E) Correlated allele frequencies indicate introgression from Seh-2AR into Ber-2AR.

First, there is strong geographic isolation between populations from different sites. In PCA, clusters are formed according to geography (Figure 2A–B, Figure 2—figure supplement 1). Mitochondrial haplotypes are not shared and are diagnostically different between geographic sites (Figure 2D). In ADMIXTURE, the optimal number of genetic groups is six (Figure 2C, Figure 2—figure supplement 2), corresponding to the number of geographic sites, and there is basically no mixing between the six clusters (Figure 2C; K=6). Finally, there is isolation by distance (IBD; Figure 2—figure supplement 3).

Second, the sympatric ecotypes in Bergen are genetically very similar. In PCA they are not separated in the first four principal components (Figure 2A–B) and they are the only populations that share mitochondrial haplotypes (Figure 2D). In the ADMIXTURE analysis, they are only distinguished at K=7, a value larger than the optimal K. As soon as the two populations are distinguished, some individuals show signals of admixed origin (Figure 2C; K=7), compatible with the assumption of ongoing gene flow and incomplete reproductive isolation. These observations question the species status of C. balticus, which was based on the assumption of temporal isolation between these two populations (Heimbach, 1978).

Third, the data suggest that after the ice age Clunio colonized northern Europe from a single source and expanded along two fronts into the Baltic Sea and into the high Arctic. The mitochondrial haplotype network expands from a single center, which implies a quick radiation from a colonizing haplotype (Figure 2D; note that all sampled populations occur in areas under a former ice shield cover). There is also a large degree of shared nuclear polymorphisms. Thirty-four percent of polymorphic SNPs are polymorphic in all seven populations and 93% are polymorphic in at least two populations (Figure 2—figure supplement 4). Separation of the Baltic Sea populations along PC1 and the Arctic population along PC2 (Figure 2A), suggests that Clunio expanded into the high Arctic and into the Baltic Sea independently. Congruently, nucleotide diversity significantly decreases toward both expansion fronts (Figure 2—figure supplement 5, Supplementary file 1). Postglacial establishment from a common source indicates that the Baltic and Arctic ecotypes have evolved their new local adaptations during this expansion time, that is, in less than 10,000 years.

Fourth, ADMIXTURE analysis reveals that sympatric co-existence of the Atlantic and Baltic ecotypes in Bergen likely results from introgression of Baltic ecotype individuals into an existing Atlantic ecotype population. At K=2 and K=3 the two Baltic Sea populations Seh-2AR and Ar-2AR are separated from all other populations and the two Bergen populations Ber-2AR and Ber-1SL show a marked genetic contribution coming from these Baltic Sea populations (Figure 2C). The Baltic genetic component is slightly larger for the Baltic ecotype Ber-2AR population than for the Atlantic ecotype Ber-1SL population. A statistical analysis via TreeMix detects predominantly an introgression from Seh-2AR into Ber-2AR (Figure 2E, Figure 2—figure supplements 6 and 7), suggesting that haplotypes that convey the Baltic ecotype have introgressed into a population of Atlantic ecotype, resulting in the now observed sympatric co-existence of Baltic and Atlantic ecotypes in Bergen. So far Bergen is the only site for which sympatric co-existence of the ecotypes is known. Further work will be required to see if there is a larger area of overlap, for example in the Skagerrak, and if we see similar signals of introgression or possibly secondary contact in other sympatric populations.

While our observation adds to the evidence for a role of introgression in adaptation and early speciation processes (Edelman and Mallet, 2021; Baack and Rieseberg, 2007), it also constitutes an excellent starting point for mapping loci that could be involved in the respective ecotype adaptations. Under the assumption that fast adaptation and ecotype formation are mostly based on frequency changes in standing variation, one expects that the respective haplotypes should be shared between ecotypes, especially when transferred via introgression. The following analysis is focussed on the Atlantic and Baltic ecotypes, represented by three populations each.

Nuclear haplotype sharing

First, we reconstructed the genealogical relationship between 36 individuals (six from each population) in 50 kb windows (n=1,607) along the genome, followed by topology weighting. There are 105 possible unrooted tree topologies for six populations, and 46,656 possibilities to pick one individual from each population out of the set of 36. For each window along the genome, we assessed the relative support of each of the 105 population tree topologies by all 46,656 combinations of six individuals. We found that tree topologies change rapidly along the chromosomes (Figure 3A; Figure 3—figure supplement 1; Supplementary file 2). The tree topology obtained for the entire genome (Figure 3—figure supplement 2) dominates only in few genomic windows (Figure 3A, black bars ‘Orig.’), while usually one or several other topologies account for more than 75% of the tree topologies (Figure 3A, gray bars ‘Misc.’). Hardly ever do all combinations of six individuals follow a single population tree topology (Figure 3A, stars), which implies that in most genomic windows some individuals do not group with their population. Taken together, this indicates a massive sharing of haplotypes across populations.

Figure 3 with 12 supplements see all
Genome screens for haplotype sharing and genotype-ecotype associations.

(A) Topology weighting of phylogenetic trees for 36 individuals from the Baltic and Atlantic ecotypes, as obtained from 50 kb non-overlapping genomic windows. Windows were marked by a bar if they were dominated by one kind of topology (wτ>75%). Most windows are not dominated by the consensus population topology (‘Orig.’; Figure 3—figure supplement 2), but by combinations of other topologies (‘Misc.’). Windows dominated by topologies that separate the Baltic and Atlantic ecotypes (‘Ecol.’) are mostly on chromosome 1. Windows consistent with introgression are found all over the genome (‘Intr.’). (B) Distribution of outlier variants (SNPs and Indels) between the six Baltic and Atlantic ecotype populations, after global correction for population structure (XtX statistic). Values below the significance threshold (as obtained by subsampling) are plotted in gray. (C) Association of variant frequencies with Baltic vs. Atlantic ecotype (eBPmc). Values below the threshold of 3 (corresponding to p=10–3) are given in gray, values above 10 are given in red. (D) Genetic differentiation (FST) between the sympatric ecotypes in Bergen. Values above 0.5 are given in black, values above 0.75 in red. (E) The distribution of SNPs with FST ≥ 0.75 in the Baltic vs. Atlantic ecotypes. Circled numbers mark the location of the eight most differentiated loci (see Figure 6). Centromeres of the chromosomes are marked by a red ‘C’.

In order to test whether this is only incomplete lineage sorting or if there is additional introgression beyond the detected case in the Bergen populations, we calculated the f4 ratio test for the geographically distant Por-1SL and He-1SL, as well as Ar-2AR and Seh-2AR populations. The genome-wide f4 ratio was –0.00062 (standard error 0.00008) and by comparison to coalescent simulations was found to be significantly different from 0, indicating some degree of introgression. However, the signature of introgression was driven by only 775 SNPs, which is only 0.17% of the total dataset. Thus, the largest part of haplotype sharing can be explained by incomplete lineage sorting, typical for such radiation situations.

Finally, we highlighted genomic windows that are consistent with the one clearly detected introgression event from the Baltic ecotype into both Bergen populations (Figure 3A, yellow bars ‘Intr.’; all topologies grouping Por-1SL and He-1SL vs Ber-1SL, Ber-2AR, Seh-2AR, and Ar-2AR; n=357). Regions consistent with this introgression event are scattered over the entire genome.

Crosses and QTL mapping suggest that lunar rhythmicity is affected by more than one locus (Fig. 4)

As a first attempt to get hold of ecologically relevant loci, we assessed the genetic basis to lunar rhythmicity by crossing the lunar-arrhythmic Ber-2AR strain with the lunar-rhythmic Ber-1SL strain. Interestingly, the degree of lunar rhythmicity segregates within and between crossing families (Figure 4—figure supplement 1), suggesting a heterogeneous multi-genic basis of lunar arrhythmicity. Genetic segregation in the cross implies that the weak rhythm observed in Ber-2AR is due to genetic polymorphism. The Ber-2AR strain seems to carry some lunar-rhythmic alleles, likely due to gene flow from the sympatric Atlantic ecotype (see Figure 1C and results below).

For QTL mapping, we then picked a set of F2 families, which all go back to a single parental pair (BsxBa-F2-34; n=272; Figure 4—figure supplement 1) and designed a set of 23 SNP markers for QTL mapping (Figure 4—figure supplement 2). As a phenotype, we scored the degree of lunar rhythmicity for each individual as the number of days between the individual’s emergence and day 9 of the artificial turbulence cycle, which is the peak emergence day of the lunar-rhythmic strain (Figure 4—figure supplement 3). However, QTL analysis with various algorithms did not reveal any significant QTL for lunar rhythmicity (Figure 4). This absence of evidence may partially be due to the limited number of markers and individuals, as well as a specific shortcoming in the phenotyping. Arrhythmic individuals, which by chance emerge during the peak, are recorded as rhythmic. This leads to an inevitable phenotyping error for a fraction of individuals. However, in another study, we could show that this phenotyping error does not impede the detection of an oligogenic genetic architecture in QTL mapping for a comparable arrhythmic phenotype (Briševac et al., 2022). As a control that in principle we had sufficient power for QTL mapping in this crossing family, we also mapped the sex determining locus, which was clearly detectable (Figure 4). We can conclude that lunar rhythmicity is not controlled by a single large-effect locus, but rather has an oligogenic or possibly even polygenic basis.

Figure 4 with 3 supplements see all
Quantitative Trait Locus (QTL) mapping for lunar rhythmicity (blue) and the sex determining locus (red).

The sex-determining locus is clearly detectable (red). No significant QTLs can be detected for lunar rhythmicity, suggesting it is controlled by more than one locus. Vertical black lines are genetic markers.

Genomic regions associated with ecotype formation

Given that QTL mapping for lunar rhythmicity suggested a multi-genic basis to ecotype formation, we next applied three approaches to identify genomic regions associated with divergence between Atlantic and Baltic ecotypes.

First, genomic windows which are dominated by tree topologies that group populations according to ecotype were highlighted (Figure 3A, red bars ‘Ecol.’).

Second, we screened all genetic variants (SNPs and indels; n=948,128) for those that are overly differentiated between the six populations after correcting for the neutral covariance structure across population allele frequencies (see Ω matrix, Figure 3—figure supplement 3A–B). Such variants can indicate local adaptation. At the same time, we tested for association of these variants with ecotype, as implemented in BayPass Gautier, 2015. Overly differentiated variants (XtX statistic; Figure 3B) and ecotype-associated variants ecotype (eBPmc; Figure 3C) were detected all over the genome, but many were concentrated in the middle of the telocentric chromosome 1. Tests for association of variants with environmental variables such as sea surface temperature or salinity find fewer associated SNPs and no concentration on chromosome 1 (Figure 3—figure supplement 3D–E), confirming that the detected signals are not due to general genome properties, but are specific to the ecotypes.

Third, we expected that gene flow between the sympatric Ber-1SL and Ber-2AR populations would largely homogenize their genomes except for regions involved in ecological adaptation, which would be highlighted as peaks of genetic differentiation. While genetic differentiation may not only be due to adaptation, but also due to demographic processes, the ecologically relevant loci should be among the differentiated loci, especially if these loci coincide with peaks in ecotype-association (eBPmc). The distributions of FST values in all pairwise population comparisons confirmed that genetic differentiation was particularly low in the Ber-1SL vs. Ber-2AR comparison (Figure 3—figure supplements 4 and 5). Pairwise differentiation between Ber-1SL and Ber-2AR (Figure 3D) shows marked peaks on chromosome 1, most of which coincide with peaks in XtX and eBPmc. Notably, when assessing genetic differentiation of Baltic vs Atlantic ecotype (72 vs 72 individuals; Figure 3E; Figure 3—figure supplement 6), there is not a single diagnostic variant (FST = 1), and even variants with FST ≥ 0.75 are very rare (n=63; Figure 3E).

Genetic divergence (dxy), nucleotide diversity (π) and local linkage disequilibrium (r2) of the two Bergen populations do not show marked differences along or between chromosomes (Figure 3—figure supplement 7). The cluster of ecotype-associated variants on chromosome 1 overlaps with three large blocks of long-range linkage disequilibrium (LD; Figure 3—figure supplement 8). This is also reflected in reduced linkage map length in our crossing mapping family (Figure 4—figure supplement 2). Reduced recombination may in part explain the clustering of divergent and ecotype-associated alleles in the middle of chromosome 1. However, the boundaries of the LD blocks do not correspond to the ecotype-associated region and differ between populations. LD blocks are not ecotype-specific. Local PCA of the strongly ecotype associated region does not reveal patterns consistent with a chromosomal inversion or another segregating structural variant (Figure 3—figure supplement 9). Thus, there is no obvious link between the clustering of ecotype-associated loci and structural variation. Notably, genetic differentiation is not generally elevated in the ecotype-associated cluster on chromosome 1, as would be expected for a segregating structural variant, but drops to baseline levels in between ecotype-associated loci (Figure 3D).

Taken together, these observations imply that numerous genomic loci – inside and outside the cluster on chromosome 1 – are associated with ecological adaptation and none of these are differentially fixed between ecotypes. In line with QTL mapping, this virtually excludes the possibility that only one or few loci have driven the new ecotype formation, suggesting instead that ecotype formation is based on a complex polygenic architecture.

Support for adaptation from standing genetic variation

If, as assumed above, adaptation in these populations was indeed based mostly on standing genetic variation, one should find evidence of this in the data structure. We selected highly ecotype-associated SNPs (XtX >1.152094, threshold obtained from randomized subsampling; eBPmc >3; n=3,976; Figure 3—figure supplement 10A) and assessed to which degree these alleles are shared between the studied populations and other populations across Europe. Allele sharing between the Bergen populations is likely due to ongoing gene flow, and hence Bergen populations were excluded from the analysis. In turn, allele sharing between the geographically isolated Seh-2AR, Ar-2AR, Por-1SL, and He-1SL populations likely represents shared ancient polymorphism. Based on this comparison, we found that 82% of the ecotype-associated SNPs are polymorphic in both Atlantic and Baltic ecotypes, suggesting that the largest part of ecotype-associated alleles originates from standing genetic variation. We then retrieved the same genomic positions from published population resequencing data for Atlantic ecotype populations from Vigo (Spain) and St. Jean-de-Luz (Jean, southern France) Kaiser et al., 2016, an area that is potentially the source of postglacial colonization of all locations in this study. We found that 90% of the alleles associated with the Northern European ecotypes are also segregating in at least one of these southern populations, supporting the notion that adaptation in the North involves a re-assortment of existing standing genetic variation.

Ecotypes differ mainly in the circadian clock and nervous system development

In the light of a polygenic architecture for ecotype formation, we can expect that usually several genes of relevant physiological pathways are affected. This provides an ideal setting to identify the physiological processes underlying ecotype formation by enrichment analysis. Accordingly, we assessed how all ecotype-associated variants (SNPs and indels; XtX >1.148764; eBPmc >3, n=4,741; Figure 3—figure supplement 10B) relate to C. marinus’ genes. In a first step, we filtered the existing gene models in the CLUMA1.0 reference genome to those that are supported by transcript or protein evidence, have known homologues or PFAM domains, or were manually curated (filtered annotations provided in Supplementary file 3; 15, 193 gene models). Based on this confidence gene set, we then assessed the location of variants relative to genes, as well as the resulting mutational effects (SNPeff; Cingolani et al., 2012; Figure 3—figure supplement 11; statistics in Supplementary file 4). The vast majority of ecotype-specific variants are classified as intergenic modifier variants, suggesting that ecotype formation might primarily rely on regulatory mutations.

The ecotype-specific SNPs are found in and around 1,400 genes (Supplementary files 5 and 6). We transferred GO terms from other species to the Clunio confidence annotations based on gene orthology (5,393 genes; see Methods and Supplementary file 7). GO term enrichment analysis suggests that ecological adaptation prominently involves the circadian clock, supported by three of the top four GO terms (Figure 5A). In order to identify which genes drive GO term enrichment in the top 40 GO terms, we extracted the genes that harbour ecotype-associated SNPs (168 genes; Figure 5B; Supplementary file 8). We individually confirmed their gene annotations and associated GO terms. Clustering the resulting table by genes and GO terms reveals two dominant signatures (Figure 5B). Many GO terms are associated with circadian timing and are driven by a small number of genes, which include almost all core circadian clock genes (Figure 5B and C). As a second strong signal, almost half of the genes are annotated with biological processes involved in nervous system development (Figure 5B and C). GO term enrichment is also found for ecdysteroid metabolism, imaginal disc development, and gonad development (Figure 5). These processes of pre-pupal development are expected to be under circalunar clock control. The fact that circalunar clocks are responsive to moonlight and water turbulence (Neumann, 2014) renders the finding of GO term enrichment for ‘auditory behavior’ and ‘phototaxis’ interesting. Furthermore, many of the genes involved in nervous system development and sodium ion transport, also have GO terms that implicate them in light- and mechanoreceptor development, wiring or sensitivity (Supplementary file 7). With the exception of ‘response to hypoxia’ and possibly ‘sodium ion transmembrane transport’, there are very few GO terms that can be linked to the submerged larval habitat of the Baltic ecotype, which is usually low in salinity and can turn hypoxic in summer. There is a striking absence of GO terms involved in metabolic processes or immune system processes, despite these GO terms being annotated to 2634 genes or 185 genes, respectively.

GO term analysis of ecotype associated SNPs (A) The top 40 enriched GO terms are listed for the 1,400 genes that are found to be affected by ecotype-associated genetic variants (eBPmc >3).

For each GO term, the significance level (black line, top y-axis) and the observed-expected ratio of genes annotated to the respective GO term (blue bars, bottom y-axis) are given. (B) The top 40 GO terms are driven by 168 genes. Hierarchical clustering of genes and GO terms reveals major signals in the circadian clock and nervous system development (more details in Supplementary file 8). (C) Most GO terms are consistent with the known ecotype differences and selected genes are highlighted for all of them. Notably, basically all core circadian clock genes are affected.

Taken together, the detected GO terms are highly consistent with the known ecotype differences and suggest that ecotypes are mainly defined by changes in the circadian clock and nervous system development. A previously unknown aspect of Clunio ecotype formation is highlighted by the GO terms ‘male courtship behavior’, ‘inter-male aggression’ and ‘copulation’ (Figure 5). These processes are subject to sexual selection and considered to evolve rapidly. They could in the long term entail assortative mating between ecotypes.

Strongly differentiated loci correspond to GO-term enriched biological processes

While GO term analysis gives a broad picture of which processes have many genes affected by ecotype-associated SNPs, this does not necessarily imply that these genes and processes also show the strongest association with ecotype. Additionally, major genes might be missed because they were not assigned GO terms. As a second line of evidence, we therefore selected variants with the highest ecotype-association by increasing the eBPmc cut-off to 10. This reduced the set of affected genes from 1,400 to 69 (Supplementary files 9 and 10). Additionally, we only considered genes with variants that are strongly differentiated between the ecotypes (FST ≥ 0.75, compare Figure 3E), leaving 13 genes in 8 distinct genomic regions (Figure 6A; numbered in Figure 3E). Two of these regions contain two genes each with no homology outside C. marinus (indicated by ‘NA’, Figure 6A), confirming that GO term analysis missed major loci because of a lack of annotation. Three other regions contain the – likely non-visual – photoreceptor ciliary Opsin 1 (cOps1) (Velarde et al., 2005), the transcription factor longitudinals lacking (lola; in fruit fly involved in axon guidance Crowner et al., 2002 and photoreceptor determination Zheng and Carthew, 2008) and the nuclear receptor tailless (tll; in fruit fly involved in development of brain and eye Suzuki and Saigo, 2000), underscoring that ecotype characteristics might involve differential light sensitivity. Interestingly, tll also affects development of the neuroendocrine centres involved in ecdysteroid production and adult emergence de Velasco et al., 2007. Even more, re-annotation of this genomic locus revealed that the neighbouring gene, which is also affected by ecotype-specific variants, is the short neuropeptide F receptor (sNPF-R) gene. Among other functions, sNPF-R is involved in coupling adult emergence to the circadian clock Selcho et al., 2017. Similarly, only 100 kb from cOps1 there is the differentiated locus of matrix metalloprotease 1 (Mmp1), which is known to regulate circadian clock outputs via processing of the neuropeptide pigment dispersing factor (PDF) Depetris-Chauvin et al., 2014. In both cases, the close genetic linkage could possibly form pre-adapted haplotypes and entail a concerted alteration of sensory and circadian functions in the formation of ecotypes. In the remaining two loci, sox100B is known to affect male gonad development Nanda et al., 2009 and the ecdysone-induced protein 93 F is involved in response to hypoxia in flies Lee et al., 2008, but was recently found to also affect reproductive cycles in mosquitoes Wang et al., 2021. In summary, only two out of the top 13 ecotype-associated genes were included in the top 40 GO terms (Figure 6A). Nevertheless, all major biological processes detected in GO term analysis (Figure 5) are also reflected in the strongly ecotype-associated loci (Figure 6), giving a robust signal that circadian timing, sensory perception and nervous system development are underlying ecotype formation in C. marinus.

The 13 most differentiated ecotype-associated genes (A) Loci with highly ecotype-associated variants were selected based on eBPmc >10 and FST(Baltic-Atlantic)>0.75.

There are 13 genes in eight distinct genomic loci. (B–G) An overview is given for the six loci with identified genes. In each panel, from top to bottom the sub-panels show the gene models, FST values of genetic variants in the region, local linkage disequilibrium (LD) and genetic diversity (π). FST values are colored by ecotype association of the variant (red: eBPmc >10; black: 10>eBPmc > 3; gray: eBPmc <3). LD and genetic diversity are shown for the six populations independently, colored-coded as in Figures 1 and 2 (10 kb overlapping sliding-windows with 1 kb steps). The are no strong signatures of selection.

Figure 6—source data 1

Table of loci with highly ecotype-associated variants.

https://cdn.elifesciences.org/articles/82824/elife-82824-fig6-data1-v3.xlsx

Finally, we assessed the top 13 strongly ecotype-associated loci for signatures of selective sweeps in genetic diversity and LD (Figure 6B–G). Despite these loci being the most differentiated between ecotypes in the entire genome, there is at best a mild reduction in genetic diversity and a mild increase in LD (Figure 6B–G). If selection acted on these loci, it must have been very soft, underscoring a history of polygenic adaptation from standing genetic variation and continued recombination.

Discussion

The deep genomic analysis of Clunio populations has allowed us to reconstruct a detailed biogeographic history of their different ecotypes. Most surprisingly, the two sympatric ecotypes that were considered different species turned out to be the most closely related populations in the sample, testifying the high power of ecology in generating differential adaptation in the face of gene flow. The data allowed further to infer a polygenic architecture for ecotype formation. We found that while ecotype-associated alleles differ in allele frequency, they are largely shared between ecotypes, which suggests that adaptation primarily involves standing genetic variation from many different loci. A similar re-use of existing regulatory variation has been found in ecotype formation in sticklebacks Chan et al., 2010; Kingman, 2020; Verta and Jones, 2019 or mimicry in Heliconius butterflies Edelman et al., 2019. However, while in Heliconius alleles are shared over large evolutionary distances via introgression, Clunio ecotypes diverged recently from a common source, as is illustrated by massive and genome-wide shared polymorphism. Combined with the observation that many genes from the same biological processes have ecotype-associated alleles, this draws a picture of polygenic adaptation, involving many pre-existing alleles with probably small phenotypic effects. Particularly for adaptation in circadian timing this scenario is highly plausible. The ancestral Atlantic ecotype comprises many genetically determined circadian timing types that are adapted to the local tides (Kaiser, 2014; Neumann, 1967; Kaiser et al., 2016; Kaiser et al., 2021). Existing genetic variants conveying emergence at dusk were likely selected or re-assorted to form the Baltic ecotype’s highly concentrated emergence at dusk.

Besides circadian timing, the ecotypes differ in circalunar timing and oviposition behavior. In our study, the vast majority of GO terms and candidate genes is consistent with these functions, leaving little risk for evolutionary ‘story-telling’ based on individual genes or GO terms (Pavlidis et al., 2012). We propose that good congruence between known phenotypic differences and detected biological processes could be a hallmark of polygenic adaptation, as only polygenic adaptation is expected to leave a footprint in many genes of the same ecologically relevant biological process.

Based on our genomic comparison of lunar-rhythmic and lunar-arrhythmic ecotypes, we propose three not mutually exclusive hypotheses on molecular pathways involved in the unknown circalunar clock. Firstly, Clunio’s circalunar clock is known to tightly regulate ecdysteroid-dependent development and maturation just prior to pupation (Neumann and Spindler, 1991). Congruently, our screen identified ecotype-associated genes in the development of imaginal discs and genital discs, and in ecdysteroid metabolism. Lunar arrhythmicity may rely on an escape of these processes from circalunar clock control. Secondly, it has been hypothesized that circalunar clocks involve a circadian clock (Bünning and Müller, 1961) and such a mechanism has been experimentally confirmed in the midge Pontomyia oceana (Soong and Chang, 2012). Thus, the overwhelming circadian signal in our data might be responsible for both circadian timing adaptations and the loss of circalunar rhythms. Thirdly, Clunio’s circalunar clock is synchronized with the external lunar cycle via moonlight, as well as tidal cycles of water turbulence and temperature (Neumann, 2014). Our data suggests that sensory receptor development, wiring or sensitivity might differ between ecotypes. Interestingly, some Atlantic ecotype populations are insensitive to specific lunar time cues, either moonlight or mechanical stimulation by the tides (Kaiser, 2014). These pre-existing insensitivities may have been combined to form completely insensitive and hence lunar-arrhythmic ecotypes. This scenario would fit the general pattern of polygenic adaptation through a re-assortment of standing genetic variation, which emerges from our study.

In several species, genes involved in complex behavioral or ecological syndromes were found to be locked into supergenes by chromosomal inversions, for example in Heliconius butterfly mimicry (Joron et al., 2011) or reproductive morphs of the ruff (Küpper et al., 2016). While we observe a clustering of ecotype-associated alleles in Clunio, there is no obvious connection to an underlying structural variant (SV). Possibly, the SV is so complex that it did not leave an interpretable genomic signal. Alternatively, Clunio’s long history of genome rearrangements (Kaiser et al., 2016) may have resulted in a clustering of ecologically relevant loci without locking them into a single SV. Clustering could be stabilized by low recombination, consistent with the observed three LD blocks, which – while not ecotype-specific – all overlap with the differentiated region. Epistatic interactions between the clustered loci and co-adaptation of alleles might further reduce the fitness of recombinants and lead to a concerted response to selection. Such an interconnected adaptive cluster might allow for more flexible evolutionary responses than a single, completely linked supergene. Further studies will have to show whether such a genome architecture exists, whether it facilitates adaptation and whether it might itself be selected for. Notably, a similar architecture with clustered, tightly linked loci was predicted to result from adaptation with gene flow (Yeaman and Whitlock, 2011), a scenario which we found in the sympatric Bergen populations.

Methods

Nomenclature of ecotypes

We expanded the existing naming convention of C. marinus timing types Kaiser et al., 2021 to also include Baltic and Arctic ecotypes. Names of populations and corresponding laboratory strains consist of an abbreviation for geographic origin followed by a code for the daily and lunar timing phenotypes. Daily phenotypes in this study are emergence during the first 12 hr after sunrise (‘1’) or, emergence during the second 12 hr after sunrise (‘2’) or emergence during every low tide (‘t’ for tidal rhythm). Lunar phenotypes in this study are either emergence during full moon and new moon low tides (‘SL’ for semi-lunar) or arrhythmic emergence (‘AR’). As a consequence, the Arctic ecotype is ‘tAR’, the Baltic ecotype is ‘2AR’ and the Atlantic ecotype populations in this study are all of timing type ‘1SL’ (while other timing types exist within the Atlantic ecotype Kaiser et al., 2021).

Fieldwork and sample collection

Field samples for genetic analysis and establishment of laboratory strains were collected in Sehlendorf (Seh, Germany), Ar (Sweden), Tromsø (Tro, Norway), and Bergen (Ber, Norway) during eight field trips in 2017 and 2018 (Supplementary file 11). Field caught adult males for DNA extraction were directly collected in 99.98% ethanol and stored at –20 °C. Females are immobile and basically invisible in the field, unless found in copulation. Laboratory strains were established by catching copulating pairs in the field and transferring multiple fertilized egg clutches to the laboratory (Supplementary file 11). Samples and laboratory strains of the sympatric ecotypes in Bergen were collected at the same location but at different daytime. Additional samples and laboratory strains from Helgoland (He, Germany) and Port-en-Bessin (Por, France) were collected and described earlier Kaiser et al., 2016; Kaiser et al., 2021; Kaiser et al., 2010, but had previously not been subject to whole genome sequencing of individuals.

Laboratory culture and phenotyping of ecotypes

Laboratory strains were reared under standard conditions Neumann, 1966 at 20 °C with 16 hr of light and 8 hr of darkness. Atlantic and Arctic ecotype strains were kept in natural seawater diluted 1:1 with deionized water and fed with diatoms (Phaeodactylum tricornutum) and powdered nettles (Urtica sp.). The Baltic ecotype was kept in natural Sea water diluted 1:2 and fed with diatoms and powdered red algae (90%, Delesseria spp., 10% Ceramium spp., obtained from F. Weinberger and N. Stärck, GEOMAR, Kiel). For entrainment of the lunar rhythm all strains were provided with 12.4 hr tidal cycles of water turbulence (mechanically induced vibrations produced by an unbalanced motor, 50 Hz, roughly 30 dB above background noise, 6hr 10 min on, 6 hr 15 min off) (Neumann and Heimbach, 1979; Neumann, 1978).

Assignment of strains to ecotypes was confirmed based on their phenotypes as recorded in laboratory culture. Oviposition behavior was assessed during standard culture maintenance: Baltic ecotype eggs are generally found submerged at the bottom of the culture vessel, Atlantic and Arctic ecotype eggs are always found floating on the water surface or on the walls of the culture vessel (for details see next paragraph). Daily emergence times were recorded in 1 hr intervals by direct observation (Seh-2AR, Ar-2AR) or with the help of a fraction collector (Honegger, 1977) (Ber-1SL, Ber-2AR, Tro-tAR, Por-1SL, He-1SL; Figure 1—figure supplement 1). Lunar emergence times were recorded by counting the number of emerged midges in the laboratory cultures every day over several months and summing them up over several tidal turbulence cycles. Emergence data for He-1SL was taken from Neumann, 1983, emergence data for Por-1SL was taken from Briševac et al., 2022.

Assessment of oviposition behavior

Because of the lack of tides in the Baltic Sea, the Baltic ecotype cannot rely on the tides to expose the larval substrates for egg deposition. Therefore, in contrast to the Atlantic and Arctic ecotypes, the Baltic ecotype does not oviposit on exposed larval substrates, but on the water surface, from where the eggs sink to the bottom of the sea. This change in oviposition preference is accompanied by a specific behavioral change, namely the bending of the female’s abdomen down through the water surface Endraß, 1976b, which ensures that the egg masses are not caught in the water’s surface tension. In our laboratory culture, the animals are kept in plastic boxes with a constant water level, so that larval substrates are never exposed. Under these conditions, most of the Baltic ecotype’s egg masses will be found submerged at the bottom of the culture box, as expected. Very rarely egg masses are deposited on the walls or the lid of the box. The Atlantic and Arctic ecotypes will – in the absence of exposed larval substrates – also oviposit on the water surface. However, as they lack the characteristic banding of the female abdomen during oviposition, their egg masses will be trapped in the water’s surface tension and cannot sink to the bottom of the culture box. Their egg masses will always float on the water surface, usually with the female still sticking to the egg jelly. These floating egg masses often form aggregations on the water surface. We used these differences in where the egg clutches are found in our laboratory cultures to additionally confirm ecotype identity of our laboratory strains.

Crosses and quantitative trait locus (QTL) mapping

The Ber-1SL and Ber-2AR laboratory strains were subject to single-pair crossing after their emergence rhythms were synchronized by keeping them in separate, time-shifted LD regimes. Several F1 families were raised independently and the F1 siblings were allowed to mate freely in order to obtain bulk F2 families. Emergence distributions were recorded by collecting all emerged adults every day.

Bulk crossing family BsxBa-F2-34 (n=272) was selected for QTL mapping. DNA of the two parents and the F2 progeny was extracted with the QuickExtract DNA Extraction Solution (Lucigen, QE0905T) according to manufacturer’s protocol with modifications. Instead of vortexing before incubation, a pestle was used to grind the sample. In order to identify scorable markers for the genetically very similar strains, the two parents of the cross were subject to whole-genome sequencing. Raw reads were processed through our genotyping pipeline (described below), resulting in 656,368 detected variants, 53,097 of which were diagnostic for the grandparents. Based on the list of diagnostic markers, we picked eight evenly spaced regions (physical distance) on each chromosome for amplicon sequencing (Figure 4—figure supplement 2, Supplementary file 12). Multiplex PCR was performed with the QIAGEN Multiplex PCR Kit (206143) for each chromosome primer set separately (15 min at 95 °C; 40 cycles of 94 °C for 30 s, 57 °C for 90 s and 72 °C for 90 s; 72 °C for 10 min). Sequencing libraries were prepared with standard Illumina protocols and sequenced with 150 bp paired-end on the Illumina HiSeq3000 sequencer by the Max Planck Genome Centre (Cologne, Germany). Sequencing reads from independent runs were first merged to one read file for forward and reverse reads. Those files were again subject to our genotyping pipeline (described below) with the exception that the mapped reads in SAM format undergone a coverage capping of 40 using a custom python script and the VCF was filtered for a minimum minor allele frequency of 0.25 ('--maf 0.25') and a maximum proportion of missing data per locus of 30% ('--max-missing 0.7'). All markers that were not diagnostic for the parents were removed. In case genotypes differed within an amplicon, a consensus amplicon genotype was called by majority.

For QTL mapping, the degree of rhythmicity of an individual was recorded as the number of days between its emergence day and day 9, which is the peak of emergence in the rhythmic strain (Ber-1SL). The phenotyping has a specific, inevitable shortcoming: Arrhythmic individuals can also emerge during the peak days by chance, so that this small fraction of individuals is phenotyped as rhythmic while actually being arrhythmic. As the phenotype is not normally distributed (Kolmogorov-Smirnov test and the Shapiro-Wilk test of normality) it was treated as non-parametric. QTL scans were performed in the R package ‘qtl’ (Broman et al., 2003). The genome scans with different QTL models followed the calculation of conditional genotype probabilities in 5 cM distances and simulated genotypes from 64 imputations with the same distance. Both phenotypes (lunar rhythmicity, sex) were scanned for a single QTL model in the ‘scanone()’ function (method = ‘em’). Significance thresholds were set at top 5% of the logarithm of the odds (LOD) scores for 1000 permutations. The ‘stepwise.qtl()’ function for identification of multiple and interacting QTLs returned a null QTL model for lunar rhythmicity.

DNA extraction and whole genome sequencing

For each of the seven populations, 24 field caught males (23 for Por-1SL, 25 for He-1SL) were subject to whole genome sequencing. DNA was extracted from entire individuals with a salting out method (Reineke et al., 1998) and amplified using the REPLI-g Mini Kit (QIAGEN) according to the manufacturer’s protocol with volume modifications (Supplementary file 13). All samples were subject to whole genome shotgun sequencing at 15–20 x target coverage on an Illumina HiSeq3000 sequencer with 150 bp paired-end reads. Library preparation and sequencing were performed by the Max Planck Genome Centre (Cologne, Germany) according to standard protocols. Raw sequence reads are deposited at ENA under Accession PRJEB43766.

Sequence data processing, genotyping, and SNP filtering

Raw sequence reads were trimmed for adapters and base quality using Trimmomatic v.0.38 Bolger et al., 2014 with parameters 'ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10:8:true', 'LEADING:20', 'TRAILING:20', 'MINLEN:75'. Overlapping paired end reads were merged with PEAR v.0.9.10 Zhang et al., 2014, setting the minimum assembled sequence length to 75 bp and a capping quality score of 20. Assembled and unassembled reads were mapped with BWA-MEM Li, 2013 to the nuclear reference genome Kaiser et al., 2016 (ENA accession GCA_900005825.1) and the mitochondrial reference genome (ENA accession CVRI01023763.1) of C. marinus. Mapped reads were sorted, indexed, filtered for mapping quality (-q 20) and transformed to BAM format with SAMtools v.1.9 Li et al., 2009. Read group information was added with the AddOrReplaceReadGroups.jar v.1.74 script from the Picard toolkit (http://picard.sourceforge.net/) DePristo et al., 2011.

For the nuclear genome, SNP and insertion-deletion (indel) genotypes were called using GATK v.3.8–0-ge9d806836 McKenna et al., 2010. After initial genotype calling with the GATK HaplotypeCaller and the parameter ‘-stand_call_conf 30’, base qualities were recalibrated with the GATK BaseRecalibrator with ‘-knownSites’ and genotype calling was repeated on the recalibrated BAM files to obtain the final individual VCF files. Individual VCF files were combined using GATK GenotypeGVCFs. SNP and indel genotypes were filtered with VCFtools v.0.1.14 Danecek et al., 2011 to keep only biallelic polymorphisms (--max-alleles 2), with a minimum minor allele frequency of 0.02 (--maf 0.02), a minimum genotype quality of 20 (--minQ 20) and a maximum proportion of missing data per locus of 40% (--max-missing 0.6), resulting in 792,032 SNPs and 156,096 indels over the entire set of 168 individuals. For certain analyses indels were excluded with VCFtools (‘--remove-indels’).

Reads mapped to the mitochondrial genome were transformed into mitochondrial haplotypes as described in Fuhrmann and Kaiser, 2020.

Population genomic analyses

Mitochondrial haplotype networks were calculated using the Median-Joining algorithm Bandelt et al., 1999 with Network v.10.1.0.0 (fluxus-engineering.com).

Nuclear SNP genotypes were converted to PLINK format with VCFtools. SNPs were LD pruned with PLINK v.1.90b4 Chang et al., 2015 and parameters ‘--indep-pairwise 50 10 0.5’ as well as ‘--chr-set 3 no-xy no-mt --nonfounders’ (resulting in 445,385 SNPs). Principal Component Analysis (PCA) was performed in PLINK using the option ‘--pca’ with the options default settings. The pruned BED file from PLINK was used as input to ADMIXTURE v.1.3.0 Alexander and Lange, 2011, with which we assessed a series of models for K=1 to K=10 genetic components, as well as the corresponding cross-validation error (‘--cv’). Migration was further tested by converting the SNP data to TreeMix format with the vcf2treemix.sh script Ravinet, 2021 and running TreeMix 1.13 Pickrell and Pritchard, 2012 with default parameters and the southernmost Por-1SL as root population. Finally, in order to distinguish incomplete lineage sorting from introgression on a broader geographic scale, we calculated the f4 ratio test Reich et al., 2009 as implemented in the f4.py script Meyer et al., 2017 for the He-1SL, Por-1SL, Ar-2AR and Seh-2AR populations. Standard error was obtained by jack-knifing in blocks of 100 SNPs (-k 100). We tested if the observed f4 ratio was significantly different from zero by comparison to 1,000 coalescent simulations on a random subset of 1% of the pruned SNPs (n=4,453; -s 1000).

Population estimates along the chromosomes were calculated in 100 kb overlapping sliding-windows with 10 kb steps (Figure 3—figure supplement 7) or 10 kb overlapping sliding-windows with 1 kb steps (Figure 6). Nucleotide diversity (π) was calculated for SNPs with VCFtools ‘--window-pi’. For the genome-wide average, calculations were repeated with 200 kb non-overlapping windows. Linkage disequilibrium (LD; as r2) was calculated in VCFtools with ‘--geno-r2’. Local LD was calculated with ‘--ld-window-bp 500’. Preliminary tests showed that local LD decays within a few hundred base pairs (Figure 3—figure supplement 12). For long range LD, minor allele frequency was filtered to 0.2 (‘--maf 0.2‘, resulting in 335,800 SNPs), only values larger 0.5 were allowed with ‘--min-r2 0.5’ and the ‘--ld-window-bp 500’ filter was removed. Pairwise FST was calculated with VCFtools ‘--weir-fst-pop’ option per SNP and in sliding windows. For calculation of genetic divergence (dxy), allele frequencies were extracted with VCFtools ‘--freq’ and dxy was estimated from allele frequencies according to Delmore et al., 2015.

Phylogenomics and topology weighting

Nuclear genome phylogeny was calculated for a random set of six individuals from each population, without Tro-tAR (n=36). We down-sampled to six individuals per population because the following topology weighting algorithm assesses all possible combinations of six individuals (i.e. one per population) in the sample. The number of combinations grows exponentially with the number of individuals and thus leads to immense run-times of the algorithm. For windowed phylogenies, the VCF file was subset into non-overlapping 50 kb windows using VCFtools ‘--from-bp --to-bp’. SNP genotypes were transformed into FASTA alignments of only informative sites with the vcf2phylip.py v.2.3 script Ortiz, 2019 and parameters ‘-m 1 p -f’. Heterozygous genotypes were represented by the respective IUPAC code for both bases. Whole genome and windowed phylogenies were calculated with IQ-TREE v.1.6.12 Nguyen et al., 2015 using the parameters ‘-st DNA -m MFP -keep-ident -redo’ for the windowed and ‘-st DNA -m MFP -keep-ident -bb 1000 -bnni -nt 10 -redo’ for the whole genome phylogenies. Topology weighting was performed on the windowed phylogenies with TWISST Martin and Van Belleghem, 2017 and the parameter ‘--method complete’.

Association analysis

Population-based association between genetic variants (SNPs and Indels) and ecotype, as well as environmental variables (Supplementary file 14) was assessed in BayPass v.2.2 Gautier, 2015. Allele counts were obtained with VCFtools option ‘--counts’. Analyzed covariates were ecotype, sea surface salinity (obtained from Hordoir et al., 2019) and average water temperature of the year 2020 (obtained from weather-atlas.com, accessed 27.04.2020; 16:38), as given in Supplementary file 14. BayPass was run with the MCMC covariate model. BayPass corrects for population structure via Ω dissimilarity matrices, then calculates the XtX statistics and finally assesses the approximate Bayesian p value of association (eBPmc). To obtain a significance threshold for XtX values, the data was randomly subsampled (100,000 genetic variants) and re-analyzed with the standard covariate model, as implemented in baypass_utils.R. All analyses we performed in three replicates (starting seeds 5,001, 24,306, and 1,855) and the median is shown.

SNP effects and GO term enrichment analysis

Gene annotations to the CLUMA1.0 reference genome Kaiser et al., 2016 were considered reliable if they fulfilled one of three criteria: (1) Identified ortholog in UniProtKB/Swiss-Prot or non-redundant protein sequences (nr) at NCBI or PFAM domain, as reported in Kaiser et al., 2016. (2) Overlap of either at least 20% with mapped transcript data or 40% with mapped protein data, as reported in Kaiser et al., 2016. (3) Manually annotated. This resulted in a 15,193 confidence genes models. The location and putative effects of the SNPs and indels relative to these confidence gene models were annotated using SnpEff 4.5 Cingolani et al., 2012 (build 2020-04-15 22:26, non-default parameter `-ud 0’). Gene Ontology (GO) terms were annotated with emapper-2.0.1. Huerta-Cepas et al., 2017 from the eggNOG 5.0 database Huerta-Cepas et al., 2019, using DIAMOND Buchfink et al., 2015, BLASTP e-value <1e−10 and subject-query alignment coverage of >60%. Conservatively, we only transferred GO terms with ‘non-electronic’ GO evidence from best-hit orthologs restricted to an automatically adjusted per-query taxonomic scope, resulting in 5,393 C. marinus gene models with GO term annotations. Enrichment of ‘Biological Process’ GO terms in the genes associated with ecotype-specific polymorphisms was assessed with the weight01 Fisher’s exact test implemented in topGO Alexa and Rahnenfuhrer, 2010 (version 2.42.0, R version 4.0.3). When constructing the topGOdata object, the 5393 gene models with GO terms were specified as the custom background universe by providing the gene list to the parameter ‘allGenes’ and the custom gene-to-GO mappings were supplied by providing eggNOG GO annotations of these genes to the parameter ‘gene2GO’. Out of the 1,400 ecotype-associated gene models 463 were annotated with GO terms and only those went into GO term enrichment analysis. The fractions of genes with missing GO terms among the ecotype-associated genes and the total genes do not differ significantly (chi square test, p=0.12), suggesting both gene sets are equally affected by missing GO terms. Any potential biases in the reduced gene universe are accounted for by calculating the probability of observing at least the same degree of overrepresentation of a GO term when 463 genes are randomly selected out of the specified gene universe of 5393 genes (as implemented with the Fisher’s exact test in topGO).

Figure preparation

Figures were prepared in R Crawley, 2007. Data were handled with the ‘data.table’ Dowle, 2020 and ‘plyr’ Wickham, 2011 packages. The map of Europe was generated using the packages ‘ggplot2’ Wickham, 2016 and ‘ggrepel’ Slowikowski, 2018, ‘maps’ Brownrigg et al., 2018 and ‘mapdata’ Becker et al., 2018. The map was taken from the CIA World DataBank II (http://www.evl.uic.edu/pape/data/WDB/). Circular plots were prepared using the R package ‘circlize’ Gu et al., 2014. Multiple plots were combined in R using the package ‘Rmisc’ Hope, 2013. The graphical editing of the whole genome phylogeny was done in Archeopteryx (http://www.phylosoft.org/archaeopteryx) Han and Zmasek, 2009. Final figure combination and graphical editing of the raw plot files was done in Inkscape. Neighbor Joining trees of the omega statistic distances from BayPass were created with the R package ‘ape’ Paradis and Schliep, 2019. In all plots, the order and orientation of scaffolds within the chromosomes follows the published genetic linkage map Kaiser et al., 2016.

Data availability

Sequencing data for field samples and mapping families are deposited at ENA under Accession PRJEB43766. Filtered genome annotations are given in Supplementary File 3. All other data are included in the manuscript as supplementary data files.

The following data sets were generated
    1. Fuhrmann N
    2. Prakash C
    3. Kaiser TS
    (2022) European Nucleotide Archive
    ID PRJEB43766. Baltic and Arctic Clunio, whole genome resequencing.

References

  1. Book
    1. Crawley MJ
    (2007) The R Book
    Chichester, UK: John Wiley & Sons Ltd.
    https://doi.org/10.1002/9780470515075
    1. Honegger HW
    (1977)
    An automatic device for the investigation of the rhythmic emergence pattern of clunio marinus
    International Journal of Chronobiology 4:217–221.
  2. Book
    1. Kaiser T
    (2014) Patterns and mechanisms of nature’s enigmatic rhythms
    In: Numata B, editors. Annual, Lunar, and Tidal Clocks. Japan: Springer. pp. 121–141.
    https://doi.org/10.1007/978-4-431-55261-1
    1. Neumann D
    (1966)
    Die lunare und tägliche schlüpfperiodik der Mücke clunio - steuerung und abstimmung auf die gezeitenperiodik
    Zeitschrift Für Vergleichende Physiologie 53:1–61.
  3. Book
    1. Neumann D
    2. Heimbach F
    (1979)
    Cyclic Phenomena in Marine Plants and Animals
    Elsevier.
  4. Book
    1. Neumann D
    (1983) Die zeitliche programmierung von tieren auf periodische umweltbedingungen
    In: Vorträge G, editors. Rheinisch-Westfälische Akademie Der Wissenschaften. Springer. pp. 31–62.
    https://doi.org/10.1007/978-3-322-85974-7_2
  5. Book
    1. Neumann D
    (2014) Circannual rhythms: history, present challenges, future directions
    In: Numata H, Helm B, editors. Annual, Lunar, and Tidal Clocks. Japan: Springer. pp. 3–24.
    https://doi.org/10.1007/978-4-431-55261-1_11
    1. Remmert H
    (1955)
    Ökologische untersuchungen über die dipteren der nord- und ostsee
    Archiv Für Hydrobiologie 51:1–53.
    1. Wickham H
    (2016)
    Ggplot2: Elegant Graphics for Data Analysis
    Ggplot2, Ggplot2: Elegant Graphics for Data Analysis, Cham, springer, 10.1007/978-3-319-24277-4.

Decision letter

  1. Detlef Weigel
    Senior and Reviewing Editor; Max Planck Institute for Biology Tübingen, Germany

Our editorial process produces two outputs: (i) public reviews designed to be posted alongside the preprint for the benefit of readers; (ii) feedback on the manuscript for the authors, including requests for revisions, shown below. We also include an acceptance summary that explains what the editors found interesting or important about the work.

Decision letter after peer review:

Thank you for submitting your article "Polygenic adaptation from standing genetic variation allows rapid ecotype formation" for consideration by eLife. Your article has been reviewed by 2 peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Detlef Weigel as the Senior Editor. The reviewers have opted to remain anonymous.

The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission. As you will see, both expert reviewers like the topic and the system you have established, but have also significant concerns, mostly about interpretation, but some also regarding technical issues. The companion paper is clearly the stronger paper, but we would like nevertheless invite revision for the current paper as well, as the two papers complement each other.

Essential revisions:

1) Describe the power of the QTL analysis and clearly describe its limitation. State that it remains unclear how complex the genetic basis of rhythmic versus arrhythmic behavior is, and that a moderately complex genetic basis (vulgo oligogenic) cannot be excluded.

2) Be more open to the possibility that some of the apparent differentiation is due to demographic processes instead of adaptation only.

3) Be more conservative with the GO enrichment analyses.

Reviewer #1 (Recommendations for the authors):

The Introduction should set the scene better before moving on to Clunio. Other cases of polygenic adaptation should be mentioned. There should be some discussion of theoretical expectations for the distribution of effect sizes for adaptive traits (with or without gene flow – see Yeaman and Whitlock 2011, for example).

The interpretations are generally strong and the core points are interesting and significant for the field – the recent, polygenic adaption from standing variation, the genomic clustering and the identification of candidate pathways are all strong results. Evidence from the QTL analysis is least convincing – see specific comments below. The absence of evidence here is consistent with the results of the genome scan but that is about as far as one can go.

Absence of evidence for structural variants is also quite hard to interpret. I agree that the patterns do not obviously fit with a simple interpretation (such as a single polymorphic inversion) but there is remarkably strong LD over long distances. This might be explained by gene flow for the sympatric population pair but it appears to extend to other populations too. That said, I think the authors handle this well, leaving open the possibility that further work will reveal structural variants.

Argument for standing variation does seem good – question about maintenance – plausibly suggest LA within the ancestral Atlantic ecotype.

The GO enrichment results are very neat. A key strength is knowing the dominant axis of selection and being able to link that to genetic pathways likely to influence the relevant phenotypes (and so avoid 'story-telling'). These points are made in the Discussion but I think some of this could usefully be moved to a revised introduction, perhaps giving more emphasis to insights other than the polygenic nature of adaptation that can come from the unique features of this system.

Below I give further comments by line number:

15 – Not 'is' but 'can be' since speciation by other routes is possible and also because ecotype formation need not lead to speciation.

36 – In addition to these recent references, it would be good to recognise long-standing awareness of the role of polygenic adaptation. See, for example, Le Corre and Kremer 2012 – doi: 10.1111/j.1365-294X.2012.05479.x

38 – A polygenic basis can be demonstrated by quantitative genetics methods such as 'regional heritability'. It is currently unclear whether this is included under 'deep genomic and genetic analysis'.

Figure 1 – The meaning of zero on the x-axis needs to be explained in the figure caption. The y-axis label (% emergence) seems to conflicts with the legend which says that these are plots of oviposition timings. Bergen is very far from the entrance to the Baltic – somewhere, the ecotype present in the Skagerrak needs to be mentioned.

118 – I am not sure why this is described as a contrast. The previous paragraph describes geographic rather than ecotype structure. That is surely confirmed by the observation that sympatric populations of different ecotypes are similar genetically.

154-164 – This inference should not be pushed too far since it is based on a single point in what is, presumably, a large area of overlap, perhaps with varying past and present mixes of the two ecotypes. So, the inferences apply to the specific sampled populations but may not generalise to the Atlantic-Baltic interaction at other locations. It also begs the question of whether this is a secondary contact, after an allopatric origin of the Baltic ecotype.

190 – How many topologies fit this pattern?

210 – Comparison to the SD locus only says that the number of loci for rhythmicity is likely to be >1, and even this conclusion is questionable since sex is a binary phenotype with no environmental variance. Therefore, to conclude from the lack of detectable QTL that 'many' loci underlie rhythmicity is going too far. It may be possible to use simulation to get a better idea of the power of the QTL analysis and so the likely maximum effect size. If the vertical lines in Figure 4 represent marker loci (not specified in the legend) then the number is rather low and distribution rather uneven, perhaps making QTL detection in some parts of the genome difficult (including the important region in the centre of Chr1). I also wonder how good a measure of rhythmicity is provided by the emergence day (as an absolute deviate, if I understand correctly) for an individual. This seems like a difficult trait to handle since rhythmicity cannot easily be measured for any one individual.

280 – It is surprising that the cluster of outliers on Chr1 does not appear as a QTL, especially given the long-range LD in this area

Methods

543 – 8 markers per chromosome was on the low side. 'Evenly spaced' here was presumably in physical distance since the marker spacing on the genetic map looks far from even. This also tells us something about the distribution of recombination along the chromosomes, which may be relevant to the cluster of divergent loci on Chr1 but this connection is not made in the manuscript.

620 – The rapid decline in LD suggests that window sizes for some analyses (e.g. 100kb for population genetic statistics) was rather large. This could have influenced sweep detection, for example.

627 – It is not clear why it was necessary to down-sample to 6 individuals per population here. Using 50kb windows are, again, rather large (and adjusting windows to a number of SNPs is preferable generally for this type of analysis), but I doubt either will have impacted the results strongly.

Figure S4 – Too many decimal places are used for both distance and Fst. The comma separator in the distance should be replaced with a point (or probably the fractions of km simply ignored).

Figure S7 – TreeMix also provides a plot of %variance explained, which helps to determine the appropriate number of migration events. This could usefully be given here.

Figure S21 – Are numbers above marker lines just their IDs? It might be helpful to indicate their linkage map positions (cf comments above about even spacing on the physical vs the genetic map).

Reviewer #2 (Recommendations for the authors):

Although I am enthusiastic about the promise of the system more generally and the importance of the questions posed, I have significant concerns about the manuscript and its conclusions. I am embedding my suggestions for new analyses with the descriptions of the issues above so that they are readable:

1) My primary concern in is in the design and interpretation of the QTL analysis. The QTL approach used here has low power, both due to the sample size and the number of markers used (it looks like ~8 per chromosome). The authors use an analysis of the sex determining locus as a "control" but because of the complete heritability of this trait in most systems it is more of a straw man to me. The authors conclude that the architecture of the trait is polygenic based on this, but we are missing key information to evaluate this. For example, what is the estimated heritability of the trait in this family and under laboratory conditions? I was also wondering if with the number of markers whether the authors might miss certain loci if the recombination rate is high.

2) There are some issues with the presentation and interpretation of the population genetic analyses. Many assumptions are made about whether introgression or ILS occurred and there are statements that are not accurate about it being "impossible" to distinguish between these scenarios. There are quite a lot of methods devoted to this problem, see e.g. (see Liu et al. PLoS Comp Bio 2014). Simpler tests like D-statistic and F4 ratio tests would add a lot of interpretability to these results.

3) Some of the analyses associated with ecological adaptation that follow on the QTL results struck me as ad hoc and with the potential to lead to spurious results. I am not familiar with the BayPass approach but since it is the approach that explicitly accounts for population structure it seems the one that would be most appropriate for the authors to focus on in a revised manuscript. The use of phylogenetic windows that associate with ecotype is concerning to me as given the level of ILS and gene flow that appears to be present in this system is would be very challenging to distinguish signal from noise. Related to this, in the companion manuscript the authors note that the chromosome 1 region has several inversions. I realize that the samples used in this manuscript are different but was unclear on how they excluded the possibility inversions were present on the chromosome 1 region that is highly differentiated in this analysis. This seems to be the primary region that distinguishes ecotypes in the BayPass analysis and thus is foundational to their conclusion that this signal is polygenic (since few other regions are significant in the BayPass analysis). I also noticed that there were not many QTL markers in this region of chromosome 1 regarding their previous analysis.

4) There were issues with the GO analysis that should be addressed. Because the gene universe used for GO enrichment is a subset of the full gene set, GO enrichment results will be biased. This will mostly lead to false positives (i.e. overrepresentation of a GO category due to evaluating a subset of genes that fall in that category). The authors should pair this with simulations or other analyses, or temper the strength of their claims.

https://doi.org/10.7554/eLife.82824.sa1

Author response

Essential revisions:

1) Describe the power of the QTL analysis and clearly describe its limitation. State that it remains unclear how complex the genetic basis of rhythmic versus arrhythmic behavior is, and that a moderately complex genetic basis (vulgo oligogenic) cannot be excluded.

We completely rewrote the section on QTL analysis. We now clearly describe its limitations (lines 203-229) and conclude that the phenotype is controlled by “more than one“ locus.

2) Be more open to the possibility that some of the apparent differentiation is due to demographic processes instead of adaptation only.

We have inserted a statement on the possible role of demographic processes in the Results section (lines 251-254), and also inserted a statement that part of the clustering of divergent alleles may be due to reduced recombination (lines 266-269).

3) Be more conservative with the GO enrichment analyses.

We maintain that we have been adequately cautious in the GO enrichment analysis. First, we conservatively only annotated orthologous genes with GO terms, avoiding false positives due to annotating duplicated genes (of essentially unknown function) with GO terms from their paralogs. Second, we think that the reviewer who is sceptic about our GO enrichment analysis has not fully appreciated the sophistication of our analysis. While details are given in the response below, briefly, we accounted for the GO term distribution of our reduced gene set when we specified the set of genes with GO terms as the custom background universe and only analyzed ecotype-associated genes with GO terms. The two subsets with GO terms are the same fraction of the respective full gene set, suggesting there are no biases in sub-setting. In our GO enrichment analysis, the Fisher’s exact test gives the exact hypergeometric probability under the null hypothesis for observing the degree of enrichment of a GO term given the custom gene universe, accounting for the potential differences in the representation of GO terms within the reduced gene set in comparison to the set of all genes. If there was any bias in the reduced gene universe, it would thereby have been accounted for in our analysis.

If the reviewer is concerned with the general stability of GO enrichment results between different subsets of a gene universe, we would like to highlight that Groß et al. have demonstrated that GO enrichment is surprisingly robust to changes in the GO term ontology and annotations (https://academic.oup.com/bioinformatics/article/28/20/2671/204694), further suggesting that using a reduced gene universe should not pose major problems.

Reviewer #1 (Recommendations for the authors):

The Introduction should set the scene better before moving on to Clunio. Other cases of polygenic adaptation should be mentioned. There should be some discussion of theoretical expectations for the distribution of effect sizes for adaptive traits (with or without gene flow – see Yeaman and Whitlock 2011, for example).

We have re-written the introduction and now mention other cases of polygenic adaptation, as well as a discussion of the expected distribution of effect sizes for adaptive traits (lines 35-51).

The interpretations are generally strong and the core points are interesting and significant for the field – the recent, polygenic adaption from standing variation, the genomic clustering and the identification of candidate pathways are all strong results.

We thank the reviewer fort this positive evaluation of our work.

Evidence from the QTL analysis is least convincing – see specific comments below. The absence of evidence here is consistent with the results of the genome scan but that is about as far as one can go.

We have re-written this paragraph, and now conclude that we can exclude a single or few strong-effect loci. We also explicitly included a discussion of the weaknesses of the QTL data (lines 203-229). However, we still consider them as relevant support for the overall conclusions.

Absence of evidence for structural variants is also quite hard to interpret. I agree that the patterns do not obviously fit with a simple interpretation (such as a single polymorphic inversion) but there is remarkably strong LD over long distances. This might be explained by gene flow for the sympatric population pair but it appears to extend to other populations too. That said, I think the authors handle this well, leaving open the possibility that further work will reveal structural variants.

We agree that the evidence for SVs is hard to interpret. As the reviewer states we handled this well, we did not make any changes to the respective paragraphs.

Argument for standing variation does seem good – question about maintenance – plausibly suggest LA within the ancestral Atlantic ecotype.

We are happy the reviewer appreciates our argument for adaptation from standing genetic variation. We are not quite sure what she/he refers to with “LA”, but assume it is local adaptation. We make the point that standing variation for circadian timing is likely due to local adaptation in the discussion.

The GO enrichment results are very neat. A key strength is knowing the dominant axis of selection and being able to link that to genetic pathways likely to influence the relevant phenotypes (and so avoid 'story-telling'). These points are made in the Discussion but I think some of this could usefully be moved to a revised introduction, perhaps giving more emphasis to insights other than the polygenic nature of adaptation that can come from the unique features of this system.

We agree with the reviewer that the GO term enrichment results are neat, and hope that this helps to convince the other reviewer and the editor of the strength of the dataset (see detailed argumentation above and below). Indeed, knowing the dominant axes of selection is a particular strength in our system and we now mention this in the introduction (lines 92-94).

Below I give further comments by line number:

15 – Not 'is' but 'can be' since speciation by other routes is possible and also because ecotype formation need not lead to speciation.

We have made the requested change.

36 – In addition to these recent references, it would be good to recognise long-standing awareness of the role of polygenic adaptation. See, for example, Le Corre and Kremer 2012 – doi: 10.1111/j.1365-294X.2012.05479.x.

We have added the respective citation.

38 – A polygenic basis can be demonstrated by quantitative genetics methods such as 'regional heritability'. It is currently unclear whether this is included under 'deep genomic and genetic analysis'.

Deep genomic and genetic analysis can also include Regional Heritability Mapping (RHM). However, as the method requires a pedigree and is thus not applicable to our data, we did not explicitly mention it here.

Figure 1 – The meaning of zero on the x-axis needs to be explained in the figure caption. The y-axis label (% emergence) seems to conflicts with the legend which says that these are plots of oviposition timings. Bergen is very far from the entrance to the Baltic – somewhere, the ecotype present in the Skagerrak needs to be mentioned.

There is no zero label on the x-axis. The zero label is on the y-axis (Emergence %).

The y-axis label does not conflict with the legend. The legend mentions “oviposition behavior” as one aspect in which the ecotypes differ. This was not listed in the figure, and may therefore have been misleading. We included “oviposition on exposed substrate” and “oviposition on water surface” in the figure. Otherwise, the legend clearly states that plots B-H are “lunar rhythms of adult emergence”.

Unfortunately, we have not sampled and we do not know any Clunio population from the Skagerrak. We have therefore included a comment that the wider overlap of the distribution Baltic and Arctic ecotypes must be subject to further investigation (lines 161-164).

118 – I am not sure why this is described as a contrast. The previous paragraph describes geographic rather than ecotype structure. That is surely confirmed by the observation that sympatric populations of different ecotypes are similar genetically.

The contrast referred to differentiated vs not differentiated populations. But this was indeed not clear from the context. So we removed the statement on this being a contrast (line 129).

154-164 – This inference should not be pushed too far since it is based on a single point in what is, presumably, a large area of overlap, perhaps with varying past and present mixes of the two ecotypes. So, the inferences apply to the specific sampled populations but may not generalise to the Atlantic-Baltic interaction at other locations. It also begs the question of whether this is a secondary contact, after an allopatric origin of the Baltic ecotype.

We agree with the reviewer that our conclusions are drawn based on a single point, and we believe we have stated this clearly that we are only referring to Bergen. In order to further clarify this, we have added the statement that further work will be required to see if there is a larger area of overlap, and if so which evolutionary scenarios (introgression, secondary contact) apply there (lines 161-164).

190 – How many topologies fit this pattern?

It is 357 topologies that fit the pattern. We have inserted the information in the text (line 200).

210 – Comparison to the SD locus only says that the number of loci for rhythmicity is likely to be >1, and even this conclusion is questionable since sex is a binary phenotype with no environmental variance. Therefore, to conclude from the lack of detectable QTL that 'many' loci underlie rhythmicity is going too far.

We agree with the reviewer and have rewritten the entire paragraph, delineating the shortcomings of the QTL mapping approach and concluding that the phenotype is not controlled by a single large-effect locus.

It may be possible to use simulation to get a better idea of the power of the QTL analysis and so the likely maximum effect size.

While it might be possible to estimate the maximum effect size from simulations, we think that this effort goes beyond the scope of our manuscript and would actually be distracting. We have instead followed the editor’s suggestion and have clearly spelled out the limitations of the QTL analysis and have tempered our claims to say the trait is controlled by “more than one” locus (lines 203-229).

If the vertical lines in Figure 4 represent marker loci (not specified in the legend) then the number is rather low and distribution rather uneven, perhaps making QTL detection in some parts of the genome difficult (including the important region in the centre of Chr1).

The vertical lines in Figure 4 represent marker loci, and we have added this information to the figure legend. We have also added the statement that the low marker number limits QTL mapping in the respective paragraph (lines 219-220). The uneven distribution of markers may make QTL detection difficult, but not for the important region in the center of Chr1 one, which is covered by the markers with low recombination at the left end of the linkage map. This in turn is important in the light of the reviewers comment below on lines 543-8, and we have therefore added this information to the manuscript (lines 264-269) and the “Figure 4—figure supplement 2”.

I also wonder how good a measure of rhythmicity is provided by the emergence day (as an absolute deviate, if I understand correctly) for an individual. This seems like a difficult trait to handle since rhythmicity cannot easily be measured for any one individual.

Rhythmicity is indeed a difficult phenotype and we have now clearly stated its specific problems in both the methods (lines 548-550) and the Results sections (lines 219-224). We can safely assume that individuals emerging outside the peak are arrhythmic. However, arrhythmic individuals which happen to emerge during the peak are scored as rhythmic, leading to a fraction of individuals with false phenotypes. We have come up with various phenotyping procedures, which are not reported here. This included the exclusion of days at the edges of the peak, which are particularly prone to phenotyping error. It also included an expectation-maximization (EM) algorithm that would assign phenotypes to the individuals in the peak based on the probability of being rhythmic and in a way so that the LOD score of the resulting phenotype panel, i.e. the probability of finding a significant QTL, is maximized. This algorithm successfully allowed QTL mapping based on rhythmicity in another case – see our preprint at https://doi.org/10.1101/2022.10.12.511720. In this preprint, the different scoring schemes and the EM algorithm converged to 1-3 detectable QTL for rhythmicity, notably based on smaller mapping families (but more markers) than used here. For the cross reported here, no scoring scheme led to detectable QTL and the EM algorithm did not converge to reproducible QTL. This may underscore that the genetic basis of arrhythmicity in the case we report here may be based on more than a few loci. But it may in part also be attributable to the difficult phenotype. Hence, in order not to over-analyze and over-interpret, we decided to only report the simplest phenotyping scheme and considerably temper our claims to say “more than one locus” contributes to the arrhythmic phenotype.

280 – It is surprising that the cluster of outliers on Chr1 does not appear as a QTL, especially given the long-range LD in this area

It is surprising that the region does not show up as a QTL. This may underscore that other loci are likely to contribute.

Methods

543 – 8 markers per chromosome was on the low side. 'Evenly spaced' here was presumably in physical distance since the marker spacing on the genetic map looks far from even. This also tells us something about the distribution of recombination along the chromosomes, which may be relevant to the cluster of divergent loci on Chr1 but this connection is not made in the manuscript.

We have added the information that ‘evenly spaced’ is with respect to physical distance (lines 531-533). We have also revised “Figure 4—figure supplement 2” (previously Supplementary Figure 21) to include the comparison between physical distance and genetic map distance. We have included the information that the divergent region on chr1 has low recombination, which may aid the clustering of divergent loci (lines 266-269).

620 – The rapid decline in LD suggests that window sizes for some analyses (e.g. 100kb for population genetic statistics) was rather large. This could have influenced sweep detection, for example.

In this case our methods section was unfortunately incomplete. Figure 6 (sweep detection) reports 10 kb windows with 1kb steps. We have added the information to the methods (lines 613-615) and the figure legend. Additionally, we have tried 1kb windows for LD and 5 kb windows for pi, but these plots were very noisy, likely due to the limited number of SNPs in such small windows.

627 – It is not clear why it was necessary to down-sample to 6 individuals per population here. Using 50kb windows are, again, rather large (and adjusting windows to a number of SNPs is preferable generally for this type of analysis), but I doubt either will have impacted the results strongly.

We included the explanation that running-times of the topology weighting algorithm increases exponentially with the number of individuals in the methods section (lines 628-632). With six individuals it is 66 = 46,656 combinations, with the full dataset it would have been 624 = 4.7*1018 combinations. That is why we had to down-sample.

While 50kb windows are indeed large, we also tried 5kb windows and did not observe any notable differences. See Author response image 1.

Author response image 1

Figure S4 – Too many decimal places are used for both distance and Fst. The comma separator in the distance should be replaced with a point (or probably the fractions of km simply ignored).

We have adjusted the figure accordingly.

Figure S7 – TreeMix also provides a plot of %variance explained, which helps to determine the appropriate number of migration events. This could usefully be given here.

We couldn’t find the respective plotting function in TreeMix. Instead we prepared a plot of the model likelihood for zero to five migration events. The sharp flattening after one migration event suggests one migration event produces the only significant improvement of the model (new “Figure 2—figure supplement 7”).

Figure S21 – Are numbers above marker lines just their IDs? It might be helpful to indicate their linkage map positions (cf comments above about even spacing on the physical vs the genetic map).

Yes, the numbers above markers are their IDs. We have adjusted the figure to show the position of the markers both on the linkage map and the physical map, as well as relative to peaks in FST (now “Figure 4—figure supplement 2”). This also allowed us to insert statements about reduced recombination in certain areas of the genome (lines 266-269).

Reviewer #2 (Recommendations for the authors):

Although I am enthusiastic about the promise of the system more generally and the importance of the questions posed, I have significant concerns about the manuscript and its conclusions. I am embedding my suggestions for new analyses with the descriptions of the issues above so that they are readable:

1) My primary concern in is in the design and interpretation of the QTL analysis. The QTL approach used here has low power, both due to the sample size and the number of markers used (it looks like ~8 per chromosome). The authors use an analysis of the sex determining locus as a "control" but because of the complete heritability of this trait in most systems it is more of a straw man to me. The authors conclude that the architecture of the trait is polygenic based on this, but we are missing key information to evaluate this. For example, what is the estimated heritability of the trait in this family and under laboratory conditions? I was also wondering if with the number of markers whether the authors might miss certain loci if the recombination rate is high.

We agree that our QTL analysis has limitations in terms of sample size, as well as number and distribution of markers. To fully acknowledge these limitations, we have now spelled them out in the results and methods section. We have also tempered our claims significantly, and now only state that QTL analysis suggests that the trait is controlled by more than one gene (lines 203-229). We have not included an estimate of heritability, as it is not straightforward to calculate it with the data we have at hand. Also, we do not over-emphasize the results of the QTL analysis – we rather acknowledge that it has limitations. Still, its results complement the whole analysis.

2) There are some issues with the presentation and interpretation of the population genetic analyses. Many assumptions are made about whether introgression or ILS occurred and there are statements that are not accurate about it being "impossible" to distinguish between these scenarios. There are quite a lot of methods devoted to this problem, see e.g. (see Liu et al. PLoS Comp Bio 2014). Simpler tests like D-statistic and F4 ratio tests would add a lot of interpretability to these results.

We added the f4 ratio test to our analysis and found that the genome-wide f4 ratio is significantly different from 0, i.e. there is some introgression. However, the signature of introgression is driven by only 775 SNPs, a small fraction of the total dataset. We have added this information in the manuscript and re-phrased our interpretation accordingly (lines 187-196).

3) Some of the analyses associated with ecological adaptation that follow on the QTL results struck me as ad hoc and with the potential to lead to spurious results.

It would have been helpful if the reviewer had explicitly stated which analyses struck her/him as ad hoc. We assume it is the analyses referred to below.

I am not familiar with the BayPass approach but since it is the approach that explicitly accounts for population structure it seems the one that would be most appropriate for the authors to focus on in a revised manuscript. The use of phylogenetic windows that associate with ecotype is concerning to me as given the level of ILS and gene flow that appears to be present in this system is would be very challenging to distinguish signal from noise.

We present three different approaches to detect genomic regions associated with ecotype: phylogenetic windows, BayPass and FST outliers. One of the strengths of our analysis is that the three approaches lead to very similar results, e.g. the clustering of ecotype-associated variants on chr1. Hence, we would not like to drop any of the analyses.

We agree with the reviewer that BayPass is indeed the strongest approach. That is why in the manuscript we already focus on the BayPass results: After reporting on all three approaches (phylogenetic windows, BayPass and FST outliers) in one paragraph each, the entirety of the downstream analyses (adaptation from standing variation, GO term analyses, highly associated loci) is based solely on the BayPass results.

While the phylogenetic approach has its weaknesses due to ILS and gene flow, we would still like to keep the single sentence reporting on it (lines 235-236), as the results are fully consistent with the other analyses.

Related to this, in the companion manuscript the authors note that the chromosome 1 region has several inversions. I realize that the samples used in this manuscript are different but was unclear on how they excluded the possibility inversions were present on the chromosome 1 region that is highly differentiated in this analysis. This seems to be the primary region that distinguishes ecotypes in the BayPass analysis and thus is foundational to their conclusion that this signal is polygenic (since few other regions are significant in the BayPass analysis). I also noticed that there were not many QTL markers in this region of chromosome 1 regarding their previous analysis.

We also report on putative inversions on chromosome 1 in this manuscript, represented by blocks of observed long-range LD (lines 264-266; “Figure 3—figure supplement 8”, previously Supplementary Figure 17). These putative inversions overlap with the divergent region on chromosome 1 and reduce recombination in that region, as can be seen from long-range LD, as well as the combination of physical markers distances with map length in our revised “Figure 4—figure supplement 2” (previously Supplementary Figure 21). However, we cannot find a 1:1 association of these putative inversions with the divergent region (the boundaries differ markedly), nor with ecotype (putative inversions are present in both ecotypes and rather population-specific than ecotype-specific). In a PCA of the divergent region, the individuals do not segregate in any pattern that would be typical for inversion haplotyopes (as is observed in the companion manuscript).

We would also like to point out that it is not the case that few other regions outside the chromosome 1 cluster are significant in BayPass. In our figure 3 the red dots are only the extreme outliers (p < 10-10), but black dots are already significant at p < 0.001 (as is stated in the figure legend). Thus, ecotype associated SNPs are all over the genome, and this is foundational to the conclusion that the signal is polygenic.

The divergent region on chromosome 1 is actually covered by four out of eight markers on that chromosome. This may not have been clear in the previous version of our manuscript, and we have therefore revised “Figure 4—figure supplement 2” (previously Supplementary Figure 21) to make the connection between the linkage map and physical marker locations.

4) There were issues with the GO analysis that should be addressed. Because the gene universe used for GO enrichment is a subset of the full gene set, GO enrichment results will be biased. This will mostly lead to false positives (i.e. overrepresentation of a GO category due to evaluating a subset of genes that fall in that category). The authors should pair this with simulations or other analyses, or temper the strength of their claims.

We see that our conservative approach to annotate only the subset of orthologous genes with GO terms produces an issue, namely that the genes without GO terms will be missing from the analysis, i.e. there are false negatives. In order to address this problem we have performed a second set of analysis which takes into account all genes with highly divergent SNPs irrespective of whether or not they are annotated with GO terms (lines 350-388). While we found that indeed many divergent genes are not annotated with GO terms, the functions we can assign to these genes by investigation of the relevant literature match the results from GO term analysis very well. This is the best we can do about false negatives.

The reviewer suggests there might be false positives in our analysis. We disagree with that for several reasons:

First, we think there might be a misunderstanding of the analysis due to insufficient information provided in the methods and we apologize for it. In our topGO analysis, the reduced dataset of 5,393 genes with GO terms was specified as the custom background gene universe. By specifying this reduced gene universe to topGO, with the parameter ‘allGenes’, any potential differences in the background frequencies of certain GO terms between our reduced gene set and the set of all genes will be statistically accounted for by the tool, through incorporation into the null hypothesis. 463 out of 1,400 ecotype-associated gene models were annotated with GO terms and only those were fed into the topGO analysis. The p-value returned by the Fisher’s exact test implemented in topGO calculates exactly the probability of observing at least the same amount of enrichment of a GO term when 463 genes are randomly selected out of the specified gene universe of 5,393 genes. Thus, if there is any bias in GO terms of our custom gene universe, this should have been accounted for. We also note that the fraction of genes with GO terms among the ecotype-associated genes and among the total genes is not significantly different (chi-squared test, p=0.12), indicating that both sets of genes are affected in the same way by the lack of GO term annotations.

Second, the reviewer asks for simulations to support our claims. If the reviewer is concerned with the ‘Open World Assumption’ that is inherent in Gene Ontology, namely that we have made statements about whether or not enrichment is true without knowing all observations (i.e. all GO terms for all genes) and would like us to perform simulations to evaluate the impact of using varying subsets of the gene universe on the stability of enrichment results -- Here we can say that Groß et al. have investigated the difference in results of functional analyses using GO annotations in both primates and rodents from different GO versions (https://academic.oup.com/bioinformatics/article/28/20/2671/204694), i.e. different Gene universes and levels of completeness. Their conclusions are that enrichment results are surprisingly stable to changes in ontology and annotation. Hence, we conclude that our reduced gene universe will very likely not cause major issues to the reliability of the conducted enrichment analysis.

In order to avoid these possible misunderstandings, we have inserted the respective information in the methods section (lines 670-681).

We also want to point out that the other reviewer considers our GO term enrichment results particularly strong.

https://doi.org/10.7554/eLife.82824.sa2

Article and author information

Author details

  1. Nico Fuhrmann

    Max Planck Institute for Evolutionary Biology, Plön, Germany
    Contribution
    Data curation, Formal analysis, Investigation, Visualization, Writing - review and editing, Field work
    Competing interests
    No competing interests declared
  2. Celine Prakash

    Max Planck Institute for Evolutionary Biology, Plön, Germany
    Contribution
    Formal analysis, Investigation, Visualization, Writing - review and editing
    Competing interests
    No competing interests declared
  3. Tobias S Kaiser

    Max Planck Institute for Evolutionary Biology, Plön, Germany
    Contribution
    Conceptualization, Resources, Formal analysis, Supervision, Funding acquisition, Investigation, Visualization, Writing – original draft, Project administration
    For correspondence
    kaiser@evolbio.mpg.de
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-4126-0533

Funding

European Research Council (802923)

  • Tobias S Kaiser

Max-Planck-Gesellschaft

  • Tobias S Kaiser

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication. Open access funding provided by Max Planck Society.

Acknowledgements

For field work we obtained logistic support from the Ar Research Station (Uppsala University), the Marine Biological Station Espegrend (University of Bergen), Even Jørgensen (The Arctic University of Norway, Tromsø) and Florian Weinberger and Nadja Stärck (GEOMAR Helmholtz Centre for Ocean Research, Kiel). We thank Jürgen Reunert, Kerstin Schäfer and Susanne Mentz for technical assistance, as well as all members of the MPRG “Biological Clocks” for discussion and support. Diethard Tautz and Julien Dutheil critically read the manuscript. Whole genome sequencing was performed at the Max Planck Genome Center (Cologne) with financial support from the Max Planck Society. This work was funded by the Max Planck Society through the Max Planck Research Group “Biological Clocks” and a sequencing grant. The work was further funded by the European Research Council (ERC) under the Horizon 2020 research and innovation program with an ERC Starting Grant (Grant agreement 802923) awarded to TSK.

Senior and Reviewing Editor

  1. Detlef Weigel, Max Planck Institute for Biology Tübingen, Germany

Version history

  1. Preprint posted: April 18, 2021 (view preprint)
  2. Received: August 18, 2022
  3. Accepted: February 7, 2023
  4. Version of Record published: February 28, 2023 (version 1)
  5. Version of Record updated: February 28, 2023 (version 2)
  6. Version of Record updated: March 31, 2023 (version 3)

Copyright

© 2023, Fuhrmann et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 827
    Page views
  • 100
    Downloads
  • 3
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Nico Fuhrmann
  2. Celine Prakash
  3. Tobias S Kaiser
(2023)
Polygenic adaptation from standing genetic variation allows rapid ecotype formation
eLife 12:e82824.
https://doi.org/10.7554/eLife.82824

Share this article

https://doi.org/10.7554/eLife.82824

Further reading

    1. Computational and Systems Biology
    2. Evolutionary Biology
    Roee Ben Nissan, Eliya Milshtein ... Ron Milo
    Research Article

    Synthetic autotrophy is a promising avenue to sustainable bioproduction from CO2. Here, we use iterative laboratory evolution to generate several distinct autotrophic strains. Utilising this genetic diversity, we identify that just three mutations are sufficient for Escherichia coli to grow autotrophically, when introduced alongside non-native energy (formate dehydrogenase) and carbon-fixing (RuBisCO, phosphoribulokinase, carbonic anhydrase) modules. The mutated genes are involved in glycolysis (pgi), central-carbon regulation (crp), and RNA transcription (rpoB). The pgi mutation reduces the enzyme’s activity, thereby stabilising the carbon-fixing cycle by capping a major branching flux. For the other two mutations, we observe down-regulation of several metabolic pathways and increased expression of native genes associated with the carbon-fixing module (rpiB) and the energy module (fdoGH), as well as an increased ratio of NADH/NAD+ - the cycle’s electron-donor. This study demonstrates the malleability of metabolism and its capacity to switch trophic modes using only a small number of genetic changes and could facilitate transforming other heterotrophic organisms into autotrophs.

    1. Evolutionary Biology
    2. Genetics and Genomics
    Thomas A Sasani, Aaron R Quinlan, Kelley Harris
    Research Article

    Maintaining germline genome integrity is essential and enormously complex. Although many proteins are involved in DNA replication, proofreading, and repair, mutator alleles have largely eluded detection in mammals. DNA replication and repair proteins often recognize sequence motifs or excise lesions at specific nucleotides. Thus, we might expect that the spectrum of de novo mutations – the frequencies of C>T, A>G, etc. – will differ between genomes that harbor either a mutator or wild-type allele. Previously, we used quantitative trait locus mapping to discover candidate mutator alleles in the DNA repair gene Mutyh that increased the C>A germline mutation rate in a family of inbred mice known as the BXDs (Sasani et al., 2022, Ashbrook et al., 2021). In this study we developed a new method to detect alleles associated with mutation spectrum variation and applied it to mutation data from the BXDs. We discovered an additional C>A mutator locus on chromosome 6 that overlaps Ogg1, a DNA glycosylase involved in the same base-excision repair network as Mutyh (David et al., 2007). Its effect depends on the presence of a mutator allele near Mutyh, and BXDs with mutator alleles at both loci have greater numbers of C>A mutations than those with mutator alleles at either locus alone. Our new methods for analyzing mutation spectra reveal evidence of epistasis between germline mutator alleles and may be applicable to mutation data from humans and other model organisms.