Sweepstakes reproductive success via pervasive and recurrent selective sweeps
Abstract
Highly fecund natural populations characterized by high early mortality abound, yet our knowledge about their recruitment dynamics is somewhat rudimentary. This knowledge gap has implications for our understanding of genetic variation, population connectivity, local adaptation, and the resilience of highly fecund populations. The concept of sweepstakes reproductive success, which posits a considerable variance and skew in individual reproductive output, is key to understanding the distribution of individual reproductive success. However, it still needs to be determined whether highly fecund organisms reproduce through sweepstakes and, if they do, the relative roles of neutral and selective sweepstakes. Here, we use coalescent-based statistical analysis of population genomic data to show that selective sweepstakes likely explain recruitment dynamics in the highly fecund Atlantic cod. We show that the Kingman coalescent (modelling no sweepstakes) and the Xi-Beta coalescent (modelling random sweepstakes), including complex demography and background selection, do not provide an adequate fit for the data. The Durrett–Schweinsberg coalescent, in which selective sweepstakes result from recurrent and pervasive selective sweeps of new mutations, offers greater explanatory power. Our results show that models of sweepstakes reproduction and multiple-merger coalescents are relevant and necessary for understanding genetic diversity in highly fecund natural populations. These findings have fundamental implications for understanding the recruitment variation of fish stocks and general evolutionary genomics of high-fecundity organisms.
Editor's evaluation
This fundamental work significantly advances our understanding of genetic diversity in highly fecund organisms by showing that the Atlantic cod genome is prone to recurrent selective sweeps. The evidence supporting these conclusions is compelling, with rigorous analysis of the site frequency spectrum providing support for models of selective sweepstakes reproduction and multi-merger coalescents. This work will be of broad interest to evolutionary geneticists and should stimulate future work to determine whether random sweepstakes interact with selective sweepstakes.
https://doi.org/10.7554/eLife.80781.sa0Introduction
Individual reproductive success, the number of reproducing offspring, is a fundamental demographic parameter in ecology and evolution. The distribution of individual reproductive success affects the distribution and abundance of organisms (the subject of ecology) and the genotypic and phenotypic changes resulting from the major processes of evolution. Individual reproductive success determines individual fitness, the currency of natural selection. Many marine organisms are highly fecund, producing vast numbers of juvenile offspring that experience high mortality (type III survivorship) as they go through several developmental stages, fertilization, zygote, larvae, fry, etc. until finally recruiting as adults into the next generation. Sweepstakes reproductive success (Hedgecock, 1994), suggested having ‘a major role in shaping marine biodiversity’ (Hedgecock and Pudovkin, 2011, p. 971), is a key to understanding the mechanism behind individual reproductive success. Sweepstakes reproduction has few winners and many losers leading to very high variance and skew in individual reproductive output. High fecundity alone does not lead to sweepstakes absent a mechanism for generating high variance in and highly skewed distribution of offspring numbers.
Two main ecological mechanisms can turn high fecundity into sweepstakes reproduction that produces ‘reproductive skew’: a random and a selective mechanism. The first is the chance matching of reproduction to a jackpot of temporally and spatially favourable conditions, a case of random sweepstakes (Hedgecock and Pudovkin, 2011). The match/mismatch hypothesis (Cushing, 1969) often explains the dynamics of recruitment variation and year-class strength by the timing of reproduction with favourable but erratic environmental conditions. For example, climatic variability leads to random temporal and spatial shifts in planktonic blooms that are food for developing fish larvae, a match means reproductive success, a mismatch means reproductive failure (Cushing, 1969). By chance, a random individual hits the jackpot of favorable environmental conditions that result in a very large reproductive output of reproducing offspring (Schweinsberg, 2003; Eldon and Wakeley, 2006).
The second mechanism is selective sweepstakes in which the genetic constitution of survivors differs from that of non-survivors (Williams, 1975). Under the second scenario, an organism’s developmental stages pass through numerous independently acting selective filters with the cumulative effect of producing a high-variance high-skew offspring distribution. Here, the winning genotypes are ephemeral and must be continuously reassembled; they are the Sisyphean genotypes in a race that Williams, 1975 argued could pay the selective cost of sexual reproduction (after Sisyphus from Greek mythology, punished with forever pushing a boulder up a hill). By analogy, the population climbs a local selective peak by positive selection, but the environment changes continuously because the sequence of selective filters changes. Only a new or recombined genotype can climb the selective peak the next time around (Williams, 1975). The population forever tracks an elusive optimum by climbing an ephemeral adaptive peak. The selective filters can arise from abiotic factors, and biotic density- and frequency-dependent effects arising from inter- and intraspecific competition and from predation and predator avoidance (Reznick, 2016).
The prevailing view in evolutionary ecology is that highly fecund populations evolve without sweepstakes reproduction. Random mortality is seen as hitting every family, the offspring of every pair, to the same degree. High fecundity simply compensates for high mortality and there is no mechanism for turning high fecundity into high-variance high-skew offspring distribution. Juvenile mortality might even be compensatory and reduce the variance in offspring number via density-dependent competition or predation. In this scenario reproduction does not match favourable conditions by chance, no individual hits the jackpot, nor does selective filtering happen. The resulting offspring distribution has a much smaller variance than in the sweepstakes models, with the same low and unchanged coefficient of variation in the distribution of zygotes and the distribution of adult offspring (Nunney, 1996). This mode of reproduction is expected to result in a similar distribution of reproducing offspring as in the assumed mode of reproduction of low fecundity and model organisms (Wright, 1931; Fisher, 1930). A low variance in individual reproductive success modelled through the Wright–Fisher model (or similar models) is nearly universally assumed in population genetics (Wakeley, 2007). This is the hypothesis of no sweepstakes.
Genomics and coalescent theory offer powerful tools to test three hypotheses: non-sweepstakes versus sweepstakes reproduction and two sweepstakes hypotheses, random and selective sweepstakes. Conducting similar tests with ecological methods would be a daunting task, requiring one to follow the fate of the offspring of different individuals (Grant and Grant, 2014). The first hypothesis can be tested by identifying the footprint of non-sweepstakes versus sweepstakes reproduction in population genomic data. The second and third hypothesis tests for random versus the selective sweepstakes, given evidence of sweepstakes reproduction in the data.
The classical Kingman coalescent (Kingman, 1982) models the reproduction of low-fecundity organisms (Appendices 1 and 2). Multiple-merger coalescents (Donnelly and Kurtz, 1999; Pitman, 1999; Sagitov, 1999; Schweinsberg, 2000; Schweinsberg, 2003) describe the genealogies for the random and the selective sweepstakes reproduction. The Xi-Beta coalescent (Schweinsberg, 2000; Birkner et al., 2018) models the genealogy of a population with large reproductive events in which a random individual has enormous reproductive success and well approximates the random or jackpot sweepstakes hypothesis (Hedgecock and Pudovkin, 2011; see Appendix 3). The Durrett–Schweinsberg model of recurrent selective sweeps (Durrett and Schweinsberg, 2005), implying an ever-changing environment that continuously favors new mutations, well approximates selective sweepstakes (Williams, 1975) (see Appendix 4). The multiple-merger Durrett–Schweinsberg coalescent describes the genealogy of a neutral site linked to a site hit by a favorable mutation that rapidly sweeps to fixation. The neutral site can escape via recombination (see Appendix 5).
The empirical evidence for reproductive sweepstakes leading to reproductive skew is limited (e.g. Árnason, 2004; Árnason and Halldórsdóttir, 2015; Niwa et al., 2016). Empirical evidence for variance in reproductive success due to life-table characteristics has been found using genome-wide polymorphism data in marine fishes (Barry et al., 2021). Reproductive skew needs to be studied using gene genealogies on a genome-wide scale. Multiple-merger coalescents occur in models of rapidly adapting populations (Neher and Hallatschek, 2013; Schweinsberg, 2017), under both directional selection (Neher, 2013; Sackman et al., 2019) and possibly strong purifying (background) selection (Irwin et al., 2016; Cvijović et al., 2018). However, background selection is not, in general, expected to mimic selective sweeps (e.g. Durrett and Schweinsberg, 2005; Schrider, 2020). Sweepstakes reproduction may apply to many organisms and may be more prevalent than previously thought. There is, therefore, a need for a critical examination of the contrasting hypotheses.
Here, we compare genomic sequences for the highly fecund Atlantic cod (Gadus morhua) to predictions of three coalescent models: the Kingman, 1982 with arbitrary demographic histories, the neutral Xi-Beta or formally the -Beta() coalescent (Schweinsberg, 2000; Schweinsberg, 2003; Birkner et al., 2018) modelling random jackpot sweepstakes in diploid, highly fecund organisms, and the Durrett–Schweinsberg coalescent derived from a population model with recurrent selective sweeps (Durrett and Schweinsberg, 2005; Appendix 1). We analyze whole-genome sequences (at 16× and 12× coverage, respectively) of Atlantic cod sampled from two localities in Iceland, with the localities serving as statistical replicates (Appendix 6—figure 1). We also consider whether other mechanisms can explain the observed patterns by examining the effects of population expansion, cryptic population structure, balancing and background selection, and the joint action of several processes.
Results
Neutrality under no sweepstakes?
Genomic scans of Tajima’s and Fay and Wu’s neutrality test statistics (for GL1 and GL2 genotype likelihoods in both populations; Figure 1a, Figure 1—figure supplement 1, and Appendix 6—figure 2a, b, and Appendix 7—table 1 and Appendix 7—table 2) showed extensive and genome-wide deviations from expectations of neutral equilibrium under the classical theory, including indications consistent with selective sweeps occurring throughout the genome (Fay and Wu, 2000; Zeng et al., 2006; Przeworski, 2002; see Appendix 8). The McDonald–Kreitman test (McDonald and Kreitman, 1991) and the neutrality index derived from it (Rand and Kann, 1996), also indicated positive selection. The neutrality index under neutrality, negative values of indicate negative or purifying selection, and positive values indicate positive selection. Our estimates showed both negative and positive selection effects distributed throughout each chromosome (Figure 1—figure supplement 2). On a local genomic scale, the distribution of the neutrality index was heavier on the side of positive selection, although only a minority of individual tests reached nominal significance and none was significant after taking multiple testing into account (Figure 1b). On a genome-wide scale, the mean and the median of were 0.27 and 0.21, respectively, and the estimated proportion of adaptive non-synonymous substitutions (Smith and Eyre-Walker, 2002) was 19–24%.
The classic no-sweepstakes model with population growth (such as post-Pleistocene population expansion, Hewitt, 2004) is known to affect primarily the singleton class and left tail of the site-frequency spectrum. Fitting a no-sweepstakes with population growth model to the data under the Kingman coalescent provided indication of historical expansion (see Figure 2 and Figure 2—figure supplement 1). However, the plausible demographic growth scenarios did not markedly improve the fit of neutral models without sweepstakes. We, therefore, reject the no-sweepstakes hypothesis.
Random versus selective sweepstakes?
The life table of cod (Appendix 9 and Appendix 7—table 3), showing an exponential decay of the number of survivors with age and an exponential increase in fecundity with age, implies that fewer and fewer individuals produce a larger and larger number of eggs. A few females may live 25 years or more and still increase fecundity with age. Thus, the life-table results in a large variance and skew in offspring number. Old surviving females may be the lucky few to be alive or they may be the very fit that have passed all selective filters.
We next compared our observations to predictions of the -Beta coalescent, which models random jackpot sweepstakes reproduction in a diploid highly fecund population. Here, the parameter determines the skewness of the offspring distribution, in essence, the jackpot size. A smaller essentially means a larger jackpot. We used a range of approximate Bayesian computation (ABC) posterior estimates of the parameter (Appendix 3). The observed site-frequency spectra were overall more V-shaped than the U-shape of the expected normalized site-frequency spectrum predicted by this model (Appendix 3; Appendix 6—figure 3a, b). Singletons and low-frequency variants were the closest to expectations of an (Appendix 6—figure 3). However, as the derived allele frequency increases, the observations were closer to a smaller and smaller (as small as ) predictions. The expected site-frequency spectrum of this model shows local peaks at intermediate allele frequencies, which represent the expected simultaneous multiple mergers of two, three, and four groups, corresponding to the four parental chromosomes involved in each large reproduction event. In diploid highly fecund populations, a single pair of diploid parents may occasionally produce huge numbers of juvenile offspring (Möhle and Sagitov, 2003; Birkner et al., 2013a; Birkner et al., 2018). The observations did not show these peaks (Appendix 6—figure 3a, b). The expectations of this model were also mainly outside the bootstrap error bars of the observations (Figure 3). However, comparing the observed site-frequency spectra to expectations of the haploid -Beta() coalescent, a haploid version of random sweepstakes (Appendix 6—figure 4), showed a better fit. Low-frequency variants fit reasonably well to an . However, as the derived allele frequency increased, a smaller and smaller (as small as , the Bolthausen–Sznitman coalescent) gave a good fit. This is likely a signal of either positive or negative natural selection. Rare alleles (less than 10–12%) contribute little to the variance in fitness. Once an allele (a site) reaches an appreciable and intermediate frequency it can contribute significantly to the variance in fitness such that selection quickly moves it out of intermediate frequency ranges. Negative selection moves it to a low frequency, and positive selection moves it to a high frequency so alleles spend a short time at intermediate frequencies (sojourn times are short). The fact that a haploid -coalescent model fits a diploid organism better than the corresponding diploid -coalescent is suggestive of natural selection, where fitter offspring descend from one particular parental chromosome out of the available four. The parameter determines the skewness of the offspring distribution in the -Beta-coalescent. But that model has no known interpretation for an explicitly selection-driven skewness (except in the particular case , the Bolthausen–Sznitman coalescent (Neher, 2013; Neher and Hallatschek, 2013), which did not adequately fit our data). Hence, the -Beta-coalescent is not an appropriate model for a diploid organism and remains difficult to intepret. Furthermore, we used ABC to estimate jointly the parameters and , where denotes a population size rescaled rate of exponential growth of the population forward in time, using the -Beta() coalescent (Appendix 3). The processes generating reproductive skew and population growth, can account for some features of the site-frequency spectrum. Thus, by jointly estimating the skew parameter and the growth parameter we hope to obtain a more accurate understanding of the observed data. The resulting posterior distribution showed small values of both parameters (Figure 2 and Figure 2—figure supplement 1) implying strong reproductive skew and little population growth. That the distribution of the growth parameter spread more with greater reproductive skew (as ) is not surprising, as population size is known to affect the model only weakly when the reproductive skew is pronounced. Furthermore, the impact of variable population size vanishes entirely when reproductive skew is maximum () (Freund, 2020; Koskela and Wilke Berenguer, 2019). Earlier work (Matuszewski et al., 2017), using a model in which a single individual reproduces each time and occasionally wins the jackpot whose size is constant over time, also found reproductive skew over demographic expansion in Japanese sardines. We used a more realistic model (Schweinsberg, 2003), in which the whole population reproduces simultaneously, however, a single random female occasionally hits a jackpot, whose size will vary over time.
The -Beta() model of random sweepstakes showed that reproductive skew is a more likely explanation than demographic expansion under the classical Kingman model and the model predicts an upswing, as observed at the right tail of the site-frequency spectrum. It nevertheless cannot adequately explain our data. There were systematic deviations from expectations of the model (see residuals in Figure 4a, b and Figure 4—figure supplement 1a, b). The deviations were nearly symmetrical around a derived allele frequency of 50% (logit of 0), and rare (less than 12%, logit of −2) and common alleles (greater than 88%, logit of 2) were too frequent. In contrast, intermediate alleles were too few compared to model expectations. The deviations immediately suggest the action of positive natural selection by selective sweeps.
Therefore, we investigated the hypothesis of selective sweepstakes by comparing our observations to predictions of the Durrett–Schweinsberg coalescent derived from the Durrett–Schweinsberg model (Appendix 4). In the Durrett–Schweinsberg model, a random site on a chromosome is hit by a beneficial mutation that, with a certain probability, goes to fixation in a time measured in coalescent time units, where is the population size. The beneficial mutation sweeps with its neutral sites that are some recombinational distance from the selected site (Durrett and Schweinsberg, 2005; Nielsen, 2005). Distant sites are more likely to escape this hitchhiking effect than neighbouring sites because of larger recombination rates. Even though the model is built from a whole chromosome undergoing recurrent selective mutations, the resulting coalescent only describes a single site under the joint effect of hitchhiking and recombination (Nielsen, 2005). Thus, the model cannot make joint predictions about several sites, such as measures of linkage disequilibrium. In Appendix 4, we propose a two-site extension of the Durrett–Schweinsberg model in the restricted case of two sampled sequences, facilitating predictions of linkage disequilibrium. This model of recurrent selective sweeps explained our results for all subsets of the data (and GL1 and GL2 in both populations Figure 3, Figure 3—figure supplement 1, and Appendix 6—figure 5). We also considered the potential effects of SNP misorientation and low-level ancestral introgression (Baudry and Depaulis, 2003; Hernandez et al., 2007; Schumer et al., 2018) (Appendix 10). Polarizing the site-frequency spectra with a 100% consensus of several outgroup sequences (Figure 3—figure supplement 2), did not change the overall pattern. Considering transversions only (Figure 3—figure supplement 4) (avoiding mutational saturation of transitions, Agarwal and Przeworski, 2021) also did not change the overall pattern. Finally, truncating the site-frequency spectra (by removing the singleton and doubleton and and class of sites most affected by SNP misorientation) also did not change the overall results (Figure 3—figure supplement 4). Linkage disequilibrium decays rapidly to background values (Appendix 6—figure 6a) as the Durrett–Schweinsberg model requires. The decay of linkage disequilibrium observed in the data was also consistent with predicted results from our two-site, two-sample extension of the Durrett–Schweinsberg model (Appendix 4 and Appendix 6—figure 6b), provided that small fractions of sweeps (on the order of 10%) are taken to affect the whole chromosome regardless of recombination. Such ‘sweeps’ are characteristic of e.g. population bottlenecks. The compound parameter of the Durrett–Schweinsberg model measures the rate of selective sweeps (δ) times the squared selection coefficient () of the beneficial mutation over the recombination rate (γ) between the selected site and the site of interest. The compound parameter is essentially the density of selection per map unit along the chromosome (Aeschbacher et al., 2017). ABC estimates yielded similar values across all replicated data sets, an average of about 10 that is 10 times more frequent than the coalescence rate of a sample with a low variance mode of reproduction described by the classical Kingman coalescence (Figure 5 and Appendix 6—figure 7). The estimated compound parameter was correlated with functional constraints and importance of sites indicating a higher selection density per generation per genetic map unit in exons and UTRs (Figure 5). The residuals of the fit to the Durrett–Schweinsberg coalescent (Figure 4 and Figure 4—figure supplement 1c, d) showed deviations that were both smaller and opposite the deviations of those of the neutral -Beta() model (Figure 4a, b and Figure 4—figure supplement 1a, b) with intermediate frequency classes being too frequent. The Durrett–Schweinsberg model is essentially haploid. We suggest that a diploid model, where dominance generates two phenotypes such that selection acts on pairs of chromosomes jointly rather than single chromosomes as in the Durrett–Schweinsberg model would provide an even better fit. However, developing a diploid multi-locus version of the Durrett–Schweinsberg model is outside the scope of the present work. Nevertheless, a comparison of our data with predictions of the Durrett–Schweinsberg model, in particular in comparison with our additional analysis, is perfectly valid. Overall, the selective sweepstakes hypothesis embodied in the Durrett-Schweinsberg coalescent (Durrett and Schweinsberg, 2005) modelling recurrent selective sweeps, in essence, explained our data, whereas the hypothesis of low-variance reproduction and one of random sweepstakes did not.
We took several steps to investigate and consider the effects of selection and recombination on the observed patterns of allele frequencies. We did a principal component (PC)-based genome-wide scan of selection (using PCangsd; Meisner and Albrechtsen, 2018) and detected several peaks (Appendix 6—figure 8). We used sites that are at least 500 kb away from selective peaks. We refer to these as non-selection sites. We extracted sites from the genome that are likely under different selective constraints. We thus extracted fourfold degenerate sites (referred to as 4Dsites), intron sites, intergenic sites, promoter sites, UTR sites, UTR sites, and exon sites. The less constrained sites are not necessarily neutral to selection. For example, although silent at the protein level, mutations at fourfold degenerate sites could affect transcriptional and translational efficiency and mRNA stability, thus giving rise to selection for or against such sites. However, the first three sets of sites are generally considered less constrained and the other sets are more constrained by selection. The resulting site-frequency spectra and parameter estimates ranked according to selective constraints (Figures 3 and 5, and Appendix 6—figure 7).
Furthermore, we used OmegaPlus (Alachiotis et al., 2012) and RAiSD (Alachiotis and Pavlidis, 2018) to detect selective sweeps genome-wide. Both methods use local linkage disequilibrium to detect sweeps (Nielsen, 2005). In addition, RAiSD uses a local reduction in levels of polymorphism and shifts in the frequencies of low- and high-frequency derived alleles affecting, respectively, the left and right tails of the site-frequency spectrum. Both methods indicated pervasive selective sweeps on all chromosomes (Figure 6). We also used SLiM (Haller and Messer, 2019) to simulate positive selection under the no-sweepstakes Wright–Fisher model and a random sweepstakes model in the domain of attraction of a Xi-Beta coalescent (Appendix 6—figure 9). We tried various forms of dominance of selection among diploid genotypes (semidominance, and full dominance, ) with different strengths of selection (selection coefficient ). The model of the successive selective pass or fail filters suggests that lacking a function (a derived allele) is a failing genotype while having a function (derived allele) is a passing genotype as modelled by full dominance. The observation of the heavy mortality of immatures (type III survivorship, Appendix 7—table 3) therefore suggests a model of selection against a recessive lethal and for a dominant. This is a two-phenotype model for a diploid organism. The results of the SLiM simulations of positive selection (Appendix 6—figure 9) gave site-frequency spectra that were qualitatively similar to the observed spectra. Selection for a semidominant phenotype produced more U-shaped spectra while selection for a dominant produced more V-shaped spectra similar to those observed. Recurrent hard sweeps interrupting the standard Kingman coalescent (simulated using msprime; Baumdicker et al., 2021) produced U-shaped site-frequency spectra (Appendix 6—figure 10) that are qualitatively similar to our data from the South/south-east coast.
Synopsis of results
We have shown that the Durrett–Schweinsberg coalescent modelling recurrent selective sweeps affecting linked sites gives the best fit for our observations (Figure 3). By extension, the hypothesis of reproduction by selective sweepstakes is best supported by our data. The Kingman coalescent and Wright–Fisher model of reproduction, without a strong positive selection of recurrent strongly beneficial mutations (Appendix 6—figure 9 and Appendix 6—figure 10), cannot explain our data. Similarly, the model of random sweepstakes, the Xi-Beta coalescent, in which a random individual has windfall reproductive success, although fairing better than the Kingman coalescent nevertheless cannot explain the observations. In Appendix 11, we ask if other processes can better explain the observed patterns and provide a detailed analysis of alternatives. Through analysis and forward and backward simulations, we consider historical demography under low-variance reproduction (see Appendix 12), the potential confounding due to cryptic population structure (see Appendix 13), the effects of balancing selection of large inversions (see Appendices 14 and 15), the effects of negative and background selection (see Appendix 15), the joint action of several evolutionary mechanisms (see Appendix 16), and the potential effects of SNP misorientation from parallel mutation and low-level ancestral introgression (see Appendix 16). Although some alternative mechanisms can come close to observations under some parameter values, they did not provide a satisfactory fit overall (see Appendix 11). Historical demographic expansions or contractions do not explain our data (Appendix 6—figure 11 and Appendix 6—figure 12). Analysis of potential cryptic population structure does not provide answers to our patterns (Appendix 6—figure 13). Similarly, modelling sampling from divergent populations, a combination of extreme parameter values can produce patterns similar to the observed patterns (Appendix 6—figure 14 and Appendix 6—figure 15). However, a leave-one-out analysis of our sample shows that our sample was not produced under such extreme parameter values (Appendix 6—figure 16). There are clear signals of balancing selection of large inversions at four chromosomes (see Appendix 14, Appendix 6—figure 17, and Appendix 6—figure 18). However, balancing selection does not change the overall shape of the site-frequency spectrum of these chromosomes, which is the summary statistic we use for our analysis. Simulations of background selection show that a narrow window of parameter space can resemble observed patterns. Still, in general, background selection does not fit our results (Appendix 6—figure 9d and Appendix 6—figure 19). Finally, simulations of the joint action of several evolutionary mechanisms, notably of demography and background selection with or without selective sweeps, do not produce qualitatively accurate U-shaped site-frequency spectra similar to the observed except in simulations that included selective sweeps (Appendix 6—figure 19).
Discussion
Understanding recruitment dynamics and what shapes the distribution of individual reproductive success is a fundamental challenge in evolutionary genomics of high-fecundity organisms. It is key to further understanding metapopulation and community dynamics, predicting response to anthropogenic challenges, for conservation and management, and further development of the ecological and evolutionary theory (Eldon, 2020). We show that selective sweepstakes, modelled by a particular example of the Durrett–Schweinsberg multiple-merger coalescent derived from a population model of recurrent selective sweeps (Durrett and Schweinsberg, 2005), essentially explain our data. Even a model of recurrent but incomplete selective sweeps (Coop and Ralph, 2012) similarly leads to U-shaped site-frequency spectra generated by a multiple-merger coalescent model similar to the Durrett–Schweinsberg model. We further show that neither non-sweepstakes reproduction nor random-sweepstakes reproduction can explain our data. Other biologically plausible scenarios (e.g. historical demographic changes, cryptic breeding structure, and background selection) show a much poorer fit. Our results indicate that strong pervasive positive natural selection is pivotal in reproductive sweepstakes, more so than in windfall sweepstakes (Hedgecock and Pudovkin, 2011).
The random sweepstakes -Beta() model assumes a single-pair mating with enormous reproductive output. However, cod is a batch spawner in which a female may pair with a different male for each batch, with potential sneaker males participating in fertilization as well (Hutchings et al., 1999; Nordeide, 2000). This mating system may result in larger fertilization success than in monogamous broadcast spawning. The -Beta() models the simultaneous coalescence of the four parental chromosomes involved in a large reproductive event, the random sweepstakes. The cod mating system implies that the two maternal chromosomes of a female combine with many pairs of paternal chromosomes with more genetic diversity than in a high-fecundity monogamous system. However, how such a mating system affects the coalescent and the shape of the site-frequency spectrum is unclear.
We have considered models based on haploid reproduction, or diploid reproduction with monogamous pairs. It is possible that a two-sex model which more accurately reflects the mating traits of cod, in which many different males can fertilize the eggs of one female (Hutchings et al., 1999; Nordeide, 2000), may further improve the fits we have obtained. Birkner et al., 2018 provide a framework for such modelling. We have chosen to use simpler, monogamous models as a starting point for our analysis and leave the development of a parsimonious mating structure model for future work.
By the time an advantageous mutation reaches the exponential phase of the Durrett–Schweinsberg process, recombination during the lag phase will have broken up the initial linkage disequilibrium of a new mutation to a haplotype composed of a chromosomal segment. The evolution of that mutation is haplotype specific. In contrast, random sweepstakes would increase the frequency of genomes of the reproducing pair. The Durrett–Schweinsberg model assumes a Kingman coalescent interrupted by a selective sweep. However, the Durrett–Schweinsberg model needs to better capture the low-frequency singleton and doubleton class of sites. In contrast, the random sweepstakes -Beta() very well captures the low-frequency singleton and doubleton class of sites. It is possible that mutations are entering the population under random sweepstakes, many being lost but some drifting to a high enough frequency that they contribute sufficiently to the variance in fitness to be grabbed by the selective process and swept to fixation. Can random sweepstakes possibly also increase the frequency of variants and thus facilitate selective sweepstakes? There is a larger variance of allele frequencies under random sweepstakes, so that many variants will be lost. We leave for future work the question of whether random sweepstakes interact with selective sweepstakes in this way. Interpreting the Durrett–Schweinsberg model as approximating selective sweepstakes, we conclude that our findings are strong evidence for selective sweepstakes (Williams, 1975) characterizing the distribution of individual reproductive success of the highly fecund Atlantic cod. Under the Durrett–Schweinsberg coalescent of recurrent selective sweeps, the rise in frequency of new mutations each time, happens rapidly compared to the coalescent timescale. The continuous input of new beneficial mutations represents the Sisyphean genotypes that forever climb a selective peak under Williams’ concept of selective sweepstakes (Williams, 1975). By extension, selective sweepstakes are the life history of highly fecund organisms with skewed offspring distribution.
Recurrent bottlenecks may mimic the effects of recurrent selective sweeps (Galtier et al., 2000). The duration, depth, and rate of recovery of a bottleneck (Nei et al., 1975) relative to the timescale of recurrent sweeps under the Durrett–Schweinsberg model is an important issue. A small number of individuals having large numbers of descendants due to a bottleneck and rapid recovery or due to a selective sweep will in both cases lead to multiple mergers in the genealogy. Our simulations of random sweepstakes with recurrent bottlenecks yield roughly a U-shaped site-frequency spectrum, but the fit is not as good as for the selective sweepstakes model. In the Durrett–Schweinsberg model, interpreting a small fraction of sweeps (on the order of 10%) as chromosome-wide sweeps or population bottlenecks resulted in a model which was able to explain the decay of linkage disequilibrium observed in Atlantic cod, without affecting the good fit of the site-frequency spectrum. Overall, therefore, the Durrett–Schweinsberg model explains our data although it is formally only applicable to single-locus data from a haploid species (the resulting coalescent process traces the genealogy of a single site), assumes a constant population size, disallows competing, simultaneous sweeps (Kim and Stephan, 2003), and only models hard sweeps. Both soft and incomplete sweeps and recombinational breakups likely occur in cod. Our estimator will pick up the effects of such sweeps. Incomplete sweeps and LD-based measures (Nielsen, 2005; Sabeti et al., 2006; Sabeti et al., 2007) may be neccessary, particularly in connection with the further extensions of the Durrett–Schweinsberg model that we present in Appendix 6—figure 6. Further extending the model is an avenue for future work.
High-fecundity matters in two ways in this process. First, each round of replication results in many new mutations in the genome of a new gamete. Even though the probability of a positive mutation is very small, the millions of gametes produced by each female multiplied by the billions of individuals in a population ensure a steady input to the population of new positive mutations to each generation. Second, high fecundity makes available a large reproductive excess which permits substitutions to occur at high rates by natural selection without the population going extinct (Felsenstein, 1971). The reproduction of a high-fecundity organism compares with the replication of a virus in an epidemic. Each infected individual produces hundreds of billions of viral particles. Even with a tiny proportion of positive mutations, the numbers of new mutations are so enormous that it is all but certain that an epidemic produces a steady stream of more contagious and fitter viral variants that sweep to fixation by selection. If the population crashes (Hutchings, 2000) the mutational input of adaptive variation diminishes. The population may run out of fuel for responding to environmental challenges via selective sweeps and go extinct (Felsenstein, 1971). Kimura’s neutral theory of molecular evolution and polymorphisms (Kimura, 1968) relied on excessive genetic load based on Haldane’s dilemma (Haldane, 1957) that the cost of adaptive substitution would limit the rate of evolution lest the population go extinct (Felsenstein, 1971). Truncation selection of continuously distributed characters, where genetic and nongenetic factors independently affect the probability of survival and act cumulatively in each individual (Williams, 1975), mitigates the genetic load (King, 1967; Sved et al., 1967). Our considerations of full dominance with selection against a lethal homozygote would entail a large genetic load. However, there can be strong selection in one habitat patch and near neutrality in another due to differences in competition and predation. The marginal fitness differences would then be less but such soft selection (Wallace, 1975; Reznick, 2016) would not drive the population to extinction (Charlesworth, 2013). Marginal fitness would still preserve full dominance and a two phenotype selection scheme and thus behave similar to the haploid Durrett–Schweinsberg model. The high fecundity and consequent excessive reproductive capacity in our study organism may also alleviate the genetic load problem. However, both loss of mutational input and genetic load (a case of selective extinction) may nevertheless be a factor in the non-recovery of a population following a crash (Hutchings, 2000). Does cod have the reproductive capacity to substitute adaptive alleles at a high rate without going extinct from the substitution load (Kimura and Maruyama, 1969)? If each high-fitness fish has offspring, which survive long enough to reproduce, and the selective sweep starts from a single individual, then after one generation, there are fit fish; after 2 there are ; after 3 there are , and so on. At sweep’s completion, there are fit descendants after generations. For the sweep to complete, we thus need , or , the base of the natural logarithm. As a numerical example, it is immaterial whether we assume a population size of a billion () and duration of a sweep 20.7 generations or a population size of a trillion () fish and a sweep 27.6 generations. The reproductive excess required is 2.71 or approximately three fit offspring that make it to reproduction. In practice, the number will have to be larger because not all fit offspring will survive to reproduce and because our estimated frequency of sweeps was large enough that 20–30 generations might be a bit too long. However, there is no reason to think that the cod population would not have the reproductive capacity to support selective substitution at the estimated rate.
Our estimate of the rate of selective sweeps (Appendix 17) amounts to mergers of ancestral lineages of our sample happening because sweeps occur at 5–18 times larger rates than mergers due to ordinary low-variance reproduction (Figure 5). In the classical model, the coalescence rate is on the order of the population size, or generations, but the duration of selective sweeps is on the order of generations. If we assume that there is a billion cod in the Icelandic population, this is some 20 generations or about 100 years from when a beneficial mutation arises until fixation. With the sigmoid nature of the positive selection curve, with a lag phase followed by an exponential phase and ending in a stationary phase, the main action of selection bringing an allele from a low frequency to a high frequency during the exponential phase may only take a few generations, perhaps 15–20 years. Erratic climatic variability, such as the great salinity anomalies (Cushing, 1969; Dickson et al., 1988) in the North Atlantic, which can greatly affect cod reproduction and ecology, is detectable over decadal timescales, similar time span as the exponential phase of our estimated selective sweeps.
We estimate that each chromosome in Atlantic cod is affected by a selective sweep every 23–50 years on average (Appendix 17). Since we also see evidence of rapid recombination (Appendix 6—figure 6), we expect that any single sweep will not strongly affect a large region of a chromosome. The rapid recombination will modulate the genomic footprints of sweeps. There is clear evidence that sweeps occur everywhere along the genome (in chromosomal fragments of different sizes, different functional groups, and on all chromosomes Figure 5 and Appendix 6—figures 3 and 17 and Appendix 6—figure 18). It is, therefore, likely that the true rate of sweeps is even faster than our estimate. For example, if an average sweep were to affect 10% of a chromosome, we would expect to see sweeps every 3–4 years or roughly once a generation to explain our results. Building a fully quantitative, data-informed picture of the rate of sweeps requires developing a diploid, genomic version of the Durrett–Schweinsberg model, which is currently absent from the literature, and for which task our results provide strong applied motivation.
The higher positive than negative selection rate is similar to findings in Drosophila and different from humans and yeast, where negative selection predominates (Li et al., 2008). Similarly, the proportion of adaptive non-synonymous substitutions is lower but in the direction of the results of Drosophila (Bierne and Eyre-Walker, 2004; Sella et al., 2009). Our study is of a locally circumscribed population compared to a more geographically diverse sampling of the fly. A global sample of cod would likely show an even higher proportion of advantageous mutations.
Is the large substitution rate of one sweep per year even possible? If we accept a 3.5 Mya split of Atlantic cod and walleye pollock (Vermeij, 1991; Vermeij and Roopnarine, 2008; Coulson et al., 2006; Carr and Marshall, 2008) the rate of one substitution per year (Appendix 7—table 4) would translate into 3.5 M site difference between the taxa. Appendix 7—table 4 also shows that the distance (proportion of nucleotide differences per nucleotide) is 0.005, and with a 685-Mb genome, yields a 3.4-M site difference between the taxa, a fair agreement. But it is unlikely that all substitution is by selection or hitchhiking. Although the proportion of adaptive substitutions () is substantial, there is also a role for random genetic drift in substitution. Our findings highlight genetic hitchhiking as a key driver of substitutions in cod. The fitted value can be thought of as a rate with which hitchhiking drives a given (neutral) mutation towards fixation, in contrast to a rate of 1 for genetic drift as modelled by the Kingman coalescent. However, as a compound parameter, does not carry direct information about the abundance of neutral versus selective mutations. This is comparable to Drosophila, for example, millions of differences between melanogaster and simulans, in which many adaptive substitutions occur (e.g. Fay et al., 2002; Smith and Eyre-Walker, 2002; Andolfatto, 2005; Eyre-Walker, 2006). We can ask (Eyre-Walker, 2006) what for is all this adaptive variation? Where are the camel’s hump and elephant’s trunk of cod? We answer that the optimal phenotype is mostly ephemeral (although balanced polymorphic inversions may tie up some long-duration adaptive variations). The population is not going anywhere in particular. This is evolution by selective sweepstakes, metaphorically a Red Queen race (Van Valen, 1973; Strotz et al., 2018) to stay in the game against nature (Lewontin, 1961).
Our findings provide a new perspective on coalescent models in population genetics and genomics. For the first time, a test involving genomic data, that is, using copies of chromosomes from several pairs of homologous chromosomes, was made on the contrasting hypotheses of reproduction using multiple-merger coalescents in a diploid organism. It is also the first time multiple-merger coalescent models based on neutral evolution and selection are contrasted. Previously, two neutral -coalescents have been compared to data of outbreaks of the tuberculosis bacterium and the Bolthausen–Sznitman coalescent () used to model rapid selection (Menardo et al., 2019). Our findings have repercussions for and give impetus to further theoretical development of multiple-merger coalescents, particularly for multiple-merger coalescent models of strong selection. Our work motivates the construction of joint models featuring neutral and selective sweepstakes. As a starting point, we expect selective sweeps akin to the Durrett–Schweinsberg model could be incorporated into the Schweinsberg (Schweinsberg, 2003) pre-limiting population models giving rise to the Beta-coalescent. To affect the infinite-population limit, selective sweeps would have to occur on the same fast timescale of generations as neutral multiple mergers. Even on this timescale, selective sweeps lasting generations will be instantaneous resulting in multiple mergers in the coalescent limit. An intriguing possibility is that the Durrett–Schweinsberg selective sweeps could account for some of the observed deviation from the Kingman coalescent, the combined model, might yield substantially higher best-fit estimates of than those obtained from the more restrictive Beta-coalescent. Low values of result in implausibly short timescales for evolution, and a combined neutral-and-selective sweepstakes model has the potential to avoid this defect.
We suggest that sweepstakes reproduction is much more common than previously thought. It is essential to understand sweepstakes and the natural and anthropogenic ecological processes conducive to sweepstakes (Hedgecock and Pudovkin, 2011; Williams, 1975). Are selective sweepstakes (Williams, 1975) the rule, or is there a role for random sweepstakes (Hedgecock and Pudovkin, 2011; Vendrami et al., 2021)? It is possible that big-bang, the semelparous reproductive strategy of reproducing once before dying, is a sweepstakes reproduction if ecological mechanisms generate a high-variance, highly skewed offspring distribution. This mode of reproduction characterizes many annual plants, a myriad of insects, and vertebrates such as Pacific salmon (Oncorhynchus) and Arctic cod (Boreogadus saida), a close relative of Atlantic cod. We further posit that sweepstakes may be the mode of reproduction of viruses (Timm and Yin, 2012) as inferred from the overdispersion of offspring distribution from superspreader individuals and events (Endo et al., 2020), some cancer cells (Kato et al., 2017), and various bacteria (Wright and Vetsigian, 2019; Menardo et al., 2019; Ypma et al., 2013). Fungi and plant pathogens, which cause extensive crop losses of great economic importance (Pimentel et al., 2000), may also reproduce by sweepstakes. Similarly, many repeat reproducers, the iteroparous reproductive strategy, produce vast numbers of tiny eggs in each reproductive season. It applies to many marine organisms such as oysters (Hedgecock and Pudovkin, 2011), and Atlantic cod and its Pacific relatives (Árnason and Halldórsdóttir, 2019) that support large fisheries of great economic importance. The dynamics of all these systems can be profitably studied using multiple-merger coalescents (Freund et al., 2022), be they generated by random or selective sweepstakes.
Materials and methods
Sampling
Request a detailed protocolWe randomly sampled adults from our extensive tissue collection (Árnason and Halldórsdóttir, 2015; Halldórsdóttir and Árnason, 2015) from two localities in Iceland, the South/south-east coast () and Þistilfjörður on the north-east coast () (Appendix 6—figure 1). The Icelandic Marine Research collected the fish during spring spawning surveys (Árnason and Halldórsdóttir, 2015). All fish selected here had running gonads (eggs and milt with maturity index 3), indicating they were spawning at the capture locality.
Ethics statement
Request a detailed protocolThe Icelandic Committee for Welfare of Experimental Animals, Chief Veterinary Officer at the Ministry of Agriculture, Reykjavik, Iceland has determined that the research conducted here is not subject to the laws concerning the Welfare of Experimental Animals (The Icelandic Law on Animal Protection, Law 15/1994, last updated with Law 157/2012). DNA was isolated from tissue taken from dead fish on board research vessels. Fish were collected during the yearly surveys of the Icelandic Marine Research Institute (and other such institutes as already described Árnason and Halldórsdóttir, 2019). All research plans and sampling of fish, including the ones for the current project, have been evaluated and approved by the Marine Research Institute Board of Directors, which serves as an ethics board. The Board comprises the Director-General, Deputy Directors for Science and Finance and heads of the Marine Environment Section, the Marine Resources Section, and the Fisheries Advisory Section.
Molecular analysis
Request a detailed protocolWe shipped tissue samples of cod from the South/south-east coast population of Iceland to Omega Bioservices. Omega Bioservices isolated genomic DNA using the E-Z 96 Tissue DNA Kit (Omega Biotek), made picogreen DNA sample quality checks, made sequencing libraries using Kapa Hyper Prep WGS (Kapa Biosystems), used Tapestation (Agilent Technologies) for sizing libraries, and sequenced libraries on a 4000/X Ten Illumina platform with a 2 × 150-bp read format, and returned demultiplexed fastq files.
Genomic DNA was isolated from the tissue samples of Þistilfjörður population using the E-Z 96 Tissue DNA Kit (Omega Biotek) according to the manufacturer’s recommendation. The DNA was normalized with elution buffer to 10 ng/µl. The normalized DNA was analyzed at the Bauer Core of Harvard University. According to the manufacturer’s recommendation, the Bauer Core used the Kapa HyperPrep Plus kit (Kapa Biosystems) for enzymatic DNA fragmentation and adapter ligation, except that the reaction volume was 1/4 of the recommended volume. The target insert size was 350 base pairs (bp) with a resulting average of 487 bp. The libraries were indexed using IDT (Integrated DNA Technologies) unique dual 8 bp indexes for Illumina. The Core uses Tapestation (Agilent Technologies) and Picogreen qPCR for sizing and quality checks. Multiplexed libraries were sequenced on NovaSeq (Illumina) S4 lanes at the Broad Institute with a bp read format, and demultiplexed fastq files were returned.
Bioinformatic analysis
Request a detailed protocolThe sequencing centres returned de-multiplexed fastq files for different runs of each individual. Data processing followed the Genome Analysis Toolkit (GATK) best practices (Van der Auwera et al., 2013) as implemented in the fastq_to_vcf pipeline of Alison Shultz (https://github.com/ajshultz/comp-pop-gen; Shultz, 2020). Using the pipeline the raw reads were adapter trimmed using NGmerge (Gaspar, 2018), the trimmed fastq files aligned to the gadMor3.0 chromosome-level reference genome assembly (NCBIaccessionID:GCF_902167405.1) using bwa mem (Li and Durbin, 2009), and the resulting bam files deduplicated, sorted, and indexed with gatk (Van der Auwera et al., 2013).
The deduplicated bam files were used for population genetic analysis with ANGSD (Korneliussen et al., 2014). We have sequenced a large sample of cod from various localities in the North Atlantic and performed both principal component (PCA) and admixture analysis using PCangsd (Meisner and Albrechtsen, 2018) revealing some population substructure and possible admixture (unpublished results). To minimize the effects of potential population substructure we screened the individuals of the two samples in this study and ensured that they are members of the same cluster detected by PCA and assigned to the same population detected with admixture. This filtering also addresses the issue of potential SNP misorientation and ancestral admixture discussed below. In order to polarize sites for estimation of site-frequency spectra outgroup fasta sequences were generated with -dofasta 3, which chooses a base using an effective depth algorithm (Wang et al., 2013). A high coverage specimen (Árnason and Halldórsdóttir, 2019) from each of Pacific cod Gadus macrocephalus (labelled Gma), walleye pollock, also from the Pacific, G. chalcogrammus (labelled Gch), Greenland cod G. ogac (labelled Gog), and Arctic cod Boreogadus saida (labelled Bsa) were each taken individually as an outgroup providing independent replicate estimation of site-frequency spectra. We used biallelic sites only with -skipTriallelic 1 filtering in ANGSD, which will leave only sites that have the same exact ancestral state in the outgroup as one of the two alleles in the ingroup. In conjunction with multiple outgroups this filtering addresses some issues with SNP misorientation. If a particular site can be polarized by outgroup A (e.g. Gma) it means that the state of the site in taxon A is the same as one of the alleles segregating in the ingroup population. If outgroup B (say Gch) has a different state for that site, the site would would be tri-allelic in that comparison and removed by the tri-allelic filtering. We did not use parsimony or consensus to infer the state of ancestral nodes (Keightley and Jackson, 2018). Therefore, this filtering will not remove sites which have parallel changes simultaneously in two or three outgroup taxa. To address the potential effects of SNP misorientation from parallel mutation (Baudry and Depaulis, 2003; Hernandez et al., 2007) or from ancestral introgression (Schumer et al., 2018) we generated a 100% consensus sequence (with perl script available from https://github.com/josephhughes/Sequence-manipulation/blob/master/Consensus.pl; Hughes, 2011) from walleye pollock (Gch), Pacific cod (Gma), and Arctic cod (Bsa) sequences and used the consensus sequence to polarize sites. There is potentially mutational saturation of transition sites (Agarwal and Przeworski, 2021) that complicates polarization of sites. We used the -rmTrans flag to remove transitions and study variation at transversion sites only. The effects of SNP misorientation from parallel mutation (Baudry and Depaulis, 2003; Hernandez et al., 2007) or from low-level ancestral introgression (Schumer et al., 2018) will primarily affect the singleton and doubleton as well as the anti-singletons () and anti-doubletons () site-frequency classes. We, therefore, also removed these classes of sites and used truncated site-frequency spectra to minimize the effects SNP misorientation and ancestral introgression.
To estimate site-frequency spectra the site allele frequency likelihoods based on genotype likelihoods were estimated using ANGSD and polarized with the respective outgroup using the -anc flag with -doSaf 1 and -doMajorMinor 1 for both genotype likelihoods 1 and 2 (GL1 the SAMtools genotype likelihood, -GL 1 and GL2 the GATK genotype likelihood, -GL 2). Filtering was done on sequence and mapping quality -minMapQ 30 -minQ 20, indel realignment -baq 1-C 50, quality checks -remove_bads 1 -uniqueOnly 1 -only_proper_pairs 1 -skipTriallelic 1, and finally the minimum number of individuals was set to the sample size (e.g. -minInd 68) so that only sites present in all individuals are selected. Errors at very low-coverage sites maybe called as heterozygotes. Similarly, sites with very high-coverage (more than twice or three times the average) may represent alignment issues of duplicated regions such that paralogous sites will be called as heterozygous. We addressed the issues of coverage with two steps. First, we screened out individuals with an average genome-wide coverage less than 10× giving samples sizes of and for the South/south-east and the Þistilfjörður populations, respectively. This resulted in an average coverage of 16× and 12× for the South/south-east and the Þistilfjörður populations, respectively. Second, we determined the overall coverage of all sites in the genome that passed the quality filtering. We then used the minimum and maximum of the boxplot statistics ( and , which represent roughly for a normal distribution) to filter sites using the ANGSD flags -setMinDepth and -setMaxDepth thus removing sites with a boxplot outlier coverage. We did this filtering separately for each chromosome. All our site-frequency spectra are estimated using these flags. The site-frequency spectra of the full data for each chromosome were then generated with realSFS using default flags. Site-frequency spectra for genomic regions used the -sites flag of realSFS with the sample allele frequency files (saf) files estimated with the above filtering and was thus based on the same filtering.
We use the logit transformation, the log of the odds ratio , to analyze the site-frequency spectra. We transform both the derived allele frequency and the normalized site frequency. Under this transformation, the overall shape of the site-frequency spectrum (L-shape, U-shape, V-shape) is invariant. We used the kernel density estimator and functions of the eks R package (Duong, 2022) to estimate and plot density contours of parameter estimates.
To investigate divergence among gadid taxa we used ANGSD to generate beagle likelihoods (-GL 1, -doGlf 2) and the quality filtering above. We then used ngsDist (Vieira et al., 2015) to estimate the -distance as nucleotide substitutions per nucleotide site between Atlantic cod and walleye pollock. The number of sites (--n_sites) was set to the number of variable sites and the total number of sites (--tot_sites) was set equal to the number of sites that passed the quality filtering in the estimation of the site-frequency spectra above (Appendix 7—table 4). A tree (Appendix 6—figure 20) was generated with fastME (Lefort et al., 2015) and displayed using ggtree (Yu et al., 2016).
To evaluate deviations from neutrality, we used ANGSD to estimate the neutrality test statistics Tajima’s D (Tajima, 1989), Fu and Li's D (Fu and Li, 1993), Fay and Wu's H (Fay and Wu, 2000), and Zeng’s E (Zeng et al., 2006) in sliding windows (window size 100 kb with 20 kb step size).
We generated vcf files for the South/south-east population using GATK (Van der Auwera et al., 2013). We used the genomic features files (gtf) of the Gadmor3 assembly to extract sites belonging to different functional groups. We used ReSeqTools (He et al., 2013) to extract fourfold degenerate sites, bedtools (Quinlan and Hall, 2010) to extract exon and intron sites using genomic feature files (gtf), and we used the GenomicFeatures Bioconductor package (Lawrence et al., 2013) for extracting other functional regions. We then used the -sites flag of realSFS to estimate site-frequency spectra from the sample allele frequency (saf) files of the entire data for each chromosome, thus keeping the quality and coverage filtering applied to the full data (Bioinformatic analysis). We used PopLDdecay (Zhang et al., 2018) to estimate the decay of linkage disequilibrium. To perform the McDonald–Kreitman test of selection (McDonald and Kreitman, 1991) we used SnpEff (Cingolani et al., 2012) to estimate the number of polymorphic non-synonymous and synonymous ( and ) sites of protein-coding genes. To estimate the number of fixed non-synonymous and synonymous ( and ) sites, we used a single individual with the highest coverage (32×) from the South/south-east population and a single high coverage () Pacific cod individual and counted sites that are homozygous within species while exhibiting different allelic states between species. We used the neutrality index (Rand and Kann, 1996) transformed as as an index of selection with negative values implying negative (purifying and background) selection and positive values implying positive selection (selective sweeps).
We did a principal components (PC) based scan of selection using PCangsd (Meisner and Albrechtsen, 2018) (python pcangsd.py -selection), which implements the fastPCA method of Galinsky et al., 2016. We then removed regions of 500 kb on either side of selective peaks that exceeded (Appendix 6—figure 8) to define regions of no selection that we compared with other genomic regions (e.g. Figure 5).
We used OmegaPlus (Alachiotis et al., 2012) and RAiSD (Alachiotis and Pavlidis, 2018) scanning for selective sweeps genome wide. Both methods use local linkage disequilibrium to detect sweeps (Nielsen, 2005) and in addition RAiSD uses a local reduction in levels of polymorphism and shifts in the frequencies of low- and high-frequency-derived alleles affecting, respectively, the left and righ tails of the site-frequency spectrum.
Methods for analyzing coalescent models
This section describes the model-fitting procedure we used for each family of models discussed in Appendix 1. Where possible, we have resorted to documented state-of-the-art simulators and inference packages, though that was not possible in all cases, particularly for the Durrett–Schweinsberg model. A description of various terms is given in Appendix 7—table 5. All custom code has been made available via GitHub, with links below.
Kingman coalescent
Request a detailed protocolThere are numerous, well-documented packages for inferring population size profiles from whole-genome data under the Kingman coalescent, typically relying on the sequentially Markovian coalescent approximation (McVean and Cardin, 2005). We used smc++ (https://github.com/popgenmethods/smcpp; Terhorst et al., 2016) to produce best-fit profiles. We also used the stairway plot (https://github.com/xiaoming-liu/stairway-plot-v2; Liu and Fu, 2015; Liu and Fu, 2020; ) that use the site-frequency spectra for a model-flexible demographic inference. Both packages were installed according to their respective documentations, and run using default settings. To treat runs of homozygosity, which may represent centromeric regions, as missing, we set the flag --missing-cutoff 10 in smc++ runs.
-Beta() coalescent
Request a detailed protocolAt the time of writing there are no off-the-shelf inference packages capable of estimating or a population size profile from whole-genome data under the -Beta() coalescent. However, synthetic data from the model can be simulated using msprime (Kelleher et al., 2016). Hence, we fit our model using ABC, in which model fitting is accomplished by comparing summary statistics of simulated and observed data under various parameters. We used uniform priors adjusted for different situations (Appendix 7—table 6).
We used the logit transform of the normalized site-frequency spectrum (SFS) as our summary statistic. The msprime package is not well optimized for simulating multiple chromosomes, so we used chromosome 4 as our observed data. To simulate observations, we set the chromosome length to 3.5 Mb, and used respective per-site per-generation mutation and recombination probabilities of 10−7 and 10−8, respectively.
A proposed parameter combination was accepted whenever the simulated statistic was within a specified tolerance of the observed statistic. To avoid tuning the tolerance and other hyperparameters, and to focus computational effort on regions of of good model fit automatically, we used the adaptive ABC-MCMC (Approximate Bayesian Computation Markov Chain Monter Carlo) method of Vihola and Franks, 2020 with a target acceptance rate of 10%, which the authors recommend.
Durrett–Schweinsberg coalescent
Request a detailed protocolTo our knowledge, there are no off-the-shelf inference packages for the Durrett–Schweinsberg model, and also no packages for simulating it. Hence we implemented a basic, single locus simulator in C++, based on the exact rejection sampling mechanism which is used in both the msprime and Beta-Xi-Sim simulation packages (see the Appendix in Koskela, 2018). Since the Durrett–Schweinsberg coalescent is a single locus model, we computed an observed site-frequency spectra separately for 25 kb lengths of genome separated by 500 kb gaps. This was done across all 19 non-inversion chromosomes, and the mean of the resulting ensemble was taken to be the observed SFS. Simulated values were calculated as the mean of 10,000 independent, single-locus replicates. This number was found to be high enough in trial runs to avoid zero entries in the averaged SFS, and hence infinite values in the logit transform.
Then we used the same ABC-MCMC pipeline outlined above for the -Beta() coalescent to infer an approximate posterior distribution of values for the compound parameter of the Durrett–Schweinsberg model.
Computations
Request a detailed protocolThe computations in this paper were run on the Odyssey cluster supported by the FAS Division of Science, Research Computing Group at Harvard University. Some computations were run on the Mimir bioinformatics server and the Assa bioinformatics computer at the University of Iceland.
Code availability
Request a detailed protocolSimulations of background selection were done with SLiM 3 (Haller and Messer, 2019) available at https://messerlab.org/slim/. Estimates of population size histories for the Kingman coalescent were produced using the stairwayplot (Liu and Fu, 2015; Liu, 2020) and smc++ (Terhorst et al., 2016) available via Github at https://github.com/xiaoming-liu/stairway-plot-v2 and https://github.com/popgenmethods/smcpp, respectively. Based on the estimated population size histories site-frequency spectra under the Kingman and the -Beta() coalescents were simulated using msprime, available via GitHub at https://github.com/tskit-dev/msprime, (copy archived at swh:1:rev:becc7b948123f8683c49ed41480ca2682d979a7f; Wong, 2022), with documentation at https://tskit.dev/msprime/docs/stable/. Our msprime simulations also make use of the tskit library, available via GitHub at https://github.com/tskit-dev/tskit, (copy archived at swh:1:rev:575daea4bcd535df7bc328a7387876eb986daebb; Jeffery, 2023), with documentation at https://tskit.dev/tskit/docs/. To our knowledge, no prior implementation of the Durrett–Schweinsberg coalescent is available. Hence, we wrote a simulator, which is available via GitHub at https://github.com/JereKoskela/ds-tree, (copy archived at swh:1:rev:7ee7d9c473278aaf618af7a539fd3cba2735d1e1; Koskela, 2022). This repository also contains documentation of the Durrett–Schweinsberg implementation, as well as Python and shell scripts for the (1) ABC pipelines we used to conduct model fitting for both the -Beta() and Durrett–Schweinsberg coalescents, and (2) the simulation pipelines for sampling site-frequency spectra under the best-fit Kingman, -Beta(), and Durrett–Schweinsberg coalescents. C++code and python scripts implementing the sampling schemes described in https://github.com/eldonb/coalescents; Eldon, 2021a. C code using recursions Birkner et al., 2013b for computing the exact expected branch length spectrum for Examples 2.3 and 2.4 of the Durrett–Schweinsberg model (Durrett and Schweinsberg, 2005) is available at https://github.com/eldonb/Durrett_Schweinsberg_Expected_SFS, (copy archived at swh:1:rev:07a534d2d6b5870762bfe6dd3c79f860eb82494a; Eldon, 2021b).
Appendix 1
Coalescent models
Our aim is to distinguish between three possible explanations of the pattern of genetic diversity observed in our samples:
Reproduction with a low variance offspring distribution is consistent with the classic Wright–Fisher model of genetic evolution, albeit in a population of varying size (the Kingman coalescent with variable demography).
Neutral and random sweepstakes reproduction arising from a skewed family size distribution (the neutral -Beta() coalescent) in a population undergoing exponential growth.
Recurrent selective sweeps (the Durrett–Schweinsberg coalescent of recurrent selective sweeps) as models of selective sweepstakes.
All three scenarios predict an excess of singletons in the site-frequency spectrum relative to a standard Kingman coalescent (at least as long as the population size decreases backward in time in scenario 1) and are hence a priori plausible explanation of our data. However, the models differ in their predictions for the rest of the site-frequency spectra given a desired singleton class (Appendix 6—figure 21).
In this section, we introduce a class of models to describe each scenario. In the methods section for analyzing coalescent models (Methods for analyzing coalescent models) we fit each family to our data and use the discrepancy between the data and the corresponding prediction of the best-fit model within each class to assess the degree to which each model class can explain the data.
Appendix 2
The Kingman coalescent
The Kingman coalescent (Kingman, 1982) is a seminal model in population genetics and the default null model for genetic evolution. In brief, the distribution of the genealogical tree connecting sampled lineages is obtained by merging each pair of lineages at a unit rate, until only one lineage remains.
The Kingman coalescent is derived from individual-based models with a finite population size by suitably rescaling time and sending . To connect a merger time in the Kingman coalescent back to generations, it is necessary to undo the rescaling. The most common choice of individual-based models is the Wright–Fisher model (Durrett, 2008), under which one unit of coalescent time corresponds to generations.
The original formulation of the coalescent assumed a constant population size, but extensions to variable population sizes, resulting in variable pair-merger rates, have subsequently been developed (see e.g. Donnelly and Tavaré, 1995). There are many robust and well-established methods for flexibly inferring historical population size profiles from DNA sequence data. Hence, rather than specifying an explicit model for population size under our Kingman coalescent scenario, we use the model specifications to coincide with the implementation of our model-fitting pipeline defined in the methods section for analyzing coalescent models (Methods for analyzing coalescent models).
Appendix 3
The -Beta() coalescent
The Beta-coalescent (or Beta()-coalescent) is a genealogical model arising from a haploid population of fixed size, in which individuals can have family sizes comparable to the total population size with non-negligible probability (Schweinsberg, 2003). Large families looking forward in time result in multiple branches merging simultaneously looking backward in time. Specifically, when there are lineages, each subset of size lineages merges at a rate
where is the beta function.
The extension to diploid population results in a coalescent with up to four simultaneous merger events, arising out of the merging lineages separating uniformly at random into the four available parental chromosomes (Birkner et al., 2013a; Koskela and Wilke Berenguer, 2019). The resulting coalescent with simultaneous multiple mergers is known as the -Beta() coalescent (Schweinsberg, 2000). Extensions to variable population sizes have also been constructed (Freund, 2020; Koskela and Wilke Berenguer, 2019).
To balance flexibility with computational constraints, we focus on the two-dimensional family of models
where determines the skewness of the offspring distribution in (1), and denotes a rate of exponential growth of the population forward in time.
Beta and -Beta() coalescents are also obtained as scaling limits of individual-based models described in Schweinsberg, 2003, but unlike the Kingman coalescent, one unit of time in the limiting coalescent model corresponds to generations. This results in a very short timescale when is close to 1, and can lead to implausibly large predicted population sizes (or, equivalently, mutation rates) in order to match observed levels of diversity. We will sidestep this issue by resorting only to statistics that are insensitive to the total number of mutations in our model-fitting pipeline.
Appendix 4
The Durrett–Schweinsberg coalescent
We model the impact of recurrent selective sweeps with the Durrett and Schweinsberg, 2005 coalescent. The model describes genetic ancestry at a single locus in a haploid population subject to hard selective sweeps due to beneficial mutations arising along the genome. The neutral locus of interest that is linked to a selected site can either merge due to a sweep or escape via recombination. The pattern of mergers and escapes in each of the sweeps results in multiple mergers, where any of lineages merge at rate
where if the event happens and zero otherwise, is the population-rescaled rate at which beneficial mutations arise, is a measure of the advantage they provide (a larger value corresponds to a greater fitness advantage), and is the population-rescaled rate of recombination per link between sites. Specifically, this is the infinite-chromosome model with uniformly distributed locations of both mutations and recombinations described in Example 2.4 of Durrett and Schweinsberg, 2005.
Extensions of the Durrett–Schweinsberg coalescent to diploid populations or nonconstant population sizes have not been derived. Because the Durrett–Schweinsberg model features hard selective sweeps and selective advantage manifests on each chromosome individually, we do not expect that a diploid extension would give rise to lineages separating into four groups in each merger, as it did in the -Beta() coalescent. However, both diploidy and a nonconstant population size likely would affect the merger rates in (2). Development of these generalizations is beyond the scope of this article, and in their absence, we resort to the haploid model with constant population size as a practical, heuristic model.
The timescale of the Durrett–Schweinsberg model coincides with that of the Kingman coalescent. One unit of coalescent time corresponds to generations regardless of the value of parameters. The duration of a selective sweep is much shorter at generations, which is why they become instantaneous in the coalescent limit, and hence also cannot overlap. However, the Durrett–Schweinsberg model requires very rapid recombination, with a per-generation recombination probability for a constant , unlike the more familiar under a Kingman coalescent.
Note that (2) coincides for any two parameter combinations with . It is also possible to draw a random selective advantage independently at each sweep from a fixed distribution, and obtain the same coalescent model as long as . Hence, we define the compound parameter and focus on the one-dimensional family of models given by
The derivation of the model in Durrett and Schweinsberg, 2005 is only valid for a single locus. Genomic generalization is likely to result in a more complex model depending on separations between sites of interest, as well as potential mutation or recombination points. We expect the methodology we develop in the methods section for analyzing coalescent models (Methods for analyzing coalescent models) to be robust to the inconsistency of applying a single-locus model to whole-genome data as it depends only on expected frequencies of segregating sites, rather than their higher moments or joint distributions. To facilitate a comparison of observed and predicted linkage patterns (Appendix 6—figure 6), we present the following two-locus generalization of the Durrett–Schweinsberg coalescent, applicable to samples of two haplotypes on which the first site is at position 0, and the second at position .
Because the recombination in the Durrett–Schweinsberg model acts on a timescale of generations while the timescale of evolution is generations, we assume that the two sites are unlinked between sweeps. When a sweep happens, four things can happen: (1) there are no mergers, (2) there is a merger at site 1 only, (3) there is a merger at site 2 only, or (4) there is a simultaneous pair merger at each site. We obtain the rate of simultaneous pair mergers by adapting the calculation in Durrett and Schweinsberg, 2005, Example 2.4 to consider the rate with which four out of four lineages take part in a merger:
where η is defined in Durrett and Schweinsberg, 2005, Theorem 2.2. The integrand is positive on the interval
whereupon (Equation 3) is positive whenever . Hence
so that sweeps which cause a simultaneous pair merger at both sites, in which all four lineages participate, rise at rate
Since the marginal rate of mergers at each site must be , the rate of sweeps which cause a merger at only one particular site is
Ancestral trees with these rates of single and double pair-mergers were simulated, and corresponding linkage disequilibrium patterns were estimated from replicates via
where and are the estimated TMRCAs at sites 1 and 2 from the ith replicate, and is a population-rescaled rate of neutral mutation. The right-hand side of (4) is obtained by computing the correlation in the events in which no mutation separates the two samples at each site, and substituting ensemble estimates for the resulting mean TMRCAs.
The model constructed above has three free parameters: , , and , but predicts an exponential decay to zero of LD as the separation grows, regardless of the values of parameters as long as they are nonzero. This contradicts the observed positive background level in Appendix 6—figure 6. To explain a positive level of background LD, we assumed that a fraction of sweeps were not localized to a mutation at a given position along the chromosome, but rather would cause a sweep along the full length of the chromosome regardless of recombination. A possible interpretation for such a ‘sweep’ is a population bottleneck. The resulting rate of a simultaneous double pair-merger is
with a corresponding change in the rate of single-site pair mergers. Appendix 6—figure 6 shows that the observed linkage data are informative about , but not the other parameters and that the resulting model predictions are not inconsistent with the observed LD profile.
Appendix 5
Limits of the models
This section highlights some key ways in which the Durrett–Schweinsberg and -Beta() coalescent families differ from the classical and well-known Kingman coalescent, as well as from each other.
The Kingman and -Beta() coalescent families we have considered are models for genome-scale, diploid organisms. In contrast, the Durrett–Schweinsberg coalescent models a haploid (since the chromosomes are treated as separate ‘individuals’) organism undergoing selective sweeps at unobserved, linked sites some recombinational distance away from the neutral site for which we are tracing the genealogy.
The fact that the model provides such a good fit to our data with only one free parameter despite this discrepancy is a strong motivation for the construction of an analogous model for multi-locus data from a diploid organism, which is not presently available in the literature. We expect such an extension to further improve the model fit, and also very likely render identifiable the various components of the compound parameter which cannot be inferred separately using the single-locus Durrett–Schweinsberg coalescent. Because multiple mergers in the Durrett–Schweinsberg coalescent arise from hard selective sweeps which encompass the whole population, we do not anticipate that a diploid extension will simply split multiple mergers into four groups (corresponding to the four ancestral chromosomes available in a reproduction event) as it does in the case of the -Beta() coalescent (Birkner et al., 2013a). In a similar vein, the impact of varying population size on measures of genetic diversity under the Durrett–Schweinsberg coalescent is unknown.
Typical constructions of the Kingman coalescent from a Wright–Fisher population predict that the number of generations between merger events is proportional to the (census) population size . In contrast, the -Beta() coalescent predicts a timescale proportional to generations per coalescent time unit (Schweinsberg, 2003). With an under the -Beta coalescent (as observed), the only way to match the best-fit regime of the -Beta coalescent to our data is to postulate a census population size many orders of magnitude larger than the effective size (which under the model is proportional to ).
In contrast, the timescale of evolution under the Durrett–Schweinsberg coalescent coincides with that of the Kingman coalescent (i.e. on the order of discrete time units per coalescent time unit). Selective sweeps under the Durrett–Schweinsberg coalescent govern the rate with which a mutation spreads in a population (proportional to time units on average), but sweeps are independent of the timescale on which new favourable mutations arise relative to the Kingman coalescent.
In the -Beta() coalescent derived for a diploid population, a multiple-merger is due to a single pair of diploid individuals (Birkner et al., 2018), picked at random, giving rise to a large family, and the single pair becomes ancestral to a non-negligible fraction of the population in one generation. The offspring of the highly fecund successful pair carry no fitness advantage. In the Durrett–Schweinsberg coalescent, the positive mutation, which initiates a selective sweep, hits a random individual that passes on a fitness advantage to its offspring, and with a certain probability the beneficial mutation sweeps to fixation in generation on average. The probability of a second sweep beginning before the first is complete is proportional to , and hence vanishes as . After a sweep ends, a new individual carrying the beneficial type can initiate a new, independent sweep. The recurrent sweeps imply that the environment is forever changing with each sweep climbing a new selective peak in the adaptive landscape. To the best of our knowledge, our work is the first attempt to use whole-genome data to distinguish between families of multiple-merger coalescents to assess the plausibility of their underlying assumptions as explanations for observed data in diploid taxa.
Finally, the Durrett–Schweinsberg coalescent assumes that recombination occurs at approximately time unit intervals, much faster than mutations that arise on a timescale of time units. Thus, the Durrett–Schweinsberg coalescent predicts a rapid decay of linkage disequilibrium as observed (Appendix 6—figure 6). Under the Kingman coalescent, both mutation and recombination act on a timescale of units, while both act on a timescale of units under the -Beta() coalescent. Hence, the Durrett–Schweinsberg model predicts widespread recombination relative to mutations, which is qualitatively consistent with the very rapid decay of linkage disequilibrium (Appendix 6—figure 6).
The Durrett–Schweinsberg model is based on the Moran, 1958 model of reproduction in which a single individual replicates and another dies instead. The reproducing individual remains active in the population. In the Durrett–Schweinsberg addition, the reproducing individual is fitter. On a coalescent timescale, there is a burst of reproductive activity by the individual and his descendants such that their fit, derived lineage quickly takes over the whole population. The Moran model does not model high fecundity by itself. Still, with this addition, it is as if the Moran process and its associated Kingman coalescent are interrupted by a burst of high fecundity. Thus the Durrett–Schweinsberg model approximates selective sweepstakes by adding recurrent sweeps of a new selectively advantageous mutation each time to the Moran model. The multiple-merger Durrett–Schweinsberg coalescent (Appendix 4) describes the genealogy at a single site, the ‘neutral’ site, that is linked at some recombinational distance to a site hit by a favorable mutation. The population experiences recurrent, strongly beneficial mutations at sites linked to the neutral site, and it is assumed that the neutral site never experiences mutation. A beneficial mutation sweeps to fixation in time units, where is the population size, and the probability of fixation does not depend on the population size. However, a vital component of the Durrett–Schweinsberg model is the assumption of an elevated rate of recombination between the neutral and the mutated site, giving ancestral lineages at the neutral site a chance to escape a sweep via recombination.
Appendix 6
Supplementary figures
Appendix 7
Supplementary tables
Appendix 8
Classic tests of neutrality
The classic Kingman coalescent, derived from the Wright–Fisher (or similar) model of low-variance reproduction, is the no-sweepstakes model. Several tests of a neutral equilibrium under the Wright–Fisher model of reproduction and the Kingman coalescent use a standardized difference of different estimators of the mutation-rate scaled by population size (Tajima, 1989; Fu and Li, 1993; Fay and Wu, 2000; Zeng et al., 2006; Przeworski, 2002). These tests are sensitive to mutations on different parts of the genealogy and thus of different frequency classes of the site-frequency spectrum that also may be influenced by demography, background selection, and selective sweeps (Tajima, 1989; Fu and Li, 1993; Fay and Wu, 2000; Zeng et al., 2006; Przeworski, 2002). A negative Tajima’s indicates an excess of low frequency over intermediate frequency alleles, and a negative Fu and Li’s , which contrasts mutations on internal and external branches of a genealogy, indicates an excess of singletons. Thus, these statistics are sensitive to deviations from neutrality affecting the left tail of the site-frequency spectrum, such as population expansion and background selection (Nielsen, 2005). In contrast, negative values of Fay and Wu's (Fay and Wu, 2000) and Zeng’s E (Zeng et al., 2006) statistics, which weigh the frequency of high-frequency derived alleles, are sensitive to deviations from neutrality affecting the right tail of the site-frequency spectrum such as positive selection and selective sweeps (Fay and Wu, 2000; Przeworski, 2002; Nielsen, 2005).
Appendix 9
Lifetable and generation time for Atlantic cod
Demographic life-table statistics for the female segment of the Atlantic cod in Iceland are presented in Appendix 7—table 3. and are, respectively, the correction factors for the effects of overlapping generations and generation time based on demographic estimation (Jorde and Ryman, 1995; Jorde and Ryman, 1996; Laikre et al., 1998). Age-specific survival rate, li, was based, respectively, on the average and the 1948–1952 and the 1963–1967 instantaneous mortality estimated from tagging experiments of Icelandic cod (Jónsson, 1996). These periods showed differences in estimated instantaneous survival and represented variation in demographic statistics. Mortality may be underestimated, particularly for Age class 1. The method assumes that the probability of survival from one year to the next is the same for all age classes (Jónsson, 1996). Age-specific fecundity is based on the average age-specific weight in the catch (Anonymous, 2001) and fecundity by weight relationships (Marteinsdottir and Begg, 2002) and similar relationships for Newfoundland cod for comparison (May, 1967). The demographic statistics were used to estimate generation time and the correction factor for the effect of overlapping generation by iteration of Equations 5–9 (Jorde and Ryman, 1996).
Few data are available on individuals older than 15 years so they are not included in the table. However, large fish up to 180 cm long, weighing up to 50 kg, and as old as 25 years are regularly caught. The annual fecundity of a 50-kg female is predicted to be between 3 and 4 × 108 eggs. Coupled with the low (type III) survivorship, large older females may contribute disproportionally to the variance in offspring numbers.
Appendix 10
SNP misorientation from parallel mutation or low-level ancestral admixture
In estimating site-frequency spectra as in this paper, potential effects of SNP misorientation from parallel mutation (Baudry and Depaulis, 2003; Hernandez et al., 2007) or low-frequency ancestral introgression (Schumer et al., 2018) can cause similar problems in the data, except that introgression is more genome-wide. We deal with these issues together. A parallel mutation in the outgroup to the same state as a derived biallelic polymorphic state in the ingroup will flip the orientation of a site with the ancestral state being considered derived. Singleton and doubleton sites are the most common sites and such sites would flip to the and class thus increasing the right tail of the SFS. Under low-level ancestral introgression sites that were formerly monomorphic in the ingroup would flip at sites that have a derived state in the outgroup. Introgression would not affect the right tail of the SFS at polymorphic ingroup sites where the outgroup carries the ancestral allele. Instead, such sites would have a higher frequency of ancestral allele and push derived alleles in the ingroup to a lower frequency and pull up the left side of the site-frequency spectrum. Ancestral introgression will affect ingroup sites that were fixed for the derived allele before introgression by making such sites polymorphic and contributing to the right tail of the site-frequency spectrum. To minimize the potential effects of introgression, we also screened the individuals sampled to ensure that they belong to the same groups revealed by principal component (PCA) and admixture analysis.
The tri-allelic filtering of sites that we apply in conjunction with multiple independent outgroups goes some way towards addressing these issues. This will only leave sites that have the same exact ancestral state in the outgroup as one of the two alleles in the ingroup. If a particular site can be polarized by outgroup A (e.g. Gma) it means that the state of the site in taxon A is the same as one of the alleles segregating in the ingroup population. If outgroup B (say Gch) has a different state for that site, the site would be tri-allelic in that comparison and removed by the tri-allelic filtering. We did not use parsimony or consensus to infer the state of ancestral nodes (Keightley and Jackson, 2018). Therefore, this filtering will not remove sites that have parallel changes simultaneously in two or three outgroup taxa.
To further address these issues, we reasoned that SNP misorientation and low-level ancestral introgression will mostly affect singletons and doubletons as well as the anti-singletons () and anti-doubletons () classes. The singletons and doubletons together comprise 62–66% of sites (depending on which genotype likelihood was used) whereas the anti-singletons () and anti-doubletons () classes comprise less than 1% of sites (see e.g. Appendix 6—figure 3). The right tail of the site-frequency spectrum is, therefore, sensitive to low levels of misorientation among singletons and doubletons. However, the truncated site-frequency spectrum compared to the full site-frequency spectrum and to the respective expectations of the Durrett–Schweinsberg model (Figure 3—figure supplement 4) does not change the overall pattern.
Second, we reasoned that sites with transition variation are more likely than transversions to be saturated with mutations (Agarwal and Przeworski, 2021) complicating the polarization of sites. Parallel changes are more likely at such sites leading to SNP misorientation. To address this, we removed transitions and fitted the Durrett–Schweinsberg model to transversions only. This had minuscule effects compared to full data (Figure 3—figure supplement 4).
Third, we studied the effect of outgroups. Instead of maximum parsimony across the cod species tree, we used a 100% consensus sequence of walleye pollock (Gch), Pacific cod (Gma), and Arctic cod (Bsa) as an outgroup. Thus only sites at which the three outgroup taxa agree are used. Under parsimony, an agreement among two out of three would be used, but here three out of three are required (Figure 3—figure supplement 1). It is worth noting that the very right tail of the site-frequency spectrum is lower and more in line with the Durrett–Schweinsberg model for both the transversions and the 100% consensus data compared to the full data (Figure 3). This probably indicates that SNP misorientation had some effect on the original full analysis. However, the effect does not change the results qualitatively.
Appendix 11
Can processes other than selective sweeps better explain the patterns?
The effects of demography (changes in population size, population structure, and migration) can be hard to distinguish from various forms of selection (Nielsen, 2005). Different forms of selection can affect the various parts of the site-frequency spectrum in similar ways. We now consider whether processes other than selective sweeps can provide a better explanation for the observed patterns.
Appendix 12
Historical demography and low variance reproduction
Our estimated demographic history (Appendix 6—figure 11 and Appendix 6—figure 12) shows population expansion in the distant past leading to the relative stability of population size in the recent past to modern times. In some cases, an apparent population crash in recent times (Appendix 6—figure 11c), which is chromosome specific, is an exception to this. Demography produces genome-wide effects and, thus, this is likely a peculiarity of runs of homozygosity of some chromosomes (such as centromeric regions, for example) and does not reflect historical size changes of the population. Based on these population growth curves (Appendix 6—figure 11 and Appendix 6—figure 12) we generated population size change scenarios for simulating site-frequency spectra using msprime (Kelleher et al., 2016; Baumdicker et al., 2021). The results (Appendix 6—figure 19) show monotonically decreasing frequency with the size of the mutation or L-shaped site-frequency spectra that neither capture the singleton class nor the upswing of the right tail of the observed site-frequency spectra (Figure 3, Appendix 6—figure 3, and Appendix 6—figure 5). Thus, there is no evidence in our results for a low-variance no-sweepstakes mode of reproduction modelled by the Kingman coalescent, even taking demographic histories of population expansion or collapse into account. Our simulations are in line with the theoretical proof (see Appendix 6 of Sargsyan and Wakeley, 2008), showing that the normalized expected site-frequency spectrum of a Kingman coalescent under arbitrary population size history is L-shaped.
Appendix 13
Potential confounding due to cryptic population structure
Here we examine alternative explanations for our observations. In particular, are the site-frequency spectra influenced by cryptic population structure?
The effect of hidden population structure on the site-frequency spectra is expected to look similar to the patterns seen for the inversion chromosomes. These are chromosomes Chr01, Chr02, Chr07, and Chr12 known to carry large inversion (Kirubakaran et al., 2016; Berg et al., 2016). They show two peaks in the site-frequency spectrum (Appendix 6—figure 18 and Appendix 6—figure 18) at the frequency of the variants’ haplotype frequency and show a block of values for neutrality statistics (Figure 1 and Appendix 6—figure 2). If a sample of size diploid organisms is composed of two cryptic reproductively isolated populations (sample sizes n1 and n2) we expect to see peaks in the site-frequency spectra at the relative frequencies of the two groups. If we expect a sharp peak at . This peak would include all fixed sites in both populations ( and ) and spread over neighbouring frequency classes ( and so on). If the frequencies of the two groups differ () two peaks will appear, but are expected to be narrow. They will always include all sites fixed in either population (because fixed sites in either population will appear to be segregating in the sample as a whole).
To study the potential effects of population structure, we used msprime (Kelleher et al., 2016; Baumdicker et al., 2021) to simulate the Kingman coalescent with two isolated populations exchanging a varying number of migrants under population growth as determined by the growth parameter . Thus, we examined the effects of cryptic structure on the site-frequency spectrum by varying the growth rate and the effective number of migrants between subpopulations (), and varied the number of individuals sampled from the population with fewer individuals represented (referred to as the minor population). Parameters of the simulations were the number of individuals from the minor population (), the migration rate (, and the growth rate (). The effective size was set at and thus the effective number of migrants per subpopulation per generation was .
We use a two-island model with exponential growth under the Kingman coalescent as a simple tool for assessing the qualitative, joint effect of demography and substructure on the site-frequency spectrum (Appendix 6—figure 14 and Appendix 6—figure 15). Two narrow peaks at opposite allele frequencies are evident (much like the two narrow peaks for the inversion chromosomes, Appendix 6—figure 18 and Appendix 6—figure 18) becoming smaller with increasing migration. If the sample contained only a single individual from the minor population, is there a remote resemblance to the observed data (Appendix 6—figure 14g, h, j). Nevertheless, even in this case, doublets are more common than singletons, and it is hard to find combinations of growth and migration rates to mimic the observed data. We used the Xi-Beta coalescent for similar simulations (Appendix 6—figure 15) and got the same results qualitatively. Therefore, population structure in a population evolving according to the Wright–Fisher (or a similar) low-fecundity model or in a population evolving under a neutral sweepstakes model is an improbable explanation for our results. Both simulations (Appendix 6—figure 14 and Appendix 6—figure 15) show that only for a minor sample size of one diploid individual do the models show a remote resemblance to our data. To further address this issue, we, therefore, estimated the site-frequency spectra with a leave-one-out approach (Appendix 6—figure 16). The leave-one-out approach is model free: whichever model is correct, one of the leave-one-out samples should behave differently if a cryptic population structure with a minor sample size of one is present in our data. None of them do. There is no indication that our sample from the South/south-east coast is composed of 67 individuals from one population and a single individual from a divergent population.
To further study the potential effects of cryptic population structure, we note that PCA of variation at each of the four chromosomes harboring large inversions reveals two or three groups that likely represent genotypes of the inversion alleles. There are three groups for Chr01 (which we refer to as Chr01-AA, Chr01-AB, and Chr01-BB), Chr02 (Chr02-CC, Chr02-CD, and Chr02-DD), and Chr07 (Chr07-EE, Chr07-EF, and Chr07-FF), and two groups for Chr12 (Chr12-GG and Chr12-GH), which has a low frequency of one inversion allele (Appendix 6—figure 13, Appendix 6—figure 18, and Appendix 6—figure 18). If we take these groups as representing the haplotypes of the inversions, the genotypic frequencies at each chromosome do not deviate from Hardy–Weinberg equilibrium, and there is thus no evidence for breeding structure (no Wahlund effect, Appendix 7—table 9). However, as the inversions effectively suppress recombination between the inversion alleles, we can also look at the chromosomes of the inversion genotypes as effectively isolated populations with no recombination (migration) between them and estimate the site-frequency spectra within genotypes for the inversion chromosomes. Furthermore, we conjecture that the PCA groups observed at inversion chromosomes represent reproductively isolated but cryptic populations. Because demography has genome-wide effects, the cryptic structure should be evident in the rest of the genome. We, therefore, estimate the site-frequency spectra for the 19 non-inversion chromosomes (chromosomes 3–6, 8–11, and 13–23) for these groups.
PCA did not show any structure for the non-inversion chromosomes. However, the four inversion chromosomes each showed two narrow peaks at intermediate allele frequencies (Appendix 6—figure 18 and Appendix 6—figure 18) indicative of either balancing selection or cryptic population breeding structure. If this is a breeding structure it should affect the whole genome. To disentangle the effects of balancing selection and potential breeding structure, we used the groups defined by PCA at the inversion chromosomes to investigate the inversion chromosomes themselves and the non-inversion chromosomes. We thus conjecture that the PCA groups represent cryptic breeding units.
PCA revealed three (or two) groups on the first principal axis that explains 4–36% of the variation in the inversion chromosomes (Appendix 6—figure 13a, d and g, j). The PCA groups most likely represent genotypes of inversion haplotypes. Taking membership in PCA groups to represent inversion genotype, their frequencies fit the Hardy–Weinberg equilibrium (Appendix 7—table 9) and thus there is no evidence of heterozygote deficiency or Wahlund effect (Wahlund, 1928) indicative of breeding structure. The only exception is chromosome 7 in the Þistilfjörður population, which shows a slight heterozygote excess (Appendix 7—table 9). Furthermore, the site-frequency spectra of the PCA groups (Appendix 6—figure 13b, e and h, k) show the same overall V-shape pattern as the site-frequency spectra for the overall data (Figure 3). Additionally, the intermediate PCA group shows a sharp peak at a derived allele frequency of (an equal frequency of two types or 0 on the logit scale) as expected for a group composed of heterozygotes only. Similarly, the site-frequency spectra of these PCA groups for the 19 non-inversion chromosomes combined (Appendix 6—figure 13c, f, i, l) show a pattern characteristic of the site-frequency spectra for the overall data. There is not the slightest hint of a Kingman coalescent-like behaviour for any of these PCA groups. Similarly, the expectations of the -Beta() coalescent do not explain the data.
Overall, the shape of the site-frequency spectra for each of the inversion chromosomes (Appendix 6—figure 18 and Appendix 6—figure 18) and the PCA groups of each of the inversion chromosomes (Appendix 6—figure 13) is the same as the shape of the site-frequency spectra of the non-inversion chromosomes (Figure 3). This shape is well explained for all PCA groups by the Durrett–Schweinsberg coalescent (Durrett and Schweinsberg, 2005), for which we estimated the parameter using ABC for the PCA group of the respective inversion chromosomes (Appendix 6—figure 13).
The observed V-shaped site-frequency spectra are inconsistent with an amalgamation of cryptic units reproducing under a Wright–Fisher model. The PCA groups are not cryptic breeding units, and we reject the above conjecture. Instead, we consider them to represent polymorphic inversion genotypes maintained by some form of balancing selection, such as frequency-dependent fitnesses arising from the accumulation of deleterious recessives on homokaryotypes (Jay et al., 2021) or other mechanisms of balancing selection (Faria et al., 2019). The good fit of the Durrett–Schweinsberg model of selective sweeps to the overall site-frequency spectra of the inversion chromosomes (Appendix 6—figure 18 and Appendix 6—figure 18) and to the PCA groups representing the alternative haplotypes of each of the inversion chromosomes (Appendix 6—figure 13) likely indicate recurrent selective sweeps within the alternative haplotypes of chromosomal inversions. It is known that both haplotypes of the PanI locus, which is located close to a breakpoint of the chromosome 01 inversion (Kirubakaran et al., 2016), are subject to selective sweeps in action (Pogson, 2001).
Appendix 14
Inversion polymorphisms
Four chromosomes, Chr01, Chr02, Chr07, and Chr12, are known to carry large inversions (Kirubakaran et al., 2016; Berg et al., 2016). The two inversions on Chr01 are connected to ecotypic variation, defining a deep-water migratory ecotype and a shallow-water stationary ecotype (Pampoulie et al., 2007), and inversions on the other three chromosomes may also be involved (Berg et al., 2016). The polymorphic Chr01 inversions likely originated in Iceland to Barents Sea populations as revealed by graph-aware retrieval of selective sweeps (Refoyo-Martínez et al., 2019). Similarly, the Chr02 and Chr07 inversions likely originated in Iceland, Faroe Islands, and North Sea populations, while the Chr12 inversion polymorphism originated in the Celtic Sea population (Refoyo-Martínez et al., 2019).
The polymorphic inversions on Chr01, Chr02, and Chr07 are segregating at intermediate frequencies (Appendix 6—figure 18 and Appendix 6—figure 18) (and see Hemmer-Hansen et al., 2013) in our two sample populations in Iceland (Appendix 6—figure 1) while the Chr12 inversion polymorphism is at about 5% versus 95% in the Þistilfjörður population and rarer still in the South/south-east coast population (Appendix 6—figure 17 and Appendix 6—figure 18). The inversion polymorphisms in these chromosomes are likely to be maintained by some form of balancing selection or they are examples of cryptic breeding structure (Hemmer-Hansen et al., 2013). To avoid the effects of balancing selection or cryptic breeding structure on our analysis, we exclude these four chromosomes from analysis or analyze them separately. The other 19 chromosomes (Chr03–Chr06, Chr08–Chr11, and Chr13–Chr23) do not seem to harbor large inversions or other significant chromosomal structural variations. We refer to them as non-inversion chromosomes.
Appendix 15
Balancing and background selection and functional constraints
Besides the Durrett–Schweinsberg model, various mechanisms of selection may influence the results. Here, we examine the effects of balancing selection, different selective constraints, and background selection.
There are several signs that natural selection affects observed patterns. Balancing selection retains linked neutral or nearly neutral variants at intermediate frequencies. The tighter the linkage and less the recombination, the longer the coalescent time of the neutral variants (Charlesworth, 2006). The observed site-frequency of intermediate frequency alleles is higher among the four inversion chromosomes than the 19 non-inversion chromosomes. All comparisons of the four inversion chromosomes and the 19 non-inversion chromosomes show this effect (Appendix 6—figure 17, Appendix 6—figure 18 and Appendix 6—figure 3). However, balancing selection does not affect the overall V-shape of the site-frequency spectrum of the inversion chromosomes (Appendix 6—figure 17 and Appendix 6—figure 18).
The PC-based selection scan (Meisner and Albrechtsen, 2018) is model-free and is based on finding genes or genomic regions that are outliers relative to the overall genome-wide allele frequencies and taking potential population structure into account. A principal component-based genomic scan of selection (Appendix 6—figure 8) showed many peaks that are likely indicative of recent and strong positive selection. Few peaks were population specific, but the two populations share most peaks. Region under a peak ranged from a single site to about 2 Mb. We extracted sites 500 kb or more away from the peaks (referred to as no-selection) and included with genomic regions under different selective constraints. We extracted fourfold degenerate sites, introns, and intergenic sites as less constrained regions, promoter regions, exons, -UTR, and -UTR as more selectively constrained regions. The mean, median, and mode of the estimated compound parameter of the Durrett–Schweinsberg model for the different genomic regions ranked from least constrained to most constrained sites (Figure 5). The ABC-MCMC was well mixed in all cases. There are two possible explanations for the rank order of the compound parameter with functional genomic regions. First, the more functionally important a region of the genome is, the stronger the selection coefficient of a new advantageous mutation will be as observed for UTRs in Drosophila (e.g. Andolfatto, 2005; Sella et al., 2009). Such mutations will sweep through the population and carry with them tightly linked neutral mutations in these same regions ( being inversely proportional to the recombination rate γ). Alternatively, different functional regions are preserved and constrained by purifying (negative) selection. If the sites are tightly linked, a positively selected mutation sweeping through will affect neutral, nearly neutral, and even deleterious sites. A tug-of-war between the effects of the sweep and purifying selection at a site results in a net effective selection coefficient for that site. The compound parameter of the Durrett–Schweinsberg model estimates the net effective selection coefficient squared over the recombination rate or density of selection per map unit (Aeschbacher et al., 2017), which may generate the observed rank order. Of course, both explanations may apply to different positive mutations. Thus selective sweeps permeate the genome affecting most if not all sites (Pouyet et al., 2018).
To study the effects of background selection, we carried out forward-in-time simulations of the Wright–Fisher model (using SLiM Haller and Messer, 2019). Simulations that ran for a relatively short number of generations (on the order of population size) produced V-shaped site-frequency spectra (Appendix 6—figure 19d). However, when simulations of the same parameter values ran for a large number of generations (up to 10 times the population size of 105 chromosomes) they accumulated more variation (Appendix 7—table 10) and produced monotone L-shaped site-frequency spectra. Thus, only in a narrow window of non-equilibrium between the input of mutation and its removal by purifying selection or loss by drift can background selection site-frequency spectra resemble our observed spectra. In general, however, background selection does not fit our data.
Appendix 16
The joint action of several evolutionary mechanisms
The analysis thus has shown that, considered singly, the various factors such as demography and background selection do not provide a good fit, particularly not involving the derived alleles at the right tail of the site-frequency spectrum. Studying the site-frequency spectrum under recurrent selective sweeps, Kim, 2006 stated that ‘the excess of high-frequency derived alleles, previously shown to be a signature of single selective sweeps, disappears with recurrent sweeps.’ This effect is sometimes—incorrectly—taken to mean that the site-frequency spectrum is no longer U-shaped under recurrent selective sweeps. However, the excess or deficiency of high-frequency derived alleles is in reference to expectations of the Kingman coalescent (Kim, 2006) and how that affects Fay and Wu's statistic (Fay and Wu, 2000). The site frequencies of alleles at intermediate allele frequencies (the alleles contributing most to the variance in fitness) are still reduced under recurrent sweeps (Kim, 2006) preserving the U-shaped site-frequency spectrum observed under a single selective sweep. Analyzing the joint action of demography, purifying and background selection with or without random sweepstakes on the genome level is computationally prohibitive. We, therefore, resorted to simulations using SLiM (Haller and Messer, 2019) of a sizeable fragment of a chromosome evolving under the joint action of several mechanisms of evolution (Appendix 6—figure 19). As is common in complex, multi-component simulations, it may be possible to tweak parameters to obtain results matching the observed data. Nevertheless, a comprehensive model-fitting search is infeasible in our setting. However, the combined effect of negative background selection without selective sweeps did not produce qualitatively accurate, U-shaped site-frequency spectra for any parameter combination we tested. Furthermore, a combination of random sweepstakes, randomly occurring bottlenecks, and background selection (Appendix 6—figure 19e, f) did not produce a qualitatively similar U-shaped pattern as the data. Hence, even if best-fit parameters could match the data, we expect the fit would not be robust to small changes in either parameter values or observed data, thus having low predictive and explanatory power. In contrast, scenarios involving selective sweeps routinely produced the right qualitative shape of the site-frequency spectra. Hence, we expect a (hypothetical) best-fit analysis to be far more robust.
Appendix 17
Rates of selective sweeps and genomic footprints
The average nucleotide distance between the Atlantic cod and walleye pollock sister taxa was 0.00508 nucleotide substitutions per nucleotide site for the average chromosome and the genome as a whole (Appendix 6—figure 20 and Appendix 7—table 4). On the assumption that Atlantic cod and walleye pollock split at the opening of the Bering Strait 3.5 million years ago (Vermeij, 1991; Vermeij and Roopnarine, 2008; Coulson et al., 2006; Carr and Marshall, 2008) the mutation rate is 7.26 × 10−10 substitutions per nucleotide site per year, about an order of magnitude lower than the mtDNA rate of 1 × 10−8 (Carr and Marshall, 2008). This translates into a mutation rate of 3.63 × 10−9 nucleotides per nucleotide site per generation. Thus, there are substitutions (where 28,406,758 is the average length of a chromosome Appendix 7—table 4) per chromosome per year since the divergence of the two taxa or roughly one substitution per chromosome every 50 years (Appendix 7—table 4). For the genome as a whole, there is roughly one-half substitution per year or a substitution every 2 years on average. With a 5-year generation time (Appendix 9 and Appendix 7—table 3), the average chromosome has a substitution every 10 generations and the Atlantic cod genome has 2.5 substitutions per generation.
The analysis of site-frequency spectra using ANGSD (Korneliussen et al., 2014) yields an estimate of the number of invariant sites, the number of segregating sites, and the number of fixed sites between the outgroup used to polarize the spectrum and the focal taxon (Appendix 7—table 7 and Appendix 7—table 8). Based on the number of fixed sites using walley pollock (Gch) as an outgroup we estimate that within Atlantic cod there have been 0.043 substitutions per average chromosome per year and one substitution per genome per year. This translates to one substitution every 23 years or four to five generations for the average chromosome and one substitution in the whole genome per year or five substitutions per generation (Appendix 7—table 7 and Appendix 7—table 8). The rates of evolution obtained with this approach are similar to the rates above using the divergence of the taxa.
We estimated the diversity statistics Watterson’s and average pairwise nucleotide diversity , which is an estimate of , as well as the neutrality statistics Tajima’s and Fu and Li’s for the South/south-east and Þistilfjörður populations respectively (Appendix 7—table 1 and Appendix 7—table 2). The neutrality statistics were significantly negative as expected under the Durrett–Schweinsberg coalescent model of recurrent selective sweeps (Durrett and Schweinsberg, 2005). The average pairwise nucleotide diversity is per nucleotide site in the South/south-east populations and slightly higher for the Þistilfjörður population (Appendix 7—table 1 and Appendix 7—table 2).
The average pairwise nucleotide diversity in the South/south-east population (Appendix 7—table 1), which is the average number of differences between pairs of sequences looking forward, is a natural proxy for the mean time until a pair of lineages coalesces when looking backwards under both the Kingman and the Durrett–Schweinsberg model. Under the Durrett–Schweinsberg model with the range of parameters that describes our data (c ≈ 6–19), most of these pairwise mergers are caused by sweeps, and hence the mean time between sweeps will be commensurate to the mean time until a pair coalescence. This is likely to be lower than the rate of all sweeps per chromosome because not every sweep happens close enough to the given site to be likely to cause a merger. In one extreme, the selective advantage might be extreme in comparison to recombination, so that the sweep typically catches most or all of the chromosome, in which case the calculated rate of sweeps is about right. Alternatively, it could be that recombination is more potent than selection, in which case only a short region of genome hitchhikes with each sweep. In this case, the actual rate of sweeps per chromosome would be higher. Identifying a more detailed rate would be equivalent to teasing apart the components of the parameter of the Durrett–Schweinsberg model.
The Durrett–Schweinsberg model is essentially a model of a single locus on a single chromosome. We estimate that each chromosome in Atlantic cod is affected by a sweep every 23–50 years on average or every 4–10 generations. Since we also see evidence of rapid recombination (Appendix 6—figure 6), we expect that one sweep will not have a substantial effect on a large region of a chromosome. Recombination is high enough that linkage disequilibrium decays to the background over 25–100 kb (Appendix 6—figure 6). We see similar pattern evidence for selective sweeps for different-sized chromosomal fragments, for different functional regions, and for all chromosomes (Figure 5 and Appendix 6—figure 13 Appendix 6—figure 17 and Appendix 6—figure 18). Thus, genomic footprints of all but powerful selective sweeps will be relatively short. There is thus clear evidence that sweeps happen everywhere along the genome, and it is therefore likely that the actual rate of sweeps is even faster than our estimate. For example, if an average sweep were to affect only 10% of a chromosome, then we expect sweeps every year or so on the average chromosome to explain our results. Building a fully quantitative, data-informed picture of the rate of sweeps requires the development of a diploid, genomic version of the Durrett–Schweinsberg model, which is currently absent from the literature, and for which task our results provide strong applied motivation.
The census population size of Atlantic cod in Iceland may be a billion to a trillion (109 to 1012) fish. However, the effective population size is much smaller. Suppose a molecular clock dates the most recent common ancestor (MRCA) of two sequences at years, which results in a Kingman effective population size of . The mutation scaled population size is estimated with pairwise nucleotide diversity and with Watterson’s estimator (Appendix 7—table 1 and Appendix 7—table 2). Assuming a mutation rate of per generation and the Kingman coalescent the effective size of, for example, the South/south-east population ranges from to depending on which parameter estimate we use (Appendix 7—table 1). The corresponding Durrett–Schweinsberg population size is , because selective sweeps increase the pair-merger rate from 1 to . In our case, is ≈6–19 times larger than of the classical Kingman coalescent, so the discrepancy between the census population size and the population size we need to plug into the model has been reduced (though of course will still be much less than the census size of a billion). Moreover, if a pair of lineages merges after years on average, and of mergers are caused by selective sweeps, then the rate of effective sweeps (sweeps that cause at least one merger at a given site) is one per years.
We can also approach the question of whether one can plug the parameter estimates obtained using the coalescent model back into the model and recover the genetic variation used to estimate those very parameters. In this vein, we ask what population size is required to recover the number of segregating sites on an average chromosome under the Durrett–Schweinsberg model. Standard coalescent theory arguments assuming the infinitely many sites mutation model given that the expected number of segregating sites on an average chromosome is . Here, μ is the mutation rate per site per generation and is the number of sites of an average chromosome. In the Kingman coalescent, time is rescaled using generations as one coalescent time unit where is the probability that two individuals picked at random in a generation are descended from the same parent in the previous generation. Thus for the Kingman coalescent, where is the effective population size. For the Durrett–Schweinsberg model, we have the actual population size, and we write , where is a function of the sample size and the compound parameter of the Durrett–Schweinsberg model. Thus, the minimum population size required to account for our results is .
The numbers to plug into the equation are per nucleotide site per generation (Appendix 7—table 4), the number of segregating sites , and the total number of sites (Appendix 7—table 7 and see Appendix 7—table 8), and the function using the sample size of chromosomes of the South/south-east population and the average based on a recursion adapted for the Durrett–Schweinsberg model from a recursion for a general coalescent (Birkner et al., 2013b). Thus, the population size required to account for our observations is or roughly . This estimate is well within reasonable limits (1 billion cod) and leaves ample room for an even more rapid rate of evolution by selective sweeps than we have observed.
The compound parameter can be thought of as a density of selection along the chromosome (Aeschbacher et al., 2017). The (numerical) derivative of with respect to the recombination unit would yield an effective local selection coefficient. However, estimating this in windows along the chromosome is too noisy. However, since the number of SNPs is proportional to branch length, which is proportional to , it is legitimate to regard these as factors by which to multiply the single already estimated in each 25-kb window. Appendix 6—figure 22 shows the relative diversity as an indication of the density of selection along chromosome 4 (the largest chromosome). The selective effects are distributed thoughout the chromosome similar to the results of genome scans for selection (Appendix 6—figure 6).
Data availability
All data needed to evaluate the conclusions of the paper are presented in the paper, and/or the supplementary materials. The bam files of the whole-genome sequencing of each individual aligned to the Gadmor3 reference genome (NCBI accession ID: GCF_902167405.1) are available from the NCBI SRA Sequence Read Archive under accession number BioProject ID: PRJNA663624 at the time of publication.
-
Dryad Digital RepositorySweepstakes reproductive success via pervasive and recurrent selective sweeps.https://doi.org/10.5061/dryad.bcc2fqzgx
-
NCBI Sequence Read ArchiveID PRJNA663624. Genomic coalescent-based evidence of sweepstakes reproduction in Atlantic cod.
References
-
ReportState of marine stocks in Icelandic waters 2000/2001. Prospects for the quota year 2001/2002 Technical reportMarine Research Institute.
-
The genomic rate of adaptive amino acid substitution in DrosophilaMolecular Biology and Evolution 21:1350–1360.https://doi.org/10.1093/molbev/msh134
-
Coalescent results for diploid exchangeable population modelsElectronic Journal of Probability 23:EJP175.https://doi.org/10.1214/18-EJP175
-
Why we are not dead one hundred times overEvolution; International Journal of Organic Evolution 67:3354–3361.https://doi.org/10.1111/evo.12195
-
Ancestral inference on gene trees under selectionTheoretical Population Biology 66:219–232.https://doi.org/10.1016/j.tpb.2004.06.006
-
The regularity of the spawning season of some fishesICES Journal of Marine Science 33:81–92.https://doi.org/10.1093/icesjms/33.1.81
-
The “great salinity anomaly” in the Northern North Atlantic 1968–1982Progress in Oceanography 20:103–151.https://doi.org/10.1016/0079-6611(88)90049-3
-
Coalescents and genealogical structure under neutralityAnnual Review of Genetics 29:401–421.https://doi.org/10.1146/annurev.ge.29.120195.002153
-
Particle representations for measure-valued population modelsThe Annals of Probability 27:166–205.https://doi.org/10.1214/aop/1022677258
-
A coalescent model for the effect of advantageous mutations on the genealogy of a populationStochastic Processes and Their Applications 115:1628–1657.https://doi.org/10.1016/j.spa.2005.04.009
-
BookProbability Models for DNA Sequence EvolutionNew York: Springer-Verlag.https://doi.org/10.1007/978-0-387-78168-6
-
Evolutionary genomics of high fecundityAnnual Review of Genetics 54:213–236.https://doi.org/10.1146/annurev-genet-021920-095932
-
SoftwareDurrett_Schweinsberg_Expected_SFS, version swh:1:rev:07a534d2d6b5870762bfe6dd3c79f860eb82494aSoftware Heritage.
-
SoftwareSelective-sweepstakes, version swh:1:rev:3235fd1a87f2741b486cb9fe17a15ae85f605d26Software Heritage.
-
The genomic rate of adaptive evolutionTrends in Ecology & Evolution 21:569–575.https://doi.org/10.1016/j.tree.2006.06.015
-
Evolving inversionsTrends in Ecology & Evolution 34:239–248.https://doi.org/10.1016/j.tree.2018.12.005
-
On the biological significance of the cost of gene substitutionThe American Naturalist 105:1–11.https://doi.org/10.1086/282698
-
BookThe Genetical Theory of Natural SelectionOxford: Clarendon Press.https://doi.org/10.5962/bhl.title.27468
-
Cannings models, population size changes and multiple-merger coalescentsJournal of Mathematical Biology 80:1497–1521.https://doi.org/10.1007/s00285-020-01470-5
-
Fast principal-component analysis reveals convergent evolution of ADH1B in Europe and East AsiaAmerican Journal of Human Genetics 98:456–472.https://doi.org/10.1016/j.ajhg.2015.12.022
-
Book40 Years of EvolutionPrinceton, NJ: Princeton University Press.https://doi.org/10.1515/9781400851300
-
Slim 3: forward genetic simulations beyond the Wright-Fisher modelMolecular Biology and Evolution 36:632–637.https://doi.org/10.1093/molbev/msy228
-
ReSeqTools: an integrated toolkit for large-scale next-generation sequencing based resequencing analysisGenetics and Molecular Research 12:6275–6283.https://doi.org/10.4238/2013.December.4.15
-
BookDoes variance in reproductive success limit effective population size of marine organismsIn: Beaumont A, editors. Genetics and Evolution of Aquatic Organisms. London: Chapman & Hall. pp. 122–134.
-
Sweepstakes reproductive success in highly fecund marine fish and shellfish: a review and commentaryBulletin of Marine Science 87:971–1002.https://doi.org/10.5343/bms.2010.1051
-
A genomic island linked to ecotype divergence in Atlantic codMolecular Ecology 22:2653–2667.https://doi.org/10.1111/mec.12284
-
Context dependence, ancestral misidentification, and spurious signatures of natural selectionMolecular Biology and Evolution 24:1792–1800.https://doi.org/10.1093/molbev/msm108
-
Genetic consequences of climatic oscillations in the QuaternaryPhilosophical Transactions of the Royal Society of London. Series B, Biological Sciences 359:183–195.https://doi.org/10.1098/rstb.2003.1388
-
Spawning behaviour of Atlantic cod,Gadus morhua: evidence of mate competition and mate choice in a broadcast spawnerCanadian Journal of Fisheries and Aquatic Sciences 56:97–104.https://doi.org/10.1139/f98-216
-
Tagging of cod (Gadus morhua) in Icelandic waters, 1948–1986 and tagging of haddock (Gadus aeglefinus) in Icelandic waters, 1953–1965Rit Fiskideildar 14:1–82.
-
Efficient coalescent simulation and genealogical analysis for large sample sizesPLOS Computational Biology 12:e1004842.https://doi.org/10.1371/journal.pcbi.1004842
-
The coalescentStochastic Processes and Their Applications 13:235–248.https://doi.org/10.1016/0304-4149(82)90011-4
-
ANGSD: Analysis of next generation sequencing dataBMC Bioinformatics 15:356.https://doi.org/10.1186/s12859-014-0356-4
-
Multi-locus data distinguishes between population growth and multiple merger coalescentsStatistical Applications in Genetics and Molecular Biology 17:2017-0011.https://doi.org/10.1515/sagmb-2017-0011
-
Robust model selection between population growth and multiple merger coalescentsMathematical Biosciences 311:1–12.https://doi.org/10.1016/j.mbs.2019.03.004
-
Temporal change of mitochondrial DNA haplotype frequencies and female effective size in a brown trout (Salmo trutta) populationEvolution; International Journal of Organic Evolution 52:910–915.https://doi.org/10.1111/j.1558-5646.1998.tb03716.x
-
Software for computing and annotating genomic rangesPLOS Computational Biology 9:e1003118.https://doi.org/10.1371/journal.pcbi.1003118
-
FastME 2.0: A comprehensive, accurate, and fast distance-based phylogeny inference program: Table 1Molecular Biology and Evolution 32:2798–2800.https://doi.org/10.1093/molbev/msv150
-
Evolution and the theory of gamesJournal of Theoretical Biology 1:382–403.https://doi.org/10.1016/0022-5193(61)90038-8
-
“Reverse Ecology” and the power of population genomicsEvolution; International Journal of Organic Evolution 62:2984–2994.https://doi.org/10.1111/j.1558-5646.2008.00486.x
-
Exploring population size changes using SNP frequency spectraNature Genetics 47:555–559.https://doi.org/10.1038/ng.3254
-
Human prehistoric demography revealed by the polymorphic pattern of CpG transitionsMolecular Biology and Evolution 37:2691–2698.https://doi.org/10.1093/molbev/msaa112
-
Fecundity of Atlantic codJournal of the Fisheries Research Board of Canada 24:1531–1551.https://doi.org/10.1139/f67-127
-
Approximating the coalescent with recombinationPhilosophical Transactions of the Royal Society of London. Series B, Biological Sciences 360:1387–1393.https://doi.org/10.1098/rstb.2005.1673
-
Coalescent patterns in diploid exchangeable population modelsJournal of Mathematical Biology 47:337–352.https://doi.org/10.1007/s00285-003-0218-6
-
Random processes in geneticsMathematical Proceedings of the Cambridge Philosophical Society 54:60–71.https://doi.org/10.1017/S0305004100033193
-
Genetic draft, selective interference, and population genetics of rapid adaptationAnnual Review of Ecology, Evolution, and Systematics 44:195–215.https://doi.org/10.1146/annurev-ecolsys-110512-135920
-
The bottleneck effect and genetic variability in populationsEvolution; International Journal of Organic Evolution 29:1–10.https://doi.org/10.1111/j.1558-5646.1975.tb00807.x
-
Molecular signatures of natural selectionAnnual Review of Genetics 39:197–218.https://doi.org/10.1146/annurev.genet.39.073003.112420
-
Reproductive skew in Japanese sardine inferred from DNA sequencesICES Journal of Marine Science 73:2181–2189.https://doi.org/10.1093/icesjms/fsw070
-
Is cod lekking or a promiscuous group spawner?Fish and Fisheries 1:90–93.https://doi.org/10.1046/j.1467-2979.2000.00005.x
-
The influence of variation in female fecundity on effective population sizeBiological Journal of the Linnean Society 59:411–425.https://doi.org/10.1111/j.1095-8312.1996.tb01474.x
-
Coalescents with multiple collisionsThe Annals of Probability 27:1870–1902.https://doi.org/10.1214/aop/1022874819
-
Excess amino acid polymorphism in mitochondrial DNA: contrasts among genes from Drosophila, mice, and humansMolecular Biology and Evolution 13:735–748.https://doi.org/10.1093/oxfordjournals.molbev.a025634
-
Identifying loci under positive selection in complex population historiesGenome Research 29:1506–1520.https://doi.org/10.1101/gr.246777.118
-
The general coalescent with asynchronous mergers of ancestral linesJournal of Applied Probability 36:1116–1125.https://doi.org/10.1239/jap/1032374759
-
A coalescent process with simultaneous multiple mergers for approximating the gene genealogies of many marine organismsTheoretical Population Biology 74:104–114.https://doi.org/10.1016/j.tpb.2008.04.009
-
Coalescents with simultaneous multiple collisionsElectronic Journal of Probability 5:1–50.https://doi.org/10.1214/EJP.v5-68
-
Coalescent processes obtained from supercritical Galton–Watson processesStochastic Processes and Their Applications 106:107–139.https://doi.org/10.1016/S0304-4149(03)00028-0
-
Rigorous results for a population model with selection II: Genealogy of the populationElectronic Journal of Probability 22:EJP58.https://doi.org/10.1214/17-EJP58
-
SoftwareComp-pop-gen, version swh:1:rev:f916dc01e114a420b3d89e082f81710106fcbd0bSoftware Heritage.
-
From Fastq data to high confidence variant calls: the genome analysis toolkit best practices pipelineCurrent Protocols in Bioinformatics 43:11.https://doi.org/10.1002/0471250953.bi1110s43
-
Anatomy of an invasion: the trans-Arctic interchangePaleobiology 17:281–307.https://doi.org/10.1017/S0094837300010617
-
Improving the estimation of genetic distances from next-generation sequencing dataBiological Journal of the Linnean Society 117:139–149.https://doi.org/10.1111/bij.12511
-
BookCoalescent Theory: An IntroductionGreenwood Village, CO: Roberts & Company Publishers.
-
Hard and soft selection revisitedEvolution; International Journal of Organic Evolution 29:465–473.https://doi.org/10.1111/j.1558-5646.1975.tb00836.x
-
A sign of superspreading in tuberculosisEpidemiology 24:395–400.https://doi.org/10.1097/EDE.0b013e3182878e19
-
Ggtree: an R package for visualization and annotation of phylogenetic trees with their covariates and other associated dataMethods in Ecology and Evolution 8:28–36.https://doi.org/10.1111/2041-210X.12628
Article and author information
Author details
Funding
Rannsóknasjóður. Rannís (Rannsóknamiðstöð Íslands) (185151-051)
- Einar Árnason
- Katrín Halldórsdóttir
- Bjarki Eldon
Deutsche Forschungsgemeinschaft (273887127 STE 325/17)
- Bjarki Eldon
Engineering and Physical Sciences Research Council (EP/R044732/1 and EP/V049208/1)
- Jere Koskela
Deutsche Forschungsgemeinschaft (DFG Priority Program (SPP) 1819: Rapid evolutionary adaptation and DFG SPP1819 start-up module grant)
- Bjarki Eldon
The funders had no role in study design, data collection, and interpretation, or the decision to submit the work for publication.
Acknowledgements
We thank John Wakeley, Fabian Freund, Pierre-Alexandre Gagnaire, and anonymous reviewers for critical comments on the manuscript and Kristján Kristinsson and the Icelandic Marine Research Institute for help in sampling. We acknowledge technical help with the molecular analysis by The Bauer Core Facility at Harvard University. Funding: The work was supported by an Icelandic Research Fund Grant of Excellence no. 185151-051 to EÁ, KH, Alison Etheridge, Wolfgang Stephan, and BE. BE also acknowledges financial support by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) Project number 273887127 through DFG grant STE 325/17 to Wolfgang Stephan through DFG Priority Program (SPP) 1819: Rapid evolutionary adaptation, a DFG SPP1819 start-up module grant to JK, Maite Wilke Berenguer, and BE, and JK acknowledges financial support from Engineering and Physical Sciences Research Council (EPSRC) grants EP/R044732/1 and EP/V049208/1.
Copyright
© 2023, Árnason et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics
-
- 928
- views
-
- 101
- downloads
-
- 17
- citations
Views, downloads and citations are aggregated across all versions of this paper published by eLife.
Download links
Downloads (link to download the article as PDF)
Open citations (links to open the citations from this article in various online reference manager services)
Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)
Further reading
-
- Evolutionary Biology
- Genetics and Genomics
It is well established that several Homo sapiens populations experienced admixture with extinct human species during their evolutionary history. Sometimes, such a gene flow could have played a role in modulating their capability to cope with a variety of selective pressures, thus resulting in archaic adaptive introgression events. A paradigmatic example of this evolutionary mechanism is offered by the EPAS1 gene, whose most frequent haplotype in Himalayan highlanders was proved to reduce their susceptibility to chronic mountain sickness and to be introduced in the gene pool of their ancestors by admixture with Denisovans. In this study, we aimed at further expanding the investigation of the impact of archaic introgression on more complex adaptive responses to hypobaric hypoxia evolved by populations of Tibetan/Sherpa ancestry, which have been plausibly mediated by soft selective sweeps and/or polygenic adaptations rather than by hard selective sweeps. For this purpose, we used a combination of composite-likelihood and gene network-based methods to detect adaptive loci in introgressed chromosomal segments from Tibetan WGS data and to shortlist those presenting Denisovan-like derived alleles that participate to the same functional pathways and are absent in populations of African ancestry, which are supposed to do not have experienced Denisovan admixture. According to this approach, we identified multiple genes putatively involved in archaic introgression events and that, especially as regards TBC1D1, RASGRF2, PRKAG2, and KRAS, have plausibly contributed to shape the adaptive modulation of angiogenesis and of certain cardiovascular traits in high-altitude Himalayan peoples. These findings provided unprecedented evidence about the complexity of the adaptive phenotype evolved by these human groups to cope with challenges imposed by hypobaric hypoxia, offering new insights into the tangled interplay of genetic determinants that mediates the physiological adjustments crucial for human adaptation to the high-altitude environment.
-
- Evolutionary Biology
Signs of ageing become apparent only late in life, after organismal development is finalized. Ageing, most notably, decreases an individual’s fitness. As such, it is most commonly perceived as a non-adaptive force of evolution and considered a by-product of natural selection. Building upon the evolutionarily conserved age-related Smurf phenotype, we propose a simple mathematical life-history trait model in which an organism is characterized by two core abilities: reproduction and homeostasis. Through the simulation of this model, we observe (1) the convergence of fertility’s end with the onset of senescence, (2) the relative success of ageing populations, as compared to non-ageing populations, and (3) the enhanced evolvability (i.e. the generation of genetic variability) of ageing populations. In addition, we formally demonstrate the mathematical convergence observed in (1). We thus theorize that mechanisms that link the timing of fertility and ageing have been selected and fixed over evolutionary history, which, in turn, explains why ageing populations are more evolvable and therefore more successful. Broadly speaking, our work suggests that ageing is an adaptive force of evolution.