Mutation saturation for fitness effects at human CpG sites

  1. Ipsita Agarwal  Is a corresponding author
  2. Molly Przeworski  Is a corresponding author
  1. Department of Biological Sciences, Columbia University, United States
  2. Department of Systems Biology, Columbia University, United States

Abstract

Whole exome sequences have now been collected for millions of humans, with the related goals of identifying pathogenic mutations in patients and establishing reference repositories of data from unaffected individuals. As a result, we are approaching an important limit, in which datasets are large enough that, in the absence of natural selection, every highly mutable site will have experienced at least one mutation in the genealogical history of the sample. Here, we focus on CpG sites that are methylated in the germline and experience mutations to T at an elevated rate of ~10-7 per site per generation; considering synonymous mutations in a sample of 390,000 individuals, ~ 99 % of such CpG sites harbor a C/T polymorphism. Methylated CpG sites provide a natural mutation saturation experiment for fitness effects: as we show, at current sample sizes, not seeing a non-synonymous polymorphism is indicative of strong selection against that mutation. We rely on this idea in order to directly identify a subset of CpG transitions that are likely to be highly deleterious, including ~27 % of possible loss-of-function mutations, and up to 20 % of possible missense mutations, depending on the type of functional site in which they occur. Unlike methylated CpGs, most mutation types, with rates on the order of 10-8 or 10-9, remain very far from saturation. We discuss what these findings imply for interpreting the potential clinical relevance of mutations from their presence or absence in reference databases and for inferences about the fitness effects of new mutations.

Introduction

A central goal of human genetics is to identify pathogenic mutations and predict how likely they are to cause disease. To this end, exome sequencing in cases and controls is often used to help identify variants with potentially large effects on disease risk (Rauch et al., 2012; Sanders et al., 2012; Need et al., 2012; Akbari et al., 2021). Even where this approach yields an enrichment of variants in cases, however, the specific subset of mutations that contributes to disease often remains unknown; similarly, in individual patients, sequencing habitually yields candidate mutations of which the significance is unclear (Richards et al., 2015; Harrison et al., 2021).

Numerous scores have therefore been developed to help prioritize among candidate mutations, based on protein structure, functional annotations, evolutionary patterns, or other features (Cooper et al., 2005; Adzhubei et al., 2010; Pollard et al., 2010; McLaren et al., 2016; Ioannidis et al., 2016; Rentzsch et al., 2019). In particular, a common approach to pinpoint sites at which mutations are likely to be pathogenic is to examine whether they appear to be under purifying selection. For instance, comparisons of sequences across species have been widely used to identify highly conserved genomic regions maintained by selection over millions of years, presumably because of their functional importance (Cooper et al., 2005; Pollard et al., 2010; Boffelli et al., 2003; Siepel et al., 2005).

The same general approach is also useful when applied within humans, where information about purifying selection is contained in whether or not a site is segregating a mutation and at what frequency (Sawyer and Hartl, 1992; Eyre-Walker and Keightley, 2007; Boyko et al., 2008; Williamson et al., 2005; Lek et al., 2016; Karczewski et al., 2020; Yi et al., 2010). For this application, however, the low diversity levels in the genome pose a major difficulty, as a site may be monomorphic simply by chance, that is when mutations at that site have no fitness consequences at all or, at the other extreme, because the mutations are embryonically lethal. In particular, because most sites are monomorphic in samples of hundreds or even thousands of humans, there is little information to distinguish sites under strong selection from those at which mutations are only weakly deleterious.

With a view to capturing natural variation at a larger number of sites in the genome and identifying more mutations with large effects on disease risk, there have been extensive efforts to collate available exome sequences from hundreds of thousands of individuals (Lek et al., 2016; Karczewski et al., 2020; Dewey et al., 2016; Szustakowski, 2020; Van Hout et al., 2020; Taliun et al., 2021). These efforts were also motivated by the idea that public repositories composed of relatively healthy adults not ascertained for a specific severe disease can serve as reference datasets, such that seeing a variant of unknown function in these datasets is indicative of it being benign (Lek et al., 2016; Claussnitzer et al., 2020; Ghouse et al., 2018). The validity of that assumption remains to be evaluated, however, especially as the repositories grow in size.

Beyond their utility in human genetics, these datasets provide an unprecedented opportunity to learn about the fitness effects of new mutations. Modeling the distribution of fitness effects (DFE) has a long history in population genetics (Sawyer and Hartl, 1992; Eyre-Walker and Keightley, 2007; Otto, 2000), but until recently, inferences were based on genetic variation in samples of at most a couple of thousand chromosomes (Eyre-Walker and Keightley, 2007; Boyko et al., 2008; Williamson et al., 2005; Eyre-Walker et al., 2006; Kim et al., 2017). As is well appreciated, the fitness effects at the few sites segregating at such sample sizes are a small and biased draw from the DFE and thus the inferred distribution of fitness effects is unlikely to recapitulate the true DFE in the genome. Moreover, for lack of sufficient information with which to distinguish weakly from strongly selected mutations, a number of approaches have relied on a specific and arbitrary parametric form for the distribution of fitness effects across sites. In that regard, not only do inferences based on small samples result in relatively noisy parameter estimates, the results can be misleading, especially about the fraction of sites under strong selection (Kim et al., 2017). Current samples in humans may allow for these limitations to start to be overcome.

Motivated by these considerations, we focus on a class of mutations known to experience mutations an order of magnitude more frequently than other types of sites in the human genome: CpG sites that are methylated in the germline (Duncan and Miller, 1980; Nachman and Crowell, 2000; Kong et al., 2012; Figure 1—figure supplement 1). We use these sites as a test case for what can be learned about selection when neutral sites are saturated, i.e., have all experienced at least one mutation in the history of the sample, and draw out implications for the interpretation of mutations as pathogenic and for inferences about fitness effects.

Results

Mutation saturation at CpGs

An attractive feature of methylated CpG (mCpG) sites is that a single mechanism, the spontaneous deamination of methyl-cytosine, is believed to underlie the uniquely high rate of C > T mutations at these sites (Duncan and Miller, 1980); thus, germline methylation at CpG sites is strongly predictive of their mutability (Kong et al., 2012; Jónsson et al., 2017; Gao et al., 2019; Figure 1—figure supplement 2). Here, we define ‘methylated’ CpG sites in exons as those that are methylated ≥65 % of the time in both testes and ovaries. For these ~1.1 million sites (of 1.8 million total CpG sites in sequenced exons), we calculate a mean haploid, autosomal C > T mutation rate of 1.17 × 10–7 per generation using de novo mutations (DNMs) in a sample of ~2900 sequenced parent-offspring trios (Materials and methods, Figure 1—figure supplements 12, Halldorsson et al., 2019).

Although methylation levels are the dominant predictor of mutation rates at CpG sites, they are not the only influence. Notably, CpG transitions differ somewhat in their mutation rates based on their trinucleotide context (Figure 1—figure supplement 3a; Aggarwala and Voight, 2016); even so, they are consistently an order of magnitude higher than the genome average (Kong et al., 2012). Broader scale features, such as replication timing, have also been reported to shape mutation rates (Stamatoyannopoulos et al., 2009; Smith et al., 2018). Nonetheless, considering methylated CpGs inside and outside exons, which differ in a number of these features, there is no appreciable difference in average DNM rates (Fisher Exact Test (FET) p-value = 0.1, Figure 1—figure supplement 4a). Similarly, the rate at which two DNMs occur at the same site, a summary statistic that reflects the variance in mutation rates, is not significantly different for methylated CpGs inside versus outside exons (FET p-value = 0.35; Figure 1—figure supplement 4b). Thus, while there is some variation in mutability per site among methylated CpGs, it appears to be small relative to the mean mutation rate across all methylated CpGs.

Considering all such CpG sites therefore, we ask what fraction are segregating at existing sample sizes. To this end, we collate polymorphism data made public by gnomAD (Karczewski et al., 2020), the UK Biobank (Szustakowski, 2020), and the DiscovEHR collaboration between the Regeneron Genetics Center and Geisinger Health System (Dewey et al., 2016) in order to ascertain whether both C and T alleles are present in a sample of ~390K individuals (Materials and methods).

To focus on the subset of genic changes most likely to be neutrally-evolving, we consider the ~350,000 methylated CpG sites at which C > T mutations do not change the amino acid. At these sites, 94.7 % of all possible synonymous CpG transitions are observed in the gnomAD data alone, and 98.8 % in the combined sample including all three datasets (Figure 1). In other words, nearly every methylated CpG site where a mutation to T is putatively neutral has experienced at least one such mutation in the history of the sample of 390K individuals. Even in the least mutable CpG trinucleotide context, 98 % of putatively neutral sites are segregating in current samples (Figure 1—figure supplement 3b). These observations imply that in the absence of selection, almost every methylated CpG site would be segregating a T at current sample sizes--and further that not seeing a T provides strong evidence it was removed by selection.

Figure 1 with 4 supplements see all
Fraction of methylated CpG sites that are polymorphic for a transition, by sample size.

The combined dataset encompasses three non-overlapping data sources: gnomAD (v2.1), the UK Biobank (UKB), and the DiscovEHR cohort. ‘European’ samples include the populations designated as ‘EUR’ in 1000 Genomes, ‘Non-Finnish European’ subsets of exome and whole genome datasets in gnomAD, as well as the UK Biobank and DiscovEHR, which have >90% samples labeled as of European ancestry.

Testing a neutral model for individual sites

The mutation saturation at methylated CpG sites provides a robust approach to identify individual sites that are not neutrally-evolving. One way to view it is in terms of a p-value: under a null model with no selection, from which we assume that synonymous sites are drawn, all but 1.2 % of neutral sites are segregating in a sample of 390K individuals. Therefore, if a given non-synonymous site, say, is invariant in a sample of ≥390K individuals, we can reject the neutral null model for this site at a significance level of 0.012. Similarly, we can ask about the probability that an invariant non-synonymous site is neutral, using a false discovery rate (FDR) approach: given that 1.2 % of neutral sites are invariant, whereas 7.4 % of non-synonymous sites are, the FDR is 1.2/7.4 = 16%. Thus, at current sample sizes, there is a substantial amount of information about whether individual CpG transitions are deleterious. By contrast, in a smaller sample with only 10 % of putatively neutral sites segregating, there is almost no information about selection in observing individual sites to be invariant (p ≤ 0.9).

This approach implicitly assumes that synonymous and non-synonymous sites do not differ in their distributions of mutation rates and that their distributions of genealogical histories are also the same, i.e.,, that the two types of sites are subject to comparable effects of linked selection. While we cannot examine whether the distributions of mutation rates are identical for lack of data, we verify that the mean de novo mutation rates do not differ for synonymous sites and for various non-synonymous annotations (Figure 2a); we also check that the distributions of methylation levels (conditional on ≥65%), an important determinant of mutation rates, are similar for synonymous and non-synonymous sites (with a significant but small shift towards higher methylation and thus presumably higher mutation rates for non-synonymous sites; Figure 2—figure supplement 1). In turn, the standard assumption of similar distributions of genealogical histories seems sensible, given that the sites are interdigitated within genic regions (McDonald and Kreitman, 1991). Under these few and at least somewhat testable assumptions, the approach based on mutation saturation at methylated CpG sites then enables us to directly pinpoint individual sites that are not neutrally evolving. We note further that if synonymous sites are not all neutral and instead some fraction are under selection, the same idea would apply, but the null model would have to be modified accordingly.

Figure 2 with 5 supplements see all
Comparing de novo mutation rates and the fraction of segregating sites across annotations.

(a) DNM rates for CpG transitions at highly methylated sites by annotation class, rescaled by the total DNM rate in exons. Fisher exact tests (FETs) of the proportion of sites with DNMs in each annotation compared to all other annotations yield p-values > 0.1 in all cases. (b) Fraction of highly methylated CpG sites that are segregating as a C/T polymorphism in an annotation class, relative to the fraction of synonymous sites segregating. Error bars are 95 % confidence intervals assuming the number of segregating sites is binomially distributed (FET p-values << 10–5 for comparisons of all annotations with synonymous sites). LOF variants are defined as stop-gained and splice donor/acceptor variants that do not fall near the end of the transcript, and meet the other criteria to be classified as ‘high-confidence’ loss-of-function in the gnomAD data (see Materials and Methods). (c) The amount of data for synonymous and missense changes involving highly methylated CpG transitions by the type of functional protein site. (d) The proportion of synonymous and missense segregating C/T polymorphisms in different classes of functional sites. Error bars are 95 % confidence intervals assuming the number of segregating sites is binomially distributed (FET p-values << 10–5 for comparisons of all missense annotations with synonymous sites; Materials and methods). All annotations are obtained using the canonical transcripts of protein coding genes (see Materials and methods).

Comparing the fraction of segregating sites across annotations

Under these same weak assumptions, it is also possible to compare the proportion of methylated CpG sites polymorphic for a transition across annotations. Here, we consider the fraction of sites segregating a transition in each annotation class in a sample of 780K chromosomes, rescaled by the fraction segregating at synonymous sites. All categories of missense, loss-of-function, and regulatory variants show a significant depletion in the fraction of segregating sites compared to synonymous variants (Figure 2b). The deficit for a given annotation is an indicator of the deleteriousness of de novo mutations in that annotation. Specifically, in our sample of 780K, the deficit for each annotation reflects sites for which we can reject neutrality at a significance level of 0.012.

These data therefore suggest that there are ~27 % fewer loss-of-function variants than would be expected under neutrality; at invariant sites within this annotation, neutrality can be rejected at an FDR of only 4.4 % ( = 1.2/27). A 27 % deficit of loss-of-function variants is again seen if we match the sites to synonymous mutational opportunities with the same predicted level of linked selection, i.e., with similar genealogical histories (McVicker et al., 2009; Figure 2—figure supplement 2a). Supporting the widely used assumption that LOF mutations within a gene are equivalent (after filtering for those at the end of transcripts; Karczewski et al., 2020; Cassa et al., 2017), when we compare the set of CpG sites at which mutations are annotated as leading to protein-truncation in the first versus the second half of transcripts, approximately the same number are missing mutations relative to synonymous sites in both subsets (Figure 2—figure supplement 3; FET p-value = 0.9). By comparison, the fraction of missense mutations and splice region variants not observed in current samples is only about 5.3%, and the FDR 22.6 % ( = 1.2/5.3) (whether or not we match for the effects of linked selection; see Figure 2—figure supplement 2a).

While LOF and missense annotation classes are most commonly used in determinations of variant pathogenicity, any two sets of methylated CpGs with similarly-distributed mutation rates can be ranked in this manner. As one example, we stratify missense mutations by the type of functional site in which they occur. For the subset of sites at which missense mutations may disrupt or alter binding, particularly DNA-binding, there is a ~ 12–20% deficit in segregating sites relative to what is seen at synonymous sites, in contrast, say, to the much smaller deficit at missense changes within trans-membrane regions (Figure 2c–d, Figure 2—figure supplement 4; Figure 2—figure supplement 2b). In other words, observing a DNA-binding missense site that is invariant provides stronger statistical evidence that it is deleterious than observing an invariant missense site with no additional functional information (e.g. the FDR is ~1/20 vs. ~1/5).

We can also check that the fraction of sites segregating is inversely proportional to the predicted functional importance of the sites using CADD scores (Rentzsch et al., 2019), widely used measures of constraint that incorporate functional annotations and measures of conservation. Across deciles, mean de novo transition rates at methylated CpGs are similar (Figure 2—figure supplement 5a) and, as expected, the fraction of segregating sites decreases with increasing CADD scores (Figure 2—figure supplement 5b). We note, however, that mutation rates may not always be similar across comparison groups: considering all CpG sites in exons (i.e. not only highly methylated ones), for example, de novo mutation rates are much more variable across CADD deciles (Figure 2—figure supplement 5c). Consequently, the depletion of segregating sites no longer has a simple interpretation (Figure 2—figure supplement 5d), instead reflecting a combination of differences in mutation rates and fitness effects. By implication, while CADD scores are meant to isolate the effects of selection, they will in some cases classify sites that have high mutation rates as less constrained, and vice versa.

What can be learned about other mutation types?

Given that current exome samples are informative about selection on transitions at methylated CpGs, a natural question is to ask to what extent there is also information for less mutable types, with mutation rates on the order of 10–8 or 10–9 per site per generation. For sites with mutation rate on the order of 10–9, which is the case for the vast majority of non-CpGs, the fraction of possible synonymous sites that segregate in a sample of 780K chromosomes is very low: for instance, it is 5 % for T > A mutations, which occur at an average rate of 1.2 × 10–9 (Figure 1—figure supplement 1) and 27 % even for other C > T mutations, which occur at a rate of 0.9 × 10–8 per site (Figure 1—figure supplement 1), compared to ~99 % for C > T mutations at methylated CpGs (Figure 3a). For invariant sites of these less mutable types, there is little information with which to evaluate the fit to the neutral null in current samples. Reflecting this lack of information, in the p-value formulation, monomorphic sites would be assigned p ≤ 0.95 for T > A sites and p ≤ 0.73 for C > T sites.

Figure 3 with 2 supplements see all
Comparing the fraction of sites observed and expected to be segregating under neutrality, by mutation type and sample size.

(a) Fraction of possible synonymous C > T mutations at CpG sites methylated in the germline and at all other C sites, and the fraction of possible synonymous T > A mutations that are observed in a sample of given size. (b) Fraction of sites segregating in simulations, assuming neutrality, a specific demographic model and a given mutation rate (see Materials and methods).

How large samples have to be for other mutation types to reach saturation depends on the length of the genealogy that relates sampled individuals, i.e., the sum of the branch lengths, which corresponds to the number of generations over which mutations could have arisen at the site. For a mutation that occurs at rate 1.17 × 10–7 per generation, the average length of the genealogy would have to be greater than 8.5 million (1/1.17 × 10–7) generations for at least one such mutation to be expected at a site. That synonymous CpG sites are close to saturation when they experience mutations to T at this rate suggests that this is in fact the case. Indeed, given that more than one mutation has occurred at a substantial fraction of sites (Karczewski et al., 2020; Harpak et al., 2016), the average length of the genealogy relating the 390K individuals is expected to be substantially longer: about 39 million generations (calculated from the probability of at least one mutation under a Poisson distribution; see Materials and methods). The observation that mutation types with rates on the order of 10–8 are far from saturation further indicates that the average length of the genealogy for these 390K individuals is substantially shorter than 100 million generations. These rough calculations thus provide a sense of the length of the genealogical history represented by these 390K individuals.

To explicitly examine the relationship between sample size, mutation rate and the amount of variation at a locus, we simulate neutral evolution at a single site with the three different mutation rates above, under a variant of the widely-used Schiffels-Durbin demographic model for population growth in Europe (Schiffels and Durbin, 2014), in which we set the effective population size Ne equal to 10 million for the past 50 generations (Methods). While this model is clearly an oversimplification, it recapitulates observed diversity levels for synonymous mutations reasonably well (Figure 3). Consistent with the rough estimate above, under our choice of demographic model, a sample of 780K chromosomes has a genealogy spanning an average of 34 million generations (Figure 3—figure supplement 1a,b).

From first principles, the length of the genealogy is expected to increase much more slowly than linearly with the number of samples (Hudson, 1990; Nelson et al., 2012). Indeed, increasing the number of samples by a factor of 12 only increases the average tree length ~3.3 x (Figure 3b, Figure 3—figure supplement 1a,b); thus, a site that mutates at rate 10–9 per generation is expected to have experienced ~0.04 mutations in the genealogical history of a sample of ~1 million, and 0.1 mutations in a sample of 10 million. The implication is that saturation for mutation rates of 10–8 or 10–9 per site per generation may not be achievable any time soon.

Quantitative predictions of our model are subject to the considerable uncertainty about the demographic history and in particular about the recent effective population size in humans (Figure 3—figure supplement 1b). Moreover, for simplicity, we model one or at most two populations, when samples that combine individuals from more diverse genetic ancestries have longer genealogical histories (Figure 3—figure supplement 1c; see Figure 1) and thus capture more variation. Perhaps most importantly, for the very large sample sizes considered here, the multiple merger coalescent is a more appropriate model (Nelson et al., 2012; Bhaskar et al., 2014). Nonetheless, the qualitative statement that less mutable types will remain very far from saturation in the foreseeable future should hold.

In the absence of information about single sites for most mutational types in the genome, it is still possible to learn to a limited degree about selection using bins of sites. If we construct a bin of K synonymous sites with the same average mutation rate per bin as a single methylated CpG, then at least one site per bin is polymorphic in ~99 % of bins (see Figure 3—figure supplement 2 for an example with T > A mutations and K~100), just as ~99 % of individual methylated CpG sites are segregating. Thus, if a bin of K non-synonymous sites with the same average mutation rate is invariant, the p-value associated with the bin is 0.01, indicating that one or more sites in the bin is likely to be under selection.

How strong is the selection that leads to invariant methylated CpG sites?

Leveraging saturation to identify a subset of sites that are not neutrally-evolving makes appealingly few assumptions, but provides no information about how strong selection is at those sites. To learn about the strength of selection consistent with methylated CpG sites being monomorphic, a series of strong assumptions are needed: we require a demographic model, a prior distribution on hs and a mutation rate distribution across sites. Here, we assume a relatively uninformative log-uniform prior on the selection coefficient s ranging from 10–7 to 1 and fix the dominance coefficient h = 0.5 (as for autosomal mutations with fitness effects in heterozygotes, we only need to specify the compound parameter hs; reviewed in Fuller et al., 2019), as well as a fixed mutation rate of 1.2 × 10–7 per site per generation. We rely on the demographic model for population growth in Europe described above (Schiffels and Durbin, 2014); as is standard (Sawyer and Hartl, 1992; Boyko et al., 2008; Williamson et al., 2005; Eyre-Walker et al., 2006; Kim et al., 2017; Cassa et al., 2017; Simons et al., 2014), we also assume that hs is fixed over time, even as the effective population size changes dramatically. Under these assumptions, we estimate the posterior distribution of hs at a site, given that the site is monomorphic, segregating with 10 or fewer derived copies of the T allele, or segregating with more than 10 copies (Figure 4a and b, Methods). These posterior distributions are estimates of the DFE at an individual mCpG site conditional on seeing 0 copies, 1–10 copies or >10 copies of the T allele.

Figure 4 with 5 supplements see all
Quantifying the strength of selection associated with invariant and segregating sites.

(a) Prior and Posterior log densities of hs for a C > T mutation at a methylated CpG site observed at 0, 1–10, or >10 copies at various sample sizes. (b) Bayes odds (i.e. posterior odds divided by prior odds) of s > 0.001 for a C > T mutation at a methylated CpG site observed at 0, 1–10, or >10 copies, at various sample sizes. (c) Probability of a methylated CpG site segregating a T allele in simulations, if the mutation has no fitness effects (hs = 0) and if it is deleterious (with a heterozygote selection coefficient hs = 0.05%) or highly deleterious (with a heterozygote selection coefficient hs = 5%).

Because Bayes odds provide a natural way to summarize the strength of statistical evidence that comes from the observation at a single site, we consider the Bayes odds that a mutation is subject to hs > 0.5 x 10–3, i.e., is under strong selection (see Materials and methods). In small samples, in which most sites are monomorphic, being monomorphic is consistent with both neutrality and very strong selection (Figure 4a) and the Bayes odds are close to 1, reflecting the fact that the observation barely shifts our prior assumptions (Figure 4b). In contrast, with larger sample sizes, in which putatively neutral CpG sites reach saturation, the posterior distribution for invariant sites is highly peaked–what is not segregating is likely strongly deleterious–and accordingly the Bayes odds become substantially greater than 1. Notably, at current sample sizes of 390K individuals, there is still some dependence on the prior (Figure 4—figure supplement 1), but the Bayes odds of hs > 0.5 x 10–3 at an invariant methylated CpG are large. Given our choice of prior, the odds are 15:1 (Figure 4b), which suggests that most (~15/16) of the ~27 % of LOF mutations and ~6 % of missense mutations not seen in current samples are subject to this degree of selection.

While the relationship of selection strengths to clinical pathogenicity is not straight-forward, selection coefficients on that order are likely to be of relevance to determinations of pathogenicity in clinical settings (Cassa et al., 2017; Kaplanis et al., 2020). Indeed, mutations with hs > 0.5 × 10–3 may be highly deleterious to some individuals that carry them, enough to produce clinically visible effects, but vary substantially in their penetrance. Accordingly, mutations classified as pathogenic in ClinVar (Landrum et al., 2018) or identified as underlying severe developmental disabilities in the Deciphering Developmental Disorders (DDD) cohort (Kaplanis et al., 2020) are 6-fold to 51-fold enriched at sites invariant in 390K individuals compared to those classified as benign (Figure 4—figure supplement 2). This analysis comes with important caveats–notably that the classifications of pathogenicity rely in part on the presence or absence of mutations in reference databases–but it suggests an enrichment on par with estimated Bayes odds of strong selection.

Discussion

Interpreting polymorphic sites in current reference databases

In a sufficiently large sample, even a segregating site can be subject to strong selection (Figure 4a and b). For instance, in current exome sample sizes, a C > T mutation at a methylated CpG site with hs = 0.5 × 10–3 is almost always observed segregating (Figure 4c). This follows from the expectation under mutation-selection-drift balance (Gillespie, 1998): in a constant population size, a mutation that arises at rate 1.2 × 10–7 per generation and is removed by selection at rate hs = 0.05% per generation has an expected population frequency of 2.4 × 10–4; in a sample of 780K, the mean number of copies is 187. Even with substantial variation due to genetic drift and sampling error, such a site should almost always be segregating at that sample size. In fact, even a mutation with hs of 5 % would quite often be observed. Thus, segregating sites in large samples are a mixture of neutral, weakly selected and strongly selected sites. An implication is that, although large reference repositories such as gnomAD were partly motivated by the possibility of excluding deleterious variants, as samples grow in size, it cannot simply be assumed that clinically relevant variants are absent from reference datasets. In principle, the only mutations never seen as samples grow in size would be the ones that are embryonically lethal.

More generally, any interpretation of variants of unknown function by reference to repositories such as gnomAD or disease cohorts enriched for deleterious variation (Taliun et al., 2021), whether the goal is to exclude benign variants or identify likely pathogenic ones, is implicitly reliant on assumptions that change with sample size and dramatically differ by mutation type. At current sample sizes, invariant methylated CpGs are likely highly deleterious; for less mutable types, the information content at invariant sites is very limited at even the largest sample sizes considered (Figure 4—figure supplement 3). Similarly, learning about the fitness consequences of segregating mutations from their observed frequencies is contingent on assumptions about the mutation rate, and the demographic history of the sample.

The distribution of fitness effects in human genes

As we show, a typical site in the genome, with a mutation rate of 10–8 per generation, does not provide much information about selection (Figure 4—figure supplement 3), because the average length of the genealogy is likely substantially less than 108 generations. One exception, which is a special case, is gene loss: each gene can be conceived of as a single locus at which many possible LOF mutations have the same fitness impact (Karczewski et al., 2020; Cassa et al., 2017; Fuller et al., 2019; Weghorn et al., 2019; Agarwal, Fuller, Przeworski, in prep.). The mutation rate to LOF, calculated by summing rates of individual LOF mutations, is ~10–6 per gene per generation on average (Karczewski et al., 2020), such that in the absence of selection, many LOF mutations are expected in most genes. At this special subset of sites, the distribution of fitness effects can be inferred by binning loss-of-function variants within genes (Karczewski et al., 2020; Cassa et al., 2017; Weghorn et al., 2019; Agarwal, Fuller, Przeworski, in prep.).

An analogous strategy to overcome sample size limitations at other types of sites is to infer selection in bins of sites (Dukler et al., 2021); however, if sites within a bin vary in their fitness effects, inferences based on these bins are not straight-forward. Indeed, the mutation frequency in a bin reflects the harmonic mean of hs across sites in the bin weighted by (unknown) mutation rates across sites (see Materials and methods).

Given these limitations, individual methylated CpG sites can provide a useful point of entry to understanding the DFE in humans. Although methylated CpG sites appear under somewhat less constraint than other sites, the differences are subtle (Figure 4—figure supplement 4), and what we learn at these sites can tell us what to expect more generally. As a first step, we can obtain a DFE across non-synonymous mCpG sites by weighting the densities for segregating and invariant sites (Figure 4a) by the proportion of sites in each category (an example for possible LOF mutations is shown in Figure 4—figure supplement 5, for sample sizes of 15K and 780K) . In current samples, the posterior odds for invariant methylated CpGs having hs ≥ 0.5 × 10–3 are 92 % under our model, whereas they are 37 % for segregating methylated CpGs. Considering possible LOF mutations at methylated CpGs, of which 27 % are not observed in current samples, these odds imply that the fraction of de novo LOF mutations with hs ≥ 0.5 × 10–3 is roughly 52 % ( = 0.27 × 0.92 + 0.73 × 0.37).

We can use a similar approach to estimate the minimum fraction of de novo mutations that lead to a deleterious non-synonymous change. For missense sites, given the same uninformative prior on hs as for LOF mutational opportunities, the fraction estimated to be highly deleterious is 40 % ( = 0.05 × 0.92 + 0.95 × 0.37). Since ~0.97 % of all de novo point mutations are missense and ~0.07 % lead to a LOF (see Methods), these estimates translate into roughly a 1 in 236 chance ( = 40%x0.97% + 52% x0.07%) that a de novo mutation has an effect hs ≥0.5 × 10–3. Assuming, finally, that each individual inherits 70 new mutations (Kong et al., 2012; Jónsson et al., 2017), these estimates imply that one out of every 3.4 individuals is born with a new and potentially highly deleterious, non-synonymous mutation. This calculation is based on only two frequency categories, however, discarding the information contained in allele frequencies at segregating sites, and only point mutations are taken into account. Thus, the true fraction is likely substantially higher.

Outlook

Moving forward, we should soon have substantial information not only about the DFE but the strength of selection at individual CpG sites (Figure 4). Inferences based on them, or indeed any sites, will need to rely on an accurate demographic model, particularly for the recent past of most relevance for deleterious mutations; this problem should be surmountable, given the tremendous recent progress in our reconstruction of population structure and changes in humans (Schiffels and Durbin, 2014; Kelleher et al., 2019; Speidel et al., 2019). Inferences will also require a good characterization of mutation rate variation across CpG sites, as is emerging from human pedigree studies and other sources (Jónsson et al., 2017; Poulos et al., 2017; Vöhringer et al., 2020; Seplyarskiy and Sunyaev, 2021), and careful consideration of the effects of multiple hits (Harpak et al., 2016) and biased gene conversion (Glémin et al., 2015). It will also be of interest to revisit the standard assumptions that go into inferring a DFE, including that all mutations are at least partially dominant in their fitness effects; that the DFE remains fixed even as the effective population size changes by orders of magnitude; and that the distribution is bounded above at 0, when some of the mutations segregating in large samples are likely to be weakly beneficial. Putting these elements together, robust inference of the fitness effects of mutations in human genes should finally be within reach, through the lens of CpG sites.

Materials and methods

Processing de novo mutation data

Request a detailed protocol

We focused on ~190,000 published de novo mutations in a sample of 2976 parent-offspring trios that were whole genome sequenced (Halldorsson et al., 2019). To date, this is the largest publicly available set of trios that, to our knowledge, have not been sampled on the basis of a disease phenotype. Unless otherwise specified, we used these DNMs to calculate mutation rates, as described in later sections. We converted hg38 coordinates to hg19 coordinates using UCSC Liftover. We excluded indels, and all DNMs that occur outside the ~2.8 billion sites covered by gnomAD v2.1.1 whole genome sequences. We obtained the immediately adjacent 5’ and 3’ bases at each position from the hg19 reference genome, so that we had each de novo mutation within its trinucleotide context; we used this information to identify CpG sites. Where such data were available (for 89 % of CpG de novo mutations), we also annotated each site with its methylation status measured by bisulfite sequencing in testis sperm cells and ovaries (see Appendix 1—table 1).

We annotated DNMs with their variant consequences using Variant effect predictor (v87, Gencode V19) and the hg19 LOFTEE tool (Karczewski et al., 2020) to flag high-confidence (‘HC’) loss-of-function variants. We obtained the fraction of DNMs in the genome that occured at sites annotated as missense or LOF in the ‘canonical’ protein-coding transcript for each gene provided by Gencode.

Processing polymorphism data

Request a detailed protocol

We downloaded publicly available polymorphism data from gnomAD (Karczewski et al., 2020), the UK Biobank (Szustakowski, 2020), the DiscovEHR collaboration between the Regeneron Genetics Center and Geisinger Health System (Dewey et al., 2016), and 1000 Genomes Phase 3 (Auton et al., 2015). Where needed, we lifted over coordinates to the hg19 reference assembly using the UCSC LiftOver tool. Salient characteristics of these samples are as follows:

DatasetRegionsSequencedIndividualsVariantsPopulations sampledOriginal alignment
1000 genomes Phase 3(also included in gnomAD)Genomes250484 millionmixturehg19-b37
gnomAD v2.1.1Exomes125,74815 millionmixturehg19
gnomAD v2.1.1Genomes15,708230 millionmixturehg19
UK BiobankExomes199,93216 million~93 % European ancestryhg38
DiscovEHRExomes50,7268 million~98 % European ancestryhg19-b37

For the gnomAD data, we obtained the allele frequency for each variant in the full exome and genome samples, as well as their Non-Finnish European (‘NFE’) subsets from the VCF files (in hg19 coordinates) provided. For each sample, we obtained the set of segregating sites (i.e. the set of variants that pass gnomAD quality filters and have an allele frequency >0 in the sample). For the 1000 Genomes Phase-3 data, we obtained the set of variant positions similarly. Note that the 1000 Genomes samples are also contained within the gnomAD sample. For the DiscovEHR sample, allele frequencies are available where MAF >0.001 (and set equal to 0.001 for lower values > 0); we can thus determine the set of sites segregating in this sample, but we do not have access to any other information about individual variants.

For the UK Biobank exome sequencing data, additional processing was required. We downloaded the population-level plink files with exome-wide genotype information for ~200,000 individuals. We excluded exome samples that did not pass variant or sample quality control criteria in the previously released genotyping array data. Specifically, we excluded samples that have a discrepancy between reported sex and inferred sex from genotype data, a large number of close relatives in the database, or are outliers based on heterozygosity and missing rate, as detailed in Bycroft et al., 2018. Finally, we excluded individuals who withdrew from the UK Biobank by the end of 2020. This left us with 199,932 exome samples that overlap with the high-quality subset of the genotyped samples. We additionally limited our analysis to the list of ~39 million exonic sites with an average of 20X sequence coverage provided by UK Biobank (Szustakowski, 2020). We transformed the processed plink files into the standard variant call format, polarized variants to the hg38 reference assembly, and obtained the frequency of the non-reference allele in the sample. We then lifted over the coordinates from hg38 to hg19 using the UCSC LiftOver tool. We excluded the few positions where the reference alleles were mismatched or swapped between the two assemblies.

All but 12 % of segregating mCpG transitions were shared across at least two non-overlapping datasets. Of segregating variants seen in one of the gnomAD or UK Biobank datasets, all but two variants had at least 5 % of individuals (and typically on the order of ~100 K) sequenced at that position. Thus, we think it highly unlikely that we misclassified invariant sites as segregating, or vice versa. For ~9000 variants that are seen only once in the GHS data, we unfortunately did not have access to variant quality metrics. Excluding these sites only very slightly affects our results and does not change any qualitative conclusions.

Identifying and annotating mutational opportunities in the exome

Request a detailed protocol

For all possible mutational opportunities in sequenced exons, we collated a variety of functional annotations. To this end, we first generated a list of all possible SNV mutational opportunities in the exome. We obtained the list of sites that fall in exons or within 50 base pairs (bp) of exons in Gencode v19 genes and that are among the ~2.8 billion sites covered by gnomAD v2.1.1 whole genome sequences. For each position, we extracted the reference allele from the hg19 assembly and generated the three possible single-nucleotide derived alleles. We also obtained the immediately adjacent 5’ and 3’ bases at each position from the hg19 reference genome, so that we had each mutational opportunity within its trinucleotide context; we used this information to identify CpG sites. Where such data were available, we also annotated each site with its methylation status in testis sperm cells and ovaries.

To identify sites at which variants or de novo mutations could be confidently assayed by whole-exome sequencing methods, we obtained regions targeted in whole exome sequencing from gnomAD and the UK Biobank. We limited our analysis to sites that were covered at 20X or more in the exome sequencing subsets of both gnomAD and UK Biobank (that lifted over correctly to the hg19 assembly), which we refer to as ‘accessible sites’.

We then annotated the ~90 million mutational opportunities (at 30 million sites) with CADD scores and variant consequences using Variant effect predictor (v87, Gencode V19) and the hg19 LOFTEE tool (Karczewski et al., 2020) to flag high-confidence (‘HC’) loss-of-function variants. For loss-of-function variants, we also noted their location in the gene by exon number (e.g. in exon 10 of 12 exons in the gene). We used a published database of curated protein features derived from Refseq proteins (Stanek et al., 2020) to annotate all sites in protein coding genes that were associated with a particular type of functional activity (detailed functional annotations were available for about 60,000 of 1.1 million methylated CpG sites). At each site, we used either the primary ‘site-type’ annotation, or when that was missing or listed as ‘other’, we extracted the annotation from the more detailed ‘notes’ field where this information was provided.

Because there are multiple transcripts for each variant, we limited our analysis to the ‘canonical’ protein-coding transcript for each gene provided by Gencode to obtain a single annotation for each variant. For 10–20% of variants, this approach still yielded multiple possible consequences per variant, for instance, where there are multiple canonical transcripts due to overlapping genes. For these cases, we assigned one of the ‘canonical’ transcripts to the variant at random, to avoid making assumptions about their relative importance. Further overlaps within the same gene, for example, a missense variant that is also a splice variant in the same transcript, or a DNA-binding site that also undergoes a particular post-translational modification, were resolved in the same manner.

As an alternative approach, we obtained the worst consequence in all protein-coding transcripts for each variant, using the ranks of variant consequences by severity provided by Ensembl (Appendix 1—table 1). In the absence of systematic ranking criteria for the protein function annotations, we used the following order: sites that were designated as having catalytic activity (‘active’ sites) were given highest priority in overlaps, followed by DNA-binding sites, followed by other types of binding (to metal, polypeptides, ions), and finally by sites that are known to undergo post-translational or other regulatory modifications, and trans-membrane sites. Thus, a transmembrane site with regulatory activity is classified as a regulatory site, while a regulatory site with DNA-binding activity is classified as DNA-binding. Using these alternate criteria to group sites does not affect our conclusions (Figure 2—figure supplement 4).

All sources of annotation data are listed in Appendix 1—table 1. A list of CpG sites and annotations is provided as additional data.

Comparing fitness effects across sets of mutational opportunities

Request a detailed protocol

To assess whether the set of 1.1 million C > T mutational opportunities at methylated CpG sites are systematically different from the other ~90 million exonic mutational opportunities in their potential fitness effects, we compared the distribution of CADD scores in the two groups using a Kolmogorov-Smirnov test. We note that this comparison is likely to be somewhat confounded by differences in mutation rates, given our finding that CADD scores do not perfectly isolate the effects of selection from those of variability in mutation rates (Figure 2—figure supplement 5c). Since the mutation rate for methylated CpG sites is higher than for other types, this may lead them to appear somewhat less constrained than they actually are.

We further compared the fraction of C > T mutational opportunities at methylated CpGs in an annotation class vs. the fraction of other mutational opportunities in that class. We used a Fisher exact test (with a Bonferroni correction for four tests) to determine whether the two sets of mutational opportunities were differently distributed across synonymous, missense, regulatory, and LOF variant classes.

Obtaining mean de novo mutation rates by mutation type and annotation

Request a detailed protocol

We counted the total number of de novo mutations in sequenced exons (~91 million mutational opportunities) for eight classes of mutations: two transitions and a transversion each at C and T sites, transitions at CpG sites with relatively low levels of methylation (defined here as <65%), and transitions at CpG sites with high levels of methylation ( ≥ 65%). To obtain the mutation rate per site per generation, we divided the counts by the haploid sample size (2 × 2976 individuals; see section 1) and the number of mutational opportunities of each type. We report 95 % confidence intervals assuming a Poisson distribution for mutation counts. The rates obtained (Figure 1—figure supplement 1) are similar to previous ones (Kong et al., 2012; Jónsson et al., 2017; Gao et al., 2019) and roughly consistent with the rates predicted by the gnomAD mutation model (Karczewski et al., 2020).

To evaluate the impact of methylation status on the mutation rate at CpG sites, we obtained the mean mutation rate for C > T mutations at CpG sites in each methylation bin as described above, separately for methylation levels in ovaries and testes. While there is a limited amount of data, especially for some low-methylation bins, our choice of cutoff for ‘methylated’ seems sensible (Figure 1—figure supplement 2).

We then calculated the mean mutation rate for methylated CpG transitions, for different compartments in the genome, namely in (a) exons and non-exons, (b) four variant consequence categories: synonymous, missense, regulatory, and LOF variants, (c) CADD score deciles, and (d) in exons that constitute the first half vs the second half of genes. We also calculated the mean mutation rate for methylated CpG transitions in four trinucleotide contexts (ACG, CCG, GCG, and TCG). In each case, we obtained the total number of de novo mutations and the Poisson 95 % confidence interval around mutation counts in each group, and divided by the number of mutational opportunities in the group. We tested if the proportion of methylated CpG sites with de novo C > T mutations in each non-synonymous compartment was different from the proportion of synonymous methylated CpGs with de novo C > T mutations, accounting for multiple tests.

Variance in mutation rate at methylated CpGs

Request a detailed protocol

Although current samples of DNM data are large enough to compare the mean mutation rate at methylated CpGs across the annotation classes examined here, there is not enough data to directly compare variances in mutation rates. To learn how much broad scale features beyond methylation and the immediate trinucleotide context shape variation in mutation rates at methylated CpGs, we therefore relied on a broader set of regions for example those that fall inside and outside exons. Exonic and non-exonic regions differ considerably in epigenetic features and replication timing (Stamatoyannopoulos et al., 2009); yet, there is no discernable difference in average de novo mutation rates at methylated CpGs inside and outside sequenced exons (FET p-value = 0.10, Figure 1—figure supplement 4a). We also compared the number of double and single de novo hits in exons and non-exons using a Fisher exact test (p-value = 0.35, Figure 1—figure supplement 4b). Since the number of double hits reflects the variance in mutation rates across sites, these results lend some support to there being limited variation due to broad scale genomic features in transition rates at methylated CpGs.

Calculating the fraction of sites segregating by annotation

Request a detailed protocol

For each methylated CpG site in the exome, there are three mutational opportunities (C > A, C > G, C > T); we focused only on the opportunities for C > T mutations. For each methylated CpG site then, we noted whether or not it was segregating, or in other words if there was a C > T variant in samples of individuals from gnomAD (Karczewski et al., 2020), the UK Biobank (Szustakowski, 2020), the DiscovEHR dataset (Dewey et al., 2016), and 1000 Genomes Phase 3 (Auton et al., 2015), processed as described above, or a combined sample of 390 K non-overlapping individuals.

Within the set of methylated CpG sites where C > T mutations are synonymous, we calculated the fraction segregating in each sample of interest. Similarly, for different subsets of methylated CpGs, namely those in (a) four variant consequence categories: synonymous, missense, regulatory, and LOF variants, (c) CADD score deciles, (d) functional site categories (e.g. trans-membrane vs catalytic sites in proteins), and (e) the first half vs the second half of genes, we calculated the fraction segregating in the combined sample of 390 K individuals. We rescaled the fraction of sites segregating in each annotation by the fraction of synonymous sites segregating in the sample.

We verified that the differences in the fraction of sites segregating across annotations are not due to variable impacts of linked selection across annotations. To do so, we calculated the fraction of sites segregating with sites in different annotations matched for B-statistics McVicker et al., 2009; we obtained very similar results with this approach (Figure 2—figure supplement 2).

We assumed that conditional on the number of mutational opportunities and a fixed probability of segregating for each site in a compartment, the number of sites segregating is binomially distributed, and obtained 95 % confidence intervals on that basis. We tested if the proportion of sites segregating in each compartment is different from the proportion segregating at putatively neutral (here, synonymous) sites using a Fisher exact test, accounting for multiple tests.

We also calculated the fraction of other types of synonymous sites segregating in each sample size of interest (specifically, for T > A variants, and C > Ts not at methylated CpG sites).

Frequency of mutant alleles in bins of K sites

Request a detailed protocol

Within each annotation of interest, with an average mutation rate of u per site, we construct bins of k sites, such that k = U/u, where U is the mean mutation rate of a transition at methylated CpG site in that annotation class. The mean mutation rates are calculated for each mutation type within each annotation, as described in Section five above. We then count the fraction of bins in which no such mutations are observed. As an example, for T > A mutations, k is on the order of 100 (Figure 3—figure supplement 2a).

Since each bin can be treated as being comparable to a single neutral methylated CpG site, bins that contain only neutral sites are expected to contain at least one mutation in 99 % of bins; this is indeed the case for bins of synonymous sites (Figure 3—figure supplement 2b).

When considering sites that contain a mixture of neutral and selected sites, bins of k sites are no longer as readily comparable to methylated CpG sites, however (Figure 3—figure supplement 2c). If sites within a bin are under varying degrees of selection, then the mutation count reflects the harmonic mean of the strength of selection acting on individual sites. Specifically, under a deterministic model of mutation-selection balance, if qi is the allele frequency at the ith site in a bin of k sites:

qbin=i=1kqi

then

qbin=i=1kuihsi

Assuming ui = u = U/k,

qbin=i=1k(U/k)hsi=(Uk).i=1k1hsi

that is, qbin is a function of the harmonic mean of hs at the k sites.

Forward simulations

Request a detailed protocol

We used a forward simulation framework initially described in Simons et al., 2014, and modified in Fuller et al., 2019. Briefly, we modeled evolution at a single non-recombining bi-allelic site, which undergoes mutations each generation at rate 2Neu in a panmictic diploid population of effective population size Ne. Each generation is formed by Wright-Fisher sampling with selection, where fitness is reduced by hs in heterozygotes and s in homozygotes for the T allele. We fixed the dominance coefficient h as 0.5, and we chose one value of the selection coefficient s from a log-uniform prior ranging from 10–7 to 1 for each simulation (for autosomal mutations with fitness effects in heterozygotes, we only need to specify the compound parameter hs; reviewed in Fuller et al., 2019). Given a mutation rate and a demographic model that specifies Ne in each generation, we simulated the evolution of a site forward in time to determine whether the site is segregating in a sample of size n at present.

We used u = 1.2 x 10–7 per site per generation to model CpG> TpG mutation at a methylated CpG site. The simulation framework allows for recurrent mutations, which are expected to arise often at this mutation rate. We also allowed for TpG> CpG back mutations at the rate of 5 × 10–9 (calculated from de novo mutation data, as CpG> TpG mutations). To model T > A mutations, we used u = 1.2 x 10–9 per site per generation, with a back mutation rate of 1.2 × 10–9 per site per generation; for C > T mutations not at methylated CpG sites, we used u = 0.9 x 10–8 per site per generation, with a back mutation rate of 5 × 10–9 per site per generation (Figure 1—figure supplement 1). We note that, since the mutation rate increases with paternal and maternal ages, an implicit assumption is that the distribution of parental ages in the trio data is representative of the parental ages over the evolutionary history of exome samples.

For the demographic model, we relied on the Schiffels-Durbin model for population size changes in Europe over the past ~55,000 generations, preceded by a ~ 10 Ne generation burn-in period of neutral evolution at an initial population size Ne of 14,448 following ref (Simons et al., 2014). In the last generation, that is at present, we sampled n individuals from the simulated population, to match the size of the sample of interest.

We calculated the probability that a site with the fixed mutation rate u is segregating for a given value of hs (with s = 0 under neutrality) as the proportion of simulations with those parameters in which the site is segregating for different sample sizes at present.

In comparing the output of these simulations to data, we considered two scenarios where we may either undercount or overcount segregating CpG sites in the data relative to the simulations. First, because we conditioned on the human reference allele being a CpG in data, we did not count sites where the CpG is the ancestral but not the reference allele. To check how often this is expected to occur, we mimicked this scenario in simulations, sampling a single chromosome at the end of the simulation as the mock haploid reference genome. The proportion of simulations in which CpG is the ancestral but not the reference allele is ~0.1%, that is, approximately the heterozygosity levels in humans. The second case is that for a subset of the CpG> TpG variants observed at present, the CpG mutation is the reference allele but is not ancestral. To mimic this scenario in our simulations, we simulated a site that starts as TpG (with a mutation rate of 5 × 10–9 to CpG, and a back mutation rate ~1.2 x 10–7 to TpG) forward in time. Then, as above, we drew a single chromosome from the sample at the end of the simulation and set it as the reference. We obtained the proportion of simulations in which the C allele is the reference, starting from a TpG background. Reassuringly, this occurs in only 0.0014 % of simulations. We note that there is in principle a third scenario to consider, in which ApG or GpG sites is ancestral and a C/T polymorphism is found in the sample at present as a result of two mutations, one to T and one to C. Given the various mutation rates involved (all less than 5 × 10–9), this double mutation case will be even less likely than the one in which TpG was ancestral. These rare scenarios should not have any substantive effect on our comparison of data to simulations, particularly when we only used such comparisons to examine qualitative trends.

Inferring the strength of selection in simulations

Request a detailed protocol

We proposed s from a prior distribution (with h fixed at 0.5) and inferred the posterior distribution of hs for a site with a T allele at 0 copies using a simple Approximate Bayesian Computation (ABC) approach. Specifically, we proposed s such that log10(s)~Uniform(–7,0); we simulated expected T allele counts under our model for 10 million proposals from the prior. We accepted the subset of the proposed values of s where simulations yield 0 copies of the T allele in the sample at present; this set of s values is a sample from the posterior distribution of s given that the site is monomorphic. We calculated the Bayes odds of s > 10–3 as the ratio of the posterior odds of s > 10–3 and the prior odds of s > 10–3:

p(hs>0.5×103copiesofT=0)/p(hs0.5×103copiesofT=0)p(hs>0.5×103)/p(hs0.5×103)

We similarly obtained posterior distributions of hs for sites that are segregating at 0, 1–10 copies, or >10 copies, in samples of different sizes, and for three different choices of priors on s, namely: s ~ Beta(α = 0.001, β = 0.1); log(s)~ N(–6,2); and Nes ~ Gamma(k = 0.23, θ = 425/0.23), with Ne = 10,000, based on the parameters inferred in Eyre-Walker et al., 2006. These are shown in Figure 4—figure supplement 1.

Calculating odds of being pathogenic in ClinVar and DDD

Request a detailed protocol

We downloaded de novo mutation data for ~35 K individuals with developmental disorders (Kaplanis et al., 2020). We also obtained a list of 380 ‘consensus’ genes from the same study; for these genes, there is evidence from multiple sources that LOF or missense mutations are causal in developmental disorders, such that they are used as part of diagnostic criteria in the clinic.

We downloaded ClinVar variants and excluded those that were not associated with at least one disease. We obtained the ‘CLNSIG’ annotation, which classifies each variant as benign or likely benign, pathogenic or likely pathogenic, or as having uncertain status or conflicting evidence.

We limited both DDD and ClinVar variants to non-synonymous C > T mutations at the subset of methylated CpG sites considered. Using variants in ClinVar and DDD at sites that are invariant in our sample of 780K, we calculated the odds that an invariant site is pathogenic (vs. benign) as follows:

p(pathogeniccopiesofT=0)/p(benigncopiesofT=0)p(pathogenic)/p(benign)

where p(pathogenic) refers to the proportion of sites classified as such, and p(benign) is defined analogously.

In DDD, we considered mutations that fall in 380 consensus genes ‘pathogenic’, and mutations in all other genes benign; thus our ‘benign’ category likely contains some genes in which mutations are in fact pathogenic. In ClinVar, variants classified as ‘pathogenic’ or ‘likely pathogenic’ are assumed to be pathogenic; these are compared to two sets of benign variants, one limited strictly to variants classified in ClinVar as ‘benign’ or ‘likely benign’, and the other inclusive of variants for which the evidence is uncertain or inconclusive. The results are shown in Figure 4—figure supplement 2.

We note that since both ClinVar classifications and the identification of consensus genes in DDD rely in part on whether a site is segregating in datasets like ExAC, the degree of enrichment in Figure 4—figure supplement 2 is hard to interpret.

Calculating the average length of the genealogy of a sample in which methylated CpGs are saturated

Request a detailed protocol

Methylated CpG sites experience mutations to T at the rate of 1.17 × 10–7 per generation; 99 % of such sites are segregating in a sample of 390 K individuals. Then the average length (L) of the genealogy relating the 390 K individuals can be calculated from the probability under a Poisson distribution of at least one mutation at 99 % of sites as 1-exp(–1.17 × 10–7 x L) = 0.99, which gives L = 39 million generations.

Coalescent simulations to obtain the length of genealogy of large samples

Request a detailed protocol

We simulated the genealogy of a sample of varying sizes using msprime (Kelleher et al., 2016) under different demographic histories, modifying the standard Schiffels-Durbin model (Schiffels and Durbin, 2014) as follows:

  1. Demographic history for a sample of Utah residents with Northern and Western European ancestry (CEU) over 55,000 generations, with a recent Ne of 10 million for the past 50 generations, described above.

  2. CEU demographic history for 55,000 generations with a recent Ne of 100 million for the past 50 generations.

  3. CEU demographic history for 55,000 generations with 4.5 % exponential growth for the past 196 generations, with a current Ne of ~100 million.

  4. Demographic history for a sample of Yoruba sampled in Nigeria (YRI) from Schiffels and Durbin, 2014, modified with a recent Ne of 10 million for the last 50 generations.

  5. A structured sample from two populations that derived from an ancestral population with YRI demographic history 2000 generations ago, with YRI and CEU demographic histories respectively since, and a recent Ne of 10 million for the last 50 generations in each.

In each case, we recorded the mean genealogy length over 20 iterations.

The code for implementing these different demographic models in msprime is available on the project github repository.

Appendix 1

Appendix 1—table 1
List of data sources.

Data availability

All source data are freely available to researchers, with sources provided in the manuscript and summarized in Appendix 1 - Table 1. Source files and code to generate the figures, and additional files containing the annotated set of CpG sites analysed in this manuscript, are available at https://github.com/agarwal-i/cpg_saturation (copy archived at https://archive.softwareheritage.org/swh:1:rev:a36cb87c0af373e81eae1935ba710c4417c46f69).

The following previously published data sets were used

References

  1. Book
    1. Gillespie JH
    (1998)
    Population genetics: a concise guide / John H. Gillespie
    The Johns Hopkins University Press.
  2. Book
    1. Harrison SM
    2. Pesaran TF
    3. Mester J
    (2021)
    Clinical DNA Variant Interpretation
    Academic Press.
  3. Book
    1. Hudson RR
    (1990)
    Gene Genealogies and the Coalescent Process
    Oxford Surveys in Evolutionary Biology.
    1. Taliun D
    2. Harris DN
    3. Kessler MD
    4. Carlson J
    5. Szpiech ZA
    6. Torres R
    7. Taliun SAG
    8. Corvelo A
    9. Gogarten SM
    10. Kang HM
    11. Pitsillides AN
    12. LeFaive J
    13. Lee S-B
    14. Tian X
    15. Browning BL
    16. Das S
    17. Emde A-K
    18. Clarke WE
    19. Loesch DP
    20. Shetty AC
    21. Blackwell TW
    22. Smith AV
    23. Wong Q
    24. Liu X
    25. Conomos MP
    26. Bobo DM
    27. Aguet F
    28. Albert C
    29. Alonso A
    30. Ardlie KG
    31. Arking DE
    32. Aslibekyan S
    33. Auer PL
    34. Barnard J
    35. Barr RG
    36. Barwick L
    37. Becker LC
    38. Beer RL
    39. Benjamin EJ
    40. Bielak LF
    41. Blangero J
    42. Boehnke M
    43. Bowden DW
    44. Brody JA
    45. Burchard EG
    46. Cade BE
    47. Casella JF
    48. Chalazan B
    49. Chasman DI
    50. Chen Y-DI
    51. Cho MH
    52. Choi SH
    53. Chung MK
    54. Clish CB
    55. Correa A
    56. Curran JE
    57. Custer B
    58. Darbar D
    59. Daya M
    60. de Andrade M
    61. DeMeo DL
    62. Dutcher SK
    63. Ellinor PT
    64. Emery LS
    65. Eng C
    66. Fatkin D
    67. Fingerlin T
    68. Forer L
    69. Fornage M
    70. Franceschini N
    71. Fuchsberger C
    72. Fullerton SM
    73. Germer S
    74. Gladwin MT
    75. Gottlieb DJ
    76. Guo X
    77. Hall ME
    78. He J
    79. Heard-Costa NL
    80. Heckbert SR
    81. Irvin MR
    82. Johnsen JM
    83. Johnson AD
    84. Kaplan R
    85. Kardia SLR
    86. Kelly T
    87. Kelly S
    88. Kenny EE
    89. Kiel DP
    90. Klemmer R
    91. Konkle BA
    92. Kooperberg C
    93. Köttgen A
    94. Lange LA
    95. Lasky-Su J
    96. Levy D
    97. Lin X
    98. Lin K-H
    99. Liu C
    100. Loos RJF
    101. Garman L
    102. Gerszten R
    103. Lubitz SA
    104. Lunetta KL
    105. Mak ACY
    106. Manichaikul A
    107. Manning AK
    108. Mathias RA
    109. McManus DD
    110. McGarvey ST
    111. Meigs JB
    112. Meyers DA
    113. Mikulla JL
    114. Minear MA
    115. Mitchell BD
    116. Mohanty S
    117. Montasser ME
    118. Montgomery C
    119. Morrison AC
    120. Murabito JM
    121. Natale A
    122. Natarajan P
    123. Nelson SC
    124. North KE
    125. O’Connell JR
    126. Palmer ND
    127. Pankratz N
    128. Peloso GM
    129. Peyser PA
    130. Pleiness J
    131. Post WS
    132. Psaty BM
    133. Rao DC
    134. Redline S
    135. Reiner AP
    136. Roden D
    137. Rotter JI
    138. Ruczinski I
    139. Sarnowski C
    140. Schoenherr S
    141. Schwartz DA
    142. Seo J-S
    143. Seshadri S
    144. Sheehan VA
    145. Sheu WH
    146. Shoemaker MB
    147. Smith NL
    148. Smith JA
    149. Sotoodehnia N
    150. Stilp AM
    151. Tang W
    152. Taylor KD
    153. Telen M
    154. Thornton TA
    155. Tracy RP
    156. Van Den Berg DJ
    157. Vasan RS
    158. Viaud-Martinez KA
    159. Vrieze S
    160. Weeks DE
    161. Weir BS
    162. Weiss ST
    163. Weng L-C
    164. Willer CJ
    165. Zhang Y
    166. Zhao X
    167. Arnett DK
    168. Ashley-Koch AE
    169. Barnes KC
    170. Boerwinkle E
    171. Gabriel S
    172. Gibbs R
    173. Rice KM
    174. Rich SS
    175. Silverman EK
    176. Qasba P
    177. Gan W
    178. NHLBI Trans-Omics for Precision Medicine (TOPMed) Consortium
    179. Papanicolaou GJ
    180. Nickerson DA
    181. Browning SR
    182. Zody MC
    183. Zöllner S
    184. Wilson JG
    185. Cupples LA
    186. Laurie CC
    187. Jaquish CE
    188. Hernandez RD
    189. O’Connor TD
    190. Abecasis GR
    (2021) Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program
    Nature 590:290–299.
    https://doi.org/10.1038/s41586-021-03205-y

Decision letter

  1. Jeffrey Ross-Ibarra
    Reviewing Editor; University of California, Davis, United States
  2. Patricia J Wittkopp
    Senior Editor; University of Michigan, United States

In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.

Decision letter after peer review:

Thank you for submitting your article "Mutation saturation for fitness effects at human CpG sites" for consideration by eLife. Your article has been reviewed by 3 peer reviewers, one of whom is a member of our Board of Reviewing Editors, and the evaluation has been overseen by Patricia Wittkopp as the Senior Editor. The reviewers have opted to remain anonymous.

The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission. All three reviewers were very enthusiastic about the manuscript, and with some revision it will clearly be suitable for publication in eLife.

The main revision reviewers agreed would be important to incorporate is the issue of multiple testing (highlighted by Reviewer #2).

While all the reviewers were enthusiastic about the manuscript as written, and none felt that additional analyses were necessary, there were a number of additional analyses suggested that reviewers felt could strengthen the paper if they were relatively straightforward to incorporate. These included:

1) Comparison of the nonsynonymous genome-wide DFE from smaller samples to the 780K, perhaps only focusing on methylated nonsynomous sites.

2) Comparison of overlap between these invariant sites and Clinvar or databases of patients with known developmental disorders.

3) Comparison of results under a different parameterization of human demography.

Reviewers had a number of additional suggestions that we felt would improve the manuscript. Perhaps first among these was a general feeling of "to be continued" in the discussion with multiple citations to another manuscript in prep. While reviewers were not against this strategy by any means, efforts to address the novelty of the current paper in the discussion (rather than simply hinting at exciting results in the forthcoming work) would be worthwhile.

Please do review and respond to the individual reviewer comments as well.

Reviewer #1:

Agarwal and Przeworski have performed a very timely and interesting study of the distribution of fitness effects (DFE) of new mutations. This study is timely because modern human population genetic datasets have finally achieved sample sizes for which a certain class of nucleotide sites, i.e. methylated CpG sites, when neutrally evolving, should approach near complete polymorphism saturation. If every neutral site is expected to carry at least one variant, the classic problem of distinguishing sites that are monomorphic due to chance (no mutation) versus sites that are monomorphic due to selective constraint (removed mutations) is greatly simplified. The point at which population genomic datasets are saturated with polymorphisms should represent a major advance in understanding the DFE at individual sites and is what immediately piqued my interest.

Overall, this manuscript is a thorough and thoughtful examination of this topic; to my enjoyment there were several times where a question came to mind that was addressed shortly later in the paper. I believe the authors have made a compelling case for why methylated CpG sites provide an entry point for understanding the site-specific DFE. I found the section "Interpreting monomorphic and polymorphic sites in current reference databases" particularly insightful as a guide to thinking about future datasets; similarly, I thought the comparison with CADD scores (Figure S9) provided important food for thought regarding confounders to maps of constraint generated from vast numbers of species using modern genomic datasets.

While the study is addressing an interesting topic, I also felt this manuscript was limited in novel findings to take away. Certainly the study clearly shows that substitution saturation is achieved at synonymous CpG sites. However, subsequent main analyses do not really show anything new: the depletion of segregating sites in functional versus neutral categories (Figure 2) has been extensively shown in the literature and polymorphism saturation is not a necessary condition for observing this pattern. Similarly, the diminishing returns on sampling new variable sites has been shown in previous studies, for example the first "large" human datasets ca. 2012 (e.g. Figure 2 in Nelson et al., 2012, Science) have similar depictions as Figure 3B although with smaller sample sizes and different approaches (projection vs simulation in this study). There are some simulations presented in Figure 4, but this is more of a hypothetical representation of the site-specific DFE under simulation conditions roughly approximating human demography than formal inference on single sites. Again, these all describe the state of the field quite well, but I was disappointed by the lack of a novel finding derived from exploiting the mutation saturation properties at methylated CpG sites.

Similarly, I felt the authors posed a very important point about limitations of DFE inference methods in the Introduction but ended up not really providing any new insights into this problem. The authors argue (rightly so) that currently available DFE estimates are limited by both the sparsity of polymorphisms and limited flexibility in parametric forms of the DFE. However, the nonsynonymous human DFE estimates in the literature appear to be surprisingly robust to sample size: older estimates (Eyre-Walker et al., 2006 Genetics, Boyko et al., 2008 PLOS Genetics) seem to at least be somewhat consistent with newer estimates (assuming the same mutation rate) from samples that are orders of magnitude larger (Kim et al., 2017 Genetics). Whether a DFE inferred under polymorphism saturation conditions with different methods is different, and how it is different, is an issue of broad and immediate relevance to all those conducting population genomic simulations involving purifying selection. The analyses presented as Figure 4A and 4B kind of show this, but they are more a demonstration of what information one might have at 1M+ sample sizes rather than an analysis of whether genome-wide nonsynonymous DFE estimates are accurate. In other words, this manuscript makes it clear that a problem exists, that it is a fundamental and important problem in population genetics, and that with modern datasets we are now poised to start addressing this problem with some types of sites, but all of this is already very well-appreciated except for perhaps the last point.

At least a crude analysis to directly compare the nonsynonymous genome-wide DFE from smaller samples to the 780K sample would be helpful, but it should be noted that these kinds of analyses could be well beyond the scope of the current manuscript. For example, if methylated nonsynonymous CpG sites are under a different level of constraint than other nonsynonymous sites (Figure S14) then comparing results to a genome-wide nonsynonymous DFE might not make sense and any new analysis would have to try and infer a DFE independently from synonymous/nonsynonymous methylated CpG sites.

Abstract: where it says "Here, we focus on putatively-neutral, synonymous CpG sites…" I thought the phrase "putatively-neutral, synonymous" could be clearer to the reader if moved to "… not seeing a polymorphism [at putatively-neutral, synonymous sites] is indicative of strong…".

Page 3 – "DNM" and "FET" were not defined before the first usage of the acronyms.

Page 7 – "That synonymous sites are close to saturation…": Here, wouldn't the expected length of the genealogy such that 1 mutation is expected per synonymous CpG site be a pretty drastic underestimate of the length of the genealogy such that saturation is observed (99% of synonymous CpG sites w/mutation)? Wouldn't a more precise estimate be something like 39 million generations, [1-Pois(0|1.17e-7*39e6)] ~ 99% of sites?

Reviewer #2:

This manuscript presents a simple and elegant argument that neutrally evolving CpG sites are now mutationally saturated, with each having a 99% probability of containing variation in modern datasets containing hundreds of thousands of exomes. The authors make a compelling argument that for CpG sites where mutations would create genic stop codons or impair DNA binding, about 20% of such mutations are strongly deleterious (likely impairing fitness by 5% or more). Although it is not especially novel to make such statements about the selective constraint acting on large classes of sites, the more novel aspect of this work is the strong site-by-site prediction it makes that most individual sites without variation in UK Biobank are likely to be under strong selection.

The authors rightly point out that since 99% of neutrally evolving CpG sites contain variation in the data they are looking at, a CpG site without variation is likely evolving under constraint with a p value significance of 0.01. However, a weakness of their argument is that they do not discuss the associated multiple testing problem-in other words, how likely is it that a given non synonymous CpG site is devoid of variation but actually not under strong selection? Since one of the most novel and useful deliverables of this paper is single-base-pair-resolution predictions about which sites are under selection, such a multiple testing correction would provide important "error bars" for evaluating how likely it is that an individual CpG site is actually constrained, not just the proportion of constrained sites within a particular functional category.

The paper provides a comparison of their functional predictions to CADD scores, an older machine-learning-based attempt at identifying site by site constraint at single base pair resolution. While this section is useful and informative, I would have liked to see a discussion of the degree to which the comparison might be circular due to CADD's reliance on information about which sites are and are not variable. I had trouble assessing this for myself given that CADD appears to have used genetic variation data available a few years ago, but obviously did not use the biobank scale datasets that were not available when that work was published.

Reading this paper left me excited about the possibility of examining individual invariant CpG sites and deducing how many of them are already associated with known disease phenotypes. I believe the paper does not mention how many of these invariant sites appear in Clinvar or in databases of patients with known developmental disorders, and I wondered how close to saturation disease gene databases might be given that individuals with developmental disorders are much more likely to have their exomes sequenced compared to healthy individuals. One could imagine some such analyses being relatively low hanging fruit that could strengthen the current paper, but the authors also make several reference to a companion paper in preparation that deals more directly with the problem of assessing clinical variant significance. This is a reasonable strategy, but it does give the Discussion section of the paper somewhat of a "to be continued" feel.

I think the paper could be strengthened by calculating the proportion of non-variable CpG sites in teach category are likely to be truly under constraint, making use of some kind of multiple testing correction. This would build upon the intuition that a non-variable CpG is likely functional with a non-corrected p value of 0.01.

My point about the possible circularity of comparison to CADD could be addressed with further discussion of the degree to which CADD is informed by patterns of human genetic variation and how incorporation of genetic variation into CADD scores might affect the conclusions of this section. As an additional point in the CADD section, it's not totally clear whether the statement "Mean transition rates at methylated CpGs are similar across CADD deciles" is based on de novo mutation data or some other data source.

Another addition that would add a lot to the paper, though is not strictly necessary, would be to comment on the overlap between sites identified as under selection by the current paper and sites where mutations are already annotated as clinically relevant or suspected to be so based on their occurrence in a disease cohort.

Reviewer #3:

Agarwal et al., combine a few well-known ideas in population genetics – diminishing returns in sampling new alleles with increasing sample size and the enrichment of invariant sites for sites under strong purifying selection – and point out the exciting result that sample sizes of modern human data sets are sufficiently large that, for highly mutable sites, saturation mutation has been reached. This is my favorite kind of result – one that is strikingly obvious in retrospect but that I had never considered (and probably wouldn't have). The manuscript is well written, and a number of my concerns or questions while reading were resolved directly by the authors later on. I have no major concerns, but a few potential suggestions that might strengthen the presentation.

The authors emphasize several times how important an accurate demographic model is. While we may be close to a solid demographic model for humans, this is certainly not the case for many other organisms. Yet we are not far off from sufficient sample sizes in a number of species to begin to reach saturation. I found myself wondering how different the results/inference would be under a different model of human demographic history. Though likely the results would be supplemental, it would be nice in the main text to be able to say something about whether results are qualitatively different under a somewhat different published model.

On a similar note, while a fixed hs simplifies much of the analysis, I wondered how results would differ for (1) completely recessive mutations and (2) under a distribution of dominance coefficients, especially one in which the most deleterious alleles were more recessive. Again, though I think it would strengthen the manuscript by no means do I feel this is a necessary addition, though some discussion of variation in dominance would be an easy and helpful add.

There's some discussion of population structure, but I also found myself wondering about GxE. That is, another reason a variant might be segregating is that it's conditionally neutral in some populations and only deleterious in a subset. I think no analysis to be done here, but perhaps some discussion?

Maybe I missed it, but I don't think the acronym DNM is explained anywhere. While it was fairly self-explanatory, I did have a moment of wondering whether it was methylation or mutation and can't hurt to be explicit.

https://doi.org/10.7554/eLife.71513.sa1

Author response

All three reviewers were very enthusiastic about the manuscript, and with some revision it will clearly be suitable for publication in eLife.

The main revision reviewers agreed would be important to incorporate is the issue of multiple testing (highlighted by Reviewer #2).

While all the reviewers were enthusiastic about the manuscript as written, and none felt that additional analyses were necessary, there were a number of additional analyses suggested that reviewers felt could strengthen the paper if they were relatively straightforward to incorporate. These included:

1) Comparison of the nonsynonymous genome-wide DFE from smaller samples to the 780K, perhaps only focusing on methylated nonsynomous sites.

2) Comparison of overlap between these invariant sites and Clinvar or databases of patients with known developmental disorders.

3) Comparison of results under a different parameterization of human demography.

Reviewers had a number of additional suggestions that we felt would improve the manuscript. Perhaps first among these was a general feeling of "to be continued" in the discussion with multiple citations to another manuscript in prep. While reviewers were not against this strategy by any means, efforts to address the novelty of the current paper in the discussion (rather than simply hinting at exciting results in the forthcoming work) would be worthwhile.

Please do review and respond to the individual reviewer comments as well.

We greatly appreciate the reviewers’ enthusiasm for the manuscript and thank them for their helpful comments.

We agree with the reviewers that it is important to address the question of how likely it is that a given non-synonymous mCpG site is invariant in current samples but nonetheless neutral. We had intended for this question to be addressed by our Bayesian analysis for an individual site in Figure 4, which examines how one’s beliefs about selection should be updated after observing a site to be invariant (i.e., using Bayes odds), with the p-value analogy serving simply to point out what makes mutation saturation special. We apologize for our failure to make this explicit, and have clarified the following points in the text:

A) First, we have now added a discussion of false discovery rates (FDR) in our mutation saturation based test for whether a non-syn site is neutral. As we now state, given that 1.2% of neutral sites are invariant, and 7.4% of all non-syn sites are invariant, the FDR across all non-synonymous invariant sites is ~16% (1.2/7.4). Within just LOFs it is ~4% (1.2/27).

B) With our demographic model and given our choice of prior, the Bayes odds for an invariant mCpG site at the current sample size are 15:1 in favor of hs > 0.5x10-3; thus there is a 1/16 chance an invariant site is not under “strong selection”.

We note that these odds depend on our prior and demographic model while the FDR calculation is based on the assumption that the null model is given by synonymous sites.

With regard to the three other points raised above, we have addressed (3) by showing results for the widely used model of Tennessen et al., (Tennessen et al., 2012), which was inferred from a relatively small sample size and thus did not detect the rapid exponential growth in population size towards the present; as expected, this model does not predict observed levels of variation as well and leads to different expectations about allele frequencies of deleterious alleles. In other words, and as expected, the results in Figure 4 depend on the demographic model choice.

We have also taken up suggestion (2) and now include an analysis of ClinVar and Deciphering Developmental Disorders (DDD) datasets. While both data sets are far from saturation, as expected, variants in those data sets, which are likely strongly deleterious, are enriched among invariant sites relative to segregating sites.

Regarding the analysis suggested in (1), we would argue we already know what will happen: given that small sample sizes carry very little information (see Figure 4a,b), the answer will depend greatly on the choice of parametric form for the DFE. That a priori choice is necessarily arbitrary, given that it is precisely what we are trying to learn about. Indeed, a recent preprint by Dukler et al., (Dukler et al., 2021) reports estimates of sh against amino-acid mutations inconsistent with those of Kim et al., (see p. 11 of their Discussion) and those of Kim et al., are in poor agreement with those of Boyko et al., and others (see Figure 4 in Kim et al., 2017).

Finally, we apologize for the impression that we postponed interesting analyses to a different paper, which is not a continuation of the current manuscript, but instead presents a complementary analysis of loss-of-function variation in human exomes and includes inference of a site-level DFE for such variants. We have now tried to make this clearer in the text, as well as adding a Figure on DDD and ClinVar and other analyses, as suggested.

Reviewer #1:

Agarwal and Przeworski have performed a very timely and interesting study of the distribution of fitness effects (DFE) of new mutations. This study is timely because modern human population genetic datasets have finally achieved sample sizes for which a certain class of nucleotide sites, i.e. methylated CpG sites, when neutrally evolving, should approach near complete polymorphism saturation. If every neutral site is expected to carry at least one variant, the classic problem of distinguishing sites that are monomorphic due to chance (no mutation) versus sites that are monomorphic due to selective constraint (removed mutations) is greatly simplified. The point at which population genomic datasets are saturated with polymorphisms should represent a major advance in understanding the DFE at individual sites and is what immediately piqued my interest.

Overall, this manuscript is a thorough and thoughtful examination of this topic; to my enjoyment there were several times where a question came to mind that was addressed shortly later in the paper. I believe the authors have made a compelling case for why methylated CpG sites provide an entry point for understanding the site-specific DFE. I found the section "Interpreting monomorphic and polymorphic sites in current reference databases" particularly insightful as a guide to thinking about future datasets; similarly, I thought the comparison with CADD scores (Figure S9) provided important food for thought regarding confounders to maps of constraint generated from vast numbers of species using modern genomic datasets.

While the study is addressing an interesting topic, I also felt this manuscript was limited in novel findings to take away. Certainly the study clearly shows that substitution saturation is achieved at synonymous CpG sites. However, subsequent main analyses do not really show anything new: the depletion of segregating sites in functional versus neutral categories (Figure 2) has been extensively shown in the literature and polymorphism saturation is not a necessary condition for observing this pattern.

We agree with the reviewer that many of the points raised were appreciated previously and did not mean to convey another impression. Our aim was instead to highlight some unique opportunities provided by being at or very near saturation for mCpG transitions. In that regard, we note that although depletion of variation in functional categories is to be expected at any sample size, the selection strength that this depletion reflects is very different in samples that are far from saturated, where invariant sites span the entire spectrum from neutral to lethal. Consider the depletion per functional category relative to synonymous sites in Author response image 1 in a sample of 100k: ~40% of mCpG LOF sites do not have T mutations. From Figure 4, it can be seen that these sites are associated with a much broader range of hs values than sites invariant at 780K, so that information about selection at an individual site is quite limited (indeed, in our p-value formulation, these sites would be assigned p≤0.35, see Figure 1). Thus, only now can we really start to tease apart weakly deleterious mutations from strongly deleterious or even embryonic lethal mutations. This allows us to identify individual sites that are most likely to underlie pathogenic mutations and functional categories that harbor deleterious variation at the extreme end of the spectrum of possible selection coefficients. More generally, saturation is useful because it allows one to learn about selection with many fewer untested assumptions than previously feasible.

Author response image 1

Similarly, the diminishing returns on sampling new variable sites has been shown in previous studies, for example the first "large" human datasets ca. 2012 (e.g. Figure 2 in Nelson et al., 2012, Science) have similar depictions as Figure 3B although with smaller sample sizes and different approaches (projection vs simulation in this study).

We agree completely: diminishing returns is expected on first principles from coalescent theory, which is why we cited a classic theory paper when making that point in the previous version of the manuscript. Nonetheless, the degree of saturation is an empirical question, since it depends on the unknown underlying demography of the recent past. In that regard, we note that Nelson et al., predict that at sample sizes of 400K chromosomes in Europeans, approximately 20% of all synonymous sites will be segregating at least one of three possible alleles, when the observed number is 29%. Regardless, not citing Nelson et al., 2012 was a clear oversight on our part, for which we apologize; we now cite it in that context and in mentioning the multiple merger coalescent.

There are some simulations presented in Figure 4, but this is more of a hypothetical representation of the site-specific DFE under simulation conditions roughly approximating human demography than formal inference on single sites. Again, these all describe the state of the field quite well, but I was disappointed by the lack of a novel finding derived from exploiting the mutation saturation properties at methylated CpG sites.

As noted above, in our view, the novelty of our results lies in their leveraging saturation in order to identify sites under extremely strong selection and make inferences about selection without the need to rely on strong, untested assumptions.

However, we note that Figure 4 is not simply a hypothetical representation, in that it shows the inferred DFE for single mCpG sites for a fixed mutation rate and given a plausible demographic model, given data summarized in terms of three ranges of allele frequency (i.e., = 0, between 1 and 10 copies, or above 10 copies). One could estimate a DFE across all sites from those summaries of the data (i.e., from the proportion of mCpG sites in each of the three frequency categories), by weighting the three densities in Figure 4 by those proportions. That is, in fact, what is done in a recent preprint by Dukler et al., (2021, BioRxiv): they infer the DFE from two summaries of the allele frequency spectrum (in bins of sites), the proportion of invariant sites and the proportion of alleles at 1-70 copies, in a sample of 70K chromosomes.

To illustrate how something similar could be done with Figure 4 based on individual sites, we obtain an estimate of the DFE for LOF mutations (shown in Author response image 2 Panel B and D for two different prior distributions on hs) by weighting the posterior densities in Panel A by the fraction of LOF mutations that are segregating (73% at 780K; 9% at 15K) and invariant (27% and 91% respectively); in panel C, we show the same for a different choice of prior. For the smaller sample size considered, the posterior distribution recapitulates the prior, because there is little information about selection in whether a site is observed to be segregating or invariant, and particularly about strong selection. In the sample of 780K, there is much more information about selection in a site being invariant and therefore, there is a shift towards stronger selection coefficients for LOF mutations regardless of the prior.

Author response image 2

Our goal was to highlight these points rather than infer a DFE using these two summaries, which throw out much of the information in the data (i.e., the allele frequency differences among segregating sites). In that regard, we note that the DFE inference would be improved by using the allele frequency at each of 1.1 million individual mCpG sites in the exome. We outline this next step in the Discussion but believe it is beyond the scope of our paper, as it is a project in itself--in particular it would require careful attention to robustness with regard to both the demographic model (and its impact on multiple hits), biased gene conversion and variability in mutation rates among mCpG sites. We now make these points explicitly in the Outlook.

Similarly, I felt the authors posed a very important point about limitations of DFE inference methods in the Introduction but ended up not really providing any new insights into this problem. The authors argue (rightly so) that currently available DFE estimates are limited by both the sparsity of polymorphisms and limited flexibility in parametric forms of the DFE. However, the nonsynonymous human DFE estimates in the literature appear to be surprisingly robust to sample size: older estimates (Eyre-Walker et al., 2006 Genetics, Boyko et al., 2008 PLOS Genetics) seem to at least be somewhat consistent with newer estimates (assuming the same mutation rate) from samples that are orders of magnitude larger (Kim et al., 2017 Genetics).

We are not quite sure what the reviewer has in mind by “somewhat consistent,” as Boyko et al., estimate that 35% of non-synonymous mutations have s>10-2 while Kim et al., find that proportion to be “0.38–0.84 fold lower” than the Boyko et al., estimate (see, e.g., Figure 4 in Kim et al., 2017). Moreover, the preprint by Dukler et al., mentioned above, which infers the DFE based on ~70K chromosomes, finds estimates inconsistent with those of Kim et al., (see SOM Table 2 and SOM Figure S5 in Dukler et al., 2021).

More generally, given that even 70K chromosomes carry little information about much of the distribution of selection coefficients (see our Figure 4), we expect that studies based on relatively sample sizes will basically recover something close to their prior; therefore, they should agree when they use the same or similar parametric forms for the distribution of selection coefficients and disagree otherwise. The dependence on that choice is nicely illustrated in Kim et al., who consider different choices and then perform inference on the same data set and with the same fixed mutation rate for exomes; depending on their choice anywhere between 5%-28% of non-synonymous changes are inferred to be under strong selection with s>=10-2 (see their Table S4).

Whether a DFE inferred under polymorphism saturation conditions with different methods is different, and how it is different, is an issue of broad and immediate relevance to all those conducting population genomic simulations involving purifying selection. The analyses presented as Figure 4A and 4B kind of show this, but they are more a demonstration of what information one might have at 1M+ sample sizes rather than an analysis of whether genome-wide nonsynonymous DFE estimates are accurate. In other words, this manuscript makes it clear that a problem exists, that it is a fundamental and important problem in population genetics, and that with modern datasets we are now poised to start addressing this problem with some types of sites, but all of this is already very well-appreciated except for perhaps the last point.

At least a crude analysis to directly compare the nonsynonymous genome-wide DFE from smaller samples to the 780K sample would be helpful, but it should be noted that these kinds of analyses could be well beyond the scope of the current manuscript. For example, if methylated nonsynonymous CpG sites are under a different level of constraint than other nonsynonymous sites (Figure S14) then comparing results to a genome-wide nonsynonymous DFE might not make sense and any new analysis would have to try and infer a DFE independently from synonymous/nonsynonymous methylated CpG sites.

We are not sure what would be learned from this comparison, given that Figure 4 shows that, at least with an uninformative prior, there is little information about the true DFE in samples, even of tens of thousands of individuals. Thus, if some of the genome-wide nonsynonymous DFE estimates based on small sample sizes turn out to be accurate, it will be because the guess about the parametric shape of the DFE was an inspired one. In our view, that is certainly possible but not likely, given that the shape of the DFE is precisely what the field has been aiming to learn and, we would argue, what we are now finally in a position to do for CpG mutations in humans.

Abstract: where it says "Here, we focus on putatively-neutral, synonymous CpG sites…" I thought the phrase "putatively-neutral, synonymous" could be clearer to the reader if moved to "… not seeing a polymorphism [at putatively-neutral, synonymous sites] is indicative of strong…".

We apologize for the unclear phrasing; we meant to say that given mutation saturation at putatively neutral sites, not seeing a non-synonymous polymorphism is indicative of strong selection against that mutation.We have now amended the text to make this clear.

Page 3 – "DNM" and "FET" were not defined before the first usage of the acronyms.

We have fixed this issue in the text.

Page 7 – "That synonymous sites are close to saturation…": Here, wouldn't the expected length of the genealogy such that 1 mutation is expected per synonymous CpG site be a pretty drastic underestimate of the length of the genealogy such that saturation is observed (99% of synonymous CpG sites w/mutation)? Wouldn't a more precise estimate be something like 39 million generations, [1-Pois(0|1.17e-7*39e6)] ~ 99% of sites?

We thank the reviewer for pointing out that we were being imprecise and have now updated the text to follow the suggestion.

Reviewer #2:

This manuscript presents a simple and elegant argument that neutrally evolving CpG sites are now mutationally saturated, with each having a 99% probability of containing variation in modern datasets containing hundreds of thousands of exomes. The authors make a compelling argument that for CpG sites where mutations would create genic stop codons or impair DNA binding, about 20% of such mutations are strongly deleterious (likely impairing fitness by 5% or more). Although it is not especially novel to make such statements about the selective constraint acting on large classes of sites, the more novel aspect of this work is the strong site-by-site prediction it makes that most individual sites without variation in UK Biobank are likely to be under strong selection.

The authors rightly point out that since 99% of neutrally evolving CpG sites contain variation in the data they are looking at, a CpG site without variation is likely evolving under constraint with a p value significance of 0.01. However, a weakness of their argument is that they do not discuss the associated multiple testing problem-in other words, how likely is it that a given non synonymous CpG site is devoid of variation but actually not under strong selection? Since one of the most novel and useful deliverables of this paper is single-base-pair-resolution predictions about which sites are under selection, such a multiple testing correction would provide important "error bars" for evaluating how likely it is that an individual CpG site is actually constrained, not just the proportion of constrained sites within a particular functional category.

We thank the reviewer for pointing this out. As we outline in response to the editorial comments, one way to think about this problem might be in terms of false discovery rates, in which case the FDR would be 16% across all non-synonymous mCpG sites that are invariant in current samples, and ~4% for the subset of those sites where mutations lead to loss-of-function of genes.

Another way to address this issue, which we had included but not emphasized previously, is by examining how one’s beliefs about selection should be updated after observing a site to be invariant (i.e., using Bayes odds). At current sample sizes and assuming our uninformative prior, for a non-synonymous mCpG site that does not have a C>T mutation, the Bayes odds are 15:1 in favor of hs>0.5x10-3; thus the chance that such a site is not under strong selection is 1/16, given our prior and demographic model.

As noted in response to the editor, these two approaches (FDR and Bayes odds) are based on somewhat distinct assumptions.

We have now added and/or emphasized these two points in the main text.

The paper provides a comparison of their functional predictions to CADD scores, an older machine-learning-based attempt at identifying site by site constraint at single base pair resolution. While this section is useful and informative, I would have liked to see a discussion of the degree to which the comparison might be circular due to CADD's reliance on information about which sites are and are not variable. I had trouble assessing this for myself given that CADD appears to have used genetic variation data available a few years ago, but obviously did not use the biobank scale datasets that were not available when that work was published.

We apologize for the lack of clarity in the presentation. We meant to emphasize that de novo mutation rates vary across CADD deciles when considering all CpG sites (Figure 2—figure supplement 5c), which confounds CADD precisely because it is based in part on which sites are variable. We now write:

“We can also check that the fraction of sites segregating is inversely proportional to the predicted functional importance of the sites using CADD scores (12), widely used measures of constraint that incorporate functional annotations and measures of conservation. Across deciles, mean de novo transition rates at methylated CpGs are similar (Figure 2—figure supplement 5a) and, as expected, the fraction of segregating sites decreases with increasing CADD scores (Figure 2—figure supplement 5b). We note, however, that mutation rates may not always be similar across comparison groups: considering all CpG sites in exons (i.e., not only highly methylated ones), for example, de novo mutation rates are much more variable across CADD deciles (Figure 2—figure supplement 5c). Consequently the depletion of segregating sites no longer has a simple interpretation (Figure 2—figure supplement 5d), instead reflecting a combination of differences in mutation rates and fitness effects. By implication, while CADD scores are meant to isolate the effects of selection, they will in some cases classify sites that have high mutation rates as less constrained, and vice versa.”

Reading this paper left me excited about the possibility of examining individual invariant CpG sites and deducing how many of them are already associated with known disease phenotypes. I believe the paper does not mention how many of these invariant sites appear in Clinvar or in databases of patients with known developmental disorders, and I wondered how close to saturation disease gene databases might be given that individuals with developmental disorders are much more likely to have their exomes sequenced compared to healthy individuals. One could imagine some such analyses being relatively low hanging fruit that could strengthen the current paper, but the authors also make several reference to a companion paper in preparation that deals more directly with the problem of assessing clinical variant significance. This is a reasonable strategy, but it does give the Discussion section of the paper somewhat of a "to be continued" feel.

We apologize for the confusion that arose from our references to a second manuscript in prep. The companion paper is not a continuation of the current manuscript: it contains an analysis of fitness and pathogenic effects of loss-of-function variation in human exomes.

Following the reviewer’s suggestion to address the clinical significance of our results, we have now examined the relationship of mCpG sites invariant in current samples with ClinVar variants. We find that of the approximately 59,000 non-synonymous mCpG sites that are invariant, only ~3.6% overlap with C>T variants associated with at least one disease and classified as likely pathogenic in ClinVar (~5.8% if we include those classified as uncertain or with conflicting evidence as pathogenic). Approximately 2% of invariant mCpGs have C>T mutations in what is, to our knowledge, the largest collection of de novo variants ascertained in ~35,000 individuals with developmental disorders (DDD, Kaplanis et al., 2020). At the level of genes, of the 10k genes that have at least one invariant non-synonymous mCpG, only 8% (11% including uncertain variants) have any non-synonymous hits in ClinVar, and ~8% in DDD. We think it highly unlikely that the large number of remaining invariant sites are not seen with mutations in these databases because such mutations are lethal; rather it seems to us to be the case that these disease databases are far from saturation as they contain variants from a relatively small number of individuals, are subject to various ascertainment biases both at the variant level and at the individual level, and only contain data for a small subset of existing severe diseases.

With a view to assessing clinical relevance however, we can ask a related question, namely how informative being invariant in a sample of 780K is about pathogenicity in ClinVar. Although the relationship between selection and pathogenicity is far from straightforward, being an invariant non-synonymous mCpG in current samples not only substantially increases (15-fold) the odds of hs > 0.5x10-3 (see Figure 4b), it also increases the odds of being classified as pathogenic vs. benign in ClinVar 8-51 fold. In the DDD sample, we don’t know which variants are pathogenic; however, if we consider non-synonymous mutations that occur in consensus DDD genes as pathogenic (a standard diagnostic criterion), being invariant increases the odds of being classified as pathogenic 6-fold. We caution that both ClinVar classifications and the identification of consensus genes in DDD relies in part on whether a site is segregating in datasets like ExAC, so this exercise is somewhat circular. Nonetheless it illustrates that there is some information about clinical importance in mCpG sites that are invariant in current samples, and that the degree of enrichment (6 to 51-fold) is very roughly on par with the Bayes odds that we estimate of strong selection conditional on a site being invariant. We have added these findings to the main text and added the plot as Figure 4—figure supplement 2.

I think the paper could be strengthened by calculating the proportion of non-variable CpG sites in teach category are likely to be truly under constraint, making use of some kind of multiple testing correction. This would build upon the intuition that a non-variable CpG is likely functional with a non-corrected p value of 0.01.

As we write above, we now address this concern in the paper in two ways: using false discovery rates and Bayes odds of hs>0.5x10-3.

My point about the possible circularity of comparison to CADD could be addressed with further discussion of the degree to which CADD is informed by patterns of human genetic variation and how incorporation of genetic variation into CADD scores might affect the conclusions of this section. As an additional point in the CADD section, it's not totally clear whether the statement "Mean transition rates at methylated CpGs are similar across CADD deciles" is based on de novo mutation data or some other data source.

We apologize for the lack of clarity in the presentation but meant to emphasize that de novo mutation rates vary across all CpG sites by CADD score, which confounds CADD precisely because it is based in part on which sites are variable. We have now revised the text to hopefully clarify this point (see response to reviewer 1).

Another addition that would add a lot to the paper, though is not strictly necessary, would be to comment on the overlap between sites identified as under selection by the current paper and sites where mutations are already annotated as clinically relevant or suspected to be so based on their occurrence in a disease cohort.

We thank the reviewer for this suggestion. As detailed in our response above, we now show that CpG sites invariant in population cohorts that overlap with ClinVar and among de novo variants in the DDD cohort are informative about pathogenicity of variants. We have added Figure 4—figure supplement 2 to illustrate this point.

Reviewer #3:

Agarwal et al., combine a few well-known ideas in population genetics – diminishing returns in sampling new alleles with increasing sample size and the enrichment of invariant sites for sites under strong purifying selection – and point out the exciting result that sample sizes of modern human data sets are sufficiently large that, for highly mutable sites, saturation mutation has been reached. This is my favorite kind of result – one that is strikingly obvious in retrospect but that I had never considered (and probably wouldn't have). The manuscript is well written, and a number of my concerns or questions while reading were resolved directly by the authors later on. I have no major concerns, but a few potential suggestions that might strengthen the presentation.

The authors emphasize several times how important an accurate demographic model is. While we may be close to a solid demographic model for humans, this is certainly not the case for many other organisms. Yet we are not far off from sufficient sample sizes in a number of species to begin to reach saturation. I found myself wondering how different the results/inference would be under a different model of human demographic history. Though likely the results would be supplemental, it would be nice in the main text to be able to say something about whether results are qualitatively different under a somewhat different published model.

We had previously examined the effect of a few demographic scenarios with large increases in population size towards the present on the average length of the genealogy of a sample (and hence the expected number of mutations at a site) in Figure 3—figure supplement 1b, but without quantifying the effect on our selection inference. Following this suggestion, we now consider a widely used model of human demography inferred from a relatively small sample, and therefore not powered to detect the huge increase in population size towards the present (Tennessen et al., 2012). Using this model, we find a poor fit to the proportion of segregating CpG sites (the observed fraction is 99% in 780K exomes, when the model predicts 49%). Also, as expected, inferences about selection depend on the accuracy of the demographic model (as can be seen by comparing Author response image 3 panel B to Figure 4B in the main text).

Author response image 3

On a similar note, while a fixed hs simplifies much of the analysis, I wondered how results would differ for (1) completely recessive mutations and (2) under a distribution of dominance coefficients, especially one in which the most deleterious alleles were more recessive. Again, though I think it would strengthen the manuscript by no means do I feel this is a necessary addition, though some discussion of variation in dominance would be an easy and helpful add.

There's some discussion of population structure, but I also found myself wondering about GxE. That is, another reason a variant might be segregating is that it's conditionally neutral in some populations and only deleterious in a subset. I think no analysis to be done here, but perhaps some discussion?

We agree that our analysis ignores the possibilities of complete recessivity in fitness (h=0) as well as more complicated selection scenarios, such as spatially-varying selection (of the type that might be induced by GxE). We note however that so long as there are any fitness effects in heterozygotes, the allele dynamics will be primarily governed by hs; one might also imagine that under some conditions, the mean selection effect across environments would predict allele dynamics reasonably well even in the presence of GxE. Also worth exploring in our view is the standard assumption that hs remains fixed even as Ne changes dramatically. We now mention these points in the Outlook.

Maybe I missed it, but I don't think the acronym DNM is explained anywhere. While it was fairly self-explanatory, I did have a moment of wondering whether it was methylation or mutation and can't hurt to be explicit.

We apologize for the oversight and have updated the text accordingly.

https://doi.org/10.7554/eLife.71513.sa2

Article and author information

Author details

  1. Ipsita Agarwal

    Department of Biological Sciences, Columbia University, New York, United States
    Contribution
    Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Visualization, Writing - original draft, Writing - review and editing
    For correspondence
    ia2337@columbia.edu
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-8537-0008
  2. Molly Przeworski

    1. Department of Biological Sciences, Columbia University, New York, United States
    2. Department of Systems Biology, Columbia University, New York, United States
    Contribution
    Conceptualization, Investigation, Methodology, Project administration, Resources, Supervision, Writing - original draft, Writing - review and editing
    For correspondence
    mp3284@columbia.edu
    Competing interests
    Senior editor, eLife
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-5369-9009

Funding

National Institutes of Health (GM122975)

  • Molly Przeworski

National Institutes of Health (GM121372)

  • Molly Przeworski

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

We thank Peter Andolfatto, Kelley Harris, Hakhamanesh Mostafavi, Magnus Nordborg, Itsik Pe’er, Jonathan Pritchard, Guy Sella, as well as Arbel Harpak, Zach Fuller, and other members of the Andolfatto, Przeworski and Sella labs for helpful discussions. This work was supported by NIH grants GM121372 and GM122975 to MP.

Senior Editor

  1. Patricia J Wittkopp, University of Michigan, United States

Reviewing Editor

  1. Jeffrey Ross-Ibarra, University of California, Davis, United States

Publication history

  1. Received: June 22, 2021
  2. Accepted: November 21, 2021
  3. Accepted Manuscript published: November 22, 2021 (version 1)
  4. Version of Record published: December 17, 2021 (version 2)
  5. Version of Record updated: December 21, 2021 (version 3)

Copyright

© 2021, Agarwal and Przeworski

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 1,403
    Page views
  • 183
    Downloads
  • 2
    Citations

Article citation count generated by polling the highest count across the following sources: PubMed Central, Crossref, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Ipsita Agarwal
  2. Molly Przeworski
(2021)
Mutation saturation for fitness effects at human CpG sites
eLife 10:e71513.
https://doi.org/10.7554/eLife.71513

Further reading

    1. Biochemistry and Chemical Biology
    2. Evolutionary Biology
    Audrey A Burnim, Matthew A Spence ... Nozomi Ando
    Research Article Updated

    Ribonucleotide reductases (RNRs) are used by all free-living organisms and many viruses to catalyze an essential step in the de novo biosynthesis of DNA precursors. RNRs are remarkably diverse by primary sequence and cofactor requirement, while sharing a conserved fold and radical-based mechanism for nucleotide reduction. Here, we structurally aligned the diverse RNR family by the conserved catalytic barrel to reconstruct the first large-scale phylogeny consisting of 6779 sequences that unites all extant classes of the RNR family and performed evo-velocity analysis to independently validate our evolutionary model. With a robust phylogeny in-hand, we uncovered a novel, phylogenetically distinct clade that is placed as ancestral to the classes I and II RNRs, which we have termed clade Ø. We employed small-angle X-ray scattering (SAXS), cryogenic-electron microscopy (cryo-EM), and AlphaFold2 to investigate a member of this clade from Synechococcus phage S-CBP4 and report the most minimal RNR architecture to-date. Based on our analyses, we propose an evolutionary model of diversification in the RNR family and delineate how our phylogeny can be used as a roadmap for targeted future study.

    1. Ecology
    2. Evolutionary Biology
    Julian Melgar, Mads F Schou ... Charlie K Cornwallis
    Research Article

    Cooperative breeding allows the costs of parental care to be shared, but as groups become larger, such benefits often decline as competition increases and group cohesion breaks down. The counteracting forces of cooperation and competition are predicted to select for an optimal group size, but variation in groups is ubiquitous across cooperative breeding animals. Here, we experimentally test if group sizes vary because of sex differences in the costs and benefits of cooperative breeding in captive ostriches, Struthio camelus, and compare this to the distribution of group sizes in the wild. We established 96 groups with different numbers of males (1 or 3) and females (1, 3, 4, or 6) and manipulated opportunities for cooperation over incubation. There was a clear optimal group size for males (one male with four or more females) that was explained by high costs of competition and negligible benefits of cooperation. Conversely, female reproductive success was maximised across a range of group sizes due to the benefits of cooperation with male and female group members. Reproductive success in intermediate sized groups was low for both males and females due to sexual conflict over the timing of mating and incubation. Our experiments show that sex differences in cooperation and competition can explain group size variation in cooperative breeders.