Limited role of generation time changes in driving the evolution of the mutation spectrum in humans
Figures
 
              Changes in the mutation spectrum of polymorphisms in CEU over evolutionary time.
- 
                    Figure 1—source data 1Bedfile for the commonly accessible region excluding exons and phylogenetically conserved elements. 
- https://cdn.elifesciences.org/articles/81188/elife-81188-fig1-data1-v3.zip
- 
                    Figure 1—source data 2Text files with (pseudo-)counts of different types of mutations in YRI, LWK, CEU, TSI, CHB, JPT in each time window. 
- https://cdn.elifesciences.org/articles/81188/elife-81188-fig1-data2-v3.zip
 
              Mutation spectra of polymorphisms stratified by allele age in six human populations (YRI, LWK, CEU, TSI, CHB, and JPT).
The ancestral allele of each single-nucleotide polymorphisms (SNP) is determined based on the 6-EPO human ancestral allele, and the allele age is inferred using Relate (Speidel et al., 2019). Populations are coded the same way as in the 1000 Genomes Project (same below): YRI, Yoruba in Ibadan, Nigeria; LWK, Luhya in Webuye, Kenya; CEU, Utah Residents (CEPH) with Northern and Western European ancestry; TSI, Toscani in Italia; CHB, Han Chinese in Beijing, China; and JPT, Japanese in Tokyo, Japan.
 
              Mutation spectra of CEU polymorphisms and de novo mutations (DNMs) from Icelandic trios (Halldorsson et al., 2019).
‘DNM 2019’ denotes DNM data from 2879 Icelandic trios (Halldorsson et al., 2019), from which 5 trios with exceedingly large numbers of DNMs and 92 trios with maternal age above 40 were excluded. ‘DNM 2017’ denotes DNM data from 1475 Icelandic trios from Jónsson et al., 2017, excluding 73 trios with maternal age above 40. All polymorphisms and DNMs were filtered to the commonly accessible region.
 
              Mutation spectra of CEU polymorphisms and de novo mutations (DNMs) from Icelandic trios, excluding C>T transitions at CpG sites.
 
              Mean and distribution of B-scores of different mutation types.
We show the mean and the distribution of B-scores which measure the reduction in nucleotide diversity levels compared to neutral expectation (McVicker et al., 2009) (0- high constraint; 1000- low constraint) for different mutation types. The box represents the interquartile range, with the centerline showing the median value.
 
              Mutation spectra of human polymorphisms in genomic regions with weak background selection (B-score > 800).
 
              Fractions of S>S, S>W, W>S, and W>W mutations stratified by (A) derived allele frequency and (B) allele age.
The enrichment of weak (W) > strong (S) mutations and depletion of S>W mutations in variants with higher derived allele frequencies and old allele ages support profound effects of GC-biased gene conversion (gBGC) on human polymorphisms and biases in allele age dating by Relate, which ignores gBGC.
 
              Distribution of variants subject to biased gene conversion stratified by allele age and recombination rate.
(A) Fraction of S>S, S>W, W>S, and W>W mutation types and (B) ratio of W>S to S>W mutations.
 
              Mutation spectra of human polymorphisms in six populations (YRI, CEU, CHB, LWK, TSI, and JPT) stratified by allele age based on alternative binning strategies.
The age bin boundaries were determined based on allele age distribution of variants observed in YRI (A) and CHB (B) samples, respectively.
 
              Comparison of pairwise mutation ratios for polymorphisms arising in different time windows.
(A) Four pairwise mutation ratios are shown, each of which compares two mutation types that are matched for mutational opportunity and effects of GC-biased gene conversion (gBGC). The black arrow indicates the window coinciding with the out-of-Africa (OOA) migration. The points represent the observed polymorphism ratios, while the whiskers denote the 95% CI assuming a binomial distribution of polymorphism counts. Highlighted in boxes are three ratios that show significant interpopulation differences, with in-depth investigation into each shown in lower panels. Asterisks refer to the p-value obtained from a chi-square test after a Bonferroni correction for 60 tests: *p<0.01, ** p<0.0001 and ***p<10–8 (same indicators of significance levels were used in Figure supplements). (B) Elevation in C>T/C>A ratio in CEU at non-CpG sites, after excluding the four trinucleotide contexts (TCC, TCT, CCC, and ACC) previously identified to be associated with the TCC pulse in Europeans (denoted by TCC*; Harris and Pritchard, 2017), as well as contexts affected by Catalog of Somatic Mutations in Cancer (COSMIC) mutational signatures of SBS7 and SBS11 (Harris, 2015; Mathieson and Reich, 2017). (C) Post-OOA divergence in C>G/T>A ratio among three population groups. (D) Higher T>C/T>G ratio in YRI than CEU and CHB samples among extremely old variants, driven by TpG variants.
- 
                    Figure 2—source data 1Text files with (pseudo-)counts of mutations classified into eight types in genomic regions including, excluding, and within the maternal C>G mutation hotspots, in YRI, LWK, CEU, TSI, CHB, and JPT in each time window. 
- https://cdn.elifesciences.org/articles/81188/elife-81188-fig2-data1-v3.zip
 
              Pairwise polymorphism ratios in YRI, CEU and CHB, as well as three additional populations of African (LWK), European (TSI), and East Asian (JPT) ancestry in the 1000 Genomes Project.
Panel (A) shows the results based on all variants dated by Relate, whereas (B) excludes singleton variants. The three signals of differences between YRI, CEU, and CHB are observed in LWK, TSI and JPT, with slight differences in the timing and magnitude of differences. Additionally, TSI shows a slightly elevated T>C/T>G ratio in recent time windows, which is absent in CEU and possibly represents another signal or technical artifacts.
 
              Pairwise polymorphism ratios in genomic regions with weak background selection (B-score >800).
 
              Pairwise polymorphism ratios in 33% of the genome with the lowest (top) and highest (bottom) regional recombination rates.
 
              Pairwise ratios of human polymorphisms stratified by allele frequency.
Unlike Figure 2 and other figure supplements, this analysis was performed on all non-singleton variants, including unphased single-nucleotide polymorphisms (SNPs) and those unmapped by Relate. Singletons were removed because they usually have a higher false positive rate. We note that the derived allele frequency (DAF) is a poor proxy for allele age, and variants at the same (sample) frequency can have drastically different ages within and across populations, which renders the interpopulation comparisons difficult to interpret and not directly comparable with Figure 2. However, for recent changes in the mutation spectrum, we expect the mutation ratios of low-frequency variants to differ across populations as those variants are likely to be young. Consistent with this expectation, the two post-OOA signals in non-CpG C>T/C>A and C>G/T>A ratios are replicated in variants at low and intermediate frequencies. The signal of T>C/T>G in ancient variants is also discernible in high-frequency variants with DAF > 90%.
 
              Pairwise polymorphism ratios in YRI, CEU, and CHB in commonly accessible regions based on alternative binning strategies.
The age bin boundaries were determined based on allele age distribution of variants observed in YRI (A) and CHB (B) samples, respectively.
 
              Alternative pairwise polymorphism ratios (T>G/T>A and C>G/C>A) to investigate the cause of interpopulation differences in C>G/T>A ratio.
 
              Comparison of allele age estimates inferred by Relate.
Relate infers the allele age of a mutation in two rounds: one initial estimate based on the entire dataset and then a more refined estimate for each population after branch length re-estimation (as recommended by Relate). Panels (A–C) show the difference in the initial and refined population-specific mutation ages for each population (CEU, YRI, and CHB). Panels (D–F) show the comparison population-specific estimates for the same variant found in pairs of populations. Mutation ages with overlapping mutation age estimates across the two populations are shown in gray, and nonoverlapping ranges are shown in red. Numbers in the legend indicate the total number and proportion of variants in each class.
 
              The T>C/T>G signal in variants shared across continental groups (left) and nonshared variants (right).
Considering only six populations (YRI, LWK, CEU, TSI, CHB, and JPT), we operationally defined shared variants as single-nucleotide polymorphisms (SNPs) with both alleles observed in samples from at least one population from each of the three continental groups (i.e., variants segregating in [YRI or LWK] and [CEU or TSI] and [CHB or JPT]). For shared variants, we observed no significant interpopulation differences in the T>C/T>G ratio (p>0.05), despite some elevation in old variants compared to younger variants in all populations (left); in contrast, the interpopulation differences for nonshared variants are highly significant (right), suggesting that the T>C/T>G signal is driven mostly by nonshared variants instead of bias or inaccuracy in allele age estimation.
 
              Alternative pairwise polymorphism ratios (T>C/T>A and T>A/T>G) to investigate the cause of interpopulation differences in T>C/T>G ratio.
The significant differences in recent time windows (<3130 generations) in T>C/T>A and T>A/T>G comparison are driven by differences in T>A mutation rate (CHB > CEU > YRI), which is consistent with findings in Figure 2—figure supplement 6.
 
              Analysis of mutational signatures across allele age bins using non-negative matrix factorization (NMF): we ran NMF on the normalized allele counts for 96 mutation types (considering the flanking 3’ and 5’ base nucleotides neighboring each single-nucleotide polymorphism [SNP]) for 15 mutation age bins in the three populations (CEU, CHB, and YRI).
(A) shows the fractions of each mutation context for each of the three significant signatures. (B) shows the contributions of each of three Signatures 1–3 in each population, YRI, CEU, and CHB, across allele age bins. (C) shows the percent variance explained by NMF analysis for factorization rank K (ranging from K = 2–15), and (D) represents the cophenetic correlation coefficients calculated over 200 independent NMF runs with a constant factorization rank K (ranging from K = 2–15).
 
              Enrichment of TpG>CpG mutations in all T>C mutations in ancient time bins.
T>C mutations at CpG sites are greatly enriched in ancient variants relative to T>C mutations at non-CpG sites (right), consistent with the hypothesis that some of the ancient TpG>CpG mutations are mis-polarized CpG>TpG mutations. Moreover, the temporal trend of (T>C at CpG)/(T>C at non-CpG) ratio mirrors that of T>C/T>G ratio closely.
 
              Persistence of the T>C/T>G signal with alternative methods for derived allele polarization and stratification of the reference genome by local ancestry.
Panels (A) and (B) show the results when the ancestral alleles were polarized to the chimpanzee reference allele or high-confidence sites in EPO. Panel (C) shows the results in subsets of genomic regions stratified based on local ancestry of human reference genome (African or European; results not shown for the small fraction of reference genome of Asian ancestry). The T>C/T>G is no longer significant in the two subsets of the genome due to reduction in sample size, but YRI still shows a higher ratio than CEU and CHB in both subsets with similar magnitude of elevation. The numbers above each panel indicate the total number of T>C and T>G SNPs across all age bins in each population.
 
              Comparison of pairwise mutation ratios for polymorphisms arising in different time windows after excluding singletons.
Unlike Figure 2 and other figure supplements, this analysis was performed after excluding singletons because they usually have a higher false positive rate.
 
              Effects of parental ages on three pairwise mutation ratios estimated from de novo mutation (DNM) data in 2879 Icelandic trios (Halldorsson et al., 2019).
The three panels show the parental age effects (left) on (A) nonCpG C>T/C>A, (B) C>G/T>A, and (C) nonTpG T>C/T>G ratios, respectively. On the left, the different colored curves reflect expected mutation ratios for different ratios of paternal (Gp) to maternal (Gm) mean generation times. Each light gray curve represents the expected ratio for Gp/Gm = 1 from one bootstrap resampling replicate (see ‘Materials and methods’), with the lighter blue area denoting 90% confidence interval (CI) assessed from 500 replicates. For ease of comparison, ratios for polymorphisms of different ages identified in CEU are shown on the right of each panel. The points represent the observed polymorphism ratios, while the whiskers denote the 95% CI assuming a binomial distribution of polymorphism counts.
- 
                    Figure 3—source data 1Mutation parameters inferred from de novo mutation (DNM) data in 2879 Icelandic trios with estimated uncertainty based on bootstrap resampling (one file for each mutation type for commonly accessible regions; n = 500 replicates). 
- https://cdn.elifesciences.org/articles/81188/elife-81188-fig3-data1-v3.zip
 
              Effects of parental ages on three pairwise mutation ratios estimated from an earlier de novo mutation (DNM) dataset (Jónsson et al., 2017).
The three panels show the parental age effects (left) on (A) nonCpG C>T/C>A, (B) C>G/T>A, and (C) nonTpG T>C/T>G ratios, respectively. Shown on the right are the observed ratios of polymorphisms in CEU stratified by allele age.
 
              Discrepancies in the mutation spectrum between de novo mutation (DNM) datasets and between DNMs and young polymorphisms.
Panel (A) shows the fractions of seven mutation types in two DNM datasets (Halldorsson et al., 2019; Jónsson et al., 2017) as well as in CEU single-nucleotide polymorphisms (SNPs) in the three most recent time windows. Young polymorphisms are depleted of C>T transitions at CpG sites compared to DNMs, consistent with the expectation that recurrent mutations are undated and ignored by Relate. As we previously noted (Gao et al., 2019), the fraction of C>A mutation in 2017 DNM dataset is substantially lower than that in polymorphisms, which indicates under-detection and is somewhat ameliorated in the 2019 DNM dataset. Panel (B) shows that, in addition to differences in C>T/C>A ratios at both CpG and non-CpG sites, the two DNM datasets also differ significantly in the T>C/T>G ratio, suggesting additional technical differences in mutation identification (note that the 2017 dataset is a subset of the 2019 dataset). Furthermore, the C>G/T>A ratios of both DNM datasets are significantly higher than that in young polymorphisms, highlighting technical differences between variant detection in DNM study and population dataset.
 
              Sex-specific parental age effects on three pairwise mutation ratios.
In the upper panel, the background color represents the expected de novo mutation (DNM) ratio given the paternal (x-axis) and maternal (y-axis) age, with darker colors representing greater values. Each colored line represents the linear combinations of paternal and maternal ages corresponding to a certain mutation ratio observed in polymorphisms in an age bin. The vertical patterns for non-CpG C>T/C>A and T>C/T>G ratios suggest that these two ratios are insensitive to the maternal age. The lower panel shows the observed polymorphism ratios ordered by allele age (same data as in Figure 2—figure supplement 10), with the colors matching those of lines in the upper panel.
 
              Power for detecting interpopulation differences in C>G/T>A ratio driven by differences in generation time outside (A) and within (B) the maternal C>G enriched regions.
We performed simulations based on the parental age effects on the mutation spectrum estimated from the de novo mutation (DNM) dataset and the observed number of single-nucleotide polymorphisms (SNPs) (see ‘Materials and methods’). 10,000 simulation replicates were performed for maternal C>G mutation hotspots (B) and non-hotspots (A), respectively. The simulated C>G/T>A ratios were compared between two populations with different generation times (G = 20, 25, 30, 35, or 40), and the power was estimated from the percentage of replicates with p-value<0.001 based on chi-square test.
 
              Past generation times corresponding to the observed polymorphism ratios in CEU, given parental age effects estimated from de novo mutation (DNM) data.
Red points represent the point estimates based on maximum likelihood estimators of mutation parameters from the DNM data; gray dots show estimates from 500 bootstrap replicates by resampling trios with replacement. We assumed the same male to female generation times (Gp = Gm) for all time windows. Similar trends were obtained for other fixed values of Gp/Gm (between 0.8–1.2) or independently varying Gp and Gm (Figure 4—figure supplement 1, Figure 4—figure supplement 3).
- 
                    Figure 4—source data 1Past generation times inferred from each polymorphism ratio, assuming fixed ratios of male to female generation times (Gp/Gm = 0.8, 1, 1.1, 1.2), with confidence intervals estimated using bootstrap resampling (n = 500 replicates; one file for each mutation type). 
- https://cdn.elifesciences.org/articles/81188/elife-81188-fig4-data1-v3.zip
 
              Past generation times corresponding to the observed pairwise polymorphism ratios, assuming fixed ratio of male to female generation times of 0.8 (A), 1.1 (B), and 1.2 (C).
The parental age effects were inferred from de novo mutation (DNM) data of 2879 Icelandic trios from Halldorsson et al., 2019.
 
              Past generation times corresponding to the observed pairwise polymorphism ratios, based on parental age effects estimated from an earlier de novo mutation (DNM) data (Jónsson et al., 2017).
In each panel, the ratio of male to female generation times is assumed to be fixed throughout time at 0.8 (A), 1 (B) and 1.2 (C), respectively.
 
              Combinations of paternal and maternal reproductive ages corresponding to the observed pairwise polymorphism ratios.
Under the assumption of linear parental age effects on the mutation rate of each mutation type, a specific value of pairwise polymorphism ratio places a linear constraint on the values of paternal and maternal reproductive ages (denoted by Gp and Gm), which is represented by a line in each plot. Blue lines represent predicted constraints based on the maximum likelihood estimators of mutation parameters estimated from de novo mutation (DNM) data (Halldorsson et al., 2019); gray lines show constraints from 500 bootstrap replicates by resampling trios with replacement. Panels (A) and (B) show the same results but with Gm and Gp shown on the x- and y-axes, respectively, in order to illustrate their temporal trends. Consistent with results shown in Figure 3—figure supplement 3, the nonTpG T>C/T>G and non-CpG C>T/C>A ratios are relatively insensitive to Gm and largely determined by Gp, so the temporal trends of these two mutation ratios in panel (B) mirror those of Figure 4. The C>G/T>A depends on both Gp and Gm, so its temporal trend in both panels (A) and (B) mirrors that shown in Figure 4. Note that for each time window the gray areas predicted by three polymorphism ratios barely, if at all, overlap, suggesting that no combination of (Gp, Gm) values can explain the three observed polymorphism ratios simultaneously. In addition, the temporal trends predicted by the three polymorphism ratios disagree with each other.
 
                 
         
         
        