Overview of aggregate mutation spectrum distance method for discovering mutator alleles.

a) A population of four haplotypes has been genotyped at three informative markers (g1 through g3); each haplotype also harbors unique de novo germline mutations. In practice, de novo mutations are partitioned by k-mer context; for simplicity in this toy example, de novo mutations are simply classified into two possible mutation types (grey squares represent C>(A/T/G) mutations, while grey triangles represent A>(C/T/G) mutations). b) At each informative marker gn, we calculate the total number of each mutation type observed on haplotypes that carry either parental allele (i.e., the aggregate mutation spectrum) using all genome-wide de novo mutations. For example, haplotypes with A (orange) genotypes at g1 carry a total of three “triangle” mutations and five “square” mutations, and haplotypes with B (green) genotypes carry a total of six triangle and two square mutations. We then calculate the cosine distance between the two aggregate mutation spectra, which we call the “aggregate mutation spectrum distance.” Cosine distance can be defined as 1 − cos(θ), where θ is the angle between two vectors; in this case, the two vectors are the two aggregate spectra. We repeat this process for every informative marker gn. c) To assess the significance of any distance peaks in b), we perform permutation tests. In each of N permutations, we shuffle the haplotype labels associated with the de novo mutation data, run a genome-wide distance scan, and record the maximum cosine distance encountered at any locus in the scan. Finally, we calculate the 1 − p percentile of the distribution of those maximum distances to obtain a genome-wide cosine distance threshold at the specified value of p.

Results of aggregate mutation spectrum distance scans in the BXDs.

a) Adjusted cosine distances between aggregate 1-mer de novo mutation spectra on BXD haplotypes (n = 117 haplotypes; 65,552 total mutations) with either D or B alleles at 7,128 informative markers. Cosine distance threshold at p = 0.05 was calculated by performing 10,000 permutations of the BXD mutation data, and is shown as a dotted grey line. b) Adjusted cosine distances between aggregate 1-mer de novo mutation spectra on BXD haplotypes with D alleles at rs27509845 (n = 66 haplotypes; 42,171 total mutations) and either D or B alleles at 6,957 informative markers. Cosine distance threshold at p = 0.05 was calculated by performing 10,000 permutations of the BXD mutation data, and is shown as a dotted grey line. c) Adjusted cosine distances between aggregate 1-mer de novo mutation spectra on BXD haplotypes with B alleles at rs27509845 (n = 44 haplotypes; 22,645 total mutations) and either D or B alleles at 6,957 informative markers. Cosine distance threshold at p = 0.05 was calculated by performing 10,000 permutations of the BXD mutation data, and is shown as a dotted grey line.

BXD mutation spectra are affected by alleles at both mutator loci.

a) C>A de novo germline mutation fractions in BXDs with either D or B genotypes at markers rs27509845 (chr4 peak) and rs46276051 (chr6 peak). Distributions of C>A mutation fractions were compared with two-sided Mann-Whitney U-tests; annotated p-values are uncorrected. B-B vs. B-D comparison: U-statistic = 149.0, p = 7.58e-2; B-D vs D-D comparison: U-statistic = 21.0, p = 2.61e-8; D-B vs D-D comparison: U-statistic = 232.5, p = 6.99e-5. b) The count of C>A de novo germline mutations in each BXD was plotted against the number of generations for which it was inbred. Lines represent predicted C>A counts in each haplotype group from a generalized linear model (Poisson family, identity link), and shading around each line represents the 95% confidence interval. c) Germline mutations in each BXD were assigned to COSMIC SBS mutation signatures using SigProfilerExtractor [29]. After grouping BXDs by their genotypes at rs27509845 and rs46276051, we calculated the fraction of mutations in each group that was attributed to each signature. The proposed etiologies of each mutation signature are: SBS1 (spontaneous deamination of methylated cytosine nucleotides at CpG contexts), SBS5 (unknown, clock-like signature), SBS18 (damage by reactive oxygen species, related to SBS36 and defective base-excision repair due to loss-of-function mutations in MUTYH), and SBS30 (defective base-excision repair due to NTHL1 mutations).

Nonsynonymous mutations in DNA repair genes near the chr6 peak

Names of gene expression datasets used for each tissue type on GeneNetwork