Overview of aggregate mutation spectrum distance method for discovering mutator alleles.
a) A population of four haplotypes has been genotyped at three informative markers (g1 through g3); each haplotype also harbors unique de novo germline mutations. In practice, de novo mutations are partitioned by k-mer context; for simplicity in this toy example, de novo mutations are simply classified into two possible mutation types (grey squares represent C>(A/T/G) mutations, while grey triangles represent A>(C/T/G) mutations). b) At each informative marker gn, we calculate the total number of each mutation type observed on haplotypes that carry either parental allele (i.e., the aggregate mutation spectrum) using all genome-wide de novo mutations. For example, haplotypes with A (orange) genotypes at g1 carry a total of three “triangle” mutations and five “square” mutations, and haplotypes with B (green) genotypes carry a total of six triangle and two square mutations. We then calculate the cosine distance between the two aggregate mutation spectra, which we call the “aggregate mutation spectrum distance.” Cosine distance can be defined as 1 − cos(θ), where θ is the angle between two vectors; in this case, the two vectors are the two aggregate spectra. We repeat this process for every informative marker gn. c) To assess the significance of any distance peaks in b), we perform permutation tests. In each of N permutations, we shuffle the haplotype labels associated with the de novo mutation data, run a genome-wide distance scan, and record the maximum cosine distance encountered at any locus in the scan. Finally, we calculate the 1 − p percentile of the distribution of those maximum distances to obtain a genome-wide cosine distance threshold at the specified value of p.