Mutation saturation for fitness effects at human CpG sites

Abstract
Data availability
Article and author information
Metrics

Abstract

Whole exome sequences have now been collected for millions of humans, with the related goals of identifying pathogenic mutations in patients and establishing reference repositories of data from unaffected individuals. As a result, we are approaching an important limit, in which datasets are large enough that, in the absence of natural selection, every highly mutable site will have experienced at least one mutation in the genealogical history of the sample. Here, we focus on CpG sites that are methylated in the germline and experience mutations to T at an elevated rate of ~10^-7 per site per generation; considering synonymous mutations in a sample of 390,000 individuals, ~99% of such CpG sites harbor a C/T polymorphism. Methylated CpG sites provide a natural mutation saturation experiment for fitness effects: as we show, at current sample sizes, not seeing a non-synonymous polymorphism is indicative of strong selection against that mutation. We rely on this idea in order to directly identify a subset of CpG transitions that are likely to be highly deleterious, including ~27% of possible loss-of-function mutations, and up to 20% of possible missense mutations, depending on the type of functional site in which they occur. Unlike methylated CpGs, most mutation types, with rates on the order of 10^-8 or 10^-9, remain very far from saturation. We discuss what these findings imply for interpreting the potential clinical relevance of mutations from their presence or absence in reference databases and for inferences about the fitness effects of new mutations.

Data availability

All source data are freely available to researchers, with sources provided in the manuscript. Data and code to generate the figures is available at https://github.com/agarwal-i/cpg_saturation.

The following previously published data sets were used

1. Karczewski KJ
2. Francioli LC
3. Tiao G
4. Cummings BB
5. Alföldi J
6. Wang Q
7. Collins RL
8. Laricchia KM
9. Ganna A
10. Birnbaum DP
11. Gauthier LD
12. Brand H
13. Solomonson M
14. Watts NA
15. Rhodes D
16. Singer-Berk M
17. England EM
18. Seaby EG
19. Kosmicki JA
20. MacArthur DG
(2020) gnomAD
gnomAD v2.1.

https://gnomad.broadinstitute.org/downloads
(2020) UK Biobank
UK Biobank.

https://www.ukbiobank.ac.uk/enable-your-research/about-our-data/genetic-data
1. Dewey FE
2. et al.
(2016) DiscovEHR
DiscovEHR.

http://www.discovehrshare.com/
1. The 1000 Genomes Project Consortium
(2015) 1000 genomes
1000 genomes Phase 3.

https://www.internationalgenome.org/category/phase-3/
1. Landrum MJ
2. Lee JM
3. Benson M
4. Brown GR
5. Chao C
6. Chitipiralla S
7. Gu B
8. Hart J
9. Hoffman D
10. Jang W
11. Karapetyan K
12. Katz K
13. Liu C
14. Maddipatla Z
15. Malheiro A
16. McDaniel K
17. Ovetsky M
18. Riley G
19. Zhou G
20. Holmes JB
21. Kattman BL
22. Maglott DR
(2018) ClinVar
ClinVar.

https://www.ncbi.nlm.nih.gov/clinvar/

Article and author information

Author details

Ipsita Agarwal

Department of Biological Sciences, Columbia University, New York, United States

For correspondence
ia2337@columbia.edu

Competing interests
No competing interests declared.

"This ORCID iD identifies the author of this article:" 0000-0001-8537-0008
Molly Przeworski

Department of Systems Biology, Columbia University, New York, United States

For correspondence
mp3284@columbia.edu

Competing interests
Molly Przeworski, Senior editor, eLife.

"This ORCID iD identifies the author of this article:" 0000-0002-5369-9009

Funding

National Institutes of Health (GM122975)

Molly Przeworski

National Institutes of Health (GM121372)

Molly Przeworski

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Copyright

This article is distributed under the terms of the Creative Commons Attribution License permitting unrestricted use and redistribution provided that the original author and source are credited.