A method for low-coverage single-gamete sequence analysis demonstrates adherence to Mendel’s first law across a large sample of human sperm

  1. Sara A Carioscia
  2. Kathryn J Weaver
  3. Andrew N Bortvin
  4. Hao Pan
  5. Daniel Ariad
  6. Avery Davis Bell
  7. Rajiv C McCoy  Is a corresponding author
  1. Department of Biology, Johns Hopkins University, United States
  2. School of Biological Sciences, Georgia Institute of Technology, United States
6 figures and 1 additional file

Figures

Schematic of the methods underlying rhapsodi (R haploid sperm/oocyte data imputation).

Low coverage data from individual gametes (A) is clustered to phase the diploid donor haplotypes (B). A Hidden Markov Model, with tunable rates of genotyping error and meiotic crossover, is applied …

Figure 2 with 10 supplements
Benchmarking performance across a range of study designs.

Values represent the average of three independent trials. FDR: False Discovery Rate; TPR: True Positive Rate. For phasing and imputation, gray indicates that no hetSNPs remained after downsampling. …

Figure 2—figure supplement 1
Generative model.

(I) The first step of the generative model builds the phased haplotypes of the diploid donor, with n hetSNPs. (II) In the second step, gamete genotypes are derived from the diploid donor haplotypes …

Figure 2—figure supplement 2
Benchmarking performance across a range of study designs - Additional Metrics.

Input data were created from the generative model and analyzed with rhapsodi. For all data in this figure, the genotyping error and recombination rates were matched between models. Each value is the …

Figure 2—figure supplement 3
Illustration of switch error rate (SER) and accuracy.

The top row shows the true haplotypes of a given chromosome and the bottom row shows the inferred haplotypes. Purple denotes alleles assigned to one haplotype and teal alleles to the other. The …

Figure 2—figure supplement 4
Automatic phasing window size calculation.

A beta regression model relating optimal phasing window size (represented as window size / of SNPs) to number of gametes, coverage, genotyping error rate, and recombination rate was fit on the …

Figure 2—figure supplement 5
Discovery of meiotic recombination events in simulated data reflecting characteristics of the Sperm-seq data.

(A) Breakpoint resolution stratified by depth of coverage. Resolution is scaled to base pairs assuming pairwise nucleotide diversity of 0.001 (i.e., one hetSNP per 1000 bp). The minimum possible …

Figure 2—figure supplement 6
Model robustness when genotyping error is underestimated.

Plotted values are the difference between the mean performance (across three independent trials) when the generative model and rhapsodi use the same parameters (Figure 2) and the mean performance …

Figure 2—figure supplement 7
Model robustness when recombination rate is underestimated.

Plotted values are the difference between the mean performance (across 3 independent trials) when the generative model and rhapsodi use the same parameters (Figure 2) and the mean performance …

Figure 2—figure supplement 8
Model robustness when genotyping error is overestimated.

Plotted values are the difference between the mean performance (across three independent trials) when the generative model and rhapsodi use the same parameters (Figure 2) and the mean performance …

Figure 2—figure supplement 9
Model robustness when recombination rate is overestimated.

Plotted values are the difference between the mean performance (across three independent trials) when the generative model and rhapsodi use the same parameters (Figure 2) and the mean performance …

Figure 2—figure supplement 10
Benchmarking rhapsodi runtime across a range of simulated data profiles.

We generated sperm-seq datasets with 100,000 SNPs and varied coverages and numbers of gametes. We measured runtime of rhapsodi analysis using the rhapsodi_autorun function. Reported time is in CPU …

Figure 3 with 4 supplements
Comparison of the performance of rhapsodi to Hapi, an existing gamete genotype imputation tool.

For each panel, we depict the difference in performance between the tools (rhapsodi minus Hapi). Each point represents a simulated dataset, and only datasets successfully analyzed by both tools are …

Figure 3—figure supplement 1
Comparison of the percentage of simulated datasets successfully analyzed by rhapsodi and Hapi across: (1) data import (gmt_import), (2) donor haplotype phasing (phasing), (3) gamete genotype imputation (imputation), and (4) meiotic recombination discovery (CO) stages.

The datasets were stratified into groups according to the number of gametes used in construction, signified by color. The run completion trend for each tool is signified by line-type.

Figure 3—figure supplement 2
Performance of rhapsodi on each simulated dataset, colored based on Hapi’s ability to successfully analyze each given data set.

False = Not successful; True = Successful. Each column is defined by the number of gametes used in data construction. (A) The relationship between phasing accuracy and completeness. (B) The …

Figure 3—figure supplement 3
Comparison of accuracy and completeness for rhapsodi phasing methods windowWardD2 and linkedSNPHapCUT2.

Datasets have varied coverages and gamete numbers, and all have 100,000 SNPs. Each panel shows the difference in performance between phasing methods (windowWardD2 minus linkedSNPHapCUT2) at a given …

Figure 3—figure supplement 4
Difference in runtime between windowWardD2 and linkedSNPHapCUT2 algorithms for phasing across a range of simulated data.

Datasets have varied coverages, gamete numbers, and numbers of SNPs. Figure shows both CPU time and wall time. Points above the y=0 line represent datasets for which the windowWardD2 method performs …

Figure 4 with 4 supplements
Simulation demonstrating power to detect deviations from binomial expectations across sample sizes of sperm, without (A) and with (B) multiple hypothesis testing correction.

For each combination of transmission rate and number of gametes, power was calculated based on 1000 independent simulations and assuming full knowledge of gamete genotypes. Panel A uses the standard …

Figure 4—figure supplement 1
Simulation demonstrating power to detect deviations from binomial expectations across sample sizes of sperm cells, without (A) and with (B) multiple hypothesis testing correction.

For each combination of transmission rate and number of gametes, power was calculated based on 1000 independent simulations and assuming full knowledge of gamete genotypes. Panel A uses the standard …

Figure 4—figure supplement 2
Simulated signature of transmission distortion.

We simulated 1000 gametes with 10,000 SNPs. We generated TD by choosing an allele at random and removing 30% of gametes which carried that allele. We simulated coverage of 0.01× and genotyping error …

Figure 4—figure supplement 3
Simulated signature of strong (k=0.99) transmission distortion.

We simulated data with 974 gametes, 79,630 SNPs, and 0.0075× coverage, corresponding to the data profile of donor NC26 chromosome 8. Zooming in to region surrounding the causal SNP (denoted with the …

Figure 4—figure supplement 4
Simulation demonstrating power to detect deviations from expectations using the transmission disequilibrium test (TDT), as applied to human pedigree studies, without (A) and with (B) multiple testing correction (alpha = 0.05 and 10-7, respectively).

The latter threshold is the standard for genome-wide significance (10-7) used in past studies (Meyer et al., 2012). The sample size refers to the number of informative transmissions, where each trio …

Figure 5 with 6 supplements
Manhattan plot displaying genome-wide TD scan results for all 25 sperm donors across the 22 autosomes.

P-values are correlated across large genomic intervals due to the high degree of linkage disequilibrium among sperm cells from a single donor. Colors distinguish results from different donors. No …

Figure 5—figure supplement 1
Workflow for application of rhapsodi to data from human sperm and downstream investigation of transmission distortion.

Raw data containing hetSNPs from each chromosome are filtered to limit spurious signal caused by sequencing error. The autorun function from rhapsodi is applied, which generates and exports data …

Figure 5—figure supplement 2
Inference of genetic similarity to reference samples from the 1000 Genomes Project.

We use a principal component analysis approach to investigate the genetic similarity of each donor with individuals from the 1000 Genomes Project (Ma and Amos, 2012). Donor individuals are shown in …

Figure 5—figure supplement 3
Example of evidence for a segmental duplication in donor NC17, chromosome 6.

(A) For a segment of the chromosome enriched for low uncorrected p-values (p-value < 1 × 10-8), we plot each hetSNP based on the number of sperm in which it was observed as the reference allele or …

Figure 5—figure supplement 4
Recombination map of inferred crossovers in the Sperm-seq data.

This map displays counts of the inferred number of crossovers within each 1 Mbp bin for each chromosome, pooling inference across the 25 donors. We used the midpoint of each crossover’s predicted …

Figure 5—figure supplement 5
Comparison of inferred crossover recombination map to published deCODE male-specific recombination map.

By binning the deCODE recombination map within each 1 Mbp bin for each chromosome, a weighted average recombination rate was calculated and compared to the number of Sperm-seq inferred crossovers …

Figure 5—figure supplement 6
Visualization of transmission rate of each allele within the pool of sperm, by donor.

Points are distributed around a transmission rate of 0.5 (shown as red line).

Quantifying global evidence of TD.

(A) Quantile-quantile plot comparing the distribution of p-values from our genome-wide scan for TD to that expected under the null hypothesis. The dashed line corresponds to x = y. (B) Mean allele …

Additional files

Download links