Association mapping from sequencing reads using k-mers

  1. Atif Rahman  Is a corresponding author
  2. Ingileif Hallgrímsdóttir
  3. Michael Eisen
  4. Lior Pachter  Is a corresponding author
  1. University of California, Berkeley, United States
  2. Howard Hughes Medical Institute, University of California, Berkeley, United States
19 figures, 4 tables and 1 additional file

Figures

Workflow for association mapping using k-mers.

The Hawk pipeline starts with sequencing reads from two sets of samples. The first step is to count k-mers in reads from each sample. Then k-mers with significantly different counts in two sets …

https://doi.org/10.7554/eLife.32920.002
Intersection analysis and comparison of powers of tests.

(a) Venn diagrams showing intersections among sequences obtained using Hawk and significant sites found by genotype calling. The percentage values shown are fractions of the sites found using one …

https://doi.org/10.7554/eLife.32920.003
Breakdown of types of variations in comparison of YRI-TSI.

(a) Bars showing breakdown of 2,970,929 and 1,865,285 sequences enriched for in YRI and TSI samples respectively. The ‘Multiple SNPs/Structural’ entries correspond to sequences of length greater …

https://doi.org/10.7554/eLife.32920.005
Detection and correction for population stratification in YRI-TSI dataset.

(a) Plots of first two principal components for YRI and TSI individuals from the 1000 genomes project. The PCA was run on a binary matrix indicating presence or absence of 3,483,820 randomly chosen k-mers present in between 1% and 99% of the samples. The colors indicate population and sizes of circles are proportional to sequencing depth. (b) log10(adjusted p-values) are plotted against log10(unadjusted p-values) where adjusted p-values are calculated by fitting logistic regression models to predict population identity from k-mer counts adjusting for population stratification, total number of k-mers per sample and gender of individuals whereas unadjusted p-values are the p-values obtained using likelihood ratio test of k-mer counts assuming Poisson distributions.

https://doi.org/10.7554/eLife.32920.008
Manhattan plots for association mapping of ampicillin resistance in E.

coli using k-mers. Manhattan plots showing log10(adjusted p-values) of k-mers found significantly associated with ampicillin resistance and their start positions in (a) Escherichia coli strain DTU-1 …

https://doi.org/10.7554/eLife.32920.009
Figure 5—source data 1

Source data for Figure 5a.

Source file contains k-mer sequences, corresponding p-values and their positions in Escherichia coli strain DTU-1 genome in SAM format.

https://doi.org/10.7554/eLife.32920.010
Figure 5—source data 2

Source data for Figure 5b.

Source file contains k-mer sequences, corresponding p-values and their positions in plasmid pKBN10P04869A sequence in SAM format.

https://doi.org/10.7554/eLife.32920.011
Appendix 1—figure 1
QQ plots of p-values.

QQ plots of p-values calculated using likelihood ratio test with Poisson and negative binomial distributions for (a) simulated data and (b) real data from the 1000 genomes project. The simulation …

https://doi.org/10.7554/eLife.32920.014
Appendix 1—figure 2
Power for different k-mer coverages.

The figure shows power to detect a k-mer present in all case samples and no control sample against total k-mer coverage of cases using Bonferroni correction for different number of total tests for …

https://doi.org/10.7554/eLife.32920.015
Appendix 1—figure 3
Power vs number of samples.

Figures show power to detect a k-mer at MAF = 0.16 and different odds ratios for per sample k-mer coverage of (a) five and (b) two after Bonferroni correction for 1 million tests.

https://doi.org/10.7554/eLife.32920.016
Appendix 1—figure 4
Comparison of powers of Poisson LRT and logistic regression based tests.

Figures show power to detect a k-mer by Poisson based likelihood ratio test, logistic regression based tests using k-mer counts and presence and absence of k-mers with number of allele copies in …

https://doi.org/10.7554/eLife.32920.017
Appendix 1—figure 5
Sensitivity with simulated E. coli data.

The figure shows sensitivity for varying number of case and control samples for different types of mutations. Sensitivity is defined as the percentage of differing nucleotides that are covered by a …

https://doi.org/10.7554/eLife.32920.018
Appendix 1—figure 6
QQ plot and cumulative distributions of p-values.

The figures show (a) QQ plot and (b) cumulative distributions of log10(p-values) for observed p-values in the comparison of YRI and TSI populations from the 1000 genomes project and p-values expected …

https://doi.org/10.7554/eLife.32920.019
Appendix 1—figure 7
Breakdown of types of variations in BEB-TSI comparison.

(a) Bar plots showing breakdown of 529,287 and 462,122 sequences associated with BEB and TSI samples respectively. The ‘Multiple SNPs/Structural’ entries correspond to sequences of length greater …

https://doi.org/10.7554/eLife.32920.020
Appendix 1—figure 8
Histograms of sequence lengths in YRI-TSI comparison.

Figures show sections of histograms of lengths of sequences associated with (ac) YRI and (bd) TSI in comparison of YRI and TSI samples. Figures (a), (b) show peaks at 61, the maximum length …

https://doi.org/10.7554/eLife.32920.021
Appendix 1—figure 9
Histograms of sequence lengths in BEB-TSI comparison.

Figures show sections of histograms of lengths of sequences associated with (a, c) BEB and (b, d) TSI in comparison of BEB and TSI samples. Figures (a), (b) show peaks at 61, the maximum length …

https://doi.org/10.7554/eLife.32920.022
Appendix 1—figure 10
Distribution of the SNP rs1042034 alleles in 1000 genomes populations.

Plot obtained from the Geography of Genetic Variants Browser showing distribution of alleles at the SNP site rs1042034 revealing the ‘C’ allele is more prevalent in South, Southeast and East Asian …

https://doi.org/10.7554/eLife.32920.023
Appendix 1—figure 11
Detection and correction for population stratification in YRI-TSI dataset.

(a) Plots of first two principal components for YRI and TSI individuals where the PCA was run on variants called in the 1000 genomes project. QQ plots of ANOVA χ2 p-values of k-mer counts computed …

https://doi.org/10.7554/eLife.32920.024
Appendix 1—figure 12
Detection of association after correcting for confounders.

A k-mer (allele) is considered to be present in a YRI individual with probability p and in a TSI individual with probability 1p and counts are simulated using total numbers of k-mers in the …

https://doi.org/10.7554/eLife.32920.025
Appendix 1—figure 13
QQ plot and cumulative distributions of p-values.

The figures show (a) QQ plot and (b) cumulative distributions of log10(p-values) for observed p-values in association mapping of ampicillin resistance in E. coli and p-values expected if the null is …

https://doi.org/10.7554/eLife.32920.026
Appendix 1—figure 14
Manhattan plots for association mapping of ampicillin resistance in E. coli using conventional approach.

Manhattan plots showing log10(p-values) of SNPs and their positions in (a) Escherichia coli strain CFT073 genome and (b) plasmid pKBN10P04869A sequence. The vertical lines denote start positions of β-l…

https://doi.org/10.7554/eLife.32920.027

Tables

Table 1
Known variants in YRI-TSI comparison.

Table 1 shows p-values of sequences computed using likelihood ratio test at some well known sites of variation between populations. The (%) values denote fraction of individuals in the sample with …

https://doi.org/10.7554/eLife.32920.004
GeneSNP idDescriptionAllelep-value%YRI%TSI
ACKR1rs2814778Duffy antigenC9.72×1011484.39%1.78%
SLC24A5rs1426654Skin pigmentationG8.45×1014487.39%1.02%
SLC45A2rs16891982Skin/hair colorC1.89×1012292.18%4.67%
G6PDrs1050829G6PD deficiencyC1.53×102924.92%1.02%
G6PDrs1050828G6PD deficiencyT5.83×102518.32%0.00%
Table 2
Summary of sequences not in the human reference genome.

Table 2 shows summary of sequences associated with different populations that did not map to the human reference genome (hg38) or to the Epstein-Barr virus genome.

https://doi.org/10.7554/eLife.32920.006
PopulationPopulation compared toTotal no. sequencesNo. sequences with length1000bpTotal length in
sequences with
length1000 bp
No. sequences with length200bpTotal length in sequences with length200bp
YRITSI94,7954159,956478225,426
TSIYRI66,0511013,89618477,383
BEBTSI19,584338357533,954
TSIBEB18,508221058128,134
Table 3
Variants in genes linked to cardiovascular diseases.

Variants in genes linked to cardiovascular diseases found to be significantly more common in BEB samples compared to TSI samples. The (%) values denote fraction of individuals in the sample with the …

https://doi.org/10.7554/eLife.32920.007
GeneSN idVariant typeAllelep-value%BEB%TSI
APOBrs2302515MissenseC1.30×101229.29%8.37%
APOBrs676210MissenseA7.73×102572.93%33.08%
APOBrs1042034MissenseC2.28×102368.67%31.91%
CYP11B2rs4545MissenseT1.31×102831.33%0.91%
CYP11B1rs4534MissenseT9.36×103633.00%0.91%
WNK4rs2290041MissenseT1.53×101413.24%0.47%
WNK4rs55781437MissenseT1.30×101215.21%0.91%
SLC12A3rs2289113MissenseT7.40×10138.14%0.00%
SCNN1Ars10849447MissenseC8.67×101262.88%39.92%
ABO-4 bp (CTGT) deletion-1.17×101329.15%10.55%
ABOrs8176741MissenseA2.06×101627.70%8.45%
SH2B3rs3184504MissenseC8.22×102392.88%63.87%
RAI1rs3803763MissenseC1.32×101275.86%51.17%
RAI1rs11649804MissenseA1.95×101981.57%52.79%
Appendix 1—table 1
Variants in Titin of differential prevalence in BEB-TSI comparison.

Variants in Titin, a gene linked to cardiovascular diseases, that were found to be significantly more common in BEB samples compared to TSI samples. The (%) values denote fraction of individuals in …

https://doi.org/10.7554/eLife.32920.028
GeneSNP idVariant typeAllelep-value%BEB%TSI
TTNrs9808377MissenseG1.70×101566.44%41.91%
TTNrs62621236MissenseG2.33×101627.70%5.72%
TTNrs2291311MissenseC1.06×101125.77%7.77%
TTNrs16866425MissenseC8.19×101221.73%2.73%
TTNrs4894048MissenseT2.00×102322.65%2.26%
TTNrs13398235Intron/missenseA2.04×101341.00%17.40%
TTNrs11888217Intron/missenseT4.18×101327.25%4.55%
TTNrs10164753MissenseT3.69×101328.48%6.19%
TTNrs10497520MissenseT1.66×102354.76%18.86%
TTNrs2627037MissenseA6.99×101325.06%4.72%
TTNrs1001238MissenseC1.66×101764.66%38.21%
TTNrs3731746MissenseA1.26×101450.72%30.21%
TTNrs17355446Intron/missenseA3.31×101115.44%1.11%
TTNrs2042996MissenseA1.03×101771.41%35.87%
TTNrs747122MissenseT1.59×101128.57%7.24%
TTNrs1560221SynonymousG1.11×102270.71%34.66%
TTNrs16866406MissenseA2.17×101235.60%17.51%
TTNrs4894028MissenseT2.58×101327.54%6.89%
TTN-InsertionT1.09×101234.11%8.59%
TTNrs3829747MissenseT4.72×101237.55%20.30%
TTNrs2291310MissenseC2.18×102036.63%8.04%
TTNrs2042995Intron/MissenseC7.83×101256.32%31.64%
TTNrs3829746MissenseC5.30×102975.94%37.60%
TTNrs744426MissenseA1.36×101337.15%18.92%

Additional files

Download links