Association mapping from sequencing reads using k-mers

  1. Atif Rahman  Is a corresponding author
  2. Ingileif Hallgrímsdóttir
  3. Michael Eisen
  4. Lior Pachter  Is a corresponding author
  1. University of California, Berkeley, United States
  2. Howard Hughes Medical Institute, University of California, Berkeley, United States
19 figures, 4 tables and 1 additional file

Figures

Workflow for association mapping using k-mers.

The Hawk pipeline starts with sequencing reads from two sets of samples. The first step is to count k-mers in reads from each sample. Then k-mers with significantly different counts in two sets are detected. Finally, overlapping k-mers are assembled into sequences to get a sequence, shown side by side, for each associated locus. The sequences may correspond to a SNP (underlined) in which case corresponding sequence may be detected in the other group. This may not be the case for other kinds of variations such as copy number variation.

https://doi.org/10.7554/eLife.32920.002
Intersection analysis and comparison of powers of tests.

(a) Venn diagrams showing intersections among sequences obtained using Hawk and significant sites found by genotype calling. The percentage values shown are fractions of the sites found using one method not covered by those found by the other method. 80.3% of the sites overlapped with some sequence. Around 42% of sequences do not overlap with any such site which can be explained by more types of variants found by Hawk as well as more power of the test using Poisson compared to Multinomial distribution. (b) Fraction of runs found significant (after Bonferroni correction) by tests against minor allele frequency of the case samples (with that of the controls fixed at 0) are shown. The curves labeled multinomial and Poisson correspond to likelihood ratio test using multinomial distribution and Poisson distributions with different k-mer coverage.

https://doi.org/10.7554/eLife.32920.003
Breakdown of types of variations in comparison of YRI-TSI.

(a) Bars showing breakdown of 2,970,929 and 1,865,285 sequences enriched for in YRI and TSI samples respectively. The ‘Multiple SNPs/Structural’ entries correspond to sequences of length greater than 61, the maximum length of a sequence due to a single SNP with k-mer size of 31 and ‘SNPs’ correspond to sequences of maximum length of 61. (b) Numbers of sequences with alignments to hg38, RefSeq mRNAs and Ensembl exons and coding regions.

https://doi.org/10.7554/eLife.32920.005
Detection and correction for population stratification in YRI-TSI dataset.

(a) Plots of first two principal components for YRI and TSI individuals from the 1000 genomes project. The PCA was run on a binary matrix indicating presence or absence of 3,483,820 randomly chosen k-mers present in between 1% and 99% of the samples. The colors indicate population and sizes of circles are proportional to sequencing depth. (b) log10(adjusted p-values) are plotted against log10(unadjusted p-values) where adjusted p-values are calculated by fitting logistic regression models to predict population identity from k-mer counts adjusting for population stratification, total number of k-mers per sample and gender of individuals whereas unadjusted p-values are the p-values obtained using likelihood ratio test of k-mer counts assuming Poisson distributions.

https://doi.org/10.7554/eLife.32920.008
Manhattan plots for association mapping of ampicillin resistance in E.

coli using k-mers. Manhattan plots showing log10(adjusted p-values) of k-mers found significantly associated with ampicillin resistance and their start positions in (a) Escherichia coli strain DTU-1 genome and (b) plasmid pKBN10P04869A sequence. The vertical lines denote start positions of β-lactamase TEM-1 gene, the presence of which is known to confer resistance to ampicillin.

https://doi.org/10.7554/eLife.32920.009
Figure 5—source data 1

Source data for Figure 5a.

Source file contains k-mer sequences, corresponding p-values and their positions in Escherichia coli strain DTU-1 genome in SAM format.

https://doi.org/10.7554/eLife.32920.010
Figure 5—source data 2

Source data for Figure 5b.

Source file contains k-mer sequences, corresponding p-values and their positions in plasmid pKBN10P04869A sequence in SAM format.

https://doi.org/10.7554/eLife.32920.011
Appendix 1—figure 1
QQ plots of p-values.

QQ plots of p-values calculated using likelihood ratio test with Poisson and negative binomial distributions for (a) simulated data and (b) real data from the 1000 genomes project. The simulation was performed by sampling 200 values from a negative binomial distribution with number of failures fixed at 1 million and a uniformly chosen success probability between 0 and 0.01. Half of the samples were assigned to cases and others were assigned to controls. Then p-values were computed using likelihood ratio tests with Poisson and negative binomial distributions. The process was repeated 100,000 times and QQ plot was generated. Similarly p-values were computed for counts of 56,119 randomly chosen k-mers from YRI and TSI populations from the 1000 genomes project.

https://doi.org/10.7554/eLife.32920.014
Appendix 1—figure 2
Power for different k-mer coverages.

The figure shows power to detect a k-mer present in all case samples and no control sample against total k-mer coverage of cases using Bonferroni correction for different number of total tests for p-value=0.05.

https://doi.org/10.7554/eLife.32920.015
Appendix 1—figure 3
Power vs number of samples.

Figures show power to detect a k-mer at MAF = 0.16 and different odds ratios for per sample k-mer coverage of (a) five and (b) two after Bonferroni correction for 1 million tests.

https://doi.org/10.7554/eLife.32920.016
Appendix 1—figure 4
Comparison of powers of Poisson LRT and logistic regression based tests.

Figures show power to detect a k-mer by Poisson based likelihood ratio test, logistic regression based tests using k-mer counts and presence and absence of k-mers with number of allele copies in controls fixed at (a), (b) 0 and (c), (d) 1. Simulation was performed by randomly choosing case individuals with specified number of copies of an allele according to varying probability. k-mer counts were then generated according to Poisson distribution and detection of association was checked after Bonferroni correction for 1 million tests. The process was repeated 100 times for each set of parameters. Number of cases and number of controls were fixed at 100.

https://doi.org/10.7554/eLife.32920.017
Appendix 1—figure 5
Sensitivity with simulated E. coli data.

The figure shows sensitivity for varying number of case and control samples for different types of mutations. Sensitivity is defined as the percentage of differing nucleotides that are covered by a sequence. All of the sequences covered some location of mutation.

https://doi.org/10.7554/eLife.32920.018
Appendix 1—figure 6
QQ plot and cumulative distributions of p-values.

The figures show (a) QQ plot and (b) cumulative distributions of log10(p-values) for observed p-values in the comparison of YRI and TSI populations from the 1000 genomes project and p-values expected if the null is true.

https://doi.org/10.7554/eLife.32920.019
Appendix 1—figure 7
Breakdown of types of variations in BEB-TSI comparison.

(a) Bar plots showing breakdown of 529,287 and 462,122 sequences associated with BEB and TSI samples respectively. The ‘Multiple SNPs/Structural’ entries correspond to sequences of length greater than 61, the maximum length of a sequence due to a single SNP with k-mer size of 31 and ‘SNPs’ correspond to sequences of maximum length of 61. (b) Numbers of sequences with alignments to hg38, RefSeq mRNAs and Ensembl exons and coding regions.

https://doi.org/10.7554/eLife.32920.020
Appendix 1—figure 8
Histograms of sequence lengths in YRI-TSI comparison.

Figures show sections of histograms of lengths of sequences associated with (ac) YRI and (bd) TSI in comparison of YRI and TSI samples. Figures (a), (b) show peaks at 61, the maximum length corresponding to a single SNP with k-mer size of 31. Figures (c), (d) show drop off after 98 which is the maximum length corresponding to two close-by SNPs as 31-mers were assembled using a minimum overlap of 24.

https://doi.org/10.7554/eLife.32920.021
Appendix 1—figure 9
Histograms of sequence lengths in BEB-TSI comparison.

Figures show sections of histograms of lengths of sequences associated with (a, c) BEB and (b, d) TSI in comparison of BEB and TSI samples. Figures (a), (b) show peaks at 61, the maximum length corresponding to a single SNP with k-mer size of 31. Figures (c), (d) show drop off after 98 which is the maximum length corresponding to two close-by SNPs as 31-mers were assembled using a minimum overlap of 24.

https://doi.org/10.7554/eLife.32920.022
Appendix 1—figure 10
Distribution of the SNP rs1042034 alleles in 1000 genomes populations.

Plot obtained from the Geography of Genetic Variants Browser showing distribution of alleles at the SNP site rs1042034 revealing the ‘C’ allele is more prevalent in South, Southeast and East Asian populations compared to populations from other parts of the world.

https://doi.org/10.7554/eLife.32920.023
Appendix 1—figure 11
Detection and correction for population stratification in YRI-TSI dataset.

(a) Plots of first two principal components for YRI and TSI individuals where the PCA was run on variants called in the 1000 genomes project. QQ plots of ANOVA χ2 p-values of k-mer counts computed by running logistic regression (b) without adjustment, (c) adjusted using first two principal components obtained from k-mer counts, total number of k-mers and sex of individuals, (d) adjusted using only first two principal components.

https://doi.org/10.7554/eLife.32920.024
Appendix 1—figure 12
Detection of association after correcting for confounders.

A k-mer (allele) is considered to be present in a YRI individual with probability p and in a TSI individual with probability 1p and counts are simulated using total numbers of k-mers in the samples assuming Poisson distribution. The individuals with the k-mer are randomly assigned to cases according to a penetrance value and the rest are assigned to controls. A p-value is then computed using logistic regression to predict phenotype from the counts correcting for population stratification, total number of k-mers per sample and gender of individuals from the 1000 genomes data. The process is repeated 1000 times for a particular p and penetrance and fraction of runs where the p-value passed the Bonferroni threshold is plotted. The process is repeated for various probabilities and penetrance.

https://doi.org/10.7554/eLife.32920.025
Appendix 1—figure 13
QQ plot and cumulative distributions of p-values.

The figures show (a) QQ plot and (b) cumulative distributions of log10(p-values) for observed p-values in association mapping of ampicillin resistance in E. coli and p-values expected if the null is true.

https://doi.org/10.7554/eLife.32920.026
Appendix 1—figure 14
Manhattan plots for association mapping of ampicillin resistance in E. coli using conventional approach.

Manhattan plots showing log10(p-values) of SNPs and their positions in (a) Escherichia coli strain CFT073 genome and (b) plasmid pKBN10P04869A sequence. The vertical lines denote start positions of β-lactamase TEM-1 gene, the presence of which is known to confer resistance to ampicillin. The horizontal lines denote Bonferroni threshold of 0.05/361293.

https://doi.org/10.7554/eLife.32920.027

Tables

Table 1
Known variants in YRI-TSI comparison.

Table 1 shows p-values of sequences computed using likelihood ratio test at some well known sites of variation between populations. The (%) values denote fraction of individuals in the sample with the allele present. The p-values and % values are averaged over k-mers constituting the associated sequences.

https://doi.org/10.7554/eLife.32920.004
GeneSNP idDescriptionAllelep-value%YRI%TSI
ACKR1rs2814778Duffy antigenC9.72×1011484.39%1.78%
SLC24A5rs1426654Skin pigmentationG8.45×1014487.39%1.02%
SLC45A2rs16891982Skin/hair colorC1.89×1012292.18%4.67%
G6PDrs1050829G6PD deficiencyC1.53×102924.92%1.02%
G6PDrs1050828G6PD deficiencyT5.83×102518.32%0.00%
Table 2
Summary of sequences not in the human reference genome.

Table 2 shows summary of sequences associated with different populations that did not map to the human reference genome (hg38) or to the Epstein-Barr virus genome.

https://doi.org/10.7554/eLife.32920.006
PopulationPopulation compared toTotal no. sequencesNo. sequences with length1000bpTotal length in
sequences with
length1000 bp
No. sequences with length200bpTotal length in sequences with length200bp
YRITSI94,7954159,956478225,426
TSIYRI66,0511013,89618477,383
BEBTSI19,584338357533,954
TSIBEB18,508221058128,134
Table 3
Variants in genes linked to cardiovascular diseases.

Variants in genes linked to cardiovascular diseases found to be significantly more common in BEB samples compared to TSI samples. The (%) values denote fraction of individuals in the sample with the allele present. The p-values and % values are averaged over k-mers constituting the associated sequences.

https://doi.org/10.7554/eLife.32920.007
GeneSN idVariant typeAllelep-value%BEB%TSI
APOBrs2302515MissenseC1.30×101229.29%8.37%
APOBrs676210MissenseA7.73×102572.93%33.08%
APOBrs1042034MissenseC2.28×102368.67%31.91%
CYP11B2rs4545MissenseT1.31×102831.33%0.91%
CYP11B1rs4534MissenseT9.36×103633.00%0.91%
WNK4rs2290041MissenseT1.53×101413.24%0.47%
WNK4rs55781437MissenseT1.30×101215.21%0.91%
SLC12A3rs2289113MissenseT7.40×10138.14%0.00%
SCNN1Ars10849447MissenseC8.67×101262.88%39.92%
ABO-4 bp (CTGT) deletion-1.17×101329.15%10.55%
ABOrs8176741MissenseA2.06×101627.70%8.45%
SH2B3rs3184504MissenseC8.22×102392.88%63.87%
RAI1rs3803763MissenseC1.32×101275.86%51.17%
RAI1rs11649804MissenseA1.95×101981.57%52.79%
Appendix 1—table 1
Variants in Titin of differential prevalence in BEB-TSI comparison.

Variants in Titin, a gene linked to cardiovascular diseases, that were found to be significantly more common in BEB samples compared to TSI samples. The (%) values denote fraction of individuals in the sample with the allele present. The p-values and % values are averaged over k-mers constituting the associated sequences.

https://doi.org/10.7554/eLife.32920.028
GeneSNP idVariant typeAllelep-value%BEB%TSI
TTNrs9808377MissenseG1.70×101566.44%41.91%
TTNrs62621236MissenseG2.33×101627.70%5.72%
TTNrs2291311MissenseC1.06×101125.77%7.77%
TTNrs16866425MissenseC8.19×101221.73%2.73%
TTNrs4894048MissenseT2.00×102322.65%2.26%
TTNrs13398235Intron/missenseA2.04×101341.00%17.40%
TTNrs11888217Intron/missenseT4.18×101327.25%4.55%
TTNrs10164753MissenseT3.69×101328.48%6.19%
TTNrs10497520MissenseT1.66×102354.76%18.86%
TTNrs2627037MissenseA6.99×101325.06%4.72%
TTNrs1001238MissenseC1.66×101764.66%38.21%
TTNrs3731746MissenseA1.26×101450.72%30.21%
TTNrs17355446Intron/missenseA3.31×101115.44%1.11%
TTNrs2042996MissenseA1.03×101771.41%35.87%
TTNrs747122MissenseT1.59×101128.57%7.24%
TTNrs1560221SynonymousG1.11×102270.71%34.66%
TTNrs16866406MissenseA2.17×101235.60%17.51%
TTNrs4894028MissenseT2.58×101327.54%6.89%
TTN-InsertionT1.09×101234.11%8.59%
TTNrs3829747MissenseT4.72×101237.55%20.30%
TTNrs2291310MissenseC2.18×102036.63%8.04%
TTNrs2042995Intron/MissenseC7.83×101256.32%31.64%
TTNrs3829746MissenseC5.30×102975.94%37.60%
TTNrs744426MissenseA1.36×101337.15%18.92%

Additional files

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Atif Rahman
  2. Ingileif Hallgrímsdóttir
  3. Michael Eisen
  4. Lior Pachter
(2018)
Association mapping from sequencing reads using k-mers
eLife 7:e32920.
https://doi.org/10.7554/eLife.32920