Figures and data in Association mapping from sequencing reads using k-mers

Figures
Tables
Additional files

19 figures, 4 tables and 1 additional file

Figures

Figure 1

Download asset Open asset

Workflow for association mapping using $k$ -mers.

The Hawk pipeline starts with sequencing reads from two sets of samples. The first step is to count $k$ -mers in reads from each sample. Then $k$ -mers with significantly different counts in two sets are detected. Finally, overlapping $k$ -mers are assembled into sequences to get a sequence, shown side by side, for each associated locus. The sequences may correspond to a SNP (underlined) in which case corresponding sequence may be detected in the other group. This may not be the case for other kinds of variations such as copy number variation.

https://doi.org/10.7554/eLife.32920.002

Figure 2

Download asset Open asset

Intersection analysis and comparison of powers of tests.

(a) Venn diagrams showing intersections among sequences obtained using Hawk and significant sites found by genotype calling. The percentage values shown are fractions of the sites found using one method not covered by those found by the other method. $80.3 %$ of the sites overlapped with some sequence. Around $42 %$ of sequences do not overlap with any such site which can be explained by more types of variants found by Hawk as well as more power of the test using Poisson compared to Multinomial distribution. (b) Fraction of runs found significant (after Bonferroni correction) by tests against minor allele frequency of the case samples (with that of the controls fixed at 0) are shown. The curves labeled multinomial and Poisson correspond to likelihood ratio test using multinomial distribution and Poisson distributions with different $k$ -mer coverage.

https://doi.org/10.7554/eLife.32920.003

Figure 3

Download asset Open asset

Breakdown of types of variations in comparison of YRI-TSI.

(a) Bars showing breakdown of 2,970,929 and 1,865,285 sequences enriched for in YRI and TSI samples respectively. The ‘Multiple SNPs/Structural’ entries correspond to sequences of length greater than 61, the maximum length of a sequence due to a single SNP with $k$ -mer size of 31 and ‘SNPs’ correspond to sequences of maximum length of 61. (b) Numbers of sequences with alignments to hg38, RefSeq mRNAs and Ensembl exons and coding regions.

https://doi.org/10.7554/eLife.32920.005

Figure 4

Download asset Open asset

Detection and correction for population stratification in YRI-TSI dataset.

(a) Plots of first two principal components for YRI and TSI individuals from the 1000 genomes project. The PCA was run on a binary matrix indicating presence or absence of 3,483,820 randomly chosen $k$ -mers present in between 1% and 99% of the samples. The colors indicate population and sizes of circles are proportional to sequencing depth. (b) $- l o g_{10}$ (adjusted p-values) are plotted against $- l o g_{10}$ (unadjusted p-values) where adjusted p-values are calculated by fitting logistic regression models to predict population identity from $k$ -mer counts adjusting for population stratification, total number of $k$ -mers per sample and gender of individuals whereas unadjusted p-values are the p-values obtained using likelihood ratio test of $k$ -mer counts assuming Poisson distributions.

https://doi.org/10.7554/eLife.32920.008

Figure 5

Download asset Open asset

Manhattan plots for association mapping of ampicillin resistance in E.

*coli* using $k$ -mers. Manhattan plots showing $- l o g_{10}$ (adjusted p-values) of $k$ -mers found significantly associated with ampicillin resistance and their start positions in (a) *Escherichia coli* strain DTU-1 genome and (b) plasmid pKBN10P04869A sequence. The vertical lines denote start positions of $β$ -lactamase TEM-1 gene, the presence of which is known to confer resistance to ampicillin.

https://doi.org/10.7554/eLife.32920.009

Figure 5—source data 1 Source data for Figure 5a. Source file contains $k$ -mer sequences, corresponding p-values and their positions in Escherichia coli strain DTU-1 genome in SAM format.: https://doi.org/10.7554/eLife.32920.010
Download elife-32920-fig5-data1-v2.txt
Figure 5—source data 2 Source data for Figure 5b. Source file contains $k$ -mer sequences, corresponding p-values and their positions in plasmid pKBN10P04869A sequence in SAM format.: https://doi.org/10.7554/eLife.32920.011
Download elife-32920-fig5-data2-v2.txt

Appendix 1—figure 1

Download asset Open asset

QQ plots of p-values.

QQ plots of p-values calculated using likelihood ratio test with Poisson and negative binomial distributions for (a) simulated data and (b) real data from the 1000 genomes project. The simulation was performed by sampling $200$ values from a negative binomial distribution with number of failures fixed at $1$ million and a uniformly chosen success probability between $0$ and $0.01$ . Half of the samples were assigned to cases and others were assigned to controls. Then p-values were computed using likelihood ratio tests with Poisson and negative binomial distributions. The process was repeated $100, 000$ times and QQ plot was generated. Similarly p-values were computed for counts of $56, 119$ randomly chosen $k$ -mers from YRI and TSI populations from the 1000 genomes project.

https://doi.org/10.7554/eLife.32920.014

Appendix 1—figure 2

Download asset Open asset

Power for different $k$ -mer coverages.

The figure shows power to detect a $k$ -mer present in all case samples and no control sample against total $k$ -mer coverage of cases using Bonferroni correction for different number of total tests for p-value=0.05.

https://doi.org/10.7554/eLife.32920.015

Appendix 1—figure 3

Download asset Open asset

Power vs number of samples.

Figures show power to detect a $k$ -mer at MAF = 0.16 and different odds ratios for per sample $k$ -mer coverage of (a) five and (b) two after Bonferroni correction for 1 million tests.

https://doi.org/10.7554/eLife.32920.016

Appendix 1—figure 4

Download asset Open asset

Comparison of powers of Poisson LRT and logistic regression based tests.

Figures show power to detect a $k$ -mer by Poisson based likelihood ratio test, logistic regression based tests using $k$ -mer counts and presence and absence of $k$ -mers with number of allele copies in controls fixed at (a), (b) 0 and (c), (d) 1. Simulation was performed by randomly choosing case individuals with specified number of copies of an allele according to varying probability. $k$ -mer counts were then generated according to Poisson distribution and detection of association was checked after Bonferroni correction for 1 million tests. The process was repeated 100 times for each set of parameters. Number of cases and number of controls were fixed at 100.

https://doi.org/10.7554/eLife.32920.017

Appendix 1—figure 5

Download asset Open asset

Sensitivity with simulated *E. coli* data.

The figure shows sensitivity for varying number of case and control samples for different types of mutations. Sensitivity is defined as the percentage of differing nucleotides that are covered by a sequence. All of the sequences covered some location of mutation.

https://doi.org/10.7554/eLife.32920.018

Appendix 1—figure 6

Download asset Open asset

QQ plot and cumulative distributions of p-values.

The figures show (a) QQ plot and (b) cumulative distributions of $- l o g_{10}$ (p-values) for observed p-values in the comparison of YRI and TSI populations from the 1000 genomes project and p-values expected if the null is true.

https://doi.org/10.7554/eLife.32920.019

Appendix 1—figure 7

Download asset Open asset

Breakdown of types of variations in BEB-TSI comparison.

(a) Bar plots showing breakdown of 529,287 and 462,122 sequences associated with BEB and TSI samples respectively. The ‘Multiple SNPs/Structural’ entries correspond to sequences of length greater than 61, the maximum length of a sequence due to a single SNP with $k$ -mer size of 31 and ‘SNPs’ correspond to sequences of maximum length of 61. (b) Numbers of sequences with alignments to hg38, RefSeq mRNAs and Ensembl exons and coding regions.

https://doi.org/10.7554/eLife.32920.020

Appendix 1—figure 8

Download asset Open asset

Histograms of sequence lengths in YRI-TSI comparison.

Figures show sections of histograms of lengths of sequences associated with (a, c) YRI and (b, d) TSI in comparison of YRI and TSI samples. Figures (a), (b) show peaks at 61, the maximum length corresponding to a single SNP with $k$ -mer size of 31. Figures (c), (d) show drop off after 98 which is the maximum length corresponding to two close-by SNPs as 31-mers were assembled using a minimum overlap of 24.

https://doi.org/10.7554/eLife.32920.021

Appendix 1—figure 9

Download asset Open asset

Histograms of sequence lengths in BEB-TSI comparison.

Figures show sections of histograms of lengths of sequences associated with (a, c) BEB and (b, d) TSI in comparison of BEB and TSI samples. Figures (a), (b) show peaks at 61, the maximum length corresponding to a single SNP with $k$ -mer size of 31. Figures (c), (d) show drop off after 98 which is the maximum length corresponding to two close-by SNPs as 31-mers were assembled using a minimum overlap of 24.

https://doi.org/10.7554/eLife.32920.022

Appendix 1—figure 10

Download asset Open asset

Distribution of the SNP rs1042034 alleles in 1000 genomes populations.

Plot obtained from the Geography of Genetic Variants Browser showing distribution of alleles at the SNP site rs1042034 revealing the ‘C’ allele is more prevalent in South, Southeast and East Asian populations compared to populations from other parts of the world.

https://doi.org/10.7554/eLife.32920.023

Appendix 1—figure 11

Download asset Open asset

Appendix 1—figure 12

Download asset Open asset

Detection of association after correcting for confounders.

A $k$ -mer (allele) is considered to be present in a YRI individual with probability $p$ and in a TSI individual with probability $1 - p$ and counts are simulated using total numbers of $k$ -mers in the samples assuming Poisson distribution. The individuals with the $k$ -mer are randomly assigned to cases according to a penetrance value and the rest are assigned to controls. A p-value is then computed using logistic regression to predict phenotype from the counts correcting for population stratification, total number of $k$ -mers per sample and gender of individuals from the 1000 genomes data. The process is repeated 1000 times for a particular $p$ and penetrance and fraction of runs where the p-value passed the Bonferroni threshold is plotted. The process is repeated for various probabilities and penetrance.

https://doi.org/10.7554/eLife.32920.025

Appendix 1—figure 13

Download asset Open asset

Appendix 1—figure 14

Download asset Open asset

Manhattan plots for association mapping of ampicillin resistance in *E. coli* using conventional approach.

Manhattan plots showing $- l o g_{10}$ (p-values) of SNPs and their positions in (a) *Escherichia coli* strain CFT073 genome and (b) plasmid pKBN10P04869A sequence. The vertical lines denote start positions of $β$ *-lactamase TEM-1* gene, the presence of which is known to confer resistance to ampicillin. The horizontal lines denote Bonferroni threshold of $0.05 / 361293$ .

https://doi.org/10.7554/eLife.32920.027

Tables

Table 1

Known variants in YRI-TSI comparison.

Table 1 shows p-values of sequences computed using likelihood ratio test at some well known sites of variation between populations. The (%) values denote fraction of individuals in the sample with the allele present. The p-values and % values are averaged over $k$ -mers constituting the associated sequences.

https://doi.org/10.7554/eLife.32920.004

Gene	SNP id	Description	Allele	p-value	%YRI	%TSI
ACKR1	rs2814778	Duffy antigen	C	9.72 $\times 10^{- 114}$	84.39%	1.78%
SLC24A5	rs1426654	Skin pigmentation	G	8.45 $\times 10^{- 144}$	87.39%	1.02%
SLC45A2	rs16891982	Skin/hair color	C	1.89 $\times 10^{- 122}$	92.18%	4.67%
G6PD	rs1050829	G6PD deficiency	C	1.53 $\times 10^{- 29}$	24.92%	1.02%
G6PD	rs1050828	G6PD deficiency	T	5.83 $\times 10^{- 25}$	18.32%	0.00%

Table 2

Summary of sequences not in the human reference genome.

Table 2 shows summary of sequences associated with different populations that did not map to the human reference genome (hg38) or to the Epstein-Barr virus genome.

https://doi.org/10.7554/eLife.32920.006

Population	Population compared to	Total no. sequences	No. sequences with length $\geq$ 1000bp	Total length in sequences with length $\geq$ 1000 bp	No. sequences with length $\geq$ 200bp	Total length in sequences with length $\geq$ 200bp
YRI	TSI	94,795	41	59,956	478	225,426
TSI	YRI	66,051	10	13,896	184	77,383
BEB	TSI	19,584	3	3835	75	33,954
TSI	BEB	18,508	2	2105	81	28,134

Table 3

Variants in genes linked to cardiovascular diseases.

Variants in genes linked to cardiovascular diseases found to be significantly more common in BEB samples compared to TSI samples. The (%) values denote fraction of individuals in the sample with the allele present. The p-values and % values are averaged over $k$ -mers constituting the associated sequences.

https://doi.org/10.7554/eLife.32920.007

Gene	SN id	Variant type	Allele	p-value	%BEB	%TSI
APOB	rs2302515	Missense	C	1.30 $\times 10^{- 12}$	29.29%	8.37%
APOB	rs676210	Missense	A	7.73 $\times 10^{- 25}$	72.93%	33.08%
APOB	rs1042034	Missense	C	2.28 $\times 10^{- 23}$	68.67%	31.91%
CYP11B2	rs4545	Missense	T	1.31 $\times 10^{- 28}$	31.33%	0.91%
CYP11B1	rs4534	Missense	T	9.36 $\times 10^{- 36}$	33.00%	0.91%
WNK4	rs2290041	Missense	T	1.53 $\times 10^{- 14}$	13.24%	0.47%
WNK4	rs55781437	Missense	T	1.30 $\times 10^{- 12}$	15.21%	0.91%
SLC12A3	rs2289113	Missense	T	7.40 $\times 10^{- 13}$	8.14%	0.00%
SCNN1A	rs10849447	Missense	C	8.67 $\times 10^{- 12}$	62.88%	39.92%
ABO	-	4 bp (CTGT) deletion	-	1.17 $\times 10^{- 13}$	29.15%	10.55%
ABO	rs8176741	Missense	A	2.06 $\times 10^{- 16}$	27.70%	8.45%
SH2B3	rs3184504	Missense	C	8.22 $\times 10^{- 23}$	92.88%	63.87%
RAI1	rs3803763	Missense	C	1.32 $\times 10^{- 12}$	75.86%	51.17%
RAI1	rs11649804	Missense	A	1.95 $\times 10^{- 19}$	81.57%	52.79%

Appendix 1—table 1

Variants in Titin of differential prevalence in BEB-TSI comparison.

Variants in Titin, a gene linked to cardiovascular diseases, that were found to be significantly more common in BEB samples compared to TSI samples. The (%) values denote fraction of individuals in the sample with the allele present. The p-values and % values are averaged over $k$ -mers constituting the associated sequences.

https://doi.org/10.7554/eLife.32920.028

Gene	SNP id	Variant type	Allele	p-value	%BEB	%TSI
TTN	rs9808377	Missense	G	1.70 $\times 10^{- 15}$	66.44%	41.91%
TTN	rs62621236	Missense	G	2.33 $\times 10^{- 16}$	27.70%	5.72%
TTN	rs2291311	Missense	C	1.06 $\times 10^{- 11}$	25.77%	7.77%
TTN	rs16866425	Missense	C	8.19 $\times 10^{- 12}$	21.73%	2.73%
TTN	rs4894048	Missense	T	2.00 $\times 10^{- 23}$	22.65%	2.26%
TTN	rs13398235	Intron/missense	A	2.04 $\times 10^{- 13}$	41.00%	17.40%
TTN	rs11888217	Intron/missense	T	4.18 $\times 10^{- 13}$	27.25%	4.55%
TTN	rs10164753	Missense	T	3.69 $\times 10^{- 13}$	28.48%	6.19%
TTN	rs10497520	Missense	T	1.66 $\times 10^{- 23}$	54.76%	18.86%
TTN	rs2627037	Missense	A	6.99 $\times 10^{- 13}$	25.06%	4.72%
TTN	rs1001238	Missense	C	1.66 $\times 10^{- 17}$	64.66%	38.21%
TTN	rs3731746	Missense	A	1.26 $\times 10^{- 14}$	50.72%	30.21%
TTN	rs17355446	Intron/missense	A	3.31 $\times 10^{- 11}$	15.44%	1.11%
TTN	rs2042996	Missense	A	1.03 $\times 10^{- 17}$	71.41%	35.87%
TTN	rs747122	Missense	T	1.59 $\times 10^{- 11}$	28.57%	7.24%
TTN	rs1560221 $^{*}$	Synonymous	G	1.11 $\times 10^{- 22}$	70.71%	34.66%
TTN	rs16866406	Missense	A	2.17 $\times 10^{- 12}$	35.60%	17.51%
TTN	rs4894028	Missense	T	2.58 $\times 10^{- 13}$	27.54%	6.89%
TTN	-	Insertion	T	1.09 $\times 10^{- 12}$	34.11%	8.59%
TTN	rs3829747	Missense	T	4.72 $\times 10^{- 12}$	37.55%	20.30%
TTN	rs2291310	Missense	C	2.18 $\times 10^{- 20}$	36.63%	8.04%
TTN	rs2042995	Intron/Missense	C	7.83 $\times 10^{- 12}$	56.32%	31.64%
TTN	rs3829746	Missense	C	5.30 $\times 10^{- 29}$	75.94%	37.60%
TTN	rs744426	Missense	A	1.36 $\times 10^{- 13}$	37.15%	18.92%