The Hawk pipeline starts with sequencing reads from two sets of samples. The first step is to count -mers in reads from each sample. Then -mers with significantly different counts in two sets …
(a) Venn diagrams showing intersections among sequences obtained using Hawk and significant sites found by genotype calling. The percentage values shown are fractions of the sites found using one …
(a) Bars showing breakdown of 2,970,929 and 1,865,285 sequences enriched for in YRI and TSI samples respectively. The ‘Multiple SNPs/Structural’ entries correspond to sequences of length greater …
(a) Plots of first two principal components for YRI and TSI individuals from the 1000 genomes project. The PCA was run on a binary matrix indicating presence or absence of 3,483,820 randomly chosen -mers present in between 1% and 99% of the samples. The colors indicate population and sizes of circles are proportional to sequencing depth. (b) (adjusted p-values) are plotted against (unadjusted p-values) where adjusted p-values are calculated by fitting logistic regression models to predict population identity from -mer counts adjusting for population stratification, total number of -mers per sample and gender of individuals whereas unadjusted p-values are the p-values obtained using likelihood ratio test of -mer counts assuming Poisson distributions.
coli using -mers. Manhattan plots showing (adjusted p-values) of -mers found significantly associated with ampicillin resistance and their start positions in (a) Escherichia coli strain DTU-1 …
Source data for Figure 5a.
Source file contains -mer sequences, corresponding p-values and their positions in Escherichia coli strain DTU-1 genome in SAM format.
Source data for Figure 5b.
Source file contains -mer sequences, corresponding p-values and their positions in plasmid pKBN10P04869A sequence in SAM format.
QQ plots of p-values calculated using likelihood ratio test with Poisson and negative binomial distributions for (a) simulated data and (b) real data from the 1000 genomes project. The simulation …
The figure shows power to detect a -mer present in all case samples and no control sample against total -mer coverage of cases using Bonferroni correction for different number of total tests for …
Figures show power to detect a -mer at MAF = 0.16 and different odds ratios for per sample -mer coverage of (a) five and (b) two after Bonferroni correction for 1 million tests.
Figures show power to detect a -mer by Poisson based likelihood ratio test, logistic regression based tests using -mer counts and presence and absence of -mers with number of allele copies in …
The figure shows sensitivity for varying number of case and control samples for different types of mutations. Sensitivity is defined as the percentage of differing nucleotides that are covered by a …
The figures show (a) QQ plot and (b) cumulative distributions of (p-values) for observed p-values in the comparison of YRI and TSI populations from the 1000 genomes project and p-values expected …
(a) Bar plots showing breakdown of 529,287 and 462,122 sequences associated with BEB and TSI samples respectively. The ‘Multiple SNPs/Structural’ entries correspond to sequences of length greater …
Figures show sections of histograms of lengths of sequences associated with (a, c) YRI and (b, d) TSI in comparison of YRI and TSI samples. Figures (a), (b) show peaks at 61, the maximum length …
Figures show sections of histograms of lengths of sequences associated with (a, c) BEB and (b, d) TSI in comparison of BEB and TSI samples. Figures (a), (b) show peaks at 61, the maximum length …
Plot obtained from the Geography of Genetic Variants Browser showing distribution of alleles at the SNP site rs1042034 revealing the ‘C’ allele is more prevalent in South, Southeast and East Asian …
(a) Plots of first two principal components for YRI and TSI individuals where the PCA was run on variants called in the 1000 genomes project. QQ plots of ANOVA p-values of -mer counts computed …
A -mer (allele) is considered to be present in a YRI individual with probability and in a TSI individual with probability and counts are simulated using total numbers of -mers in the …
The figures show (a) QQ plot and (b) cumulative distributions of (p-values) for observed p-values in association mapping of ampicillin resistance in E. coli and p-values expected if the null is …
Manhattan plots showing (p-values) of SNPs and their positions in (a) Escherichia coli strain CFT073 genome and (b) plasmid pKBN10P04869A sequence. The vertical lines denote start positions of -l…
Table 1 shows p-values of sequences computed using likelihood ratio test at some well known sites of variation between populations. The (%) values denote fraction of individuals in the sample with …
Gene | SNP id | Description | Allele | p-value | %YRI | %TSI |
---|---|---|---|---|---|---|
ACKR1 | rs2814778 | Duffy antigen | C | 9.72 | 84.39% | 1.78% |
SLC24A5 | rs1426654 | Skin pigmentation | G | 8.45 | 87.39% | 1.02% |
SLC45A2 | rs16891982 | Skin/hair color | C | 1.89 | 92.18% | 4.67% |
G6PD | rs1050829 | G6PD deficiency | C | 1.53 | 24.92% | 1.02% |
G6PD | rs1050828 | G6PD deficiency | T | 5.83 | 18.32% | 0.00% |
Table 2 shows summary of sequences associated with different populations that did not map to the human reference genome (hg38) or to the Epstein-Barr virus genome.
Population | Population compared to | Total no. sequences | No. sequences with length1000bp | Total length in sequences with length1000 bp | No. sequences with length200bp | Total length in sequences with length200bp |
---|---|---|---|---|---|---|
YRI | TSI | 94,795 | 41 | 59,956 | 478 | 225,426 |
TSI | YRI | 66,051 | 10 | 13,896 | 184 | 77,383 |
BEB | TSI | 19,584 | 3 | 3835 | 75 | 33,954 |
TSI | BEB | 18,508 | 2 | 2105 | 81 | 28,134 |
Variants in genes linked to cardiovascular diseases found to be significantly more common in BEB samples compared to TSI samples. The (%) values denote fraction of individuals in the sample with the …
Gene | SN id | Variant type | Allele | p-value | %BEB | %TSI |
---|---|---|---|---|---|---|
APOB | rs2302515 | Missense | C | 1.30 | 29.29% | 8.37% |
APOB | rs676210 | Missense | A | 7.73 | 72.93% | 33.08% |
APOB | rs1042034 | Missense | C | 2.28 | 68.67% | 31.91% |
CYP11B2 | rs4545 | Missense | T | 1.31 | 31.33% | 0.91% |
CYP11B1 | rs4534 | Missense | T | 9.36 | 33.00% | 0.91% |
WNK4 | rs2290041 | Missense | T | 1.53 | 13.24% | 0.47% |
WNK4 | rs55781437 | Missense | T | 1.30 | 15.21% | 0.91% |
SLC12A3 | rs2289113 | Missense | T | 7.40 | 8.14% | 0.00% |
SCNN1A | rs10849447 | Missense | C | 8.67 | 62.88% | 39.92% |
ABO | - | 4 bp (CTGT) deletion | - | 1.17 | 29.15% | 10.55% |
ABO | rs8176741 | Missense | A | 2.06 | 27.70% | 8.45% |
SH2B3 | rs3184504 | Missense | C | 8.22 | 92.88% | 63.87% |
RAI1 | rs3803763 | Missense | C | 1.32 | 75.86% | 51.17% |
RAI1 | rs11649804 | Missense | A | 1.95 | 81.57% | 52.79% |
Variants in Titin, a gene linked to cardiovascular diseases, that were found to be significantly more common in BEB samples compared to TSI samples. The (%) values denote fraction of individuals in …
Gene | SNP id | Variant type | Allele | p-value | %BEB | %TSI |
---|---|---|---|---|---|---|
TTN | rs9808377 | Missense | G | 1.70 | 66.44% | 41.91% |
TTN | rs62621236 | Missense | G | 2.33 | 27.70% | 5.72% |
TTN | rs2291311 | Missense | C | 1.06 | 25.77% | 7.77% |
TTN | rs16866425 | Missense | C | 8.19 | 21.73% | 2.73% |
TTN | rs4894048 | Missense | T | 2.00 | 22.65% | 2.26% |
TTN | rs13398235 | Intron/missense | A | 2.04 | 41.00% | 17.40% |
TTN | rs11888217 | Intron/missense | T | 4.18 | 27.25% | 4.55% |
TTN | rs10164753 | Missense | T | 3.69 | 28.48% | 6.19% |
TTN | rs10497520 | Missense | T | 1.66 | 54.76% | 18.86% |
TTN | rs2627037 | Missense | A | 6.99 | 25.06% | 4.72% |
TTN | rs1001238 | Missense | C | 1.66 | 64.66% | 38.21% |
TTN | rs3731746 | Missense | A | 1.26 | 50.72% | 30.21% |
TTN | rs17355446 | Intron/missense | A | 3.31 | 15.44% | 1.11% |
TTN | rs2042996 | Missense | A | 1.03 | 71.41% | 35.87% |
TTN | rs747122 | Missense | T | 1.59 | 28.57% | 7.24% |
TTN | rs1560221 | Synonymous | G | 1.11 | 70.71% | 34.66% |
TTN | rs16866406 | Missense | A | 2.17 | 35.60% | 17.51% |
TTN | rs4894028 | Missense | T | 2.58 | 27.54% | 6.89% |
TTN | - | Insertion | T | 1.09 | 34.11% | 8.59% |
TTN | rs3829747 | Missense | T | 4.72 | 37.55% | 20.30% |
TTN | rs2291310 | Missense | C | 2.18 | 36.63% | 8.04% |
TTN | rs2042995 | Intron/Missense | C | 7.83 | 56.32% | 31.64% |
TTN | rs3829746 | Missense | C | 5.30 | 75.94% | 37.60% |
TTN | rs744426 | Missense | A | 1.36 | 37.15% | 18.92% |