Human genetic ancestry, Mycobacterium tuberculosis diversity, and tuberculosis disease severity in Dar es Salaam, Tanzania
Figures
Genetic ancestry analyses of Tanzanian TB patients.
(A) Genetic ancestry proportions of the 1444 Tanzanian TB patients and representative human populations who shared at least 1% of their most common genetic ancestry with the Tanzanians for K = 24 (ESN: Esan from Nigeria (1000G), LWK: Luhya from Kenya (1000G)). For all populations included in our study, see Figure 1—figure supplement 1 for their geographic distribution and Figure 1—figure supplement 5 for the ancestry composition of all African populations included in this study. (B) The geographical location of the representative populations shown in A is depicted with black circles, and the corresponding country is highlighted. The remaining African populations included in the analysis are represented by blue circles.
Populations included in the admixture analysis.
Populations are colored according to the study from which the data was obtained (Patin et al., 2017; Semo et al., 2020; Selwyn et al., 1989; Stucki et al., 2016; Steiner et al., 2014; Coll et al., 2014; Obenchain et al., 2014; Wejse et al., 2008; Ralph et al., 2010; Michel et al., 2006).
The first two principal component analysis (PCA) components for all African populations included in this study (n = 116).
Boxplots of cross-validation errors for values of K between 2 and 29 resulting from 15 runs.
The lowest cross-validation error was obtained for K = 24.
Boxplots of the proportions of the 24 genetic ancestries among the Tanzanian TB patients.
The ancestries were named after the population(s) where they were most abundant.
Spatial visualizations of the Bantu-speaking (BS) genetic ancestries and the genetic ancestries of the different self-identified ethnic groups among the TB patients in Tanzania.
The genetic ancestry was inferred by admixture with K = 24, and the interpolation of the ancestries was performed by using the pykrige module in Python (see methods). (A) eBS genetic ancestry, (B) seBS genetic ancestry, and (C) wBS genetic ancestry. The populations included for spatial interpolations are marked with a black dot on the maps. The maps were created using the basemap module in Python. (D) Geographical origin of the ethnic groups among our TB patient cohort. The Temeke District hospital in Dar es Salaam where the patients were recruited is marked with a red point. Note that for some ethnic groups, no geographical origin could be identified (Supplementary file 1). (E) Ancestry plots for the different ethnic groups with at least 10 patients from our TB patient cohort.
Heatmap showing the correlations between the genetic ancestries and geographical location.
(A) Correlations at the country level of Tanzania between the latitude and longitude of the ethnic group the patients belong to. (B) Correlations on a continental level including all African populations. For the latitude, a negative correlation indicates higher genetic ancestry proportions toward the South, and for the longitude, a negative correlation indicates lower genetic ancestry proportions toward the East.
Manhattan plot for genome-wide association study (GWAS) conducted using (A–C) TB-score, X-ray score, and Ct-value.
Red line indicates GWAS significance threshold of 5e−8.
QQ plot and genomic inflation factor for genome-wide association study (GWAS) conducted using (A–C) TB-score, X-ray score, and Ct-value.
Association between the Bantu genetic ancestries and TB-score and X-ray score for each of the most successful introductions.
(A, C, E) Southeastern, Eastern, and Western Bantu on mild or severe TB-score, (B, D, F) Southeastern, Eastern, and Western Bantu genetic ancestry lung damage assessed by lung X-ray.
Tables
Human and bacterial genotypes by the severity measures.
| TB-score | Lung damage (X-ray score) | Bacterial load (Ct-value) | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Levels | Total N (%) | Missing N (%) | Severe N (%) | Mild N (%) | Total N | Missing N | Severe | Mild | Not available | Total N | Missing N | Mean (SD) | |
| Total N (%) | 624 (33) | 1280 (67) | 177 (9) | 849 (45) | 878 (46) | ||||||||
| MTBC genotype | Other | 1471 (77.3) | 433 | 207 (42) | 406 (41) | 764 (74.5) | 262 | 51 (39) | 269 (43) | 293 (41) | 863 (78.3) | 239 | 19.4 (4.9) |
| L2.2.1 – Intro 1 | 22 (4) | 49 (5) | 7 (5) | 28 (4) | 36 (5) | 20.5 (5.1) | |||||||
| L3.1.1 – Intro 10 | 184 (38) | 388 (40) | 59 (45) | 239 (38) | 274 (39) | 19.1 (4.7) | |||||||
| L4.3.4 – Intro 5 | 26 (5) | 60 (6) | 4 (3) | 38 (6) | 44 (6) | 19.5 (4.1) | |||||||
| L1.1.2 – Intro 9 | 51 (10) | 78 (8) | 11 (8) | 58 (9) | 60 (8) | 19.1 (4.7) | |||||||
| seBS Bantu | Mean (SD) | 1442 (75.7) | 462 | 0.44 (0.2) | 0.44 (0.2) | 840 (81.9) | 186 | 0.45 (0.1) | 0.43 (0.2) | 0.45 (0.2) | 810 (73.5) | 292 | 19.9 (5.2) |
| eBS | Mean (SD) | 1442 (75.7) | 462 | 0.23 (0.1) | 0.22 (0.1) | 840 (81.9) | 186 | 0.22 (0.1) | 0.23 (0.1) | 0.22 (0.1) | 810 (73.5) | 292 | 19.9 (5.2) |
| wBS | Mean (SD) | 1442 (75.7) | 462 | 0.08 (0.1) | 0.09 (0.1) | 840 (81.9) | 186 | 0.08 (0.1) | 0.09 (0.1) | 0.08 (0.1) | 810 (73.5) | 292 | 19.9 (5.2) |
| Other ancestry | Mean (SD) | 1442 (75.7) | 462 | 0.25 (0.1) | 0.25 (0.1) | 840 (81.9) | 186 | 0.25 (0.1) | 0.25 (0.1) | 0.25 (0.1) | 810 (73.5) | 292 | 19.9 (5.2) |
Characteristics of MTBC genotypes for all patients with either human or bacterial genetic data available.
| N (%)* | Missing N | Levels | Other genotype | L2.2.1 – Intro 1 | L3.1.1 – Intro 10 | L4.3.4 – Intro 5 | L1.1.2 – Intro 9 | No bacterial data available | ||
|---|---|---|---|---|---|---|---|---|---|---|
| Total N (%) | 613 (32) | 71 (4) | 572 (30) | 86 (5) | 129 (7) | 433 (23) | ||||
| Sex† | 1471 (100.0) | 0 | Male (%) | 424 (69) | 52 (73) | 425 (74) | 55 (64) | 87 (67) | 302 (70) | |
| Female (%) | 189 (31) | 19 (27) | 147 (26) | 31 (36) | 42 (33) | 131 (30) | ||||
| Age in years | 1471 (100.0) | 0 | Median (IQR) | 33.0 (28.0–40.0) | 31.0 (24.5–38.0) | 33.0 (26.0–41.0) | 31.5 (27.0–38.8) | 35.0 (26.0–45.0) | 35.0 (27.0–43.0) | |
| HIV status† | 1452 (98.7) | 19 | Infected (%) | 90 (15) | 10 (14) | 103 (18) | 19 (22) | 31 (25) | 97 (23) | |
| Smoker† | 1443 (98.1) | 28 | Yes (%) | 127 (21) | 18 (26) | 149 (27) | 11 (13) | 26 (20) | 97 (23) | |
| Cough duration (weeks) | 1454 (98.8) | 17 | Median (IQR) | 4.0 (3.0–4.0) | 4.0 (3.0–4.0) | 4.0 (3.0–4.0) | 3.5 (2.2–4.0) | 4.0 (2.0–5.0) | 3.0 (2.0–4.0) | |
| Socioeconomic status | 1452 (98.7) | 19 | Median (IQR) | 80.0 (50.0–125.0) | 75.0 (50.0–125.0) | 87.5 (50.0–126.2) | 85.4 (50.0–122.9) | 75.0 (50.0–125.0) | 83.3 (50.0–133.3) | |
| Education† | 1471 (100.0) | 0 | No education (%) | 90 (14.7) | 5 (7.0) | 76 (13.3) | 10 (11.6) | 15 (11.6) | 76 (17.6) | |
| Primary (%) | 391 (63.8) | 54 (76.1) | 390 (68.2) | 61 (70.9) | 92 (71.3) | 273 (63.0) | ||||
| Secondary (%) | 105 (17.1) | 12 (16.9) | 90 (15.7) | 13 (15.1) | 18 (14.0) | 64 (14.8) | ||||
| University (%) | 27 (4.4) | 0 (0.0) | 16 (2.8) | 2 (2.3) | 4 (3.1) | 20 (4.6) | ||||
| Malnutrition† | 1471 (100.0) | 0 | Yes (%) | 83 (13.5) | 9 (12.7) | 71 (12.4) | 11 (12.8) | 13 (10.1) | 56 (12.9) | |
| Relapse/reinfection† | 1447 (98.4) | 24 | Yes (%) | 9 (1.5) | 4 (5.6) | 18 (3.2) | 2 (2.4) | 3 (2.3) | 12 (2.8) | |
| Drug resistance status† | 1471 (100.0) | 0 | Susceptible (%) | 574 (93.6) | 70 (98.6) | 556 (97.2) | 80 (93.0) | 119 (92.2) | 0 | |
| INH-Mono (%) | 37 (6.0) | 1 (1.4) | 16 (2.8) | 6 (7.0) | 10 (7.8) | 0 | ||||
| MDR (%) | 0 (0.0) | 0 (0.0) | 0 (0.0) | 2 (0.3) | 0 | |||||
| seBS | 1009 (68.6) | 462 | Median (IQR) | 0.5 (0.3–0.6) | 0.5 (0.4–0.5) | 0.5 (0.4–0.6) | 0.5 (0.4–0.6) | 0.5 (0.4–0.6) | 0.5 (0.3–0.6) | |
| eBS | 1009 (68.6) | 462 | Median (IQR) | 0.2 (0.2–0.3) | 0.2 (0.2–0.3) | 0.2 (0.2–0.3) | 0.2 (0.2–0.3) | 0.2 (0.2–0.3) | 0.2 (0.2–0.3) | |
| wBS | 1009 (68.6) | 462 | Median (IQR) | 0.1 (0.0–0.1) | 0.1 (0.1–0.1) | 0.1 (0.0–0.1) | 0.1 (0.0–0.1) | 0.1 (0.0–0.1) | 0.1 (0.0–0.1) | |
| Other ancestry | 1009 (68.6) | 462 | Median (IQR) | 0.2 (0.2–0.3) | 0.2 (0.2–0.3) | 0.2 (0.2–0.3) | 0.2 (0.2–0.2) | 0.2 (0.2–0.2) | 0.2 (0.2–0.3) | |
-
*
The column N (%) indicates the total number of patients with bacterial genetic data available that contained a value for the respective variable.
-
†
The percentage of an MTBC genotype that has the respective characteristic (e.g. percentage of males among patients infected with an Intro 1 MTBC genotype).
Estimated associations between disease severity, human genetic ancestry, and bacterial genotype.
Three variables as proxies for disease severity were included: lung damage (mild versus severe), TB-score (mild versus severe), bacterial load (continuous, log10 transformed). Binomial logistic regressions were performed on the data of HIV-negative patients and adjusting was done for age, sex, smoking, socioeconomic status, level of education, malnutrition, TB type (relapse or new infection), and drug resistance status by including these variables in the model. For the ancestries and the interactions, the p-values were retrieved by performing a likelihood ratio test comparing a model including the ancestries and interactions to a model without them. This table combines the results of two logistic regressions per disease severity measure, one including an interaction and one without. The ancestries were transformed and categorized (see Methods) with category 1 comprising the lowest amount of the respective ancestry and category 3 (4 in the case of wBS) the highest amount.
| Disease severity measure | |||||||
|---|---|---|---|---|---|---|---|
| Lung damage | TB-score | Bacterial load (Ct-value) | |||||
| OR adjusted | p-value adjusted | OR adjusted | p-value adjusted | OR adjusted | p-value adjusted | ||
| MTBC genotype* | L3.1.1 – Introduction 10 | 1.60 (0.95–2.67) | 0.07 | 1.00 (0.72–1.40) | 0.98 | 0.99 (0.97–1.00) | 0.13 |
| Other genotypes | 1 | 1 | 1.00 | ||||
| Human ancestry† | seBS category 3 | 2.29 (1.04–5.42) | 0.19 | 1.06 (0.66–1.69) | 0.17 | 1.00 (0.98–1.03) | 0.13 |
| seBS category 2 | 2.40 (1.15–5.41) | 0.62 (0.40–0.94) | 1.02 (0.99–1.04) | ||||
| seBS category 1 | 1 | 1 | 1.00 | ||||
| eBS category 3 | 1.30 (0.51–3.79) | 1.09 (0.64–1.84) | 1.02 (0.99–1.05) | ||||
| eBS category 2 | 1.50 (0.61–4.28) | 1.17 (0.69–1.94) | 1.00 (0.97–1.03) | ||||
| eBS category 1 | 1 | 1 | 1.00 | ||||
| wBS category 4 | 0.58 (0.23–1.53) | 0.94 (0.49–1.76) | 1.02 (0.98–1.06) | ||||
| wBS category 3 | 0.54 (0.21–1.46) | 1.09 (0.57–2.04) | 1.03 (0.99–1.07) | ||||
| wBS category 2 | 0.73 (0.26–2.15) | 1.03 (0.50–2.09) | 1.03 (0.99–1.08) | ||||
| wBS category 1 | 1 | 1 | 1.00 | ||||
| Interaction† | seBS category 3* L3.1.1 – Intro 10 | 1.05 (0.20–5.43) | 0.06(1) | 1.05 (0.40–2.71) | 0.92 | 0.99 (0.94–1.04) | 0.83 |
| seBS category 2* L3.1.1 – Intro 10 | 0.39 (0.08–1.94) | 1.29 (0.52–3.15) | 0.98 (0.93–1.03) | ||||
| seBS category 1* L3.1.1 – Intro 10 | 1 | 1 | 1.00 | ||||
| eBS category 3* L3.1.1 – Intro 10 | 8.32 (0.91–193.14) | 1.45 (0.48–4.34) | 1.01 (0.94–1.07) | ||||
| eBS category 2* L3.1.1 – Intro 10 | 6.38 (0.71–146.86) | 1.25 (0.42–3.70) | 1.02 (0.96–1.09) | ||||
| eBS category 1* L3.1.1 – Intro 10 | 1 | 1 | 1.00 | ||||
| wBS category 4* L3.1.1 – Intro 10 | 0.17 (0.02–1.23) | 1.79 (0.49–6.56) | 0.97 (0.90–1.05) | ||||
| wBS category 3* L3.1.1 – Intro 10 | 0.33 (0.04–2.51) | 1.48 (0.40–5.51) | 0.99 (0.92–1.07) | ||||
| wBS category 2* L3.1.1 – Intro 10 | 0.87 (0.09–8.06) | 2.40 (0.53–11.05) | 0.98 (0.90–1.07) | ||||
| wBS category 1* L3.1.1 – Intro 10 | 1 | 1 | 1.00 | ||||
-
*
The odds ratio for genotype represents the odds of severe disease for Introduction 10 compared to the odds for other genotypes.
-
†
The odds ratios represent the estimated multiple in the odds of severe disease for a one-unit increase in the additive log-transformed ancestry variable.
Additional files
-
Supplementary file 1
The different ethnic groups with at least 10 members in our cohort and the region and broad geographic location of the original area of the ethnic group.
The latitude and longitude are given in decimal degrees. In the case of two associated regions, a location close to the border of the two regions was selected.
- https://cdn.elifesciences.org/articles/103533/elife-103533-supp1-v1.docx
-
Supplementary file 2
Phylogenetic markers selected to identify the introductions.
The position is based on the reconstructed reference of the ancestor (Chang et al., 2015) and the derived base indicates the base present in the respective introduction. Intro 1 refers to Introduction 1 within L2.2.1, Intro 5 to Introduction 5 within L4.3.4, Intro 9 to Introduction 9 within L1.1.2, and Intro 10 to Introduction 10 within L3.1.1.
- https://cdn.elifesciences.org/articles/103533/elife-103533-supp2-v1.docx
-
MDAR checklist
- https://cdn.elifesciences.org/articles/103533/elife-103533-mdarchecklist1-v1.pdf