Introduction

Africa harbors the largest human genetic diversity worldwide (1). This continent is also inhabited by numerous ethnic and linguistic groups (2). While the long evolutionary history of modern humans in Africa and their large effective population sizes (3) explain this high genetic diversity, more recent migration events within and from outside Africa during the last 5,000 years, as well as admixture between historically separated populations, have resulted in some degree of homogenization (4). Hence, African populations nowadays are often composed of a mixture of different genetic ancestries (5, 6). One human migration that had a major influence on the population structure of present-day Africans is the so-called ‘Bantu expansion’, where Bantu-speaking groups migrated from central Western Africa southwards and eastwards, spreading farming technologies across sub-Saharan Africa, and admixing with local groups of hunter-gatherers and pastoralists (4, 7). Recent human genetic studies have identified moderate population structuring among Bantu-speaking populations (8, 9), yet, admixture with local populations has impacted immune responses to infectious diseases (10). Since pathogens have been one of the strongest selective forces driving human evolution (11), disease susceptibility and clinical outcomes can differ markedly between human populations. Genome-wide association studies (GWAS) have identified mutations altering the susceptibility to various infectious diseases (1215), including tuberculosis (TB), which remains the main cause of human death due to a single infectious agent (16).

The bacteria that cause TB belong to the Mycobacterium tuberculosis complex (MTBC) and can be classified into ten human-adapted phylogenetic lineages: Lineage 1 (L1) to L10, plus several lineages adapted to different wild and domestic animal host species (17, 18). While TB is a global problem, the human-adapted MTBC lineages differ in their geographical distribution. L2 and L4 occur worldwide and other lineages are restricted to specific regions.

Specifically, L1 and L3 mainly occur around the Indian Ocean (19), L5 and L6 that are limited to West Africa (20), and L7 only occurs in Ethiopia (21). This phylogeographic population structure of the human-adapted MTBC led to the hypothesis that certain MTBC genotypes are locally adapted to their sympatric human populations (19). This hypothesis is supported by findings from cosmopolitan settings, where sympatric associations between the geographical origin of TB patients and their infecting MTBC strains were observed (2224). Moreover, in immune-compromised individuals with HIV co-infection, these sympatric associations were disrupted (25). The various MTBC genotypes also differ phenotypically (26), regarding disease progression (27), transmission (28, 29), and disease presentation (30).

Human genetic diversity has also been linked to differences in TB susceptibility. While for example TYK2 has been associated with TB disease worldwide (31), several human genetic loci were not consistently associated with TB in populations from different geographical regions but specific to certain populations (3236). Particular human genetic ancestries have also been found to play a role in the context of TB. People with higher proportions of native Peruvian genetic ancestry showed a higher risk of progressing to active TB (37), and a higher proportion of San genetic ancestry was associated with an increased risk for TB among South African Coloured individuals (38). In addition to the effects of human and bacterial genetic diversity on TB, many social and environmental factors, as well as co-morbidities are known drivers of TB. These include malnutrition (39), HIV infection, diabetes (40, 41), poverty (42), smoking (43), and alcohol consumption (44).While associations between TB and individual host, pathogen or environmental factors have been found (3237, 39), studies considering all these components simultaneously, remain scarce (4547).

Here, we characterized the genetic ancestry of a cohort of TB patients from Dar es Salaam, Tanzania, the phylogenetic lineage of their correspondent MTBC isolate, and investigated the association of both with TB disease severity.

Results

Genetic ancestry of Tanzanian TB patients

Genetic ancestries were estimated for 7,479 individuals from 249 populations (Supplementary Figure S2) including 1,444 Tanzanian TB patients, using the software Admixture (48) with 53,255 SNPs (Figure 1, Supplementary Figure S1).

Genetic ancestry analyses of Tanzanian TB patients.

A) Genetic ancestry proportions of the 1,444 Tanzanian TB patients and representative human populations who shared at least 1% of their most common genetic ancestry with the Tanzanians for K = 24. (ESN: Esan from Nigeria (1000G), LWK: Luhya from Kenya (1000G)). For all populations included in our study, see Supplementary Figure S2 for their geographic distribution and S5 for the ancestry composition of all African populations included in this study. B) The geographical location of the representative populations shown in A are depicted with black circles, and the corresponding country is highlighted. The remaining African populations included in the analysis are represented by blue circles.

The optimal number of source populations to describe our dataset was 24, based on the lowest cross-validation error (49) (Error: Reference source not found). For the purpose of this study, we named the genetic ancestries according to the geographical distribution and/or ethnicity of the reference populations that they are most prevalent in. The genetic ancestry with the highest contribution among Tanzanians with a mean of 44% (maximum 68%, minimum 0%) was also the most abundant in Bantu-speaking people from Southern and southeastern Africa (e.g. the Ronga population in Figure 1), and hence hereafter, we will refer to this ancestry as “Southeastern Bantu” (Figure 1). The second most common genetic ancestry with a mean of 22% (maximum 42%, minimum 6%) in the Tanzanian TB patients was most common among Kenyans (e.g. the Luhya population in Figure 1, 1000G and HGDP, see methods), and will be referred to as the “Eastern Bantu” genetic ancestry. Additionally, the Tanzanian TB patients had a mean of 9% (maximum 53%, minimum 0%) of a genetic ancestry that was most common among Bantu-speaking populations from western Central Africa, (e.g. the Eviya population in Figure 1); we will refer to it as the “Western Bantu” genetic ancestry. Furthermore, the Tanzanian TB patients contained on average 4% of a genetic ancestry most abundant among Nigerians represented by the Esan population (“Nigerian” genetic ancestry in Figure 1), and 4% of a genetic ancestry most abundant in people from Chad and Sudan (represented by the Nuba population in Figure 1). In addition, the genetic ancestry of the Tanzanian TB patients was composed of 3% of a genetic ancestry most prevalent in people from Western Africa (Gambia and Senegal represented by the Senegal Bedik population, in Figure 1), as well as 3% of a genetic ancestry most prevalent in individuals from western Africa rainforest hunter-gatherer populations (Bezan population in Figure 1). A mean of 2% belonged to a genetic ancestry most common among Bedouin individuals (represented by the Yemenite Jew population in Figure 1). The proportions of the remaining genetic ancestries were all smaller than 2% (Error: Reference source not found, admixture plots for all African populations included can be found in Error: Reference source not found). Finally, most Tanzanian TB patients had little non-African genetic ancestry (mean 5%, maximum 65%), with only 14 patients (∼1%) showing more than 30% non-African genetic ancestry. In summary, the ancestry of Tanzanian TB patients was, for the most part, a mixture of three different Bantu components. Thus, for the remaining sections, we will focus on the ancestries termed Eastern Bantu, Southeastern Bantu, and Western Bantu.

Insights into the Bantu genetic ancestries in Africa and Tanzania

Compiling several datasets, including many different Bantu populations, allowed for a closer look at the distribution of Bantu ancestries across African populations. Like Tanzanians, the populations from the neighboring Kenya and Mozambique showed strong contributions of Bantu ancestries resulting from different admixture events (Figure 1). While the Eastern Bantu genetic ancestry was highest in Kenya and Tanzania, decreasing from there to the south and to the west of the continent (Figure 2A), the Southeastern Bantu genetic ancestry generally increased towards the south and decreased towards the west as observed by others (Figure 2B) (8). The Western Bantu genetic ancestry was mainly seen in Bantu-speaking populations from Gabon and Cameroon, as well as in South African populations ( Figure 2C). At a continental-wide level, the geographical distribution and the genetic distances of the human populations analyzed were significantly correlated (Mantel test: veridical correlation = 0.18, p-value < 0.001, Supplementary Figure S6).

Spatial visualizations of the Bantu genetic ancestries and the genetic ancestries of the different self-identified ethnic groups among the TB patients in Tanzania. The genetic ancestry was inferred by admixture with K = 24, and the interpolation of the ancestries was performed by using the pykrige module in Python (see methods). A) Eastern Bantu genetic ancestry, B) Southeastern Bantu genetic ancestry, and C) Western Bantu genetic ancestry. The populations included for spatial interpolations are marked with a black dot on the maps. The maps were created using the basemap module in Python. D) Geographical origin of the ethnic groups among our TB patient cohort. The Temeke Distric hospital in Dar es Salaam where the patients were recruited is marked with a red point. Note that for some ethnic groups, no geographical origin could be identified (Supplementary Table 1). E) Ancestry plots for the different ethnic groups with at least 10 patients from our TB patient cohort.

Even though all our TB patients were recruited in Dar es Salaam, we found them to belong to a variety of self-defined ethnic groups linked to different geographical regions within Tanzania (Figure 2D). Even at this smaller scale, we found a significant correlation between the geographic distance of the self-defined ethnic groups of our TB patients and their genetic distances (Mantel test: veridical correlation = 0.12, p-value < 0.001). The geographical structuring of human genetic diversity within Tanzania was further supported by our finding that the Southeastern Bantu genetic ancestry increased from West to East, and from North to South. The Eastern Bantu genetic ancestry increased from South to North and decreased from West to East. The Western Bantu genetic ancestry increased to the North and decreased to the East (Error: Reference source not foundA).

The MTBC genotypes circulating in Tanzania and their association with TB disease severity

In previous work (50), we investigated the MTBC genotypes causing TB in the patient cohort analyzed here. We found a high diversity of MTBC genotypes, with approximately half of the TB cases caused by four main MTBC genotypes estimated to have been introduced into Tanzania starting 320 years ago. After adding the genomic information of an additional 389 MTBC isolates to our dataset (total N=1,471), the prevalence of the four dominant MTBC genotypes was very similar to our previous findings (50). The most successful genotype within L3.1.1 (referred to as “Introduction 10”) contributed 39% of the current TB cases, followed by a genotype within L1.1.2 (“Introduction 9”) with 9%, a genotype within L4.3.4 (“Introduction 5”) with 6%, and a genotype within L2.2.1 (“Introduction 1”) with 5%. The remaining TB cases were caused by a variety of other genotypes within L1-L4, but occurred at frequencies of 1% or less (50). Despite a 36% increase in sample size compared to our previous analysis (50), we still find no evidence of an association between the different MTBC genotypes circulating in Dar es Salaam and TB disease severity using as proxies; X-ray scores (mild or severe), bacterial load (inferred from GeneXpert cycle threshold) and TB-score (mild or severe) (Table 1, Table 3) (see Methods).

Human and bacterial genotypes by the severity measures.

Relationship between the human and MTBC population structures

We previously found that the four dominant MTBC genotypes in Dar es Salaam differed in their transmission rate and in the duration of the infectious period (50). Here, we assessed whether there might be an additional host genetic contribution to these differences. We first compared the genetic ancestry proportions between patients infected with the four dominant genotypes, and then tested whether patients who were genetically more closely related were infected with MTBC genotypes that were also more closely related as would be expected from a co-evolutionary process (51). However, the human genetic ancestry proportions differed only marginally between the TB patients infected by the four main MTBC genotypes (Table 2). Moreover, there was no correlation between the human and bacterial genetic distances (Mantel test: veridical correlation = −0.02, p-value = 0.85). Taken together, we found no evidence for any statistically significant relationship between the human and bacterial genetic population structure in Dar es Salaam. These results also suggest that the genetic composition of this human population is unlikely to have a measurable effect on the differences in bacterial transmission rate and duration of the infectious period reported previously (50).

Summary of ancestry proportions, sociodemographic and clinical data stratified by MTBC genotype.

Association of human genetic ancestry with TB disease severity

We next investigated whether human genetic ancestry could have contributed to the differences in disease severity observed between our TB patients. We fitted logistic regression models using the three proxies of TB disease severity mentioned previously as outcome variables. We included the three human genetic ancestries with the highest proportions among the Tanzanian TB patients (Southeastern Bantu, Eastern Bantu, Western Bantu) as covariates, together with age, sex, HIV status, age, smoking and cough duration to control for potential confounding. We found no evidence of an association between human genetic ancestry and any of these three proxies of TB disease severity (Table 1).

The combined effect of human and MTBC genetic diversity on TB disease severity

For a subset of 1,000 TB patients, we had both an MTBC genome and a human genome or genotype available. Genetic interactions between the host and the pathogen have been show to affect TB severity in other settings (37, 52). To test for potential interactions between human and bacterial diversity on TB severity, we added to the logistic regression models described in the previous section the most common MTBC genotype as an additional explanatory variable (L3.1.1-Introduction 10), as well as the interaction between human ancestry and MTBC genotype. We only tested L3.1.1-Introduction 10, since the numbers of the other genotypes were too few for meaningful testing. However, there we found no evidence for any interaction between the MTBC genotype and human ancestry influencing TB disease severity in this patient population (Table 3).

Associations between the different proxies for disease severity, human genetic ancestry, bacterial genotype and the interaction between human genetic ancestry and bacterial genotype. Binomial logistic regressions were performed on the data of HIV-negative patients, adjusting for age, sex, smoking and cough duration.

Discussion

In this study, we analyzed the genetic ancestry of TB patients, the MTBC diversity underlying their TB infection, and estimated the associations of both with disease severity in Dar es Salaam, Tanzania. We found a strong component of Bantu genetic ancestries among the Tanzanian TB patients, similar to those of neighboring populations from Mozambique and Kenya, and little non-African genetic ancestry. Genetic ancestry proportions did not differ between patients infected with different MTBC genotypes. There was no evidence that the patient genetic ancestry or the MTBC genotype on their own, nor their interaction, had any effect on TB disease severity.

Despite the fact that Tanzania is one of the few countries in sub-Saharan Africa where all four main African linguistic groups co-exist (53), and that its largest city and economic capital, Dar es Salaam, is strongly influenced by different human populations from within and outside Africa, our cohort of TB patients mostly comprised Bantu-speaking ethnicities. Comparing this TB patient cohort to a large number of modern human populations revealed major components of Eastern and Southeastern Bantu genetic ancestries. This genetic population structure probably resulted from several admixture events estimated to have happened between 1500 to 150 years ago, between local populations and Bantu-speaking populations who migrated from West Africa to the East and South of the continent (4). The TB patients investigated here were recruited in one district hospital of Dar es Salaam. Yet, we found the genetic distances between the patients to be correlated with the original geographic range of their self-identified ethnicities, suggesting that the corresponding human populations are not fully admixed. The population of Dar es Salaam has increased by several millions in the last 40 years, mainly as a result of immigration from rural parts of Tanzania (54). Thus, our findings suggest that our TB patient population mostly represents recent migrants to Dar es Salaam from other regions of Tanzania. Moreover, we found little evidence of Eurasian genomic influence in the TB patient population (on average 5% genetic ancestry). This is in strong contrast to coastal Swahili populations, as recent findings comparing modern and medieval Swahili people revealed large components of genetic ancestry derived from exchanges between local East African Bantu populations and people from India, Persia, and Arabia, starting as early as 1000 AD (55). We conclude that our TB patient population does not represent the full spectrum of human genetic diversity in Tanzania.

In contrast to the genetic ancestry of the TB patients, we found that the MTBC genotypes infecting these patients descend from multiple historical introductions, which mainly resulted from the human exchanges that took place across the Indian Ocean during the last few centuries (50). Some of these recently introduced MTBC genotypes became dominant, in particular the MTBC genotype L3.1.1-Introduction 10, which caused TB in approximately 40% of our patients. These strains descended from an introduction that occurred approximately 300 years ago from South or Central Asia to East-Africa (50). Hence, while our TB patient population reflects little historical gene flow from non-African populations, the underlying MTBC diversity indicates that the MTBC genotypes introduced from outside successfully spread in this newly encountered host population, eventually outcompeting native MTBC genotypes (56).

We previously reported that in this TB patient cohort, some of the dominant MTBC genotypes had a higher transmission rate than others, while some other MTBC genotypes induced patients to remain infectious for longer (50). Based on the similar proportions of MTBC genotypes among self-reported ethnic groups we observed at the time, we had already hypothesized that human genetic heterogeneity of the host population is unlikely to be responsible for those differences (50). Here we formally addressed this hypothesis, and found that there was no evidence that the human genetic ancestry proportions differed between patients infected with different MTBC genotypes in our cohort. This finding further supports the notion that the differences in epidemiological parameters we reported previously are mainly determined by the pathogen genotype.

Disease severity is one aspect of the clinical presentation of TB with a direct impact on patient mortality and morbidity, as well as on pathogen transmission, as it influences patient infectiousness. It is thus likely that both host and pathogen genetic characteristics can modulate TB disease severity (51). Experimental infections in various animal models suggest that different MTBC strains vary in virulence (5759). However, in clinico evidence for differences in disease severity caused by different MTBC genotypes are inconsistent (60, 61). We found no evidence of differences in disease severity at the point of care caused by the different MTBC genotypes in our study. Moreover, we did not observe any association between human genetic ancestry and disease severity, which is in contrast to a recent study from Peru, where human genetic ancestry was found to influence progression to active TB (37).

Our previous work found evidence of such genetic interactions when considering the complete genomes of the TB patients and their infecting MTBC strain (46), and identified associations between human and pathogen variants. Such associations reflect host-pathogen genetic interactions that determine susceptibility to symptomatic TB or intra-host selection during mycobacterial replication. Here, in the context of co-evolutionary history between humans and MTBC, we specifically tested whether an interaction between the main human ancestry components and being infected with the most dominant MTBC strain in Dar es Salaam could explain the variability in TB disease severity. However, we did not find evidence of such an effect. Others have reported an association between TB disease severity, a particular bacterial genotype and a particular human SNP in Uganda, but did not explicitly link this human SNP to a particular human genetic ancestry (52, 62). Several factors could account for the lack of effect we observed. First, our patient population was relatively genetically homogeneous, given that the different Bantu components represent populations with only moderate levels of genetic differentiation (63). Second, there is likely to be selection bias in our cohort since only patients presenting at the clinic were recruited. The disease severity measures included in this study mainly reflect disease stages, at which patients felt ill enough to go to the hospital; i.e. we did not consider intermediate, more contrasting disease states that are known to occur between infection and the development of symptoms (64). To at least partially account for that, we included the number of weeks a patient was coughing as a covariate in our analyses. Third, the lack of a measurable interaction between host genetic ancestry and MTBC genotype could reflect the relatively recent presence of these MTBC genotypes in Tanzania, and the distinct (i.e. allopatric) geographical origins of the host and pathogen populations. This indicates that none of the ancestral human populations that compose modern Tanzanians has lived in sympatry with the ancestors of the modern MTBC genotypes that circulate in Dar es Salaam today, thus rendering human susceptibility alleles linked to genetic ancestry unlikely. With the exception of West Africa, where the geographically restricted West-African MTBC lineages L5 and L6 remain an important cause of human TB (65), the situation in Tanzania might be representative of the TB epidemics in many African countries, as evidence suggests that many MTBC genotypes dominating the continent today have been introduced from outside Africa during recent centuries (56, 6668).

In conclusion, our study shows that the TB patients from Dar es Salaam were mainly of Bantu genetic ancestry reflecting limited Eurasian genetic influx. Neither the human genetic ancestry or the MTBC genotype alone, nor their interaction, were associated with TB disease severity. Our results highlight the dominant role of social and environmental factors in human TB in Tanzania.

Methods

Study population

This study is based on a previously described dataset (46, 50). Briefly, adult active TB patients (sputum smear-positive and GeneXpert-positive) were recruited between November 2013 and June 2022 at the Temeke District Hospital in Dar es Salaam, Tanzania when they first presented for care. Sputum and blood samples were collected from each patient to extract DNA for sequencing of the MTBC strain, and genotyping or whole-genome sequencing (WGS) the patient. Additionally, clinical and sociodemographic information was obtained from every patient. In total, there was either human or bacterial data available for 1,906 patients. Of those 1,444 patients had human genetic data and 1,471 had bacterial genetic data available, respectively. A total of 1,000 patients had both types of data available after quality-based filtering. The geographical locations of the self-indicated ethnic group of each patient were retrieved by searching for the original region of the respective ethnic group, and if they originated from a single region, the geographic coordinates according to Wikipedia were taken. If two neighboring regions were among the origins, then a random location between the two regions was taken as surrogate (Supplementary Table S1).

Bacterial and human sample processing

The MTBC bacteria were cultured on solid Löwenstein-Jensen media at the TB laboratory of the Ifakara Health Institue in Bagamoyo, Tanzania. Before 2018, the MTBC isolates were transferred to Switzerland for DNA extraction and later, DNA extraction was carried out in Bagamoyo. Bacterial WGS was done using the Illumina short-read technology at the Department of Biosystems Science and Engineering of ETH Zurich in Basel (DBSSE). Human WGS was done at the Health 2030 Genome Center in Geneva, Switzerland using an Illumina NovaSeq 6000 sequencer. The human genotyping was done at the iGE3 Genomics platform at the University of Geneva in Switzerland using the Illumina Infinium H3Africa genotyping microarray (Version 2; https://chipinfo.h3abionet.org) plus custom Tanzanian-specific SNP add-ons (69).

Human genetic data

The processing of the human genetic data has been described in detail by Xu et al (46). Briefly, we used the GRCh38 as a human reference genome to map the WGS reads of 118 patients using BWA aligner (v0.7.17) (70). Duplicate reads were then marked with the markduplicates module of Picard (v2.8.14) (71). Variant calling was first done for each sample individually following GATK best practices for germline short variant discovery. Samples with a coverage below 5 were excluded followed by a joint calling of the variants. A Quality Score Recalibration (VQSR) based filter was applied (real sensitivity of 99.7, excess heterozygosity of 54.69) and samples with more than 50% missing genotype calls were removed.

For the genotyping data, we used the Illumina GenomeStudio software (v2.0.5, https://support.illumina.com/array/array_software/genomestudio/downloads.html) to analyze the raw microarray data. Samples with a low quality, that were badly clustered, or that had a call rate below 0.97, were excluded. The PLINK format was then converted to VCF format using PLINK (v1.9) (72). The first round of imputation was performed with the African Genome Resources (AFGR, https://www.apcdr.org/) reference panel on the sanger imputation server with EAGLE (73) for phasing and Positional Burrows-Wheeler transform (PBWT) (74) for imputing. The second round was performed with a reference panel created in house that was based on 118 patients with available whole-genome sequences (69) with SHAPEIT4 for phasing and Minimac3 for imputing. For each SNP, the reference panel with the highest imputation quality score was used to determine the final genotype call. SNPs with an INFO score below 0.8 were discarded.

Bcftools (v1.15) was used to merge the WGS and genotyping samples after identifying SNPs shared between the two methods that were missing in fewer than 10 samples.

Whole-genome sequence analysis of the MTBC bacteria

We analyzed all FASTQ files using the WGS analysis pipeline described previously (75). In summary, Trimmomatic (76) v. 0.33 (SLIDINGWINDOW:5:20) was used to remove the Illumina adaptors and to trim low quality reads. Only reads with a length of at least 20 bp were kept for further analysis. Overlapping paired-end reads were mergend using SeqPrep v. 1.2 (77) (overlap size = 15). We then mapped the resulting reads to a reconstructed ancestral sequence of the MTBC (78) with BWA v. 0.7.13 (mem algorithm) (70). Picard v. 2.9.1 (79) was then applied to mark and exclude duplicated reads. Furthermore, the RealignerTargetCreator and IndelRealigner modules of GATK v. 3.4.0 (80) were used to perform local realignment of reads around INDELs. Reads having an alignment score lower than (0.93 ×read length)−( read length× 4 × 0.07), corresponding to more than 7 miss-matches per 100bp, were excluded using Pysam v. 0.9.0 (81). SNP calling was then performed with SAMtools v. 1.2 mpileup (82) and VarScan v. 2.4.1 (83) with the following thresholds: a minimum mapping quality of 20, a minimum base quality at a position of 20, minimum read depth at a position of 7 and no strand bias. Positions in repetitive regions such as PE, PPE, and PGRS genes or phages were excluded, as described in (84). A whole-genome Fasta file was created from the resulting VCF file. We applied some additional filters; genomes were excluded from downstream analysis if they had a sequencing coverage of lower than 15 or if they contained SNPs suggestive of different MTBC lineages (i.e. mixed infections). We identified lineages and sublineages using the SNP-based classification by Steiner et al. (85), and Coll et al. (86), respectively.

Identification of bacterial SNPs diagnostic for the successful MTBC introductions

We previously identified several successful MTBC introductions into Dar es Salaam (49). For these, we aimed to obtain a set of diagnostic SNPs that would allow assigning MTBC strains not included in our previous study to these genotypes. For that, we merged the VCF files from the 1,082 MTBC genomes included in that previous dataset by using BCFtools (v1.9). We then used VCFtools (v0.1.16) to remove Indels and positions that were variable in less than 12 genomes (12 was the minimal threshold selected when identifying the successful introductions in our previous publication (50)). By using the R package VariantAnnotation (87) and a customized Python script, SNPs specific to one of the most successful introductions were extracted. To ensure the SNPs identified as markers for the introductions were specific, we also identified phylogenetic SNPs on a bigger and global dataset representing the human-adapted MTBC diversity (75), and tested whether any of the phylogenetic SNPs identified for any of the successful Introductions was present in any of the MTBC lineages or sublineages. We compiled a subset of 25 SNPs (Error: Reference source not found) that we used as phylogenetic markers for the different MTBC introductions and identified strains belonging to one of the four most successful MTBC introductions (Introduction 1, Introduction 5, Introduction 9, and Introduction 10) in the expanded MTBC dataset based on these SNPs by using a customized script.

Measures of TB disease severity

We used three different proxies for TB disease severity. The first one was the TB-score, which is a clinical score adapted from Wejse et al. (88) that consists of several signs and symptoms including the presence of fever and the Body Mass Index (BMI). A point was given for each of the following symptoms or clinical measures if present: BMI below 18, BMI below 16, mid upper arm circumference (MUAC) below 220, MUAC below 200, body temperature higher than 37°C, cough, hemoptysis, dyspnea, chest pain, night sweat, abnormal auscultation, anaemia. A maximum of 12 points could be achieved, and a TB-score below 6 was considered as mild and everything above as severe. As a second proxy, we assessed the amount of lung damage. Two independent radiologists assessed chest x-ray pictures of the patients and gave a Ralph score (89). X-ray scores above 71 were considered as severe and everything below as mild. As a third proxy we used the bacterial load in the sputum represented by the difference between the first (early cycle) and the last (late cycle) threshold during quantitative PCR (Ct-value). For each sputum sample, we took five probes, run a quantitative PCR each and reported the lowest Ct-value.

Genetic ancestry analysis of TB patients

To investigate the genetic ancestry of the TB patients, we combined our dataset with the data from ten other projects: The Gambian Genome Variation Project (GGVP) (90), the 1000 Genomes Project (1000G) (91), the Human Genome Diversity Project (HGDP) (92), Simons Genome Diversity Project (93), as well as data generated by Patin et al. 2014 and 2017 (4, 94), Hollfelder et al. 2017 (95), Semo et al. 2019 (8), Schlebusch et al. 2012 (96), and Fortes-Lima et al. 2022 (97). We used the GRCh37 version of all datasets. The dataset of HGDP was in GRCh38 version and we thus did a lift over to GRCh37 using the picard (v2.26.10) (71) tool LiftoverVcf. For all the datasets including only populations from one single continent, we excluded variants with a missingness of more than 10% and only included variants that did not deviate from Hardy-Weinberg equilibrium (p < 1e-5) using PLINK (version 1.9b, www.cog-genomics.org/plink/1.9/) (72). For the 1000G, SGDP, and HGDP data, we first identified variants that deviated from the Hardy-Weinberg equilibrium (p < 1e-5) in each superpopulation using PLINK (version 2.0a, www.cog-genomics.org/plink/2.0/) and removed them from the whole dataset. We additionally removed variants with a high missingness (> 10%) from the full datasets using PLINK (version 1.9b). After extraction of 103,262 nucleotide positions common to all datasets, we merged the datasets using PLINK (version 1.9b). From the merged dataset, we removed second-degree relatives using PLINK (version 2.0a, king cutoff of 0.088) (N=369) and patients from our cohort where the sex according to the genetic data did not correspond with the sex indicated in the clinical data, patients who were genetic outliers based on principal component analysis (PCA) or who did not cluster with any other African samples (N=83). In addition, we removed regions of high linkage disequilibrium (https://genome.sph.umich.edu/wiki/Regions_of_high_linkage_disequilibrium_(LD)) and applied additional filters to the merged dataset (missingness > 10%, minimum allele frequency of 5%, removal of sex chromosomes, variant pruning with –indep-pairwise 50 10 0.1, only biallelic positions) ending up with 53,255 positions and 7,479 individuals from 249 populations.

To infer the ancestry proportions of the Tanzanian TB patients, we used ADMIXTURE (version 1.3.0) (48). We estimated the number of ancestral populations (K) by running ADMIXTURE 15 times for each value for K from K=2 until K=29 with the option--cv. The--cv option performs 5-fold cross-validation and allows to identify the value for K resulting in the lowest cross-validation error (Error: Reference source not found). The cross-validation error was lowest for K = 24. From the 15 runs performed with K = 24, we selected a representative of an output that was supported by most of the 15 runs (6/15) to extract the ancestry proportions of each individual. A PCA was performed using PLINK (version 1.9b) on all African populations included.

Spatial interpolation of human genetic ancestry proportions

To visualize the distributions of the different patient ancestries, we performed spatial interpolation using the OrdinaryKriging function from the pykrige module in Python (variogram_model = “linear”, grid space of 500). For the Eastern and Southern Bantu ancestries, we included all African populations for the interpolation. For the Western Bantu, the interpolation failed when using all African populations with an insufficient slope, suggesting little spatial variability. Since many populations were sampled in the region with the highest proportions of Western Bantu genetic ancestry and among them many hunter-gatherer populations containing little to no Western Bantu genetic ancestry, we repeated the interpolation with only a subset of the non-hunter-gatherer populations.

Correlation between distance matrices

To assess, whether MTBC genotypes that are more closely related tend to infect people that are also more closely related genetically which would be compatible with co-evolution, we investigated the correlation between the human and bacterial distance matrices. To calculate the pairwise bacterial genetic distances, alignments of variable positions where data was missing in less than 10% of the genomes were generated and used to create SNP distance matrices according to the Hamming distance (https://git.scicore.unibas.ch/TBRU/tacos). Insertions and deletions were considered as missing data. To get the human pairwise genetic distances, we calculated the Euclidean distance based on the first two principal components. When only looking at the Tanzanian TB patients, we calculated the principal components for the Tanzanians only while for the continental dataset we included all available African populations.

To investigate, whether human populations that are geographically more distant are also genetically more distant, we calculated the correlation between the geographical and the genetic distance on an African level as well as on the level of Tanzania. For the geographical distance matrix, we calculated the Euclidean distance based on the latitude and the longitude. At the level of Tanzania, the broad geographic location of the original area of the ethnic group was considered (Supplementary Table S1) and for the continental level, the coordinates of the hospital in Temeke were taken for the Tanzanian TB patients considering that for the other studies only the sampling locations were known. The human and bacterial genetic distance matrices as well as the human genetic and geographic distance matrices were tested for correlations using the mantel.test() function from the mantel module in Python (options: perms = 10000, method = “pearson”, tail = “upper”).

Statistical analyses

Clinical and sociodemographic characteristics, MTBC genotype as well as human genetic ancestry were summarized by the different disease severity measures using proportions or means and standard deviations. Similarly, the human genetic ancestries were summarized by MTBC genotype.

To estimate the associations between disease severity and the explanatory variables bacterial genotype (binary, belonging to Introduction 10 or not) and human ancestries (Southeastern Bantu, Eastern Bantu, Western Bantu), we used a logistic regression model. We included three variables as proxies for disease severity: Lung damage based on X-ray score (mild versus severe), TB-score (mild versus severe), bacterial load (continuous, log10 transformed). We tested only for Introduction 10 because there were few observations of the other MTBC genotypes. To account for the compositional nature of the human ancestries (i.e. that they sum up to 1), we used the additive log transformation from the R package ‘compositions’ (98). The ancestry proportions were transformed and categorized with category 1 comprising the lowest amount of the respective ancestry and category 3 (4 in the case of Western Bantu) the highest. The categories were chosen to have roughly equal numbers of patients in each. We used categories to allow a non-linear relationship without specifying polynomials, needing a large sample size or difficulty in interpretation but we recognize that a small amount of information is lost. Similar results were also obtained with other parameterizations. We assessed whether there was an interaction between ancestry and genotype on TB severity using the likelihood ratio test. For that, we compared a model including the interaction between the MTBC genotype and human genetic ancestry to a model without the interaction using the ‘lmtest’ package (99). The estimates were adjusted for: age, sex, smoking, and the number of weeks with cough. Only HIV-negative patients were included in the analysis. All statistical analyses were carried out in R (version 4.1.2). Code for the statistical analysis can be found (https://github.com/mzwyer/TB-Dar_Mtb).

Acknowledgements

Calculations were performed at the sciCORE (http://scicore.unibas.ch/) scientific computing core facility at University of Basel. We thank all TB-DAR staff and study participants.

Data availability

The bacterial WGS data can be found under the bioprojects PRJEB49562 and PRJNA670836 on ENA and the human WGS and genotyping data under EGAS00001005850 and EGAS00001007216, respectively. Clinical data available under request.

Funding statement

Swiss National Science Foundation (Grant No: CRSII5_177163, CRSII5_213514 and 320030-227432) and the European Research Council (Grant: 883582). The funders had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.