Research Article

In-host population dynamics of Mycobacterium tuberculosis complex during active disease

Department of Systems Biology, Harvard Medical School, United States
Department of Biomedical Informatics, Harvard Medical School, United States
Center for Genes, Environment and Health, Center for Genes, National Jewish Health, United States
Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, United States
Icahn Institute of Data Sciences and Genomics Technology, United States
Mycobacteriology Reference Laboratory, Advanced Diagnostic Laboratories, National Jewish Health, United States
College of Public Health, University of South Florida, United States
Morsani College of Medicine, University of South Florida, United States
Pulmonary and Critical Care Medicine, Massachusetts General Hospital, United States

Feb 1, 2021

Open access
Copyright information

Abstract
Introduction
Results
Discussion
Materials and methods
Appendix 1
Appendix 2
Appendix 3
Data availability
References
Article and author information
Metrics

Abstract

Tuberculosis (TB) is a leading cause of death globally. Understanding the population dynamics of TB’s causative agent Mycobacterium tuberculosis complex (Mtbc) in-host is vital for understanding the efficacy of antibiotic treatment. We use longitudinally collected clinical Mtbc isolates that underwent Whole-Genome Sequencing from the sputa of 200 patients to investigate Mtbc diversity during the course of active TB disease after excluding 107 cases suspected of reinfection, mixed infection or contamination. Of the 178/200 patients with persistent clonal infection >2 months, 27 developed new resistance mutations between sampling with 20/27 occurring in patients with pre-existing resistance. Low abundance resistance variants at a purity of ≥19% in the first isolate predict fixation in the subsequent sample. We identify significant in-host variation in 27 genes, including antibiotic resistance genes, metabolic genes and genes known to modulate host innate immunity and confirm several to be under positive selection by assessing phylogenetic convergence across a genetically diverse sample of 20,352 isolates.

Introduction

Tuberculosis (TB) and its causative pathogen Mycobacterium tuberculosis complex (Mtbc) remain a major public health threat (World Health Organization, 2018). Yet the majority of individuals exposed to Mtbc clear or contain the infection, and only 5–10% of those infected develop active TB disease at some point in their lifetime (Pai et al., 2016). While basic human immune mechanisms to Mtbc have been identified, attempts at effective vaccine development guided by these mechanisms have repeatedly failed (Ernst, 2018). Hence, global efforts in disease control currently focus on scale up of directly observed therapy but achieving a universal and sustained cure remains a challenge. Mtbc is an obligate human pathogen (Gagneux, 2018). Infection and disease involve a complex human host-pathogen interaction that is both physically and temporally heterogeneous (Lin et al., 2014). Consequently, all selective forces acting on Mtbc will originate within the host, and the study of temporal dynamics of this is likely to inform antibiotic treatment (Sun et al., 2012) and rational vaccine design (Ernst, 2018).

Little is known about selection at short timescales, such as within single infections. Drug pressure may select for resistance-conferring mutations, thus an understanding of how the frequency of minor alleles changes longitudinally can inform optimal drug treatment (Didelot et al., 2016; Sun et al., 2012; Zhang et al., 2016). Mtbc’s interaction with host immunity or metabolic pressures imposed by persistent active human infection may also exert selective pressures, the detection of which can inform vaccine design or host directed therapeutics. To elucidate these temporal dynamics, we aimed to study how genomic diversity arises in-host in Mtbc populations, employing a longitudinal sampling scheme from patients with active TB disease enriched for treatment failure and relapse.

The application of genome sequencing technologies to Mtbc isolates cultured from clinical samples has highlighted that infection consists of populations of Mtbc bacteria rather than single clones devoid of diversity (Copin et al., 2016; Didelot et al., 2016; Lieberman et al., 2014; Lieberman et al., 2011; Marvig et al., 2015). Differences in observed allele frequencies captured using genome sequencing (Figure 1A) may represent a difference in the genetic composition of the infecting population, commonly referred to as heterogeneity. Mtbc population heterogeneity might be present within a host because (1) the host is infected with multiple strains or is re-infected by a new strain (consistent with mixed infection or re-infection) or (2) genetic diversity arises within the Mtbc population during infection due to selection or drift (Ford et al., 2012; Guerra-Assunção et al., 2015; Lieberman et al., 2016). However, non-uniform sampling (Trauner et al., 2017), selection during the in vitro culture process (Trauner et al., 2017), laboratory contamination (Goig et al., 2020; Wyllie et al., 2018), sequencing error and mapping error all represent examples of experimental error that give rise to heterogeneity of low significance to host-pathogen interactions.

Figure 1 with 1 supplement see all

Download asset Open asset

Selection of patients with longitudinal clonal infection.

(A) Allele frequency change between paired isolates $(Δ A F) = | A F_{1}^{A} - A F_{2}^{A} | = | A F_{1}^{B} - A F_{2}^{B} |$ . (B) The F2 measure >0.04 (Materials and methods) was used to identify and exclude isolate pairs with evidence for mixed strain growth at any time point. (C) Replicate and longitudinal pairs with fixed SNP (fSNP) distance of >7 were excluded. For longitudinal isolates, fSNP >7 was assessed as consistent with Mtbc reinfection with a different strain.

Here, we present a framework to overcome these barriers and demonstrate the use of longitudinally collected isolates, pooled sweeps of colonies cultured from sputa, to investigate true in-host diversity with implications for Mtbc treatment. We analyzed 614 paired longitudinal isolates representing 307 patients from eight studies (Bryant et al., 2013; Casali et al., 2016; Guerra-Assunção et al., 2015; Trauner et al., 2017; Walker et al., 2013; Witney et al., 2017; Xu et al., 2018). Many patients, despite undergoing treatment, remained culture positive at 2 months intervals or longer meeting microbiological criteria for delayed culture conversion, treatment failure or relapse (Supplementary file 1–2). Our sample consisted of 178 patients fulfilling these criteria, which allowed us to overcome the small sample size problem present in prior studies. We provide a proof of concept that whole-genome sequencing (WGS) can aide in predicting resistance amplification and demonstrate that in addition to loci involved in the acquisition of antibiotic resistance, loci implicated in modulation of innate host-immunity appear to be under positive selection.

Results

Identifying clonal Mtbc populations in-host

Of the 307 patients with longitudinal samples collected (Supplementary file 1–2), 32 patients had evidence for isolate microbiological contamination at any time point (Goig et al., 2020) and were excluded. We found evidence for mixed infection with two or more Mtbc lineages (Wyllie et al., 2018) for 31 patients (Figure 1B and Figure 1—figure supplement 1); 44 patients had evidence for re-infection with a different Mtbc strain between the first and second time points, using a pairwise genetic distance >7 fixed SNPs (fSNPs) (Materials and methods, Figure 1C and Figure 1—figure supplement 1). Median fSNP distance for the 44 patients identified as reinfection was 708 (IQR 250–1086). The remaining 200 patients were accordingly identified as having persistent or relapsed clonal infection (Supplementary file 3). Isolates from these infections spanned five of the eight known Mtbc lineages (Figure 5A). We implemented WGS SNP calling filters to minimize the likelihood of false positive SNP calls and validated calls with simulation and PacBio long-read data. We required that no indels be present in any of the reads supporting any SNP call, dropped SNP calls in repetitive regions and enforced a read depth ≥25x and alternate allele depth of ≥5 reads. We estimated the false error rate of our analysis pipeline for detecting allele frequency changes between sampling times at ≤0.053 using a control dataset of 82 isolate pairs (162 total) that were in vitro technical or biological replicates (Materials and methods, Figure 4 and Figure 1—figure supplement 1).

In-host pathogen dynamics in antibiotic resistance loci

Of the 200 patients with clonal infection, we had complete treatment data on 127 patients. Six of the 127 patients had isolates sampled <2 months apart, and the remaining 121 had an outcome at the second sampling consistent with delayed culture conversion, failure or relapse of their clonal infection, hitherto treatment failure for brevity. Treatment regimen details are provided in Supplementary file 2. Of the other 73 patients, 49 patients were sampled ≥2 months apart during treatment but regimen details or interruptions were not available, for these patients the outcome may have been either default or failure. The remaining 24 patients had inadequate treatment data to confirm treatment outcome. We conducted all analyses focused on antibiotic resistance loci on 200 patients with isolate date data and separately on the 121-patient subset with confirmed failure (the latter detailed in Appendix 3). For all 200 cases, the order of sampling was available, but for 195/200 (119/121 confirmed failure patients) we also had the exact dates of sampling which were required for some analyses.

Resistance mutations found at low frequencies in-host may indicate the impending development of clinical resistance (Sun et al., 2012; Trauner et al., 2017; Zhang et al., 2016). To investigate temporal dynamics related to antibiotic pressure, we identified non-synonymous and intergenic SNPs within a set of 36 predetermined resistance loci associated with antibiotic resistance (Farhat et al., 2016; Farhat et al., 2013; Supplementary file 5) that changed in allele frequency by ≥5% (Sun et al., 2012) and ensuring that support of the alternate allele was ≥5 reads at each time point (Materials and methods). We detected 1939 such SNPs across our sample of 200 patients (Figure 2B), 1774 were non-synonymous, 91 were intergenic, and 74 occurred within the rrs region (Supplementary file 6).

Figure 2 with 1 supplement see all

Download asset Open asset

Allele frequency dynamics within antibiotic resistance loci.

(A) The antibiotic resistance genes *embB*, *katG*, *gyrA*, *ethA*, and *pncA* demonstrate evidence for competing clones during infection (other examples found are displayed in Figure 2—figure supplement 1). Each mutant allele is labeled with amino acid encoded by the reference allele, H37Rv codon position, and amino acid encoded by the mutant allele. (B) The allele frequency trajectories for SNPs that occur in patients over the course of infection can be used to study the prediction of further antibiotic resistance using the frequency of alternate alleles detected in the longitudinal isolates collected from patients. (C) Plot of true positive rate (TPR) and false positive rate (FPR) for detecting eventual fixation of a resistance allele ( ${A F}_{2} \geq$ 75%) as a function of initial allele frequency ( ${A F}_{1}$ ).

We searched for signs of selection by identifying clonal interference, or evidence of competition between strains with different drug resistance mutations (Sun et al., 2012; Trauner et al., 2017; Zhang et al., 2016). We characterized this in longitudinal isolates fulfilling three criteria: (i) isolates containing multiple resistance SNPs in the same gene within the same patient, (ii) at alternate allele frequencies that change in opposing directions over time, and (iii) the alternate (mutant) allele frequency was intermediate to high at ≥40% in at least one isolate (Farhat et al., 2016) for at least one of the co-occurring SNPs. This identified 11 cases of clonal interference (Figure 2A and Figure 2—figure supplement 1), demonstrating most often the fixation of a single allele in the second isolate from a mixture of multiple alleles at lower frequencies in the first isolate collected.

Allele frequency >19% predicts subsequent fixation of resistance variants

We aimed to measure the lowest AR allele frequency that can accurately predict the fixation of resistance alleles later in time (Dreyer et al., 2020; Sun et al., 2012; Zhang et al., 2016). We examined all 1919 SNPs that varied by at least 5% in allele frequency (AF), and discarded 20 SNPs that were fixed at AF >75% in both isolates. We calculated the true positive rate (TPR) and false positive rate (FPR) for varying values of AF at the first time point ( ${A F}_{1}) \in \{0,1, 2, \dots, 99,100\} %$ (Figure 2C, Materials and methods) allowing a maximum FPR of 5%. We found the optimal classification threshold to be ${A F}_{1}^{*} = 19 %$ with an associated sensitivity of 27.0% and a specificity of 95.8%. Of the total 37 alleles that became fixed at the second time point, 10 (from seven patients) had a frequency between 19% and 75% at the first time point, two were detected at the first time point but had AF <19%, and the remaining majority, or 25, were undetectable (i.e. had support of <5 reads) at the first time point. Taken together, we find a high turnover of low-frequency alleles in loci associated with antibiotic resistance but that mutant alleles in these loci that rise to a frequency of 19% are predicted to fix in-host with a sensitivity of 27.0% and specificity of 95.8%.

Determinants of antibiotic resistance acquisition and microbiological treatment failure

We identified overall rates of resistance acquisition by focusing on AR SNPs with moderate to high ΔAF ≥ 40% given prior evidence of association between such SNPs and phenotypic resistance (Farhat et al., 2016).

Twenty-seven AR SNPs were acquired in the 178 patients with persistent or relapsed clonal infection ≥2 months (Figure 3B). Among the set of 119 patients with confirmed failure and known isolate sampling date, 9% (11/119) of these patients acquired ≥1 AR SNP. Of the 11, 9 patients received fewer than four effective drugs. We examined the relationship between pre-existing resistance and new AR acquisition. Pre-existing resistance was defined as ≥1 fixed AR SNPs in the first isolate (Farhat et al., 2016) (Materials and methods). Two hundred fifty-nine pre-existing AR SNPs were identified with 41% (73/178) of failure patients harboring resistance to any drug at the first sampling (Figure 3B, Supplementary file 7). The majority of this resistance was MDR (multidrug resistance to at least isoniazid and a rifamycin), 64% (47/73) (Figure 3C). New resistance acquisition occurred mostly in patients with pre-existing resistance 20/27 (74%) ( $O R = 5.28, P = 2.2 \times 10^{- 4}$ Fisher’s exact test) or pre-existing MDR ( $O R = 3.85, P = 3.4 \times 10^{- 3}$ Fisher’s exact test). Among the set of 195/200 patients with clonal infection and sampling date, AR acquisition was more likely as the time between sampling increased with the OR of AR acquisition being 1.023 per 30day increment ( $95 % C I 1.002, 1.045, P = 0.035$ Logistic Regression).

Figure 3

Download asset Open asset

Pre-existing resistance is associated with resistance amplification.

We called heterozygous SNPs (hSNP) in each isolate from a patient with clonal infection classified as failing treatment (N = 178). We defined hSNPs as a SNP called in an isolate with an alternate allele frequency between 25% and 75% (Materials and methods). (A) The number of hSNPs called in the second sample isolated vs the number of hSNPs called in the first sample isolated from each of 178 patients (median T1 = 13.5 hSNPs, median T2 = 13.5 hSNPs). The dashed line is y = x. Red denotes 27/178 patients who had an antibiotic resistance in-host SNP arise between sampling (median T1 = 15.0 hSNPs, median T2 = 11.0 hSNPs), blue denotes 5/178 patients who had a putative host-adaptive in-host SNP (Rv1944c, Rv0095c, *PPE18*, *PPE54*, *PPE60*) arise between sampling (median T1 = 19.0 hSNPs, median T2 = 6.0 hSNPs). (**B–C**) Among patients who fail treatment, (B) patients with pre-existing mutations that confer antibiotic resistance and (C) those that have pre-existing MDR are more likely to acquire antibiotic resistance mutations throughout the course of infection.

We also quantified genome-wide Mtbc diversity in-host among the patients with persistent or relapsed infection for ≥2months. We reasoned that if these patients are not on or not adherent to effective antibiotic treatment, their effective pathogen population size may be large and prone to more genetic drift or turnover of minority variants with and without selection (Trauner et al., 2017). We counted the number SNPs with an alternate allele frequency between 25% and 75% (Materials and methods) at each time point as a conservative estimate of the number of segregating sites in each population. We found this count to strongly correlate between the first and second time point (Figure 3A) suggesting that minor allele diversity is maintained in-host in patients without effective therapy (median T1=13.5 hSNPs, median T2=13.5 hSNPs, $r^{2} = 0.426, P = 5.97 \times 10^{- 23}$ Linear Regression).

Genome-wide in-host diversity

Beyond antibiotic pressure, selective forces acting on the infecting Mtbc population in-host are largely unknown. To investigate this reliably across the entire Mtbc genome, we first examined the genome-wide allele frequency distribution for both technical replicates (in vitro technical or biological replicates, sample size m=62 after exclusions) and in-host longitudinal pairs (Figure 4 and Figure 1—figure supplement 1). We detected five SNPs in glpK (with ΔAF ≥ 25%) among five replicate pairs (mean ΔAF = 45%) consistent with an adaptive role for glpK mutations in vitro (Pethe et al., 2010; Vargas and Farhat, 2020) and accordingly excluded this gene from further analysis (Materials and methods). The genome-wide AF distribution in both replicate and longitudinal pairs demonstrated an abundance of SNPs with low ΔAF likely resulting from noise or technical factors. To clearly distinguish signal related to in-host factors from noise, we determined the ΔAF threshold above which SNPs/isolate-pair were rare among technical replicates that is, constituted 5% or less of total SNPs (Figure 4). We determined this ΔAF threshold to be 70% and selected 174 SNPs that developed in-host (in-host SNPs) among the 200 TB cases (Figure 4C, Supplementary file 8). Using archived MTBC isolates, we observe that changes in allele frequency are common among replicate isolates and changes in frequency of 70% are indicative of in-host evolution.

Figure 4

Download asset Open asset

Replicate pairs reveal levels of biological noise associated with repeated sampling.

(**A, B**) We analyzed the distribution of ΔAF for all SNPs detected across all replicate pairs $(m = 62)$ and longitudinal pairs $(n = 200)$ for SNPs where ΔAF ≥25%. (B) SNPs were detectable at lower levels of ΔAF for both types of isolate pairs, but SNPs with higher values of ΔAF were only found in longitudinal pairs. (C) To determine a ΔAF threshold for calling SNPs representative of changes in bacterial population composition in-host, we calculated the average number of SNPs per pair of isolates at different ΔAF thresholds for both replicate and longitudinal pairs. At a ΔAF threshold of 70% the number of SNPs between replicate pairs represents 5.27% of the SNPs detected amongst all replicate and longitudinal pairs, weighted by the number of pairs in each group.

Characteristics of mutations in-host

Of the 174 SNPs, 112 were non-synonymous, 42 synonymous, and 21 were intergenic (Figure 5C). The 153/174 coding SNPs were distributed across 127/3,886 genes and were observed in 71/200 patients (Figure 5B and Figure 5D). We analyzed the spectrum of mutations and found the GC > AT nucleotide transition to be the most common. The GC > AT transition is putatively due to oxidative damage including the deamination of cytosine/5-methyl-cytosine or the formation of 8-oxoguanine (Dillon et al., 2015; Ford et al., 2011). The transversion AT > TA was the least common substitution (Figure 6A). We expected the number of SNPs detected between longitudinal isolates to increase with time between isolate collection. Regressing the number of SNPs per patient on the timing between isolate collection (for 195 patients with isolate collection dates) (Figure 6B), we found SNPs to accumulate at an average rate of 0.56 SNPs per genome per year ( $P = 7 \times 10^{- 12}$ ) consistent with prior in vivo estimates (Ford et al., 2011; Walker et al., 2013).

Figure 5 with 3 supplements see all

Download asset Open asset

Genome-wide diversity in 200 clonal Mtbc infections.

(A) Distribution of five major Mtbc lineages among the 200 clonal Mtbc infections. (B) Distribution of 153 in-host SNPs within coding regions among the 200 longitudinal isolate pairs across the 4.41 Mbp Mtbc genome (blue circles: synonymous, red circles: non-synonymous). Blue and red circles on the innermost black ring indicate the locations of SNPs detected in one patient; circles on the next ring represent SNPs detected in two patients. The $- \log_{10}$ (p-value) of the mutational density test (Materials and methods) by gene is plotted in the outermost, red and green, regions. Labeled yellow circles represent genes significant at the bonferroni-corrected cutoff ( $α = 0.05 / 3,886$ ). (C) Distribution of ΔAF by SNP type: sSNP: synonymous, nSNP: non-synonymous, iSNP: intergenic. (D) Heat-map of SNPs per gene (rows) and patient (columns). Colored circles across columns indicate the strain phylogenetic lineage (as represented in A). Gene names colored according to gene category (Figure 6D) with parentheses indicating the number of patients with an SNP in a given gene. *Indicates genes in which SNPs are detected within multiple patients.

Figure 6 with 2 supplements see all

Download asset Open asset

PE/PPE genes vary considerably within host while putative antigens remain conserved.

(A) Mutational spectrum of in-host SNPs. (B) In-host SNP counts vs. time between isolate collection (195/200 patients with dates shown, *W [Walker et al., 2013] isolates only had year of collection). (C) Boxplots of nucleotide diversity by gene within each of five non-redundant categories (see text; $n =$ number of genes). (D) Average nucleotide diversity across genes by category. Nucleotide diversity in epitope and non-epitope region (Materials and methods) of each gene in the Antigen (**E, F**) and PE/PPE (**G, H**) gene categories. (**l, J**) PE/PPE genes separated into three non-redundant categories: PE, PE-PGRS, and PPE. (J) The average nucleotide diversity by category. (I) Box plot of nucleotide diversity by gene.

Simulations and PacBio sequencing demonstrate a low false-positive rate in repetitive regions

Several SNPs detected were in the GC-rich repetitive PE/PPE gene family (Brennan and Delogu, 2002). Variants called on these genes are commonly excluded from comparative genomic analyses (Casali et al., 2016; Comas et al., 2010; Copin et al., 2016; Coscolla et al., 2015) due to the limitations of short-read sequencing data and the possibility of making spurious variant calls; however, the rates at which these false calls occur has not been evaluated. We reasoned that our stringent filtering criteria, quality of sequencing data and depth of coverage allowed us to reliably detect variants in these regions of the genome, with the potential to uncover variation in these understudied regions of the genome.

We took several approaches to test the rate of false-positives for the single base-pair mutations observed in our analysis (Materials and methods). First, we introduced the mutant alleles observed in-host (Supplementary file 10) into a set of Mtbc reference genomes belonging to different lineages and simulated short read sequencing data from these modified genomes (Appendix 1—figure 1). We then used our variant calling pipeline to call bases from this simulated data. We observed a high recall rate of the introduced mutant alleles and a very low number of false positive base calls (zero in most cases) within the loci containing modified alleles (Appendix 1—figure 2). Second, we assessed the congruence in variant calls between short-read Illumina data and long-read PacBio data for a set of isolates that underwent sequencing with both technologies (Materials and methods). Unlike Illumina generated reads, PacBio reads are much longer and have randomly distributed error profiles (Rhoads and Au, 2015). With high coverage, PacBio sequencing can reliably reconstruct full microbial genomes and identify SNPs in repetitive regions. The comparison with PacBio assemblies confirmed empirically a low rate of false positive base calls in genomic regions where we observed in-host SNPs (Materials and methods). Third, we confirmed the five phylogenetically convergent in-host SNPs in PPE genes PPE18, PPE54, and PPE60 (see below) through manual inspection of the read alignment (Figure 5—figure supplements 1–3).

Antibiotic resistance and PE/PPE genes vary while antigens remain conserved

To understand how different classes of proteins evolve in-host, we separated Mtbc genes into five non-redundant categories (Materials and methods). The vast majority of genes in each category did not vary within patients (Figure 6C). Antibiotic resistance genes were on average the most diverse category while Essential genes varied the least (Figure 6D). Antigen genes appeared to be as conserved as were both Essential ( $P = 0.49$ Mann-Whitney U-test) and Non-Essential genes ( $P = 0.45$ Mann-Whitney U-test) while PE/PPE genes showed higher levels of nucleotide diversity than both Essential ( $P = 0.022$ Mann-Whitney U-test) and Non-Essential genes ( $P = 0.013$ Mann-Whitney U-test) (Figure 6D).

PE/PPE variation is independent of T-cell recognition

To test whether variation in Antigen or PE/PPE genes occurred in response to T-cell recognition, we separated each gene in these categories into (CD4⁺ and CD8⁺ T-cell) epitope and non-epitope concatenates and recalculated nucleotide diversity for these concatenates (Figure 6E–H). For both Antigen and PE/PPE genes (Figure 6F and Figure 6H), epitope concatenates were less diverse than non-epitope concatenates ( $P = 0.018$ and $P = 0.059$ , respectively, Mann-Whitney U-test). Only one in-host SNP was detected within an epitope-encoding region in the gene PPE18 (Figure 6G and Figure 7—figure supplement 2, Supplementary file 12). This suggests that T-cell recognition does not drive diversity in these regions. Looking within the three PE/PPE subfamilies (Figure 6I–J; Brennan, 2017), the PPE genes appeared more diverse in-host than PE genes and PE-PGRS genes ( $P = 0.019$ and $P = 0.033$ respectively, Mann-Whitney U-test).

Identifying candidate pathoadaptive loci from genome-wide variation

To identify genes involved in pathogen adaptation (Lieberman et al., 2011; Marvig et al., 2015), we applied a test of mutational density (Farhat et al., 2014; Materials and methods) by pooling variation across all 200 pairs of genomes and identifying those genes with more mutations than expected under a neutral model of evolution where variants are Poisson distributed across the genome (Farhat et al., 2014; Figure 5B, Supplementary file 13, Materials and methods). We also searched for evidence of convergent evolution, that is, genes or pathways where in-host SNPs developed in ≥ 2 patients (Supplementary file 14, Supplementary file 17). Seven known antibiotic resistance genes (Didelot et al., 2016; Farhat et al., 2013) had significant mutational density ( $α = 0.05$ , Bonferroni correction) or were convergent across patients: rpoB, gyrA, katG, rpoC, embB, ethA and pncA (mutated in six, four, four, three, three, two, and one patient, respectively) (Figure 5B and Figure 5D). Single in-host SNPs occurred in eight additional known resistance loci including three intergenic regions, and in prpR, a gene recently implicated with drug tolerance (Hicks et al., 2018; Supplementary file 8).

Three genes with unknown function: Rv0139, Rv0895, and Rv1543 were convergent in two patients each, two of which (Rv0139, Rv1543) had significant mutational density (p<2×10⁻⁵) and; three additional genes including PPE60 displayed significant mutational density (p<2×10⁻⁵) (Figure 5B, Supplementary file 13). We found evidence for convergence in six pathways not known to result in antibiotic resistance. These pathways are involved with biotin biosynthesis (fadD23, fadD29, and fadD30), ribosomal large subunit proteins (rpmB1, rplE, and rplY), glycerolipid and glycerophospholipid metabolism (aldA and Rv2974c), ESAT-6 protein secretion (eccCa1 and eccD1), coenzyme B12/cobalamin synthesis (cobH and cobK) and the uncharacterized pathway CBSS-164757.7.peg.5020 (fdxB and PPE18) (Supplementary file 17).

In-host mutations display phylogenetic convergence across multiple global lineages

We reasoned that pathoadaptive mutations observed to sweep to fixation in-host and not compromise pathogen transmissibility are likely to arise independently within other patients and in separate geographic regions in a convergent manner (Farhat et al., 2013). We screened a geographically diverse set of 20,352 sequenced clinical isolates belonging to global lineages 1–6 for mutations observed within host in which the alternate (mutant) allele swept over the course of sampling (141/174 in-host SNPs, Materials and methods, Supplementary file 8, Supplementary file 18). Conservatively, a mutation was characterized as phylogenetically convergent if it was present in isolates from three or more global lineages but not fixed in any lineage (Materials and methods). We identified 26/141 in-host SNPs as phylogenetically convergent in our global sample of isolates (Supplementary file 19). Figure 7 and Figure 7—figure supplements 1–2 display the distribution of convergent alleles across the 20,353 isolates using t-Distributed Stochastic Neighbor Embedding (t-SNE) of the pairwise genetic distance matrix (Materials and methods). The convergent alleles included the PPE genes PPE18 (1 site), PPE54 (1 site) and PPE60 (3 sites), as well as Rv0095c (2 sites) and Rv1944c (1 site) both conserved proteins of unknown function. In addition to several SNPs in loci associated with antibiotic resistance, gyrB (1 site), gyrA (2 sites), rpoB (4 sites), rpoC (3 sites), inhA (1 site), embB (3 sites), and gid (1 site).

Figure 7 with 2 supplements see all

Download asset Open asset

Mutations acquired in-host are phylogenetically convergent.

We constructed t-SNE plots from a pairwise SNP distance matrix for our global sample of 20,352 clinical isolates and 128,898 SNP sites (Materials and methods). (A) Labeling isolates by global lineage revealed that isolates cluster according to genetic similarity. Next, we labeled isolates by whether they carried a mutant allele that was also detected in-host. (**B–F**) Mutations in *gyrA*, *rpoB*, *PPE18*, *PPE54*, and *PPE60* were detected in-host (Supplementary file 8), occur in a global collection of isolates (Supplementary file 18) and are scattered across the tSNE plots, indicating that they belong to genetically different clusters of isolates (Supplementary file 19). Furthermore, all mutations with a signal of phylogenetic convergence were detected in isolates belonging to different clusters, confirming that theses mutations must have arisen independently in different genetic backgrounds (Figure 7—figure supplements 1–2). Each plot is labeled with the gene name each mutation occurs within, amino acid encoded by the reference allele, H37Rv codon position, and amino acid encoded by the mutant allele. N = number of isolates with mutant allele.

Discussion

In our Mtbc populations sequenced from active TB patients enriched for negative treatment outcomes, we find a wealth of dynamics in genetic loci associated with antibiotic resistance, including a high turnover of minor variants. Known factors that determine treatment outcome are complex and include severity of lung disease, cavitation and adherence to treatment among others (Imperial et al., 2018). Additionally, resistance acquisition in the course of one infection is comparatively rare in most pathogenic bacteria (Llewelyn et al., 2017). Here, we observe that 9% of patients with confirmed delayed culture conversion, failure and relapse amplify resistance over time. Our findings of a higher rate of resistance acquisition in patients with MDR at the outset and with time between sampling, emphasize the importance of appropriately tailoring treatment regimens as well as close surveillance for microbiological clearance and resistance acquisition by phenotypic or genotypic means. The observed high rate of resistance acquisition also emphasizes Mtbc’s biological adaptability and the long duration of drug pressure in vivo. In addition to clonal acquisition of resistance, we find that sequencing revealed a substantial proportion of mixed infection or reinfection (28% of samples collected ≥2 months apart). This high percentage suggests that patient treatment and control of disease transmission can be better guided if pathogen sequencing is routinely performed for cases with persistent positive cultures especially in high TB prevalence settings where reinfection is more likely. Reinfection can also introduce strains with a different antibiotic susceptibility profile requiring adjustment in the treatment regimen.

While prior studies have investigated the lowest resistance allele frequencies that can be detected in clinical sputum samples (Dreyer et al., 2020; Trauner et al., 2017), there is little information on the clinical relevance of these low frequency variants. We provide a proof-of-concept analysis that minor AR alleles, occurring at a frequency 19%, can predict fixation of the variant with a specificity >95% of mutations in-host, although we find the sensitivity of this threshold to be low. The low sensitivity is because the majority of alleles that sweep to fixation are actually not detectable at all at the first time point, suggesting that more frequent sampling may be needed. In the future, higher depth and more frequent sequencing can elucidate more clearly the role of minor AR allele detection in clinical management of TB treatment.

Various sources of noise contribute to allele frequency changes over time and challenge inference on bacterial composition in vivo. Here, we determined an appropriate threshold for identifying mutations in-host using average depth Mtbc WGS from cultured isolates and demonstrate the importance of including technical replicate WGS. While culturing sputa in vitro enriches Mtbc DNA for WGS it also creates experimental noise (Vargas and Farhat, 2020) and can purge some of the genetic diversity present in the sputum sample (Nimmo et al., 2019). The refinement of methods for DNA extraction directly from sputum (Votintseva et al., 2017), may allow the calling of relevant changes in allele frequencies at lower thresholds in future work. This would permit the unbiased study of loci that may be under frequency-dependent selection, where changes in allele frequencies would unlikely change by as much as 70% as we used here.

We detected 174 alleles rising to near fixation in-host across our sample of 200 patients. The observed distribution of variants including the high rate of non-synonymous substitutions and the predominance of GC > AT variants are consistent with the hypotheses of purifying pressure on synonymous variants and oxidative DNA damage, respectively (Ford et al., 2011; Namouchi et al., 2012) in Mtbc. This consistency adds validity to our variant calling approach. Overall, the observed diversity spared the CD4⁺ and CD8⁺ T cell epitope encoding regions of the genome providing further evidence that host adaptive immunity does not drive directional selection in Mtbc genomes now at short-time scales (Comas et al., 2010; Copin et al., 2014; Coscolla et al., 2015). Diversity was concentrated in antibiotic resistance regions and strikingly also in PE/PPE genes (Figure 6D; Phelan et al., 2016). Although previous studies have generally avoided reporting short-read variant calls in PE/PPE regions, we demonstrate using read simulation, visualization of Illumina read alignments and comparison with long-read sequencing data the accuracy of the SNPs captured in our study. We found PPE genes to be more diverse in-host than PE genes and detected a signal of positive selection acting on three genes belonging to the PPE sub-family (Figure 7). This indicates that PPE genes may be play an important role in the process of host-adaptation.

In addition to identifying in-host variation in 12 loci known to be involved in the acquisition of antibiotic resistance, we identified six genes and six pathways displaying diversity in-host and not known to be associated with antibiotic resistance (Supplementary file 8, Supplementary file 13–14, Supplementary file 17). For a subset, we demonstrate similar diversity has arisen independently in separate hosts and in strains with different genetic backgrounds suggesting positive selection (Figure 7). Evidence of directional selection in Mtbc genomes have thus far been largely restricted to adaptation to antibiotic treatment (Brites and Gagneux, 2015; Didelot et al., 2016; Trauner et al., 2017). The novel pathways showing in-host convergence may be important for interactions between host and pathogen arising from either metabolic or immune pressure. Mtbc is one of a few types of bacteria that possess the capacity for de novo coenzyme B12/cobalamin synthesis, and this pathway has been implicated in Mtbc survival in-host and Mtbc growth (Rowley and Kendall, 2019). We identified four genetic variants that developed in three separate patients and in three consecutive genes from the same locus cobG, intergenic cobG-cobH, cobH and cobK (Rv2064-Rv2067). This observation contributes to mounting evidence on the importance of this pathway for in vivo Mtbc survival and may have implications for drug development (Gopinath et al., 2013; Minias et al., 2018). Biotin biosynthesis is also relatively unique to mycobacteria and plays an important role in Mtbc growth, infection and host survival during latency (Salaemae et al., 2011). The other identified pathways include ESAT-6 protein secretion known to play a role in the modulation of host immune response by disrupting the phagosomal membrane (Clemmensen et al., 2017).

The loci found to be phylogenetically convergent and not known to be associated with antibiotic resistance, include the genes Rv0095c, PPE18, PPE54, and PPE60. Consistent with the idea that positive selection is acting on alleles within these loci, we observe a reduction in diversity at the second time point for the patients in which drug-resistant alleles sweep to fixation and in which putative host-pathogen alleles sweep to fixation (Figure 3A). Although of unknown function, Rv0095c (SNP A85V) was recently associated with transmission success of an Mtbc cluster in Peru (Dixit et al., 2019). Both PPE18 and PPE60 have been shown to interact with toll-like receptor 2 (TLR2) (Nair et al., 2009; Su et al., 2018). PPE18 was the only gene to encode an epitope containing a SNP in-host; mutations in the epitope-encoding regions of this gene have previously been described in a set of geographically separated clinical isolates (Hebert et al., 2007). Furthermore, PPE18 codes for one of the antigens used in the construction of the M72/AS01E vaccine candidate (Tait et al., 2019). Our results demonstrating that PPE18 is under positive selection in the MTBC may have implications for the efficacy of this vaccine against genetically diverse Mtbc strains. PPE54 has been implicated in Mtbc’s ability to arrest macrophage phagosomal maturation (phagosome-lysosome fusion) and thought to be vital for intracellular persistence (Brodin et al., 2010). The mechanism by which PPE54 accomplishes this is unknown, but Mtbc modification of phagosomal function is thought to be TLR2/TLR4-dependent (Podinovskaia et al., 2013).

Mtbc is known to disrupt numerous innate immune mechanisms including phagosome maturation, apoptosis, autophagy as well as inhibition of MHC II expression through prolonged engagement with innate sensor toll-like receptor 2 (TLR2) among others (Ernst, 2018). SNPs in human genes involved with innate-immune pathways have been implicated in-host susceptibility to TB (Azad et al., 2012; Kleinnijenhuis et al., 2011; Tientcheu et al., 2017). Specifically, SNPs in TLR2 (thought to be the most important TLR in Mtbc recognition) (Tientcheu et al., 2017) and TLR4 have been associated with susceptibility to TB disease (Azad et al., 2012; Kleinnijenhuis et al., 2011). Taken together, these observations and our results are consistent with ongoing co-evolution between humans and Mtbc with evidence for reciprocal adaptive changes, leaving a signature of selection in both humans and Mtbc populations (Brites and Gagneux, 2015). Most co-evolution between Mtbc and humans, the main reciprocal adaptations between host and pathogen are thought to have occurred long ago and as a result of long-term host-pathogen interactions (Azad et al., 2012; Brites and Gagneux, 2015). Here, we observe these dynamics over the short evolutionary timescale of a single infection which has important implications for vaccine development (Brennan, 2017).

Share this article

Cite this article

Selection of patients with longitudinal clonal infection.

Allele frequency dynamics within antibiotic resistance loci.

Pre-existing resistance is associated with resistance amplification.

Replicate pairs reveal levels of biological noise associated with repeated sampling.

Genome-wide diversity in 200 clonal Mtbc infections.

PE/PPE genes vary considerably within host while putative antigens remain conserved.

Mutations acquired in-host are phylogenetically convergent.

Overview of simulation methodology.

Simulations indicate that we can accurately recall most introduced SNPs while rarely making spurious SNP calls.

Allele frequency dynamics within antibiotic resistance loci in 121/200 confirmed failure/relapse patients.

Pre-existing resistance is associated with resistance amplification in 121/200 confirmed failure/relapse patients.

In-host SNP counts vs. time between isolate collection (119/121 confirmed failure/relapse patients with dates for both isolates shown).

Author details

Roger Vargas

Contribution

For correspondence

Competing interests

Luca Freschi

Contribution

Competing interests

Maximillian Marin

Contribution

Competing interests

L Elaine Epperson

Contribution

Competing interests

Melissa Smith

Contribution

Competing interests

Irina Oussenko

Contribution

Competing interests

David Durbin

Contribution

Competing interests

Michael Strong

Contribution

Competing interests

Max Salfinger

Contribution

Competing interests

Maha Reda Farhat

Contribution

For correspondence

Competing interests

Citations by DOI

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

Categories and tags

Further reading