The Mycobacterium tuberculosis complex pangenome is small and driven by sub-lineage-specific regions of difference

  1. Department of Biosciences, Nottingham Trent University, Nottingham, UK
  2. Institute of Microbiology and Infection, College of Medical and Dental Sciences, University of Birmingham, Birmingham, UK
  3. Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, USA
  4. Pulmonary and Critical Care Medicine, Massachusetts General Hospital, Boston, MA 02114, USA
  5. Unit of Mycobacteriology, Institute of Tropical Medicine, Antwerp, Belgium

Peer review process

Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, public reviews, and a provisional response from the authors.

Read more about eLife’s peer review process.

Editors

  • Reviewing Editor
    Bavesh Kana
    University of the Witwatersrand, Johannesburg, South Africa
  • Senior Editor
    Bavesh Kana
    University of the Witwatersrand, Johannesburg, South Africa

Reviewer #1 (Public Review):

Summary:

In this paper, Behruznia and colleagues use long-read sequencing data for 335 strains of the Mycobacterium tuberculosis complex to study genome evolution in this clonal bacterial pathogen. They use both a "classical" pangenome approach that looks at the presence and absence of genes, and a more general pangenome graph approach to investigate structural variants also in non-coding regions. The two main results of the study are that (1) the MTBC has a small pangenome with few accessory genes, and that (2) pangenome evolution is driven by deletions in sublineage-specific regions of difference. Combining the gene-based approach with a pangenome graph is innovative, and the former analysis is largely sound apart from a lack of information about the data set used. The graph part, however, requires more work and currently fails to support the second main result. Problems include the omission of important information and the confusing analysis of structural variants in terms of "regions of difference", which unnecessarily introduces reference bias. Overall, I very much like the direction taken in this article, but think that it needs more work: on the one hand by simply telling the reader what exactly was done, on the other by taking advantage of the information contained in the pangenome graph.

Strengths:

The authors put together a large data set of long-read assemblies representing most lineages of the Mycobacterium tuberculosis context, covering a large geographic area. State-of-the-art methods are used to analyze gene presence-absence polymorphisms (Panaroo) and to construct a pangenome graph (PanGraph). Additional analysis steps are performed to address known problems with misannotated or misassembled genes in pangenome analysis.

Weaknesses:

The study does not quite live up to the expectations raised in the introduction. Firstly, while the importance of using a curated data set is emphasized, little information is given about the data set apart from the geographic origin of the samples (Figure 1). A BUSCO analysis is conducted to filter for assembly quality, but no results are reported. It is also not clear whether the authors assembled genomes themselves in the cases where, according to Supplementary Table 1, only the reads were published but not the assemblies. In the end, we simply have to trust that single-contig assemblies based on long-reads are reliable.

One issue with long read assemblies could be that high rates of sequencing errors result in artificial indels when coverage is low, which in turn could affect gene annotation and pangenome inference (e.g. Watson & Warr 2019, https://doi.org/10.1038/s41587-018-0004-z). Some of the older long-read data used by the authors could well be problematic (PacBio RSII), but also their own Nanopore assemblies, six of which have a mean coverage below 50 (Wick et al. 2023 recommend 200x for ONT, https://doi.org/ 10.1371/journal.pcbi.1010905). Could the results be affected by such assembly errors? Are there lineages, for example, for which there is an increased proportion of RSII data? Given the large heterogeneity in data quality on the NCBI, I think more information about the reads and the assemblies should be provided.

The part of the paper I struggled most with is the pangenome graph analysis and the interpretation of structural variants in terms of "regions of difference". To start with, the method section states that "multiple whole genomes were aligned into a graph using PanGraph" (l.159/160), without stating which genomes were for what reason. From Figure 5 I understand that you included all genomes, and that Figure 6 summarizes the information at the sublineage level. This should be stated clearly, at present the reader has to figure out what was done. It was also not clear to me why the authors focus on the sublineage level: a minority of accessory genes (107 of 506) are "specific to certain lineages or sublineages" (l. 240), so why conclude that the pangenome is "driven by sublineage-specific regions of difference", as the title states? What does "driven by" mean? Instead of cutting the phylogeny arbitrarily at the sublineage level, polymorphisms could be described more generally by their frequencies.

I fully agree that pangenome graphs are the way to go and that the non-coding part of the genome deserves as much attention as the coding part, as stated in the introduction. Here, however, the analysis of the pangenome graph consists of extracting variants from the graph and blasting them against the reference genome H37Rv in order to identify genes and "regions of difference" (RDs) that are variable. It is not clear what the authors do with structural variants that yield no blast hit against H37Rv. Are they ignored? Are they included as new "regions of difference"? How many of them are there? etc. The key advantage of pangenome graphs is that they allow a reference-free, full representation of genetic variation in a sample. Here reference bias is reintroduced in the first analysis step.

Along similar lines, I find the interpretation of structural variants in terms of "regions of difference" confusing, and probably many people outside the TB field will do so. For one thing, it is not clear where these RDs and their names come from. Did the authors use an annotation of RDs in the reference genome H37Rv from previously published work (e.g. Bespiatykh et al. 2021)? This is important basic information, its lack makes it difficult to judge the validity of the results. The Bespiatykh et al. study uses a large short-read data (721 strains) set to characterize diversity in RDs and specifically focuses on the sublineage-specific variants. While the authors cite the paper, it would be relevant to compare the results of the two studies in more detail.

As far as I understand, "regions of difference" have been used in the tuberculosis field to describe structural variants relative to the reference genome H37Rv. Colloquially, regions present in H37Rv but absent in another strain have been called "deletions". Whether these polymorphisms have indeed originated through deletion or through insertion in H37Rv or its ancestors requires a comparison with additional strains. While the pangenome graph does contain this information, the authors do not attempt to categorize structural variants into insertions and deletions but simply seem to assume that "regions of difference" are deletions. This, as well as the neglect of paralogs in the "classical" pangenome analysis, puts a question mark behind their conclusion that deletion drives pangenome evolution in the MTBC.

Reviewer #2 (Public Review):

Summary:

The authors attempted to investigate the pangenome of MTBC by using a selection of state-of-the-art bioinformatic tools to analyse 324 complete and 11 new genomes representing all known lineages and sublineages. The aim of their work was to describe the total diversity of the MTBC and to investigate the driving evolutionary force. By using long read and hybrid approaches for genome assembly, an important attempt was made to understand why the MTBC pangenome size was reported to vary in size by previous reports.

Strengths:

A stand-out feature of this work is the inclusion of non-coding regions as opposed to only coding regions which was a focus of previous papers and analyses which investigated the MTBC pangenome. A unique feature of this work is that it highlights sublineage-specific regions of difference (RDs) that were previously unknown. Another major strength is the utilisation of long-read whole genomes sequences, in combination with short-read sequences when available. It is known that using only short reads for genome assembly has several pitfalls. The parallel approach of utilizing both Panaroo and Pangraph for pangenomic reconstruction illuminated the limitations of both tools while highlighting genomic features identified by both. This is important for any future work and perhaps alludes to the need for more MTBC-specific tools to be developed.

Weaknesses:

The only major weakness was the limited number of isolates from certain lineages and the over-representation others, which was also acknowledged by the authors. However, since the case is made that the MTBC has a closed pangenome, the inclusion of additional genomes would not result in the identification of any new genes. This is a strong statement without an illustration/statistical analysis to support this.

Author response:

Reviewer #1 (Public Review):

Summary:

In this paper, Behruznia and colleagues use long-read sequencing data for 335 strains of the Mycobacterium tuberculosis complex to study genome evolution in this clonal bacterial pathogen. They use both a "classical" pangenome approach that looks at the presence and absence of genes, and a more general pangenome graph approach to investigate structural variants also in non-coding regions. The two main results of the study are that (1) the MTBC has a small pangenome with few accessory genes, and that (2) pangenome evolution is driven by deletions in sublineage-specific regions of difference. Combining the gene-based approach with a pangenome graph is innovative, and the former analysis is largely sound apart from a lack of information about the data set used. The graph part, however, requires more work and currently fails to support the second main result. Problems include the omission of important information and the confusing analysis of structural variants in terms of "regions of difference", which unnecessarily introduces reference bias. Overall, I very much like the direction taken in this article, but think that it needs more work: on the one hand by simply telling the reader what exactly was done, on the other by taking advantage of the information contained in the pangenome graph.

Thank you for your constructive feedback. We have hopefully positively addressed all your concerns. Please see our detailed responses below.

Strengths:

The authors put together a large data set of long-read assemblies representing most lineages of the Mycobacterium tuberculosis context, covering a large geographic area. State-of-the-art methods are used to analyze gene presence-absence polymorphisms (Panaroo) and to construct a pangenome graph (PanGraph). Additional analysis steps are performed to address known problems with misannotated or misassembled genes in pangenome analysis.

Thank you for your positive feedback. We are pleased that you found these aspects of our work noteworthy and valuable.

Weaknesses:

The study does not quite live up to the expectations raised in the introduction. Firstly, while the importance of using a curated data set is emphasized, little information is given about the data set apart from the geographic origin of the samples (Figure 1). A BUSCO analysis is conducted to filter for assembly quality, but no results are reported. It is also not clear whether the authors assembled genomes themselves in the cases where, according to Supplementary Table 1, only the reads were published but not the assemblies. In the end, we simply have to trust that single-contig assemblies based on long-reads are reliable.

The BUSCO results are present for all the genomes in Supplementary Table S1. Genome assemblies were obtained from public databases and other studies that performed the assemblies. We did not perform assemblies for any of the public datasets except the 11 genomes sequenced by ourselves, for which we included the assembly statistics. The public genomes from NCBI were marked as closed based on the NCBI pipelines so there are additional checks on quality undertaken there before we included in our analysis. Marin et al (2024; BioRxiv) also performed additional checks on the vast majority of the genomes before they were included here. We are confident that these genomes represent the highest quality M. tuberculosis dataset possible, but we will check that all genomes are present in the GTDB list, which performs additional tests including CheckM, to add another layer of confidence. Some of the accessions to the final genomes were not included as these papers were not released yet but will be in the next version. Supplementary Table S1 will be updated to include the assembly information for each genome.

One issue with long read assemblies could be that high rates of sequencing errors result in artificial indels when coverage is low, which in turn could affect gene annotation and pangenome inference (e.g. Watson & Warr 2019, https://doi.org/10.1038/s41587-018-0004-z). Some of the older long-read data used by the authors could well be problematic (PacBio RSII), but also their own Nanopore assemblies, six of which have a mean coverage below 50 (Wick et al. 2023 recommend 200x for ONT, https://doi.org/ 10.1371/journal.pcbi.1010905). Could the results be affected by such assembly errors? Are there lineages, for example, for which there is an increased proportion of RSII data? Given the large heterogeneity in data quality on the NCBI, I think more information about the reads and the assemblies should be provided.

We have shown elsewhere (Marin et al (2024; BioRxiv)) that short read sequencing is significantly worse for these types of problems. For this reason, we have included only closed genomes which we believe will reduce the potential for such errors. However, we agree that older sequencing technologies, such as PacBio RSII, can introduce errors in the assemblies and subsequent downstream analyses. We will look for correlation between platform and accessory genome presence/absence to see if the type of sequencing influences the results.

Wick et al. (2023) recommend a coverage of 200x for ONT sequencing; however, newer analyses from Wick have shown that with modern basecalling and sequencing very low error rates can be achieved with much lower coverage (see https://rrwick.github.io/2023/10/24/ont-only-accuracy-update.html). We are quite confident that gene presence/absence patterns should be robust to this in our analysis but will confirm with some additional analysis on our sequenced genomes.

The part of the paper I struggled most with is the pangenome graph analysis and the interpretation of structural variants in terms of "regions of difference". To start with, the method section states that "multiple whole genomes were aligned into a graph using PanGraph" (l.159/160), without stating which genomes were for what reason. From Figure 5 I understand that you included all genomes, and that Figure 6 summarizes the information at the sublineage level. This should be stated clearly, at present the reader has to figure out what was done.

All genomes were included in the pangenome graph construction and to look for regions of differences. We then grouped genomes into sub-lineages to undertake the additional analyses as there is not enough genomes per sub-sub-lineages and lower for robust analyses. We will make this clearer in the next version, likely with a flowchart of analyses.

It was also not clear to me why the authors focus on the sublineage level: a minority of accessory genes (107 of 506) are "specific to certain lineages or sublineages" (l. 240), so why conclude that the pangenome is "driven by sublineage-specific regions of difference", as the title states? What does "driven by" mean? Instead of cutting the phylogeny arbitrarily at the sublineage level, polymorphisms could be described more generally by their frequencies.

We acknowledge the importance of polymorphisms, but our study primarily aimed to investigate the presence and absence of genes/genomic regions, as highlighted in our focus on structural differences rather than SNPs (L67-69). We attempted to clarify our goal of exploring gene content variation both between and within lineages (L69) to avoid confusion.

Our focus on the sub-lineage level addresses the gap in understanding gene content distribution beyond the broad lineage level, where previous pangenome studies have concentrated. The decision to focus on sub-lineages allows for a more detailed exploration of genetic diversity. Due to the limited number of genomes available to represent all sub-sub-lineages and lower levels of classification, we aimed to investigate gene content differences at the sub-lineage level. This decision allows for a more detailed and comprehensive exploration of gene content differences within the MTBC.

I fully agree that pangenome graphs are the way to go and that the non-coding part of the genome deserves as much attention as the coding part, as stated in the introduction. Here, however, the analysis of the pangenome graph consists of extracting variants from the graph and blasting them against the reference genome H37Rv in order to identify genes and "regions of difference" (RDs) that are variable. It is not clear what the authors do with structural variants that yield no blast hit against H37Rv. Are they ignored? Are they included as new "regions of difference"? How many of them are there? etc. The key advantage of pangenome graphs is that they allow a reference-free, full representation of genetic variation in a sample. Here reference bias is reintroduced in the first analysis step.

Genomic analysis of Mycobacterium tuberculosis is H37Rv reference-centric, meaning that RDs are typically defined based on their presence or absence relative to the reference strain. Our approach comparing variants to the H37Rv reference was primarily to identify and name the known regions of differences (RDs). For structural variants that did not yield a BLAST hit against H37Rv, we assigned them as new RDs in Supplementary Table S4 to provide a reference-free approach for investigating gene content differences. Further clarifications on the definition and identification of RDs will be added.

Along similar lines, I find the interpretation of structural variants in terms of "regions of difference" confusing, and probably many people outside the TB field will do so. For one thing, it is not clear where these RDs and their names come from. Did the authors use an annotation of RDs in the reference genome H37Rv from previously published work (e.g. Bespiatykh et al. 2021)? This is important basic information, its lack makes it difficult to judge the validity of the results. The Bespiatykh et al. study uses a large short-read data (721 strains) set to characterize diversity in RDs and specifically focuses on the sublineage-specific variants. While the authors cite the paper, it would be relevant to compare the results of the two studies in more detail.

Indeed the term regions of difference (RDs) is somewhat M. tuberculosis specific. These are large polymorphisms which are differentially present in clades (primarily lineages) of M. tuberculosis. Annotations and naming of these is based on Bespiatykh et al. (2021) and RDscan tool which identify RD regions based on the H37Rv genomic coordinates. We obtained the corresponding Rv locus for RD regions by matching their genomic coordinates on the H37Rv genome and confirmed the RDs using the bed file from RDscan. We have used their names where our findings overlap and any new RDs we report are not found in their data. We will ensure this is clearer in the next version.

As far as I understand, "regions of difference" have been used in the tuberculosis field to describe structural variants relative to the reference genome H37Rv. Colloquially, regions present in H37Rv but absent in another strain have been called "deletions". Whether these polymorphisms have indeed originated through deletion or through insertion in H37Rv or its ancestors requires a comparison with additional strains. While the pangenome graph does contain this information, the authors do not attempt to categorize structural variants into insertions and deletions but simply seem to assume that "regions of difference" are deletions. This, as well as the neglect of paralogs in the "classical" pangenome analysis, puts a question mark behind their conclusion that deletion drives pangenome evolution in the MTBC.

The term regions of difference or RDs has traditionally been used to describe structural variants relative to the H37Rv genome, often interpreted as deletions. Consistent with our study, Bespiatykh et al. (2021) observed two types of deletions: those associated with repeat sequences or mobile genetic elements, and conserved RDs that are phylogenetically informative deletions inherited by all descendants of a strain.

In our study, we employed a phylogenetic approach to identify deletions. If RDs are present in genomes both upstream and downstream of a phylogenetic branch but are absent in one specific branch, we interpret this as evidence of gene deletion (Figure 5B). This method was systematically applied to all RDs identified as deletions in our study; we will clarify this better in the next version.

We acknowledge the importance of considering paralogs in pangenome analysis. While the evolution of genomes is driven by duplication, loss and transfer, we know that transfer is not a mechanism in modern MTBC evolution and we have focussed here on loss. Duplication (paralog) analysis from annotations continues to be difficult to quantify due to the difficult of reliably confirming paralogy. We have addressed the effect of different Panaroo options, including merge paralogs, on the genomic diversity and pangenome estimation of MTBC in our associated paper (Marin et al 2024). This study showed that most structural variation in Mycobacterium tuberculosis is attributed to rearrangements of existing sequences rather than novel sequence content. For example, the transposable element IS6110 accounts for a significant portion of sequence variation. This hints that paralogs are not very important in terms of gene content differences in MTBC.

However, we will attempt to build on this by looking at Panaroo outputs without merged paralogs and looking for potentially duplicated genomic stretches in the Pangraph analyses. This will hopefully show more robustly that the MTBC diversity is primarily deletion driven.

Reviewer #2 (Public Review):

Summary:

The authors attempted to investigate the pangenome of MTBC by using a selection of state-of-the-art bioinformatic tools to analyse 324 complete and 11 new genomes representing all known lineages and sublineages. The aim of their work was to describe the total diversity of the MTBC and to investigate the driving evolutionary force. By using long read and hybrid approaches for genome assembly, an important attempt was made to understand why the MTBC pangenome size was reported to vary in size by previous reports.

Strengths:

A stand-out feature of this work is the inclusion of non-coding regions as opposed to only coding regions which was a focus of previous papers and analyses which investigated the MTBC pangenome. A unique feature of this work is that it highlights sublineage-specific regions of difference (RDs) that were previously unknown. Another major strength is the utilisation of long-read whole genomes sequences, in combination with short-read sequences when available. It is known that using only short reads for genome assembly has several pitfalls. The parallel approach of utilizing both Panaroo and Pangraph for pangenomic reconstruction illuminated the limitations of both tools while highlighting genomic features identified by both. This is important for any future work and perhaps alludes to the need for more MTBC-specific tools to be developed.

Thank you for recognising the strengths of our work.

Weaknesses:

The only major weakness was the limited number of isolates from certain lineages and the over-representation others, which was also acknowledged by the authors. However, since the case is made that the MTBC has a closed pangenome, the inclusion of additional genomes would not result in the identification of any new genes. This is a strong statement without an illustration/statistical analysis to support this.

The language around open and closed pangenomes is difficult to convey and indeed we will improve this for the next version. We aimed to show that with a set of highly curated genomes that span the breadth of known diversity within the MTBC, we see no evidence for a large, open pangenome as has been previously suggested. We instead suggest that adding new genomes is unlikely to bring large additions to the accessory genome, therefore showing that the MTBC pangenome tends towards being closed. We will add additional visualisations such as gene accumulation plots to better support this argument.

  1. Howard Hughes Medical Institute
  2. Wellcome Trust
  3. Max-Planck-Gesellschaft
  4. Knut and Alice Wallenberg Foundation