The Mycobacterium tuberculosis complex pangenome is small and shaped by sub-lineage-specific regions of difference

  1. Mahboobeh Behruznia
  2. Maximillian Marin
  3. Daniel J Whiley
  4. Maha Reda Farhat
  5. Jonathan C Thomas
  6. Maria Rosa Domingo-Sananes
  7. Conor J Meehan  Is a corresponding author
  1. Department of Biosciences, Nottingham Trent University, United Kingdom
  2. Institute of Microbiology and Infection, College of Medical and Dental Sciences, University of Birmingham, United Kingdom
  3. Department of Biomedical Informatics, Harvard Medical School, United States
  4. Medical Technologies Innovation Facility, Nottingham Trent University, United Kingdom
  5. Pulmonary and Critical Care Medicine, Massachusetts General Hospital, United States
  6. Unit of Mycobacteriology, Institute of Tropical Medicine, Belgium
4 figures, 2 tables and 9 additional files

Figures

Boxplot showing the distribution of accessory genes within each lineage.

Lineages with a small number of genomes were excluded from the statistical analysis. L6, La1, and M. microti have significantly smaller accessory genomes compared to other lineages.

Figure 2 with 1 supplement
Analysis of the functional components in the core (A) and accessory (B) genome using EggNOG mapper and InterProScan.

Core and accessory genes are derived from Panaroo with merged paralogs. For without merged paralogs, see Figure 2—figure supplement 1.

Figure 2—figure supplement 1
Analysis of the functional components in the core (A) and accessory (B) genome using EggNOG mapper and InterProScan.

Core and accessory genes are derived from Panaroo without merged paralogs.

Population structure of the MTBC derived from phylogenetic and gene presence/absence approaches.

(A) Phylogenetic tree based on MTBC core genome. PCA based on the accessory genome data from (B) Panaroo (merged paralogs), (C) Panaroo (no merged paralogs) and (D) accessory regions data from Pangraph.

Sub-lineage specific regions of differences (RDs) identified using pangenome-based approaches.

Sub-lineages are shown on the Y-axis coloured as per the legend. H37Rv sits within L4.9 at the top of the Y-axis. RDs (structural variants present in all members of one or more sub-lineages and absent in all members of one or more other sub-lineages) are listed on the X-axis, grouped by their pattern of presence/absence. Only RDs detected by both the Panaroo and Pangraph-based approaches are shown. Grey boxes indicate that the region is absent from that sub-lineage. Only RDs that are due to divergent evolution and found present or absent in 2 or more genomes are shown here.

Tables

Table 1
Pangenome (Panaroo) size estimations with and without corrections and merged paralogs.

Corrected estimates are derived from filtering out false accessory genes through merging of genes with strong hits to the same gene in H37Rv (≥90% identity and ≥75% gene length coverage).

MethodPangenome statisticsRaw estimatesCorrected estimates
PanarooMerged paralogsTotal Genes41184032
Core genes36383627
Accessory genes:480394
PanarooIncluding paralogsTotal Genes44274321
Core genes:36353627
Accessory genes:792694
Table 2
Pangenome graph (Pangraph) size estimations with and without corrections.

Corrected estimates are derived from filtering out false labelling of absent regions using BLASTn.

Pangenome partitionRaw estimatesCorrected estimates
Total Regions1,3381,338
Core Regions1,0151,040
Soft core129124
Shell140143
Cloud5431

Additional files

Supplementary file 1

An overview of the genome dataset, both publicly acquired data and those strains sequenced within this study including assembly accessions and online locations, where relevant.

BUSCO completeness information for each genome is also provided, as is the CDS and pseudogene count as predicted by PGAP.

https://cdn.elifesciences.org/articles/97870/elife-97870-supp1-v1.xlsx
Supplementary file 2

The geography of the dataset.

(A) sample collection distribution by country; (B) the number of genomes of each lineage included from each continent.

https://cdn.elifesciences.org/articles/97870/elife-97870-supp2-v1.zip
Supplementary file 3

A list of all gene groups which were combined to reduce over-splitting of pangenome due to annotation errors.

Each gene group's name is listed alongside its original classification (Core, Sort-core, Cloud, Shell) and its new classification after merging with others after annotation correction. This is shown for both the pangenome with and without merged paralog setting enabled.

https://cdn.elifesciences.org/articles/97870/elife-97870-supp3-v1.xlsx
Supplementary file 4

Pangenome openness assessment using Heap’s Law analysis, genome fluidity and rarefaction curve for (A) Panaroo with merged paralogs; (B) Panaroo with unmerged paralogs and (C) Pangraph blocks.

https://cdn.elifesciences.org/articles/97870/elife-97870-supp4-v1.pdf
Supplementary file 5

Pairwise comparisons of accessory genome size between lineages.

A * indicates a significant difference between these lineages. Comparisons on both merged and unmerged paralog datasets are shown.

https://cdn.elifesciences.org/articles/97870/elife-97870-supp5-v1.xlsx
Supplementary file 6

MTBC accessory genome distribution based on (A) Panaroo (merged paralogs) and (B) Pangraph.

The MTBC phylogenetic tree is shown on the left beside a coloured bar indicating the sub-lineage of each tip genome. The accessory genes/regions are indicated by columns in the heatmap with a blue box if present in that strain’s genome.

https://cdn.elifesciences.org/articles/97870/elife-97870-supp6-v1.zip
Supplementary file 7

Coinfinder gene association heatmap.

Coinfinder reveals gene association patterns within the accessory genome of the MTBC. Accessory genes, listed on the X-axis, cluster into groups (shown here by varying colours of the blocks).

https://cdn.elifesciences.org/articles/97870/elife-97870-supp7-v1.pdf
Supplementary file 8

A list of all RDs (known and new) along with the genes contained in each and the lineages the regions are absent from.

https://cdn.elifesciences.org/articles/97870/elife-97870-supp8-v1.xlsx
MDAR checklist
https://cdn.elifesciences.org/articles/97870/elife-97870-mdarchecklist1-v1.docx

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Mahboobeh Behruznia
  2. Maximillian Marin
  3. Daniel J Whiley
  4. Maha Reda Farhat
  5. Jonathan C Thomas
  6. Maria Rosa Domingo-Sananes
  7. Conor J Meehan
(2025)
The Mycobacterium tuberculosis complex pangenome is small and shaped by sub-lineage-specific regions of difference
eLife 13:RP97870.
https://doi.org/10.7554/eLife.97870.4