Research Article

The Mycobacterium tuberculosis complex pangenome is small and shaped by sub-lineage-specific regions of difference

Department of Biosciences, Nottingham Trent University, United Kingdom
Institute of Microbiology and Infection, College of Medical and Dental Sciences, University of Birmingham, United Kingdom
Department of Biomedical Informatics, Harvard Medical School, United States
Medical Technologies Innovation Facility, Nottingham Trent University, United Kingdom
Pulmonary and Critical Care Medicine, Massachusetts General Hospital, United States
Unit of Mycobacteriology, Institute of Tropical Medicine, Belgium

Sep 5, 2025

https://doi.org/10.7554/eLife.97870.4

Open access
Copyright information

Figures
Tables
Additional files

4 figures, 2 tables and 9 additional files

Figures

Figure 1

Download asset Open asset

Boxplot showing the distribution of accessory genes within each lineage.

Lineages with a small number of genomes were excluded from the statistical analysis. L6, La1, and *M. microti* have significantly smaller accessory genomes compared to other lineages.

Figure 2 with 1 supplement

Download asset Open asset

Analysis of the functional components in the core (A) and accessory (B) genome using EggNOG mapper and InterProScan.

Core and accessory genes are derived from Panaroo with merged paralogs. For without merged paralogs, see Figure 2—figure supplement 1.

Figure 2—figure supplement 1

Download asset Open asset

Figure 3

Download asset Open asset

Population structure of the MTBC derived from phylogenetic and gene presence/absence approaches.

(A) Phylogenetic tree based on MTBC core genome. PCA based on the accessory genome data from (B) Panaroo (merged paralogs), (C) Panaroo (no merged paralogs) and (D) accessory regions data from Pangraph.

Figure 4

Download asset Open asset

Sub-lineage specific regions of differences (RDs) identified using pangenome-based approaches.

Sub-lineages are shown on the Y-axis coloured as per the legend. H37Rv sits within L4.9 at the top of the Y-axis. RDs (structural variants present in all members of one or more sub-lineages and absent in all members of one or more other sub-lineages) are listed on the X-axis, grouped by their pattern of presence/absence. Only RDs detected by both the Panaroo and Pangraph-based approaches are shown. Grey boxes indicate that the region is absent from that sub-lineage. Only RDs that are due to divergent evolution and found present or absent in 2 or more genomes are shown here.

Tables

Table 1

Pangenome (Panaroo) size estimations with and without corrections and merged paralogs.

Corrected estimates are derived from filtering out false accessory genes through merging of genes with strong hits to the same gene in H37Rv (≥90% identity and ≥75% gene length coverage).

Method	Pangenome statistics	Raw estimates	Corrected estimates
PanarooMerged paralogs	Total Genes	4118	4032
	Core genes	3638	3627
	Accessory genes:	480	394
PanarooIncluding paralogs	Total Genes	4427	4321
	Core genes:	3635	3627
	Accessory genes:	792	694

Table 2

Pangenome graph (Pangraph) size estimations with and without corrections.

Corrected estimates are derived from filtering out false labelling of absent regions using BLASTn.

Pangenome partition	Raw estimates	Corrected estimates
Total Regions	1,338	1,338
Core Regions	1,015	1,040
Soft core	129	124
Shell	140	143
Cloud	54	31

Additional files

Supplementary file 1 An overview of the genome dataset, both publicly acquired data and those strains sequenced within this study including assembly accessions and online locations, where relevant. BUSCO completeness information for each genome is also provided, as is the CDS and pseudogene count as predicted by PGAP.: https://cdn.elifesciences.org/articles/97870/elife-97870-supp1-v1.xlsx
Download elife-97870-supp1-v1.xlsx
Supplementary file 2 The geography of the dataset. (A) sample collection distribution by country; (B) the number of genomes of each lineage included from each continent.: https://cdn.elifesciences.org/articles/97870/elife-97870-supp2-v1.zip
Download elife-97870-supp2-v1.zip
Supplementary file 3 A list of all gene groups which were combined to reduce over-splitting of pangenome due to annotation errors. Each gene group's name is listed alongside its original classification (Core, Sort-core, Cloud, Shell) and its new classification after merging with others after annotation correction. This is shown for both the pangenome with and without merged paralog setting enabled.: https://cdn.elifesciences.org/articles/97870/elife-97870-supp3-v1.xlsx
Download elife-97870-supp3-v1.xlsx
Supplementary file 4 Pangenome openness assessment using Heap’s Law analysis, genome fluidity and rarefaction curve for (A) Panaroo with merged paralogs; (B) Panaroo with unmerged paralogs and (C) Pangraph blocks.: https://cdn.elifesciences.org/articles/97870/elife-97870-supp4-v1.pdf
Download elife-97870-supp4-v1.pdf
Supplementary file 5 Pairwise comparisons of accessory genome size between lineages. A * indicates a significant difference between these lineages. Comparisons on both merged and unmerged paralog datasets are shown.: https://cdn.elifesciences.org/articles/97870/elife-97870-supp5-v1.xlsx
Download elife-97870-supp5-v1.xlsx
Supplementary file 6 MTBC accessory genome distribution based on (A) Panaroo (merged paralogs) and (B) Pangraph. The MTBC phylogenetic tree is shown on the left beside a coloured bar indicating the sub-lineage of each tip genome. The accessory genes/regions are indicated by columns in the heatmap with a blue box if present in that strain’s genome.: https://cdn.elifesciences.org/articles/97870/elife-97870-supp6-v1.zip
Download elife-97870-supp6-v1.zip
Supplementary file 7 Coinfinder gene association heatmap. Coinfinder reveals gene association patterns within the accessory genome of the MTBC. Accessory genes, listed on the X-axis, cluster into groups (shown here by varying colours of the blocks).: https://cdn.elifesciences.org/articles/97870/elife-97870-supp7-v1.pdf
Download elife-97870-supp7-v1.pdf
Supplementary file 8 A list of all RDs (known and new) along with the genes contained in each and the lineages the regions are absent from.: https://cdn.elifesciences.org/articles/97870/elife-97870-supp8-v1.xlsx
Download elife-97870-supp8-v1.xlsx
MDAR checklist: https://cdn.elifesciences.org/articles/97870/elife-97870-mdarchecklist1-v1.docx
Download elife-97870-mdarchecklist1-v1.docx

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Mendeley

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

Mahboobeh Behruznia
Maximillian Marin
Daniel J Whiley
Maha Reda Farhat
Jonathan C Thomas
Maria Rosa Domingo-Sananes
Conor J Meehan

(2025)

The Mycobacterium tuberculosis complex pangenome is small and shaped by sub-lineage-specific regions of difference

eLife 13:RP97870.

https://doi.org/10.7554/eLife.97870.4

Share this article

Cite this article

Boxplot showing the distribution of accessory genes within each lineage.

Analysis of the functional components in the core (A) and accessory (B) genome using EggNOG mapper and InterProScan.

Analysis of the functional components in the core (A) and accessory (B) genome using EggNOG mapper and InterProScan.

Population structure of the MTBC derived from phylogenetic and gene presence/absence approaches.

Sub-lineage specific regions of differences (RDs) identified using pangenome-based approaches.

Pangenome (Panaroo) size estimations with and without corrections and merged paralogs.

Pangenome graph (Pangraph) size estimations with and without corrections.

Supplementary file 1

Supplementary file 2

Supplementary file 3

Supplementary file 4

Supplementary file 5

Supplementary file 6

Supplementary file 7

Supplementary file 8

MDAR checklist

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)