The Mycobacterium tuberculosis complex pangenome is small and shaped by sub-lineage-specific regions of difference
Figures

Boxplot showing the distribution of accessory genes within each lineage.
Lineages with a small number of genomes were excluded from the statistical analysis. L6, La1, and M. microti have significantly smaller accessory genomes compared to other lineages.

Analysis of the functional components in the core (A) and accessory (B) genome using EggNOG mapper and InterProScan.
Core and accessory genes are derived from Panaroo with merged paralogs. For without merged paralogs, see Figure 2—figure supplement 1.

Analysis of the functional components in the core (A) and accessory (B) genome using EggNOG mapper and InterProScan.
Core and accessory genes are derived from Panaroo without merged paralogs.

Population structure of the MTBC derived from phylogenetic and gene presence/absence approaches.
(A) Phylogenetic tree based on MTBC core genome. PCA based on the accessory genome data from (B) Panaroo (merged paralogs), (C) Panaroo (no merged paralogs) and (D) accessory regions data from Pangraph.

Sub-lineage specific regions of differences (RDs) identified using pangenome-based approaches.
Sub-lineages are shown on the Y-axis coloured as per the legend. H37Rv sits within L4.9 at the top of the Y-axis. RDs (structural variants present in all members of one or more sub-lineages and absent in all members of one or more other sub-lineages) are listed on the X-axis, grouped by their pattern of presence/absence. Only RDs detected by both the Panaroo and Pangraph-based approaches are shown. Grey boxes indicate that the region is absent from that sub-lineage. Only RDs that are due to divergent evolution and found present or absent in 2 or more genomes are shown here.
Tables
Pangenome (Panaroo) size estimations with and without corrections and merged paralogs.
Corrected estimates are derived from filtering out false accessory genes through merging of genes with strong hits to the same gene in H37Rv (≥90% identity and ≥75% gene length coverage).
Method | Pangenome statistics | Raw estimates | Corrected estimates |
---|---|---|---|
PanarooMerged paralogs | Total Genes | 4118 | 4032 |
Core genes | 3638 | 3627 | |
Accessory genes: | 480 | 394 | |
PanarooIncluding paralogs | Total Genes | 4427 | 4321 |
Core genes: | 3635 | 3627 | |
Accessory genes: | 792 | 694 |
Pangenome graph (Pangraph) size estimations with and without corrections.
Corrected estimates are derived from filtering out false labelling of absent regions using BLASTn.
Pangenome partition | Raw estimates | Corrected estimates |
---|---|---|
Total Regions | 1,338 | 1,338 |
Core Regions | 1,015 | 1,040 |
Soft core | 129 | 124 |
Shell | 140 | 143 |
Cloud | 54 | 31 |
Additional files
-
Supplementary file 1
An overview of the genome dataset, both publicly acquired data and those strains sequenced within this study including assembly accessions and online locations, where relevant.
BUSCO completeness information for each genome is also provided, as is the CDS and pseudogene count as predicted by PGAP.
- https://cdn.elifesciences.org/articles/97870/elife-97870-supp1-v1.xlsx
-
Supplementary file 2
The geography of the dataset.
(A) sample collection distribution by country; (B) the number of genomes of each lineage included from each continent.
- https://cdn.elifesciences.org/articles/97870/elife-97870-supp2-v1.zip
-
Supplementary file 3
A list of all gene groups which were combined to reduce over-splitting of pangenome due to annotation errors.
Each gene group's name is listed alongside its original classification (Core, Sort-core, Cloud, Shell) and its new classification after merging with others after annotation correction. This is shown for both the pangenome with and without merged paralog setting enabled.
- https://cdn.elifesciences.org/articles/97870/elife-97870-supp3-v1.xlsx
-
Supplementary file 4
Pangenome openness assessment using Heap’s Law analysis, genome fluidity and rarefaction curve for (A) Panaroo with merged paralogs; (B) Panaroo with unmerged paralogs and (C) Pangraph blocks.
- https://cdn.elifesciences.org/articles/97870/elife-97870-supp4-v1.pdf
-
Supplementary file 5
Pairwise comparisons of accessory genome size between lineages.
A * indicates a significant difference between these lineages. Comparisons on both merged and unmerged paralog datasets are shown.
- https://cdn.elifesciences.org/articles/97870/elife-97870-supp5-v1.xlsx
-
Supplementary file 6
MTBC accessory genome distribution based on (A) Panaroo (merged paralogs) and (B) Pangraph.
The MTBC phylogenetic tree is shown on the left beside a coloured bar indicating the sub-lineage of each tip genome. The accessory genes/regions are indicated by columns in the heatmap with a blue box if present in that strain’s genome.
- https://cdn.elifesciences.org/articles/97870/elife-97870-supp6-v1.zip
-
Supplementary file 7
Coinfinder gene association heatmap.
Coinfinder reveals gene association patterns within the accessory genome of the MTBC. Accessory genes, listed on the X-axis, cluster into groups (shown here by varying colours of the blocks).
- https://cdn.elifesciences.org/articles/97870/elife-97870-supp7-v1.pdf
-
Supplementary file 8
A list of all RDs (known and new) along with the genes contained in each and the lineages the regions are absent from.
- https://cdn.elifesciences.org/articles/97870/elife-97870-supp8-v1.xlsx
-
MDAR checklist
- https://cdn.elifesciences.org/articles/97870/elife-97870-mdarchecklist1-v1.docx