Figures and data

Pangenome (Panaroo) size estimations with and without corrections and merged paralogs.
Corrected estimates are derived from filtering out false accessory genes through merging of genes with strong hits to the same gene in H37Rv (≥90% identity and ≥75% gene length coverage).

Boxplot showing the distribution of accessory genes within each lineage.
Lineages with a small number of genomes were excluded from the statistical analysis. L6, La1 and M. microti have significantly smaller accessory genomes compared to other lineages.

Analysis of the functional components in the core (A) and accessory (B) genome using EggNOG mapper and InterProScan.
Core and accessory genes are derived from Panaroo with merged paralogs. For without merged paralogs see Supplementary Figure S3.

Pangenome graph (Pangraph) size estimations with and without corrections
. Corrected estimates are derived from filtering out false labelling of absent regions using BLASTn.

(A) phylogenetic tree based on MTBC core genome. PCA based on the accessory genome data from (B) Panaroo (merged paralogs), (C) Panaroo (no merged paralogs) and (D) accessory regions data from Pangraph.

Sub-lineage specific regions of differences (RDs) identified using pangenome-based approaches.
Sub-lineages are shown on the Y-axis coloured as per the legend. H37Rv sits within L4.9 at the top of the Y-axis. RDs (structural variants present in all members of one or more sub-lineages and absent in all members of one or more other sub-lineages) are listed on the X-axis, grouped by their pattern of presence/absence. Only RDs detected by both the Panaroo and Pangraph-based approaches are shown. Grey boxes indicate that the region is absent from that sub-lineage. Only RDs that are lineage-specific and found present or absent in 2 or more genomes are shown here.