CC8 pangenome and phylogeny.

(a) Pangenomic analysis of CC8 genomes shows the distribution of genes and mutations in ORFs and regulatory regions. (b) Prevalence of USA300 specific genetic markers, PVL and SCCmec IVa, as you traverse up the phylogenetic tree from TCH1516. The gray dashed line represents the node where the USA300 root is placed. (c) Phylogenetic tree of CC8 genomes classified into USA300 and non-USA300 strains.

USA300 strains associated mutations.

(a) DBGWAS recovers components associated with USA300 previously described markers of USA300 strains including mecA (SCCmec IVa), arcA (ACME), cap5e mutation, seq, sek and Phi-PVL. In addition, components with many other mutations scattered throughout the genome (NC_010079) are also enriched. Each ‘significant node’ represents a k-mer sequence (with minimum size of 31 nucleotides) that are associated with USA300 strains (adjusted p-value < 0.05).

Linkage Disequilibrium and de novo mutations in USA300 strains.

(a) Enriched k-mers showed high linkage disequilibrium, with some k-mers at 1.4 Mbp distance still having r2 of greater than 0.98. (b) Schematic of position specific entropy analysis. Positions with heterogeneous sequences have higher calculated entropy than more conserved sequences with fewer mutations. (c) Using position specific entropy, we only found one example of shared enriched mutation in ORFs of USA300 and non-USA300 strains. (d) Distance (in base pairs) between the position of enriched mutation in USA300 strains and the position of the nearest entropy peak in other non-CC8 strains.

Strain-specific regulatory changes in the CC8 clade.

(a) ICA analysis of USA300 and non-USA300 RNA-sequencing data identified an iModulon with strain specific activity.(b) The strain-specific iModulon contained various horizontally acquired elements (e.g. ACME, PhiPVL) that are prevalent in USA300 lineage as well as conserved genes with strain-specific expression patterns. (c) Comparing the 5’ regulatory region of the gene isdH from various S. aureus strains revealed a unique deletion containing Fur binding site in USA300 reference strain TCH1516.

Pangenome analysis and strain classification.

(a) Cumulative distribution of unique genes used to fit the pangenomic parameters. The core and unique genes threshold were calculated at 90% of the distance from the inflection point (black dot) of the curve. (b) Analysis with Roary confirmed that adding new genomes to the analysis collection were unlikely to introduce many new genes which indicates a good gene level coverage of the CC8 clade. (c) SCCmec and PVL distribution in the CC8 tree as it is traversed up from the FPR3757 leaf towards the root. Starting from FPR3757 gives the same delineation between USA300 and non-USA300 genomes as the search that starts from TCH1516.

S. aureus MLST distribution of genomes from PATRIC used in this study.

SCCmec/ACME iModulons weighting and strain-specific activity.

(a) The activity of the SCCmec/ACME iModulon shows clear strain-specific separation. (b) Gene weighting for the iModulon primarily containing SCCmec and ACME. Genes encoding SarY and AraC family proteins were also enriched.

isdH gene shows strain-specific gene expression level.

The increased expression level in USA300 is in line with the deletion of the Fur repressor binding site. The expression levels are log-TPM centered on the expression profile from the TCH1516 strain grown in RPMI + 10%LB.

Interpreting DBGWAS output.

(a) Example of components associated with Mobile genetic elements (MGE)s; components have a series of nodes that are enriched in one group (blue circles). (b) Example of components associated with SNP. Component graph contains a cycle around the mutation location with the paths from the cycle forming a sequence unique to either case or control group. Aligning the sequences reveals the enriched mutation.