Abstract
The Staphylococcus aureus clonal complex 8 (CC8) is made up of several subtypes with varying levels of clinical burden; from community-associated methicillin resistant S. aureus (CA-MRSA) USA300 strains to hospital-associated (HA-MRSA) USA500 strains and ancestral methicillin susceptible (MSSA) strains. This phenotypic distribution within a single clonal complex makes CC8 an ideal clade to study the emergence of mutations important for antibiotic resistance and community spread. Gene level analysis comparing USA300 against MSSA and HA-MRSA strains have revealed key horizontally acquired genes important for its rapid spread in the community. However, efforts to define the contributions of point mutations and indels have been confounded by strong linkage disequilibrium resulting from clonal propagation. To break down this confounding effect, we combined genetic association testing with a model of the transcriptional regulatory network (TRN) to find candidate mutations that may have led to changes in gene regulation. First, we used a De Bruijn graph genome-wide association study (DBGWAS) to enrich mutations unique to the USA300 lineages within CC8. Next, we reconstructed the TRN by using Independent Component Analysis on 670 RNA sequencing samples from USA300 and non-USA300 CC8 strains which predicted several genes with strain-specific altered expression patterns. Examination of the regulatory region of one of the genes enriched by both approaches, isdH, revealed a 38 base pair deletion containing a Fur binding site and a conserved Single Nucleotide Polymorphism (SNP) which likely led to the altered expression levels in USA300 strains. Taken together, our results demonstrate the utility of reconstructed TRNs to address the limits of genetic approaches when studying emerging pathogenic strains.
Introduction
Comparative genomic methods represent an important approach to understand the emergence and evolution of new strains of pathogens. In S. aureus alone, whole genome comparisons have enabled rapid characterization of genetic basis for antibiotic resistance, increased virulence, host specificity and altered metabolic capabilities 1–5. However, genome-wide linkage disequilibrium and strong population structure currently limits the differentiation of causative alleles from genetically linked ones. By calculating lineage level associations, methods like bugwas address these issues for single, recurring phenotypes like antibiotic resistance6.
Emerging clonal complexes, on the other hand, exhibit multiple complex phenotypes that may contribute to their emergence and proliferation. For example, USA300 strains carry antibiotic resistance cassettes, Panton Valentine Leukocidin (PVL) associated with pyogenic skin infections, increased ability to colonize locations outside of the nasopharynx etc1,7,8. As these strains often emerge clonally from closely related ‘ancestral strains,’ efforts to discern causal mutations that lead to their increased clinical burden is hampered by strong population-stratification and genome-wide linkage disequilibrium 9–11. Though recombination at species level is common in S. aureus, within clade recombination rates tend to be lower, thus preserving the linkage between mutations 10,12–14. Due to this limitation, studies of emerging strains often focus on gene level analysis such as acquisition of mobile genetic elements or loss of gene function as their effect on phenotype is easier to determine than that of point mutations15,16. Computational modeling methods can help tackle these challenges by predicting phenotype differences between strains, thus acting as a sieve to filter enriched mutations with potential phenotypic effects and therefore find candidate causal mutations 17–19. Even if experimentally intractable, the large possible phenotypic space of an organism can be explored quickly with computational models. Taking advantage of these modeling techniques, we use a reconstruction of the transcriptional regulatory network (TRN) of CC8 strains to find USA300 specific mutations that are associated with changes in gene regulation.
First, we used De Bruijn graph GWAS (DBGWAS) to discover enriched mutations associated with the USA300 strain within clonal complex 8 (CC8) 20. Due to clonal expansion of USA300 strains from their progenitors within CC8, the enriched USA300 specific mutations were in high linkage disequilibrium. Further complicating the matter, we found that almost all mutations enriched within open reading frames (ORFs) were unique to USA300 lineage and not found in any other clonal complexes, precluding identification of potential causative mutations by homoplasy. Instead, we turned to reconstruction of a TRN to identify genes that were both associated with an enriched mutation and had altered regulation in USA300 strains. We built an Independent Component Analysis (ICA)-based reconstruction of the TRN using 670 publicly available RNA sequencing samples from both USA300 and non-USA300 CC8 strains. By factoring the RNA sequencing data into a series of signals and their activities, the ICA-based reconstruction of the TRN shows both the static gene-regulator interaction and the dynamic activity of these interactions in a sample specific manner21. However, ICA is a generalized signal extraction algorithm and therefore does not distinguish between biological sources of signals like regulatory elements and ‘artificial’ sources that can be created by sourcing data from multiple strains. Therefore, in addition to signals associated with gene regulators, ICA also outputs signals associated with strain-specific changes in gene regulation. Furthermore, by utilizing RNA sequencing data from hundreds of samples to identify genes with strain-specific expression patterns, this approach is more likely to find strain-specific differences than previous approaches that focus on specific conditions 22,23.
This analysis revealed several genes with distinct expression patterns in USA300 strains that were also associated with DBGWAS enriched mutations. One of these genes, isdH, which encodes a haptoglobin binding protein, showed several enriched mutations in the regulatory region of USA300 strains including the deletion of the Fur repressor binding site. Additionally, the isdH gene generally had higher expression levels in samples from USA300 strains, connecting the mutation enriched by DBGWAS with differences in TRN enriched by ICA. Overall, our analysis shows how the reconstruction of TRN can be used to extend the limits of current GWAS approaches when studying emerging populations of bacterial pathogens.
Results
Classifying USA300 and non-USA300 genomes based on genetic markers
We sought to compare the genetic differences between USA300 CA-MRSA strains and other subtypes within CC8 that have lower clinical and community burden. Given that both subtypes exist within the same clonal complex, this comparison allowed us to probe the genetic basis for the success of USA300 strains with limited confounding effects of different genetic backgrounds. We downloaded 2033 S. aureus genomes for analysis and excluded six of them with genome length of less than 2.5 million base pairs. The CC8 pangenome consisted of 19176 gene clusters with 2291 core genes that were present in at least ∼95% of the genomes analyzed. Among the remainder of the genes, 931 were categorized as accessory genes and 15954 were found in less than 5% of the genomes (Figure S1a). The collection formed a closed pangenome, as adding new genomes did not introduce many new genes (Figure 1a), suggesting that our collection had a good gene level coverage of the CC8 pangenome. We confirmed the pangenome coverage with Roary (Figure S1b). To get a higher resolution view of these genomes, we surveyed unique alleles within the ORFs and in the 300 base-pair 5’ upstream and 3’ downstream sequences. We found a larger number of mutations within the ORFs, indicating the presence of greater genetic variation in the ORFs than in the neighboring regulatory regions. This is reflective of the fact that most of S. aureus genome sequence comprises of ORFs e.g. ∼84% of TCH1516 genome is part of an ORF.
Next, we classified the CC8 genomes into USA300 and non-USA300 strains using Genetic Marker Inference (GMI). GMI was previously developed to rapidly and systematically identify different subclades within inner-CC8 strictly based on genetic markers 24. In this scheme, USA300 genomes can be differentiated from non-USA300 CC8 genomes by the presence of either SCCmec IVa or the presence of Panton-Valentine Leukocidin (PVL) in case of methicillin sensitive S. aureus (MSSA).
To identify the root of the USA300 clades, we first traversed up nodes of the phylogenetic tree starting from known USA300 strain TCH1516 and determined the number of strains, fraction PVL positive and fraction SCCmec IVa positive for each node during traversal. The root was placed at the last node where >90% of the strains within the subclade represented by the daughter nodes were SCCmec IVa and PVL positive (Figure 1b). As phylogenetic trees are nested, root finding with this procedure is not dependent on the starting USA300 strain.
Same root was identified when the procedure was initialized with another well-known USA300 reference strain FPR3757 (Figure S1c). Combining the genetic markers with phylogenetic grouping led to the classification of 1449 genomes as USA300 and 589 genomes as non-USA300 (Figure 1c, Supplementary Table 1). Strains previously identified as ‘early USA300’ were not part of our USA300 classification24. While many of these strains are PVL positive, they have variable SCCmec types and therefore are likely to be clinically distinct from the USA300 strains.
Enriching USA300 specific genes and mutations using DBGWAS
After classifying the genomes into USA300 and non-USA300 strains, we identified genes and mutations associated with each subtype by using the De Bruijn graph Genome Wide Association Study (DBGWAS)20. We used 2030 genomes for this analysis; the 2027 genomes in pangenomics analysis above were “spiked” with three well known CC8 genomes-TCH1516, COL, and Newman-to help annotate the DBGWAS unitigs. DBGWAS provides a reference genome-free method for conducting GWAS analysis in prokaryotes by building a compacted De Brujin Graph to represent the pan genome of input sequences. The nodes of the graph represent unique compacted k-mers that are joined by edges to other nodes with k-mers that appear adjacent to it in genomes. The procedure enriches unique k-mers that appear with different frequencies in each classification and outputs the enriched k-mer as well as its genetic neighborhood (called ‘components’) from the De Bruijn graph. Visualizing the components associated with the enriched k-mers makes it easier to interpret the k-mers and makes it easy to identify large structural variations (e.g. cassette acquisition) which are often represented by multiple enriched k-mers that fall within the same component. We took the output component graphs and automatically extracted the enriched genetic changes e.g. indels, SNPs, phage insertions etc (see Supplementary Note 1).
Many of the components were associated with genes and genetic elements expected to be enriched with USA300 strains-SCCmec IVA (the GMI marker), Arginine Catabolite Repressor Element (ACME), cap5E point mutation, multiple prophages etc (Supplementary Table 2). In total, we found k-mers in 149 components associated with 137 unique TCH1516 genes that were enriched in this analysis, pointing to a large array of mutations that are unique to the USA300 lineage (Figure 2a). Significant k-mers in some components did not uniquely match to TCH1516 genes or only matched to genes in non-USA300 reference genome, NCTC8325.
Genome-wide linkage and de novo mutations obfuscate identification of causal mutations
Though these mutations were enriched in USA300 strains with DBGWAS, we could not attribute the prevalence of any mutation to selection due to strong genome-wide linkage. We quantified the linkage disequilibrium by calculating the square of the correlation coefficient (r2) for each of the enriched k-mer not associated with mobile genetic elements. High correlation coefficient indicates tight co-occurrence of k-mers in the genomes and therefore high linkage disequilibrium between the sequences. There was a strong linkage between the k-mers that were enriched in USA300 strains. Surprisingly, even k-mers that were 1.4 million base pairs away (the maximum distance between two sites in the circular 2.8 million base pairs long S. aureus genome) still had r2 over 0.9 (Figure 3a).
To differentiate potential causal mutations from genetically linked alleles, we searched for mutation hotspots by comparing the positions of USA300 mutations in open reading frames (ORFs) to mutations in other clonal complexes. Barring recombination events, presence of mutation hotspots in the same position in multiple clades could point to selection acting on the sequence. Therefore, we searched for prevalence of enriched mutations in other non-CC8 clades. We identified 61 SNPs within open reading frames (ORFs) that were enriched in USA300 strains. To identify mutational hotspots in other clades, we downloaded all the amino acid sequences belonging to the PATRIC genus protein family of each of the gene products encoded by the selected ORFs 25. The PATRIC local protein family consists of sequences of homologous proteins within the same genus which were further filtered down to S. aureus species specific sequences. After filtering, each protein family comprised 2,000 to 16,000 unique sequences and the strains from which the amino acid sequences were derived spanned dozens of clades allowing for broad comparisons (Figure S2). Lastly, we removed sequences associated with ST239 as it is thought to have emerged from large-scale recombination of ST8 and ST30 strains 26.
We determined mutation hotspots by calculating position specific allelic entropy. Allelic entropy at a given amino acid position is a function of the number of unique amino acids found in that position and the frequency of the mutation 27. Positions where all queried sequences have the same amino acid have low entropy, while positions that have frequent amino acid substitutions (hotspots) have high entropy (Figure 3b). This measure allows us to quickly determine the positions of mutation hotspots while accounting for multiple possible amino acid substitutions and rare mutations. Before calculating the position-specific entropy, all sequences within each of the PATRIC local protein families were aligned with Multiple Sequence Alignment (MSA). This alignment ensures proper comparison of amino acids even when there are deletions or insertions in some of the genes in the family.
Of the 36 enriched ORF mutations only the Asp75Tyr mutation in the cap5E gene, which was previously shown to ablate capsule production in USA300 strains, was found in other strains (Figure 3c) 16. Peaks in entropy corresponding to this mutation position were present in both the CC8 and non-CC8 strains while all other mutation positions were unique to CC8. Despite not having any perfect matches outside of the cap5E mutation, we found that for 28 of the mutations, a peak was present in sequences from other clades within 71 MSA positions. Together, our data suggests that mutations within ORFs in USA300 strains are likely de novo mutations and are not acquired through horizontal gene transfer though many of these mutations have occurred in hotspot regions (Figure 3d).
iModulon in the CC8 TRN points to mutations associated with differential regulation
The presence of genome-wide linkage and de novo mutations in ORFs severely limited the ability to distinguish causal SNPs contributing to increased pathogenesis in USA300 strains. The effect of some mutations, especially in ORFs, has been successfully linked to distinct phenotypes such as the absence of a capsule in USA300 and USA500 strains 16. However, the effect of mutations associated with changes in gene regulations can be much more difficult to assess 15. To look for mutations that may be associated with changes in transcriptional regulation, we used ICA to model gene-regulation in CC8 strains which can predict strain specific differences in expression patterns.
We collected CC8 strains associated RNA-sequencing data from Sequence Read Archive (SRA). After stringent QC/QA and curation, 291 non-USA300 strains (e.g. Newman, NCTC8325) and 379 USA300 (e.g. LAC, TCH1516) samples were used to create a single reconstruction of TRN across CC8 using ICA 21,28 (Supplementary Table 3) ICA calculates independently modulated sets of genes, iModulons, and the activities of those gene sets in each sample. iModulons calculated by ICA represent distinct sources of signals in the RNA-sequencing data. While most of the signals can be associated with different regulatory elements, iModulons associated with other biological features such as mobile genetic elements, genomic backgrounds are also enriched. In Escherichia coli and Salmonella enterica Typhimurium, multi-strain ICA has been used to calculate strain-specific iModulons that represent differences in gene expression 21,29.
In our reconstruction, two iModulons captured a large number of genes with different expression levels in the non-USA300 and USA300 strains (Figure 4a, Figure S3a). Most of the genes in the strain-specific iModulons belonged to mobile elements associated with USA300 strains such as ACME, SCCmec, Phi-PVL etc. However, the iModulons also contained core genes that are present in both strains, pointing to possible differences in gene regulation (Figure 4b, Figure S3b).
We mapped the enriched mutations from DBGWAS onto the core genes enriched in the strain-specific iModulon. 10 core genes-including isdH, argR1 and araC family regulator-with mutations in the ORF or in the regulatory region were also enriched in the strain-specific iModulon (Supplementary Table 4). Of these genes, gene isdH, encoding a heme scavenger molecule showed distinct strain-specific expression levels and had enriched k-mers that are mapped to the upstream regulatory region. Therefore, we compared the upstream regulatory region of several reference strains including TCH1516 (USA300), NCTC8325 (CC8b), 2395 (USA500). Additionally, we included MW2 (CC1 CA-MRSA) as the transcription start site (TSS) in the region has been experimentally confirmed in this strain30. Comparisons showed a 38 base-pair deletion in the 5’ untranslated region containing a transcription factor Fur binding site (q-val=0.033e-4) (Figure 4c).
This deletion was detected in all of the 1385 USA300 genomes, but only present in 95 of the 589 non-USA300 genomes. As Fur is a repressor that blocks expression in presence of iron concentration, this deletion in the Fur binding site may be responsible for the general increase in isdH expression observed in USA300 samples (Figure S4). We also found a second mutation upstream of the predicted −35 binding site that was also enriched in USA300. Interestingly, while the MW2 strain did not have the 38 bp deletion, it contained the exact upstream A->T mutation. All other base-pairs in the region were perfectly matched in between all the reference genomes. The combination of evidence from genetic and transcriptomic analysis suggests that regulation of isdH is altered in USA300 strains compared to its non-USA300 progenitors.
As with many point mutations detected in our analysis, the absence of Fur binding site upstream of isdH gene is prevalent only in the USA300 lineage. We searched for Fur binding motif in the 100 base-pairs upstream regions from 3515 non-CC8 strains spanning multiple clonal complexes (Figure S2). We detected the binding motif in all but 21 strains. Of the 21 strains with no detectable Fur binding sites, 6 belonged to ST72 (out of 28 total from this type) and 6 had uncharacterized MLST type. The rest were distributed among types 121, 1750, 375, 1, 3317, 15, 7, 398, and 4803, with one positive strain per type.
Discussion
Emergence of CA-MRSA USA300 strains from HA-MRSA USA500 progenitors presents a natural experiment to probe the genetic basis for the establishment of the USA300 lineage. However, in studying these groups, genetic methods like GWAS were limited in finding causal mutations due to genome-wide linkage disequilibrium and presence of an unexpectedly large number of de novo mutations unique to the USA300 lineage. Here, we demonstrated how a model of transcriptional regulation with iModulons can be used to make a headway through the impasse created by the high linkage disequilibrium and identify GWAS-enriched mutations that are also associated with measurable phenotypic changes in the TRN. From the combined RNA sequencing dataset of USA300 and non-USA300 strains, ICA calculated two iModulons that captured strain specific variation in gene expression. As expected, most genes in the iModulons were part of mobile genetic elements such as ACME and SCCmec because they have zero expression level in non-USA300 samples. However, the iModulon also contained several core genes that are present in both groups but are differentially regulated. A deeper analysis of the regulatory region of one of these genes with enriched mutation, isdH, revealed a deletion of a DNA segment containing the binding site of the Fur repressor. In congruence with this observation, we also found that USA300 strains with the deleted Fur binding site showed general increase in isdH expression level. This gene encodes IsdH, a surface receptor that binds to human hemoglobin, causing it to release the heme31. It is part of an arsenal of S. aureus iron sequestration proteins including Staphyloferrins and ferrous iron transporters that it uses to compete with the host for essential iron32. Despite having many different pathways for obtaining iron, it has been observed that S. aureus prefers heme as its iron source over transferrin-bound iron33. Our analysis shows that preference for heme is reflected in the genomic and transcriptomic signature of USA300 as a deletion of Fur binding region upstream of isdH and subsequent increase in its expression. Combining GWAS with large-scale transcriptomic modeling was therefore able to predict potential causal mutations contributing to the increased clinical burden of the USA300 lineage.
The current analysis utilized the available DNA and RNA sequencing data and the methods used here are scalable to the rapidly growing number of data in the public repositories. Indeed, with the greater scale, we can get more granular insight into subclade specific differences. The transcriptomic analysis consisted of samples primarily from the USA300 (CC8e and CC8f) clades, the CC8a clade represented by Newman and the CC8b clade represented by NCTC8325 and its derivatives. However, the CC8b and CC8a clades are currently undersampled due to its minimal clinical burden compared to USA300. We therefore combined strains from all non-USA300 clades into a single group for GWAS. The misalignment of RNA sequencing samples from GWAS samples may explain the low number of hits that were enriched by both methods when many other unique gene expression patterns have been observed in USA300 strains. This misalignment points to the limit of our approach. Most other phenotypes of clinical interest such as antibiotic resistance may not separate cleanly into distinct clades. In those cases, it is not obvious which strains should be chosen as the reference strain for RNA-sequencing and subsequent TRN reconstruction. The choice of reference strain as well as the choices in the RNA-sequencing sample conditions will impact which association between mutations and changes in gene regulations are uncovered.
With time, the scaling of databases may be able to resolve the issue of imbalanced sampling. On the other hand, resolving the confounding effect of linkage disequilibrium inherent in emerging and clonal strains will require a new generation of modeling methods11. Our current approach focuses on modeling the changes in gene regulation at the transcriptional level, but causal mutations can have any number of effects on the phenotype of the organism. New modeling methods that can systematically predict these other phenotypes are now rapidly emerging. Our recent work with Mycobacterium tuberculosis utilized a metabolic allele classifier (MACs) which combines genome scale metabolic models with machine learning to estimate biochemical effects of alleles thus mapping mutations to changes in metabolic fluxes 19.
Similarly, advances in protein structure prediction with AlphaFold2 and RosettaFold puts us at the cusp of predicting the effects of mutations on protein folding 34,35. Combination of these modeling techniques may therefore prove to be the breakthrough required to advance solutions to the current challenges in population genetics of emerging pathogens.
Materials and Methods
Pangenomic analysis
The pangenome analysis was run as described in detail before 27. Briefly, “complete” or “WGS” samples from CC8/ST8 were downloaded from the PATRIC database25. Sequences with lengths that were not within 3 standard deviations of the mean length or those with more than 100 contigs were filtered out. A non-redundant list of CDSs from all genomes was created and clustered by protein sequence using CD-HIT (v4.6) with minimum identity (-T) and minimum alignment length (-aL) of 80% and word size of 5 (-w 5)36. To get the 5’ and 3’ sequences, non-redundant 300 nucleotide upstream and downstream sequences from the CDS were extracted for each gene.
The CDSs were divided into core, accessory and unique genes based on the frequency of genes as previously described27. To calculate the frequency thresholds for each category, P(x), the number of genes with frequency x and its integral F(x), the cumulative frequency less than or equal to x were calculated. The multimodal gene distribution can be estimated by sum of two power laws as:
where N is the total number of genomes, x is the gene frequency and (c1, c2, -α1, -α2) are parameters fit based on the data. The cumulative distribution is then the integral of P(x) with additional parameter k:
The parameters (c1, c2, -α1, -α2, and k) were fitted based on the data using nonlinear least squares regression from scipy37. The frequency threshold of core genomes was defined as greater than 0.9 N + 0.1 x* and the threshold for unique genome was defined as 0.1x*, where x* represents the inflection point of the fitted cumulative distribution.
Roary (v3.13.0) was used with -i 95 flag to confirm the output of our pangenome analysis38.
Reconstructing CC8 phylogenetic tree
The phylogenetic tree was reconstructed using the standardized PHaME pipeline on the PATRIC sequences that passed the QC/QA39. Using the pipeline, the contigs and sequences were aligned to the reference TCH1516 genome NC_010079 and plasmids NC_012417, NC_01006340 and 24881 core SNPs at were calculated. The core SNPs were then used to estimate the phylogenetic tree using IQ-TREE(v1.6.7) run with 1000 bootstraps and utilizing the ultrafast bootstrap41,42. The tree was built using the “TVMe+ASC+G4’’ model as suggested by the IQ-TREE ModelFinder43. Finally, iTOL was used to visualize, annotate and root the tree with the USA100 D592 (NZ_CP035791) from CC5 as the outgroup 44.
Classification of USA300 and non-USA300 strains
The USA300 and non-USA300 strains were classified based on a previously proposed and validated CC8 subtyping scheme24. In this scheme, USA300 strains can be identified from the whole genome if they are PVL positive MSSA or MRSA with SCCmec IVa cassette. We detected SCCmec types using SCCmecFinder (v1.2), and only those genomes where the cassette could be identified by both BLASTn and k-mer based methods were marked as positive45. PVL was detected using nucleotide BLAST(v2.2.31). We added additional criteria that all genomes identified as USA300 by GMI form a distinct subclade before they are labeled as USA300 i.e. PVL or SCCmec IVa positive genomes that grouped separately from other USA300 strains in the phylogenetic tree were not labeled as USA300. To find the root of the USA300 strains in the phylogenetic tree, the genomes in the tree were first annotated by their PVL and SCCmec status. Then the tree traversed from leaf to root starting from known USA300 strains – TCH1516 and FPR3757-while keeping track of the number of descendant genomes from the current root that contained known markers SCCmec IVa and PVL. The node where the number of genomes with the markers started flatlining was marked as the root of USA300.
We detected SCCmec cassettes in 1588 genomes of which 1358 were SCCmec IVa positive. We also found 1431 PVL positive genomes using BLASTn search with PVL encoding genes from USA300 TCH1516 (USA300HOU_RS07645, USA300HOU_RS07650) as reference (Supplementary Table 5). Lastly, we reconstructed the CC8 phylogenetic tree based on core SNPs and rooted the tree using strain D592 (CC5) as an outgroup. The tree was then traversed from reference strain TCH1516 to the CC8 root using ete3, while tracking the total number of genomes, the total number of SCCmec IVa positive genomes and the number of PVL positive genomes in each root 46. The root of USA300 was placed manually where the number of total genomes kept increasing while the number of PVL and SCCmec positive genomes plateaued. All strains in the clade represented by the USA300 root were classified as USA300 regardless of their SCCmec or PVL status.
DBGWAS and k-mer linkage calculations
DBGWAS (v0.5.4) was used to enrich mutations unique to USA300 strains using default k-mer size of 31 (-k 31) and neighborhood size of 5 (-nh 5). Alleles with frequency less than 0.1 were filtered (-maf 0.1) and all components enriched with q-values less than 0.05 were documented (-SFF q0.05). Genome-wide linkage was estimated by Pearson correlation (calculated with built-in Pandas function) of the presence/ absence of enriched k-mers and distance was measured based on the k-mer alignment to the reference TCH1516 genome as determined by BLASTn.
To determine the enriched ‘genetic event’ (e.g. SNP, indel, mobile genetic element etc), the graph output from DBGWAS was first loaded onto a networkX model47. All nodes in the graph with frequency lower than 0.05 were discarded. MGEs were identified if all significant nodes from DBGWAS had higher frequencies in one strain, e.g. all nodes associated with SCCmec had higher frequencies in USA300 strains. To find SNPs and smaller indel events, the networkx was used to find cycles in the graph, which results from bifurcation and eventual re-collapse of Debruijn graphs around mutations. For each cycle, the ‘end nodes’ representing the start and end of the bifurcation were identified by finding the nodes in the cycle with highest frequency across all samples. As ‘end nodes’ are present in both case and control samples, they will have higher frequency than other nodes in the cycle which are specific to either case or control. Once the end nodes are identified, the two paths around the bifurcations representing the case and control specific sequences were identified using the shortest path algorithm in networkx. The sequences from nodes of each path were concatenated, changing the sequences to reverse complements and removing overlaps in sequences when required. The concatenated sequences from each path were then compared using BioPython (v1.83) pairwise global alignments to find the SNPs or indels that differentiate the sequences from case and control48. If reference sequences are passed, the concatenated sequences are aligned to the reference sequences using nucleotide BLAST and mutation positions were converted from k-mer positions to positions in the reference genomes. The code used for this analysis can be found in https://github.com/SBRG/dbgwas_network_analysis.
Mapping mutation hotspots with position specific Shannon entropy
For each of the CDS with enriched mutations, the PATRIC local protein family (PLfam) was identified based on the reference TCH1516 genome. All available protein sequences for each CDS PLfam were downloaded and filtered for S. aureus sequences. The multilocus sequence type (MLST) of the source genome of each downloaded sequence was mapped using the PATRIC database. The online PATRIC website was used to find and filter the target sequences. The sequences were divided into ST8 and non ST8 and ST239 sequences were filtered. MAFFT was used for multiple alignment and position-specific Shannon entropy was calculated on the aligned file49. The entropy is calculated as:
where n is the total number of unique amino acids in the position and P(xi) is the probability of finding the given amino acid.
Calculating strain-specific iModulons with independent component analysis
A detailed version of the methods for RNA-sequencing and ICA analysis is available as Supplementary Note 2. ICA of RNA sequencing data was performed using the pymodulon package 28. Using the package, all available RNA sequencing data for non-USA300 and USA300 strains were downloaded, run through the QC/QA pipeline, manually curated for metadata and aligned to the TCH1516 genome (NC_010079, NC_012417, NC_010063). The combined data was then transformed into log-TPM (Transcripts per Million) and normalized to a single reference condition (SRX3760886, SRX3760891). This contrasts with other ICA models that normalize the data to project specific reference conditions to reduce batch effects.
However, normalizing to project specific control conditions also erases the strain specific information as almost all BioProjects contain data from only one isolate (e.g. NCTC8325, TCH1516, LAC etc). ICA was then run as previously described (see Supplementary Note 2)21. The activities of the output iModulons were manually parsed to look for iModulons with the largest strain specific differences.
Fur box motif search
isdH genes in all the genomes were first clustered using CD-HIT with identity and coverage minimum of 0.836. All annotated isdH genes fell within a single cluster. For each genome, the 100 bp upstream region was then extracted and used for motif search. Motif search for the Fur box was conducted using the FIMO package from the MEME suite (v5.1.0) with default settings50. The S. aureus strain NCTC8325 Fur motif from collecTF was used as a reference51.
Code and Data Availability
The genome sequences and the RNA-sequencing data used in this study are publicly available (See Supplementary Table 1 and 3 respectively). The code used for analysis, the intermediate files and models are available on github (https://github.com/sapoudel/USA300GWASPUB).
Supplementary Materials
Supplementary Note 1. Converting DBGWAS enriched component graphs to interpretable mutations
Currently, DBGWAS outputs the graph consisting of the nodes with the enriched k-mers and its genetic neighborhood, but does not automatically yield the exact mutation associated with each of the significant nodes. By analyzing the structure of the component graphs with networkX, we were able to extract the exact genetic changes represented by these components 1. Mobile genetic elements (MGEs) and large indels can be identified by a series of nodes that are all enriched in either USA300 or non-USA300 genomes (Figure S5a). The enrichment of multiple sequential k-mers in only one of the groups implies deletion of the sequence (or conversely insertion) in the other group. SNPs and indels smaller than the k-mer-size on the other hand form ‘cycles’ containing significant nodes (Figure S5b). Consequently, the k-mers in the nodes of each of the ‘paths’ around the cycle represent sequences unique to either case or control group. The enriched mutation can therefore be extracted by comparing the sequences with global alignment. Lastly, the unique sequences from each path can also be mapped to reference genomes if needed. The exact mutations used in the subsequent sections were extracted from the components using this method (Supplementary Table 2). The code used to automatically analyze the network has been uploaded to github (https://github.com/SBRG/dbgwas-network).
Supplementary Note 2. Creating iModulons for CC8 Clade Staphylococcus aureus
We followed our previously established pipeline to generate iModulons for CC8 clade Staphylococcus aureus2. We began by collecting all available RNA-sequencing data and metadata for S. aureus strains from Sequence Read Archives (SRA) that belonged to the CC8 clade. Most of the sequences were from well-studied CC8 strains-TCH1516, FPR3757, LAC, Newman, NCTC8325. Others had less specific labels e.g. “USA300” but still belonged to CC8. The fastq files for the samples were trimmed with TrimGalore (v0.6.5) and aligned to reference TCH1516 genome (NC_010079, NC_012417, NC_010063) with bowtie2 (v1.2.3) 3,4. The gene-specific read counts were calculated with HTSeqCount (v2.0.1) using the intersection-strict criteria. The number of mapped reads were then normalized to transcripts per million (TPM) and log-transformed (log-TPM).
Before using the data, the quality of the reads and alignment were assessed using FastQC and MultiQC (v 1.11)5,6. Any samples failing ‘per base sequence quality’, ‘per sequence quality score,’ ‘per base n content’, or ‘adapter content’ were dropped. Additionally, we also removed samples with less than 500,000 reads aligned to the reference. Lastly, samples that did not contain replicates or those with replicates with Pearson Correlation coefficients less than 0.9 were also excluded. We then collected additional metadata for the remaining 670 RNA-sequencing samples including growth conditions, genetic changes, associated experiment etc. The log-TPM were then centered to the reference condition of S.aureus TCH1516 grown in RPMI + 10 % LB. By centering data from non-USA300 strains on USA300 reference, ICA is able to pick up strain-specific regulatory changes e.g. ICA captures the activity of Fur transcription factor as a linear combination of Fur iModulon containing gene regulated by Fur and a second ‘strain-specific’ iModulon that captures differences between USA300 and non-USA300 strains (Figure S3).
We applied FastICA to the centered log-TPM to calculate the M and the A matrix which respectively describe the iModulon structure and their activities7,8. To find the best possible model, we first had to compute an optimal number of stable components. As FastICA is non-deterministic, each iteration yields a slightly different component weightings and activity levels. It may also yield “spurious” components that are only present in a subset of runs. To find stable components, we ran ICA 100 times with a random seed. Similar components (e.g. same component containing Fur associated iModulon) from different iterations which may have slightly different weightings were detected by clustering with DBSCAN. Only components that appear in every run were accepted. When running ICA, users must also provide the number of desired components that the data will be decomposed into. Decomposition into too few components could lead to signals from several transcription factors being combined into a single components while over decomposition leads to many unstable and “single-gene” iModulons that likely capture noise in the data. To find the optimal number of components we used a heuristic method, OptICA, which runs ICA with different numbers of input components from 10 to 340 and suggests an optimal component that minimizes single-gene iModulons while maximizing robust components. Based on this heuristic, the final model was built with 270 components as input, 148 of which were determined to be robust components.
In each component, we labeled a gene as being part of an iModulon if their weighting in that component did not fall within a Gaussian distribution as determined by D’Agostino’s test. The genes in each iModulon was then compared to genomic features (e.g. regulons, phage, mobile cassettes etc; see ‘TRN’ object in the model), and was determined to be associated with the feature if there was significant overlap between the two groups (hypergeometric test; adjusted p-value < 0.05, precision >= 0.5 and coverage >= 0.2). We also manually curated other iModulons associated with other features e.g. iModulon where all member genes associated with translation were labeled ‘Translation iModulon.’
References
- 1.Panton-Valentine leucocidin is the key determinant of Staphylococcus aureus pyomyositis in a bacterial GWASElife 8
- 2.Comparative genome-scale modelling of Staphylococcus aureus strains identifies strain-specific metabolic capabilities linked to pathogenicityProc. Natl. Acad. Sci. U. S. A 113:E3801–9
- 3.The Staphylococcus aureus Two-Component System AgrAC Displays Four Distinct Genomic Arrangements That Delineate Genomic Virulence Factor SignaturesFront. Microbiol 9
- 4.Sequential evolution of virulence and resistance during clonal spread of community-acquired methicillin-resistant Staphylococcus aureusProc. Natl. Acad. Sci. U. S. A 116
- 5.Naturally occurring polymorphisms in the virulence regulator Rsp modulate Staphylococcus aureus survival in blood and antibiotic susceptibilityMicrobiology 164:1189–1195
- 6.Identifying lineage effects when controlling for population structure improves power in bacterial association studiesNat Microbiol 1
- 7.The arginine catabolic mobile element and staphylococcal chromosomal cassette mec linkage: convergence of virulence and resistance in the USA300 clone of methicillin-resistant Staphylococcus aureusJ. Infect. Dis 197:1523–1530
- 8.Importance of colonization site in the current epidemic of staphylococcal skin abscessesPediatrics 125:e618–24
- 9.Evolution and Global Transmission of a Multidrug-Resistant, Community-Associated Methicillin-Resistant Staphylococcus aureus Lineage from the Indian SubcontinentMBio 10
- 10.Phylogenomic Classification and the Evolution of Clonal Complex 5 Methicillin-Resistant Staphylococcus aureus in the Western HemisphereFront. Microbiol 9
- 11.Genomic insights into the emergence and spread of international clones of healthcare-, community-and livestock-associated meticillin-resistant Staphylococcus aureus: blurring of the traditional definitionsJournal of Global Antimicrobial Resistance 6:95–101
- 12.Molecular tracing of the emergence, diversification, and transmission of S. aureus sequence type 8 in a New York communityProc. Natl. Acad. Sci. U. S. A 111:6738–6743
- 13.Range Expansion and the Origin of USA300 North American Epidemic Methicillin-Resistant Staphylococcus aureusMBio 9
- 14.Mobile elements drive recombination hotspots in the core genome of Staphylococcus aureusNat. Commun 5
- 15.Virulence strategies of the dominant USA300 lineage of community-associated methicillin-resistant Staphylococcus aureus (CA-MRSA)FEMS Immunol. Med. Microbiol 65:5–22
- 16.USA300 and USA500 clonal lineages of Staphylococcus aureus do not produce a capsular polysaccharide due to conserved mutations in the cap5 locusMBio 6
- 17.Predicting the effects of SNPs on transcription factor binding affinityBioinformatics 36:364–372
- 18.Predicting the functional effect of amino acid substitutions and indelsPLoS One 7
- 19.A biochemically-interpretable machine learning classifier for microbial GWASNat. Commun 11
- 20.A fast and agnostic method for bacterial genome-wide association studies: Bridging the gap between k-mers and genetic eventsPLoS Genet 14
- 21.The Escherichia coli transcriptome mostly consists of independently regulated modulesNat. Commun 10
- 22.Genomic and transcriptomic differences in community acquired methicillin resistant Staphylococcus aureus USA300 and USA400 strainsBMC Genomics 15
- 23.Comparative virulence studies and transcriptome analysis of Staphylococcus aureus strains isolated from animalsSci. Rep 6
- 24.Improved Subtyping of Staphylococcus aureus Clonal Complex 8 Strains Based on Whole-Genome Phylogenetic AnalysismSphere 3
- 25.PATRIC, the bacterial bioinformatics database and analysis resourceNucleic Acids Res 42:D581–91
- 26.Evolution of Staphylococcus aureus by large chromosomal replacementsJ. Bacteriol 186:1060–1064
- 27.Comparative pangenomics: analysis of 12 microbial pathogen pangenomes reveals conserved global structures of genetic and functional diversityBMC Genomics 23
- 28.Mining all publicly available expression data to compute dynamic microbial transcriptional regulatory networksbioRxiv https://doi.org/10.1101/2021.07.01.450581
- 29.Pan-genomic analysis of transcriptional modules across Salmonella Typhimurium reveals the regulatory landscape of different strainsbioRxiv https://doi.org/10.1101/2022.01.11.475931
- 30.TSS-EMOTE, a refined protocol for a more complete and less biased global mapping of transcription start sites in bacterial pathogensBMC Genomics 17
- 31.The Staphylococcus aureus IsdH Receptor Forms a Dynamic Complex with Human Hemoglobin that Triggers Heme Release via Two Distinct Hot SpotsJ. Mol. Biol 432:1064–1082
- 32.The Role of Iron in Staphylococcus aureus Infection and Human Disease: A Metal Tug of War at the Host-Microbe InterfaceFront Cell Dev Biol 10
- 33.Iron-source preference of Staphylococcus aureus infectionsScience 305:1626–1628
- 34.Accurate prediction of protein structures and interactions using a three-track neural networkScience 373:871–876
- 35.Highly accurate protein structure prediction with AlphaFoldNature 596:583–589
- 36.CD-HIT: accelerated for clustering the next-generation sequencing dataBioinformatics 28:3150–3152
- 37.SciPy: Open Source Scientific Tools for Python
- 38.Roary: rapid large-scale prokaryote pan genome analysisBioinformatics 31:3691–3693
- 39.Standardized phylogenetic and molecular evolutionary analysis applied to species across the microbial tree of lifeSci. Rep 10
- 40.Subtle genetic changes enhance virulence of methicillin resistant and sensitive Staphylococcus aureusBMC Microbiol 7
- 41.IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic EraMol. Biol. Evol 37:1530–1534
- 42.UFBoot2: Improving the Ultrafast Bootstrap ApproximationMol. Biol. Evol 35:518–522
- 43.ModelFinder: fast model selection for accurate phylogenetic estimatesNat. Methods 14:587–589
- 44.Interactive Tree Of Life (iTOL) v5: an online tool for phylogenetic tree display and annotationNucleic Acids Res 49:W293–W296
- 45.SCCmecFinder, a Web-Based Tool for Typing of Staphylococcal Cassette Chromosome mec in Staphylococcus aureus Using Whole-Genome Sequence DatamSphere 3
- 46.ETE 3: Reconstruction, Analysis, and Visualization of Phylogenomic DataMol. Biol. Evol 33:1635–1638
- 47.Hagberg, A., Swart, P. & S Chult, D. Exploring Network Structure, Dynamics, and Function Using Networkx. https://www.osti.gov/biblio/960616 (2008).
- 48.Biopython: freely available Python tools for computational molecular biology and bioinformaticsBioinformatics 25:1422–1423
- 49.MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transformNucleic Acids Res 30:3059–3066
- 50.MEME SUITE: tools for motif discovery and searchingNucleic Acids Res 37:W202–8
- 51.CollecTF: a database of experimentally validated transcription factor-binding sites in BacteriaNucleic Acids Res 42:D156–D160
- 1.Hagberg, A., Swart, P. & S Chult, D. Exploring network structure, dynamics, and function using networkx. https://www.osti.gov/biblio/960616 (2008).
- 2.Mining all publicly available expression data to compute dynamic microbial transcriptional regulatory networksbioRxiv 2021.07.01.450581 https://doi.org/10.1101/2021.07.01.450581
- 3.Trim galore: A wrapper tool around Cutadapt and FastQC to consistently apply quality and adapter trimming to FastQ files
- 4.Fast gapped-read alignment with Bowtie 2Nat. Methods 9:357–359
- 5.FastQC: a quality control tool for high throughput sequence data
- 6.MultiQC: summarize analysis results for multiple tools and samples in a single reportBioinformatics 32:3047–3048
- 7.Scikit-learn: Machine Learning in PythonJ. Mach. Learn. Res 12:2825–2830
- 8.Efficient Variant of Algorithm FastICA for Independent Component Analysis Attaining the CramÉr-Rao Lower BoundIEEE Trans. Neural Netw 17:1265–1277
Article and author information
Author information
Version history
- Sent for peer review:
- Preprint posted:
- Reviewed Preprint version 1:
- Reviewed Preprint version 2:
Copyright
© 2023, Poudel et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics
- views
- 359
- downloads
- 4
- citations
- 0
Views, downloads and citations are aggregated across all versions of this paper published by eLife.