Major genetic discontinuity and novel toxigenic species in Clostridioides difficile taxonomy

  1. Daniel R Knight  Is a corresponding author
  2. Korakrit Imwattana
  3. Brian Kullin
  4. Enzo Guerrero-Araya
  5. Daniel Paredes-Sabja
  6. Xavier Didelot
  7. Kate E Dingle
  8. David W Eyre
  9. César Rodríguez
  10. Thomas V Riley  Is a corresponding author
  1. Medical, Molecular and Forensic Sciences, Murdoch University, Australia
  2. School of Biomedical Sciences, the University of Western Australia, Australia
  3. Department of Microbiology, Faculty of Medicine Siriraj Hospital, Mahidol University, Thailand
  4. Department of Pathology, University of Cape Town, South Africa
  5. Microbiota-Host Interactions and Clostridia Research Group, Facultad de Ciencias de la Vida, Universidad Andrés Bello, Chile
  6. Millenium Nucleus in the Biology of Intestinal Microbiota, Chile
  7. Department of Biology, Texas A&M University, United States
  8. School of Life Sciences and Department of Statistics, University of Warwick, United Kingdom
  9. Nuffield Department of Clinical Medicine, University of Oxford, National Institute for Health Research (NIHR) Oxford Biomedical Research Centre, John Radcliffe Hospital, United Kingdom
  10. Big Data Institute, Nuffield Department of Population Health, University of Oxford, National Institute for Health Research (NIHR) Oxford Biomedical Research Centre, John Radcliffe Hospital, United Kingdom
  11. Facultad de Microbiología & Centro de Investigación en Enfermedades Tropicales (CIET), Universidad de Costa Rica, Costa Rica
  12. Department of Microbiology, PathWest Laboratory Medicine, Queen Elizabeth II Medical Centre, Australia
  13. School of Medical and Health Sciences, Edith Cowan University, Australia
7 figures, 3 tables and 5 additional files


Composition of C. difficile genomes in the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA).

Snapshot obtained 1 January 2020; 12,304 strains (taxid ID 1496). (A) Top 40 most prevalent sequence types (STs) in the NCBI SRA coloured by clade. (B) The proportion of genomes in SRA by clade. (C) Number/proportion of STs per clade found in the SRA/present in the PubMLST database. (D) Annual and cumulative deposition of C. difficile genome data in SRA.

C. difficile population structure.

(A) Neighbor joining phylogeny of 659 aligned, concatenated, multilocus sequence-type (MLST) allele combinations coloured by current PubMLST clade assignment. Black bars indicate whole-genome sequencing (WGS) available for average nucleotide identity (ANI) analysis (n = 260). (B) A subset of the tree showing cryptic clades C-I, C-II, and C-III. Again, black bars indicate WGS available for ANI analysis (n = 17).

Species-wide average nucleotide identity (ANI) analysis.

Panels (A–C) show ANI plots for sequence type (ST)3 (C1) vs. all clades (260 STs) using FastANI, ANIm, and ANIb algorithms, respectively. Panels (D–G) show ANI plots for ST11 (C5), ST181 (C-I), ST200 (C-II), and ST369 (C-III) vs. all clades (260 STs), respectively. National Center for Biotechnology Information species demarcation of 96% indicated by red dashed line (Ciufo et al., 2018).

Bayesian analysis of species and clade divergence.

BactDating and BEAST estimates of the age of major C. difficile clades. Node dating ranges for both Bayesian approaches are transposed onto an maximum-likelihood phylogeny built from concatenated multi-locus sequence type (MLST) alleles of a dozen sequence types (STs) from each clade. Archetypal STs in each evolutionary clade are indicated. The tree is midpoint rooted, and bootstrap values are shown (all bootstrapping values of the cryptic clade branches are 100%). Scale bar indicates the number of substitutions per site. BactDating estimates the median time of the most recent common ancestor of C1–5 at 3.89 million years ago (mya) (95% credible interval [CI], 1.11–6.71 mya). Of the cryptic clades, C-II shared the most recent common ancestor with C1–5 (13.05 mya, 95% CI 3.72–22.44 mya), followed by C-I (22.02 mya, 95% CI 6.28–37.83 mya) and C-III (47.61 mya, 95% CI 13.58–81.73 mya). Comparative temporal estimates from BEAST show the same order of magnitude and support the same branching order (clades C1–5 [12.01 mya, 95% CI 6.80–33.47 mya]; C-II [37.12 mya, 95% CI 20.95–103.48 mya]; C-I [65.93 mya, 95% CI 37.32–183.84 mya]; C-III [142.13 mya, 95% CI 79.77–397.18 mya]).

Revised taxonomy for the Peptostreptococcaceae.

(A) Average nucleotide identity (ANI)-based minimum evolution tree showing evolutionary relationship between 8 C. difficile ‘clades’ along with 17 members of the Peptostreptococcaceae (from Lawson et al., 2016) as well as Clostridium butyricum as the outgroup and type strain of the Clostridium genus senso stricto. To convert the ANI into a distance, its complement to 1 was taken. (B) Matrices showing pairwise ANI and 16S rRNA values for the eight C. difficile clades and C. mangenotii (Cm), the only other known member of Clostridioides.

Clostridioides difficile species pangenome.

(A) Pan and core genome estimates for all 260 sequence types (STs), clades C1–4 (n = 242 STs) and clades C1–5 (n = 225 STs). (B) The difference in % core genome and pangenome sizes with Panaroo and Roary algorithms. * indicates χ2 p<0.00001 and ** indicates χ2 p=0.0008. (C) The proportion of retained genes per genome after polishing Prokka-annotated genomes with Panaroo. (D) The total number of genes in the pan (grey) and core (black) genomes is plotted as a function of the number of genomes sequentially added (n = 260). Following the definition of Tettelin et al., 2005., the C. difficile species pangenome showed characteristics of an ‘open’ pangenome. First, the pangenome increased in size exponentially with sampling of new genomes. At n = 260, the pangenome exceeded more than double the average number of genes found in a single C. difficile genome (~3700) and the curve was yet to reach a plateau or exponentially decay, indicating more sequenced strains are needed to capture the complete species gene repertoire. Second, the number of new ‘strain-specific’ genes did not converge to zero upon sequencing of additional strains, at n = 260, an average of 27 new genes were contributed to the gene pool. Finally, according to Heap’s law, α values of ≤1 are representative of open pangenome. Rarefaction analysis of our pangenome curve using a power-law regression model based on Heap’s law (Tettelin et al., 2005) showed the pangenome was predicted to be open (Bpan [≈ α (Tettelin et al., 2005) = 0.47], curve fit, r2 = 0.999). (E) Presence-absence variation (PAV) matrix for 260 C. difficile genomes is shown alongside a maximum-likelihood phylogeny built from a recombination-adjusted alignment of core genes from Panaroo (2232 genes, 2,606,142 sites).

Toxin gene analysis.

(A) Distribution of toxin genes across C. difficile clades (n = 260 sequence types [STs]). Presence is indicated by black bars and absence by light blue bars. (B) Comparison of PaLoc architecture in the chromosome of strain R20291 (C2, ST1) and cognate chromosomal regions in genomes of cryptic STs 649 (C-I), 637 (C-II), and 369 (C-III). All three cryptic STs show atypical ‘monotoxin’ PaLoc structures, with the presence of syntenic tcdR, tcdB, and tcdE, and the absence of tcdA, tcdC, cdd1, and cdd2. ST369 genome ERR2215981 shows colocalisation of the PaLoc and CdtLoc, see below. (C) Comparison of CdtLoc architecture in the chromosome of strain R20291 (C2, ST1) and cognate chromosomal regions in genomes of cryptic STs 649/644 (C-I) and 343/369 (C-III). Several atypical CdtLoc features are observed; cdtR is absent in ST649, and an additional copy of cdtA is present in ST369, the latter comprising part of a CdtLoc colocated with the PaLoc. (D) Amino acid differences in TcdB among cryptic STs 649, 637, and 369 and reference strains from clades C1–5. Variations are shown as black lines relative to CD630 (C1, ST54). Phylogenies constructed from the catalytic and protease domains (in blue) and translocation and receptor-binding domains (in orange) of TcdB for the same eight STs included in (D). Scale bar shows the number of amino acid substitutions per site. Trees are midpoint rooted and supported by 500 bootstrap replicates.


Table 1
Whole-genome ANI analysis of cryptic clades vs. 25 Peptostreptococcaceae species from Lawson et al., 2016.
SpeciesNCBI accessionANI %
ST181 (C-I)ST200 (C-II)ST369 (C-III)
Clostridioides difficile (ST3)AQWV00000000.191.1193.5489.30
Asaccharospora irregularisNZ_FQWX0000000078.9478.8778.91
Romboutsia lituseburensisNZ_FNGW00000000.178.5178.3678.66
Romboutsia ilealisLN555523.178.4578.5478.44
Paraclostridium benzoelyticumNZ_LBBT00000000.177.9277.7178.14
Paraclostridium bifermentansNZ_AVNC00000000.177.8977.8978.06
Clostridioides mangenotiiGCA_000687955.177.8277.8478.15
Paeniclostridium sordelliiNZ_APWR00000000.177.7377.5977.86
Clostridium hiranonisNZ_ABWP0100000077.5277.4277.59
Terrisporobacter glycolicusNZ_AUUB00000000.177.4777.5377.53
Intestinibacter bartlettiiNZ_ABEZ00000000.277.2977.5277.48
Clostridium paradoxumNZ_LSFY00000000.176.6076.6576.93
Clostridium thermoalcaliphilumNZ_MZGW00000000.176.4976.6176.85
Tepidibacter formicigenesNZ_FRAE00000000.176.4176.4776.38
Tepidibacter mesophilusNZ_BDQY00000000.176.3876.4476.22
Tepidibacter thalassicusNZ_FQXH00000000.176.3476.3176.46
Peptostreptococcus russelliiNZ_JYGE00000000.176.3076.0876.38
Clostridium formicaceticumNZ_CP020559.175.1875.2675.62
Clostridium caminithermaleFRAG0000000074.9775.0775.03
Clostridium aceticumNZ_JYHU00000000.1≤70.00≤70.00≤70.00
Clostridium litoraleFSRH01000000≤70.00≤70.00≤70.00
Eubacterium acidaminophilumNZ_CP007452.1≤70.00≤70.00≤70.00
Filifactor alocisNC_016630.1≤70.00≤70.00≤70.00
Peptostreptococcus anaerobiusARMA01000000≤70.00≤70.00≤70.00
Peptostreptococcus stomatisNZ_ADGQ00000000.1≤70.00≤70.00≤70.00
NCBI: National Center for Biotechnology Information; ANI: average nucleotide identity: ST: sequence type.
Table 2
Major clade-specific gene clusters identified by Pangenome-Wide Association Study (pan-GWAS).
ProteinGeneClade specificityFunctional insights
Ethanolamine kinaseETNK, EKIUnique to C-III and is in addition to the highly conserved eut cluster found in all lineages. Has a unique composition and includes six additional genes that are not present in the traditional CD630 eut operon or any other non-C-III strains.An alternative process for the breakdown of ethanolamine and its utilisation as a source of reduced nitrogen and carbon.
1-propanol dehydrogenasepduQ
Ethanolamine utilisation protein EutSeutS
Ethanolamine utilisation protein EutPeutP
Ethanolamine ammonia-lyase large subuniteutB
Ethanolamine ammonia-lyase small subuniteutC
Ethanolamine utilisation protein EutLeutL
Ethanolamine utilisation protein EutMeutM
Acetaldehyde dehydrogenaseE1.2.1.10
Putative phosphotransacetylaseK15024
Ethanolamine utilisation protein EutNeutN
Ethanolamine utilisation protein EutQeutQ
TfoX/Sxy family protein-
Iron complex transport system permease proteinABC.FEV.PUnique to C-III.Multicomponent transport system with specificity for chelating heavy metal ions.
Iron complex transport system ATP-binding proteinABC.FEV.A
Iron complex transport system substrate-binding proteinABC.FEV.S
Hydrogenase nickel incorporation protein HypBhypB
Putative ABC transport system ATP-binding proteinyxdL
Class I SAM-dependent methyltransferase-
Peptide/nickel transport system substrate-binding proteinABC.PE.S
Peptide/nickel transport system permease proteinABC.PE.P
Peptide/nickel transport system permease proteinABC.PE.P1
Peptide/nickel transport system ATP-binding proteinddpD
Oligopeptide transport system ATP-binding proteinoppF
Class I SAM-dependent methyltransferase-
Heterodisulfide reductase subunit D (EC: to C-III and is in addition to the highly conserved spermidine uptake cluster found in all other lineages.Alternative spermidine uptake processes that may play a role in stress response to nutrient limitation. The additional cluster has homologs in Romboutsia, Paraclostridium, and Paeniclostridium spp.
CDP-L-myo-inositol myo-inositolphosphotransferasedipps
Spermidine/putrescine transport system substrate-binding proteinABC.SP.S
Spermidine/putrescine transport system permease proteinABC.SP.P1
Spermidine/putrescine transport system permease proteinABC.SP.P
Spermidine/putrescine transport system ATP-binding proteinpotA
Sigma-54-dependent transcriptional regulatorgfrRPresent in all lineages except C-I. Cluster found in a different genomic position in C-III.Mannose-type PTS system essential for utilisation of fructosamines such as fructoselysine and glucoselysine, abundant components of rotting fruit and vegetable matter.
Fructoselysine/glucoselysine PTS system EIIB componentgfrB
Mannose PTS system EIIA componentmanXa
Fructoselysine/glucoselysine PTS system EIIC componentgfrC
Fructoselysine/glucoselysine PTS system EIID componentgfrD
SIS domain-containing protein-
Fur family transcriptional regulator, ferric uptake regulatorfurBUnique to C-II and C5.Associated with EDTA resistance in E. coli, helping the bacteria survive in Zn-depleted environment.
Zinc transport system substrate-binding proteinznuA
Fe-S-binding proteinyeiR
Rrf2 family transcriptional regulator-
Putative signalling protein-Unique to C-I and C5 STs 163, 280, and 386In E. coli, AbgAB proteins enable uptake and cleavage of the folate catabolite p-aminobenzoyl-glutamate, allowing the bacterium to survive on exogenous sources of folic acid.
Aminobenzoyl-glutamate utilisation protein BabgB
MarR family transcriptional regulator-
Key resources table
Reagent type
or resource
DesignationSource or
Software, algorithmABRicate
Software, algorithmACT: Artemis Comparison Tool
Software, algorithmBactDating
Software, algorithmBEAST
Software, algorithmClustal Omega
Software, algorithmEasyfig
Software, algorithmFastANI
Software, algorithmGeneious
Software, algorithmGubbins
Software, algorithmiToL
OtherKEGG database
Software, algorithmKraken2
Software, algorithmMAFFT
Software, algorithmMEGA
Software, algorithmMUSCLE
OtherNCBI RefSeq database
OtherNCBI Sequence Read Archive database
Software, algorithmPanaroo
Software, algorithmPanGP
Software, algorithmPhandango
Software, algorithmProkka
OtherPubMLST database
Software, algorithmpyani
Software, algorithmQUAST
Software, algorithmRAxML
Software, algorithmRoary
Software, algorithmScoary
Software, algorithmSPAdes
Software, algorithmSPSS
Software, algorithmSRST2
Software, algorithmTrimGalore

Additional files

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Daniel R Knight
  2. Korakrit Imwattana
  3. Brian Kullin
  4. Enzo Guerrero-Araya
  5. Daniel Paredes-Sabja
  6. Xavier Didelot
  7. Kate E Dingle
  8. David W Eyre
  9. César Rodríguez
  10. Thomas V Riley
Major genetic discontinuity and novel toxigenic species in Clostridioides difficile taxonomy
eLife 10:e64325.