Introduction

Genomic enzymology has been proven to help understand protein (super) families since the mid-1990s, helping to connect enzyme sequences to function through comparative genomics and neighborhood analysis (Babbitt et al., 1996; Knox and Allen, 2023). In this study, we are employing a genome neighborhood and co-occurrence analysis to help understand reactivity and functionality of the family of nickel containing carbon monoxide dehydrogenases ([NiFe]-CODHs) and their relationship to hybrid cluster proteins (HCPs).

[NiFe]-CODHs are ancient and diverse enzymes that catalyze the interconversion between carbon monoxide (CO) and carbon dioxide (CO2), a reaction of high interest for biotechnological applications, including CO2 capture and conversion. Research on this enzyme spans over 60 years, and recent studies have provided important biochemical insights, such as their turnover frequency and oxygen tolerance (Can et al., 2014). These properties vary greatly between enzymes, not only between separate phylogenetic clades, but also within clades, making functional prediction from sequence alone challenging. Phylogenetic analyses of available [NiFe]-CODH (hereafter referred to as CODH) sequences with different focus such as gene transfer (Techtmann et al., 2012), primary structure (Inoue et al., 2018), biodistribution (Inoue et al., 2022) and human gut microbiome (Katayama et al., 2024) have enriched our understanding of this old and diverse enzyme family. From initially small data sets of 17 sequences (Lindahl and Chang, 2001) to datasets well above 5000 sequences (this study). It has been shown that up to eight distinct phylogenetic clades (Figure 1) can be distinguished with all of them having sequence variations while preserving the overall fold, as seen with cryo-electron microscopy (Biester et al., 2024) and x-ray crystallography (Basak et al., 2025; Domnik et al., 2017; Gong et al., 2008; Jeoung et al., 2022; Jeoung and Dobbek, 2007; Wittenborn et al., 2020).

Schematic phylogenetic tree of [NiFe]-CODH, with selected CODHs marked with their respective position.

Tree was build using iqtree2, 1000 ultrafast bootstrap, containing 5508 putative CODH sequences and one outgroup (MBE6442607.1 hydroxylamine reductase [Desulfovibrio desulfuricans]) for rooting. Detailed searchable tree with bootstrap values can be found in Supplementary File 9_Tree5.

The biochemical characterization of this enzyme family is still ongoing, and it shows a wide range of turnover frequencies as well as different degrees of O2 tolerance. For example, looking at two CODHs from Carboxydothermus hydrogenoformans: ChCODH-II, a benchmark CODH known for its high CO oxidation activity but low O2 tolerance, contrasts with ChCODH-IV — another enzyme from the same clade and organism — which retains 20% of activity after 1 h of O2 exposure but displays reduced CO oxidation capacity with an increased activation barrier and much lower KM (Domnik et al., 2017), both belonging to clade F (Fig. 1). Similarly, a less active CODH from clade E, NvCODH (formerly known as DvCODH) from Nitratidesulfovibrio vulgaris, has been reported to fully reactivate after initial inactivation by O2 exposure(Hadj-Said et al., 2015). Also, the two CODHs from Thermococcus sp. AM4, TcCODH-I and TcCODH-II belonging to clade E react slower with O2 compared to ChCODH-II, but have overall equal O2 sensitivity (Benvenuti et al., 2020).

In addition to the previously mentioned diversity within CODH’s clades or organisms, with regards to activity and oxygen tolerance, it is known that some CODHs rely on maturases for full activation while others do not. For example, RrCODH, from the phototroph Rhodospirillum rubrum, needs to be expressed together with three maturases (CooC, CooJ and CooT) in order to be isolated in an active form (Kerby et al., 1997). A similar situation arises for NvCODH, however, its genomic neighborhood (Fig. 2) only contains one maturase (CooC) which is required for active production (Hadj-Said et al., 2015). On the contrary, ChCODH-II can be heterologously expressed without co-expression of any maturases (Merrouch et al., 2018). Also, ChCODH-I needs to be co-expressed with CooC in order to reach high activity but it can also be expressed without it, albeit with reduced activity (Inoue et al., 2014). Interestingly, much of the diversity with regards to activity, O2 tolerance and maturase dependence does not only occur between the different clades but also within them.

Operons of selected [NiFe]-CODH.

* NtCODH formally known as MtCODH formerly known as CtCODH due to renaming of host organism (Gtari and Ventura, 2025). ** NvCODH formerly known as DvCODH due to renaming of host organism (Waite et al., 2020).

Due to the homology between the CODHs in this study and the fact that active CODHs have been demonstrated from several of the clades, it is reasonable to assume that CODHs from all clades are able to interconvert CO2/CO. However, a recent study by Dobbek and co-workers showed that Carboxydothermus hydrogenoformans CODH-V (ChCODH-V) from clade D was not able to perform this reaction (Jeoung et al., 2022). They showed that this enzyme has a closer similarity to the family of hybrid cluster proteins (HCPs), due to its morphing active site, composed if iron, sulphur and oxygen, responding with structural and stoichiometric changes to changes in its redox state. A connection between HCPs and CODHs has been pointed out previously by Inoue et al. (Inoue et al., 2018) due to their close phylogenetic relationship, and was further discussed by Fujishiro et al. (Fujishiro and Takaoka, 2023). HCPs can be divided in three phylogenetic classes, of which class III exhibits a homodimeric structure like CODH. Generally, the two enzyme families share a similar overall fold while their active sites differ greatly, both in terms of amino acid and metallocofactor composition. Similar to ChCODH-V, HCPs do not catalyze CO2/CO interconversion but they do display a range of activities at low rate, such as hydroxylamine reductase-, peroxidase-, nitric oxide reductase- and S-nitrosylase activity. The main natural function of HCPs is debated but it was recently established that it is most likely a nitric oxide reductase involved in nitric oxide detoxification (Hagen, 2022).

In this study, we contribute to paint a wholistic phylogenetic picture of CODHs by focusing on the analysis of their genetic environment as well as harnessing the concept of synteny in which we use a semi quantitative approach to predict characteristics of CODH, clade and subclade specific. We are presenting certain clade specific trends in the operon composition in CODH. Since it is known that many organisms have multiple isoforms of CODHs coded in their genome, we analyze the co-occurrence of CODH of different clades in an organism, as well as the co-occurrences of CODH and HCP. With our findings we want to propose a systematic approach in the analysis of new CODH, with the focus on identifying promising CO2 reduction catalysts, suitable for biotechnological application.

Results

Co-occurrence and Correlation

After evaluating the assemblies in regard to their count of CODH, it was seen that around 30% of all assemblies encode for more than one CODH. For HCP this number is much smaller, around 6%. As can be seen from Fig. 3 A, the occurrence of multiple isoforms from specific clades within organisms varies. Clades B, C and D almost exclusively occur only once within a genome, while clades A, E and F are more likely to co-occur with another isoform from the same clade. The overall trend that we observe is most likely underrepresenting the number of genomes encoding multiple CODHs, since incomplete genomes are also included in these analyses. When calculating the correlation of the co-occurrence of CODH from two different clades in one organism, a pattern evolves (Fig. 3 B). Most obvious is the lack of co-occurrence of clades A, E and F with each other. Furthermore, clade B CODHs seem to have little co-occurrence with other clades as well. However, clades C and D more often co-occur with CODH from other clades, especially A, E and F. Clades C and D also have a higher probability to co-occur with each other. As outlined in the introduction, from biochemical studies it is known, that CODHs from clade A, E, and F are active whereas CO2/CO interconversion activity is missing in CODH from clade D. This co-occurrence might suggest that the redox sensing properties of CODH from clade D (and potentially clade C) are useful for organisms already containing a functional CODH. Interestingly, a high co-occurrence was also seen for CODH and HCP, with an exception for clade A CODH.

(A) Count of organisms that contain one ore multiple [NiFe]-CODH. (B) Probability matrix of co-occurrence of [NiFe]-CODH from different clades in one organism. Raw data can be found in Supplementary File 1_Table S1.

Neighbor Analysis

We semi-quantitively evaluated the operon composition of 1351 CODHs (121 A, 130 B, 168 C, 253 D, 434 E, 245 F, Supplementary File 5_Tree1 and Supplementary File 6_Tree2) with proteins which function we could predict using eggNOG(Huerta-Cepas et al., 2019), the NCBI product prediction and manual curation. The results are summarized in Fig. 4. In the following we only report on neighbors that are encoded in the same operon as more than 10% of CODHs per clade (see Table S1). Starting with CODHs from clade A, 93% contain a one carbon pool related gene in their operon, followed by CooC (62%) and iron-sulfur (FeS) cluster containing protein (31%). As one carbon pool related gene, we defined genes associated either with direct conversion of one carbon compounds such as formate dehydrogenases, or with the Wood-Ljungdahl pathway. Clade B CODH operons mainly encode ABC transporter associated genes (64%). Furthermore, almost a quarter (24%) of all CODHs from clade B could not be associated with any neighbor, and 12% are coded close to transcriptional regulators. For CODH from clade C, the three main neighbors are proteins associated with FeS cluster containing proteins (such as CooF) (72%), NAD(P) or FAD dependent oxidoreductases (71%) and transcription (58%) or other (10%) regulation. The overall diversity of neighboring proteins from clade D, and the fact that a major part of those CODHs seemingly do not encode close to any other genes (64%) made it challenging to sum up their different codons, and no clear pattern could be observed. Only transcription regulation proteins (9.9%) and general regulatory proteins (9.5%) are worth mentioning in this context. Clade E and F both have a larger set of proteins frequently observed in their associated operons. Operons encoding either clade E or F CODH contain CooC like genes (59% E, 68% F), one carbon pool associated genes (49% E, 37% F) and FeS genes (29% E, 53% F). Transcription regulators (17% E, 35% F) and NAD(P)/FAD-dependent oxidoreductase (22% E, 42% F) have also been found. The maturation protein CooT was exclusively found in operons from clade E (16%) and F (6.1%). The same holds true for CooJ but in clade F, CooJ was seen in even fewer operons (12% E, <5.0% F). Additional Hydrogenases (25%) and their maturation machinery (17%) are coded primarily for clade F CODH, as well as different types of transporter proteins (11%).

(A) Phylogenetic tree of putative [NiFe]-CODH unrooted with 1376 sequences, color-coded if operon contains one or more of a certain type of protein. Detailed searchable tree with bootstrap values can be found in Supplementary File 5_Tree1 and Supplementary File 6_Tree2. (B) Distribution of operon size for CODH and HCP genes. (C) Proportion of [NiFe]-CODH from one clade being coded near a certain type of protein. Raw data can be found in Supplementary File 1_Table S1.

A similar analysis of the operons encoding for HCP was performed and a total of 1476 HCP genes were analyzed (class I: 1049, class II: 23, class III: 404, Supplementary File 7_Tree3 and Supplementary File 8_Tree4) showing a low frequency of isoforms within organisms (Fig. S1). HCP exhibited a large variety of neighbors, leading to difficulties in extracting meaningful information from their operon composition (Fig. S2). Furthermore, class I and III had a high proportion of entries without any neighbors (49% and 75%, respectively) which is reflected in their tendency to have fewer proteins coded in their operons (see Fig. 4 B). However, we observed a high frequency of FeS cluster proteins (18%) and transcription regulators (17%) for class I, as well as NAD(P) or FAD oxidoreductases (96%) and transport proteins (78%) for class II HCP. It needs to be noted, however, that our sample set for class II HCP is very small, so its information value is considerably lower compared to the other classes/clades.

Discussion

Our analysis reveals substantial diversity in the occurrence, co-occurrence, and genomic context of CODH and HCP genes, suggesting complex evolutionary and functional relationships within and across microbial lineages. The observed differences in the frequency of multiple isoforms per genome (∼30% for CODH versus ∼6% for HCP) indicate that CODHs are more often retained in multiple copies, potentially pointing to functional diversification among its isoforms. Similar values have been reported by Techtmann et al., where they found that a striking 43% of organisms coded for more than one CODH (Techtmann et al., 2012). On the other hand, Katayama et al., investigating only the human gut microbiome found a number as low as 5.5% (Katayama et al., 2024). We suspect that this number underrepresents the amount of organisms carrying multiple isoforms of CODH in the human gut, since data refinements that exclude potential CODHs were performed (such as the strict requirement for a [4Fe-4S] D-cluster, even though Inoue et al. reported on the diversity of the D-cluster(Inoue et al., 2018)).

There are many examples of organisms coding for multiple CODH isoforms, as outlined in Fig. 3 (Supplementary File 10_Tree6 and Supplementary File 11_Tree7). Many of them have been known to literature for a long time (even though their CODH abundance has only been discussed sporadically) with the most famous example being C. hydrogenoformans encoding five different CODHs (Wu et al., 2005). Another interesting example is Clostridium formicoaceticum, since this organism has a total of six CODH isoforms encoded in its genome. It needs to be noted, that in our analysis, this organism did not show up as an organism with six CODHs, see Fig. 3 A. This is due to our analysis only counting CODH stemming from the same organism when their genes are associated with the same genome assembly. We therefore rather underestimate counts of organism with multiple CODHs, such as the aforementioned. The only CODH from C. formicoaceticum that has so far been isolated, characterized and discussed is one of its clade E CODHs, that is associated with acetyl-CoA synthase (ACS) (Bao et al., 2019; Diekert and Thauer, 1978). Other examples from literature attempted to investigate the influence of CODH isoforms on the metabolism. Archaeglobus fulgidus contains three CODH genes, two from clade A and one from clade D. Its clade D CODH seems to have no role in the CO metabolism of this organism (Hocking et al., 2015). There also has been a report for the organism Methanosarcina acetivorans, which harbors three CODHs, all of them belonging to clade A, where only two of them are associated with ACS and are believed to be involved in the CO metabolism, the other one being a lone gene and is seemingly not involved in carbon metabolism (Matschiavelli et al., 2012). Another interesting example is Thermoanaerobacter kivui, formerly known as Acetogenium kivui (COLLINS et al., 1994). When TkCODH-I (clade C) is deleted from the organism, the strain loses its ability to grow slowly on CO, however, if grown on H2+CO2 the overall acetate production is greatly increased (Jain et al., 2021). Similar effects have been shown for Clostridium autoethanogenum, which contains three isoforms of CODH, and if its clade C CODH is deleted, it’s lag phase is reduced and its growth rate is greatly increased (Liew et al., 2016). The other two CODH isoforms from this organism are from clade E, and D. Deletion of clade D CODH showed no immediate effect on the organism, except moderately lower overall biomass yield (Liew et al., 2016). In our analysis we saw an increased frequency of co-occurrence of clades C and D with A, E, or F, which together with biological data, may indicate a complementary role, possibly linked to redox sensing or regulatory functions. This is especially evident with the examples for clade C CODHs from T. kivui and C. autoethanogenum. For the case of clade D, which lacks catalytic activity towards CO/CO2 interconversion (as reported by Jeoung et al. (Jeoung et al., 2022) through their recombinant production of ChCODH-V) and until now has unknown influence in the metabolism that might only manifest in harsher environments, since its believed to be involved in stress response (Jeoung et al., 2022) (similar to HCPs(Hagen, 2022), see below). However, experimental proof for this claim is still missing.

Furthermore, clades A, E, and F rarely co-occur. Interestingly, many organisms do however contain multiple copies of CODHs from one of these clades, such as M. acetivorans (clade A), C. hydrogenofromans (clade F), and C. formicoaceticum (clade E). We suspect an evolutionary reason behind this, as is also outlined by others (Adama et al., 2018; Lindahl and Chang, 2001). Biochemical data indicate that CODHs from these clades possess CO/CO2 interconversion capability, as previously mentioned. This is also in line with the genetic context of these CODHs, which is most often tuned for this CO/CO2 interconversion chemistry (Fig. 4 C), see below.

The high rate of co-occurrence between CODH and HCP genes (except for clade A) suggests functional integration, a shared metabolic niche, or involvement in a coordinated response to redox stress, given that HCPs are thought to regulate nitric or oxidative stress(Hagen, 2022). The lack of co-localization for clade A CODHs might point to distinct metabolic roles or evolutionary constraints, or on the high rate of archaeal genes in clade A CODHs, even though HCP genes are known to also be found in archaea (Hagen, 2022). Interestingly, the co-occurrence between clade D CODH and HCP seems the highest for our data set, the reason for that is unclear. Similar to clade D CODH, empirical proof that HCP influences the activity or expression of CODH is missing.

The genomic context analysis adds another layer of functional inference, as has been done before by others with different foci (Inoue et al., 2018; Katayama et al., 2024; Matson et al., 2011; Techtmann et al., 2012). We could show again that operons containing clade A CODHs are highly conserved with one carbon pool-related genes and CooC, as they are almost exclusively found as part of the Wood-Ljungdahl pathway, which has a typical arrangement similar to Methanosarcina barkeris CODH (MbCODH, Fig 2, Supplementary File 6_Tree2). Recently, another representative from this group from M. thermophila (MetCODH) has been resolved (Biester et al., 2024).

In contrast, clade B CODHs appear largely alone or associated with transport-related genes, raising the possibility of a non-canonical or even degenerated function. Its operon composition is also rather consistent and its arrangements only varies to a small extent, as ABC transporters are either coded upstream (as for Ruminococcus flavefaciens’ CODH, RfCODH, see Fig. 2) or downstream of the CODH gene. Almost all operons analyzed do not contain any maturases, expect for a small cluster that branches off rather early in the tree (Supplementary File 6_Tree2). This might indicate that the need for a maturase was lost due to re-purposing of the CODH. We yet await biochemical characterization of any clade B CODH.

Clade C CODHs are associated with FeS cluster proteins, regulators and redox enzymes, pointing towards more regulatory or redox-modulatory roles, which could also be indicated in knock-out studies (Liew et al., 2016). The only isolated example from this clade is TkCODH-I. Its operon exhibits a composition only partially representable for clade C CODHs, containing only one other gene coding for a FeS protein (Fig. 2). Furthermore, TkCODH-I’s sequence branches off early and seems to be rather distinct (Supplementary File 9_Tree5), only having one other close relative from Aceticella autotrophica (Frolov et al., 2023). Furthermore, the reported isolated CODH from Jain and co-workers (Jain et al., 2021) stems from a CO adapted strain (Weghoff and Müller, 2016), which might harbor mutations in the protein sequences that are not accessible to us at the moment. Drawn together, we conclude that right now TkCODH-I might not be an optimal representative for clade C CODHs and more clade C CODHs should be isolated to help us understand their biochemical properties better.

The high operonic variability and frequency of solitary coding regions in clade D might reflect either evolutionary drift or multifunctionality not restricted to operonic structure. Clade D will therefore not be discussed further.

Operons from clades E and F are more functionally complex, including components from the Wood-Ljungdahl pathway, hydrogenases, and additional redox partners, consistent with a diverse metabolic role. That being said, their operon compositions and arrangements showcased some interesting clustering. Starting with clade F, which has the highest proportion of CODHs that might be associated with hydrogenases, including the aforementioned ChCODH-I and RrCODH, both interacting with hydrogenases to produce hydrogen in vivo (Fox et al., 1996; Soboh et al., 2002), however with greatly differing operon compositions (Fig. 2). ChCODH-I like operons (see Supplemetary File 6_Tree2) contain their hydrogenase modules directly within the CODH operon, whereas RrCODH-like operons do not include the hydrogenase module which is coded upstream of the CODH gene, with an intergenic space of > 400 bp(Fox et al., 1996). RrCODH-like operons are the only clade F operons that include two additional maturation enzymes, CooT and CooJ. Clade F also contain many ACS associated CODHs, such as Neomoorella thermoacetica CODH, NtCODH, formerly known as Moorella thermoacetica (Gtari and Ventura, 2025). Those NtCODH-like operons all have the same arrangement. This arrangement is distinct from Clade A and E ACS associated CODH operons.

In our dataset most of the hydrogenases and their maturation genes are associated with clade F, suggesting active hydrogen metabolism, coupling CO oxidation to H2 production or consumption as has been suggested earlier for a wider range of CODH clades (Inoue et al., 2018; Techtmann et al., 2012). We believe that our data grossly underestimates this relationship overall, since operon examples such as RrCODH and TcCODH-II showcase that hydrogenases associated with a CODH are not necessarily encoded in the same operon. There has also been a report of a clade A CODH from the methanogen M. thermophila (Terlesky and Ferry, 1988) being associated with a hydrogenase. However, in later studies it was shown that the genome of M. thermophila does not contain a hydrogenase (Smith and Ingram-Smith, 2007). Investigation of the electron transport chain of its membrane could not find a hydrogen oxidizing complex (Welte and Deppenmeier, 2011). Together with our analysis we conclude that hydrogenase association is a trait almost exclusive to clade E and F CODHs. ChCODH-II’s operon seems to be rather uniquely constructed as a similar operon composition only can be found for other Carboxydothermus species. A similar situation can be seen for ChCODH-IV, where its operon containing FeS and NAD/FAD-dependent oxidoreductases is closer in similarity to some clade E CODH.

Regarding the biggest clade, clade E, its diversity is striking. NvCODH, formerly known as DvCODH (Waite et al., 2020), from the organism Nitratidesulfovibrio vulgaris, has a very small operon with only two genes in its close proximity, a transcriptional regulator (Zhou et al., 2012) and a maturation enzyme (CooC), see Fig. 2. This is seen for a huge number of both clade E and F CODHs. The occurrence of neither CooJ nor CooT is striking, both only appear in two very distinct parts of clade E, all of them being associated with one carbon pool metabolism, with one exception from Clostridium pasteurianum BC1, which more resembles the clade F RrCODH-like operon. The previously introduced archaeal CODHs, TcCODH-I and TcCODH-II from clade E, are contained in operons (Fig. 2), that are rather specific and only found for a few other Thermococcus or Pyrococcus species (Benvenuti et al., 2020; Kim et al., 2015). It needs to be noted, that TcCODH-I’s CooC gene is coded outside of its operon, and on the opposite strand. The CooC gene is therefore not included in our analysis. We are only aware of examples from this type of operon, and don’t expect this to be a common trait of the CODH maturation machinery. However, it needs to be noted that the CooC proportion might be slightly underestimated. Interestingly, TcCODH-II like CODH all contain a CooT like protein in their operon, forming the only cluster of CODH that contain only CooT like proteins without CooJ. Within clade E, another unique genomic neighborhood from TkCODH-II must be pointed out. From experimental data it is known that this CODH is associated with ACS (Jain et al., 2021), however, in our analysis we did not see this ACS complex in TkCODH-II’s operon. This is due to the ACS subunit being coded further downstream of the CODH gene, not being taken into account due to our initial parameters.

For HCPs, the high variability and low operon density — especially in classes I and III — point towards more modular or conditionally expressed roles, similar to clade D CODH. The clear patterning in class II operons, though based on a limited sample, may reflect specialized functions, perhaps in niche-specific oxidoreductase activities.

Conclusion

As previously mentioned, the aim of this study is to identify which CODH clades harbor the most promising enzymes for future application in CO2 reduction. The operon composition of CODHs from different clades show distinct differences and what we could gather from this information is that clade A, E and F are the most likely clades to harbor CODHs able to efficiently convert CO2 to CO. These clades are therefore the most interesting for CO2 reducing biotechnological applications, or as inspiration for new synthetic catalysts. Also, literature has shown that the activity of many CODHs depend on co-expression with maturation proteins such as CooC, and in some cases, CooJ and CooT are also required for full activation. Although some CODHs (most notably CODH-II from C. hydrogenoformans) can function independently of maturases, our neighborhood analysis indicates that maturase-coding genes are predominantly found in operons from clades A, E, and F. This pattern implies yet again that these clades may represent more biochemically active or catalytically optimized CODHs, making them promising targets for future functional studies and biotechnological applications.

The function of Clade B could not be deduced based on its genomic environment but it seems to have a remarkable self-standing function, that is not shared with any other CODHs. Its low co-occurrence with other CODH clades within organisms also supports a unique role for Clade B. Clades C and D are more likely to show low or even no activity towards CO2/CO interconversion, as was deduced from literature and the lack of C1 metabolism related genes in their operons. However, Jain et. al. recently showed low CO oxidation activity in a clade C CODH from T. kivui, but this enzyme originated from a strain that had acquired the ability to grow on CO through laboratory evolution (Jain et al., 2021). Sequence data used to classify the CODH into clade C was from the original strain (incapable of growing on CO) and data on the engineered strain is not available. It is therefore not known whether the active CODH is the wild-type or an engineered enzyme and we cannot draw any conclusions regarding the activity of clade C CODHs. Taken together, this makes clades B, C and D less promising in the hunt for CO2 reduction catalysts. However, much is still unknown about these enzymes such as their cellular function.

Future work should focus on experimental validation of the functional differences between CODH isoforms, particularly in organisms where multiple clades co-occur. Additionally, transcriptomic and proteomic studies could illuminate condition-dependent expression patterns and confirm proposed regulatory functions. Finally, deeper phylogenomic analyses may reveal the evolutionary drivers behind the observed distribution and diversification of these ancient redox enzymes.

Methods

Data collection and refinement

Multiple pBLAST searches (BLOSUM62, E < 0.05) in the NCBI database were carried out using NCBI accession numbers provided by Inoue et al. (Inoue et al., 2018) (A-1, WP_011305243; A-2, WP_010878596; A-3, OGW06734; A-4, OIP92259; A-5, ODS42986; A-6, OIP30420; B-1, WP_026514536; B-2, WP_015485077; B-3,WP_012645460; B-4, WP_011393470; C-1, WP_039226206; C-2, WP_013237576; C-3, WP_010870233; C-4, WP_044921150; D-1, WP_011342982; D-2, WP_015926279; D-3, WP_079933214; D-4, WP_096205957; E-1, WP_012571978; E-2, WP_010939375; E-3, WP_088535808; F-1, WP_011343033; F-2, WP_011389181; G-1, OGP75751) and Techtmann et al. (Techtmann et al., 2012) (mini CooS, WP_007288589.1). CODH from clade H(Inoue et al., 2022) was omitted due to limited host information. Duplicates were removed using seqkit’s(Shen et al., 2016) rmdup. Sequences of length below 400 amino acids (aa) were removed. Clustering was performed to further reduce data size, by using cd-hit(Li et al., 2002, 2001; Li and Godzik, 2006) and a global sequence identity of 99% or 90%, the later only used for tree generation. It was necessary to have high sequence similarity in the clustering within organisms, since it was known that some organisms have multiple CODH with striking sequence similarities in their genome such as Clostridium pasteurianum BC1(taxid: 86416) that contains WP_015614757.1 and WP_015615315.1 with 93.27% similarity. For the dataset involved in neighbor analysis, taxonomic information for each sequence was retrieved using R-packages taxize (Chamberlain et al., 2020; Chamberlain and Szocs, 2013) and taxizedb (Chamberlain et al., 2025), and only sequences that could be related to a recorded organism were kept (Supplementary File 3_Table S3 and Supplementary File 4_Table S4). Sequences were aligned using E-INS-I from mafft (Katoh and Standley, 2013) and sequences that had gaps in important positions related to D, B or C cluster or acid base active site residues were sorted out. The alignment was trimmed using trimAl’s (Capella-Gutiérrez et al., 2009) automated1 option and a tree was generated using FastTree (Price et al., 2010). Via visual inspection further sequences that were not CODH sequences were removed. The final list of CODH sequences used in the neighbor and correlation analysis counted 1376. A similar approach was done for HCP (class I, Q01770.2; class II, WP_000458809.1; class III, WP_013294878.1) and a final count of 1545 sequences was collected for neighbor and correlation analysis. Another set of CODH genes was curated using the 90% cd-hit cut-off. It was aligned using mafft’s FFT-NS-1, sequences that had gaps in important positions related to D, B or C cluster or acid base active site residues were sorted out. The alignment was trimmed using trimAl. An initial tree was built using FastTree. Further sequences were removed after visual inspection, which yielded a final dataset containing 5508 sequences. See Fig. S3 for detailed flowchart. Custom code can be found and retrieved for github (Böhm, 2025a, 2025b, 2025c).

Neighbor analysis

Genome information was downloaded for the accession number lists generated for CODH and HCP, which lead to the download of 955 and 1425 genomes, respectively. As neighboring gene we defined a maximum of 15 genes upstream and downstream of the target gene, that had a maximum intergenic distance of 300 base pairs (bp), as was done previously by Inoue et al. (Inoue et al., 2018). We decided to use this rather large intergenic distance to include as many neighbors as possible, and we expect that unrelated genes will disappear in the noise. For the same reason, we included an overlap region of 50 bp for genes in the same operon, which is rather high, as genes for example in E. coli usually overlap 4 to 1 bp(Johnson and Chisholm, 2004). Aa sequences for those genes were retrieved from the NCBI database using entrez(Sayers, 2022), and their function was predicted using eggnog (Huerta-Cepas et al., 2019), the results from eggNOG as well as the product prediction from NCBI were taken into account in the manual placing of selected functional groups. The data was plotted using R (R Core Team, 2023), tidyverse (Wickham et al., 2019), patchwork(Pedersen, 2025), ggnewscale(Elio Campitelli et al., 2025), ggtree (Yu et al., 2018, 2017), ggtreeExtra(Xu et al., 2021), and treeio(Wang et al., 2020) and for gene maps gggenes (Wilkins, 2023) was used. Since CooJ determination was neither possible with the NCBI prediction nor via eggNOG, we selected operons from clade E and F that contained CooS and CooT, and manually extracted some accession numbers of potential CooJs which were used to PSI-BLAST (BLOSUM45, E < 0.001) for further accession numbers, summary can be found in Supplementary File 1_Table S1. These numbers were used to help annotate potential CooJs in our analysis, 68 potential CooJ genes could be identified.

Correlation analysis

Correlation of CODH and HCP from different clades/classes was calculated according the formula

where NY is the total number of assemblies containing protein from clade/class Y, NXY is the total number of assemblies containing proteins from both clade/class X and Y, and P(X|Y) is the probability that a genome coding for a protein from clade/class Y also codes for a protein from clade/class X.

Tree generation

In total, five trees were generated. Trees carrying phylogenetic information were generated via iqtree2 (Minh et al., 2020) with the LG+I+R10 model and ultrafast bootstrapping with 1000 resampling for a dataset of 5508 CODH sequences, a dataset of 1351 CODH sequences, and a dataset of 1476 HCP sequences (see above for details on their generation). For the 5508 sequence CODH dataset an outgroup was introduced to root the tree (MBE6442607.1). Sequences were aligned within their dataset using mafft’s FFT-NS-2. The alignment was again trimmed using trimAl and built using iqtree2 with the above parameters. For tree inspection and plotting ggtree (Yu et al., 2017) was used. The two other trees generated are taxonomic trees, either only on taxid using a custom python script and ete3 (Huerta-Cepas et al., 2016), or from WoL: Reference Phylogeny for Microbes (Zhu, 2023; Zhu et al., 2019).

Acknowledgements

The Novo Nordisk Foundation (Grant reference number NNF21OC0066716) is gratefully acknowledged for funding.

Additional files

Supplementary file 0

Supplementary file 1

Supplementary file 2

Supplementary file 3

Supplementary file 4

Supplementary file 5

Supplementary file 6

Supplementary file 7

Supplementary file 8

Supplementary file 9

Supplementary file 10

Supplementary file 11