Introduction

DNA methylation, particularly 5-methylcytosine (5mC) at CpG sequences, is widely conserved in eukaryotes. Along with its role in silencing transposable elements and suppressing aberrant intragenic transcription (Choi et al., 2020; Deniz et al., 2019; Neri et al., 2017), DNA methylation plays critical roles in developmental control, genome stability and the development of diseases such as cancers and immunodeficiencies (Greenberg and Bourc’his, 2019; Lyko, 2018; Nishiyama and Nakanishi, 2021; Robertson, 2005). Despite its versatility as an epigenetic control mechanism, DNA methyltransferases (DNMTs) are lost in multiple evolutionary lineages (Bewick et al., 2017; Huff and Zilberman, 2014; Kyger et al., 2021; Zemach et al., 2010). DNMTs are largely subdivided into maintenance DNMTs and de novo DNMTs (Lyko, 2018).

Maintenance DNMTs recognize hemimethylated CpGs and restore symmetric methylation at these sites to prevent the passive loss of 5mC upon DNA replication. Conversely, methylation by de novo DNMTs does not require methylated DNA templates. In animals, 5mC is maintained during DNA replication by DNMT1 together with UHRF1, which directly recognizes hemimethylated cytosine via the SRA domain and stimulates activity of DNMT1 in a manner dependent on its ubiquitin-ligase activity (Nishiyama and Nakanishi, 2021). De novo DNA methylation is carried out by DNMT3 in animals. Some species, such as the fungus Cryptococcus neoformans, lack de novo DNA methylation activity (Catania et al., 2020; Dumesic et al., 2020; Huff and Zilberman, 2014). In C. neoformans, DNA methylation is maintained by DNMT5, which has a SNF2-family ATPase domain that is critical for its methyltransferase activity (Dumesic et al., 2020; Huff and Zilberman, 2014). Although DNMT5 orthologs cannot be found in land plants and animals, their broad existence in Stramenopiles, Chlorophyta and Fungi suggests that DNMT5, perhaps together with DNMT1, coexisted in the last eukaryotic common ancestor (LECA) (Huff and Zilberman, 2014). While the evolutionary preservation and loss of DNMTs and other proteins involved in 5mC metabolism has been studied (Bewick et al., 2017; de Mendoza et al., 2018; Dumesic et al., 2020; Engelhardt et al., 2022; Huff and Zilberman, 2014; Iyer et al., 2011; Lewis et al., 2020; Mondo et al., 2017; Mulholland et al., 2020; Nai et al., 2020; Tirot et al., 2021; Zemach et al., 2010), it remains unclear if there is any common process or event that leads to the loss of DNA methylation systems from certain evolutionary lineages.

The loss of DNA methylation is a hallmark of immunodeficiency–centromeric instability–facial anomalies (ICF) syndrome, a rare genetic disorder which causes severe immune defects (Ehrlich, 2003; Ehrlich et al., 2006; Vukic and Daxinger, 2019). Activated lymphocytes of ICF patients display a characteristic cytogenetic abnormality at the juxtacentromeric heterochromatin of chromosome 1 and 16, where the satellite 2 repetitive element is highly enriched. In about 50 % of ICF patients, classified as ICF1, the disease is caused by mutations in the “de novo” DNA methyltransferase DNMT3B (Hansen et al., 1999; Okano et al., 1999). The rarer genotypes known as ICF2, ICF3 and ICF4, are caused by mutations in ZBTB24, CDCA7, and HELLS (Helicase, Lymphoid Specific, also known as LSH, Lymphoid-Specific Helicase), respectively (de Greef et al., 2011; Thijssen et al., 2015; Unoki, 2021). While the loss of DNA methylation at satellite II repeats is common to all ICF genotypes, ICF2-4 patient-derived cells, but not ICF1 patient cells, additionally exhibit hypomethylation at centromeric alpha-satellite repeats (Jiang et al., 2005; Thijssen et al., 2015; Velasco et al., 2018). Knock out of ICF genes in human HEK293 cells reproduces the DNA methylation profile observed in patient cells (Unoki et al., 2019). In mice, ZBTB24, HELLS and CDCA7, but not DNMT3B, are required for methylation at centromeric minor satellite repeats (Dennis et al., 2001; Hardikar et al., 2020; Ren et al., 2015; Thijssen et al., 2015). Therefore, although all ICF proteins promote DNA methylation, the ZBTB24-CDCA7-HELLS axis may target additional loci such as alpha satellites for DNA methylation in a DNMT3-independent manner. Indeed, the importance of HELLS/CDCA7 in DNA methylation maintenance by DNMT1 has been reported (Han et al., 2020; Ming et al., 2021; Unoki, 2021; Unoki et al., 2020).

HELLS belongs to one of ∼25 subclasses of the SNF2-like ATPase family (Flaus et al., 2006). Among these diverse SNF2 family proteins, HELLS appears to have a specialized role in DNA methylation. Reduced genomic DNA methylation was observed in HELLS (LSH) knockout mice (Dennis et al., 2001) and mouse embryonic fibroblasts (Myant et al., 2011; Yu et al., 2014). In fact, this function of HELLS in DNA methylation was originally inferred from studies in Arabidopsis, where mutations in the HELLS ortholog DDM1 (decrease in DNA methylation) cause drastic reduction of 5mC in transposable and repetitive elements (Vongs et al., 1993). Like HELLS, DDM1 is a SNF2 ATPase with demonstrable in vitro nucleosome remodeling activity (Brzeski et al., 2003; Jenness et al., 2018). Since DNA methylation defects in ddm1 mutants can be rescued by the loss of histone H1, it has been proposed that DDM1-mediated remodeling of H1-bound nucleosomes is important for DNA methylation (Zemach et al., 2013).

CDCA7 (also known as JPO1) was originally identified as one of eight CDCA genes that exhibited cell division cycle-associated gene expression profiles (Walker, 2001). A putative 4CXXC zinc finger binding domain (zf-4CXXC_R1) is conserved among CDCA7 homologs, including its paralog CDCA7L (also known as JPO2) (Chen et al., 2005; Ou et al., 2006). Multiple lines of evidence support the idea that CDCA7 functions as a direct activator of the nucleosome remodeling enzyme HELLS. First, in Xenopus egg extracts, HELLS and CDCA7e (the sole CDCA7 paralog present in Xenopus eggs) both preferentially interact with nucleosomes rather than nucleosome-free DNA, and binding of HELLS to chromatin depends on CDCA7e (Jenness et al., 2018). Second, HELLS alone exhibits little nucleosome sliding activity, but CDCA7e greatly stimulates it (Jenness et al., 2018). Third, HELLS directly binds to CDCA7e (as well as CDCA7 and CDCA7L), even in the absence of DNA (Jenness et al., 2018). Fourth, HELLS and CDCA7 interact in human cells (Unoki et al., 2019), where chromatin binding of HELLS also depends on CDCA7 (Jenness et al., 2018). ICF disease mutations located in the conserved zf-4CXXC_R1 domain (R274C, R274H and R304H in human CDCA7) inhibited chromatin binding of CDCA7 and HELLS without interfering with CDCA7-HELLS interaction (Jenness et al., 2018), such that the zf-4CXXC_R1 domain likely serves as a chromatin binding module. Therefore, we proposed that CDCA7 and HELLS form a bipartite nucleosome remodeling complex, termed CHIRRC (CDCA7-HELLS ICF-related nucleosome remodeling complex) (Jenness et al., 2018).

We previously suggested that ZBTB24, CDCA7, and HELLS form a linear pathway to support DNA methylation (Jenness et al., 2018). ZBTB24 is a transcription factor which binds the promoter region of CDCA7 and is required for its expression (Wu et al., 2016). As CDCA7 binds HELLS to form the CHIRRC, we proposed that its ATP-dependent nucleosome sliding activity exposes DNA that was previously wrapped around the histone octamer and makes it accessible for DNA methylation (Jenness et al., 2018).

Indeed, it has been shown that DNMT3A and DNMTB cannot methylate DNA within a nucleosome (Felle et al., 2011), and the importance of HELLS and DDM1 for DNA methylation at nucleosomal DNA has been reported in mouse embryonic fibroblasts and Arabidopsis, respectively (Lyons and Zilberman, 2017). Given the frequent loss of 5mC as an epigenetic mark in multiple evolutionary lineages, it is striking that the role for HELLS/DDM1 in DNA methylation appears to be conserved in evolutionarily distant mammals and plants. Importantly, this would suggest that the promotion of DNA methylation through nucleosome sliding is specific to HELLS and cannot be substituted by other SNF2 family nucleosome remodelers, such as SNF2 (SMARCA2/4), INO80, and ISWI (SMARCA1/5). If the specific function of HELLS and CDCA7 in DNA methylation is indeed derived from the last eukaryotic common ancestor (LECA), we hypothesize that HELLS and CDCA7 coevolved with other DNA methylation machineries. We here test this hypothesis and discuss the potential sequence of events that led to the loss of DNA methylation in some species.

Results

CDCA7 is absent from the classic model organisms that lack genomic 5mC

CDCA7 is characterized by the unique zf-4CXXC_R1 motif (Pfam PF10497) (Mistry et al., 2021). Conducting a BLAST search using human CDCA7 as a query sequence against the Genbank protein database, we realized that no zf-4CXXC_R1-containing proteins are identified in the classic model organisms Drosophila melanogaster, Caenorhabditis elegans, Schizosaccharomyces pombe and Saccharomyces cerevisiae, which are all also known to lack any DNMTs and genomic 5mC (Zemach et al., 2010). However, we identified a protein with a zf-4CXXC_R1 motif in the bumblebee Bombus terrestris and the thale cress A. thaliana, which have both maintenance and de novo DNMTs (Bewick et al., 2017; Li et al., 2018) (Figure 1). Based on the reciprocal best hits (RBH) criterion using human HELLS as a query sequence (see Methods) (Ward and Moreno-Hagelsieb, 2014), HELLS orthologs were identified in A. thaliana (DMM1) and S. cerevisiae (Irc5), which is in line with previous reports (Litwin et al., 2017), as well as a putative HELLS ortholog in B. terrestris. No clear HELLS orthologs were identified in D. melanogaster, C. elegans, or S. pombe. This pilot-scale analysis led us to hypothesize that the evolutionary maintenance of CDCA7 is linked to that of HELLS and DNMTs. In order to statistically validate this hypothesis, we set out to systematically define orthologs of CDCA7, HELLS and DNMTs in a broad range of evolutionary lineages.

CDCA7 is absent from model organisms with undetectable genomic 5mC

CDCA7 homologs are absent from model organisms where 5mC is largely undetectable in genomic DNA and DNMT1/DNMT3 are lost

CDCA7 family proteins in vertebrates

We first characterized evolutionary conservation of CDCA7 family proteins in vertebrates, where 5mC, DNMT1 and DNMT3 are highly conserved. A BLAST search against the Genbank protein database identified two zf-4CXXC_R1 motif-containing proteins, CDCA7/JPO1 and CDCA7L/R1/JPO2, throughout Gnathostomata (jawed vertebrates) (Fig. 2A, B). In frogs (such as Xenopus, but not all amphibians), and some fishes (such as Astyanax mexicanus and Takifugu rubripes), a third paralog CDCA7e exists (Fig. 2A, B, Fig. S1). CDCA7e is the only CDCA7-like protein that can be detected in Xenopus eggs (Jenness et al., 2018), and thus likely represents a form specific to oocytes and early embryos in these species. Among twelve conserved cysteine residues originally reported in the zf-4CXXC_R1 domain (Ou et al., 2006), the 12th cysteine residue is not conserved in Rhincodon typus (whale shark) CDCA7 and in Xenopus laevis CDCA7e. In general, the 12th cysteine residue of the zf-4CXXC_R1 domain is least conserved among CDCA7 homologs within and outside of vertebrates, such that we do not consider it a key component of the zf-4CXXC_R1 motif (see below). Considering that the jawless fish Petromyzon marinus (sea lamprey) and other invertebrates commonly possess only one CDCA7 family gene (Fig. 2, and see below), CDCA7L and CDCA7e may have emerged in jawed vertebrates. Overall, the zf-4CXXC_R1 sequence is highly conserved among vertebrate CDCA7 homologs, including three residues that are mutated in ICF3 patients (R274, G294 and R304 in human CDCA7) (Fig. 2B) (Thijssen et al., 2015). Note that these amino acid positions of the ICF3 mutations are based on the previously reported sequence of an NP_665809 (Thijssen et al., 2015), which is annotated as isoform 2 (371 amino acids), whereas we list the isoform 1 (NP_114148) with 450 amino acids in this study (Fig. 2A).

CDCA7 paralogs in vertebrates

A. Schematics of vertebrate CDCA7 primary sequence composition, based on NP_114148. Yellow lines and light blue lines indicate positions of evolutionary conserved cysteine residues and residues that are mutated in ICF patients, respectively. B. Sequence alignment of the zf-4CXXC_R1 domain of vertebrate CDCA7-family proteins. White arrowheads; amino residues unique in fish CDCA7L. Black arrowheads; residues that distinguish CDCA7L and CDCA7e from CDCA7.

While the presence of four CXXC motifs in the zf-4CXXC_R1 domain is reminiscent of a classic zinc finger-CXXC domain (zf-CXXC, Pfam PF02008), their cysteine arrangement is distinctly different (Long et al., 2013). In vertebrate CDCA7 paralogs, eleven conserved cysteines are arranged as CXXCX10CX4CX7CXXCX19CXXCX3CXCXXC. In contrast, in the classic zf-CXXC motif eight cysteines are arranged as CXXCXXCX4-5CXXCXXCX8-14CX4C (Long et al., 2013). Apart from the zf-4CXXC_R1 domain, vertebrate CDCA7-like proteins often, but not always, contain one or two Lens epithelium-derived growth factor (LEDGF)-binding motif(s), defined as ([E/D]XEXFXGF) (Tesina et al., 2015). It has been reported that human CDCA7L and CDCA7 both interact with c-Myc but apparently via different regions, a leucine zipper and a less-defined adjacent segment that overlaps with a bipartite nuclear localization signal, respectively (Gill et al., 2013; Huang et al., 2005).

This leucine zipper sequence is highly conserved among vertebrate CDCA7 family proteins. In contrast to zf-CXXC motif-containing proteins such as KDM2A/B, DNMT1, MLL1/2, and TET1/3, the vertebrate CDCA7 proteins do not contain any predicted enzymatic domains (Huang et al., 2005; Maertens et al., 2006; Tesina et al., 2015).

Plant homologs of CDCA7

A BLAST sequence homology search identified three classes of zf-4CXXC_R1 motif-containing proteins in Arabidopsis (Figure 3). Class I proteins conserve all three ICF-associated residues (R274, G294 and R304 in human CDCA7) as well as all eleven characteristic cysteines, with the exception that the position of the fourth cysteine is shifted two residues toward the N terminus. The protein size of class I proteins is comparable to vertebrate CDCA7 (400 – 550 aa), and no other Pfam motifs can be identified. Based on the striking conservation of the eleven cysteine and three ICF-associated residues, we predict that the class I proteins are prototypical CDCA7 orthologs.

CDCA7 homologs and other zf-4CXXC_R1-containing proteins in Arabidopsis

Top; alignments of the zf-4CXXC_R1 domain found in Arabidopsis thaliana. Bottom; domain structure of the three classes of zf-4CXXC_R1-containing proteins in Arabidopsis.

Class II proteins contain a zf-4CXXC_R1 domain, a DDT domain and a WHIM1 domain. These proteins were previously identified as DDR1-3 (Dong et al., 2013). DDT and WHIM1 domains are commonly found in proteins that interact with SNF2h/ISWI (Aravind and Iyer, 2012; Li et al., 2017; Yamada et al., 2011). Indeed, it was reported that Arabidopsis DDR1 and DDR3 interact with the ISWI orthologs CHR11 and CHR17 (Tan et al., 2020). Among the eleven cysteine residues in the zf-4CXXC_R1 motif of these proteins, the position of the fourth residue is shifted towards the C-terminus. The ICF-associated glycine residue (G294 in human CDCA7, mutated to valine in ICF3 patients) is replaced by isoleucine.

Class III proteins are longer (∼1000 amino acid) and contain an N-terminal zf-4CXXC_R1 domain and a C-terminal JmjC domain (Pfam, PF02373), which is predicted to possess demethylase activity against histone H3K9me2/3 (Saze et al., 2008). While all eleven cysteine residues can be identified, there are deletions between the 4th and 5th cysteine and 6th and 7th cysteine residues. None of the ICF-associated residues are conserved in the class III. One of these class III proteins is IBM1 (increase in bonsai mutation 1), whose mutation causes the dwarf “bonsai” phenotype (Saze et al., 2008), which is accompanied with increased H3K9me2 and DNA methylation levels at the BONSAI (APC13) locus. Double mutants of ddm1 and ibm1 exacerbate the bonsai phenotype, indicating that DDM1 and IBM1 act independently to regulate DNA methylation (Saze et al., 2008). Another class III protein is JMJ24, which harbors a RING finger domain in addition to the 4CXXC and JmjC domains. This RING finger domain promotes ubiquitin-mediated degradation of the DNA methyltransferase CMT3, and thus opposes DNA methylation (Deng et al., 2016).

Orthologs of these three classes of CDCA7 proteins found in Arabidopsis are widely identified in green plants (Viridiplantae), including Streptophyta (e.g., rice, maize, moss, fern) and Chlorophyta (green algae). Other variants of zf-4CXXC_R1 are also found in Viridiplantae. In contrast to green plants, in which the combined presence of HELLS/DDM1-, CDCA7- and DNMT-orthologs is broadly conserved, no zf-4CXXC_R1-containing proteins can be identified in red algae (Rhodophyta) (Table S1, see Fig.5, Fig. S2).

Evolutionary conservation of CDCA7, HELLS and DNMTs in fungi

A. Sequence alignment of class IIf zf-4CXXC_R1 sequences found in fungi. B. Domain architectures of zf-4CXXC_R1-containg proteins in fungi. The class II zf-4CXXC_R1 domain is indicated with purple circles. Squares with dotted lines indicate preliminary genome assemblies.

Evolutionary conservation of CDCA7, HELLS and DNMTs

The phylogenetic tree was generated based on timetree.org. Filled squares indicate presence of an orthologous protein(s). Squares with dotted lines imply preliminary-level genome assemblies. Squares with a diagonal line; Paramecium EED was functionally identified (Miro-Pina et al., 2022), but not by the sequence-based search in this study; homologs of EZH1/2 and EED were identified in Symbiodinium sp. KB8 but not in Symbiodinium microadriaticum (Table S1). An opaque box of DNMT5 in Symbiodinium indicates a homolog that does not contain the ATPase domain, which is commonly found in DNMT5 family proteins. Opaque boxes of UHRF1 indicates homologs that harbor the SRA domain but not the RING-finger domain. Full set of analysis on the panel of 180 eukaryote species is shown in Figure S1 and Table S1. Genbank accession numbers of each protein and PMID numbers of published papers that report presence or absence of 5mC are reported in Table S1.

Zn-4CXXC_R1-containig proteins in Fungi

Although S. pombe and S. cerevisiae genomes do not encode any CDCA7-like proteins, a BLAST search identified various fungal protein(s) with a zf-4CXXC_R1 motif. Among the zf-4CXXC_R1-containing proteins in fungi, 10 species (Kwoniella mangroviensis, Coprinopsis cinere, Agaricus bisporus, Taphrina deformans, Gonapodya prolifera, Basidiobolus meristosporus, Coemansia reversa, Linderina pennispora, Rhizophagus irregularis, Podila verticillate) harbor a zf-4CXXC-R1 motif highly similar to the prototypical (Class I) CDCA7 albeit with three notable deviations (Fig. 4A). First, the space between the third and fourth cysteine residues is variable. Second, the fifth cysteine is replaced by aspartate in Zoopagomycota. Third, the second ICF-associated residue

(G294 in human CDCA7) is not conserved. As this zf-4CXXC_R1 signature is similar to the plant class II zf-4CXXC_R1 motif, we define this fungal protein family as class IIf CDCA7, which forms a distinct clade in our phylogenetic analysis of zf-4CXXC_R1 sequence alignment (Fig. S1).

Beside the class IIf CDCA7-like proteins, several fungal species encode a protein with a diverged zf-4CXXC_R1 motif, including those with a JmjC domain at the N-terminus (Figure 4B, Table S1), unlike the plant class III proteins for which the JmjC domain is located at the C-terminus. Among these proteins, it was suggest that Neurospora crassa DMM-1 does not directly regulate DNA methylation or demethylation but rather controlling deposition of histone H2A.Z and/or H3K56 acetylation, which inhibit spreading of heterochromatin segments with methylated DNA and H3K9me3 (Honda et al., 2010; Zhang et al., 2022).

Systematic identification of CDCA7, HELLS and DNMT homologs in eukaryotes

To systematically identify CDCA7 and HELLS orthologs in the major eukaryotic supergroups, we conducted a BLAST search against the NCBI protein database using human CDCA7 and HELLS protein sequences. To omit species with a high risk of false negative identification, we selected species containing at least 6 distinct proteins with compelling homology to the SNF2 ATPase domain of HELLS, based on the assumption that each eukaryotic species is expected to have 6-20 SNF2 family ATPases (Flaus et al., 2006). Indeed, even the microsporidial pathogen Encephalitozoon cuniculi, whose genome size is a mere 2.9 Mb, contains six SNF2 family ATPases (Flaus et al., 2006). As such, we generated a panel of 180 species encompassing all major eukaryote supergroups (5 Excavata, 18 SAR [2 Rhizaria, 6 Alveolata, 10 Stramenopiles]), 1 Haptista, 1 Cryptista, 15 Archaeplastida [3 Rhodophyta and 12 Viridiplantae], 4 Amoebozoa, 136 Opisthokonts [34 Fungi, 3 Holozoa, and 99 Metazoa])(Fig. S2, Table S1).

HELLS orthologs were initially identified if they satisfied the RBH criterion. To further validate the annotation of HELLS orthologs, a phylogenetic tree was constructed from a multiple sequence alignment of the putative HELLS orthologs alongside other SNF2-family proteins of H. sapiens, D. melanogaster, S. cerevisiae, and A. thaliana. If HELLS orthologs are correctly identified (i.e. without erroneously including orthologs of another SNF2-subfamily) they should cluster together in a single clade. However, the sequence alignment using the full-length protein sequence failed to cluster HELLS and DDM1 in the same clade (Fig. S3). Since HELLS and other SNF2-family proteins have variable insertions within the SNF2 ATPase domain, multiple sequence alignment of the SNF2 domains was then conducted after removing the insertion regions, as previously reported (Flaus et al., 2006). By this SNF2 domain-only alignment method, all HELLS orthologs formed a clade, separated from CHD1, ISWI, SMARCA2/4, SRCAP, and INO80 (Fig. S4).

A BLAST search with the human CDCA7 sequence across the panel of 180 species identified a variety of proteins containing the zf-4CXXC_R1 motif, which are prevalent in all major supergroups (Fig. 5, Fig. S1, Table S1). Each of these identified proteins contains only one zf-4CXXC_R1 motif. The resulting list of CDCA7 BLAST hits were further classified as prototypical (Class I) CDCA7-like proteins (i.e. CDCA7 orthologs) if the first eleven cysteine residues and the three ICF-associated residues are conserved. A phylogenetic tree analysis of zf-4CXXC_R1 domains from diverse species confirmed that Class I CDCA7 proteins are clustered under the same clade (Fig. S1). These CDCA7 orthologs are broadly found in the three supergroup lineages (Archaeplastida, Amoebozoa, Opisthokonta) (Fig.5, Fig. S1, S2, S5, Table S1). In Excavata, the amoeboflagellate Naegleria gruberi encodes a protein which is a likely ortholog of CDCA7 with an apparent C-terminal truncation (XP_002678720), possibly due to a sequencing error (Fig. S5). In SAR, Class I CDCA7 proteins are absent from all available genomes, except stramenopile Tribonema minus, which encodes a distantly related likely ortholog of CDCA7 (KAG5177154). These conservations suggest that the Class I zf-4CXXC_R1 domain in CDCA7 was inherited from the LECA.

Classification of DNMTs in eukaryotes

A simple RBH approach is not practical to classify eukaryotic DNMT proteins due to the presence of diverse lineage-specific DNMTs (Huff and Zilberman, 2014). Therefore, we collected proteins with a DNMT domain within the panel of 180 eukaryote species, and then the DNMT domains were extracted from each sequence (based on an NCBI conserved domains search). Generating a phylogenetic tree based on the multisequence alignment of the DNMT domains, we were able to classify the majority of all identified DNMTs as previously characterized DNMT subtypes according to their sequence similarity. These DNMT subtypes include the functionally well-characterized DNMT1 and DNMT3, the fungi-specific maintenance methyltransferase Dim-2 and de novo methyltransferase DNMT4 (Nai et al., 2020), the SNF2 domain-containing maintenance methyltransferase DNMT5 (Dumesic et al., 2020; Huff and Zilberman, 2014), the plant-specific DNMTs (such as DRM and CMT) (Yaari et al., 2019), DNMT6 (a poorly characterized DNMT identified in Stramenopiles, Haptista and Chlorophyta) (Huff and Zilberman, 2014), and the tRNA methyltransferase TRDMT1 (also known as DNMT2) (Fig. S6). We also identified other DNMTs, which did not cluster into these classes. For example, although it has been reported that DNMT6 is identified in Micromonas but not in other Chlorophyta species, such as Bathycoccus and Ostreococcus (Huff and Zilberman, 2014), we identified Chlorophyta-specific DNMTs that form a distinct clade, which seems to be diverged from DNMT6 and DNMT3. We temporarily called this class Chlorophyta DNMT6-like (Fig. S6). Other orphan DNMTs include the de novo DNA methyltransferase DNMTX in fungus Kwoniella mangroviensis (Catania et al., 2020), which may be most related to CMT, and an uncharacterized DNMT in N. gruberi (XP_002682263), which seems to diverge from DNMT1 (Fig. S6).

Coevolution of CDCA7, HELLS and DNMTs

The identification of protein orthologs across the panel of 180 eukaryotic species reveals that homologs of CDCA7, HELLS and DNMTs are conserved across the major eukaryote supergroups, but they are also dynamically lost (Fig. 5, Fig. S2). We found 40 species encompassing Excavata, SAR, Amoebozoa, and Opisthokonta that lack CDCA7, HELLS and DNMT1. The concurrent presence of DNMT1, UHRF1 and CDCA7 outside of Viridiplantae and Opisthokonta is rare (Fig. 5, Fig. S2, Table S1). Interestingly,

Acanthamoeba castellanii, whose genome is reported to have methylated cytosines (Moon et al., 2017), encodes homologs of DNMT1, DNMT3, UHRF1, CDCA7 and HELLS, while none of these genes are present in other Amoebozoa species (Entamoeba histolytica, Dictyostelium purpureum, Heterostelium album) (Fig. 5, Table S1, Fig. S2). N. gruberi is an exceptional example among Excavata species, encoding a suspected CDCA7 ortholog (XP_002678720) as well as HELLS, an orphan DNMT1-like protein (XP_002682263), and a UHRF1-like protein (Table S1). DNMT1, DNMT3, HELLS and CDCA7 seem to be absent in other Excavata lineages, although Euglenozoa variants of DNMT6 and cytosine methylation are identified in Trypanosoma brucei and Leishmania major (Huff and Zilberman, 2014; Militello et al., 2008).

To quantitatively assess evolutionary coselection of DNMTs, CDCA7 and HELLS, we performed CoPAP analysis on the panel of 180 eukaryote species (Fig. S7)(Cohen et al., 2013). The analysis was complicated due to the presence of clade specific DNMTs (e.g., Dim2, DNMT5, DNMT6 and other plant specific DNMT variants) and diverse variants of zf-4CXXC_R1 containing proteins (class II, class III). Considering this caveat, we conducted CoPAP analysis of four DNMTs (DNMT1, Dim-2, DNMT3, DNMT5), UHRF1, CDCA7 (class I and class II), and HELLS. We included Dim-2 and DNMT5 since they play a critical role in DNA methylation in fungi lacking the prototypical DNMT1 (Catania et al., 2020; Selker et al., 2002) (Fig. 5, S2, Table S1). As a positive and negative controls for the CoPAP analysis, we also included subunits of the PRC2 complex (EZH1/2, EED and Suz12), and other SNF2 family proteins SMARCA2/SMARCA4, INO80 and RAD54L, which have no clear direct role related to DNA methylation, respectively. As expected for proteins that act in concert within the same biological pathway, the CoPAP analysis showed significant coevolution between DNMT1 and UHRF1, as well as between the PRC2 subunits EZH1 and EED. Suz12 did not show a significant linkage to other PRC2 subunits by this analysis, most likely due to a failure in identifying diverged Suz12 orthologs, such as those in Neurospora and Paramecium (Jamieson et al., 2013; Miro-Pina et al., 2022). The class I CDCA7 exhibits significant linkage to DNMT1, UHRF1, HELLS and DNMT3. Notably, none of these proteins show an evolutionary association with the PRC2 proteins or other SNF2 family proteins, supporting the specific coevolution of CDCA7-HELLS and the DNA methylation proteins.

We next conducted the CoPAP analysis against a panel of 50 Ecdysozoa species, where DNA methylation system is dynamically lost in multiple lineages (Bewick et al., 2017; Engelhardt et al., 2022), yet the annotation of DNMTs, UHRF1, CDCA7 and HELLS is unambiguous. As a negative control, we included INO80, which is dynamically lost in several Ecdysozoa lineages, such as C. elegans (Fig. 5, Fig. S2, Table S1). As expected, CoPAP analysis showed a highly significant coevolutionary interaction between DNMT1 and UHRF1 (Fig.6). In addition, HELLS interacts with DNMT1, UHRF1 and CDCA7. In contrast, no linkage from INO80 or DNMT3 was seen.

CoPAP analysis of CDCA7, HELLS and DNMTs in Ecdyzosoa species

CoPAP analysis of 50 Ecdyzosoa species. Presence and absence patterns of indicated proteins during evolution were analyzed. List of species are shown in Table S1. Phylogenetic tree was generated by amino acid sequences of all proteins shown in Table S1. The number indicates the p-values.

Among the panel of 180 eukaryote species, we found 82 species encompassing Excavata, Viridiplantae, Amoebozoa and Opisthokonts that have CDCA7-like proteins (including the prototypical CDCA7 and fungal type IIf CDCA7) (Fig. S2 and Table S1). Strikingly, all 82 species containing CDCA7 (or type IIf CDCA7) also harbor HELLS. Almost all CDCA7 encoding species have DNMT1 with the exception of the Mamiellophyceae lineage, which lost DNMT1 but possesses DNMT5; and the fungus T. deformans, which encodes DNMT4 but not DNMT1 or DNMT5. In contrast, 20 species (e.g., S. cerevisiae) possess only HELLS, while 12 species (e.g. Bombyx mori) retains only DNMT1 among these proteins. These observations indicate that the function of CDCA7-like proteins is strongly linked to HELLS and DNMT1, such that their co-selection is maintained during eukaryote evolution, while CDCA7 is easier to lose than DNMT1 and HELLS.

Loss of CDCA7 in braconid wasps together with DNMT1 or DNMT3

CoPAP analysis detected the coevolutionary linkage between CDCA7 and DNMT1, rather than DNMT3, in Ecdysozoa. We were therefore intrigued by the fact that two species, Tribolium castaneum and Microplitis demolitor, whose genomic DNA does not have any detectable 5mC despite the presence of DNMT1 (Bewick et al., 2017; Schulz et al., 2018; Zemach et al., 2010), lost CDCA7 and DNMT3. To further validate the apparent co-loss of CDCA7 and DNMT1/DNMT3 in Ecdysozoa, we focused on the Hymenoptera clade (including M. demolitor), for which genome synteny has been reported (Table S2) (Li et al., 2021). Indeed, a striking synteny is observed in the genome region surrounding CDCA7 among the parasitic wood wasp (Orussus abietinus) and Aculeata species (bees Bombus terrestris and Habropoda laborlosa, and the eusocial wasp Polistes canadensis), which diverged ∼250 MYA (Li et al., 2021; Peters et al., 2017) (Fig. 7, Table S3). In these species, CDCA7 is located between Methyltransferase-like protein 25 homolog (MET25, E) and Ornithine aminotransferase homolog (OAT, F). In fact, the gene cluster containing LTO1 homolog (D), MET25 (E), OAT (F), and Zinc finger protein 808 (ZN808, G) is highly conserved in all the analyzed Hymenoptera species, but not outside of Hymenoptera (e.g., Drosophila).

Synteny of Hymenoptera genomes adjacent to CDCA7 genes

Genome compositions around CDCA7 genes in Hymenoptera insects are shown. For genome with annotated chromosomes, chromosome numbers (Chr) or linkage group numbers (LG) are indicated at each gene cluster. Gene clusters without chromosome annotation indicate that they are within a same scaffold or contig. Dash lines indicate the long linkages not proportionally scaled in the figure. Due to their extraordinarily long sizes, DE-cadherin genes (L) are not scaled proportionally. Presence and absence of CDCA7, HELLS, DNMT1, DNMT3, and UHRF1 in each genome is indicated by filled and open boxes, respectively. The phylogenetic tree is drawn based on published analysis (Li et al., 2021; Peters et al., 2017) and TimeTree.

However, CDCA7 is lost from this gene cluster in parasitoid wasps, including Ichneumonoidea wasps (M. demolitor, Cotesia glomerata, Aphidius gifuensis, Fopius arisanus, Venturia canescen) and chalcid wasps (Copidosoma floridanum, Nasonia vitripennis) (Figure 7, Table S1 and S2). Among Ichneumonoidea wasps, CDCA7 appears to be lost in the Braconidae clade, while CDCA7 appears to have translocated to a different chromosome in Venturia canescen. In chalcid wasps, CDCA7 translocated to a genome segment between Artemis (I) and Chromatin accessibility complex protein 1 (CHRC1, J). Together with the chromosome-level genome assembly for Cotesia glomerata and Aphidius gifuensis (Feng et al., 2020; Pinto et al., 2021), the synteny analysis strongly indicates that CDCA7 is lost from braconid wasps. Intriguingly, among those braconid wasps that lost CDCA7, DNMT3 is co-lost in the Microgastrinae linage (M. demolitor, Cotesia glomerata, Cotesia typhae) and Chelonus insularis, while DNMT1 and UHRF1 are co-lost in the Opiinae linage (Fopius arisanus, Diachasma alloeum) and Aphidius gifuensis (Table S2). This co-loss of CDCA7 with either DNMT1-UHRF1or DNMT3 in the braconid wasp clade suggests that evolutionary preservation of CDCA7 is more sensitive to DNA methylation status per se than to the presence or absence of a particular DNMT subtype.

Discussion

Although DNA methylation is prevalent across eukaryotes, DNA methyltransferases are frequently lost in a variety of lineages. This study reveals that the nucleosome remodeling complex CHIRRC, composed of CDCA7 and HELLS, is frequently lost in conjunction with DNA methylation status. More specifically, evolutionary preservation of CDCA7 is tightly coupled to the preservation of HELLS and DNMT1. The conservation of CDCA7’s signature cysteine residues alongside three ICF-associated residues across diverse eukaryote lineages suggests a unique evolutionary conserved role in DNA methylation. Our co-evolution analysis suggests that DNA methylation-related functionalities of CDCA7 and HELLS are inherited from LECA.

The evolutionary coupling of CDCA7, HELLS and DNMT1 is consistent with a proposed role of HELLS in replication-uncoupled DNA methylation maintenance (Ming et al., 2021). Commonly, DNA methylation maintenance occurs directly behind the DNA replication fork. Replication uncoupled DNA methylation maintenance is distinct from this process (Nishiyama et al., 2020), and HELLS and CDCA7 may be important for the maintenance of DNA methylation long after the completion of DNA replication, particularly at heterochromatin where chromatin has restricted accessibility (Ming et al., 2020).

The loss of CDCA7 is not always coupled to the loss of DNMT1 or HELLS, however. In the Hymenoptera clade, CDCA7 loss in the braconid wasps is accompanied with the loss of DNMT1/UHRF1 or the loss of DNMT3. Among these species, it was reported that 5mC DNA methylation is undetectable in M. demolitor, which harbors DNMT1, UHRF1 and HELLS but lost DNMT3 and CDCA7 (Bewick et al., 2017). Similarly, in the Coleoptera clade, the red flour beetle Tribolium castaneum possesses DNMT1 and HELLS, but lost DNMT3 and CDCA7. Since DNMT1 is essential for the embryonic development of T. castaneum and CpG DNA methylation is undetectable in this organism (Schulz et al., 2018), it has been predicted that DNMT1 has a function independent of DNA methylation in this species. Indeed, species that preserves DNMTs and/or HELLS in the absence of CDCA7 emerge repeatedly during eukaryote evolution, whereas CDCA7 appears to be immediately dispensable in species that have a dampened requirement for DNA methylation. In other words, there is no evolutionary advantage to retain CDCA7 in the absence of DNA methylation, and CDCA7 is almost never maintained in the absence of any DNMTs. Alternatively, could the loss of CDCA7 precede the loss of DNA methylation? If CDCA7 is important to promote DNA methylation maintenance, the loss of CDCA7 may exacerbate an impaired epigenetic environment and stimulate the adaptation of the organism towards DNA methylation-independent epigenetic mechanisms, thereby decreasing the necessity to maintain the DNA methylation system. In this way, the loss of CDCA7 could trigger the subsequent loss of DNMTs (and HELLS), unless these proteins acquired important DNA methylation-independent roles. In line with this scenario, insects possess robust DNA methylation-independent mechanisms (such as piwi-RNA and H3K9me3) to silence transposons and DNA methylation is largely limited to gene bodies in insects (Bonasio et al., 2012; Feng et al., 2010; Libbrecht et al., 2016). The loss of CDCA7 is more frequently observed in insects than plants or vertebrates, which rely heavily on DNA methylation to silence transposons. (Czech et al., 2018; Onishi et al., 2021). Thus, lowering the demand of DNA methylation at transposable elements might reduce the essentiality of CDCA7 (and then perhaps that of DNMTs), though it is difficult to deduce which evolutionary change occurs first.

The observation that some species retain HELLS but lose CDCA7 (while the reverse is never true) suggests that HELLS can evolve a CDCA7-independent function. Indeed, it has been suggested that the sequence-specific DNA-binding protein PRDM9 recruits HELLS to meiotic chromatin to promote DNA double-strand breaks and recombination (Imai et al., 2020; Spruce et al., 2020). Unlike CDCA7, PRDM proteins are found only in metazoans, and are even lost in some vertebrates such as Xenopus laevis and Gallus gallus (Birtle and Ponting, 2006). Thus, it is plausible that HELLS acquires a species-specific CDCA7-independent role during evolution through acquiring a new interaction partner. Another example of this may be found in S. cerevisiae, where DNA methylation and CDCA7 are absent and the HELLS homolog Irc5 interacts with cohesin to facilitate its chromatin loading (Litwin et al., 2017).

Recently, the role of HELLS in the deposition of the histone variant macroH2A, which compacts chromatin, has been reported in mice (Ni and Muegge, 2021; Ni et al., 2020; Xu et al., 2021). Similarly, in Arabidopsis, DDM1 is critical for deposition of H2A.W, which is enriched on heterochromatin, in a manner independent of DNA methylation (Osakabe et al., 2021). The role of CDCA7 in the deposition of these H2A variants remains to be tested. HELLS and DDM1 can directly interact with macroH2A and H2A.W, respectively, even in the absence of CDCA7 (Ni and Muegge, 2021; Osakabe et al., 2021). It is thus possible that HELLS/DDM1 family proteins have an evolutionary conserved function in H2A variant deposition independent of CDCA7 and DNA methylation. However, there is no clear indication of evolutionary co-selection of HELLS and macroH2A, as macroH2A is largely missing from insects and the chelicerata Centruroides sculpturatus has macroH2A but lost HELLS and CDCA7 (XP_023217082, XP_023212717),

Whereas the class I CDCA7-like proteins are evolutionary coupled to HELLS and DNMT1-UHRF1, other variants of zf-4CXXC_R1 are found in many eukaryote clades including Excavata, SAR, Viridiplantae, Amoebazoa, and Fungi. Interestingly, proteins such as IBM1 in Arabidopsis and DMM-1 in Neurospora, which contain both a variant of zf-4CXXC_R1 and the JmjC domain appeared convergently in several clades (Saze et al., 2008) (Honda et al., 2010; Zhang et al., 2022). Functional studies of IBM1 and DMM-1 suggested that they contribute to DNA methylation regulation via indirect mechanisms.

As IBM1 and DMM-1 do not preserve ICF-associated residues, which are critical for nucleosome binding in CDCA7 (Jenness et al., 2018), it is likely that these variants of zf-4CXXC_R1 are adapted to recognize different structural features of the genome and no longer preserve the DNA methylation function of CDCA7 orthologs.

Changes in DNA methylation patterns are highly correlated with aging in mammals (Lowe et al., 2018; Wang et al., 2020), and may cause diseases (Greenberg and Bourc’his, 2019; Nishiyama and Nakanishi, 2021; Robertson, 2005). Mutations in CDCA7, HELLS or DNMT3B cause ICF syndrome (Thijssen et al., 2015; Xu et al., 1999), which leads to poor life expectancy due to severe infections (Gossling et al., 2017; Hagleitner et al., 2008). Considering the importance of DNA methylation in the immune system in vertebrates (Hemmi et al., 2000; Kondilis-Mangum and Wade, 2013), plants (Deleris et al., 2016), and bacteria (Vasu and Nagaraja, 2013), the evolutionary arrival of HELLS-CDCA7 in eukaryotes might have been required to transmit the original immunity-related role of DNA methylation from prokaryotes to nucleosome-containing (eukaryotic) genomes.

Materials and Methods

Building a curated list of 180 species for analysis of evolutionary co-selection

A list of 180 eukaryote species was manually generated to encompass broad eukaryote evolutionary clades (Table S1). Species were included in this list based on two criteria: (i) the identification UBA1 and PCNA homologs, two highly conserved and essential proteins for cell proliferation; and (ii) the identification of more than 6 distinct SNF2 family sequences. Homologs of CDCA7, HELLS, UBA1 and PCNA were identified by BLAST search against the Genbank eukaryote protein database available at National Center for Biotechnology Information using the human protein sequence as a query (NCBI). Homologs of human UHRF1, ZBTB24, SMARCA2/SMARCA4, INO80, RAD54L, EZH2, EED or Suz12 were also identified based on the RBH criterion. To get a sense of genome assembly level of each genome sequence, we divided “Total Sequence Length” by “Contig N50” (“length such that sequence contigs of this length or longer include half the bases of the assembly”; https://www.ncbi.nlm.nih.gov/assembly/help/). In the species whose genome assembly level is labeled as “complete”, this value is close to the total number of chromosomes or linkage groups. As such, as a rule of thumb, we arbitrarily defined the genome assembly “preliminary”, if this value is larger than 100. In Figure 5, these species with preliminary-level genome assembly were noted as boxes with dotted outlines.

CDCA7 homolog identification and annotation

The obtained list of CDCA7 homologs was further classified as Class I, Class II or Class III homologs based on the distinct characteristics of the zf-4CXXC_R1 motif, as described in Results, where Class I homologs are considered to be orthologs of human CDCA7. The classification of CDCA7 homologs was further validated based on their clustering in a phylogenetic tree built from the CLUSTALW alignment of the zf-4CXXC_R1 motif (Higgins and Sharp, 1988; Thompson et al., 1994) (Fig S1), using MacVector (MacVector, Inc.).

HELLS homolog identification and annotation

Putative HELLS orthologs were identified according to the RBH criterion. Briefly, a BLAST search was conducted using human HELLS as the query sequence, after which protein sequences of obtained top hits (or secondary hits, if necessary) in each search were used as a query sequence to conduct reciprocal BLAST search against the Homo sapiens protein database. If the top hit in the reciprocal search returned the original input sequence (i.e. human HELLS), it was annotated as an orthologous protein. If HELLS showed up as a next best hit, it is temporally listed as a “potential HELLS ortholog”. To further validate the identified HELLS orthologs, full length amino acid sequences of these proteins were aligned using CLUSTALW in MacVector with homologs of SMARCA2/SMARCA4, CHD1, ISWI, RAD54L, ATRX, HLTF, TTF2, SHPRH, INO80, SMARCAD1, SWR1, MOT1, ERCC6, and SMARCAL1, which were also identified and temporarily annotated with a similar reciprocal BLAST search methods (Fig. S3). This alignment was used to define the conserved SNF2 domain and variable linker regions in the putative HELLS orthologs, which were then removed to conduct the secondary CLUSTALW alignment in MacVector, from which a phylogenetic tree was generated. Distinct cluster of HELLS and other SNF2 family proteins can be identified from the phylogenetic tree, hence confirming that the annotation of HELLS orthologs based on the reciprocal BLAST search method is reasonable (Fig. S4). Exceptions are Leucosporidium creatinivorum ORY88017 and ORY88018, which classified within the HELLS clade of the phylogenetic tree. However, we decided to annotate L. creatinivorum ORY88017 and ORY88018 as HELLS orthologs since among other L. creatinivorum SNF2-family proteins in this species these two proteins are most similar to human HELLS while clear orthologs of other L. creatinivorum SNF2 family proteins, CHD1(ORY55731), ISWI(ORY89162), SMARCA2/4(ORY76015), SRCAP/SWR1(ORY90750) and INO80(ORY91599), can be identified.

DNMT homolog identification and annotation

Proteins with a DNA methyltransferase domain were identified with BLAST searches using human DNMT1 and DNMT3A. If necessary, additional BLAST searches was conducted using human DNMT2, C. neoformans DNMT5, N. crassa Dim-2 and DNMT4. DNMT domains were extracted based on NCBI conserved domains search, and were aligned with CLUSTALW to build a phylogenetic tree in MacVector.

Co-PAP

The published method was used (Cohen et al., 2013). The curated list of orthologous proteins listed in Table S1 was first used to generate a presence-absence FASTA file. Next, a phylogenetic species tree was generated from all orthologous protein sequences listed in Table S1 using the ETE3 toolkit. For this, protein sequences were retrieved using the rentrez Bioconductor package and exported to a FASTA file alongside a COG file containing gene to orthologous group mappings. ETE3 was used with the parameters -w clustalo_default-trimal01-none-none and -m cog_all-alg_concat_default-fasttree_default and the resulting tree exported in Newark format. COPAP was run using default parameters and results visualised using Cytoscape. Code and files required for COPAP input generation as well COPAP parameters and output results can be found in our Github repository (https://github.com/RockefellerUniversity/Copap_Analysis).

As negative and positive controls for the Co-PAP analysis, we identified several well-conserved protein orthologs across the panel of 180 eukaryotic species, including Snf2-like proteins SMARCA2/SMARCA4, INO80, and RAD54L (Flaus et al., 2006), as well as subunits of the polycomb repressive complex 2 (PRC2), which plays an evolutionary conserved role in gene repression via deposition of the H3K27me3 mark. PRC2 is conserved in species where DNMTs are absent (including in D. melanogaster and C. elegans) but is frequently lost particularly in several lineages of SAR and Fungi (Sharaf et al., 2022). Among the four core subunits of PRC2, we focused on the catalytic subunit EZH1/2, EED, and SUZ12, since the fourth subunit RbAp46/48 has a PRC2-independent role (Margueron and Reinberg, 2011). We are aware that the reciprocal BLAST search missed previously reported highly divergent functional orthologs of SUZ12 in Neurospora (Jamieson et al., 2013), and EED and Suz12 in Paramecium (Miro-Pina et al., 2022). However, we did not attempt to use these divergent homologs of EED and SUZ12 as baits to expand our search in order to consistently apply our homology-based definition of orthologs.

Hymenoptera synteny analysis

The mapping of gene loci is based on the information available on the Genome Data Viewer (https://www.ncbi.nlm.nih.gov/genome/gdv). Genome positions of listed genes are summarized in Table S3.

Artworks

Artworks of species images were obtained from PhyloPic.com, of which images of Daphnia, Platyhelminthes, Tribolium and Volvox were generated by Mathilde Cordellier, Christopher Laumer/T. Michael Keesey, Gregor Bucher/Max Farnworth and Matt-Crook, respectively.

Acknowledgements

We thank Daniel Kronauer and Rochelle Shih for critical reading of the manuscript, and D. Kronauer, Li Zhao, Junhui Peng and Erick Jarvis for helpful discussion. The research by Q. J. was in part executed through Chemers Neustein Summer Undergraduate Research Fellowship (SURF) Program at the Rockefeller University.

Funding

This work was supported by National Institutes of Health Grants R35GM132111 to H.F., and the Women & Science Postdoctoral Fellowship Program at The Rockefeller University to I.E.W..

Competing interests

H.F. is affiliated with Graduate School of Medical Sciences, Weill Cornell Medicine, and Cell Biology Program, the Sloan Kettering Institute. The authors declare that no competing financial interests exist.

Authors’ Contribution

H. F., Conceptualization, Data curation, Funding acquisition, Investigation, Project administration, Supervision, Visualization, Writing – original draft, Writing – review & editing; I.E.W., Funding acquisition, Investigation, Supervision, Writing – review & editing; Q.J., Data curation, Investigation, Visualization; J.L., Data curation, Methodology, Software; T.C., Data curation, Methodology, Software.

Supplemental Figure Legends

Evolutionary conservation of CDCA7-family proteins and other zf-4CXXC_R1-containig proteins

Amino acid sequences of zf-4CXXC_R1 domain from indicated species were aligned with CLUSTALW. A phylogenetic tree of this alignment is shown. Genbank accession numbers of analyzed sequences are indicated.

Evolutionary conservation of CDCA7, HELLS and DNMTs

Presence and absence of each annotated proteins in the panel of 180 eukaryote species is marked as filled and blank boxes. The phylogenetic tree was generated based on NCBI taxonomy by phyloT. Bottom right; summary of combinatory presence or absence of CDCA7 (class I or II), HELLS, and maintenance DNA methyltransferases DNMT1/Dim-2/DNMT5.

Phylogenetic tree of HELLS and other SNF2 family proteins

Amino acid sequences of full-length HELLS proteins from the panel of 180 eukaryote species listed in Table S1 were aligned with full length sequences of other SNF2 family proteins with CLUSTALW. A phylogenetic tree of this alignment is shown. Genbank accession numbers of analyzed sequences are indicated.

Phylogenetic tree of the SNF2-domain

Amino acid sequences of SNF2-doman without variable insertions from representative HELLS and DDM1-like proteins from Figure S3 were aligned with the corresponding domain of other SNF2 family proteins with CLUSTALW. A phylogenetic tree of this alignment is shown. Genbank accession numbers of analyzed sequences are indicated.

Sequence alignment and classification of zf-4CXXC_R1 domains across eukaryotes

CDCA7 orthologs are characterized by the class I zf-4CXXC_R1 motif, where eleven cysteine residues and three residues mutated in ICF patients are conserved. One or more amino acids are changed in other variants of zf-4CXXC_R1 motifs. Note that codon frame after the stop codon (an asterisk in a magenta box) of Naegleria XP_002678720 encodes a peptide sequence that aligns well with human CDCA7, indicating that the apparent premature termination of XP_002678720 is likely caused by a sequencing or annotation error.

Phylogenetic tree of DNMT proteins

DNA methyltransferase domain of DNMT proteins across eukaryotes (Table S1, excluding majority of those from Metazoa), the Escherichia coli DNA methylases DCM and Dam, and Homo sapiens PCNA as an outlier sequence, were aligned with CLUSTALW. A phylogenetic tree classification of this alignment is shown.

CoPAP analysis of CDCA7, HELLS and DNMTs in eukaryotes

CoPAP analysis of 180 eukaryote species. Presence and absence patterns of indicated proteins during evolution were analyzed. List of species are shown in Table S1. Phylogenetic tree was generated by amino acid sequences of all proteins shown in Table S1. The number indicates the p-values.

Table S1. Lists of proteins and species used in this study

Tab1, Full list. The list contains species names, their taxonomies, Genbank accession numbers of proteins, PMID of references supporting the 5mC status, and genome sequence assembly statistics. ND; not detected. DNMT5 proteins shown in red lack the Snf2-like ATPase domain. UHRF1 proteins shown in red lack the Ring-finger E3 ubiquitin-ligase domain. CDCA7 proteins shown in red indicate ambiguous annotation as described in the main text.

Tab2, Full list 2. The list is used to make presence (1) or absence (0) list.

Tab3 Ecdysozoa CO-PAP. List of Ecdysozoa species used for CO-PAP analysis

Tab4 Full CO-PAP. List of all species used for CO-PAP analysis

Tab5 Full clustering. Table used for clustering analysis

Tab6 Metazoan invertebrates. Table used for clustering analysis for metazoan invertebrates.

Tab7 Fungi clustering. Table used for clustering analysis for fungi.

Table S2. Lists of proteins in Hymenoptera species supporting Figure 5

Table S3. Summary of Hymenoptera genome location supporting Figure 5