Figures and data

CDCA7 is absent from model organisms with undetectable genomic 5mC
CDCA7 homologs are absent from model organisms where 5mC is largely undetectable in genomic DNA and DNMT1/DNMT3 are lost

CDCA7 paralogs in vertebrates
A. Schematics of vertebrate CDCA7 primary sequence composition, based on NP_114148. Yellow lines and light blue lines indicate positions of evolutionary conserved cysteine residues and residues that are mutated in ICF patients, respectively. B. Sequence alignment of the zf-4CXXC_R1 domain of vertebrate CDCA7-family proteins. White arrowheads; amino residues unique in fish CDCA7L. Black arrowheads; residues that distinguish CDCA7L and CDCA7e from CDCA7.

CDCA7 homologs and other zf-4CXXC_R1-containing proteins in Arabidopsis
Top; alignments of the zf-4CXXC_R1 domain found in Arabidopsis thaliana. Bottom; domain structure of the three classes of zf-4CXXC_R1-containing proteins in Arabidopsis.

Evolutionary conservation of CDCA7, HELLS and DNMTs in fungi
A. Sequence alignment of class IIf zf-4CXXC_R1 sequences found in fungi. B. Domain architectures of zf-4CXXC_R1-containg proteins in fungi. The class II zf-4CXXC_R1 domain is indicated with purple circles. Squares with dotted lines indicate preliminary genome assemblies.

Evolutionary conservation of CDCA7, HELLS and DNMTs
The phylogenetic tree was generated based on timetree.org. Filled squares indicate presence of an orthologous protein(s). Squares with dotted lines imply preliminary-level genome assemblies. Squares with a diagonal line; Paramecium EED was functionally identified (Miro-Pina et al., 2022), but not by the sequence-based search in this study; homologs of EZH1/2 and EED were identified in Symbiodinium sp. KB8 but not in Symbiodinium microadriaticum (Table S1). An opaque box of DNMT5 in Symbiodinium indicates a homolog that does not contain the ATPase domain, which is commonly found in DNMT5 family proteins. Opaque boxes of UHRF1 indicates homologs that harbor the SRA domain but not the RING-finger domain. Full set of analysis on the panel of 180 eukaryote species is shown in Figure S1 and Table S1. Genbank accession numbers of each protein and PMID numbers of published papers that report presence or absence of 5mC are reported in Table S1.

CoPAP analysis of CDCA7, HELLS and DNMTs in Ecdyzosoa species
CoPAP analysis of 50 Ecdyzosoa species. Presence and absence patterns of indicated proteins during evolution were analyzed. List of species are shown in Table S1. Phylogenetic tree was generated by amino acid sequences of all proteins shown in Table S1. The number indicates the p-values.

Synteny of Hymenoptera genomes adjacent to CDCA7 genes
Genome compositions around CDCA7 genes in Hymenoptera insects are shown. For genome with annotated chromosomes, chromosome numbers (Chr) or linkage group numbers (LG) are indicated at each gene cluster. Gene clusters without chromosome annotation indicate that they are within a same scaffold or contig. Dash lines indicate the long linkages not proportionally scaled in the figure. Due to their extraordinarily long sizes, DE-cadherin genes (L) are not scaled proportionally. Presence and absence of CDCA7, HELLS, DNMT1, DNMT3, and UHRF1 in each genome is indicated by filled and open boxes, respectively. The phylogenetic tree is drawn based on published analysis (Li et al., 2021; Peters et al., 2017) and TimeTree.

Evolutionary conservation of CDCA7-family proteins and other zf-4CXXC_R1-containig proteins
Amino acid sequences of zf-4CXXC_R1 domain from indicated species were aligned with CLUSTALW. A phylogenetic tree of this alignment is shown. Genbank accession numbers of analyzed sequences are indicated.

Evolutionary conservation of CDCA7, HELLS and DNMTs
Presence and absence of each annotated proteins in the panel of 180 eukaryote species is marked as filled and blank boxes. The phylogenetic tree was generated based on NCBI taxonomy by phyloT. Bottom right; summary of combinatory presence or absence of CDCA7 (class I or II), HELLS, and maintenance DNA methyltransferases DNMT1/Dim-2/DNMT5.

Phylogenetic tree of HELLS and other SNF2 family proteins
Amino acid sequences of full-length HELLS proteins from the panel of 180 eukaryote species listed in Table S1 were aligned with full length sequences of other SNF2 family proteins with CLUSTALW. A phylogenetic tree of this alignment is shown. Genbank accession numbers of analyzed sequences are indicated.

Phylogenetic tree of the SNF2-domain
Amino acid sequences of SNF2-doman without variable insertions from representative HELLS and DDM1-like proteins from Figure S3 were aligned with the corresponding domain of other SNF2 family proteins with CLUSTALW. A phylogenetic tree of this alignment is shown. Genbank accession numbers of analyzed sequences are indicated.

Sequence alignment and classification of zf-4CXXC_R1 domains across eukaryotes
CDCA7 orthologs are characterized by the class I zf-4CXXC_R1 motif, where eleven cysteine residues and three residues mutated in ICF patients are conserved. One or more amino acids are changed in other variants of zf-4CXXC_R1 motifs. Note that codon frame after the stop codon (an asterisk in a magenta box) of Naegleria XP_002678720 encodes a peptide sequence that aligns well with human CDCA7, indicating that the apparent premature termination of XP_002678720 is likely caused by a sequencing or annotation error.

Phylogenetic tree of DNMT proteins
DNA methyltransferase domain of DNMT proteins across eukaryotes (Table S1, excluding majority of those from Metazoa), the Escherichia coli DNA methylases DCM and Dam, and Homo sapiens PCNA as an outlier sequence, were aligned with CLUSTALW. A phylogenetic tree classification of this alignment is shown.

CoPAP analysis of CDCA7, HELLS and DNMTs in eukaryotes
CoPAP analysis of 180 eukaryote species. Presence and absence patterns of indicated proteins during evolution were analyzed. List of species are shown in Table S1. Phylogenetic tree was generated by amino acid sequences of all proteins shown in Table S1. The number indicates the p-values.