The genetic factors of bilaterian evolution

Abstract
Introduction
Results
Discussion
Materials and methods
Appendix 1
Data availability
References
Article and author information
Metrics

Abstract

The Cambrian explosion was a unique animal radiation ~540 million years ago that produced the full range of body plans across bilaterians. The genetic mechanisms underlying these events are unknown, leaving a fundamental question in evolutionary biology unanswered. Using large-scale comparative genomics and advanced orthology evaluation techniques, we identified 157 bilaterian-specific genes. They include the entire Nodal pathway, a key regulator of mesoderm development and left-right axis specification; components for nervous system development, including a suite of G-protein-coupled receptors that control physiology and behaviour, the Robo-Slit midline repulsion system, and the neurotrophin signalling system; a high number of zinc finger transcription factors; and novel factors that previously escaped attention. Contradicting the current view, our study reveals that genes with bilaterian origin are robustly associated with key features in extant bilaterians, suggesting a causal relationship.

Introduction

The taxon Bilateria consists of multicellular animals with bilateral body symmetry and constitutes a major and ancient radiation of animals. There is compelling morphological and molecular evidence for the monophyly of bilaterians (Hejnol et al., 2009; Dunn et al., 2014; Cannon et al., 2016), for their subdivision into protostomes and deuterostomes (Aguinaldo et al., 1997; Philippe et al., 2005; Dunn et al., 2008; Simakov et al., 2013; Cannon et al., 2016), and for the overall relationships of ∼25 phyla that make up this group (Dunn et al., 2008; Hejnol et al., 2009; Dunn et al., 2014). In contrast, the evolutionary relationships of non-bilaterian metazoans are still a matter of debate, in particular the relative positions of placozoans, ctenophores, and sponges (Brooke and Holland, 2003; Ryan et al., 2013; Pisani et al., 2015; Feuda et al., 2017; Simion et al., 2017; Whelan et al., 2017).

The first unambiguously bilaterian fossils appear in Cambrian sediments with an age of ∼540 million years (Marshall, 2006; Erwin and Valentine, 2013). By the end of Cambrian stage 3 (499 Mya), stem groups of all major bilaterian phyla inhabited Earth. This abrupt appearance of most bilaterian body plans, the sets of morphological features common to a phylum, already puzzled Darwin (Darwin, 2009). It is considered one of the most important evolutionary events after the origin of life (Conway Morris, 2006; Budd, 2008) and still awaits an explanation today. Importantly, no new body plans evolved in the 500 My since the initial radiation.

Abiotic, ecological, and genetic factors have been proposed to explain the Cambrian radiation. While deep-ocean oxygenation (Canfield et al., 2007), the availability of calcium (Jackson et al., 2010), or ecological interactions (Budd and Jensen, 2017) likely played a role, genetic changes in the bilaterian ancestor must ultimately have constituted its molecular basis. However, evidence for such genetic changes is scarce. Genomic sequencing of non-bilaterian animals revealed that the major signalling pathways and many developmentally important genes of bilaterians are also present in non-bilaterians, indicating that these genes evolved before the advent of bilaterians (Technau et al., 2005; Putnam et al., 2007; Srivastava et al., 2008; Srivastava et al., 2010; Ryan et al., 2013; Babonis and Martindale, 2017). Similarly, epigenetic mechanisms to regulate gene expression, such as DNA methylation and histone modifications, seem to be conserved between bilaterians and non-bilaterian metazoans (Zemach et al., 2010; Schwaiger et al., 2014). Therefore, the common view is that modification of existing gene regulatory networks rather than the invention of new genes determined the evolution of complex body plans (Davidson and Erwin, 2006; Su and Yu, 2017).

Nevertheless, a number of studies identified genes that emerged in the ancestor of bilaterians. One example is a major expansion of miRNA families that likely triggered an increase in miRNA-mediated gene regulation (Prochnik et al., 2007; Wheeler et al., 2009). However, the significance of this event at the base of the Bilateria is unclear because frequent miRNA expansions are seen in various lineages over time (Peterson et al., 2009). Similarly, a link between the genome organiser CTCF and Hox genes presumably emerged in the bilaterian ancestor and might have contributed to the organisation of bilaterian body plans (Heger et al., 2012). The importance of CTCF for Hox gene expression has been shown repeatedly (Mohan et al., 2007; Kim et al., 2011; Rousseau et al., 2014; Narendra et al., 2015), yet direct evidence for the involvement of a Hox-CTCF link in body patterning is lacking. Another study implicated the TATA-box-binding protein-related factor 2 (TRF2) in the evolution of bilaterians. This factor may have founded new, TATA box-independent transcriptional programs involved in body plan development (Duttke et al., 2014), but the consequences of this hypothesis have not been tested.

Therefore, a comprehensive screen for bilaterian-specific genes and an assessment of their evolutionary impact is missing. A major obstacle for such a screen is the uneven coverage of the animal tree with sequence data. While some lineages, particularly those including model organisms (e.g., nematodes, flies, or mammals), are well represented, other areas of the metazoan tree are remarkably under-represented, for example lophotrochozoans and non-bilaterian metazoans. For instance, the leading orthology databases OrthoDB (Kriventseva et al., 2015; Kriventseva et al., 2019), eggNOG (Huerta-Cepas et al., 2016), and OrthoMCL (Li et al., 2003) each contain fewer than ten non-bilaterian species, and two of these databases do not contain lophotrochozoans at all (Figure 1, Table 1). It is therefore difficult to deduce from such databases the genes that are widespread in bilaterians and absent in non-bilaterians. In addition to the bias in coverage, sequence databases suffer from annotation errors, which particularly affect non-model organisms and under-represented parts of the tree, such as non-bilaterian metazoans and lophotrochozoans. Annotation errors, in turn, have been found as the largest single source for errors in orthology benchmark testing and, together with uneven phylogenetic coverage, accounted for up to 40% of incorrect assignments (Trachana et al., 2011).

Figure 1 with 3 supplements see all

Download asset Open asset

Table 1

Comparison of three major orthology databases with the BigWenDB.

The number of species of a given taxon (left column) in four different orthology databases is shown. In contrast to other databases, the BigWenDB has substantially more sequence information from non-bilaterian metazoans and therefore a better resolution at the divergence of bilaterians and non-bilaterians. D = Deuterostomia, E = Ecdysozoa. Note the bias of other databases towards insects and vertebrates, which continues in the latest database versions (e.g., OrthoDB v10.2; Kriventseva et al., 2019).

Taxon	OrthoDB V8	eggNOG V4.5	OrthoMCL V5	BigWenDB
Cellular organisms	3,027	2,031	150	273
Metazoa	173	88	29	175
Bilateria	169	85	27	142
non-Bilateria	4	3	2	33
Ecdysozoa (E)	97	29	12	54
E w/o insects	17	9	4	29
Lophotrochozoa	5	0	0	18
Deuterostomia (D)	66	55	14	65
D w/o vertebrates	5	4	1	12

To address these biases and to infer bilaterian-specific genes in a reliable and robust way, we (i) assembled a dataset covering the animal tree in the most comprehensive and representative way so far; (ii) particularly strengthened resolution at the base of the Bilateria; (iii) reduced annotation errors by incorporating newly generated ORF (open reading frame) data sets; and (iv) evaluated the composition of the generated orthologous groups in a phylogenetic context. Using this strategy we extracted, from an initial set of 124 million sequences from 273 species, 157 high-confidence bilaterian-specific genes, with many functions connected to key bilaterian features.

Results

Dataset generation and orthogroup evaluation

Non-bilaterian metazoans are severely under-represented in existing sequence collections, but sufficient coverage is critical to illuminate bilaterian evolution. To maximise phylogenetic resolution at the origin of Bilateria, we assembled a new database specifically tailored to this purpose, the BigWenDB (Figure 1, Figure 1—figure supplement 1; Table 1). This database combines sequence data of 273 species from three sources. The backbone of our analysis is the opisthokont sequence space (primarily fungi, vertebrates, and insects): 204 species, each with >8000 available sequences at GenBank, totalling 2.7 million sequences (Table 2; NCBI GenBank release 203 from 15 August 2014). The second part derives from transcriptome sequences of 64 species from various sources (Supplementary file 1–Supplementary Table 1, Supplementary file 1–Supplementary Table 2, Supplementary file 2). Among others, non-bilaterian metazoans (30 species) and lophotrochozoans (12 species) contribute 11.7 million sequences to this group, complementing their poor GenBank representation (Figure 1—figure supplement 1). The third and largest sequence set contains ∼109 million open reading frames (ORFs) obtained by translating 25 metazoan genomes (Supplementary file 1–Supplementary Table 3). All non-bilaterian and lophotrochozoan whole genome sequences available at the time, as well as genomes from additional phyla, were included to compile a comprehensive and representative dataset (Figure 1—figure supplement 1). As this strategy caused a large increase in sequence number, we limited the third set to 25 species to maintain technical feasibility. The final dataset combines 124 million sequences from 21 metazoan and three outgroup phyla, including several taxa absent from other databases, for example tardigrades, a priapulid, bryozoans, a nemertean, a rotifer, a brachiopod, and choanoflagellates (Figure 1, Figure 1—figure supplement 1).

Table 2

Composition of the BigWenDB.

The number of sequences (overall: 124,031,501) collected from three different sources (NCBI, Transcriptome, ORFs) is indicated for major taxonomic groups of the BigWenDB. "Others" comprises the ichthyosporean Capsaspora owczarzaki and the choanoflagellates Monosiga brevicollis and Salpingoeca rosetta.

Group	(Super)Phylum	# Species	NCBI	Transcriptome	ORFs
Bilateria	Deuterostomia	65	895,084	2,292,541	51,922,654
	Ecdysozoa	54	511,663	2,150,424	17,338,026
	Lophotrochozoa	23	170,379	2,618,518	9,805,405
Non-Bilat.	Ctenophora	7	0	1,468,372	2,458,546
	Placozoa	1	11,215	0	590,820
	Porifera	6	8,836	539,299	1,008,535
	Cnidaria	19	36,873	2,361,032	26,443,358
Fungi		93	1,032,299	0	0
others		3	29,292	0	0
total		273	2,695,641	11,768,516	109,567,344

To be able to generate clusters of orthologous proteins from this large dataset, we adapted the OrthoMCL pipeline (Li et al., 2003) and improved its scalability (see Appendix 1: Orthology pipeline and clustering; Supplementary file 1–Supplementary Table 4). As a large proportion of the resulting 824,605 orthogroups was small and had phylogenetically inconsistent composition (Appendix 1—figure 1; Supplementary file 1–Supplementary Table 5), we focused our analysis on 75,744 orthogroups (OGs) with at least 10 species. They provide a rich repertoire for the identification of lineage-specific protein sets.

Hundreds to thousands of novel translated ORFs exist in humans and other animals, that are missed by traditional annotation methods (Ladoukakis et al., 2011; Mackowiak et al., 2015; Raj et al., 2016). A key aspect of our analysis is therefore the inclusion of genomic ORFs. To estimate their contribution to the clustering process, we examined the composition of all orthogroups. Genomic ORFs constitute a substantial fraction of the majority of orthogroups, comprising >90% of all sequences in 50% of orthogroups. This demonstrates that a high percentage of orthogroups is either dependent on or substantially affected by the inclusion of ORFs. Although most ORFs are short (mean length of 60 AA; Figure 1—figure supplement 2, Figure 1—figure supplement 3), nearly 2.3 million ORFs (on average 90,443 per species) are >132 AA, the mean size of domains in the PFAM database, ensuring the possibility of annotating ORF-dominated orthogroups (Figure 1—figure supplement 2).

We next assessed the accuracy and biological validity of our orthogroup dataset via several approaches. First, we compared our clustering results with an external benchmark set of 70 manually curated orthogroups (Trachana et al., 2011; see Appendix 1: Cluster evaluation and quality control; Supplementary file 3). We then specifically examined the clustering results of a highly conserved and difficult to assess class of proteins, the Nkx homeodomain proteins (Supplementary file 1–Supplementary Table 6). Third, we evaluated potential sources of error with respect to the phylogenetic composition of a given orthogroup (see Appendix 1: Identification of bilaterian-specific genes). For this purpose, we developed a new reciprocal HMM-HMM comparison step. It performs sensitive, BLAST-independent searches for orthogroups with similar sequence profiles to validate orthogroup completeness. We demonstrated the value of this step by using two proteins as test cases, the FGF signalling pathway component Sprouty and the insulator protein GAGA factor (see Appendix 1: Identification of bilaterian-specific genes; Supplementary file 1–Supplementary Table 7). After these quality control steps, we finally identified 157 orthogroups as a minimal set of high confidence, bilaterian-specific orthogroups (Supplementary file 4).

The domain repertoire of bilaterian-specific proteins is enriched for DNA-binding

To reveal the putative function of the 157 identified bilaterian-specific genes, we first determined their protein domain repertoire and the gene ontology terms for molecular function associated with these domains. We then compared the results to analyses carried out for the vertebrate and arthropod nodes, as these nodes represent major radiations that are well-supported by genome sequence data. The obtained terms indicate that membrane processes, including cell adhesion, G-protein-coupled receptor signalling, and Ca²⁺-binding, as well as protein interactions and metal ion binding, are prominent molecular functions of bilaterian-specific proteins (Figure 2 left, top and middle row). In contrast, terms derived from the arthropod and vertebrate nodes are markedly different. While the vertebrate repertoire comprises G-protein-coupled receptors, cadherins, and extracellular domains required for protein-protein or protein-ligand interactions, arthropod-specific genes are characterised by a broad spectrum of similarly prominent functions, from expected roles in cuticle and chitin biology to a plenitude of conserved domains of unknown function (Figure 2 middle and right, top and middle row). These results indicate that proteins with distinct functions characterise the evolution of each of the three nodes.

Figure 2 with 4 supplements see all

Download asset Open asset

Inventory of protein domains and associated GO terms for three animal lineages.

Further, our comparative analysis implied that a large number of transcription factors emerged in the bilaterian ancestor. While 3.58% of vertebrate-specific orthogroups and 9.30% of arthropod-specific orthogroups had transcription factor-associated domains such as zinc fingers or homeodomains, the corresponding fraction was 26.06% in bilaterian-specific orthogroups (Figure 2 middle row). To substantiate this result, we randomly selected 10 times 157 proteins from a curated set of 20,205 human proteins. The average number of transcription factors in these control sets was 12.8 ± 4.44 as opposed to 37 transcription factors in the set of 157 bilaterian-specific genes. This is a highly significant result under a number of assumptions for data distribution (see Materials and methods), lending statistical support to an unexpectedly high number of transcription factors in the bilaterian-specific dataset.

Importantly, many of the transcription factors contained tandem C₂H₂ zinc finger domains and already originated with multiple zinc fingers, as their extant Drosophila and human orthologues suggest (Supplementary file 1–Supplementary Table 8). With the addition of at least 13 members, the modest poly-ZF repertoire at the dawn of metazoans thus almost doubled in the bilaterian ancestor (Figure 2—figure supplement 1) in line with previous evidence that poly-ZF proteins emerged from a small group of eukaryotic zinc finger transcription factors (Emerson and Thomas, 2009). Considering that several factors with this domain configuration are involved in regulating chromatin architecture, including CTCF (Phillips-Cremins et al., 2013), YY1 (Weintraub et al., 2017), Pita (Kyrchanova et al., 2017), SuHw (Van Bortle et al., 2012), and Casz1 (Mattar et al., 2018), these findings open the possibility that multiple poly-ZF factors participated in modifying higher-order chromatin structure during the emergence of bilaterians, as proposed for CTCF (Heger et al., 2012; Vietri Rudan and Hadjur, 2015; Acemel et al., 2017). With the exception of YY1 (OG_3966: metazoan origin or earlier), all known chromatin architectural proteins emerged in the ancestor of bilaterians or later (Heger et al., 2013; Heger and Wiehe, 2014), suggesting that a more sophisticated regulation of gene expression by influencing chromatin architecture contributed to bilaterian evolution. More generally, we note that poly-ZF proteins often comprise the most abundant transcription factor superfamily in bilaterians, with many lineage-specific expansions even within orders and families (Panfilio et al., 2019). Below, we also comment both on similar patterns in other protein classes and on potential other roles of a bilaterian expansion in poly-ZF proteins.

Bilaterian-specific proteins contain novel protein domains

Using domain scans, we could not identify known protein domains or other functional annotation for 5 of the 157 bilaterian-specific orthogroups. Nevertheless, the corresponding alignments displayed extended regions of sequence conservation (Figure 2—figure supplement 2, Figure 2—figure supplement 3, Figure 2—figure supplement 4) arguing that these regions may constitute so far undetected protein domains. To explore whether the putative domains are bilaterian novelties, we converted them to hidden Markov models and used these to search our database of 824,605 orthogroup HMMs. In these searches, only one of the five domains showed weak evidence for homology outside the Bilateria, indicating that a protein with a similar domain exists in non-bilaterians. The other four domains were restricted to bilaterians, like the proteins they belong to (Supplementary file 1–Supplementary Table 9), a finding compatible with the de novo birth of these five genes. Similarly, sequences without known protein domains were also detectable in arthropod- and vertebrate-specific orthogroups (Figure 2) and, more generally, in approximately 40% of the 69,114 orthogroups with more than ten species. These findings open the possibility that, across opisthokonts, many lineage-specific genes are uncharacterised and may contain previously undescribed protein domains and novel lineage-specific domains, emphasising the involvement of gene birth in lineage evolution on a broad scale.

Changes in the transcription factor repertoire and in membrane processes accompany bilaterian evolution

Nuclear factors include key developmental regulators

To reveal the putative function of the identified bilaterian-specific genes, we determined the subcellular location of their human orthologues according to the information at www.uniprot.org (Figure 3). Almost two-thirds of the 157 genes belonged to either of two cellular compartments, the nucleus or the plasma membrane. The majority of nuclear proteins (40/57 orthogroups) had transcription factor activity, with various domains for DNA binding (Figure 3B). Although C₂H₂ poly-ZF proteins are particularly enriched (Figure 2—figure supplement 1, Supplementary file 1–Supplementary Table 8), we also found several transcription factors with homeobox and basic helix-loop-helix (bHLH) domains (Figure 3B; Figure 2). The latter factors are important for regulatory processes during embryogenesis such as neurogenesis, myogenesis, and positional specification along the body axis (Supplementary file 1–Supplementary Table 10). For example, we found the bHLH domain-containing transcription factor MyoD, the master regulator for muscle cell specification in vertebrates, D. melanogaster, and C. elegans (Tapscott et al., 1988; Michelson et al., 1990; Chen et al., 1994), consistent with the bilaterian origin of mesoderm (Supplementary file 1–Supplementary Table 10, Supplementary file 4). Likewise, at least three conserved regulators of nervous system development and neurotransmission, the Neuronal PAS domain-containing protein 4, the Prospero homeobox protein 2, and the Achaete-scute homologue 2 (Stergiopoulos et al., 2014; Sun and Lin, 2016), emerged in the ancestor of bilaterians (Supplementary file 1–Supplementary Table 10, Supplementary file 4). Finally, two orthogroups with homeobox domain proteins, OG_8634 and OG_4203, contained the central Hox genes Antennapedia and Ultrabithorax (Balavoine et al., 2002; Chourrout et al., 2006). Central Hox genes are absent from non-bilaterian Metazoa despite the existence of anterior and posterior homologues (Ryan et al., 2007). Our screen thus correctly identified central Hox genes as a bilaterian novelty even though homeodomain-containing proteins are difficult to assign (Thomas-Chollier et al., 2010; Hueber et al., 2013).

Figure 3

Download asset Open asset

Subcellular location and molecular function of 157 bilaterian-specific genes.

(A) Graphic representation of a eukaryotic cell with its typical organelles. Numbers in parentheses denote the number of bilaterian-specific orthogroups associated predominantly with a given cellular structure. Graphic drawn after the subcellular location section at uniprot.org. (B) Upper chart: Subcellular location of 157 bilaterian-specific genes. Location data is based on the corresponding human orthologues and colour-matched with the graphics in A. Lower chart: Number and name of transcription factor-associated domains present in the set of 157 bilaterian-specific genes. The 40 orthogroups are a subset of 51 orthogroups associated with the nuclear compartment. In most cases, domains names follow Pfam standards (http://pfam.xfam.org/). (C) Distribution of 84 domains found in 51 orthogroups associated with the nucleus. (D) Distribution of 77 domains found in 49 orthogroups associated with the plasma membrane. (E) Distribution of 39 domains found in 28 orthogroups associated with the cytoplasm. "Other" represents domains found only once in the respective category.

Membrane factors include neural transducers and novel proteins

A heterogeneous set of proteins was mapped to the membrane compartment (Figure 3D). While most of the domains found in 49 orthogroups of this category occurred once or twice, several domains were seen more often, in particular the seven transmembrane receptor domain (7tm; 13×), the leucine-rich repeat (LRR; 5×), the Bestrophin chloride channel (Bestrophin; 3×), and the hormone receptor domain (HRM; 3×). The 7tm domain is characteristic of G-protein-coupled receptors, which will be discussed further below. The LRR domain is a protein binding motif (Kobe and Kajava, 2001) and present in several factors connected to the plasma membrane (Figure 3D) such as LINGO1, SLIT2, or SEMA6C. These LRR domain-containing molecules are crucial for organising neural connectivity and are employed for axon guidance, myelination, and synapse formation (de Wit et al., 2011). Although LRR domain-containing molecules exist in non-bilaterians (Ocampo et al., 2015), it is currently unknown whether they fulfil, in these organisms, a role in nervous system development as observed in flies and vertebrates. Further, several bilaterian-specific orthogroups contained ion channel proteins. For both nervous system function and embryonic development (Moody et al., 1991; Pai et al., 2017), ion channels play important roles as they provide the basis of currents and action potentials across the plasma membrane and are involved in morphogenetic movements and cell shape changes during development (Moody et al., 1991). However, most ion channel proteins seem to predate the origin of metazoans (Jegla et al., 2009), and therefore it is unclear how the identified channel proteins affected bilaterian evolution.

Three orthogroups contained transmembrane proteins for which currently no functional description is available, although expression data for two of these exist: OG_13067 (TM169_HUMAN), OG_26661 (TM74B_HUMAN), and OG_28197 (TM160_HUMAN). Genome-wide studies revealed that CG4596, the Drosophila orthologue of TM169_HUMAN, is expressed in the ventral nerve cord, ventral midline, and in the brain during embryogenesis (Tomancak et al., 2002), similar to a central nervous system-based expression of the mouse orthologue (Supplementary file 1–Supplementary Table 11; Petryszak et al., 2016). Mouse expression data for the transmembrane protein TM160_HUMAN largely overlap with TM169_HUMAN (Supplementary file 1–Supplementary Table 11), but corresponding data from Drosophila are not available, as TM160 is absent from ecdysozoans (Figure 2—figure supplement 2, Supplementary file 1–Supplementary Table 12). Multiple sequence alignments and HMM-HMM searches demonstrate further that these two transmembrane proteins are well conserved across bilaterians (Figure 2—figure supplement 2) and possess a unique sequence profile without similarity to other orthogroups within the opisthokont search space (Supplementary file 1–Supplementary Table 12). Together, these observations establish that so far uncharacterised proteins with predicted transmembrane domains and distinct structures might have a function in the nervous system since the Cambrian.

Lineage-specific genes are ubiquitous and contain lineage-specific protein domains

The dataset for this study was designed to capture genes with bilaterian-specific distribution. To explore whether it allows the identification of genes specific for other evolutionary nodes, we determined the number of lineage-specific orthogroups for five successive nodes in two lineages: in the protostome lineage leading to Diptera and in the deuterostome lineage leading to Mammalia. We counted for every node lineage-specific orthogroups as a function of increasing species coverage. Extending coverage reduced the number of lineage-specific orthogroups, as expected (Figure 4). However, tens to hundreds of lineage-specific orthogroups were still obtained at each individual node under the strict condition of 50% coverage (i.e. at least 50% of the species that belong to the respective node need to be present in orthogroups; Figure 4). HMM-HMM searches and domain scans further suggested that lineage-specific orthogroups for the 10 nodes contain novel domains unique to the respective lineage (for examples, see Figure 4—figure supplement 1 and Supplementary file 1–Supplementary Table 13), as it is the case for bilaterian-specific proteins (Figure 2—figure supplement 2, Figure 2—figure supplement 3, Figure 2—figure supplement 4). These findings suggest that the origin of genes and novel protein domains is a robust component of evolution at every examined node and that the faithful identification of these genes is a critical aspect in reconstructing evolutionary history, as exemplified by the recent detection of lineage-specific genes in mammals, mollusks, cnidarians, or arthropods (Milde et al., 2009; Aguilera et al., 2017; Dunwell et al., 2017; Thomas et al., 2020).

Figure 4 with 1 supplement see all

Download asset Open asset

Distinct lineage-specific genes at subsequent nodes of insect and vertebrate evolution.

Starting from Bilateria (left), a protostome lineage leading to dipterans (upper) and a deuterostome lineage leading to mammals (lower) are shown as schematic phylogenetic tree. Sister clades to the selected taxa are denoted on short branches in the center. Each barplot displays the number of lineage-specific orthogroups (y axis) as a function of orthogroup size (x axis) for the selected taxonomic group (Protostomia, Ecdysozoa, Arthropoda etc.). The total species count (within BigWenDB) for each of the eleven taxonomic groups is indicated on top of the corresponding barplots (# Species). The count of lineage-specific genes decreases with growing orthogroup size. A red line denotes the number of orthogroups in which at least 50% of the species of a selected lineage are present. The corresponding number of lineage-specific orthogroups is highlighted in red next to the line.

The Nodal pathway is a bilaterian-specific addition to the TGF-β superfamily and linked to left-right determination and mesoderm formation

Three orthogonal axes—the anterior-posterior, the dorsal-ventral, and the left-right axis—determine body layout in bilaterian animals. One of the signalling systems active in these processes is the Nodal pathway. It belongs to the transforming growth factor β (TGF-β) pathway and is essential for the specification of left-right asymmetry and the induction of mesoderm and endoderm in vertebrates (Shen, 2007). The TGF-β ligands Nodal and Lefty, the co-receptor EGF-CFC, and the transcription factor FoxH1 are components specific to the Nodal pathway (Figure 5—figure supplement 1). In addition, the T-box transcription factor TBR-2/Eomes (T-box brain protein 2/Eomesodermin) is a target of Nodal signalling and critical for mesoderm formation and neural development (Ryan et al., 1996; Arnold et al., 2008).

Distinct phylogenetic distributions have been reported for the Nodal-signalling components. The presence and functional conservation of Nodal itself is well established across deuterostomes (Duboc et al., 2004; Hudson and Yasuo, 2005; Shen, 2007; Röttinger et al., 2015) and lophotrochozoans (Grande et al., 2014; Kenny et al., 2014). In contrast, searches for Lefty orthologues were so far positive only in deuterostomes (Chen and Schier, 2002; Mita and Fujiwara, 2007; Duboc et al., 2008; Li et al., 2017), but not in Lophotrochozoa (Grande et al., 2014). Similarly, the Nodal coreceptor EGF-CFC has been identified only in deuterostomes (Yan et al., 1999; Ravisankar et al., 2011), and FoxH1 orthologues have been characterised in vertebrates and cephalochordates only (Weisberg et al., 1998; Zhou et al., 1998; Yu et al., 2008; Figure 5A). Nodal-signalling components have not been identified in the protostome model organisms D. melanogaster and C. elegans. Likewise, the T-box factor eomesodermin is absent from these animals but has been described in lophotrochozoans, deuterostomes, and sponges (Maruyama, 2000; Tagawa et al., 2000; Arenas-Mena, 2008; Arnold et al., 2008; Sebé-Pedrós et al., 2013). These findings imply a successive gain of Nodal signalling components along the lineage from the metazoan to the vertebrate ancestor (Figure 5A).

Figure 5 with 4 supplements see all

Download asset Open asset

Evolution of the Nodal signaling pathway.

Two consensus phylogenetic trees showing the relationship of major metazoan lineages. The five factors of the Nodal signalling pathway (Nodal, Lefty, EGF-CFC, FoxH1, and Eomes) are displayed as coloured boxes. Their phylogenetic distribution and inferred evolutionary origin are mapped onto the tree. Gene births are indicated as coloured boxes above the respective branch. Inferred losses are represented by crosses. Bold labels to the left of a branch indicate branch ancestors: B = Bilateria, Eu = Eumetazoa, M = Metazoa. (A) Previous results regarding the evolution of Nodal pathway genes, as known from the literature. (B) Revised evolutionary history of the Nodal pathway genes according to our results. Note that none of the five factors has been found in arthropods and nematodes. The ecdysozoan boxes for Eomes and FoxH1 are derived from the presence of the genes in a single priapulid species. Grey shading: Hypothetical emergence of a putative kernel for mesoderm specification and neural patterning.

In line with previous findings (Hudson and Yasuo, 2005; Shen, 2007; Grande et al., 2014; Kenny et al., 2014), our analysis revealed that the TGF-β ligand Nodal belongs to a robust bilaterian-specific orthogroup (OG_12210; Figure 5—figure supplement 2, Supplementary file 1–Supplementary Table 14). However, orthogroups of the other Nodal pathway members (Lefty, EGF-CFC, FoxH1, and Eomes) were also bilaterian-specific, and HMM-HMM-based searches against all orthogroups (Supplementary file 1–Supplementary Table 14) as well as phylogenetic analyses supported this result (Figure 5—figure supplement 2, Figure 5—figure supplement 3).

Our clustering results suggested further that the T-box transcription factor Eomes is in fact restricted to bilaterians, contradicting a study that identified Eomes candidates in two poriferan species (Sebé-Pedrós et al., 2013). In BLAST searches, the two poriferan sequences displayed highest similarity to the canonical T-box transcription factors TBX3/4, but not to the T-box containing protein Eomes (Supplementary file 1–Supplementary Table 15). Likewise, phylogenetic analyses failed to confidently assign the poriferan sequences to the Eomes clade (Figure 5—figure supplement 4), and HMM-HMM searches could not detect Eomes-related orthogroups with proteins from sponges or other non-bilaterian animals (Supplementary file 1–Supplementary Table 14). These results consistently argue for a bilaterian origin of the factor, matching the distribution of the other Nodal pathway members (Figure 5B). While our phylogenetic analyses supported orthology clustering results and the monophyly of the Eomes clade, they unexpectedly argued for a metazoan origin of the gene (Figure 5—figure supplement 4). This interpretation would imply independent loss events in the ancestors of three phyla (Cnidaria, Placozoa, and Ctenophora) and in two sponge lineages (see Figure 5A and discussion), while a posited bilaterian-specific origin would be more parsimonious. To finally resolve this issue, more detailed analyses are needed.

Recently, a Nodal-related gene has been identified in the cnidarian Hydra magnipapillata and found to be essential for specifying axial asymmetry along the polyp’s main body axis (Watanabe et al., 2014). In our dataset, H. magnipapillata Nodal-related belongs to a different orthogroup (OG_9136), together with sequences from nine other cnidarians and many deuterostomes. This orthogroup contains, among others, vertebrate GDF-6/7, but no Nodal orthologues. Furthermore, we did not obtain an HMM-HMM reciprocal best hit relationship with the Nodal orthogroup using as query either the entire orthogroup OG_9136 or a subset of cnidarian sequences (Supplementary file 1–Supplementary Table 16), suggesting that Nodal indeed emerged in the bilaterian ancestor as a new member among pre-existing Nodal-related genes.

Taken together, orthology clustering, HMM-HMM comparison, and phylogenetic evidence establish that all four Nodal-specific pathway components and Eomes are present only in bilaterians (Figure 5B). It is thus possible that these factors co-evolved as extension of the more ancient TGF-β signalling pathway (Huminiecki et al., 2009; Hinck et al., 2016) and acquired the potential for mesoderm formation and left-right axis determination, two characteristic bilaterian traits. Due to the conservation of this hypothetical gene regulatory network (GRN) since the Cambrian, it could represent an ancient kernel for mesoderm specification and neural patterning. The identification of only a subset of the five factors in non-chordate species (Figure 5B) indicates that Nodal signalling experienced substantial evolutionary turnover, but it does not exclude initial assembly of the pathway in the bilaterian ancestor and subsequent lineage-specific changes.

One consequence of these considerations is that large parts of the Nodal GRN must have been lost early in ecdysozoan evolution, implying the evolution of alternative upstream signalling pathway inputs for axial specification in this group. Secondly, genes that originated in the bilaterian ancestor may have been lost in a particular daughter lineage. The widespread loss of genes across metazoans (Richter et al., 2018; Sharma et al., 2018) and the loss of Nodal pathway members (this study) shows that such scenarios are conceivable and might impact the exhaustive description of lineage-specific genes, that is, the reconstruction of the "true" evolutionary history of a taxon.

G-protein-coupled receptors and the control of physiological state through circulatory flow

Among the identified bilaterian-specific genes is a set of eight G-protein-coupled receptors (GPCRs), members of a large family of seven-transmembrane domain receptors. While GPCRs are ancient and were already present in the ancestor of bilaterians and fungi (Krishnan et al., 2012), our results indicate that new members of the GPCR family appeared at the bilaterian base. Specifically, robust clustering results and HMM-HMM comparisons place the origin of monoamine neurotransmitter receptors for serotonin, adrenaline, and dopamine to the bilaterian root (Supplementary file 1–Supplementary Table 17, Supplementary file 1–Supplementary Table 18), in line with a recent publication that dated back the evolutionary history of adrenergic signalling to the bilaterian ancestor (Bauknecht and Jékely, 2017). Histochemical, biochemical, and functional data are in conflict with this finding and argue for the presence of serotonin, dopamine, and other small molecule neurotransmitters in cnidarians, the bilaterian sister group (Carlberg and Anctil, 1993; Kass-Simon and Pierobon, 2007; Mayorova and Kosevich, 2013). However, receptors for these molecules could not be identified unambiguously in cnidarians (Anctil, 2009; Bosch et al., 2017), maintaining the possibility that they indeed constitute bilaterian innovations.

There is evidence across several bilaterian phyla (arthropods, nematodes, mollusks, platyhelminthes, vertebrates) that adrenaline, dopamine, and serotonin signalling regulates many important processes such as behaviour, feeding, learning, locomotion, memory, reproduction, reward, or sleep (Ségalat et al., 1995; Berridge, 2004; Suo et al., 2004; Berger et al., 2009; Vidal-Gadea et al., 2011; Burke et al., 2012; El-Shehabi et al., 2012; Ueno et al., 2012). In addition to these "post-embryonic" functions, serotonin is recognised as an important regulator of embryonic development and neuronal circuitry in vertebrates and invertebrates (Brown and Shaver, 1989; Buznikov et al., 2001; Daubert and Condron, 2010). The proposed origin of monoamine neurotransmitter receptors in the bilaterian ancestor (Supplementary file 1–Supplementary Table 17, Supplementary file 1–Supplementary Table 18) and the related functions of monoamine neurotransmitter signalling across phyla suggest that diverse functions of monoamine neurotransmitter signalling already existed in the bilaterian ancestor and could have played a role in the evolution of complex development, brain function, and behaviour. Preliminary evidence indicates that cnidarians, as the bilaterian sister group, do not respond to rewarding or punishing stimuli as do bilaterians (Barron et al., 2010). A link between this behavioural difference and the evolution of monoamine neurotransmitter receptors would comply with the previous notion that the evolution of dopamine-based brain reward systems in bilaterians started from dopamine’s ancient role as a signalling molecule for motor circuits (Barron et al., 2010).

In addition to monoamine neurotransmitter receptors, we detected several peptide hormone receptors in the set of bilaterian-specific GPCRs and could support their bilaterian origin using HMM-HMM searches: the receptors for secretin, corticotropin-releasing factor, neuromedin-U, calcitonin, and somatostatin (Supplementary file 4, Supplementary file 1–Supplementary Table 17, Supplementary file 1–Supplementary Table 18). In vertebrates, these GPCRs and their hormone ligands are part of the endocrine system and regulate basal physiological activities such as feeding, energy homoeostasis, or stress (Budhiraja and Chugh, 2009; Afroze et al., 2013). homologues of the five receptors and their ligands have also been described in C. elegans and D. melanogaster (Johnson et al., 2005; Cardoso et al., 2006; Melcher et al., 2006; Lindemans et al., 2009; Cardoso et al., 2014; Kunst et al., 2014; Ketchesin et al., 2017), and the putative bilaterian ancestry of some of these signalling systems has been recognised by others, in agreement with our results (Johnson et al., 2005; Lindemans et al., 2009; Mirabeau and Joly, 2013). In contrast to vertebrates or insects, cnidarians and other non-bilaterian Metazoa do not contain specialised endocrine organs and circulatory systems. Thus, our finding of highly conserved peptide hormone receptors supports the view that major physiological regulators evolved in parallel with the emergence of circulatory systems. Moreover, recent evidence indicates that these hormone receptors also act during development and participate in neuronal migration and nervous system formation (Afroze et al., 2013; Liguz-Lecznar et al., 2016; Galas et al., 2017), suggesting an ancient link between the generation of complex nervous systems and the ability to control body functions through circulatory fluid.

Changes in axon guidance accompany bilaterian evolution

Axon guidance, the guided outgrowth of axons and dendrites, is essential for the development of neuronal connections and mediated by two major pathways, the Netrin-DCC and the Slit-Robo (Round-About) pathway (Lowery and Van Vactor, 2009; Evans, 2016). To reveal whether changes in these processes accompanied the evolution of bilaterians, we studied the respective orthogroups. Except one, all human Netrin paralogues were assigned to a single orthogroup. Its composition and the composition of its HMM-HMM best hit orthogroups support the emergence of Netrins in the ancestor of eumetazoans or earlier (Supplementary file 1–Supplementary Table 19), in line with a description of Netrins in the sea anemone N. vectensis (Putnam et al., 2007). We found a corresponding (eu)metazoan origin for the Netrin receptor DCC (Supplementary file 1–Supplementary Table 19). These results indicate that cnidarians, but not ctenophores, might regulate axon outgrowth at least in part by Netrin-DCC based interactions, consistent with an independent origin of the nervous system in ctenophores (Moroz et al., 2014).

Although orthogroup composition of Slit and its receptor Robo suggested a bilaterian origin of this system, reciprocal HMM-HMM searches indicated the existence of cnidarian Robo orthologues that were assigned to a separate orthogroup, OG_51853 (Supplementary file 1–Supplementary Table 19). Like their bilaterian counterparts, the cnidarian Robo candidates had highly disordered cytoplasmic domains, as revealed by structure predictions of the extracellular and intracellular part of representative sequences (Figure 6). On the other hand, sequence comparisons revealed that the conserved cytoplasmic motif CC1, which is required for binding the Ena/VASP protein Enabled and for transducing signals to the actin cytoskeleton (Bashaw et al., 2000), is altered in cnidarian Robos (Figure 6—figure supplement 1), and that cnidarian Robos displayed several insertions and deletions in the cytoplasmic part when compared with bilaterian Robos (Figure 6—figure supplement 2). It is therefore an open question whether the structural differences in cnidarian Robo-like proteins involve interactions with different downstream partners and whether cnidarian Robos regulate axon growth. Known downstream effectors of Robo signalling, such as Enabled and Son of sevenless, originated early in metazoan evolution (Supplementary file 1–Supplementary Table 20) and could provide in principle the functionality for Robo-based axon guidance, although mediated by a different ligand.

Figure 6 with 3 supplements see all

Download asset Open asset

Structural predictions of cnidarian and bilaterian Robo proteins.

Top (ex): Predicted structure of the extracellular domain plus transmembrane region of seven selected Robo proteins. Bottom (cp): Predicted structure of the transmembrane region plus cytoplasmic part of seven selected Robo proteins. Robo1 orthologues of two deuterostomes (Hsap = *Homo sapiens*; Spur = *Strongylocentrotus purpuratus*), one lophotrochozoan (Lana = *Lingula anatina*), two ecdysozoans (Dmel = *Drosophila melanogaster*; Tpse = *Trichinella pseudospiralis*), and two cnidarians (Hvul = *Hydra vulgaris*; Spis = *Stylophora pistillata*) were analysed. "% conf" indicates the percentage of residues modelled at >90% confidence. "% dis" indicates the predicted percentage of disordered regions. Bottom right: Schematic outline of the Robo domain structure with five immunoglobulin domains (IG1–IG5) and three fibronectin type III domains (FN3) in the extracellular part and four conserved cytoplasmic motifs (CC0–CC3) in the intracellular part. Like their bilaterian counterparts, cnidarian Robo candidates display a disorganised protein structure in the cytoplasmic part despite differences in structural features (Figure 6—figure supplement 1, Figure 6—figure supplement 2). The extracellular part (top row), on the other hand, is similarly organised across metazoans.

In both Drosophila melanogaster and vertebrates, midline glia cells secrete the Slit protein to prevent Robo expressing axons from crossing the body midline (Rothberg et al., 1990; Brose et al., 1999; Kidd et al., 1999), indicating that a key component in the establishment of bilaterally symmetric nervous systems is shared between protostomes and deuterostomes. However, in our dataset, a single placozoan sequence was assigned to Slit’s otherwise bilaterian-specific orthogroup, shifting its origin back in time. BLAST searches at NCBI verified a reciprocal best hit relationship of the putative placozoan Slit to known Slit proteins, in agreement with our clustering results (Supplementary file 1–Supplementary Table 15). Likewise, placement of the placozoan sequence in phylogenetic analyses is compatible with its orthology to the Slit protein (Figure 6—figure supplement 3). Unexpectedly, HMM-HMM comparisons could not reveal the existence of Slit in other non-bilaterian species such as cnidarians or ctenophores (Supplementary file 1–Supplementary Table 21). From these results, we conclude that Slit and Robo probably originated in the common ancestor of placozoans, cndiarians, and bilaterians. However, the Slit-Robo-based mechanism for midline repulsion during nervous system development appears to be restricted to bilaterians, as placozoans lack a nervous system and cnidarians lack the Slit ligand.

Neurotrophin receptor signalling is a bilaterian innovation

Neurotrophin signalling plays a fundamental role in nervous system generation by regulating many aspects of neuronal development and function, such as neuronal survival, synapse formation, or axon guidance (Huang and Reichardt, 2001; Lu et al., 2005). Vertebrates possess four related neurotrophin ligands and three corresponding transmembrane receptors of the Trk family that each originated from a single ancestral gene in chordates (Benito-Gutiérrez et al., 2005; Hallböök et al., 2006). Once considered a vertebrate innovation, neurotrophins and their receptors have now been found in diverse invertebrates (Wilson, 2009; Kassabov et al., 2013; Lauri et al., 2016). In particular, studies in the mollusk Aplysia californica suggest that neurotrophin signalling and neurotrophin-mediated synaptic plasticity are conserved in protostomes and deuterostomes (Kassabov et al., 2013).

To elucidate the evolutionary origin of neurotrophin signalling, we analysed the orthogroups containing neurotrophins and their receptors. The four vertebrate neurotrophin ligands clustered into two bilaterian-specific orthogroups (OG_14798 and OG_21801) that are each other’s reciprocal best hit. We could not detect orthogroups similar to neurotrophins in non-bilaterian metazoans or additional, so far unidentified neurotrophins in bilaterians (Supplementary file 1–Supplementary Table 22), supporting the emergence of a single neurotrophin gene in the ancestor of bilaterians and its subsequent diversification in vertebrates. When we analysed the evolutionary origin of other neurotrophic factors, we recognised that they also arose in the ancestor of bilaterians or even later (Figure 7; Supplementary file 1–Supplementary Table 22, Supplementary file 1–Supplementary Table 23). The evolutionary age of these additional neurotrophic factors is thus consistent with a bilaterian origin of neurotrophic ligands per se. The same evolutionary scenario is supported by detailed analysis of the Trk receptor family. Although our initial dataset conflated Trk and Wnt pathway receptors due to a shared receptor tyrosine kinase domain, adjustment of the MCL inflation parameter successfully rendered a Trk-only orthogroup, whose taxonomic composition is restricted to bilaterians (Figure 7—figure supplement 1; Supplementary file 1–Supplementary Table 24).

Figure 7 with 1 supplement see all

Download asset Open asset

The bilaterian-wide distribution of neurotrophic factors.

The NTRK receptor and 14 major neurotrophic factors are displayed as coloured boxes. Their phylogenetic distribution and inferred evolutionary origin are mapped onto the tree (see Supplementary file 1–Supplementary Table 22 and Supplementary file 1–Supplementary Table 23). Gene births are indicated as coloured boxes above the respective branch of the tree (left). Inferred losses are shown as coloured crosses in the matrix. Bold labels to the left of a branch indicate branch ancestors: Ac = Actinopterygii, B = Bilateria, Ch = Chordata, Eu = Eumetazoa, Gn = Gnathostomata, M = Metazoa, Sa = Sarcopterygii. The neurotrophic factors of Cladistia, the sister group of Actinopteri, are inferred and distinguished by a question mark as the dataset lacks species from this lineage.

These results indicate that neurotrophins and their receptors are present across bilaterians and might fulfill conserved functions in neuronal development in these animals. If long-term potentiation and memory formation is regulated by serotonin and its receptors across bilaterians (see, for example, Teixeira et al., 2018), a link between serotonin action and neurotrophin signalling may have emerged in the bilaterian ancestor that contributed to nervous system evolution and the learning-dependent synaptic plasticity characteristic for this group.

Bilaterian-specific factors and the evolution of excretory systems

Protostomes and deuterostomes comprise the taxon Nephrozoa, animals with a dedicated excretory system (sensu Jondelius et al., 2002). Together with their sister group Xenacoelomorpha, Nephrozoa form the taxon Bilateria (Cannon et al., 2016). When we started with our study, sequences from Xenacoelomorpha were not available, and therefore our bilaterian-specific gene set is in fact specific for nephrozoans and might contain factors related to kidney and/or nephron development. Indeed, we identified in the 157 bilaterian-specific orthogroups two relevant zinc finger transcription factors. The poly-zinc finger transcription factor Evi1/MECOM was assigned to a large orthogroup with protein members from 108 of 142 bilaterian species (OG_5543). Evi1 is expressed in pronephric tissue of Xenopus and zebrafish embryos and involved in nephron patterning in these species (Mead et al., 2005; Li et al., 2014; Desgrange and Cereghini, 2015), although this might only be a part of its function (Goyama et al., 2008). Secondly, after BLAST searches, maximum likelihood phylogenetic analysis, and HMM-HMM searches focusing on orthogroup OG_5226, we found evidence for a bilaterian-wide distribution of odd-skipped related 1, a zinc finger transcription factor required for heart and urogenital development in vertebrates (Wang et al., 2005; Dressler, 2006; Tena et al., 2007; Supplementary file 1–Supplementary Table 15, Supplementary file 1–Supplementary Table 26; Supplementary file 1–Supplementary Figure 1). Thus, the observed expansion of the zinc finger transcription factor repertoire may also have been important for the evolution and development of excretory organs, a key nephrozoan innovation.

Bilaterian-specific genes form a rich interaction network with interconnected subnetworks

To reveal potential interactions among the 157 bilaterian-specific proteins, we analysed the interaction network of the corresponding human orthologues using the STRING protein-protein interaction (PPI) database. The obtained PPI network contained significantly more interactions than expected by chance (PPI enrichment p-value: 5.93e^-14), revealing that bilaterian-specific genes form a dense network in which about 50% of the factors (83 distinct factors) are connected to one another (Figure 8A). These interactions form several subnetworks involved in regulating key aspects of bilaterian development, such as chromatin organisation and transcriptional regulation (subnetwork A), myogenesis (subnetwork B), mesoderm formation and left-right asymmetry (the Nodal pathway, subnetwork C: see also Figure 8B), neurogenesis (subnetwork D), and physiology (subnetwork E). Connections between different subnetworks further suggest that crosstalk between the newly established regulatory subnetworks was an important aspect of bilaterian evolution.

Figure 8

Download asset Open asset

Protein-protein interaction network of bilaterian-specific proteins.

(A) Uniprot identifiers corresponding to the human orthologues of 150 bilaterian-specific genes (seven OGs had no human orthologues) were uploaded to the STRING database, and their mutual interactions were visualised as a network. Parameters for the displayed PPI network were: minimum required interaction score = 0.4; maximum number of interactors to display in 1st and 2nd shell = 0. Thus, only known and predicted interactions between 83 distinct bilaterian-specific proteins are shown (non-interacting proteins are hidden). Evidence for displayed interactions is colour-coded (see legend). Edge length and node placement are arbitrary. Five subnetworks between bilaterian-specific genes are highlighted in red (*A-E*, see Results). (B) Bilaterian-specific Nodal subnetwork in the context of metazoan genes. The five members of the Nodal pathway are highlighted by shading. (**C, D**) Boxplots comparing bilaterian- (B) and metazoan-specific (M) proteins in the full network and Nodal subnetwork for the total number of interactions per protein (C), and for the relative fraction of bilaterian interactions per protein (D).

Previous work found that protein network connectivity (number of interactions) increases with gene age (Kim and Marcotte, 2008). To analyse the degree of connectivity of our bilaterian network, we compared it to a PPI network generated from metazoan-specific proteins that is expected to show higher connectivity due to the proteins’ more ancient origin. Our orthology clustering data identified 797 metazoan-specific proteins (>5× as many proteins as in the bilaterian dataset), and the combined bilaterian-metazoan PPI network comprised 2,531 interactions among 823 proteins (16% bilaterian-specific proteins, 84% metazoan-specific proteins). In fact, we obtained a slightly higher level of connectivity for the younger, bilaterian proteins (Figure 8C: total number of interactions per protein, median ± median absolute deviation (MAD): 5 ± 4.62 for Bilateria, 4 ± 4.16 for Metazoa; Mann-Whitney U test: U = 39792, p = 0.0135). Furthermore, bilaterian-specific proteins preferentially interacted with one another, with over twice as many bilaterian-bilaterian interactions as would be expected by chance ( ${\tilde{χ}}^{2}$ statistic = 24.814, p = 0.000001), primarily due to fewer bilaterian-metazoan interactions than would be expected. This is also evident at the level of individual proteins: bilaterian-specific proteins have significantly more bilaterian interaction partners (Figure 8D: percent of bilaterian interactions, median ± MAD: $19.5 \pm 23.2$ for Bilateria, $0.0 \pm 16.1$ for Metazoa; Mann-Whitney $U = 32231$ , $p = 0.00000$ ).

As we identify the Nodal pathway as a key bilaterian innovation (Figure 5, Figure 8A: subnetwork C), we focused on this subnetwork as a case study for further analysis of molecular interactions. Within the full bilaterian-metazoan PPI network, we indeed recovered the Nodal pathway as a bilaterian-specific subnetwork, embedded among connections to additional bilaterian and metazoan proteins (Figure 8B). As with the full network, for this subnetwork we found a significant number of bilaterian-specific protein interactions (Figure 8D; Kruskal-Wallis ${\tilde{χ}}^{2}$ = 62.855, degrees of freedom = 3, p = 1.44e^-13). Furthermore, for this subnetwork, we found support for the hypothesis that older (metazoan) genes have higher connectivity (Figure 8C; Kim and Marcotte, 2008). Notably, metazoan-specific proteins that participate in the Nodal subnetwork are a non-representative subset, showing significantly higher overall connectivity and bilaterian-specific connectivity than metazoan proteins in the full bilaterian-metazoan PPI network. Thus, it may be that older genes have higher connectivity if they exceed a minimum threshold of connectivity (number of interactions). For example, the Nodal subnetwork includes Smad4, a metazoan-specific protein with the highest connectivity (46 interactions) of any protein in our combined network. This multifunctional BMP pathway component likely exemplifies two evolutionary trends: that highly connected genes are most likely to acquire new interaction partners, and that bilaterian-specific PPI innovations build on more ancient, preexisting PPI networks by co-option.

Extrapolating these findings to interactions with additional factors of more ancient origin implies that the evolution of new genes in the bilaterian ancestor affected a large number of processes in animal biology.

Discussion

An R-based OrthoMCL pipeline for processing large datasets

Explaining the sudden emergence of bilaterally symmetric animals during the Cambrian is a central problem in evolutionary biology. Complicated by the uneven coverage of the metazoan tree with sequence information, a systematic approach to identify the genetic basis for the evolution of bilaterians was missing. In this study, we present a comparative genomics approach, designed to provide maximum resolution at the bilaterian/non-bilaterian divergence and therefore uniquely suited to discover bilaterian-specific genes.

Although sequence data for individual species in our study might be incomplete (Supplementary file 1–Supplementary Table 1, Supplementary file 1–Supplementary Table 2), each important taxonomic group (Deuterostomia, Ecdysozoa, Lophotrochozoa, and "non-Bilateria") is represented with several well-annotated genomes and/or proteomes (Figure 1—figure supplement 1, Supplementary file 1–Supplementary Table 3). Importantly, sequence data from 19 cnidarian species, including four sequenced genomes and five transcriptomes with CEGMA scores above 70% (Supplementary file 1–Supplementary Table 2), allow the crucial distinction of orthogroups with cnidarian participation from bilaterian-specific orthogroups without cnidarian contribution, a serious problem of existing databases (Table 1).

While other orthology databases might surpass the BigWenDB in species number, this is often due to the integration of many non-metazoan and prokaryotic species (Table 1). Still, the total sequence content of other databases is small enough to be handled by a MySQL engine (see http://www.orthodb.org/v9.1/download/README.MySQL.txt; www.orthomcl.org) because it is restricted to predicted and annotated protein sequences. To accomplish processing of the large amount of sequence data from 25 genomic ORF sets, we developed an R-based version of the OrthoMCL pipeline (Li et al., 2003). It reproduces the results of the original pipeline meticulously (Supplementary file 1–Supplementary Table 4) and is capable of processing at least 125 million sequences with current computer hardware, considerably extending the limit imposed by conventional MySQL usage. In view of the ongoing increase in sequence data, the R-based version of OrthoMCL may prove valuable for generating large and comprehensive orthology datasets in the future.

Importantly, scaling up the orthology engine to handle larger datasets did not come at the expense of clustering quality. Rather, the combination of a comprehensive dataset and a scalable orthology prediction tool turned out as beneficial, challenging an early study that found a high false-positive rate when testing OrthoMCL on a small and taxonomically restricted dataset (Chen et al., 2007). This advance of our approach is further demonstrated by correct orthology inference rates that surpass those previously obtained in the orthobench comparisons (Trachana et al., 2011; Supplementary file 3).

Reciprocal HMM-HMM comparisons for improving orthogroup completeness

Despite the existence of many orthology detection methods (Tekaia, 2016), current tools do not evaluate orthogroup composition after clustering. In contrast, we implemented filtering steps to first identify widely distributed bilaterian-specific orthogroups. We then applied to the resulting orthogroups extensive procedures for quality control and error correction, taking into account the taxonomic composition of orthogroups and their best hits in HMM-HMM searches. In this context, we developed a new reciprocal HMM-HMM comparison step to evaluate orthogroup completeness because reliable orthogroups are a prerequisite for inferring the evolutionary age of the corresponding gene (Supplementary file 1–Supplementary Table 7). Although HMMs generated from orthogroup alignments can be uninformative outside conserved regions, they capture important amino acid positions and their spacing and variability, and therefore the individual profile of an orthogroup even within common functional domains such as zinc fingers (Supplementary file 1–Supplementary Figure 2). Indeed, we observed several instances where HMM-HMM comparisons improved results and affected conclusions, demonstrating the value of this novel step (Supplementary file 1–Supplementary Table 13, Supplementary file 1–Supplementary Table 14, Supplementary file 1–Supplementary Table 16, Supplementary file 1–Supplementary Table 19, Supplementary file 1–Supplementary Table 21, Supplementary file 1–Supplementary Table 22, Supplementary file 1–Supplementary Table 23, Supplementary file 1–Supplementary Table 24).

In particular, we employed highly sensitive HMM-HMM comparisons to minimise errors caused by low protein traceability, the limitation of the BLAST algorithm to detect orthologous genes in distantly related organisms (Jain et al., 2019; Weisman et al., 2020). This strategy led to the removal of 68 false-positive orthogroups from an initial set of 431 bilaterian-specific orthogroups because they displayed reciprocal best-hit relationships to non-bilaterian orthogroups, indicating a more ancient origin (see Appendix 1: Identification of bilaterian-specific genes). In addition, the broad coverage of bilaterians and non-bilaterians and the evaluation of orthogroup composition by filtering rules minimises errors that may be caused by the low traceability of specific genes or by single taxa with particularly high evolutionary rates.

Limitations of our orthology clustering pipeline

Our methods for error correction facilitate the detection of reliable lineage-specific gene sets and may serve as a future standard. However, developing software that can automatically detect such patterns and combine/split orthogroups in awareness of the underlying phylogeny would further improve orthogroup assignments. That lineage-specific genes exist and can directly change an animal’s phenotype to gain access to new ecological niches has been shown recently, illustrating the importance of these genes and the need for their identification (Dunwell et al., 2017; Santos et al., 2017; Luis Villanueva-Cañas et al., 2017).

Although we obtained a robust set of 157 genes that evolved in the bilaterian ancestor or, more specifically, in the ancestor of protostomes and deuterostomes (Nephrozoa) (Jondelius et al., 2002), by design our study is limited to protein coding sequences. It will therefore miss the possible involvement of RNA genes in bilaterian evolution, including miRNAs (micro RNAs) and lncRNAs (long non-coding RNAs), as suggested by Prochnik et al., 2007. It will further fail to detect changes in cis-regulatory regions and structural alterations or epigenetic changes, additional factors that affect evolutionary processes (Carroll, 1995; Prud'homme et al., 2006; Klironomos et al., 2013; Feulner and De-Kayne, 2017). Despite these limitations, our study successfully corroborated the bilaterian origin of several previously known bilaterian-specific genes, such as the chromatin organiser CTCF (Heger et al., 2012), the left-right determination factor Nodal (Grande et al., 2014), and central Hox genes (Finnerty and Martindale, 1999; Hueber et al., 2013).

Challenges in reconciling orthogroups and phylogenetic trees

Orthology clustering is a distinct method from phylogenetic tree building, and when we used phylogenetic analyses to validate orthogroup composition, we experienced difficulties in reconciling the two approaches.

Firstly, we do consistently obtain high branch support for bilaterian-specific orthogroups as discrete clades. Yet within orthogroups, phylogenetic resolution was often weak, with low branch support and gene tree–species tree discordance. However, tree discordance in itself does not argue against orthology because phylogenies suffer from various problems, such as the inclusion of problematic sequences, little phylogenetic information, or—in our case—the presence of short ORF fragments (Aguileta et al., 2008; Som, 2015). While our ORF data aid the recognition of distinct orthogroups by avoiding systemic annotation errors from external databases and by providing essential taxonomic coverage, these sequences do not represent full-length proteins and may curtail within-orthogroup resolving power.

In addition, in several cases we obtained tree topologies that could imply orthogroup origin in the metazoan ancestor rather than a later, bilaterian origin (Figure 5—figure supplement 3, Figure 5—figure supplement 4, Figure 7—figure supplement 1). One major confounding factor for correct tree reconstruction is heterotachy: a non-constant rate of evolution among different lineages (Lopez et al., 2002; Wu and Susko, 2011; Jayaswal et al., 2014). Importantly, heterotachy is often observed along the branches originating from a gene duplication event (Kondrashov et al., 2002; Conant and Wagner, 2003; He and Zhang, 2005; Steinke et al., 2006). Accelerated evolution in bilaterian-specific duplicates could therefore explain the observed tree topologies and the discrepancy between trees and clustering results. In contrast, the alternative interpretation of metazoan orthogroup origins would require that one of the two duplicates was secondarily lost in the stem lineage of sponges, ctenophores, placozoans, and cnidarians because of its absence in all available samples from these phyla. Gene loss is increasingly recognised as a widespread and important evolutionary mechanism (Sharma et al., 2018; Hecker et al., 2019; Thomas et al., 2020). However, the loss of a number of genes in the stem lineages of four independent phyla would imply strong selective pressure against their presence in non-bilaterian lineages, creating an aspect of deep evolution worthwhile of future exploration.

A robust associaton between bilaterian-specific genes and key morphological features

Several morphological features are widely considered key bilaterian innovations: (i) a third germ layer, the mesoderm; (ii) a complex bilateral nervous system; (iii) a Hox gene cluster with at least seven anterior, posterior, and central Hox genes; (iv) a through gut; (v) an excretory system; (vi) the possession of many different cell types; and (vii) bilateral symmetry (Baguñà et al., 2008 and references therein). It was unknown so far whether, and if so which, genetic factors contributed to the emergence of these innovations. From the results presented here, we conclude that a considerable fraction of the identified 157 bilaterian-specific genes is associated with the origin of characteristic bilaterian traits. Although correlations cannot prove a causal relationship, in the absence of ancestral genetic information our inferences from extant animals offer a fruitful approach. Here, we elaborate on several instances where the origin of proteins and bilaterian traits appear to coincide.

For example, a large portion of the 157 genes is involved in nervous system development and/or maintenance (Supplementary file 4). Several factors in this category provide functionalities absent from non-bilaterian metazoans, such as the long-range control of behaviour and physiological state through an expanded repertoire of GPCRs (Supplementary file 1–Supplementary Table 17, Supplementary file 1–Supplementary Table 18), a midline repulsion mechanism for the establishment of a bilateral nervous system (Robo-Slit; Figure 6—figure supplement 3; Supplementary file 1–Supplementary Table 19, Supplementary file 1–Supplementary Table 21), or mechanisms for sophisticated axon guidance and synaptic plasticity (neurotrophin signalling system; Figure 7; Supplementary file 1–Supplementary Table 22, Supplementary file 1–Supplementary Table 23, Supplementary file 1–Supplementary Table 24). These findings are consistent with the convergent evolution of muscle and nerve cells in ctenophores (Moroz et al., 2014) and suggest that bilaterians have a common genetic basis for nervous system patterning despite the recently proposed scenario of convergent evolution of bilaterian nerve cords (Martín-Durán et al., 2018). The importance of the nervous-system-related category of bilaterian-specific genes is further underscored by the identification of various transcription factors with a well supported role in nervous system development across phyla, for example the Prospero homeobox protein, the Achaete-scute homologue 2, or the neuronal PAS domain-containing protein 4 (Supplementary file 1–Supplementary Table 10, Supplementary file 4). Further, three transmembrane proteins with expression in the nervous system, but unknown function, provide the opportunity to characterise novel factors with nervous system-related function (Supplementary file 1–Supplementary Table 11). Together, the factors we found in this category provide fundamental features of bilaterian nervous systems, and their evolutionary origin in the bilaterian ancestor is compatible with observable changes in nervous system development and architecture.

An unexpectedly high number of bilaterian-specific genes has transcription factor activity (Figure 3B; Figure 2). As noted above, these factors are often equipped with multiple C₂H₂ zinc finger domains (Figure 2—figure supplement 1; Supplementary file 1–Supplementary Table 8). Apart from so far uncharacterised proteins, which include ZF64B_HUMAN or ZN236_HUMAN, the expression and developmental role of bilaterian-specific zinc finger proteins is compatible with prominent functions during early development, such as imaginal disc development (Rotund; St Pierre et al., 2002), modulation of TGF-β signalling (Schnurri; Yao et al., 2006), nephron patterning (Evi1, odd-skipped related 1; Mead et al., 2005; Dressler, 2006; Tena et al., 2007; Li et al., 2014), or the differentiation of cardiac precursor cells at the ventral midline (Castor; Christine and Conlon, 2008). Importantly, the identified transcription factors with homeobox or bHLH domain are involved in the specification of several bilaterian tissues, the mesoderm (MyoD, PRRX1_HUMAN, BHE22_HUMAN), the nervous system (Prospero homeobox protein 2, Achaete-scute homologues 2, FER3L_HUMAN, NPAS4_HUMAN, BHE22_HUMAN, BUN1_DROME), or the intestine (ISX_HUMAN), consistent with a role in the evolution of these characteristic bilaterian traits .

A contiguous cluster of at least seven Hox genes is an ancestral bilaterian feature (Baguñà et al., 2008). A prerequisite for its formation is the existence of anterior, central, and posterior Hox genes. Our results confirm previous findings that placed the origin of central Hox genes to the bilaterian ancestor (Supplementary file 1–Supplementary Table 10), in contrast to evolutionarily older anterior and posterior Hox genes (Finnerty and Martindale, 1999; Hueber et al., 2013). Importantly, Hox gene expression is regulated in part by the chromatin organiser CTCF (Rousseau et al., 2014; Narendra et al., 2015), another bilaterian-specific protein (Heger et al., 2012; Supplementary file 1–Supplementary Table 8; Supplementary file 4). As outlined elsewhere, the evolution of CTCF—and other poly-zinc finger proteins—could have provided a mechanism for the creation and regulation of bilaterian Hox gene clusters, once central Hox genes had been added to the repertoire (Heger et al., 2012).

The emergence of the mesoderm as a third germ layer is one of the most characteristic morphological innovations of bilaterian animals. In contrast to previous work, our findings suggest that several genes and gene networks which provide regulatory inputs to mesodermal patterning arose in the bilaterian ancestor. Specifically, we identified orthologues of all Nodal pathway members across bilaterians, but not outside this clade (Figure 5—figure supplement 1, Figure 5—figure supplement 2, Figure 5—figure supplement 3, Figure 5—figure supplement 4; Supplementary file 1–Supplementary Table 14, Supplementary file 1–Supplementary Table 16). The robust bilaterian-specific distribution of these genes, derived from orthology clustering and HMM-HMM searches, implies that the entire Nodal pathway—and its roles in mesoderm specification and left-right asymmetry—is a bilaterian novelty (Figure 5). Although a reasonable speculation, this is currently not supported for all pathway members by phylogenetic analyses and needs to be tested more thoroughly in the future. Together with the bilaterian specificity of additional modulators and effectors of Nodal and/or TGF-β signalling (BAMBI_HUMAN, VWC2_HUMAN, MECOM_HUMAN, Q24605_DROME; Supplementary file 4), these findings suggest that significant changes in TGF-β signalling occurred in the bilaterian ancestor. In addition to the Nodal pathway, several other genes with key roles in mesoderm formation also originated in the bilaterian ancestor, among them the master regulator of muscle cell specification, MyoD, and the Paired mesoderm homeobox protein 1 (PRRX1_HUMAN; Supplementary file 1–Supplementary Table 10) which regulates the formation of preskeletal condensations from undifferentiated mesenchyme during mouse skeletogenesis (Martin et al., 1995). Taken together, we identified multiple genetic factors essential for the differentiation of mesoderm and mesodermal tissues in bilaterians.

In conclusion, we demonstrate that a considerable number of genes has a bilaterian-specific distribution and probably originated in the bilaterian ancestor. While the function of some of these genes is unknown, many of them participate in the formation of key morphological innovations in extant bilaterians, implying that the evolution of specific genes contributed to the formation of bilaterian body plans.

Materials and methods

Sequence collection and database construction

Request a detailed protocol

The sequence repertory for this study was assembled from three parts. Genomic and transcriptomic sequences were collected from the sources listed in Supplementary file 1–Supplementary Table 1, Supplementary file 1–Supplementary Table 3, Supplementary file 2. As third component, selected sequences were downloaded from the NCBI non-redundant protein database.

The 25 genomic sequences were first screened for repetitive sequence content using RepeatMasker V4.0.5 (http://repeatmasker.org) with default parameters. The resulting contigs/scaffolds were translated into six ORFs using the Emboss tool "getorf" (Rice et al., 2000), with a minimum ORF length of 25 AA. Sequences containing strings of "X" characters, a result of translating sequencing gaps and masked repeats, were treated differentially to retain as much information as possible. Sequences with ≥9 "X" in a row were split. After removing the Xs, each flanking region ≥35 valid amino acids was kept and given a new identifier while smaller flanking regions were discarded. These measures decreased sequence count by 46.8%, from 324,788,561 to 172,606,165 ORFs. To further reduce the amount of ORFs, we blasted them against a custom database of opisthokont sequences. This database contained all sequences of opisthokont origin as extracted from the non-redundant protein database at GenBank, release 198 from 21 October 2013 (2,695,641 sequences). We kept ORFs with a BLAST expectation value <10 against this database and thus rejected ORFs that have no detectable similarity to the protein repertoire of opisthokonts. In a final step, we used CD-HIT (Li and Godzik, 2006) with default parameters and 90% identity threshold to remove redundancy. These steps reduced the number of sequences significantly, from initially 324,788,561 to 109,567,344 genomic ORFs.

To fill in the gaps of public sequence repositories and extend coverage, we collected transcriptome data of poorly represented animal groups (Supplementary file 1–Supplementary Table 1, Supplementary file 2). Downloaded transcriptomes were first assayed for completeness using the CEGMA (Core Eukaryotic Genes Mapping Approach) pipeline which reports the coverage of 248 ultra-conserved core eukaryotic genes present in a dataset (Parra et al., 2007). On the basis of CEGMA completeness and phylogenetic placement, we selected transcriptomes of 64 species for the dataset. Their average transcriptome completeness according to CEGMA was 60.8%, with several bilaterian and non-bilaterian species exceeding 90% (Supplementary file 1–Supplementary Table 2). As described for genomes, transcriptomes were then translated into six ORFs. We kept the three longest ORFs for each transcriptome contig, removed Xs, and obtained 11,768,516 transcriptome protein sequences in total (Table 2).

To provide a backbone of published and annotated protein sequences for the genomic and transcriptomic ORFs, we filtered the NCBI non-redundant protein database and kept 2.9 million protein sequences from 204 opisthokont species that had >8000 sequence entries each. Extraction of opisthokont sequences was guided by NCBI taxonomy.

As the combination of sequences from three sources again introduced redundancy, we clustered the final dataset with 90% identity threshold. In a last pre-processing step, we changed the headers of all sequences to obey a consistent naming scheme. It includes the NCBI taxon identifier and a unique sequence ID that allows to distinguish between NCBI-, ORF-, and transcriptome-derived sequences. The final dataset used for this analysis contained 124,031,501 sequences.

Orthology pipeline and clustering

Request a detailed protocol

For orthology clustering, we employed the OrthoMCL pipeline (Li et al., 2003). It utilises a graph-based clustering approach for the generation of orthologous groups on the basis of normalised BLAST similarity measurements between sequence pairs. To enable the processing of our large dataset, we ported to the statistical programming environment R (https://www.r-project.org/) all steps of the original OrthoMCL pipeline that require interaction with a MySQL database. In this way, loading of the database and inference of orthology tables is limited only by the size of the computer’s main memory, not by the speed and additional memory requirements of the underlying MySQL engine, as in the original implementation. By dividing the computation of orthology tables into an appropriate number of steps, our entire dataset could be processed on a compute server with 250 GB memory. Importantly, the R version of OrthoMCL accurately reproduces all steps of the original pipeline (Supplementary file 1–Supplementary Table 4). The collection of scripts for the R version of OrthoMCL is available at https://github.com/prheger/BigWenDB (Heger, 2020; copy archived at https://github.com/elifesciences-publications/BigWenDB).

HMM-HMM searches and database

Request a detailed protocol

We extracted from the BigWenDB sequence collection the individual sequences belonging to each of the 824,605 orthologues groups and calculated 824,605 corresponding multiple sequence alignments using default parameters of the MAFFT v7.304b "einsi" algorithm (Katoh et al., 2005). After converting the alignments into hhm format (hhsearch format for hidden Markov models) with the command "hhmake" and default parameters, we concatenated them to a database that can be searched by hhsearch (parameters in addition to default: "-nodssp -nopred -dbstrlen 100"), according to Söding, 2005. We precomputed HMM-HMM search results for about 20% of orthogroups and issued missing searches on demand. Reciprocal best hit relationships were analysed using custom scripts.

Quality control of clustering and the bilaterian-specific gene set

Request a detailed protocol

Quality control of clustering results and the bilaterian-specific gene set was carried out as described in Appendix 1, sections "Cluster evaluation and quality control" and "Identification of bilaterian-specific genes".

Statistical tests for the enrichment of transcription factors

Request a detailed protocol

To test whether the bilaterian-specific gene set of 157 orthogroups is enriched for transcription factors, we downloaded as control the human proteome with 20,205 protein sequences from ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/Eukaryota/ and predicted transcription factors in this dataset using the PfamScan software (ftp://ftp.ebi.ac.uk/pub/databases/Pfam/Tools/) with E-value cutoff = 5x10^-05. We then determined the abundance of 10 prevalent DNA-binding domains in the dataset: "Basic; bZIP_2; HLH; HNF-1_N; Homeobox; Hox9_act; HPD; SOBP; THAP; zf-". Corresponding domains were identified in 1,756 of the 20,205 human reference proteins. We then randomly selected 10× 157 genes from the reference set and specified the number of transcription factors (proteins with the above mentioned domains) in the obtained subsets. While the average number of transcription factors in the 10 control sets was 12.8 ± 4.44, the equally sized bilaterian-specific gene set (157 orthogroups) had 37 transcription factors. Modelling a normal distribution from the obtained mean and standard deviation yielded a p-value of 2.512e^-08 for the transcription factor content in bilaterian-specific genes (using the R function "pnorm"). Likewise, a Pearson’s ${\tilde{χ}}^{2}$ test with the corresponding data matrix (1,765:20,205; 37:157), using the R function "chisq.test", yielded a p-value of 3.805e^-08. Finally, under the assumption of a binomial distribution (R function "pbinom") and given that there are 1,756 transcription factors in 20,205 human proteins, the probability that we obtain 36 or more transcription factors when drawing 157 random proteins is p < 1.841e^-08.

Poly-Zinc finger scan across Opisthokonta

Request a detailed protocol

We downloaded the proteomes of 7 ecdysozoan, 5 lophotrochozoan, 12 deuterostomian, and 4 non-bilaterian species from http://www.uniprot.org/proteomes. On average, each proteome consisted of 28,772 sequences. We scanned all protein sequences for the presence of protein domains using the PfamScan software (ftp://ftp.ebi.ac.uk/pub/databases/Pfam/Tools/) with E-value cutoff = 5x10^-05 and Pfam database version 31.0. Using command line tools, we identified C₂H₂ zinc finger proteins in the PfamScan output and counted for every proteome the number of proteins with six or more zinc finger domains. The resulting numbers were used to plot Figure 2—figure supplement 1A,B.

To determine the number of poly-ZF proteins that originated in the ancestor of opisthokonts, metazoans, and eumetazoans, we first extracted from the clustering results orthogroups specific for these lineages. The filtering criteria for selecting opisthokont-specific orthogroups were: Fungi ≥ 20 species, Metazoa ≥ 40 species, Bilateria ≥ 30 species and yielded 2,928 orthogroups of ancient origin. The filtering criteria for selecting metazoan-specific orthogroups were identical, except that no fungi were allowed, and yielded 2,615 metazoan-specific orthogroups. For eumetazoan-specific orthogroups we required the presence of at least 30 bilaterian and 3 cnidarian species, with not more than 2 ctenophore species allowed (according to NCBI taxonomy, both ctenophores and cnidarians misleadingly belong to eumetazoans). Applying these conditions, we obtained 283 eumetazoan-specific orthogroups. Next, we extracted the longest sequence of each opisthokont-, metazoan-, and eumetazoan-specific orthogroup and scanned it with PfamScan (E-value cutoff = 5x10^-05). Finally, we counted the number of poly-ZF sequences with at least six domains for each node and mapped the numbers to a phylogeny. Note that this "simple" filtering strategy (Bilateria: ≥ 30 species) would return 662 bilaterian-specific orthogroups, considerably more than the 157 error-corrected orthogroups in the final dataset. The strategy therefore possibly overestimates the number of poly-ZF proteins at the three ancient nodes.

Determining orthogroup ancestors

Request a detailed protocol

To determine the ancestor of the species combined in a given orthogroup, we wrote a custom Perl script that extracts the taxonomic identifiers of each sequence and then determines the last common ancestor of all represented species on the basis of NCBI taxonomy and lineage information (ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/). The script generates output that can be parsed and filtered using command line utilities. It is part of the collection of R scripts at https://github.com/prheger/BigWenDB.

Protein domain scans and gene ontology analysis

Request a detailed protocol

We applied strict filtering rules to extract bilaterian-, vertebrate-, and arthropod-specific genes from the Markov clustering results (rule for bilaterian-specific orthogroups: deuterostomes ≥ 7, lophotrochozoans ≥ 4 or 0, ecdysozoans ≥ 4 or 0; for arthropod-specific orthogroups: chelicerates ≥ 2, crustaceans ≥ 0, myriapods ≥ 1, insects ≥ 5; for vertebrate-specific orthogroups: ≥40 of 53 gnathostome species). From each lineage-specific orthogroup obtained by filtering, we extracted the longest sequence and scanned it with PfamScan Version 1.5 (Punta et al., 2012) (available at ftp://ftp.ebi.ac.uk/pub/databases/Pfam/Tools/) at an E-value cutoff of e^-05 for the presence of protein domains as classified in PFAM database release 30.0 (release date: 06/16).

To associate the identified protein domains with gene ontology (GO) terms, we utilised the Pfam2GO list at http://geneontology.org/external2go/pfam2go and extracted relevant terms using command line tools. Typically, only a subset of domains was linked to GO terms. We finally created a list with the relative number of identified protein domains and associated gene ontology terms and visualised this list as a word cloud at www.wortwolken.com.

Multiple sequence alignment and phylogenetic analysis

Request a detailed protocol

Multiple sequence alignments required for the HMM-HMM database and phylogenetic analyses were carried out using the MAFFT v7.304b "einsi" algorithm with default parameters (Katoh et al., 2005). Large alignments (>200 sequences) were computed using MAFFT v7.304b with high-speed parameters. For phylogeny, we added ingroup and outgroup sequences from the clustered orthogroup sets or from public repositories, as appropriate, and manually removed indels and unalignable regions from the data prior to analysis. In some cases, for example for Lefty, we generated a hidden Markov model of an orthogroup alignment and searched additional transcriptomic datasets not represented in the BigWenDB for potential orthologues. Phylogenetic trees were computed under the maximum likelihood criterion, using IQ-TREE v1.6.10 (Nguyen et al., 2015) with ModelFinder for fast and accurate model selection (Kalyaanamoorthy et al., 2017), ultrafast bootstrap approximation and optimisation (1000 replicates) (Minh et al., 2013), and Shimodaira-Hasegawa-like approximate likelihood ratio test (SH-aLRT) (command line parameters: "-bb 1000 -alrt 1000 -bnni"). Resulting trees were edited with FigTree v1.4.3 (http://tree.bio.ed.ac.uk/software/figtree/) and Affinity Designer Version 1.72 (https://affinity.serif.com).

Prediction of protein structure

Request a detailed protocol

After constructing multiple sequence alignments from cnidarian and bilaterian Robo proteins, we identified the transmembrane region (corresponding to sequence "AFIAGIGAACWIILMVFSIWL" in ROBO1_HUMAN) and generated two subsequences overlapping at this feature. One subsequence spanned the extracellular part of the protein plus the transmembrane domain, the other spanned the transmembrane domain plus the cytoplasmic part. We generated the two fragments for seven exemplary Robo proteins, for the deuterostomes Homo sapiens and Strongylocentrotus purpuratus, the lophotrochozoan Lingula anatina, the ecdysozoans Drosophila melanogaster and Trichinella pseudospiralis, and the two cnidarians Hydra vulgaris and Stylophora pistillata. All fragments were uploaded to the Phyre2 web interface (http://www.sbg.bio.ic.ac.uk/phyre2/html/page.cgi?id=index; Kelley et al., 2015) and analysed with modelling mode "intensive" (complete modelling using multiple templates and ab initio techniques).

Identification of metazoan-specific genes

Request a detailed protocol

To obtain a list of genes with metazoan origin, we first blasted 20,205 human genes obtained from uniprot.org against the BigWen database and obtained BLAST hits for 19,322 genes. To reliably map the UniProt queries to orthogroups, we selected queries that had a BLAST hit with high identity (>95%) over at least 100 amino acids. For proteins fulfilling these criteria, we extracted the corresponding orthogroup ID and ancestor, taking into account only orthogroups with at least 75 species to ensure broad sampling. After removing redundancy, we obtained 797 distinct orthogroups of metazoan origin whose human orthologues were used for the stringDB PPI network analysis. A conceptually similar study obtained 1,189 novel metazoan-specific homology groups, which is in reasonable agreement with our result when considering the differences in methodology and datasets (Paps and Holland, 2018).

Protein-protein interaction network analyses

Request a detailed protocol

Protein interaction data were obtained from the STRING database v11.0 of known and predicted protein-protein interactions (PPI; https://string-db.org; Szklarczyk et al., 2017). To construct PPI networks, we first identified the appropriate human orthologues of bilaterian-specific and metazoan-specific orthogroups. We obtained 150 human orthologue IDs for the 157 bilaterian-specific orthogroups and 797 human orthologue IDs for 797 metazoan-specific orthogroups (collected as described above). We uploaded these protein IDs to the STRING browser interface and generated three separate PPI networks, one for bilaterian-specific proteins (B), one for metazoan-specific proteins (M), and a combined network for both taxonomic groups (B + M). The average local clustering coefficients and PPI enrichment p-values we report are based on analyses with default settings, where all evidence types were considered. Further statistical analyses were conducted for the B + M full network and the B + M Nodal-Lefty subnetwork, the latter being defined by the core five bilaterian-specific proteins (Nodal, Lefty, FoxH1, Eomes, and EGF-CFC) and their interaction partners. From the complete list of pairwise protein-protein interactions in the B + M network, data were extracted for the numbers of B – B, M – M, and B – M interactions and assessed by a ${\tilde{χ}}^{2}$ test. Additional calculations were made per protein for the total number of interactions and for the proportion of interactions that involve a bilaterian-specific partner. Boxplots for these values display the median, and whiskers represent 1.5× the value of the Q3 (upper) or Q2 (lower) quartile range, with outliers omitted for clarity. Statistical tests involved ${\tilde{χ}}^{2}$ tests (https://www.socscistatistics.com/tests/chisquare/default2.aspx, accessed 26 August 2019) and non-parametric comparisons in multigroup (Kruskal-Wallis) and pairwise (Mann-Whitney U) assessments as reported, calculated in R version 3.4.0 and from the Python library scipy.stats (function: mannwhitneyu).

Data access

Request a detailed protocol

The R version of OrthoMCL and a script for inferring orthogroup ancestors are available at https://github.com/prheger/BigWenDB. The sequence dataset used to build the BigWenDB and the final clustering results are available at https://doi.org/10.5061/dryad.4qf7168. Several Supplementary Files with original data and Supplementary Tables are linked to this paper at elifesciences.org.

Appendix 1

Orthology pipeline and clustering

To generate clusters of orthologous proteins from the collected sequence data, we used the OrthoMCL pipeline (Li et al., 2003). OrthoMCL is a graph-based method for orthologue group identification that represents sequences as nodes and their similarities as weighted edges. A normalization step adjusts initial similarity scores to reflect species distance and ensures that edge weights for sequence pairs are comparable between different genomes. Finally, the Markov cluster algorithm (van Dongen, 2000) performs random walks on the normalised graph by simulating transition probabilities of sequences to other nodes, thereby revealing an underlying cluster structure. To create the BLAST similarity table required by OrthoMCL, we performed all-vs-all BLAST searches with 124 million sequences (with default BLAST parameters, except "-outfmt 6"; BLAST version 2.2.28). Roughly one million CPU hours were necessary for this task, running hundreds of jobs in parallel on a high performance computing platform. Merging the individual output files, we obtained a similarity score table of ∼500 GB, containing roughly 6 billion BLAST hits (see Supplementary file 1–Supplementary Table 5). In the original implementation, OrthoMCL loads the BLAST output table into a MySQL database and performs subsequent computations within the relational database. Because of its size, we could not load the BLAST output table into a physical MySQL database. We therefore ported all MySQL processes to the statistical computing environment R to execute them in computer memory. Test experiments, carried out in parallel with our R implementation and the original software, produced identical results, demonstrating that the R version of OrthoMCL accurately reproduces the outcome of the standard pipeline (Supplementary file 1–Supplementary Table 4). After obtaining the final table with adjusted pairwise distance information in R, we used Markov clustering (van Dongen, 2000), as in the original protocol, to combine sequences to orthologous groups.

Depending on the origin of compared sequences, OrthoMCL creates three orthologue tables: a table with reciprocal relationships of sequences between different species (orthologue table), a table of within-species relationships (in-paralogues), and a table of co-orthologues with protein pairs that are connected through orthology and in-paralogy. Of 124 million gathered sequences, 122 million had at least one BLAST hit in the database, giving rise to a collection of 6 billion BLAST pairs as raw material for orthology clustering. The OrthoMCL pipeline retained 35 million of these sequences in 806 million pairs of the three orthology tables. Thus, 28.8% of the sequences had enough similarity with other sequences to participate in orthology group construction whereas the majority of input sequences were so remotely related to other sequences that our pipeline could not merge them with a cluster. As expected, artificially generated ORFs represented by far the largest portion of the non-clustered sequences (91.3%).

As we observed that a large in-paralogue table (5.8× larger than the orthologue plus co-orthologue tables for the final dataset) negatively affected the accuracy of the clustering process, we omitted this table in subsequent trials. In the final MCL run, we obtained 824,605 orthologous groups with 6,743,519 distinct sequences derived from 118,499,524 protein pairs (BLAST hits) of the orthologue and co-orthologue tables. Discarding the large in-paralogue table led to a drop in the percentage of clustered sequences from 28.8% to 5.5%, indicating that a considerable amount of orthogroups in the larger dataset consisted of paralogues (Supplementary file 1–Supplementary Table 5).

To investigate the properties of orthogroups as old as bilaterians or older, we plotted for these orthogroups the number of species against their proportion of bilaterians (Appendix 1—figure 1). Position and abundance of many data points in the resulting plot are a consequence of dataset composition. For example, (i) the majority of orthogroups is small, leading to an abundance of solid (because of overlap) data points for small orthogroups (Appendix 1—figure 1, left part; Supplementary file 1–Supplementary Table 5); (ii) bilaterians and non-bilaterians including fungi are groups roughly equal in size (142 vs. 131 species), preventing that bilaterian sequences exceed a coverage of ∼50% in large orthogroups. Similarly, bilaterian content can hardly fall below 40% to 50% in large orthogroups with more than ∼175 species, giving rise to an arrowhead shape at the right side of the plot (Appendix 1—figure 1). (iii) orthogroups with a bilaterian ancestor have, by definition, a bilaterian content of 100% and are therefore spread as dotted red line on top of the plot that is fading away in orthogroups with more than 100 species; (iv) orthogroups with metazoan and eumetazoan ancestor (green and blue) concentrate on the left part of the plot because not more than 33 non-bilaterian metazoans are present in the dataset, restricting orthogroup size. In addition, the low orthogroup density in sectors B2, B3, and C3 suggests that ancient genes, that evolved in the ancestor of eumetazoans or earlier and survived in bilaterians, do not get lost randomly at multiple nodes in the bilaterian tree. Instead, they tend to be maintained across most bilaterian species. It remains to be seen whether this behaviour is specific for bilaterians in this dataset or a general evolutionary pattern.

Appendix 1—figure 1

Download asset Open asset

General properties of sequence clusters from a bilaterian viewpoint.

(A) The proportion of bilaterians per orthogroup is shown as a function of orthogroup size (in terms of species number) for 207,285 orthogroups that trace back to the four ancestors Bilateria, Eumetazoa, Metazoa, and Opisthokonta. Dot colours indicate the orthogroup ancestor and are printed with 85% transparency to reveal overlaps. (B) Orthogroup count (how often orthogroups of a given size are observed) is displayed as function of orthogroup size (number of sequences present in an orthogroup). 34 orthogroups with more than 1,000 sequences were omitted. Almost all of these sizes occurred only once.

Cluster evaluation and quality control

In a first approach to verify the accuracy of our clustering results, we employed as an external benchmark a manually curated set of 70 orthologous groups (Trachana et al., 2011), the orthobench dataset (http://eggnog.embl.de/orthobench). For the members of every orthobench family, we determined the corresponding BigWenDB sequence ID and the cluster ID (orthogroup ID) to which this sequence was assigned during clustering. We then analysed how the members of a given orthobench family were distributed among orthogroups in the BigWenDB. We performed such comparisons for two MCL inflation parameters ( $I = 1.3$ and $I = 1.4$ ) and two database sizes (full database and database without paralogue table). The clustering with the highest agreement to the expected orthobench outcome was the dataset with inflation parameter $I = 1.3$ and without paralogue table (mcl_ortho-coortho_1.3.7; see Supplementary file 3). In this dataset, 46 of 70 protein families were assigned correctly, i.e. in 65.7% of the cases our pipeline combined all members of an orthobench family, as expected, in a single orthogroup. However, BLAST hits that allow correct mapping were not found for all orthobench family members, and some members were mapped to erroneously predicted proteins. In such cases, orthobench members may be linked to an orthogroup different from the rest of the family, leading to the impression that several orthogroups exist for this family. According to our estimates, such mapping errors reduce accuracy by at least 5%, suggesting a correct orthology inference rate above 70% for our dataset. In contrast, only 10% to 48% of reference orthogroups were predicted correctly in the orthobench comparison (Trachana et al., 2011), indicating that the representative coverage of our dataset considerably improves orthogroup inference quality.

Evolutionary relationships of homeodomain-containing genes are difficult to trace because of the strong conservation and shortness of the homeodomain (60 AA) (Irvine et al., 1997; Kourakis and Martindale, 2000). To understand how our study deals with these difficulties, we analysed the composition of orthogroups containing NK (Nirenberg-Kim) homeobox genes. Like Hox and ParaHox gene clusters, the NK cluster is a close association of homeobox genes with crucial roles in animal development. It consists of the six genes tinman, bagpipe, ladybird (early and late), C15, and slouch in D. melanogaster. They are all involved in mesodermal patterning (Kim and Nirenberg, 1989; Jagla et al., 2001). Genomic data from vertebrates and the cephalochordate Branchiostoma indicate that the NK cluster is an ancient feature of bilaterians, but has been duplicated and split repeatedly in chordate history, leading to the presence of four dispersed clusters and multiple paralogues of each gene in humans (Luke et al., 2003). Several rearrangements have also been observed in the NK cluster of arthropods (Chan et al., 2015). In addition, studies of the homeodomain gene complement of sponges and cnidarians revealed that NK cluster genes predate the evolution of bilaterians (Ryan et al., 2006; Larroux et al., 2007). Given these findings, we can expect that NK homeobox genes from diverse metazoans (sponges, cnidarians, vertebrates, and insects) are each represented in a single orthogroup. Analysing the orthogroups of all Drosophila and human NK cluster genes revealed that, indeed, bilaterian and non-bilaterian orthologues of the five NK genes were joined in five corresponding groups (Supplementary file 1–Supplementary Table 6). These five orthogroups contained sequences from 81 to 128 (of 142) bilaterian species, including the known Drosophila and human NK genes, as well as sponge, cnidarian, and ctenophore sequences. We found placozoan sequences in a single orthogroup, OG_613 (NKX2), suggesting the previously unknown existence of NK class homeobox genes in Placozoa (Monteiro et al., 2006). In contrast to other NK genes, Drosophila tinman is not located in the group of its vertebrate counterparts NKX2.3/2.5/2.6 (OG_613; Supplementary file 1–Supplementary Table 6). It has been shown previously that orthology relationships between tinman and vertebrate NKX2 genes are difficult to establish because of the fast evolving insect tinman genes (Harvey, 1996; Saudemont et al., 2008). In line with these observations, tinman was assigned to a small orthogroup restricted to endopterygote insects (OG_92160) while other putative NKX2 orthologues from a wide range of arthropods (32/37 species) were combined with vertebrate NKX2 genes in orthogroup OG_613.

Consistency between our method and an independent method would further underline the reliability of inferred orthogroups. We therefore prepared our data for a control run with the orthogroup inference algorithm OrthoFinder that, in contrast to OrthoMCL, takes into account a so far unrecognised gene length bias (Emms and Kelly, 2015). However, the number of pairwise BLAST similarity tables, resembling OrthoFinder’s input, increases quadratically with the number of species, and so does the amount of required main memory. With 80 species and 6,320 corresponding BLAST tables, approximately 250 GB of memory are occupied, precluding a run with the full dataset (273 species; 74,256 BLAST tables) on current computers. OrthoFinder thus cannot be used to confirm our data until it is adapted to large data sets, in turn illustrating the power of our modified version of the OrthoMCL pipeline.

Taken together, the assessment of clustering quality using a benchmark and a homeobox gene set indicates that orthology prediction in the BigWenDB accurately captures known evolutionary relationships of difficult target genes over large evolutionary distances. We conclude therefore that our cluster results are well suited as raw material for the search of bilaterian-specific genes.

Identification of bilaterian-specific genes

To infer lineage-specific genes, we determined on the basis of NCBI taxonomy (ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz) the last common ancestor of the species present in all 824,605 orthologous groups of the final clustering. Together with other ancient groups such as Metazoa, Eumetazoa, or Opisthokonta, the taxon Bilateria is among the top ten of taxa with the highest counts (42,946 bilaterian-specific orthogroups; Supplementary file 1–Supplementary Table 25). While these counts include all orthologue groups that trace back to a given ancestor, the majority of groups contains only few species (see Figure 4, Appendix 1—figure 1, Supplementary file 1–Supplementary Table 5). To obtain meaningful groups with a broad representation across bilaterians, we required that at least 10% of the species of each bilaterian super-phylum must be present (Ecdysozoa ≥ 6, Lophotrochozoa ≥ 4, and Deuterostomia ≥ 7 species). We included orthogroups with zero ecdysozoans or lophotrochozoans if the count for the two other super-phyla met the 10% threshold, thereby allowing for the loss of bilaterian-specific genes in ecdysozoans or lophotrochozoans. Following these rationales, we obtained 345 bilaterian-specific groups.

At least four types of error might impair our set of bilaterian-specific orthologous groups: (1) An orthogroup is judged older than bilaterians, but is in fact bilaterian-specific (orthogroup too large), (2) an orthogroup is inferred to be bilaterian-specific, but is in fact older (orthogroup too small), (3) an orthogroup is found to be bilaterian-specific, but is in fact younger (orthogroup too large), (4) an orthogroup is considered younger than bilaterians, but is in fact bilaterian-specific (orthogroup too small).

The presence of several bilaterian sequences and a single sequence from an earlier branching eukaryote would conceal the potential bilaterian ancestry of an orthogroup (type 1 error). We therefore searched for orthologue groups with broad bilaterian representation, according to our above mentioned rules, and up to two outgroup sequences. Of 349 orthogroups satisfying these criteria, the majority (263 or 75.3%) contained as outliers sequences of cnidarian origin, the sister group of bilaterians. To maximise the likelihood of detecting true outliers, we considered only organisms without direct sister group relationship for further analysis and obtained 86 additional bilaterian-specific candidate groups with one or two non-bilaterian/non-cnidarian sequences. As the probability is high that these orthogroups contain phylogenetically unrelated outliers and actually originated in the bilaterian ancestor, we ranked them, together with the 345 previous orthogroups, in a set of 431 bilaterian-specific orthogroups.

Type 2 errors can arise if the MCL algorithm does not combine a group with bilaterian ancestry and a group with related sequences from non-bilaterian species although both groups might represent a single natural orthology group. To identify such errors, we computed for all 824,605 orthogroups multiple sequence alignments and turned them into profile hidden Markov models (HMMs) that describe alignment consensus sequences in a probabilistic way (Eddy, 1998). We then assembled a database from the HMMs and searched the two next similar profiles for every bilaterian-specific group using sensitive HMM-HMM alignments (Söding, 2005). We devised a new reciprocal HMM-HMM alignment comparison step, analogous to the strategy of reciprocal best BLAST hits (Tatusov et al., 1997; Ward and Moreno-Hagelsieb, 2014), to discover bidirectional best hit orthogroup pairs prognostic for common descent. To demonstrate the power of this method, we analysed the orthogroup distribution of two example proteins, Sprouty, an inhibitor of FGF signalling, and the insulator protein GAGA factor. We found that the orthogroups of both, D. melanogaster Sprouty and D. melanogaster GAGA factor, were smaller than anticipated considering their reported phylogenetic distribution (Matus et al., 2007; Heger et al., 2013). In both cases, the reciprocal best hit strategy allowed us to detect highly similar orthogroups with known Sprouty and GAGA factor orthologues that complemented the original orthogroup. After fusion of query and reciprocal best hit orthogroups, the resulting sequence collections matched the expected phylogenetic coverage (Supplementary file 1–Supplementary Table 7). Encouraged by these findings, we examined the 431 bilaterian-specific orthogroups accordingly and excluded orthogroups from the list if they satisfied three criteria: (i) their best or second best HMM-HMM hit modifies the ancestor of the resulting fusion group, (ii) their best or second best hit orthogroup is a reciprocal best hit, and (iii) their best or second best hit orthogroup does not contain more than three bilaterian species. With the last criterion we avoid to eliminate orthogroups whose reciprocal best hit is an ancient orthogroup with wide bilaterian representation, an indicator of homology rather than of orthology. The majority of bilaterian-specific orthogroups (84.2% or 363/431 orthogroups) were not affected by this procedure. Therefore we considered them high-confidence bilaterian-specific orthogroups. On the other hand, 68 bilaterian-specific orthogroups (15.8%) were possibly false positives and may have originated in pre-bilaterian time.

If, for example, several insects and a single sequence from a vertebrate populate an orthogroup, a bilaterian ancestor would be computed for this group although, from a phylogenetic point of view, the single vertebrate sequence is more likely an outlier added to the group erroneously. The filtering rules mentioned above require that at least 10% of the species in each super-phylum are present in a group to qualify as bilaterian-specific. They effectively prevent type 3 errors in our list of bilaterian-specific orthogroups that were caused by the addition of <4 sequences. In contrast, we cannot currently prevent potentially wrong orthology inference if four or more sequences of an ancestor-changing lineage were added erroneously (for example, four ecdysozoan sequences added to an otherwise mammalian-specific orthogroup). However, this error mainly affects small bilaterian-specific orthogroups with only few sequences from deuterostomes, lophotrochozoans, and/or ecdysozoans because of their lack in representativeness. Detailed phylogenetic analysis as well as improved taxon sampling would be necessary to discover such false-positive assignments.

Type 4 errors occur if an orthogroup is estimated younger than bilaterians, but is—accidentally—not joined with another, similar orthogroup that would convert the ancestor to Bilateria if combined with the original group (for example, a vertebrate-specific orthogroup and a highly similar insect-specific orthogroup would create a bilaterian-specific orthogroup). To detect such errors, it is necessary to perform all-vs-all profile comparisons of the orthogroups younger than bilaterians. Next, combinations of similar groups need to be determined that would shift the former individual ancestors to a new common bilaterian ancestor and that are each other’s bidirectional best hit. Due to the high computational investment we refrained from further investigating this error source in this manuscript.

To further probe accuracy of the 363 bilaterian-specific orthogroups, we mapped human and D. melanogaster sequences contained in these orthogroups to the respective genome (versions hg38 and dm6) using BLAT (Kent, 2002). Such mapping was possible for 348/363 orthogroups (95.87%). We then checked whether the target gene to which these sequences were assigned, belonged to the initial orthogroup. This was not true in a considerable number of cases. For example, often bilaterian-specific orthogroups contained short ORFs from H. sapiens or D. melanogaster that mapped to a particular gene. The corresponding full length protein, however, was assigned to a different orthogroup with a different ancestor, indicating that separation of genes into two or more orthogroups affected integrity of the 363 orthogroups set. We therefore excluded all orthogroups with potential mapping inconsistencies and arrived at a set of 204 bilaterian-specific genes. As a final validation step, we blasted at NCBI (non-redundant GenBank version from May 24, 2017) all human or D. melanogaster orthologues, which are present in the 204 bilaterian-specific orthogroups, against non-bilaterian metazoans (Metazoa excluding Bilateria and Mesozoa). A reciprocal best hit analysis of the BLAST results indicated that 47 genes, corresponding to 47 orthogroups, might contain orthologues in non-bilaterian species although our orthology prediction pipeline did not detect them. As substantial work is required to confirm or reject these potentially false-positive orthogroups, we removed them from the list and arrived at a final number of 157 orthogroups. These 157 orthogroups represent a minimal set of high-confidence bilaterian-specific orthogroups which is free of most errors present in other orthology databases.

Data availability

Accession numbers and/or URLs for previously published transcriptome datasets are listed in Supplementary File 3. Download links for previously published genomic sequences are listed in Supplementary File 1-Supplementary Table S7. Orthology datasets generated in this study have been deposited to Dryad, under the URL https://doi.org/10.5061/dryad.4qf7168.

The following data sets were generated

1. Heger P
2. Zheng W
3. Rottmann A
4. Panfilio K
5. Wiehe T
(2019) Dryad Digital Repository
Data from: The genetic factors of bilaterian evolution.

https://doi.org/10.5061/dryad.4qf7168

References

(2017) Topologically associated domains: a successful scaffold for the evolution of gene regulation in animals
Wiley Interdisciplinary Reviews: Developmental Biology 6:e265.

https://doi.org/10.1002/wdev.265
- PubMed
- Google Scholar
1. Afroze S
2. Meng F
3. Jensen K
4. McDaniel K
5. Rahal K
6. Onori P
7. Gaudio E
8. Alpini G
9. Glaser SS
(2013) The physiological roles of secretin and its receptor
Annals of Translational Medicine 1:29.

https://doi.org/10.3978/j.issn.2305-5839.2012.12.01
- PubMed
- Google Scholar
(2017) Co-option and de novo gene evolution underlie molluscan shell diversity
Molecular Biology and Evolution 34:779–792.

https://doi.org/10.1093/molbev/msw294
- PubMed
- Google Scholar
(2008) Assessing the performance of single-copy genes for recovering robust phylogenies
Systematic Biology 57:613–627.

https://doi.org/10.1080/10635150802306527
- PubMed
- Google Scholar
1. Aguinaldo AM
2. Turbeville JM
3. Linford LS
4. Rivera MC
5. Garey JR
6. Raff RA
7. Lake JA
(1997) Evidence for a clade of nematodes, arthropods and other moulting animals
Nature 387:489–493.

https://doi.org/10.1038/387489a0
- PubMed
- Google Scholar
1. Anctil M
(2009) Chemical transmission in the sea anemone Nematostella vectensis: a genomic perspective
Comparative Biochemistry and Physiology Part D: Genomics and Proteomics 4:268–289.

https://doi.org/10.1016/j.cbd.2009.07.001
- Google Scholar
1. Arenas-Mena C
(2008) The transcription factors HeBlimp and HeT-brain of an indirectly developing polychaete suggest ancestral endodermal, gastrulation, and sensory cell-type specification roles
Journal of Experimental Zoology. Part B, Molecular and Developmental Evolution 310:567–576.

https://doi.org/10.1002/jez.b.21225
- PubMed
- Google Scholar
(2008) Pivotal roles for eomesodermin during axis formation, epithelium-to-mesenchyme transition and endoderm specification in the mouse
Development 135:501–511.

https://doi.org/10.1242/dev.014357
- PubMed
- Google Scholar
1. Babonis LS
2. Martindale MQ
(2017) Phylogenetic evidence for the modular evolution of metazoan signalling pathways
Philosophical Transactions of the Royal Society B: Biological Sciences 372:20150477.

https://doi.org/10.1098/rstb.2015.0477
- PubMed
- Google Scholar
(2008) Back in time: a new systematic proposal for the Bilateria
Philosophical Transactions of the Royal Society B: Biological Sciences 363:1481–1491.

https://doi.org/10.1098/rstb.2007.2238
- PubMed
- Google Scholar
(2002) Hox clusters and bilaterian phylogeny
Molecular Phylogenetics and Evolution 24:366–373.

https://doi.org/10.1016/S1055-7903(02)00237-3
- PubMed
- Google Scholar
(2010) The roles of dopamine and related compounds in reward-seeking behavior across animal phyla
Frontiers in Behavioral Neuroscience 4:163.

https://doi.org/10.3389/fnbeh.2010.00163
- PubMed
- Google Scholar
1. Bashaw GJ
2. Kidd T
3. Murray D
4. Pawson T
5. Goodman CS
(2000) Repulsive axon guidance: abelson and enabled play opposing roles downstream of the roundabout receptor
Cell 101:703–715.

https://doi.org/10.1016/s0092-8674(00)80883-1
- PubMed
- Google Scholar
1. Bauknecht P
2. Jékely G
(2017) Ancient coexistence of norepinephrine, tyramine, and octopamine signaling in bilaterians
BMC Biology 15:6.

https://doi.org/10.1186/s12915-016-0341-7
- PubMed
- Google Scholar
(2005) The single AmphiTrk receptor highlights increased complexity of neurotrophin signalling in vertebrates and suggests an early role in developing sensory neuroepidermal cells
Development 132:2191–2202.

https://doi.org/10.1242/dev.01803
- PubMed
- Google Scholar
1. Berger M
2. Gray JA
3. Roth BL
(2009) The expanded biology of serotonin
Annual Review of Medicine 60:355–366.

https://doi.org/10.1146/annurev.med.60.042307.110802
- PubMed
- Google Scholar
1. Berridge KC
(2004) Motivation concepts in behavioral neuroscience
Physiology & Behavior 81:179–209.

https://doi.org/10.1016/j.physbeh.2004.02.004
- PubMed
- Google Scholar
(2017) Back to the basics: cnidarians start to fire
Trends in Neurosciences 40:92–105.

https://doi.org/10.1016/j.tins.2016.11.005
- PubMed
- Google Scholar
1. Brooke NM
2. Holland PW
(2003) The evolution of multicellularity and early animal genomes
Current Opinion in Genetics & Development 13:599–603.

https://doi.org/10.1016/j.gde.2003.09.002
- PubMed
- Google Scholar
1. Brose K
2. Bland KS
3. Wang KH
4. Arnott D
5. Henzel W
6. Goodman CS
7. Tessier-Lavigne M
8. Kidd T
(1999) Slit proteins bind robo receptors and have an evolutionarily conserved role in repulsive axon guidance
Cell 96:795–806.

https://doi.org/10.1016/S0092-8674(00)80590-5
- PubMed
- Google Scholar
1. Brown KM
2. Shaver JR
(1989) [3H]Serotonin binding to blastula, gastrula, prism, and pluteus sea urchin embryo cells
Comparative Biochemistry and Physiology Part C: Comparative Pharmacology 93:281–285.

https://doi.org/10.1016/0742-8413(89)90234-X
- Google Scholar
1. Budd GE
(2008) The earliest fossil record of the animals and its significance
Philosophical Transactions of the Royal Society B: Biological Sciences 363:1425–1434.

https://doi.org/10.1098/rstb.2007.2232
- PubMed
- Google Scholar
1. Budd GE
2. Jensen S
(2017) The origin of the animals and a 'Savannah' hypothesis for early bilaterian evolution
Biological Reviews 92:446–473.

https://doi.org/10.1111/brv.12239
- PubMed
- Google Scholar
1. Budhiraja S
2. Chugh A
(2009) Neuromedin U: physiology, pharmacology and therapeutic potential
Fundamental & Clinical Pharmacology 23:149–157.

https://doi.org/10.1111/j.1472-8206.2009.00667.x
- PubMed
- Google Scholar
1. Burke CJ
2. Huetteroth W
3. Owald D
4. Perisse E
5. Krashes MJ
6. Das G
7. Gohl D
8. Silies M
9. Certel S
10. Waddell S
(2012) Layered reward signalling through octopamine and dopamine in Drosophila
Nature 492:433–437.

https://doi.org/10.1038/nature11614
- PubMed
- Google Scholar
(2001) Serotonin and serotonin-like substances as regulators of early embryogenesis and morphogenesis
Cell and Tissue Research 305:177–186.

https://doi.org/10.1007/s004410100408
- PubMed
- Google Scholar
(2007) Late-neoproterozoic deep-ocean oxygenation and the rise of animal life
Science 315:92–95.

https://doi.org/10.1126/science.1135013
- PubMed
- Google Scholar
(2016) Xenacoelomorpha is the sister group to Nephrozoa
Nature 530:89–93.

https://doi.org/10.1038/nature16520
- PubMed
- Google Scholar
1. Cardoso JC
2. Pinto VC
3. Vieira FA
4. Clark MS
5. Power DM
(2006) Evolution of secretin family GPCR members in the Metazoa
BMC Evolutionary Biology 6:108.

https://doi.org/10.1186/1471-2148-6-108
- PubMed
- Google Scholar
(2014) New insights into the evolution of vertebrate CRH (corticotropin-releasing hormone) and invertebrate DH44 (diuretic hormone 44) receptors in metazoans
General and Comparative Endocrinology 209:162–170.

https://doi.org/10.1016/j.ygcen.2014.09.004
- PubMed
- Google Scholar
1. Carlberg M
2. Anctil M
(1993) Biogenic amines in coelenterates
Comparative Biochemistry and Physiology Part C: Pharmacology, Toxicology and Endocrinology 106:1–9.

https://doi.org/10.1016/0742-8413(93)90250-O
- PubMed
- Google Scholar
1. Carroll SB
(1995) Homeotic genes and the evolution of arthropods and chordates
Nature 376:479–485.

https://doi.org/10.1038/376479a0
- PubMed
- Google Scholar
1. Chan C
2. Jayasekera S
3. Kao B
4. Páramo M
5. von Grotthuss M
6. Ranz JM
(2015) Remodelling of a homeobox gene cluster by multiple independent gene reunions in Drosophila
Nature Communications 6:6509.

https://doi.org/10.1038/ncomms7509
- PubMed
- Google Scholar
1. Chen L
2. Krause M
3. Sepanski M
4. Fire A
(1994)
The Caenorhabditis elegans MYOD homologue HLH-1 is essential for proper muscle function and complete morphogenesis

Development 120:1631–1641.
- PubMed
- Google Scholar
1. Chen F
2. Mackey AJ
3. Vermunt JK
4. Roos DS
(2007) Assessing performance of orthology detection strategies applied to eukaryotic genomes
PLOS ONE 2:e383.

https://doi.org/10.1371/journal.pone.0000383
- PubMed
- Google Scholar
1. Chen Y
2. Schier AF
(2002) Lefty proteins are long-range inhibitors of squint-mediated nodal signaling
Current Biology 12:2124–2128.

https://doi.org/10.1016/S0960-9822(02)01362-3
- PubMed
- Google Scholar
1. Chourrout D
2. Delsuc F
3. Chourrout P
4. Edvardsen RB
5. Rentzsch F
6. Renfer E
7. Jensen MF
8. Zhu B
9. de Jong P
10. Steele RE
11. Technau U
(2006) Minimal ProtoHox cluster inferred from bilaterian and cnidarian Hox complements
Nature 442:684–687.

https://doi.org/10.1038/nature04863
- PubMed
- Google Scholar
1. Christine KS
2. Conlon FL
(2008) Vertebrate CASTOR is required for differentiation of cardiac precursor cells at the ventral midline
Developmental Cell 14:616–623.

https://doi.org/10.1016/j.devcel.2008.01.009
- PubMed
- Google Scholar
1. Conant GC
2. Wagner A
(2003) Asymmetric sequence divergence of duplicate genes
Genome Research 13:2052–2058.

https://doi.org/10.1101/gr.1252603
- PubMed
- Google Scholar
1. Conway Morris S
(2006) Darwin's dilemma: the realities of the Cambrian ‘explosion’
Philosophical Transactions of the Royal Society B: Biological Sciences 361:1069–1083.

https://doi.org/10.1098/rstb.2006.1846
- PubMed
- Google Scholar
Book
1. Darwin C
(2009) The Origin of Species by Means of Natural Selection, or the Preservation of Favoured Races in the Struggle for Life (6th edition)
Cambridge University Press.

https://doi.org/10.1017/CBO9780511694295
- Google Scholar
1. Daubert EA
2. Condron BG
(2010) Serotonin: a regulator of neuronal morphology and circuitry
Trends in Neurosciences 33:424–434.

https://doi.org/10.1016/j.tins.2010.05.005
- PubMed
- Google Scholar
1. Davidson EH
2. Erwin DH
(2006) Gene regulatory networks and the evolution of animal body plans
Science 311:796–800.

https://doi.org/10.1126/science.1113832
- PubMed
- Google Scholar
1. de Wit J
2. Hong W
3. Luo L
4. Ghosh A
(2011) Role of leucine-rich repeat proteins in the development and function of neural circuits
Annual Review of Cell and Developmental Biology 27:697–729.

https://doi.org/10.1146/annurev-cellbio-092910-154111
- PubMed
- Google Scholar
1. Desgrange A
2. Cereghini S
(2015) Nephron patterning: lessons from Xenopus, zebrafish, and mouse studies
Cells 4:483–499.

https://doi.org/10.3390/cells4030483
- PubMed
- Google Scholar
1. Dressler GR
(2006) The cellular basis of kidney development
Annual Review of Cell and Developmental Biology 22:509–529.

https://doi.org/10.1146/annurev.cellbio.22.010305.104340
- PubMed
- Google Scholar
(2004) Nodal and BMP2/4 signaling organizes the oral-aboral axis of the sea urchin embryo
Developmental Cell 6:397–410.

https://doi.org/10.1016/S1534-5807(04)00056-5
- PubMed
- Google Scholar
(2008) Lefty acts as an essential modulator of nodal activity during sea urchin oral-aboral axis formation
Developmental Biology 320:49–59.

https://doi.org/10.1016/j.ydbio.2008.04.012
- PubMed
- Google Scholar
1. Dunn CW
2. Hejnol A
3. Matus DQ
4. Pang K
5. Browne WE
6. Smith SA
7. Seaver E
8. Rouse GW
9. Obst M
10. Edgecombe GD
11. Sørensen MV
12. Haddock SH
13. Schmidt-Rhaesa A
14. Okusu A
15. Kristensen RM
16. Wheeler WC
17. Martindale MQ
18. Giribet G
(2008) Broad phylogenomic sampling improves resolution of the animal tree of life
Nature 452:745–749.

https://doi.org/10.1038/nature06614
- PubMed
- Google Scholar
(2014) Animal phylogeny and its evolutionary implications
Annual Review of Ecology, Evolution, and Systematics 45:371–395.

https://doi.org/10.1146/annurev-ecolsys-120213-091627
- Google Scholar
(2017) Novel and divergent genes in the evolution of placental mammals
Proceedings of the Royal Society B: Biological Sciences 284:20171357.

https://doi.org/10.1098/rspb.2017.1357
- PubMed
- Google Scholar
(2014) TRF2 and the evolution of the Bilateria
Genes & Development 28:2071–2076.

https://doi.org/10.1101/gad.250563.114
- PubMed
- Google Scholar
1. Eddy SR
(1998) Profile hidden Markov models
Bioinformatics 14:755–763.

https://doi.org/10.1093/bioinformatics/14.9.755
- PubMed
- Google Scholar
(2012) A novel G protein-coupled receptor of Schistosoma mansoni (SmGPR-3) is activated by dopamine and is widely expressed in the nervous system
PLOS Neglected Tropical Diseases 6:e1523.

https://doi.org/10.1371/journal.pntd.0001523
- PubMed
- Google Scholar
1. Emerson RO
2. Thomas JH
(2009) Adaptive evolution in zinc finger transcription factors
PLOS Genetics 5:e1000325.

https://doi.org/10.1371/journal.pgen.1000325
- PubMed
- Google Scholar
1. Emms DM
2. Kelly S
(2015) OrthoFinder: solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy
Genome Biology 16:157.

https://doi.org/10.1186/s13059-015-0721-2
- PubMed
- Google Scholar
Book
1. Erwin DH
2. Valentine JW
(2013)
The Cambrian Explosion: The Construction of Animal Biodiversity

Greenwood Village, United States: Roberts and Company Publishers, Inc.
- Google Scholar
1. Evans TA
(2016) Embryonic axon guidance: insights from Drosophila and other insects
Current Opinion in Insect Science 18:11–16.

https://doi.org/10.1016/j.cois.2016.08.007
- PubMed
- Google Scholar
1. Fan X
2. Dougan ST
(2007) The evolutionary origin of nodal-related genes in teleosts
Development Genes and Evolution 217:807–813.

https://doi.org/10.1007/s00427-007-0191-y
- PubMed
- Google Scholar
1. Feuda R
2. Dohrmann M
3. Pett W
4. Philippe H
5. Rota-Stabelli O
6. Lartillot N
7. Wörheide G
8. Pisani D
(2017) Improved modeling of compositional heterogeneity supports sponges as sister to all other animals
Current Biology 27:3864–3870.

https://doi.org/10.1016/j.cub.2017.11.008
- PubMed
- Google Scholar
1. Feulner PGD
2. De-Kayne R
(2017) Genome evolution, structural rearrangements and speciation
Journal of Evolutionary Biology 30:1488–1490.

https://doi.org/10.1111/jeb.13101
- PubMed
- Google Scholar
1. Finnerty JR
2. Martindale MQ
(1999) Ancient origins of axial patterning genes: Hox genes and ParaHox genes in the Cnidaria
Evolution and Development 1:16–23.

https://doi.org/10.1046/j.1525-142x.1999.99010.x
- PubMed
- Google Scholar
(2014) The Fox/Forkhead transcription factor family of the hemichordate Saccoglossus kowalevskii
EvoDevo 5:17.

https://doi.org/10.1186/2041-9139-5-17
- PubMed
- Google Scholar
1. Galas L
2. Bénard M
3. Lebon A
4. Komuro Y
5. Schapman D
6. Vaudry H
7. Vaudry D
8. Komuro H
(2017) Postnatal migration of cerebellar interneurons
Brain Sciences 7:62.

https://doi.org/10.3390/brainsci7060062
- Google Scholar
1. Goyama S
2. Yamamoto G
3. Shimabe M
4. Sato T
5. Ichikawa M
6. Ogawa S
7. Chiba S
8. Kurokawa M
(2008) Evi-1 is a critical regulator for hematopoietic stem cells and transformed leukemic cells
Cell Stem Cell 3:207–220.

https://doi.org/10.1016/j.stem.2008.06.002
- PubMed
- Google Scholar
(2014) Evolution, divergence and loss of the Nodal signalling pathway: new data and a synthesis across the Bilateria
The International Journal of Developmental Biology 58:521–532.

https://doi.org/10.1387/ijdb.140133cg
- PubMed
- Google Scholar
(2006) Formation and evolution of the chordate neurotrophin and Trk receptor genes
Brain, Behavior and Evolution 68:133–144.

https://doi.org/10.1159/000094083
- PubMed
- Google Scholar
1. Harvey RP
(1996) NK-2 homeobox genes and heart development
Developmental Biology 178:203–216.

https://doi.org/10.1006/dbio.1996.0212
- PubMed
- Google Scholar
1. He X
2. Zhang J
(2005) Rapid subfunctionalization accompanied by prolonged and substantial neofunctionalization in duplicate gene evolution
Genetics 169:1157–1164.

https://doi.org/10.1534/genetics.104.037051
- PubMed
- Google Scholar
(2019) Convergent gene losses illuminate metabolic and physiological changes in herbivores and carnivores
PNAS 116:3036–3041.

https://doi.org/10.1073/pnas.1818504116
- PubMed
- Google Scholar
1. Heger P
2. Marin B
3. Bartkuhn M
4. Schierenberg E
5. Wiehe T
(2012) The chromatin insulator CTCF and the emergence of metazoan diversity
PNAS 109:17507–17512.

https://doi.org/10.1073/pnas.1111941109
- PubMed
- Google Scholar
1. Heger P
2. George R
3. Wiehe T
(2013) Successive gain of insulator proteins in arthropod evolution
Evolution 127:2945–2956.

https://doi.org/10.1111/evo.12155
- PubMed
- Google Scholar
Software
1. Heger P
(2020) BigWenDB, version 8be4115
GitHub.

https://github.com/prheger/BigWenDB
1. Heger P
2. Wiehe T
(2014) New tools in the box: an evolutionary synopsis of chromatin insulators
Trends in Genetics 30:161–171.

https://doi.org/10.1016/j.tig.2014.03.004
- PubMed
- Google Scholar
1. Hejnol A
2. Obst M
3. Stamatakis A
4. Ott M
5. Rouse GW
6. Edgecombe GD
7. Martinez P
8. Baguñà J
9. Bailly X
10. Jondelius U
11. Wiens M
12. Müller WEG
13. Seaver E
14. Wheeler WC
15. Martindale MQ
16. Giribet G
17. Dunn CW
(2009) Assessing the root of bilaterian animals with scalable phylogenomic methods
Proceedings of the Royal Society B: Biological Sciences 276:4261–4270.

https://doi.org/10.1098/rspb.2009.0896
- PubMed
- Google Scholar
(2016) Structural biology and evolution of the TGF-β family
Cold Spring Harbor Perspectives in Biology 8:a022103.

https://doi.org/10.1101/cshperspect.a022103
- PubMed
- Google Scholar
1. Huang EJ
2. Reichardt LF
(2001) Neurotrophins: roles in neuronal development and function
Annual Review of Neuroscience 24:677–736.

https://doi.org/10.1146/annurev.neuro.24.1.677
- PubMed
- Google Scholar
1. Hudson C
2. Yasuo H
(2005) Patterning across the ascidian neural plate by lateral Nodal signalling sources
Development 132:1199–1210.

https://doi.org/10.1242/dev.01688
- PubMed
- Google Scholar
1. Hueber SD
2. Rauch J
3. Djordjevic MA
4. Gunter H
5. Weiller GF
6. Frickey T
(2013) Analysis of central Hox protein types across bilaterian clades: on the diversification of central Hox proteins from an Antennapedia/Hox7-like protein
Developmental Biology 383:175–185.

https://doi.org/10.1016/j.ydbio.2013.09.009
- PubMed
- Google Scholar
1. Huerta-Cepas J
2. Szklarczyk D
3. Forslund K
4. Cook H
5. Heller D
6. Walter MC
7. Rattei T
8. Mende DR
9. Sunagawa S
10. Kuhn M
11. Jensen LJ
12. von Mering C
13. Bork P
(2016) eggNOG 4.5: a hierarchical orthology framework with improved functional annotations for eukaryotic, prokaryotic and viral sequences
Nucleic Acids Research 44:D286–D293.

https://doi.org/10.1093/nar/gkv1248
- PubMed
- Google Scholar
(2009) Emergence, development and diversification of the TGF-β signalling pathway within the animal kingdom
BMC Evolutionary Biology 9:28.

https://doi.org/10.1186/1471-2148-9-28
- PubMed
- Google Scholar
(1997) A survey of homeobox genes in Chaetopterus variopedatus and analysis of polychaete homeodomains
Molecular Phylogenetics and Evolution 7:331–345.

https://doi.org/10.1006/mpev.1997.0407
- PubMed
- Google Scholar
(2010) An evolutionary fast-track to biocalcification
Geobiology 8:191–196.

https://doi.org/10.1111/j.1472-4669.2010.00236.x
- PubMed
- Google Scholar
(2001) A cluster of Drosophila homeobox genes involved in mesoderm differentiation programs
BioEssays 23:125–133.

https://doi.org/10.1002/1521-1878(200102)23:2<125::AID-BIES1019>3.0.CO;2-C
- PubMed
- Google Scholar
(2019) The evolutionary traceability of a protein
Genome Biology and Evolution 11:531–545.

https://doi.org/10.1093/gbe/evz008
- PubMed
- Google Scholar
(2014) Mixture models of nucleotide sequence evolution that account for heterogeneity in the substitution process across sites and across lineages
Systematic Biology 63:726–742.

https://doi.org/10.1093/sysbio/syu036
- PubMed
- Google Scholar
1. Jegla TJ
2. Zmasek CM
3. Batalov S
4. Nayak SK
(2009) Evolution of the human ion channel set
Combinatorial Chemistry & High Throughput Screening 12:2–23.

https://doi.org/10.2174/138620709787047957
- PubMed
- Google Scholar
1. Johnson EC
2. Shafer OT
3. Trigg JS
4. Park J
5. Schooley DA
6. Dow JA
7. Taghert PH
(2005) A novel diuretic hormone receptor in Drosophila: evidence for conservation of CGRP signaling
Journal of Experimental Biology 208:1239–1246.

https://doi.org/10.1242/jeb.01529
- PubMed
- Google Scholar
(2002) The Nemertodermatida are basal bilaterians and not members of the Platyhelminthes
Zoologica Scripta 31:201–215.

https://doi.org/10.1046/j.1463-6409.2002.00090.x
- Google Scholar
(2017) ModelFinder: fast model selection for accurate phylogenetic estimates
Nature Methods 14:587–589.

https://doi.org/10.1038/nmeth.4285
- PubMed
- Google Scholar
1. Kass-Simon G
2. Pierobon P
(2007) Cnidarian chemical neurotransmission, an updated overview
Comparative Biochemistry and Physiology Part A: Molecular & Integrative Physiology 146:9–25.

https://doi.org/10.1016/j.cbpa.2006.09.008
- PubMed
- Google Scholar
1. Kassabov SR
2. Choi YB
3. Karl KA
4. Vishwasrao HD
5. Bailey CH
6. Kandel ER
(2013) A single Aplysia neurotrophin mediates synaptic facilitation via differentially processed isoforms
Cell Reports 3:1213–1227.

https://doi.org/10.1016/j.celrep.2013.03.008
- PubMed
- Google Scholar
1. Katoh K
2. Kuma K
3. Toh H
4. Miyata T
(2005) MAFFT version 5: improvement in accuracy of multiple sequence alignment
Nucleic Acids Research 33:511–518.

https://doi.org/10.1093/nar/gki198
- PubMed
- Google Scholar
1. Kelley LA
2. Mezulis S
3. Yates CM
4. Wass MN
5. Sternberg MJ
(2015) The Phyre2 web portal for protein modeling, prediction and analysis
Nature Protocols 10:845–858.

https://doi.org/10.1038/nprot.2015.053
- PubMed
- Google Scholar
1. Kenny NJ
2. Namigai EK
3. Dearden PK
4. Hui JH
5. Grande C
6. Shimeld SM
(2014) The lophotrochozoan TGF-β signalling cassette - diversification and conservation in a key signalling pathway
The International Journal of Developmental Biology 58:533–549.

https://doi.org/10.1387/ijdb.140080nk
- PubMed
- Google Scholar
1. Kent WJ
(2002) BLAT--the BLAST-like alignment tool
Genome Research 12:656–664.

https://doi.org/10.1101/gr.229202
- PubMed
- Google Scholar
(2017) Corticotropin-releasing hormone-binding protein and stress: from invertebrates to humans
Stress 20:449–464.

https://doi.org/10.1080/10253890.2017.1322575
- PubMed
- Google Scholar
(1999) Slit is the midline repellent for the Robo receptor in Drosophila
Cell 96:785–794.

https://doi.org/10.1016/S0092-8674(00)80589-9
- PubMed
- Google Scholar
(2011) Conserved, developmentally regulated mechanism couples chromosomal looping and heterochromatin barrier activity at the homeobox gene A locus
PNAS 108:7391–7396.

https://doi.org/10.1073/pnas.1018279108
- PubMed
- Google Scholar
1. Kim WK
2. Marcotte EM
(2008) Age-dependent evolution of the yeast protein interaction network suggests a limited role of gene duplication and divergence
PLOS Computational Biology 4:e1000232.

https://doi.org/10.1371/journal.pcbi.1000232
- PubMed
- Google Scholar
1. Kim Y
2. Nirenberg M
(1989) Drosophila NK-homeobox genes
PNAS 86:7716–7720.

https://doi.org/10.1073/pnas.86.20.7716
- PubMed
- Google Scholar
(2013) How epigenetic mutations can affect genetic evolution: model and mechanism
BioEssays 35:571–578.

https://doi.org/10.1002/bies.201200169
- PubMed
- Google Scholar
1. Kobe B
2. Kajava AV
(2001) The leucine-rich repeat as a protein recognition motif
Current Opinion in Structural Biology 11:725–732.

https://doi.org/10.1016/S0959-440X(01)00266-4
- PubMed
- Google Scholar
(2002) Selection in the evolution of gene duplications
Genome Biology 3:RESEARCH0008.

https://doi.org/10.1186/gb-2002-3-2-research0008
- PubMed
- Google Scholar
1. Kourakis MJ
2. Martindale MQ
(2000) Combined-method phylogenetic analysis of Hox and ParaHox genes of the Metazoa
Journal of Experimental Zoology 288:175–191.

https://doi.org/10.1002/1097-010X(20000815)288:2<175::AID-JEZ8>3.0.CO;2-N
- PubMed
- Google Scholar
(2012) The origin of GPCRs: identification of mammalian like rhodopsin, adhesion, glutamate and frizzled GPCRs in fungi
PLOS ONE 7:e29817.

https://doi.org/10.1371/journal.pone.0029817
- PubMed
- Google Scholar
(2015) OrthoDB v8: update of the hierarchical catalog of orthologs and the underlying free software
Nucleic Acids Research 43:D250–D256.

https://doi.org/10.1093/nar/gku1220
- PubMed
- Google Scholar
(2019) OrthoDB v10: sampling the diversity of animal, plant, fungal, protist, bacterial and viral genomes for evolutionary and functional annotations of orthologs
Nucleic Acids Research 47:D807–D811.

https://doi.org/10.1093/nar/gky1053
- PubMed
- Google Scholar
1. Kunst M
2. Hughes ME
3. Raccuglia D
4. Felix M
5. Li M
6. Barnett G
7. Duah J
8. Nitabach MN
(2014) Calcitonin gene-related peptide neurons mediate sleep-specific circadian output in Drosophila
Current Biology 24:2652–2664.

https://doi.org/10.1016/j.cub.2014.09.077
- PubMed
- Google Scholar
(2017) Architectural protein Pita cooperates with dCTCF in organization of functional boundaries in Bithorax complex
Development 144:2663–2672.

https://doi.org/10.1242/dev.149815
- PubMed
- Google Scholar
(2011) Hundreds of putatively functional small open reading frames in Drosophila
Genome Biology 12:R118.

https://doi.org/10.1186/gb-2011-12-11-r118
- PubMed
- Google Scholar
1. Larroux C
2. Fahey B
3. Degnan SM
4. Adamski M
5. Rokhsar DS
6. Degnan BM
(2007) The NK homeobox gene cluster predates the origin of Hox genes
Current Biology 17:706–710.

https://doi.org/10.1016/j.cub.2007.03.008
- PubMed
- Google Scholar
1. Laumer CE
2. Bekkouche N
3. Kerbl A
4. Goetz F
5. Neves RC
6. Sørensen MV
7. Kristensen RM
8. Hejnol A
9. Dunn CW
10. Giribet G
11. Worsaae K
(2015) Spiralian phylogeny informs the evolution of microscopic lineages
Current Biology 25:2000–2006.

https://doi.org/10.1016/j.cub.2015.06.068
- PubMed
- Google Scholar
(2016) Neurotrophin, p75, and Trk signaling module in the developing nervous system of the marine annelid Platynereis dumerilii
BioMed Research International 2016:2456062.

https://doi.org/10.1155/2016/2456062
- PubMed
- Google Scholar
(2003) OrthoMCL: identification of ortholog groups for eukaryotic genomes
Genome Research 13:2178–2189.

https://doi.org/10.1101/gr.1224503
- PubMed
- Google Scholar
1. Li Y
2. Cheng CN
3. Verdun VA
4. Wingert RA
(2014) Zebrafish nephrogenesis is regulated by interactions between retinoic acid, mecom, and Notch signaling
Developmental Biology 386:111–122.

https://doi.org/10.1016/j.ydbio.2013.11.021
- PubMed
- Google Scholar
1. Li G
2. Liu X
3. Xing C
4. Zhang H
5. Shimeld SM
6. Wang Y
(2017) Cerberus-Nodal-Lefty-Pitx signaling cascade controls left-right asymmetry in amphioxus
PNAS 114:3684–3689.

https://doi.org/10.1073/pnas.1620519114
- PubMed
- Google Scholar
1. Li W
2. Godzik A
(2006) Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences
Bioinformatics 22:1658–1659.

https://doi.org/10.1093/bioinformatics/btl158
- PubMed
- Google Scholar
(2016) Somatostatin and somatostatin-containing neurons in shaping neuronal activity and plasticity
Frontiers in Neural Circuits 10:48.

https://doi.org/10.3389/fncir.2016.00048
- PubMed
- Google Scholar
1. Lindemans M
2. Janssen T
3. Husson SJ
4. Meelkop E
5. Temmerman L
6. Clynen E
7. Mertens I
8. Schoofs L
(2009) A neuromedin-pyrokinin-like neuropeptide signaling system in Caenorhabditis elegans
Biochemical and Biophysical Research Communications 379:760–764.

https://doi.org/10.1016/j.bbrc.2008.12.121
- PubMed
- Google Scholar
(2002) Heterotachy, an important process of protein evolution
Molecular Biology and Evolution 19:1–7.

https://doi.org/10.1093/oxfordjournals.molbev.a003973
- PubMed
- Google Scholar
1. Lowery LA
2. Van Vactor D
(2009) The trip of the tip: understanding the growth cone machinery
Nature Reviews Molecular Cell Biology 10:332–343.

https://doi.org/10.1038/nrm2679
- PubMed
- Google Scholar
1. Lu B
2. Pang PT
3. Woo NH
(2005) The yin and yang of neurotrophin action
Nature Reviews Neuroscience 6:603–614.

https://doi.org/10.1038/nrn1726
- PubMed
- Google Scholar
(2017) New genes and functional innovation in mammals
Genome Biology and Evolution 9:1886–1900.

https://doi.org/10.1093/gbe/evx136
- PubMed
- Google Scholar
1. Luke GN
2. Castro LFC
3. McLay K
4. Bird C
5. Coulson A
6. Holland PWH
(2003) Dispersal of NK homeobox gene clusters in amphioxus and humans
PNAS 100:5292–5295.

https://doi.org/10.1073/pnas.0836141100
- PubMed
- Google Scholar
1. Mackowiak SD
2. Zauber H
3. Bielow C
4. Thiel D
5. Kutz K
6. Calviello L
7. Mastrobuoni G
8. Rajewsky N
9. Kempa S
10. Selbach M
11. Obermayer B
(2015) Extensive identification and analysis of conserved small ORFs in animals
Genome Biology 16:179.

https://doi.org/10.1186/s13059-015-0742-x
- PubMed
- Google Scholar
1. Marshall CR
(2006) Explaining the Cambrian “explosion” of animals
Annual Review of Earth and Planetary Sciences 34:355–384.

https://doi.org/10.1146/annurev.earth.33.031504.103001
- Google Scholar
(1995) The paired-like homeo box gene MHox is required for early events of skeletogenesis in multiple lineages
Genes & Development 9:1237–1249.

https://doi.org/10.1101/gad.9.10.1237
- PubMed
- Google Scholar
1. Martín-Durán JM
2. Pang K
3. Børve A
4. Lê HS
5. Furu A
6. Cannon JT
7. Jondelius U
8. Hejnol A
(2018) Convergent evolution of bilaterian nerve cords
Nature 553:45–50.

https://doi.org/10.1038/nature25030
- PubMed
- Google Scholar
1. Maruyama YK
(2000) A sea cucumber homolog of the mouse T-Brain-1 is expressed in the invaginated cells of the early gastrula in Holothuria leucospilota
Zoological Science 17:383–387.

https://doi.org/10.2108/zsj.17.383
- PubMed
- Google Scholar
(2018) Casz1 controls higher-order nuclear organization in rod photoreceptors
PNAS 115:E7987–E7996.

https://doi.org/10.1073/pnas.1803069115
- PubMed
- Google Scholar
(2007) FGF signaling in gastrulation and neural development in Nematostella vectensis, an anthozoan cnidarian
Development Genes and Evolution 217:137–148.

https://doi.org/10.1007/s00427-006-0122-3
- PubMed
- Google Scholar
1. Mayorova TD
2. Kosevich IA
(2013) Serotonin-immunoreactive neural system and contractile system in the hydroid Cladonema (Cnidaria, Hydrozoa)
Invertebrate Neuroscience 13:99–106.

https://doi.org/10.1007/s10158-013-0152-2
- PubMed
- Google Scholar
1. Mead PE
2. Parganas E
3. Ohtsuka S
4. Morishita K
5. Gamer L
6. Kuliyev E
7. Wright CV
8. Ihle JN
(2005) Evi-1 expression in Xenopus
Gene Expression Patterns 5:601–608.

https://doi.org/10.1016/j.modgep.2005.03.007
- PubMed
- Google Scholar
1. Melcher C
2. Bader R
3. Walther S
4. Simakov O
5. Pankratz MJ
(2006) Neuromedin U and its putative Drosophila homolog hugin
PLOS Biology 4:e68.

https://doi.org/10.1371/journal.pbio.0040068
- PubMed
- Google Scholar
1. Michelson AM
2. Abmayr SM
3. Bate M
4. Arias AM
5. Maniatis T
(1990) Expression of a MyoD family member prefigures muscle pattern in Drosophila embryos
Genes & Development 4:2086–2097.

https://doi.org/10.1101/gad.4.12a.2086
- PubMed
- Google Scholar
(2009) Characterization of taxonomically restricted genes in a phylum-restricted cell type
Genome Biology 10:R8.

https://doi.org/10.1186/gb-2009-10-1-r8
- PubMed
- Google Scholar
(2013) Ultrafast approximation for phylogenetic bootstrap
Molecular Biology and Evolution 30:1188–1195.

https://doi.org/10.1093/molbev/mst024
- PubMed
- Google Scholar
1. Mirabeau O
2. Joly JS
(2013) Molecular evolution of peptidergic signaling systems in bilaterians
PNAS 110:E2028–E2037.

https://doi.org/10.1073/pnas.1219956110
- PubMed
- Google Scholar
1. Mita K
2. Fujiwara S
(2007) Nodal regulates neural tube formation in the Ciona intestinalis embryo
Development Genes and Evolution 217:593–601.

https://doi.org/10.1007/s00427-007-0168-x
- PubMed
- Google Scholar
1. Mohan M
2. Bartkuhn M
3. Herold M
4. Philippen A
5. Heinl N
6. Bardenhagen I
7. Leers J
8. White RA
9. Renkawitz-Pohl R
10. Saumweber H
11. Renkawitz R
(2007) The Drosophila insulator proteins CTCF and CP190 link enhancer blocking to body patterning
The EMBO Journal 26:4203–4214.

https://doi.org/10.1038/sj.emboj.7601851
- PubMed
- Google Scholar
(2006) A low diversity of ANTP class homeobox genes in Placozoa
Evolution Development 8:174–182.

https://doi.org/10.1111/j.1525-142X.2006.00087.x
- PubMed
- Google Scholar
1. Moody WJ
2. Simoncini L
3. Coombs JL
4. Spruce AE
5. Villaz M
(1991) Development of ion channels in early embryos
Journal of Neurobiology 22:674–684.

https://doi.org/10.1002/neu.480220703
- PubMed
- Google Scholar
1. Moroz LL
2. Kocot KM
3. Citarella MR
4. Dosung S
5. Norekian TP
6. Povolotskaya IS
7. Grigorenko AP
8. Dailey C
9. Berezikov E
10. Buckley KM
11. Ptitsyn A
12. Reshetov D
13. Mukherjee K
14. Moroz TP
15. Bobkova Y
16. Yu F
17. Kapitonov VV
18. Jurka J
19. Bobkov YV
20. Swore JJ
21. Girardo DO
22. Fodor A
23. Gusev F
24. Sanford R
25. Bruders R
26. Kittler E
27. Mills CE
28. Rast JP
29. Derelle R
30. Solovyev VV
31. Kondrashov FA
32. Swalla BJ
33. Sweedler JV
34. Rogaev EI
35. Halanych KM
36. Kohn AB
(2014) The ctenophore genome and the evolutionary origins of neural systems
Nature 510:109–114.

https://doi.org/10.1038/nature13400
- PubMed
- Google Scholar
1. Narendra V
2. Rocha PP
3. An D
4. Raviram R
5. Skok JA
6. Mazzoni EO
7. Reinberg D
(2015) CTCF establishes discrete functional chromatin domains at the Hox clusters during differentiation
Science 347:1017–1021.

https://doi.org/10.1126/science.1262088
- PubMed
- Google Scholar
(2015) IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies
Molecular Biology and Evolution 32:268–274.

https://doi.org/10.1093/molbev/msu300
- PubMed
- Google Scholar
(2015) The immunotranscriptome of the Caribbean reef-building coral Pseudodiploria strigosa
Immunogenetics 67:515–530.

https://doi.org/10.1007/s00251-015-0854-1
- PubMed
- Google Scholar
(2012) Unipro UGENE: a unified bioinformatics toolkit
Bioinformatics 28:1166–1167.

https://doi.org/10.1093/bioinformatics/bts091
- PubMed
- Google Scholar
1. Pai VP
2. Willocq V
3. Pitcairn EJ
4. Lemire JM
5. Paré JF
6. Shi NQ
7. McLaughlin KA
8. Levin M
(2017) HCN4 ion channel function is required for early events that regulate anatomical left-right patterning in a nodal and lefty asymmetric gene expression-independent manner
Biology Open 6:1445–1457.

https://doi.org/10.1242/bio.025957
- PubMed
- Google Scholar
1. Panfilio KA
2. Vargas Jentzsch IM
3. Benoit JB
4. Erezyilmaz D
5. Suzuki Y
6. Colella S
7. Robertson HM
8. Poelchau MF
9. Waterhouse RM
10. Ioannidis P
11. Weirauch MT
12. Hughes DST
13. Murali SC
14. Werren JH
15. Jacobs CGC
16. Duncan EJ
17. Armisén D
18. Vreede BMI
19. Baa-Puyoulet P
20. Berger CS
21. Chang CC
22. Chao H
23. Chen MM
24. Chen YT
25. Childers CP
26. Chipman AD
27. Cridge AG
28. Crumière AJJ
29. Dearden PK
30. Didion EM
31. Dinh H
32. Doddapaneni HV
33. Dolan A
34. Dugan S
35. Extavour CG
36. Febvay G
37. Friedrich M
38. Ginzburg N
39. Han Y
40. Heger P
41. Holmes CJ
42. Horn T
43. Hsiao YM
44. Jennings EC
45. Johnston JS
46. Jones TE
47. Jones JW
48. Khila A
49. Koelzer S
50. Kovacova V
51. Leask M
52. Lee SL
53. Lee CY
54. Lovegrove MR
55. Lu HL
56. Lu Y
57. Moore PJ
58. Munoz-Torres MC
59. Muzny DM
60. Palli SR
61. Parisot N
62. Pick L
63. Porter ML
64. Qu J
65. Refki PN
66. Richter R
67. Rivera-Pomar R
68. Rosendale AJ
69. Roth S
70. Sachs L
71. Santos ME
72. Seibert J
73. Sghaier E
74. Shukla JN
75. Stancliffe RJ
76. Tidswell O
77. Traverso L
78. van der Zee M
79. Viala S
80. Worley KC
81. Zdobnov EM
82. Gibbs RA
83. Richards S
(2019) Molecular evolutionary trends and feeding ecology diversification in the Hemiptera, anchored by the milkweed bug genome
Genome Biology 20:64.

https://doi.org/10.1186/s13059-019-1660-0
- PubMed
- Google Scholar
1. Paps J
2. Holland PWH
(2018) Reconstruction of the ancestral metazoan genome reveals an increase in genomic novelty
Nature Communications 9:1730.

https://doi.org/10.1038/s41467-018-04136-5
- PubMed
- Google Scholar
1. Parra G
2. Bradnam K
3. Korf I
(2007) CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes
Bioinformatics 23:1061–1067.

https://doi.org/10.1093/bioinformatics/btm071
- PubMed
- Google Scholar
(2009) MicroRNAs and metazoan macroevolution: insights into canalization, complexity, and the Cambrian explosion
BioEssays 31:736–747.

https://doi.org/10.1002/bies.200900033
- PubMed
- Google Scholar
1. Petryszak R
2. Keays M
3. Tang YA
4. Fonseca NA
5. Barrera E
6. Burdett T
7. Füllgrabe A
8. Fuentes AM
9. Jupp S
10. Koskinen S
11. Mannion O
12. Huerta L
13. Megy K
14. Snow C
15. Williams E
16. Barzine M
17. Hastings E
18. Weisser H
19. Wright J
20. Jaiswal P
21. Huber W
22. Choudhary J
23. Parkinson HE
24. Brazma A
(2016) Expression Atlas update--an integrated database of gene and protein expression in humans, animals and plants
Nucleic Acids Research 44:D746–D752.

https://doi.org/10.1093/nar/gkv1045
- PubMed
- Google Scholar
(2005) Multigene analyses of bilaterian animals corroborate the monophyly of Ecdysozoa, Lophotrochozoa, and Protostomia
Molecular Biology and Evolution 22:1246–1253.

https://doi.org/10.1093/molbev/msi111
- PubMed
- Google Scholar
1. Phillips-Cremins JE
2. Sauria ME
3. Sanyal A
4. Gerasimova TI
5. Lajoie BR
6. Bell JS
7. Ong CT
8. Hookway TA
9. Guo C
10. Sun Y
11. Bland MJ
12. Wagstaff W
13. Dalton S
14. McDevitt TC
15. Sen R
16. Dekker J
17. Taylor J
18. Corces VG
(2013) Architectural protein subclasses shape 3D organization of genomes during lineage commitment
Cell 153:1281–1295.

https://doi.org/10.1016/j.cell.2013.04.053
- PubMed
- Google Scholar
1. Pisani D
2. Pett W
3. Dohrmann M
4. Feuda R
5. Rota-Stabelli O
6. Philippe H
7. Lartillot N
8. Wörheide G
(2015) Genomic data do not support comb jellies as the sister group to all other animals
PNAS 112:15402–15407.

https://doi.org/10.1073/pnas.1518127112
- PubMed
- Google Scholar
(2007) Evidence for a microRNA expansion in the bilaterian ancestor
Development Genes and Evolution 217:73–77.

https://doi.org/10.1007/s00427-006-0116-1
- PubMed
- Google Scholar
1. Prud'homme B
2. Gompel N
3. Rokas A
4. Kassner VA
5. Williams TM
6. Yeh SD
7. True JR
8. Carroll SB
(2006) Repeated morphological evolution through cis-regulatory changes in a pleiotropic gene
Nature 440:1050–1053.

https://doi.org/10.1038/nature04597
- PubMed
- Google Scholar
1. Punta M
2. Coggill PC
3. Eberhardt RY
4. Mistry J
5. Tate J
6. Boursnell C
7. Pang N
8. Forslund K
9. Ceric G
10. Clements J
11. Heger A
12. Holm L
13. Sonnhammer EL
14. Eddy SR
15. Bateman A
16. Finn RD
(2012) The Pfam protein families database
Nucleic Acids Research 40:D290–D301.

https://doi.org/10.1093/nar/gkr1065
- PubMed
- Google Scholar
1. Putnam NH
2. Srivastava M
3. Hellsten U
4. Dirks B
5. Chapman J
6. Salamov A
7. Terry A
8. Shapiro H
9. Lindquist E
10. Kapitonov VV
11. Jurka J
12. Genikhovich G
13. Grigoriev IV
14. Lucas SM
15. Steele RE
16. Finnerty JR
17. Technau U
18. Martindale MQ
19. Rokhsar DS
(2007) Sea anemone genome reveals ancestral eumetazoan gene repertoire and genomic organization
Science 317:86–94.

https://doi.org/10.1126/science.1139158
- PubMed
- Google Scholar
1. Raj A
2. Wang SH
3. Shim H
4. Harpak A
5. Li YI
6. Engelmann B
7. Stephens M
8. Gilad Y
9. Pritchard JK
(2016) Thousands of novel translated open reading frames in humans inferred by ribosome footprint profiling
eLife 5:e13328.

https://doi.org/10.7554/eLife.13328
- PubMed
- Google Scholar
(2011) Molecular evolution of the EGF-CFC protein family
Gene 482:43–50.

https://doi.org/10.1016/j.gene.2011.05.007
- PubMed
- Google Scholar
(2000) EMBOSS: The European Molecular Biology Open Software Suite
Trends in Genetics 16:276–277.

https://doi.org/10.1016/S0168-9525(00)02024-2
- PubMed
- Google Scholar
1. Richter DJ
2. Fozouni P
3. Eisen MB
4. King N
(2018) Gene family innovation, conservation and loss on the animal stem lineage
eLife 7:e34226.

https://doi.org/10.7554/eLife.34226
- PubMed
- Google Scholar
(1990) Slit: an extracellular protein necessary for development of midline glia and commissural axon pathways contains both EGF and LRR domains
Genes & Development 4:2169–2187.

https://doi.org/10.1101/gad.4.12a.2169
- PubMed
- Google Scholar
(2015) Nodal signaling is required for mesodermal and ventral but not for dorsal fates in the indirect developing hemichordate, Ptychodera flava
Biology Open 4:830–842.

https://doi.org/10.1242/bio.011809
- PubMed
- Google Scholar
(2014) Hox in motion: tracking HoxA cluster conformation during differentiation
Nucleic Acids Research 42:1524–1540.

https://doi.org/10.1093/nar/gkt998
- PubMed
- Google Scholar
1. Ryan K
2. Garrett N
3. Mitchell A
4. Gurdon JB
(1996) Eomesodermin, a key early gene in Xenopus mesoderm differentiation
Cell 87:989–1000.

https://doi.org/10.1016/S0092-8674(00)81794-8
- PubMed
- Google Scholar
1. Ryan JF
2. Burton PM
3. Mazza ME
4. Kwong GK
5. Mullikin JC
6. Finnerty JR
(2006) The cnidarian-bilaterian ancestor possessed at least 56 homeoboxes: evidence from the starlet sea anemone, Nematostella vectensis
Genome Biology 7:R64.

https://doi.org/10.1186/gb-2006-7-7-R64
- PubMed
- Google Scholar
1. Ryan JF
2. Mazza ME
3. Pang K
4. Matus DQ
5. Baxevanis AD
6. Martindale MQ
7. Finnerty JR
(2007) Pre-bilaterian origins of the Hox cluster and the Hox code: evidence from the sea anemone, Nematostella vectensis
PLOS ONE 2:e153.

https://doi.org/10.1371/journal.pone.0000153
- PubMed
- Google Scholar
(2013) The genome of the ctenophore Mnemiopsis leidyi and its implications for cell type evolution
Science 342:1242592.

https://doi.org/10.1126/science.1242592
- PubMed
- Google Scholar
(2017) Taxon-restricted genes at the origin of a novel trait allowing access to a new environment
Science 358:386–390.

https://doi.org/10.1126/science.aan2748
- PubMed
- Google Scholar
1. Saudemont A
2. Dray N
3. Hudry B
4. Le Gouar M
5. Vervoort M
6. Balavoine G
(2008) Complementary striped expression patterns of NK homeobox genes during segment formation in the annelid Platynereis
Developmental Biology 317:430–443.

https://doi.org/10.1016/j.ydbio.2008.02.013
- PubMed
- Google Scholar
1. Schwaiger M
2. Schönauer A
3. Rendeiro AF
4. Pribitzer C
5. Schauer A
6. Gilles AF
7. Schinko JB
8. Renfer E
9. Fredman D
10. Technau U
(2014) Evolutionary conservation of the eumetazoan gene regulatory landscape
Genome Research 24:639–650.

https://doi.org/10.1101/gr.162529.113
- PubMed
- Google Scholar
(2013) Early evolution of the T-box transcription factor family
PNAS 110:16050–16055.

https://doi.org/10.1073/pnas.1309748110
- PubMed
- Google Scholar
(1995) Modulation of serotonin-controlled behaviors by Go in Caenorhabditis elegans
Science 267:1648–1651.

https://doi.org/10.1126/science.7886454
- PubMed
- Google Scholar
1. Sharma V
2. Hecker N
3. Roscito JG
4. Foerster L
5. Langer BE
6. Hiller M
(2018) A genomics approach reveals insights into the importance of gene losses for mammalian adaptations
Nature Communications 9:1215.

https://doi.org/10.1038/s41467-018-03667-1
- PubMed
- Google Scholar
1. Shen MM
(2007) Nodal signaling: developmental roles and regulation
Development 134:1023–1034.

https://doi.org/10.1242/dev.000166
- PubMed
- Google Scholar
1. Simakov O
2. Marletaz F
3. Cho SJ
4. Edsinger-Gonzales E
5. Havlak P
6. Hellsten U
7. Kuo DH
8. Larsson T
9. Lv J
10. Arendt D
11. Savage R
12. Osoegawa K
13. de Jong P
14. Grimwood J
15. Chapman JA
16. Shapiro H
17. Aerts A
18. Otillar RP
19. Terry AY
20. Boore JL
21. Grigoriev IV
22. Lindberg DR
23. Seaver EC
24. Weisblat DA
25. Putnam NH
26. Rokhsar DS
(2013) Insights into bilaterian evolution from three spiralian genomes
Nature 493:526–531.

https://doi.org/10.1038/nature11696
- PubMed
- Google Scholar
1. Simion P
2. Philippe H
3. Baurain D
4. Jager M
5. Richter DJ
6. Di Franco A
7. Roure B
8. Satoh N
9. Quéinnec É
10. Ereskovsky A
11. Lapébie P
12. Corre E
13. Delsuc F
14. King N
15. Wörheide G
16. Manuel M
(2017) A large and consistent phylogenomic dataset supports sponges as the sister group to all other animals
Current Biology 27:958–967.

https://doi.org/10.1016/j.cub.2017.02.031
- PubMed
- Google Scholar
1. Söding J
(2005) Protein homology detection by HMM-HMM comparison
Bioinformatics 21:951–960.

https://doi.org/10.1093/bioinformatics/bti125
- PubMed
- Google Scholar
1. Som A
(2015) Causes, consequences and solutions of phylogenetic incongruence
Briefings in Bioinformatics 16:536–548.

https://doi.org/10.1093/bib/bbu015
- PubMed
- Google Scholar
1. Srivastava M
2. Begovic E
3. Chapman J
4. Putnam NH
5. Hellsten U
6. Kawashima T
7. Kuo A
8. Mitros T
9. Salamov A
10. Carpenter ML
11. Signorovitch AY
12. Moreno MA
13. Kamm K
14. Grimwood J
15. Schmutz J
16. Shapiro H
17. Grigoriev IV
18. Buss LW
19. Schierwater B
20. Dellaporta SL
21. Rokhsar DS
(2008) The Trichoplax genome and the nature of placozoans
Nature 454:955–960.

https://doi.org/10.1038/nature07191
- PubMed
- Google Scholar
1. Srivastava M
2. Simakov O
3. Chapman J
4. Fahey B
5. Gauthier ME
6. Mitros T
7. Richards GS
8. Conaco C
9. Dacre M
10. Hellsten U
11. Larroux C
12. Putnam NH
13. Stanke M
14. Adamska M
15. Darling A
16. Degnan SM
17. Oakley TH
18. Plachetzki DC
19. Zhai Y
20. Adamski M
21. Calcino A
22. Cummins SF
23. Goodstein DM
24. Harris C
25. Jackson DJ
26. Leys SP
27. Shu S
28. Woodcroft BJ
29. Vervoort M
30. Kosik KS
31. Manning G
32. Degnan BM
33. Rokhsar DS
(2010) The Amphimedon queenslandica genome and the evolution of animal complexity
Nature 466:720–726.

https://doi.org/10.1038/nature09201
- PubMed
- Google Scholar
(2002)
Control of Drosophila imaginal disc development by rotund and roughened eye: differentially expressed transcripts of the same gene encoding functionally distinct zinc finger proteins

Development 129:1273–1281.
- PubMed
- Google Scholar
(2006) Many genes in fish have species-specific asymmetric rates of molecular evolution
BMC Genomics 7:20.

https://doi.org/10.1186/1471-2164-7-20
- PubMed
- Google Scholar
(2014) Prospero-related homeobox 1 (Prox1) at the crossroads of diverse pathways during adult neural fate specification
Frontiers in Cellular Neuroscience 8:454.

https://doi.org/10.3389/fncel.2014.00454
- PubMed
- Google Scholar
1. Su YH
2. Yu JK
(2017) EvoDevo: changes in developmental controls underlying the evolution of animal body plans
Developmental Biology 427:177–178.

https://doi.org/10.1016/j.ydbio.2017.05.023
- PubMed
- Google Scholar
1. Sun X
2. Lin Y
(2016) Npas4: linking neuronal activity to memory
Trends in Neurosciences 39:264–275.

https://doi.org/10.1016/j.tins.2016.02.003
- PubMed
- Google Scholar
(2004) Dopamine receptors in C. elegans
European Journal of Pharmacology 500:159–166.

https://doi.org/10.1016/j.ejphar.2004.07.021
- PubMed
- Google Scholar
1. Szklarczyk D
2. Morris JH
3. Cook H
4. Kuhn M
5. Wyder S
6. Simonovic M
7. Santos A
8. Doncheva NT
9. Roth A
10. Bork P
11. Jensen LJ
12. von Mering C
(2017) The STRING database in 2017: quality-controlled protein-protein association networks, made broadly accessible
Nucleic Acids Research 45:D362–D368.

https://doi.org/10.1093/nar/gkw937
- PubMed
- Google Scholar
(2000) T-brain expression in the apical organ of hemichordate tornaria larvae suggests its evolutionary link to the vertebrate forebrain
Journal of Experimental Zoology 288:23–31.

https://doi.org/10.1002/(SICI)1097-010X(20000415)288:1<23::AID-JEZ3>3.0.CO;2-H
- PubMed
- Google Scholar
1. Tapscott SJ
2. Davis RL
3. Thayer MJ
4. Cheng PF
5. Weintraub H
6. Lassar AB
(1988) MyoD1: a nuclear phosphoprotein requiring a Myc homology region to convert fibroblasts to myoblasts
Science 242:405–411.

https://doi.org/10.1126/science.3175662
- PubMed
- Google Scholar
(1997) A genomic perspective on protein families
Science 278:631–637.

https://doi.org/10.1126/science.278.5338.631
- PubMed
- Google Scholar
1. Technau U
2. Rudd S
3. Maxwell P
4. Gordon PM
5. Saina M
6. Grasso LC
7. Hayward DC
8. Sensen CW
9. Saint R
10. Holstein TW
11. Ball EE
12. Miller DJ
(2005) Maintenance of ancestral complexity and non-metazoan genes in two basal cnidarians
Trends in Genetics 21:633–639.

https://doi.org/10.1016/j.tig.2005.09.007
- PubMed
- Google Scholar
1. Teixeira CM
2. Rosen ZB
3. Suri D
4. Sun Q
5. Hersh M
6. Sargin D
7. Dincheva I
8. Morgan AA
9. Spivack S
10. Krok AC
11. Hirschfeld-Stoler T
12. Lambe EK
13. Siegelbaum SA
14. Ansorge MS
(2018) Hippocampal 5-HT input regulates memory formation and Schaffer collateral excitation
Neuron 98:992–1004.

https://doi.org/10.1016/j.neuron.2018.04.030
- PubMed
- Google Scholar
1. Tekaia F
(2016) Inferring orthologs: open questions and perspectives
Genomics Insights 9:GEI.S37925–GEI.S37928.

https://doi.org/10.4137/GEI.S37925
- PubMed
- Google Scholar
(2015) Phylogenomic insights into animal evolution
Current Biology 25:R876–R887.

https://doi.org/10.1016/j.cub.2015.07.060
- PubMed
- Google Scholar
(2007) Odd-skipped genes encode repressors that control kidney development
Developmental Biology 301:518–531.

https://doi.org/10.1016/j.ydbio.2006.08.063
- PubMed
- Google Scholar
1. Thomas GWC
2. Dohmen E
3. Hughes DST
4. Murali SC
5. Poelchau M
6. Glastad K
7. Anstead CA
8. Ayoub NA
9. Batterham P
10. Bellair M
11. Binford GJ
12. Chao H
13. Chen YH
14. Childers C
15. Dinh H
16. Doddapaneni HV
17. Duan JJ
18. Dugan S
19. Esposito LA
20. Friedrich M
21. Garb J
22. Gasser RB
23. Goodisman MAD
24. Gundersen-Rindal DE
25. Han Y
26. Handler AM
27. Hatakeyama M
28. Hering L
29. Hunter WB
30. Ioannidis P
31. Jayaseelan JC
32. Kalra D
33. Khila A
34. Korhonen PK
35. Lee CE
36. Lee SL
37. Li Y
38. Lindsey ARI
39. Mayer G
40. McGregor AP
41. McKenna DD
42. Misof B
43. Munidasa M
44. Munoz-Torres M
45. Muzny DM
46. Niehuis O
47. Osuji-Lacy N
48. Palli SR
49. Panfilio KA
50. Pechmann M
51. Perry T
52. Peters RS
53. Poynton HC
54. Prpic NM
55. Qu J
56. Rotenberg D
57. Schal C
58. Schoville SD
59. Scully ED
60. Skinner E
61. Sloan DB
62. Stouthamer R
63. Strand MR
64. Szucsich NU
65. Wijeratne A
66. Young ND
67. Zattara EE
68. Benoit JB
69. Zdobnov EM
70. Pfrender ME
71. Hackett KJ
72. Werren JH
73. Worley KC
74. Gibbs RA
75. Chipman AD
76. Waterhouse RM
77. Bornberg-Bauer E
78. Hahn MW
79. Richards S
(2020) Gene content evolution in the arthropods
Genome Biology 21:15.

https://doi.org/10.1186/s13059-019-1925-7
- PubMed
- Google Scholar
(2010) A non-tree-based comprehensive study of metazoan Hox and ParaHox genes prompts new insights into their origin and evolution
BMC Evolutionary Biology 10:73.

https://doi.org/10.1186/1471-2148-10-73
- PubMed
- Google Scholar
1. Tomancak P
2. Beaton A
3. Weiszmann R
4. Kwan E
5. Shu S
6. Lewis SE
7. Richards S
8. Ashburner M
9. Hartenstein V
10. Celniker SE
11. Rubin GM
(2002) Systematic determination of patterns of gene expression during Drosophila embryogenesis
Genome Biology 3:research0088.1–research0088.14.

https://doi.org/10.1186/gb-2002-3-12-research0088
- PubMed
- Google Scholar
(2015) Phylogenomics reveals convergent evolution of lifestyles in close relatives of animals and fungi
Current Biology 25:2404–2410.

https://doi.org/10.1016/j.cub.2015.07.053
- PubMed
- Google Scholar
1. Trachana K
2. Larsson TA
3. Powell S
4. Chen WH
5. Doerks T
6. Muller J
7. Bork P
(2011) Orthology prediction methods: A quality assessment using curated protein families
BioEssays 33:769–780.

https://doi.org/10.1002/bies.201100062
- PubMed
- Google Scholar
1. Ueno T
2. Tomita J
3. Tanimoto H
4. Endo K
5. Ito K
6. Kume S
7. Kume K
(2012) Identification of a dopamine pathway that regulates sleep and arousal in Drosophila
Nature Neuroscience 15:1516–1523.

https://doi.org/10.1038/nn.3238
- PubMed
- Google Scholar
1. Van Bortle K
2. Ramos E
3. Takenaka N
4. Yang J
5. Wahi JE
6. Corces VG
(2012) Drosophila CTCF tandemly aligns with other insulator proteins at the borders of H3K27me3 domains
Genome Research 22:2176–2187.

https://doi.org/10.1101/gr.136788.111
- PubMed
- Google Scholar
Thesis
1. van Dongen S
(2000)
Graph clustering by flow simulation

University of Utrecht, PhD thesis.
- Google Scholar
1. Vidal-Gadea A
2. Topper S
3. Young L
4. Crisp A
5. Kressin L
6. Elbel E
7. Maples T
8. Brauner M
9. Erbguth K
10. Axelrod A
11. Gottschalk A
12. Siegel D
13. Pierce-Shimomura JT
(2011) Caenorhabditis elegans selects distinct crawling and swimming gaits via dopamine and serotonin
PNAS 108:17504–17509.

https://doi.org/10.1073/pnas.1108673108
- PubMed
- Google Scholar
1. Vietri Rudan M
2. Hadjur S
(2015) Genetic tailors: CTCF and cohesin shape the genome during evolution
Trends in Genetics 31:651–660.

https://doi.org/10.1016/j.tig.2015.09.004
- PubMed
- Google Scholar
1. Wang Q
2. Lan Y
3. Cho ES
4. Maltby KM
5. Jiang R
(2005) Odd-skipped related 1 (Odd 1) is an essential regulator of heart and urogenital development
Developmental Biology 288:582–594.

https://doi.org/10.1016/j.ydbio.2005.09.024
- PubMed
- Google Scholar
1. Ward N
2. Moreno-Hagelsieb G
(2014) Quickly finding orthologs as reciprocal best hits with BLAT, LAST, and UBLAST: how much do we miss?
PLOS ONE 9:e101850.

https://doi.org/10.1371/journal.pone.0101850
- PubMed
- Google Scholar
1. Watanabe H
2. Schmidt HA
3. Kuhn A
4. Höger SK
5. Kocagöz Y
6. Laumann-Lipp N
7. Ozbek S
8. Holstein TW
(2014) Nodal signalling determines biradial asymmetry in Hydra
Nature 515:112–115.

https://doi.org/10.1038/nature13666
- PubMed
- Google Scholar
1. Weintraub AS
2. Li CH
3. Zamudio AV
4. Sigova AA
5. Hannett NM
6. Day DS
7. Abraham BJ
8. Cohen MA
9. Nabet B
10. Buckley DL
11. Guo YE
12. Hnisz D
13. Jaenisch R
14. Bradner JE
15. Gray NS
16. Young RA
(2017) YY1 is a structural regulator of enhancer-promoter loops
Cell 171:1573–1588.

https://doi.org/10.1016/j.cell.2017.11.008
- PubMed
- Google Scholar
1. Weisberg E
2. Winnier GE
3. Chen X
4. Farnsworth CL
5. Hogan BL
6. Whitman M
(1998) A mouse homologue of FAST-1 transduces TGF β superfamily signals and is expressed during early embryogenesis
Mechanisms of Development 79:17–27.

https://doi.org/10.1016/S0925-4773(98)00160-9
- PubMed
- Google Scholar
Preprint
(2020) Many but not all lineage-specific genes can be explained by homology detection failure
bioRxiv.

https://doi.org/10.1101/2020.02.27.968420
- Google Scholar
1. Wheeler BM
2. Heimberg AM
3. Moy VN
4. Sperling EA
5. Holstein TW
6. Heber S
7. Peterson KJ
(2009) The deep evolution of metazoan microRNAs
Evolution & Development 11:50–68.

https://doi.org/10.1111/j.1525-142X.2008.00302.x
- PubMed
- Google Scholar
1. Whelan NV
2. Kocot KM
3. Moroz TP
4. Mukherjee K
5. Williams P
6. Paulay G
7. Moroz LL
8. Halanych KM
(2017) Ctenophore relationships and their placement as the sister group to all other animals
Nature Ecology & Evolution 1:1737–1746.

https://doi.org/10.1038/s41559-017-0331-3
- PubMed
- Google Scholar
1. Wilson KH
(2009) The genome sequence of the protostome Daphnia pulex encodes respective orthologues of a neurotrophin, a Trk and a p75NTR: Evolution of neurotrophin signaling components and related proteins in the Bilateria
BMC Evolutionary Biology 9:243.

https://doi.org/10.1186/1471-2148-9-243
- PubMed
- Google Scholar
1. Wu J
2. Susko E
(2011) A test for heterotachy using multiple pairs of sequences
Molecular Biology and Evolution 28:1661–1673.

https://doi.org/10.1093/molbev/msq346
- PubMed
- Google Scholar
1. Yan YT
2. Gritsman K
3. Ding J
4. Burdine RD
5. Corrales JD
6. Price SM
7. Talbot WS
8. Schier AF
9. Shen MM
(1999) Conserved requirement for EGF-CFC genes in vertebrate left-right axis formation
Genes & Development 13:2527–2537.

https://doi.org/10.1101/gad.13.19.2527
- PubMed
- Google Scholar
1. Yao LC
2. Blitz IL
3. Peiffer DA
4. Phin S
5. Wang Y
6. Ogata S
7. Cho KW
8. Arora K
9. Warrior R
(2006) Schnurri transcription factors from Drosophila and vertebrates can mediate Bmp signaling through a phylogenetically conserved mechanism
Development 133:4025–4034.

https://doi.org/10.1242/dev.02561
- PubMed
- Google Scholar
1. Yu JK
2. Mazet F
3. Chen YT
4. Huang SW
5. Jung KC
6. Shimeld SM
(2008) The Fox genes of Branchiostoma floridae
Development Genes and Evolution 218:629–638.

https://doi.org/10.1007/s00427-008-0229-9
- PubMed
- Google Scholar
(2010) Genome-wide evolutionary analysis of eukaryotic DNA methylation
Science 328:916–919.

https://doi.org/10.1126/science.1186366
- PubMed
- Google Scholar
1. Zhou S
2. Zawel L
3. Lengauer C
4. Kinzler KW
5. Vogelstein B
(1998) Characterization of human FAST-1, a TGF β and activin signal transducer
Molecular Cell 2:121–127.

https://doi.org/10.1016/S1097-2765(00)80120-3
- PubMed
- Google Scholar

Article and author information

Author details

Peter Heger

Institute for Genetics, Cologne Biocenter, University of Cologne, Cologne, Germany

Contribution
Conceptualization, Data curation, Software, Supervision, Validation, Investigation, Visualization, Methodology, Project administration

For correspondence
peter.heger@uni-koeln.de

Competing interests
No competing interests declared

"This ORCID iD identifies the author of this article:" 0000-0003-2583-2981
Wen Zheng

Institute for Genetics, Cologne Biocenter, University of Cologne, Cologne, Germany

Present address
West China-Washington Mitochondria and Metabolism Research Center, West China Hospital, Sichuan University, Chengdu, China

Contribution
Data curation, Software, Validation, Investigation, Methodology

Competing interests
No competing interests declared
Anna Rottmann

Institute for Genetics, Cologne Biocenter, University of Cologne, Cologne, Germany

Contribution
Software, Investigation

Competing interests
No competing interests declared
Kristen A Panfilio
1. Institute for Zoology: Developmental Biology, Cologne Biocenter, University of Cologne, Cologne, Germany
2. School of Life Sciences, University of Warwick, Gibbet Hill Campus, Coventry, United Kingdom
Contribution
Resources, Supervision, Funding acquisition, Validation, Investigation

Competing interests
No competing interests declared

"This ORCID iD identifies the author of this article:" 0000-0002-6417-251X
Thomas Wiehe

Institute for Genetics, Cologne Biocenter, University of Cologne, Cologne, Germany

Contribution
Conceptualization, Resources, Supervision, Funding acquisition

Competing interests
No competing interests declared

"This ORCID iD identifies the author of this article:" 0000-0002-8932-2772

Funding

Deutsche Forschungsgemeinschaft (CRC 680)

Kristen A Panfilio
Thomas Wiehe

Deutsche Forschungsgemeinschaft (CRC 1211)

Thomas Wiehe

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

This research was supported by grants from the German Research Foundation to TW (CRC 680 and CRC 1211) and to KAP (CRC 680). BLAST searches were computed on CHEOPS, the Cologne High Efficiency Operating Platform for Science of the University of Cologne, and on JuRoPA (Jülich Research on Petaflop Architectures), a High Performance Computing Platform of the Jülich Supercomputing Centre, Germany. We thank Robert Fürst for programming help, Kay Hofmann for help with protein structure analysis, Richard Stancliffe for scripting and statistical support, Maria Thieser for help with transcriptome processing, and Olav Zimmermann for the cooperation with the Jülich Supercomputing Centre. Special thanks to countless researchers and institutions for sharing sequence data.

Copyright

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.