1. Evolutionary Biology
  2. Genetics and Genomics
Download icon

The genetic factors of bilaterian evolution

  1. Peter Heger  Is a corresponding author
  2. Wen Zheng
  3. Anna Rottmann
  4. Kristen A Panfilio
  5. Thomas Wiehe
  1. Institute for Genetics, Cologne Biocenter, University of Cologne, Germany
  2. Institute for Zoology: Developmental Biology, Cologne Biocenter, University of Cologne, Germany
  3. School of Life Sciences, University of Warwick, Gibbet Hill Campus, United Kingdom
Research Article
  • Cited 0
  • Views 1,926
  • Annotations
Cite this article as: eLife 2020;9:e45530 doi: 10.7554/eLife.45530

Abstract

The Cambrian explosion was a unique animal radiation ~540 million years ago that produced the full range of body plans across bilaterians. The genetic mechanisms underlying these events are unknown, leaving a fundamental question in evolutionary biology unanswered. Using large-scale comparative genomics and advanced orthology evaluation techniques, we identified 157 bilaterian-specific genes. They include the entire Nodal pathway, a key regulator of mesoderm development and left-right axis specification; components for nervous system development, including a suite of G-protein-coupled receptors that control physiology and behaviour, the Robo-Slit midline repulsion system, and the neurotrophin signalling system; a high number of zinc finger transcription factors; and novel factors that previously escaped attention. Contradicting the current view, our study reveals that genes with bilaterian origin are robustly associated with key features in extant bilaterians, suggesting a causal relationship.

Introduction

The taxon Bilateria consists of multicellular animals with bilateral body symmetry and constitutes a major and ancient radiation of animals. There is compelling morphological and molecular evidence for the monophyly of bilaterians (Hejnol et al., 2009; Dunn et al., 2014; Cannon et al., 2016), for their subdivision into protostomes and deuterostomes (Aguinaldo et al., 1997; Philippe et al., 2005; Dunn et al., 2008; Simakov et al., 2013; Cannon et al., 2016), and for the overall relationships of ∼25 phyla that make up this group (Dunn et al., 2008; Hejnol et al., 2009; Dunn et al., 2014). In contrast, the evolutionary relationships of non-bilaterian metazoans are still a matter of debate, in particular the relative positions of placozoans, ctenophores, and sponges (Brooke and Holland, 2003; Ryan et al., 2013; Pisani et al., 2015; Feuda et al., 2017; Simion et al., 2017; Whelan et al., 2017).

The first unambiguously bilaterian fossils appear in Cambrian sediments with an age of ∼540 million years (Marshall, 2006; Erwin and Valentine, 2013). By the end of Cambrian stage 3 (499 Mya), stem groups of all major bilaterian phyla inhabited Earth. This abrupt appearance of most bilaterian body plans, the sets of morphological features common to a phylum, already puzzled Darwin (Darwin, 2009). It is considered one of the most important evolutionary events after the origin of life (Conway Morris, 2006; Budd, 2008) and still awaits an explanation today. Importantly, no new body plans evolved in the 500 My since the initial radiation.

Abiotic, ecological, and genetic factors have been proposed to explain the Cambrian radiation. While deep-ocean oxygenation (Canfield et al., 2007), the availability of calcium (Jackson et al., 2010), or ecological interactions (Budd and Jensen, 2017) likely played a role, genetic changes in the bilaterian ancestor must ultimately have constituted its molecular basis. However, evidence for such genetic changes is scarce. Genomic sequencing of non-bilaterian animals revealed that the major signalling pathways and many developmentally important genes of bilaterians are also present in non-bilaterians, indicating that these genes evolved before the advent of bilaterians (Technau et al., 2005; Putnam et al., 2007; Srivastava et al., 2008; Srivastava et al., 2010; Ryan et al., 2013; Babonis and Martindale, 2017). Similarly, epigenetic mechanisms to regulate gene expression, such as DNA methylation and histone modifications, seem to be conserved between bilaterians and non-bilaterian metazoans (Zemach et al., 2010; Schwaiger et al., 2014). Therefore, the common view is that modification of existing gene regulatory networks rather than the invention of new genes determined the evolution of complex body plans (Davidson and Erwin, 2006; Su and Yu, 2017).

Nevertheless, a number of studies identified genes that emerged in the ancestor of bilaterians. One example is a major expansion of miRNA families that likely triggered an increase in miRNA-mediated gene regulation (Prochnik et al., 2007; Wheeler et al., 2009). However, the significance of this event at the base of the Bilateria is unclear because frequent miRNA expansions are seen in various lineages over time (Peterson et al., 2009). Similarly, a link between the genome organiser CTCF and Hox genes presumably emerged in the bilaterian ancestor and might have contributed to the organisation of bilaterian body plans (Heger et al., 2012). The importance of CTCF for Hox gene expression has been shown repeatedly (Mohan et al., 2007; Kim et al., 2011; Rousseau et al., 2014; Narendra et al., 2015), yet direct evidence for the involvement of a Hox-CTCF link in body patterning is lacking. Another study implicated the TATA-box-binding protein-related factor 2 (TRF2) in the evolution of bilaterians. This factor may have founded new, TATA box-independent transcriptional programs involved in body plan development (Duttke et al., 2014), but the consequences of this hypothesis have not been tested.

Therefore, a comprehensive screen for bilaterian-specific genes and an assessment of their evolutionary impact is missing. A major obstacle for such a screen is the uneven coverage of the animal tree with sequence data. While some lineages, particularly those including model organisms (e.g., nematodes, flies, or mammals), are well represented, other areas of the metazoan tree are remarkably under-represented, for example lophotrochozoans and non-bilaterian metazoans. For instance, the leading orthology databases OrthoDB (Kriventseva et al., 2015Kriventseva et al., 2019), eggNOG (Huerta-Cepas et al., 2016), and OrthoMCL (Li et al., 2003) each contain fewer than ten non-bilaterian species, and two of these databases do not contain lophotrochozoans at all (Figure 1, Table 1). It is therefore difficult to deduce from such databases the genes that are widespread in bilaterians and absent in non-bilaterians. In addition to the bias in coverage, sequence databases suffer from annotation errors, which particularly affect non-model organisms and under-represented parts of the tree, such as non-bilaterian metazoans and lophotrochozoans. Annotation errors, in turn, have been found as the largest single source for errors in orthology benchmark testing and, together with uneven phylogenetic coverage, accounted for up to 40% of incorrect assignments (Trachana et al., 2011).

Figure 1 with 3 supplements see all
Properties of the BigWenDB data collection.

(A) Comparison of three major orthology databases with the BigWenDB. The relative contribution of four metazoan clades (Deuterostomia, Ecdysozoa, Lophotrochozoa, and the paraphyletic group "non-Bilateria") is shown as stacked bar graph. The count of metazoans in our database (175 species) is set to 100%. In comparison to other databases, the BigWenDB has a larger repertoire of critical lophotrochozoans and non-bilaterian Metazoa. (B) Consensus phylogeny describing the relationships of 21 metazoan phyla covered in our database, after Laumer et al., 2015; Telford et al., 2015; Torruella et al., 2015; Cannon et al., 2016. Bold labels to the left or above branches indicate its ancestor (A: Arthropoda, B: Bilateria, D: Deuterostomia, E: Ecdysozoa, Eu: Eumetazoa, L: Lophotrochozoa, M: Metazoa, O: Opisthokonta, P: Protostomia). Numbers in parentheses (after the phylum name) indicate the number of species present from this phylum. Horizontal bars visualise the number of database sequences that belong to a given phylum (logarithmic scale; transcriptomic, ORF, and NCBI sequences summed up). Species silhouettes were downloaded from www.phylopic.org. Morphological innovations of Bilateria according to Baguñà et al., 2008 are highlighted in a shaded box.

Table 1
Comparison of three major orthology databases with the BigWenDB.

The number of species of a given taxon (left column) in four different orthology databases is shown. In contrast to other databases, the BigWenDB has substantially more sequence information from non-bilaterian metazoans and therefore a better resolution at the divergence of bilaterians and non-bilaterians. D = Deuterostomia, E = Ecdysozoa. Note the bias of other databases towards insects and vertebrates, which continues in the latest database versions (e.g., OrthoDB v10.2; Kriventseva et al., 2019).

TaxonOrthoDB V8eggNOG V4.5OrthoMCL V5BigWenDB
Cellular organisms3,0272,031150273
Metazoa1738829175
Bilateria1698527142
non-Bilateria43233
Ecdysozoa (E)97291254
E w/o insects179429
Lophotrochozoa50018
Deuterostomia (D)66551465
D w/o vertebrates54112

To address these biases and to infer bilaterian-specific genes in a reliable and robust way, we (i) assembled a dataset covering the animal tree in the most comprehensive and representative way so far; (ii) particularly strengthened resolution at the base of the Bilateria; (iii) reduced annotation errors by incorporating newly generated ORF (open reading frame) data sets; and (iv) evaluated the composition of the generated orthologous groups in a phylogenetic context. Using this strategy we extracted, from an initial set of 124 million sequences from 273 species, 157 high-confidence bilaterian-specific genes, with many functions connected to key bilaterian features.

Results

Dataset generation and orthogroup evaluation

Non-bilaterian metazoans are severely under-represented in existing sequence collections, but sufficient coverage is critical to illuminate bilaterian evolution. To maximise phylogenetic resolution at the origin of Bilateria, we assembled a new database specifically tailored to this purpose, the BigWenDB (Figure 1, Figure 1—figure supplement 1; Table 1). This database combines sequence data of 273 species from three sources. The backbone of our analysis is the opisthokont sequence space (primarily fungi, vertebrates, and insects): 204 species, each with >8000 available sequences at GenBank, totalling 2.7 million sequences (Table 2; NCBI GenBank release 203 from 15 August 2014). The second part derives from transcriptome sequences of 64 species from various sources (Supplementary file 1–Supplementary Table 1, Supplementary file 1–Supplementary Table 2, Supplementary file 2). Among others, non-bilaterian metazoans (30 species) and lophotrochozoans (12 species) contribute 11.7 million sequences to this group, complementing their poor GenBank representation (Figure 1—figure supplement 1). The third and largest sequence set contains ∼109 million open reading frames (ORFs) obtained by translating 25 metazoan genomes (Supplementary file 1–Supplementary Table 3). All non-bilaterian and lophotrochozoan whole genome sequences available at the time, as well as genomes from additional phyla, were included to compile a comprehensive and representative dataset (Figure 1—figure supplement 1). As this strategy caused a large increase in sequence number, we limited the third set to 25 species to maintain technical feasibility. The final dataset combines 124 million sequences from 21 metazoan and three outgroup phyla, including several taxa absent from other databases, for example tardigrades, a priapulid, bryozoans, a nemertean, a rotifer, a brachiopod, and choanoflagellates (Figure 1, Figure 1—figure supplement 1).

Table 2
Composition of the BigWenDB.

The number of sequences (overall: 124,031,501) collected from three different sources (NCBI, Transcriptome, ORFs) is indicated for major taxonomic groups of the BigWenDB. "Others" comprises the ichthyosporean Capsaspora owczarzaki and the choanoflagellates Monosiga brevicollis and Salpingoeca rosetta.

Group(Super)Phylum# SpeciesNCBITranscriptomeORFs
BilateriaDeuterostomia65895,0842,292,54151,922,654
Ecdysozoa54511,6632,150,42417,338,026
Lophotrochozoa23170,3792,618,5189,805,405
Non-Bilat.Ctenophora701,468,3722,458,546
Placozoa111,2150590,820
Porifera68,836539,2991,008,535
Cnidaria1936,8732,361,03226,443,358
Fungi931,032,29900
others329,29200
total2732,695,64111,768,516109,567,344

To be able to generate clusters of orthologous proteins from this large dataset, we adapted the OrthoMCL pipeline (Li et al., 2003) and improved its scalability (see Appendix 1: Orthology pipeline and clustering; Supplementary file 1–Supplementary Table 4). As a large proportion of the resulting 824,605 orthogroups was small and had phylogenetically inconsistent composition (Appendix 1—figure 1; Supplementary file 1–Supplementary Table 5), we focused our analysis on 75,744 orthogroups (OGs) with at least 10 species. They provide a rich repertoire for the identification of lineage-specific protein sets.

Hundreds to thousands of novel translated ORFs exist in humans and other animals, that are missed by traditional annotation methods (Ladoukakis et al., 2011; Mackowiak et al., 2015; Raj et al., 2016). A key aspect of our analysis is therefore the inclusion of genomic ORFs. To estimate their contribution to the clustering process, we examined the composition of all orthogroups. Genomic ORFs constitute a substantial fraction of the majority of orthogroups, comprising >90% of all sequences in 50% of orthogroups. This demonstrates that a high percentage of orthogroups is either dependent on or substantially affected by the inclusion of ORFs. Although most ORFs are short (mean length of 60 AA; Figure 1—figure supplement 2, Figure 1—figure supplement 3), nearly 2.3 million ORFs (on average 90,443 per species) are >132 AA, the mean size of domains in the PFAM database, ensuring the possibility of annotating ORF-dominated orthogroups (Figure 1—figure supplement 2).

We next assessed the accuracy and biological validity of our orthogroup dataset via several approaches. First, we compared our clustering results with an external benchmark set of 70 manually curated orthogroups (Trachana et al., 2011; see Appendix 1: Cluster evaluation and quality control; Supplementary file 3). We then specifically examined the clustering results of a highly conserved and difficult to assess class of proteins, the Nkx homeodomain proteins (Supplementary file 1–Supplementary Table 6). Third, we evaluated potential sources of error with respect to the phylogenetic composition of a given orthogroup (see Appendix 1: Identification of bilaterian-specific genes). For this purpose, we developed a new reciprocal HMM-HMM comparison step. It performs sensitive, BLAST-independent searches for orthogroups with similar sequence profiles to validate orthogroup completeness. We demonstrated the value of this step by using two proteins as test cases, the FGF signalling pathway component Sprouty and the insulator protein GAGA factor (see Appendix 1: Identification of bilaterian-specific genes; Supplementary file 1–Supplementary Table 7). After these quality control steps, we finally identified 157 orthogroups as a minimal set of high confidence, bilaterian-specific orthogroups (Supplementary file 4).

The domain repertoire of bilaterian-specific proteins is enriched for DNA-binding

To reveal the putative function of the 157 identified bilaterian-specific genes, we first determined their protein domain repertoire and the gene ontology terms for molecular function associated with these domains. We then compared the results to analyses carried out for the vertebrate and arthropod nodes, as these nodes represent major radiations that are well-supported by genome sequence data. The obtained terms indicate that membrane processes, including cell adhesion, G-protein-coupled receptor signalling, and Ca2+-binding, as well as protein interactions and metal ion binding, are prominent molecular functions of bilaterian-specific proteins (Figure 2 left, top and middle row). In contrast, terms derived from the arthropod and vertebrate nodes are markedly different. While the vertebrate repertoire comprises G-protein-coupled receptors, cadherins, and extracellular domains required for protein-protein or protein-ligand interactions, arthropod-specific genes are characterised by a broad spectrum of similarly prominent functions, from expected roles in cuticle and chitin biology to a plenitude of conserved domains of unknown function (Figure 2 middle and right, top and middle row). These results indicate that proteins with distinct functions characterise the evolution of each of the three nodes.

Figure 2 with 4 supplements see all
Inventory of protein domains and associated GO terms for three animal lineages.

Further, our comparative analysis implied that a large number of transcription factors emerged in the bilaterian ancestor. While 3.58% of vertebrate-specific orthogroups and 9.30% of arthropod-specific orthogroups had transcription factor-associated domains such as zinc fingers or homeodomains, the corresponding fraction was 26.06% in bilaterian-specific orthogroups (Figure 2 middle row). To substantiate this result, we randomly selected 10 times 157 proteins from a curated set of 20,205 human proteins. The average number of transcription factors in these control sets was 12.8 ± 4.44 as opposed to 37 transcription factors in the set of 157 bilaterian-specific genes. This is a highly significant result under a number of assumptions for data distribution (see Materials and methods), lending statistical support to an unexpectedly high number of transcription factors in the bilaterian-specific dataset.

Importantly, many of the transcription factors contained tandem C2H2 zinc finger domains and already originated with multiple zinc fingers, as their extant Drosophila and human orthologues suggest (Supplementary file 1–Supplementary Table 8). With the addition of at least 13 members, the modest poly-ZF repertoire at the dawn of metazoans thus almost doubled in the bilaterian ancestor (Figure 2—figure supplement 1) in line with previous evidence that poly-ZF proteins emerged from a small group of eukaryotic zinc finger transcription factors (Emerson and Thomas, 2009). Considering that several factors with this domain configuration are involved in regulating chromatin architecture, including CTCF (Phillips-Cremins et al., 2013), YY1 (Weintraub et al., 2017), Pita (Kyrchanova et al., 2017), SuHw (Van Bortle et al., 2012), and Casz1 (Mattar et al., 2018), these findings open the possibility that multiple poly-ZF factors participated in modifying higher-order chromatin structure during the emergence of bilaterians, as proposed for CTCF (Heger et al., 2012; Vietri Rudan and Hadjur, 2015; Acemel et al., 2017). With the exception of YY1 (OG_3966: metazoan origin or earlier), all known chromatin architectural proteins emerged in the ancestor of bilaterians or later (Heger et al., 2013; Heger and Wiehe, 2014), suggesting that a more sophisticated regulation of gene expression by influencing chromatin architecture contributed to bilaterian evolution. More generally, we note that poly-ZF proteins often comprise the most abundant transcription factor superfamily in bilaterians, with many lineage-specific expansions even within orders and families (Panfilio et al., 2019). Below, we also comment both on similar patterns in other protein classes and on potential other roles of a bilaterian expansion in poly-ZF proteins.

Bilaterian-specific proteins contain novel protein domains

Using domain scans, we could not identify known protein domains or other functional annotation for 5 of the 157 bilaterian-specific orthogroups. Nevertheless, the corresponding alignments displayed extended regions of sequence conservation (Figure 2—figure supplement 2, Figure 2—figure supplement 3, Figure 2—figure supplement 4) arguing that these regions may constitute so far undetected protein domains. To explore whether the putative domains are bilaterian novelties, we converted them to hidden Markov models and used these to search our database of 824,605 orthogroup HMMs. In these searches, only one of the five domains showed weak evidence for homology outside the Bilateria, indicating that a protein with a similar domain exists in non-bilaterians. The other four domains were restricted to bilaterians, like the proteins they belong to (Supplementary file 1–Supplementary Table 9), a finding compatible with the de novo birth of these five genes. Similarly, sequences without known protein domains were also detectable in arthropod- and vertebrate-specific orthogroups (Figure 2) and, more generally, in approximately 40% of the 69,114 orthogroups with more than ten species. These findings open the possibility that, across opisthokonts, many lineage-specific genes are uncharacterised and may contain previously undescribed protein domains and novel lineage-specific domains, emphasising the involvement of gene birth in lineage evolution on a broad scale.

Changes in the transcription factor repertoire and in membrane processes accompany bilaterian evolution

Nuclear factors include key developmental regulators

To reveal the putative function of the identified bilaterian-specific genes, we determined the subcellular location of their human orthologues according to the information at www.uniprot.org (Figure 3). Almost two-thirds of the 157 genes belonged to either of two cellular compartments, the nucleus or the plasma membrane. The majority of nuclear proteins (40/57 orthogroups) had transcription factor activity, with various domains for DNA binding (Figure 3B). Although C2H2 poly-ZF proteins are particularly enriched (Figure 2—figure supplement 1, Supplementary file 1–Supplementary Table 8), we also found several transcription factors with homeobox and basic helix-loop-helix (bHLH) domains (Figure 3B; Figure 2). The latter factors are important for regulatory processes during embryogenesis such as neurogenesis, myogenesis, and positional specification along the body axis (Supplementary file 1–Supplementary Table 10). For example, we found the bHLH domain-containing transcription factor MyoD, the master regulator for muscle cell specification in vertebrates, D. melanogaster, and C. elegans (Tapscott et al., 1988; Michelson et al., 1990; Chen et al., 1994), consistent with the bilaterian origin of mesoderm (Supplementary file 1–Supplementary Table 10, Supplementary file 4). Likewise, at least three conserved regulators of nervous system development and neurotransmission, the Neuronal PAS domain-containing protein 4, the Prospero homeobox protein 2, and the Achaete-scute homologue 2 (Stergiopoulos et al., 2014; Sun and Lin, 2016), emerged in the ancestor of bilaterians (Supplementary file 1–Supplementary Table 10, Supplementary file 4). Finally, two orthogroups with homeobox domain proteins, OG_8634 and OG_4203, contained the central Hox genes Antennapedia and Ultrabithorax (Balavoine et al., 2002; Chourrout et al., 2006). Central Hox genes are absent from non-bilaterian Metazoa despite the existence of anterior and posterior homologues (Ryan et al., 2007). Our screen thus correctly identified central Hox genes as a bilaterian novelty even though homeodomain-containing proteins are difficult to assign (Thomas-Chollier et al., 2010; Hueber et al., 2013).

Subcellular location and molecular function of 157 bilaterian-specific genes.

(A) Graphic representation of a eukaryotic cell with its typical organelles. Numbers in parentheses denote the number of bilaterian-specific orthogroups associated predominantly with a given cellular structure. Graphic drawn after the subcellular location section at uniprot.org. (B) Upper chart: Subcellular location of 157 bilaterian-specific genes. Location data is based on the corresponding human orthologues and colour-matched with the graphics in A. Lower chart: Number and name of transcription factor-associated domains present in the set of 157 bilaterian-specific genes. The 40 orthogroups are a subset of 51 orthogroups associated with the nuclear compartment. In most cases, domains names follow Pfam standards (http://pfam.xfam.org/). (C) Distribution of 84 domains found in 51 orthogroups associated with the nucleus. (D) Distribution of 77 domains found in 49 orthogroups associated with the plasma membrane. (E) Distribution of 39 domains found in 28 orthogroups associated with the cytoplasm. "Other" represents domains found only once in the respective category.

Membrane factors include neural transducers and novel proteins

A heterogeneous set of proteins was mapped to the membrane compartment (Figure 3D). While most of the domains found in 49 orthogroups of this category occurred once or twice, several domains were seen more often, in particular the seven transmembrane receptor domain (7tm; 13×), the leucine-rich repeat (LRR; 5×), the Bestrophin chloride channel (Bestrophin; 3×), and the hormone receptor domain (HRM; 3×). The 7tm domain is characteristic of G-protein-coupled receptors, which will be discussed further below. The LRR domain is a protein binding motif (Kobe and Kajava, 2001) and present in several factors connected to the plasma membrane (Figure 3D) such as LINGO1, SLIT2, or SEMA6C. These LRR domain-containing molecules are crucial for organising neural connectivity and are employed for axon guidance, myelination, and synapse formation (de Wit et al., 2011). Although LRR domain-containing molecules exist in non-bilaterians (Ocampo et al., 2015), it is currently unknown whether they fulfil, in these organisms, a role in nervous system development as observed in flies and vertebrates. Further, several bilaterian-specific orthogroups contained ion channel proteins. For both nervous system function and embryonic development (Moody et al., 1991; Pai et al., 2017), ion channels play important roles as they provide the basis of currents and action potentials across the plasma membrane and are involved in morphogenetic movements and cell shape changes during development (Moody et al., 1991). However, most ion channel proteins seem to predate the origin of metazoans (Jegla et al., 2009), and therefore it is unclear how the identified channel proteins affected bilaterian evolution.

Three orthogroups contained transmembrane proteins for which currently no functional description is available, although expression data for two of these exist: OG_13067 (TM169_HUMAN), OG_26661 (TM74B_HUMAN), and OG_28197 (TM160_HUMAN). Genome-wide studies revealed that CG4596, the Drosophila orthologue of TM169_HUMAN, is expressed in the ventral nerve cord, ventral midline, and in the brain during embryogenesis (Tomancak et al., 2002), similar to a central nervous system-based expression of the mouse orthologue (Supplementary file 1–Supplementary Table 11; Petryszak et al., 2016). Mouse expression data for the transmembrane protein TM160_HUMAN largely overlap with TM169_HUMAN (Supplementary file 1–Supplementary Table 11), but corresponding data from Drosophila are not available, as TM160 is absent from ecdysozoans (Figure 2—figure supplement 2, Supplementary file 1–Supplementary Table 12). Multiple sequence alignments and HMM-HMM searches demonstrate further that these two transmembrane proteins are well conserved across bilaterians (Figure 2—figure supplement 2) and possess a unique sequence profile without similarity to other orthogroups within the opisthokont search space (Supplementary file 1–Supplementary Table 12). Together, these observations establish that so far uncharacterised proteins with predicted transmembrane domains and distinct structures might have a function in the nervous system since the Cambrian.

Lineage-specific genes are ubiquitous and contain lineage-specific protein domains

The dataset for this study was designed to capture genes with bilaterian-specific distribution. To explore whether it allows the identification of genes specific for other evolutionary nodes, we determined the number of lineage-specific orthogroups for five successive nodes in two lineages: in the protostome lineage leading to Diptera and in the deuterostome lineage leading to Mammalia. We counted for every node lineage-specific orthogroups as a function of increasing species coverage. Extending coverage reduced the number of lineage-specific orthogroups, as expected (Figure 4). However, tens to hundreds of lineage-specific orthogroups were still obtained at each individual node under the strict condition of 50% coverage (i.e. at least 50% of the species that belong to the respective node need to be present in orthogroups; Figure 4). HMM-HMM searches and domain scans further suggested that lineage-specific orthogroups for the 10 nodes contain novel domains unique to the respective lineage (for examples, see Figure 4—figure supplement 1 and Supplementary file 1–Supplementary Table 13), as it is the case for bilaterian-specific proteins (Figure 2—figure supplement 2, Figure 2—figure supplement 3, Figure 2—figure supplement 4). These findings suggest that the origin of genes and novel protein domains is a robust component of evolution at every examined node and that the faithful identification of these genes is a critical aspect in reconstructing evolutionary history, as exemplified by the recent detection of lineage-specific genes in mammals, mollusks, cnidarians, or arthropods (Milde et al., 2009; Aguilera et al., 2017; Dunwell et al., 2017; Thomas et al., 2020).

Figure 4 with 1 supplement see all
Distinct lineage-specific genes at subsequent nodes of insect and vertebrate evolution.

Starting from Bilateria (left), a protostome lineage leading to dipterans (upper) and a deuterostome lineage leading to mammals (lower) are shown as schematic phylogenetic tree. Sister clades to the selected taxa are denoted on short branches in the center. Each barplot displays the number of lineage-specific orthogroups (y axis) as a function of orthogroup size (x axis) for the selected taxonomic group (Protostomia, Ecdysozoa, Arthropoda etc.). The total species count (within BigWenDB) for each of the eleven taxonomic groups is indicated on top of the corresponding barplots (# Species). The count of lineage-specific genes decreases with growing orthogroup size. A red line denotes the number of orthogroups in which at least 50% of the species of a selected lineage are present. The corresponding number of lineage-specific orthogroups is highlighted in red next to the line.

The Nodal pathway is a bilaterian-specific addition to the TGF-β superfamily and linked to left-right determination and mesoderm formation

Three orthogonal axes—the anterior-posterior, the dorsal-ventral, and the left-right axis—determine body layout in bilaterian animals. One of the signalling systems active in these processes is the Nodal pathway. It belongs to the transforming growth factor β (TGF-β) pathway and is essential for the specification of left-right asymmetry and the induction of mesoderm and endoderm in vertebrates (Shen, 2007). The TGF-β ligands Nodal and Lefty, the co-receptor EGF-CFC, and the transcription factor FoxH1 are components specific to the Nodal pathway (Figure 5—figure supplement 1). In addition, the T-box transcription factor TBR-2/Eomes (T-box brain protein 2/Eomesodermin) is a target of Nodal signalling and critical for mesoderm formation and neural development (Ryan et al., 1996; Arnold et al., 2008).

Distinct phylogenetic distributions have been reported for the Nodal-signalling components. The presence and functional conservation of Nodal itself is well established across deuterostomes (Duboc et al., 2004; Hudson and Yasuo, 2005; Shen, 2007; Röttinger et al., 2015) and lophotrochozoans (Grande et al., 2014; Kenny et al., 2014). In contrast, searches for Lefty orthologues were so far positive only in deuterostomes (Chen and Schier, 2002; Mita and Fujiwara, 2007; Duboc et al., 2008; Li et al., 2017), but not in Lophotrochozoa (Grande et al., 2014). Similarly, the Nodal coreceptor EGF-CFC has been identified only in deuterostomes (Yan et al., 1999; Ravisankar et al., 2011), and FoxH1 orthologues have been characterised in vertebrates and cephalochordates only (Weisberg et al., 1998; Zhou et al., 1998; Yu et al., 2008; Figure 5A). Nodal-signalling components have not been identified in the protostome model organisms D. melanogaster and C. elegans. Likewise, the T-box factor eomesodermin is absent from these animals but has been described in lophotrochozoans, deuterostomes, and sponges (Maruyama, 2000; Tagawa et al., 2000; Arenas-Mena, 2008; Arnold et al., 2008; Sebé-Pedrós et al., 2013). These findings imply a successive gain of Nodal signalling components along the lineage from the metazoan to the vertebrate ancestor (Figure 5A).

Figure 5 with 4 supplements see all
Evolution of the Nodal signaling pathway.

Two consensus phylogenetic trees showing the relationship of major metazoan lineages. The five factors of the Nodal signalling pathway (Nodal, Lefty, EGF-CFC, FoxH1, and Eomes) are displayed as coloured boxes. Their phylogenetic distribution and inferred evolutionary origin are mapped onto the tree. Gene births are indicated as coloured boxes above the respective branch. Inferred losses are represented by crosses. Bold labels to the left of a branch indicate branch ancestors: B = Bilateria, Eu = Eumetazoa, M = Metazoa. (A) Previous results regarding the evolution of Nodal pathway genes, as known from the literature. (B) Revised evolutionary history of the Nodal pathway genes according to our results. Note that none of the five factors has been found in arthropods and nematodes. The ecdysozoan boxes for Eomes and FoxH1 are derived from the presence of the genes in a single priapulid species. Grey shading: Hypothetical emergence of a putative kernel for mesoderm specification and neural patterning.

In line with previous findings (Hudson and Yasuo, 2005; Shen, 2007; Grande et al., 2014; Kenny et al., 2014), our analysis revealed that the TGF-β ligand Nodal belongs to a robust bilaterian-specific orthogroup (OG_12210; Figure 5—figure supplement 2, Supplementary file 1–Supplementary Table 14). However, orthogroups of the other Nodal pathway members (Lefty, EGF-CFC, FoxH1, and Eomes) were also bilaterian-specific, and HMM-HMM-based searches against all orthogroups (Supplementary file 1–Supplementary Table 14) as well as phylogenetic analyses supported this result (Figure 5—figure supplement 2, Figure 5—figure supplement 3).

Our clustering results suggested further that the T-box transcription factor Eomes is in fact restricted to bilaterians, contradicting a study that identified Eomes candidates in two poriferan species (Sebé-Pedrós et al., 2013). In BLAST searches, the two poriferan sequences displayed highest similarity to the canonical T-box transcription factors TBX3/4, but not to the T-box containing protein Eomes (Supplementary file 1–Supplementary Table 15). Likewise, phylogenetic analyses failed to confidently assign the poriferan sequences to the Eomes clade (Figure 5—figure supplement 4), and HMM-HMM searches could not detect Eomes-related orthogroups with proteins from sponges or other non-bilaterian animals (Supplementary file 1–Supplementary Table 14). These results consistently argue for a bilaterian origin of the factor, matching the distribution of the other Nodal pathway members (Figure 5B). While our phylogenetic analyses supported orthology clustering results and the monophyly of the Eomes clade, they unexpectedly argued for a metazoan origin of the gene (Figure 5—figure supplement 4). This interpretation would imply independent loss events in the ancestors of three phyla (Cnidaria, Placozoa, and Ctenophora) and in two sponge lineages (see Figure 5A and discussion), while a posited bilaterian-specific origin would be more parsimonious. To finally resolve this issue, more detailed analyses are needed.

Recently, a Nodal-related gene has been identified in the cnidarian Hydra magnipapillata and found to be essential for specifying axial asymmetry along the polyp’s main body axis (Watanabe et al., 2014). In our dataset, H. magnipapillata Nodal-related belongs to a different orthogroup (OG_9136), together with sequences from nine other cnidarians and many deuterostomes. This orthogroup contains, among others, vertebrate GDF-6/7, but no Nodal orthologues. Furthermore, we did not obtain an HMM-HMM reciprocal best hit relationship with the Nodal orthogroup using as query either the entire orthogroup OG_9136 or a subset of cnidarian sequences (Supplementary file 1–Supplementary Table 16), suggesting that Nodal indeed emerged in the bilaterian ancestor as a new member among pre-existing Nodal-related genes.

Taken together, orthology clustering, HMM-HMM comparison, and phylogenetic evidence establish that all four Nodal-specific pathway components and Eomes are present only in bilaterians (Figure 5B). It is thus possible that these factors co-evolved as extension of the more ancient TGF-β signalling pathway (Huminiecki et al., 2009; Hinck et al., 2016) and acquired the potential for mesoderm formation and left-right axis determination, two characteristic bilaterian traits. Due to the conservation of this hypothetical gene regulatory network (GRN) since the Cambrian, it could represent an ancient kernel for mesoderm specification and neural patterning. The identification of only a subset of the five factors in non-chordate species (Figure 5B) indicates that Nodal signalling experienced substantial evolutionary turnover, but it does not exclude initial assembly of the pathway in the bilaterian ancestor and subsequent lineage-specific changes.

One consequence of these considerations is that large parts of the Nodal GRN must have been lost early in ecdysozoan evolution, implying the evolution of alternative upstream signalling pathway inputs for axial specification in this group. Secondly, genes that originated in the bilaterian ancestor may have been lost in a particular daughter lineage. The widespread loss of genes across metazoans (Richter et al., 2018; Sharma et al., 2018) and the loss of Nodal pathway members (this study) shows that such scenarios are conceivable and might impact the exhaustive description of lineage-specific genes, that is, the reconstruction of the "true" evolutionary history of a taxon.

G-protein-coupled receptors and the control of physiological state through circulatory flow

Among the identified bilaterian-specific genes is a set of eight G-protein-coupled receptors (GPCRs), members of a large family of seven-transmembrane domain receptors. While GPCRs are ancient and were already present in the ancestor of bilaterians and fungi (Krishnan et al., 2012), our results indicate that new members of the GPCR family appeared at the bilaterian base. Specifically, robust clustering results and HMM-HMM comparisons place the origin of monoamine neurotransmitter receptors for serotonin, adrenaline, and dopamine to the bilaterian root (Supplementary file 1–Supplementary Table 17, Supplementary file 1–Supplementary Table 18), in line with a recent publication that dated back the evolutionary history of adrenergic signalling to the bilaterian ancestor (Bauknecht and Jékely, 2017). Histochemical, biochemical, and functional data are in conflict with this finding and argue for the presence of serotonin, dopamine, and other small molecule neurotransmitters in cnidarians, the bilaterian sister group (Carlberg and Anctil, 1993; Kass-Simon and Pierobon, 2007; Mayorova and Kosevich, 2013). However, receptors for these molecules could not be identified unambiguously in cnidarians (Anctil, 2009; Bosch et al., 2017), maintaining the possibility that they indeed constitute bilaterian innovations.

There is evidence across several bilaterian phyla (arthropods, nematodes, mollusks, platyhelminthes, vertebrates) that adrenaline, dopamine, and serotonin signalling regulates many important processes such as behaviour, feeding, learning, locomotion, memory, reproduction, reward, or sleep (Ségalat et al., 1995; Berridge, 2004; Suo et al., 2004; Berger et al., 2009; Vidal-Gadea et al., 2011; Burke et al., 2012; El-Shehabi et al., 2012; Ueno et al., 2012). In addition to these "post-embryonic" functions, serotonin is recognised as an important regulator of embryonic development and neuronal circuitry in vertebrates and invertebrates (Brown and Shaver, 1989; Buznikov et al., 2001; Daubert and Condron, 2010). The proposed origin of monoamine neurotransmitter receptors in the bilaterian ancestor (Supplementary file 1–Supplementary Table 17, Supplementary file 1–Supplementary Table 18) and the related functions of monoamine neurotransmitter signalling across phyla suggest that diverse functions of monoamine neurotransmitter signalling already existed in the bilaterian ancestor and could have played a role in the evolution of complex development, brain function, and behaviour. Preliminary evidence indicates that cnidarians, as the bilaterian sister group, do not respond to rewarding or punishing stimuli as do bilaterians (Barron et al., 2010). A link between this behavioural difference and the evolution of monoamine neurotransmitter receptors would comply with the previous notion that the evolution of dopamine-based brain reward systems in bilaterians started from dopamine’s ancient role as a signalling molecule for motor circuits (Barron et al., 2010).

In addition to monoamine neurotransmitter receptors, we detected several peptide hormone receptors in the set of bilaterian-specific GPCRs and could support their bilaterian origin using HMM-HMM searches: the receptors for secretin, corticotropin-releasing factor, neuromedin-U, calcitonin, and somatostatin (Supplementary file 4, Supplementary file 1–Supplementary Table 17, Supplementary file 1–Supplementary Table 18). In vertebrates, these GPCRs and their hormone ligands are part of the endocrine system and regulate basal physiological activities such as feeding, energy homoeostasis, or stress (Budhiraja and Chugh, 2009; Afroze et al., 2013). homologues of the five receptors and their ligands have also been described in C. elegans and D. melanogaster (Johnson et al., 2005; Cardoso et al., 2006; Melcher et al., 2006; Lindemans et al., 2009; Cardoso et al., 2014; Kunst et al., 2014; Ketchesin et al., 2017), and the putative bilaterian ancestry of some of these signalling systems has been recognised by others, in agreement with our results (Johnson et al., 2005; Lindemans et al., 2009; Mirabeau and Joly, 2013). In contrast to vertebrates or insects, cnidarians and other non-bilaterian Metazoa do not contain specialised endocrine organs and circulatory systems. Thus, our finding of highly conserved peptide hormone receptors supports the view that major physiological regulators evolved in parallel with the emergence of circulatory systems. Moreover, recent evidence indicates that these hormone receptors also act during development and participate in neuronal migration and nervous system formation (Afroze et al., 2013; Liguz-Lecznar et al., 2016; Galas et al., 2017), suggesting an ancient link between the generation of complex nervous systems and the ability to control body functions through circulatory fluid.

Changes in axon guidance accompany bilaterian evolution

Axon guidance, the guided outgrowth of axons and dendrites, is essential for the development of neuronal connections and mediated by two major pathways, the Netrin-DCC and the Slit-Robo (Round-About) pathway (Lowery and Van Vactor, 2009; Evans, 2016). To reveal whether changes in these processes accompanied the evolution of bilaterians, we studied the respective orthogroups. Except one, all human Netrin paralogues were assigned to a single orthogroup. Its composition and the composition of its HMM-HMM best hit orthogroups support the emergence of Netrins in the ancestor of eumetazoans or earlier (Supplementary file 1–Supplementary Table 19), in line with a description of Netrins in the sea anemone N. vectensis (Putnam et al., 2007). We found a corresponding (eu)metazoan origin for the Netrin receptor DCC (Supplementary file 1–Supplementary Table 19). These results indicate that cnidarians, but not ctenophores, might regulate axon outgrowth at least in part by Netrin-DCC based interactions, consistent with an independent origin of the nervous system in ctenophores (Moroz et al., 2014).

Although orthogroup composition of Slit and its receptor Robo suggested a bilaterian origin of this system, reciprocal HMM-HMM searches indicated the existence of cnidarian Robo orthologues that were assigned to a separate orthogroup, OG_51853 (Supplementary file 1–Supplementary Table 19). Like their bilaterian counterparts, the cnidarian Robo candidates had highly disordered cytoplasmic domains, as revealed by structure predictions of the extracellular and intracellular part of representative sequences (Figure 6). On the other hand, sequence comparisons revealed that the conserved cytoplasmic motif CC1, which is required for binding the Ena/VASP protein Enabled and for transducing signals to the actin cytoskeleton (Bashaw et al., 2000), is altered in cnidarian Robos (Figure 6—figure supplement 1), and that cnidarian Robos displayed several insertions and deletions in the cytoplasmic part when compared with bilaterian Robos (Figure 6—figure supplement 2). It is therefore an open question whether the structural differences in cnidarian Robo-like proteins involve interactions with different downstream partners and whether cnidarian Robos regulate axon growth. Known downstream effectors of Robo signalling, such as Enabled and Son of sevenless, originated early in metazoan evolution (Supplementary file 1–Supplementary Table 20) and could provide in principle the functionality for Robo-based axon guidance, although mediated by a different ligand.

Figure 6 with 3 supplements see all
Structural predictions of cnidarian and bilaterian Robo proteins.

Top (ex): Predicted structure of the extracellular domain plus transmembrane region of seven selected Robo proteins. Bottom (cp): Predicted structure of the transmembrane region plus cytoplasmic part of seven selected Robo proteins. Robo1 orthologues of two deuterostomes (Hsap = Homo sapiens; Spur = Strongylocentrotus purpuratus), one lophotrochozoan (Lana = Lingula anatina), two ecdysozoans (Dmel = Drosophila melanogaster; Tpse = Trichinella pseudospiralis), and two cnidarians (Hvul = Hydra vulgaris; Spis = Stylophora pistillata) were analysed. "% conf" indicates the percentage of residues modelled at >90% confidence. "% dis" indicates the predicted percentage of disordered regions. Bottom right: Schematic outline of the Robo domain structure with five immunoglobulin domains (IG1–IG5) and three fibronectin type III domains (FN3) in the extracellular part and four conserved cytoplasmic motifs (CC0–CC3) in the intracellular part. Like their bilaterian counterparts, cnidarian Robo candidates display a disorganised protein structure in the cytoplasmic part despite differences in structural features (Figure 6—figure supplement 1, Figure 6—figure supplement 2). The extracellular part (top row), on the other hand, is similarly organised across metazoans.

In both Drosophila melanogaster and vertebrates, midline glia cells secrete the Slit protein to prevent Robo expressing axons from crossing the body midline (Rothberg et al., 1990; Brose et al., 1999; Kidd et al., 1999), indicating that a key component in the establishment of bilaterally symmetric nervous systems is shared between protostomes and deuterostomes. However, in our dataset, a single placozoan sequence was assigned to Slit’s otherwise bilaterian-specific orthogroup, shifting its origin back in time. BLAST searches at NCBI verified a reciprocal best hit relationship of the putative placozoan Slit to known Slit proteins, in agreement with our clustering results (Supplementary file 1–Supplementary Table 15). Likewise, placement of the placozoan sequence in phylogenetic analyses is compatible with its orthology to the Slit protein (Figure 6—figure supplement 3). Unexpectedly, HMM-HMM comparisons could not reveal the existence of Slit in other non-bilaterian species such as cnidarians or ctenophores (Supplementary file 1–Supplementary Table 21). From these results, we conclude that Slit and Robo probably originated in the common ancestor of placozoans, cndiarians, and bilaterians. However, the Slit-Robo-based mechanism for midline repulsion during nervous system development appears to be restricted to bilaterians, as placozoans lack a nervous system and cnidarians lack the Slit ligand.

Neurotrophin receptor signalling is a bilaterian innovation

Neurotrophin signalling plays a fundamental role in nervous system generation by regulating many aspects of neuronal development and function, such as neuronal survival, synapse formation, or axon guidance (Huang and Reichardt, 2001; Lu et al., 2005). Vertebrates possess four related neurotrophin ligands and three corresponding transmembrane receptors of the Trk family that each originated from a single ancestral gene in chordates (Benito-Gutiérrez et al., 2005; Hallböök et al., 2006). Once considered a vertebrate innovation, neurotrophins and their receptors have now been found in diverse invertebrates (Wilson, 2009; Kassabov et al., 2013; Lauri et al., 2016). In particular, studies in the mollusk Aplysia californica suggest that neurotrophin signalling and neurotrophin-mediated synaptic plasticity are conserved in protostomes and deuterostomes (Kassabov et al., 2013).

To elucidate the evolutionary origin of neurotrophin signalling, we analysed the orthogroups containing neurotrophins and their receptors. The four vertebrate neurotrophin ligands clustered into two bilaterian-specific orthogroups (OG_14798 and OG_21801) that are each other’s reciprocal best hit. We could not detect orthogroups similar to neurotrophins in non-bilaterian metazoans or additional, so far unidentified neurotrophins in bilaterians (Supplementary file 1–Supplementary Table 22), supporting the emergence of a single neurotrophin gene in the ancestor of bilaterians and its subsequent diversification in vertebrates. When we analysed the evolutionary origin of other neurotrophic factors, we recognised that they also arose in the ancestor of bilaterians or even later (Figure 7; Supplementary file 1–Supplementary Table 22, Supplementary file 1–Supplementary Table 23). The evolutionary age of these additional neurotrophic factors is thus consistent with a bilaterian origin of neurotrophic ligands per se. The same evolutionary scenario is supported by detailed analysis of the Trk receptor family. Although our initial dataset conflated Trk and Wnt pathway receptors due to a shared receptor tyrosine kinase domain, adjustment of the MCL inflation parameter successfully rendered a Trk-only orthogroup, whose taxonomic composition is restricted to bilaterians (Figure 7—figure supplement 1; Supplementary file 1–Supplementary Table 24).

Figure 7 with 1 supplement see all
The bilaterian-wide distribution of neurotrophic factors.

The NTRK receptor and 14 major neurotrophic factors are displayed as coloured boxes. Their phylogenetic distribution and inferred evolutionary origin are mapped onto the tree (see Supplementary file 1–Supplementary Table 22 and Supplementary file 1–Supplementary Table 23). Gene births are indicated as coloured boxes above the respective branch of the tree (left). Inferred losses are shown as coloured crosses in the matrix. Bold labels to the left of a branch indicate branch ancestors: Ac = Actinopterygii, B = Bilateria, Ch = Chordata, Eu = Eumetazoa, Gn = Gnathostomata, M = Metazoa, Sa = Sarcopterygii. The neurotrophic factors of Cladistia, the sister group of Actinopteri, are inferred and distinguished by a question mark as the dataset lacks species from this lineage.

These results indicate that neurotrophins and their receptors are present across bilaterians and might fulfill conserved functions in neuronal development in these animals. If long-term potentiation and memory formation is regulated by serotonin and its receptors across bilaterians (see, for example, Teixeira et al., 2018), a link between serotonin action and neurotrophin signalling may have emerged in the bilaterian ancestor that contributed to nervous system evolution and the learning-dependent synaptic plasticity characteristic for this group.

Bilaterian-specific factors and the evolution of excretory systems

Protostomes and deuterostomes comprise the taxon Nephrozoa, animals with a dedicated excretory system (sensu Jondelius et al., 2002). Together with their sister group Xenacoelomorpha, Nephrozoa form the taxon Bilateria (Cannon et al., 2016). When we started with our study, sequences from Xenacoelomorpha were not available, and therefore our bilaterian-specific gene set is in fact specific for nephrozoans and might contain factors related to kidney and/or nephron development. Indeed, we identified in the 157 bilaterian-specific orthogroups two relevant zinc finger transcription factors. The poly-zinc finger transcription factor Evi1/MECOM was assigned to a large orthogroup with protein members from 108 of 142 bilaterian species (OG_5543). Evi1 is expressed in pronephric tissue of Xenopus and zebrafish embryos and involved in nephron patterning in these species (Mead et al., 2005; Li et al., 2014; Desgrange and Cereghini, 2015), although this might only be a part of its function (Goyama et al., 2008). Secondly, after BLAST searches, maximum likelihood phylogenetic analysis, and HMM-HMM searches focusing on orthogroup OG_5226, we found evidence for a bilaterian-wide distribution of odd-skipped related 1, a zinc finger transcription factor required for heart and urogenital development in vertebrates (Wang et al., 2005; Dressler, 2006; Tena et al., 2007; Supplementary file 1–Supplementary Table 15, Supplementary file 1–Supplementary Table 26; Supplementary file 1–Supplementary Figure 1). Thus, the observed expansion of the zinc finger transcription factor repertoire may also have been important for the evolution and development of excretory organs, a key nephrozoan innovation.

Bilaterian-specific genes form a rich interaction network with interconnected subnetworks

To reveal potential interactions among the 157 bilaterian-specific proteins, we analysed the interaction network of the corresponding human orthologues using the STRING protein-protein interaction (PPI) database. The obtained PPI network contained significantly more interactions than expected by chance (PPI enrichment p-value: 5.93e-14), revealing that bilaterian-specific genes form a dense network in which about 50% of the factors (83 distinct factors) are connected to one another (Figure 8A). These interactions form several subnetworks involved in regulating key aspects of bilaterian development, such as chromatin organisation and transcriptional regulation (subnetwork A), myogenesis (subnetwork B), mesoderm formation and left-right asymmetry (the Nodal pathway, subnetwork C: see also Figure 8B), neurogenesis (subnetwork D), and physiology (subnetwork E). Connections between different subnetworks further suggest that crosstalk between the newly established regulatory subnetworks was an important aspect of bilaterian evolution.

Protein-protein interaction network of bilaterian-specific proteins.

(A) Uniprot identifiers corresponding to the human orthologues of 150 bilaterian-specific genes (seven OGs had no human orthologues) were uploaded to the STRING database, and their mutual interactions were visualised as a network. Parameters for the displayed PPI network were: minimum required interaction score = 0.4; maximum number of interactors to display in 1st and 2nd shell = 0. Thus, only known and predicted interactions between 83 distinct bilaterian-specific proteins are shown (non-interacting proteins are hidden). Evidence for displayed interactions is colour-coded (see legend). Edge length and node placement are arbitrary. Five subnetworks between bilaterian-specific genes are highlighted in red (A-E, see Results). (B) Bilaterian-specific Nodal subnetwork in the context of metazoan genes. The five members of the Nodal pathway are highlighted by shading. (C, D) Boxplots comparing bilaterian- (B) and metazoan-specific (M) proteins in the full network and Nodal subnetwork for the total number of interactions per protein (C), and for the relative fraction of bilaterian interactions per protein (D).

Previous work found that protein network connectivity (number of interactions) increases with gene age (Kim and Marcotte, 2008). To analyse the degree of connectivity of our bilaterian network, we compared it to a PPI network generated from metazoan-specific proteins that is expected to show higher connectivity due to the proteins’ more ancient origin. Our orthology clustering data identified 797 metazoan-specific proteins (>5× as many proteins as in the bilaterian dataset), and the combined bilaterian-metazoan PPI network comprised 2,531 interactions among 823 proteins (16% bilaterian-specific proteins, 84% metazoan-specific proteins). In fact, we obtained a slightly higher level of connectivity for the younger, bilaterian proteins (Figure 8C: total number of interactions per protein, median ± median absolute deviation (MAD): 5 ± 4.62 for Bilateria, 4 ± 4.16 for Metazoa; Mann-Whitney U test: U = 39792, p = 0.0135). Furthermore, bilaterian-specific proteins preferentially interacted with one another, with over twice as many bilaterian-bilaterian interactions as would be expected by chance (χ~2 statistic = 24.814, p = 0.000001), primarily due to fewer bilaterian-metazoan interactions than would be expected. This is also evident at the level of individual proteins: bilaterian-specific proteins have significantly more bilaterian interaction partners (Figure 8D: percent of bilaterian interactions, median ± MAD: 19.5±23.2 for Bilateria, 0.0±16.1 for Metazoa; Mann-Whitney U=32231, p=0.00000).

As we identify the Nodal pathway as a key bilaterian innovation (Figure 5, Figure 8A: subnetwork C), we focused on this subnetwork as a case study for further analysis of molecular interactions. Within the full bilaterian-metazoan PPI network, we indeed recovered the Nodal pathway as a bilaterian-specific subnetwork, embedded among connections to additional bilaterian and metazoan proteins (Figure 8B). As with the full network, for this subnetwork we found a significant number of bilaterian-specific protein interactions (Figure 8D; Kruskal-Wallis χ~2 = 62.855, degrees of freedom = 3, p = 1.44e-13). Furthermore, for this subnetwork, we found support for the hypothesis that older (metazoan) genes have higher connectivity (Figure 8C; Kim and Marcotte, 2008). Notably, metazoan-specific proteins that participate in the Nodal subnetwork are a non-representative subset, showing significantly higher overall connectivity and bilaterian-specific connectivity than metazoan proteins in the full bilaterian-metazoan PPI network. Thus, it may be that older genes have higher connectivity if they exceed a minimum threshold of connectivity (number of interactions). For example, the Nodal subnetwork includes Smad4, a metazoan-specific protein with the highest connectivity (46 interactions) of any protein in our combined network. This multifunctional BMP pathway component likely exemplifies two evolutionary trends: that highly connected genes are most likely to acquire new interaction partners, and that bilaterian-specific PPI innovations build on more ancient, preexisting PPI networks by co-option.

Extrapolating these findings to interactions with additional factors of more ancient origin implies that the evolution of new genes in the bilaterian ancestor affected a large number of processes in animal biology.

Discussion

An R-based OrthoMCL pipeline for processing large datasets

Explaining the sudden emergence of bilaterally symmetric animals during the Cambrian is a central problem in evolutionary biology. Complicated by the uneven coverage of the metazoan tree with sequence information, a systematic approach to identify the genetic basis for the evolution of bilaterians was missing. In this study, we present a comparative genomics approach, designed to provide maximum resolution at the bilaterian/non-bilaterian divergence and therefore uniquely suited to discover bilaterian-specific genes.

Although sequence data for individual species in our study might be incomplete (Supplementary file 1–Supplementary Table 1, Supplementary file 1–Supplementary Table 2), each important taxonomic group (Deuterostomia, Ecdysozoa, Lophotrochozoa, and "non-Bilateria") is represented with several well-annotated genomes and/or proteomes (Figure 1—figure supplement 1, Supplementary file 1–Supplementary Table 3). Importantly, sequence data from 19 cnidarian species, including four sequenced genomes and five transcriptomes with CEGMA scores above 70% (Supplementary file 1–Supplementary Table 2), allow the crucial distinction of orthogroups with cnidarian participation from bilaterian-specific orthogroups without cnidarian contribution, a serious problem of existing databases (Table 1).

While other orthology databases might surpass the BigWenDB in species number, this is often due to the integration of many non-metazoan and prokaryotic species (Table 1). Still, the total sequence content of other databases is small enough to be handled by a MySQL engine (see http://www.orthodb.org/v9.1/download/README.MySQL.txt; www.orthomcl.org) because it is restricted to predicted and annotated protein sequences. To accomplish processing of the large amount of sequence data from 25 genomic ORF sets, we developed an R-based version of the OrthoMCL pipeline (Li et al., 2003). It reproduces the results of the original pipeline meticulously (Supplementary file 1–Supplementary Table 4) and is capable of processing at least 125 million sequences with current computer hardware, considerably extending the limit imposed by conventional MySQL usage. In view of the ongoing increase in sequence data, the R-based version of OrthoMCL may prove valuable for generating large and comprehensive orthology datasets in the future.

Importantly, scaling up the orthology engine to handle larger datasets did not come at the expense of clustering quality. Rather, the combination of a comprehensive dataset and a scalable orthology prediction tool turned out as beneficial, challenging an early study that found a high false-positive rate when testing OrthoMCL on a small and taxonomically restricted dataset (Chen et al., 2007). This advance of our approach is further demonstrated by correct orthology inference rates that surpass those previously obtained in the orthobench comparisons (Trachana et al., 2011; Supplementary file 3).

Reciprocal HMM-HMM comparisons for improving orthogroup completeness

Despite the existence of many orthology detection methods (Tekaia, 2016), current tools do not evaluate orthogroup composition after clustering. In contrast, we implemented filtering steps to first identify widely distributed bilaterian-specific orthogroups. We then applied to the resulting orthogroups extensive procedures for quality control and error correction, taking into account the taxonomic composition of orthogroups and their best hits in HMM-HMM searches. In this context, we developed a new reciprocal HMM-HMM comparison step to evaluate orthogroup completeness because reliable orthogroups are a prerequisite for inferring the evolutionary age of the corresponding gene (Supplementary file 1–Supplementary Table 7). Although HMMs generated from orthogroup alignments can be uninformative outside conserved regions, they capture important amino acid positions and their spacing and variability, and therefore the individual profile of an orthogroup even within common functional domains such as zinc fingers (Supplementary file 1–Supplementary Figure 2). Indeed, we observed several instances where HMM-HMM comparisons improved results and affected conclusions, demonstrating the value of this novel step (Supplementary file 1–Supplementary Table 13, Supplementary file 1–Supplementary Table 14, Supplementary file 1–Supplementary Table 16, Supplementary file 1–Supplementary Table 19, Supplementary file 1–Supplementary Table 21, Supplementary file 1–Supplementary Table 22, Supplementary file 1–Supplementary Table 23, Supplementary file 1–Supplementary Table 24).

In particular, we employed highly sensitive HMM-HMM comparisons to minimise errors caused by low protein traceability, the limitation of the BLAST algorithm to detect orthologous genes in distantly related organisms (Jain et al., 2019; Weisman et al., 2020). This strategy led to the removal of 68 false-positive orthogroups from an initial set of 431 bilaterian-specific orthogroups because they displayed reciprocal best-hit relationships to non-bilaterian orthogroups, indicating a more ancient origin (see Appendix 1: Identification of bilaterian-specific genes). In addition, the broad coverage of bilaterians and non-bilaterians and the evaluation of orthogroup composition by filtering rules minimises errors that may be caused by the low traceability of specific genes or by single taxa with particularly high evolutionary rates.

Limitations of our orthology clustering pipeline

Our methods for error correction facilitate the detection of reliable lineage-specific gene sets and may serve as a future standard. However, developing software that can automatically detect such patterns and combine/split orthogroups in awareness of the underlying phylogeny would further improve orthogroup assignments. That lineage-specific genes exist and can directly change an animal’s phenotype to gain access to new ecological niches has been shown recently, illustrating the importance of these genes and the need for their identification (Dunwell et al., 2017; Santos et al., 2017; Luis Villanueva-Cañas et al., 2017).

Although we obtained a robust set of 157 genes that evolved in the bilaterian ancestor or, more specifically, in the ancestor of protostomes and deuterostomes (Nephrozoa) (Jondelius et al., 2002), by design our study is limited to protein coding sequences. It will therefore miss the possible involvement of RNA genes in bilaterian evolution, including miRNAs (micro RNAs) and lncRNAs (long non-coding RNAs), as suggested by Prochnik et al., 2007. It will further fail to detect changes in cis-regulatory regions and structural alterations or epigenetic changes, additional factors that affect evolutionary processes (Carroll, 1995; Prud'homme et al., 2006; Klironomos et al., 2013; Feulner and De-Kayne, 2017). Despite these limitations, our study successfully corroborated the bilaterian origin of several previously known bilaterian-specific genes, such as the chromatin organiser CTCF (Heger et al., 2012), the left-right determination factor Nodal (Grande et al., 2014), and central Hox genes (Finnerty and Martindale, 1999; Hueber et al., 2013).

Challenges in reconciling orthogroups and phylogenetic trees

Orthology clustering is a distinct method from phylogenetic tree building, and when we used phylogenetic analyses to validate orthogroup composition, we experienced difficulties in reconciling the two approaches.

Firstly, we do consistently obtain high branch support for bilaterian-specific orthogroups as discrete clades. Yet within orthogroups, phylogenetic resolution was often weak, with low branch support and gene tree–species tree discordance. However, tree discordance in itself does not argue against orthology because phylogenies suffer from various problems, such as the inclusion of problematic sequences, little phylogenetic information, or—in our case—the presence of short ORF fragments (Aguileta et al., 2008; Som, 2015). While our ORF data aid the recognition of distinct orthogroups by avoiding systemic annotation errors from external databases and by providing essential taxonomic coverage, these sequences do not represent full-length proteins and may curtail within-orthogroup resolving power.

In addition, in several cases we obtained tree topologies that could imply orthogroup origin in the metazoan ancestor rather than a later, bilaterian origin (Figure 5—figure supplement 3, Figure 5—figure supplement 4, Figure 7—figure supplement 1). One major confounding factor for correct tree reconstruction is heterotachy: a non-constant rate of evolution among different lineages (Lopez et al., 2002; Wu and Susko, 2011; Jayaswal et al., 2014). Importantly, heterotachy is often observed along the branches originating from a gene duplication event (Kondrashov et al., 2002; Conant and Wagner, 2003; He and Zhang, 2005; Steinke et al., 2006). Accelerated evolution in bilaterian-specific duplicates could therefore explain the observed tree topologies and the discrepancy between trees and clustering results. In contrast, the alternative interpretation of metazoan orthogroup origins would require that one of the two duplicates was secondarily lost in the stem lineage of sponges, ctenophores, placozoans, and cnidarians because of its absence in all available samples from these phyla. Gene loss is increasingly recognised as a widespread and important evolutionary mechanism (Sharma et al., 2018; Hecker et al., 2019; Thomas et al., 2020). However, the loss of a number of genes in the stem lineages of four independent phyla would imply strong selective pressure against their presence in non-bilaterian lineages, creating an aspect of deep evolution worthwhile of future exploration.

A robust associaton between bilaterian-specific genes and key morphological features

Several morphological features are widely considered key bilaterian innovations: (i) a third germ layer, the mesoderm; (ii) a complex bilateral nervous system; (iii) a Hox gene cluster with at least seven anterior, posterior, and central Hox genes; (iv) a through gut; (v) an excretory system; (vi) the possession of many different cell types; and (vii) bilateral symmetry (Baguñà et al., 2008 and references therein). It was unknown so far whether, and if so which, genetic factors contributed to the emergence of these innovations. From the results presented here, we conclude that a considerable fraction of the identified 157 bilaterian-specific genes is associated with the origin of characteristic bilaterian traits. Although correlations cannot prove a causal relationship, in the absence of ancestral genetic information our inferences from extant animals offer a fruitful approach. Here, we elaborate on several instances where the origin of proteins and bilaterian traits appear to coincide.

For example, a large portion of the 157 genes is involved in nervous system development and/or maintenance (Supplementary file 4). Several factors in this category provide functionalities absent from non-bilaterian metazoans, such as the long-range control of behaviour and physiological state through an expanded repertoire of GPCRs (Supplementary file 1–Supplementary Table 17, Supplementary file 1–Supplementary Table 18), a midline repulsion mechanism for the establishment of a bilateral nervous system (Robo-Slit; Figure 6—figure supplement 3; Supplementary file 1–Supplementary Table 19, Supplementary file 1–Supplementary Table 21), or mechanisms for sophisticated axon guidance and synaptic plasticity (neurotrophin signalling system; Figure 7; Supplementary file 1–Supplementary Table 22, Supplementary file 1–Supplementary Table 23, Supplementary file 1–Supplementary Table 24). These findings are consistent with the convergent evolution of muscle and nerve cells in ctenophores (Moroz et al., 2014) and suggest that bilaterians have a common genetic basis for nervous system patterning despite the recently proposed scenario of convergent evolution of bilaterian nerve cords (Martín-Durán et al., 2018). The importance of the nervous-system-related category of bilaterian-specific genes is further underscored by the identification of various transcription factors with a well supported role in nervous system development across phyla, for example the Prospero homeobox protein, the Achaete-scute homologue 2, or the neuronal PAS domain-containing protein 4 (Supplementary file 1–Supplementary Table 10, Supplementary file 4). Further, three transmembrane proteins with expression in the nervous system, but unknown function, provide the opportunity to characterise novel factors with nervous system-related function (Supplementary file 1–Supplementary Table 11). Together, the factors we found in this category provide fundamental features of bilaterian nervous systems, and their evolutionary origin in the bilaterian ancestor is compatible with observable changes in nervous system development and architecture.

An unexpectedly high number of bilaterian-specific genes has transcription factor activity (Figure 3B; Figure 2). As noted above, these factors are often equipped with multiple C2H2 zinc finger domains (Figure 2—figure supplement 1; Supplementary file 1–Supplementary Table 8). Apart from so far uncharacterised proteins, which include ZF64B_HUMAN or ZN236_HUMAN, the expression and developmental role of bilaterian-specific zinc finger proteins is compatible with prominent functions during early development, such as imaginal disc development (Rotund; St Pierre et al., 2002), modulation of TGF-β signalling (Schnurri; Yao et al., 2006), nephron patterning (Evi1, odd-skipped related 1; Mead et al., 2005; Dressler, 2006; Tena et al., 2007; Li et al., 2014), or the differentiation of cardiac precursor cells at the ventral midline (Castor; Christine and Conlon, 2008). Importantly, the identified transcription factors with homeobox or bHLH domain are involved in the specification of several bilaterian tissues, the mesoderm (MyoD, PRRX1_HUMAN, BHE22_HUMAN), the nervous system (Prospero homeobox protein 2, Achaete-scute homologues 2, FER3L_HUMAN, NPAS4_HUMAN, BHE22_HUMAN, BUN1_DROME), or the intestine (ISX_HUMAN), consistent with a role in the evolution of these characteristic bilaterian traits .

A contiguous cluster of at least seven Hox genes is an ancestral bilaterian feature (Baguñà et al., 2008). A prerequisite for its formation is the existence of anterior, central, and posterior Hox genes. Our results confirm previous findings that placed the origin of central Hox genes to the bilaterian ancestor (Supplementary file 1–Supplementary Table 10), in contrast to evolutionarily older anterior and posterior Hox genes (Finnerty and Martindale, 1999; Hueber et al., 2013). Importantly, Hox gene expression is regulated in part by the chromatin organiser CTCF (Rousseau et al., 2014; Narendra et al., 2015), another bilaterian-specific protein (Heger et al., 2012; Supplementary file 1–Supplementary Table 8; Supplementary file 4). As outlined elsewhere, the evolution of CTCF—and other poly-zinc finger proteins—could have provided a mechanism for the creation and regulation of bilaterian Hox gene clusters, once central Hox genes had been added to the repertoire (Heger et al., 2012).

The emergence of the mesoderm as a third germ layer is one of the most characteristic morphological innovations of bilaterian animals. In contrast to previous work, our findings suggest that several genes and gene networks which provide regulatory inputs to mesodermal patterning arose in the bilaterian ancestor. Specifically, we identified orthologues of all Nodal pathway members across bilaterians, but not outside this clade (Figure 5—figure supplement 1, Figure 5—figure supplement 2, Figure 5—figure supplement 3, Figure 5—figure supplement 4; Supplementary file 1–Supplementary Table 14, Supplementary file 1–Supplementary Table 16). The robust bilaterian-specific distribution of these genes, derived from orthology clustering and HMM-HMM searches, implies that the entire Nodal pathway—and its roles in mesoderm specification and left-right asymmetry—is a bilaterian novelty (Figure 5). Although a reasonable speculation, this is currently not supported for all pathway members by phylogenetic analyses and needs to be tested more thoroughly in the future. Together with the bilaterian specificity of additional modulators and effectors of Nodal and/or TGF-β signalling (BAMBI_HUMAN, VWC2_HUMAN, MECOM_HUMAN, Q24605_DROME; Supplementary file 4), these findings suggest that significant changes in TGF-β signalling occurred in the bilaterian ancestor. In addition to the Nodal pathway, several other genes with key roles in mesoderm formation also originated in the bilaterian ancestor, among them the master regulator of muscle cell specification, MyoD, and the Paired mesoderm homeobox protein 1 (PRRX1_HUMAN; Supplementary file 1–Supplementary Table 10) which regulates the formation of preskeletal condensations from undifferentiated mesenchyme during mouse skeletogenesis (Martin et al., 1995). Taken together, we identified multiple genetic factors essential for the differentiation of mesoderm and mesodermal tissues in bilaterians.

In conclusion, we demonstrate that a considerable number of genes has a bilaterian-specific distribution and probably originated in the bilaterian ancestor. While the function of some of these genes is unknown, many of them participate in the formation of key morphological innovations in extant bilaterians, implying that the evolution of specific genes contributed to the formation of bilaterian body plans.

Materials and methods

Sequence collection and database construction

Request a detailed protocol

The sequence repertory for this study was assembled from three parts. Genomic and transcriptomic sequences were collected from the sources listed in Supplementary file 1–Supplementary Table 1, Supplementary file 1–Supplementary Table 3, Supplementary file 2. As third component, selected sequences were downloaded from the NCBI non-redundant protein database.

The 25 genomic sequences were first screened for repetitive sequence content using RepeatMasker V4.0.5 (http://repeatmasker.org) with default parameters. The resulting contigs/scaffolds were translated into six ORFs using the Emboss tool "getorf" (Rice et al., 2000), with a minimum ORF length of 25 AA. Sequences containing strings of "X" characters, a result of translating sequencing gaps and masked repeats, were treated differentially to retain as much information as possible. Sequences with ≥9 "X" in a row were split. After removing the Xs, each flanking region ≥35 valid amino acids was kept and given a new identifier while smaller flanking regions were discarded. These measures decreased sequence count by 46.8%, from 324,788,561 to 172,606,165 ORFs. To further reduce the amount of ORFs, we blasted them against a custom database of opisthokont sequences. This database contained all sequences of opisthokont origin as extracted from the non-redundant protein database at GenBank, release 198 from 21 October 2013 (2,695,641 sequences). We kept ORFs with a BLAST expectation value <10 against this database and thus rejected ORFs that have no detectable similarity to the protein repertoire of opisthokonts. In a final step, we used CD-HIT (Li and Godzik, 2006) with default parameters and 90% identity threshold to remove redundancy. These steps reduced the number of sequences significantly, from initially 324,788,561 to 109,567,344 genomic ORFs.

To fill in the gaps of public sequence repositories and extend coverage, we collected transcriptome data of poorly represented animal groups (Supplementary file 1–Supplementary Table 1, Supplementary file 2). Downloaded transcriptomes were first assayed for completeness using the CEGMA (Core Eukaryotic Genes Mapping Approach) pipeline which reports the coverage of 248 ultra-conserved core eukaryotic genes present in a dataset (Parra et al., 2007). On the basis of CEGMA completeness and phylogenetic placement, we selected transcriptomes of 64 species for the dataset. Their average transcriptome completeness according to CEGMA was 60.8%, with several bilaterian and non-bilaterian species exceeding 90% (Supplementary file 1–Supplementary Table 2). As described for genomes, transcriptomes were then translated into six ORFs. We kept the three longest ORFs for each transcriptome contig, removed Xs, and obtained 11,768,516 transcriptome protein sequences in total (Table 2).

To provide a backbone of published and annotated protein sequences for the genomic and transcriptomic ORFs, we filtered the NCBI non-redundant protein database and kept 2.9 million protein sequences from 204 opisthokont species that had >8000 sequence entries each. Extraction of opisthokont sequences was guided by NCBI taxonomy.

As the combination of sequences from three sources again introduced redundancy, we clustered the final dataset with 90% identity threshold. In a last pre-processing step, we changed the headers of all sequences to obey a consistent naming scheme. It includes the NCBI taxon identifier and a unique sequence ID that allows to distinguish between NCBI-, ORF-, and transcriptome-derived sequences. The final dataset used for this analysis contained 124,031,501 sequences.

Orthology pipeline and clustering

Request a detailed protocol

For orthology clustering, we employed the OrthoMCL pipeline (Li et al., 2003). It utilises a graph-based clustering approach for the generation of orthologous groups on the basis of normalised BLAST similarity measurements between sequence pairs. To enable the processing of our large dataset, we ported to the statistical programming environment R (https://www.r-project.org/) all steps of the original OrthoMCL pipeline that require interaction with a MySQL database. In this way, loading of the database and inference of orthology tables is limited only by the size of the computer’s main memory, not by the speed and additional memory requirements of the underlying MySQL engine, as in the original implementation. By dividing the computation of orthology tables into an appropriate number of steps, our entire dataset could be processed on a compute server with 250 GB memory. Importantly, the R version of OrthoMCL accurately reproduces all steps of the original pipeline (Supplementary file 1–Supplementary Table 4). The collection of scripts for the R version of OrthoMCL is available at https://github.com/prheger/BigWenDB (Heger, 2020; copy archived at https://github.com/elifesciences-publications/BigWenDB).

HMM-HMM searches and database

Request a detailed protocol

We extracted from the BigWenDB sequence collection the individual sequences belonging to each of the 824,605 orthologues groups and calculated 824,605 corresponding multiple sequence alignments using default parameters of the MAFFT v7.304b "einsi" algorithm (Katoh et al., 2005). After converting the alignments into hhm format (hhsearch format for hidden Markov models) with the command "hhmake" and default parameters, we concatenated them to a database that can be searched by hhsearch (parameters in addition to default: "-nodssp -nopred -dbstrlen 100"), according to Söding, 2005. We precomputed HMM-HMM search results for about 20% of orthogroups and issued missing searches on demand. Reciprocal best hit relationships were analysed using custom scripts.

Quality control of clustering and the bilaterian-specific gene set

Request a detailed protocol

Quality control of clustering results and the bilaterian-specific gene set was carried out as described in Appendix 1, sections "Cluster evaluation and quality control" and "Identification of bilaterian-specific genes".

Statistical tests for the enrichment of transcription factors

Request a detailed protocol

To test whether the bilaterian-specific gene set of 157 orthogroups is enriched for transcription factors, we downloaded as control the human proteome with 20,205 protein sequences from ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/Eukaryota/ and predicted transcription factors in this dataset using the PfamScan software (ftp://ftp.ebi.ac.uk/pub/databases/Pfam/Tools/) with E-value cutoff = 5x10-05. We then determined the abundance of 10 prevalent DNA-binding domains in the dataset: "Basic; bZIP_2; HLH; HNF-1_N; Homeobox; Hox9_act; HPD; SOBP; THAP; zf-". Corresponding domains were identified in 1,756 of the 20,205 human reference proteins. We then randomly selected 10× 157 genes from the reference set and specified the number of transcription factors (proteins with the above mentioned domains) in the obtained subsets. While the average number of transcription factors in the 10 control sets was 12.8 ± 4.44, the equally sized bilaterian-specific gene set (157 orthogroups) had 37 transcription factors. Modelling a normal distribution from the obtained mean and standard deviation yielded a p-value of 2.512e-08 for the transcription factor content in bilaterian-specific genes (using the R function "pnorm"). Likewise, a Pearson’s χ~2 test with the corresponding data matrix (1,765:20,205; 37:157), using the R function "chisq.test", yielded a p-value of 3.805e-08. Finally, under the assumption of a binomial distribution (R function "pbinom") and given that there are 1,756 transcription factors in 20,205 human proteins, the probability that we obtain 36 or more transcription factors when drawing 157 random proteins is p < 1.841e-08.

Poly-Zinc finger scan across Opisthokonta

Request a detailed protocol

We downloaded the proteomes of 7 ecdysozoan, 5 lophotrochozoan, 12 deuterostomian, and 4 non-bilaterian species from http://www.uniprot.org/proteomes. On average, each proteome consisted of 28,772 sequences. We scanned all protein sequences for the presence of protein domains using the PfamScan software (ftp://ftp.ebi.ac.uk/pub/databases/Pfam/Tools/) with E-value cutoff = 5x10-05 and Pfam database version 31.0. Using command line tools, we identified C2H2 zinc finger proteins in the PfamScan output and counted for every proteome the number of proteins with six or more zinc finger domains. The resulting numbers were used to plot Figure 2—figure supplement 1A,B.

To determine the number of poly-ZF proteins that originated in the ancestor of opisthokonts, metazoans, and eumetazoans, we first extracted from the clustering results orthogroups specific for these lineages. The filtering criteria for selecting opisthokont-specific orthogroups were: Fungi ≥ 20 species, Metazoa ≥ 40 species, Bilateria ≥ 30 species and yielded 2,928 orthogroups of ancient origin. The filtering criteria for selecting metazoan-specific orthogroups were identical, except that no fungi were allowed, and yielded 2,615 metazoan-specific orthogroups. For eumetazoan-specific orthogroups we required the presence of at least 30 bilaterian and 3 cnidarian species, with not more than 2 ctenophore species allowed (according to NCBI taxonomy, both ctenophores and cnidarians misleadingly belong to eumetazoans). Applying these conditions, we obtained 283 eumetazoan-specific orthogroups. Next, we extracted the longest sequence of each opisthokont-, metazoan-, and eumetazoan-specific orthogroup and scanned it with PfamScan (E-value cutoff = 5x10-05). Finally, we counted the number of poly-ZF sequences with at least six domains for each node and mapped the numbers to a phylogeny. Note that this "simple" filtering strategy (Bilateria: ≥ 30 species) would return 662 bilaterian-specific orthogroups, considerably more than the 157 error-corrected orthogroups in the final dataset. The strategy therefore possibly overestimates the number of poly-ZF proteins at the three ancient nodes.

Determining orthogroup ancestors

Request a detailed protocol

To determine the ancestor of the species combined in a given orthogroup, we wrote a custom Perl script that extracts the taxonomic identifiers of each sequence and then determines the last common ancestor of all represented species on the basis of NCBI taxonomy and lineage information (ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/). The script generates output that can be parsed and filtered using command line utilities. It is part of the collection of R scripts at https://github.com/prheger/BigWenDB.

Protein domain scans and gene ontology analysis

Request a detailed protocol

We applied strict filtering rules to extract bilaterian-, vertebrate-, and arthropod-specific genes from the Markov clustering results (rule for bilaterian-specific orthogroups: deuterostomes ≥ 7, lophotrochozoans ≥ 4 or 0, ecdysozoans ≥ 4 or 0; for arthropod-specific orthogroups: chelicerates ≥ 2, crustaceans ≥ 0, myriapods ≥ 1, insects ≥ 5; for vertebrate-specific orthogroups: ≥40 of 53 gnathostome species). From each lineage-specific orthogroup obtained by filtering, we extracted the longest sequence and scanned it with PfamScan Version 1.5 (Punta et al., 2012) (available at ftp://ftp.ebi.ac.uk/pub/databases/Pfam/Tools/) at an E-value cutoff of e-05 for the presence of protein domains as classified in PFAM database release 30.0 (release date: 06/16).

To associate the identified protein domains with gene ontology (GO) terms, we utilised the Pfam2GO list at http://geneontology.org/external2go/pfam2go and extracted relevant terms using command line tools. Typically, only a subset of domains was linked to GO terms. We finally created a list with the relative number of identified protein domains and associated gene ontology terms and visualised this list as a word cloud at www.wortwolken.com.

Multiple sequence alignment and phylogenetic analysis

Request a detailed protocol

Multiple sequence alignments required for the HMM-HMM database and phylogenetic analyses were carried out using the MAFFT v7.304b "einsi" algorithm with default parameters (Katoh et al., 2005). Large alignments (>200 sequences) were computed using MAFFT v7.304b with high-speed parameters. For phylogeny, we added ingroup and outgroup sequences from the clustered orthogroup sets or from public repositories, as appropriate, and manually removed indels and unalignable regions from the data prior to analysis. In some cases, for example for Lefty, we generated a hidden Markov model of an orthogroup alignment and searched additional transcriptomic datasets not represented in the BigWenDB for potential orthologues. Phylogenetic trees were computed under the maximum likelihood criterion, using IQ-TREE v1.6.10 (Nguyen et al., 2015) with ModelFinder for fast and accurate model selection (Kalyaanamoorthy et al., 2017), ultrafast bootstrap approximation and optimisation (1000 replicates) (Minh et al., 2013), and Shimodaira-Hasegawa-like approximate likelihood ratio test (SH-aLRT) (command line parameters: "-bb 1000 -alrt 1000 -bnni"). Resulting trees were edited with FigTree v1.4.3 (http://tree.bio.ed.ac.uk/software/figtree/) and Affinity Designer Version 1.72 (https://affinity.serif.com).

Prediction of protein structure

Request a detailed protocol

After constructing multiple sequence alignments from cnidarian and bilaterian Robo proteins, we identified the transmembrane region (corresponding to sequence "AFIAGIGAACWIILMVFSIWL" in ROBO1_HUMAN) and generated two subsequences overlapping at this feature. One subsequence spanned the extracellular part of the protein plus the transmembrane domain, the other spanned the transmembrane domain plus the cytoplasmic part. We generated the two fragments for seven exemplary Robo proteins, for the deuterostomes Homo sapiens and Strongylocentrotus purpuratus, the lophotrochozoan Lingula anatina, the ecdysozoans Drosophila melanogaster and Trichinella pseudospiralis, and the two cnidarians Hydra vulgaris and Stylophora pistillata. All fragments were uploaded to the Phyre2 web interface (http://www.sbg.bio.ic.ac.uk/phyre2/html/page.cgi?id=index; Kelley et al., 2015) and analysed with modelling mode "intensive" (complete modelling using multiple templates and ab initio techniques).

Identification of metazoan-specific genes

Request a detailed protocol

To obtain a list of genes with metazoan origin, we first blasted 20,205 human genes obtained from uniprot.org against the BigWen database and obtained BLAST hits for 19,322 genes. To reliably map the UniProt queries to orthogroups, we selected queries that had a BLAST hit with high identity (>95%) over at least 100 amino acids. For proteins fulfilling these criteria, we extracted the corresponding orthogroup ID and ancestor, taking into account only orthogroups with at least 75 species to ensure broad sampling. After removing redundancy, we obtained 797 distinct orthogroups of metazoan origin whose human orthologues were used for the stringDB PPI network analysis. A conceptually similar study obtained 1,189 novel metazoan-specific homology groups, which is in reasonable agreement with our result when considering the differences in methodology and datasets (Paps and Holland, 2018).

Protein-protein interaction network analyses

Request a detailed protocol

Protein interaction data were obtained from the STRING database v11.0 of known and predicted protein-protein interactions (PPI; https://string-db.org; Szklarczyk et al., 2017). To construct PPI networks, we first identified the appropriate human orthologues of bilaterian-specific and metazoan-specific orthogroups. We obtained 150 human orthologue IDs for the 157 bilaterian-specific orthogroups and 797 human orthologue IDs for 797 metazoan-specific orthogroups (collected as described above). We uploaded these protein IDs to the STRING browser interface and generated three separate PPI networks, one for bilaterian-specific proteins (B), one for metazoan-specific proteins (M), and a combined network for both taxonomic groups (B + M). The average local clustering coefficients and PPI enrichment p-values we report are based on analyses with default settings, where all evidence types were considered. Further statistical analyses were conducted for the B + M full network and the B + M Nodal-Lefty subnetwork, the latter being defined by the core five bilaterian-specific proteins (Nodal, Lefty, FoxH1, Eomes, and EGF-CFC) and their interaction partners. From the complete list of pairwise protein-protein interactions in the B + M network, data were extracted for the numbers of B – B, M – M, and B – M interactions and assessed by a χ~2 test. Additional calculations were made per protein for the total number of interactions and for the proportion of interactions that involve a bilaterian-specific partner. Boxplots for these values display the median, and whiskers represent 1.5× the value of the Q3 (upper) or Q2 (lower) quartile range, with outliers omitted for clarity. Statistical tests involved χ~2 tests (https://www.socscistatistics.com/tests/chisquare/default2.aspx, accessed 26 August 2019) and non-parametric comparisons in multigroup (Kruskal-Wallis) and pairwise (Mann-Whitney U) assessments as reported, calculated in R version 3.4.0 and from the Python library scipy.stats (function: mannwhitneyu).

Data access

Request a detailed protocol

The R version of OrthoMCL and a script for inferring orthogroup ancestors are available at https://github.com/prheger/BigWenDB. The sequence dataset used to build the BigWenDB and the final clustering results are available at https://doi.org/10.5061/dryad.4qf7168. Several Supplementary Files with original data and Supplementary Tables are linked to this paper at elifesciences.org.

Appendix 1

Orthology pipeline and clustering

To generate clusters of orthologous proteins from the collected sequence data, we used the OrthoMCL pipeline (Li et al., 2003). OrthoMCL is a graph-based method for orthologue group identification that represents sequences as nodes and their similarities as weighted edges. A normalization step adjusts initial similarity scores to reflect species distance and ensures that edge weights for sequence pairs are comparable between different genomes. Finally, the Markov cluster algorithm (van Dongen, 2000) performs random walks on the normalised graph by simulating transition probabilities of sequences to other nodes, thereby revealing an underlying cluster structure. To create the BLAST similarity table required by OrthoMCL, we performed all-vs-all BLAST searches with 124 million sequences (with default BLAST parameters, except "-outfmt 6"; BLAST version 2.2.28). Roughly one million CPU hours were necessary for this task, running hundreds of jobs in parallel on a high performance computing platform. Merging the individual output files, we obtained a similarity score table of ∼500 GB, containing roughly 6 billion BLAST hits (see Supplementary file 1–Supplementary Table 5). In the original implementation, OrthoMCL loads the BLAST output table into a MySQL database and performs subsequent computations within the relational database. Because of its size, we could not load the BLAST output table into a physical MySQL database. We therefore ported all MySQL processes to the statistical computing environment R to execute them in computer memory. Test experiments, carried out in parallel with our R implementation and the original software, produced identical results, demonstrating that the R version of OrthoMCL accurately reproduces the outcome of the standard pipeline (Supplementary file 1–Supplementary Table 4). After obtaining the final table with adjusted pairwise distance information in R, we used Markov clustering (van Dongen, 2000), as in the original protocol, to combine sequences to orthologous groups.

Depending on the origin of compared sequences, OrthoMCL creates three orthologue tables: a table with reciprocal relationships of sequences between different species (orthologue table), a table of within-species relationships (in-paralogues), and a table of co-orthologues with protein pairs that are connected through orthology and in-paralogy. Of 124 million gathered sequences, 122 million had at least one BLAST hit in the database, giving rise to a collection of 6 billion BLAST pairs as raw material for orthology clustering. The OrthoMCL pipeline retained 35 million of these sequences in 806 million pairs of the three orthology tables. Thus, 28.8% of the sequences had enough similarity with other sequences to participate in orthology group construction whereas the majority of input sequences were so remotely related to other sequences that our pipeline could not merge them with a cluster. As expected, artificially generated ORFs represented by far the largest portion of the non-clustered sequences (91.3%).

As we observed that a large in-paralogue table (5.8× larger than the orthologue plus co-orthologue tables for the final dataset) negatively affected the accuracy of the clustering process, we omitted this table in subsequent trials. In the final MCL run, we obtained 824,605 orthologous groups with 6,743,519 distinct sequences derived from 118,499,524 protein pairs (BLAST hits) of the orthologue and co-orthologue tables. Discarding the large in-paralogue table led to a drop in the percentage of clustered sequences from 28.8% to 5.5%, indicating that a considerable amount of orthogroups in the larger dataset consisted of paralogues (Supplementary file 1–Supplementary Table 5).

To investigate the properties of orthogroups as old as bilaterians or older, we plotted for these orthogroups the number of species against their proportion of bilaterians (Appendix 1—figure 1). Position and abundance of many data points in the resulting plot are a consequence of dataset composition. For example, (i) the majority of orthogroups is small, leading to an abundance of solid (because of overlap) data points for small orthogroups (Appendix 1—figure 1, left part; Supplementary file 1–Supplementary Table 5); (ii) bilaterians and non-bilaterians including fungi are groups roughly equal in size (142 vs. 131 species), preventing that bilaterian sequences exceed a coverage of ∼50% in large orthogroups. Similarly, bilaterian content can hardly fall below 40% to 50% in large orthogroups with more than ∼175 species, giving rise to an arrowhead shape at the right side of the plot (Appendix 1—figure 1). (iii) orthogroups with a bilaterian ancestor have, by definition, a bilaterian content of 100% and are therefore spread as dotted red line on top of the plot that is fading away in orthogroups with more than 100 species; (iv) orthogroups with metazoan and eumetazoan ancestor (green and blue) concentrate on the left part of the plot because not more than 33 non-bilaterian metazoans are present in the dataset, restricting orthogroup size. In addition, the low orthogroup density in sectors B2, B3, and C3 suggests that ancient genes, that evolved in the ancestor of eumetazoans or earlier and survived in bilaterians, do not get lost randomly at multiple nodes in the bilaterian tree. Instead, they tend to be maintained across most bilaterian species. It remains to be seen whether this behaviour is specific for bilaterians in this dataset or a general evolutionary pattern.

Appendix 1—figure 1
General properties of sequence clusters from a bilaterian viewpoint.

(A) The proportion of bilaterians per orthogroup is shown as a function of orthogroup size (in terms of species number) for 207,285 orthogroups that trace back to the four ancestors Bilateria, Eumetazoa, Metazoa, and Opisthokonta. Dot colours indicate the orthogroup ancestor and are printed with 85% transparency to reveal overlaps. (B) Orthogroup count (how often orthogroups of a given size are observed) is displayed as function of orthogroup size (number of sequences present in an orthogroup). 34 orthogroups with more than 1,000 sequences were omitted. Almost all of these sizes occurred only once.

Cluster evaluation and quality control

In a first approach to verify the accuracy of our clustering results, we employed as an external benchmark a manually curated set of 70 orthologous groups (Trachana et al., 2011), the orthobench dataset (http://eggnog.embl.de/orthobench). For the members of every orthobench family, we determined the corresponding BigWenDB sequence ID and the cluster ID (orthogroup ID) to which this sequence was assigned during clustering. We then analysed how the members of a given orthobench family were distributed among orthogroups in the BigWenDB. We performed such comparisons for two MCL inflation parameters (I=1.3 and I=1.4) and two database sizes (full database and database without paralogue table). The clustering with the highest agreement to the expected orthobench outcome was the dataset with inflation parameter I=1.3 and without paralogue table (mcl_ortho-coortho_1.3.7; see Supplementary file 3). In this dataset, 46 of 70 protein families were assigned correctly, i.e. in 65.7% of the cases our pipeline combined all members of an orthobench family, as expected, in a single orthogroup. However, BLAST hits that allow correct mapping were not found for all orthobench family members, and some members were mapped to erroneously predicted proteins. In such cases, orthobench members may be linked to an orthogroup different from the rest of the family, leading to the impression that several orthogroups exist for this family. According to our estimates, such mapping errors reduce accuracy by at least 5%, suggesting a correct orthology inference rate above 70% for our dataset. In contrast, only 10% to 48% of reference orthogroups were predicted correctly in the orthobench comparison (Trachana et al., 2011), indicating that the representative coverage of our dataset considerably improves orthogroup inference quality.

Evolutionary relationships of homeodomain-containing genes are difficult to trace because of the strong conservation and shortness of the homeodomain (60 AA) (Irvine et al., 1997; Kourakis and Martindale, 2000). To understand how our study deals with these difficulties, we analysed the composition of orthogroups containing NK (Nirenberg-Kim) homeobox genes. Like Hox and ParaHox gene clusters, the NK cluster is a close association of homeobox genes with crucial roles in animal development. It consists of the six genes tinman, bagpipe, ladybird (early and late), C15, and slouch in D. melanogaster. They are all involved in mesodermal patterning (Kim and Nirenberg, 1989; Jagla et al., 2001). Genomic data from vertebrates and the cephalochordate Branchiostoma indicate that the NK cluster is an ancient feature of bilaterians, but has been duplicated and split repeatedly in chordate history, leading to the presence of four dispersed clusters and multiple paralogues of each gene in humans (Luke et al., 2003). Several rearrangements have also been observed in the NK cluster of arthropods (Chan et al., 2015). In addition, studies of the homeodomain gene complement of sponges and cnidarians revealed that NK cluster genes predate the evolution of bilaterians (Ryan et al., 2006; Larroux et al., 2007). Given these findings, we can expect that NK homeobox genes from diverse metazoans (sponges, cnidarians, vertebrates, and insects) are each represented in a single orthogroup. Analysing the orthogroups of all Drosophila and human NK cluster genes revealed that, indeed, bilaterian and non-bilaterian orthologues of the five NK genes were joined in five corresponding groups (Supplementary file 1–Supplementary Table 6). These five orthogroups contained sequences from 81 to 128 (of 142) bilaterian species, including the known Drosophila and human NK genes, as well as sponge, cnidarian, and ctenophore sequences. We found placozoan sequences in a single orthogroup, OG_613 (NKX2), suggesting the previously unknown existence of NK class homeobox genes in Placozoa (Monteiro et al., 2006). In contrast to other NK genes, Drosophila tinman is not located in the group of its vertebrate counterparts NKX2.3/2.5/2.6 (OG_613; Supplementary file 1–Supplementary Table 6). It has been shown previously that orthology relationships between tinman and vertebrate NKX2 genes are difficult to establish because of the fast evolving insect tinman genes (Harvey, 1996; Saudemont et al., 2008). In line with these observations, tinman was assigned to a small orthogroup restricted to endopterygote insects (OG_92160) while other putative NKX2 orthologues from a wide range of arthropods (32/37 species) were combined with vertebrate NKX2 genes in orthogroup OG_613.

Consistency between our method and an independent method would further underline the reliability of inferred orthogroups. We therefore prepared our data for a control run with the orthogroup inference algorithm OrthoFinder that, in contrast to OrthoMCL, takes into account a so far unrecognised gene length bias (Emms and Kelly, 2015). However, the number of pairwise BLAST similarity tables, resembling OrthoFinder’s input, increases quadratically with the number of species, and so does the amount of required main memory. With 80 species and 6,320 corresponding BLAST tables, approximately 250 GB of memory are occupied, precluding a run with the full dataset (273 species; 74,256 BLAST tables) on current computers. OrthoFinder thus cannot be used to confirm our data until it is adapted to large data sets, in turn illustrating the power of our modified version of the OrthoMCL pipeline.

Taken together, the assessment of clustering quality using a benchmark and a homeobox gene set indicates that orthology prediction in the BigWenDB accurately captures known evolutionary relationships of difficult target genes over large evolutionary distances. We conclude therefore that our cluster results are well suited as raw material for the search of bilaterian-specific genes.

Identification of bilaterian-specific genes

To infer lineage-specific genes, we determined on the basis of NCBI taxonomy (ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz) the last common ancestor of the species present in all 824,605 orthologous groups of the final clustering. Together with other ancient groups such as Metazoa, Eumetazoa, or Opisthokonta, the taxon Bilateria is among the top ten of taxa with the highest counts (42,946 bilaterian-specific orthogroups; Supplementary file 1–Supplementary Table 25). While these counts include all orthologue groups that trace back to a given ancestor, the majority of groups contains only few species (see Figure 4, Appendix 1—figure 1, Supplementary file 1–Supplementary Table 5). To obtain meaningful groups with a broad representation across bilaterians, we required that at least 10% of the species of each bilaterian super-phylum must be present (Ecdysozoa ≥ 6, Lophotrochozoa ≥ 4, and Deuterostomia ≥ 7 species). We included orthogroups with zero ecdysozoans or lophotrochozoans if the count for the two other super-phyla met the 10% threshold, thereby allowing for the loss of bilaterian-specific genes in ecdysozoans or lophotrochozoans. Following these rationales, we obtained 345 bilaterian-specific groups.

At least four types of error might impair our set of bilaterian-specific orthologous groups: (1) An orthogroup is judged older than bilaterians, but is in fact bilaterian-specific (orthogroup too large), (2) an orthogroup is inferred to be bilaterian-specific, but is in fact older (orthogroup too small), (3) an orthogroup is found to be bilaterian-specific, but is in fact younger (orthogroup too large), (4) an orthogroup is considered younger than bilaterians, but is in fact bilaterian-specific (orthogroup too small).

The presence of several bilaterian sequences and a single sequence from an earlier branching eukaryote would conceal the potential bilaterian ancestry of an orthogroup (type 1 error). We therefore searched for orthologue groups with broad bilaterian representation, according to our above mentioned rules, and up to two outgroup sequences. Of 349 orthogroups satisfying these criteria, the majority (263 or 75.3%) contained as outliers sequences of cnidarian origin, the sister group of bilaterians. To maximise the likelihood of detecting true outliers, we considered only organisms without direct sister group relationship for further analysis and obtained 86 additional bilaterian-specific candidate groups with one or two non-bilaterian/non-cnidarian sequences. As the probability is high that these orthogroups contain phylogenetically unrelated outliers and actually originated in the bilaterian ancestor, we ranked them, together with the 345 previous orthogroups, in a set of 431 bilaterian-specific orthogroups.

Type 2 errors can arise if the MCL algorithm does not combine a group with bilaterian ancestry and a group with related sequences from non-bilaterian species although both groups might represent a single natural orthology group. To identify such errors, we computed for all 824,605 orthogroups multiple sequence alignments and turned them into profile hidden Markov models (HMMs) that describe alignment consensus sequences in a probabilistic way (Eddy, 1998). We then assembled a database from the HMMs and searched the two next similar profiles for every bilaterian-specific group using sensitive HMM-HMM alignments (Söding, 2005). We devised a new reciprocal HMM-HMM alignment comparison step, analogous to the strategy of reciprocal best BLAST hits (Tatusov et al., 1997; Ward and Moreno-Hagelsieb, 2014), to discover bidirectional best hit orthogroup pairs prognostic for common descent. To demonstrate the power of this method, we analysed the orthogroup distribution of two example proteins, Sprouty, an inhibitor of FGF signalling, and the insulator protein GAGA factor. We found that the orthogroups of both, D. melanogaster Sprouty and D. melanogaster GAGA factor, were smaller than anticipated considering their reported phylogenetic distribution (Matus et al., 2007; Heger et al., 2013). In both cases, the reciprocal best hit strategy allowed us to detect highly similar orthogroups with known Sprouty and GAGA factor orthologues that complemented the original orthogroup. After fusion of query and reciprocal best hit orthogroups, the resulting sequence collections matched the expected phylogenetic coverage (Supplementary file 1–Supplementary Table 7). Encouraged by these findings, we examined the 431 bilaterian-specific orthogroups accordingly and excluded orthogroups from the list if they satisfied three criteria: (i) their best or second best HMM-HMM hit modifies the ancestor of the resulting fusion group, (ii) their best or second best hit orthogroup is a reciprocal best hit, and (iii) their best or second best hit orthogroup does not contain more than three bilaterian species. With the last criterion we avoid to eliminate orthogroups whose reciprocal best hit is an ancient orthogroup with wide bilaterian representation, an indicator of homology rather than of orthology. The majority of bilaterian-specific orthogroups (84.2% or 363/431 orthogroups) were not affected by this procedure. Therefore we considered them high-confidence bilaterian-specific orthogroups. On the other hand, 68 bilaterian-specific orthogroups (15.8%) were possibly false positives and may have originated in pre-bilaterian time.

If, for example, several insects and a single sequence from a vertebrate populate an orthogroup, a bilaterian ancestor would be computed for this group although, from a phylogenetic point of view, the single vertebrate sequence is more likely an outlier added to the group erroneously. The filtering rules mentioned above require that at least 10% of the species in each super-phylum are present in a group to qualify as bilaterian-specific. They effectively prevent type 3 errors in our list of bilaterian-specific orthogroups that were caused by the addition of <4 sequences. In contrast, we cannot currently prevent potentially wrong orthology inference if four or more sequences of an ancestor-changing lineage were added erroneously (for example, four ecdysozoan sequences added to an otherwise mammalian-specific orthogroup). However, this error mainly affects small bilaterian-specific orthogroups with only few sequences from deuterostomes, lophotrochozoans, and/or ecdysozoans because of their lack in representativeness. Detailed phylogenetic analysis as well as improved taxon sampling would be necessary to discover such false-positive assignments.

Type 4 errors occur if an orthogroup is estimated younger than bilaterians, but is—accidentally—not joined with another, similar orthogroup that would convert the ancestor to Bilateria if combined with the original group (for example, a vertebrate-specific orthogroup and a highly similar insect-specific orthogroup would create a bilaterian-specific orthogroup). To detect such errors, it is necessary to perform all-vs-all profile comparisons of the orthogroups younger than bilaterians. Next, combinations of similar groups need to be determined that would shift the former individual ancestors to a new common bilaterian ancestor and that are each other’s bidirectional best hit. Due to the high computational investment we refrained from further investigating this error source in this manuscript.

To further probe accuracy of the 363 bilaterian-specific orthogroups, we mapped human and D. melanogaster sequences contained in these orthogroups to the respective genome (versions hg38 and dm6) using BLAT (Kent, 2002). Such mapping was possible for 348/363 orthogroups (95.87%). We then checked whether the target gene to which these sequences were assigned, belonged to the initial orthogroup. This was not true in a considerable number of cases. For example, often bilaterian-specific orthogroups contained short ORFs from H. sapiens or D. melanogaster that mapped to a particular gene. The corresponding full length protein, however, was assigned to a different orthogroup with a different ancestor, indicating that separation of genes into two or more orthogroups affected integrity of the 363 orthogroups set. We therefore excluded all orthogroups with potential mapping inconsistencies and arrived at a set of 204 bilaterian-specific genes. As a final validation step, we blasted at NCBI (non-redundant GenBank version from May 24, 2017) all human or D. melanogaster orthologues, which are present in the 204 bilaterian-specific orthogroups, against non-bilaterian metazoans (Metazoa excluding Bilateria and Mesozoa). A reciprocal best hit analysis of the BLAST results indicated that 47 genes, corresponding to 47 orthogroups, might contain orthologues in non-bilaterian species although our orthology prediction pipeline did not detect them. As substantial work is required to confirm or reject these potentially false-positive orthogroups, we removed them from the list and arrived at a final number of 157 orthogroups. These 157 orthogroups represent a minimal set of high-confidence bilaterian-specific orthogroups which is free of most errors present in other orthology databases.

References

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
    Back in time: a new systematic proposal for the Bilateria
    1. J Baguñà
    2. P Martinez
    3. J Paps
    4. M Riutort
    (2008)
    Philosophical Transactions of the Royal Society B: Biological Sciences 363:1481–1491.
    https://doi.org/10.1098/rstb.2007.2238
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
  16. 16
  17. 17
  18. 18
  19. 19
  20. 20
  21. 21
  22. 22
    The earliest fossil record of the animals and its significance
    1. GE Budd
    (2008)
    Philosophical Transactions of the Royal Society B: Biological Sciences 363:1425–1434.
    https://doi.org/10.1098/rstb.2007.2232
  23. 23
  24. 24
  25. 25
  26. 26
  27. 27
  28. 28
  29. 29
  30. 30
  31. 31
    Biogenic amines in coelenterates
    1. M Carlberg
    2. M Anctil
    (1993)
    Comparative Biochemistry and Physiology Part C: Pharmacology, Toxicology and Endocrinology 106:1–9.
    https://doi.org/10.1016/0742-8413(93)90250-O
  32. 32
  33. 33
  34. 34
    The Caenorhabditis elegans MYOD homologue HLH-1 is essential for proper muscle function and complete morphogenesis
    1. L Chen
    2. M Krause
    3. M Sepanski
    4. A Fire
    (1994)
    Development 120:1631–1641.
  35. 35
  36. 36
  37. 37
  38. 38
  39. 39
  40. 40
    Darwin's dilemma: the realities of the Cambrian ‘explosion’
    1. S Conway Morris
    (2006)
    Philosophical Transactions of the Royal Society B: Biological Sciences 361:1069–1083.
    https://doi.org/10.1098/rstb.2006.1846
  41. 41
  42. 42
  43. 43
  44. 44
  45. 45
  46. 46
  47. 47
  48. 48
  49. 49
  50. 50
  51. 51
  52. 52
  53. 53
  54. 54
  55. 55
  56. 56
  57. 57
    The Cambrian Explosion: The Construction of Animal Biodiversity
    1. DH Erwin
    2. JW Valentine
    (2013)
    Greenwood Village, United States: Roberts and Company Publishers, Inc.
  58. 58
  59. 59
  60. 60
  61. 61
  62. 62
  63. 63
  64. 64
  65. 65
  66. 66
  67. 67
  68. 68
  69. 69
  70. 70
  71. 71
  72. 72
  73. 73
  74. 74
  75. 75
  76. 76
  77. 77
  78. 78
  79. 79
  80. 80
  81. 81
  82. 82
  83. 83
  84. 84
  85. 85
  86. 86
  87. 87
  88. 88
  89. 89
  90. 90
  91. 91
    Cnidarian chemical neurotransmission, an updated overview
    1. G Kass-Simon
    2. P Pierobon
    (2007)
    Comparative Biochemistry and Physiology Part A: Molecular & Integrative Physiology 146:9–25.
    https://doi.org/10.1016/j.cbpa.2006.09.008
  92. 92
  93. 93
  94. 94
  95. 95
  96. 96
  97. 97
  98. 98
  99. 99
  100. 100
  101. 101
  102. 102
  103. 103
  104. 104
  105. 105
  106. 106
  107. 107
  108. 108
  109. 109
  110. 110
  111. 111
  112. 112
  113. 113
  114. 114
  115. 115
  116. 116
  117. 117
  118. 118
  119. 119
  120. 120
  121. 121
  122. 122
  123. 123
  124. 124
  125. 125
  126. 126
  127. 127
  128. 128
  129. 129
  130. 130
  131. 131
  132. 132
  133. 133
  134. 134
  135. 135
  136. 136
  137. 137
  138. 138
  139. 139
  140. 140
  141. 141
  142. 142
  143. 143
  144. 144
  145. 145
  146. 146
  147. 147
  148. 148
  149. 149
  150. 150
  151. 151
  152. 152
  153. 153
  154. 154
  155. 155
  156. 156
  157. 157
  158. 158
  159. 159
  160. 160
  161. 161
  162. 162
  163. 163
  164. 164
  165. 165
  166. 166
  167. 167
  168. 168
  169. 169
  170. 170
  171. 171
  172. 172
  173. 173
  174. 174
  175. 175
  176. 176
  177. 177
  178. 178
  179. 179
  180. 180
  181. 181
  182. 182
  183. 183
  184. 184
  185. 185
  186. 186
    Control of Drosophila imaginal disc development by rotund and roughened eye: differentially expressed transcripts of the same gene encoding functionally distinct zinc finger proteins
    1. SE St Pierre
    2. MI Galindo
    3. JP Couso
    4. S Thor
    (2002)
    Development 129:1273–1281.
  187. 187
  188. 188
  189. 189
  190. 190
  191. 191
  192. 192
  193. 193
  194. 194
  195. 195
  196. 196
  197. 197
  198. 198
  199. 199
  200. 200
  201. 201
  202. 202
  203. 203
  204. 204
  205. 205
  206. 206
  207. 207
  208. 208
    Graph clustering by flow simulation
    1. S van Dongen
    (2000)
    University of Utrecht, PhD thesis.
  209. 209
  210. 210
  211. 211
  212. 212
  213. 213
  214. 214
  215. 215
  216. 216
  217. 217
  218. 218
  219. 219
  220. 220
  221. 221
  222. 222
  223. 223
  224. 224
  225. 225

Decision letter

  1. Patricia J Wittkopp
    Senior and Reviewing Editor; University of Michigan, United States
  2. Ingo Ebersberger
    Reviewer

In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.

Acceptance summary:

This paper provides a comprehensive search for proteins whose evolutionary emergence can be dated to the last common ancestor of the Bilateria. The main aim of this study is to shed light on the genetic changes that underlie account of key innovations in the bilaterian body plan. For this purpose, the authors have compiled an impressive data collection comprising protein coding genes from pre-annotated data, a filtered collection of genomic ORFs stemming from a six-frame translation of entire genome sequences, and eventually transcriptome data for taxa lacking a genome sequence. The key asset of this data collection, when compared to existing ortholog databases, is the-according to contemporary standards comprehensive-sampling of non-bilaterian metazoan species. The much more comprehensive inclusion of non-bilaterian datasets in this study allows a more accurate orthogroup clustering than previous studies. Moreover, this database of sequences presented is a rich resource for reconstructing gene content at any metazoan node.

Decision letter after peer review:

Thank you for submitting your article "The genetic factors of bilaterian evolution" for consideration by eLife. Your article has been reviewed by two peer reviewers, and the evaluation has been overseen by Patricia Wittkopp as the Senior Editor.

The following individuals involved in review of your submission have agreed to reveal their identity: Ingo Ebersberger (Reviewer #2). The reviewers have discussed the reviews with one another and the Senior Editor has drafted this decision to help you prepare a revised submission.

Summary:

In their manuscript, Heger et al. present a comprehensive search for proteins whose evolutionary emergence can be dated to the last common ancestor of the Bilateria. The main aim of this study is to shed light on the genetic changes that underlie account of key innovations in the bilaterian body plan. For this purpose, the authors have compiled an impressive data collection comprising protein coding genes from pre-annotated data, a filtered collection of genomic ORFs stemming from a six-frame translation of entire genome sequences, and eventually transcriptome data for taxa lacking a genome sequence. The key asset of this data collection, when compared to existing ortholog databases, is the-according to contemporary standards comprehensive-sampling of non-bilaterian metazoan species. For the ortholog search, they present a modified version of the orthoMCL algorithm. The resulting orthologous groups are then subjected to a number of intuitive, yet largely ad hoc, filters to identify a set of 157 orthologous groups the authors date to the last common ancestor of the Bilateria. The remainder of the manuscript is then dedicated to discuss these 157 orthologous in the context of bilaterian evolution.

BigWenDB, a new database maximising resolution at the bilaterian origin, building on opisthokont sequences from GenBank, as well as transcriptomes available for non-bilaterian animals, is an important contribution for the field. Orthogroups are identified using a modified the OrthoMCL pipeline, and validated against an external benchmark set of 70 manually curated orthogroups. An HMM-HMM comparison step evaluates orthogroup completeness. The main value of this studies lies in the much more comprehensive inclusion of non-bilaterian datasets in BigWenDB, allowing a more accurate orthogroup clustering than with previous efforts. This database is a rich resource for reconstructing gene content at any metazoan node. Clustering reveals a clear pattern of enrichment among the 157 bilaterian-specific genes, with transcription factors and membrane proteins of various kinds dominating – in contrast to other nodes of the bilaterian tree. Also, the identification of monoamine neurotransmitter reception and nodal signalling as bilaterian-specific innovation is a valuable step forward in our understanding of bilaterian evolution.

Essential revisions:

1) The authors present a way of compiling orthologous groups from their data, which deviates from currently available methods. In particular, the hmm-hmm comparison is, to my knowledge, unprecedented. This pipeline is presented as a main methodological innovation of this manuscript. Although I appreciate that the authors are concerned about the performance of their method, the data set used for method validation is way too small to thoroughly benchmark the method. I strongly suggest that the authors make use of the public benchmark service for ortholog search tools at https://orthology.benchmarkservice.org (Altenhoff et al., 2016) for this purpose.

2) The authors do not consider the limited sensitivity of Blast based ortholog search tools, which leads to an underestimation of gene age (e.g. Elhaik et al., 2006; Luz et al., 2006; Moyers and Zhang, 2015, 2016, 2017; Jain et al., 2019). In this context, it is a bit worrying, that proteins involved in regulatory functions (transcription factors), and proteins located in the membrane appear enriched in the set of Bilateria-specific proteins. Evidence exists for both kinds of proteins that their particular rates and mode of evolution increases the risk of missing distantly related orthologues (Jain et al., 2019). Although the reciprocal hmm-based search could ameliorate this problem, it is likely to be not helpful. This is, because the high evolutionary rate of such proteins paired with the presence of very common functional domains, e.g. a Zn-Finger, probably results in rather uninformative hidden Markov models.

3) I miss the definition of clear-cut evolutionary scenarios. For example, it is unclear what it means when a gene emerges in the LCA of the Bilateria. Is it a de-novo gene birth? Or do the authors also take into account lineage specific gene duplications, probably in combination with domain gain or loss? From what I see in the data I have the impression that the authors consider both. I will exemplify my concern using Figure S14. It shows a phylogeny comprising both FoxH1-an alleged bilaterian invention-and the older FoxD. The way how the tree is drawn implies that it is rooted. If so, it can only be interpreted in the following way: A gene duplication at the root of the metazoa gave rise to FoxH1 and to FoxD. While FoxD is still seen in metazoa, which branched off prior to bilaterian diversification, FoxH1 is nowadays retained only in the Bilateria, and the corresponding orthologs were lost in earlier branching taxa. If this were true, then FoxD would be as old as the animals. Alternatively, one could consider an outgroup rooting. Then FoxD and FoxH1 diversified only at the root of the Bilateria. This diversification would then be the true innovation, but then FoxD and FoxH1 would be co-orthologous to the proteins in the early branching animals. I think the authors have to be way more specific here.

4) The authors use phylogenetic trees to support, or more often to reject an orthology relationship, e.g. subsection “Changes in axon guidance accompany bilaterian evolution”. I am missing topology tests, e.g. the Shimodaira Hasegawa test, to show that a gene tree indeed differs significantly from the species tree. This is necessary, because the phylogenetic information within one orthologous group is, in many cases, not sufficient for accurately resolving the evolutionary history of the corresponding proteins.

5) The resampling approach to show that bilaterian genes are significantly more connected in PPI networks than it is expected by chance is likely to be confounded with gene age. There is evidence that evolutionarily older genes are more central in PPI networks than younger ones (Kim and Marcotte, 2008). Any resampling method that also selects young genes is therefore likely to result in less connected networks.

6) The reciprocal best hit approach, which is used in the final filtering of the candidates is problematic. Reason is, that the initial Blast against non-redundant genbank can have any paralog as a best hit, since the assumption that the entire proteome of each species is represented in the data base is probably not valid.

7) Based on what evidence do the authors consider five independent losses unlikely? And what is an “undetermined orthology”. Can the authors reject that the poriferan sequence is an ortholog of Eomes, and if so again based on what evidence?

8) To consider: one reviewer found the data set overly complex. Because the story is pretty complex, they thought it would do very little harm to omit the genomic ORFs from the analysis.

[Editors' note: further revisions were suggested prior to acceptance, as described below.]

Thank you for resubmitting your work entitled "The genetic factors of bilaterian evolution" for further consideration by eLife. Your revised article has been evaluated by Patricia Wittkopp (Senior Editor), and two reviewers, including the guest Reviewing Editor.

The manuscript has been improved but there are some remaining issues that need to be addressed before acceptance, as outlined in the comments from the re-review below.

Reviewer #1:

This is an interesting and important manuscript dealing with the identification and reconstruction of the ancestral pool of bilaterian genes and bilaterian innovations.

Different authors used different approaches, and the story is ongoing. I consider the manuscript as a next step to decipher the complex story of the amazing diversification among bilaterians.

In my opinion, it is also a valuable reference paper, and this study would be used by the biological research community for future efforts to reconstruct the set of genes underlying bilaterians innovations.

The paper is supported by an extensive supplement, which is an additional and beneficial reference source of specific information about hundreds of genes.

The authors performed a very careful revision of the manuscript and clarified the previously raised questions. The authors identified a subset of about 150 genes as bilaterian-specific innovations and provided a solid argument to support their gene selection criteria.

In general, the paper advances the field and will be an important web resource of comparative data for future investigations.

One of the major critical points: The authors mainly provide the description of identified genes and relatively little critical analyses. I understand that the selected design of the manuscript might reflect the nature of the complex bioinformatic study and reflect a limited knowledge about the functions of many genes in different bilaterian lineages. Nevertheless, I would recommend to critically reread the manuscript and carefully think about generalized statements, when they use references to the vertebrate/arthropod data vs. other lineages.

There are some moderate and minor comments, which might improve this manuscript.

1) Introduction: "In contrast, the evolutionary relationships of non-bilaterian metazoans are still a matter of debate, in particular the relative positions of placozoans, ctenophores, and sponges (Brooke and Holland, 2003; Ryan et al., 2013; Pisani et al., 2015; Simion et al., 2017; Feuda et al., 2017). "

– It is a biased reference list with three references stressing one point; the alternative reconstruction of the animal phylogeny should also be presented (e.g., Whelan et al., 2015 – PNAS; Whelan et al., 2017)

2) Introduction: Genomic sequencing of non-bilaterian animals revealed that essentially all signalling pathways and developmentally important genes from bilaterians are also present in non-bilaterians, indicating that these genes evolved before the advent of bilaterians.….

– "…all…important genes" is an incorrect and biased statement. Frankly, we do not know all important genes for most of the bilaterian lineages. Regarding all signaling pathways, many genes related to neuronal signaling and immune systems, for example, are absent in some non-bilaterian metazoans (see for examples in Moroz and Kohn, 2016)

3) Result subsection “Bilaterian G protein-coupled receptors and the control of physiological state through circulatory flow”, paragraphs one and two.

– Discussion about the monoaminergic reward system, as a generalized statement for bilaterians, is highly speculative. I would recommend using more careful wordings since the functions of many GPCRs are putative and speculative.

-

4) Result subsection “Bilaterian G protein-coupled receptors and the control of physiological state through circulatory flow”, paragraph three .

– Speculations about the peptidergic reception. There are many uncertainties about GPCRs in the majority of bilaterians, and it is difficult to say about functions and ligands of putative (neuro)peptide receptors. I suggest using more careful wording.

Reviewer #2:

In their revised version, the authors have addressed many of my initial concerns, and I trust that this is a relevant study, in principle.

However, some issues remain, which I think can be addressed by carefully re-working the argumentation.

1) I asked in my initial review about a benchmark of the new ortholog assignment method making use of the QfO reference proteome set. The authors chose to not follow this up and justified this with two arguments.

a) First, they stated that their core method is basically a re-implementation of OrthoMCL, and use the times the OrthoMCL paper has been cited as kind of a quality criterion. Needless to say that I am not happy with this argument. OrthoMCL has been shown to have, compared to other methods, a very high false positive rate of 16% (Chen et al., 2007). Thus, it is probably not the number of citations for OrthoMCL, but rather the performance of this tool, the authors should use in their argumentation. I strongly encourage to adapt the discussion in this direction. If they have the feeling that a high false positive rate does not matter, which of course is the case when a high sensitivity is desired, and false positives are either sorted out at a later curation step, or false positive make the analysis more conservative, and thus are ok, then they should explain this to the reader.

b) As a second argument, the authors state that the key advantage „(...) is not the core method (OrthoMCL), but the inclusion of several steps for error correction after orthology clustering, including a new HMM-HMM reciprocal best hit test for orthogroup completion (see, e.g., Supplementary file 1—supplementary table 7). These additional steps (together with careful sampling) distinguish our pipeline from the original OrthoMCL and from other orthology pipelines and influence the interpretation of cluster/orthogroup origin. The additional steps are computationally demanding and require, at present, extensive manual processing. We performed them, as proof-of-principle, for the 157 orthogroups with assumed bilaterian ancestry, but this approach is not yet scalable. Applying the steps for error correction to all clustered outputs for reference proteomes from a public database is therefore not feasible."

I am a bit puzzled reading this response. If I understand the answer correctly then the authors state that they perform a semi-automated curation of orthologous groups predicted by orthoMCL, which is absolutely fair. In their manuscript, however, the authors advertise their method as "A unique orthology pipeline for the identification of bilaterian-specific genes", which creates the (untested) impression that the pipeline outperforms existing ortholog search tools. I see two possibilities. Either, the authors present their method as a curation step without the claim of having developed a new ortholog search pipeline. Or, alternatively, they perform a comprehensive benchmark comparing it to existing and state of the art “orthology pipelines”. As a last aspect to consider, part of the QfO benchmark analyses are independent of the QfO reference proteomes. For example, tree discordance tests can be, in principle performed with any underlying set of proteomes.

To sum it up, my main concern is not necessarily the method itself, except the orthologous group completion step using hmm-hmm searches, of course. Rather, I am very much worried about the fact that the authors generate the impression of having developed a new pipeline for orthologous group compilation, which apparently performs better than existing software.

2) My second concern was the limited sensitivity of Blast in ortholog searches. I very much appreciate that the authors have considered this now in the Discussion. In my opinion, some aspects require further attention. Moyers and Zhang, as well as Jain et al. have shown that the sensitivity of Blast can become limiting even when searching for orthologs in the same kingdom. I think it would be a good idea to openly discuss the issue of limited sensitivity ignoring the fact that you “only” search within the same kingdom. It is then straightforward to propose extensive taxon sampling together with the very sensitive hmm-hmm comparisons as a way out of this problem. The shift to a different search method comes then at the cost of a substantially decreased specificity in the ortholog assignment (see my point above). The authors can then explain why they think that an elevated rate of false positive either do not harm, or how they get rid of them in a downstream curation.

3) The authors write in their response "Our primary intention was to generate molecular phylogenies as a means of validating orthogroup clustering, not for inferring specific evolutionary scenarios directly from the gene trees." I cannot follow this argumentation, and I am convinced that I misinterpret the answer. Phylogenetic trees are the basis for deriving evolutionary scenarios on the sequence level, and I cannot see how this cannot be the case. In this context, I regret to say that I find the sentence "The relationship between orthology clusters is imposed by the method and does not necessarily reflect the timing of evolutionary events" impossible to understand. A phylogenetic tree that does not reflect the timing of evolutionary events is either wrong or unrooted. If the latter is the case, then there has to exist the possibility to place the root such that the evolutionary scenario emerges that is proposed in the text. If this is not the case, then again either the tree or the proposed scenario is wrong.

Concerning the precise example of FoxH1: Given the tree shown in Figure 5—figure supplement 3, there is no way that FoxH1 is a bilaterian invention. The FoxH1 lineage split from its sister group (either FoxQ1 or FoxD) via a gene duplication prior to the diversification of the bilateria. If no FoxH1 sequences outside the Bilateria were found, then they were either lost-which still does not make FoxH1 a bilaterian invention-or the taxon sampling was not sufficient, and non-Bilaterian FoxH1 sequences await their detection. Alternatively, the tree is wrong, but then it should not be shown. I consider this a major issue.

4) As a response to my concern about missing topology tests, the authors have recomputed the trees and have added approximate SH-LRT support values. They write in their answer "Although the new phylogenies fully support the conclusions derived from the original analysis using RAxML, we acknowledge that some trees show high gene/species topology discordance within a given protein clade(orthogroup). We accept that this is, in part, the consequence of including mixed data types, such as the shorter ORF sequences, and sequence sets not optimized for phylogeny because they are often derived directly from the orthogroups obtained in our clustering." I do not get the point about why it is a problem of using sequences directly derived from orthogroups. Is this again an issue of missing specificity in the ortholog assignments?

[Editors' note: further revisions were suggested prior to acceptance, as described below.]

Thank you for submitting your article "The genetic factors of bilaterian evolution" for consideration by eLife. Your article has been considered again by one of the original reviewers, and the evaluation has been overseen by Patricia Wittkopp as the Senior Editor. While we found most of the reviewers’ prior concerns addressed satisfactorily, there was also one major concern not fully addressed. This concern must be addressed in an additional revision before we can consider this manuscript further.

This is a highly complex manuscript, and all conclusions essentially depend on the accuracy of the orthology assignment and the correct phylogenetic interpretation. This was pointed out in the previous round of reviews using the FoxH1 tree as an example. This tree suggests that FoxH1 is as old as the metazoa, and the same is true for EOMES (Figure 5—figure supplement 3 and 4). In the response, the authors state that there are several reasons why the trees shown might not accurately reflect the evolutionary history, and mainly mention heterotachy following gene duplication, which makes the gene look older than it is. This may or may not be correct. To further corroborate their argumentation, the authors state that independent gene loss, which would be required to explain the present day distribution of the respective genes is unlikely, and there is no report for such parallel losses in the literature. Previously published work, including https://www.biorxiv.org/content/10.1101/2020.04.23.058008v1 and https://www.nature.com/articles/s41467-018-03667-1 are at odds with this statement.

We recognize that reconstructing evolutionary relationships with individual sequences is hard. But if the tree is at odds with the conclusion, then this has to be addressed with analyses, and not just by saying “probably the tree is wrong”.

We see two ways out of this dilemma:

One is for you to perform additional analyses, which will take time. This would likely have to include the results of the ortholog searches in a well-defined set of species, such that presence/absence of orthologs can be interpreted along a (known species) tree. For example, at the moment, it is unclear why the tree of Nodal features about 30 vertebrate sequences, that of FoxH1 only about 20 sequences, and that of EOMES only 2. Although these numbers are not relevant when it comes to the age of the genes, they tell us something about the precision and rigor of the analysis.

The other possibility is that the authors could modify the manuscript to avoid the term “innovation” where it is not backed up by the data. If the tree argues for an evolutionarily old gene, which nowadays – and according to the analyses of the authors – is confined to the bilateria, why not simply call it a “bilaterian-specific” gene, leaving it for the moment open whether it was indeed an “innovation” or a selective retention of an older gene. In essence, the authors would then argue with their phylogenetic profiles, which seem to be comprehensive, and not with trees, which they indicate in the response might be wrong anyways. If the authors then wish, they could speculate that the gene is indeed a bilateral innovation, but it has to be clear that this is a speculation that would require more thorough analyses to be tested.

https://doi.org/10.7554/eLife.45530.sa1

Author response

Essential revisions:

1) The authors present a way of compiling orthologous groups from their data, which deviates from currently available methods. In particular, the hmm-hmm comparison is, to my knowledge, unprecedented. This pipeline is presented as a main methodological innovation of this manuscript. Although I appreciate that the authors are concerned about the performance of their method, the data set used for method validation is way too small to thoroughly benchmark the method. I strongly suggest that the authors make use of the public benchmark service for ortholog search tools at https://orthology.benchmarkservice.org (Altenhoff et al., 2016) for this purpose.

The reviewers address an important question concerning the reliability of our orthology groupings and suggest that we benchmark our method using a public benchmark service for orthology search tools (https://orthology.benchmarkservice.org). This service provides reference proteomes of 78 archaea, bacteria, and eukaryotes (~1.3 million sequences) for processing in newly developed orthology pipelines, and the resulting ortholog clusters are uploaded to the benchmark service website for comparison with other pipelines and statistical analyses of method accuracy. We have used the reviewers’ suggestion as an opportunity to explore several aspects of dataset and method validation. However, the result is that we refrained from using this particular service for several reasons:

First, our core method for orthology inference is identical to OrthoMCL (Li et al., 2003), a well-established and widely used method (>3700 citations in Google Scholar). In our work the only difference to the original OrthoMCL method is porting the code (Perl and MySQL) to the statistical programming language R to be able to process a larger number of sequences. We show in the manuscript that the original and our Rbased pipeline produce identical results (Supplementary file 1—supplementary table 4) and therefore we cannot expect clustering differences between the two versions.

Importantly, the key advantage of our workflow is not the core method (OrthoMCL), but the inclusion of several steps for error correction after orthology clustering, including a new HMM-HMM reciprocal best hit test for orthogroup completion (see, e.g., Supplementary file 1—supplementary table 7). These additional steps (together with careful sampling) distinguish our pipeline from the original OrthoMCL and from other orthology pipelines and influence the interpretation of cluster/orthogroup origin.

The additional steps are computationally demanding and require, at present, extensive manual processing. We performed them, as proof-of-principle, for the 157 orthogroups with assumed bilaterian ancestry, but this approach is not yet scalable. Applying the steps for error correction to all clustered outputs for reference proteomes from a public database is therefore not feasible. We would be happy to help develop automatic solutions for these steps in the future to make the method accessible for a wider community.

Moreover, any clustering pipeline heavily depends on the input dataset, and as the reviewers note a key innovation is the content of our BigWenDB. What we really want, to increase confidence in our data, is to validate, by using other methods, the clustering results OrthoMCL produced with OUR dataset. To accomplish this, we would have to process our dataset in parallel with other orthology pipelines and compare the results. This, however, is a major project that is beyond the scope of the current study, as evidenced by calls for dedicated benchmarking studies (e.g., https://www.biomedcentral.com/collections/benchmarkingstudies) and joint efforts to benchmark, improve and standardize orthology predictions (https://questfororthologs.org/).

Nevertheless, we agree that additional validation is desirable and searched for public datasets that may also contain bilaterian-specific gene predictions. We first considered the well-established orthology database OrthoDB version 10 (orthodb.org; Kriventseva et al., 2018). Although this database provides orthogroup analyses for different taxonomic levels (nodes) of the phylogenetic tree, in fact each independent clustering analysis is based on input data of the given taxon only, whereas the identification of taxon-specific orthogroups requires simultaneous analysis of all taxa followed by assessment of taxon membership per orthogroup. It is therefore not possible to obtain lineage-specific orthogroup sets from this database (e.g. metazoan-specific genes) or from BUSCO sets (lineage-specific sets of Benchmarking Universal Single-Copy Orthologs; Waterhouse et al., MBE, 2017) derived from OrthoDB.

The same is true for TreeFam (http://www.treefam.org) which is used by the above mentioned orthology benchmark service. While a fraction of our bilaterian-specific genes also appear as bilaterian-specific in TreeFam (20 of 50 tested), the power of TreeFam to distinguish bilaterian from eumetazoan or metazoan origin is low, as it contains only a single cnidarian outgroup species (BigWenDB: 19 species) and two other non-bilaterians (BigWenDB: 14), and it might have difficulties to distinguish vertebrate/chordate-specific orthogroups from bilaterian-specific groups due to the shortage in lophotrochozoan species (3; BigWenDB: 18). While we do not question the quality of the TreeFam data, they are not ideal for validating bilaterianspecific or metazoan-specific genes.

We then sought to compare our data with a dataset compiled by Paps and Holland, 2018. In this paper, the authors reconstructed the ancestral metazoan genome by orthology clustering of 62 metazoan and outgroup genomes, conceptually similar to our study. They provide in their Supplementary Material lineage-specific genes not only for the metazoan node, but also for the bilaterian and other nodes (https://www.nature.com/articles/s41467-018-04136-5#Sec17; file: 41467_2018_4136_MOESM13_ESM.zip). There are 1,580 bilaterian-specific orthogroups in the file "Bilateria Novel all genes IDs.out" of the Paps and Holland dataset. As many of the orthogroups are small and contained only a few species, we filtered these away and retained OGs with better coverage (at least 15 different species), including a human ortholog. Of the 92 OGs fulfilling these criteria, 15 had overlap with our 157 bilaterian-specific genes. However, as this study did not consider orthogroup composition (phylogenetic coverage) and did not carry out error correction steps, their results cannot be taken as an established reference for the validation of our data.

In summary, we think that the suggested benchmark test does not provide the desired validation of our approach. Rather, with the BigWenDB and our analysis approach we offer a rich new resource, including our orthology clustering predictions, as a baseline for future testing of alternative methods. To support this, we deposited our results and sequence dataset at datadryad.org (temporary review link: https://datadryad.org/review?doi=doi:10.5061/dryad.4qf7168) and the R code for OrthoMCL at https://github.com/BigWenDB (still private, will be made public after acceptance of the paper). We also note that we have done due diligence in comparing our results against 70 curated orthogroups and in conducting a number of phylogenetic tests with molecular outgroups to validate our clustering results (see also response #3-4 below).

2) The authors do not consider the limited sensitivity of Blast based ortholog search tools, which leads to an underestimation of gene age (e.g. Elhaik et al., 2006; Luz et al., 2006; Moyers and Zhang, 2015, 2016, 2017; Jain et al., 2019). In this context, it is a bit worrying, that proteins involved in regulatory functions (transcription factors), and proteins located in the membrane appear enriched in the set of Bilateria-specific proteins. Evidence exists for both kinds of proteins that their particular rates and mode of evolution increases the risk of missing distantly related orthologs (Jain et al., 2019). Although the reciprocal hmm-based search could ameliorate this problem, it is likely to be not helpful. This is, because the high evolutionary rate of such proteins paired with the presence of very common functional domains, e.g. a Zn-Finger, probably results in rather uninformative hidden Markov models.

The articles mentioned by the reviewers show that Blast-based orthology search tools tend to underestimate gene age if the most distant (earliest branching) true homolog is missed because of Blast's limited sensitivity or because of a gene's fast evolution. The reviewers specifically mention the study of Jain et al., 2019, in which the authors investigated the concept of protein traceability, i.e. the "evolutionary distance beyond which sequences are too diverged to be detected with BlastP-based ortholog search tools". Using yeast genes and ortholog searches across different kingdoms (Bacteria, Archaea, protists, plants, green algae, Metazoa), the authors show that protein traceability is high when performing intra-kingdom comparisons (their Figure 2) and considerably lower when performing inter-kingdom comparisons. As the focus of our study is the border between bilaterian and non-bilaterian metazoans, the critical comparisons take place within the same kingdom (Metazoa), and issues caused by low protein traceability should therefore be less prominent than in the scenarios outlined by Jain et al., 2019.

In addition, Jain et al. demonstrate that protein traceability changes with the underlying training data: with phylogenetically diverse training data, better recovery of distant homologs is possible. Again, this principle is met in our study as we used ortholog alignments across the entire bilaterian lineage for the recovery of distant homologs in HMM-HMM comparison steps. With the inclusion of many different bilaterian and nonbilaterian species and the evaluation of orthogroup composition in a phylogenetic context, our dataset further minimizes errors in inferring orthogroup origin that may be caused by the low traceability of specific genes or by taxa with particularly high evolutionary rates. We added a new paragraph discussing this.

As for the reviewers’ concerns for specific gene classes, comparative studies in non-bilaterian metazoans revealed that much of the developmental signaling and transcription factor repertoire was already present in the metazoan stem (see, e.g., Lanna, Evo-devo of non-bilaterian animals, Genet Mol Biol, 2015, and references therein). The successful identification of these factors in non-bilaterian metazoans suggests that traceability within Metazoa does not seem to be a general problem of a metazoan-centered dataset like ours. In the revised version, we briefly discuss these issues in the Discussion of the manuscript.

Nevertheless, it is possible that in particular cases low traceability causes wrong inferences, and the reviewers acknowledged that our newly developed reciprocal best-hit HMM-HMM search step could improve orthogroup assignments in such cases. However, the reviewers also expressed concern that the "high evolutionary rate of such proteins paired with the presence of very common functional domains, e.g. a Zn-Finger, probably results in rather uninformative hidden Markov models."

To address this, we have now generated sequence logo visualisations of two example orthogroups with a common functional domain, the poly-zinc finger domain (new figure: Supplementary file 1—supplementary figure 2) and mention these in the text. While it is true that these HMMs (or sequence logo visualisations) are uninformative in areas outside the poly-ZF region, their central conserved poly-ZF domain is highly informative. The sequence logos not only demonstrate the high conservation of Cys and His residues (Zn2+ complexing) common to both proteins, but also show that individual Zinc fingers of each protein have highly conserved and unique profiles distinguishing them from other ZF proteins. We therefore think that the HMMs generated from such alignments are informative in the sense that they capture the essence (the important positions and their spacing and variability) of an alignment/orthogroup. Indeed, we demonstrated in a number of examples that the approach is capable of linking previously separate (and sometimes taxonomically complementary) orthogroups on the basis of their HMM profiles (Supplementary file 1—supplementary tables 7, 14, 16, 18, 24). For example, lophotrochozoan orthologs of FoxH1 fall into two orthogroups of FoxH1 proteins (one with mammals and the other with fishes and ecdysozoans), such that our HMM-HMM analysis and taxonomic breadth accurately recovers and unifies these orthogroups for a bilaterian-wide protein.

The HMM-HMM comparison step also allowed us to identify potential orthologs of bilaterian-specific proteins in non-bilaterian species, leading to the exclusion of these candidates and demonstrating the rigor of this approach for determining meaningful sequence similarity.

3) I miss the definition of clear-cut evolutionary scenarios. For example, it is unclear what it means when a gene emerges in the LCA of the Bilateria. Is it a de-novo gene birth? Or do the authors also take into account lineage specific gene duplications, probably in combination with domain gain or loss? From what I see in the data I have the impression that the authors consider both. I will exemplify my concern using Figure S14. It shows a phylogeny comprising both FoxH1-an alleged bilaterian invention-and the older FoxD. The way how the tree is drawn implies that it is rooted. If so, it can only be interpreted in the following way: A gene duplication at the root of the metazoa gave rise to FoxH1 and to FoxD. While FoxD is still seen in metazoa, which branched off prior to bilaterian diversification, FoxH1 is nowadays retained only in the Bilateria, and the corresponding orthologs were lost in earlier branching taxa. If this were true, then FoxD would be as old as the animals. Alternatively, one could consider an outgroup rooting. Then FoxD and FoxH1 diversified only at the root of the Bilateria. This diversification would then be the true innovation, but then FoxD and FoxH1 would be co-orthologous to the proteins in the early branching animals. I think the authors have to be way more specific here.

We apologize to the reviewers for not explicitly stating our assumptions and interpretations of the phylogenetic trees (and see also our response to the next point #4, below). Our primary intention was to generate molecular phylogenies as a means of validating orthogroup clustering, not for inferring specific evolutionary scenarios directly from the gene trees. The Fox gene phylogeny example highlighted by the reviewer (original Figure S14) includes all members of the two orthogroups that contain putative FoxH1 proteins, and we included a taxonomically broad subset of proteins from the nearest molecular outgroup in our dataset (the orthogroup with FoxD proteins). The phylogenetic tree recovers 100% support for FoxH1 and FoxD proteins as forming protein-specific, distinct clades. Thus, we are confident in identifying all members of the two FoxH1 orthogroups as genuine FoxH1 orthologs. Furthermore, in re-evaluating this particular example in order to reply to the reviewers, we revisited the primary literature on Fox genes, which – with smallerscale phylogenetic sampling than in our dataset – implicated FoxQ1 proteins as being the nearest outgroup to FoxH1. We have thus repeated our phylogenetic analyses with both FoxD and FoxQ1 outgroups, and again strongly recover a single FoxH1 clade (new Figure 5—figure supplement 3). The evolutionary scenario of FoxH1 being ascribed to the bilaterian ancestor then derives directly from the taxonomic membership within the FoxH1 orthogroups.

To avoid potential misunderstandings, we now modified the legends of phylogenetic trees throughout the manuscript by adding the sentence: "The relationship between orthology clusters is imposed by the method and does not necessarily reflect the timing of evolutionary events".

As for the precise nature of evolutionary novelty, the reviewer is correct that the 157 bilaterian-specific proteins span genuinely de novo proteins as well as – predominantly – those that originated by major evolutionary changes such as gene duplication and protein domain changes. In fact, we only identify 5 of the 157 bilaterian-specific proteins as being likely de novo gene births and mention this now in the text (supported by alignments in Figure 2—figure supplement 2, 3, 4). HMMs corresponding to these 5 orthogroups did not have hits outside Bilateria (Supplementary file 1—supplementary table 9). In contrast, the majority of bilaterian-specific proteins contained one or more known protein domains.

4) The authors use phylogenetic trees to support, or more often to reject an orthology relationship, e.g. subsection “Changes in axon guidance accompany bilaterian evolution”. I am missing topology tests, e.g. the Shimodaira Hasegawa test, to show that a gene tree indeed differs significantly from the species tree. This is necessary, because the phylogenetic information within one orthologous group is, in many cases, not sufficient for accurately resolving the evolutionary history of the corresponding proteins.

Following on from point #3 above, we agree that our original phylogenies were not optimized for directly deducing evolutionary scenarios. Instead, we use phylogenies for testing the recovery of orthogroups as well-supported molecular clades and for investigating whether or not outlier sequences from phylogenetically distant species were erroneously included into an orthogroup. For these purposes, formal topology tests for gene-tree species-tree concordance within orthogroups are, in our opinion, not strictly necessary.

With respect to the reviewers' concern, we recalculated all phylogenetic trees of this article with a new method (IQ-Tree) and included Shimodaira–Hasegawa-like approximate likelihood ratio tests (SH-aLRT) whose results are now displayed on all branch labels of the trees, in addition to bootstrap values. According to the IQ-Tree documentation, one would typically rely on a clade if its SH-aLRT value is > = 80% and its UFboot value is > = 95%, conditions that are met for the relevant nodes in all our analyses.

Although the new phylogenies fully support the conclusions derived from the original analysis using RAxML, we acknowledge that some trees show high gene/species topology discordance within a given protein clade (orthogroup). We accept that this is, in part, the consequence of including mixed data types, such as the shorter ORF sequences, and sequence sets not optimized for phylogeny because they are often derived directly from the orthogroups obtained in our clustering.

5) The resampling approach to show that bilaterian genes are significantly more connected in PPI networks than it is expected by chance is likely to be confounded with gene age. There is evidence that evolutionarily older genes are more central in PPI networks than younger ones (Kim and Marcotte, 2008). Any resampling method that also selects young genes is therefore likely to result in less connected networks.

We thank the reviewers for identifying this important point. To investigate whether the high connectedness of bilaterian-specific genes is indeed biased by age, we carried out additional analyses.

First, we obtained from our clustering results a list of 5,429 (of 20,205) human genes/orthogroups with broad phylogenetic distribution (present in >75 species and unique mapping) and reliable birth in the ancestor of bilaterians or older. We then selected 50x 150 out of the 5,429 genes and analysed their PPI network at stringDB. The obtained PPI score distribution showed that the value for bilaterian-specific genes fell into a broad peak of PPI scores of these older genes and was thus not indicative of a particularly high connectedness in bilaterian genes.

In this first approach, most of the 5,428 considered genes/orthogroups originated in the ancestor of opisthokonts, the oldest possible phylostratum in our database. To avoid this bias, we selected from the corresponding list a set of 797 genes/orthogroups with metazoan ancestor and compared this age-stratified gene set with the bilaterian-specific gene set. We selected 100x 150 genes/orthogroups from the 797 metazoan-specific gene set and analysed their PPI score at stringDB. This time, the PPI score of our bilaterianspecific gene set was much lower than the peak of the metazoan-specific PPI distribution, indicating a clear difference in connectedness. However, we were not satisfied with this result because in this case we compared a "full" gene set (bilaterian-specific OGs) with a distribution of subsets (100x 150 out of 797), in which existing PPI networks might have been destroyed by subsampling.

We then analysed in detail the interactions of bilaterian-specific and metazoan-specific genes in a combined network of the two sets and found support for a higher level of connectivity for bilaterian-specific proteins, in comparison to metazoan-specific proteins. These analyses are described in a new paragraph on PPI networks and are accompanied by a new Figure 8 (previous Figure 6) and new corresponding method paragraphs.

The above described results were obtained with stringDB version 11, which is currently running on the stringDB webserver. As our original network graph was obtained with an earlier version of stringDB (version 10.5) and a slightly incomplete set of bilaterian-specific genes (we missed to add OSR1, NTRK1, EOMES), we re-analysed interactions using the new stringDB version and a complete gene set.

This reanalysis, however, led to the new conclusion that the previously proposed hub gene PRDM10 was a false positive result. In the new database version, all interactions of PDRM10 with other proteins were removed, even for experimentally validated interactions. We contacted the stringDB support team with this issue and they explained that "textmining links in version 10 for PRDM10 are false positives due to PRDM10 synonym Tris, which is also part of the name of the buffer. In version 11 we, most likely, blacklisted that synonym".

We therefore removed the paragraph on PDRM10, including the wet bench expression, RNAi data, the PPI control figure (original Figure S21), and corresponding methods, from the manuscript, as it is no longer supported by stringDB interaction data.

6) The reciprocal best hit approach, which is used in the final filtering of the candidates is problematic. Reason is, that the initial Blast against non-redundant genbank can have any paralog as a best hit, since the assumption that the entire proteome of each species is represented in the data base is probably not valid.

As final validation step for the bilaterian-specific genes, we blasted all human or D. melanogaster orthologs, which were present in a precursor list of 204 bilaterian-specific orthogroups, against non-bilaterian metazoans at NCBI. If a reciprocal best-hit analysis was positive, it suggested that potential orthologs of the bilaterian-specific gene query might exist in non-bilaterians. To avoid the risk of including false positives, we therefore decided to remove such candidates from the set of bilaterian genes, irrespective of their true relationship. We accepted these losses because final proof would require more reliable sequence data from non-bilaterians and detailed additional analyses, or wet lab experiments.

7) Based on what evidence do the authors consider five independent losses unlikely? And what is an “undetermined orthology”. Can the authors reject that the poriferan sequence is an ortholog of Eomes, and if so again based on what evidence?

To clarify our arguments for a bilaterian origin of the Nodal pathway target Eomes, we rewrote the corresponding paragraph and included as further support for our initial conclusion an additional table with reciprocal best hit Blast analysis (new Supplementary file 1—supplementary table 15) of the two putative poriferan Eomes sequences.

8) To consider: one reviewer found the data set overly complex. Because the story is pretty complex, they thought it would do very little harm to omit the genomic ORFs from the analysis.

We explained in the manuscript why we have chosen to include a metazoan-wide set of genomic ORFs (missing, unreliable and/or erroneous genome annotation, especially in non-model organisms). The decision to include ORFs led to the current complexity of the dataset. All analyses are based on the ORF-containing dataset, and we do not see a productive way to remove the ORFs without having to rerun all analyses.

Instead, we think that the addition of ORFs is valuable. For example, the bilaterian origin of FoxH1, and consequently of the entire Nodal signalling pathway, would not have been detected without the inclusion of genomic ORFs. ORFs represent a substantial fraction of sequences in many orthogroups, and they contributed significantly to the generation of ortholog clusters during the clustering steps.

Furthermore, the genomic ORFs gave rise to ORF-only orthogroups (not mentioned in the manuscript), with the possibility to detect new lineage-specific protein-coding factors that so far escaped attention. The analysis of these orthogroups is currently under way.

[Editors' note: further revisions were suggested prior to acceptance, as described below.]

Reviewer #1:

[…] There are some moderate and minor comments, which might improve this manuscript.

1) Introduction: "In contrast, the evolutionary relationships of non-bilaterian metazoans are still a matter of debate, in particular the relative positions of placozoans, ctenophores, and sponges (Brooke and Holland, 2003; Ryan et al., 2013; Pisani et al., 2015; Simion et al., 2017; Feuda et al., 2017). "

– It is a biased reference list with three references stressing one point; the alternative reconstruction of the animal phylogeny should also be presented (e.g., Whelan et al., 2015 – PNAS; Whelan et al., 2017)

Thank you for your comment. As suggested, we now include Whelan et al., 2017.

2) Introduction: Genomic sequencing of non-bilaterian animals revealed that essentially all signaling pathways and developmentally important genes from bilaterians are also present in non-bilaterians, indicating that these genes evolved before the advent of bilaterians.….

– "…all…important genes" is an incorrect and biased statement. Frankly, we do not know all important genes for most of the bilaterian lineages. Regarding all signaling pathways, many genes related to neuronal signaling and immune systems, for example, are absent in some non-bilaterian metazoans (see for examples in Moroz and Kohn, 2016)

We agree that our original sentence contained an inappropriate generalization. The seven major signalling pathways present in metazoans are: TGF-β, canonical Wnt, nuclear receptors, Notch-Δ, Hedgehog, RTK/growth factors, JAK/STAT, and extracellular matrix, and essentially all of these exist also in non-bilaterians (see Babonis and Martindale, Phylogenetic evidence for the modular evolution of metazoan signalling pathways, 2016, Phil. Trans. R. Soc. B 372: 20150477. http://dx.doi.org/10.1098/rstb.2015.0477). To avoid misunderstandings, we changed "essentially all" to "the major" and inserted "many" before "developmentally important genes from bilaterians".

3) Result subsection “Bilaterian G protein-coupled receptors and the control of physiological state through circulatory flow”, paragraphs one and two.

– Discussion about the monoaminergic reward system, as a generalized statement for bilaterians, is highly speculative. I would recommend using more careful wordings since the functions of many GPCRs are putative and speculative.

Indeed, our initial references (Berridge, 2004; Berger et al.,2009) only referred to the function of monaminergic signaling in humans/mammals. However, there is a large body of literature supporting the view that dopamine, adrenaline, and serotonin signalling indeed regulate many important processes (sleep, arousal, wakefulness, aggression, locomotion, feeding and courtship behaviour, memory formation and learning) in several bilaterian phyla, e.g. in arthropods, nematodes, mollusks, and vertebrates. Although these neurotransmitters have not been studied in all bilaterian phyla, they are involved in similar physiological processes in a range of different phyla. To take this into account, we modified the text and included additional references from non-vertebrate model systems.

We agree that also reference Budhiraja, 2009, in the following paragraph, does not fully document what we wanted to express as this article only refers to the Neuromedin-U receptor in humans. However, we surveyed the literature again (Feb 2020) and we are still confident that our initial statements were justified as we found additional supporting examples in various bilaterian phyla. We reworded the paragraph and included several new references to monoamine signalling across bilaterians.

4) Result subsection “Bilaterian G protein-coupled receptors and the control of physiological state through circulatory flow”, paragraph three.

– Speculations about the peptidergic reception. There are many uncertainties about GPCRs in the majority of bilaterians, and it is difficult to say about functions and ligands of putative (neuro)peptide receptors. I suggest using more careful wording.

We re-edited this paragraph and incorporated references that better support our initial statements.

Reviewer #2:

In their revised version, the authors have addressed many of my initial concerns, and I trust that this is a relevant study, in principle.

However, some issues remain, which I think can be addressed by carefully re-working the argumentation.

1) I asked in my initial review about a benchmark of the new ortholog assignment method making use of the QfO reference proteome set. The authors chose to not follow this up and justified this with two arguments.

a) First, they stated that their core method is basically a re-implementation of OrthoMCL, and use the times the OrthoMCL paper has been cited as kind of a quality criterion. Needless to say that I am not happy with this argument. OrthoMCL has been shown to have, compared to other methods, a very high false positive rate of 16% (Chen et al., 2007). Thus, it is probably not the number of citations for OrthoMCL, but rather the performance of this tool, the authors should use in their argumentation. I strongly encourage to adapt the discussion in this direction. If they have the feeling that a high false positive rate does not matter, which of course is the case when a high sensitivity is desired, and false positives are either sorted out at a later curation step, or false positive make the analysis more conservative, and thus are ok, then they should explain this to the reader.

b) As a second argument, the authors state that the key advantage „(...) is not the core method (OrthoMCL), but the inclusion of several steps for error correction after orthology clustering, including a new HMM-HMM reciprocal best hit test for orthogroup completion (see, e.g., Supplementary file 1—supplementary table 7). These additional steps (together with careful sampling) distinguish our pipeline from the original OrthoMCL and from other orthology pipelines and influence the interpretation of cluster/orthogroup origin. The additional steps are computationally demanding and require, at present, extensive manual processing. We performed them, as proof-of-principle, for the 157 orthogroups with assumed bilaterian ancestry, but this approach is not yet scalable. Applying the steps for error correction to all clustered outputs for reference proteomes from a public database is therefore not feasible."

I am a bit puzzled reading this response. If I understand the answer correctly then the authors state that they perform a semi-automated curation of orthologous groups predicted by orthoMCL, which is absolutely fair. In their manuscript, however, the authors advertise their method as "A unique orthology pipeline for the identification of bilaterian-specific genes", which creates the (untested) impression that the pipeline outperforms existing ortholog search tools. I see two possibilities. Either, the authors present their method as a curation step without the claim of having developed a new ortholog search pipeline. Or, alternatively, they perform a comprehensive benchmark comparing it to existing and state of the art “orthology pipelines”. As a last aspect to consider, part of the QfO benchmark analyses are independent of the QfO reference proteomes. For example, tree discordance tests can be, in principle performed with any underlying set of proteomes.

To sum it up, my main concern is not necessarily the method itself, except the orthologous group completion step using hmm-hmm searches, of course. Rather, I am very much worried about the fact that the authors generate the impression of having developed a new pipeline for orthologous group compilation, which apparently performs better than existing software.

The reviewer criticizes that we "generate the impression of having developed a new pipeline for orthology prediction that performs better than existing software" and/or that we "claim of having developed a new ortholog search pipeline". As it has never been our intention to mislead the reader with respect to the nature of our orthology pipeline, we carefully examined the manuscript for the presence of equivocal text passages. We therefore changed the wording of the sentence: from "improved its performance" (the performance of OrthoMCL) to "improved its scalability". We further re-organised the first paragraph of the original version of the Discussion: We split it into several paragraphs with more informative headings (and additions shown in red) and included an entirely new paragraph that addresses various problems mentioned by the reviewer (see also below). In this process, we changed the old header "A unique orthology pipeline for the identification of bilaterian-specific genes" because it also might have fueled misunderstandings. The corresponding new header is "An R-based OrthoMCL pipeline for processing large datasets".

In her/his first comment, reviewer 2 also addresses the high false-positive rate of OrthoMCL and the possibility of tree discordance tests. In the following, we respond to these concerns in detail:

Mentioning the paper by Chen et al., 2007, the reviewer is worried that OrthoMCL's "very high false positive rate of 16%" might compromise our results. Although this rate is indeed a result of the Chen paper, it is important to consider its context: The Chen paper assessed the performance of 10 different orthology detection methods using a -- by today's standard -- small dataset of "27,562 protein sequences from six eukaryotic genomes", which span a large evolutionary interval of 1.3 billion years ( = divergence time between Arabidopsis and Drosophila according to timetree.org). When interpreting the Chen et al. orthology clustering results (performance, FP and FN rate etc.), one has to keep in mind that they have been obtained with this comparatively modest dataset. Importantly, we made an effort to compile a more comprehensive and representative dataset (273 opisthokont species with members of 21 metazoan phyla) because uneven phylogenetic coverage (as in Chen et al.) has been identified as a major confounding factor for orthology prediction (p2 of manuscript; Trachana et al., 2011). As the performance of OrthoMCL is dependent on the dataset (and the settings, e.g. of the MCL inflation parameter), and not an inherent property of the algorithm, performance measures of OrthoMCL in our much more comprehensive dataset will be different from the results obtained in the Chen et al. comparison. In fact, OrthoMCL was -- despite the high FP rate -- the best performing algorithm of the three methods generating orthologous clusters in the Chen comparison. To underscore the dependency of orthology prediction on the dataset, we now mention in the Discussion the high false-positive rate of OrthoMCL and the benefits of a comprehensive dataset.

The potentially high false-positive rate of OrthoMCL further indicates that orthogroups might be inferred as bilaterian-specific, but are in fact not, and only appear so because false positive sequences were erroneously assigned to them (discussed in Appendix, error type 3). According to our filtering strategy, this can happen only if two conditions are met: First, the false positive sequences need to belong to a different lineage than the "truly" assigned sequences (e.g., to Ecdysozoa while the other sequences are from Deuterostomia) to induce a change in the orthogroup ancestor to Bilateria. Second, four or more sequences of an ancestor-changing lineage need to be added erroneously (e.g., four ecdysozoans added to an otherwise mammalian-specific orthogroup). While our filtering conditions reject orthogroups with less than four false-positive sequences, we cannot currently detect false-positive bilaterian-specific orthogroups generated by the addition of four or more ancestor-changing sequences. However, this error will only affect a minority of small bilaterian-specific orthogroups with uneven representation. Detailed phylogenetic analysis as well as improved taxon sampling would be necessary to discover such false-positive assignments. To complete error discussion, we included the described effect of false positives in the revised version of the manuscript (Appendix).

The reviewer states that "tree discordance tests can be, in principle performed with any underlying set of proteomes".

We are not sure whether this statement is a request to perform discordance tests. According to Altenhoff et al., 2016, a "species tree discordance test exploits the [orthology] relationship by assessing the accuracy of orthologs in terms of the accuracy of the species tree that can be reconstructed from them". Although it would support orthology if the gene tree of orthologous sequences matches the species tree, the reverse conclusion is not true: We can not necessarily infer non-orthology if species tree and gene tree do not match because of various problems in phylogenetic tree reconstruction (see, for example, Aguileta et al., 2008). Even in their own comparison of orthology pipelines, Altenhoff et al. came to the conclusion that "there is no obvious systematic difference in performance between […] methods relying on species tree […] and methods that do not". We therefore do not see that including discordance tests would add value, or insight, to the conclusions of our article. However, we now mention gene tree-species tree discordance in a new paragraph of the Discussion.

2) My second concern was the limited sensitivity of Blast in ortholog searches. I very much appreciate that the authors have considered this now in the Discussion. In my opinion, some aspects require further attention. Moyers and Zhang, as well as Jain et al. have shown that the sensitivity of Blast can become limiting even when searching for orthologs in the same kingdom. I think it would be a good idea to openly discuss the issue of limited sensitivity ignoring the fact that you “only” search within the same kingdom. It is then straightforward to propose extensive taxon sampling together with the very sensitive hmm-hmm comparisons as a way out of this problem. The shift to a different search method comes then at the cost of a substantially decreased specificity in the ortholog assignment (see my point above). The authors can then explain why they think that an elevated rate of false positive either do not harm, or how they get rid of them in a downstream curation.

As response to the reviewer’s suggestion, we replaced the corresponding paragraph (starting with "Low protein traceability, the failure to detect…") of the Discussion by a new paragraph. We now generally acknowledge the possibility of homology detection failure and describe the benefit of HMM-HMM comparisons for minimizing errors caused by the low traceability of our candidate genes.

3) The authors write in their response "Our primary intention was to generate molecular phylogenies as a means of validating orthogroup clustering, not for inferring specific evolutionary scenarios directly from the gene trees." I cannot follow this argumentation, and I am convinced that I misinterpret the answer. Phylogenetic trees are the basis for deriving evolutionary scenarios on the sequence level, and I cannot see how this cannot be the case. In this context, I regret to say that I find the sentence "The relationship between orthology clusters is imposed by the method and does not necessarily reflect the timing of evolutionary events" impossible to understand. A phylogenetic tree that does not reflect the timing of evolutionary events is either wrong or unrooted. If the latter is the case, then there has to exist the possibility to place the root such that the evolutionary scenario emerges that is proposed in the text. If this is not the case, then again either the tree or the proposed scenario is wrong.

Concerning the precise example of FoxH1: Given the tree shown in Figure 5—figure supplement 3, there is no way that FoxH1 is a bilaterian invention. The FoxH1 lineage split from its sister group (either FoxQ1 or FoxD) via a gene duplication prior to the diversification of the bilateria. If no FoxH1 sequences outside the Bilateria were found, then they were either lost-which still does not make FoxH1 a bilaterian invention-or the taxon sampling was not sufficient, and non-Bilaterian FoxH1 sequences await their detection. Alternatively, the tree is wrong, but then it should not be shown. I consider this a major issue.

The reviewer is concerned that "either the tree or the proposed scenario is wrong". We agree that this statement is correct, if evolution proceeded with a strictly constant molecular clock. Unfortunately (for tree reconstruction), this is not the case. Hence, the structure of inferred evolutionary/phylogenetic trees need not always reflect the historically true branching order. In particular, heterotachy of evolutionary rates has been widely documented to accompany, and to last after, events of gene duplication. The cases which we discuss in our manuscript can all be explained by this phenomenon. We made an effort to clarify our explanations in the text by adding a new paragraph to the Discussion ("Challenges in reconciling orthogroups and phylogenetic trees").

We further removed all instances of the unfortunate/unreasonable statement "The relationship between orthology clusters is imposed by the method and does not necessarily reflect the timing of evolutionary events", but we kept the original trees according to the considerations above and the changes in the Discussion.

4) As a response to my concern about missing topology tests, the authors have recomputed the trees and have added approximate SH-LRT support values. They write in their answer "Although the new phylogenies fully support the conclusions derived from the original analysis using RAxML, we acknowledge that some trees show high gene/species topology discordance within a given protein clade(orthogroup). We accept that this is, in part, the consequence of including mixed data types, such as the shorter ORF sequences, and sequence sets not optimized for phylogeny because they are often derived directly from the orthogroups obtained in our clustering." I do not get the point about why it is a problem of using sequences directly derived from orthogroups. Is this again an issue of missing specificity in the ortholog assignments?

Perhaps our previous answer led to misunderstandings. The branching order of phylogenetic trees can be affected by the inclusion of problematic sequences, little phylogenetic information, inappropriate evolutionary models, sampling depth, or heterotachy, to name a few. The presence of (often) rather short ORF-derived sequences is certainly not optimal (in comparison to full length sequences) for phylogenetic analysis and may reinforce gene-tree species-tree discordance. We do not think that such discordance is caused by missing specificity in the orthogroup assignment. If orthogroup assignments were unspecific, we should see low support in the nodes defining an orthogroup/clade. This is not the case for the phylogenetic trees shown in our article. We are thus confident that the sequences assigned to a specific orthogroup do belong to this orthogroup. In contrast, the branching order of sequences within orthogroups does not necessarily match the expected species relationships (gene tree/species tree discordance) for the various reasons outlined above. As expected, support values for branching events within orthogroups are therefore often low. We now discuss these considerations in the new paragraph of the Discussion ("Challenges in reconciling orthogroups and phylogenetic trees").

[Editors' note: further revisions were suggested prior to acceptance, as described below.]

[…] We see two ways out of this dilemma:

One is for you to perform additional analyses, which will take time. This would likely have to include the results of the ortholog searches in a well-defined set of species, such that presence/absence of orthologs can be interpreted along a (known species) tree. For example, at the moment, it is unclear why the tree of Nodal features about 30 vertebrate sequences, that of FoxH1 only about 20 sequences, and that of EOMES only 2. Although these numbers are not relevant when it comes to the age of the genes, they tell us something about the precision and rigor of the analysis.

The other possibility is that the authors could modify the manuscript to avoid the term “innovation” where it is not backed up by the data. If the tree argues for an evolutionarily old gene, which nowadays – and according to the analyses of the authors – is confined to the bilateria, why not simply call it a “bilaterian-specific” gene, leaving it for the moment open whether it was indeed an “innovation” or a selective retention of an older gene. In essence, the authors would then argue with their phylogenetic profiles, which seem to be comprehensive, and not with trees, which they indicate in the response might be wrong anyways. If the authors then wish, they could speculate that the gene is indeed a bilateral innovation, but it has to be clear that this is a speculation that would require more thorough analyses to be tested.

We very much appreciate that you mentioned the complexity of our study and the effort needed to perform additional analyses, such as relative rate tests on the branches of the phylogenies shown in Figure 5—figure supplement 3, Figure 5—figure supplement 4, or Figure 7—figure supplement 1. While we agree that such analyses may yield additional insight, we think that they are beyond the scope of the present manuscript and should be deferred to a future study. We therefore followed your second suggestion and main comment: to modify the manuscript and avoid the term "innovation" where it is not clearly supported by the data.

Throughout the manuscript, we therefore changed the text when our conclusions (bilaterian origin/innovation) were in conflict with phylogenetic trees. As suggested, we now describe the respective genes/features as bilaterian-specific or as having a bilaterian-wide distribution, leaving their evolutionary origin open.

We now also acknowledge in the discussion the existence of gene loss as an important evolutionary mechanism and cite appropriate studies, as requested by the reviewer. Despite the increasing recognition of gene loss and the studies mentioned by the reviewer, there is, to our knowledge, still no support in the literature for a scenario comparable to the one some of our trees imply: the parallel loss of several genes in several entire sister phyla (such as the loss of FoxH1 in cnidarians, poriferans, placozoans, and ctenophores).

To seek further support for our interpretation of trees, a rate acceleration of the FoxH branch after duplication (and therefore origin in the bilaterian ancestor), we analysed in a preliminary way the CDSs of human FoxD/FoxH/FoxQ sequences using Tajima's 1- and 2-parameter relative rate tests (Tajima, Genetics, 1993 Oct;135(2):599-607). Although the results in this special case confirmed our claim, a thorough analysis on the basis of several genes from several organisms should be the focus of a different study.

https://doi.org/10.7554/eLife.45530.sa2

Article and author information

Author details

  1. Peter Heger

    Institute for Genetics, Cologne Biocenter, University of Cologne, Cologne, Germany
    Contribution
    Conceptualization, Data curation, Software, Supervision, Validation, Investigation, Visualization, Methodology, Project administration
    For correspondence
    peter.heger@uni-koeln.de
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-2583-2981
  2. Wen Zheng

    Institute for Genetics, Cologne Biocenter, University of Cologne, Cologne, Germany
    Present address
    West China-Washington Mitochondria and Metabolism Research Center, West China Hospital, Sichuan University, Chengdu, China
    Contribution
    Data curation, Software, Validation, Investigation, Methodology
    Competing interests
    No competing interests declared
  3. Anna Rottmann

    Institute for Genetics, Cologne Biocenter, University of Cologne, Cologne, Germany
    Contribution
    Software, Investigation
    Competing interests
    No competing interests declared
  4. Kristen A Panfilio

    1. Institute for Zoology: Developmental Biology, Cologne Biocenter, University of Cologne, Cologne, Germany
    2. School of Life Sciences, University of Warwick, Gibbet Hill Campus, Coventry, United Kingdom
    Contribution
    Resources, Supervision, Funding acquisition, Validation, Investigation
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-6417-251X
  5. Thomas Wiehe

    Institute for Genetics, Cologne Biocenter, University of Cologne, Cologne, Germany
    Contribution
    Conceptualization, Resources, Supervision, Funding acquisition
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-8932-2772

Funding

Deutsche Forschungsgemeinschaft (CRC 680)

  • Kristen A Panfilio
  • Thomas Wiehe

Deutsche Forschungsgemeinschaft (CRC 1211)

  • Thomas Wiehe

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

This research was supported by grants from the German Research Foundation to TW (CRC 680 and CRC 1211) and to KAP (CRC 680). BLAST searches were computed on CHEOPS, the Cologne High Efficiency Operating Platform for Science of the University of Cologne, and on JuRoPA (Jülich Research on Petaflop Architectures), a High Performance Computing Platform of the Jülich Supercomputing Centre, Germany. We thank Robert Fürst for programming help, Kay Hofmann for help with protein structure analysis, Richard Stancliffe for scripting and statistical support, Maria Thieser for help with transcriptome processing, and Olav Zimmermann for the cooperation with the Jülich Supercomputing Centre. Special thanks to countless researchers and institutions for sharing sequence data.

Senior and Reviewing Editor

  1. Patricia J Wittkopp, University of Michigan, United States

Reviewer

  1. Ingo Ebersberger

Publication history

  1. Received: January 31, 2019
  2. Accepted: July 3, 2020
  3. Accepted Manuscript published: July 16, 2020 (version 1)
  4. Version of Record published: October 5, 2020 (version 2)

Copyright

© 2020, Heger et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 1,926
    Page views
  • 373
    Downloads
  • 0
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)

Further reading

    1. Evolutionary Biology
    2. Immunology and Inflammation
    Kazi Rahman et al.
    Research Article

    The interferon-inducible transmembrane (IFITM) proteins belong to the Dispanin/CD225 family and inhibit diverse virus infections. IFITM3 reduces membrane fusion between cells and virions through a poorly characterized mechanism. Mutation of proline rich transmembrane protein 2 (PRRT2), a regulator of neurotransmitter release, at glycine-305 was previously linked to paroxysmal neurological disorders in humans. Here, we show that glycine-305 and the homologous site in IFITM3, glycine-95, drive protein oligomerization from within a GxxxG motif. Mutation of glycine-95 (and to a lesser extent, glycine-91) disrupted IFITM3 oligomerization and reduced its antiviral activity against Influenza A virus. An oligomerization-defective variant was used to reveal that IFITM3 promotes membrane rigidity in a glycine-95-dependent and amphipathic helix-dependent manner. Furthermore, a compound which counteracts virus inhibition by IFITM3, amphotericin B, prevented the IFITM3-mediated rigidification of membranes. Overall, these data suggest that IFITM3 oligomers inhibit virus-cell fusion by promoting membrane rigidity.

    1. Cell Biology
    2. Evolutionary Biology
    Nicole L Nuckolls et al.
    Research Article

    Meiotic drivers are parasitic loci that force their own transmission into greater than half of the offspring of a heterozygote. Many drivers have been identified, but their molecular mechanisms are largely unknown. The wtf4 gene is a meiotic driver in Schizosaccharomyces pombe that uses a poison-antidote mechanism to selectively kill meiotic products (spores) that do not inherit wtf4. Here, we show that the Wtf4 proteins can function outside of gametogenesis and in a distantly related species, Saccharomyces cerevisiae. The Wtf4poison protein forms dispersed, toxic aggregates. The Wtf4antidote can co-assemble with the Wtf4poison and promote its trafficking to vacuoles. We show that neutralization of the Wtf4poison requires both co-assembly with the Wtf4antidote and aggregate trafficking, as mutations that disrupt either of these processes result in cell death in the presence of the Wtf4 proteins. This work reveals that wtf parasites can exploit protein aggregate management pathways to selectively destroy spores.