1. Evolutionary Biology
  2. Genetics and Genomics
Download icon

A putative origin of the insect chemosensory receptor superfamily in the last common eukaryotic ancestor

  1. Richard Benton  Is a corresponding author
  2. Christophe Dessimoz
  3. David Moi
  1. Center for Integrative Genomics, Faculty of Biology and Medicine, University of Lausanne, Switzerland
  2. Department of Computational Biology, Faculty of Biology and Medicine, University of Lausanne, Switzerland
  3. Swiss Institute of Bioinformatics, Switzerland
  4. Department of Genetics, Evolution and Environment, University College London, United Kingdom
  5. Department of Computer Science, University College London, United Kingdom
Short Report
  • Cited 0
  • Views 1,426
  • Annotations
Cite this article as: eLife 2020;9:e62507 doi: 10.7554/eLife.62507
Voice your concerns about research culture and research communication: Have your say in our 7th annual survey.

Abstract

The insect chemosensory repertoires of Odorant Receptors (ORs) and Gustatory Receptors (GRs) together represent one of the largest families of ligand-gated ion channels. Previous analyses have identified homologous ‘Gustatory Receptor-Like’ (GRL) proteins across Animalia, but the evolutionary origin of this novel class of ion channels is unknown. We describe a survey of unicellular eukaryotic genomes for GRLs, identifying several candidates in fungi, protists and algae that contain many structural features characteristic of animal GRLs. The existence of these proteins in unicellular eukaryotes, together with ab initio protein structure predictions, provide evidence for homology between GRLs and a family of uncharacterized plant proteins containing the DUF3537 domain. Together, our analyses suggest an origin of this protein superfamily in the last common eukaryotic ancestor.

Introduction

The insect chemosensory receptor superfamily, comprising Odorant Receptors (ORs) and Gustatory Receptors (GRs), forms a critical molecular interface between diverse chemical signals in the environment and neural activity patterns that evoke behavioral responses (Benton, 2015; Joseph and Carlson, 2015; Robertson, 2019; Rytz et al., 2013; van Giesen and Garrity, 2017). Most insect genomes encode dozens to hundreds of different, often species-specific, ORs and/or GRs. Detailed analyses, in particular in Drosophila melanogaster, indicate that the vast majority of these are likely to be expressed in, and define the chemical response properties of, distinct subpopulations of peripheral sensory neurons (Chen and Dahanukar, 2020; Joseph and Carlson, 2015; Scott, 2018; Vosshall and Stocker, 2007).

Insect ORs and GRs – the former having derived from an ancestral GR (Robertson, 2019; Robertson et al., 2003) – contain seven transmembrane (TM) domains (Clyne et al., 2000; Clyne et al., 1999; Scott et al., 2001; Vosshall et al., 1999). In contrast to vertebrate olfactory and taste receptors, which belong to the G protein-coupled receptor (GPCR) superfamily of seven TM domain proteins (Glezer and Malnic, 2019; Yarmolinsky et al., 2009), insect ORs and GRs have the opposite topology, with an intracellular N-terminus (Benton et al., 2006; Lundin et al., 2007). Functional analyses of these proteins in heterologous expression systems indicate that they form ligand-gated ion channels (Butterwick et al., 2018; Sato et al., 2008; Sato et al., 2011; Wicher et al., 2008). Insect ORs assemble into heteromeric (probably tetrameric) complexes likely composed of two subunits each of a tuning OR, which recognizes odor ligands, and a universal co-receptor, ORCO, which is critical for complex assembly, subcellular trafficking, and – together with the tuning OR – forms the ion conduction pore (Benton et al., 2006; Butterwick et al., 2018; Larsson et al., 2004; Sato et al., 2008; Wicher et al., 2008). GRs are less well-characterized but are also likely to function in multimeric complexes of one or more different subunits (Joseph and Carlson, 2015; Scott, 2018). A cryogenic electron microscopy (cryo-EM) structure of an ORCO homotetramer (Butterwick et al., 2018) – which can conduct ions itself upon stimulation with artificial ligands (Jones et al., 2011) – demonstrated that this receptor adopts a novel fold unrelated to any known family of ion channels. Analysis of amino acid conservation across the OR repertoire (Butterwick et al., 2018) and de novo structure predictions of tuning ORs, guided by patterns of amino acid co-evolution (Hopf et al., 2015), suggest that this fold is globally similar for all ORs (and potentially GRs).

The unusual nature of these proteins has prompted significant interest to understand their evolution. The most extensive comparative genomic analyses have focused on Insecta, tracing the origins of the OR family from the ancestral insect GRs (Brand et al., 2018; Missbach et al., 2014). Surveys for GR-like (GRL) proteins beyond insects have identified members of this family across Protostomia (including in Annelida, Nematoda, and Mollusca) as well as in a limited number of Deuterostomia (including Echinodermata and Hemichordata, but not Chordata) (Eyun et al., 2017; Robertson, 2015; Saina et al., 2015). Several non-bilaterian animals have recognizable GRLs, including Cnidaria and Placozoa (Eyun et al., 2017; Nordstrom et al., 2011; Robertson, 2015; Saina et al., 2015). Although only very sparse expression and functional information exists outside Insecta, several lines of evidence indicate that members of this superfamily have roles beyond chemosensation. For example, two homologs in Caenorhabditis elegans (LITE-1 and GUR-3) function in light detection (Edwards et al., 2008), either as a photoreceptor (Gong et al., 2016), or indirectly through recognition of cellular chemical products produced upon light exposure (Bhatla and Horvitz, 2015). (Isoforms of the D. melanogaster LITE-1 ortholog, GR28b, also have other sensory roles, notably in thermosensation and light-sensing (Ni et al., 2013; Xiang et al., 2010).) Another C. elegans homolog functions in motoneurons to control egg-laying (Moresco and Koelle, 2004). GRLs in the sea anemone Nematostella vectensis and the purple sea urchin, Strongylocentrotus purpuratus are expressed early during development (Saina et al., 2015), and one of the N. vectensis proteins may have a role in apical body patterning (Saina et al., 2015).

Although this family of established (or presumed) ion channels represents one of the largest and most functionally diverse in nature, its evolutionary origin remains unknown. We previously suggested potential homology between GRLs and a family of uncharacterized plant proteins containing the Domain of Unknown Function 3537 (DUF3537), based upon their predicted seven TM domains and intracellular N-termini (Benton, 2015). However, this proposal was questioned (Robertson, 2019) because DUF3537 proteins lack other features that characterize animal GRLs, such as a motif in TM7 and conserved introns near the 3’ end of the corresponding gene (Robertson, 2015; Robertson, 2019; Saina et al., 2015). Moreover, if insect ORs/GRs and plant DUF3537 proteins were derived from a common ancestor, we might expect to find related proteins encoded in the genomes of unicellular eukaryotes. This study aimed to profit from the wealth of genomic information now available to investigate the potential existence of GRL homologs in such species to further trace the birth of this remarkable protein family.

Results

Screening and assessment of candidate GRLs in unicellular eukaryotes

We used diverse insect ORs and GRs, other animal GRLs and plant DUF3537 families as sequence queries in BLAST searches of protein and genomic sequence databases of unicellular organisms (see Materials and methods). Significant hits were subjected to further assessment to exclude spurious similarities, retaining those that fulfilled most or all of the following criteria: (i) reciprocal BLAST using a candidate sequence as query identified a known GRL or DUF3537 member as a top hit (or no significant similarity to other protein families); (ii) ~ 350–500 amino acids long, similar to GRLs/DUF3537 proteins; (iii) predicted seven TM domains; (iv) intracellular N-terminus; (v) longer intracellular loops than extracellular loops (a feature of the insect receptors [Otaki and Yamamoto, 2003; Robertson, 2015]). These analyses identified 17 sequences from Fungi, Protista and unicellular Plantae (Table 1 and Figure 1), described in more detail below.

Figure 1 with 1 supplement see all
Transmembrane topology predictions of GRLs.

(A) Top: cryo-EM structure of Apocrypta bakeri ORCO (AbakORCO) (PDB 6C70 [Butterwick et al., 2018]); only two subunits of the homotetrameric structure are visualized. Bottom: Schematic of the membrane topology of AbakORCO (adapted from Butterwick et al., 2018), colored as in the cryo-EM structure. The white asterisk marks a helical segment that forms part of a membrane re-entrant loop in the N-terminal region. TM domain seven is divided into a cytoplasmic segment (7a) and a membrane-spanning segment (7b). (B) TM domain and topology predictions of the previously described and newly-recognized GRLs and DUF3537 proteins (Dmel, Drosophila melanogaster; Skow, Saccoglossus kowalevskii; Spur, Strongylocentrotus purpuratus; Nvec, Nematostella vectensis; Atha, Arabidopsis thaliana; see Table 1 for other species abbreviations and sequence accessions). Each plot represents the posterior probabilities of transmembrane helix and inside/outside cellular location along the protein sequence, adapted from the output of TMHMM Server v2 (Krogh et al., 2001). In several sequences an extra transmembrane segment near the N-terminus is predicted (marked by a white asterisk in the N-best prediction above the plot); this may represent the re-entrant loop helical region observed in ORCO, rather than a transmembrane region; in at least one case (SpurGRL1) the designation of this region as a TM domain, leads to an atypical (and presumably incorrect) prediction of an extracellular N-terminus. Conversely, in a subset of proteins individual TM domains are not predicted (notably TM7, black asterisks above the N-best plot), which is likely due to subthreshold predictions for TM domainsin these regions. In NvecGRL1, the long TM4 helix (which projects into the intracellular space in ORCO [Butterwick et al., 2018]) is mis-predicted as two TM domains (dashed red line). Independent membrane topology predictions for unicellular species’ GRLs were obtained using TOPCONs (Supplementary file 2), with largely consistent results.

Table 1
Candidate GRLs in unicellular eukaryotes.

Protein sequences are provided in Supplementary file 1. Protein nomenclature is provisional and does not imply orthology between species.

KingdomPhylumSpeciesIsolateAlternative nameCommon nameProvisional protein nameAccession/version
FungiChytridiomycotaSpizellomyces punctatusDAOM BR117chytrid fungusSpunGRL1XP_016607089.1
Spizellomyces palustrisCBS 455.65Phlyctochytrium palustrechytrid fungusSpalGRL1TPX68946.1
ProtistaAmoebozoaProtostelium aurantium var. fungivorum-Planoprotostelium fungivorum-PfunGRL1PRP89608.1
ApusozoaThecamonas trahensATCC 50062Amastigomonas trahenszooflagellateTtraGRL1XP_013761079.1
TtraGRL2XP_013753662.1
TtraGRL3XP_013759733.1 (trimmed)
TtraGRL4XP_013759396.1
TtraGRL5XP_013757274.1
TtraGRL6XP_013755387.1
Incertae sedis/Chromerdia (superphylum: Alveolata)Vitrella brassicaformisCCMP3315-chromeridVbraGRL1CEM13019.1
VbraGRL2CEL93132.1
VbraGRL3CEM19221.1
VbraGRL4CEM01650.1
VbraGRL5CEM10760.1
VbraGRL6CEM25255.1
PlantaeChlorophytaChloropicon primus-CpriGRL1QDZ19318.1
Micromonas pusillaCCMP1545Chromulina pusillaMpusGRL1XP_003054778.1

To further assess these candidate homologs (which we refer to as GRLs hereafter), we used their sequences to build and compare Hidden Markov Models (HMMs) using HHblits (Remmert et al., 2012), a remote homology detection tool that is more sensitive than BLAST (Steinegger et al., 2019) (see Materials and methods). We constructed HMMs for candidate GRLs, as well as a representative set of animal GRLs and DUF3537 proteins. Each HMM was used as a query to perform all-versus-all alignments. A similarity matrix comparing these alignments was compiled by parsing the probabilities of each alignment (Figure 1—figure supplement 1). As expected, alignments of HMMs seeded by animal and plant proteins each form clusters of high probability similarity, although we also detected similarity between these clusters, indicative of homology. Importantly, HMMs of all new candidate sequences display significant similarity to those of multiple animal and/or plant proteins. Some candidates clustered more closely with the animal sequences, while others displayed similar probabilities with both animal and plant proteins, an observation consistent with phylogenetic analyses presented below.

We also examined the candidate homologs for the only known primary sequence feature of animal GRLs, a short motif located in the C-terminal half of TM7: (T/S)Yhhhhh(Q/K/E)(F/L/M), where h denotes a hydrophobic amino acid (Robertson, 2015). This motif is diagnostic, but not definitive: many insect tuning ORs (as well as some GRs/GRLs) have divergent amino acids at some or all four positions (Robertson, 2019; Scott et al., 2001). Structural and functional analyses of a subset of residues of this motif in ORCO indicates that the TY residues form part of the interaction interface between subunits (Figure 2A), and that their mutation is detrimental to function in some, but not all, combinations of subunits (Butterwick et al., 2018; Nakagawa et al., 2012). The terminal L residue of the motif is part of the channel gate (Figure 2A), and its mutation alters ion permeation selectivity (Butterwick et al., 2018). These observations suggest that divergence from the GRL motif in a given protein could be compatible with a conserved function as an ion channel, albeit with different complex assembly and biophysical properties. Bearing these observations in mind, candidate GRLs were inspected for this motif, as described below (Figure 2B). Finally, the corresponding genes were examined for the existence of the 3’ introns characteristic of the animal genes (Robertson, 2015; Saina et al., 2015); this analysis was ultimately uninformative because of the scarcity of introns in most of these organisms (Roy and Gilbert, 2006; Figure 2B).

Figure 2 with 2 supplements see all
Conservation and divergence in TM7 features and GRL phylogeny.

(A) Side and top views of the cryo-EM structure of the ORCO homotetramer (Butterwick et al., 2018), in which the TM7 motif amino acid side chains are shown in stick format and colored red or orange. The region in the dashed blue box, representing the extracellular entrance to the ion channel pore, is shown in a magnified view on the far right. (B) Multiple sequence alignment of the C-terminal region (encompassing TM7) of unicellular eukaryotic GRLs and selected animal GRLs and plant DUF3537 proteins. Tadh, Trichoplax adhaerens; other species abbreviations are defined in Figure 1 and Table 1. The TM7 motif consensus amino acids (and conservative substitutions) are indicated below the alignment; h indicates a hydrophobic amino acid. Red dashed lines on the alignment indicate positions of predicted introns within the corresponding transcripts. Intron locations are generally conserved within sequences from different Kingdoms, but not between Kingdoms; many Protista sequences do not have introns in this region. (C) Maximum likelihood phylogenetic tree of unicellular eukaryotic GRLs, and selected animal GRLs and plant DUF3537 proteins, with aBayes branch support values. Although the tree is represented as rooted, the rooting is highly uncertain. Protein labels are in black for animals, orange for fungi, blue for protists and green for plants. The scale bar represents one substitution per site.

Candidate GRLs from unicellular eukaryotes

The fungal kingdom is thought to be the closest relative to Animalia (Baldauf and Palmer, 1993). A single candidate GRL was identified in two fungi, Spizellomyces punctatus (Russ et al., 2016) and Spizellomyces palustris (van de Vossenberg et al., 2019). The fungal proteins exhibit the secondary structural features of GRLs (Figure 1), but only one of the four positions of the TM7 motif is conserved (Figure 2B). A 3’ intron is not in the same position of the characteristic last intron of animal GRL genes (Figure 2B). Both of these species are chytrids, an early diverging lineage of fungi that retains some features of the last common opisthokont ancestor of animals and fungi (Medina and Buchler, 2020). Chytrids have diverse lifestyles, but are notable for their reproduction via zoospores, which use a motile cilium to swim or crawl.

Taxonomic classification of many single-celled eukaryotes remains unresolved, and we use the term Protista to cover all unicellular species that are neither Fungi nor Plantae. Three such species were found to encode GRLs. The marine gliding zooflagellate Thecamonas trahens (Cavalier-Smith and Chao, 2010; Howe et al., 2020) has six candidate proteins. Beyond secondary structural similarity (Figure 1), the proteins have up to three conserved residues of the TM7 motif (if counting a Y→F substitution in the second position as conservative, as observed in some animal GRLs (e.g., SpurGRL1)) (Figure 2B). The chromerid Vitrella brassicaformis (Woo et al., 2015), a free-living, non-parasitic photosynthetic protist, also has six GRLs, most of which have two conserved positions within the TM7 motif. Finally, the amoebozoan Protostelium aurantium var. fungivorum (Hillmann et al., 2018), has a single GRL. Protosteloid amoebae differ from dictyostelids by producing simple fruiting bodies with only one or few single stalked spores.

DUF3537 domain proteins are widely (and possibly universally) encoded in higher plant genomes, typically comprising small families of 4–12 members (Benton, 2015). Single proteins were also found in unicellular plants (‘green algae’), including the marine microalgae Chloropicon primus (Lemieux et al., 2019) and Micromonas pusilla (van Baren et al., 2016; Worden et al., 2009). As for higher plant sequences, the TM7 motif is largely unrecognizable. The relationship between DUF3537 proteins and the GRL superfamily may have been overlooked earlier because protein alignments are impeded by the longer length of several intracellular loops (IL) of the plant proteins, notably IL3 in all DUF3537 proteins, and IL2 in the green algal proteins (~200 residues in the C. primus homolog) (Figure 1B). We note that ORCO is also distinguished from other insect ORs and GRs by an additional ~60–70 amino acids in IL2 (Benton et al., 2006), a region that contributes to channel regulation (Bahk and Jones, 2016; Mukunda et al., 2014).

Phylogenetic analysis of candidate GRLs

The candidate GRLs are extremely divergent in primary sequence: pairwise alignment of the new proposed family members, together with representative animal and plant proteins, reveal as little as 10% amino acid identity. While this divergence does not preclude their definition as homologs – insect OR families themselves have an average of only ~20% amino acid identity (Butterwick et al., 2018) – it makes it difficult to infer homology from sequence alone, and hinders confident multiprotein alignment. Indeed, an alignment of these 17 sequences, together with selected animal GRLs and plant DUF3537 proteins highlights the absence of any universally conserved residues, beyond the hydrophobic regions predicted to be TM domains (Figure 2—figure supplement 1). These TM regions are most confidently aligned in the C-terminal halves of the proteins, with substantial variation in loop lengths fragmenting the N-terminal halves (Figure 2—figure supplement 1). This pattern of conservation along the protein length is characteristic of insect OR/GR families (Robertson, 2015; Robertson, 2019; Saina et al., 2015), which might reflect the role of the N-terminal half in ligand-recognition (and commensurate higher divergence between proteins) and the C-terminal half in mediating subunit interactions and forming the ion channel pore (Butterwick et al., 2018).

To gain an initial idea of the phylogeny of these proteins, we inferred a maximum likelihood tree (Figure 2C). As expected, the animal and plant/algal proteins form distinct clades. The sets of GRLs of V. brassicaformis and T. trahens GRLs segregate into two lineages, one of which (comprising VbraGRL1/2 and TtraGRL1/2/3) is more closely related to animal GRLs, while the others are more distantly related; this distinction matches observations from the alignments of HMMs derived from these sequences (Figure 1—figure supplement 1). Low branch support does not allow for a confident placement of the fungal GRLs; they could group with either of the two protist lineages. All of these observations are consistently held if the tree is inferred from trimmed alignments (Figure 2—figure supplement 2).

Common three-dimensional structural predictions of GRLs and DUF3537 proteins

To obtain further evidence supporting the homology of these proteins, we performed ab initio structure predictions of animal GRLs (including AbakORCO as control), plant DUF3537 proteins and unicellular eukaryotic GRLs, using the transform-restrained Rosetta (trRosetta) algorithm (Yang et al., 2020). Most query sequences successfully seeded a multisequence alignment to permit extraction of co-evolutionary couplings, generation of inter-residue contact maps and prediction of three-dimensional models (Figure 3A–B, Supplementary file 7; full outputs of modeling are provided in the Dryad repository doi:10.5061/dryad.s7h44j15f). The contact maps predicted consistent patterns of anti-parallel packing of TM helices (Figure 3A). Concordantly, the top-predicted models were qualitatively similar to the AbakORCO cryo-EM structure (Figure 3B), particularly in the transmembrane core of these models (Figure 3B). Importantly, the consistent helical packing of GRLs and DUF3537 proteins is fundamentally different to an unrelated seven TM protein, the Homo sapiens Adiponectin Receptor 1 (HsapAdipoR1), which – despite sharing the same membrane orientation as GRLs – displays an arrangement of helices that is convergent with that of GPCRs (Hopf et al., 2015; Vasiliauskaité-Brooks et al., 2017; Figure 3B). We confirmed these observations first by heuristic searches of the Protein Data Bank (PDB) with Dali (Holm and Rosenström, 2010): strikingly, essentially all models identified the AbakORCO cryo-EM structure as the top hit (Supplementary file 7; outputs of Dali searches are provided in the Dryad repository doi:10.5061/dryad.s7h44j15f). Second, quantitative pairwise comparisons of selected structures using both Dali and TM-align (Zhang and Skolnick, 2005) indicated that all family members are likely to adopt the same global protein fold, while displaying only random structural similarity to our negative control HsapAdipoR1 (Figure 3C).

Ab initio structural predictions of GRLs and DUF3537 proteins.

(A) Inter-residue contact maps from trRosetta analysis of the indicated proteins. The axes represent the indices along the primary sequence; the positions of the predicted TM domains are shown in the schematics. The representation is mirror-symmetric along the diagonal; in one half ‘lines’ of contacts perpendicular to the diagonal of the map support the existence of anti-parallel alpha-helical transmembrane packing arrangements. Most pairs of predicted anti-parallel TMs are conserved across the proteins, despite variation in the length of loops between TM domains, supporting a globally similar packing of TM helices. The output of trRosetta analyses for these and other proteins is summarized in Supplementary file 7 and complete datasets are provided in the Dryad repository (doi:10.5061/dryad.s7h44j15f). (B) Side and top views of experimentally-determined (AbakORCO (PDB 6C70 chain A)) and Homo sapiens Adiponectin Receptor 1 (HsapAdipoR1; PDB 5LXG chain A [Vasiliauskaité-Brooks et al., 2017]) or the top trRosetta protein model of GRL and DUF3537 proteins. All GRL/DUF3537 proteins have a similar predicted global packing of TM domains (which is particularly evident in the top view in which the seven TM domains are labelled), despite variation in lengths of the loops and N-terminal regions (colored in dark blue). By contrast, HsapAdipoR1 has a fundamentally different arrangement of TM domains. The dashed ovals on the AbakORCO model highlight the extracellular loop 2 (EL2) and intracellular loop 2 (IL2) regions that were not visualized in the ORCO cryo-EM structure (Butterwick et al., 2018). (C) Quantitative pairwise comparisons of the structures shown in (B) using TM-align (Zhang and Skolnick, 2005) and Dali (Holm and Rosenström, 2010). TM-scores of 0.0–0.30 indicate random structural similarity; TM-scores of 0.5–1.00 indicate that the two proteins adopt generally the same fold (1.00 represents a perfect match). Dali Z-scores of <2 indicate spurious similarity. In both cases, these quantitative cut-offs are not stringent, and must be used as a guide in combination with other criteria (e.g., evidence for homology based upon primary sequence comparisons). The two half-matrices are colored using different scales.

We extended this analysis by generating protein models with an independent algorithm, RaptorX (Källberg et al., 2012), and assessing these using Dali PDB searches. Where reliable models were generated (excluding those resulting from multisequence alignments consisting of only a few proteins), AbakORCO was again repeatedly identified as the top hit for animal and unicellular eukaryotic GRLs and plant DUF3537 proteins (Supplementary file 7; full outputs of modeling are provided in the Dryad repository doi:10.5061/dryad.s7h44j15f).

It is important to recognize that the trRosetta and RaptorX models of unicellular eukaryotic GRLs derive from multisequence alignments containing large numbers of animal proteins. As such, construction of these models necessarily depends upon amino acid covariation within the animal GRL family. At the level of the global fold, this is only problematic if the query sequences are not homologous to the other sequences in the alignment, a possibility that is inconsistent with our primary and secondary sequence similarities, HMM alignments and phylogenetic analysis.

Importantly, the models of the plant proteins used information extracted from alignments of only other DUF3537 family members, presumably because the primary sequence similarity of members of the more divergent plant and animal proteins is below the threshold of both trRosetta and RaptorX algorithms. The observation that the independently-generated plant protein models are also similar to AbakORCO argues that, despite their high sequence divergence, animal and protist GRLs and plant DUF3537 proteins all adopt a common three-dimensional architecture.

Discussion

Claims of evolutionary relationships between proteins whose sequence identity resides in the ‘twilight zone’ must be made with caution (Rost, 1999). Nevertheless, our primary, secondary and tertiary structural analyses together support the hypothesis that animal ORs/GRs/GRLs, newly-recognized unicellular eukaryotic GRLs, and plant DUF3537 proteins are homologous. The extremely sparse phylogenetic distribution of these genes in unicellular eukaryotes, despite our efforts to perform exhaustive searches, has multiple potential explanations. These genes might have been independently lost in many lineages and/or have diverged beyond sequence-based homology detection levels. It is also possible that some homologs were acquired by lateral gene transfer, as has been proposed to explain the patchy phylogenetic distribution of microbial rhodopsins (Gavelis et al., 2017).

The global conservation of the structural features of unicellular eukaryotic GRLs and DUF3537 proteins with insect chemosensory receptors suggests that they are also ligand-gated ion channels. There is currently insufficient knowledge of the biology of the unicellular eukaryotes in which GRLs have been found – let alone what might be common to these species – to predict ligands or physiological functions. It is possible, if not likely, that the proteins fulfill distinct roles in different phyla, as has been suggested in non-Bilateria, where some GRLs appear to act during development (Saina et al., 2015). In A. thaliana, transcriptomic analysis indicates that DUF3537 genes have various tissue-specific expression patterns, including in leaf guard cells, roots and pollen grains (Schmid et al., 2005). For one broadly expressed plant protein (AT4G22270), GFP-tagging revealed localization in the plasma membrane (Guan et al., 2009), and transgenic overexpression appeared to promote organ growth (Guan et al., 2009). However, the significance of this phenotype will require validation by loss-of-function genetic analysis. Together, the available evidence indicates that even if members of this superfamily have a common function as ligand-gated ion channels, they are likely to recognize very diverse chemicals that are potentially of environmental and/or internal origin.

While functional studies remain a future challenge, the recognition of their existence across Animalia, Plantae, Fungi and Protista provides evidence that this protein superfamily originated in the last common eukaryotic ancestor, 1.5–2 billion years ago (Hedges et al., 2006). No sequences bearing resemblance to GRLs were found, so far, in Bacteria or Archaea. Future analysis of additional genomic data will help to update the present survey and refine (or refute) the current evolutionary model.

Materials and methods

Identification and assessment of candidate GRL homologs

Request a detailed protocol

Candidate GRL sequences from unicellular eukaryotes were initially identified by searches of the GenBank RefSeq non-redundant protein sequence database (which includes sequences from 573 protozoan and 13,970 fungal species). The iterative search algorithms PSI-BLAST (Altschul et al., 1997) and DELTA-BLAST (Boratyn et al., 2012) were used, with a range of divergent animal GRLs (Pfam 7tm_6 (PF02949) and 7tm_7 domain (PF08395)) and plant DUF3537 proteins (PF12056) as queries. To avoid convergence onto hits from Animalia or Plantae, sequences from Metazoa (taxid:33208) or Viridiplantae (taxid:33090) were generally excluded from the search set. We retrieved sequences that had a query coverage of >50% and an E-value of <0.05. Sequences were subject to initial assessment based on their fulfillment of most or all of the following properties: (i) reciprocal BLASTP (or PSI-BLAST) with the candidate as query of Metazoan and Plantae datasets identified a known GRL or DUF3537 sequence as a top hit and/or no significant similarity to other protein families (e.g., distinct types of ion channels or transporters); (ii) the candidate sequence is ~350–500 amino acids long, similar to the vast majority of GRLs and DUF3537 proteins (the TtraGRL3 accession (XP_013759733.1) is 1014 amino acids but this was reannotated to 440 amino acids by trimming a large C-terminal region, which is encoded by the exon of a 3’ gene); (iii) the candidate sequence is predicted to contain seven TM domains, an intracellular N-terminus, and generally longer intracellular loops than extracellular loops (as described for insect ORs [Otaki and Yamamoto, 2003]; note this analysis was published before recognition of the inverted topology of the insect proteins). Membrane topology predictions were made with both the TMHMM Server v2.0 (Krogh et al., 2001; Figure 1B) and TOPCONS (Bernsel et al., 2009; Supplementary file 2). As described previously for Arthropoda ORs/GRs and animal GRLs (Saina et al., 2015), TM domains are not always reliably predicted, so visual inspection of the output plots was essential to recognize hydrophobic regions that fell below the threshold for TM domain assignment. Conversely, an N-terminal helical region that forms a re-entrant loop in ORCO (Butterwick et al., 2018) was often mispredicted to be a TM domain (see Figure 1B legend).

Retained hits from unicellular eukaryotes were used as queries in further PSI-BLAST/DELTA-BLAST searches, as well as in TBLASTN searches (Gertz et al., 2006) of the corresponding species’ genomes to identify unannotated protein coding sequences (the latter approach ultimately found none in the present analysis). New candidate sequences were subject to the same assessment as described above. Although our searches were very broad phylogenetically, the extreme divergence in the primary sequence of these proteins and the relatively stringent criteria for retaining hits – to avoid excessive numbers of spurious matches with other polytopic membrane proteins – make it highly likely that additional members of the family exist.

To obtain further evidence for homology between the identified sequences, HMMs were constructed with HHblits with default parameters (Remmert et al., 2012), using three iterations over the Uniclust30 database (Mirdita et al., 2017; Steinegger and Söding, 2017). The results of these iterative searches were examined to verify that no additional candidate GRL sequences had been identified (Supplementary file 3). The HMMs resulting from the iterative searches were aligned pairwise using HHblits to obtain a matrix of homology probabilities (code provided in Supplementary file 4).

Intron positions were identified by analysis of the predicted gene structure of the coding sequence of each GRL, which was obtained from the corresponding GeneID page for each GenBank Accession.

Phylogenetic analysis

Request a detailed protocol

The multiprotein alignment was built with MAFFT v7.310 (option ‘linsi’) (Katoh and Standley, 2013), and a maximum likelihood tree was inferred with IQTree v.2.0.6 (Minh et al., 2020) with aBayes branch support values (Anisimova et al., 2011). Alignment trimming for Figure 2—figure supplement 2 was performed with trimAl (option ‘gappyout’) (Capella-Gutierrez et al., 2009). Raw untrimmed and trimmed sequence alignments are provided in Supplementary files 56. Alignments were visualized in Jalview 2.9.0b2 (Waterhouse et al., 2009) and trees in phylo.io (Robinson et al., 2016).

Protein structure prediction

Request a detailed protocol

Ab initio protein structure prediction was performed using trRosetta (https://yanglab.nankai.edu.cn/trRosetta/) (Yang et al., 2020) and RaptorX (http://raptorx.uchicago.edu/ContactMap/) (Källberg et al., 2012). For both algorithms, individual sequences of unicellular eukaryotic GRLs, plant DUF3537 proteins and selected animal GRLs were provided as queries. The complete outputs of the analyses are provided in the Dryad repository (doi:10.5061/dryad.s7h44j15f), and summarized in Supplementary file 7. Models predicted for sequences that only seeded multiprotein sequence alignments containing very few (<15) sequences or, for trRosetta models, those with an ‘estimated TM-score’ of <0.17 (Yang et al., 2020) (indicating spurious structural models [Yang et al., 2020; Zhang and Skolnick, 2004]), were not analyzed further. The top predicted models from trRosetta (model 1) and RaptorX (the model with the lowest estimated RMSD) were used in heuristic Protein Data Bank (PDB) searches with Dali (Holm and Rosenström, 2010). The top hits of these searches are shown in Supplementary file 7 and full search results are provided in the Dryad repository (doi:10.5061/dryad.s7h44j15f).

Pairwise structural similarities of selected trRosetta-predicted models were assessed with TM-align (Zhang and Skolnick, 2005) and Dali (Holm and Rosenström, 2010), including the experimentally-determined ORCO structure (PDB 6C70) (Butterwick et al., 2018) and, as ‘negative’ control, the Homo sapiens Adiponectin Receptor 1 structure (HsapAdipoR1; PDB 5LXG chain A [Vasiliauskaité-Brooks et al., 2017]). Models were visualized in PyMol v2.4.0.

Data availability

All data generated or analyzed during this study are included in the manuscript and supporting files. The output of trRosetta and RaptorX analyses and subsequent Dali searches are available on Dryad (https://doi.org/10.5061/dryad.s7h44j15f).

The following data sets were generated
    1. Benton R
    2. Dessimoz C
    3. Moi D
    (2020) Dryad Digital Repository
    A putative origin of the insect chemosensory receptor superfamily in the last common eukaryotic ancestor.
    https://doi.org/10.5061/dryad.s7h44j15f

References

  1. Book
    1. Glezer I
    2. Malnic B
    (2019) Olfactory receptor function
    In: Steegers A. P, editors. Handbook of Clinical Neurology. Elsevier. pp. 67–78.
    https://doi.org/10.1016/B978-0-444-63855-7.00005-8

Decision letter

  1. Claude Desplan
    Reviewing Editor; New York University, United States
  2. Piali Sengupta
    Senior Editor; Brandeis University, United States

In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.

Acceptance summary:

The paper provides significant insights into the superfamily of olfactory receptors and how it evolved: Your discovery of GRLs in multiple unicellular organisms supports the claim that this is a very old family, even if the sequence conservation is pretty low. However, a major advance results from your analysis of the tertiary structure of these proteins that takes advantage of the power of Rosetta to provide evidence that the GRL proteins are distant members of the same superfamily. This represents a significant advance in our understanding of the origins of this superfamily of proteins.

Decision letter after peer review:

Thank you for submitting your article "A putative origin of insect chemosensory receptors in the last common eukaryotic ancestor" for consideration by eLife. Your article has been reviewed by four peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Piali Sengupta as the Senior Editor. The reviewers have opted to remain anonymous.

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

We would like to draw your attention to changes in our revision policy that we have made in response to COVID-19 (https://elifesciences.org/articles/57162). Specifically, we are asking editors to accept without delay manuscripts, like yours, that they judge can stand as eLife papers without additional data, even if they feel that they would make the manuscript stronger. Thus the revisions requested below only address clarity and presentation.

Summary:

The reviewers found that the paper provides significant insights into this family of receptors: First, your discovery of GRLs in multiple unicellular organisms supports the claim that you are dealing with a large family with plant homologs, although the analyses of sequence conservation remains speculative. However, the major advance results from the tertiary structures of these proteins that take advantage of the power of trRosetta to provide evidence that the GRL proteins are distant members of the same superfamily. This represents a significant advance in our understanding of the origins of this superfamily of proteins.

However, the reviewers had also two major concerns: One is the serious lack of technical details and you must provide more information about how many genomes were used in your initial search and discuss whether it was exhaustive or so stringent that more members of the family likely exist: Providing more technical details will help make the work more accessible. The second point is that functional data would be very useful, e.g. showing biochemically that distant members behave similarly to the fly proteins, or that they serve (or not!) as ligand-gated channels. If you have already acquired this type of data, they would strengthen your paper. However, a discussion of possible molecular functions would be sufficient in the absence of such data.

Reviewer #1:

Vertebrate and nematode odorant receptors (ORs) function as GPCRs, while insect ORs were derived from gustatory receptors (GRs) and function as ligand gated ion channels. However, the evolutionary origin of insect GRs is not clear. The manuscript of Benton, Dessimoz and Moi titled "A putative origin of insect chemosensory receptors in the last common eukaryotic ancestor" answered this key question. Following the previous studies that identified GR-like proteins (GRLs) in animals, and GR homologs, known as the DUF3537 domain-containing proteins in plants, they further identified and performed phylogenetic analysis on GRL proteins in unicellular eukaryotic organisms, including fungi, protists, and algae, the common ancestor of plants and animals.

Overall, the topic of this manuscript is very interesting and well written. The data are solid. Several key points have been addressed, including role of TM7, consistent predicted orientation of TM domains, presence of intracellular loops (like ORCO), conserved vs diverse regions on GRL proteins, and same origin for plant and animal GRLs. Therefore, I strongly recommend for publication, after the authors properly address the following concerns:

1) The major weakness is that there is no functional analysis. If any of GRL proteins is predicted to be a canonical chemical sensor, would it be possible to utilize Xenopus or another system to test the hypothesis?

2) If functional study is currently a big challenge, could the authors perhaps add some validation on GRL protein localization in a unicellular eukaryote? I wonder if antibody could be made and used to test membrane localization of GRL, or a tagged protein could be ectopically expressed in a cell line (or yeast).

3) "heteromeric (probably tetrameric) complexes composed of a tuning OR, which recognises odour ligands, and a universal co-receptor, ORCO" This describes a dimeric complex with one OR and one ORCO. It seems not consistent with "probably tetrameric"

4) Introduction paragraph three provides examples of non-chemosensation functions of GRL proteins. I suggest to expand and add a table or a supplemental table, which should include currently known expression patterns and functions of GR and GRL proteins in animals and plants.

Reviewer #2:

In this work, Benton and colleagues consider the evolutionary origin of the immense insect chemoreceptor family, which includes odorant receptors (ORs) and gustatory receptors (GRs). Past sequence mining from the Benton lab and others has suggested that distant members of the GRL family were found in diverse Protostomia and also homologous to a family of uncharacterized plant proteins containing the Domain of Unknown Function 3537. However, despite multiple GRL lineages being present in early branching deuterostomes, GRLs have been completely lost from the chordate lineage suggesting recurrent independent losses, obscuring their exact evolutionary trajectory. Here Benton and colleagues extend their genome mining analyses to identify 17 sequences from fungi, protista and unicellular plants that share the same overall topology and some of the poorly conserved sequence features of this family. Finally, they use the extraordinary power of trRosetta to predict candidate GRL structures from the diverse lineages de novo and demonstrate that they share the same distinct architecture as an experimental structure of an OR. By far the most impressive part of the manuscript is the structure prediction since it would argue that these distantly related members, even bearing little sequence conservation, fold into the same distinct helical arrangement. If correct, this would argue that the GRL family is incredibly ancient, originating in the last eukaryotic ancestor, 1.5-2 Billion years ago, which has important implications for thinking about how this immense family arose.

Overall, I have a few concerns that should be addressed:

1) The Materials and methods are quite sparse and require a lot of effort by the reader to appreciate how well controlled and vetted their results are. Only 17 members of the family were found across the genomes of fungi, protista and unicellular plants, derived from an even smaller subset of species, which the authors acknowledge is extremely sparse and implies either that they propagated by lateral gene transfer or were independently lost many times, making their evolutionary origin still a bit uncertain. The authors should provide more information about how many genomes were used in their initial search and discuss whether it was exhaustive or so stringent that more members of the family likely exist.

2) One complication of the limited number of sequences from unicellular eukaryotes is that the structure prediction relies on multiple sequence alignments largely built from GRs. This was not obvious from the Materials and methods. I only know this because I took one of their putative GRL sequences and submitted it to the trRosetta website and three hours later got the same structure prediction as in Figure 3 and the MSA the trRosetta algorithm used for prediction. While the algorithm for trRosetta has been previously published, for a general audience the paper would benefit from more detail about how it was used-both what was required as input (apparently just a single sequence plugged into the trRosetta website) and how to evaluate the output, beyond physical inspection. For example, in Figure 3C the assignment of proteins to their groups seems like an arbitrary delimitation without further explanation, since the score/distances between proteins are marginally different. Only in the figure legend it states: TM-scores of 0.0-0.30 indicate random structural similarity; TM-scores of 0.5-1.00 indicate that the two proteins adopt generally the same fold. The authors thus suggest a TM score of 0.27 as meaning Orco and HsapAdipoR1 are unrelated but a score of 0.53 as being indicative that VbraGRL2 and AthaAT3G20300 are part of the same structural family, but provide insufficient information to the reader to understand whether this is a stringent cutoff or not.

3) One important caveat that the authors should discuss and address is that given that the de novo structure prediction relies heavily on GR sequence covariation, is there any possibility that tertiary structural similarity is imposed onto these more distant members of the GRL family? Ideally the de novo structure prediction would be truly independent and based on similar numbers of GRL sequences from single-celled eukaryotes but this does not seem possible.

4) The central advance of this study over past work from the Benton lab (Benton, 2015; Hopf et al., 2015) is the dramatic improvement in structure prediction algorithms, which provide tantalizing information about structural similarity (barring the caveat in the point directly above.) I appreciate that the authors don't overstate their claims, suggesting that these GRL proteins may not serve the same function in different organisms but likely form ligand gated channels. To really move into novel territory, I wish the authors could probe the functional or biochemical properties of these ancient GRLs a bit further. For example, for these proteins to serve as ion channels likely requires a multimeric organization. Native gels could biochemically demonstrate this, providing powerful additional evidence that these are part of the same family. Alternatively, could sequence covariation provide evidence for this (e.g. Hopf, 2014). Either way, it would be valuable to discuss this additional feature that does not immediately fall out of the trRosetta predictions.

Reviewer #4:

Benton et al. is a well written study on the evolution of insect chemosensory receptors that uses bioinformatics-based approaches to identify putative GRL homologs in several species of unicellular eukaryotes. Both sequence and structure-based approaches are utilized to buttress the authors arguments that fungal and protista GRL homologs are an evolutionary link to DUF3537 proteins they have previously identified in plants and algae thereby extending this evolutionary relationship to "the last common eukaryotic ancestor"

While I am generally supportive of the authors rationale and recognize they have been careful to appropriately qualify their hypothesis throughout this work, I am somewhat disinclined to place a high degree of definitive value on the ab initio structural predictions which underscores much of this analysis. Even so, and despite the fact these evolutionary relationships between animal and plant GRLs are unlikely to ever be definitively tested, this hypothesis seems to me to be reasonable. That said, I remain underwhelmed by their significance.

Reviewer #5:

The insect chemoreceptor superfamily of ligand-gated ion channels is one of the largest and most diverse protein families known. Partly as a result of their extreme divergence, the evolutionary origins of the superfamily have been obscure. Following up on a previous proposal of relationship to a protein family that is widespread in plants, the authors discovered several convincingly related proteins encoded by fungal, protist, and algal genomes. While the relationship with the plant protein family remains remarkably distant, their three-dimensional modeling of these diverse proteins reveals convincing similarity and hence suggests the superfamily originated at or before the eukaryotic origin.

I have no substantive concerns. Previous objections to the distant relationship to the plant protein family on the basis of lack of three shared introns are not decisive given the rampant loss of introns in the unicellular genomes examined here. The details of the three-dimensional modeling are not my expertise, however these authors previously employed a related technique to generate a remarkably good model for the insect odorant receptors that was mostly confirmed by subsequently generated experimental structure.

https://doi.org/10.7554/eLife.62507.sa1

Author response

Summary:

The reviewers found that the paper provides significant insights into this family of receptors: First, your discovery of GRLs in multiple unicellular organisms supports the claim that you are dealing with a large family with plant homologs, although the analyses of sequence conservation remains speculative. However, the major advance results from the tertiary structures of these proteins that take advantage of the power of trRosetta to provide evidence that the GRL proteins are distant members of the same superfamily. This represents a significant advance in our understanding of the origins of this superfamily of proteins.

However, the reviewers had also two major concerns: One is the serious lack of technical details and you must provide more information about how many genomes were used in your initial search and discuss whether it was exhaustive or so stringent that more members of the family likely exist: Providing more technical details will help make the work more accessible.

We acknowledge this concern and have now provided additional technical details on the initial searches and other analyses in the Materials and methods. We further note that all code and sequence files are provided as Supplementary files, and outputs of the ab initio protein modelling are available on the Dryad repository (doi:10.5061/dryad.s7h44j15f).

We hope these efforts will clarify the search strategies taken and aid in the reproduction and extension of this work by others. Although our searches have been very broad phylogenetically, the extreme divergence in the primary sequences of these proteins and the relatively stringent criteria for retaining hits – to avoid excessive numbers of spurious matches with other polytopic membrane proteins – make it highly likely that additional members of the family exist (as we now stress in the Discussion and Materials and methods sections). In this work, we have preferred to be relatively conservative by including proteins for which several lines of evidence support their homology to insect chemosensory receptors (i.e., from amino acid sequence similarity and predicted secondary and tertiary structural analyses). Although finer scale details of the evolution of this superfamily will likely emerge in the future, we believe the current data support the central conclusion of our work (i.e., the origin of the insect chemosensory receptor superfamily in the last common eukaryotic ancestor).

The second point is that functional data would be very useful, e.g. showing biochemically that distant members behave similarly to the fly proteins, or that they serve (or not!) as ligand-gated channels. If you have already acquired this type of data, they would strengthen your paper. However, a discussion of possible molecular functions would be sufficient in the absence of such data.

We also would very much like to have functional data on these phylogenetically distant homologs, but do not have anything to add to the current manuscript. Functional characterization is far from trivial: if they are ion channels, it is unknown what ligands might gate them; if they are not channels, it is not obvious how to determine what biochemical function(s) they do possess. Our planned initial approach would be reverse genetic; while this is certainly conceivable for the plant proteins (using Arabidopsis thaliana as a model), for the fungal and protist species possessing GRL homologs, none are yet genetically accessible. Transgenesis was very recently reported in Spizellomyces punctatus (Medina et al., eLife 2020), raising hope that genome-editing approaches will soon be available in this species.

We have expanded the Discussion to discuss possible molecular functions of family members. While we feel that consideration of roles of unicellular eukaryotic GRLs would be pure speculation at this stage (little is known about the biology of these species), we do incorporate some further information on the plant homologs.

Reviewer #1:

Vertebrate and nematode odorant receptors (ORs) function as GPCRs, while insect ORs were derived from gustatory receptors (GRs) and function as ligand gated ion channels. However, the evolutionary origin of insect GRs is not clear. The manuscript of Benton, Dessimoz and Moi titled "A putative origin of insect chemosensory receptors in the last common eukaryotic ancestor" answered this key question. Following the previous studies that identified GR-like proteins (GRLs) in animals, and GR homologs, known as the DUF3537 domain-containing proteins in plants, they further identified and performed phylogenetic analysis on GRL proteins in unicellular eukaryotic organisms, including fungi, protists, and algae, the common ancestor of plants and animals.

Overall, the topic of this manuscript is very interesting and well written. The data are solid. Several key points have been addressed, including role of TM7, consistent predicted orientation of TM domains, presence of intracellular loops (like ORCO), conserved vs diverse regions on GRL proteins, and same origin for plant and animal GRLs. Therefore, I strongly recommend for publication, after the authors properly address the following concerns:

1) The major weakness is that there is no functional analysis. If any of GRL proteins is predicted to be a canonical chemical sensor, would it be possible to utilize Xenopus or another system to test the hypothesis?

As described above in response to the general comments, we also would very much like to have functional data on these phylogenetically distant homologs, but do not have anything to add to the current manuscript. Experimental characterization is far from trivial: if they are ion channels, it is unknown what ligands might gate them (necessitating large-scale chemical screening). If they are not channels, it is unclear how best to determine what biochemical function(s) they do possess. Our planned initial approach would be reverse genetic; while this is certainly conceivable for the plant proteins (using Arabidopsis thaliana as a model), for the fungal and protist species possessing GRL homologs, none are yet genetically accessible. Transgenesis was very recently reported in Spizellomyces punctatus (Medina et al., eLife 2020), raising hope that genome-editing approaches will soon be available in this species.

2) If functional study is currently a big challenge, could the authors perhaps add some validation on GRL protein localization in a unicellular eukaryote? I wonder if antibody could be made and used to test membrane localization of GRL, or a tagged protein could be ectopically expressed in a cell line (or yeast).

While it certainly would be possible to tag these proteins with GFP and express them in a heterologous cell type, we do not think such results alone would be particularly informative. It is almost certain – based upon the secondary structure predictions – that these are integral membrane proteins, but they could potentially localize anywhere within the endomembrane system. Without validation in the endogenous cell types, it would be hard to interpret whether localization patterns are real or artefactual (due to, for example, protein over-expression, an impact of the protein tag or an influence of the heterologous cellular environment). Antibodies might be an alternative tool to assess endogenous protein localization, although there has only been very limited success for generation of effective antibodies against insect receptors; moreover, this approach would require development of immunofluorescence protocols for the fungal or protist species of interest and ideally a means of validating antibody specificity (e.g., by parallel staining of genetic knock-outs of the corresponding GRL).

An early study of one of the plant proteins, A. thaliana AT4G22270, revealed that an overexpressed GFP-tagged version displayed membrane localization (Guan et al., 2009). Curiously, this study (mis)predicted the family as having four transmembrane domains and did not recognize the similarity with insect chemosensory receptors. This work also found that overexpression of AT4G22270 led to increases in the size of various plant organs, although the relevance of this phenotype (if any) remains to be confirmed by loss-of-function analysis. Nevertheless, the cellular localization may be real and we cite this work in the revised Discussion.

3) "heteromeric (probably tetrameric) complexes composed of a tuning OR, which recognises odour ligands, and a universal co-receptor, ORCO" This describes a dimeric complex with one OR and one ORCO. It seems not consistent with "probably tetrameric"

We have clarified this sentence to indicate that the tetrameric complex probably comprises two tuning OR subunits and two ORCO subunits.

4) Introduction paragraph three provides examples of non-chemosensation functions of GRL proteins. I suggest to expand and add a table or a supplemental table, which should include currently known expression patterns and functions of GR and GRL proteins in animals and plants.

To our knowledge, the work cited in this paragraph, and the revised Discussion (which incorporates further information on the plant proteins – see the comment above) encompasses all known “non-chemosensory” roles of this family. For completeness, we have now added a sentence to this paragraph on the thermosensory and light-sensing functions of D. melanogaster GR28b isoforms. At this stage, we feel that information on non-chemosensory function of members of this repertoire is simply too sparse – and the evidence for certain functions too limited – to warrant a table, which would ultimately be redundant with the information in the text.

Reviewer #2:

In this work, Benton and colleagues consider the evolutionary origin of the immense insect chemoreceptor family, which includes odorant receptors (ORs) and gustatory receptors (GRs). Past sequence mining from the Benton lab and others has suggested that distant members of the GRL family were found in diverse Protostomia and also homologous to a family of uncharacterized plant proteins containing the Domain of Unknown Function 3537. However, despite multiple GRL lineages being present in early branching deuterostomes, GRLs have been completely lost from the chordate lineage suggesting recurrent independent losses, obscuring their exact evolutionary trajectory. Here Benton and colleagues extend their genome mining analyses to identify 17 sequences from fungi, protista and unicellular plants that share the same overall topology and some of the poorly conserved sequence features of this family. Finally, they use the extraordinary power of trRosetta to predict candidate GRL structures from the diverse lineages de novo and demonstrate that they share the same distinct architecture as an experimental structure of an OR. By far the most impressive part of the manuscript is the structure prediction since it would argue that these distantly related members, even bearing little sequence conservation, fold into the same distinct helical arrangement. If correct, this would argue that the GRL family is incredibly ancient, originating in the last eukaryotic ancestor, 1.5-2 Billion years ago, which has important implications for thinking about how this immense family arose.

Overall, I have a few concerns that should be addressed:

1) The Materials and methods are quite sparse and require a lot of effort by the reader to appreciate how well controlled and vetted their results are. Only 17 members of the family were found across the genomes of fungi, protista and unicellular plants, derived from an even smaller subset of species, which the authors acknowledge is extremely sparse and implies either that they propagated by lateral gene transfer or were independently lost many times, making their evolutionary origin still a bit uncertain. The authors should provide more information about how many genomes were used in their initial search and discuss whether it was exhaustive or so stringent that more members of the family likely exist.

As described above in response to the general comments, we acknowledge this concern and have now provided additional technical details on the initial searches and other analyses in the Materials and methods. We further note that all code and sequence files are provided as Supplementary files, and outputs of the ab initio protein modelling are available on the Dryad repository (doi:10.5061/dryad.s7h44j15f).

We hope these efforts will clarify the search strategies taken and aid in the reproduction and extension of this work by others. Although our searches have been very broad phylogenetically, the extreme divergence in the primary sequence of these proteins and the relatively stringent criteria for retaining hits – to avoid excessive numbers of spurious hits with other polytopic membrane proteins – make it highly likely that additional members of the family exist (as we now stress in the Discussion and Materials and methods sections). In this work, we have preferred to be relatively conservative by including proteins for which several lines of evidence support their homology to insect chemosensory receptors (i.e., from amino acid sequence similarity and predicted secondary and tertiary structural analyses). Although finer scale details of the evolution of this superfamily will likely emerge in the future, we believe the current data support the central conclusion of our work (i.e., the origin of the insect chemosensory receptor superfamily in the last common eukaryotic ancestor).

2) One complication of the limited number of sequences from unicellular eukaryotes is that the structure prediction relies on multiple sequence alignments largely built from GRs. This was not obvious from the Materials and methods. I only know this because I took one of their putative GRL sequences and submitted it to the trRosetta website and three hours later got the same structure prediction as in Figure 3 and the MSA the trRosetta algorithm used for prediction. While the algorithm for trRosetta has been previously published, for a general audience the paper would benefit from more detail about how it was used-both what was required as input (apparently just a single sequence plugged into the trRosetta website) and how to evaluate the output, beyond physical inspection. For example, in Figure 3C the assignment of proteins to their groups seems like an arbitrary delimitation without further explanation, since the score/distances between proteins are marginally different. Only in the figure legend it states: TM-scores of 0.0-0.30 indicate random structural similarity; TM-scores of 0.5-1.00 indicate that the two proteins adopt generally the same fold. The authors thus suggest a TM score of 0.27 as meaning Orco and HsapAdipoR1 are unrelated but a score of 0.53 as being indicative that VbraGRL2 and AthaAT3G20300 are part of the same structural family, but provide insufficient information to the reader to understand whether this is a stringent cutoff or not.

The reviewer raises a number of important points, which we address individually below:

- structure predictions from multiple sequence alignments (MSAs) largely built from GRs: this reviewer reiterates this issue in the comment below, where we provide a full response.

- use of trRosetta algorithm: we provide additional use and evaluation of this server in the Materials and methods. In brief, the user interface is indeed extremely simple, requiring just entry of an individual sequence, as MSAs are built automatically.

- evaluation of trRosetta output: we describe the pertinent information in Supplementary file 7 and the associated legend. The key parameter to judge the quality of the top model from trRosetta is the “estimated TM-score”. As described in the cited trRosetta paper (Yang et al., 2020), this is calculated based upon a combination of the probability of the predicted top distances and the average pairwise TM-score between the top ten models under no restraints. In test proteins of known structure, the estimated TM-score had a high correlation with the true TM-score (which is calculated based upon comparison of the model and the experimentally-determined protein structure). For proteins for which no experimental information is available (such as GRLs or DUF3537 proteins), the estimated TM-score provides a measure of predicted resemblance of the model to the real structure. While there is no firm cut-off, scores <0.17 are likely to reflect spurious protein structural models (Yang et al., 2020). In our work, as shown in Supplementary file 7, sequences that yielded MSAs with very few proteins gave commensurately extremely low estimated TM-scores (typically around 0.1); these models were not examined further. All trRosetta output files are provided in the Dryad repository (doi:10.5061/dryad.s7h44j15f).

- evaluation of trRosetta models by structure comparisons with Dali and TM-align: for all trRosetta models that had an estimated TM-score >0.17, we assessed whether these had similarity to proteins of known structure in the Protein Data Bank using the Dali server. In all but two cases (TtraGRL4 and TtraGRL5), the ORCO cryo-EM structure was identified as the top hit, usually with a Dali Z-score (a measure of structural similarity) that is substantially higher that the next most similar protein fold. The results of these Dali searches are provided inside the corresponding subfolder of the trRosetta output in the Dryad repository. The consistent retrieval of ORCO by other models of animal GRs/GRLs, protist GRLs and plant DUF3537 proteins is striking and argues these proteins all adopt a similar fold. Regarding the two exceptions: the best TtraGRL4 and TtraGRL5 models identified Diablo (a HECT-type E3 ligase) and Plectin (a cytoskeletal protein) as top hits, respectively. Although these GRL models have estimated TM scores >0.17 and the Dali Z-scores are indicative of “significant similarity” (>2 (Holm et al., 2010)), these are clearly spurious matches. We note that in both cases the number of sequences used in the MSA is very low (<230) compared to models of TtraGRL1-3 (>1200).

We further assessed structural similarity by pairwise comparisons of selected proteins (with the highest estimated TM-score) together with a negative control (AdipoR1, which has the same membrane topology as the OR/GR/GRL/DUF3537 superfamily). For Dali pairwise comparisons (top-right of Figure 3C), the Z-score is substantially higher for all comparisons within the OR/GR/GRL/DUF3537 set than with AdipoR1. Similarly, for TM-align pairwise comparisons, the OR/GR/GRL/DUF3537 comparisons all fall within the range of 0.5-1, which indicates – as described in Zhang and Skolnick NAR 2005 – that the proteins are expected to adopt the same fold (1 would be a perfect match). By contrast, comparisons with AdipoR1 fall within the range (0-0.3) indicative of spurious similarity. We tried to add these numerical ranges on the figure itself but found that it cluttered the panel and would prefer to have the full description of their meaning in the legend.

We emphasize that the cut-offs of trRosetta, Dali and TM-align are defined by the developers of these algorithms based upon analysis of many test cases of proteins of known structure. To our knowledge, these cut-offs are not stringent, and must be viewed in the context of the proteins being analyzed, as many factors could impact these scores (e.g., quality of model, quality of experimentally-determined structure, primary sequence similarity target and query, domain organization of protein (in our experience individual proteins with large inserts in the loops were often problematic)). In our work, the tertiary structural similarity provides additional support for the homology between various proteins that were initially identified based upon primary and secondary structural similarities.

To strengthen our claims, we now provide analyses of the same set of query sequences with an independent ab initio protein folding algorithm, RaptorX, which uses distance-based protein folding driven by deep learning (Kallberg et al., 2012). While this algorithm failed to build sufficiently large MSAs for slightly more queries than trRosetta, several sequences from both protists and plants successfully yielded models that, via Dali searches, retrieved the ORCO structure as the top hit. The results of this new analysis are summarized in Supplementary file 7, and the complete output files from RaptorX, together with the results of the subsequent Dali searches, are provided in the Dryad repository (doi:10.5061/dryad.s7h44j15f).

We hope to have explained the logic of software use, our steps for quality control at each stage and the availability of the source data to allow readers to view and reproduce our results. As we are users, not testers, of the software packages, we felt it out-of-place to have a detailed description of these published algorithms in our work, but we have added additional technical details in this revision to enable a reader to appreciate our procedures for assessing the structure prediction results.

3) One important caveat that the authors should discuss and address is that given that the de novo structure prediction relies heavily on GR sequence covariation, is there any possibility that tertiary structural similarity is imposed onto these more distant members of the GRL family? Ideally the de novo structure prediction would be truly independent and based on similar numbers of GRL sequences from single-celled eukaryotes but this does not seem possible.

This is a very good point: at present, there are indeed insufficient numbers of GRL sequences from unicellular eukaryotes alone to be able to analyze amino acid co-evolution and use this information for modelling. The current models therefore necessarily depend in part upon covariation within the larger animal GR/GRL family. At the level of the global fold, this is only problematic if the query sequence is not homologous to the sequences in the alignment. We believe that the primary and secondary sequence analyses and phylogenetic analysis (in Figure 1, Figure 1—figure supplement 1, Figure 2) do support such homology, notably for the protist GRLs for which we have obtained structural models.

Importantly, the models of the plant proteins used information extracted from alignments of only other DUF3537 family members, because these are more divergent from the animal sequences than those of unicellular eukaryotes. It is therefore striking that the plant structural models are also similar to ORCO, and infer that the entire family is likely to share the same global fold. We briefly mentioned these issues in our original manuscript but have now expanded our comments on these points in the text.

4) The central advance of this study over past work from the Benton lab (Benton, 2015; Hopf et al., 2015) is the dramatic improvement in structure prediction algorithms, which provide tantalizing information about structural similarity (barring the caveat in the point directly above.) I appreciate that the authors don't overstate their claims, suggesting that these GRL proteins may not serve the same function in different organisms but likely form ligand gated channels. To really move into novel territory, I wish the authors could probe the functional or biochemical properties of these ancient GRLs a bit further. For example, for these proteins to serve as ion channels likely requires a multimeric organization. Native gels could biochemically demonstrate this, providing powerful additional evidence that these are part of the same family. Alternatively, could sequence covariation provide evidence for this (e.g. Hopf, 2014). Either way, it would be valuable to discuss this additional feature that does not immediately fall out of the trRosetta predictions.

As described above in response to the general comments, we feel it is premature to begin to assess biochemical properties of these proteins without first some hint of their in vivo role, which in turn requires genetic analysis. It is currently hard also to extract further insights from patterns of amino acid covariation for the protist and fungal GRLs alone because there are too few sequences available.

We have made some preliminary analysis of the plant proteins, by overlaying the degree of amino acid conservation on the predicted structure but this was not particular informative: in contrast to the animal proteins, the plant family has quite high amino acid identity throughout its length and this analysis did not highlight particularly conserved regions (in 3D space) that might indicate functional domains. Moreover, in contrast to ORs, for which there is good (albeit mostly indirect) evidence of heteromeric complex assembly between tuning ORs and ORCOs, we currently do not know if and how DUF3537 proteins may form multimeric complexes. As it is not trivial to distinguish contacts that may be involved in monomer folding versus those involved in potential intersubunit contacts (as described in Hopf et al., eLife 2014), we feel it is premature to attempt to draw conclusions about complex formation from sequence analysis alone at this stage. If such intersubunit interactions exist, we suspect they are slightly different from those reported in ORCO. The cryo-EM ORCO structure revealed that the major interaction interface was within cytoplasmic domain (the “anchor” domain (Butterwick et al., 2018)) comprising cytosolic regions of TM4, TM5, TM6 and TM7a; notably, all of the plant proteins have a cytoplasmic insertion of ∼50 amino acids in this region in IC3 (between TM6 + TM7a).

Reviewer #4:

Benton et al. is a well written study on the evolution of insect chemosensory receptors that uses bioinformatics-based approaches to identify putative GRL homologs in several species of unicellular eukaryotes. Both sequence and structure-based approaches are utilized to buttress the authors arguments that fungal and protista GRL homologs are an evolutionary link to DUF3537 proteins they have previously identified in plants and algae thereby extending this evolutionary relationship to "the last common eukaryotic ancestor"

While I am generally supportive of the authors rationale and recognize they have been careful to appropriately qualify their hypothesis throughout this work, I am somewhat disinclined to place a high degree of definitive value on the ab initio structural predictions which underscores much of this analysis. Even so, and despite the fact these evolutionary relationships between animal and plant GRLs are unlikely to ever be definitively tested, this hypothesis seems to me to be reasonable. That said, I remain underwhelmed by their significance.

We fully acknowledge the caveats associated with ab initio structural predictions and hope to have been suitably cautious in our claims throughout the manuscript. We do find the very strong similarity between the predicted and experimentally-determined ORCO structures striking, which supports the relevance of the predictions for other members of this repertoire. We refer the reviewer to our detailed comments in response to reviewer 2 concerning the procedure, assessment and caveats of ab initio modelling, as well as our additional analyses in this revision using the RaptorX algorithm. We provide all the outputs of these analyses in the Dryad repository (doi:10.5061/dryad.s7h44j15f), to permit independent assessment and reproduction by others.

https://doi.org/10.7554/eLife.62507.sa2

Article and author information

Author details

  1. Richard Benton

    Center for Integrative Genomics, Faculty of Biology and Medicine, University of Lausanne, Lausanne, Switzerland
    Contribution
    Conceptualization, Data curation, Supervision, Funding acquisition, Validation, Investigation, Visualization, Writing - original draft, Project administration, Writing - review and editing
    For correspondence
    Richard.Benton@unil.ch
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-4305-8301
  2. Christophe Dessimoz

    1. Center for Integrative Genomics, Faculty of Biology and Medicine, University of Lausanne, Lausanne, Switzerland
    2. Department of Computational Biology, Faculty of Biology and Medicine, University of Lausanne, Lausanne, Switzerland
    3. Swiss Institute of Bioinformatics, Lausanne, Switzerland
    4. Department of Genetics, Evolution and Environment, University College London, London, United Kingdom
    5. Department of Computer Science, University College London, London, United Kingdom
    Contribution
    Data curation, Supervision, Funding acquisition, Validation, Investigation, Visualization, Writing - review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-2170-853X
  3. David Moi

    1. Center for Integrative Genomics, Faculty of Biology and Medicine, University of Lausanne, Lausanne, Switzerland
    2. Department of Computational Biology, Faculty of Biology and Medicine, University of Lausanne, Lausanne, Switzerland
    3. Swiss Institute of Bioinformatics, Lausanne, Switzerland
    Contribution
    Data curation, Software, Validation, Investigation, Visualization, Writing - review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-2664-7385

Funding

H2020 European Research Council (833548)

  • Richard Benton

FP7 Ideas: European Research Council (615094)

  • Richard Benton

Novartis StiftungfürMedizinisch-Biologische Forschung

  • Richard Benton

Schweizerischer Nationalfonds zur Förderung der Wissenschaftlichen Forschung (31003A_166646)

  • Richard Benton

Schweizerischer Nationalfonds zur Förderung der Wissenschaftlichen Forschung (183723)

  • Christophe Dessimoz
  • David Moi

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

We thank members of the Benton lab for discussions. Research in RB’s laboratory is supported by the University of Lausanne, European Research Council Consolidator and Advanced Grants (615094 and 833548), the Swiss National Science Foundation (31003A_166646) and the Novartis Foundation for medical-biological Research. DM and CD were supported by Swiss National Science Foundation Grant 183723.

Senior Editor

  1. Piali Sengupta, Brandeis University, United States

Reviewing Editor

  1. Claude Desplan, New York University, United States

Publication history

  1. Received: August 26, 2020
  2. Accepted: December 3, 2020
  3. Accepted Manuscript published: December 4, 2020 (version 1)
  4. Version of Record published: December 17, 2020 (version 2)

Copyright

© 2020, Benton et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 1,426
    Page views
  • 256
    Downloads
  • 0
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)

Further reading

    1. Evolutionary Biology
    Kai He et al.
    Research Article

    The speciose mammalian order Eulipotyphla (moles, shrews, hedgehogs, solenodons) combines an unusual diversity of semi-aquatic, semi-fossorial, and fossorial forms that arose from terrestrial forbearers. However, our understanding of the ecomorphological pathways leading to these lifestyles has been confounded by a fragmentary fossil record, unresolved phylogenetic relationships, and potential morphological convergence, calling for novel approaches. The net surface charge of the oxygen-storing muscle protein myoglobin (ZMb), which can be readily determined from its primary structure, provides an objective target to address this question due to mechanistic linkages with myoglobin concentration. Here we generate a comprehensive 71 species molecular phylogeny that resolves previously intractable intra-family relationships and then ancestrally reconstruct ZMb evolution to identify ancient lifestyle transitions based on protein sequence alone. Our phylogenetically informed analyses confidently resolve fossorial habits having evolved twice in talpid moles and reveal five independent secondary aquatic transitions in the order housing the world's smallest endothermic divers.

    1. Computational and Systems Biology
    2. Evolutionary Biology
    Alexander J Tarashansky et al.
    Research Advance

    Comparing single-cell transcriptomic atlases from diverse organisms can elucidate the origins of cellular diversity and assist the annotation of new cell atlases. Yet, comparison between distant relatives is hindered by complex gene histories and diversifications in expression programs. Previously, we introduced the self-assembling manifold (SAM) algorithm to robustly reconstruct manifolds from single-cell data (Tarashansky et al., 2019). Here, we build on SAM to map cell atlas manifolds across species. This new method, SAMap, identifies homologous cell types with shared expression programs across distant species within phyla, even in complex examples where homologous tissues emerge from distinct germ layers. SAMap also finds many genes with more similar expression to their paralogs than their orthologs, suggesting paralog substitution may be more common in evolution than previously appreciated. Lastly, comparing species across animal phyla, spanning mouse to sponge, reveals ancient contractile and stem cell families, which may have arisen early in animal evolution.