1. Developmental Biology
  2. Evolutionary Biology
Download icon

Bacterial contribution to genesis of the novel germ line determinant oskar

  1. Leo Blondel
  2. Tamsin EM Jones
  3. Cassandra G Extavour  Is a corresponding author
  1. Department of Molecular and Cellular Biology, Harvard University, United States
  2. Department of Organismic and Evolutionary Biology, Harvard University, United States
Short Report
  • Cited 0
  • Views 1,855
  • Annotations
Cite this article as: eLife 2020;9:e45539 doi: 10.7554/eLife.45539

Abstract

New cellular functions and developmental processes can evolve by modifying existing genes or creating novel genes. Novel genes can arise not only via duplication or mutation but also by acquiring foreign DNA, also called horizontal gene transfer (HGT). Here we show that HGT likely contributed to the creation of a novel gene indispensable for reproduction in some insects. Long considered a novel gene with unknown origin, oskar has evolved to fulfil a crucial role in insect germ cell formation. Our analysis of over 100 insect Oskar sequences suggests that oskar arose de novo via fusion of eukaryotic and prokaryotic sequences. This work shows that highly unusual gene origin processes can give rise to novel genes that may facilitate evolution of novel developmental mechanisms.

Introduction

Heritable variation is the raw material of evolutionary change. Genetic variation can arise from mutation and gene duplication of existing genes (Taylor and Raes, 2004), or through de novo processes (Tautz and Domazet-Lošo, 2011), but the extent to which such novel, or 'orphan' genes participate significantly in the evolutionary process is unclear. Mutation of existing cis-regulatory (Wittkopp and Kalay, 2012) or protein coding regions (Hoekstra and Coyne, 2007) can drive evolutionary change in developmental processes. However, recent studies in animals and fungi suggest that novel genes can also drive phenotypic change (Chen et al., 2013). Although counterintuitive, novel genes may be integrating continuously into otherwise conserved gene networks, with a higher rate of partner acquisition than subtler variations on preexisting genes (Zhang et al., 2015). Moreover, in humans and fruit flies, a large proportion of novel genes are expressed in the brain, suggesting their participation in the evolution of major organ systems (Zhang et al., 2012; Chen et al., 2012). However, while next generation sequencing has improved their discovery, the developmental and evolutionary significance of novel genes remains understudied.

The mechanism of formation of a novel gene may have implications for its function. Novel genes that arise by duplication, thus possessing the same biophysical properties as their parent genes, have innate potential to participate in preexisting cellular and molecular mechanisms (Taylor and Raes, 2004). However, orphan genes lacking sequence similarity to existing genes must form novel functional molecular relationships with extant genes, in order to persist in the genome. When such genes arise by introduction of foreign DNA into a host genome through horizontal gene transfer (HGT), they may introduce novel, already functional sequence information into a genome. Whether genes created by HGT show a greater propensity to contribute to or enable novel processes is unclear. Endosymbionts in the host germ line cytoplasm (germ line symbionts) could increase the occurrence of evolutionarily relevant HGT events, as foreign DNA integrated into the germ line genome is transferred to the next generation. HGT from bacterial endosymbionts into insect genomes appears widespread, involving transfer of metabolic genes or even larger genomic fragments to the host genome (see for example Dunning Hotopp et al., 2007; Acuna et al., 2012; Sloan et al., 2014; Husnik et al., 2013).

Here we examined the evolutionary origins of the oskar (osk) gene, long considered a novel gene that evolved to be indispensable for insect reproduction (Lehmann, 2016). First discovered in Drosophila melanogaster (Lehmann and Nüsslein-Volhard, 1986), osk is necessary and sufficient for assembly of germ plasm, a cytoplasmic determinant that specifies the germ line in the embryo. Germ plasm-based germ line specification appears derived within insects, confined to insects that undergo metamorphosis (Holometabola) (Ewen-Campen et al., 2012; Extavour and Akam, 2003). Initially thought exclusive to Diptera (flies and mosquitoes), its discovery in a wasp, another holometabolous insect with germ plasm (Lynch et al., 2011), led to the hypothesis that oskar originated as a novel gene at the base of the Holometabola approximately 300 Mya, facilitating the evolution of insect germ plasm as a novel developmental mechanism (Lynch et al., 2011). However, its subsequent discovery in a cricket (Ewen-Campen et al., 2012), a hemimetabolous insect without germ plasm (Ewen-Campen et al., 2013), implied that osk was instead at least 50 My older, and that its germ plasm role was derived rather than ancestral (Abouheif, 2013). Despite its orphan gene status, osk plays major developmental roles, interacting with the products of many genes highly conserved across animals (Lehmann, 2016; Jeske et al., 2015; Jeske et al., 2017). osk thus represents an example of a novel gene that not only functions within pre-existing gene networks in the nervous system (Ewen-Campen et al., 2012), but has also evolved into the only animal gene that has been experimentally demonstrated to be both necessary and sufficient to specify functional primordial germ line cells (Ephrussi and Lehmann, 1992; Kim-Ha et al., 1991).

The evolutionary origins of this remarkable gene are unknown. Osk contains two biophysically conserved domains, an N-terminal LOTUS domain and a C-terminal hydrolase-like domain called OSK (Jeske et al., 2015; Yang et al., 2015; Figure 1a). An initial BLASTp search using the full-length D. melanogaster osk sequence as a query yielded either other holometabolous insect osk genes, or partial hits for the LOTUS or OSK domains (E-value < 0.01; Source data 1: BLAST search results). This suggested that full length osk was unlikely to be a duplication of any other known gene. This prompted us to perform two more BLASTp searches, one using each of the two conserved Osk protein domains individually as query sequences. Strikingly, in this BLASTp search, although we recovered several eukaryotic hits for the LOTUS domain, we recovered no eukaryotic sequences that resembled the OSK domain, even with very low E-value stringency (E-value < 10; see Materials and methods section “BLAST searches of oskar” for an explanation of E-value threshold choices; Source data 1: BLAST search results).

Sequence analysis of the Oskar gene.

(a) Schematic representation of the Oskar gene. The LOTUS and OSK hydrolase-like domains are separated by a poorly conserved region of predicted high disorder and variable length between species. In some dipterans, a region 5’ to the LOTUS domain is translated to yield a second isoform, called Long Oskar. Residue numbers correspond to the D. melanogaster Osk sequence. (b) Stackplot of domain of life identity of HMMER hits across the protein sequence. For a sliding window of 60 Amino Acids across the protein sequence (X axis), the number of hits in the Trembl (UniProt) database (Y axis) is represented and color coded by domain of life origin (see Materials and methods: Iterative HMMER search of OSK and LOTUS domains), stacked on top of each other. (c, d) EFI-EST-generated graphs of the sequence similarity network of the LOTUS (c) and OSK (d) domains of Oskar (Gerlt et al., 2015). Sequences were obtained using HMMER against the UniProtKB database. Most Oskar LOTUS sequences cluster within eukaryotes and arthropods. In contrast, Oskar OSK sequences cluster most strongly with a small subset of bacterial sequences.

To understand this anomaly, we built an alignment of 95 Oskar sequences (Source data 1 Alignments>OSKAR_MUSCLE_FINAL.fasta; Supplementary file 1A and B) and used a custom iterative HMMER sliding window search tool to compare each domain with protein sequences from all domains of life. Sequences most similar to the LOTUS domain were almost exclusively eukaryotic sequences (Supplementary file 1C). In contrast, those most similar to the OSK domain were bacterial, specifically sequences similar to SGNH-like hydrolases (Jeske et al., 2015; Yang et al., 2015) (Pfam Clan: SGNH_hydrolase - CL0264; Supplementary file 1D; Figure 1b). To visualize their relationships, we graphed the sequence similarity network for the sequences of these domains and their closest hits. We observed that the majority of LOTUS domain sequences clustered within eukaryotic sequences (Figure 1c). In contrast, OSK domain sequences formed an isolated cluster, a small subset of which formed a connection to bacterial sequences (Figure 1d). These data are consistent with a previous suggestion, based on BLAST results (Lynch et al., 2011), that HGT from a bacterium into an ancestral insect genome may have contributed to the evolution of osk. However, this possibility was not formally addressed by previous analyses, which were based on alignments of full length Osk containing only eukaryotic sequences as outgroups (Ewen-Campen et al., 2012). To rigorously test this hypothesis, we therefore performed phylogenetic analyses of the two domains independently. A finding that LOTUS sequences were nested within eukaryotes, while OSK sequences were nested within bacteria, would provide support for the HGT hypothesis.

Both Maximum likelihood and Bayesian approaches confirmed this prediction (Figure 2a, Figure 2—figure supplements 1 and 2), and these results were robust to changes in the methods of sequence alignment (Figure 2—figure supplements 6, 7, 8, 9, 10). As expected, LOTUS sequences from Osk proteins were related to other eukaryotic LOTUS domains, to the exclusion of the only three bacterial sequences that met our E-value cutoff for inclusion in the analyses (Figure 2a, Figure 2—figure supplements 1 and 2; see Materials and methods and Supplemental Text). LOTUS sequences from non-Oskar proteins were almost exclusively eukaryotic. (Supplementary file 1); only three bacterial sequences matched the LOTUS domain with an E-value < 0.01. Osk LOTUS domains clustered into two distinct clades, one comprising all Dipteran sequences, and the other comprising all other Osk LOTUS domains examined from both holometabolous and hemimetabolous orders (Figure 2a). Dipteran Osk LOTUS sequences formed a monophyletic group that branched sister to a clade of LOTUS domains from Tud5 family proteins of non-arthropod animals (NAA). NAA LOTUS domains from Tud7 family members were polyphyletic, but most of them formed a clade branching sister to (Osk LOTUS + NAA Tud5 LOTUS). Non-Dipteran Osk LOTUS domains formed a monophyletic group that was related in a polytomy to the aforementioned (NAA Tud7 LOTUS + (Dipteran Osk LOTUS + NAA Tud5 LOTUS)) clade, and to various arthropod Tud7 family LOTUS domains.

Figure 2 with 13 supplements see all
Phylogenetic analysis of the LOTUS and OSK domains.

(a) Bayesian consensus tree for the LOTUS domain. Three major LOTUS-containing protein families are represented within the tree: Tudor 5, Tudor 7, and Oskar. Oskar LOTUS domains form two clades, one containing only dipterans and one containing all other represented insects (hymenopterans and orthopterans). The tree was rooted to the three bacterial sequences added in the dataset. (b) Bayesian consensus tree for the OSK domain. The OSK domain is nested within GDSL-like domains of bacterial species from phyla known to contain germ line symbionts in insects. The ten non-Oskar eukaryotic sequences in the analysis form a single clade comprising fungal Carbohydrate Active Enzyme 3 (CAZ3) proteins. For Bayesian and RaxML trees with all accession numbers and node support values see Figure 2—figure supplements 14.

The fact that Tud7 LOTUS domains are polyphyletic suggests that arthropod domains in this family may have evolved differently than their homologues in other animals. The relationships of Dipteran LOTUS sequences were consistent with the current hypothesis for interrelationships between Dipteran species (Kirk-Spriggs and Sinclair, 2017). Similarly, among the non-Dipteran Osk LOTUS sequences, the hymenopteran sequences form a clade to the exclusion of the single hemimetabolous sequence (from the cricket Gryllus bimaculatus), consistent with the monophyly of Hymenoptera (Peters et al., 2017). It is unclear why Dipteran Osk LOTUS domains cluster separately from those of other insect Osk proteins. We speculate that the evolution of the Long Oskar domain (Vanzo and Ephrussi, 2002; Hurd et al., 2016), which appears to be a novelty within Diptera (Source data 1: Alignments>OSKAR_MUSCLE_FINAL.fasta), may have influenced the evolution of the Osk LOTUS domain in at least some of these insects. Consistent with this hypothesis, of the 17 Dipteran oskar genes we examined, the seven oskar genes possessing a Long Osk domain clustered into two clades based on the sequences of their LOTUS domain. One of these clades comprised five Drosophila species (D. willistoni, D. mojavensis, D. virilis, D. grimshawi and D. immigrans), and the second was composed of two calyptrate flies from different superfamilies, Musca domestica (Muscoidea) and Lucilia cuprina (Oestroidea).

In summary, the LOTUS domain of Osk proteins is most closely related to a number of other LOTUS domains found in eukaryotic proteins, as would be expected for a gene of animal origin, and the phylogenetic interrelationships of these sequences are largely consistent with the current species or family level trees for the corresponding insects.

In contrast, OSK domain sequences were nested within bacterial sequences (Figure 2b, Figure 2—figure supplements 3 and 4). This bacterial, rather than eukaryotic, affinity of the OSK domain was recovered even when different sequence alignment methods were used (Figure 2—figure supplements 7, 8, 9, 10 and 11). The only eukaryotic proteins emerging from the iterative HMMER search for OSK domain sequences that had an E-value < 0.01 were all from fungi. All five of these sequences were annotated as Carbohydrate Active Enzyme 3 (CAZ3), and all CAZ3 sequences formed a clade that was sister to a clade of primarily Firmicutes. Most bacterial sequences used in this analysis were annotated as lipases and hydrolases, with a high representation of GDSL-like hydrolases (Supplementary file 1D). OSK sequences formed a monophyletic group but did not branch sister to the other eukaryotic sequences in the analysis. Within this OSK clade, the topology of sequence relationships was largely concordant with the species tree for insects (Misof et al., 2014), as we recovered monophyletic Diptera to the exclusion of other insect species. However, the single orthopteran OSK sequence (from the cricket G. bimaculatus) grouped within the Hymenoptera, rather than branching as sister to all other insect sequences in the tree, as would be expected for this hemimetabolous sequence (Misof et al., 2014).

Importantly, OSK sequences did not simply form an outgroup to bacterial sequences. To formally reject the possibility that the eukaryotic OSK clade has a sister group relationship to all bacterial sequences in the analysis, we performed topology constraint analyses using the Swofford–Olsen–Waddell–Hillis (SOWH) test, which assigns statistical support to alternative phylogenetic topologies (Swofford et al., 1996). We used the SOWHAT tool (Church et al., 2015) to compare the HGT-supporting topology to two alternative topologies with constraints more consistent with vertical inheritance. The first was constrained by domain of life, disallowing paraphyletic relationships between sequences from the same domain of life (Figure 2—figure supplement 5a). The second required monophyly of Eukaryota but allowed paraphyletic relationships between bacterial and archaeal sequences (Figure 2—figure supplement 5b). We found that the topologies of both of these constrained trees were significantly worse than the result we had recovered with our phylogenetic analysis (Figure 2—figure supplement 5), namely that the closest relatives of the OSK domain were bacterial rather than eukaryotic sequences Figure 2b, Figure 2—figure supplements 3 and 4).

OSK sequences formed a well-supported clade nested within bacterial GDSL-like lipase sequences. The majority of these bacterial sequences were from the Firmicutes, a bacterial phylum known to include insect germ line symbionts (Wheeler et al., 2013; Chepkemoi et al., 2017). All other sequences from classified bacterial species, including a clade branching as sister to all other sequences, belonged either to the Bacteroidetes or to the Proteobacteria. Members of both of these phyla are also known germ line symbionts of insects (Dunning Hotopp et al., 2007; Zchori-Fein et al., 2004) and other arthropods (Zchori-Fein and Perlman, 2004). In sum, the distinct phylogenetic relationships of the two domains of Oskar are consistent with a bacterial origin for the OSK domain. Further, the specific bacterial clades close to OSK suggest that an ancient arthropod germ line endosymbiont could have been the source of a GDSL-like sequence that was transferred into an ancestral insect genome, and ultimately gave rise to the OSK domain of oskar (Figure 3).

Hypothesis for the origin of oskar.

Integration of the OSK domain close to a LOTUS domain in an ancestral insect genome. (a) DNA containing a GDSL-like domain from an endosymbiotic germ line bacterium is transferred to the nucleus of a germ cell in an insect common ancestor. (b) DNA damage or transposable element activity induces an integration event in the host genome, close to a pre-existing LOTUS-like domain. (c) The region between the two domains undergoes de novo coding evolution, creating an open reading frame with a unique, chimeric domain structure. (d) In some Diptera, including D. melanogaster, part of the 5’ UTR of oskar has undergone de novo coding evolution to form the Long Oskar domain.

While multiple mechanisms can give rise to novel genes, HGT is arguably among the least well understood, as it involves multiple genomes and ancient biotic interactions between donor and host organisms that are often difficult to reconstruct. In the case of oskar, however, the fact that both germ line symbionts (Bourtzis and Miller, 2006) and HGT events (Dunning Hotopp et al., 2007) are widespread in insects, provides a plausible biological mechanism consistent with our hypothesis that fusion of eukaryotic and bacterial domain sequences led to the birth of this novel gene. Under this hypothesis, this fusion would have taken place before the major diversification of insects, nearly 500 million years ago (Misof et al., 2014).

Once arisen, novel genes might be expected to disappear rapidly, given that pre-existing gene regulatory networks operated successfully without them (Taylor and Raes, 2004). However, it is clear that novel genes can evolve functional connections with existing networks, become essential (Chen et al., 2010), and in some cases lead to new functions (Cornelis et al., 2012) and contribute to phenotypic diversity (Chen et al., 2013). Even given the growing number of convincing examples of HGT from both prokaryotic and eukaryotic origins (see for example Husnik and McCutcheon, 2018; Di Lelio et al., 2019; Wybouw et al., 2016; Quispe-Huamanquispe et al., 2017), some authors suspect that the contribution of horizontal gene transfer to the acquisition of novel traits has been underestimated across animals (Boto, 2014). Moreover, the functional contribution of genes horizontally transferred specifically from bacteria to insects has been documented for a range of adaptive phenotypes (see for example Wilson and Duncan, 2015; López-Madrigal and Gil, 2017; Provorov and Onishchuk, 2018), including digestive metabolism (Acuna et al., 2012; Sloan et al., 2014; Shelomi et al., 2016), glycolysis (Zeng et al., 2018) complex symbiosis (Husnik et al., 2013) and endosymbiont cell wall construction (Bublitz et al., 2019). oskar plays multiple critical roles in insect development, from neural patterning (Ewen-Campen et al., 2012; Xu et al., 2013) to oogenesis (Jenny et al., 2006). In the Holometabola, a clade of nearly one million extant species (Rees and Cranston, 2017), oskar’s co-option to become necessary and sufficient for germ plasm assembly is likely the cell biological mechanism underlying the evolution of this derived mode of insect germ line specification (Ewen-Campen et al., 2012; Lynch et al., 2011; Abouheif, 2013). Our study thus provides evidence that HGT can not only introduce functional genes into a host genome, but also, by contributing sequences of individual domains, generate genes with entirely novel domain structures that may facilitate the evolution of novel developmental mechanisms.

Materials and methods

BLAST searches of Oskar

Request a detailed protocol

All BLAST (Altschul et al., 1990) searches were performed using the NCBI BLASTp tool suite on the non-redundant (nr) database. Amino Acid (AA) sequences of D. melanogaster full length Oskar (EMBL ID AAF54306.1), as well as the AA sequences for the D. melanogaster Oskar LOTUS (AA 139–238) and OSK (AA 414–606) domains were used for the BLAST searches. We used the default NCBI cut-off parameters (E-value cut-off of 10) for searches using OSK and LOTUS as queries, and a more stringent E-value threshhold of 0.01 for the search using full length D. melanogaster Oskar as a query. We chose an E-value threshold of 10 for LOTUS and OSK to capture potentially highly divergent homologs of the two domains, especially for the OSK domain, where we were looking for any viable candidate for a homologous eukaryotic domain. All BLAST searches results are included in the Source data 1: BLAST search results.

Hidden Markov Model (HMM) generation and alignments of the OSK and LOTUS domains

Request a detailed protocol

101 1KITE transcriptomes (Misof et al., 2014; Supplementary file 1A) were downloaded and searched using the local BLAST program (BLAST+) using the tblastn algorithm with default parameters, with Oskar protein sequences of Drosophila melanogaster, Aedes aegypti, Nasonia vitripennis and Gryllus bimaculatus as queries (EntrezIDs: NP_731295.1, ABC41128.1, NP_001234884.1 and AFV31610.1 respectively). For all of these 1KITE transcriptome searches, predicted protein sequences from transcript data were obtained by in silico translation using the online ExPASy translate tool (https://web.expasy.org/translate/), taking the longest open reading frame. Publicly available sequences in the non-redundant (nr), TSA databases at NCBI, and a then-unpublished transcriptome (Benton et al., 2016) (kind gift of Matthew Benton and Siegfried Roth, University of Cologne) were subsequently searched using the web-based BLAST tool hosted at NCBI, using the tblastn algorithm with default parameters. Sequences used for queries were the four Oskar proteins described above, and newfound oskar sequences from the 1KITE transcriptomes of Baetis pumilis, Cryptocercus wright, and Frankliniella cephalica. For both searches, oskar orthologs were identified by the presence of BLAST hits on the same transcript to both the LOTUS (N-terminal) and OSK (C-terminal) regions of any of the query oskar sequences, regardless of E-values. The sequences found were aligned using MUSCLE (eight iterations) (Edgar, 2004) into a 46-sequence alignment (Source data 1: Alignments > OSKAR_MUSCLE_INITIAL.fasta). From this alignment, the LOTUS and OSK domains were extracted (Source data 1: Alignments > LOTUS_MUSCLE_INITIAL.fasta and Alignments > OSK_MUSCLE_INITIAL.fasta) to define the initial Hidden Markov Models (HMM) using the hmmbuild tool from the HMMER tool suite with default parameters (http://hmmer.org/Eddy, 2011). 126 insect genomes and 128 insect transcriptomes (from the Transcriptome Shotgun Assembly TSA database: https://www.ncbi.nlm.nih.gov/Traces/wgs/?view=TSA) were subsequently downloaded from NCBI (download date September 29, 2015; Supplementary file 1A). Genomes were submitted to Augustus v2.5.5 (Stanke et al., 2004) (using the D. melanogaster exon HMM predictor) and SNAP v2006-07-28 (Korf, 2004) (using the default ‘fly’ HMM) for gene discovery. The resulting nucleotide sequence database comprising all 309 downloaded and annotated genomes and transcriptomes, was then translated in six frames to generate a non-redundant amino acid database (where all sequences with the same amino acid content are merged into one). This process was automated using a series of custom scripts available here: https://github.com/Xqua/Genomes. The non-redundant amino acid database was searched using the HMMER v3.1 tool suite (Eddy, 2011) and the HMM for the LOTUS and OSK domains described above. A hit was considered positive if it consisted of a contiguous sequence containing both a LOTUS domain and an OSK domain, with the two domains separated by an inter-domain sequence. We imposed no length, alignment or conservation criteria on the inter-domain sequence, as this is a rapidly-evolving region of Oskar protein with predicted high disorder (Jeske et al., 2015; Yang et al., 2015; Ahuja and Extavour, 2014). Positive hits were manually curated and added to the main alignment, and the search was performed iteratively until no more new sequences meeting the above criteria were discovered. This resulted in a total of 95 Oskar protein sequences, (see Supplementary file 1B for the complete list). Using the final resulting alignment (Source data 1: Alignments > OSKAR_MUSCLE_FINAL.fasta), the LOTUS and OSK domains were extracted from these sequences (Source data 1: Alignments > LOTUS_MUSCLE_FINAL.fasta and Alignments > OSK_MUSCLE_FINAL.fasta), and the final three HMM (for full-length Oskar, OSK, and LOTUS domains) used in subsequent analyses were created using hmmbuild with default parameters (Source data 1: HMM >OSK.hmm, HMM >LOTUS.hmm and HMM >OSKAR.hmm).

Iterative HMMER search of OSK and LOTUS domains

Request a detailed protocol

A reduced version of TrEMBL (U Consortium, 2005) (v2016-06) was created by concatenating all hits (regardless of E-value) for sequences of the LOTUS domain, the OSK domain and full-length Oskar, using hmmsearch with default parameters and the HMM models created above from the final alignment. This reduced database was created to reduce potential false positive results that might result from the limited size of the sliding window used in the search approach described here. The full-length Oskar alignment of 1133 amino acids (Source data 1: Alignments > OSKAR_MUSCLE_FINAL.fasta) was split into 934 sub-alignments of 60 amino acids each using a sliding window of one amino acid. Each alignment was converted into a HMM using hmmbuild, and searched against the reduced TrEMBL database using hmmsearch using default parameters. Domain of life origin of every hit sequence at each position was recorded. Eukaryotic sequences were further classified as Oskar/Non-Oskar and Arthropod/Non-Arthropod. Finally, for the whole alignment, the counts for each category were saved and plotted in a stack plot representing the proportion of sequences from each category to create Figure 1b. The python code used for this search is available at https://github.com/Xqua/Iterative-HMMER.

Sequence similarity networks

Request a detailed protocol

LOTUS and OSK domain sequences from the final alignment obtained as described above (see ‘Hidden Markov Model (HMM) generation and alignments of the OSK and LOTUS domains’; Source data 1: Alignments > LOTUS_MUSCLE_FINAL.fasta and Alignments > OSK_MUSCLE_FINAL.fasta) were searched against TrEMBL (U Consortium, 2005) (v2016-06) using HMMER. All hits with E-value <0.01 were consolidated into a fasta file that was then entered into the EFI-EST tool (Gerlt et al., 2015) using default parameters to generate a sequence similarity network. An alignment score corresponding to 30% sequence identity was chosen for the generation of the final sequence similarity network. Finally, the network was graphed using Cytoscape 3 (Shannon et al., 2003).

Phylogenetic analysis based on MUSCLE alignment

Request a detailed protocol

For both the LOTUS and OSK domains, in cases where more than one sequence from the same organism was retrieved by the search described above in ‘Iterative HMMER Search of OSK and LOTUS domains’, only the sequence with the lowest E-value was used for phylogenetic analysis. For the LOTUS domain, the first 97 best hits (lowest E-value) were selected, and the only three bacterial sequences that satisfied an E-value <0.01 were manually added. For oskar sequences, if more than one sequence per species was obtained by the search, only the single sequence per species with the lowest E-value was kept for analysis, generating a set of 100 sequences for the LOTUS domain, and 87 sequences for the OSK domain. Unique identifiers for all sequences used to generate alignments for phylogenetic analysis are available in Supplementary files 1C, 1D. For both datasets, the sequences were then aligned using MUSCLE (Edgar, 2004) (eight iterations) and trimmed using trimAl (Capella-Gutiérrez et al., 2009) with 70% occupancy. The resulting alignments that were subject to phylogenetic analysis are available in Source data 1: Alignments > LOTUS_MUSCLE_TREE.fasta and Alignments > OSK_MUSCLE_TREE.fasta. For the maximum likelihood tree, we used RaxML v8.2.4 (Stamatakis, 2014) with 1000 bootstraps, and the models were selected using the automatic RaxML model selection tool. The substitution model chosen for both domains was LGF. For the Bayesian tree inference, we used MrBayes V3.2.6 (Huelsenbeck and Ronquist, 2001) with a Mixed model (prset aamodel = Mixed) and a gamma distribution (lset rates = Gamma). We ran the MonteCarlo for 4 million generations (std <0.01) for the OSK domain, and for 3 million generations (std <0.01) for the LOTUS domain. For the tree comparisons (Figure 2—figure supplements 8, 9), the RaxML best tree output from the MUSCLE and PRANK alignments were compared using the tool Phylo.io (Robinson et al., 2016).

Phylogenetic analysis based on PRANK alignment

Request a detailed protocol

For the OSK domain, the raw full length sequences obtained from the HMMER search were aligned to each other using the HMMER HMM-based alignment tool: hmmalign, with the same HMM used to do the search, namely OSK.hmm (supplementary data: Data/HMM/OSK.hmm). Starting from this base alignment, we used the default alignment method option offered by PRANK (version: v.170427) (Löytynoja, 2014). We then used PRANK to realign those sequences, which in turn led to a usable alignment for phylogenetic analysis. This alignment was trimmed using the same parameters as described in Hidden Markov Model (HMM) generation and alignments of the OSK and LOTUS domains above. The final alignment is available in supplementary data: Alignment/OSK_prank_aligned.fasta. We then performed a phylogenetic analysis of this alignment using RAXML with the same parameters described in Phylogenetic Analysis Based on MUSCLE Alignment above. The resulting tree is presented in Figure 2—figure supplements 7 and 8.

For the LOTUS domain, the raw full length sequences obtained from the HMMER search were aligned to each other using the HMMER HMM-based alignment tool: hmmalign, with the same HMM used to do the search, namely LOTUS.hmm (Supplementary data: Data/HMM/LOTUS.hmm). Starting from this base alignment, we then used PRANK with default options to realign those sequences. This alignment was trimmed using the same parameters as described in the Hidden Markov Model (HMM) generation and alignments of the OSK and LOTUS domains. The final alignment is available in supplementary data: Alignments/LOTUS_prank_aligned.fasta. We then performed a phylogenetic analysis using RAXML with the same parameters described above in Phylogenetic Analysis Based on MUSCLE alignment. The resulting trees are presented in Figure 2—figure supplements 6 and 9.

Phylogenetic analysis based on T coffee alignment

Request a detailed protocol

For the LOTUS and OSK domains, the raw full length sequences obtained from the HMMER search were aligned to each other using T-Coffee with its default parameters (Notredame et al., 2000). This alignment was trimmed using the same parameters as described in Hidden Markov Model (HMM) generation and alignments of the OSK and LOTUS domains above. The final alignment is available in supplementary data: Alignment/LOTUS_tcoffee_aligned.fasta Alignment/OSK_tcoffee_aligned.fasta. We then performed a phylogenetic analysis of this alignment using RAXML with the same parameters described in Phylogenetic Analysis Based on MUSCLE Alignment above. The resulting trees are presented in Figure 2—figure supplements 10 and 11.

Visual comparison of phylogenetic trees

Request a detailed protocol

To compare the trees obtained with different alignment tools, we used Phylo.io (Robinson et al., 2016). The trees were imported in Newick format, and the Phylo.io tool generated the mirrored and aligned versions of the trees represented in Figure 2—figure supplements 8, 9, 12 and 13. The color of the branches is the tree similarity score, where lighter colors represent a higher number of topological differences. It is a custom implementation of the Jacard Index by Phylo.io.

Statistical analysis of tree topology

Request a detailed protocol

To statistically evaluate our best-supported topology of the OSK and LOTUS trees, we compared constrained topologies to the highest likelihood trees using the SOWHAT tool (Church et al., 2015). SOWHAT automates the stringent SOWH phylogenetic topology test (Swofford et al., 1996), and compares the log likelihood between generated trees. We defined three constrained trees to test our results, one requiring monophyly of all domains of life, a second requiring only eukaryotic monophyly, and the last one requiring monophyly of the Oskar LOTUS domain (Source data 1: Data > Trees > constrained_kingdom_tree.tre, constrained_eukmono_tree.tre and constrained_lotus_mono_tree.tre). We then ran SOWHAT using its default parameters, 1000 bootstraps, and the two constrained trees against the OSK or LOTUS alignment used to generate the phylogenetic trees (Source data 1: Alignments > OSK_MUSCLE_TREE.fasta and LOTUS_MUSCLE_TREE.fasta). All best trees generated by SOWHAT are available in (Source data 1: Data > Trees > SOWHAT_*_test.tre).

Code availability

Request a detailed protocol

All custom code generated for this study is available in the GitHub repository https://github.com/extavourlab/Oskar_HGT, commit ID 6f6c4c50dfb9391567d70f9eea922f3876a4e153 (Blondel et al., 2020; copy archived at https://github.com/elifesciences-publications/Oskar_HGT).

Scripts

Request a detailed protocol

All scripts used herein are hosted on GitHub at https://github.com/extavourlab/Oskar_HGT.

References

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
  16. 16
  17. 17
  18. 18
  19. 19
  20. 20
  21. 21
  22. 22
  23. 23
  24. 24
  25. 25
  26. 26
  27. 27
  28. 28
  29. 29
  30. 30
  31. 31
  32. 32
  33. 33
  34. 34
  35. 35
    Manual of Afrotropical Diptera; Ptychopteridae
    1. AH Kirk-Spriggs
    2. BJ Sinclair
    (2017)
    Pretoria, South Africa: South African National Biodiversity Institute.
  36. 36
  37. 37
  38. 38
  39. 39
  40. 40
  41. 41
  42. 42
  43. 43
  44. 44
  45. 45
  46. 46
  47. 47
  48. 48
  49. 49
  50. 50
  51. 51
  52. 52
  53. 53
  54. 54
    Phylogenetic inference
    1. DL Swofford
    2. GJ Olsen
    3. PJ Waddell
    (1996)
    In: C Moritz, DM Hillis, BK Mable, editors. Molecular Systematics (2nd Edition). Sinauer, MA: Sinauer Associates, Inc. pp. 407–453.
  55. 55
  56. 56
  57. 57
  58. 58
    Oskar anchoring restricts pole plasm formation to the posterior of the Drosophila oocyte
    1. NF Vanzo
    2. A Ephrussi
    (2002)
    Development 129:3705–3714.
  59. 59
  60. 60
  61. 61
  62. 62
  63. 63
  64. 64
  65. 65
  66. 66
  67. 67
  68. 68
  69. 69

Decision letter

  1. Antonis Rokas
    Reviewing Editor; Vanderbilt University, United States
  2. Patricia J Wittkopp
    Senior Editor; University of Michigan, United States
  3. Eve Gazave
    Reviewer; Institut Jacques Monod, France

In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.

Acceptance summary:

oskar is a master regulatory gene involved in insect early development that is only found in holometabolan insects. The gene's functions are well studied but its origins remain unknown. This study provides convincing phylogenetic evidence that the OSK domain of the protein encoded by the oskar gene appears to have been originally acquired from a bacterium. The study sheds significant light on the origin of a gene that is uniquely present in holometabolan insects and that plays a major role in insect early development.

Decision letter after peer review:

Thank you for submitting your article "Bacterial contribution to genesis of the novel germ line determinant oskar" for consideration by eLife. Your article has been reviewed by three peer reviewers, one of whom is a member of our Board of Reviewing Editors, and the evaluation has been overseen by Patricia Wittkopp as the Senior Editor. The following individual involved in review of your submission has agreed to reveal their identity: Eve Gazave (Reviewer #2).

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

Summary:

In this short report submitted to ELife, Blondel and collaborators aim at understanding the evolutionary history of a crucial gene for reproduction in some insects, oskar. This gene, not found outside this group of organisms, plays a sufficient and mandatory role for germ line determination. So far, the evolutionary history of this enigmatic gene was unknown. In this study, the authors conduct a suite of molecular evolutionary and bioinformatic analyses to unravel the origin of this important gene. They notably performed parametric and phylogenetic methods to detect potential HGT that has been suggested to be at the origin of Oskar. Based on these analyses and findings, they argue:

i) That one of the two domains of the Oskar protein, OSK, is more related to bacterial sequences while the other one, LOTUS, is clustering with eukaryotic sequences.

ii) That LOTUS is related to eukaryotes and that OSK clusters within bacterial sequences, more specifically with GDSL-like hydrolases, thanks to phylogenetic analyses of the two domains separately, as well as topology-constraint tests.

iii) That sequence characteristics, notably the GC3 content and the codon use, through the cosine distance analysis, are very different from the LOTUS and OSK domains which is consistent with the hypothesis of distinct origins for the two domains composing the Oskar gene.

Essential revisions:

The essential revisions fall into two major categories: concerns about the phylogenetic analyses and concerns about the codon usage analyses. The concerns about the codon usage analyses are quite substantial. Here, the authors have two options – one would be to remove completely this part from the manuscript (given that the HGT inference can stand on its own). The other would be to address the reviewers' concerns, which may reveal additional interesting biology.

Concerns about codon usage analyses:

Main text, twelfth paragraph: The reviewers were surprised that you were able to detect codon usage differences for HGT events that occurred that far back in time (e.g., third position synonymous sites are saturated between D. melanogaster and D. pseudoobscura). Thus, explanations of any differences between LOTUS and OSK parts of the oskar gene should be carefully scrutinized. In particular:

1) If we understand the methods correctly, the process used to calculate codon usage is somewhat unorthodox. By calculating the frequency of all 64 codons, rather than usage of alternate synonymous codons, this metric of similarity will be influenced by amino acid composition. Thus, genes where the 5' and 3' ends look quite different in their codon usage could result from very different amino acid composition between the two halves, rather than differences in synonymous codon usage. This raises the possibility that the large differences in codon usage between the LOTUS and OSK domains could be driven by differences in amino acid composition. Do either of these domains (especially OSK) have a particularly unusual or distinctive amino acid composition, relative to other arthropod genes?

2) We recommend that you conduct this analysis using a metric of codon usage that is independent of amino acid composition, such as the codon adaptation index or relative synonymous codon usage.

3) There are many other possible explanations for why the coding sequences of two domains in a protein may exhibit different codon usage that may have to do with domain folding, translational selection, etc. that would be worth considering as alternative explanations.

4) If the oskar gene is cut in half following the procedure used for the genes in the 17 genomes, rather than comparing just the LOTUS and OSK domains, are the 5' and 3' halves still significantly discordant for AT3/GC3/codon usage?

5) The manuscript states: "if evolutionary time had not completely erased the original GC3 content and codon use signatures from the putative bacterially donated sequence (OSK), we might detect some differences in these parameters both from the LOTUS domain, and from the host genome." All of the analyses presented appear to address the first issue (OSK vs. LOTUS), but I did not see any analyses addressing the second question, whether these parameters systematically differ between the OSK domain and the rest of the host genome. If some of the analyses in the paper do bear on this question, this connection should be made more explicit. This seems like an important question, and would help to inform whether the observed differences between the LOTUS and OSK domains are in fact due to unusual features of the OSK domain specifically.

6) In Figure 3—figure supplement 1, why is the correlation for AT3 between the 5' and 3' ends of genes positive, but negative for GC3? This seems counter-intuitive to me.

7) We don't understand this sentence: "Thus, we sampled each codon at least twice, preserving the coding frame." Why does cutting each gene in half at a randomly chosen location sample each codon twice (or more)?

8) Barplots in Figure 3A and B: it would be more informative to see the full distribution of correlations for all non-oskar genes. It would also be useful to see the points for the LOTUS and OSK domains highlighted in Figure 3—figure supplement 1. In general, Figure 3—figure supplement 1 appears to be more informative than Figure 3.

9) Figure 3C: we expect there are many points within the "intra-gene" set that are not displayed? If this is true, it would be more transparent to allow the boxplot whiskers to extend to the full range of the data.

10) Figure 3D is hard to understand – those two boxplots show two null distributions, but we don't see where the differences between the LOTUS and OSK domains are indicated.

Concerns about phylogenetic analyses/results:

11) Although the reviewers appreciate that the MUSCLE sequence alignment program is one of the state-of-the-art software for this type of analyses, it is fair to say that the use of different alignment programs can sometimes generate considerable variation in phylogenetic inference (e.g., https://academic.oup.com/mbe/article/30/3/642/1038709). We request that the authors use at least a couple of additional programs for sequence alignment (e.g., PRANK and T-Coffee) and rebuild the trees for the two domains. This should provide a sanity check that the key conclusions of their work are not sensitive to method/alignment software.

12) Main text, seventh paragraph: to really claim that the dipteran LOTUS domain sequences do not group together with the LOTUS domain sequences from other insects, you should do a topology constraint analysis (like you've done for the HGT). Is the ML topology where these two groups of sequences are forced into monophyly significantly different from the unconstrained ML topology? The reviewers think you need to show that these two topologies are significantly different to be able to claim that this is not simply an artifact of doing phylogenetic analyses using a short sequence alignment. Also note that your statement, "the phylogenetic interrelationships of these [LOTUS] sequences is largely consistent with the current species or family level trees for the corresponding insects", contradicts your findings/arguments in the aforementioned paragraph.

13) It is not clear how the two domains fused and when (approximately) they did so. I think it would be useful for the readers if the authors speculated in a small paragraph their hypothesis (based on their knowledge of the distribution of these two domains in insects) when the fused oskar gene originated.

Other broad comments:

14) Main text, fourth paragraph: why are you using two different thresholds for your blast searches of the protein versus the two domains?

15) The discussion and cited studies on HGT were too narrow and well known examples of bacterial to insect HGT not cited (e.g., Husnik et al., 2013; Sloan et al., 2014; Acuna et al., 2012, to mention just a few).

[Editors' note: further revisions were suggested prior to acceptance, as described below.]

Thank you for submitting your article "Bacterial contribution to genesis of the novel germ line determinant oskar" for consideration by eLife. Your article has been reviewed by three peer reviewers, one of whom is a member of our Board of Reviewing Editors, and the evaluation has been overseen by a Patricia Wittkopp as the Senior Editor. The following individual involved in review of your submission has agreed to reveal their identity: Eve Gazave (Reviewer #2).

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

Summary:

The study by Blondel et al. provides phylogenetic evidence that the OSK domain of the key developmental gene oskar appears to have been originally acquired from a bacterium. oskar is a key developmental gene and its functions are well studied but its origins remain unknown. This study sheds significant light on its origin and believe will be of interest not just to evo-devo afficionados, but more broadly to developmental biologists and evolutionary biologists. The manuscript is overall clearly written and straightforward to understand.

Essential revisions:

While the reviewers appreciated the authors' additional analyses on codon usage between the two domains of the oskar gene, they raised numerous concerns about the statistical analyses performed as well as for the inference that the observed patterns of codon usage may stem from the HGT event (see detailed comments below). The reviewers thought that the argument that the observed patterns are remnants of the HGT event is highly unlikely and hard to reconcile with everything that we know about the (fast) rate with which codon bias evolves. The reviewers also raised several remaining issues in the calculations of the codon usage statistics, which are substantial and would require additional, extensive analyses.

Given that the manuscript is a short communication and that the codon bias analyses are not mentioned in the title or Abstract (so their removal does not impact the manuscript's key result and discovery), our recommendation is that the authors remove all analyses associated with codon usage. If the authors can do so, there is broad agreement that the rest of this manuscript would be suitable for publication and would be accepted without the need for additional revisions (everyone concurred that the phylogenetic aspects of the manuscript are well done and will be interesting to a broad audience). Alternatively, if the authors feel strongly that they wish to keep the codon usage analyses in their manuscript, our recommendation is that the manuscript should be rejected. We sincerely hope that you decide to implement the reviewers' suggestions so that your manuscript can be accepted for publication at eLife.

Specific comments about codon usage analyses:

The reviewers appreciate the additional analyses that the authors have included regarding the differences in codon useage between the LOTUS and OSK domains of oskar. However, some of the changes are not presented clearly, and the rest appear to undermine the authors' claim that codon useage in oskar reflects its HGT hybrid origin.

First, there are some ambiguities and contradictions within the manuscript about the new analyses. The legend to Figure 3A states "Scatter plot representing the full distribution of correlations from Pearson correlation analysis for AT3 use in 3' and 5' halves". However, what is shown in Figure 3A is a bivariate plot – unless we misunderstand, calculating the correlation coefficient between the AT3 content for the 5' and 3' halves of each gene should yield a univariate distribution (one correlation coefficient for each gene) – and according to the axis labels, what is shown are the residuals (?) or Z-score (are these the same thing?) of AT3 for each 5' and 3' half of each gene. We are similarly confused about this statement: "The Z score ranges of correlations for the oskar domains are within the distribution for all genes in the genome, although correlations of use across the gene are lower for oskar than for non-oskar genes".

It is stated in the main text and in the Materials and methods that the residuals/Z-scores are calculated for each insect genome separately: "Pooling the residuals together revealed that the GC3 content was significantly different between the LOTUS and OSK domains, compared to what would be expected within an average gene in that genome"; "Then, Z scores for each sequence from the Intra-Gene, OSK or LOTUS domain sequences were calculated against the corresponding genome frequency distribution." But if this were the case, wouldn't the plots in Figure 3—figure supplements 1 and 2 be centered on (0,0)? At least Figure 3—figure supplement 1 looks as if the Z-scores were calculated from the combined distributions of AT3 pooled across all 17 genomes, which we do not think is appropriate, since genomes differ in AT/GC content. For this reason, we also think Figure 3A should be changed to Z-scores calculated for each genome separately.

Second, comparing the correlation in AT3/GC3 between the LOTUS and OSKAR domains with the average intra-gene correlation (either within a genome or across all genomes) does not seem to be a good indication of whether the correlation for oskar is unusual – as suggested by Figure 3A, many genes have correlations well above or below the mean for that genome, and likely have a lower correlation than observed for oskar. If even some genes in these genomes have AT3/GC3 correlations between their 5' and 3' halves that are as low or lower than oskar's (are some even negative?), and presumably this is not due to the same hybrid evolutionary origin as hypothesized for oskar , then how confident can we be that this low correlation reflects the HGT origin for oskar? We apologize for not raising this issue in the first round of review, but we did not quite understand what was represented in Figure 3C, D, E.

Third, we still think that the methods used to compare codon usage between the 5' and 3' ends of non-oskar genes are not fully comparable to comparing codon usage between the LOTUS and OSK domains. The OSK domain comprises the 3' 33% of oskar while the LOTUS domain comprises 16% of the protein, from 23-39% of the length from the 5' end. So the LOTUS domain makes up a smaller fraction of oskar than the methods used to sample the 5' halves of other genes, which could generate a more biased usage in the LOTUS domain. Probably more importantly, assuming that the LOTUS and OSKAR domains are functional, coherent units within the oskar protein, we might expect their amino acid composition to be unrepresentative of simply the 5' or 3' halves of genes, if most of those halves include multiple protein domains and/or truncate protein domains. It's hard to know what the perfect control set is here, but it would be informative to at least compare codon usage across two different discrete protein domains within a number of genes as was done for oskar.

Fourth, as the revised manuscript states, and as is clearly shown in Figures 3A, B (our questions about those figures notwithstanding), Figure 3—figure supplement 2, and Figure 3—figure supplement 6: codon useage in oskar genes is actually not unusual relative to their host genomes, or when it is unusual, this is true for both OSKAR and LOTUS domains (Figure 3—figure supplement 6A). Drawing the alternate conclusion in the face of these results confuses the manuscript.

Finally, we reiterate that we have difficulty believing, a priori, that "evolutionary time had not completely erased the original GC3 content and codon use signatures from the putative bacterially donated sequence (OSK)". Codon usage evolves over very rapid evolutionary timescales, (for example Akashi 1996, Genetics 144:1297), and can diverge genome-wide within a genus (see the 12 Drosophila genomes paper: Clark et al. 2007, Nature 450: 203-218, Figure 5). There is still much we do not understand about the evolution of synonymous codon usage (for example https://doi.org/10.1534/genetics.119.302542). Nonetheless, everything we do know suggests that either neutral or selective pressures should have long ago erased any codon usage signatures retained from a bacterial origin of the OSKAR domain. If the difference in AT3/GC3/codon useage is indeed unusual between the OSKAR and LOTUS domains of oskar (a conclusion that the manuscript has yet to convincingly demonstrate), then we insist that this is much more likely to be due to some unknown feature of oskar function that constrains its codon usage, than it is to reflect its evolutionary origin. As the authors mention in the revised manuscript, given that oskar mRNA is localized in the oocyte, this provides an entirely plausible constraint on the mRNA sequence independent of its coding role. We note that FlyBase also has annotated a lncRNA that almost entirely overlaps with the oskar mRNA in D. melanogaster; we don't know what evidence supports this annotation, but if true, this could indicate additional constraints that could shape codon usage. But in the end, we think such speculation is of limited value to the manuscript if the reader is not convinced that the codon usage is indeed unusually different between the two domains in oskar.

https://doi.org/10.7554/eLife.45539.sa1

Author response

Essential revisions:

The essential revisions fall into two major categories: concerns about the phylogenetic analyses and concerns about the codon usage analyses. The concerns about the codon usage analyses are quite substantial. Here, the authors have two options – one would be to remove completely this part from the manuscript (given that the HGT inference can stand on its own). The other would be to address the reviewers' concerns, which may reveal additional interesting biology.

We thank the reviewers for these suggestions. We chose to take the second of these two suggested options, and to address the reviewer’s concerns, as described for each point below, rather than completely removing the codon use analyses from the manuscript. Taken together, the results of the multiple new analyses that we performed supported our major conclusions. In some cases, notably for the phylogenetic analyses, they provided even stronger support for our hypothesis than in our original submission (Comments 11 through 13 below). In others, in particular for the codon use analyses, they indeed revealed additional nuances to the likely evolutionary history of the oskar gene (Comments 1 through 10 below). Overall we feel that the revised manuscript is stronger as a result of these revisions.

Concerns about codon usage analyses:

Main text, twelfth paragraph: The reviewers were surprised that you were able to detect codon usage differences for HGT events that occurred that far back in time (e.g., third position synonymous sites are saturated between D. melanogaster and D. pseudoobscura). Thus, explanations of any differences between LOTUS and OSK parts of the oskar gene should be carefully scrutinized. In particular:

Summary response to concerns about codon use analysis: In response to these points about our codon use analyses, we have performed a number of new analyses of codon use, modified relevant figures (Figure 3), added new relevant supplementary figures (Figure 3—figure supplements 1-6), and have rewritten the section describing these analyses within the main text.

The codon use analysis in the revised manuscript now includes an assessment of the following four aspects of codon use:

1) GC3 and AT3 use: this analysis was included in the original manuscript, but we have added an new analysis as per the suggestion in Comment 4 below; results are shown in revised Figure 3, new Figure 3—figure supplement 1 and new Figure 3—figure supplement 2.

2) Comparison of codon adaptation index (CAI) between full length oskar, OSK and LOTUS domains, and all full length genes in a genome, as per Comment 2 below; results are shown within this document in Author response image 2.

3) Comparison of the difference in CAI between OSK and LOTUS domains (”δ CAI”), compared with the difference in CAI between the 5’ and 3’ ends of an average gene in the genome; results are shown within this document in Author response image 3.

4) Analysis of the CAI values for OSK and LOTUS domains in the context of the range of CAI values for all 3’ and 5’ halves of all genes in the genome; results are shown within this document in Author response image 4.

In the original manuscript, we had performed only analysis of parameter (1) above. In response to reviewer suggestions (Comment 2), we performed new analyses of parameters (2) through (4). We also refined our analysis of (1) relative to the original manuscript, by adding a second GC3/AT3 analysis where we changed the way we cut genes in half (see Comment 4). This analysis yielded the same result as our original analysis; indeed, we observed a slightly larger difference between the OSK and LOTUS domains with this method, than we had with our original method.

Taken together, analysis of these four parameters showed that the biggest difference in codon use between the OSK domain and the LOTUS domain, and also between the OSK domain and the genome as a whole, was that the difference in GC3 use between the OSK and the LOTUS domains was greater than the difference in GC3 use between the 5’ and 3’ halves of the average gene in the genome. Thus, in the revised manuscript, we retained original Figures 3A-C (with presentation modifications for Figure 3A as per Comment 8), which present the results of analysis (1) above. The results of new CAI analyses (done in response to Comments 2 and 5) are presented in new Figure 3—figure supplement 5.

Please see point by point responses to each of the ten comments regarding codon use analyses, in the following section:

1) If we understand the methods correctly, the process used to calculate codon usage is somewhat unorthodox. By calculating the frequency of all 64 codons, rather than usage of alternate synonymous codons, this metric of similarity will be influenced by amino acid composition. Thus, genes where the 5' and 3' ends look quite different in their codon usage could result from very different amino acid composition between the two halves, rather than differences in synonymous codon usage. This raises the possibility that the large differences in codon usage between the LOTUS and OSK domains could be driven by differences in amino acid composition. Do either of these domains (especially OSK) have a particularly unusual or distinctive amino acid composition, relative to other arthropod genes?

We take this excellent point by the reviewer. In response to this concern, we examined the relationship between the cosine difference in codon use between the two domains, and amino acid frequency cosine difference between the two domains. The results of this analysis are shown in Author response image 1. We found that, as per the reviewer’s suspicion, these two parameters were indeed highly correlated, such that we could not tell if this apparent different codon use result (based on cosine difference in codon use) was a consequence of different amino acid use or a difference in synonymous codon use. We therefore removed original Figure 3D from the revised manuscript.

Author response image 1

2) We recommend that you conduct this analysis using a metric of codon usage that is independent of amino acid composition, such as the codon adaptation index or relative synonymous codon usage.

We appreciate this suggestion by the reviewer and agree that it would be ideal to be able to assess codon use with a metric that is independent of amino acid composition. For example, to calculate the codon adaptation index (CAI) for a gene, we would wish to use a well annotated set of highly expressed genes that display optimal codon use, based on genome and transcriptome analysis. However, to our knowledge, no such set of genes exists for most of the species studied here. We therefore took the following approach: for the 17 species that we used for the codon use analyses (listed in Supplementary file 1E), we calculated the frequency of codon use for every gene in the available transcriptomes, and calculated a “global CAI” for all genes in each of these species (rather than only for highly expressed genes, since we cannot be confident which ones those are based on the available data for these species). We also calculated the codon use frequencies for the full length oskar gene, and for the two domains of oskar, LOTUS and OSK.

Using these frequencies, we performed two analyses. First, for each oskar ortholog in each of these 17 species, we compared the codon use in the full length gene, and in each of its two domains, to the codon use of all genes in the genome. We found that codon use for full length oskar, and for each of its two domains LOTUS and OSK, was well within the distribution of codon use for all genes in the genome (Author response image 2).

Author response image 2

Second, we compared the difference in codon use between the LOTUS and OSK, to the difference in codon use between the 3’ and 3’ halves of all genes in the genome, and again found that the values were well within the distribution of 5’ vs. 3’ codon use for all genes in the genome (Author response image 3).

Author response image 3

Thus, this analysis does not reveal any significant difference in codon use measured in this way, between the OSK and LOTUS domains of these insect oskar orthologs. The revised manuscript does include some new CAI analyses examining possible differences between OSK domain codon use and codon use in the rest of the genome (discussed below in Comment 5 Response). However, we include Author response image 2 and 3, as it is not a useful way to compare codon use between the OSK and LOTUS domains, given the likely confounding effect of distinct amino acid use between domains that the reviewer helpfully suggested we consider (Author response image 1).

3) There are many other possible explanations for why the coding sequences of two domains in a protein may exhibit different codon usage that may have to do with domain folding, translational selection, etc. that would be worth considering as alternative explanations.

We take this point by the reviewer. In response, we have added new text to the revised manuscript that discusses these alternative explanations for the observed differences in codon usage between the two domains (main text).

4) If the oskar gene is cut in half following the procedure used for the genes in the 17 genomes, rather than comparing just the LOTUS and OSK domains, are the 5' and 3' halves still significantly discordant for AT3/GC3/codon usage?

This is an interesting suggestion by the reviewer. In response to this query, we performed a new analysis in order to assess whether cutting the oskar gene following the same procedure as devised for the Intra-Gene null distribution would change the overall results in codon use. We cut each oskar mRNA transcript in half at a random location while preserving the coding frame, in the same manner as that described for the Intra-Gene distribution analysis in the original manuscript. We then ran the same GC3 AT3 and wobble positions analyses on those “half sequences” of oskar orthologues. We found that the overall difference in codon usage between the two halves was the same as in our original comparison of just the LOTUS and OSK domains. In other words, the two domains were uncorrelated in their codon use, in contrast to most genes in the genome, which did show a correlation in codon use between the two halves of a gene. The only difference we noted relative to our original analysis was in the T3 wobble position which showed some correlation between OSK and LOTUS domains in our original analysis, but showed no correlation at all in the new random cut analysis. Thus, overall the main finding of discordant AT3/GC3 codon use between the two domains of oskar remains unchanged by this new analysis. We have included the results of the new random cut analysis in new Figure 3—figure supplement 4.

5) The manuscript states: "if evolutionary time had not completely erased the original GC3 content and codon use signatures from the putative bacterially donated sequence (OSK), we might detect some differences in these parameters both from the LOTUS domain, and from the host genome." All of the analyses presented appear to address the first issue (OSK vs. LOTUS), but I did not see any analyses addressing the second question, whether these parameters systematically differ between the OSK domain and the rest of the host genome. If some of the analyses in the paper do bear on this question, this connection should be made more explicit. This seems like an important question, and would help to inform whether the observed differences between the LOTUS and OSK domains are in fact due to unusual features of the OSK domain specifically.

We thank the reviewer for this important point. The reviewer is correct in that we did not explicitly test differences in codon use parameters between the OSK domain and the rest of the genome in the original manuscript. To address this issue, we made use of the CAI approach as explained in the response to Comment 2. We calculated the CAI for each transcript in each of these 17 genomes, for the full length oskar mRNA, and for the LOTUS and OSK domains. We found that the latter three CAI values fell well within the ranges of the genome-wide CAI distribution, for all 17 genomes studied (new Figure 3—figure supplement 6A).We then asked if the difference in CAI between the LOTUS and OSK domain was larger than what would be expected than the differences between the 3’ and 5’ end (cut in ”half” according to the procedure described in response to Comment 4) of an “average gene” in each genome. Using the Intra-Gene Distribution as a null model, we computed the CAI difference between the 3’ and 5’ end of randomly cut transcripts, and did the same between the LOTUS and OSK sequences. We found no significant difference between the CAI difference for the LOTUS and OSK domains, and the distribution of CAI differences for other genes in the genome (new Figure 3—figure supplement 6B).

Finally, we asked how the LOTUS and OSK domain CAI values compared to the median genome-wide CAI values for the 5’ and 3’ ends of each of the other genes in a genome. To address this, we computed a CAI Z score for all genes as follows: for each genome we calculated the mean CAI and the standard deviation of the CAI for each of the 5’ and 3’ parts of the transcripts of all predicted genes in each of the 17 genomes under study. Using these values, we computed (a) the Z score relative to the global CAI for that genome, for each of the 5’ and 3’ halves, and (B) the Z score for each of the LOTUS and OSK domains of the oskar ortholog of each genome. We found that the Z score for the LOTUS and OSK domains fell within the distribution of Z scores other genes in the genome (new Figure 3—figure supplement 6C).

Thus, based on these CAI analyses, we did not detect any unusual codon use features of the OSK domain relative to the remainder of the genome. We include and discuss the results of these new analyses in the revised manuscripts as Figure 3—figure supplement 6.

6) In Figure 3—figure supplement 1, why is the correlation for AT3 between the 5' and 3' ends of genes positive, but negative for GC3? This seems counter-intuitive to me.

We found that the aggregation of the data from all 17 genomes analyzed here generated a distribution that appeared negatively correlated. Indeed, we agree with the reviewer that this seems counterintuitive. However, when each genome is observed individually, we can see that correlations are positive. To clarify this for the reader, we have added two new supplementary figures showing the individual correlations for GC3 and AT3 for each genome analyzed (Figure 3—figure supplements 1 and 2).

7) We don't understand this sentence: "Thus, we sampled each codon at least twice, preserving the coding frame." Why does cutting each gene in half at a randomly chosen location sample each codon twice (or more)?

This comment from the reviewer made us realize that we should clarify our description of this analysis. To this end, we have added a new schematic to diagram our analysis method (Figure 3—figure supplement 5), and have rewritten the corresponding text (subsection “Generation of Intra-Gene Distribution of Codon Use”).

8) Barplots in Figure 3A and 3B: it would be more informative to see the full distribution of correlations for all non-oskar genes. It would also be useful to see the points for the LOTUS and OSK domains highlighted in Figure 3—figure supplement 1. In general, Figure 3—figure supplement 1 appears to be more informative than Figure 3.

We agree with this point by the reviewer. In response, we have made the following changes:

a) We have added new representations of the plots shown in the original Figure 3—figure supplement 1 (now called Figure 3—figure supplement 2 in the revised manuscript). These new representations now appear as revised Figure 3, new panels A and B. As suggested, these plots identify the 17 oskar genes in the analysis (red dots) so that the reader can see how their Z scores compare with those of non-oskar genes.

b) Because the new panels Figure 3A and B show the correlations for every gene analyzed in every one of the 17 genomes included in the analysis, they are very densely populated. We therefore wished to include an easier way for readers to visualize how the GC3 correlations for the LOTUS and OSK domains of oskar, compare with the GC3 correlations for the 5’ and 3’ parts of non-oskar genes. To this end, we added two new supplementary figures (Figure 3—figure supplements 1 and 2 in the revised manuscript) that provide a contour map for the scatter plot of the GC3 and AT3 correlations for the 5’ and 3’ parts of each of the 17 genomes included in this analysis, and shows where the LOTUS/OSK domain G3 correlation for each oskar gene falls relative to all non-oskar genes in each genome (red dot).

c) Because we believe that it is easier to compare the residuals in the original bar plots that we showed in the original Figure 3A and B, we have retained these in the revised Figure 3, as panels C and D.

9) Figure 3C: we expect there are many points within the "intra-gene" set that are not displayed? If this is true, it would be more transparent to allow the boxplot whiskers to extend to the full range of the data.

We take this point by the reviewer, and in response, have modified Figure 3C to show the full range of the data.

10) Figure 3D is hard to understand – those two boxplots show two null distributions, but we don't see where the differences between the LOTUS and OSK domains are indicated.

As Comment #1 above helped us realize, our choice of distance metric in this analysis could not exclude the possibility that differences in codon use between the two domains of oskar were due to differences in amino acid use. We have therefore eliminated this panel from the revised Figure 3. Please see our responses to Comments #1 through 9 for further explanations of the rationale behind this revision.

Concerns about phylogenetic analyses / results:

11) Although the reviewers appreciate that the MUSCLE sequence alignment program is one of the state-of-the-art software for this type of analyses, it is fair to say that the use of different alignment programs can sometimes generate considerable variation in phylogenetic inference (e.g., https://academic.oup.com/mbe/article/30/3/642/1038709). We request that the authors use at least a couple of additional programs for sequence alignment (e.g., PRANK and T-Coffee) and rebuild the trees for the two domains. This should provide a sanity check that the key conclusions of their work are not sensitive to method/alignment software.

This is an excellent suggestion by the reviewer. In response to this comment, we created new alignments using PRANK and T-Coffee for both domains, and trimmed these alignments using the same parameters as described in the original manuscript for the MUSCLE alignments. The new PRANK and T-Coffee alignments are provided in the revised supplementary files (OSK_prank_aligned.fasta and OSK_tcoffee_aligned.fasta). We then used these new PRANK and T-Coffee alignments to rebuild both trees using RAXML with the same parameters described in the original manuscript.

For the OSK domain, as did the original MUSCLE-based tree, the topology of the PRANK-based and T-Coffee-based trees revealed the OSK domain as more closely related to bacterial than to eukaryotic sequences. The following minor differences in topology between the trees did not affect this bacterial (rather than eukaryotic) affinity relationship of the OSK domain:

PRANK Tree

1) The closest bacterial nodes to the OSK domain formed a polytomy in the original MUSCLE-based tree, but are resolved in the PRANK-based tree. Specifically, OSK now branches within the group of bacteria that it was previously sister to.

2) The support values are globally higher in the PRANK-based tree in comparison to the previous MUSCLE-based tree.

3) Small differences in the topology of the bacterial sequences are seen throughout the PRANK-based tree. Specifically, a monophyletic group of Bacteroidetes changed its branch position from the base of the tree to within the main Firmicutes monophyletic group, as a sister group to the CAZ3 fungal sequences. Two Bacteroidetes sequences (R7ADB4) and one from an environmental sample branch at the base of a large clade of Firmicutes, rather than within that clade as they did in the MUSCLE-based tree. Finally, the Proteobacteria sequences now branch outside of the larger Firmicutes clade.

T coffee Tree

1) Two Streptococcus sequences now branch with the bacteria clade sister to OSK sequences.

2) Similar to the PRANK-based tree, the CAZ 3 sequences are now more closely related to the Bacteroidetes clade, though they are not sister to this clade unlike in the PRANK-based tree. Instead, the CAZ3 sequences branch as an outgroup of the main clade containing OSK and the Firmicutes.

3) Similar to the PRANK tree, The Proteobacteria clade branches outside the main group of Firmicutes bacterial sequences.For the LOTUS domain, the PRANK-based and T-Coffee-based trees showed that, as for the MUSCLE-based tree presented in the original manuscript, the closest relatives of the LOTUS domain of Osk proteins are other LOTUS domains found in eukaryotic proteins, as would be expected for a gene of animal origin. In trees made with all of these alignment methods, the phylogenetic interrelationships of these sequences is largely consistent with the current species or family level trees for the corresponding insects.

The following specific minor differences, none of which affect these major conclusions, were observed between trees:

PRANK Tree

1) In the PRANK-based tree, the dipteran oskar LOTUS domains grouped within a common node with the TUDOR 7 LOTUS domain and the other insect oskar LOTUS domains, and a single arthropod Tudor 5 LOTUS domain (A0A0J7KVQ7) branched sister to the dipteran oskar LOTUS domains

2) All the arthropod LOTUS domains are grouped under one common node (with the MUSCLE tree they were split between the TUD5 and TUD7 groups).

3) Other differences in the topology of the non-oskar sequences are seen in the PRANK-based tree. Notably, the internal branching of mammalian TUD5 sequences changed to form two main clades, one containing mostly primate sequences, the other containing non-primate TUD5 sequences. In addition, the internal branching of arthropod TUD7 sequences changed, such that arthropod sequence relationships followed order relationships, including with three hymenopteran, two hemipteran and two coleopteran sequences.

T Coffee tree

1) The Oskar LOTUS domain sequences formed a monophyletic clade that is sister to the TUD7 sequences.

2) TUD5 sequences showed similar relationships as those recovered in the MUSCLE Tree. Similar to the PRANK-based tree, the primate TUD5 sequences formed a monophyletic group.

3) TUD7 sequences formed two main clade, one composed of sequences from Chondrichthyes, and one composed of mammalian sequences.

In the revised manuscript, we present the new PRANK-based trees in new Figure 2—figure supplement 6 (LOTUS) and Figure 2—figure supplement 7 (OSK), and the new T-Coffee-based trees in new Figure 2—figure supplement 10 (LOTUS) and Figure 2—figure supplement 11 (OSK). We also present graphic visualizations of the similarities and differences between the trees generated with the different alignment methods (Figure 2—figure supplement 8 (MUSCLE vs. PRANK: OSK), Figure 2—figure supplement 9 (MUSCLE vs. PRANK :LOTUS), Figure 2—figure supplement 12 (MUSCLE vs. T-Coffee: OSK) and Figure 2—figure supplement 13 (MUSCLE vs. T-Coffee: LOTUS)).

12) Main text, seventh paragraph: to really claim that the dipteran LOTUS domain sequences do not group together with the LOTUS domain sequences from other insects, you should do a topology constraint analysis (like you've done for the HGT). Is the ML topology where these two groups of sequences are forced into monophyly significantly different from the unconstrained ML topology? The reviewers think you need to show that these two topologies are significantly different to be able to claim that this is not simply an artifact of doing phylogenetic analyses using a short sequence alignment. Also note that your statement, "the phylogenetic interrelationships of these [LOTUS] sequences is largely consistent with the current species or family level trees for the corresponding insects", contradicts your findings/arguments in the aforementioned paragraph.

This is an excellent suggestion by the reviewers. In response to this suggestion, we used SOWHAT to compare the probabilities of trees with various topology constraints. We found that the ML topology requiring monophyly of all insect LOTUS domains was significantly less likely than the unconstrained topology, with a p-value of 0.037. We present the results of this analysis in new Figure 2—figure supplement 5.

13) It is not clear how the two domains fused and when (approximately) they did so. I think it would be useful for the readers if the authors speculated in a small paragraph their hypothesis (based on their knowledge of the distribution of these two domains in insects) when the fused oskar gene originated.

In response to this comment, we have added new text of the revised manuscript to explain why that we believe that this fusion event can be dated most precisely to at least the time before major insect divergence (main text).

Other broad comments:

14) Main text, fourth paragraph: why are you using two different thresholds for your blast searches of the protein versus the two domains?

In the original manuscript we had done searches in November of 2015 using the full length protein as a query, applying an E-value threshold of 0.01 for matches to the whole sequence, and an E-value threshold of 10 for matches to the OSK and LOTUS domains. We chose these different thresholds because we wished to cast a wide net in our initial search for potential domains that might be related to the domains of oskar. To determine whether having the less stringent cutoff for the full length protein search would have changed our results, we re-did the search in May of 2019 using the D. melanogaster osk full length sequence as a query and imposing an E-value threshold of 10. This search yielded a similar result to the one described in the original manuscript: that is, all sequences were from an insect species, with the exception of three TUDOR 5 Pocillopora damicornis (coral) sequences with E-values of 1.8, 1.8 and 1.9. Those sequences had been added to the nr database in 2018 (https://www.ncbi.nlm.nih.gov/assembly/GCF_003704095.1) and thus were not present during our first search in 2015. Searches with hmmsearch and the OSK.hmm model revealed that none of these sequences contained a OSK domain, so they were not included in any further analyses. In response to this comment, we have clarified these points in the revised Materials and methods section. Subsection “BLAST searches of Oskar” is an updated BLAST section where we explain the choices described above. In addition, we have added a reference to the specific relevant Materials and methods section, with a comment about E-value choices, to the revised main text.

15) The discussion and cited studies on HGT were too narrow and well known examples of bacterial to insect HGT not cited (e.g., Husnik et al., 2013; Sloan et al., 2014; Acuna et al., 2012, to mention just a few).

In response to this comment, we discuss and include a broader range of examples of HGT studies from the literature in the revised manuscript, including the ones specifically suggested by the reviewer (main text).

[Editors' note: further revisions were suggested prior to acceptance, as described below.]

Essential revisions:

While the reviewers appreciated the authors' additional analyses on codon usage between the two domains of the oskar gene, they raised numerous concerns about the statistical analyses performed as well as for the inference that the observed patterns of codon usage may stem from the HGT event (see detailed comments below). The reviewers thought that the argument that the observed patterns are remnants of the HGT event is highly unlikely and hard to reconcile with everything that we know about the (fast) rate with which codon bias evolves. The reviewers also raised several remaining issues in the calculations of the codon usage statistics, which are substantial and would require additional, extensive analyses.

Given that the manuscript is a short communication and that the codon bias analyses are not mentioned in the title or Abstract (so their removal does not impact the manuscript's key result and discovery), our recommendation is that the authors remove all analyses associated with codon usage. If the authors can do so, there is broad agreement that the rest of this manuscript would be suitable for publication and would be accepted without the need for additional revisions (everyone concurred that the phylogenetic aspects of the manuscript are well done and will be interesting to a broad audience). Alternatively, if the authors feel strongly that they wish to keep the codon usage analyses in their manuscript, our recommendation is that the manuscript should be rejected. We sincerely hope that you decide to implement the reviewers' suggestions so that your manuscript can be accepted for publication at eLife.

We have removed all codon use analysis from the manuscript as suggested. Accordingly, we have not responded in a point by point fashion to each of the specific comments about codon use analyses.

https://doi.org/10.7554/eLife.45539.sa2

Article and author information

Author details

  1. Leo Blondel

    Department of Molecular and Cellular Biology, Harvard University, Cambridge, United States
    Contribution
    Data curation, Formal analysis, Validation, Visualization, Methodology, Writing - original draft, Writing - review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-2276-4821
  2. Tamsin EM Jones

    Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, United States
    Present address
    European Bioinformatics Institute, EMBL-EBI, Wellcome Genome Campus, Hinxton, United Kingdom
    Contribution
    Data curation, Writing - review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-0027-0858
  3. Cassandra G Extavour

    1. Department of Molecular and Cellular Biology, Harvard University, Cambridge, United States
    2. Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, United States
    Contribution
    Conceptualization, Supervision, Funding acquisition, Writing - original draft, Project administration, Writing - review and editing
    For correspondence
    extavour@oeb.harvard.edu
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-2922-5855

Funding

Harvard University

  • Leo Blondel
  • Cassandra G Extavour
  • Tamsin E M Jones

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

We thank Sean Eddy, Chuck Davis, and Extavour lab members for discussion.

Senior Editor

  1. Patricia J Wittkopp, University of Michigan, United States

Reviewing Editor

  1. Antonis Rokas, Vanderbilt University, United States

Reviewer

  1. Eve Gazave, Institut Jacques Monod, France

Publication history

  1. Received: January 29, 2019
  2. Accepted: February 23, 2020
  3. Accepted Manuscript published: February 24, 2020 (version 1)
  4. Version of Record published: May 26, 2020 (version 2)

Copyright

© 2020, Blondel et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 1,855
    Page views
  • 307
    Downloads
  • 0
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)

Further reading

    1. Developmental Biology
    2. Immunology and Inflammation
    Samuel Liegeois, Dominique Ferrandon
    Insight
    1. Developmental Biology
    2. Stem Cells and Regenerative Medicine
    Louise Menendez et al.
    Research Article