Analysis of yeast, fly and human genomes suggests that sequence divergence is not the main source of orphan genes.
For half a century, most scientists believed that new protein-coding genes arise as a result of mutations in existing protein-coding genes. It was considered impossible for anything as complex as a functional new protein to arise from scratch. However, every species has certain genes, known as 'orphan genes', which code for proteins that are not homologous to proteins found in any other species. What do these orphan genes do, and how are they formed?
To date the roles of hundreds of orphan genes have been characterized. Although this is just a tiny fraction of the total, it is known that most of them code for proteins that bind to conserved proteins such as transcription factors or receptors. Some of these proteins are toxins, some are involved in reproduction, some integrate into existing metabolic and regulatory networks, and some confer resistance to stress (Carvunis et al., 2012; Li et al., 2009; Xiao et al., 2009; Arendsee et al., 2014; Belcaid et al., 2019). However, none of them are enzymes (Arendsee et al., 2014). Orphan genes arise quickly, so they may provide a disruptive mechanism that allows a given species to survive changes to its environment. Thus, the study of how orphan genes arise (and fall) is central to understanding the forces that drive evolution (Figure 1).
One possible mechanism is the 'de novo' appearance of a gene from an intergenic region or a completely new reading frame within an existing gene (Tautz and Domazet-Lošo, 2011). An alternative mechanism is that the coding sequence of the orphan gene arises by rapid divergence from the coding sequence of a preexisting gene: this would mean that an entire set of regulatory and structural elements would be available to the gene as it evolves. Now, in eLife, Nikolaos Vakirlis and Aoife McLysaght (both from Trinity College Dublin) and Anne-Ruxandra Carvunis (University of Pittsburgh) report how they have studied yeast, fly and human genes to compare the contributions of these two mechanisms to the emergence of orphan genes (Vakirlis et al., 2020).
Previous studies have used simulations to estimate the number of orphan genes that appear by divergence; until now, no one had relied on actual genomics data to study this phenomenon. Vakirlis et al. use a new approach to analyze orphan genes that have originated through divergence. They examine regions of the genome that correspond to each other (so-called syntenic regions) in related species to determine whether a gene exists in both regions and, if so, whether the proteins are non-homologous. If the genes have no homology, they may have originated by rapid divergence from the coding sequence of a preexisting gene.
Using this method, Vakirlis et al. infer that at most 45% of S. cerevisiae (yeast) orphan genes, 25% of D. melanogaster (fruit fly) orphan genes, and 18% of human orphan genes arose by rapid divergence, but this is an upper estimate. For example, it is possible that a new coding sequence might have arisen de novo within an existing gene, rather than the existing coding sequence having been modified beyond recognition.
But how can a protein sequence continue to be selected for as it rapidly diverges? Vakirlis et al. suggest that divergence might occur by a process of partial pseudogenation: the existing gene becomes non-functional, and then, with no selection pressure to retain the old protein, it diverges to form an orphan gene.
Many orphan genes may not have been identified yet, because they do not have homologs in other species, and have few recognizable sequence features. This means that up to 80% of orphan genes can be missed when a new genome is annotated (Seetharam et al., 2019). The approach detailed by Vakirlis, Carvunis and McLysaght evaluates specifically those annotated orphan genes for which a similar gene exists in a related species (which is ~50% of them; Arendsee et al., 2019). As high-quality genomes from more species become available, and as more orphan genes are annotated, the approach will provide yet deeper insights into the origin of these genes.
One of the many open questions in this field deals with genes of ‘mixed age’. Some such genes have incorporated ‘chunks’ of orphans into their coding sequences. A gene that has done this is (somewhat arbitrarily) considered to be the age of its most ancient segment, but we know little about the mechanism of this process or its significance. Another question involves the unique strategies and rates of evolution of each gene (Revell et al., 2018). How might the abundance and mechanisms of orphan gene origin vary among species? And how do different environments affect the emergence of orphan genes?
Comparing evolutionary rates between trees, clades and traitsMethods in Ecology and Evolution 9:994–1005.https://doi.org/10.1111/2041-210X.12977
- Version of Record published: February 19, 2020 (version 1)
© 2020, Singh and Syrkin Wurtele
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Downloads (link to download the article as PDF)
Download citations (links to download the citations from this article in formats compatible with various reference manager tools)
Open citations (links to open the citations from this article in various online reference manager services)
Predicting gene expression from DNA sequence remains a major goal in the field of gene regulation. A challenge to this goal is the connectivity of the network, whose role in altering gene expression remains unclear. Here, we study a common autoregulatory network motif, the negative single-input module, to explore the regulatory properties inherited from the motif. Using stochastic simulations and a synthetic biology approach in E. coli, we find that the TF gene and its target genes have inherent asymmetry in regulation, even when their promoters are identical; the TF gene being more repressed than its targets. The magnitude of asymmetry depends on network features such as network size and TF-binding affinities. Intriguingly, asymmetry disappears when the growth rate is too fast or too slow and is most significant for typical growth conditions. These results highlight the importance of accounting for network architecture in quantitative models of gene expression.
The underlying cell types mediating predisposition to obesity remain largely obscure. Here, we integrated recently published single-cell RNA-sequencing (scRNA-seq) data from 727 peripheral and nervous system cell types spanning 17 mouse organs with body mass index (BMI) genome-wide association study (GWAS) data from >457,000 individuals. Developing a novel strategy for integrating scRNA-seq data with GWAS data, we identified 26, exclusively neuronal, cell types from the hypothalamus, subthalamus, midbrain, hippocampus, thalamus, cortex, pons, medulla, pallidum that were significantly enriched for BMI heritability (p<1.6×10−4). Using genes harboring coding mutations associated with obesity, we replicated midbrain cell types from the anterior pretectal nucleus and periaqueductal gray (p<1.2×10−4). Together, our results suggest that brain nuclei regulating integration of sensory stimuli, learning and memory are likely to play a key role in obesity and provide testable hypotheses for mechanistic follow-up studies.