Protein evidence of unannotated ORFs in Drosophila reveals diversity in the evolution and properties of young proteins

  1. Eric B Zheng
  2. Li Zhao  Is a corresponding author
  1. Rockefeller University, United States


De novo gene origination, where a previously non-genic genomic sequence becomes genic through evolution, has been increasingly recognized as an important source of evolutionary novelty across diverse taxa. Many de novo genes have been proposed to be protein-coding, and in several cases have been experimentally shown to yield protein products. However, the systematic study of de novo proteins has been hampered by doubts regarding the translation of their transcripts without the experimental observation of protein products. Using a systematic, ORF-focused mass-spectrometry-first computational approach, we identify almost 1000 unannotated open reading frames with evidence of translation (utORFs) in the model organism Drosophila melanogaster, 371 of which have canonical start codons. To quantify the comparative genomic similarity of these utORFs across Drosophila and to infer phylostratigraphic age, we further develop a synteny-based protein similarity approach. Combining these results with reference datasets on tissue- and life-stage-specific transcription and conservation, we identify different properties amongst these utORFs. Contrary to expectations, the fastest-evolving utORFs are not the youngest evolutionarily. We observed more utORFs in the brain than in the testis. Most of the identified utORFs may be of de novo origin, even accounting for the possibility of false-negative similarity detection. Finally, sequence divergence after an inferred de novo origin event remains substantial, raising the possibility that de novo proteins turn over frequently. Our results suggest that there is substantial unappreciated diversity in de novo protein evolution: many more may exist than have been previously appreciated; there may be divergent evolutionary trajectories; and de novo proteins may be gained and lost frequently. All in all, there may not exist a single characteristic model of de novo protein evolution, but instead, there may be diverse evolutionary trajectories for de novo proteins.

Data availability

Raw MS data are deposited in PRIDE under accession number PXD032197. Relevant scripts and intermediate files can be found in our Github repository

The following data sets were generated

Article and author information

Author details

  1. Eric B Zheng

    Laboratory of Evolutionary Genetics and Genomics, Rockefeller University, New York, United States
    Competing interests
    The authors declare that no competing interests exist.
  2. Li Zhao

    Laboratory of Evolutionary Genetics and Genomics, Rockefeller University, New York, United States
    For correspondence
    Competing interests
    The authors declare that no competing interests exist.
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-6776-1996


National Institute of General Medical Sciences (R35GM133780)

  • Li Zhao

National Institute of General Medical Sciences (T32GM007739)

  • Eric B Zheng

Robertson Foundation

  • Li Zhao

Rita Allen Foundation (Rita Allen Foundation Scholar)

  • Li Zhao

Vallee Foundation (Vallee Scholar)

  • Li Zhao

Monique Weill-Caulier Trust

  • Li Zhao

Alfred P. Sloan Foundation (Alfred P. Sloan Research Fellowship)

  • Li Zhao

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Reviewing Editor

  1. Mia T Levine, University of Pennsylvania, United States

Publication history

  1. Received: March 18, 2022
  2. Preprint posted: April 5, 2022 (view preprint)
  3. Accepted: September 26, 2022
  4. Accepted Manuscript published: September 30, 2022 (version 1)
  5. Version of Record published: October 13, 2022 (version 2)


© 2022, Zheng & Zhao

This article is distributed under the terms of the Creative Commons Attribution License permitting unrestricted use and redistribution provided that the original author and source are credited.


  • 1,085
    Page views
  • 280
  • 0

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Eric B Zheng
  2. Li Zhao
Protein evidence of unannotated ORFs in Drosophila reveals diversity in the evolution and properties of young proteins
eLife 11:e78772.
  1. Further reading

Further reading

    1. Developmental Biology
    2. Evolutionary Biology
    James W Truman, Jacquelyn Price ... Tzumin Lee
    Research Article

    We have focused on the mushroom bodies (MB) of Drosophila to determine how the larval circuits are formed and then transformed into those of the adult at metamorphosis. The adult MB has a core of thousands of Kenyon neurons; axons of the early-born g class form a medial lobe and those from later-born a'b' and ab classes form both medial and vertical lobes. The larva, however, hatches with only g neurons and forms a vertical lobe 'facsimile' using larval-specific axon branches from its g neurons. Computations by the MB involves MB input (MBINs) and output (MBONs) neurons that divide the lobes into discrete compartments. The larva has 10 such compartments while the adult MB has 16. We determined the fates of 28 of the 32 types of MBONs and MBINs that define the 10 larval compartments. Seven larval compartments are eventually incorporated into the adult MB; four of their larval MBINs die, while 12 MBINs/MBONs continue into the adult MB although with some compartment shifting. The remaining three larval compartments are larval specific, and their MBIN/MBONs trans-differentiate at metamorphosis, leaving the MB and joining other adult brain circuits. With the loss of the larval vertical lobe facsimile, the adult vertical lobes, are made de novo at metamorphosis, and their MBONs/MBINs are recruited from the pool of adult-specific cells. The combination of cell death, compartment shifting, trans-differentiation, and recruitment of new neurons result in no larval MBIN-MBON connections persisting through metamorphosis. At this simple level, then, we find no anatomical substrate for a memory trace persisting from larva to adult. For the neurons that trans-differentiate, our data suggest that their adult phenotypes are in line with their evolutionarily ancestral roles while their larval phenotypes are derived adaptations for the larval stage. These cells arise primarily within lineages that also produce permanent MBINs and MBONs, suggesting that larval specifying factors may allow information related to birth-order or sibling identity to be interpreted in a modified manner in these neurons to cause them to adopt a modified, larval phenotype. The loss of such factors at metamorphosis, though, would then allow these cells to adopt their ancestral phenotype in the adult system.

    1. Evolutionary Biology
    2. Genetics and Genomics
    Ipsita Agarwal, Zachary L Fuller ... Molly Przeworski
    Research Article

    Causal loss-of-function (LOF) variants for Mendelian and severe complex diseases are enriched in 'mutation intolerant' genes. We show how such observations can be interpreted in light of a model of mutation-selection balance, and use the model to relate the pathogenic consequences of LOF mutations at present-day to their evolutionary fitness effects. To this end, we first infer posterior distributions for the fitness costs of LOF mutations in 17,318 autosomal and 679 X-linked genes from exome sequences in 56,855 individuals. Estimated fitness costs for the loss of a gene copy are typically above 1%; they tend to be largest for X-linked genes, whether or not they have a Y homolog, followed by autosomal genes and genes in the pseudoautosomal region. We then compare inferred fitness effects for all possible de novo LOF mutations to those of de novo mutations identified in individuals diagnosed with one of six severe, complex diseases or developmental disorders. Probands carry an excess of mutations with estimated fitness effects above 10%; as we show by simulation, when sampled in the population, such highly deleterious mutations are typically only a couple of generations old. Moreover, the proportion of highly deleterious mutations carried by probands reflects the typical age of onset of the disease. The study design also has a discernible influence: a greater proportion of highly deleterious mutations is detected in pedigree than case-control studies, and for autism, in simplex than multiplex families and in female versus male probands. Thus, anchoring observations in human genetics to a population genetic model allows us to learn about the fitness effects of mutations identified by different mapping strategies and for different traits.