Accurate annotation of protein coding regions is essential for understanding how genetic information is translated into function. We describe riboHMM, a new method that uses ribosome footprint data to accurately infer translated sequences. Applying riboHMM to human lymphoblastoid cell lines, we identified 7,273 novel coding sequences, including 2,442 translated upstream open reading frames. We observed an enrichment of footprints at inferred initiation sites after drug-induced arrest of translation initiation, validating many of the novel coding sequences. The novel proteins exhibit significant selective constraint in the inferred reading frames, suggesting that many are functional. Moreover, ~40% of bicistronic transcripts showed negative correlation in the translation levels of their two coding sequences, suggesting a potential regulatory role for these novel regions. Despite known limitations of mass spectrometry to detect protein expressed at low level, we estimated a 14% validation rate. Our work significantly expands the set of known coding regions in humans.
- Nicholas T Ingolia, University of California, Berkeley, United States
© 2016, Raj et al.
This article is distributed under the terms of the Creative Commons Attribution License permitting unrestricted use and redistribution provided that the original author and source are credited.
Genotype imputation is a foundational tool for population genetics. Standard statistical imputation approaches rely on the co-location of large whole-genome sequencing-based reference panels, powerful computing environments, and potentially sensitive genetic study data. This results in computational resource and privacy-risk barriers to access to cutting-edge imputation techniques. Moreover, the accuracy of current statistical approaches is known to degrade in regions of low and complex linkage disequilibrium. Artificial neural network-based imputation approaches may overcome these limitations by encoding complex genotype relationships in easily portable inference models. Here we demonstrate an autoencoder-based approach for genotype imputation, using a large, commonly used reference panel, and spanning the entirety of human chromosome 22. Our autoencoder-based genotype imputation strategy achieved superior imputation accuracy across the allele-frequency spectrum and across genomes of diverse ancestry, while delivering at least 4-fold faster inference run time relative to standard imputation tools.
How cells control gene expression is a fundamental question. The relative contribution of protein-level and RNA-level regulation to this process remains unclear. Here, we perform a proteogenomic analysis of tumors and untransformed cells containing somatic copy number alterations (SCNAs). By revealing how cells regulate RNA and protein abundances of genes with SCNAs, we provide insights into the rules of gene regulation. Protein complex genes have a strong protein-level regulation while non-complex genes have a strong RNA-level regulation. Notable exceptions are plasma membrane protein complex genes, which show a weak protein-level regulation and a stronger RNA-level regulation. Strikingly, we find a strong negative association between the degree of RNA-level and protein-level regulation across genes and cellular pathways. Moreover, genes participating in the same pathway show a similar degree of RNA- and protein-level regulation. Pathways including translation, splicing, RNA processing, and mitochondrial function show a stronger protein-level regulation while cell adhesion and migration pathways show a stronger RNA-level regulation. These results suggest that the evolution of gene regulation is shaped by functional constraints and that many cellular pathways tend to evolve one predominant mechanism of gene regulation at the protein level or at the RNA level.