Figures and data in Enrichment of rare codons at 5' ends of genes is a spandrel caused by evolutionary sequence turnover and does not improve translation

Figures
Additional files

8 figures and 7 additional files

Figures

Figure 1 with 1 supplement

Download asset Open asset

Calculation of translation speed confirms slow initial translation (SIT).

Translation speeds were calculated using ribosome residence time (RRT) (Gardin et al., 2014; Supplementary file 1 for RRT values) as a measure of codon-specific translation speed over *S. cerevisiae* open reading frames (ORFs). The horizontal line indicates average inverse RRT across all ORFs. The average speed in the first 40 amino acids is about 1.1% slower than in the rest of the gene (p < 0.001).

Figure 1—figure supplement 1

Download asset Open asset

Distribution of translation speeds at 5’ and 3’ ends.

The distribution of relative translation speeds over 5694 genes is shown for the first and last 40 amino acids. For the 5’ end, 57.2% of genes have relatively slow initial translation, while for the 3’ end, 50.14% of genes have a slow terminal translation.

Figure 2 with 1 supplement

Download asset Open asset

Codon usage in the slow initial translation (SIT) region.

(A). Relative codon usage in the SIT versus the rest of the gene. The Y-axis shows codon usage in the first 40 amino acids (omitting ATG) divided by its usage in the rest of the gene. The 61 sense codons are grouped by amino acid. Within each group, codons are ordered from least to most frequent left to right. Red arrows show the seven slowest codons by ribosome residence time (RRT), purple arrows show the seven rarest codons by total usage, and Figure 2—figure supplement 1 shows the correlation between codon usage and translation speed. Blue shows Start and alternative Start codons (ATG, TTG, ATT, ATA). Ratios above 1 show enrichment in the first 40 amino acids. Typically, the rarest codons are enriched. (B). Absolute usage of each leucine codon in the SIT. The absolute usage frequency of each leucine codon is shown globally, and for the first 40 amino acids. Rare codons are still rare in the SIT, just not as rare as elsewhere. The same pattern holds for the other amino acids.

Figure 2—figure supplement 1

Download asset Open asset

Codon speed and codon usage are correlated.

(A) Rare codons are translated slowly. Each dot represents a sense codon. The x-axis displays the translation speed of each codon (modified from Gardin et al., 2014); the y-axis displays the global frequency of usage of each codon. The correlation is 0.64, p<0.001. (B) The first 40 codons are enriched for rare codons. Each dot represents a sense codon. The relative usage of each type of codon in the first 40 codons of genes (i.e. in the slow initial translation , SIT) (y-axis) is displayed against global codon usage (x-axis). The correlation is –0.61, p<0.001. (C) The first 40 codons are enriched for slow codons. Each dot represents a sense codon. The relative usage of each type of codon in the first 40 codons of genes (i.e. in the SIT) is displayed against codon translation speed (i.e. 1/ribosome residence time (RRT), Gardin et al., 2014). The correlation is –0.45, p<0.001.

Figure 3

Download asset Open asset

The N-termini of proteins can vary in evolution.

BLAST of four example *S. cerevisiae* proteins against proteins in the subphylum ‘*Saccharomycotina’* (taxid: 147537) (excluding *Saccharomyces,* taxid 4930) was performed. Top hits are shown. Red regions indicate homology with an alignment score >200, while white indicates no detected homology (BLAST default parameters). Even though all hits have high to moderate homology towards the center of the protein, many have little or no homology at the N-terminus.

Figure 4 with 1 supplement

Download asset Open asset

Conservation of S. *cerevisiae* proteins over the N-terminal, Middle, and C-terminal 40 amino acids.

*S. cerevisiae* proteins were blasted against proteins of *Saccharomycotina* (excluding *cerevisiae*). ‘Conservation Scores’ (Methods and materials) were calculated for the N-terminal, Middle, and C-terminal 40 amino acids of the *S. cerevisiae* proteins. Scores range from 0 (no conservation) to 40 (perfect conservation). The frequency of each conservation score (3964 S. *cerevisiae* proteins) was plotted.

Figure 4—figure supplement 1

Download asset Open asset

Comparison of conservation scores at the N- and C-termini.

Gray, N-terminal conservation scores. Red, C-terminal conservation scores.

Figure 5 with 2 supplements

Download asset Open asset

Translation speed at 3’ ends.

Translation speeds at the 3’ ends of genes were calculated using ribosome residence time (RRT) (Gardin et al., 2014; Supplementary file 1 for RRT values). The average speed over the last 40 amino acids is about 0.1% slower than in the rest of the gene, not statistically significant. The average speed over the last 100 amino acids is about 0.19% slower, which is significantly different (p=0.028).

Figure 5—figure supplement 1

Download asset Open asset

Codon characteristics at the beginnings and ends of yeast genes.

Red arrows identify the seven slowest codons (CCG, CGA, CGG, GGG, CGC, CCC, and TGG), purple arrows identify the seven rarest codons (CGG, CGC, CGA, TGC, CCG, CTC, and GGG), and blue arrows identify four Start codons (ATA, ATG, ATT, and TTG). Codons for each amino acid are arranged from least to most used, left to right.(A) Each bar is a ratio of the codon usage among the last 125 codons relative to codon usage for the remainder of genes. (B) Codon usage among the first 40 codons relative to the last 125 codons.

Figure 5—figure supplement 2

Download asset Open asset

Mitochondrial and ER signal sequences.

467 mitochondrial genes were defined according to Williams et al. (14), and 222 proteins with a predicted ER signal peptide were defined according to Jan et al. (15). (A) The relative initial translation speeds (SITs) of mitochondrial, ER, and all other proteins were characterized. (B) The N-terminal protein conservation scores were characterized for the three sets of proteins.

Figure 6 with 1 supplement

Download asset Open asset

Slow initial translation is correlated with poor N-terminal conservation.

(A) Proteins were grouped by their N-terminal conservation scores (top, middle, and bottom thirds), and then the relative initial translation rate was plotted for each group. More conserved N-termini have a faster initial translation. (B) Proteins were grouped by their initial translation rate (Slow, SIT; Medium, MIT, or Fast, FIT), and then the N-terminal conservation scores were plotted for each group. Genes with faster initial translation have more conserved N-termini. Relative Initial Translation Speed is the log2 of (average ribosome residence time, RRT of the first 40 amino acids divided by the average RRT of the rest of the same gene) (Methods and materials).

Figure 6—figure supplement 1

Download asset Open asset

Slow 3’ translation is correlated with poor C-terminal conservation.

(A) Proteins were grouped by their C-terminal conservation scores (top, middle, and bottom thirds), and then the terminal translation rate was plotted for each group. More conserved C-termini have faster terminal translation. (B) Proteins were grouped by their C-terminal translation rate (top, middle, and bottom thirds), and then the C-terminal conservation scores were plotted for each group. Genes with faster terminal translation have more conserved C-termini.

Figure 7

Download asset Open asset

Genes with high levels of expression, and high ribosome densities, generally have rapidly-translated N-termini, and high N-terminal conservation scores.

(A and B) Genes were grouped by expression level (bottom, middle, and top)(except that genes with fewer than 10 read-counts were omitted to reduce noise) (Lipson et al., 2009). In A, the initial translation rate is shown; in B, the conservation scores are shown. The correlation between speed and transcript abundance fails for the bottom third of genes; possibly these are genes expressed at high levels under other conditions (e.g. meiosis and sporulation). (C and D) Genes were grouped by ribosome density (Arava et al., 2003) as a measure of intensity of translation. In C, the initial translation rate is shown; in D, the conservation scores are shown. High ribosome density correlates with high initial translation speed and high conservation score.

Figure 8 with 1 supplement

Download asset Open asset

Slow initial translation inhibits gene expression.

Left three bars. A synthetic GFP was constructed with a leader amino acid sequence that had little effect on GFP. The leader sequence was recoded to give slow (SIT), medium (MIT), or fast (FIT) translation speed over the first 41 amino acids, without changing the amino acid sequence—i.e., the SIT, MIT, and FIT had identical amino acid sequences, but different average ribosome residence times (RRTs). Each construct (SIT, MIT, FIT) was integrated in a single copy at the *ADE2* locus, and 25 independently-transformed strains were picked, and GFP fluorescence was measured for each, and the RFP-normalized mean was plotted. Numerical values were: SIT, 1.66; MIT, 1.80, FIT, 2.29. GFP was normalized to RFP expressed from the same reporter molecule, but RFP fluorescence hardly changed amongst the transformants, and non-normalized GFP would have given very similar results. Slower initial translation reduced gene expression. Right three bars. As above, a Putative ribosome collision site (PCS) (CGA-CGG) was inserted between the leader and the GFP. Again, slower initial translation reduced gene expression. Values were: SIT:PCS, 0.69, MIT:PCS, 0.74, FIT:PCS, 0.99.

Figure 8—figure supplement 1

Download asset Open asset

Structure of the GFP reporters.

These reporters were adapted from Brule et al., 2016; Gamble et al., 2016 . A leader sequencer (purple), originally from *HIS3*, is appended upstream of GFP. (A) For the first three constructs, recoding of some residues within the first 41 codons with synonymous codons gave either a slow (SIT), medium (MIT), or fast (FIT) initial translation rate. Protein sequences were preserved. (B) Three analogous reporters were made with a putative ribosome collision site (PCS) at codon positions 68 and 69 (still upstream of GFP sequences). The PCS was the codon pair CGA-CGG, two rare Arg codons.

Additional files

Supplementary file 1 Ribosome residence time (RRT) values. See attached Excel spreadsheet.: https://cdn.elifesciences.org/articles/89656/elife-89656-supp1-v1.xlsx
Download elife-89656-supp1-v1.xlsx
Supplementary file 2 RRT statistics of various 5’ ends. See attached Excel Spreadsheet.: https://cdn.elifesciences.org/articles/89656/elife-89656-supp2-v1.xlsx
Download elife-89656-supp2-v1.xlsx
Supplementary file 3 Example conservation scores. Scores were calculated as described in Materials and methods, but in this example, only for a subset of Saccharomycotina. ‘Total Hits’ is the number of different proteins from the sub-phylum Saccharomycotina subset giving a BLAST bit-score of at least 50. ‘Hits in the first 40 amino acids’ is the number of proteins (out of the proteins in the ‘Total Hits’ columns) that had a BLAST alignment with an alignment score >200 matching any part of the first 40 amino acids of the query sequence (i.e. of PCA1, NSR1, etc.). ‘Query Start’ is the range of amino acid positions in the Query protein where the BLAST alignments started. For instance, for BUD5, the 125 Saccharomycotina homologs had BLAST alignments that started at positions between amino acid 211 and amino acid 420 on S. cerevisiae BUD5; none had an alignment starting within the first 40 amino acids. For SNX41, 65 of the 121 hits had an alignment beginning within the first 40 amino acids of S. cerevisiae SNX41. For RPL12B, all 121 of the Saccharomycotina homologs had BLAST alignments starting at amino acid 1 of S. cerevisiae RPL12B. The ‘Conservation Score’ is the score calculated as described in Materials and methods. Note that the number of hits varies in part because the genomes of the Saccharomycotina species were not all fully sequenced. Thus, BNA2 likely has fifteen fewer hits than TRP3 because the BNA2 locus was not sequenced in some species. However, the number of hits does not affect the conservation score, as long as the number meets the qualifying minimum.: https://cdn.elifesciences.org/articles/89656/elife-89656-supp3-v1.docx
Download elife-89656-supp3-v1.docx
Supplementary file 4 Calculation of a conservation score. ‘Query start position in BLAST alignment’ is the amino acid residue of the S. cerevisiae query protein where a BLAST alignment (alignment score >200) begins with a protein of Saccharomycotina. ‘Proportion of hits with this Q-Start Position’ is the proportion of qualifying Saccharomycotina hits (i.e. bit-score >50) that have their alignment begin at this position. ‘Weight’ is multiplied by ‘Proportion,’ and the sum is the conservation score.: https://cdn.elifesciences.org/articles/89656/elife-89656-supp4-v1.docx
Download elife-89656-supp4-v1.docx
Supplementary file 5 Sequences of Ramp genes in Figure 8.: https://cdn.elifesciences.org/articles/89656/elife-89656-supp5-v1.docx
Download elife-89656-supp5-v1.docx
MDAR checklist: https://cdn.elifesciences.org/articles/89656/elife-89656-mdarchecklist1-v1.docx
Download elife-89656-mdarchecklist1-v1.docx
Source code 1 We provide the custom R code written for this project as a text file, Source Code File 1.: https://cdn.elifesciences.org/articles/89656/elife-89656-code1-v1.zip
Download elife-89656-code1-v1.zip