A new codon adaptation metric predicts vertebrate body size and tendency to protein disorder

  1. Department of Mathematics, University of Arizona, Tucson, Arizona 85721, USA
  2. Department of Physics, University of Arizona, Tucson, Arizona 85721, USA
  3. Department of Applied Physics, Stanford University, California, USA
  4. Genetics Graduate Interdisciplinary Program, University of Arizona, Tucson, Arizona 85721, USA
  5. Department of Ecology and Evolutionary Biology, University of Arizona, Tucson, Arizona 85721, USA
  6. Department of Ecology and Genetics, Evolutionary Biology Center, Uppsala University, Sweden
  7. University Information Technology Services, University of Arizona, Tucson, Arizona 85721, USA

Peer review process

Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, public reviews, and a provisional response from the authors.

Read more about eLife’s peer review process.

Editors

  • Reviewing Editor
    C Brandon Ogbunugafor
    Yale University, New Haven, United States of America
  • Senior Editor
    Detlef Weigel
    Max Planck Institute for Biology Tübingen, Tübingen, Germany

Reviewer #1 (Public Review):

In this manuscript, the authors propose a new codon adaptation metric, Codon Adaptation Index of Species (CAIS), which they present as an easily obtainable proxy for effective population size. To permit between-species comparisons, they control for both amino acid frequencies and genomic GC content, which distinguishes their approach from existing ones. Having confirmed that CAIS negatively correlates with vertebrate body mass, as would be expected if small-bodied species with larger effective populations experience more efficient selection on codon usage, they then examine the relationship between CAIS and intrinsic structural disorder in proteins.

The idea of a robust species-level measure of codon adaptation is interesting. If CAIS is indeed a reliable proxy for the effectiveness of selection, it could be useful to analyze species without reliable life history- or mutation rate data (which will apply to many of the genomes becoming available in the near future).

A key question is whether CAIS, in fact, measures adaptation at the codon level. Unfortunately, CAIS is only validated indirectly by confirming a negative correlation with body mass. As a result, the observations about structural disorder are difficult to evaluate.

A potential problem is that differences in GC between species are not independent of life history. Effective population size can drive compositional differences due to the effects of GC-biased gene conversion (gBGC). As noted by Galtier et al. (2018), genomic GC correlates negatively with body mass in mammals and birds. It would therefore be important to examine how gBGC might affect CAIS, and to what extent it could explain the relationship between CAIS and body mass.

Suppose that gBGC drives an increase in GC that is most pronounced at 3rd codon positions in high-recombination regions in small-bodied species. In this case, could observed codon usage depart more strongly from expectations calculated from overall genomic GC in small vertebrates compared to large ones? The authors also report that correcting for local intergenic GC was unsuccessful, based on the lack of a significant negative relationship with body mass (Figure 3D). In principle, this could also be consistent with local GC providing a relatively more appropriate baseline in regions with high recombination rates. Considering these scenarios would clarify what exactly CAIS is capturing.

Given claims about "exquisitely adapted species", the case for using CAIS as a measure of codon adaptation would also be stronger if a relationship with gene expression could be demonstrated. RSCU is expected to be higher in highly expressed genes. Is there any evidence that the equivalent GC-controlled measure behaves similarly?

The manuscript is overall easy to follow, though some additional context may be helpful for the general reader. A more detailed discussion of how this work compares to the approach taken by Galtier et al. (2018), which accounted for GC content and gBGC when examining codon preferences, would be appropriate, for example. In addition, it would have been useful to mention past work that has attempted to explicitly quantify selection on codon usage.

Reviewer #2 (Public Review):

Summary:

The goal of the authors in this study is to develop a more reliable approach for quantifying codon usage such that it is more comparable across species. Specifically, the authors wish to estimate the degree of adaptive codon usage, which is potentially a general proxy for the strength of selection at the molecular level. To this end, the authors created the Codon Adaptation Index for Species (CAIS) that controls for differences in amino acid usage and GC% across species. Using their new metric, the authors find a previously unobserved negative correlation between the overall adaptiveness of codon usage and body size across 118 vertebrates. As body size is negatively correlated with effective population size and thus the general strength of natural selection, the negative correlation between CAIS and body size is expected. The authors argue this was previously unobserved due to failures of other popular metrics such as Codon Adaptation Index (CAI) and the Effective Number of Codons (ENC) to adequately control for differences in amino acid usage and GC content across species. Most surprisingly, the authors also find a positive relationship between CAIS and the overall "disorderedness" of a species protein domains. As some of these results are unexpected, which is acknowledged by the authors, I think it would be particularly beneficial to work with some simulated datasets. I think CAIS has the potential to be a valuable tool for those interested in comparing codon adaptation across species in certain situations. However, I have certain theoretical concerns about CAIS as a direct proxy for the efficiency of selection when the mutation bias changes across species.

Strengths:

(1) I appreciate that the authors recognize the potential issues of comparing CAI when amino acid usage varies and correct for this in CAIS. I think this is sometimes an under-appreciated point in the codon usage literature, as CAI is a relative measure of codon usage bias (i.e. only considers synonyms). However, the strength of natural selection on codon usage can potentially vary across amino acids, such that comparing mean CAI between protein regions with different amino acid biases may result in spurious signals of statistical significance (see Cope et al. Biochemica et Biophysica Acta - Biomembranes 2018 for a clear example of this).

(2) The authors present numerous analysis using both ENC and mean CAI as a comparison to CAIS, helping given a sense of how CAIS corrects for some of the issues with these other metrics. I also enjoyed that they examined the previously unobserved relationship between codon usage bias and body size, which has bugged me ever since I saw Kessler and Dean 2014. The result comparing protein disorder to CAIS was particularly interesting and unexpected.

(3) The CAIS metric presented here is generally applicable to any species that has an annotated genome with protein-coding sequences.

Weaknesses:

(1) The main weakness of this work is that it lacks simulated data to confirm that it works as expected. This would be particularly useful for assessing the relationship between CAIS and the overall effect of protein structure disorder, which the authors acknowledge is an unexpected result. I think simulations could also allow the authors to assess how their metric performs in situations where mutation bias and natural selection act in the same direction vs. opposite directions. Additionally, although I appreciate their comparisons to ENC and mean CAI, the lack of comparison to other popular codon metrics for calculating the overall adaptiveness of a genome (e.g. dos Reis et al.'s statistic, which is a function of tRNA Adaptation Index (tAI) and ENC) may be more appropriate. Even if results are similar to , CAIS has a noted advantage that it doesn't require identifying tRNA gene copy numbers or abundances, which I think are generally less readily available than genomic GC% and protein-coding sequences.

The authors mention the selection-mutation-drift equilibrium model, which underlies the basic ideas of this work (e.g. higher results in stronger selection on codon usage), but a more in-depth framing of CAIS in terms of this model is not given. I think this could be valuable, particularly in addressing the question "are we really estimating what we think we're estimating?"

Let's take a closer look at the formulation for RSCUS. From here on out, subscripts will only be used to denote the codon and it will be assumed that we are only considering the case of for some species

I think what the authors are attempting to do is "divide out" the effects of mutation bias (as given by , such that only the effects of natural selection remain, i.e. deviations from the expected frequency based on mutation bias alone represent adaptive codon usage. Consider Gilchrist et al. MBE 2015, which says that the expected frequency of codon at selection-mutation-drift equilibrium in gene for an amino acid with synonymous codons is

where is the mutation bias, is the strength of selection scaled by the strength of drift, and is the gene expression level of gene (g). In this case, <a href="https://imgur.com/F2ikUov"> and reflect the strength and direction of mutation bias and natural selection relative to a reference codon, for which . Assuming the selection-mutation-drift equilibrium model is generally adequate to model the true codon usage patterns in a genome (as I do and I think the authors do, too), the could be considered the expected observed frequency codon in gene .

Let's re-write the in the form of Gilchrist et al., such that it is a function of mutation bias . For simplicity, we will consider just the two-codon case and assume the amino acid sequence is fixed. Assuming GC% is at equilibrium, the term and can be written as

where is the mutation rate from nucleotides to. As described in Gilchrist et al. MBE 2015 and Shah and Gilchrist PNAS 2011, the mutation bias . This can be expressed in terms of the equilibrium GC content by recognizing that

As we are assuming the amino acid sequence is fixed, the probability of observing a synonymous codon at an amino acid becomes just a Bernoulli process.

If we do this, then

Recall that in the Gilchrist et al. framework, the reference codon has . Thus, we have recovered the Gilchrist et al. model from the formulation of under the assumption that natural selection has no impact on codon usage and codon NNG is the pre-defined reference codon. To see this, plug in 0 for in equation (1).

We can then calculate the expected RSCUS using equation (1) (using notation and equation (6) for the two codon case. For simplicity assume, we are only considering a gene of average expression (defined as . Assume in this case that NNG is the reference codon .

This shows that the expected value of RSCUS for a two-codon amino acid is expected to increase as the strength of selection increases, which is desired. Note that in Gilchrist et al. is formulated in terms of selection against a codon relative to the reference, such that a negative value represents that a codon is favored relative to the reference. If (i.e. selection does not favor either codon), then . Also note that the expected RSCUS does not remain independent of the mutation bias. This means that even if (i.e. the strength of natural selection) does not change between species, changes to the strength and direction of mutation bias across species could impact RSCUS. Assuming my math is right, I think one needs to be cautious when interpreting CAIS as representative of the differences in the efficiency of selection across species except under very particular circumstances. One such case could be when it is known that mutation bias varies little across the species of interest. Looking at the species used in this manuscript, most of them have a GC content ranging around 0.41, so I suspect their results are okay.

Although I have not done so, I am sure this could be extended to the 4 and 6 codon amino acids.

Another minor weakness of this work is that although the method is generally applicable to any species with an annotated genome and the code is publicly available, the code itself contains hard-coded values for GC% and amino acid frequencies across the 118 vertebrates. The lack of a more flexible tool may make it difficult for less computationally-experienced researchers to take advantage of this method.

Author Response

The primary concern of Reviewer 1 is that Ne might affect gBGC and hence GC, and this might act as a confounding effect. The reviewer suggests that we should investigate how gBGC (with GC presumably as its proxy) might affect CAIS, and to what extent any relationship here could explain the relationship between CAIS and body mass. We believe that we have already dealt with this both in Supplementary Figure S5A (where we regret having inserted the wrong figure panel, a mistake we will correct), and its PIC-corrected counterpart in S5B. These two panels show (or will show) that CAIS is not correlated with GC. Note that we expect our genomic-GC-based codon usage expectations to reflect unchecked gBGC in an average genomic region, independently of whether that species has high or low Ne. Our working model is that mutation biases, including but not limited to the strength of gBGC, vary among species, and that they rather than selection determine each species’ genome-wide %GC. By correcting for genome-wide %GC, our CAIS thus corrects for mutation bias, in order to isolate the effects of selection.

Reviewer 1 also suggests that we examine the relationship between gene expression and GC corrected RSCU, as we would expect codon adaptation to be stronger in more highly expressed genes, as was previously shown in the non-GC corrected CAI metric (Sharp et al 1987). Correlations with gene expression are outside the scope of the current work, which is focused on producing a single value of codon adaptation per species. It is indeed possible that our general approach could be useful in future work investigating differences among genes.

One key difference between our work and that of Galtier et al. 2018 is that our approach does not rely on identifying specific codon preferences per species. Our approach thus remains appropriate even for scenarios e.g. where different cell types, different environmental conditions, and/or different genes have different codon preferences (Gingold et al. 2014 https://doi.org/10.1016/j.cell.2014.08.011). At a high level, our results are in broad agreement with those of Galtier et al., 2018, who found that gBGC affected all animal species, regardless of Ne, and who like us, found that the degree of selection on codon usage depended on Ne. Through use of a more sensitive methodology, we believe we have expanded our ability to detect codon adaptation into animals of somewhat higher Ne than in previous work.

We thank Reviewer 2 for explicitly laying out the math that was implicit in our Figures 1 and 2. In our revisions, we will more clearly acknowledge that the per-site codon adaptation bias depicted in Figure 1 has limited sensitivity to s*Ne. We believe our approach worked despite this because the phenomenon is driven by what is shown in Figure 2. I.e., where Ne makes a difference is by determining the proteome-wide fraction of codons subject to significant codon adaptation, rather than by determining the strength of codon adaptation at any particular site or gene.

Simulated datasets would be great, but we think it a nice addition rather than must-have, in particular because we are skeptical about whether our understanding of all relevant processes is good enough such that simulations would add much to our more heuristic argument along the lines of Figure 2. E.g. we believe the complications documented by Gingold et al. 2014 cited above are pertinent, but incorporating them into simulations would require a complex set of assumptions.

In response to the final comment of reviewer 2, the reason that we hard-coded genome-wide %GC values is that we took them from the previous study of James et al. (2023) https://doi.org/10.1093/molbev/msad073. As summarized in the manuscript, genome-wide %GC was a byproduct of a scan conducted in that work, of all six reading frames across genic and intergenic sequences available from NCBI with access dates between May and July 2019. The code used in the current work to calculate the intergenic %GC, as well as that used to calculate amino acid frequencies, is located at https://github.com/MaselLab/Codon-Adaptation-Index-of-Species. We agree that more user-friendly tools would be useful, but producing robust tools falls outside the scope of the current manuscript.

  1. Howard Hughes Medical Institute
  2. Wellcome Trust
  3. Max-Planck-Gesellschaft
  4. Knut and Alice Wallenberg Foundation