A new codon adaptation metric predicts vertebrate body size and tendency to protein disorder

Catherine A. Weibel; Andrew L. Wheeler; Jennifer E. James; Sara M. Willis; Joanna Masel

doi:10.7554/eLife.87335.1

eLife assessment

In this intriguing study, the authors offer a new metric, Codon Adaptation Index of Species (CAIS), which allows one to determine the strength of selection and effective population sizes using data on GC content and amino acid composition. This could be of broad use in molecular evolution, as it could be applied across species. The study offers important findings that may aid in our ability to compute key population genomic parameters from genomic data, and the conclusions are based on solid evidence.

https://doi.org/10.7554/eLife.87335.1.sa2

Significance of findings

important: Findings that have theoretical or practical implications beyond a single subfield

landmark
fundamental
important
valuable
useful

Strength of evidence

solid: Methods, data and analyses broadly support the claims with only minor weaknesses

exceptional
compelling
convincing
solid
incomplete
inadequate

During the peer-review process the editor and reviewers write an eLife assessment that summarises the significance of the findings reported in the article (on a scale ranging from landmark to useful) and the strength of the evidence (on a scale ranging from exceptional to inadequate). Learn more about eLife assessments

Abstract

The nearly neutral theory of molecular evolution posits variation among species in the effectiveness of selection. In an idealized model, the census population size determines both this minimum magnitude of the selection coefficient required for deleterious variants to be reliably purged, and the amount of neutral diversity. Empirically, an “effective population size” is often estimated from the amount of putatively neutral genetic diversity, and is assumed to also capture a species’ effectiveness of selection. The degree to which selection maintains preferred codons has the potential to more directly quantify the effectiveness of selection. However, past metrics that compare codon bias across species are confounded by among-species variation in %GC content and/or amino acid composition. Here we propose a new Codon Adaptation Index of Species (CAIS) that corrects for both confounders. Unlike previous metrics of codon bias, CAIS yields the expected relationship with adult vertebrate body mass. We demonstrate the use of CAIS correlations to show that the protein domains of more highly adapted vertebrate species evolve higher intrinsic structural disorder.

Introduction

Species differ from each other in many ways, including mating system, ploidy, spatial distribution, life history, size, lifespan, and population size. These differences make the process of purifying selection more efficient in some species than others. Our understanding of both the causes and consequences of these differences is limited in part by a reliable metric with which to measure them. In the long-term, the probability that a gene is fixed for one allele rather than another allele is given by the ratio of fixation and counter-fixation probabilities (Bulmer 1991). In an idealized population of constant population size and no selection at linked sites, a mutation-selection-drift model describes how this ratio of fixation probabilities depends on the census population size N (Kimura 1962), and hence gives the fraction of sites expected to be found in preferred vs. non-preferred states (Figure 1).

The effectiveness of selection, calculated as the long-term ratio of time spent in fixed deleterious: fixed beneficial allele states given symmetric mutation rates, is a function of the product Ns. Assuming a diploid Wright-Fisher population with s ≪1, the probability of fixation of a new mutation , and the y-axis is calculated as *π(N, —s)/(π(N, —s) + π(N, s))*. s is held constant at a value of 0.001 and N is varied. Results for other small magnitude values of s are superimposable.

This reasoning has been extended to real populations by positing that species have an “effective” population size, N_e (Ohta 1973). N_e is the census population size of an idealized population that reproduces a property of interest in the focal population. N_e is therefore not a single quantity per population, but instead depends on which property is of interest.

The amount of neutral polymorphism is the usual property used to empirically estimate N_e (Charlesworth 2009; Doyle et al. 2015; Lynch et al. 2016). However, the property of most relevance to nearly neutral theory is instead the inflection point s at which non-preferred alleles become common enough to matter (Figure 1), and hence the degree to which highly exquisite adaptation can be maintained in the face of ongoing mutation and genetic drift (Kimura 1962; Ohta 1972; Ohta 1992). While genetic diversity has been found to reflect some aspects of life history strategy (Romiguier et al. 2014), there remain concerns about whether neutral genetic diversity and the limits to weak selection always remain closely coupled. One source of concern is that the classic one-locus models that underpin N_e calculations assume that neutral diversity is shaped by genetic drift, but it can instead be shaped instead by some combination of background selection and hitchhiking (Masel 2011; Kern & Hahn 2018).

As a practical matter, N_e is usually calculated by taking some measure of the amount of putatively neutral (often synonymous) polymorphism segregating in a population, and dividing by that species’ mutation rate (Charlesworth 2009). As a result, N_e values are only available for species that have both polymorphism data and accurate mutation rate estimates, limiting its use. Worse, N_e is not a robust statistic. In the absence of a clear species definition, polymorphism is sometimes calculated across too broad a range of genomes, substantially inflating N_e (Daubin & Moran 2004). Similarly, a recent bottleneck or poor sampling scheme will deflate estimated values of N_e, even if the degree of fine-tuned adaptation remains high. Transient hypermutation (Plotkin et al. 2006), which is common in microbes, causes further short-term inconsistencies in polymorphism levels.

An alternative approach to measure the efficiency of selection exploits codon usage bias, which is influenced by weak selection for factors such as translational speed and accuracy (Hershberg & Petrov 2008; Plotkin & Kudla 2011; Hunt et al. 2014). The degree of bias in synonymous codon usage that is driven by selective preference offers a more direct way to assess how effective selection is at the molecular level in a given species (Li 1987; Bulmer 1991; Akashi 1996; Subramanian 2008). Conveniently, it can be estimated from only a single genome, i.e., without polymorphism or mutation rate data for that species.

One commonly used metric, the Codon Adaptation Index (CAI) (Sharp & Li 1987; Sharp et al. 2010) takes the average of Relative Synonymous Codon Usage (RSCU) scores, which quantify how often a codon is used, relative to the codon that is most frequently used to encode that amino acid in that species. While this works well for comparing genes within the same species, it unfortunately means that the species-wide strength of codon bias appears in the normalizing denominator (see equation 4 below and Supplementary Figure 1A). Paradoxically, this can make more exquisitely adapted species have lower rather than higher species-averaged CAI scores (as we will later find in Figure 3B). A handful of publications (Jansen et al. 2003; Botzman & Margalit 2011; LaBella et al. 2019) have inappropriately used mean CAI as a metric of codon adaptation across species, presumably generating inverse results.

To compare species using CAI, it has been suggested that instead of taking a genome-wide average, one should consider a set of highly expressed reference genes (Sharp et al. 2005; Vicario et al. 2007; dos Reis & Wernisch 2008; Subramanian 2008). This approach assumes that the relative strength of selection on those reference genes (often a function of gene expression) remains approximately constant across the set of species considered (red boxes in Figure 2 indicating same quartiles). Its use also requires careful attention to the length of reference genes (Urrutia & Hurst 2001; Doherty & McInerney 2013).

More highly adapted species (bottom) have a higher proportion of their genes subject to effective selection on codon bias (blue area). The CAI, when used correctly, compares the intensity of selection in comparable regions of the right tail (red boxes). Our new metric captures both this, and differences in the proportion of the genome subject to substantial selection.

CAIS reflects the expected relationship between effectiveness of selection and body size, while CAI, ENC, and CAIS calculated using local GC content do not. Note that a high ENC value means more codons are being used in the genome of the given species, so that the given species is less codon adapted, while a high CAIS value means that a species is more codon adapted. Body size data are from PanTHERIA database, originally in log₁₀(mass) in grams prior to PIC correction. Data are shown for 62 species in common between PanTHERIA and our own dataset of 118 vertebrate species that have both “Complete” genome sequence available for calculating %GC, and TimeTree divergence dates available for use in PIC correction. P-values shown are for Pearson’s correlation. Red line shows unweighted Im(y∼x) with grey region indicated 95% confidence interval.

If more highly adapted species have a higher proportion of their genes subject to effective selection on codon bias (Galtier et al. 2018), this is best detected using a proteome-wide approach (Figure 2). Averaging across the entire proteome provides robustness to shifts in the expression level of or strength of selection on particular genes. The proteome-wide average depends on the fraction of sites whose selection coefficients exceed the “drift barrier” for that particular species (Figure 2, blue threshold).

However, any whole-proteome comparisons must be corrected for differences in GC content among species. Genomic GC content is a major driver of codon usage differences among species. Species differ in their mutational bias with respect to GC and in their frequency of GC-biased gene conversion (Urrutia & Hurst 2001; Duret & Galtier 2009; Doherty & McInerney 2013; Figuet et al. 2014). Here, we control for both by simply replacing RSCU scores with the ratio of how frequently a codon is used to encode a particular amino acid relative to how frequently we expect it to be used given the %GC content. There are also GC-related differences in which codons are preferred (Hershberg & Petrov 2009); we expect our method to be more robust to these than CAI is.

An alternative metric, the Effective Number of Codons (ENC) originally quantified how far the codon usage of a sequence departs from equal usage of synonymous codons (Wright 1990), creating a complex relationship with GC content (Fuglsang 2008). The ENC was later modified to correct for GC content, in a similar manner to the new method we present here (Novembre 2002). However, a remaining issue with this modified ENC is that differences among species in amino acid composition might act as a confounding factor, even after controlling for GC content. Specifically, species that make more use of an amino acid for which there is stronger selection among codons (which is sometimes the case (Vicario et al. 2007)) would have higher codon bias, even if each amino acid, considered on its own, had identical codon bias irrespective of which species it is in. Neither ENC (Fuglsang 2004; Fuglsang 2008) nor the CAI (Sharp & Li 1987) adequately control for differences in amino acid composition when applied across species. Despite early claims to the contrary (Wright 1990), this problem is not easy to fix for ENC (Fuglsang 2004; Fuglsang 2008).

Here, we extend the CAI so that it corrects for both GC and amino acid composition (see Methods) to create a new Codon Adaptation Index of Species (CAIS), and compare its performance to that of the ENC. The availability of complete genomes allows both metrics to be readily calculated without data on polymorphism or mutation rate, without selecting reference genes, and without concerns about demographic history. Our purpose is to find an accessible metric that can quantify the limits to weak selection important to nearly neutral theory; this differs from past evaluations focused on comparing different genes of the same species and recapitulating “ground truth” simulations thereof (Sun et al. 2012; Zhang et al. 2012; Liu et al. 2018). Our first measure of success is to produce the expected correlation with body size (used here as a proxy for census population size and hence the expected exquisiteness of adaptation). We use vertebrates as a challenging test case, given relatively weak selection on codon usage (Doherty & McInerney 2013; Galtier et al. 2018). Correlations between codon adaptation and body size across vertebrates have previously been elusive, which is surprising given that the expected differences in the efficiency of selection between X-chromosomes vs. autosomes and between high-recombination vs. low recombination regions are found (Kessler & Dean 2014). Our second measure of success is a demonstration of how our metric can be used to find novel correlations pointing to what else might be subject to weak selective preferences at the molecular level.

Results

Species with larger body mass have smaller CAIS

CAI scores averaged across the proteome of each species would paradoxically suggest that larger species have more effective selection (Figure 3B; note that we control for phylogenetic non-independence using Phylogenetic Independent Contrasts (PIC) (Felsenstein 1985)). This is because the behavior of the CAI is driven by the normalization term on its denominator, rather than by its numerator (Supplementary Figure 1A). This problem is removed in our improved CAIS measure (Supplementary Figure IB), which yields the expected result that species with more codon adaptation according to the CAIS tend to be smaller (Figure 3A).

In contrast, the ENC is independent of adult vertebrate body mass (Figure 3C), consistent with past reports that ENC does not predict body mass or generation time in mammals (Kessler & Dean 2014) and that genome-wide ENC does not predict generation time in eukaryotes more broadly (Subramanian 2008). This difference between ENC and CAIS is surprising because the two methods are conceptually similar (see Methods). One difference is that CAIS corrects for amino acid composition differences across the dataset while ENC does not (see Methods). However, when we remove the amino acid composition correction from CAIS, we retain the relationship that smaller-bodied species tend to have more codon adaptation (Supplementary Figure 2), ruling this out as the cause of their different behaviors.

Another difference between ENC and CAIS is that CAIS is linear in observed codon frequencies (equation 7), while ENC has a quadratic term (equation 14). The quadratic term in ENC magnifies the differences in more extreme deviations of observed frequencies from the GC-informed expected frequencies than CAIS. We speculate that this might somehow swamp more informative differences. Other ENC-like approaches that use a quadratic term have also struggled to reliably detect selection (Urrutia & Hurst 2001; Kessler & Dean 2014).

Surprisingly, we achieved a successful version of CAIS by using genome-wide GC content to calculate expected codon use. Different parts of the genome have different GC contents (Bernardi 2000; Eyre-Walker & Hurst 2001; Lander et al. 2001), primarily because the extent to which GC-biased gene conversion increases GC content depends on the local rate of recombination (Galtier et al. 2001; Meunier & Duret 2004; Duret et al. 2006; Duret & Galtier 2009). We therefore also calculated a version of CAIS whose codon frequency expectations are based on local intergenic GC content. This performed worse (Fig. 3D) than our simple use of genome-wide GC content (Fig. 3A) with respect to the strength of correlation between CAIS and body size. We therefore use the simpler, genome-wide metric from here on. It is possible that local GC content evolves more rapidly than codon usage, and that genome-wide GC serves as an appropriately time-averaged proxy. It is also possible that the local non-coding sequences we used were too short (at 3OOObp or more), creating excessive noise that obscured the signal. Given the body size results, we advocate for the use of CAIS as a metric of species’ effectiveness of selection.

Proteins in better adapted species evolve more structural disorder

As an example of how correlations with CAIS can be used to identify weak selective preferences, we investigate protein intrinsic structural disorder (ISD). Disordered proteins are more likely to be harmful when overexpressed (Vavouri et al. 2009), and ISD is more abundant in eukaryotic than prokaryotic proteins (Schad et al. 2011; Xue et al. 2012; Basile et al. 2019), suggesting that low ISD might be favored by more effective selection.

However, compositional differences among proteomes might be driven by the recent birth of ISD-rich proteins in animals (James et al. 2021), and/or by differences among sequences in their subsequent tendency to proliferate into many different genes (James et al. 2022), rather than by differences in how a given protein sequence evolves as a function of the effectiveness of selection. To focus only on the latter, we use a linear mixed model, with each species having a fixed effect on ISD, while controlling for Pfam domain identity as a random effect. We note that once GC is controlled for, codon adaptation can be assessed similarly in intrinsically disordered vs. ordered proteins (Gossmann et al. 2012). Controlling for Pfam identity is supported, with standard deviation in ISD among Pfams estimated as 0.178 in comparison to residual standard deviation of 0.058 and a p-value on the significance of the Pfam random effect term of 3×10’¹³. We then ask whether the fixed species effects on ISD are correlated with CAIS.

Surprisingly, more exquisitely adapted species have more disordered protein domains (Figure 4). Results are similar using ENC instead of CAIS (Supplementary Figure 3), despite the weakness of ENC in predicting body size; the correlation coefficient drops only slightly from 0.54 to 0.5, and the p-value rises by less than 2 orders of magnitude.

Protein domains have higher ISD when found in more exquisitely adapted species. Each datapoint is one of 118 vertebrate species with “complete” intergenic genomic sequence available (allowing for %GC correction) and TimeTree divergence dates (allowing for PIC correction). “Effects” on ISD shown on the y-axis are fixed effects of species identity in our linear mixed model, after PIC correction. Red line shows unweighted Im(y∼x) with grey region as 95% confidence interval.

Younger animal-specific protein domains have higher ISD (James et al. 2021). We therefore hypothesize that selection in favor of high ISD might be strongest in young domains, which might use more primitive methods to avoid aggregation (Foy et al. 2019; Bertram & Masel 2020). To test this, we analyze two subsets of our data: those that emerged prior to the last eukaryotic common ancestor (LECA), here referred to as “old” protein domains, and “young” protein domains that emerged after the divergence of animals and fungi from plants. Young and old domains both show a trend of increasing disorder with species’ adaptedness (Figure 5). Because PIC analysis makes the units incommensurable, we quantitatively compare the slopes of non-PIC-corrected weighted regressions. As predicted, the slope is stronger among young protein domains than among old domains (0.223 +/-0.021 versus 0.161 +/-0.014, respectively; units of proportion ISD/ CAIS), although we are struck by the degree to which even ancient domains prefer higher ISD. We find similar relationships of modestly lower strength and statistical significance using ENC (Supplementary Figure 4).

More exquisitely adapted species have higher ISD in both ancient and recent protein domains. Age assignments are taken from James et al. (2021), with vertebrate protein domains that emerged prior to LECA classified as “old”, and vertebrate protein domains that emerged after the divergence of animals and fungi from plants as “young”. “Effects” on ISD shown on the y-a×is are fixed effects of species identity in our linear mixed model. The same n=H8 datapoints are shown as in Figures 3 and 4. P-values shown are for Pearson’s correlation. Red line shows Im(y∼×), with grey region as 95% confidence interval, using a weighted model for non-PIC-corrected figures and unweighted for PIC-corrected figures. Weighted models make the quantitative estimation of slopes more accurate. Dot size on left indicates weight.

Discussion

Here we propose CAIS as a new metric to quantify how species differ in the effectiveness of selection. CAIS corrects codon bias both for total genomic GC content and for amino acid composition, to extract a measure of codon adaptation. Surprisingly, we obtained better results correcting for total GC content than for local GC content in the vicinity of each coding region. Unlike the ENC, CAIS is sensitive enough to show the expected relationship with adult vertebrate body mass.

CAIS can be used to detect which properties are under selection. To accomplish this, we note that when different properties are each causally affected by a species’ exquisiteness of adaptation, this will create a correlation between the properties. We use codon adaptation as a reference property, such that correlations with codon adaptation indicate selection. We operationalize this using a linear mixed model approach that controls for Pfam identity as a random effect, and then apply phylogenetic correction to the resulting fixed effects of species on ISD. This approach shows that the same Pfam domain tends to be more disordered when found in a well-adapted species (i.e. a species with a higher CAIS). This is true for both ancient and recently emerged protein domains, albeit of slightly larger magnitude for the latter.

It is important that no additional variable such as GC content creates a spurious correlation by affecting both CAIS and our property of interest. For this reason, we control for GC content during the construction of CAIS. GC content is the product of many different processes, most notably the genome-wide forces of mutation bias and gene conversion (Romiguier & Roux 2017), but potentially also selection on individual nucleotide substitutions that is hypothesized to favor higher %GC (Long et al. 2018). By controlling for %GC, we exclude all these forces from influencing CAIS. CAIS thus captures the extent of adaptation in codon bias, including translational speed, accuracy, and any intrinsic preference for GC over AT that is specific to coding regions. These remaining codon-adaptive factors do not create a statistically significant correlation between CAIS and GC (Supplementary Figures 5A, 5B), nor between ENC and GC (Supplementary Figures 5C, 5D), although CAI is correlated with GC (Supplementary Figures 5E, F).

A direct effect of ISD on fitness agrees with studies of random ORFs in Escherichia coli, where fitness was driven more by amino acid composition than %GC content, after controlling for the intrinsic correlation between the two (Kosinski et al. 2022). However, we have not ruled out a role for selection for higher %GC in ways that are general rather than restricted to coding regions, whether in shaping mutational biases (Smith & Eyre-Walker 2001; Hershberg & Petrov 2009; Hildebrand et al. 2010; Novoa et al. 2019; Forcelloni & Giansanti 2020) or the extent of gene conversion, or even at the single nucleotide level in a manner shared between coding regions and intergenic regions (Long et al. 2018).

A more complex metric could control for more than just GC content and amino acid frequencies. First vs. second vs. third codon positions have different nucleotide usage on average, but while correcting for this might be useful for comparing genes (Zhang et al. 2012), the issue in correcting for it while comparing species is that one might be removing the effect of interest. Similarly, while it might be useful to control for dinucleotide and trinucleotide frequencies (Brbić et al. 2015), to avoid circularity these would need to be taken from intergenic sequences, with care needed to avoid influence from unannotated protein-coding genes or even highly decayed pseudogenes.

Note that if a species were to experience a sudden reduction in population size, e.g. due to habitat loss, leading to less effective selection, it would take some time for CAIS to adjust. CAIS thus represents a relatively long-term historical pattern of adaptation. The timescales setting neutral polymorphism based N_e estimates are likely shorter (Gossmann et al. 2012). It is possible that the reason that we obtained better results when we control for genome-wide GC content than when we control for local GC content is also that codon adaptation adjusts slowly relative to the timescale of fluctuations in local GC content.

Here we developed a new metric of species adaptedness at the codon level, capable of quantifying degrees of codon adaptation even among vertebrates. We chose vertebrates partly due to the abundance of suitable data, and partly as a stringent test case, given past studies finding limited evidence for codon adaptation (Kessler & Dean 2014). We restricted our analysis to only the best annotated genomes, in part to ensure the quality of intergenic %GC estimates, and in part limited by the feasibility of running linear mixed models with six million data points. The phylogenetic tree is well resolved for vertebrate species, with an overrepresentation of mammalian species. Despite the focus on vertebrates, we see a remarkably strong signal for a subtle codon adaptation effect across closely related species, to the point where we can comfortably detect the ISD signal across subsets of domains.

An alternative approach to measuring effective population size focuses on d_N/d_s, the ratio of nonsynonymous to synonymous substitutions (Weber et al. 2014; Lefébure et al. 2017; Payne & Alvarez-Ponce 2018; Lin et al. 2019; Luzuriaga-Neira & Alvarez-Ponce 2022). A potential issue with this approach is that it is assumes that differences among species are driven by differences in dN and not by differences in dS. The success of our codon adaptation metric suggests that this assumption does not hold across animals, because the proportion of synonymous substitutions that is effectively neutral varies across species. This may explain why d_N/ds is not consistently related to life history traits across different taxonomic groups (Figuet et al. 2016).

Our finding that weak selection prefers higher ISD was unexpected, given that lower-N_e eukaryotes have more disordered proteins than higher-N_e prokaryotes (Ahrens et al. 2017; Basile et al. 2019). However, this can be reconciled in a model in which highly disordered sequences are less likely to be found in high-N_e species, but the sequences that are present tend to have slightly higher disorder than their low-N_e homologs. High ISD might help mitigate the trade-off between affinity and specificity (Dunker et al. 1998; Huang & Liu 2013; Lazar et al. 2022); non-specific interactions might be short-lived due to the high entropy associated with disorder, which specific interactions are robust to.

Our new CAIS metric more directly quantifies how species vary in their exquisiteness of adaptation than estimates of effective population size that are based on neutral polymorphism. It can also be estimated for far more species because it does not require polymorphism or mutation rate data, but only a single complete genome. We therefore expect CAIS to have many uses as a new tool for exploring nearly neutral theory, of which our worked example of ISD could be just the start.

Methods

Species

Pfam sequences and IUPRED2 estimates of Intrinsic Structural Disorder (ISD) predictions were taken from James et al. (2021), who studied species marked as “Complete” in the GOLD database, with divergence dates available in TimeTree (Kumar et al. 2017). James et al. (2021) applied a variety of quality controls to exclude contaminants from the set of Pfams and assign accurate dates of Pfam emergence. Pfams that emerged prior to LECA are classified here as “old”, and Pfams that emerged after the divergence of animals and fungi from plants are classified as “young”, following annotation by James et al. (2021).

Codon Adaptation Index

Sharp and Li (1987) quantified codon bias through the Codon Adaptation Index (CAI), a normalized geometric mean of synonymous codon usage bias across sites, excluding stop and start codons. We modify this to calculate CAI including stop and start codons, because of documented preferences among stop codons in mammals (Wangen & Green 2020). While usually used to compare genes within a species, among-species comparisons can be made using a reference set of genes that are highly expressed in yeast genome (Sharp & Li 1987). Each codon i is assigned a Relative Synonymous Codon Usage value:

where N_i denotes the number of times that codon i is used, and the denominator sums over all n_a codons that code for that specific amino acid. RSCU values are normalized to produce a relative adaptiveness values w_t for each codon, relative to the best adapted codon for that amino acid:

Let L be the number of codons across all protein-coding sequences considered. Then

To understand the effects of normalization, it is useful to rewrite this as:

where CAI_raw is the geometric mean of the “unnormalized” or observed synonymous codon usages, and CAI_max is the maximum possible CAI given the observed codon frequencies.

GC Content

We calculated total %GC content (intergenic and genic) during a scan of all six reading frames across genic and intergenic sequences available from NCBI with access dates between May and July 2019 (described in James et al. (2022)). Of the 170 vertebrates meeting the quality criteria of James et al. (2021), 118 had annotated intergenic sequences within NCBI, so we restricted the dataset further to keep only the 118 species for which total GC content was available.

Codon Adaptation Index of Species (CAIS)

Controlling for GC bias in Synonymous Codon Usage

Consider a sequence region r within species s where each nucleotide has an expected probability of being G or C = g_r. For our main analysis, we consider just one region r encompassing the entire genome of a species s. In a secondary analysis, we break the genome up and use local values of g_r in the non-coding regions within and surrounding a gene or set of overlapping genes. To annotate the boundaries of these local regions, we first selected 1500 base pairs flanking each side of every coding sequence identified by NCBI annotations. Coding sequence annotations are broken up according to exon by NCBI. When coding sequences of the same gene did not fall within 3000 base pairs of each other, they were treated as different regions. When two coding sequences, whether from the same gene or from different genes, had overlapping l5OObp catchment areas, we merged them together. g_r was then calculated based on the non-coding sites within each region, including both genic regions such as promoters and non-genic regions such as introns and intergenic sequences.

With no bias between C vs. G, nor between A vs. T, nor patterns beyond the overall composition taken one nucleotide at a time, the expected probability of seeing codon i in a triplet within r is

where k_GC + k_AT = 3 total positions in codon i. The expected probability that amino acid a in region r is encoded by codon i is

The Relative Synonymous Codon Usage (RSCU) value used by the CAI measures the degree to which a codon’s relative frequency differs from the null expectation that all synonymous codons are used equally. Using equations 5 and 6, we replace this by one or a set of Relative Synonymous Codon Usage of Species (RSCUS) values for each region:

where O_i is the observed frequency with which amino acid a is encoded by codon i within region r of species s.

Controlling for Amino Acid Composition

Some amino acids may be more intrinsically prone to codon bias. We want a metric that quantifies effectiveness of selection (not amino acid frequency), so we re-weight CAIS on the basis of a standardized amino acid composition, to remove the effect of variation among species in amino acid frequencies.

Let F_a be the frequency of amino acid a across the entire dataset of 118 vertebrate genomes, and f_i,_s the frequency of codon i in given species s. A candidate CAIS that controls for global GC content but not for amino acid composition can be written as

We want to re-weight f_i,s on the basis of F_a to ensure that differences in amino acid frequencies among species do not affect CAIS, while preserving relative codon frequencies for the same amino acid. We do this by solving for a_a,s so that

We then define to obtain an amino acid frequency adjusted CAIS:

For convenient implementation in code, we used the following form:

The F_a values for our species set are at https://github.com/MaselLab/Codon-Adaptation-lndex-of-Species/blob/main/CAIS_ENC_calculation/Total_amino_acid_frequency_vertebrates.txt. Similarly, CAIS corrected for local intergenic GC content but not species-wide amino acid composition is

where n_i,r is the number of times codon i appears in region r of species s, G is the number of regions, and is the total number of codons in the genome. Rewritten for greater computational ease:

Novembre’s Effective Number of Codons (ENC) controlled for total GC Content

Following from equations 5 and 6, the value representing the deviation of the frequencies of the codons for amino acid a from null expectations is

where N_a is the total number of times that amino acid a appears. Novembre (2002) defines the corrected “F value” of amino acid a as

and the Effective Number of Codons as

where each is the average of the “F values” for amino acids with n_a synonymous codons. Past measures of ENC do not contain stop or start codons (Wright 1990; Novembre 2002; Fuglsang 2004), but as we did for CAI and CAIS above, we include stop codons as an “amino acid” and therefore amend (19) to

Statistical Analysis

All statistical modelling was done in R 3.5.1. Scripts for calculating CAI and CAIS were written in Python 3.7.

Phylogenetic Independent Contrasts

Spurious phylogenetically confounded correlations can occur when closely related species share similar values of both metrics. One danger of such pseudoreplication is Simpson’s paradox, where there are negative slopes within taxonomic groups and a positive slope among them might combine to yield an overall positive slope. We avoid pseudoreplication by using Phylogenetic Independent Contrasts (PIC) (Felsenstein 1985) to assess correlation. PIC analysis was done using the R package “ape” (Paradis & Schliep 2019).

Data Availability

All data and code underlying this article are available in the public repository at https://github.com/MaselLab/Codon-Adaptation-lndex-of-Species.

Acknowledgements

This work was supported by the National Institutes of Health (R01 GM104040 and T32 GM132008), the John Templeton Foundation (60814), the Arnold and Mabel Beckman Foundation Scholars Program, the Western Alliance to Expand Student Opportunities (WAESO) Louis Stokes Alliance for Minority Participation (LSAMP) National Science Foundation (NSF) Cooperative Agreement (HRD-1101728), and UA/NASA Space Grant Undergraduate Research Internship program. Wethank Luke Kosinski, Hanon McShea, David Liberies, and Sawsan Wehbi for helpful discussions, Paul Nelson for providing the genome-wide GC contents, and the University of Arizona Undergraduate Biology Research Program for training.

Codon Adaptation Index is not appropriate for species-wide effectiveness of selection measurements. Each CAI value shown is averaged over an entire species’ proteome. A) The value of CAI is driven by its normalizing denominator term, CAI_max. B) As a result, CAI is inversely proportional to CAIS. Each datapoint is one of 118 vertebrate species with “Complete” intergenic genomic sequence available (allowing for %GC correction) and TimeTree divergence dates (allowing for PIC correction). P-values shown are for Pearson’s correlation.

CAIS without correction for amino acid composition still reflects the expected relationship between effectiveness of selection and body. Body size data from PanTHERIA database, originally in log₁₀(mass) in grams; data shown for 62 species in common between PANTHERIA and our own dataset of 118 vertebrate species with a “Complete” intergenic genomic sequence and TimeTree divergence dates. P-value is shown for Pearson’s correlation. Red line shows unweighted Im(y∼x) with grey region as 95% confidence interval.

Our Figure 4 finding that more exquisitely adapted species have protein domains with higher ISD is confirmed by ENC. The same n=118 datapoints are shown as in Figure 4. P-values shown are for Pearson’s correlation. Red line shows unweighted Im(y∼x) with grey region as 95% confidence interval.

More exquisitely adapted species have higher ISD in both ancient and recent protein domains, as confirmed by ENC. Protein domains that emerged prior to LECA are identified as “old”, and protein domains that emerged after the divergence of animals and fungi from plants and found in vertebrates are identified as “young”. Age assignments are taken from James et al. (2021). The same n=H8 datapoints are shown as in Figure 3. P-values shown are for Pearson’s correlation. Red line shows Im(y∼x), with grey region as 95% confidence interval, using a weighted model for non-PIC-corrected figures and unweighted for PIC-corrected figures.

Effective Number of Codons (ENC) and Codon Adaptation Index of Species (CAIS) are uncorrelated with total genomic GC Content. Each datapoint is one of 118 vertebrate species with “Complete” intergenic genomic sequence and TimeTree divergence dates. Statistical significance is only supported if it survives controlling for phylogenetic confounding via Phylogenetic Independent Contrasts (PIC). P-values shown are for Pearson’s correlation.

References

1. Ahrens JB
2. Nunez-Castilla J
3. Siltberg-Liberles J.
2017Evolution of intrinsic disorder in eukaryotic proteinsCellular and Molecular Life Sciences 74:3163–3174Google Scholar
1. Akashi H.
1996Molecular Evolution Between Drosophila melanogaster and D. simulans Reduced Codon Bias, Faster Rates of Amino Acid Substitution, and Larger Proteins in D. melanogasterGenetics 144:1297–1307Google Scholar
1. Basile W
2. Salvatore M
3. Bassot C
4. Elofsson A.
2019Why do eukaryotic proteins contain more intrinsically disordered regions?PLoS Computational Biology l5:elOO7l86Google Scholar
1. Bernardi G.
2000Isochores and the evolutionary genomics of vertebratesGene 241:3–17Google Scholar
1. Bertram J
2. Masel J.
2020Evolution rapidly optimizes stability and aggregation in lattice proteins despite pervasive landscape valleys and mazesGenetics 214:1047–1057Google Scholar
1. Botzman M
2. Margalit H.
2011Variation in global codon usage bias among prokaryotic organisms is associated with their lifestylesGenome Biology l2:RlO9Google Scholar
1. Brbić M
2. Warnecke T
3. Kriško A
4. Supek F.
2015Global Shifts in Genome and Proteome Composition Are Very Tightly CoupledGenome Biology and Evolution 7:1519–1532Google Scholar
1. Bulmer M.
1991The selection-mutation-drift theory of synonymous codon usageGenetics 129:897907Google Scholar
1. Charlesworth B.
2009Effective population size and patterns of molecular evolution and variationNature Reviews Genetics 10:195–205Google Scholar
1. Daubin V
2. Moran NA
2004Comment on “The Origins of Genome Complexity”Science 306:978978Google Scholar
1. Doherty A
2. McInerney JO
2013Translational Selection Frequently Overcomes Genetic Drift in Shaping Synonymous Codon Usage Patterns in VertebratesMolecular Biology and Evolution 30:2263–2267Google Scholar
1. dos Reis M.
2. Wernisch L.
2008Estimating Translational Selection in Eukaryotic GenomesMolecular Biology and Evolution 26:451–461Google Scholar
1. Doyle JM
2. Hacking CC
3. Willoughby JR
4. Sundaram M
5. DeWoody JA
2015Mammalian genetic diversity as a function of habitat, body size, trophic class, and conservation statusJournal of Mammalogy 96:564–572Google Scholar
1. Dunker AK
2. Garner E
3. Guilliot S
4. Romero P
5. Albrecht K
6. Hart J
7. Obradovic Z
8. Kissinger C
9. Villafranca JE
1998Protein disorder and the evolution of molecular recognition: theory, predictions and observationsPac Symp Biocomput :473–484Google Scholar
1. Duret L
2. Eyre-Walker A
3. Galtier N.
2006A new perspective on isochore evolutionGene 385:71–74Google Scholar
1. Duret L
2. Galtier N.
2009Biased Gene Conversion and the Evolution of Mammalian Genomic LandscapesAnnual Review of Genomics and Human Genetics 10:285–311Google Scholar
1. Eyre-Walker A
2. Hurst LD
2001The evolution of isochoresNature Reviews Genetics 2:549–555Google Scholar
1. Felsenstein J.
1985Phylogenies and the Comparative MethodThe American Naturalist 125:1–15Google Scholar
1. Figuet E
2. Ballenghien M
3. Romiguier J
4. Galtier N.
2014Biased Gene Conversion and GC-Content Evolution in the Coding Sequences of Reptiles and VertebratesGenome Biology and Evolution 7:240–250Google Scholar
1. Figuet E
2. Nabholz B
3. Bonneau M
4. Mas Carrio E.
5. Nadachowska-Brzyska K
6. Ellegren H
7. Galtier N.
2016Life History Traits, Protein Evolution, and the Nearly Neutral Theory in AmniotesMolecular Biology and Evolution 33:1517–1527Google Scholar
1. Forcelloni S
2. Giansanti A.
2020Evolutionary Forces and Codon Bias in Different Flavors of Intrinsic Disorder in the Human ProteomeJournal of Molecular Evolution 88:164–178Google Scholar
1. Foy SG
2. Wilson BA
3. Bertram J
4. Cordes MHJ
5. Masel J.
2019A Shift in Aggregation Avoidance Strategy Marks a Long-Term Direction to Protein EvolutionGenetics 211:1345–1355Google Scholar
1. Fuglsang A.
2004The ‘effective number of codons’ revisitedBiochemical and Biophysical Research Communications 317:957–964Google Scholar
1. Fuglsang A.
2008Impact of bias discrepancy and amino acid usage on estimates of the effective number of codons used in a gene, and a test for selection on codon usageGene 410:82–88Google Scholar
1. Galtier N
2. Piganeau G
3. Mouchiroud D
4. Duret L.
2001GC-Content Evolution in Mammalian Genomes: The Biased Gene Conversion HypothesisGenetics 159:907–911Google Scholar
1. Galtier N
2. Roux C
3. Rousselle M
4. Romiguier J
5. Figuet E
6. Glémin S
7. Bierne N
8. Duret L.
2018Codon Usage Bias in Animals: Disentangling the Effects of Natural Selection, Effective Population Size, and GC-Biased Gene ConversionMolecular Biology and Evolution 35:1092–1103Google Scholar
1. Gossmann Tl
2. Keightley PD
3. Eyre-Walker A.
2012The Effect of Variation in the Effective Population Size on the Rate of Adaptive Molecular Evolution in EukaryotesGenome Biology and Evolution 4:658–667Google Scholar
1. Hershberg R
2. Petrov DA
2008Selection on Codon BiasAnnual Review of Genetics 42:287–299Google Scholar
1. Hershberg R
2. Petrov DA
2009General Rules for Optimal Codon ChoicePLoS Genetics 5:elOOO556Google Scholar
1. Hildebrand F
2. Meyer A
3. Eyre-Walker A.
2010Evidence of Selection upon Genomic GC-Content in BacteriaPLoS Genetics 6:elOOllO7Google Scholar
1. Huang Y
2. Liu Z.
2013Do Intrinsically Disordered Proteins Possess High Specificity in ProteinProtein Interactions?Chemistry-A European Journal 19:4462–4467Google Scholar
1. Hunt RC
2. Simhadri VL
3. landoli M
4. Sauna ZE
5. Kimchi-Sarfaty C.
2014Exposing synonymous mutationsTrends In Genetics 30:308–321Google Scholar
1. James J
2. Nelson P
3. Masel J.
2022Differential retention of Pfam domains creates long-term evolutionary trendsbioRxiv:2O22 10.27:514087Google Scholar
1. James JE
2. Willis SM
3. Nelson PG
4. Weibel C
5. Kosinski U
6. Masel J.
2021Universal and taxon-specific trends in protein sequences as a function of ageeLife 10:e57347Google Scholar
1. Jansen R
2. Bussemaker HJ
3. Gerstein M.
2003Revisiting the codon adaptation index from a wholegenome perspective: analyzing the relationship between gene expression and codon occurrence in yeast using a variety of modelsNucleic Acids Research 31:2242–2251Google Scholar
1. Kern AD
2. Hahn MW
2018The Neutral Theory in Light of Natural SelectionMolecular Biology and Evolution 35:1366–1371Google Scholar
1. Kessler MD
2. Dean MD
2014Effective population size does not predict codon usage bias in mammalsEcology and Evolution 4:3887–3900Google Scholar
1. Kimura M.
1962On the probability of fixation of mutant genes in a populationGenetics 47:713–719Google Scholar
1. Kosinski LJ
2. Aviles NR
3. Gomez K
4. Masel J.
2022Random Peptides Rich in Small and DisorderPromoting Amino Acids Are Less Likely to Be HarmfulGenome Biology and Evolution l4:evacO85Google Scholar
1. Kumar S
2. Stecher G
3. Suleski M
4. Hedges SB
2017TimeTree: A Resource for timelines, timetrees, and divergence timesMolecular Biology and Evolution 34:1812–1819Google Scholar
1. LaBella AL
2. Opulente DA
3. Steenwyk JL
4. Hittinger CT
5. Rokas A.
2019Variation and selection on codon usage bias across an entire subphylumPLoS Genetics l5:elOO83O4Google Scholar
1. Lander ES
2. Linton LM
3. Birren B
4. Nusbaum C
5. Zody MC
6. Baldwin J
7. Devon K
8. Dewar K
9. Doyle M
10. FitzHugh W
11. Funke R
12. Gage D
13. Harris K
14. Heaford A
15. Howland J
16. Kann L
17. Lehoczky J
18. LeVine R
19. McEwan P
20. McKernan K
21. Meldrim J
22. Mesirov JP
23. Miranda C
24. Morris W
25. Naylor J
26. Raymond C
27. Rosetti M
28. Santos R
29. Sheridan A
30. Sougnez C
31. Stange-Thomann N
32. Stojanovic N
33. Subramanian A
34. Wyman D
35. Rogers J
36. Sulston J
37. Ainscough R
38. Beck S
39. Bentley D
40. Burton J
41. Clee C
42. Carter N
43. Coulson A
44. Deadman R
45. Deloukas P
46. Dunham A
47. Dunham I
48. Durbin R
49. French L
50. Grafham D
51. Gregory S
52. Hubbard T
53. Humphray S
54. Hunt A
55. Jones M
56. Lloyd C
57. McMurray A
58. Matthews L
59. Mercer S
60. Milne S
61. Mullikin JC
62. Mungall A
63. Plumb R
64. Ross M
65. Shownkeen R
66. Sims S
67. Waterston RH
68. Wilson RK
69. Hillier LW
70. McPherson JD
71. Marra MA
72. Mardis ER
73. Fulton LA
74. Chinwalla AT
75. Pepin KH
76. Gish WR
77. Chissoe SL
78. Wendi MC
79. Delehaunty KD
80. Miner TL
81. Delehaunty A
82. Kramer JB
83. Cook LL
84. Fulton RS
85. Johnson DL
86. Minx PJ
87. Clifton SW
88. Hawkins T
89. Branscomb E
90. Predki P
91. Richardson P
92. Wenning S
93. Slezak T
94. Doggett N
95. Cheng J-F
96. Olsen A
97. Lucas S
98. Elkin C
99. Uberbacher E
100. Frazier M
101. Gibbs RA
102. Muzny DM
103. Scherer SE
104. Bouck JB
105. Sodergren EJ
106. Worley KC
107. Rives CM
108. Gorrell JH
109. Metzker ML
110. Naylor SL
111. Kucherlapati RS
112. Nelson DL
113. Weinstock GM
114. Sakaki Y
115. Fujiyama A
116. Hattori M
117. Yada T
118. Toyoda A
119. Itoh T
120. Kawagoe C
121. Watanabe H
122. Totoki Y
123. Taylor T
124. Weissenbach J
125. Heilig R
126. Saurin W
127. Artiguenave F
128. Brottier P
129. Bruls T
130. Pelletier E
131. Robert C
132. Wincker P
133. Rosenthal A
134. Platzer M
135. Nyakatura G
136. Taudien S
137. Rump A
138. Smith DR
139. Doucette-Stamm L
140. Rubenfield M
141. Weinstock K
142. Lee HM
143. Dubois J
144. Yang H
145. Yu J
146. Wang J
147. Huang G
148. Gu J
149. Hood L
150. Rowen L
151. Madan A
152. Qin S
153. Davis RW
154. Federspiel NA
155. Abola AP
156. Proctor MJ
157. Roe BA
158. Chen F
159. Pan H
160. Ramser J
161. Lehrach H
162. Reinhardt R
163. McCombie WR
164. de la Bastide M
165. Dedhia N
166. Blocker H
167. Hornischer K
168. Nordsiek G
169. Agarwala R
170. Aravind L
171. Bailey JA
172. Bateman A
173. Batzoglou S
174. Birney E
175. Bork P
176. Brown DG
177. Burge CB
178. Cerutti L
179. Chen H-C
180. Church D
181. Clamp M
182. Copley RR
183. Doerks T
184. Eddy SR
185. Eichler EE
186. Furey TS
187. Galagan J
188. Gilbert JGR
189. Harmon C
190. Hayashizaki Y
191. Haussler D
192. Hermjakob H
193. Hokamp K
194. Jang W
195. Johnson LS
196. Jones TA
197. Kasif S
198. Kaspryzk A
199. Kennedy S
200. Kent WJ
201. Kitts P
202. Koonin EV
203. Korf I
204. Kulp D
205. Lancet D
206. Lowe TM
207. McLysaght A
208. Mikkelsen T
209. Moran JV
210. Mulder N
211. Pollara VJ
212. Ponting CP
213. Schuler G
214. Schultz J
215. Slater G
216. Smit AFA
217. Stupka E
218. Szustakowki J
219. Thierry-Mieg D
220. Thierry-Mieg J
221. Wagner L
222. Wallis J
223. Wheeler R
224. Williams A
225. Wolf Yl
226. Wolfe KH
227. Yang S-P
228. Yeh R-F
229. Collins F
230. Guyer MS
231. Peterson J
232. Felsenfeld A
233. Wetterstrand KA
234. Myers RM
235. Schmutz J
236. Dickson M
237. Grimwood J
238. Cox DR
239. Olson MV
240. Kaul R
241. Raymond C
242. Shimizu N
243. Kawasaki K
244. Minoshima S
245. Evans GA
246. Athanasiou M
247. Schultz R
248. Patrinos A
249. Morgan MJ
250. International Human Genome Sequencing, C, Whitehead Institute for Biomedical Research, CfGR, The Sanger, C. Washington University Genome Sequencing, C, Institute, UDJG, Baylor College of Medicine Human Genome Sequencing, C, Center, RGS, Genoscope, Cnrs, UMR, Department of Genome Analysis, loMB, Center, GTCS, Beijing Genomics Institute/Human Genome, C. Multimegabase Sequencing Center, TlfSB, Stanford Genome Technology, C, University of Oklahoma’s Advanced Center for Genome, T, Max Planck Institute for Molecular, G, Cold Spring Harbor Laboratory, LAHGC, Biotechnology, GBGRCf, *Genome Analysis, G, Scientific management: National Human Genome Research Institute, USNIoH, Stanford Human Genome, C, University of Washington Genome, C, Department of Molecular Biology, KUSoM, University of Texas Southwestern Medical Center at, D, Office of Science, USDoE, and The Wellcome, T
2001Initial sequencing and analysis of the human genomeNature 409:860–921Google Scholar
1. Lazar T
2. Tantos A
3. Tompa P
4. Schad E.
2022Intrinsic protein disorder uncouples affinity from binding specificityProtein Science 3l:e4455Google Scholar
1. Lefébure T
2. Morvan C
3. Malard F
4. François C
5. Konecny-Dupré L
6. Guéguen L
7. Weiss-Gayet M
8. Seguin-Orlando A
9. Ermini L
10. Sarkissian CD
11. Charrier NP
12. Erne D
13. Mermillod-Blondin F
14. Duret L
15. Vieira C
16. Orlando L
17. Douady CJ.
2017Less effective selection leads to larger genomesGenome Research 27:1016–1028Google Scholar
1. Li W-H.
1987Models of nearly neutral mutations with particular implications for nonrandom usage of synonymous codonsJournal of Molecular Evolution 24:337–345Google Scholar
1. Lin J-J
2. Bhattacharjee MJ
3. Yu C-P
4. Tseng YY
5. Li W-H.
2019Many human RNA viruses show extraordinarily stringent selective constraints on protein evolutionProceedings of the National Academy of Sciences 116:19009–19018Google Scholar
1. Liu SS
2. Hockenberry AJ
3. Jewett MC
4. Amaral LAN.
2018A novel framework for evaluating the performance of codon usage bias metricsJournal of The Royal Society Interface 15:20170667Google Scholar
1. Long H
2. Sung W
3. Kucukyildirim S
4. Williams E
5. Miller SF
6. Guo W
7. Patterson C
8. Gregory C
9. Strauss C
10. Stone C
11. Berne C
12. Kysela D
13. Shoemaker WR
14. Muscarella ME
15. Luo H
16. Lennon JT
17. Brun YV
18. Lynch M.
2018Evolutionary determinants of genome-wide nucleotide compositionNature Ecology & Evolution 2:237–240Google Scholar
1. Luzuriaga-Neira AR
2. Alvarez-Ponce D.
2022Rates of Protein Evolution across the Marsupial Phylogeny: Heterogeneity and Link to Life-History TraitsGenome Biology and Evolution l4:evab277Google Scholar
1. Lynch M
2. Ackerman MS
3. Gout J-F
4. Long H
5. Sung W
6. Thomas WK
7. Foster PL.
2016Genetic drift, selection and the evolution of the mutation rateNature Reviews Genetics 17:704–714Google Scholar
1. Masel J.
2011Genetic driftCurrent Biology 2l:R837–R838Google Scholar
1. Meunier J
2. Duret L.
2004Recombination Drives the Evolution of GC-Content in the Human GenomeMolecular Biology and Evolution 21:984–990Google Scholar
1. Novembre JA.
2002Accounting for Background Nucleotide Composition When Measuring Codon Usage BiasMolecular Biology and Evolution 19:1390–1394Google Scholar
1. Novoa EM
2. Jungreis I
3. Jail Ion O.
4. Kellis M.
2019Elucidation of Codon Usage Signatures across the Domains of LifeMolecular Biology and Evolution 36:2328–2339Google Scholar
1. Ohta T.
1972Population size and rate of evolutionJournal of Molecular Evolution 1:305–314Google Scholar
1. Ohta T.
1973Slightly Deleterious Mutant Substitutions in EvolutionNature 246:96–98Google Scholar
1. Ohta T.
1992The Nearly Neutral Theory of Molecular EvolutionAnnual Review of Ecology and Systematics 23:263–286Google Scholar
1. Paradis E
2. Schliep K.
2019ape 5.0: an environment for modern phylogenetics and evolutionary analyses in RBioinformatics 35:526–528Google Scholar
1. Payne BL
2. Alvarez-Ponce D.
2018Higher Rates of Protein Evolution in the Self-Fertilizing Plant Arabidopsis thaliana than in the Out-Crossers Arabidopsis lyrata and Arabidopsis halleriGenome Biology and Evolution 10:895–900Google Scholar
1. Plotkin JB
2. Dushoff J
3. Desai MM
4. Fraser HB.
2006Codon Usage and Selection on ProteinsJournal of Molecular Evolution 63:635–653Google Scholar
1. Plotkin JB
2. Kudla G.
2011Synonymous but not the same: the causes and consequences of codon biasNature Reviews Genetics 12:32–42Google Scholar
1. Romiguier J
2. Gayral P
3. Ballenghien M
4. Bernard A
5. Cahais V
6. Chenuil A
7. Chiari Y
8. Dernat R
9. Duret L
10. Faivre N
11. Loire E
12. Lourenco JM
13. Nabholz B
14. Roux C
15. Tsagkogeorga G
16. Weber AAT
17. Weinert LA
18. Belkhir K
19. Bierne N
20. Glémin S
21. Galtier N.
2014Comparative population genomics in animals uncovers the determinants of genetic diversityNature 515:261–263Google Scholar
1. Romiguier J
2. Roux C.
2017Analytical Biases Associated with GC-Content in Molecular EvolutionFrontiers in Genetics 8:16Google Scholar
1. Schad E
2. Tompa P
3. Hegyi H.
2011The relationship between proteome size, structural disorder and organism complexityGenome Biology l2:Rl2OGoogle Scholar
1. Sharp PM
2. Bailes E
3. Grocock RJ
4. Peden JF
5. Sockett RE.
2005Variation in the strength of selected codon usage bias among bacteriaNucleic Acids Research 33:1141–1153Google Scholar
1. Sharp PM
2. Emery LR
3. Zeng K.
2010Forces that influence the evolution of codon biasPhilosophical Transactions of the Royal Society B: Biological Sciences 365:1203–1212Google Scholar
1. Sharp PM
2. Li W-H.
1987The codon adaptation index-a measure of directional synonymous codon usage bias, and its potential applicationsNucleic Acids Research 15:1281–1295Google Scholar
1. Smith NGC
2. Eyre-Walker A.
2001Synonymous Codon Bias Is Not Caused by Mutation Bias in G+C-Rich Genes in HumansMolecular Biology and Evolution 18:982–986Google Scholar
1. Subramanian S.
2008Nearly Neutrality and the Evolution of Codon Usage Bias in Eukaryotic GenomesGenetics 178:2429–2432Google Scholar
1. Sun X
2. Yang Q
3. Xia X.
2012An Improved Implementation of Effective Number of Codons (Nc)Molecular Biology and Evolution 30:191–196Google Scholar
1. Urrutia AO
2. Hurst LD.
2001Codon Usage Bias Covaries With Expression Breadth and the Rate of Synonymous Evolution in Humans, but This Is Not Evidence for SelectionGenetics 159:11911199Google Scholar
1. Vavouri T
2. Semple JI
3. Garcia-Verdugo R
4. Lehner B.
2009Intrinsic Protein Disorder and Interaction Promiscuity Are Widely Associated with Dosage SensitivityCell 138:198–208Google Scholar
1. Vicario S
2. Moriyama EN
3. Powell JR.
2007Codon usage in twelve species of DrosophilaBMC Evolutionary Biology 7:226Google Scholar
1. Wangen JR
2. Green R.
2020Stop codon context influences genome-wide stimulation of termination codon readthrough by aminoglycosideseLife 9:e526llGoogle Scholar
1. Weber CC
2. Nabholz B
3. Romiguier J
4. Ellegren H.
2014Kr/Kc but not dN/dS correlates positively with body mass in birds, raising implications for inferring lineage-specific selectionGenome Biology 15:542Google Scholar
1. Wright F.
1990The ‘effective number of codons’ used in a geneGene 87:23–29Google Scholar
1. Xue B
2. Dunker AK
3. Uversky VN.
2012Orderly order in protein intrinsic disorder distribution: disorder in 3500 proteomes from viruses and the three domains of lifeJournal of Biomolecular Structure and Dynamics 30:137–149Google Scholar
1. Zhang Z
2. Li J
3. Cui P
4. Ding F
5. Li A
6. Townsend JP
7. Yu J.
2012Codon Deviation Coefficient: a novel measure for estimating codon usage bias and its statistical significanceBMC Bioinformatics :43Google Scholar

Article and author information

Author information

Catherine A. Weibel
Department of Mathematics, University of Arizona, Tucson, Arizona 85721, USA, Department of Physics, University of Arizona, Tucson, Arizona 85721, USA, Department of Applied Physics, Stanford University, California, USA
ORCID iD: 0000-0003-1837-5209
Andrew L. Wheeler
Genetics Graduate Interdisciplinary Program, University of Arizona, Tucson, Arizona 85721, USA
ORCID iD: 0000-0002-5347-5419
Jennifer E. James
Department of Ecology and Evolutionary Biology, University of Arizona, Tucson, Arizona 85721, USA, Department of Ecology and Genetics, Evolutionary Biology Center, Uppsala University, Sweden
ORCID iD: 0000-0003-0518-6783
Sara M. Willis
Department of Ecology and Evolutionary Biology, University of Arizona, Tucson, Arizona 85721, USA, University Information Technology Services, University of Arizona, Tucson, Arizona 85721, USA
ORCID iD: 0000-0002-1605-6426
Joanna Masel
Department of Ecology and Evolutionary Biology, University of Arizona, Tucson, Arizona 85721, USA
ORCID iD: 0000-0002-7398-2127
- Corresponding author: Joanna Masel, Email: masel@arizona.edu

Version history

Preprint posted: March 3, 2023
Sent for peer review: March 16, 2023
Reviewed Preprint version 1: July 26, 2023
Reviewed Preprint version 2: July 16, 2024
Version of Record published: September 6, 2024

Cite all versions

You can cite all versions using the DOI https://doi.org/10.7554/eLife.87335. This DOI represents all versions, and will always resolve to the latest one.

Copyright

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

views: 1,929
downloads: 75
citations: 2

Views, downloads and citations are aggregated across all versions of this paper published by eLife.