The protein domains of vertebrate species in which selection is more effective have greater intrinsic structural disorder
eLife assessment
This study develops a useful metric for quantifying codon usage adaptation - the Codon Adaptation Index of Species (CAIS). This metric permits direct comparisons of the strength of selection at the molecular level across species. The study is based on solid evidence, and the authors identify relationships between CAIS and the presence of disordered protein domains. Other correlations, such as the one between CAIS and body size, are weak and non-significant. In summary, the study introduces an interesting new approach to quantifying codon usage across species, which may be helpful in attempts to measure selection at the molecular level.
https://doi.org/10.7554/eLife.87335.3.sa0Useful: Findings that have focused importance and scope
- Landmark
- Fundamental
- Important
- Valuable
- Useful
Solid: Methods, data and analyses broadly support the claims with only minor weaknesses
- Exceptional
- Compelling
- Convincing
- Solid
- Incomplete
- Inadequate
During the peer-review process the editor and reviewers write an eLife Assessment that summarises the significance of the findings reported in the article (on a scale ranging from landmark to useful) and the strength of the evidence (on a scale ranging from exceptional to inadequate). Learn more about eLife Assessments
Abstract
The nearly neutral theory of molecular evolution posits variation among species in the effectiveness of selection. In an idealized model, the census population size determines both this minimum magnitude of the selection coefficient required for deleterious variants to be reliably purged, and the amount of neutral diversity. Empirically, an ‘effective population size’ is often estimated from the amount of putatively neutral genetic diversity and is assumed to also capture a species’ effectiveness of selection. A potentially more direct measure of the effectiveness of selection is the degree to which selection maintains preferred codons. However, past metrics that compare codon bias across species are confounded by among-species variation in %GC content and/or amino acid composition. Here, we propose a new Codon Adaptation Index of Species (CAIS), based on Kullback–Leibler divergence, that corrects for both confounders. We demonstrate the use of CAIS correlations, as well as the Effective Number of Codons, to show that the protein domains of more highly adapted vertebrate species evolve higher intrinsic structural disorder.
eLife digest
Evolution is the process through which populations change over time, starting with mutations in the genetic sequence of an organism. Many of these mutations harm the survival and reproduction of an organism, but only by a very small amount.
Some species, especially those with large populations, can purge these slightly harmful mutations more effectively than other species. This fact has been used by the ‘drift barrier theory’ to explain various profound differences amongst species, including differences in biological complexity. In this theory, the effectiveness of eliminating slightly harmful mutations is specified by an ‘effective' population size, which depends on factors beyond just the number of individuals in the population.
Effective population size is normally calculated from the amount of time a ‘neutral’ mutation (one with no effect at all) stays in the population before becoming lost or taking over. Estimating this time requires both representative data for genetic diversity and knowledge of the mutation rate. A major limitation is that these data are unavailable for most species. A second limitation is that a brief, temporary reduction in the number of individuals has an oversized impact on the metric, relative to its impact on the number of slighly harmful mutations accumulated.
Weibel, Wheeler et al. developed a new metric to more directly determine how effectively a species purges slightly harmful mutations. Their approach is based on the fact that the genetic code has ‘synonymous’ sequences. These sequences code for the same amino acid building block, with one of these sequences being only slightly preferred over others.
The metric by Weibel, Wheeler et al. quantifies the proportion of the genome from which less preferred synonymous sequences have been effectively purged. It judges a population to have a higher effective population size when the usage of synonymous sequences departs further from the usage predicted from mutational processes.
The researchers expected that natural selection would favour ‘ordered’ proteins with robust three-dimensional structures, i.e., that species with a higher effective population size would tend to have more ordered versions of a protein. Instead, they found the opposite: species with a higher effective population size tend to have more disordered versions of the same protein. This changes our view of how natural selection acts on proteins.
Why species are so different remains a fundamental question in biology. Weibel, Wheeler et al. provide a useful tool for future applications of drift barrier theory to a broad range of ways that species differ.
Introduction
Species differ from each other in many ways, including mating system, ploidy, spatial distribution, life history, size, lifespan, genome size, mutation rate, selective pressure, and population size. These differences make the process of purifying selection more efficient in some species than others. Our understanding of both the causes and consequences of these differences is limited in part by a reliable metric with which to measure them. In the long term, the probability that a gene is fixed for one allele rather than another allele is given by the ratio of fixation and counter-fixation probabilities (Bulmer, 1991). In an idealized population of constant population size and no selection at linked sites, a mutation–selection–drift model describes how this ratio of fixation probabilities depends on the census population size N (Kimura, 1962), and hence gives the fraction of sites expected to be found in preferred vs. non-preferred states (Figure 1).
This reasoning has been extended to real populations by positing that species have an ‘effective’ population size, Ne (Ohta, 1973). Ne is the census population size of an idealized population that reproduces a property of interest in the focal population. Ne is therefore not a single quantity per population, but instead depends on which property is of interest.
The amount of neutral polymorphism is the usual property used to empirically estimate Ne (Charlesworth, 2009; Doyle et al., 2015; Lynch et al., 2016). However, the property of most relevance to nearly neutral theory is instead the inflection point s at which non-preferred alleles become common enough to matter (Figure 1), and hence the degree to which highly exquisite adaptation can be maintained in the face of ongoing mutation and genetic drift (Kimura, 1962; Ohta, 1972; Ohta, 1992). While genetic diversity has been found to reflect some aspects of life history strategy (Romiguier et al., 2014), there remain concerns about whether neutral genetic diversity and the limits to weak selection always remain closely coupled in non-equilibrium settings.
As a practical matter, Ne is usually calculated by dividing some measure of the amount of putatively neutral (often synonymous) polymorphism segregating in a population by that species’ mutation rate (Charlesworth, 2009). As a result, Ne values are only available for species that have both polymorphism data and accurate mutation rate estimates, limiting their use. Worse, Ne is not a robust statistic. In the absence of a clear species definition, polymorphism is sometimes calculated across too broad a range of genomes, substantially inflating Ne (Daubin and Moran, 2004); a poor sampling scheme can have the converse effect of deflating genetic diversity. Transient hypermutation (Plotkin et al., 2006), which is common in microbes, causes further short-term inconsistencies in polymorphism levels. Perhaps most importantly, a recent bottleneck will deflate Ne based on the coalescence time, even if too brief to lead to significant erosion of fine-tuned adaptations. But drift barrier theory concerns the level with which adaptation is fine-tuned, and so a better metric would capture that directly, rather than indirectly rely on neutral diversity.
An alternative approach to measure the efficiency of selection exploits codon usage bias, which is influenced by weak selection for factors such as translational speed and accuracy (Hershberg and Petrov, 2008; Plotkin and Kudla, 2011; Hunt et al., 2014). The degree of bias in synonymous codon usage that is driven by selective preference offers a more direct way to assess how effective selection is at the molecular level in a given species (Li, 1987; Bulmer, 1991; Akashi, 1996; Subramanian, 2008). Conveniently, it can be estimated from only a single genome, that is, without polymorphism or mutation rate data for that species.
One commonly used metric, the Codon Adaptation Index (CAI) (Sharp and Li, 1987; Sharp et al., 2010) takes the average of Relative Synonymous Codon Usage (RSCU) scores, which quantify how often a codon is used, relative to the codon that is most frequently used to encode that amino acid in that species. While this works well for comparing genes within the same species, it unfortunately means that the species-wide strength of codon bias appears in the normalizing denominator (see Equation 4 and Figure 3—figure supplement 1A). Paradoxically, this can make more exquisitely adapted species have lower rather than higher species-averaged CAI scores (Figure 3—figure supplement 1B; Rocha, 2004; Botzman and Margalit, 2011).
To compare species using CAI, it has been suggested that instead of taking a genome-wide average, one should consider a set of highly expressed reference genes (Sharp et al., 2005; Vicario et al., 2007; Subramanian, 2008; dos Reis and Wernisch, 2009). This approach assumes that the relative strength of selection on those reference genes (often a function of gene expression) remains approximately constant across the set of species considered (red distributions in Figure 2). Its use also requires careful attention to the length of reference genes (Urrutia and Hurst, 2001; Doherty and McInerney, 2013), and some approaches also require information about tRNA gene copy numbers and abundances (dos Reis and Wernisch, 2009).
Since codon bias varies quantitatively within only a small range of (Figure 1), a promising approach is to measure the proportion of sites at which codon adaptation is effective. We posit that more highly adapted species have a higher proportion of both genes and sites subject to effective selection on codon bias (Figure 2; Galtier et al., 2018). Indeed, CAI might also rely in part on variation in the fraction of sites within the reference genes that is subject to effective selection as a function of species (Figure 2, red). Here we take this logic further, considering all sites in a proteome-wide approach. Averaging across the entire proteome provides robustness to shifts in the expression level of or strength of selection on particular genes. The proteome-wide average depends on the fraction of sites whose selection coefficients exceed the ‘drift barrier’ for that particular species (Figure 2, blue threshold).
In estimating the effects of selection, it is critical to control for other causes of codon bias. In particular, species differ in their mutational bias with respect to the proportion of the genome that consists of guanine-cytosine base pairs (GC), and in the frequency of GC-biased gene conversion (Urrutia and Hurst, 2001; Duret and Galtier, 2009; Doherty and McInerney, 2013; Figuet et al., 2014). Here, we control for %GC, capturing species differences both in mutation and in gene conversion, by calculating the Kullback–Leibler divergence of the observed codon frequencies away from the codon frequencies that we would expect to see given the genomic %GC content of the species. Kullback–Leibler divergence measures the distance of an observed probability distribution from an expected reference distribution, capturing a measure of surprise (Kullback and Leibler, 1951). This method does not require us to specify preferred vs. non-preferred codons, and can thus also accommodate situations in which different genes have different codon preferences (Gingold et al., 2014; Cope et al., 2018).
An alternative metric, the Effective Number of Codons (ENC) originally quantified how far the codon usage of a sequence departs from equal usage of synonymous codons (Wright, 1990), with lower ENC values indicating greater departure. This approach creates a complex relationship with GC content (Fuglsang, 2008), and so ENC was later modified to correct for GC content (Novembre, 2002). However, a remaining issue with this modified ENC is that differences among species in amino acid composition might act as a confounding factor, even after controlling for GC content. Specifically, species that make more use of an amino acid for which there is stronger selection among codons (which is sometimes the case Vicario et al., 2007) would have higher codon bias, even if each amino acid considered on its own had identical codon bias irrespective of which species it is in. Confounding with amino acid frequencies has been shown to be a problem at the individual protein level (Cope et al., 2018). Neither ENC (Fuglsang, 2004; Fuglsang, 2008) nor the CAI (Sharp and Li, 1987) adequately control for differences in amino acid composition when applied across species. Despite early claims to the contrary (Wright, 1990), this problem is not easy to fix for ENC (Fuglsang, 2004; Fuglsang, 2008).
Here, we extend the CAI, using the information-theory-based Kullback–Leibler divergence, so that it corrects for both GC and amino acid composition (see Methods) to create a new Codon Adaptation Index of Species (CAIS). The availability of a complete genome allows both metrics to be readily calculated without data on polymorphism or mutation rate, without selecting reference genes, and without concerns about demographic history. Our purpose is to find an accessible metric that can quantify the limits to weak selection important to nearly neutral theory; this differs from past evaluations focused on comparing different genes of the same species and recapitulating ‘ground truth’ simulations thereof (Sun et al., 2013; Zhang et al., 2012; Liu et al., 2018). To demonstrate the usefulness of our method, we identify a novel correlation with intrinsic structural disorder (ISD), pointing to what else might be subject to weak selective preferences at the molecular level. While ENC can also identify subtle selection on ISD, CAIS can do so without the risk of confounding with amino acid frequencies.
Results
Both ENC and CAIS solve the GC confounding problem that plagues CAI
Proteins in better adapted species evolve more structural disorder
As an example of how correlations with codon adaptation metrics can be used to identify weak selective preferences, we investigate protein ISD. Disordered proteins are more likely to be harmful when overexpressed (Vavouri et al., 2009), and ISD is more abundant in eukaryotic than prokaryotic proteins (Schad et al., 2011; Xue et al., 2012; Basile et al., 2019), suggesting that low ISD might be favored by more effective selection.
However, compositional differences among proteomes might not be driven by differences in how a given protein sequence evolves as a function of the effectiveness of selection. Instead, they might be driven by the recent birth of ISD-rich proteins in animals (James et al., 2021), and/or by differences among sequences in their subsequent tendency to proliferate into many different genes (James et al., 2023). To focus only on the effects of descent with modification, we use a linear mixed model, with each species having a fixed effect on ISD, while controlling for Pfam domain identity as a random effect. We note that once GC is controlled for, codon adaptation can be assessed similarly in intrinsically disordered vs. ordered proteins (Gossmann et al., 2012). Controlling for Pfam identity is supported, with standard deviation in ISD of 0.178 among Pfams compared to residual standard deviation of 0.058, and a p-value on the significance of the Pfam random effect term of 3 × 10−13. Controlling in this way for Pfam identity, we then ask whether the fixed species effects on ISD are correlated with CAIS and with ENC.
Surprisingly, more exquisitely adapted species have more disordered protein domains (Figure 4). Results using ENC and CAIS are similar, with ENC having higher power; the correlation coefficient is 0.36 for CAIS compared to 0.50 for ENC, and the p-value for ENC is 3 orders of magnitude lower. We note, however, that amino acid frequencies strongly influence ISD (Theillet et al., 2013). The CAIS correlation is more reliable than the ENC correlation because by construction, CAIS controls for differences in amino acid frequencies among species.
Different parts of the genome have different GC contents (Bernardi, 2000; Eyre-Walker and Hurst, 2001; Lander et al., 2001), primarily because the extent to which GC-biased gene conversion increases GC content depends on the local rate of recombination (Galtier et al., 2001; Meunier and Duret, 2004; Duret et al., 2006; Duret and Galtier, 2009). We therefore also calculated a version of CAIS whose codon frequency expectations are based on local intergenic GC content. This performed worse (Figure 4C) than our simple use of genome-wide GC content (Figure 4A) with respect to the strength of correlation between CAIS and ISD. If GC-biased gene conversion is a more powerful force than weak selective preferences among codons, then local GC content will evolve more rapidly than codon usage (Kondrashov et al., 2010). In this case, genome-wide GC may serve as an appropriately time-averaged proxy. It is also possible that the local non-coding sequences we used were too short (at 3000 bp or more), creating excessive noise that obscured the signal.
Many vertebrates have higher recombination rates and hence GC-biased gene conversion near genes; in this case genome-wide GC content would misestimate the codon usage expected from the combination of mutation bias and GC-biased gene conversion in the vicinity of genes. If GC-biased gene conversion drove CAIS, we expect high to predict high CAIS. We do not see this relationship (Figure 5), suggesting that gene conversion strength is not a confounding factor impacting CAIS.
Younger animal-specific protein domains have higher ISD (James et al., 2021). It is possible that selection in favor of high ISD is strongest in young domains, which might use more primitive methods to avoid aggregation (Foy et al., 2019; Bertram and Masel, 2020). To test this, we analyze two subsets of our data: those that emerged prior to the last eukaryotic common ancestor (LECA), here referred to as ‘old’ protein domains, and ‘young’ protein domains that emerged after the divergence of animals and fungi from plants. Young and old domains show equally strong trends of increasing disorder with species’ adaptedness (Figure 6).
Discussion
When different properties are each causally affected by a species’ exquisiteness of adaptation, this will create a correlation between the properties. We use codon adaptation as a reference property, such that correlations with codon adaptation indicate selection. To detect ISD as a novel property under selection, we used a linear mixed model approach that controls for Pfam identity as a random effect. This approach shows that the same Pfam domain tends to be more disordered when found in a well-adapted species (i.e. a species with a higher CAIS or ENC). This is true for both ancient and recently emerged protein domains.
It is important that no additional variable such as GC content or amino acid frequencies creates a spurious correlation by affecting both CAIS and our property of interest. For this reason, we define CAIS as the observed Kullback–Leibler divergence (Kullback and Leibler, 1951) from the codon usage expected given the GC content. The GC content pertinent to this expectation depends primarily on mutation bias and GC-biased gene conversion (Romiguier and Roux, 2017), but potentially also on selection on individual nucleotide substitutions that is hypothesized to favor higher %GC (Long et al., 2018). By controlling for %GC, we exclude all these forces from influencing CAIS or ENC. We thus capture the extent of adaptation in codon bias, including translational speed, accuracy, and any intrinsic preference for GC over AT that is specific to coding regions. These remaining codon-adaptive factors do not create a statistically convincing correlation between CAIS and GC (Figure 3C), nor between ENC and GC (Figure 3B), although CAI is strongly correlated with GC (Figure 3A). Notably, our new CAIS metric of codon adaptation controls for amino acid frequencies, rather than, like ENC, only GC content.
A direct effect of ISD on fitness agrees with studies of random Open Reading Frames (ORFs) in Escherichia coli, where fitness was driven more by amino acid composition than %GC content, after controlling for the intrinsic correlation between the two (Kosinski et al., 2022). However, we have not ruled out a role for selection for higher %GC in ways that are general rather than restricted to coding regions, whether in shaping mutational biases (Smith and Eyre-Walker, 2001; Hershberg and Petrov, 2009; Hildebrand et al., 2010; Novoa et al., 2019; Forcelloni and Giansanti, 2020) or the extent of gene conversion, or even at the single-nucleotide level in a manner shared between coding regions and intergenic regions (Long et al., 2018).
A more complex metric could control for more than just GC content and amino acid frequencies. First vs. second vs. third codon positions have different nucleotide usage on average, but while correcting for this might be useful for comparing genes (Zhang et al., 2012), correcting for it while comparing species might remove the effect of interest. Similarly, while it might be useful to control for dinucleotide and trinucleotide frequencies (Brbić et al., 2015), to avoid circularity these would need to be taken from intergenic sequences, with care needed to avoid influence from unannotated protein-coding genes or even pseudogenes.
Note that if a species were to experience a sudden reduction in census population size, for example due to habitat loss, leading to less effective selection, it would take some multiple of the neutral coalescent time for CAIS to fully adjust. CAIS thus represents a relatively long-term historical pattern of adaptation. The timescales setting neutral polymorphism-based estimates are likely shorter, based on a single round of coalescence. It is possible that the reason that we obtained correlations when we controlled for genome-wide GC content, but not when we controlled for local GC content, is also that codon adaptation adjusts slowly relative to the timescale of fluctuations in local GC content.
Here, we developed a new metric of species adaptedness at the codon level, capable of quantifying degrees of codon adaptation even among vertebrates. We chose vertebrates partly due to the abundance of suitable data, and partly as a stringent test case, given past studies finding limited evidence for codon adaptation (Kessler and Dean, 2014). It remains to be seen how CAIS behaves among species with stronger codon adaptation. We restricted our analysis to only the best annotated genomes, in part to ensure the quality of intergenic %GC estimates, and in part limited by the feasibility of running linear mixed models with 6 million datapoints. The phylogenetic tree is well resolved for vertebrate species, with an overrepresentation of mammalian species. Despite the focus on vertebrates, we were able to discover new results regarding selection on ISD.
Our finding that more effective selection prefers higher ISD was unexpected, given that lower-Ne eukaryotes have more disordered proteins than higher-Ne prokaryotes (Ahrens et al., 2017; Basile et al., 2019). However, this can be reconciled in a model in which highly disordered sequences are less likely to be found in high-Ne species, but the sequences that are present tend to have slightly higher disorder than their low-Ne homologs. High ISD might help mitigate the trade-off between affinity and specificity in protein–protein interactions (Dunker et al., 1998; Huang and Liu, 2013; Lazar et al., 2022); non-specific interactions might be short-lived due to the high entropy associated with disorder, which specific interactions are robust to.
Codon adaptation metrics more directly quantify how species vary in their exquisiteness of adaptation, than do estimates of effective population size that are based on neutral polymorphism. Both CAIS and ENC can also be estimated for far more species because they do not require polymorphism or mutation rate data, nor tRNA gene copy numbers and abundances, but only a single complete genome. CAIS has the additional advantage of not being confounded with amino acid frequencies. This makes CAIS a useful tool for applying nearly neutral theory to protein evolution, as shown by our worked example of ISD.
Methods
Species
Pfam sequences and IUPRED2 estimates of ISD predictions were taken from James et al., 2021, who studied species marked as ‘Complete’ in the GOLD database, with divergence dates available in TimeTree (Kumar et al., 2017). James et al., 2021 applied a variety of quality controls to exclude contaminants from the set of Pfams and assign accurate dates of Pfam emergence. Pfams that emerged prior to LECA are classified here as ‘old’, and Pfams that emerged after the divergence of animals and fungi from plants are classified as ‘young’, following annotation by James et al., 2021. Species list and other information can be found at https://github.com/MaselLab/Codon-Adaptation-Index-of-Species (copy archived at MaselLab, 2024).
Codon Adaptation Index
Sharp and Li, 1987 quantified codon bias through the CAI, a normalized geometric mean of synonymous codon usage bias across sites, excluding stop and start codons. We modify this to calculate CAI including stop and start codons, because of documented preferences among stop codons in mammals (Wangen and Green, 2020). While usually used to compare genes within a species, among-species comparisons can be made using a reference set of genes that are highly expressed (Sharp and Li, 1987). Each codon i is assigned an RSCU value:
where denotes the number of times that codon is used, and the denominator sums over all codons that code for that specific amino acid. RSCU values are normalized to produce a relative adaptiveness values for each codon, relative to the best adapted codon for that amino acid:
Let be the number of codons across all protein-coding sequences considered. Then
To understand the effects of normalization, it is useful to rewrite this as:
where is the geometric mean of the ‘unnormalized’ or observed synonymous codon usages, and is the maximum possible CAI given the observed codon frequencies.
GC content
We calculated total %GC content (intergenic and genic) during a scan of all six reading frames across genic and intergenic sequences available from NCBI with access dates between May and July 2019 (described in James et al., 2023). Of the 170 vertebrates meeting the quality criteria of James et al., 2021, 118 had annotated intergenic sequences within NCBI, so we restricted the dataset further to keep only the 118 species for which total GC content was available.
Codon Adaptation Index of Species
Controlling for GC bias in synonymous codon usage
Consider a sequence region within species where each nucleotide has an expected probability of being G or C = . For our main analysis, we consider just one region encompassing the entire genome of a species . In a secondary analysis, we break the genome up and use local values of in the non-coding regions within and surrounding a gene or set of overlapping genes. To annotate the boundaries of these local regions, we first selected 1500 base pairs flanking each side of every coding sequence identified by NCBI annotations. Coding sequence annotations are broken up according to exon by NCBI. When coding sequences of the same gene did not fall within 3000 base pairs of each other, they were treated as different regions. When two coding sequences, whether from the same gene or from different genes, had overlapping 1500 bp catchment areas, we merged them together. was then calculated based on the non-coding sites within each region, including both genic regions such as promoters and non-genic regions such as introns and intergenic sequences.
With no bias between C vs. G, nor between A vs. T, nor patterns beyond the overall composition taken one nucleotide at a time, the expected probability of seeing codon in a triplet within is
where total positions in codon . The expected probability that amino acid in region is encoded by codon is
We can then measure the degree to which the observed codon frequencies diverge from these expected probabilities using the Kullback–Leibler divergence. This gives a CAIS metric for a species where is the observed frequency of codon i:
Controlling for amino acid composition
Some amino acids may be more intrinsically prone to codon bias. We want a metric that quantifies effectiveness of selection (not amino acid frequency), so we re-weight CAIS on the basis of a standardized amino acid composition, to remove the effect of variation among species in amino acid frequencies.
Let be the frequency of amino acid across the entire dataset of 118 vertebrate genomes. We want to re-weight on the basis of to ensure that differences in amino acid frequencies among species do not affect CAIS, while preserving relative codon frequencies for the same amino acid. We do this by solving for so that
We then define to obtain an amino acid frequency adjusted CAIS:
The values for our species set are at https://github.com/MaselLab/Codon-Adaptation-Index-of-Species/blob/main/CAIS_ENC_calculation/Total_amino_acid_frequency_vertebrates.txt. Use of the standardized set of amino acid frequencies has only a small effect on computed CAIS values relative to using each vertebrate species’ own amino acid frequencies (Figure 3—figure supplement 3).
CAIS corrected for local intergenic GC content but not species-wide amino acid composition is
where is the number of times codon appears in region of species , is the expected number of times codon would appear in region of species given the local intergenic GC content, is the number of regions, and is the total number of codons in the genome. Rewritten for greater computational ease:
Given the limited impact of amino acid frequency correction, we used Equation 11 for the local GC results, but we could correct for amino acid composition by replacing the prefactor with , or even .
Novembre’s ENC controlled for total GC content
The expected number of codons is based on the squared deviations of the frequencies of the codons for each amino acid from null expectations:
where is the total number of times that amino acid appears. Novembre, 2002 defines the corrected ‘F value’ of amino acid as
and
where each is the average of the ‘F values’ for amino acids with synonymous codons. Past measures of ENC do not contain stop or start codons (Wright, 1990; Novembre, 2002; Fuglsang, 2004), but as we did for CAI and CAIS above, we include stop codons as an ‘amino acid’ and therefore amend Equation 14 to
Statistical analysis
All statistical modeling was done in R 3.5.1. Scripts for calculating CAI and CAIS were written in Python 3.7.
Phylogenetic Independent Contrasts
Spurious phylogenetically confounded correlations can occur when closely related species share similar values of both metrics. One danger of such pseudoreplication is Simpson’s paradox, where there are negative slopes within taxonomic groups, but a positive slope among them might combine to yield an overall positive slope. We avoid pseudoreplication by using Phylogenetic Independent Contrasts (PIC) (Felsenstein, 1985) to assess correlation. PIC analysis was done using the R package ‘ape’ (Paradis and Schliep, 2019).
Data availability
There is no new data. Processed data and code underlying this article are available in the public repository at https://github.com/MaselLab/Codon-Adaptation-Index-of-Species (copy archived at MaselLab, 2024).
-
figshareData from: Universal and taxon-specific trends in protein sequences as a function of age.https://doi.org/10.6084/m9.figshare.12037281.v1
References
-
Evolution of intrinsic disorder in eukaryotic proteinsCellular and Molecular Life Sciences 74:3163–3174.https://doi.org/10.1007/s00018-017-2559-0
-
Why do eukaryotic proteins contain more intrinsically disordered regions?PLOS Computational Biology 15:e1007186.https://doi.org/10.1371/journal.pcbi.1007186
-
Global shifts in genome and proteome composition are very tightly coupledGenome Biology and Evolution 7:1519–1532.https://doi.org/10.1093/gbe/evv088
-
Fundamental concepts in genetics: effective population size and patterns of molecular evolution and variationNature Reviews. Genetics 10:195–205.https://doi.org/10.1038/nrg2526
-
Quantifying codon usage in signal peptides: Gene expression and amino acid usage explain apparent selection for inefficient codonsBiochimica et Biophysica Acta. Biomembranes 1860:2479–2485.https://doi.org/10.1016/j.bbamem.2018.09.010
-
Translational selection frequently overcomes genetic drift in shaping synonymous codon usage patterns in vertebratesMolecular Biology and Evolution 30:2263–2267.https://doi.org/10.1093/molbev/mst128
-
Estimating translational selection in eukaryotic genomesMolecular Biology and Evolution 26:451–461.https://doi.org/10.1093/molbev/msn272
-
Protein Disorder and the Evolution of Molecular Recognition: Theory, Predictions and ObservationsPac Symp Biocomput pp. 473–484.
-
Biased gene conversion and the evolution of mammalian genomic landscapesAnnual Review of Genomics and Human Genetics 10:285–311.https://doi.org/10.1146/annurev-genom-082908-150001
-
Phylogenies and the comparative methodThe American Naturalist 125:1–15.https://doi.org/10.1086/284325
-
Biased gene conversion and GC-content evolution in the coding sequences of reptiles and vertebratesGenome Biology and Evolution 7:240–250.https://doi.org/10.1093/gbe/evu277
-
Evolutionary forces and codon bias in different flavors of intrinsic disorder in the human proteomeJournal of Molecular Evolution 88:164–178.https://doi.org/10.1007/s00239-019-09921-4
-
The ‘effective number of codons’ revisitedBiochemical and Biophysical Research Communications 317:957–964.https://doi.org/10.1016/j.bbrc.2004.03.138
-
Codon usage bias in animals: Disentangling the effects of natural selection, effective population size, and GC-biased gene conversionMolecular Biology and Evolution 35:1092–1103.https://doi.org/10.1093/molbev/msy015
-
The effect of variation in the effective population size on the rate of adaptive molecular evolution in eukaryotesGenome Biology and Evolution 4:658–667.https://doi.org/10.1093/gbe/evs027
-
Selection on codon biasAnnual Review of Genetics 42:287–299.https://doi.org/10.1146/annurev.genet.42.110807.091442
-
General rules for optimal codon choicePLOS Genetics 5:e1000556.https://doi.org/10.1371/journal.pgen.1000556
-
Do intrinsically disordered proteins possess high specificity in protein–protein interactions?Chemistry – A European Journal 19:4462–4467.https://doi.org/10.1002/chem.201203100
-
Exposing synonymous mutationsTrends in Genetics 30:308–321.https://doi.org/10.1016/j.tig.2014.04.006
-
Differential retention of pfam domains contributes to long-term evolutionary trendsMolecular Biology and Evolution 40:msad073.https://doi.org/10.1093/molbev/msad073
-
Effective population size does not predict codon usage bias in mammalsEcology and Evolution 4:3887–3900.https://doi.org/10.1002/ece3.1249
-
Random peptides rich in small and disorder-promoting amino acids are less likely to be harmfulGenome Biology and Evolution 14:evac085.https://doi.org/10.1093/gbe/evac085
-
On information and sufficiencyThe Annals of Mathematical Statistics 22:79–86.https://doi.org/10.1214/aoms/1177729694
-
TimeTree: A resource for timelines, timetrees, and divergence timesMolecular Biology and Evolution 34:1812–1819.https://doi.org/10.1093/molbev/msx116
-
Models of nearly neutral mutations with particular implications for nonrandom usage of synonymous codonsJournal of Molecular Evolution 24:337–345.https://doi.org/10.1007/BF02134132
-
A novel framework for evaluating the performance of codon usage bias metricsJournal of the Royal Society, Interface 15:20170667.https://doi.org/10.1098/rsif.2017.0667
-
Evolutionary determinants of genome-wide nucleotide compositionNature Ecology & Evolution 2:237–240.https://doi.org/10.1038/s41559-017-0425-y
-
Genetic drift, selection and the evolution of the mutation rateNature Reviews. Genetics 17:704–714.https://doi.org/10.1038/nrg.2016.104
-
SoftwareCodon-adaptation-index-of-species, version swh:1:rev:408af3d150311c4732219abae67c6929421908dfSoftware Heritage.
-
Recombination drives the evolution of GC-content in the human genomeMolecular Biology and Evolution 21:984–990.https://doi.org/10.1093/molbev/msh070
-
Accounting for background nucleotide composition when measuring codon usage biasMolecular Biology and Evolution 19:1390–1394.https://doi.org/10.1093/oxfordjournals.molbev.a004201
-
Elucidation of codon usage signatures across the domains of lifeMolecular Biology and Evolution 36:2328–2339.https://doi.org/10.1093/molbev/msz124
-
Population size and rate of evolutionJournal of Molecular Evolution 1:305–314.https://doi.org/10.1007/BF01653959
-
The nearly neutral theory of molecular evolutionAnnual Review of Ecology and Systematics 23:263–286.https://doi.org/10.1146/annurev.ecolsys.23.1.263
-
Codon usage and selection on proteinsJournal of Molecular Evolution 63:635–653.https://doi.org/10.1007/s00239-005-0233-x
-
Synonymous but not the same: the causes and consequences of codon biasNature Reviews. Genetics 12:32–42.https://doi.org/10.1038/nrg2899
-
A comment on phylogenetic correctionEvolution; International Journal of Organic Evolution 60:1509–1515.https://doi.org/10.1554/05-550.1
-
Variation in the strength of selected codon usage bias among bacteriaNucleic Acids Research 33:1141–1153.https://doi.org/10.1093/nar/gki242
-
Forces that influence the evolution of codon biasPhilosophical Transactions of the Royal Society B 365:1203–1212.https://doi.org/10.1098/rstb.2009.0305
-
Synonymous codon bias is not caused by mutation bias in G+C-rich genes in humansMolecular Biology and Evolution 18:982–986.https://doi.org/10.1093/oxfordjournals.molbev.a003899
-
An improved implementation of effective number of codons (nc)Molecular Biology and Evolution 30:191–196.https://doi.org/10.1093/molbev/mss201
-
The alphabet of intrinsic disorderIntrinsically Disordered Proteins 1:e24360.https://doi.org/10.4161/idp.24360
-
Codon usage in twelve species of DrosophilaBMC Evolutionary Biology 7:226.https://doi.org/10.1186/1471-2148-7-226
-
Orderly order in protein intrinsic disorder distribution: disorder in 3500 proteomes from viruses and the three domains of lifeJournal of Biomolecular Structure & Dynamics 30:137–149.https://doi.org/10.1080/07391102.2012.675145
Article and author information
Author details
Funding
National Institutes of Health (GM104040)
- Catherine A Weibel
- Jennifer E James
- Sara M Willis
- Joanna Masel
National Institutes of Health (GM132008)
- Andrew L Wheeler
John Templeton Foundation (60814)
- Catherine A Weibel
- Jennifer E James
- Sara M Willis
- Joanna Masel
Arnold and Mabel Beckman Foundation (Scholars Program)
- Catherine A Weibel
National Science Foundation (WAESO/LSAMP Cooperative Agreement HRD-1101728)
- Catherine A Weibel
National Aeronautics and Space Administration (Arizona NASA Space Grant Consortium, Cooperative Agreement 80NSSC20M0041)
- Catherine A Weibel
National Science Foundation (Graduate Research Fellowship Program)
- Hanon McShea
The funders had no role in study design, data collection, and interpretation, or the decision to submit the work for publication.
Acknowledgements
We thank Luke Kosinski, David Liberles, and Sawsan Wehbi for helpful discussions, Paul Nelson for providing the genome-wide GC contents, and the University of Arizona Undergraduate Biology Research Program for training. We thank Gavin Douglas for writing a convenient end-to-end implementation of CAIS on the basis of our preprint, which can be found at https://github.com/gavinmdouglas/handy_pop_gen/blob/main/CAIS.py, and for catching a minor bug in our code in time for us to correct it in the version of record. We thank the anonymous reviewers for constructive feedback, and Laurent Duret for helpful elaboration on the concerns of reviewer 1.
Version history
- Preprint posted:
- Sent for peer review:
- Reviewed Preprint version 1:
- Reviewed Preprint version 2:
- Version of Record published:
Cite all versions
You can cite all versions using the DOI https://doi.org/10.7554/eLife.87335. This DOI represents all versions, and will always resolve to the latest one.
Copyright
© 2023, Weibel, Wheeler et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics
-
- 1,144
- views
-
- 64
- downloads
-
- 1
- citations
Views, downloads and citations are aggregated across all versions of this paper published by eLife.
Download links
Downloads (link to download the article as PDF)
Open citations (links to open the citations from this article in various online reference manager services)
Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)
Further reading
-
- Evolutionary Biology
Eyespot patterns have evolved in many prey species. These patterns were traditionally explained by the eye mimicry hypothesis, which proposes that eyespots resembling vertebrate eyes function as predator avoidance. However, it is possible that eyespots do not mimic eyes: according to the conspicuousness hypothesis, eyespots are just one form of vivid signals where only conspicuousness matters. They might work simply through neophobia or unfamiliarity, without necessarily implying aposematism or the unprofitability to potential predators. To test these hypotheses and explore factors influencing predators’ responses, we conducted a meta-analysis with 33 empirical papers that focused on bird responses to both real lepidopterans and artificial targets with conspicuous patterns (i.e. eyespots and non-eyespots). Supporting the latter hypothesis, the results showed no clear difference in predator avoidance efficacy between eyespots and non-eyespots. When comparing geometric pattern characteristics, bigger pattern sizes and smaller numbers of patterns were more effective in preventing avian predation. This finding indicates that single concentric patterns have stronger deterring effects than paired ones. Taken together, our study supports the conspicuousness hypothesis more than the eye mimicry hypothesis. Due to the number and species coverage of published studies so far, the generalisability of our conclusion may be limited. The findings highlight that pattern conspicuousness is key to eliciting avian avoidance responses, shedding a different light on this classic example of signal evolution.
-
- Evolutionary Biology
Studying the fecal microbiota of wild baboons helps provide new insight into the factors that influence biological aging.