The effectiveness of selection, calculated as the long-term ratio of time spent in fixed deleterious : fixed beneficial allele states given symmetric mutation rates, is a function of the product sN. Assuming a diploid Wright-Fisher population with s <<1, the probability of fixation of a new mutation , and the y-axis is calculated as π(N, -s)(π(N, -s) + π(N, s)). s is held constant at a value of 0.001 and N is varied. Results for other small magnitude values of s are superimposable. For small sN, selection is ineffective at producing codon bias. For large sN, selection is highly effective. For only a relatively narrow range of intermediate values of sN, the degree of codon bias depends quantitatively on sN.

More highly adapted species (bottom) have a higher proportion of their sites subject to effective selection on codon bias (blue area). The CAI attempts to compare the intensity of selection (Figure 1, x-axis) in a subset of genes under strong selection (red areas). Given the narrow range of quantitative dependence of codon bias on sN shown in Figure 1, our new metric is intended to capture differences in the proportion of the proteome subject to substantial selection (blue areas).

CAI is seriously confounded with GC content, while ENC and CAIS are not. We control for phylogenetic confounding via Phylogenetic Independent Contrasts (PIC) (Felsenstein 1985); this yields an unbiased R2 estimate (Rohlf 2006). Each datapoint is one of 118 vertebrate species with “Complete” intergenic genomic sequence (allowing for %GC correction) and TimeTree divergence dates (allowing for PIC correction). Red line shows unweighted lm(y∼x) with grey region as 95% confidence interval. Plots without PIC correction are shown in Supplementary Figure 2.

Protein domains have higher ISD when found in more exquisitely adapted species. We show - ENC instead of ENC to more easily compare with CAIS. Each datapoint is one of 118 vertebrate species with “complete” intergenic genomic sequence available (allowing for %GC correction), and TimeTree divergence dates (allowing for PIC correction). “Effects” on ISD shown on the y-axis are fixed effects of species identity in our linear mixed model, after PIC correction. Red line shows unweighted lm(y∼x) with grey region as 95% confidence interval. Panels without PIC correction are presented in Supplementary Figure 3.

CAIS is not correlated with the degree to which local genomic regions differ in their GC content from global GC content. If CAIS were driven by GC biased gene conversion, genomes with more heterogeneous %GC distributions should have higher CAIS scores.

More exquisitely adapted species have higher ISD in both ancient and recent protein domains. Age assignments are taken from James et al. (2021), with vertebrate protein domains that emerged prior to LECA classified as “old”, and vertebrate protein domains that emerged after the divergence of animals and fungi from plants as “young”. “Effects” on ISD shown on the y-axis are fixed effects of species identity in our linear mixed model. The same n=118 datapoints are shown as in Figures 3 and 4. Red line shows lm(y∼x), with grey region as 95% confidence interval. Panels without PIC correction are shown in Supplementary Figure 4. We show -ENC instead of ENC to more easily compare with CAIS.

Codon Adaptation Index is not appropriate for species-wide effectiveness of selection measurements. Each CAI value shown is averaged over an entire species’ proteome. A) The value of CAI is driven by its normalizing denominator term, CAImax. B) As a result, CAI is inversely proportional to CAIS. Each datapoint is one of 118 vertebrate species with “Complete” intergenic genomic sequence available (allowing for %GC correction) and TimeTree divergence dates (allowing for PIC correction). P-values shown are for Pearson’s correlation.

The same relationships are shown as in Figure 3, here without correction for phylogenetic confounding. CAI and ENC both corelate with genomic GC, but CAIS does not. Red line shows lm(y∼x), with grey region as 95% confidence interval. We use PIC corrected results rather than these results because PIC correction removes non-independent errors to produce an unbiased R2 estimate (Rohlf 2006).

The same relationships are shown as in Figure 4, here without correction for phylogenetic confounding. As in Figure 4, ISD of protein domains is higher in more highly adapted species, as measured by CAIS and ENC, but not by CAIS calculated with local GC% rather than genome-wide GC%. ISD is calculated as in Figure 4. Red line shows lm(y∼x), with grey region as 95% confidence interval.

Without correction for phylogenetic confounding, more highly adapted species have higher ISD in both young and old protein domains. Age assignments and ISD effects are calculated as in Figure 6. same n=118 datapoints are shown as in Figures 3-5. Red line shows lm(y∼x), with grey region as 95% confidence interval.

Vertebrate CAIS values are not greatly affected by computation for a standardized amino acid composition vs. computation for the amino acid frequencies in the species in question.