(a) A conceptual representation of the GFP fitness landscape following the visualization proposed by Wright, 1932. The black dotted lines represent the unknown regions of the fitness landscape and …
Fluorescence level distributions of individual barcodes linked to wild-type nucleotide sequences (colour) versus sequences containing only synonymous mutations (dotted line). The minimum number of …
(a) Median effects of single mutations according to sequence position. Amino acid residues on one single strand of the beta barrel of GFP monomers. The chromophore sites are hatched (positions …
(a) Observed frequencies of nucleotide mutation types in the four landscapes. Libraries of amacGFP, cgreGFP, and ppluGFP2 were generated with the Mutazyme II kit under the same conditions while the …
Entire mutant libraries are shown in grey, non-fluorescent negative controls are shown in black. Sorted cells, falling within the selected gate in the mKate2 channel and corresponding to around 10% …
(a) Fluorescence level distributions of genotypes at varying distances from the wildtype and the logistic curves fitted to the median fluorescence for each category (black line). (b), Distribution …
(a) Epistasis in genotypes with two mutations, highlighting negative epistatic interactions between individually neutral or slightly deleterious mutations. The values of impact on fluorescence have …
(a) Fraction of wildtype states in extant green fluorescent proteins which become deleterious in our data, as a function of the sequence divergence (from 0 to 100) between the two proteins. (b) …
(a) Thermal unfolding measured by differential scanning fluorimetry (DSF) showing the first derivative of the ratio of 350/330 nm emission. Shaded areas indicate standard deviation of triplicates. (b…
(a) Absorbance (grey) and fluorescence (green) spectra of purified protein in 9 M urea, or (b), 1 x PBS, measured at 42 °C every 30 min for 60 hr; darker lines correspond to later time points. …
(a) Coomassie-stained gels (top) of full lysate, pellet, and supernatant of pooled functional (bright) or non-functional (dark) genome-integrated library variants for amacGFP, cgreGFP, and ppluGFP2. …
(a) Distribution of ddG predictions for single mutations observed to either maintain wildtype-level fluorescence (white) or render a genotype non-functional (color). Differences between the two …
(a) Correlation between effects of specific mutations in amacGFP backgrounds with and without the V12L (position 14 in our alignment) mutation (Pearson’s r=0.96). (b), Median (solid lines) and mean …
Heatmaps show the minimal distance in Angstroms between two residues, with pairs showing epistatic interactions >0.3 (representing twofold change in fluorescence compared to the additive …
(a) The proportion of single amino acid mutations which were observed to be neutral (maintaining fluorescence within two standard deviations of the wildtype level) in one GFP sequence and …
(a) 1. Each genotype in the dataset was denoted by the mutations it contained relative to its parental wildtype sequence. 2. Genotypes were one-hot encoded. For each position in the sequence, a …
(a) With a linear model, (b), with a linear model and a sigmoid output node, (c), with an output subnetwork, and (d), non-trivial sigmoidal functions transform fitness potentials into predicted …
Violin plots show the distribution of fluorescence of all genotypes (black) and combinations of only individually neutral mutations (color). Experimental measurements of the level of fluorescence in …
(a) For all genotypes generated by the neural network model (see Figure 5), we show the measured fluorescence as a function of the strongest deleterious effect of any mutation in that genotype that …
The avGFP data is from Sarkisyan et al., 2016.
Gene | amacGFP | cgreGFP | ppluGFP2 | avGFP |
---|---|---|---|---|
Number of protein genotypes surveyed | 35,500 | 26,165 | 32,260 | 51,715 |
Average (median) number of AA substitutions per genotype | 4.37 (3) | 4.23 (3) | 3.7 (2) | 3.93 (4) |
Average (median) number of barcode replicates per protein genotype | 8.7 (5) | 6.8 (5) | 12 (7) | 1.2 (1) |
Amino acid identity | avGFP: 82% cgreGFP: 43% ppluGFP2: 17% | avGFP: 41% amacGFP: 43% ppluGFP2: 19% | avGFP: 18% amacGFP: 17% cgreGFP: 19% | amacGFP: 82% cgreGFP: 41% ppluGFP2: 18% |
False positive rate* | 0.55% (9 of 1635) | 0.75% (14 of 1860) | 0.49% (11 of 2242) | 0.24% (2 of 839) |
False negative rate* | 0% (0 of 1084) | 0% (0 of 1583) | 0% (0 of 2744) | 0.08% (2 of 2444) |
Mean wildtype log10 fluorescence level ± standard deviation | 3.97±0.031 (3.96±0.030 for amacGFP:V12L) | 4.50±0.028 | 4.23±0.027 | 3.72±0.082 |
Fraction of genotypes in which epistasis cannot be ascertained† | 7.4% | 15.9% | 4.5% | 16.5% |
Fraction of genotypes displaying |epistasis|>0.3 (>1) ‡ | 5.3% (0.2%) | 14.4% (5.6%) | 6.8% (0.9%) | 21.4% (11.6%) |
Mutational LD50, loss of function § | 5.8 (5.7 for amacGFP:V12L) | 3.2 | 6.2 | 4.1 |
Mutational LD50, loss of wildtype-level fluorescence level § | 1.7 (1.8 for amacGFP:V12L) | 0.9 | 1.7 | 2.2 |
Proportion of machine-learning predicted genotypes displaying epistasis <–0.3 (<-1) | 78% (46%) | 57% (21%) | 81% (64%) | NA |
False positive rates refer to the fraction of genotypes which are expected to be dark or dim due to chromophore mutations but which were assigned a bright fitness; false negative rates refer to genotypes encoding wildtype protein which were assigned dim or dark fitnesses.
Calculation of epistasis requires knowledge of a genotype’s expected fluorescence, i.e. the sum of contributions of individual mutations. For genotypes with multiple mutations, all individual mutations comprising the genotype must have been measured in isolation.
An absolute epistasis value of 0.3 or 1 implies a two-fold or ten-fold difference between the observed and expected fluorescence levels, respectively.
“Mutational LD50, loss of function” refers to the number of mutations at which 50% of genotypes are rendered non-functional (i.e. assigned to the darkest FACS gate), obtained by fitting a logistic curve to the fraction of non-functional genotypes at each mutational step (see values in Supplementary file 1) and solving for f(x)=0.5; “Mutational LD50, loss of wildtype fluorescence level” refers instead to the number of mutations at which 50% of genotypes maintain a fluorescence level within two standard deviations of the WT level.
amacGFP:V12L | amacGFP | cgreGFP | ppluGFP | avGFP | |
---|---|---|---|---|---|
Unfolding Tm (DSF) | 80.8 °C | 82.6 °C | 74.1 °C | 91.8 °C | 86.8 °C |
Aggregation Tm (DSF) | 79.5 °C | 82.0 °C | 73.9 °C | 90.2 °C | 86.6 °C |
Tm (CD) | 80.4 °C | 82.6 °C | 71.2 °C | 86.4 °C | 83.7 °C |
Transition slope (CD) | 0.86 | 0.72 | 1.27 | 0.63 | 0.67 |
Tm (DSC) | 80.2 °C | 82.4 °C | 72.9 °C | 90.3 °C | 86.3 °C |
Enthalpy of denaturation (DSC) | 744 kJ/mol | 768 kJ/mol | 755 kJ/mol | 515 kJ/mol | 1012 kJ/mol |
Fluorescence loss Tm (qPCR) | 81.1 °C | 82.6 °C | 72.9 °C | - | 87.5 °C |
Urea denaturation: initial rate* | –0.87 | –0.35 | –0.18 | –0.02 | –0.009 |
Kinetic parameters for urea denaturation curves* | a1=0.71 k1=0.96 h–1 a2=0.28 k2=0.25 h–1 | a1=0.52 k1=0.54 h–1 a2=0.43 k2=0.12 h–1 | - | a1=0.92 k1=0.02 h–1 | a1=0.92 k1=0.01 h–1 |
Refolding: initial rate† | 0.01 | 0.01 | 0.000014 | 0.05 | 0.007 |
Kinetic parameters for refolding curves† | a1=–0.35 k1=0.025 s–1 a2=–0.36 k2=0.005 s–1 a3=–0.38 k3=0.001 s–1 | a1=–0.057 k1=0.057 s–1 a2=–0.39 k2=0.013 s–1 a3=–0.63 k3=0.002 s–1 | a1=0.16 k1=0.036 s–1 a2=–0.45 k2=0.01 s–1 a3=–0.87 k3=0.001 s–1 | a1=–0.32 k1=0.14 s–1 a2=–0.45 k2=0.02 s–1 a3=–0.21 k3=0.003 s–1 | a1=–0.4 k1=0.016 s–1 a2=–0.36 k2=0.001 s–1 a3=–0.31 k3=0.001 s–1 |
Expected monomer size | 28.1 kDa | 28.1 kDa | 27.4 kDa | 25.7 kDa | 27.9 kDa |
Primary oligomeric state (SEC-MALS) | Monomer (67%), dimer (31%) | Monomer (51%), dimer (46%) | Dimer (>99%) | Tetramer (>97%) | Monomer (>99%) |
Curves monitoring loss of fluorescence in 9 M urea were fitted with two exponential functions in the case of amacGFP and amacGFP:V12L and one exponential function for avGFP and ppluGFP2, while cgreGFP fluorescence loss could not be well modeled using only exponential functions (see Figure 4—figure supplement 1). Initial rates were estimated by calculating the derivative at time t=0.
Curves monitoring the recovery of fluorescence after urea denaturation over the course of 20 minutes were fitted with three exponential functions (see Figure 4—figure supplement 1). Initial rates were estimated by calculating the derivative at time t=0.
Selected statistics of genotypes at different divergence from five GFP sequences.
Data Collection and Refinement Statistics.
Absolute values for the borders between gates in the green channel during sorting, for all genes and machines, and the corrections applied to match values between the machines.
Dataframes containing the distribution across gates of all primary-secondary barcode combinations, along with their fitted fitness values (see Materials and methods).
Data are not filtered according to cell count, number of replicates, etc. One dataframe per gene and machine.
Dataframes linking nucleotide or protein genotypes to their measured fluorescence level (see Materials and methods).
Mutations in genotypes are labeled in the format AiB, where A is the original wildtype state, B is the mutated state, and i is the position (counting starts from Methionine = 0). In the nucleotide dataset, 'n_replicates' refers to the combined number of distinct barcodes representing a genotype and machines it was measured on. In the amino acid dataset, 'n_replicates' refers to the number of synonymous nucleotide sequences measured for each protein sequence. Nucleotide genotypes and amino acid genotypes are on separate tabs in the file.
Table containing ddG predictions for single mutations in avGFP, amacGFP, amacGFP:V12L, cgreGFP, and ppluGFP2.
Residue positions are labeled starting from 0 (methionine).
Dataframes containing the minimum physical distance between pairs of residues inside the 3D GFP structures, in Angstroms.
Row and column indices represent the residue position within the protein, starting from 0 for the initial methionine. Matrices for different proteins are included in different tabs in the file.
Table containing absorbance values (from 300 to 700nm) and fluorescence emission values (from 450nm to 700nm, upon 420nm excitation) for all genes, in 9M urea and PBS, measured on a plate reader at multiple consecutive time points.
Blank control values are already subtracted. Absorbance and fluorescence data are listed on separate tabs
Raw data from differential scanning fluorimetry and calorimetry, circular dichroism, and qPCR melting curves.
Coding sequences for neural network-generated genotypes, and their predicted and observed levels of fluorescence.
Table of over 70 documented natural fluorescent proteins used during analyses, including name, species, sequence, original reference and, where possible, accession numbers and measured excitation/emission peaks.
Estimated rates of evolution of amino acid states used in prediction of novel GFP sequences on each branch of the phylogeny of extant GFPs.