Heterogeneity of the GFP fitness landscape and data-driven protein design

  1. Louisa Gonzalez Somermeyer
  2. Aubin Fleiss
  3. Alexander S Mishin
  4. Nina G Bozhanova
  5. Anna A Igolkina
  6. Jens Meiler
  7. Maria-Elisenda Alaball Pujol
  8. Ekaterina V Putintseva
  9. Karen S Sarkisyan  Is a corresponding author
  10. Fyodor A Kondrashov  Is a corresponding author
  1. Institute of Science and Technology Austria, Austria
  2. Synthetic Biology Group, MRC London Institute of Medical Sciences, United Kingdom
  3. Institute of Clinical Sciences, Faculty of Medicine and Imperial College Centre for Synthetic Biology, Imperial College London, United Kingdom
  4. Shemyakin-Ovchinnikov Institute of Bioorganic Chemistry, Russian Academy of Sciences, Russian Federation
  5. Department of Chemistry, Center for Structural Biology, Vanderbilt University, United States
  6. Gregor Mendel Institute, Austrian Academy of Sciences, Vienna BioCenter, Austria
  7. Institute for Drug Discovery, Medical School, Leipzig University, Germany
  8. LabGenius, United Kingdom
  9. Evolutionary and Synthetic Biology Unit, Okinawa Institute of Science and Technology Graduate University, Japan
7 figures, 2 tables and 13 additional files

Figures

Figure 1 with 3 supplements
Comparison of four GFP fitness peaks.

(a) A conceptual representation of the GFP fitness landscape following the visualization proposed by Wright, 1932. The black dotted lines represent the unknown regions of the fitness landscape and …

Figure 1—figure supplement 1
Distributions of wild-type protein genotypes with and without synonymous mutations.

Fluorescence level distributions of individual barcodes linked to wild-type nucleotide sequences (colour) versus sequences containing only synonymous mutations (dotted line). The minimum number of …

Figure 1—figure supplement 2
Effects of mutations across the GFP sequences.

(a) Median effects of single mutations according to sequence position. Amino acid residues on one single strand of the beta barrel of GFP monomers. The chromophore sites are hatched (positions …

Figure 1—figure supplement 3
Mutational bias in datasets generated from different mutagenesis strategies.

(a) Observed frequencies of nucleotide mutation types in the four landscapes. Libraries of amacGFP, cgreGFP, and ppluGFP2 were generated with the Mutazyme II kit under the same conditions while the …

Figure 2 with 1 supplement
Flowthrough of the experimental methodology.
Figure 2—figure supplement 1
Distribution of cells during FACS sorting.

Entire mutant libraries are shown in grey, non-fluorescent negative controls are shown in black. Sorted cells, falling within the selected gate in the mKate2 channel and corresponding to around 10% …

Figure 3 with 2 supplements
Distributions of fluorescence.

(a) Fluorescence level distributions of genotypes at varying distances from the wildtype and the logistic curves fitted to the median fluorescence for each category (black line). (b), Distribution …

Figure 3—figure supplement 1
Epistatic interactions of mutations in GFP.

(a) Epistasis in genotypes with two mutations, highlighting negative epistatic interactions between individually neutral or slightly deleterious mutations. The values of impact on fluorescence have …

Figure 3—figure supplement 2
Effect of extant and non-extant mutations.

(a) Fraction of wildtype states in extant green fluorescent proteins which become deleterious in our data, as a function of the sequence divergence (from 0 to 100) between the two proteins. (b) …

Figure 4 with 5 supplements
Thermal sensitivity of GFP orthologs.

(a) Thermal unfolding measured by differential scanning fluorimetry (DSF) showing the first derivative of the ratio of 350/330 nm emission. Shaded areas indicate standard deviation of triplicates. (b

Figure 4—figure supplement 1
Urea denaturation and refolding of orthologues.

(a) Absorbance (grey) and fluorescence (green) spectra of purified protein in 9 M urea, or (b), 1 x PBS, measured at 42 °C every 30 min for 60 hr; darker lines correspond to later time points. …

Figure 4—figure supplement 2
Aggregation and oligomeric states in GFP orthologues.

(a) Coomassie-stained gels (top) of full lysate, pellet, and supernatant of pooled functional (bright) or non-functional (dark) genome-integrated library variants for amacGFP, cgreGFP, and ppluGFP2. …

Figure 4—figure supplement 3
Correlation between fluorescence and ddG predicted by Rosetta.

(a) Distribution of ddG predictions for single mutations observed to either maintain wildtype-level fluorescence (white) or render a genotype non-functional (color). Differences between the two …

Figure 4—figure supplement 4
Effects of mutations in amacGFP and amacGFP:V12L.

(a) Correlation between effects of specific mutations in amacGFP backgrounds with and without the V12L (position 14 in our alignment) mutation (Pearson’s r=0.96). (b), Median (solid lines) and mean …

Figure 4—figure supplement 5
Spatial proximity of amino acid residues and detected pairwise epistasis.

Heatmaps show the minimal distance in Angstroms between two residues, with pairs showing epistatic interactions >0.3 (representing twofold change in fluorescence compared to the additive …

Differences in mutational effects in GFP orthologues.

(a) The proportion of single amino acid mutations which were observed to be neutral (maintaining fluorescence within two standard deviations of the wildtype level) in one GFP sequence and …

Figure 6 with 1 supplement
Neural network structure.

(a) 1. Each genotype in the dataset was denoted by the mutations it contained relative to its parental wildtype sequence. 2. Genotypes were one-hot encoded. For each position in the sequence, a …

Figure 6—figure supplement 1
Correlations between observed and predicted levels of fluorescence.

(a) With a linear model, (b), with a linear model and a sigmoid output node, (c), with an output subnetwork, and (d), non-trivial sigmoidal functions transform fitness potentials into predicted …

Figure 7 with 1 supplement
Predicting functional GFP mutants.

Violin plots show the distribution of fluorescence of all genotypes (black) and combinations of only individually neutral mutations (color). Experimental measurements of the level of fluorescence in …

Figure 7—figure supplement 1
Mutations used in machine learning-generated genotypes.

(a) For all genotypes generated by the neural network model (see Figure 5), we show the measured fluorescence as a function of the strongest deleterious effect of any mutation in that genotype that …

Tables

Table 1
The dataset in numbers.

The avGFP data is from Sarkisyan et al., 2016.

GeneamacGFPcgreGFPppluGFP2avGFP
Number of protein genotypes surveyed35,50026,16532,26051,715
Average (median) number of AA substitutions per genotype4.37 (3)4.23 (3)3.7 (2)3.93 (4)
Average (median) number of barcode replicates per protein genotype8.7 (5)6.8 (5)12 (7)1.2 (1)
Amino acid identityavGFP: 82% cgreGFP: 43% ppluGFP2: 17%avGFP: 41% amacGFP: 43% ppluGFP2: 19%avGFP: 18% amacGFP: 17% cgreGFP: 19%amacGFP: 82% cgreGFP: 41% ppluGFP2: 18%
False positive rate*0.55% (9 of 1635)0.75% (14 of 1860)0.49% (11 of 2242)0.24% (2 of 839)
False negative rate*0% (0 of 1084)0% (0 of 1583)0% (0 of 2744)0.08% (2 of 2444)
Mean wildtype log10 fluorescence level ± standard deviation3.97±0.031
(3.96±0.030 for amacGFP:V12L)
4.50±0.0284.23±0.0273.72±0.082
Fraction of genotypes in which epistasis cannot be ascertained7.4%15.9%4.5%16.5%
Fraction of genotypes displaying |epistasis|>0.3 (>1) 5.3% (0.2%)14.4% (5.6%)6.8% (0.9%)21.4% (11.6%)
Mutational LD50, loss of function §5.8
(5.7 for amacGFP:V12L)
3.26.24.1
Mutational LD50, loss of wildtype-level fluorescence level §1.7
(1.8 for amacGFP:V12L)
0.91.72.2
Proportion of machine-learning predicted genotypes displaying epistasis <–0.3 (<-1)78% (46%)57% (21%)81% (64%)NA
  1. *

    False positive rates refer to the fraction of genotypes which are expected to be dark or dim due to chromophore mutations but which were assigned a bright fitness; false negative rates refer to genotypes encoding wildtype protein which were assigned dim or dark fitnesses.

  2. Calculation of epistasis requires knowledge of a genotype’s expected fluorescence, i.e. the sum of contributions of individual mutations. For genotypes with multiple mutations, all individual mutations comprising the genotype must have been measured in isolation.

  3. An absolute epistasis value of 0.3 or 1 implies a two-fold or ten-fold difference between the observed and expected fluorescence levels, respectively.

  4. §

    “Mutational LD50, loss of function” refers to the number of mutations at which 50% of genotypes are rendered non-functional (i.e. assigned to the darkest FACS gate), obtained by fitting a logistic curve to the fraction of non-functional genotypes at each mutational step (see values in Supplementary file 1) and solving for f(x)=0.5; “Mutational LD50, loss of wildtype fluorescence level” refers instead to the number of mutations at which 50% of genotypes maintain a fluorescence level within two standard deviations of the WT level.

Table 2
Biophysical and biochemical characterisation of wildtype GFPs.
amacGFP:V12LamacGFPcgreGFPppluGFPavGFP
Unfolding Tm (DSF)80.8 °C82.6 °C74.1 °C91.8 °C86.8 °C
Aggregation Tm (DSF)79.5 °C82.0 °C73.9 °C90.2 °C86.6 °C
Tm (CD)80.4 °C82.6 °C71.2 °C86.4 °C83.7 °C
Transition slope (CD)0.860.721.270.630.67
Tm (DSC)80.2 °C82.4 °C72.9 °C90.3 °C86.3 °C
Enthalpy of denaturation (DSC)744 kJ/mol768 kJ/mol755 kJ/mol515 kJ/mol1012 kJ/mol
Fluorescence loss Tm (qPCR)81.1 °C82.6 °C72.9 °C-87.5 °C
Urea denaturation: initial rate*–0.87–0.35–0.18–0.02–0.009
Kinetic parameters for urea denaturation curves*a1=0.71
k1=0.96 h–1
a2=0.28
k2=0.25 h–1
a1=0.52
k1=0.54 h–1
a2=0.43
k2=0.12 h–1
-a1=0.92
k1=0.02 h–1
a1=0.92
k1=0.01 h–1
Refolding: initial rate0.010.010.0000140.050.007
Kinetic parameters for refolding curvesa1=–0.35
k1=0.025 s–1
a2=–0.36
k2=0.005 s–1
a3=–0.38
k3=0.001 s–1
a1=–0.057
k1=0.057 s–1
a2=–0.39
k2=0.013 s–1
a3=–0.63
k3=0.002 s–1
a1=0.16
k1=0.036 s–1
a2=–0.45
k2=0.01 s–1
a3=–0.87
k3=0.001 s–1
a1=–0.32
k1=0.14 s–1
a2=–0.45
k2=0.02 s–1
a3=–0.21
k3=0.003 s–1
a1=–0.4
k1=0.016 s–1
a2=–0.36
k2=0.001 s–1
a3=–0.31
k3=0.001 s–1
Expected monomer size28.1 kDa28.1 kDa27.4 kDa25.7 kDa27.9 kDa
Primary oligomeric state (SEC-MALS)Monomer (67%), dimer (31%)Monomer (51%), dimer (46%)Dimer (>99%)Tetramer (>97%)Monomer (>99%)
  1. *

    Curves monitoring loss of fluorescence in 9 M urea were fitted with two exponential functions in the case of amacGFP and amacGFP:V12L and one exponential function for avGFP and ppluGFP2, while cgreGFP fluorescence loss could not be well modeled using only exponential functions (see Figure 4—figure supplement 1). Initial rates were estimated by calculating the derivative at time t=0.

  2. Curves monitoring the recovery of fluorescence after urea denaturation over the course of 20 minutes were fitted with three exponential functions (see Figure 4—figure supplement 1). Initial rates were estimated by calculating the derivative at time t=0.

Additional files

Transparent reporting form
https://cdn.elifesciences.org/articles/75842/elife-75842-transrepform1-v3.docx
Supplementary file 1

Selected statistics of genotypes at different divergence from five GFP sequences.

https://cdn.elifesciences.org/articles/75842/elife-75842-supp1-v3.docx
Supplementary file 2

Data Collection and Refinement Statistics.

https://cdn.elifesciences.org/articles/75842/elife-75842-supp2-v3.docx
Source data 1

Absolute values for the borders between gates in the green channel during sorting, for all genes and machines, and the corrections applied to match values between the machines.

https://cdn.elifesciences.org/articles/75842/elife-75842-data1-v3.xlsx
Source data 2

Dataframes containing the distribution across gates of all primary-secondary barcode combinations, along with their fitted fitness values (see Materials and methods).

Data are not filtered according to cell count, number of replicates, etc. One dataframe per gene and machine.

https://cdn.elifesciences.org/articles/75842/elife-75842-data2-v3.zip
Source data 3

Dataframes linking nucleotide or protein genotypes to their measured fluorescence level (see Materials and methods).

Mutations in genotypes are labeled in the format AiB, where A is the original wildtype state, B is the mutated state, and i is the position (counting starts from Methionine = 0). In the nucleotide dataset, 'n_replicates' refers to the combined number of distinct barcodes representing a genotype and machines it was measured on. In the amino acid dataset, 'n_replicates' refers to the number of synonymous nucleotide sequences measured for each protein sequence. Nucleotide genotypes and amino acid genotypes are on separate tabs in the file.

https://cdn.elifesciences.org/articles/75842/elife-75842-data3-v3.xlsx
Source data 4

Table containing ddG predictions for single mutations in avGFP, amacGFP, amacGFP:V12L, cgreGFP, and ppluGFP2.

Residue positions are labeled starting from 0 (methionine).

https://cdn.elifesciences.org/articles/75842/elife-75842-data4-v3.csv
Source data 5

Dataframes containing the minimum physical distance between pairs of residues inside the 3D GFP structures, in Angstroms.

Row and column indices represent the residue position within the protein, starting from 0 for the initial methionine. Matrices for different proteins are included in different tabs in the file.

https://cdn.elifesciences.org/articles/75842/elife-75842-data5-v3.xlsx
Source data 6

Table containing absorbance values (from 300 to 700nm) and fluorescence emission values (from 450nm to 700nm, upon 420nm excitation) for all genes, in 9M urea and PBS, measured on a plate reader at multiple consecutive time points.

Blank control values are already subtracted. Absorbance and fluorescence data are listed on separate tabs

https://cdn.elifesciences.org/articles/75842/elife-75842-data6-v3.xlsx
Source data 7

Raw data from differential scanning fluorimetry and calorimetry, circular dichroism, and qPCR melting curves.

https://cdn.elifesciences.org/articles/75842/elife-75842-data7-v3.xlsx
Source data 8

Coding sequences for neural network-generated genotypes, and their predicted and observed levels of fluorescence.

https://cdn.elifesciences.org/articles/75842/elife-75842-data8-v3.csv
Source data 9

Table of over 70 documented natural fluorescent proteins used during analyses, including name, species, sequence, original reference and, where possible, accession numbers and measured excitation/emission peaks.

https://cdn.elifesciences.org/articles/75842/elife-75842-data9-v3.csv
Source data 10

Estimated rates of evolution of amino acid states used in prediction of novel GFP sequences on each branch of the phylogeny of extant GFPs.

https://cdn.elifesciences.org/articles/75842/elife-75842-data10-v3.xlsx

Download links