Computational and Systems Biology

Learning sequence-function relationships with scalable, interpretable Gaussian processes

Juannan Zhou
Carlos Martí-Gómez
Samantha Petti
David M McCandlish author has email address

Department of Biology, University of Florida, Gainesville, United States
Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, United States
Department of Mathematics, Tufts University, Medford, United States

https://doi.org/10.7554/eLife.108964.1

Open access
Copyright information

Figures and data

Application to the genotype-phenotype dataset for protein GB1.
(A) Model predictive performance, measured by the R² on held-out data, as a function of the fraction of data used for training. Error bars represent one standard deviation across three independent subsets of sequences used for training at each proportion. (B) Inferred position- and allele-specific decay factors under the connectedness model (top) and Jenga model (bottom), respectively. Black squares highlight the allele of the wild-type sequence at each position. (C) The Jenga prior is able to capture information about the three main fitness peaks of the GB1 landscape [58]. In each panel, dots represent genotypes and lines denote single point mutations between genotypes. Squared distances between dots optimally approximate the commute times between genotypes under a weak-mutation evolutionary model under selection for high phenotypic values [74, 93] (see Methods). Genotypes are colored by their correlations with three genotypes VDGV, WWLG and LICA, each representing a local fitness peak. Correlations were calculated using the Jenga kernel with hyperparameters inferred using evidence maximization. The complete GB1 fitness landscape for the evolutionary model was constructed using the maximum a posteriori estimate under the Jenga model. (D, E) Inferred mutation-specific decay factors under the general product model for positions 41 and 54. These matrices show great heterogeneity in the influence of mutations at a given site on the predictability of mutational effects at other sites both in terms of the degree of influence conferred by mutations between different pairs of alleles and in the effect of mutations between the same pair of alleles at different sites.

Application to the high-throughput dataset for 5^′splice site of SMN1 exon 7.
(A) Model predictive performance, measured by the R² on held-out data, as a function of the fraction of data used for training. Error bars represent one standard deviation across three independent subsets of sequences used for training at each proportion. (B) Empirical correlations between pairs of sequences. Black dots show the average correlation for all pairs of sequences at a given Hamming distance. Gray dots show the correlation between genotypes segregating at specific sets of positions for each distance class. Values on the x-axis were jittered to facilitate visualization. (C) Inferred position- and allele-specific decay factors under the connectedness model (top) and Jenga model (bottom), respectively. Black squares highlight the canonical 5^′ splice site nucleotides complementary to the U1 snRNA template. (D) Two-dimensional histogram comparing the observed Percent Spliced In (PSI) values against predictions of the Jenga model on test 5^′ splice sites, comprising 10% of the measured genotypes.

Application to the AAV2 capsid protein.
(A) Model predictive performance measured by test R² on held-out data, shown as a function of the proportion of training data. Error bars represent one standard deviation across three random subsets of training sequences at each proportion. (B) Inferred decay factors for each of the 28 positions under the connectedness model (top) and for 20 amino acids at each position under the Jenga model (bottom). Black squares indicate the wild-type allele at each position. (C) Distribution of raw DMS scores for sequences with Asn at position 569 (dark gray) compared to sequences with other amino acids at that site (light gray). (D) Raw DMS scores as a function of net charge across amino acids 579–588, indicating that intermediate charge levels are associated with higher viral production. (E) Inferred mutation-specific decay factors from the general product model for position 576. (F) Distribution of raw DMS scores for sequences with an aromatic residue (Tyr, Phe, or Trp) at position 576 (dark gray) compared to sequences with other amino acids at that site (light gray).

Connectedness regression applied to genome-wide genotype-phenotype data measuring relative fitness of Saccharomyces cerevisiae under lithium exposure.
(A) Model predictive performance, measured by the R² on held-out data, as a function of the fraction of data used for training. Error bars represent one standard deviation over three independent subsets of sequences used for training at each proportion. (B) Inferred decay factors for the 83 QTLs across the genome under the connectedness model. QTLs are named by the gene at which they are located or their closest gene in the reference genome S288C. (C) Fitness distribution of segregants stratified by their allele at the ENA1 locus. (D) Representation of the complete genotype-phenotype map across the 16 loci with the highest inferred decay factors (mapped to the closest genes: ENA1, HAL9, MKT1, PHO84, HAP1, HAL5, TAO3, BUL2, PTR2, NRT1, SUP45, DPH5, MLF3, SUS1, IRA2, VIP1) and an additional pseudo-locus representing the genetic background across all other loci (RM vs. BY). The posterior mean fitnesses of the 2¹⁷ = 131,072 possible genotypes are plotted against their Hamming distances to the genotype with the highest predicted fitness when combined with the ENA1 ^RM variant. Nodes represent genotypic combinations and are colored according to the allele at the ENA1 locus. Edges connect genotypes separated by single point mutations. Values on the x-axis were jittered to facilitate visualization. (E) Comparison of the estimated mutational effects in the RM background in the presence of RM vs. BY alleles at the ENA1 locus. Labeled QTLs highlighted in black correspond to loci with the largest decay factors after ENA1, as shown in (B). Error bars represent one standard deviation of the posterior distribution.

Sign up for email alerts