Learning sequence-function relationships with scalable, interpretable Gaussian processes

Juannan Zhou; Carlos Martí-Gómez; Samantha Petti; David M McCandlish

doi:10.7554/eLife.108964.1

Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, and public reviews.

Reviewing Editor
Anne-Florence Bitbol
Ecole Polytechnique Federale de Lausanne (EPFL), Lausanne, Switzerland
Senior Editor
Alan Moses
University of Toronto, Toronto, Canada

Reviewer #1 (Public review):

Summary:

Zhou and colleagues introduce a series of generalized Gaussian process models for genotype-phenotype mapping. The goal was to develop models that were more powerful than standard linear models, while retaining explanatory power as opposed to neural network approaches. The novelty stems from choices of prior distributions (and I suppose fitted posteriors) that model epistasis based on some form of site/allele-specific modifier effect and genotype distance. The authors then apply their models to three empirical datasets, the GB1 antibody-binding dataset, the human 5' splice set dataset, and a yeast meiotic cross dataset, and find substantially improved variance explained while retaining strong explanatory power when compared to linear models.

Strengths:

The main strength of the manuscript lies in the development of the modeling approaches, as well as the evidence from the empirical dataset that the variance explained is improved.

Weaknesses:

The main weakness of the paper is that none of the models were tested on an in silico dataset where the ground truth is known. Therefore, it is unclear if their model actually retains any explanatory power.

Impact:

Genotype-phenotype mapping is a central point of genetics. However, the function is complex and unknown. Simple linear models can uncover some functional link between genes and their effects, but do so through severe oversimplification of the system. On the other hand, neural networks can, in principle, model the function perfectly, but it does so without easy interpretation. Gaussian regression is another approach that improves on linear regression, allowing better fitting of the data while allowing interpretation of the underlying alleles and their effects. This approach, now computable with state-of-the-art algorithms, will advance the field of genotype-to-phenotype associations.

https://doi.org/10.7554/eLife.108964.1.sa2

Reviewer #2 (Public review):

This paper builds on prior work by some of the same authors on how to model fitness landscapes in the presence of epistasis. They have previously shown how simply writing general expansions of fitness in terms of one-body plus two-body plus three-body, etc., terms often fails to generalize to good predictions. They have also previously introduced a Gaussian process regression approach regarding how much epistasis there should be of each order.

This paper contains several main advances:

(1) They implement a more efficient form of the Gaussian process model fitting that uses GPUs and related algorithmic advances to enable better fitting of these models to datasets for larger sequences.

(2) They provide a software package implementing the above.

(3) They generalize the models to allow the extent of epistasis associated with changes in sequence to depend on specific sites, alleles, and mutations.

(4) They show modest improvements in prediction and substantial improvements in interpretability with the more generalized models above.

Overall, while this paper is quite technical, my assessment is that it represents a substantial conceptual and algorithmic advance for the above reasons, and I would recommend only modest revisions. The paper seems well-written and clear, given the inherent complexity of this topic.

https://doi.org/10.7554/eLife.108964.1.sa1

Reviewer #3 (Public review):

Summary:

The authors propose three types of Gaussian process kernels that extend and generalize standard kernels used for sequence-function prediction tasks, giving rise to the connectedness, Jenga, and general product models. The associated hyperparameters are interpretable and represent epistatic effects of varying complexity. The proposed models significantly outperform the simpler baselines, including the additive model, pairwise interaction model, and Gaussian process with a geometric kernel, in terms of R^2.

Strengths:

(1) The demonstrated performance boost and improved scaling with increasing training data are compelling.

(2) The hyperparameter selection step using the marginal likelihood, as implemented by the authors, seems to yield a reasonable hyperparameter combination that lends itself to biologically plausible interpretations.

(3) The proposed kernels generalize existing kernels in domain-interpretable ways, and can correspond to cases that would not be "physical" in the original models (e.g., $\mu_p>1$ in the original connectedness model that allows modeling of anticorrelated phenotypes).

Weaknesses:

(1) While enabling uncertainty quantification is a key advantage of Gaussian processes, the authors do not present metrics specific to the predicted uncertainties; all metrics seem to concern the mean predictions only. It would be helpful to evaluate coverage metrics and maybe include an application of the uncertainties, such as in active learning or Bayesian optimization.

(2) The more complex models, like the general product model, place a heavier burden on the hyperparameter selection step. Explicitly discussing the optimization routine used here would be helpful to potential users of the method and code.

https://doi.org/10.7554/eLife.108964.1.sa0

Learning sequence-function relationships with scalable, interpretable Gaussian processes

Peer review process

Editors

Be the first to read new articles from eLife