Benchmarking of signaling networks generated by large language models

Jeevan Tewari; Benjamin W Dahl; Jeffrey J Saucerman

doi:10.7554/eLife.109709.1

Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, and public reviews.

Reviewing Editor
Martin Graña
Institut Pasteur de Montevideo, Montevideo, Uruguay
Senior Editor
Aleksandra Walczak
CNRS, Paris, France

Reviewer #1 (Public review):

Summary:

Large language models (LLMs) have been developed rapidly in recent years and are already contributing to progress across scientific fields. The manuscript tries to address a specific question: whether LLMs can accurately infer signaling networks from gene lists. However, the evaluation is inadequate due to four major weaknesses described below. Despite these limitations, the authors conclude that current general-purpose LLMs lack adequate accuracy, which is already widely recognized. Its key contribution should instead be to provide concrete recommendations for the development of specialized LLMs for this task, which is completely absent. Developing such specific LLMs would be highly valuable, as they could substantially reduce the time required by researchers to analyze signaling networks.

Strengths:

The manuscript raises a good question: whether current LLMs can accurately generate signaling networks from gene lists.

Weaknesses:

(1) The authors evaluate LLM performance using only three signaling networks: "hypertrophy", "fibroblast", and "mechanosignaling". Given the large number of well-established signaling pathways available, this is not a comprehensive assessment. Moreover, the analysis need not be restricted to signaling networks. Other network types, including metabolic and transcriptional regulatory networks, are already accessible in well-known databases such as KEGG, Reactome, BioCyc, WikiPathways, and Pathway Commons. Including these additional networks would substantially strengthen the evaluation.

(2) In LLM evaluation, the authors use the gene lists that exactly match those in their "ground truth" networks, thereby fixing the set of nodes and evaluating only the predicted edges. However, in practical research, the relevant genes or nodes are not fully known. A more realistic assessment would therefore include gene lists with both genes present in the ground-truth network and additional genes absent from it, to evaluate the ability of the LLM to exclude irrelevant genes.

(3) The authors report only the recall/sensitivity of the LLM, without assessing specificity. In practical applications, if an LLM generates a large number of incorrect interactions that greatly exceed the correct ones, researchers may be misled or may lose confidence in the LLM output. Therefore, a comprehensive evaluation must include both sensitivity and specificity. Furthermore, it would be informative to check whether some of the "false positives" might in fact represent biologically plausible interactions that are absent from the manually curated "ground truth". Manually generated "ground truth" can overlook genuine interactions, and the ability of LLMs to recover such missing edges could be particularly valuable. This may even represent one of the most important potential contributions of LLMs.

(4) It is widely known that applying differential equation models to highly complex biological networks, such as the three networks in the manuscript, is meaningless, because these systems involve a large number of parameters whose values can drastically alter the results. As Richard Feynman once said: "with four parameters I can fit an elephant, and with five I can make him wiggle his trunk." Thus, the evaluation of LLMs on "logic-based differential equation models" does not make much sense.

https://doi.org/10.7554/eLife.109709.1.sa1

Reviewer #2 (Public review):

Summary:

The authors evaluate whether commonly used LLMs (ChatGPT, Claude and Gemini) can reconstruct signalling networks and predict effects of network perturbations, and propose a pipeline for benchmarking future models. Across three phenotypes (hypertrophy, fibroblast signalling, and mechanosignalling), LLMs capture upstream ligand-receptor interactions and conserved crosstalk but fail to recover downstream transcriptional programmes. Logic-based simulations show that LLM-derived networks underperform compared to manually curated models. The authors also propose that their pipeline can be used for benchmarking future models aimed at reconstructing signalling networks.

Strength:

The authors compare the outcomes from three LLMs with three manually curated and validated models. Additionally, they have investigated gene network reconstruction in the context of three distinct phenotypes. Using logic-based modelling, the authors assessed how LLM-derived networks predict perturbation effects, providing functional validation beyond network overlap.

Weaknesses:

The authors have used legacy models for all three LLMs, and the study would benefit from testing the current versions of the LLMs (ChatGPT 5.2, Claude 4.5 and Gemini 2.5). Additional metrics such as node coverage, node invention, direction accuracy and sign accuracy would be useful to make robust comparisons across models.

https://doi.org/10.7554/eLife.109709.1.sa0

Benchmarking of signaling networks generated by large language models

Peer review process

Editors

Be the first to read new articles from eLife