Abstract
Computational models of signaling networks provide frameworks for predicting how molecular cues guide cell decisions. But they are typically limited by manual curation from incomplete literature. Here, we test whether general-purpose large language models (LLMs) generate accurate models of signaling networks. We find that general purpose LLMs generate 24-58% of the reactions of literature-curated networks for cardiomyocyte hypertrophy, myofibroblast activation, and mechano-signaling, and predicting network responses to perturbations with accuracies of 5-26%. While current general-purpose LLMs generate signaling networks with limited accuracy, this study provides a pipeline and benchmarks to guide future improvements.
Main
The response of cells to external stimuli is a fundamental characteristic of life, mediated by complex molecular networks of signal transduction and gene regulation. Computational models of signaling networks have been widely used to understand their systems properties and new guide experiments1,2. However, a central hurdle is that computational models of signaling networks typically require substantial manual literature curation of network reactions and rate constants, which is time intensive and requires extensive experimental knowledge3. Further, existing data is often incomplete.
AI, and large language models (LLMs) in particular, have the potential to outperform human labor-intensive processes of knowledge curation. For example, general purpose LLMs have scored highly on the U.S. Medical Licensing Exam4, law exams5, and Advanced Placement exams for many subjects5. AI has also made substantial contributions in biology, including predicting protein structure6,7, cell types8,9, and is expected to play a central role in the development of comprehensive virtual cells10–12. However, to our knowledge, LLM-generated predictive models of signaling networks have not been reported or benchmarked. Here, we develop a pipeline for LLM-generated models of signaling networks, and we test whether general purpose LLMs generate networks with accurate structure and predictions of perturbation responses, as benchmarked against three manually-curated and validated models13–15.
We first established a computational pipeline to generate network models from LLMs using a given gene set (Figure 1A). For benchmarking purposes, here the gene set is based on annotations of manually curated network models that have been extensively validated. However, future studies could derive gene sets from other sources such as proteomics or transcriptomics. We optimized a programmatic LLM prompt that iteratively queried for directed (activation or inhibition) interactions among the gene products in the context of a particular phenotype (here “hypertrophy”, “fibroblast”, or “mechanosignaling”). With text processing, we extracted the returned directed interactions to examine network structure and then converted the network into the format needed to simulate network responses to perturbations, here with logic- based differential equations in Netflux16. LLM-based models were generated 10 times per network to examine reproducibility.

Signaling networks generated by general-purpose large language models.
A) Schematic of pipeline for LLM-generated models of signaling networks. B) Network reactions recalled by three large language models (Gemini2.0, orange; ChatGPT4, blue; Claude3.7, green) compared with a “Ground Truth” literature-curated and validated cardiomyocyte hypertrophy signaling network13 (gray reactions). LLM networks were generated using iterative prompts based on the gene set list of the Ground Truth hypertrophy network. Summary of reaction recall accuracy for three literature-curated signaling networks (C, hypertrophy13; D, fibroblast14; and E, mechanosignaling15) by Gemini, GPT, and Claude. * Indicates p < 10−9 in one-sample T test between LLM-generated replicates (n = 10) and the ground truth network.
We first tested the extent to which three general purpose LLMs (GPT 4.1, Open AI; Gemini 2.0, Google; and Claude 3.7, Anthropic) could reconstruct the structure of the manually-curated cardiomyocyte hypertrophy signaling network13, based on its gene annotations alone. Overall, all three LLMs capture aspects of this manually-curated network, but with varying coverage depending on the region of the network and the particular LLM. In particular, GPT 4.1, Gemini 2.0, and Claude 3.7 all performed fairly well at reconstructing upstream aspects of ligand-receptor interactions (e.g. EGF to EGFR) and well- established second messenger cascades like β-adrenergic receptor to cAMP to CREB pathway or phospholipase C to diacylglycerol to PKC. They also accurately predicted the structure of highly conserved signaling axes such as PI3K-Akt-mTor and calcium- calmodulin-calcineurin-NFAT. In contrast, the three LLMs were less able to reconstruct regulation of downstream transcription factors (e.g. cFOS, NFAT, cJUN, MEF2) and gene expression (SERCA, αMHC, βMHC, natriuretic peptides ANP/BNP) or cell area / hypertrophy. The lower reconstruction accuracy of downstream signaling may arise because these contain more cardiomyocyte-specific reactions, but such gaps could also represent a general challenge of LLMs in reconstructing gene regulation from literature. Overall, the three LLMs exhibited 26.70-58.12% reconstruction of network reactions in the cardiomyocyte hypertrophy network. The top-performing LLM was Claude 3.7, which was reproducible with repeated queries.
To what extent is the limited recall of LLM-generated networks specific to cardiomyocyte hypertrophy? To address this question, we used GPT 4.1, Gemini 2.0, and Claude 4.7 to also reconstruct network models of fibroblast activation14 (91 nodes, 134 reactions) (Supplementary Figure 1) and mechanosignaling15 (94 nodes, 125 reactions) (Supplementary Figure S2). Like the hypertrophy model, these networks were previously manually reconstructed from the literature, translated into predictive logic-based models, and had their predictions validated against substantial experiments from the literature that were not used to develop the model. Consistent with the hypertrophy network, LLM-generated models of fibroblast activation and mechanosignaling successfully reconstructed 25.93-50.00% and 24.42-48.26% of their manually curated counterparts. Across all three networks, Claude 4.7 and GPT 4.1 outperformed Gemini 2.0, but with limited recall of the overall network.
We then asked whether the regional coverage of LLM-generated networks of fibroblasts and mechanosignaling mirrors that of cardiomyocytes. Indeed, the LLM- generated fibroblast network was most effective at reconstructing ligand-receptor interactions and highly conserved pathways such as cAMP and Ras/Raf/ERK, but could not recall AT1R signaling, PDGFR-abl signaling, and especially downstream regulation of transcription factors, genes, and extracellular matrix (e.g. MMP-collagen subnetwork) (Supplementary Figure 1). Unlike hypertrophy and fibroblast networks, mechanosignaling is mediated by stretch-dependent proteins that are relatively less characterized. Indeed, the LLM-generated networks for mechanosignaling did not adequately reconstruct direct mechano-sensation and had limited coverage of gene regulation consistent with the other networks (Supplementary Figure 2). Still, core conserved pathways for Na/Ca regulation, regulation of protein synthesis, and MAPK signaling were reconstructed well.
Models of signaling networks are useful not just for their description of structure, but especially for their ability to simulate the response to new perturbations. Making such predictions requires a specific mathematical formalism. For the three networks used for benchmarking, we had previously automatically translated them into logic- based differential equation models17 using the software Netflux16. This allows network- wide prediction of dynamics in response to any desired perturbations and does not require prior knowledge of rate constants or concentrations17. The predictions of these three logic-based models have also been highly validated against experiments not used to develop the model13–15,18. Therefore, we generated logic-based models based on all of the LLM-generated network structures described above.
Returning our focus to the hypertrophy network, we examined the ability of LLM- generated networks to predict classic the “fetal gene program” gene expression signature in cardiomyocytes. The manually curated model accurately predicts that AngII and/or ISO increase the expression of ANP, BNP, GATA4, βMHC, sACT, and increases cell area/size, while decreasing expression of SERCA and αMHC. In contrast, the LLM- generated network models produce limited aspects of this signature, with the best performance in predicting ISO-dependent ANP in the GPT version (Figure 2A). We then tested whether this validation accuracy scales across all available manually curated experiments that cover a wider array of perturbations and measured outputs. Overall, for 114 validations where the manually curated network has an functional accuracy of 94.74%, LLM-generated networks have an accuracy of 6.14-26.32% (Figure 2B).

Experimental validation of perturbation responses predicted by LLM-generated signaling network models.
A) Representative validations of network models generated by manual curation or by LLMs (Gemini, GPT, Claude), in comparison to experiments in conditions of Angiotensin II (AngII) or isoproterenol (ISO) from the literature13. B) Summary of systematic validations of manually curated (Ground Truth) and LLM-generated network models of hypertrophy, fibroblast, and mechano-signaling network models against perturbation experiments from the literature (n = 114, 83, and 171 experiments, respectively). * Indicates p < 10−11 in one-sample T test between LLM-generated model validation scores (n = 10 replicates) and ground truth model validation accuracy.
Repeating this systematic experimental validation for LLM-generated models of the fibroblast (Figure 2C) and mechano-signaling (Figure 2D), we observed similar validation accuracies of 16.87-21.69% and 5.81%, respectively. Together, these results indicate that validation accuracy of LLM-generated models for perturbation responses is even lower than their accuracy for reconstructing network structure. Because the validation tests are testing larger-range functional interactions, this difference may be caused by gaps in reconstructed pathways.
In conclusion, we developed a pipeline for generating predictive, logic-based models of signaling networks using general purpose large language models (LLMs). To test the reconstruction and simulation accuracy of LLM-generated network models, we benchmarked this pipeline using three LLMs (GPT 4.1, Claude 4.7, and Gemini 2.0) against three large-scale, manually curated and validated models of signaling networks. Across cardiac hypertrophy, fibroblast, and mechano-signaling networks, LLM- generated models exhibit 24.42-58.12% reconstruction of reactions from the manually curated models. Reconstruction accuracy is high for ligand-receptor interactions and well-conserved pathways, but lower for cell-specific gene regulation or biophysical regulation. Likely due to gaps in reconstructed pathways, LLM-generated models had a low ability to accurately simulate the responses of networks to perturbations. Overall, we provide a pipeline for LLM-generated models of signaling networks and benchmark their performance against three manually curated networks. Improving the performance of LLM-generated network models may require hybrid manual-LLM approaches or advances in more special-purpose AI models.
Methods
Prior knowledge models of signaling networks
To benchmark LLM-generated network models, we selected three large-scale curated network models based on manual curation of the directed interactions using the literature13–15. These models were recently used to benchmark protein interaction databases19, creating a comparison’ for the current analysis. These models contain a range of biological processes including ligand-receptor interactions, phosphorylation, gene regulation, and cellular phenotypes. Associated with the network structure is an annotated list of genes associated with the nodes in the model. These models were mathematically formulated as systems of logic-based differential equations17, which enable dynamic network-wide predictions of how those cells respond to new perturbations. The predictive accuracy of these models has previously been rigorously tested with manual curation of literature not used to develop the models, with accuracies of 94.74% (hypertrophy), 81.93% (fibroblast), and 77.33% (mechanosignaling)13–15.
Querying large language models
We focused on widely used, general-purpose large language models (LLMs) that are most relevant to wider use in the field. Specifically, we selected current versions of GPT (4.1, OpenAI), Gemini (2.0, Google), and Claude (3.7, Anthromorphic). The prompt consisted of three steps:
The list of gene symbols, along with the description: “List of genes and other signaling nodes:”.
Description of the desired outputs: “For the first {batch_size} entries in this list of genes, proteins, and other signaling nodes from a {phenotype} network, please provide more than 0 but fewer than {max_connections} direct interactions with other nodes from the list supported by available literature. Simply list the input node, affected node, and if the affected node is stimulated / inhibited.” Batch size was set at 20 genes, max connections set to the maximum number of connections exhibited in that parent network, and phenotype set at “cardiac hypertrophy”, “fibroblast”, or “mechanosignaling” corresponding to the three networks used for benchmarking.
Instructions to repeat the process until reaching the end of the gene list: “That looks great! Please do the same operation for the next {len(chunk)} nodes! Thank you”
The responses were stored as .txt files and then programmatically extracted using regular expressions.
Extracting reactions from LLM outputs, logic-based model generation
Each of the LLMs returned network reactions in slightly different format. But for example, Claude 3.7 returned reactions in the form: “ADRB2 stimulates RAC1” or “RAC1 stimulates MAP3K1”. Relationships based on directional signed relationships were kept, such as terms “stimulates”, down-regulates”, or “generates”. Returned reactions referring to non-relevant proteins or processes were discarded, as were non- directional relationships. We attempted also querying PubMed IDs, but these were rarely accurate.
Reactions successfully extracted from the LLM outputs were formatted for uniform comparison to the reaction list from the corresponding manually curated network. Finally, this reaction list was translated into the Netflux format for automated conversion to a logic-based differential equation model16,17. This final logic-based model was used to perform validation simulations.
Supplementary Information

Visualization of LLM-generated fibroblast signaling networks, as recalled by three general-purpose large language models.
Network reactions recalled by three large language models (Gemini2.0, orange; ChatGPT4, blue; Claude3.7, green) compared with a “Ground Truth” literature-curated and validated fibroblast signaling network (gray reactions). LLM-generated networks used prompts based on the gene set list of the Ground Truth fibroblast network. This visualization corresponds to the analyses in Figure 1D.

Visualization of LLM-generated mechanosignaling networks, as recalled by three general-purpose large language models.
Network reactions recalled by three large language models (Gemini2.0, orange; ChatGPT4, blue; Claude3.7, green) compared with a “Ground Truth” literature-curated and validated mechanosignaling network (gray reactions). LLM-generated networks used prompts based on the gene set list of the Ground Truth mechanosignaling network. This visualization corresponds to the analyses in Figure 1E.
Data availability
The code used in this project is freely available on GitHub at: https://github.com/saucermanlab/LLM-network-generation
Additional information
Funding
National Institutes of Health (R01HL162925)
Jeffrey J Saucerman
National Institutes of Health (R01HL160665)
Jeffrey J Saucerman
National Institutes of Health (R01HL172417)
Jeffrey J Saucerman
National Institutes of Health (T32GM156694)
Jeevan Tewari
References
- 1.A Whole-Cell Computational Model Predicts Phenotype from GenotypeCell 150:389–401Google Scholar
- 2.Network Modeling of TGFβ Signaling in Hepatocellular Carcinoma Epithelial-to-Mesenchymal Transition Reveals Joint Sonic Hedgehog and Wnt Pathway ActivationCancer Research 74:5963–5977Google Scholar
- 3.The Evolution of Systems Biology and Systems Medicine: From Mechanistic Models to Uncertainty QuantificationAnnu Rev Biomed Eng 27:425–447Google Scholar
- 4.Toward expert-level medical question answering with large language modelsNat Med 31:943–950Google Scholar
- 5.GPT-4 Technical ReportarXiv https://doi.org/10.48550/arXiv.2303.08774Google Scholar
- 6.Accurate structure prediction of biomolecular interactions with AlphaFold 3Nature 630:493–500Google Scholar
- 7.Robust deep learning-based protein sequence design using ProteinMPNNScience 378:49–56Google Scholar
- 8.Transfer learning enables predictions in network biologyNature 618:616–624Google Scholar
- 9.scGPT: toward building a foundation model for single-cell multi-omics using generative AINat Methods 21:1470–1480Google Scholar
- 10.How to build the virtual cell with artificial intelligence: Priorities and opportunitiesCell 187:7045–7063Google Scholar
- 11.Virtual Cell Challenge: Toward a Turing test for the virtual cellCell 188:3370–3374Google Scholar
- 12.Toward a foundation model of causal cell and tissue biology with a Perturbation Cell and Tissue AtlasCell 187:4520–4545Google Scholar
- 13.Network reconstruction and systems analysis of cardiac myocyte hypertrophy signalingThe Journal of biological chemistry 287:42259–42268Google Scholar
- 14.A computational model of cardiac fibroblast signaling predicts context-dependent drivers of myofibroblast differentiationJ Mol Cell Cardiol 94:72–81Google Scholar
- 15.Predictive model identifies key network regulators of cardiomyocyte mechano-signalingPLoS Comput Biol 13:e1005854Google Scholar
- 16.Logic-based modeling of biological networks with NetfluxPLoS Comput Biol 21:e1012864Google Scholar
- 17.Modeling cardiac β-adrenergic signaling with normalized-Hill differential equations: comparison with a biochemical modelBMC Syst Biol 4:157Google Scholar
- 18.Context-specific network modeling identifies new crosstalk in β-adrenergic cardiac hypertrophyPLoS Comput Biol 16:e1008490Google Scholar
- 19.Benchmarking of protein interaction databases for integration with manually reconstructed signalling network modelsJ Physiol https://doi.org/10.1113/JP284616Google Scholar
Article and author information
Author information
Version history
- Preprint posted:
- Sent for peer review:
- Reviewed Preprint version 1:
Cite all versions
You can cite all versions using the DOI https://doi.org/10.7554/eLife.109709. This DOI represents all versions, and will always resolve to the latest one.
Copyright
© 2026, Tewari et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics
- views
- 0
- downloads
- 0
- citations
- 0
Views, downloads and citations are aggregated across all versions of this paper published by eLife.