Figures and data in Critique of impure reason: Unveiling the reasoning behaviour of medical large language models

Figures
Tables

7 figures and 3 tables

Figures

Figure 1

Download asset Open asset

A graphical abstract illustrating the current state of medical large language models (LLMs) in the context of *reasoning behaviour*.

Figure 2

Download asset Open asset

An illustration of the contrast in modalities between computer vision and natural language processing.

Figure 3

Download asset Open asset

A schematic diagram illustrating different strategies for solving problems using large language models (LLMs).

Each rectangular box represents a ‘thought’—a meaningful segment of language that functions as an intermediate step in the reasoning or problem-solving process.

Figure 4

Download asset Open asset

Figure 5

Download asset Open asset

A graphical representation of *reasoning*, *reasoning outcome,* and *reasoning behaviour*.

*Reasoning* encapsulates the process of drawing conclusions, arriving at a *reasoning outcome*. At a more fundamental level, *reasoning behaviour* describes the logical flow through the system that occurs during *reasoning.*

Figure 6

Download asset Open asset

Two frameworks with a focus on exposing *reasoning behaviour*.

Note that the two frameworks are independent but shown together to facilitate easier comparison. Top: input data is standardised and fed to tree-based models. The deterministic nature of trees is exploited for achieving transparency for *reasoning behaviour*. Bottom: an integrative framework of combining the complementary strengths of LLM and Symbolic Reasoning. The medical LLM extracts diagnostic rules from clinical algorithms, along with its chain-of-thought (CoT) reasoning and attention weights. These diagnostic rules, together with patient case inputs, are provided to the symbolic solver, which determines the final diagnosis and generates inference chains as its reasoning trace.

Figure 7

Download asset Open asset

An illustration of the spectrum of ‘System 1’ fundamental thought processes to ‘System 2’ analytical thought processes.

Tables

Table 1

A table showing types of reasoning, their definition and examples.

Reasoning types are colour-coded for clarity. Logical reasoning encompasses abductive, deductive, and inductive subtypes.

Type of reasoning	Definition	General example	Medical example
Abductive	Inferring the most likely explanation for observed data or evidence.	Ali, Muthu, and Ah Hock breathe oxygen. Therefore, Ali, Muthu, and Ah Hock are likely human.	A patient has increased intercranial pressure, blurred vision and nausea. Therefore, the patient may have a brain aneurysm or ischaemic stroke.
Deductive	Reasoning from a set of premises to reach a certain conclusion.	All humans breathe oxygen. Rentap is human. Therefore, Rentap breathes oxygen.	A patient has increased intercranial pressure, blurred vision and nausea. A CT scan shows no bleeding or swelling. Therefore, the patient does not have a brain aneurysm.
Inductive	Inferring general principles based on specific observations.	All humans that I have seen breathe oxygen. Therefore, Rentap probably breathes oxygen.	A patient has increased intercranial pressure, blurred vision and nausea. A CT scan shows no bleeding or swelling. Therefore, the patient probably has an ischaemic stroke.
Symbolic^*	The abstraction of a system into its component parts, which enables a more direct application of mathematics.	Rule: If an organism breathes oxygen and nitrogen, and exhales carbon dioxide → likely human. Observation: Ali, Muthu, Ah Hock, and Rentap exhibit this respiratory pattern. Conclusion: Therefore, they are probably human.	Rule 1: If a patient presents with increased intracranial pressure (ICP), blurred vision, and nausea → infer high intracranial pathology. Observation 1: The patient shows increased ICP, blurred vision, and nausea. Rule 2: If CT scan shows no bleeding or swelling → rule out haemorrhagic causes. Observation 2: CT scan reveals no evidence of bleeding or swelling. Rule 3: If high ICP and haemorrhage is ruled out → suspect ischaemic stroke. Conclusion: Therefore, the patient most likely has an ischaemic stroke.
Causal/ Counterfactual	Establishing a cause-and-effect relationship between events.	Ali, Muthu, Ah Hock, and Rentap exhibit this respiratory pattern.	A blood clot probably caused blockage in the brain leading to the stroke.

*

We note that the term symbolic reasoning may be misleading as it is fundamentally an abstract data representation which simplifies the process of translating a scenario into a reasoning framework.

Table 2

A table showing medical reasoning methods, their defining characteristics, and approach to reasoning.

Method name	Base architecture/method	Reasoning improvement strategy	Type of Reasoning	Advantages	Disadvantages	Dataset	Github
Savage et al., 2024	GPT-3.5; GPT-4.0	Chain-of-thought (diagnostic reasoning)	Deductive	Easy to implement	Scope is limited to GPT models, focusing exclusively on English medical questions	Modified MedQA USMLE; NEJM (New England Journal of Medicine) case series
Kwon et al., 2024	GPT-4.0; OPT; LLaMA-2; 3D ResNet	Chain-of-thought; knowledge distillation (via SFT)	Deductive	Lightweight and practical to use	Tight scope to limited disease conditions	Alzheimer’s Disease Neuroimaging Initiative (ADNI); Australian Imaging Biomarkers and Lifestyle Study of Ageing (AIBL)	https://github.com/ktio89/ClinicalCoT
MEDDM Binbin et al Li et al., 2023	GPT	Chain-of-thought; clinical decision trees	Deductive	Adaptable to different systems	Heavy data collection to generate clinical guidance trees	Medical books, treatment guidelines, and other medical literature
DRHOUSE Yang et al., 2024a	GPT-3.5; GPT-4.0; LLaMA-3 70b; HuatuoGPT-II; MEDDM	Chain-of-thought; clinical decision trees	Deductive	Objective sensor measurement	Available datasets are currently limited	MedDG; KaMed; DialMed
DR. KNOWS Gao et al., 2023b	Vanilla T5; Flan T5; ClinicalT5; GPT	Chain-of-thought; extracted explainable diagnostic pathway	Deductive; neurosymbolic	Hybrid method improves accuracy; provides explainable diagnostic pathways	Particularly fragile to missing data	MIMIC-III; In-house EHR
TEMED-LLM Bisercic et al., 2023	text-davinchi-003; GPT-3.5; logistic regression; decision tree; XGBoost	Few-shot learning, tabular ML modelling; Neurosymbolic	Deductive	End-to-end interpretability, from data extraction to ML analysis	Requires human experts	EHR dataset (kaggle; see referenced publication for details)
EHRAgent Shi et al., 2024	GPT-4	Autonomous code generation and execution for multi-tabular reasoning in EHRs	Deductive	Facilitates automated solutions in complex medical scenarios	Non-deterministic; limited generalisability	MIMIC-III; eICU; TREQS	https://github.com/wshi83/EhrAgent; https://wshi83.github.io/EHR-Agent-page
AMIE Tu et al., 2024	PaLM 2	Reinforcement learning	Deductive	Effectively handles noisy and ambiguous real-world medical dialogues	Computationally expensive and resource-intensive; simulated data may not fully capture real-world clinical nuances	MedQA; HealthSearchQA; LiveQA; Medication QA in MultiMedBench, MIMIC-III
ArgMed-Agents Hong et al., 2024	GPT-3.5-turbo; GPT-4	Chain-of-Thought; symbolic reasoning: neurosymbolic	Deductive	Training-free enhancement; explainability matches fully transparent, knowledge-based systems	Artificially restricted responses that do not match real-world cases	MedQA; PubMedQA
Fansi Tchango et al., 2022a	BASD (baseline ASD): multi-layer perceptron (MLP) diaformer	Reinforcement learning	Deductive	Closely align with clinical reasoning protocols	Limited testing on real patient data	DDxPlus	https://github.com/mila-iqia/Casande-RL
MEDIQ Li et al., 2024	LLaMA-3-Instruct (8B, 70B); GPT-3.5; GPT-4	Chain-of-thought; information-seeking dialogues	Abductive	Robust to missing information	Available datasets are limited Proprietary; artificially restricted responses that do not match real-world cases	iMEDQA; iCRAFT-MD	https://github.com/stellalisy/mediQ
Naik et al., 2023	GPT-4	Causal network generation	Causal/ counterfactual	Uses general LLMs	Lacks a specialised medical knowledge base	Providence St.Joseph Health (PSJH’s) clinical data warehouse
Gopalakrishnan et al., 2024	BioBERT; DistilBERT; BERT; GPT-4; LLaMA	Causality extraction	Causal/ counterfactual	Easy to implement	Tight scope to limited disease conditions	American Diabetes Association (ADA); US Preventive Services Task Force (USPSTF); American College of Obstetrics Gynecology (ACOG); American Academy of Family Physician (AAFP); Endocrine Society	https://github.com/gseetha04/LLMs-Medicaldata
InferBERT Wang et al., 2021	ALBERT; Judea Pearl’s Do-calculus	Causal inference using do-calcus	Causal/ counterfactual; mathematical	Establishes causal inference	Tight scope to limited disease conditions; highly restrictive input format	FAERS case reports from the PharmaPendium database	https://github.com/XingqiaoWang/DeepCausalPV-master
Emre Kıcıman	text-davinci-003, GPT-3.5-turbo, and GPT-4	Determine direction of causality between pairs of variables	Causal/ counterfactual	Highly accurate for large models	Limited reproducibility due to dependency on tailored prompts	Tübingen cause-effect pairs dataset.	https://github.com/py-why/pywhy-llm
Huatuo GPT-o1 Chen et al., 2024	LLaMA-3.1-8B-Instruct and LLaMA-3.1-70B-Instruct	Supervised fine-tuning and PPO	Deductive reasoning	Instils multi-step reasoning in medical LLMs; built-in interpretability as LLM can output reasoning traces along with answer	Limited evaluations as evaluations cover accuracy scores on medical MCQ benchmarks	Adapted from MedQA-USMLE and MedMcQA	https://github.com/FreedomIntelligence/HuatuoGPT-o1
Med-R1 Lai et al., 2025	Qwen2-VL-2B	Supervised fine-tuning and GRPO	Deductive reasoning	Joint image-text and multi-task reasoning; built-in interpretability as LLM can output reasoning traces along with answer	Rethinking the ‘More Thinking is Better’ Assumption	OmniMedVQA	https://github.com/Yuxiang-Lai117/Med-R1
MedVLM-R1 Pan et al., 2025	Qwen2-VL-2B	Supervised Fine-tuning and GRPO	Deductive Reasoning	Joint image-text reasoning; Built-in interpretability as LLM can output reasoning traces along with answer	Limited evaluations as evaluations cover accuracy scores on medical MCQ benchmarks	VQA-RAD, SLAKE, PathVQA, OmniMedVQA, and PMC-VQA	https://huggingface.co/JZPeterPan/MedVLM-R1
MedFound Liu et al., 2025	176 billion parameter LLM pretrained from scratch	Supervised fine-tuning and DPO	Deductive reasoning	- Self-bootstrapped Chain-of-Thought fine-tuning; Rigorous human evaluation of reasoning traces with rubric; built-in interpretability as LLM can output reasoning traces along with answer	Proprietary EHR datasets aren’t fully open, hindering exact reproduction	MedCorpus, MedDX-FT and MedDX-Bench	https://github.com/medfound/medfound
DeepSeekR1 Guo et al., 2025	DeepSeek-V3-Base	Supervised fine-tuning and GRPO	Deductive reasoning	Built-in interpretability as LLM can output reasoning traces along with answer	Pre-training and reasoning datasets aren’t open-sourced	-	https://github.com/deepseek-ai/DeepSeek-R1

Table 3

Comparison of reasoning evaluation paradigms in medical LLMs.

Evaluation paradigm	Conceptual focus	Typical implementation and metrics
Conclusion-based	Assesses correctness of the final answer only, without inspecting the reasoning path.	Automated scoring on Q-&-A benchmarks (e.g. MedQA, MedMCQA). Metrics: Accuracy, exact-match, F1. Fast, reproducible, but offers only high-level insight.
Rationale-based	Evaluates the logic chain or narrative explanation produced by the model. Focuses on coherence, validity, and completeness of reasoning traces.	Manual expert review or rubric-based grading of CoT. Automated graph checks (e.g. DAG similarity, causal-direction tests). Metrics: Bayesian Dirichlet score, Normalised Hamming Distance.
Mechanistic	Probes low-level internal signals to answer “why did the model arrive here?”. Targets feature attribution and internal attention contributions.	Explainable-AI toolkits (Integrated Gradients, SHAP, attention rollout). Outputs saliency maps or keyword heat-maps for clinician inspection.
Interactive	Treats evaluation as a dialogue or game; dynamically stresses the model in real time. Explores the response space by challenging, re-prompting, or role-playing.	Game-theoretic tasks (e.g. debate, self-play). Rich insights but lower reproducibility; requires human-in-the-loop or scripted agents.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Article PDF

Open citations (links to open the citations from this article in various online reference manager services)

Mendeley

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

Shamus Zi Yang Sim
Tyrone Chen

(2025)

Critique of impure reason: Unveiling the reasoning behaviour of medical large language models

eLife 14:e106187.

https://doi.org/10.7554/eLife.106187

Figures

A graphical abstract illustrating the current state of medical large language models (LLMs) in the context of reasoning behaviour.

An illustration of the contrast in modalities between computer vision and natural language processing.

A schematic diagram illustrating different strategies for solving problems using large language models (LLMs).

A sample directed acyclic graph.

A graphical representation of reasoning, reasoning outcome, and reasoning behaviour.

Two frameworks with a focus on exposing reasoning behaviour.

An illustration of the spectrum of ‘System 1’ fundamental thought processes to ‘System 2’ analytical thought processes.

Tables

A table showing types of reasoning, their definition and examples.

A table showing medical reasoning methods, their defining characteristics, and approach to reasoning.

Comparison of reasoning evaluation paradigms in medical LLMs.

Download links

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

Be the first to read new articles from eLife

Share this article

Cite this article

A graphical abstract illustrating the current state of medical large language models (LLMs) in the context of reasoning behaviour.

An illustration of the contrast in modalities between computer vision and natural language processing.

A schematic diagram illustrating different strategies for solving problems using large language models (LLMs).

A sample directed acyclic graph.

A graphical representation of reasoning, reasoning outcome, and reasoning behaviour.

Two frameworks with a focus on exposing reasoning behaviour.

An illustration of the spectrum of ‘System 1’ fundamental thought processes to ‘System 2’ analytical thought processes.

A table showing types of reasoning, their definition and examples.

A table showing medical reasoning methods, their defining characteristics, and approach to reasoning.

Comparison of reasoning evaluation paradigms in medical LLMs.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)