Critique of impure reason: Unveiling the reasoning behaviour of medical large language models
Figures
A graphical abstract illustrating the current state of medical large language models (LLMs) in the context of reasoning behaviour.
An illustration of the contrast in modalities between computer vision and natural language processing.
A schematic diagram illustrating different strategies for solving problems using large language models (LLMs).
Each rectangular box represents a ‘thought’—a meaningful segment of language that functions as an intermediate step in the reasoning or problem-solving process.
A graphical representation of reasoning, reasoning outcome, and reasoning behaviour.
Reasoning encapsulates the process of drawing conclusions, arriving at a reasoning outcome. At a more fundamental level, reasoning behaviour describes the logical flow through the system that occurs during reasoning.
Two frameworks with a focus on exposing reasoning behaviour.
Note that the two frameworks are independent but shown together to facilitate easier comparison. Top: input data is standardised and fed to tree-based models. The deterministic nature of trees is exploited for achieving transparency for reasoning behaviour. Bottom: an integrative framework of combining the complementary strengths of LLM and Symbolic Reasoning. The medical LLM extracts diagnostic rules from clinical algorithms, along with its chain-of-thought (CoT) reasoning and attention weights. These diagnostic rules, together with patient case inputs, are provided to the symbolic solver, which determines the final diagnosis and generates inference chains as its reasoning trace.
Tables
A table showing types of reasoning, their definition and examples.
Reasoning types are colour-coded for clarity. Logical reasoning encompasses abductive, deductive, and inductive subtypes.
| Type of reasoning | Definition | General example | Medical example |
|---|---|---|---|
| Abductive | Inferring the most likely explanation for observed data or evidence. | Ali, Muthu, and Ah Hock breathe oxygen. Therefore, Ali, Muthu, and Ah Hock are likely human. | A patient has increased intercranial pressure, blurred vision and nausea. Therefore, the patient may have a brain aneurysm or ischaemic stroke. |
| Deductive | Reasoning from a set of premises to reach a certain conclusion. | All humans breathe oxygen. Rentap is human. Therefore, Rentap breathes oxygen. | A patient has increased intercranial pressure, blurred vision and nausea. A CT scan shows no bleeding or swelling. Therefore, the patient does not have a brain aneurysm. |
| Inductive | Inferring general principles based on specific observations. | All humans that I have seen breathe oxygen. Therefore, Rentap probably breathes oxygen. | A patient has increased intercranial pressure, blurred vision and nausea. A CT scan shows no bleeding or swelling. Therefore, the patient probably has an ischaemic stroke. |
| Symbolic* | The abstraction of a system into its component parts, which enables a more direct application of mathematics. | Rule: If an organism breathes oxygen and nitrogen, and exhales carbon dioxide → likely human. Observation: Ali, Muthu, Ah Hock, and Rentap exhibit this respiratory pattern. Conclusion: Therefore, they are probably human. | Rule 1: If a patient presents with increased intracranial pressure (ICP), blurred vision, and nausea → infer high intracranial pathology. Observation 1: The patient shows increased ICP, blurred vision, and nausea. Rule 2: If CT scan shows no bleeding or swelling → rule out haemorrhagic causes. Observation 2: CT scan reveals no evidence of bleeding or swelling. Rule 3: If high ICP and haemorrhage is ruled out → suspect ischaemic stroke. Conclusion: Therefore, the patient most likely has an ischaemic stroke. |
| Causal/ Counterfactual | Establishing a cause-and-effect relationship between events. | Ali, Muthu, Ah Hock, and Rentap exhibit this respiratory pattern. | A blood clot probably caused blockage in the brain leading to the stroke. |
-
*
We note that the term symbolic reasoning may be misleading as it is fundamentally an abstract data representation which simplifies the process of translating a scenario into a reasoning framework.
A table showing medical reasoning methods, their defining characteristics, and approach to reasoning.
| Method name | Base architecture/method | Reasoning improvement strategy | Type of Reasoning | Advantages | Disadvantages | Dataset | Github |
|---|---|---|---|---|---|---|---|
| Savage et al., 2024 | GPT-3.5; GPT-4.0 | Chain-of-thought (diagnostic reasoning) | Deductive | Easy to implement | Scope is limited to GPT models, focusing exclusively on English medical questions | Modified MedQA USMLE; NEJM (New England Journal of Medicine) case series | |
| Kwon et al., 2024 | GPT-4.0; OPT; LLaMA-2; 3D ResNet | Chain-of-thought; knowledge distillation (via SFT) | Deductive | Lightweight and practical to use | Tight scope to limited disease conditions | Alzheimer’s Disease Neuroimaging Initiative (ADNI); Australian Imaging Biomarkers and Lifestyle Study of Ageing (AIBL) | https://github.com/ktio89/ClinicalCoT |
| MEDDM Binbin et al Li et al., 2023 | GPT | Chain-of-thought; clinical decision trees | Deductive | Adaptable to different systems | Heavy data collection to generate clinical guidance trees | Medical books, treatment guidelines, and other medical literature | |
| DRHOUSE Yang et al., 2024a | GPT-3.5; GPT-4.0; LLaMA-3 70b; HuatuoGPT-II; MEDDM | Chain-of-thought; clinical decision trees | Deductive | Objective sensor measurement | Available datasets are currently limited | MedDG; KaMed; DialMed | |
| DR. KNOWS Gao et al., 2023b | Vanilla T5; Flan T5; ClinicalT5; GPT | Chain-of-thought; extracted explainable diagnostic pathway | Deductive; neurosymbolic | Hybrid method improves accuracy; provides explainable diagnostic pathways | Particularly fragile to missing data | MIMIC-III; In-house EHR | |
| TEMED-LLM Bisercic et al., 2023 | text-davinchi-003; GPT-3.5; logistic regression; decision tree; XGBoost | Few-shot learning, tabular ML modelling; Neurosymbolic | Deductive | End-to-end interpretability, from data extraction to ML analysis | Requires human experts | EHR dataset (kaggle; see referenced publication for details) | |
| EHRAgent Shi et al., 2024 | GPT-4 | Autonomous code generation and execution for multi-tabular reasoning in EHRs | Deductive | Facilitates automated solutions in complex medical scenarios | Non-deterministic; limited generalisability | MIMIC-III; eICU; TREQS | https://github.com/wshi83/EhrAgent; https://wshi83.github.io/EHR-Agent-page |
| AMIE Tu et al., 2024 | PaLM 2 | Reinforcement learning | Deductive | Effectively handles noisy and ambiguous real-world medical dialogues | Computationally expensive and resource-intensive; simulated data may not fully capture real-world clinical nuances | MedQA; HealthSearchQA; LiveQA; Medication QA in MultiMedBench, MIMIC-III | |
| ArgMed-Agents Hong et al., 2024 | GPT-3.5-turbo; GPT-4 | Chain-of-Thought; symbolic reasoning: neurosymbolic | Deductive | Training-free enhancement; explainability matches fully transparent, knowledge-based systems | Artificially restricted responses that do not match real-world cases | MedQA; PubMedQA | |
| Fansi Tchango et al., 2022a | BASD (baseline ASD): multi-layer perceptron (MLP) diaformer | Reinforcement learning | Deductive | Closely align with clinical reasoning protocols | Limited testing on real patient data | DDxPlus | https://github.com/mila-iqia/Casande-RL |
| MEDIQ Li et al., 2024 | LLaMA-3-Instruct (8B, 70B); GPT-3.5; GPT-4 | Chain-of-thought; information-seeking dialogues | Abductive | Robust to missing information | Available datasets are limited Proprietary; artificially restricted responses that do not match real-world cases | iMEDQA; iCRAFT-MD | https://github.com/stellalisy/mediQ |
| Naik et al., 2023 | GPT-4 | Causal network generation | Causal/ counterfactual | Uses general LLMs | Lacks a specialised medical knowledge base | Providence St.Joseph Health (PSJH’s) clinical data warehouse | |
| Gopalakrishnan et al., 2024 | BioBERT; DistilBERT; BERT; GPT-4; LLaMA | Causality extraction | Causal/ counterfactual | Easy to implement | Tight scope to limited disease conditions | American Diabetes Association (ADA); US Preventive Services Task Force (USPSTF); American College of Obstetrics Gynecology (ACOG); American Academy of Family Physician (AAFP); Endocrine Society | https://github.com/gseetha04/LLMs-Medicaldata |
| InferBERT Wang et al., 2021 | ALBERT; Judea Pearl’s Do-calculus | Causal inference using do-calcus | Causal/ counterfactual; mathematical | Establishes causal inference | Tight scope to limited disease conditions; highly restrictive input format | FAERS case reports from the PharmaPendium database | https://github.com/XingqiaoWang/DeepCausalPV-master |
| Emre Kıcıman | text-davinci-003, GPT-3.5-turbo, and GPT-4 | Determine direction of causality between pairs of variables | Causal/ counterfactual | Highly accurate for large models | Limited reproducibility due to dependency on tailored prompts | Tübingen cause-effect pairs dataset. | https://github.com/py-why/pywhy-llm |
| Huatuo GPT-o1 Chen et al., 2024 | LLaMA-3.1-8B-Instruct and LLaMA-3.1-70B-Instruct | Supervised fine-tuning and PPO | Deductive reasoning | Instils multi-step reasoning in medical LLMs; built-in interpretability as LLM can output reasoning traces along with answer | Limited evaluations as evaluations cover accuracy scores on medical MCQ benchmarks | Adapted from MedQA-USMLE and MedMcQA | https://github.com/FreedomIntelligence/HuatuoGPT-o1 |
| Med-R1 Lai et al., 2025 | Qwen2-VL-2B | Supervised fine-tuning and GRPO | Deductive reasoning | Joint image-text and multi-task reasoning; built-in interpretability as LLM can output reasoning traces along with answer | Rethinking the ‘More Thinking is Better’ Assumption | OmniMedVQA | https://github.com/Yuxiang-Lai117/Med-R1 |
| MedVLM-R1 Pan et al., 2025 | Qwen2-VL-2B | Supervised Fine-tuning and GRPO | Deductive Reasoning | Joint image-text reasoning; Built-in interpretability as LLM can output reasoning traces along with answer | Limited evaluations as evaluations cover accuracy scores on medical MCQ benchmarks | VQA-RAD, SLAKE, PathVQA, OmniMedVQA, and PMC-VQA | https://huggingface.co/JZPeterPan/MedVLM-R1 |
| MedFound Liu et al., 2025 | 176 billion parameter LLM pretrained from scratch | Supervised fine-tuning and DPO | Deductive reasoning | - Self-bootstrapped Chain-of-Thought fine-tuning; Rigorous human evaluation of reasoning traces with rubric; built-in interpretability as LLM can output reasoning traces along with answer | Proprietary EHR datasets aren’t fully open, hindering exact reproduction | MedCorpus, MedDX-FT and MedDX-Bench | https://github.com/medfound/medfound |
| DeepSeekR1 Guo et al., 2025 | DeepSeek-V3-Base | Supervised fine-tuning and GRPO | Deductive reasoning | Built-in interpretability as LLM can output reasoning traces along with answer | Pre-training and reasoning datasets aren’t open-sourced | - | https://github.com/deepseek-ai/DeepSeek-R1 |
Comparison of reasoning evaluation paradigms in medical LLMs.
| Evaluation paradigm | Conceptual focus | Typical implementation and metrics |
|---|---|---|
| Conclusion-based | Assesses correctness of the final answer only, without inspecting the reasoning path. | Automated scoring on Q-&-A benchmarks (e.g. MedQA, MedMCQA). Metrics: Accuracy, exact-match, F1. Fast, reproducible, but offers only high-level insight. |
| Rationale-based | Evaluates the logic chain or narrative explanation produced by the model. Focuses on coherence, validity, and completeness of reasoning traces. | Manual expert review or rubric-based grading of CoT. Automated graph checks (e.g. DAG similarity, causal-direction tests). Metrics: Bayesian Dirichlet score, Normalised Hamming Distance. |
| Mechanistic | Probes low-level internal signals to answer “why did the model arrive here?”. Targets feature attribution and internal attention contributions. | Explainable-AI toolkits (Integrated Gradients, SHAP, attention rollout). Outputs saliency maps or keyword heat-maps for clinician inspection. |
| Interactive | Treats evaluation as a dialogue or game; dynamically stresses the model in real time. Explores the response space by challenging, re-prompting, or role-playing. | Game-theoretic tasks (e.g. debate, self-play). Rich insights but lower reproducibility; requires human-in-the-loop or scripted agents. |