Critique of impure reason: Unveiling the reasoning behaviour of medical large language models

  1. Shamus Zi Yang Sim  Is a corresponding author
  2. Tyrone Chen  Is a corresponding author
  1. QueueMed Healthtech, Malaysia
  2. Peter MacCallum Cancer Centre, Australia
7 figures and 3 tables

Figures

A graphical abstract illustrating the current state of medical large language models (LLMs) in the context of reasoning behaviour.
An illustration of the contrast in modalities between computer vision and natural language processing.
A schematic diagram illustrating different strategies for solving problems using large language models (LLMs).

Each rectangular box represents a ‘thought’—a meaningful segment of language that functions as an intermediate step in the reasoning or problem-solving process.

A sample directed acyclic graph.
A graphical representation of reasoning, reasoning outcome, and reasoning behaviour.

Reasoning encapsulates the process of drawing conclusions, arriving at a reasoning outcome. At a more fundamental level, reasoning behaviour describes the logical flow through the system that occurs during reasoning.

Two frameworks with a focus on exposing reasoning behaviour.

Note that the two frameworks are independent but shown together to facilitate easier comparison. Top: input data is standardised and fed to tree-based models. The deterministic nature of trees is exploited for achieving transparency for reasoning behaviour. Bottom: an integrative framework of combining the complementary strengths of LLM and Symbolic Reasoning. The medical LLM extracts diagnostic rules from clinical algorithms, along with its chain-of-thought (CoT) reasoning and attention weights. These diagnostic rules, together with patient case inputs, are provided to the symbolic solver, which determines the final diagnosis and generates inference chains as its reasoning trace.

An illustration of the spectrum of ‘System 1’ fundamental thought processes to ‘System 2’ analytical thought processes.

Tables

Table 1
A table showing types of reasoning, their definition and examples.

Reasoning types are colour-coded for clarity. Logical reasoning encompasses abductive, deductive, and inductive subtypes.

Type of reasoningDefinitionGeneral exampleMedical example
AbductiveInferring the most likely explanation for observed data or evidence.Ali, Muthu, and Ah Hock breathe oxygen. Therefore, Ali, Muthu, and Ah Hock are likely human.A patient has increased intercranial pressure, blurred vision and nausea. Therefore, the patient may have a brain aneurysm or ischaemic stroke.
DeductiveReasoning from a set of premises to reach a certain conclusion.All humans breathe oxygen. Rentap is human. Therefore, Rentap breathes oxygen.A patient has increased intercranial pressure, blurred vision and nausea. A CT scan shows no bleeding or swelling. Therefore, the patient does not have a brain aneurysm.
InductiveInferring general principles based on specific observations.All humans that I have seen breathe oxygen. Therefore, Rentap probably breathes oxygen.A patient has increased intercranial pressure, blurred vision and nausea. A CT scan shows no bleeding or swelling. Therefore, the patient probably has an ischaemic stroke.
Symbolic*The abstraction of a system into its component parts, which enables a more direct application of mathematics.Rule: If an organism breathes oxygen and nitrogen, and exhales carbon dioxide → likely human.
Observation: Ali, Muthu, Ah Hock, and Rentap exhibit this respiratory pattern.
Conclusion: Therefore, they are probably human.
Rule 1: If a patient presents with increased intracranial pressure (ICP), blurred vision, and nausea → infer high intracranial pathology.
Observation 1: The patient shows increased ICP, blurred vision, and nausea.
Rule 2: If CT scan shows no bleeding or swelling → rule out haemorrhagic causes.
Observation 2: CT scan reveals no evidence of bleeding or swelling.
Rule 3: If high ICP and haemorrhage is ruled out → suspect ischaemic stroke.
Conclusion: Therefore, the patient most likely has an ischaemic stroke.
Causal/
Counterfactual
Establishing a cause-and-effect relationship between events.Ali, Muthu, Ah Hock, and Rentap exhibit this respiratory pattern.A blood clot probably caused blockage in the brain leading to the stroke.
  1. *

    We note that the term symbolic reasoning may be misleading as it is fundamentally an abstract data representation which simplifies the process of translating a scenario into a reasoning framework.

Table 2
A table showing medical reasoning methods, their defining characteristics, and approach to reasoning.
Method nameBase architecture/methodReasoning improvement strategyType of ReasoningAdvantagesDisadvantagesDatasetGithub
Savage et al., 2024GPT-3.5; GPT-4.0Chain-of-thought (diagnostic reasoning)DeductiveEasy to implementScope is limited to GPT models, focusing exclusively on English medical questionsModified MedQA USMLE; NEJM (New England Journal of Medicine) case series
Kwon et al., 2024GPT-4.0; OPT; LLaMA-2; 3D ResNetChain-of-thought; knowledge distillation (via SFT)DeductiveLightweight and practical to useTight scope to limited disease conditionsAlzheimer’s Disease Neuroimaging Initiative (ADNI); Australian Imaging Biomarkers and Lifestyle Study of Ageing (AIBL)https://github.com/ktio89/ClinicalCoT
MEDDM Binbin et al Li et al., 2023GPTChain-of-thought; clinical decision treesDeductiveAdaptable to different systemsHeavy data collection to generate clinical guidance treesMedical books, treatment guidelines, and other medical literature
DRHOUSE Yang et al., 2024aGPT-3.5; GPT-4.0; LLaMA-3 70b; HuatuoGPT-II; MEDDMChain-of-thought; clinical decision treesDeductiveObjective sensor measurementAvailable datasets are currently limitedMedDG; KaMed; DialMed
DR. KNOWS Gao et al., 2023bVanilla T5; Flan T5; ClinicalT5; GPTChain-of-thought; extracted explainable diagnostic pathwayDeductive; neurosymbolicHybrid method improves accuracy; provides explainable diagnostic pathwaysParticularly fragile to missing dataMIMIC-III; In-house EHR
TEMED-LLM Bisercic et al., 2023text-davinchi-003; GPT-3.5; logistic regression; decision tree; XGBoostFew-shot learning, tabular ML modelling; NeurosymbolicDeductiveEnd-to-end interpretability, from data extraction to ML analysisRequires human expertsEHR dataset (kaggle; see referenced publication for details)
EHRAgent Shi et al., 2024GPT-4Autonomous code generation and execution for multi-tabular reasoning in EHRsDeductiveFacilitates automated solutions in complex medical scenariosNon-deterministic; limited generalisabilityMIMIC-III; eICU; TREQShttps://github.com/wshi83/EhrAgent; https://wshi83.github.io/EHR-Agent-page
AMIE Tu et al., 2024PaLM 2Reinforcement learningDeductiveEffectively handles noisy and ambiguous real-world medical dialoguesComputationally expensive and resource-intensive; simulated data may not fully capture real-world clinical nuancesMedQA; HealthSearchQA; LiveQA; Medication QA in MultiMedBench, MIMIC-III
ArgMed-Agents Hong et al., 2024GPT-3.5-turbo; GPT-4Chain-of-Thought; symbolic reasoning: neurosymbolicDeductiveTraining-free enhancement; explainability matches fully transparent, knowledge-based systemsArtificially restricted responses that do not match real-world casesMedQA; PubMedQA
Fansi Tchango et al., 2022aBASD (baseline ASD): multi-layer perceptron (MLP) diaformerReinforcement learningDeductiveClosely align with clinical reasoning protocolsLimited testing on real patient dataDDxPlushttps://github.com/mila-iqia/Casande-RL
MEDIQ Li et al., 2024LLaMA-3-Instruct (8B, 70B); GPT-3.5; GPT-4Chain-of-thought; information-seeking dialoguesAbductiveRobust to missing informationAvailable datasets are limited
Proprietary; artificially restricted responses that do not match real-world cases
iMEDQA; iCRAFT-MDhttps://github.com/stellalisy/mediQ
Naik et al., 2023GPT-4Causal network generationCausal/ counterfactualUses general LLMsLacks a specialised medical knowledge baseProvidence St.Joseph Health (PSJH’s) clinical data warehouse
Gopalakrishnan et al., 2024BioBERT; DistilBERT; BERT; GPT-4; LLaMACausality extractionCausal/ counterfactualEasy to implementTight scope to limited disease conditionsAmerican Diabetes Association (ADA); US Preventive Services Task Force (USPSTF); American College of Obstetrics Gynecology (ACOG); American Academy of Family Physician (AAFP); Endocrine Societyhttps://github.com/gseetha04/LLMs-Medicaldata
InferBERT Wang et al., 2021ALBERT; Judea Pearl’s Do-calculusCausal inference using do-calcusCausal/ counterfactual; mathematicalEstablishes causal inferenceTight scope to limited disease conditions; highly restrictive input formatFAERS case reports from the PharmaPendium databasehttps://github.com/XingqiaoWang/DeepCausalPV-master
Emre Kıcımantext-davinci-003, GPT-3.5-turbo, and GPT-4Determine direction of causality between pairs of variablesCausal/ counterfactualHighly accurate for large modelsLimited reproducibility due to dependency on tailored promptsTübingen cause-effect pairs dataset.https://github.com/py-why/pywhy-llm
Huatuo GPT-o1 Chen et al., 2024LLaMA-3.1-8B-Instruct and LLaMA-3.1-70B-InstructSupervised fine-tuning and PPODeductive reasoningInstils multi-step reasoning in medical LLMs; built-in interpretability as LLM can output reasoning traces along with answerLimited evaluations as evaluations cover accuracy scores on medical MCQ benchmarksAdapted from MedQA-USMLE and MedMcQAhttps://github.com/FreedomIntelligence/HuatuoGPT-o1
Med-R1 Lai et al., 2025Qwen2-VL-2BSupervised fine-tuning and GRPODeductive reasoningJoint image-text and multi-task reasoning; built-in interpretability as LLM can output reasoning traces along with answerRethinking the ‘More Thinking is Better’ AssumptionOmniMedVQAhttps://github.com/Yuxiang-Lai117/Med-R1
MedVLM-R1 Pan et al., 2025Qwen2-VL-2BSupervised Fine-tuning and GRPODeductive ReasoningJoint image-text reasoning; Built-in interpretability as LLM can output reasoning traces along with answerLimited evaluations as evaluations cover accuracy scores on medical MCQ benchmarksVQA-RAD, SLAKE, PathVQA, OmniMedVQA, and PMC-VQAhttps://huggingface.co/JZPeterPan/MedVLM-R1
MedFound Liu et al., 2025176 billion parameter LLM pretrained from scratchSupervised fine-tuning and DPODeductive reasoning- Self-bootstrapped Chain-of-Thought fine-tuning; Rigorous human evaluation of reasoning traces with rubric; built-in interpretability as LLM can output reasoning traces along with answerProprietary EHR datasets aren’t fully open, hindering exact reproductionMedCorpus, MedDX-FT and MedDX-Benchhttps://github.com/medfound/medfound
DeepSeekR1 Guo et al., 2025DeepSeek-V3-BaseSupervised fine-tuning and GRPODeductive reasoningBuilt-in interpretability as LLM can output reasoning traces along with answerPre-training and reasoning datasets aren’t open-sourced-https://github.com/deepseek-ai/DeepSeek-R1
Table 3
Comparison of reasoning evaluation paradigms in medical LLMs.
Evaluation paradigmConceptual focusTypical implementation and metrics
Conclusion-basedAssesses correctness of the final answer only, without inspecting the reasoning path.Automated scoring on Q-&-A benchmarks (e.g. MedQA, MedMCQA). Metrics: Accuracy, exact-match, F1. Fast, reproducible, but offers only high-level insight.
Rationale-basedEvaluates the logic chain or narrative explanation produced by the model. Focuses on coherence, validity, and completeness of reasoning traces.Manual expert review or rubric-based grading of CoT. Automated graph checks (e.g. DAG similarity, causal-direction tests). Metrics: Bayesian Dirichlet score, Normalised Hamming Distance.
MechanisticProbes low-level internal signals to answer “why did the model arrive here?”. Targets feature attribution and internal attention contributions.Explainable-AI toolkits (Integrated Gradients, SHAP, attention rollout). Outputs saliency maps or keyword heat-maps for clinician inspection.
InteractiveTreats evaluation as a dialogue or game; dynamically stresses the model in real time. Explores the response space by challenging, re-prompting, or role-playing.Game-theoretic tasks (e.g. debate, self-play). Rich insights but lower reproducibility; requires human-in-the-loop or scripted agents.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Shamus Zi Yang Sim
  2. Tyrone Chen
(2025)
Critique of impure reason: Unveiling the reasoning behaviour of medical large language models
eLife 14:e106187.
https://doi.org/10.7554/eLife.106187