The growing use of Large Language Models (LLMs) in healthcare has produced meaningful advances in diagnostic assistance and clinical documentation. However, the persistent risk of medical hallucinations remains a serious barrier to broader adoption. Detecting fabricated or harmful clinical out- puts requires a reliable foundation of factual correctness, and establishing that foundation in medicine is far more difficult than it first appears. This paper examines what we call the “crisis of ground truth” in medical AI evaluation. We review the tools and methods used to verify AI outputs, organizing the literature around four interconnected themes: the limitations of traditional lexical metrics, the circular reasoning problem in LLM-as-a-judge setups, the challenges of building useful domain-specific benchmarks, and the need to rethink what clinical truth actually means for evaluation purposes. Static benchmarks are highly susceptible to data contamination and struggle to capture multi-turn clinical reasoning. Scalable au- tomated alternatives that use models to judge other models risk validating outputs against themselves rather than against verified medical knowledge. Through thematic analysis of current work, including frameworks such as CLEVER, MedHallBench, and risk-sensitive evaluation methods, we show that automated evaluators can catch obvious factual errors but consistently miss the subtle reasoning failures and safety-critical gaps that clinical environments require. We argue that resolving the ground truth problem requires hybrid evaluation architectures that combine high-throughput automated checks with structured, expert-led human review at key decision points.
Introduction
The text reviews how large language models (LLMs) are being used in medicine but frequently generate hallucinations—confident yet incorrect outputs—which is dangerous in clinical contexts. A central issue is the “ground truth crisis,” where there is no stable or reliable standard for measuring factual correctness in medicine because knowledge is complex, evolving, and often embedded in expert consensus. Traditional evaluation methods (static exams, lexical similarity metrics like ROUGE and BLEU) are insufficient and often misleading due to memorization and lack of semantic understanding.
The review analyzes 28 studies and identifies four main directions in current research: improving factual evaluation metrics beyond lexical similarity, using LLMs to judge other LLMs (which introduces bias and circular validation problems), developing more realistic multi-turn and domain-specific benchmarks, and rethinking ground truth as a structural, ontology-based constraint rather than a fixed dataset.
Key findings show that current hallucination detection tools are limited: automated judges can be biased, static benchmarks fail to reflect real clinical reasoning, and single-turn evaluations miss the complexity of real medical dialogue. The paper argues for hybrid solutions that combine structured medical knowledge systems with human expert oversight.
Conclusion
This review has mapped the current landscape of tools and frameworks being used to address the ground truth crisis in medical AI. We have shown how the field’s historical reliance on static benchmarks and lexical metrics falls short when it comes to identifying clinical hallucinations. The LLM-as-a- judge paradigm offers a practical path to scale but introduces evaluation biases that demand careful scrutiny. The most promising direction shifts validation away from isolated fact- checking and toward assessing whether a model’s outputs com- ply with the structural logic of constrained medical workflows. Resolving this crisis ultimately requires accepting that ab- solute binary truth in medicine is rarely available. Evaluation systems should instead optimize for clinical safety, support dynamic contextual reasoning, and incorporate meaningful human oversight at the points where it matters most.
References
[1] V. Kocaman et al., “Clinical Large Language Model Evaluation by Expert Review (CLEVER): Framework Development and Validation,” JMIR AI, 2025. doi: 10.2196/72153
[2] T. Miller et al., “HumanELY: Human evaluation of large language models in healthcare: gaps, challenges, and the need for standardization,” npj Health Systems, 2025.
[3] S. Doshi, “Beyond Accuracy: Risk-Sensitive Evaluation of Hallucinated Medical Advice,” arXiv preprint, 2026. doi: 10.48550/arXiv.2602.07319
[4] S. Gao, J. H. Lau, and J. Qi, “Beyond Seen Data: Improving KBQA Generalization Through Schema-Guided Logical Form Generation,” in Proc. EMNLP, 2025. doi: 10.48550/arXiv.2502.12737
[5] S. Pandit et al., “MedHallu-Bench: A benchmark for medical hallucina- tions,” arXiv preprint, 2024. doi: 10.48550/arXiv.2412.18947
[6] D. Janiak et al., “The illusion of progress: Re-evaluating hallucination detection in LLMs,” Proc. EMNLP, 2025. doi: 10.48550/arXiv.2508.08285
[7] E. Asgari et al., “A framework to assess clinical safety and hallucination rates of LLMs for medical text summarisation,” npj Digital Medicine, 8(1), 274, 2025. doi: 10.1038/s41746-025-01670-7
[8] Y. Kim et al., “Medical hallucination in foundation models,” Medical Machine Learning, 2025.
[9] L. Zheng et al., “Judging LLM-as-a-judge with MT-Bench and Chatbot Arena,” in Proc. NeurIPS, 2023. doi: 10.48550/arXiv.2306.05685
[10] R. Williams et al., “Human evaluators vs. LLM judges in clinical decision support,” Nature Medicine, 2025.
[11] T. Wang et al., “Bioengineering perspectives on LLMs,” Bioengineering, 2025.
[12] S. Li et al., “ThReadMed-QA: A Multi-Turn Medical Dialogue Benchmark from Real Patient Questions,” arXiv preprint, 2026. arXiv:2603.11281
[13] M. Eriksson et al., “The Swedish medical LLM benchmark,” Frontiers in Medicine, 2025.
[14] L. Chen et al., “Process vs. outcome in hallucination detection,” arXiv preprint, 2025. arXiv:2503.04567
[15] R. Gupta et al., “Semantic illusion in QA models,” arXiv preprint, 2025. arXiv:2501.08942
[16] A. Brown et al., “Assessing LLM ability in grading medical notes,” JMIR, 2025.
[17] Y. Wang et al., “Clinical Safety-Effectiveness Dual-Track Benchmark (CSEDB),” npj Digital Medicine, 2025.
[18] M. Croxford et al., “Evaluating clinical AI summaries with LLM judges,” Health Data Science, 2025.
[19] S. Jain et al., “Beyond Consensus: Mitigating the Agreeableness Bias in LLM Judge Evaluations,” arXiv preprint, 2025. arXiv:2510.11822
[20] D. Fan et al., “HalluHard: A Hard Multi-Turn Hallucination Bench- mark,” arXiv preprint, 2026. doi: 10.48550/arXiv.2602.01031
[21] K. Zhu et al., “Can We Trust AI Doctors? A Survey of Medical Hallucination in Large Language and Large Vision-Language Models,” in Findings of ACL, 2025. doi: 10.18653/v1/2025.findings-acl.350