Automatic assignment of ICD-10 diagnostic codes from free-text clinical narratives is a central task in modern healthcare analytics. While recent literature has focused heavily on deep learning and large language models (LLMs), classical machine learning methods remain essential for building transparent and reproducible baselines, especially when data quality issues exist. This paper presents a research-grade baseline study using Logistic Regression for ICD-10 classification of Spanish clinical text from the CodiEsp dataset. We describe a complete end-to-end pipeline including dataset inspection, text preprocessing, TF–IDF feature extraction, one-vs-rest Logistic Regression modeling, and result analysis. Particular attention is given to practical challenges such as misaligned identifiers, missing text files, and multi-label imbalance. Experimental results on a cleaned subset of the data show that Logistic Regression can provide interpretable decision boundaries and reasonable macro-F1, but performance is constrained by dataset structure and label sparsity. We conclude by outlining a path toward transformer-based and LLM-enhanced architectures that build directly on this baseline pipeline.
Introduction
This study develops a Logistic Regression baseline for automated ICD-10 coding of Spanish clinical texts using the CodiEsp dataset. ICD-10 codes, critical for billing, research, and quality assessment, are traditionally assigned manually—a process that is costly, slow, and prone to variability. Automatic coding is framed as a multi-label text classification problem, where each clinical document may map to multiple codes.
The pipeline includes data cleaning and alignment to address irregular TSV formats, missing files, and misaligned labels. Text preprocessing involves normalization, tokenization, and TF–IDF feature extraction, followed by one-vs-rest Logistic Regression for multi-label classification. Experiments used an 80/20 train-test split, and evaluation focused on macro-averaged precision, recall, and F1-score to account for label imbalance.
Results indicate moderate performance, with frequent codes predicted more accurately than rare ones. Error analysis highlights challenges from short or ambiguous texts, label noise, and the long-tail distribution of ICD-10 codes. While Logistic Regression is interpretable and efficient, it lacks deep contextual understanding.
The study provides a reproducible, interpretable baseline, paving the way for future work with transformer-based models and LLMs, which can better capture context in clinical narratives and improve automated coding accuracy.
Conclusion
This paper presented a comprehensive Logistic Regression baseline for ICD-10 classification on Spanish clinical text from the CodiEsp dataset. By carefully documenting dataset inspection, preprocessing, feature extraction, model training, and error analysis, we provide a transparent foundation for further research. While the performance of the linear model is limited by dataset quality and the inherent complexity of clinical language, the insights gained from this baseline are invaluable for guiding future development of more powerful transformer-based and LLM-driven systems.
References
[1] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” Proc. NAACL-HLT, 2019.
[2] A. E. Johnson et al., “MIMIC-III, a freely accessible critical care database,” Sci. Data, vol. 3, 2016.
[3] P. López et al., “CodiEsp: ICD-10 coding in Spanish clinical texts,” Proc. ClinicalNLP, 2020.
[4] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” Proc. ICLR, 2013.
[5] Y. Zhang et al., “Deep learning for medical coding,” J. Am. Med. Inform. Assoc., 2020.
[6] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput., vol. 9, no. 8, 1997.
[7] A. Vaswani et al., “Attention is all you need,” Proc. NeurIPS, 2017.
[8] F. Chollet, “Deep Learning with Python,” 2nd ed., Manning, 2021.
[9] Scikit-learn Developers, “Logistic Regression documentation,” scikit-learn.org, accessed 2025.
[10] N. Collobert et al., “Natural language processing (almost) from scratch,” JMLR, 2011.
[11] J. Lee et al., “BioBERT: a pre-trained biomedical language representation model for biomedical text mining,” Bioinformatics, 2020.
[12] E. Alsentzer et al., “Publicly Available Clinical BERT Embeddings,” Proc. ClinicalNLP, 2019.
[13] O. Uzuner et al., “Evaluating the state-of-the-art in automatic de-identification,” J. Am. Med. Inform. Assoc., 2007.
[14] P. Koopman and J. Zhai, “Automated ICD coding using machine learning,” in Proc. IEEE Int. Conf. Healthcare Informatics, 2019.
[15] World Health Organization, “International Statistical Classification of Diseases and Related Health Problems, 10th Revision,” WHO, 2016.