Text classification in clinical settings is a very important and serious task in the sphere of Natural Language Processing, which has far-reaching implications in healthcare processes. We shall be categorising the medical transcripts into their respective medical conditions in this NLP project.
To accomplish this, we employ the Sequential Forward Selection (SFS) method, a feature selection technique chosen specifically for its dimensional reduction capabilities. Using SFS, not only do we hope to increase the classification performance, but also to increase the efficiency of pattern recognition so that the disease can be detected not only fast, but also accurately. The contribution to the research highlights the importance of Clinical Text Classification, and in particular, how to streamline the process with the help of SFS.
Introduction
The text presents a Natural Language Processing (NLP)–based approach for classifying unstructured clinical text into appropriate medical specialties, addressing an urgent need in healthcare to extract value from large volumes of clinical notes. Clinical text classification is important for improving healthcare management, supporting clinical decision-making, and enhancing patient outcomes.
The project treats medical specialty as the target variable and clinical transcripts as input features. It involves preprocessing steps such as tokenization, lemmatization, stop-word removal, and vectorization using TF-IDF. Multiple machine learning models are considered, with performance evaluated using accuracy, precision, recall, and F1-score.
A key objective is to improve efficiency, interpretability, and performance by applying Sequential Forward Selection (SFS) for feature selection. SFS reduces the high dimensionality of text data by iteratively selecting the most informative features, leading to simpler, faster, and more interpretable models. The selected features are then used to train a CatBoost classifier, chosen for its robustness, high performance, and ability to handle categorical data effectively.
The system is implemented as a web-based application using Django, featuring user authentication, admin control, secure input handling, and clear output presentation. The architecture includes modules for data ingestion, preprocessing, feature selection, model training, evaluation, and deployment.
The dataset used is sourced from Kaggle (mtsamples.csv). After preprocessing and feature selection, the data is split into training and testing sets, balanced using random undersampling, and used to train the CatBoost model. The trained model and vectorizer are saved and deployed to classify new clinical text inputs in real time.
Evaluation results show that the CatBoost model achieved a high training accuracy of 99.15%, demonstrating the effectiveness of combining NLP preprocessing, Sequential Forward Selection, and gradient boosting for clinical text classification.
Conclusion
This study shows that sub-categorizing similar groups within the dataset can simplify the classification process by reducing the number of categories to be analyzed. Although manually engineered features may improve performance on this dataset, they are unlikely to generalize well to other clinical transcription datasets. Our findings indicate that additional data is necessary to classify the transcriptions accurately into their respective medical categories, as the limited size of the current dataset restricts the achievable accuracy.
References
[1] Yao, L., Mao, C., and Luo, Y. developed a method combining rule-based features with knowledge-driven CNNs to classify clinical text. This study was published in BMC Medical Informatics and Decision Making, in the 19th volume, Supplement 3, on page 71, in 2019.
[2] Ravi D., Wong, C., Deligianni, F., Berthelot, M., Andreu-Perez, J., Lo, B., and Yang, G. Z. explored the use of deep learning methods in health informatics. Their research appeared in the IEEE Journal of Biomedical and Health Informatics, Volume 21, Issue 1, spanning pages 4–21, in 2017.
[3] A. Marcano-Cedeño, J. Quintanilla-Domínguez, M. G. Cortina Januchs, and D. Andina wrote about feature selection by using Sequential Forward Selection and a classification approach with Artificial Metaplasticity Neural Network. Their work was presented at the IECON 2010 - 36th Annual IEEE Industrial Electronics Society Conference held in Glendale, Arizona, USA, pages 2845-2850 in 2010.
[4] V. Garla, C. Taylor, and C. Brandt discussed semi-supervised classification of clinical text through Laplacian SVMs. They showed its use in managing cancer cases. Their findings appeared in the Journal of Biomedical Informatics, volume 46, issue 5, pages 869–875, published in 2013.
[5] Demner-Fushman D., Chapman, W. W., and McDonald, C. J. How does natural language processing help in clinical decision-making? Journal of Biomedical Informatics 42(5) 760–772, 2009.
[6] Özlem Uzuner, Ira Goldstein, Yuan Luo, and Isaac Kohane. Finding patient smoking habits using medical discharge data. Journal of the American Medical Informatics Association 15(1) 14–24, January 2008.
[7] Lance De Vine, Guido Zuccon, Bevan Koopman, Laurianne Sitbon, and Peter Bruza. They explored Medical Semantic Similarity using a Neural Language Model. This study was presented at the 23rd ACM International Conference on Information and Knowledge Management (CIKM \'14). The Association for Computing Machinery published it in New York, NY, USA, on pages 1819 to 1822, in 2014.
[8] M. Last, A. Kandel, and O. Maimon worked on an information-theoretic algorithm to select features. Their research appears in the journal Pattern Recognition Letters Volume 22, Issues 6-7, on pages 799 to 811, published in 2001.
[9] Kudo, Mineichi and Sklansky, Jack. J. Sklansky. Comparison of Methods to Pick Features in Pattern Classifiers. Pattern Recognition Volume 33, Pages 25–41, Year 2000.
[10] D. Xiao and J. Zhang. The Role of Feature Importance and Choosing Features. Published in the proceedings of the 2009 Sixth International Conference on Fuzzy Systems and Knowledge Discovery, Tianjin, China.