Early identification of Parkinson\'s Disease (PD) helps doctors provide better care and treatment. Analyzing vocal patterns serves as a straightforward, non-invasive method to recognize the initial symptoms of the condition. Despite this, many existing detection systems struggle with two persistent issues unequal distribution of class samples and high-dimensional feature spaces. This paper introduces an enhanced machine learning framework that applies Borderline-SMOTE to address data imbalance by synthesizing samples within challenging classification zones. A feature reduction step is incorporated to minimize redundancy, and a range of classifiers is evaluated against each other. Decision Tree classifier demonstrates superior performance on processed dataset. The model is tested using accuracy, precision, recall, and F1-score. The method achieved 98.67% accuracy, 98.69% precision, 98.66% recall, and 98.67% F1-score. These outcomes stem from improved handling of skewed data distributions and enhanced pattern recognition. The result is a transparent and dependable solution for speech-based early PD identification.
Introduction
Parkinson’s Disease (PD) is a progressive neurological disorder that affects motor and vocal functions, making early detection important but difficult due to reliance on specialist diagnosis and expensive clinical tests. This research explores an alternative approach using machine learning on vocal (speech) data, since voice changes often appear early and can be captured easily and non-invasively. However, challenges such as class imbalance (far more healthy samples than PD cases), high-dimensional features, and lack of interpretability reduce the effectiveness of conventional models.
To address these issues, the study proposes a structured machine learning pipeline that includes preprocessing, class balancing using Borderline-SMOTE, feature selection using Recursive Feature Elimination (RFE), and comparison of multiple classifiers such as SVM, Random Forest, KNN, XGBoost, and Decision Tree. After evaluating performance using accuracy, precision, recall, and F1-score, the Decision Tree model emerges as the best-performing and most interpretable classifier. To further improve transparency, SHAP analysis is used to explain which vocal features contribute most to predictions, making the system more suitable for clinical use.
The results show that proper handling of imbalanced data and feature reduction significantly improves classification performance and reliability. Overall, the proposed system provides an efficient, interpretable, and practical solution for early Parkinson’s Disease detection using voice-based machine learning analysis.
Conclusion
This work presents a reliable and easy-to-understand system for early PD detection by analyzing speech patterns. This framework was built to systematically address three fundamental limitations that are commonly encountered in medical machine learning applications. Class imbalance was resolved using Borderline-SMOTE, which directs synthetic sample generation toward the most critical regions of the feature space. The dimensionality of the input data was reduced through Recursive Feature Elimination, retaining only the attributes with the strongest predictive relevance. Model transparency was achieved by incorporating SHAP-based explainability, which maps the role of each feature in making a prediction. These three components work in concert to produce a system that is both technically sound and practically applicable.
A total of eight models were examined to select the most appropriate classifier for this task. Among them, the Decision Tree model showed the best results, obtaining better values in accuracy, precision, recall, and F1-score. This outcome demonstrates that targeting the most difficult classification cases and eliminating uninformative features produces tangible improvements in model learning. The addition of SHAP analysis further strengthens the system by giving clinicians a clear understanding of how individual vocal attributes influence each diagnostic outcome, thereby increasing trust and usability in real medical environments. The overall results of this study affirm that combining principled data balancing, targeted feature selection, and an inherently interpretable model can substantially elevate the quality of automated PD detection.
The proposed system is not only high-performing but also straightforward to validate and explain qualities that are particularly important when deploying AI-based tools in healthcare contexts where accountability and transparency are non-negotiable.
Looking ahead, several directions exist for extending this work. Testing the framework on larger and more demographically diverse datasets would strengthen confidence in its generalizability. Incorporating complementary data modalities such as gait analysis, neuroimaging, or handwriting patterns alongside speech features could provide a richer diagnostic signal and further boost performance. Ultimately, translating this pipeline into real-time clinical system would be a major step, allowing quick, low-cost, and Conservative screening of PD directly in healthcare settings
References
[1] K. Shyamala and T. M. Navamani, “Design of an Efficient Prediction Model for Early Parkinson’s Disease Diagnosis,” IEEE Access, vol. 12, pp. 137295–137309, 2024, doi: 10.1109/ACCESS.2024.3421302.
[2] K. Shyamala and T. M. Navamani, “Design of an Optimized Feature Driven Severity Stage Classifier for Parkinson’s Disease Prediction Using Deep Learning,” IEEE Access, vol. 13, pp. 142140–142160, 2025, doi: 10.1109/ACCESS.2025.3597851.
[3] Q. Dao et al., “Detection of Early Parkinson\'s Disease by Leveraging Speech Foundation Models,” IEEE Journal of Biomedical and Health Informatics, vol. 29, no. 7, pp. 5181–5190, Jul. 2025, doi: 10.1109/JBHI.2025.3548917.
[4] M. Junaid, M. Ghergherehchi, and S. Lee, “Multitask Deep Learning for Predicting Parkinson’s Progression and Depression From Multimodal Time Series Data,” IEEE Access, vol. 13, pp. 147818–147841, 2025, doi: 10.1109/ACCESS.2025.3593254.
[5] M. Rey-Paredes, C. J. Pérez, and A. Mateos-Caballero, “Time Series Classification of Raw Voice Waveforms for Parkinson\'s Disease Detection Using Generative Adversarial Network-Driven Data Augmentation,” IEEE Open Journal of the Computer Society, vol. 6, pp. 72–84, 2025, doi: 10.1109/OJCS.2024.3504864.
[6] M. Khan, A. Moiz, G. Nawaz Khan, M. Wajid, M. Usman, and J. Ali, “An FPGA Prototype for Parkinson’s Disease Detection Using Machine Learning on Voice Signal,” IEEE Access, vol. 13, pp. 91113–91128, 2025, doi: 10.1109/ACCESS.2025.3572092.
[7] M. Ullrich et al., “Fall Risk Prediction in Parkinson\'s Disease Using Real-World Inertial Sensor Gait Data,” IEEE Journal of Biomedical and Health Informatics, vol. 27, no. 1, pp. 319–328, Jan. 2023, doi: 10.1109/JBHI.2022.3215921.
[8] E. Kumari, M. K. Shukla, O. J. Pandey, and S. Yadav, “NeuroAid: Emotion-Based EEG Analysis for Parkinson\'s Disease Identification,” IEEE Sensors Letters, vol. 7, no. 12, pp. 1–4, Dec. 2023, doi: 10.1109/LSENS.2023.3335226.
[9] A. Rani Palakayala, P. Kuppusamy, D. Kothandaraman, G. Archana, and J. Gera, “HAMF: A Novel Hierarchical Attention-Based Multi-Modal Fusion Model for Parkinson’s Disease Classification and Severity Prediction,” IEEE Access, vol. 13, pp. 81252–81278, 2025.
[10] A. Rezvani et al., “DiffuseGaitNet: Improving Parkinson’s Disease Gait Severity Assessment With a Diffusion Model Framework,” IEEE Journal of Biomedical and Health Informatics, 2024.
[11] G. Amprimo, Z. Mei, C. Ferraris, G. Olmo, and D. Ravi, “A Data-Driven Exploration and Prediction of Deep Brain Stimulation Effects on Gait in Parkinson\'s Disease,” IEEE Journal of Biomedical and Health Informatics, vol. 29, no. 7, pp. 4647–4658, Jul. 2025, doi: 10.1109/JBHI.2024.3446548.
[12] R. M. Al-Tam, F. A. Hashim, S. Maqsood, L. Abualigah, and R. M. Alwhaibi, “Enhancing Parkinson’s Disease Diagnosis Through Stacking Ensemble-Based Machine Learning Approach,” IEEE Access, vol. 12, pp. 79549–79567, 2024, doi: 10.1109/ACCESS.2024.3408680.
[13] C. Dong et al., “Static-Dynamic Temporal Networks for Parkinson’s Disease Detection and Severity Prediction,” IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 31, pp. 2205–2213, 2023, doi: 10.1109/TNSRE.2023.3269569.
[14] S. M. Abdullah et al., “Deep Transfer Learning Based Parkinson’s Disease Detection Using Optimized Feature Selection,” IEEE Access, vol. 11, pp. 3511–3524, 2023, doi: 10.1109/ACCESS.2023.3233969.
[15] S. Gaba and H. Kaur, “Clinical Voice Data Collection and Analysis for Parkinson’s Disease Diagnosis,” in Proc. 3rd World Conf. Communication & Computing (WCONF), Raipur, India, 2025, pp. 1–6, doi: 10.1109/WCONF64849.2025.11233316.
[16] J. Jamuna and K. Kasturi, “Enhancing Parkinson\'s Disease Prediction Using Machine Learning Techniques,” in Proc. 9th Int. Conf. Inventive Systems and Control (ICISC), Coimbatore, India, 2025, pp. 958–964, doi: 10.1109/ICISC65841.2025.11188216.
[17] D. Kumar B. and K. France, “Prediction of Parkinson\'s Disease Using Machine Learning,” in Proc. 9th Int. Conf. Inventive Systems and Control (ICISC), Coimbatore, India, 2025, pp. 147–152, doi: 10.1109/ICISC65841.2025.11187909.
[18] A. Selvi S. and T. Kamalakannan, “Machine Learning Based Prediction of Parkinson\'s Diseases,” in Proc. 4th Int. Conf. Sentiment Analysis and Deep Learning (ICSADL), Nepal, 2025, pp. 1499–1502, doi: 10.1109/ICSADL65848.2025.10933153.
[19] A. S. et al., “Early Prediction of Parkinson\'s Disease with Machine Learning: A KNN Approach,” in Proc. 5th Int. Conf. Pervasive Computing and Social Networking (ICPCSN), Salem, India, 2025, pp. 1003–1007, doi: 10.1109/ICPCSN65854.2025.11035935.
[20] A. Akilandeswari, “Evaluation of Feature Selection Techniques for Predicting Parkinson\'s Disease using Machine Learning Models,” in Proc. Int. Conf. Electronics and Renewable Systems (ICEARS), Tuticorin, India, 2025, pp. 1431–1435, doi: 10.1109/ICEARS64219.2025.10940164.