Student dropout and inconsistent academic performance continue to pose serious challenges in higher education, affecting both institutional outcomes and student success. Early identification of students at risk enables institutions to implement timely interventions and improve retention rates. This study presents a machine learning-based framework for predicting student academic outcomes and dropout risk using historical educational data.
The proposed model is developed using a dataset from the UCI Machine Learning Repository that includes academic, demographic, and socioeconomic attributes. Multiple machine learning algorithms, including Logistic Regression, Random Forest, and XGBoost, are implemented and evaluated for classifying student outcomes. Experimental results indicate that XGBoost achieves the best performance, with a test accuracy of approximately 87.45%, outperforming the other models.
To enhance interpretability, Shapley Additive explanations (SHAP) are utilised to examine feature contributions. The findings reveal that academic performance indicators, particularly second-semester results and the completion of curricular units, play a significant role in predicting student dropout.
Overall, the proposed system provides an accurate and interpretable solution for early dropout detection, supporting data-driven decision-making in educational institutions.
Introduction
Student dropout and poor academic performance are significant challenges in higher education, often caused by academic, financial, and personal factors. Traditional identification methods are reactive and inefficient, whereas machine learning enables early prediction using student data, allowing institutions to take proactive measures like mentoring and support.
This study proposes a machine learning-based system that predicts both academic performance and dropout risk using a dataset with academic, demographic, and socio-economic features. Models such as Logistic Regression, Random Forest, and XGBoost are implemented and compared, with an ensemble approach improving overall performance. The system also incorporates SHAP for interpretability and an interactive dashboard for clear visualization of insights.
Data preprocessing includes handling missing values, encoding categorical variables, and feature scaling. Model evaluation is performed using metrics like accuracy, precision, recall, and F1-score. Among the models, XGBoost achieved the highest accuracy (around 87%), outperforming others.
Overall, the system provides accurate predictions, interpretable results, and user-friendly insights, helping educational institutions identify at-risk students early and implement effective intervention strategies.
Conclusion
This study presents a machine learning-based framework for predicting student dropout risk using academic, demographic, and socio-economic data. Multiple models, including Logistic Regression, Random Forest, and XGBoost, were implemented and evaluated, with XGBoost demonstrating the best overall performance. The results indicate that ensemble and boosting techniques outperform traditional approaches in terms of accuracy and robustness. The integration of Shapley Additive Explanations (SHAP) enhances model interpretability by identifying the key features influencing predictions.
The proposed system provides a reliable and interpretable solution for early identification of at-risk students. It can assist educational institutions in implementing timely interventions, ultimately improving student retention and academic success.
References
[1] M. S. A. N. Araújo, P. J. G. Lisboa, and A. M. P. de Carvalho, “Predict Students’ Dropout and Academic Success,” UCI Machine Learning Repository, 2020. [Online]. Available: https://archive.ics.uci.edu/
[2] T. Chen and C. Guestrin, “XGBoost: A Scalable Tree Boosting System,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2016, pp. 785– 794.
[3] L. Breiman, “Random Forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001.
[4] D. W. Hosmer, S. Lemeshow, and R. X. Sturdivant, “Applied Logistic Regression,” 3rd ed., Wiley, 2013
[5] S. Lundberg and S.-I. Lee, “A Unified Approach to Interpreting Model Predictions,” in Advances in Neural Information Processing Systems (NeurIPS), 2017, pp. 4765–4774.
[6] C. Romero and S. Ventura, “Educational Data Mining: A Review of the State of the Art,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 40, no. 6, pp. 601–618, 2010.
[7] J. Han, M. Kamber, and J. Pei, “Data Mining: Concepts and Techniques,” 3rd ed., Morgan Kaufmann, 2012.
[8] F. Pedregosa et al., “Scikit-learn: Machine Learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
[9] A. Géron, “Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow,” 2nd ed., O’Reilly Media, 2019.
[10] H. Abdi and L. J. Williams, “Principal Component Analysis,” Wiley Interdisciplinary Reviews: Computational Statistics, vol. 2, no. 4, pp. 433–459, 2010.