Predicting student academic success is essential for providing early support to at-risk learners. However, high-accuracy models such as Extreme Gradient Boosting (XGBoost) are often underutilized by educators due to their opaque decision-making processes. This paper implements a predictive framework using the UCI Student Performance dataset (649 records). We evaluate Random Forest, XGBoost, Logistic Regression, and a Multi-Layer Perceptron (MLP) baseline, with our XGBoost model achieving a classification accuracy of 0.892 and an F1-score of 0.936. To provide transparency, we integrate SHAP (SHapley Additive exPlanations) and LIME (Local Interpretable Model-Agnostic Explanations) as Explainable AI (XAI) methods. Our analysis shows that previous academic results (G2) and past failures are the most significant predictors. We quantitatively compare these XAI methods, finding that SHAP demonstrates near-perfect stability (?1.000) compared to LIME (0.846). A demographic parity evaluation confirms equitable prediction across socioeconomic groups. This study provides a practical framework for educational decision support, integrating predictive performance with the interpretability required for institutional trust
Introduction
This study develops an Explainable Artificial Intelligence (XAI)-based Educational Data Mining (EDM) framework to predict student academic outcomes (Pass/Fail) while providing transparent explanations for the predictions. Traditional machine learning models often act as "black boxes," making educators hesitant to trust and act on their predictions. To address this issue, the research combines predictive modeling with explainability techniques and fairness evaluation.
Using the UCI Student Performance dataset containing 649 Portuguese students, the authors compare four machine learning models: XGBoost, Random Forest, Logistic Regression, and Multi-Layer Perceptron (MLP). The dataset was preprocessed by converting final grades into Pass/Fail categories and applying SMOTE to handle class imbalance. Model performance was evaluated using accuracy, precision, recall, and F1-score.
The study applies SHAP for global feature importance analysis and LIME for local, instance-level explanations. A key contribution is the introduction of a stability-based comparison of SHAP and LIME, measuring explanation consistency across multiple runs using Spearman rank correlation. Results show that SHAP achieved perfect stability (1.000) and generated explanations much faster, while LIME provided slightly higher local fidelity but exhibited lower stability (0.846) and significantly higher computational cost.
Experimental results indicate that Logistic Regression achieved the highest test accuracy (91.5%) and F1-score (95.0%), while XGBoost and Random Forest produced comparable performance. Feature importance analysis revealed that second-period grades (G2) and previous academic failures were the strongest predictors of student success. However, student absenteeism emerged as the most actionable factor for educational interventions because it can be directly influenced by schools.
A fairness assessment based on parental education levels showed only minor disparities in prediction outcomes, suggesting that the model does not significantly disadvantage students from lower socioeconomic backgrounds. An ablation study removing the G2 feature demonstrated that predictive performance remained strong, confirming the model’s usefulness for early-warning systems before later academic grades become available.
The study concludes that combining predictive analytics with explainable AI improves trust and transparency in educational decision-making. SHAP is recommended for institutional reporting due to its stability and efficiency, while LIME is more suitable for detailed individual student assessments. Despite promising results, limitations include the dataset’s restricted scope, potential data leakage from highly correlated grade features, lack of temporal analysis, and the use of only a shallow neural network baseline. Future work should explore larger datasets, longitudinal student data, and more advanced deep learning models.
Conclusion
This paper implemented a reproducible machine learning pipeline that achieves 0.892 accuracy and an F1-score of 0.936 with XGBoost, and a further validated ablation result of 0.838 accuracy without the G2 feature — confirming the pipeline retains strong predictive power even in early-intervention scenarios. We used SHAP (stability: ?1.000) and LIME (stability: 0.846) to transform \"black-box\" predictions into diagnostic insights for educators, and conducted the first quantitative stability- and efficiency-based comparison of these two XAI methods within the EDM domain. Our analysis confirms that while academic history (G2) is the strongest statistical predictor, behavioral features like absenteeism provide earlier and more actionable signals for timely institutional intervention.
Unlike prior studies that focus solely on predictive accuracy [7, 8, 15], this work emphasizes explanation reliability as a critical factor for institutional deployment. Our findings suggest that stability metrics such as Spearman rank correlation of SHAP values should be incorporated as a standard evaluation criterion in future XAI-based educational systems.
Future work will pursue three directions. First, the completed G2 ablation (Table IV) confirms a 5.4% accuracy drop, and future work will extend this by measuring SHAP stability change across the ablation conditions and validating the behavioral-only model on a held-out institutional dataset. Second, cross-dataset validation on the Open University Learning Analytics Dataset (OULAD) [10] will test the generalizability of both the predictive and explainability findings beyond the UCI benchmark. Third, a real-time student engagement dashboard will be developed, and Large Language Model (LLM) post-processing will be explored to convert SHAP values into natural language explanations accessible to students and parents without technical backgrounds. The full implementation code for this study is available at: https://github.com/kashishtomar-11/xai-student-performance (link anonymized for review; will be made public upon acceptance).
References
[1] P. Cortez and A. Silva, \"Using data mining to predict secondary school student performance,\" in Proc. 5th Annual Future Business Technology Conference (FUBUTEC), Porto, Portugal, 2008, pp. 5–12.
[2] K. Kesgin, S. Kiraz, S. Kosunalp, and B. Stoycheva, \"Beyond performance: Explaining and ensuring fairness in student academic performance prediction with machine learning,\" Applied Sciences, vol. 15, no. 15, art. no. 8409, 2025.
[3] I. T. Adom, C. O. Julius, S. Akuma, and S. U. Otor, \"Comparative analysis of explainable AI frameworks (LIME and SHAP) in student performance prediction,\" International Journal of Information Engineering and Electronic Business (IJIEEB), vol. 17, no. 6, pp. 60–70, 2025.
[4] X. Xie et al., \"Explainable AI in educational data mining: Transparent predictions for student performance,\" IEEE Access, vol. 10, pp. 33132–33143, 2022.
[5] S. M. Lundberg and S. Lee, \"A unified approach to interpreting model predictions,\" in Proc. 31st International Conference on Neural Information Processing Systems (NIPS), 2017, pp. 4765–4774.
[6] M. T. Ribeiro, S. Singh, and C. Guestrin, \"\'Why should I trust you?\': Explaining the predictions of any classifier,\" in Proc. 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp. 1135–1144.
[7] T. Chen and C. Guestrin, \"XGBoost: A scalable tree boosting system,\" in Proc. 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp. 785–794.
[8] L. Breiman, \"Random forests,\" Machine Learning, vol. 45, no. 1, pp. 5–32, 2001.
[9] C. Romero and S. Ventura, \"Educational data mining: A review of the state of the art,\" IEEE Transactions on Systems, Man, and Cybernetics, Part C, vol. 40, no. 6, pp. 601–618, 2010.
[10] J. Kuzilek, M. Hlosta, and Z. Zdrahal, \"Open University Learning Analytics dataset,\" Scientific Data, vol. 4, art. no. 170171, 2017.
[11] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, \"SMOTE: Synthetic minority over-sampling technique,\" Journal of Artificial Intelligence Research, vol. 16, pp. 321–357, 2002.
[12] F. Pedregosa, G. Varoquaux, A. Gramfort et al., \"Scikit-learn: Machine learning in Python,\" Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
[13] D. P. Kingma and J. Ba, \"Adam: A method for stochastic optimization,\" in Proc. 3rd International Conference on Learning Representations (ICLR), San Diego, CA, 2015.
[14] A. Fernández, S. García, F. Herrera, and N. V. Chawla, \"SMOTE for learning from imbalanced data: Progress and challenges, marking the 15-year anniversary,\" Journal of Artificial Intelligence Research, vol. 61, pp. 863–905, 2018.
[15] S. Dreiseitl and L. Ohno-Machado, \"Logistic regression and artificial neural network classification models: A methodology review,\" Journal of Biomedical Informatics, vol. 35, no. 5–6, pp. 352–359, 2002.
[16] D. Durães, B. Lacerda, R. Bezerra, and P. Novais, “Predictive analytics in education: A comparative analysis of machine learning models for predicting student performance,” in Proc. EPIA 2024, Lecture Notes in Computer Science, vol. 14967, Springer, 2025, pp. 145–157.
[17] D. Frontera, A. Ramos-Pulido, and M. Choi, “Machine learning models for academic performance prediction: Interpretability and application in educational decision-making,” Frontiers in Education, vol. 10, art. no. 1632315, 2025.
[18] T. Trinh, A. Nguyen, and M. Bui, “Which LIME should I trust? Concepts, challenges, and solutions,” arXiv preprint arXiv:2503.24365, 2025.
[19] A. Salih, I. Galazzo, Z. Raisi-Estabragh, S. Petersen, G. Menegaz, and P. Radeva, “A perspective on explainable artificial intelligence methods: SHAP and LIME,” Advanced Intelligent Systems, vol. 7, no. 1, art. no. 2400304, 2025.