An early prediction of student performance is one significant part that improves the learning achievements, reduce the student attrition and personalized learning strategy. The traditional methods of performance assessment usually includes a summative grade scheme which do not provide early warning system for a struggling student. This paper demonstrates a supervised learning model for initial estimation of student achievement using academic, behavioral, and demographic factors. The proposed method for early prediction of student performance consists of preprocessing the data, feature selection, building the classifier models, and evaluation of the achievement of the classification models using various classification algorithms, namely, Logistic Regression, Support Vector Machine, Decision Tree, and Random Forest. The proposed method is evaluated based on accuracy, precision, recall, F1 measure, and ROC AUC. The empirical results show that the ensemble learning models have better performance than linear models based on the complex relationship between the features. The result of the work highlights the role of attendance, internal performance and prior GPA performance as strong predictors. The proposed method is a potential for developing early alert systems for higher education institutes.
Introduction
This research focuses on using Machine Learning (ML) and Educational Data Mining (EDM) to predict students’ academic performance at an early stage. With the widespread adoption of Learning Management Systems (LMS), online assessments, attendance systems, and institutional databases, educational institutions now generate large amounts of student data. However, student performance is still often evaluated using traditional post-examination methods. The proposed system aims to leverage this data to predict student outcomes, identify at-risk learners, and support timely academic interventions.
Background and Motivation
Student performance is influenced by multiple factors, including:
Attendance and participation
Internal assessment scores
Previous GPA
Assignment completion
Study habits
Classroom engagement
Demographic and socio-economic factors
Because these factors interact in complex and often non-linear ways, traditional regression models may not be sufficient. Advanced ML techniques such as:
Decision Trees
Support Vector Machines (SVM)
Random Forests
Ensemble Learning Methods
Probabilistic Classifiers
Deep Learning Models
can better capture these relationships and provide more accurate predictions.
Early prediction of academic performance helps institutions:
Identify students at risk of failure or dropout.
Design targeted intervention strategies.
Improve academic outcomes.
Support data-driven educational decision-making.
Research Gap
Previous studies mainly focused on prediction accuracy while neglecting other important evaluation metrics such as:
Precision
Recall
F1-Score
ROC-AUC
Additionally, many high-performing deep learning models lack interpretability, making it difficult for educators to understand the reasons behind predictions. Since transparency is crucial in educational environments, the proposed work emphasizes both prediction performance and model explainability.
Literature Survey
Research in educational data mining has explored various predictive approaches:
Traditional statistical methods generally perform worse than machine learning algorithms.
Random Forest and other ensemble methods have demonstrated strong predictive performance.
Feature engineering and attribute selection significantly improve model efficiency and accuracy.
Rule-based and interpretable models provide transparency while maintaining competitive accuracy.
Ensemble learning techniques often outperform individual classifiers because they can model complex non-linear relationships.
Recent studies using boosting and stacking methods achieved improvements in recall and F1-score.
Explainable AI techniques such as SHAP (Shapley Additive Explanations) help identify the most influential factors affecting academic performance.
Deep learning models, including Bidirectional LSTM networks, have shown success in analyzing sequential educational data.
Hybrid and multi-model approaches generally outperform single-model systems.
Including demographic and environmental factors improves model generalization.
Metaheuristic optimization techniques such as Genetic Algorithms and Particle Swarm Optimization further enhance prediction performance.
Proposed System
The primary objective of the proposed system is to classify students into three performance categories:
Low Performance
Medium Performance
High Performance
The system combines machine learning, feature engineering, model comparison, and explainable AI techniques to create an accurate and interpretable prediction framework.
Data Collection
Student data is collected from institutional databases and learning management systems. Key input features include:
Attendance percentage
Internal test scores
Previous GPA
Assignment completion status
Study hours
Classroom participation
Demographic information
The target variable is the student's final academic performance.
Data Preprocessing
To improve data quality and model reliability, several preprocessing techniques are applied:
1. Data Cleaning
Missing values are handled using mean and mode imputation.
Duplicate records are removed.
Outliers are detected using the Z-score method and eliminated when necessary.
2. Data Transformation
Categorical variables are converted into numerical format using One-Hot Encoding.
3. Feature Scaling
Min-Max Normalization is used to scale features to a common range.
This improves model stability and training efficiency.
Feature Extraction and Model Development
The system uses feature engineering and feature selection techniques to identify the most relevant predictors of academic success. Multiple machine learning models are trained and evaluated using cross-validation to determine the best-performing classifier.
Feature importance analysis is also performed to identify the factors that most strongly influence student outcomes.
Conclusion
In this paper, a machine learning based system was proposed to predict students’ academic achievement using educational data. The measurement of the system performance was relatively standard performance measures like accuracy, precision, recall, F1, measure. The experiment outcome showed that on the classification accuracy aspect, the proposed system outperformed standard baseline approach.
The proposed ensemble strategy demonstrated strong generalization capabilities and robust performance even when cross-validation and noisy data were used for validation. In addition, the proposed system demonstrated efficient use of resources and execution time.
Attendance, internal scores, and past GPA are all very significant indicators of student performance, as indicated by the feature significance analysis. This information is very useful for interventions related to early student performance.
Future studies can focus on combining behavioral, psychological, and socioeconomic factors to further enhance the model’s forecasting ability , although the proposed framework already yields promising results. To enhance its generalizability, the model can also be developed using larger datasets.
In conclusion proposed approach could be good foundation for developing intelligent academic tracking systems and offers a precise, effective, and scalable approach for predicting student performance.
References
[1] A. Abukader, A. Alzubi, and O. R. Adegboye, “Intelligent System for Student Performance Prediction: An Educational Data Mining Approach Using Metaheuristic-Optimized LightGBM with SHAP-Based Learning Analytics,” Applied Sciences, vol. 15, no. 20, p. 10875, Oct. 2025, doi: 10.3390/app152010875.
[2] W. Ahmed, M. A. Wani, P. Plawiak, S. Meshoul, A. Mahmoud, and M. Hammad, “Machine learning-based academic performance prediction with explainability for enhanced decision-making in educational institutions,” Sci Rep, vol. 15, no. 1, p. 26879, Jul. 2025, doi: 10.1038/s41598-025-12353-4.
[3] M. S. N. Al-Din and H. A. A. Abdulqader, “Students’ Academic Performance Prediction Using Educational Data Mining and Machine Learning: A Systematic Review,” IJRISS, vol. VIII, no. VIII, pp. 1264–1291, 2024, doi: 10.47772/IJRISS.2024.808095.
[4] R. Guevara-Reyes, I. Ortiz-Garcés, R. Andrade,
[5] F. Cox-Riquetti, and W. Villegas-Ch, “Machine learning models for academic performance prediction: interpretability and application in educational decision-making,” Front. Educ., vol. 10, p. 1632315, Aug. 2025, doi: 10.3389/feduc.2025.1632315.
[6] E. Kalita et al., “Predicting student academic performance using Bi-LSTM: a deep learning framework with SHAP-based interpretability and statistical validation,” Front. Educ., vol. 10,
[7] p. 1581247, Jun. 2025, doi: 10.3389/feduc.2025.1581247.
[8] D. Khairy, N. Alharbi, M. A. Amasha, M. F. Areed, S. Alkhalaf, and R. A. Abougalala, “Prediction of student exam performance using data mining classification algorithms,” Educ Inf Technol, vol. 29, no. 16, pp. 21621–21645, Nov. 2024, doi: 10.1007/s10639-024-12619-w.
[9] M. Liu, W. He, G. Zhou, and H. Zhu, “A New Student Performance Prediction Method Based on Belief Rule Base with Automated Construction,” Mathematics, vol. 12, no. 15, p. 2418, Aug. 2024, doi: 10.3390/math12152418.
[10] S. Malik et al., “Advancing educational data mining for enhanced student performance prediction: a fusion of feature selection algorithms and classification techniques with dynamic feature ensemble evolution,” Sci Rep, vol. 15, no. 1, p. 8738, Mar. 2025, doi: 10.1038/s41598-025-92324-x.
[11] A. Namoun and A. Alshanqiti, “Predicting Student Performance Using Data Mining and Learning Analytics Techniques: A Systematic Literature Review,” Applied Sciences, vol. 11, no. 1, p. 237, Dec. 2020, doi: 10.3390/app11010237.
[12] B. Santana-Perera, C. García-Barceló, M. González Arcas, and D. Gil, “Exploring Predictive Insights on Student Success Using Explainable Machine Learning: A Synthetic Data Study,” Information, vol. 16, no. 9, p. 763, Sep. 2025, doi: 10.3390/info16090763.
[13] S. M. F. D. Syed Mustapha, “Predictive Analysis of Students’ Learning Performance Using Data Mining Techniques: A Comparative Study of Feature Selection Methods,” ASI, vol. 6, no. 5, p. 86, Sep. 2023, doi: 10.3390/asi6050086.
[14] [12] E. Vecchi, “Investigating the Efficacy and Interpretability of ML Classifiers for Student Performance Prediction in the Small-Data Regime,” Education Sciences, vol. 16, no. 1, p. 149, Jan. 2026, doi: 10.3390/educsci16010149.
[15] M. Ya?c?, “Educational data mining: prediction of students’ academic performance using machine learning algorithms,” Smart Learn. Environ., vol. 9, no. 1, p. 11, Dec. 2022, doi: 10.1186/s40561-022-00192-z.
[16] W. Zou, W. Zhong, J. Du, and L. Yuan, “Prediction of Student Academic Performance Utilizing a Multi-Model Fusion Approach in the Realm of Machine Learning,” Applied Sciences, vol. 15, no. 7, p. 3550, Mar. 2025, doi: 10.3390/app15073550.