Air Quality Index (AQI) Prediction for Indian Cities Using Machine Learning: A Comparative Study of Random Forest and XGBoost on Delhi, Noida, and Faridabad
Air pollution has emerged as one of the most pressing environmental and public health challenges in rapidly urbanizing India. The National Capital Region (NCR), encompassing Delhi, Noida, and Faridabad, consistently records some of the highest Air Quality Index (AQI) levels in the world, posing severe health risks to millions of residents. This study presents a machine learning-based comparative framework for predicting AQI across these three NCR cities using historical air quality datasets sourced from Kaggle. Two ensemble learning algorithms — Random Forest and XGBoost (Extreme Gradient Boosting) — are implemented, trained, and rigorously evaluated. Key pollutant features including PM2.5, PM10, NO2, SO2, CO, and O3 are utilized as predictors. The experimental results demonstrate that XGBoost achieves superior predictive accuracy with an R² of 0.94 and RMSE of 12.3, outperforming Random Forest (R² = 0.91, RMSE = 15.8). Feature importance analysis reveals PM2.5 and PM10 as the dominant predictors. These findings highlight the potential of gradient-boosted ensemble methods for real-time air quality forecasting systems in urban Indian environments, and offer actionable insights for pollution management and early warning systems.
Introduction
The study focuses on predicting Air Quality Index (AQI) in the National Capital Region (Delhi, Noida, and Faridabad), one of the most polluted areas in India. Due to severe air pollution caused by vehicles, industries, construction dust, and crop burning, accurate AQI prediction is essential for public health planning and environmental management. Traditional statistical models are inadequate for capturing complex pollution patterns, so the study uses machine learning techniques.
The research compares two ensemble models—Random Forest and XGBoost—using a Kaggle dataset containing pollutant and meteorological data such as PM2.5, PM10, NO2, SO2, CO, and O3. Data preprocessing includes handling missing values, outliers, feature engineering (seasonal and temporal patterns), encoding, and train-test splitting.
Random Forest uses multiple decision trees to improve prediction stability, while XGBoost applies gradient boosting with regularization for higher accuracy and better error correction. Both models are evaluated using RMSE, MAE, and R² metrics.
The literature review shows that ensemble methods consistently outperform traditional models, with XGBoost often achieving slightly better accuracy due to its regularization and optimization techniques.
Overall, the study concludes that machine learning models, especially Random Forest and XGBoost, are highly effective for AQI prediction, with XGBoost typically providing superior performance in handling complex air pollution data.
Conclusion
This study presented a comprehensive machine learning framework for Air Quality Index prediction across three major cities of India\'s National Capital Region — Delhi, Noida, and Faridabad — using historical AQI data sourced from Kaggle. Two ensemble learning models, Random Forest and XGBoost, were implemented, tuned, and evaluated under identical experimental conditions.
The results conclusively demonstrate that XGBoost achieves superior predictive accuracy (R² = 0.94, RMSE = 12.31) compared to Random Forest (R² = 0.91, RMSE = 15.82), with consistent improvements across all three cities and all evaluation metrics. Feature importance analysis confirms PM2.5 and PM10 as the dominant AQI drivers in NCR, with seasonal factors playing a significant secondary role. The models provide a practical foundation for real-time AQI forecasting systems that could be integrated into public health dashboards and pollution advisory services. Several promising directions for future work include: (1) integration of real-time meteorological data (temperature, humidity, wind speed and direction) to further improve prediction accuracy; (2) exploration of deep learning architectures such as LSTM networks for multi-day AQI forecasting; (3) expansion of the study to additional Indian cities including Mumbai, Kolkata, Chennai, and Ahmedabad; (4) development of a web-based or mobile AQI early-warning application powered by the trained XGBoost model; and (5) investigation of interpretability methods such as SHAP (SHapley Additive exPlanations) for granular feature-level explanations of individual AQI predictions.
References
[1] Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/A:1010933404324
[2] Chen, T., & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–794.
[3] Central Pollution Control Board (CPCB). (2023). National Air Quality Index. Ministry of Environment, Forest and Climate Change, Government of India. https://cpcb.nic.in
[4] Yadav, R., Sharma, S., & Gupta, R. (2021). Machine Learning Based AQI Prediction for Delhi Using Random Forest Regression. International Journal of Environmental Science and Technology, 18(4), 1123–1135.
[5] Kumar, A., & Goyal, P. (2022). Comparative Analysis of Machine Learning Models for Air Quality Prediction in Indian Metropolitan Cities. Environmental Pollution, 295, 118627.
[6] Sharma, P., Sharma, A., & Singh, K. (2020). XGBoost-based PM2.5 Forecasting for the National Capital Region of India. Atmospheric Environment, 231, 117595.
[7] Bedi, J., & Toshniwal, D. (2019). Deep Learning Framework to Forecast Electricity Demand and AQI Using LSTM. Applied Energy, 248, 615–625.
[8] Doreswamy, H., Harishkumar, K. S., Km, Y., & Gad, I. (2023). Forecasting Air Pollution Particulate Matter (PM2.5) Using Machine Learning Regression Models. Procedia Computer Science, 218, 2502–2512.
[9] World Health Organization. (2021). WHO Global Air Quality Guidelines: Particulate Matter (PM2.5 and PM10), Ozone, Nitrogen Dioxide, Sulfur Dioxide and Carbon Monoxide. WHO Press.
[10] Kaggle. (2023). Air Quality Index (AQI) Dataset — India. Retrieved from https://www.kaggle.com/datasets