Authors: Arpan Ghosh, Vansh Nanda, Tushar Bansal, Er. Reshma
Certificate: View Certificate
The air quality monitoring system collects data of pollutants from different location to maintain optimum air quality. In the current situation, it is the critical concern,. The introduction of hazardous gases into the atmosphere from industrial sources, vehicle emissions, etc. pollutes the air. Today, the amount of air pollution in large cities has surpassed the government-set air quality index value and reached dangerous levels. It has a significant effect on a human health. The prediction of air pollution can be done by the Machine Learning (ML) algorithms. Machine Learning (ML) combines statistics and computer science to maximize the prediction power. ML is used in order to calculate the Air Quality Index. Various sensors and an Arduino Uno microcontroller are utilized to collect the dataset. Then by using K- Nearest Neighbor (KNN) algorithm, the air quality is predicted.
Among the most crucial challenges faced in the world today is air pollution. Industrial activity is increasing more regularly due to the explosive growth of economy, which is causing air pollution to increase more rapidly. Environmental pollution is a serious issue that affects all living things, including humans, with pollution from industry accounting for a significant portion of it. Solid particles such as dust, pollen, and spores, and gases, contribute to air pollution. Carbon monoxide, Carbon dioxide, Nitrogen dioxide, Sulphur oxide, Chlorofluorocarbons, Particulate Matter, and other air pollutants that cause air pollution are released by the combustion of natural gas, coal, and wood, as well as factories, cars, and other sources. Prolonged exposure to air pollution leads to serious health problems, such as lung and respiratory illnesses
The annual death toll from household exposure to gasoline smoke is 3.8 million. Exposure to the outdoor air pollution will cause 4.2 million deaths annually. 9 out of 10 people on the earth reside in areas with air quality that is worse than recommended by the World Health Organization. As per the Greenpeace Southeast Asia Analysis of IQAir statistics, air pollution and associated problems caused over 120,000 deaths in India in 2020.According to the report, air pollution caused economic losses of ?2 lakh crore in India. This demonstrates how crucial it is to pay attention on the air quality.
Primary pollutants and the secondary pollutants are the two major classifications of air pollutants. One that is directly emitted into the atmosphere from its source is referred to as a primary pollutant, whereas a secondary pollutant is one that is produced due to the interaction between two primary pollutants or with other elements of the atmosphere. One of the detrimental effects of pollutants emitted into the environment is the degradation of air quality. Also, other harmful effects, such as acid rain, global warming, aerosol production, and photochemical smog has increased in past years.
Predicting the air quality is crucial for preventing the problem of air pollution. The Machine Learning (ML) models can be used for this. With the use of training data, a computer can learn how to build models via a technique called as Machine Learning. It is a branch of Artificial Intelligence that gives computer program the ability to forecast outcomes with ever-increasing accuracy. ML can examine a variety of data and identify patterns and particular trends. Machine learning is the ability given to a computer program to do a task without any external programming and this is task is achieved by using some statistical and advanced mathematical algorithms.
As air pollution has been rising every day, monitoring has proven to be a significant task. The amount of pollution in a given area is determined through continuous air quality monitoring at that location. The information obtained by the sensors reveals the source and concentration of the pollutants in that area. Measures to minimise pollution levels can be taken usingthat knowledge and the ML model. The hardware device consists of three different sensors like MQ-135 air quality sensor, MQ-5 sensor, Optical dust sensor connected to the Arduino uno board, which helps in collecting the pollutants information of the currentplace.
II. LITERATURE SURVEY
The authors of  proposed that Machine Learning algorithms plays important role in measuring air quality index accurately. Logistic regression and auto regression, ANN help in determining the level of PM2.5. ANN comes out with best results in the paper.
In  authors gives the prediction of the air quality index by using different machine learning algorithms like Decision Tree and Random Forest. From the results, concluded that the Random Forest algorithm gives better prediction of air quality index.
In  authors proposed model by using BILSTM which is the Deep Learning model to predicted the PM2.5 with improved performance comparing the existing model and produced exceptional MAE, RMSE.
In  authors used the prediction model results were based on Big Data Analytics and Machine Learning, which have helped to evaluate and contrast current assessments of air quality. The Decision Tree algorithm gave the best results among all the algorithms.
The authors of  used SVR, and LSTM Machine Learning models. The Machine Learning algorithms used for estimating the atmospheric pollutants (PM10 and PM2.5), it was demonstrated that SVR algorithms are the most suitable in forecasting the air pollutants concentrations.
This study focused on machine learning algorithms to predict air quality indices and pollutant concentrations. The research, published in Applied Sciences, demonstrated the potential of machine learning in accurately forecasting air quality parameters, highlighting the practical applications of these algorithms.
The paper introduced a machine learning framework specifically tailored for predicting air quality in California. Published in Complexity, the research delved into the complexities of Californian air quality and proposed innovative machine learning techniques, providing a comprehensive understanding of the region's pollution dynamics.
Focusing on Chennai, this research employed regression and ARIMA time series models to predict air quality indices. Published in the Journal of Engineering Research, the study showcased the efficacy of combining traditional statistical methods with advanced machine learning, offering a nuanced approach to air quality prediction.
This paper  presented an integrated model employing Artificial Neural Networks (ANN) and Kriging for forecasting air pollutants. Published in the International Journal of Advanced Research in Computer, Communication Engineering, the research underscored the importance of incorporating meteorological data into machine learning models, enhancing the accuracy of air pollution predictions. Focused on supervised machine learning algorithms, this study, published in the International Journal of Scientific Research in Computer Science, Engineering, and Information Technology, presented a robust air quality prediction model. The research emphasized the significance of algorithm selection and training data quality in building effective prediction systems.
In this paper, the proposed methods use three different algorithms to draw a comparative analysis of the AQI values of New Delhi, Bangalore, Kolkata, and Hyderabad by using parameters such as PM2.5, PM10, NO, NO2, NOx, NH3, CO, SO2, O3, Benzene, and toluene levels, which will then compare the three algorithms and find the most accurate and efficient algorithm. The aim is to analyze and present it in an efficient way. It would help us discover interesting and insightful information. These particular cities have a higher population density and give a good estimate of the pollution in a major South Asian city. More cities have not been added due to the fact that it makes the research paper way too lengthy. Hence, the major cities of India have been chosen to analyze the pollution levels in different urban cities of India as they are the major contributors to pollution.
Some of the existing algorithms used are Naive Bayes-a Bayes theorem-based classifier, support vector machine-a supervised learning model for classification and regression, artificial neural network-learning methodology inspired by actual neurons of the brain, gradient boost-techniques utilizing an ensemble of weak prediction models, decision tree-which works by making predictive models using data, and k-nearest neighbor-a lazy learning nonparametric supervised method.
The proposed algorithms used and compared are given below.
A. Synthetic Minority Oversampling Technique (SMOTE) Algorithm
Synthetic samples are created for the minority class using this oversampling technique. It aids in making an imbalanced dataset balanced. This approach helps with beating the issue of overfitting brought about by arbitrary oversampling.
B. Support Vector Regression
It is a discrete value prediction technique that uses supervised learning. For comparable purposes, SVMs and support vector regression are likewise used. Finding the most appropriate line is the main tenet of SVR. In SVR, the hyperplane with the most points is the line that fits the data the best.
C. Random Forest Regression (RFR) Algorithm
It is a frequently used supervised machine-learning technique for classification and regression problems. It creates decision trees based on a variety of samples, utilizing the average for regression and the classification vote.
D. CatBoost Regression (CR) Algorithm
Yandex has developed a library of open-source software. It offers a framework for gradient boosting which, unlike the standard technique, aims at resolving categorical features using an alternative based on permutation. All the three algorithms showed promising results in other works which had been studied through the literature survey. These three algorithms were chosen due to their high accuracy in previous different works, and with the proposed work, the aim is to draw a comparative analysis and find the one with the best accuracy with balanced and imbalanced datasets. The aim is to use them and apply them to the Bangalore, Kolkata, Hyderabad, and New Delhi datasets and compare their accuracies to figure out what best fits our use case. The picked algorithms have the highest accuracy based on our extensive literature survey, used for the AQI prediction. The algorithms being used for prediction are support vector regression (SVR), random forest regression (RFR), and CatBoost regression (CR). These algorithms will be provided with a suitably large dataset of cities, such as New Delhi, Bangalore, Kolkata, and Hyderabad, and will provide a practical environment. The dataset used will be cleaned, reduced, and prepared according to our requirements and the data will be split into training and testing data. The plan is to use the simplest, most straightforward implementation in order for the algorithms to be applied easily in a real-life use case. Then, different parameters will be taken to finalize and draw up a comparison between these 3 algorithms and then come to the conclusion to show which is the most accurate. The comparison can bring out important information about AQI prediction methods and even help us choose the most suitable one. A comparison of the accuracy levels obtained with an imbalanced dataset and a balanced dataset with the help of the SMOTE algorithm will also be done.
Hence, the methodology is a step-by-step process in which the first step is to find a suitable dataset and clean it. After this, further data preprocessing is applied which makes use of SMOTE in order to balance the dataset. Both balanced and imbalanced datasets will be preserved and used in order to bring to light any differences in performance that may arise due to balancing. Following this, in a standard machine learning procedure, the dataset is split into train and test to train the models and test their accuracies against real data. Feature scaling and normalization are carried out. Now, each regression model which has been picked, namely, random forest, support vector regression, and CatBoost, are used for prediction and its accuracy is gauged, for each balanced and imbalanced dataset as mentioned previously. They are compared using metrics such as RMSE and R-SQUARE. Finally, all the data and results have been displayed using clear figures, graphs, and charts which easily make one understand what exactly has led to the increase in accuracy and hence help future research. Various steps which will be performed during the implementation of this work to achieve the determined result. The flowchart is a process-based flowchart that shows the steps of the process in a detailed manner. It has been derived from the actual working out into running these models and extracting results. The process flowchart is drawn in Western ANSI standards.
The Central Pollution Control Board of India provided the AQI in the report National Air Quality Index, which is shown in the Fig 2 below
The quality of the air is determined by components like gases and particulate matter. These pollutants decrease the air quality, which can lead to serious illnesses when breathed in repeatedly. With air quality monitoring systems, it is possible to identify the presence of these toxics and monitor air quality in order to take sensible measures to enhance air quality. As a result, production rises and health problems caused by air pollution are reduced. The prediction models built using machine learning have been shown to be more reliable and consistent. Data collecting is now simple and precise due to advanced technology and sensors. Only machine learning (ML) algorithms can effectively handle the rigorous analysis needed to make accurate and efficent predictions from such vast environmental data. In order to predict air pollution, the KNN algorithm is used, which is better suitable for prediction tasks.
 Shreyas Simu,Varsha Turkar, Rohit Martires, “Air Pollution Prediction using Machine Learning”, 2020, IEEE  Tanisha Madan, Shrddha Sagar, Deepali Virmani, “Air Quality Prediction using Machine Learning Algorithms”, 2020, IEEE  Venkat Rao Pasupuleti, Uhasri , Pavan Kalyan, “Air Quality Prediction Of Data Log By Machine Learning”, 2020 , IEEE  S. Jeya, Dr. L. Sankari, “Air Pollution Prediction by Deep Learning Model”, 2020, IEEE  SriramKrishna Yarragunta, Mohammed Abdul Nabi, Jeyanthi.P, “Prediction of Air Pollutants Using Supervised Machine Learning”, 2021, IEEE  H. Liu, Q. Li, D. Yu, and Y. Gu, “Air quality index and air pollutant concentration prediction based on machine learning algorithms,” Applied Sciences, vol. 9, p. 4069, 2019.  M. Castelli, F. M. Clemente, A. Popovic, S. Silva, and L. Vanneschi, “A machine learning approach to predict air quality in California,” Complexity, vol.2020, Article ID 8049504, 23 pages, 2020.  G. Mani, J. K. Viswanadhapalli, and A. A. Stonie, “Prediction and forecasting of air quality index in Chennai using regression and ARIMA time series models,” Journal of Engineering Research, vol. 9, 2021.  S. V. Kottur and S. S. Mantha, “An integrated model using Artificial Neural Network (ANN) and Kriging for forecasting air pollutants using meteorological data,” Int. J. Adv. Res. Comput. Commun. Eng, vol. 4, pp. 146–152, 2015.  S. Halsana, “Air quality prediction model using supervised machine learning algorithms,” International Journal of Scientific Research in Computer Science, Engineering and Information Technology, vol. 8, pp. 190–201, 2020.  Hasbeh, M. Ahmadi Alvar, K. Y. Aghdam, H. Ghorbani, N. Mohamadian, and J. Moghadasi, “Hybrid computing models to predict oil formation volume factor using multilayer perceptron algorithm,” Journal of Petroleum and Mining Engineering, vol. 23, no. 1, pp. 17–30, 2021.  F. Jafarizadeh, M. Rajabi, S. Tabasi et al., “Data driven models to predict pore pressure using drilling and petrophysical data,” Energy  G. Zhang, S. Davoodi, S. S. Band, H. Ghorbani, A. Mosavi, and M Moslehpour, “A robust approach to pore pressure prediction applying petrophysical log data aided by machine learning techniques,” Energy Reports, vol. 8, pp. 2233– 2247, 2022.  S. Tabasi, P. Soltani Tehrani, M. Rajabi et al., “Optimized machine learning models for natural fractures prediction using conventional well logs,” Fuel, vol. 326, Article ID 124952, 2022.  M. Rajabi, O. Hazbeh, S. Davoodi et al., “Predicting shear wave velocity from conventional well logs with deep and hybrid machine learning algorithms,” Journal of Petroleum Exploration and Production Technology, 2022.  S. Beheshtian, M. Rajabi, S. Davoodi et al., “Robust computational approach to determine the safe mud weight window using well-log data from a large gas reservoir,” Marine and Petroleum Geology, vol. 142, Article ID 105772, 2022.  Z. K. Masoud, S. Davoodi, H. Ghorbani et al., “Band, Permeability prediction of heterogeneous carbonate gas condensate reservoirs applying group method of data handling,” Marine and Petroleum Geology, vol. 139, 2022.  N. Mohamadian, H. Ghorbani, D. A. Wood, and M. A. Khoshmardan, “A hybrid nanocomposite of poly(styrene-methyl methacrylate- acrylic acid)/clay as a novel rheology-improvement additive for drilling fluids,” Journal of Polymer Research, vol. 26, no. 2, p. 33, 2019.  N. Mohamadian, H. Ghorbani, D. A. Wood, and H. K. Hormozi, “Rheological and filtration characteristics of drilling fluids enhanced by nanoparticles with selected additives: an experimental study,” Advances in Geo-Energy Research, vol. 2, no. 3, pp. 228–236, 2018.
Copyright © 2023 Arpan Ghosh, Vansh Nanda, Tushar Bansal, Er. Reshma. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.