Authors: Anushka Awasthi, Ishwar Gangwal, Mihir Jain
Certificate: View Certificate
Prediction of diabetes using machine learning algorithms has been thoroughly studied by several researchers in the past. Finding, critically assessing, and combining the data of all pertinent, high-quality individual research is what prompted us to conduct this assessment of multiple diabetes prediction models. Analysis of several writers\' expertise on diabetes prediction systems is presented in this publication. This study on diabetes prediction models aimed to discover the best strategies for selecting and synthesising the many studies of high quality. The majority of medical data is nonlinear, correlation-structured, and complicated, making it difficult to analyse. The use of machine learning-based techniques in healthcare and medical imagery has been ruled out.
As part of this research, this paper has looked at a number of machine learning-based approaches for making diabetes predictions and then compared them.
The study's goals are as follows:
A thorough review of all relevant, high-quality individual research and a rigorous analysis of the resulting data prompted us to look at several diabetes predictions models. The researcher has read a lot of papers on diabetes prediction models to help fuel our enthusiasm for this evaluation process.
Diabetes is a condition that is brought on by having an abnormally high level of blood glucose in the body. Human bodies are in constant need of power, and sugar is one of the primary sources of vitality that is used in the construction of our muscles and other tissues. In individuals, the primary reasons of type 2 diabetes are often an unhealthy habit combined with a lack of physical activity. Diabetes is a condition that is brought on by an abnormally high level of glucose in the bloodstream. Diabetes occurs when the pancreas is failing to turn the meal into insulin; as a result, sugar is not taken into the body, leading to the condition. Diabetes may cause problems in a variety of body systems, including the kidneys, eyes, neurological system, arteries, and so on. There are three different forms of diabetes. The first kind of diabetes is known as juvenile diabetes (Sun & Zhang, 2019) which primarily affects youngsters and damages the cells in the pancreatic that are responsible for insulin production. The second form of diabetes is called type 2, and it is often diagnosed in people over the age of 40 who do not get enough exercise and have bad lifestyles. Diabetes is a form of illness that cannot be cured but may be managed well via a healthy diet, exercising regularly, and the use of appropriate medication. Diabetes cannot be reversed (Sun & Zhang, 2019).
People with type 2 diabetes (Malik, et al., 2020) do not receive insulin injections at regular intervals, which is why this form of diabetes is often referred to as insulin-independent diabetes. Patients with type 1 diabetes, on the other side, receive insulin injections at periodic intervals, which is why this form of diabetes is often referred to as insulin-dependent diabetes (Malik, et al., 2020). The third kind of diabetes (Han, et al., 2020) is called gestational diabetes, and it is caused by the shift in hormone levels that happens during childbirth. In several cases, gestational diabetes goes away once the child is born. Prediabetes is a situation in wherein patient’s blood sugar levels are borderline for diabetes, however this situation may be corrected with the aid of physical activity and a healthier diet. Prediabetes is the only diagnosis in which blood sugar levels are borderline for diabetes.
B. Need of Machine Learning
The field of artificial intelligence known as machine learning is concerned with the process by which computers attempt to forecast future events using historical data and the information they already possess. There are two distinct forms of machine learning. The first kind of learning is called supervised learning, and in this type of learning, the data itself serve as the instructor, and the system is constructed based on the dataset. The second kind of learning is called unsupervised learning, and it involves the data teaching itself by identifying certain patterns within the dataset and then categorising those patterns. Over the last several years, a large number of writers have reported and discussed their research on diabetes prediction by utilising machine learning algorithms.
C. Machine Learning algorithms
The research that has been done on machine learning has resulted in the development of multiple data mining methods. These algorithms may be directly applied to a dataset in order to create some predictions or to derive important inferences and conclusions from such a dataset. Immediate use of these algorithms is possible. Decision tree, Naive Bayes, k-means, neural network, and other similar algorithms are examples of prominent data mining techniques. In the part that comes after this one, we will talk about them.
II. RELATED WORK
Patients may get assistance in resuming their normal routines of life via the provision of individualised services in a variety of medical specialties offered by healthcare systems. The condition known as diabetes mellitus ranks among the most critical and severe challenges faced by the medical community. In the current set of realistic conditions, classification is one of the most important decision-making approaches that may be used. The major objective is to classify the data as either being related to diabetes or not being related to diabetes, as well as improve the classification accuracy (Saxena, et al., 2022). When it comes to the diagnosis of diabetes, machine learning is primarily focused on recognising patterns within the diabetes dataset that would be provided. In recent years, machine learning has emerged as the most reliable and helpful innovation in the field of medicine, and this trend is expected to continue in the foreseeable future. With the use of machine learning classifiers, the primary objective of this work is to categorise diabetes patients into different kinds depending on the information they provide about themselves and their clinical conditions. This section includes an overview of the works that were proposed by various researchers during the course of the previous ten years. It is helpful to detect the inadequacies of recommended works in the area of machine learning classifiers for diabetic patients' treatment regimens. The identification of diabetes is becoming an increasingly important topic of research (Saxena, et al., 2022).
Several other deep learning approaches and classification methods, including artificial neural networks, decision trees, random forests, and support vector machines, have been described in (Sun & Zhang, 2019) work. In order to classify diabetes-related data, (Qawqzeh, et al., 2020) used a classification strategy based on logistic regression. There are 459 patients included in the training data, and 128 patients are included in the testing data. Utilizing logistic regression, the authors were successful in achieving a classification accuracy of 92 percent. The fact that the model was not compared to any of the other diabetes predictive models and, as a result, was unable to be verified was the model's most significant shortcoming. One half of the dataset was used to train the algorithm while the other half was used to test it. (Qawqzeh, et al., 2020). In order to make a forecast of diabetes, the naive Bayes and support vector machine methods were combined in the framework that was presented. The suggested model was verified on this dataset after the dataset was obtained from three separate places. The dataset had a total of 402 individuals and contained eight different features, one of which was the presence of type 2 diabetes in 80 of the patients (Tafa, et al., 2015). The ensemble of naive Bayes and SVM has accomplished an accuracy of 97.6 percent, which is significantly higher than the accuracy obtained by either of the algorithms when they were run individually on the dataset, with naive Bayes accomplishing an accuracy of 94.52 percent and support vector machine attaining 95.52 percent respectively. The authors have not stated any pre-processing techniques in order to delete any undesirable inputs from the dataset.
(Karan, et al., 2012) presented a novel approach for diagnosing diabetes by constructing a distributed end-to-end three-level inescapable healthcare system framework using artificial neural network (ANN) computation. This allowed them to show the new method. Sensors and other wearable technology are utilized to measure vital signs and other indications on the human system at the most fundamental level. At the second level, client-side devices like personal digital assistants (PDAs) and personal computers (PCs) act as an arbitrator and mediator across the primary level and the final tier. Customers get assistance with social welfare procedures and database activities from powerful desktop servers, which are part of the third level's culmination (Karan, et al., 2012). In order to identify disorders at both the following and subsequent stages, techniques of an artificial neural network are performed. The client and server paradigm are depending on the calculations of artificial neural networks. Using the idea of sickness as the basis for computations and system communications on both the client and the server sides is how this strategy develops both of those areas.
On the Pima Indians Diabetes Collection, (Sisodia & Sisodia, 2018) used the Naive Bayes, decision trees, and SVM learning methods. The Naive Bayes classifier obtained the highest level of accuracy in its ability to forecast diabetes. Sisodia used a method known as tenfold cross-validation, which consisted of dividing the dataset into 10 equal portions and then using nine of those parts for training purposes while using the tenth part for assessment purposes. Precision, accuracy, recall, and area under the curve were always the evaluation measures that were used to forecast diabetes. (Hussain & Naaz, 2021) provided an evaluation of a number of different machine learning techniques. Within this review, the accuracy of random forest, Naive Bayes, and NN was examined and contrasted. The Matthews correlation coefficient was utilised by the authors in order to carry out the evaluation of these machine learning techniques. The research conducted by (Kumari, et al., 2021) on the Pima Indians Diabetes Dataset included the application of Naive Bayes, RF, and LR. The researchers then contrasted these three methods to an ensemble method and found that the ensemble strategy provided the most accurate results for the model.
Deep learning, often known as a neural network, is a multi-layered network that uses feed forward. (Olaniyi & Adnan, 2014) used this kind of network in their work. The technique was applied to the Pima Indians Diabetes Dataset by the researchers; the dataset was then split in such a manner that 500 entries were utilized for training reasons and 268 entries were utilized for testing reasons. Before any kind of pre-processing activities could be carried out, the dataset was first normalised in order to establish numerical stability. By dividing each characteristic by its associated amplitude, the values of the dataset were normalised such that they all fell within the range of 0 to 1, which was the goal of the normalisation process. The authors were able to attain an accuracy rate of 82 percent with their predictions. SVM and Naive Bayes techniques were used by (Gupta, et al., 2021) in their study to categorise the diabetes dataset. The authors trained and tested their model using a k-fold cross-validation, and when they utilised both classification methods to their data, the SVM classification performed much better than that of the Naive Bayes technique.
(Kandhasamy & Balamurali, 2015) used a few different machine learning algorithms to make a prediction of diabetes using a dataset that has been obtained from the UCI repository. These algorithms included random forest, J48, k-nearest neighbours, and SVM. Both before and afterwards pre-processing the dataset, the authors used the aforementioned classifier to analyse it. The precompiled data was used in the second attempt. There was no discussion of pre-processing procedures; all that was mentioned was the notion that the database contained some noise that was eliminated. The authors have assessed the accuracy, sensitivity, and applicability of their prediction using those three criteria. The decision tree obtained the maximum accuracy of 73.82 percent when the dataset was not pre-processed, whereas the random forest earned the best accuracy of 100 percent when the information was pre-processed.
(Choubey, et al., 2020) used two feature selection approaches called PCA and linear discriminant evaluation to extract relevant characteristics from the Pima Indians Diabetes Dataset. Linear discriminant analysis and Principal component analysis are both types of factor analysis. In addition to that, a comparative study of the strategy for selecting attributes was included in the paper. For the aim of classification, a select group of machine learning methods, including radially foundation kernel, KNN, and AdaBoost, were also used to the dataset in question. The dataset obtained from the Canadian primary healthcare sentinel monitoring network was used in the research carried out by (Perveen, et al., 2016). The parameters that are included in the dataset include sex, BMI, fasting blood sugar, triglycerides, and systolic and diastolic heart rate, respectively. The authors of the study utilised decision trees, bootstraps, and adaptable boosting as their classifiers of choice.
Utilizing machine learning methods, (Gujral, 2017) published a survey on the key phases of diagnosing type 2 diabetes. The survey also identified frequently occurring problems related with diabetic retinopathy and nephropathy.
Quite a few different approaches to machine learning have already been looked at and researched, some of which include artificial neural networks, essential components, decision trees, hereditary computing, and fuzzy logic. The Pima Indians Diabetes Dataset serves as the informational index for the vast bulk of the relevant body of research, which may be found here. It is critical to get an accurate diagnosis of diabetes in the initial stages so that life-threatening complications associated with the disease may be mitigated. Based on the findings (Gujral, 2017) of the Writing Survey of Diabetes Assumptions, it is clear that a solitary approach to diagnosing diabetes is not a very sophisticated way of diagnosing diabetes at an early stage. Combining many types of classifiers, including SVM, principal component analysis, and evolutionary algorithms, together with ANN, yields the best possible results.
(Mamuda & Sathasivam, 2017) used supervised machine learning classifiers such as scaled conjugate gradient, Levenberg–Marquardt, and Bayesian regulation. All of these methods fall under the category of "supervised machine learning." After separating the data into testing and training batches, the Levenberg–Marquardt algorithm demonstrated the highest level of accuracy. (Malik, et al., 2020) worked with a regionally accessible dataset that was acquired from a facility in Germany. They implemented decision trees, KNN, and random forest on top of this locally accessible dataset. The identification of diabetes was accomplished by (Soltani & Jafarian, 2016) through the utilisation of a probabilistic neural network. The Pima Indians Diabetes Dataset was split into two parts: a training dataset consisting of 90 percent of the total, and a testing dataset consisting of the remaining 10 percent. The accuracy that was reached for the training set was 89.56 percent, while the precision for the test dataset was 81.49 percent.
Both (Tigga & Garg, 2020) contributed to the Pima Indians Diabetes Dataset. Blood glucose levels, the quantity of pregnancies, and BMI were shown to be three of the most important parameters collected from the information. RStudio was used to make a prediction about the accuracy utilizing logistic regression, and the result was that the accuracy reached was 75.32 percent. The Pima Indians Diabetes Dataset was analysed using the Naive Bayes, and random forest classification methods by (Yuvaraj & SriPreethaa, 2017). In conjunction, the information gain attribute selection approach was used in additament to machine learning algorithms in order to retrieve the key features. Furthermore, eight features were utilised as opposed to thirteen characteristics as a result of this change. 30 percent of the dataset was employed for testing objectives, and the authors demonstrated that a random forest algorithm achieves a maximum efficiency of 94 percent of the time.
(Rashid, et al., 2016) constructed diabetic mellitus support systems, which operate automatically by using classification algorithms, taking into account various versions of the aforementioned concerns. Also reflecting the capabilities of medical professionals who are definite that there is a substantial association between the negative effects of particular chronic diseases and the frequency of glucose production in the blood. It's possible that the implications of this research transcend beyond just classifying people who have diabetes into different groups. The primary commitments are as follows, given this arrangement: It takes use of a few free variables here and there.
(Negi & Jaiswal, 2016) developed their own unique dataset, which has 102538 items and 49 descriptors altogether. In all, this dataset included around 64419 diabetic individuals, whereas the remaining patients did not have diabetes. Using a pre-processing approach, we were able to fill in the values that were absent, and normality test were converted into numerical data. In order to choose the pertinent characteristics from the dataset, the wrapper element selection approach and the ranking method were both used. To further improve accuracy, an aggregation of a few different classifiers was applied. The improved accuracy was 72%.
The Bayes net, Hoeffding tree, JRip, and multilayer perceptron were some of the models that (Mercaldo, et al., 2017) used to analyse the Pima Indians Diabetes Dataset. In order to choose the relevant features that would lead to an improvement in the efficiency of classifiers, researchers turned to both the greedy iterative and optimal first approaches of feature selection. Out of a total of eight qualities, only four were utilised. Age, BMI, diabetic pedigree functioning, and plasma glucose content were the four characteristics that were considered. The Hoeffding tree method was successful in achieving a recall score of 76.2 percent and an accuracy value of 75.7 percent. (Swapna, et al., 2018) used convolution neural networks with long short-term memory on electrocardiograms. The dataset used was private and comprised of 142000 samples. The researchers were able to reach an accuracy of 90.9 percent using their methods. This dataset did not undergo any pre-processing, nor has it been subjected to any kind of feature selection procedure.
As per (Vasapalli, et al., 2021) diabetes mellitus type 2 (DM) is a condition that may endure for a very long time and whose prevalence has been continuously rising all over the globe. Diabetes affects around 30 million people in India, with millions more at risk for developing the condition. Therefore, early identification is essential in order to prevent diabetes and the difficulties that are linked with it. The purpose of utilising multiple techniques for the hypothetical perseverance of type 2 diabetes rooted on the indicative research is to prolong the disease's detection period by evaluating evocative features and regular practises. As a result, this will enable the assessment of type 2 diabetes without the usages of clinical exams via the usages of predictive modelling (Vasapalli, et al., 2021).
At this point in time, there is an abundance of clinical information accessible about viruses, their indications, the factors that contribute to sickness, and the effects that they have on one's health. The precision of these algorithms allows for the possibility of accurately predicting the risk of developing type 2 diabetes, which is of vital importance to the medical industry.
(Lekha & Suchetha, 2018) developed their own dataset, which was centred on breathing patterns and included a total of 25 patients. Eleven of these patients seemed normal, five were diagnosed with type 1 diabetes, and the other nine were diagnosed with type 2 diabetes. For the purpose of verification, leave one out cross validation was utilised, and the ROC curve served as the assessment metric. The accuracy of the test was close to 96 percent. (Mohebbi, et al., 2017) employed a CNN and a MLP to identify diabetes on a collection that comprised of 9 individuals. The dataset was given to the researchers to analyse. Constant glucose tracking signal dataset was used as the basis for this dataset. There was a total of nine patients used throughout the study: six for training and validating purposes, three for actual testing. The traditional neural network was able to attain the maximum level of accuracy, which was 77.5 percent.
Pima Indians Diabetes Dataset and Luzhou dataset were both gathered from a regional Chinese hospital by (Zou, et al., 2018) who then used two different feature selection techniques on each dataset. On both sets of data, three different machine learning classifiers—namely, random forest, and neural network—were put through their paces. PCA and minimal redundancy maximal relevance are the names of the feature selection approaches that were used in order to cut down on the total amount of characteristics. Utilizing random forest and the minimal redundancy maximum relevance technique, we were able to reach the highest level of accuracy possible, which was 77.21 percent.
This section summarised all the papers that have predicted diabetes and a small summary of the discussion is presented below in table 1.
III. DISCUSSION AND CHALLENGES
While there have been several approaches to predicting diabetes that rely on a few machine learning techniques like random forests and support vector machines (SVMs), only a few characteristics are picked for prediction. While reading through all of these papers, the researcher encountered the following difficulties:
In conclusion, the best approach for diabetes prediction was done by (Choubey, et al., 2020) as they used two feature selection approaches called PCA and linear discriminant evaluation to extract relevant characteristics from the PID Dataset. In addition to that, a comparative study of the strategy for selecting attributes was included in the paper. For the aim of classification, a select group of machine learning methods, including radially foundation kernel, KNN, and AdaBoost, were also used to the dataset in question.
 Choubey, D. K. et al., 2020. Comparative analysis of classification methods with PCA and LDA for diabetes. Current Diabetes Reviews, 16(8), p. 833–850.  Gujral, S., 2017. Early diabetes detection using machine learning: a review. International Journal for Innovative Research in Science & Technology, 3(10), pp. 45-60.  Gupta, S., Verma, H. K. & D. Bhardwaj, 2021. Classification of diabetes using naïve bayes and support vector machine as a technique. Singapore, Springer, p. 365–376.  Han, J., Rodriguez, J. C. & M. Behesti, 2020. Discovering Decision Tree-Based Diabetes Prediction Model. Jeju Island, Korea, Springer, p. 99–109.  Hussain, A. & Naaz, S., 2021. Prediction of diabetes mellitus: comparative study of various machine learning models. Advances in Intelligent Systems and Computing, Volume 1166, p. 103–115.  Kandhasamy, J. P. & Balamurali, S., 2015. Performance analysis of classifier models to predict diabetes mellitus. Procedia Computer Science, Volume 47, p. 45–51.  Karan, O., Bayraktar, C., Karl?k, H. & Karlik, B., 2012. Diagnosing diabetes using neural networks on small mobile devices. Expert Systems with Applications, 39(1), p. 54–60.  Kumari, S., Kumar, D. & M. Mittal, 2021. An ensemble approach for classification and prediction of diabetes mellitus using soft voting classifier. International Journal of Cognitive Computing in Engineering, p. 40–46.  Lekha, S. & Suchetha, M., 2018. Real-time non-invasive detection and classification of diabetes using modified convolution neural Network. IEEE Journal of Biomedical Health Information, Volume 22, p. 1630–1636.  Malik, S., Harous, S. & El-Sayed, H., 2020. Comparative analysis of machine learning algorithms for early prediction of diabetes mellitus in women. Algeria, Springer, p. 95–106.  Mamuda, M. & Sathasivam, S., 2017. Predicting the survival of diabetes using neural network. Poland, AIP Conference Proceedings, p. 40–46.  Mercaldo, F., Nardone, V. & Santone, A., 2017. Diabetes mellitus affected patients classification and diagnosis through machine learning techniques. Procedia Computer Science, Volume 112, p. 2519–2528.  Mohebbi, A. et al., 2017. A Deep Learning Approach to Adherence Detection for Type 2 Diabetics. Korea, s.n., p. 2896–2899.  Negi, A. & Jaiswal, V., 2016. A First Attempt to Develop a Diabetes Prediction Method Based on Different Global Datasets. Waknaghat, India, s.n., p. 237–241.  Olaniyi, E. O. & Adnan, K., 2014. Onset diabetes diagnosis using artificial neural network. International Journal of Scientific Engineering and Research, Volume 5, p. 754–759.  Perveen, S., Shahbaz, M., Guergachi, A. & Keshavjee, K., 2016. Performance analysis of data mining classification techniques to predict diabetes. Procedia Computer Science, Volume 82, p. 115–121.  Qawqzeh, Y. K. et al., 2020. Classification of diabetes using photoplethysmogram (PPG) waveform analysis: logistic regression modeling. BioMed Research International, pp. 1-20.  Rashid, T. A., Abdulla, S. M. & Abdulla, R. M., 2016. Decision support system for diabetes mellitus through machine learning techniques. International Journal of Advanced Computer Science and Applications, 7(7), pp. 1-30.  Saxena, R., Sharma, S. K., Gupta, M. & Sampada, G. C., 2022. A Comprehensive Review of Various Diabetic Prediction Models: A Literature Survey. Journal of Healthcare Engineering, pp. 1-22.  Sisodia, D. & Sisodia, D. S., 2018. Prediction of diabetes using classification algorithms. Procedia Computer Science, Volume 132, p. 1578–1585.  Soltani, Z. & Jafarian, A., 2016. A new artificial neural networks approach for diagnosing diabetes disease type II. International Journal of Advanced Computer Science and Applications, Volume 7, p. 89–94.  Sun, Y. L. & Zhang, D. L., 2019. Machine Learning Techniques for Screening and Diagnosis of Diabetes: A Survey. Technical Gazette, Volume 26, p. 872–880.  Swapna, G., Soman, K. P. & Vinayakumar, R., 2018. Automated detection of diabetes using CNN and CNN-LSTM network and heart rate signals. Procedia Computer Science, Volume 132, p. 1253–1262.  Tafa, Z., Pervetica, N. & Karahoda, B., 2015. An Intelligent System for Diabetes Prediction. Budva, Montenegro, s.n., p. 378–382.  Tigga, N. P. & Garg, S., 2020. Prediction of type 2 diabetes using machine learning classification methods. Procedia Computer Science, Volume 167, p. 706–716.  Vasapalli, M. et al., 2021. Prediction of Type 2 Diabetes Using Machine Learning algorithms. Pichanur, India, s.n.  Yuvaraj, N. & SriPreethaa, K. R., 2017. Diabetes prediction in healthcare systems using machine learning algorithms on Hadoop cluster. Cluster Computing, 22(1), pp. 1-9.  Zou, Q. et al., 2018. Predicting diabetes mellitus with machine learning techniques. Frontiers in Genetics, Volume 9, p. 515–522.
Copyright © 2022 Anushka Awasthi, Ishwar Gangwal, Mihir Jain. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.