Authors: Dr. Senthil Kumar M, Mr. Sakthivel S, Mr. Sasitharan M, Mr. Shakir Ahamed M
Certificate: View Certificate
Machine learning has vast algorithms in which each and everything is specialized to predict and compute certain functionalities and tasks. A neural network is a collection of neurons in which every neuron holds a numerical value as the output of other neurons. Furthermore, neural networks are classified more as regular neural networks, convolutional neural networks, feed-forward neural networks, and long-term memory networks. Each is specialized in unique scenarios, such as regular neural networks work better in position-based image detection and convolutional neural networks work better in edge-based detection. Therefore, this paper bases the comparison of different algorithms on a base prediction problem, which is fake news detection. This project displays various performance metrics such as accuracy score and various visualization plots like scatter, pie, and bar. so that users will know the different algorithms and their scalability on non-relatable text patterns.
The fields of machine learning and deep learning basically read and understand the patterns and flaws in the data we feed into them and Predictions are based on the input we provide, and machines can be programmed to make their own decisions without any human interference. They learn from their errors and rectify them. there are algorithms that are specifically good at certain fields of prerequisite. For example, let us consider a dataset that is related to the chance of heart disease and has an attribute of whether a person is prone to heart attack or not. Considering the required feature sets, regular neural networks (RNN) can produce good accuracy with low loss, and on the other hand, ML algorithms such as logistic regression can project good metrics. These two above-mentioned algorithms perform at their best as long as the data is linear (numerical or categorical). What if the data we have is an image, video, or audio? In this case, videos are nothing but a continuous frame of images, so let us consider that images and videos are of the same format. Image processing totally depends on the position or the edge-based classification. Regular neural networks can deal with position-based prediction, but this may not help in every case, so here come convolutional neural networks. They classify things based on edges. To be more specific, depending on the dataset and the problem statements, different algorithms can perform differently.
II. Related Works
There are various machine learning algorithms used to solve the fake news detection problem. Xavier Jose et al. discusses the many features and forms of false news as well as a practical approach for detecting fake news on OSM networks. The two key components of the approach are the stance detection model and the created content classifier. The posture detection model achieved an accuracy of 90.37 percent using Logistic Regression, and the manufactured content classifier achieved an accuracy of 93.46 percent using Bi-directional LSTM . Zhang Bao-wen et al. states that regression algorithms can be used in many classification issues. The purpose of this article is to look at how temperature affects the selling of iced items. To begin, the author gathered data on last year's anticipated temperature and iced product sales, followed by data compilation and purification. Using data mining theory, the author created a mathematical regression analysis model based on the cleaned data . Sarker. I.H explains the concepts of numerous machine learning techniques and their usefulness in a variety of real-world application areas, including cybersecurity, smart cities, healthcare, e-commerce, agriculture, and many more. Based on his findings, he also emphasizes the obstacles and future research avenues . Saima Akhter et al. proposes to computerize the fake news detection in Twitter datasets, this research provides a strategy for spotting fabricated news messages from Twitter tweets by figuring out how to anticipate accuracy evaluations. After that, the author compared five well-known Machine Learning techniques, including Support Vector Machine, Nave Bayes Method, Logistic Regression, and Recurrent Neural Network models, to show how efficient the classification performance on the dataset was.
The SVM and Nave Bayes classifiers outperformed the other methods in the experiments . R. Saravanan et al. examines supervised learning methods that are commonly employed in data classification. The strategies are evaluated in terms of their goals, methodologies, benefits, and drawbacks. Finally, readers will get an understanding of supervised machine learning approaches to data classification . Hadeer Ahmed et al. proposes a false-news detection model based on n-gram analysis and machine learning approaches is presented in this study. Two different feature extraction strategies and six different machine classification algorithms are investigated and compared. The best results were obtained with Term Frequency-Inverted Document Frequency (TF-IDF) as a feature extraction technique and Linear Support Vector Machine (LSVM) as a classifier, with an accuracy of 92 percent .
III. MATERIALS AND METHODS
A. Various library packages are available to process text, to split the data into training and testing, and to feed the training data to the algorithms. Finally, some visualisation plots and performance metrics such as accuracy and losses are presented.
The Jupyter Notebook is the platform used to implement the code. There is 50+ kernels running in the notebook, each containing visualization, performance metrics checking, and comparison. The final outcome of the project is a confusion matrix, which includes the accuracy and loss of each algorithm and plots those on a linear graph and scatters (Accuracy vs Loss). As the dataset used in the model is not balanced, let us consider an example. Suppose a person has a shoe factory and has 100,000 shoes, of which 99,000 are Adidas and 1,000 are Nike. Suppose a client needs a project where the model must be able to classify the shoe as two labels, whether it is Adidas or Nike, but the model always predicts Adidas no matter what brand and has an accuracy of 99%. It makes no sense in this case to provide high accuracy, but the model fails to predict correctly. So here the confusion matrix is projected. It has 2 columns and 2 rows, namely, true positive, true negative, and predicted positive and predicted negative. It shows
a two-dimensional array of true and false values.
Starting from the environment being coded to the final output system projects, everything is done in Python. The Jupyter notebook is used as an integrated development environment as it facilitates kernels that will be useful for machine learning and deep learning. And coming to the part of coding, there is the usage of various library packages, which include re (Regular Expression), nltk (Natural Language Processing Toolkit), Matplotlib, Pandas, and Scikit Learn. As this project deals with text processing, nltk plays a vital role. It cleans the text and forms corpus, such as removing symbols and stopwords by using the Porter-Stremer method. The program reads the data in the form of numbers, although the data provided by the user is in text, so no matter which format the data is in, it must be converted to the numerical format. CountVectorizer is a method in the sklearn.feature_extraction set that performs the required conversion of text to numbers. Furthermore, the data is split into training and testing derived from a method model_selection with a testing size of 20% of the data we feed, and after that, various algorithms are applied to check their performance by visualizing it, comparing its accuracy, and finally a confusion matrix. Performing with a low accuracy of 0.883 (Gaussian Naive Bayesian) and a high accuracy of 0.9495 (Logistic Regression). Logistic regression works based on linear regression.
The only difference is that logistic regression has a sigmoid graph and a pivot point. On the other hand, Gaussian NB is based on conditional probability and the Bayes theorem. We can observe that logistic regression works for binary class labels (male and female, black and white, fake and legit). Gaussian NB performs worst when it comes to binary label predictions as they are performed on a given condition that has already occurred.
Machine learning is a vast domain with numerous algorithms to work with, each specialising in a specific area problem statement. As a result, developers have a wide range of options for selecting an algorithm and incorporating it into their projects. our proposed model applies 6 different algorithms to the model, which are fake news detection and computing its performance. parameters including accuracy, loss, and the confusion matrix. Let\\\'s say convolutional neural networks (CNN) are good at image processing and NLTK is good at text processing. For image processing, it is good to use conventional neural networks whereas NLTK is good at processing text. In addition to this, the model deals with CountVectorized at first followed by six different algorithms to process data and accuracy in order to find out which algorithms are best performing and which are worst-performing. With the values of accuracy versus loss, a graph is plotted to find out the relationship between the accuracy of six different algorithms and to suggest which is best and worst-performing in text processing.
 Ehesas Mia Mahir, Saima Akhter, Mohammad Rezwanul Huq et al., Detecting fake news using machine and deep learning algorithms, Intl Conf. on Smart Computing & Communications, 2019, Pp.43-75.  Hadeer Ahmed, Issa Traore and Sherif Saad, Detection of online fake news using n-gram analysis and machine learning techniques, Intl conf. on intelligent secure and dependable system, 2017, Pp. 145-167.  Xichen Zhang and Ali A.Ghorbani, An overview of online fake news Characterization detection, Information Processing and Management, 2020, Pp.45-56.  Zhang Bao-wen, Shen Rong, The research of regression model in machine learning field, MATEC Web of Conferences, 2018, Pp-95-110.  R. Saravanan, Pothula Sujatha, A Perspective of Supervised Learning Approaches in Data Classification, Second Intl Conf. on Intelligent Computing and Control Systems, 2019, Pp-78-102.  Sarker.I.H, Machine Learning: Real-World Applications and Research Directions, 2021, Pp-76-89.  Harsh Patel, urvi Prajapati, Study and Analysis of Decision Tree Based Classification Algorithms, 2018, Intl Jrnl of computer sciences and engineering, Pp-65-80.  Xavier Jose, S. D Madhu Kumar, Priya Chandran, Characterization, Classification and Detection of Fake News in Online Social Media Networks, Mysore Sub Section Intl Conf. ,2021, Pp-30-43.  Alessandro Bondielli And Francesco Marcelloni, A Survey On Fake News Detection Techniques, 2019, Information Science, Pp. 78-85.
Copyright © 2022 Dr. Senthil Kumar M, Mr. Sakthivel S, Mr. Sasitharan M, Mr. Shakir Ahamed M. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.