Authors: Hemalatha A, Karpahalakshmi S, Thanga Sri R, Vaishnavi M, Bhavani N
Certificate: View Certificate
The role of social media in our day to day life has increased rapidly in recent years. Information quality in social media is an increasingly important issue, but web-scale data hinders experts’ ability to assess and correct much of the inaccurate content, or “fake news”, present in these platforms. It is now used not only for social interaction, but also as an important platform for exchanging information and news. Twitter, Facebook a micro-blogging service, connects millions of users around the world and allows for the real-time propagation of information and news. The fake news on social media and various other media is wide spreading and is a matter of serious concern due to its ability to cause a lot of social and national damage with destruction impacts. A lot of research is already focused on detecting it. A human being is unable to detect all these fake news. Detecting fake news is an important step. This process will result in feature extraction and vectorization; we propose using Python scikit-learn library to perform tokenization and feature extraction of text data, because this library contains useful tools like Count Vectorizer and Tiff Vectorizer. Then, we will perform feature selection methods, to experiment and choose the best fit features to obtain the highest precision, according to confusion matrix results. A feature analysis then identifies features that are most predictive for crowdsourced and journalistic accuracy assessments, results of which are consistent with prior work. We aim to provide the user with the ability to classify the news as “fake” or “real”.
In today’s world social media plays an important role in information passing. It is mandatory to identify the truthfulness of the information which has been circulated in order to avoid fault manipulation, misconception and to create minimum impact on society regarding the fake news.
Machine Learning (ML) is a branch of Artificial Intelligence which includes the study of computer algorithms that improve automatically through experience. It allows applications to become more accurate in predicting outcomes without explicitly programmed based on the training data. Machine Learning models help us in many tasks, such as: Object Recognition, Summarization, Prediction, Classification, Clustering, Recommender systems.
Machine Learning (ML) is one of the most exciting technologies that one would have ever come across. As it is evident from the name, it gives the computer that makes it more similar to humans: The ability to learn. Machine learning is actively being used everywhere.
The process starts with feeding good quality data and then training our machines (computers) by building machine learning models using the data and different algorithms. The choice of algorithms depends on what type of data we have and what kind of task we are trying to automate.
Machine Learning (ML) has proven valuable because it can solve problems at a speed and scale that cannot be duplicated by the human mind alone. With massive amounts of computational ability behind a single task or multiple specific tasks, machines can be trained to identify patterns in and relationships between input data and automate routine processes.
A. TF-IDF Vectorizer
TF-IDF vectorizer technology is used in the model. It is a common algorithm to transform text into meaningful representation of numbers. It is used to extract features from text strings based on occurrence.
TF-IDF is an acronym that stands for “Term Frequency – Inverse Document” Frequency which are the components of the resulting scores assigned to each word.
TF-IDF is a numerical statistic which measures the importance of the word in a document.
It helps us in dealing with the most frequent words. Using it we can penalize them. Tf-idf Vectorizer weights the word counts by a measure of how often they appear in the documents.
1. TF (Term Frequency)
The number of times a word appears in a document is its Term Frequency. A higher value means a term appears more often than others, and so, the document is a good match when the term is part of the search terms.
Term frequency indicates how important a specific term is in a document. Term frequency represents every text from the data as a matrix whose rows are the number of documents and columns are the number of distinct terms throughout all documents.
2. IDF (Inverse Document Frequency)
Words that occur many times in a document, but also occur many times in many others, may be irrelevant. IDF is a measure of how significant a term is in the entire corpus.
The TF-IDF Vectorizer converts a collection of raw documents into a matrix of TF-IDF features.
Document frequency is the number of documents containing a specific term. Document frequency indicates how common the term is. Inverse document frequency (IDF) is the weight of a term, it aims to reduce the weight of a term if the term’s occurrences are scattered throughout all the documents.
B. Passive Aggresive Classifier
Passive Aggressive algorithms are online learning algorithms. Such an algorithm remains passive for a correct classification outcome, and turns aggressive in the event of a miscalculation, updating and adjusting. Unlike most other algorithms, it does not converge. Its purpose is to make updates that correct the loss, causing very little change in the norm of the weight vector.
This is very useful in situations where there is a huge amount of data and it is computationally infeasible to train the entire dataset because of the sheer size of the data.
C. Confusion Matrix
D. Sentimental Analyzer
Sentiment Analysis is a procedure used to determine if a chunk of text is positive, negative or neutral. In text analytics, natural language processing (NLP) and machine learning (ML) techniques are combined to assign sentiment scores to the topics, categories or entities within a phrase.
Sentiment analysis is used to determine whether a given text contains negative, positive, or neutral emotions. It’s a form of text analytics that uses machine learning.
Sentiment analysis looks at the emotion expressed in a text. It is commonly used to analyze customer feedback, survey responses, and product reviews. Social media monitoring, reputation management, and customer experience are just a few areas that can benefit from sentiment analysis. We have used sentimental analysis for detecting the positiveness in the news that are posted in social media.
Sentiment Analysis is a Natural Language Processing and Information Extraction task that aims to obtain writer’s feelings expressed in positive or negative comments, questions and requests, by analyzing a large number of documents. It is also a machine learning tool used to study human behavior that analyzes texts for polarity, from positive to negative. By training machine learning tools with examples of emotions in text, machines automatically learn how to detect sentiment without human input.
Sentiment analysis can be used to quickly analyze the text of research papers, news articles, social media posts like tweets and more. Social Sentiment Analysis is an algorithm that is tuned to analyze the sentiment of social media content, like tweets and status updates.
Automatically extracting opinions, emotions and sentiments in text. Language - independent technology that understands the meaning of the text. It identifies the opinion or attitude that a person has towards a topic or an object.
Sentiment analysis is a powerful marketing tool that enables product managers to understand customer emotions in their marketing campaigns. It is an important factor when it comes to product and brand recognition, customer loyalty, customer satisfaction, advertising and promotion's success, and product acceptance.
In our project, once the user gives the news as the input, data undergoes the preprocessing stage, where the words get stemmed and tokenized. Next, the data is given to the Sentimental Analyzer from where the sentiment of the news is detected based on the positivity of the news.
II. SYSTEM DESIGN
A. ns using Passive Aggressive Classifier to give the desired output.
Our system also analyzes the sentiment of the news, where the news is given as an input. Data preprocessing is done, in order to analyze the positiveness of the news.
B. Data Preprocessing
2. Need of Data Preprocessing
4. Stemming: Stemming tries to cut off details like exact form of a word and produce word bases as features for classification. Stemming is important in natural language understanding (NLU) and natural language processing (NLP).
5. Need of Stemming: When a text is pre-processed for classification purposes, stemming is applied in order to bring words from their current variation to their original root in order to better the process in natural language with subsequent steps..
6. Tokenization: Tokenization is a way of separating a piece of text into smaller units called tokens. Here, tokens can be either words, characters, or subwords. Hence, tokenization can be broadly classified into 3 types – word, character, and subword tokenization .As tokens are the building blocks of Natural Language, the most common way of processing the raw text happens at the token level.
7. Need of Tokenization: Tokenization helps in understanding the context or developing the model for Natural Language Processing. The tokenization helps in interpreting the meaning of the text by analyzing the sequence of the words.
Figure 4.2 Tokenization of Fake news dataset
Figure 4.3 Tokenization of Real news dataset
The entire project is implemented using Python, the library and packages includes:
The implementation is segregated into several modules according to the process:
A. Importing The Dataset
The first step involves importing the dataset. The two datasets namely Real and Fake having the real news data and fake news data respectively are to be imported as separate csv files. Datasets namely,
B. Data Preprocessing
C. Training The Model
Training a model simply means learning (determining) good values for all the weights and the bias from labeled examples. Usually, machine learning models require a lot of data in order for them to perform well. Usually, when training a machine learning model, one needs to collect a large, representative sample of data from a training set. Data from the training set can be as varied as a corpus of text.
D. Sentiment Analyzer
Once the user gives the input, data gets preprocessed by stemming and tokenization. Preprocessed data is given to the sentiment analyzer to analyze the sentiment of the news.
V. FUTURE ENHANCEMENT
Fake news Detector detects only the linguistic-based information as Real or Fake.In our application ,news articles are given as the input and the output is delivered as real or fake. Also our application detects the sentiment of the required news.Our future enhancement of our project is to give more details on the fake news detection by including the publisher of the article,subject of the article and also to detect the visual-based information as real or fake.
media, more and more people consume news from social media instead of traditional news media. However, social media has also been used to spread fake news, which has strong negative impacts on individual users and to the society.Classification news manually requires in-depth knowledge of the domain and expertise to identify anomalies in the text. In Fake News Detection, we discussed the problem of classifying fake news articles and their sentiments using TF-IDF vectorizer,Passive aggressive classifier and VaderSentiment.
 DETECTING FAKE NEWS WITH MACHINE LEARNING METHOD, Supanya Aphiwongsophon, Prabhas Chongstitvatana, Department of Computer Engineering, Faculty of Engineering Chulalongkorn, University Bangkok, Thailand, July, 2018  A SMART SYSTEM FOR FAKE NEWS DETECTION USING MACHINE LEARNING, Anjali Jain, Harsh Khatter and AvinashShakya, Dr. APJ Abdul Kalam University, Lucknow, India, September, 2019.  FAKE NEWS DETECTION USING MACHINE LEARNING ENSEMBLE METHODS, Iftikhar Ahmad, Muhammad Yousaf and Suhail Yousaf - Department of Computer Science and Information Technology, University of Engineering and Technology, Peshawar, Pakistan. Muhammad Ovais Ahmad - Department of Mathematics and Computer Science, Karlstad University, Karlstad, Sweden, October, 2020.  FAKE NEWS DETECTION USING PASSIVE AGGRESSIVE AND TF-IDF VECTORIZER, Jayashree M Kudari - Jain University. Varsha V, Monica BG and Archana R - UG Scholar, Jain University, September, 2020.  FAKE NEWS DETECTION USING MACHINE LEARNING, Vijaya Balpande, Kasturi Baswe, Kajol Somaiya, Achal Dhande, Prajwal Mire - Department of Computer Science and Engineering, Priyadarshini J. L. College of Engineering, Nagpur, Maharashtra, India, June, 2021.  AN AUTONOMOUS MODEL FOR FAKE NEWS DETECTION, Noman Islam - Department of Computer Science, Iqra University, Karachi, Pakistan. Asadullah Shaikh, Yousef Asiri , Sultan Almakdi and Adel Sulaiman - College of Computer Science and Information Systems, Najran University, Najran 61441, Saudi Arabia. Asma Qaiser, Verdah Moazzam and Syeda Aiman Babar - Department of Computer Science, NED University of Engineering and Technology, Karachi, Pakistan, October, 2021.
Copyright © 2022 Hemalatha A, Karpahalakshmi S, Thanga Sri R, Vaishnavi M, Bhavani N. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.