Authors: Chandan Singh, V. Vijayalakshmi, Harsh Raj
Certificate: View Certificate
Website attacks have been one of the main threats to websites and web portals of private and public organizations. In today\'s digital world web applications are an important part of day-to-day life so it has become a challenging task to secure the applications. The attackers aim to extract sensitive information about the users through the URL links sent to the victims. We are trying filling the gap of traditional methods to stop the attacks, but the traditional methods fail to perform well as the attackers are becoming good at attacking the web applications. People are presently searching for reliable and consistent web application attack detection software. This model aims to secure web applications of vulnerabilities and from different types of attacks using a machine learning approach which has more accuracy compared to other machine learning algorithms since we are using Random Forest Model.
Web applications have become the most important part as it helps to reduce all complexity and makes life easy. As it has this much importance in the day-to-day life it becomes usual to some bad affects also it means it will attract lot of attentions of third parties and hackers as it handles all types of traffic like banking defense transactions or cyber bullying.
Attacker can’t easily take control of web applications of others for that they need an open port or a weakness of the system where they can enter through it this weakness is called as vulnerability it means nothing but the weakness using which hackers can access and perform tasks like manipulate, destroying data or changing this all can be done. 
Vulnerabilities are typically created unintentionally during system development. Vulnerabilities are caused by incorrect design decisions in one of the phases of the system life cycle. Bugs found and fixed during the development and testing stages are not counted as vulnerabilities, only bugs that are built into the operation of the system. If the creation is malicious and therefore intentional, the discovery and creation match. After a vulnerability is discovered, you can retroactively determine the point at which the vulnerability was created.
This opens to many attacks like phishing, defacement, malware SQL injection, XSS attack and many more here we are taking only 3 types of attack they are phishing, defacement and malware.
There are many traditional techniques to detect this type of vulnerability in web applications but they are limited only some operations so there were requirements for more efficient methods to detect vulnerability so by using machine learning methods we reduce false rate and increase the accuracy. 
There are many algorithms which help in detecting vulnerabilities, but we need the best one which has the highest accuracy, so we are using Random Forest algorithm to detect multiple forms of attack. As we are using machine learning we need sufficient datasets to train the model for testing it. So, it becomes a major problem as datasets are not available in sufficient amounts so for some type of attack it becomes very difficult to train it.
The project is used to detect the attack Phishing, Malware, Defacement attacks. And is used by accuracy of 96.6%. This project will help individuals as well as organizations in detecting attacks, which can happen while clicking on the infected link.
The rest of the paper is formulated as making detailed literature study in Section II. The system tool selection, problem identifications are discussed in Section III.
The system architecture, detailed system design steps are discussed in Section IV. The implementation steps are discussed in SectionIV. The rest of the paper is concluded with future enhancement.
II. LITERATURE SURVEY
III. SYSTEM DESIGN
A. Problem Analysis
We have taken the dataset from Kaggle which is a collection of 651191 URLs. From the dataset which we have taken 428103 are safe URLs, 96457 are defacement URLs, 94111 are phishing URL, 32520 are malware URLs and over all the websites of 651191 which is a collection of all legitimate, phishing, malware and defacement websites which can be used as a training dataset.
The dataset is a combination of three dataset all together which contains all the types of attack present in the data set which phishing malware and defacement the dataset is processed and made through the feature extraction process which is a technique used to reduce the number of features in a dataset by creating a new feature set from a particular feature in the dataset. This is used when the dataset contains many characteristics, such as, makes the model difficult to fit to the data set where in feature fd_length, hostname_length, count_dir, url_length, count, count_letters, tld_length, count, count-www, count = and count percentage. These features are extracted on the basis of which malicious and legitimate URLs are differentiated as the URLs which have attack show different features compared to the legitimate website when it is found that it is legitimate website we pass it through the Random Forest model, this model then divides the website URLs on the basis of the attack into safe, malware, defacement, phishing. The detection report is then generated according to the observation and warning dialogue box with the given accuracy to the user.
A. System Architecture
The system architecture provides an overview of how the system works. Here's how this system works:
Dataset collection is the collection of data, including URLs and websites which can be either malicious or legitimate. Through the process of the feature extraction, we extract and differentiate the attacks and process them further to know whether they are legitimate or not.
First we insert the trained dataset with the input URL’s which is further completed by checking whether the given URL is malicious or not after finding it malicious in which case it is either phishing, malware, defacement we intercept it with algorithm and show warning dialogue with the type of attack which is there if it is not a malicious website then we show the dialogue and load the page in normal manner.
B. Supervised Machine Learning
In the simplest sense, supervised learning means learning that an algorithm maps an input to a particular output. If the mapping is correct, this indicates that the algorithm has been learned correctly. If not, make the necessary changes to the algorithm so that it can be learned properly. Supervised learning algorithms can predict invisible data that will be received in the future.
The supervised learning model is used to build and improve several business applications, including:
C. Random Forest model
Random forest is a non-parametric (no assumption on the probability distribution of the data points) supervised machine learning algorithm. This is an extension of the machine learning classifier that includes bagging to improve decision tree performance. It combines tree predictors, and the tree relies on an independently sampled random vector. It belongs to a class of ensemble methods as it tries to reduce variance and produces an "average" decision rule from a set (the forest) of many different decision trees. These trees are constructed in such a way that when the prediction on a new data point is given by some part of the forest, it should be like the average rules produced by other parts.
For setting up the model to train we need to import the python packages such as Pandas, NumPy, Scikit-learn, Matplotlib, Flask. So, for importing we need to set up all package to import.
A. Feature Pattern
In the given model to detect attacks and classify them we need features for our database we have chosen some important features so before classifying the data we need to check if there any matching patterns that can determine the types of links on the basis of the data collected from the database.
B. Suspicious Words
In the current dataset which have been collected we search for the suspicious words which help to identify the treats more accurately.
In the dataset we have collected for different types of attack like normal, phishing and defacement so we need to get the total information about the number of data we have on the attacks to train the model for detection.
D. Feature Extractor
For the model to detect we need feature on which the type of URL will be decided so for the feature will play an important role as they can improve reduction in false rate as we are going with the Random Forest algorithm, we need features for different types as every type of attack has different types of feature so we have collected the most used features in all 3 types of attack.
E. Confusion Matrix
The confusion matrix is a technique for summarizing the performance of a classification algorithm. If each class has an unequal number of observations, or if the dataset has more than one class, the accuracy of the classification alone can be misleading.
VI. RESULTS AND DISCUSSIONS
A. Attack Detection
The major challenge persisting with the proposed model is the usage of large datasets. To assess the machine learning techniques, we have utilized a dataset which contains over 651191 URL’s both legitimate and malicious. Each which comprises of 21 features. Each URLs has a standard. If the standard is met, it will be considered as malicious URL. If not, it will be considered as legitimate URL. So, we found that if we are using SVM and logistic regression over Random Forest then the system performance is being degraded.
As in today’s internet world web technology is at its higher growth potential. The safety of the website and protection from attacks like phishing malware defacement attacks and be detected and prevented. This project creates a model which helps in the safety and security of the given website. This project is optimized at a level where we can identify the malicious and non-malicious website using datasets which and then with the procedure of feature extraction the website is noticed and then detected using Machine Learning by use of Random Forest algorithm and accuracy of the attack is determined. This model will give a safe and secure way to access website without worrying about the unfamiliarity and unpredictable behavior it possesses as we get the idea of it by machine learning it is the technology which can has impact in all the domains and in this digital world this is one of the best tools for safe and secure internet platform.
 Hoang, X. D. (2018, December). A website defacement detection method based on machine learning techniques. In Proceedings of the Ninth International Symposium on Information and Communication Technology (pp. 443-448).  Hoang, X. D., & Nguyen, N. T. (2019). Detecting website defacements based on machine learning techniques and attack signatures. Computers, 8(2), 35.  Sahingoz, O. K., Buber, E., Demir, O., & Diri, B. (2019). Machine learning based phishing detection from URLs. Expert Systems with Applications, 117, 345-357.  Jain, A. K., & Gupta, B. B. (2018). Towards detection of phishing websites on client-side using machine learning based approach. Telecommunication Systems, 68(4), 687-700.  Althubiti, S., Yuan, X., & Esterline, A. (2017). Analyzing HTTP requests for web intrusion detection.  Mereani, F. A., & Howe, J. M. (2018, February). Detecting cross-site scripting attacks using machine learning. In International conference on advanced machine learning technologies and applications (pp. 200-210). Springer, Cham.  Pham, T. S., Hoang, T. H., & Van Canh, V. (2016, October). Machine learning techniques for web intrusion detection—A comparison. In 2016 Eighth International Conference on Knowledge and Systems Engineering (KSE) (pp. 291-297). IEEE.  Calzavara, S., Conti, M., Focardi, R., Rabitti, A., & Tolomei, G. (2020). Machine learning for web vulnerability detection: the case of cross-site request forgery. IEEE Security & Privacy, 18(3), 8-16.  Zolanvari, M., Teixeira, M. A., Gupta, L., Khan, K. M., & Jain, R. (2019). Machine learning-based network vulnerability analysis of industrial Internet of Things. IEEE Internet of Things Journal, 6(4), 6822-6834.  Jain, A. K., & Gupta, B. B. (2018). Detection of phishing attacks in financial and e-banking websites using link and visual similarity relation. International Journal of Information and Computer Security, 10(4), 398-417.  Romagna, M., & van den Hout, N. J. (2017, October). Hacktivism and website defacement: motivations, capabilities and potential threats. In 27th virus bulletin international conference (Vol. 1, pp. 1-10).  Kim, W., Lee, J., Park, E., & Kim, S. (2006, August). Advanced mechanism for reducing false alarm rate in web page defacement detection. In The 7th International Workshop on Information Security Applications.  Medvet, E., Fillon, C., & Bartoli, A. (2007, August). Detection of web defacements by means of genetic programming. In Third International Symposium on Information Assurance and Security (pp. 227-234). IEEE.  Bartoli, A., Davanzo, G., & Medvet, E. (2010). A framework for large-scale detection of Web site defacements. ACM Transactions on Internet Technology (TOIT), 10(3), 1- 37.  Davanzo, G., Medvet, E., & Bartoli, A. (2011). Anomaly detection techniques for a web defacement monitoring service. Expert Systems with Applications, 38(10), 12521-12530.  Borgolte, K., Kruegel, C., & Vigna, G. (2015). Meerkat: Detecting website defacements through image-based object recognition. In 24th USENIX Security Symposium (USENIX Security 15) (pp. 595-610).  Apruzzese, G., Colajanni, M., Ferretti, L., Guido, A., & Marchetti, M. (2018, May). On the effectiveness of machine and deep learning for cyber security. In 2018 10th international conference on cyber Conflict (CyCon) (pp. 371-390). IEEE.  Abubakar, A., & Pranggono, B. (2017, September). Machine learning based intrusion detection system for software defined networks. In 2017 seventh international conference on emerging security technologies (EST) (pp. 138-143). IEEE.  Calzavara, S., Focardi, R., Squarcina, M., & Tempesta, M. (2017). Surviving the web: A journey into web session security. ACM Computing Surveys (CSUR), 50(1), 1-34.  Sudhodanan, A., Carbone, R., Compagna, L., Dolgin, N., Armando, A., & Morelli, U. (2017, April). Large-scale analysis & detection of authentication cross-site request forgeries. In 2017 IEEE European symposium on security and privacy (EuroS&P) (pp. 350-365). IEEE.  Fernandez, K., & Pagkalos, D. (2017). XSS (Cross-Site Scripting) information and vulnerable websites archive. XSSed. com. Accessed, 14.
Copyright © 2022 Chandan Singh, V. Vijayalakshmi, Harsh Raj. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.