The rapid growth of internet-based services has significantly increased the risk of cyber threats, among which phishing attacks remain one of the most common and effective social engineering attacks. Phishing websites are designed to deceive users and steal confidential information such as login credentials, banking details, and credit card information. This research focuses on developing an intelligent phishing website detection system capable of accurately identifying malicious and legitimate websites. A comprehensive phishing dataset is collected from sources such as Phish Tank and other verified phishing repositories, containing various website attributes and URL-based features. The collected dataset is pre-processed and analysed using machine learning techniques to classify websites into phishing and legitimate categories. In this study, a Logistic Regression-based machine learning model is employed for phishing website classification due to its efficiency, interpretability, and ability to handle binary classification problems. The proposed model analyses extracted website features and predicts the probability of a website being phishing or legitimate. Additionally, a web-based application is developed to provide real-time phishing detection capabilities for users. The system incorporates a database containing previously identified phishing websites, which functions as a blacklist to improve detection speed and reduce redundant classification operations. Experimental evaluation demonstrates that the proposed Logistic Regression approach provides effective phishing detection performance while maintaining computational efficiency, making it suitable for practical cyber security applications.
Introduction
The rapid growth of internet-based services such as online banking, shopping, communication, and social networking has increased the risk of cyber threats, particularly phishing attacks. Phishing websites imitate legitimate websites to steal sensitive user information such as usernames, passwords, and financial details. Traditional detection methods, including blacklists and manual verification, are often ineffective because phishing websites are frequently created and removed. To address this issue, the proposed system uses Machine Learning (ML) techniques to automatically detect phishing websites based on features such as URL length, domain age, HTTPS usage, special characters, redirections, and webpage content.
The literature review shows that ML-based approaches outperform traditional methods by identifying both known and newly created phishing websites. Algorithms such as Decision Tree, Random Forest, Support Vector Machine (SVM), Logistic Regression, and Naive Bayes have been widely used, with Random Forest and SVM often achieving higher accuracy. Recent studies also explore deep learning techniques for improved detection performance.
Several challenges exist in developing an effective phishing detection system, including the rapid evolution of phishing tactics, difficulty in obtaining updated datasets, feature selection issues, false predictions, and the need for continuous model updates. The proposed methodology involves collecting phishing and legitimate website data, extracting relevant features, preprocessing the dataset, and training ML models. The trained model then analyzes user-entered URLs and predicts whether a website is legitimate or phishing.
The system operates in two phases: training and prediction. During training, website data is processed and used to build ML models, while in the prediction phase, website features are analyzed and classified by the trained model. Performance is evaluated using metrics such as Accuracy, Precision, Recall, and F1-Score.
Conclusion
The proposed “Phishing Website Detection Using Machine Learning” system demonstrates an effective approach for identifying and classifying malicious websites by applying machine learning techniques. In this work, the Logistic Regression algorithm is utilized as a binary classification model to analyze extracted website features and predict whether a given website is phishing or legitimate. The system improves the reliability of phishing detection by evaluating important attributes such as URL characteristics, security-related parameters, and domain-based features.
The integration of data preprocessing, feature extraction, and feature selection enhances the performance and efficiency of the classification process. The developed model provides faster detection, reduces the risk of credential theft, and strengthens user protection against cyber threats. Experimental evaluation using performance metrics such as accuracy, precision, recall, and F1-score demonstrates the effectiveness of Logistic Regression in phishing website identification. The proposed system highlights the potential of machine learning-based approaches in developing intelligent cybersecurity solutions and provides a scalable framework for real-time phishing detection and secure web browsing.
References
[1] Mohammad, R. M., Thabtah, F., & McCluskey, L. (2015). Phishing Website Detection Using Machine Learning Techniques. International Journal of Cyber-Security and Digital Forensics.
[2] Jain, A. K., & Gupta, B. B. (2018). Towards Detection of Phishing Websites on Client-Side Using Machine Learning Based Approach. Telecommunications Systems Journal.
[3] Abdelhamid, N., Ayesh, A., &Thabtah, F. (2014). Phishing Detection Based Associative Classification Data Mining. Expert Systems with Applications.
[4] Chiew, K. L., Yong, K. S., & Tan, C. L. (2018). A Survey of Phishing Attacks: Their Types, Vectors and Technical Approaches. Expert Systems with Applications.
[5] Rao, R. S., & Pais, A. R. (2019). Detection of Phishing Websites Using an Efficient Feature-Based Machine Learning Framework. Neural Computing and Applications.
[6] Sahingoz, O. K., Buber, E., Demir, O., & Diri, B. (2019). Machine Learning Based Phishing Detection from URLs. Expert Systems with Applications.
[7] Verma, R., & Das, A. (2017). What’s in a URL: Fast Feature Extraction and Malicious URL Detection. ACM International Conference Proceedings.
[8] Ma, J., Saul, L. K., Savage, S., & Voelker, G. M. (2009). Beyond Blacklists: Learning to Detect Malicious Web Sites from Suspicious URLs. ACM SIGKDD Conference.
[9] Xiang, G., Hong, J., Rose, C. P., & Cranor, L. (2011). CANTINA+: A Feature-Rich Machine Learning Framework for Detecting Phishing Web Sites. ACM Transactions on Information and System Security.
[10] Basnet, R., Mukkamala, S., & Sung, A. H. (2012). Detection of Phishing Attacks: A Machine Learning Approach. Soft Computing Applications.
[11] Scikit-learn Documentation –Machine Learning Library for Python. Web link: Scikit-learn Official Website
[12] Python Programming Language Documentation.Web link: Python Official Website
[13] TensorFlow Documentation – Deep Learning Framework. Web link: TensorFlow Official Website
[14] Kaggle – Phishing Website Dataset. Web link: Kaggle Official Website
[15] UCI Machine Learning Repository – Phishing Websites Dataset.UCI Machine Learning Repository
[16] WEKA Data Mining Tool Documentation. Web link: WEKA Official Website
[17] National Institute of Standards and Technology (NIST) Cybersecurity Resources. Web link: NIST Official Website
[18] Open Phish – Phishing Intelligence and Feed Services. Web link: Open Phish Official Website
[19] Phish Tank – Community Phishing Detection Platform. Web link: Phis Tank Official Website
[20] Google Safe Browsing – Website Security Service. Web link: Google Safe Browsing