The evolving cybersecurity landscape demands intrusion detection systems capable of identifying diverse attack patterns across network and application layers. This study addresses limitations in current benchmark datasets by enhancing the CICIDS-2017 dataset through systematic incorporation of multiple attack variants, including web-based threats like Cross-Site Scripting alongside its existing network attack profiles. Our methodology combines realistic attack simulation with rigorous feature engineering to maintain dataset integrity while expanding its threat coverage.We train and evaluate multiple algorithms, selecting the most effective approach based on comprehensive evaluation metrics. The resulting model demonstrates strong capabilities in identifying both traditional network intrusions and contemporary attack patterns. Particular attention is given to maintaining low false positive rates while ensuring broad threat coverage.
Introduction
The CICIDS2017 dataset is a widely used benchmark for network intrusion detection research, containing realistic and labeled network traffic with various modern attack types alongside benign data. Researchers have applied numerous traditional machine learning algorithms (like Random Forest, SVM, KNN) and deep learning models (including CNNs, RNNs, LSTMs, and autoencoders) to classify network traffic and detect intrusions. Key challenges include handling class imbalance, where benign traffic dominates, leading to biased models. Techniques such as oversampling (e.g., SMOTE), undersampling, cost-sensitive learning, and ensemble methods have been employed to mitigate this issue. Feature selection and engineering also play critical roles in improving model accuracy and efficiency.
Comparative studies help identify the best-performing algorithms across multiple metrics like accuracy, precision, recall, and F1-score, balancing detection quality with computational cost.
The methodology outlined involves simulating a web-based Cross-Site Scripting (XSS) attack in a controlled virtual environment, capturing the resulting network traffic, converting it into a structured dataset compatible with CICIDS2017, and integrating the new attack data into the original dataset. A Random Forest classifier was trained on the augmented dataset, achieving high accuracy (98%) but showing reduced performance on the minority intrusion class due to imbalance. Cross-validation confirmed the model's robustness, suggesting Random Forest is effective for intrusion detection, though addressing class imbalance remains important for improving detection of rare attacks.
Conclusion
This study shows how machine learning methods, specifically RANDOM FOREST, can be used to forecast the risk of intrusion detec-tion. The results are promising, but future work should focus on improving model performance through more sophisticated feature engineering and handling class imbalance . intrusion decision support systems may benefit greatly from the use of machine learn-ing models like RANDOM FOREST, which could help medical practitioners identify patients more quickly and accurately.
References
[1] Sharafaldin, I., Lashkari, A. H., & Ghorbani, A. A. (2018). Toward Generating a New Intrusion Detection Dataset and Intrusion Traffic Characterization. 4th International Conference on Information Systems Security and Privacy (ICISSP), 108-116.
[2] Tavallaee, M., Bagheri, E., Lu, W., & Ghorbani, A. A. (2009). A Detailed Analysis of the KDD CUP 99 Data Set. Second IEEE International Conference on Secure Software Integration and Reliability Engineering (Companion), 128-131. (Highlights the need for comprehensive datasets, motivating the creation of CICIDS2017).
[3] Ho, T. K. (1995). Random Decision Forests. Proceedings of 3rd International Conference on Document Analysis and Recognition, 278-282. (Foundational work on Random Forests, a key algorithm used with CICIDS2017).
[4] Friedman, J. H. (2001). Greedy Function Approximation: A Gradient Boosting Machine. The Annals of Statistics, 29(5), 1189-1232. (Seminal paper on Gradient Boosting, widely applied in CICIDS2017 research).
[5] Cortes, C., &Vapnik, V. (1995). Support-Vector Networks. Machine Learning, 20(3), 273-297. (The foundational paper on Support Vector Machines, frequently used with CICIDS2017).
[6] Aha, D. W., Kibler, D., & Albert, M. K. (1991). Instance-Based Learning Algorithms. Machine Learning, 6(1), 37-66. (Discusses K-Nearest Neighbors, often a baseline in CICIDS2017 studies).
[7] Powers, D. M. W. (2011). Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation. Journal of Machine Learning Technologies, 2(1), 38-53. (Explains key evaluation metrics used in IDS research).
[8] Sokolova, M., & Lapalme, G. (2009). A Systematic Analysis of Performance Measures for Classification Tasks. Information Processing & Management, 45(4), 427-437. (Another important paper on classification performance metrics).
[9] LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep Learning. Nature, 521(7553), 436-444. (A seminal review of deep learning).
[10] Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. (A comprehensive textbook on deep learning).
[11] Hochreiter, S., &Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9(8), 1735-1780. (Foundational paper on LSTM networks).
[12] Graves, A., Mohamed, A. R., & Hinton, G. (2013). Speech Recognition with Deep Recurrent Neural Networks. 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 6645-6649. (Demonstrates RNN capabilities for sequence modeling).
[13] Hinton, G. E., &Salakhutdinov, R. R. (2006). Reducing the Dimensionality of Data with Neural Networks. Science, 313(5786), 504-507. (Introduces autoencoders).
[14] Chawla, N. V., Bowyer, K. W., Hall, L. O., &Kegelmeyer, W. P. (2002). SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16, 321-357. (Key paper on SMOTE for class imbalance).
[15] Drummond, C., & Holte, R. C. (2003). C4. 5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling Beats Over-Sampling. Workshop on Learning from Imbalanced Datasets II, 1-8. (Discusses class imbalance strategies).
[16] Breiman, L. (1996). Bagging Predictors. Machine Learning, 24(2), 123-140. (Introduces Bagging).
[17] Freund, Y., &Schapire, R. E. (1997). A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. Journal of Computer and System Sciences, 55(1), 119-139. (Foundational paper on Boosting).
[18] Quinlan, J. R. (1986). Induction of Decision Trees. Machine Learning, 1(1), 81-106. (Discusses information gain for feature selection).
[19] Liu, H., & Yu, L. (2005). Toward Integrating Feature Selection Algorithms for Classification and Clustering. IEEE Transactions on Knowledge and Data Engineering, 17(4), 491-502. (Discusses feature selection).
[20] Ringberg, T., Soule, A., & Williamson, C. (2007). Traffic Characterization of a Campus Network. ACM SIGCOMM Computer Communication Review, 37(5), 17-30. (While not specific to CICIDS2017, it discusses traffic characterization, relevant to feature engineering).
[21] Buczak, A. L., & Guven, E. (2016). A Survey of Data Mining and Machine Learning Methods for Cyber Security Intrusion Detection. IEEE Communications Surveys & Tutorials, 18(2), 1153-1176. (A broader survey that includes discussion of datasets and methods relevant to CICIDS2017).
[22] Axelsson, S. (2000). Intrusion Detection Systems: A Survey and Taxonomy. Chalmers University of Technology. (A foundational survey of IDS concepts).