This research addresses the critical challenge of securing sensitive information by leveraging machine learning to detect data privacy threats. In this study systematically evaluates and compares the performance of CNN and XGBoost classifier later to optimized with the advanced hyperparameter tuning framework. This robust preprocessing pipeline, including privacy-preserving noise, was implemented to ensure data integrity. The results demonstrate a clear performance hierarchy, that an optimized XGBoost model achieving a superior classification accuracy that significantly outperforms than others. The analysis of feature importances from the optimized model provides a unique and interpretable to identifying the most influential features driving the model\'s decisions. These findings underscore the potential of combining powerful boosting algorithms with modern optimization techniques to build highly effective and insightful solutions for data privacy protection.
Introduction
In the digital age, the surge in data usage—especially in cloud computing, IoT, and online transactions—has intensified concerns about data privacy and security. Traditional security methods (e.g., encryption, access control) often fall short in addressing evolving cyber threats such as malware, phishing, data breaches, and insider attacks. This has led to growing reliance on Artificial Intelligence (AI) and Machine Learning (ML) for robust threat detection.
Hybrid CNN-XGBoost Framework
The study proposes a hybrid model combining:
Convolutional Neural Networks (CNN) for automatic feature extraction from large datasets.
XGBoost for fast and accurate classification of privacy threats.
This combined model outperforms traditional deep learning methods in accuracy, speed, and efficiency, particularly in handling complex, inconsistent, or unbalanced data.
Types of Data Privacy Threats Identified
Malware (viruses, ransomware)
Social engineering (phishing, pretexting)
Insider threats
Man-in-the-middle (MitM) attacks
Data breaches
Non-compliance with regulations (e.g., GDPR, HIPAA)
Key Defensive Measures
Encryption and access control
Data Loss Prevention (DLP)
Privacy-by-design
Employee training
Zero Trust Architecture (ZTA)
Regulatory compliance
Review of Related Work
The literature review explores various privacy-preserving techniques such as:
Federated learning with encryption (e.g., MedPFL, Fed-AugMix)
Hybrid DL models (CNN-LSTM-XGBoost)
AI for intrusion detection in IoT, edge, and cyber-physical systems
These models balance accuracy and efficiency but often face issues like high computational cost or lack of scalability.
Applies machine learning to detect known attack signatures or anomalies in network traffic.
Uses the UNSW-NB15 dataset for training and evaluation.
XGBoost is favored for its superior performance over other models like Decision Trees and SVMs.
XGBoost Algorithm Steps
Includes model initialization, prediction, gradient and Hessian calculation, tree construction, and iterative updates—all aimed at minimizing loss while avoiding overfitting through regularization.
Results
Data preprocessing significantly boosts model performance.
CNN and Optimized XGBoost models both achieved perfect AUC scores (1.00) on ROC curves.
Top features contributing to malware detection included SizeOfStackReserve, VersionInformationSize, and Subsystem.
Conclusion
In this work the effectiveness of a hyperparameteroptimized XGBoost model for malware detection in a data privacy-sensitive environment to be used which indicate the real time thread thought. By intentionally under-tuning the Simple CNN and Simple XGBoost models, this research met the core objective of showcasing the significant performance gains achieved through intelligent hyperparameter optimization with Optuna.
The results clearly illustrate a performance hierarchy: the Optimized XGBoost model achieved the highest accuracy, outperforming the Simple XGBoost model by approximately 15% and the Simple CNN model by over 20%. This validates the hypothesis that a carefully tuned boosting algorithm can capture complex patterns in the data more effectively than simpler or unoptimized models. The use of Optuna proved to be a highly efficient method for navigating the complex hyperparameter space, converging on a superior solution with fewer trials than a traditional grid search.It identified the most influential features for distinguishing between legitimate and malicious files, thereby transforming the model\'s abstract predictions into actionable intelligence. This demonstrates that the Machine Learning model is not just a black box; it is a powerful analytical tool that can provide a deeper understanding of the underlying data privacy threats.
Future work could explore the application of more advanced deep learning architectures, such as Recurrent Neural Networks (RNNs) or Attention mechanisms, to see if they can achieve even higher performance. Additionally, the robustness of the privacy-preserving noise and its impact on a wider range of datasets and attack vectors could be further investigated to strengthen the model\'s real-world applicability.
References
[1] Q. Wu, S. Zhuang, and X. Wang, “A novel detection mechanism against malicious attacks by using spatio and temporal topology information,” Sci Rep, vol. 15, no. 1, p. 9978, Mar. 2025, doi: 10.1038/s41598-025-93957-8.
[2] S. S. Reka, T. Dragicevic, P. Venugopal, V. Ravi, and M. K. Rajagopal, “Big data analytics and artificial intelligence aspects for privacy and security concerns for demand response modelling in smart grid: A futuristic approach,” Heliyon, vol. 10, no. 15, p. e35683, Aug. 2024, doi: 10.1016/j.heliyon.2024.e35683.
[3] Maureen Oluchukwuamaka Okafor, “Deep learning in cybersecurity: Enhancing threat detection and response,” World J. Adv. Res. Rev., vol. 24, no. 3, pp. 1116–1132, Dec. 2024, doi: 10.30574/wjarr.2024.24.3.3819.
[4] Y. Liu, S. Li, X. Wang, and L. Xu, “A Review of Hybrid Cyber Threats Modelling and Detection Using Artificial Intelligence in IIoT,” CMES, vol. 140, no. 2, pp. 1233–1261, 2024, doi: 10.32604/cmes.2024.046473.
[5] H. Li, W. Chen, and X. Zhang, “Fed-AugMix: Balancing Privacy and Utility via Data Augmentation,” Dec. 18, 2024, arXiv: arXiv:2412.13818. doi: 10.48550/arXiv.2412.13818.
[6] A. Korkmaz and P. Rao, “A Selective Homomorphic Encryption Approach for Faster Privacy-Preserving Federated Learning,” Feb. 27, 2025, arXiv: arXiv:2501.12911. doi: 10.48550/arXiv.2501.12911.
[7] K. Harahsheh, M. Alzaqebah, and C.-H. Chen, “An Enhanced Real-Time Intrusion Detection Framework Using Federated Transfer Learning in Large-Scale IoT Networks,” ijacsa, vol. 15, no. 12, 2024, doi: 10.14569/IJACSA.2024.0151204.
[8] B. C. Das, M. H. Amini, and Y. Wu, “In-depth Analysis of Privacy Threats in Federated Learning for Medical Data,” Sep. 27, 2024, arXiv: arXiv:2409.18907. doi: 10.48550/arXiv.2409.18907.
[9] S.-H. Choi and K.-W. Park, “GENOME: Genetic Encoding for Novel Optimization of Malware Detection and Classification in Edge Computing,” CMC, vol. 82, no. 3, pp. 4021–4039, 2025, doi: 10.32604/cmc.2025.061267.
[10] Dr. S. Bahmaid and Dr. S. A. Mahyoub Ghaleb, “Intrusion Detection System Using Chaotic Walrus Optimization-based Convolutional Echo State Networks for IoT-assisted Wireless Sensor Networks,” JOWUA, vol. 15, no. 3, pp. 236–252, Sep. 2024, doi: 10.58346/JOWUA.2024.I3.016.
[11] [A. F. Al-zubidi, A. K. Farhan, and S. M. Towfek, “Predicting DoS and DDoS attacks in network security scenarios using a hybrid deep learning model,” Journal of Intelligent Systems, vol. 33, no. 1, p. 20230195, Apr. 2024, doi: 10.1515/jisys-2023-0195.
[12] Dari, S. S., Dhabliya, D., Govindaraju, K., Dhablia, A., &Mahalle, P. N. (2024). Data Privacy in the Digital Era: Machine Learning Solutions for Confidentiality. E3S Web of Conferences, 491, 02024. https://doi.org/10.1051/e3sconf/202449102024
[13] Ch. Nanda Krishna and k.f. Bharati (2024) “An Adaptive Privacy Preserving Based Ensemble Learning Framework for Large Dimensional Datasets” Journal of Theoretical and Applied Information Technology , ISSN: 1992-8645 , 15th January 2024. Vol.102. No 1
[14] Elaheh Jafarigol, Theodore B. Trafalis, Talayeh Razzaghi, Mona Zamankhani (2023) “Exploring Machine Learning Models for Federated Learning: A Review of Approaches, Performance, and Limitations” arXiv:2311.10832v1 [cs.LG] 17 Nov 2023
[15] Liu, K., & Tang, C. (2023) “Privacy-preserving Naive Bayes classification based on secure two-party computation”, AIMS Mathematics, 8(12), 28517–28539. https://doi.org/ 10.3934/ math. 20231459
[16] Mohtady Ehab Barakat and Chung Gwo Chin et. al (2023) \"Performance Analysis of Chronic Kidney Disease Detection Based on K-Nearest Neighbors Data Mining\" International Journal of Intelligent Systems And Applications In Engineering, ISSN:2147-67992, IJISAE, 2023, 11(8s), 393–400
[17] Madhu, B., Aerranagula, V., Mahomad, R., Ravindernaik, V., Madhavi, K., & Krishna, G. (2023). Techniques of Machine Learning for the Purpose of Predicting Diabetes Risk in PIMA Indians. E3S Web of Conferences, 430, 01151. https://doi.org/10.1051/e3sconf/202343001151
[18] Yerra Renu Sree and Prof. M. Ramjee (2023) \"Heart Disease Prediction Using Machine Learning Algorithms\" International Journal of Creative Research Thoughts (IJCRT) , ISSN: 2320-2882 ,Volume 11, Issue 9 September 2023
[19] Suyal, M., & Goyal, P. (2022, July 31). A Review on Analysis of K-Nearest Neighbor Classification Machine Learning Algorithms based on Supervised Learning. International Journal of Engineering Trends and Technology, 70(7), 43–48. https://doi.org/10.14445/22315381/ijett-v70i7p205
[20] N. B. Henda, A. Msolli, I. Hagui, A. Helali, H. Maaref and R. Mghaieth, \"A Novel SVM Based CFS for Intrusion Detection in IoT Network,\" 2023 IEEE International Conference on Advanced Systems and Emergent Technologies (IC_ASET), Hammamet, Tunisia, 2023, pp. 1-5, doi: 10.1109/IC_ASET58101.2023.10150979.
[21] S. Sharma, A. K. M. M. Alam and K. Chen, \"Image Disguising for Protecting Data and Model Confidentiality in Outsourced Deep Learning,\" 2021 IEEE 14th International Conference on Cloud Computing (CLOUD), Chicago, IL, USA, 2021, pp. 71-77, doi: 10.1109/CLOUD53861.2021.00020.