Distributed network attacks, including botnets, pose significant chal-lengesindetectingandmitigatingtheiractivities.WepresenttheapplicationoflearningDiscriminativeBoostedBayesianNetworkstodetectbotnetactiv-ityusingtheCTU-13-Dataset.Ourresultsarecomparedwithtraditionalmachinelearning approaches, with and without expert knowledge. This marks the firstapplication of statistical relational learning in this domain, addressing the needfor effective detection in evolving threat landscapes. Our approach focuses onlearning a generalized model from sparse botnet data, addressing the challengesof limited data availability. By carefully engineering features and selecting ap-propriate learning algorithms, we aim to achieve accurate results. The CTU-13-Dataset, capturing diverse botnet examples, is utilized for experiments. Our re-searchcontributestointrusiondetectionandbotnetdetectionbyemphasizingtheimportanceofdomainknowledgeinfeatureengineering.
Introduction
Distributed network attacks, often facilitated by botnets, remain a significant and evolving threat on the internet, involved in denial-of-service, spam, malware, and data theft. Despite extensive research and defensive efforts, these attacks have become more prevalent, requiring detection models that generalize well to new and unseen attacks.
The paper focuses on creating generalizable machine learning models to detect botnets from sparse, real-world data—where botnet activity is rare compared to normal traffic. It uses the CTU-13 Dataset containing diverse botnet traffic captures to test various algorithms and feature engineering approaches.
Traditional machine learning methods like Bayesian networks, random forests, and Naive Bayes have been applied historically to botnet detection, often focusing on command-and-control traffic. However, challenges include overfitting to specific botnets and poor generalization to new data. Domain knowledge is crucial for selecting and engineering features that generalize well, as some features (e.g., source/destination IP addresses, start times) can cause overfitting.
The paper explores statistical relational learning (BoostSRL) to handle the imbalance and complexity in the data, aiming for better generalization and adaptability. Experiments show many features do not generalize well, necessitating careful feature selection guided by domain expertise.
Results indicate that removing non-generalizing features (like IP addresses and start time) improves model generality, while retaining features like destination port and flow duration. However, scalability issues with the relational learning approach remain, due to the massive size of the data and computational limits.
In conclusion, combining domain knowledge and advanced learning techniques is essential for building generalized botnet detection models, but challenges remain in computational efficiency and handling sparse, imbalanced data.
Conclusion
While the experiments still may need considerable work, the authors have presented anovelapproachtowarddetectingbotnetactivityonanetwork–asfarasweknow,thisisthefirstapplicationofstatisticalrelationallearningtothisdomain.
References
[1] Alejandre,F.V., Corte´s,N.C.,Anaya,E.A.: Feature selection todetectbotnets using machine learning algorithms. In: 2017 International Conference on Elec- tronics, Communications and Computers (CONIELECOMP). pp. 1–7 (Feb 2017).
https://doi.org/10.1109/CONIELECOMP.2017.7891834
[2] Ben-Asher, N., Gonzalez, C.: Effects of cybersecurity knowl- edge on attack detection. Computers in Human Behavior 48, 51-61 (2015).
https://doi.org/https://doi.org/10.1016/j.chb.2015.01.039, http://www.sciencedirect.com/science/article/pii/S0747563215000539
[3] Bilge, L., Balzarotti, D., Robertson, W., Kirda, E., Kruegel, C.: DISCLO- SURE: Detecting botnet command and control servers through large-scale net- flow analysis.In:ACSAC2012,28thAnnualComputerSecurityApplicationsConference,December3-7,2012,Orlando,Florida,USA.Orlando, UNITEDSTATES(122012).https://doi.org/http://dx.doi.org/10.1145/2420950.2420969,http://www.eurecom.fr/publication/3886
[4] Bringas,P.G.,Penya,Y.K.:Next-generationmisuseandanomalypreventionsystem.In:Fil-ipe, J., Cordeiro, J. (eds.) Enterprise Information Systems. pp. 117–129. Springer BerlinHeidelberg,Berlin,Heidelberg(2009)
[5] Cho,C.Y.,Babic´,D.,Shin,E.C.R., Song,D.:Inference and analysis of formalmodelsofbotnetcommandandcontrolprotocols.In:Proceedingsofthe17thACMConferenceonComputerandCommunicationsSecurity.pp.426–439.CCS’10,ACM,NewYork,NY,USA(2010).https://doi.org/10.1145/1866307.1866355,http://doi.acm.org/10.1145/1866307.1866355
[6] Dhami,D.S.,Soni,A.,Page,D.,Natarajan,S.:Identifyingparkinson’spatients:Afunctionalgradientboostingapproach.In:ConferenceonArtificialIntelligenceinMedicineinEurope.pp.332–337.Springer(2017)
[7] Dietrich,C.J.,Rossow,C.,Pohlmann,N.:Cocospot:Clusteringandrecognizingbotnetcommandandcontrolchannelsusingtrafficanalysis.ComputerNetworks 57(2),475–486(2013).
https://doi.org/https://doi.org/10.1016/j.comnet.2012.06.019,http://www.sciencedirect.com/science/article/pii/S1389128612002472,botnetActivity:Analysis,DetectionandShutdown
[8] Garcia,S.,Grill,M.,Stiborek,J.,Zunino,A.:Anempiricalcomparisonofbotnetdetectionmethods.Computers&Security45,100–123(2014)
[9] Gu, G., Zhang, J., Lee, W.: Botsniffer: Detecting botnet command and control channels innetworktraffic.In:NDSS(2008)
[10] Joshi,S.S.,Phoha,V.V.:Investigatinghiddenmarkovmodelscapabilitiesinanomalydetection. In: Proceedings of the 43rd Annual Southeast Regional Conference- Volume 1. pp. 98–103. ACM-SE 43, ACM, New York, NY, USA (2005).https://doi.org/10.1145/1167350.1167387,http://doi.acm.org/10.1145/1167350.1167387
[11] Kruegel,C.,Mutz,D.,Robertson,W.,Valeur,F.:Bayesianeventclassificationforintrusiondetection.In:19thAnnualComputerSecurityApplicationsConference,2003.Proceedings.pp.14–23(Dec2003).https://doi.org/10.1109/CSAC.2003.1254306
[12] Livadas,C.,Walsh,R.,Lapsley,D.,Strayer,W.T.:Usilngmachinelearningtechnliquestoidentifybotnettraffic.In:Proceedings.200631stIEEEConferenceonLocalComputerNet-works.pp.967–974(Nov2006).https://doi.org/10.1109/LCN.2006.322210
[13] Natarajan,S.,Kersting,K.,Khot,T.,Shavlik,J.:Boostedstatisticalrelationallearners:Frombenchmarkstodata-drivenmedicine.Springer(2015)
[14] Natarajan, S., Prabhakar, A., Ramanan, N., Bagilone, A., Siek, K., Connelly, K.: Boostingfor postpartum depression prediction. In: Connected Health: Applications, Systems and En-gineering Technologies (CHASE), 2017 IEEE/ACM International Conference on. pp. 232–240.IEEE(2017)
[15] Osanaiye, O., Cai, H., Choo, K.K.R., Dehghantanha, A., Xu, Z., Dlodlo, M.: Ensemble-basedmulti-filterfeatureselectionmethodforddosdetectionincloudcomputing.EURASIPJournalonWirelessCommunicationsandNetworking2016(1),130(May2016).https://doi.org/10.1186/s13638-016-0623-3,https://doi.org/10.1186/s13638-016-0623-3
[16] Ourston,D.,Matzner,S.,Stump,W.,Hopkins,B.:Applicationsofhiddenmarkovmodels to detecting multi-stage network attacks. In: 36th Annual Hawaii InternationalConferenceonSystemSciences,2003.Proceedingsofthe.pp.10pp.–(Jan2003).https://doi.org/10.1109/HICSS.2003.1174909
[17] Puttini,R.S.,Marrakchi,Z.,Me´,L.:Bayesianclassificationmodelforreal-timeintrusiondetection.In:In22thInternationalWorkshoponBayesianInferenceandMaximumEntropyMethodsinScienceandEngineering(2002)
[18] Ramanan,N.,Yang,S.,Grannis,S.,Natarajan,S.:Discriminativeboostedbayesnetworksforlearningmultiplecardiovascularprocedures.In:2017IEEEInternationalConferenceonBioinformaticsandBiomedicine(BIBM).pp.870–873.IEEE(2017)
[19] Singh,K., Guntuku, S.C., Thakur, A., Hota, C.: Big data analytics frameworkfor peer-to-peer botnet detection using random forests. Information Sciences278,488–497(2014).
https://doi.org/https://doi.org/10.1016/j.ins.2014.03.066,http://www.sciencedirect.com/science/article/pii/S0020025514003570
[20] Witten,I.H., Frank,E., Hall, M.A.,Pal, C.J.: DataMining: Practical machinelearning toolsandtechniques.MorganKaufmann(2016
[21] Xu, J., Shelton, C.R.: Intrusion detection using continuous time bayesian networks. J. Artif.Int.Res.39(1),745–774(Sep2010),http://dl.acm.org/citation.cfm?id=1946417.1946434
[22] Xu,X.,Sun,Y.,Huang,Z.:Defendingddosattacksusinghiddenmarkovmodelsandco-operative reinforcement learning. In: Proceedings of the 2007 Pacific Asia Conference onIntelligenceandSecurityInformatics.pp.196–207.PAISI’07,Springer-Verlag,Berlin,Hei-delberg(2007),http://dl.acm.org/citation.cfm?id=1763599.1763621