Android Dynamic Malware Analysis

Authors: Anunay Anand, Md. Shahroz, Kishan Dixit, Nikhil Ranjan

DOI Link: https://doi.org/10.22214/ijraset.2025.72665

Abstract

In today’s mobile computing landscape, Android-based systems are highly prevalent and frequently targeted by malicious applications that exhibit anomalous behavior. Detecting such anomalies in real time is critical for ensuring system stability, user data privacy, and overall device security. This review paper explores the implementation and evaluation of unsupervised machine learning techniques for dynamic malware detection in Android applications. The focus is on models such as Isolation Forest, One-Class SVM, Local Outlier Factor, and Elliptic Envelope, which learn from normal process behavior to identify deviations without requiring labeled data. Among these, Isolation Forest demonstrates superior accuracy and efficiency, achieving up to 99% accuracy in detecting anomalous activity based on real-time process metrics like CPU usage, memory consumption, and disk operations. The system is designed to be lightweight, privacy-preserving, and suitable for deployment on individual devices without the need for external infrastructure. This paper also discusses the limitations of existing methods, presents a comparative analysis of model performance, and outlines potential future enhancements including deep learning integration, hybrid detection strategies, and cloud-based intelligence sharing. The findings support the feasibility and effectiveness of machine learning-driven anomaly detection as a proactive defense mechanism in modern Android environments.

Introduction

With the rise of smartphones, Android has become the most widely used mobile OS globally. Its open-source nature and flexibility have made it a hub for innovation but also a major target for cyber threats. Android’s broad application scope (e.g., social, financial, enterprise) and practices like sideloading APKs or rooting devices increase its vulnerability to hidden threats.

Challenges:
Traditional malware detection methods—like static analysis and signature-based detection—struggle against modern threats, especially polymorphic or zero-day malware. These methods can’t effectively detect threats that execute dynamically or use obfuscation techniques.

Proposed Solution:
The research presents a real-time anomaly detection framework using unsupervised machine learning to detect Android application anomalies based on behavior rather than known signatures.

System Architecture:

Data Collection:
- System-level metrics (CPU, memory, disk I/O, thread count) collected in real time using tools like ADB and psutil.
Preprocessing:
- Cleans data, handles missing values, performs normalization (e.g., Min-Max scaling), and engineers features such as rolling averages or usage rates per process.
Model Training:
- Uses unsupervised models trained only on “normal” behavior:
  - Isolation Forest
  - One-Class SVM
  - Local Outlier Factor (LOF)
  - Elliptic Envelope
- Anomalous behaviors are simulated (e.g., CPU spikes, rogue processes) for evaluation.
Real-Time Detection:
- New system activity is fed through the same pipeline, evaluated in real time, and anomalies are flagged based on model output.
- Ensemble approaches are supported for improved robustness.
Evaluation:
- Uses accuracy, precision, recall, F1-score, confusion matrix, and classification reports.
- Isolation Forest performed best with 99% accuracy and minimal false positives.
User Interface (GUI):
- Provides real-time dashboards for monitoring, anomaly tracking, and visualization.
- Built using web technologies like Flask/Django (backend) and React/Vue.js (frontend).
Privacy & Deployment:
- Runs locally or within a private network—ensuring user data doesn’t leave the device.
- Modular design allows extensions (e.g., permission tracking, antivirus integration).

Literature Survey Highlights:

Shift from rule-based to machine learning-based methods.
Focus on log analysis, Sysmon events, lateral movement detection, and big data platforms (Spark, Hadoop).
Use of deep learning, graph-based, and semi-supervised models for handling diverse logs.
Parsing and structuring unstructured logs improve interpretability and detection.
Recent works emphasize the scalability, real-time processing, and cross-domain applicability of anomaly detection in logs.

Conclusion

In conclusion, this research presents a comprehensive, robust, and forward-looking approach to real-time anomaly detection in Android applications by leveraging advanced machine learning models, with a particular focus on unsupervised learning techniques. The critical importance of anomaly detection in the mobile computing domain cannot be overstated, as Android applications operate within dynamic and often unpredictable environments that are prone to various malicious behaviors and system faults. Unlike traditional supervised methods that require extensive labeled datasets—which are difficult, time-consuming, and expensive to acquire for malware detection—this research emphasizes unsupervised models such as Isolation Forest, One-Class Support Vector Machine (SVM), Local Outlier Factor (LOF), and Elliptic Envelope, all of which offer practical advantages in detecting previously unseen threats without relying on pre-labeled attack examples. Among these evaluated models, Isolation Forest consistently emerged as the most effective and reliable anomaly detection technique. It outperformed other methods in key performance metrics including accuracy, precision, recall, and F1 - score, highlighting its superior ability to distinguish between normal and anomalous process behaviors on Android systems. This effectiveness can be attributed to Isolation Forest’s unique operational principle, which isolates anomalies by constructing random decision trees that partition the data, leveraging the insight that anomalies are ‘few and different’ and thus require fewer splits to isolate. Its ensemble approach inherently provides robustness against noisy and high-dimensional data, making it exceptionally well-suited for the complex and multi-faceted feature space generated by system process monitoring. One-Class SVM demonstrated strong performance, particularly in recall, which is crucial for environments where missing any anomaly could result in severe security breaches. Although it exhibited a slightly higher false positive rate than Isolation Forest, its theoretical grounding in margin maximization and capacity to model complex decision boundaries make it an invaluable component of the overall detection framework. LOF, meanwhile, showed moderate success in identifying localized anomalies by analyzing the density deviation of each point relative to its neighbors. However, it struggled in scenarios involving globally distributed anomalies or sudden spikes in system metrics, which are common in real-world Android environments. Its dependence on distance-based calculations also poses scalability challenges in large-scale streaming data. Elliptic Envelope, relying on Gaussian distribution assumptions, exhibited the lowest performance, which aligns with its limitations in handling real-world, irregular, and bursty process data. The real strength of this research lies not only in the performance metrics but also in the practical applicability of the proposed system architecture. The system continuously monitors key system metrics such as CPU usage, memory consumption, and I/O operations of active processes in real-time, creating a rich dataset that reflects the operational state of the device. By incorporating preprocessing steps including data normalization, feature extraction, and noise reduction, the model efficiently handles streaming data and improves anomaly discernment. These preprocessing techniques transform raw telemetry into meaningful features that highlight deviations from normal behavior, reducing false positives and enhancing detection sensitivity. This process ensures that the system operates effectively under the constraints of limited device resources, providing timely alerts without degrading overall device performance or user experience. Scalability and modularity are other defining characteristics of the architecture. The system is designed with a modular framework, wherein components such as data collection, preprocessing, anomaly detection, alert generation, and logging operate independently yet cohesively. This modularity facilitates straightforward system maintenance, upgrades, and integration with other cybersecurity tools and enterprise management systems. The architecture can be deployed across individual user devices or scaled to enterprise environments managing thousands of endpoints, making it highly versatile. Its flexible design enables extension to additional system metrics, incorporation of new detection models, or integration of feedback mechanisms for continuous model refinement. Despite these advancements, the system faces several ongoing challenges that present avenues for future research and development. Model adaptability remains critical as the behavioral patterns of Android applications and system processes evolve due to software updates, user behavior changes, and emerging malware techniques. Although the proposed architecture includes incremental learning capabilities to address concept drift, the optimal strategies for continuous model retraining, balancing stability and plasticity, require further exploration. Data imbalance is another prevalent challenge: anomalies are rare events, which can bias models towards normal class dominance. While unsupervised models mitigate the need for labeled anomalies, ensuring robust anomaly representation without overfitting to noise remains complex. Interpretability of anomaly detection results is also vital for practical deployment; users and administrators must understand the rationale behind flagged anomalies to effectively respond and mitigate threats. Techniques such as explainable AI (XAI) offer promising directions to enhance model transparency. Looking ahead, this research lays a solid foundation for future innovations in real-time Android security solutions. Hybrid detection methodologies that combine unsupervised learning with supervised or semi-supervised techniques could further boost detection accuracy by leveraging labeled threat intelligence alongside continuous behavior modeling. Advances in online learning algorithms may allow even more seamless adaptation to new threat patterns with minimal human intervention. Integration with broader cybersecurity ecosystems, including endpoint detection and response (EDR) platforms, network intrusion detection systems, and automated incident response workflows, will increase the system’s operational impact. Moreover, leveraging cloud-based analytics and federated learning could enable cross-device collaboration, enhancing detection of coordinated or distributed attacks while preserving user privacy. In summary, this research demonstrates a practical, scalable, and highly effective approach to real-time anomaly detection in Android applications using state-of-the-art unsupervised machine learning models. By combining efficient data preprocessing, modular architecture design, and robust anomaly detection techniques, the system delivers timely and accurate identification of malicious activities without relying on extensive labeled datasets. It addresses the unique challenges of Android environments including resource constraints, dynamic behavior, and evolving threat landscapes, positioning itself as a valuable tool in the arsenal of mobile security. With continuous enhancements in model training paradigms, interpretability, and integration capabilities, this approach holds immense potential for transforming the detection and mitigation of anomalies in modern digital infrastructures. Ultimately, it contributes to creating more secure, resilient, and stable mobile computing environments that can keep pace with the rapidly evolving cybersecurity landscape.

References

[1] M. Ahmed, A. N. Mahmood and J. Hu, “A survey of network anomaly detection techniques,” Journal of Network and Computer Applications, vol. 60, pp. 19–31, Jan. 2016. [2] L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, Oct. 2001. [3] F. T. Liu, K. M. Ting, and Z.-H. Zhou, “Isolation forest,” in Proc. 2008 IEEE International Conference on Data Mining, pp. 413–422, 2008. [4] B. Schölkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson, “Estimating the support of a high- dimensional distribution,” Neural Computation, vol. 13, no. 7, pp. 1443–1471, Jul. 2001. [5] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander, “LOF: Identifying density-based local outliers,” in Proc. 2000 ACM SIGMOD International Conference on Management of Data, pp. 93–104. [6] V. Chandola, A. Banerjee and V. Kumar, “Anomaly detection: A survey,” ACM Computing Surveys, vol. 41, no. 3, pp. 1–58, Jul. 2009. [7] A. Patcha and J. M. Park, “An overview of anomaly detection techniques: Existing solutions and latest technological trends,” Computer Networks, vol. 51, no. 12, pp. 3448–3470, Aug. 2007. [8] R. Pang, M. Allman, M. Bennett, J. Lee, V. Paxson, and B. Tierney, “A first look at modern enterprise traffic,” in Proc. 5th ACM SIGCOMM Conference on Internet Measurement, pp. 2–2, 2005. [9] M. Goldstein and A. Dengel, “Histogram-based outlier score (HBOS): A fast unsupervised anomaly detection algorithm,” in Proc. KI-2012: Poster and Demo Track, pp. 59–63, 2012. [10] H. Hoffmann, “Kernel PCA for novelty detection,” Pattern Recognition, vol. 40, no. 3, pp. 863–874, Mar. 2007. [11] S. X. Wu and W. Banzhaf, “The use of computational intelligence in intrusion detection systems: A review,” Applied Soft Computing, vol. 10, no. 1, pp. 1–35, Jan. 2010. [12] N. Hubballi and V. Suryanarayanan, “False alarm minimization techniques in signature-based intrusion detection systems: A survey,” Computer Communications, vol. 49, pp. 1–17, Aug. 2014. [13] K. Rieck, P. Trinius, C. Willems, and T. Holz, “Automatic analysis of malware behavior using machine learning,” Journal of Computer Security, vol. 19, no. 4, pp. 639–668, 2011. [14] A. Javaid, Q. Niyaz, W. Sun, and M. Alam, “A deep learning approach for network intrusion detection system,” in Proc. 9th EAI International Conference on Bio-inspired Information and Communications Technologies, pp. 21–26, 2016. [15] T. Kim, J. Park, and B. B. Kang, “Anomaly detection with memory-augmented neural networks,” in Proc. 2018 IEEE International Conference on Big Data and Smart Computing, pp. 687–690. [16] R. Sommer and V. Paxson, “Outside the closed world: On using machine learning for network intrusion detection,” in Proc. 2010 IEEE Symposium on Security and Privacy, pp. 305–316. [17] C. Kruegel, D. Mutz, F. Valeur, and G. Vigna, “On the detection of anomalous system call arguments,” in Proc. European Symposium on Research in Computer Security, pp. 326–343, 2003. [18] C. Warrender, S. Forrest, and B. Pearlmutter, “Detecting intrusions using system calls: Alternative data models,” in Proc. 1999 IEEE Symposium on Security and Privacy, pp. 133–145. [19] J. Zico Kolter and M. A. Maloof, “Learning to detect and classify malicious executables in the wild,” Journal of Machine Learning Research, vol. 7, pp. 2721–2744, Dec. 2006. [20] K. Kendall, “A database of computer attacks for the evaluation of intrusion detection systems,” Master’s thesis, Massachusetts Institute of Technology, 1999. [21] K. Bu et al., “Malicious process detection using behavior tree models,” in Proc. 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, pp. 949–956. [22] J. Ma, L. K. Saul, S. Savage, and G. M. Voelker, “Beyond blacklists: Learning to detect malicious web sites from suspicious URLs,” in Proc. 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1245–1254, 2009. [23] S. Roy, C. Ellis, and M. Chuah, “A survey of system call anomaly detection systems,” Security and Communication Networks, vol. 7, no. 13, pp. 2498– 2516, 2014. [24] M. Sabhnani and G. Serpen, “Application of machine learning algorithms to KDD intrusion detection dataset within misuse detection context,” in Proc. Intl. Conf. on Machine Learning: Models, Technologies and Applications, pp. 209–215, 2003. [25] T. Fawcett and F. Provost, “Activity monitoring: Noticing interesting changes in behavior,” in Proc. 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 53–62, 1999. [26] A. Lakhina, M. Crovella, and C. Diot, “Diagnosing network-wide traffic anomalies,” in Proc. ACM SIGCOMM, pp. 219–230

Copyright

Copyright © 2025 Anunay Anand, Md. Shahroz, Kishan Dixit, Nikhil Ranjan. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET72665

Publish Date : 2025-06-19

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here