Use of Automation in Correlation of Metadata Activity in Digital Forensic Investigations

Authors: Glenn Nor, Dr. Mabrouka Abuhmida, Dr. Eric Llewellyn

DOI Link: https://doi.org/10.22214/ijraset.2023.51447

Abstract

Extracting relevant information from large volumes of digital evidence is a significant challenge for digital forensic investigators. Manual analysis is time-consuming and error-prone, and the sheer volume of data can make it difficult to identify correlations and key events. To address this challenge, this research project has developed a new framework that extracts metadata activity timelines and identifies correlations between them. By using this framework, investigators can generate automated correlation data for use in timeline or graph-based visualization. This framework is designed to extract relevant activity or event-based data, design a framework that allows the creation of custom activity or event-based, custodian-specific correlation data, and test the theoretical framework by creating proof-of-concept python implementation code. The resulting insights are novel, enabling investigators to identify crucial correlations and information about document content, order of document revisions, and other relevant metadata activities.

Introduction

I. INTRODUCTION

Digital forensic investigations involve collecting and analyzing large amounts of digital evidence, often in the hundreds of gigabytes to several terabytes per case [1]. One of the most tedious and time-consuming tasks in these investigations is to get an overview of custodian timelines and relevant events, which is necessary for identifying connections between events and entities. This process is typically done manually, with keyword searches, filtering, and grouping together disparate data, and risks missing critical evidence when data is reduced to a more manageable size.

To address this problem, this research project aims to improve digital forensic investigation efficiency with automated metadata timeline correlation. The project's objectives are to extract relevant metadata activity timelines from multiple sources, design a framework that allows for the automated correlation of metadata activity timelines, and test the theoretical framework by creating proof-of-concept Python implementation code.

The use of this framework enables digital forensic investigators to identify and correlate metadata activity timelines between multiple sources, such as logs, files, registry keys, and others, and create a more complete picture of the timeline of events [2] By using automated techniques, investigators can generate actionable insights quickly and efficiently, which is critical in large-scale investigations.

II. LITERATURE REVIEW

Several research projects have attempted to address the issue of the manual correlation of metadata activity timelines in digital forensic investigations. One approach is through the use of timeline analysis software. In 2013, Nisén [3] created a timeline analysis software for security incident events that used data visualization and timeline production to obtain an overview of security incidents by graphically viewing network traffic load, IP communication, and disparate system logs connected and viewed as a single event [4]. However, this approach only provides a high-level overview and does not offer the ability to correlate events between different users or systems.

In recent years, researchers have attempted to automate the correlation of metadata activity timelines using various techniques. One such technique is the use of machine learning algorithms. Baggili et al. in [5] proposed a machine learning-based framework that can automatically analyze and correlate metadata activities from multiple data sources to generate a comprehensive timeline of events [6]. The framework uses various machine learning algorithms, including k-means clustering and support vector machines, to cluster and classify metadata activities, and a rule-based approach to correlate the activities across different data sources [6]. The results showed that the framework was able to generate a comprehensive timeline of events with a high degree of accuracy.

Another approach to automating the correlation of metadata activity timelines is through the use of visualization techniques. Al-Zaidy et al. in [7] proposed a graph-based visualization approach that can automatically correlate metadata activities across different data sources to generate a comprehensive timeline of events [8]. The approach uses a graph database to store and visualize the metadata activities, and a graph-based algorithm to correlate the activities across different data sources [8].

The results showed that the approach was able to generate a comprehensive timeline of events with a high degree of accuracy and allowed for easy visualization and analysis of the data.

In addition to machine learning and visualization techniques, researchers have also proposed the use of ontologies to automate the correlation of metadata activity timelines. Wang et al. in [9] proposed an ontology-based approach that can automatically correlate metadata activities across different data sources to generate a comprehensive timeline of events [10]. The approach uses an ontology to represent the metadata activities and a rule-based approach to correlate the activities across different data sources [10]. The results showed that the approach was able to generate a comprehensive timeline of events with a high degree of accuracy and allowed for easy analysis and interpretation of the data.

Overall, the literature suggests that the automated correlation of metadata activity timelines is an important area of research in digital forensic investigations. Various approaches, including machine learning, visualization, and ontology-based techniques, have been proposed to automate the correlation of metadata activity timelines and generate a comprehensive timeline of events. These approaches have shown promise in terms of accuracy and ease of use, and further research is needed to explore their full potential in the field of digital forensics [11].

III. METHODOLOGIES

All documents are analyzed using a metadata extraction function. It will extract Modification, Access, and Creation (MAC) information for all user created documents such as Microsoft Word, Microsoft Excel, Acrobat PDF and others, as well as a subject description of the documents and where it was found in the digital forensic image.

The extracted metadata is then passed to two different processing functions:

User Activity Timeline – What type of document, was created when.
User Correlation Timeline – Multiple custodians and how their activity correlate.

User Activity Timeline

This will first identify user generated documents, and ignores documents that are generated by the system, such as windows logs and system files. It will however, search for artifacts connected to the documents in places like logs, to see if there is activity-based information available that could help in generating insights.

2. User Correlation Timeline

This will take two or more timelines and attempt to display them in a way that will give digital forensic investigators the ability to correlate timelines between the custodians.

A. Proof-of-Concept

In this section we are going to look at a proof-of-concept using two functions: the first one generates a timeline based on artifacts or metadata. The second allows investigators to merge timelines to see if there are any correlations in user activity between custodians.

Timeline Database

There are many ways we can create a timeline database for use in digital forensic investigation. One way is to gather metadata directly from user-generated files, such as Microsoft word, Microsoft Excel, Acrobat PDF etc. We can also fetch relevant information from artifacts or logs from the computer the evidence was taken from. Other methods include cross-checking file metadata with emails, chat transfer logs, and other external sources. For this research project, the exact method of generating the timeline database is not important. What is important however is showing how such a database can be used to give digital forensic investigators valuable insights.

We can create a proof-of-concept timeline generator with two functions. The first function will handle the extraction of metadata:

2. Timeline Correlation

While a timeline database by itself is of value for digital forensic investigators, the more valuable insight, as mentioned above, would be to compare multiple custodian timeline to see if there are similarities, correlations, or other valuable user patterns. In our example implementation python code above for creating timeline data, we did not limit ourselves to just one custodian, but rather created timeline data for all available custodians. Because of this, the legwork for preparing timelines is already done, and we can create the data necessary for generating timeline correlation plots by requesting custodian timeline data for multiple custodians. An example of how this can be done in python, can be seen here:

V. DISCUSSION

One of the most significant advantages of the timeline correlation methodology is the increased efficiency it offers to digital forensic investigators. By automating the process of identifying and correlating relevant metadata across different custodians, the approach can drastically reduce the time and effort required to gain insights into the interactions and activities of suspects. This, in turn, can help investigators to focus on other critical aspects of their inquiries, ultimately leading to more accurate and timely conclusions.

The visualization of correlated data in the form of graphs or timelines provides a more intuitive and easily interpretable representation of the relationships between custodians. This enables investigators to quickly identify patterns, anomalies, and potential areas of interest, which may not be readily apparent when dealing with raw data or textual representations. The ability to customize these visualizations further adds to their utility, as investigators can tailor them to suit their specific needs and preferences.

However, the current proof-of-concept implementation of the timeline correlation approach also has some limitations. The reliance on document metadata, while providing valuable insights, may not capture the full extent of the interactions and relationships between custodians. Expanding the scope of correlated data, as suggested in the future work section, could address this issue, and offer a more comprehensive understanding of custodian behaviour. Additionally, the performance and scalability of the approach in handling large datasets remain to be thoroughly tested and optimized. As digital forensic investigations often involve substantial amounts of data, it is crucial to ensure that the timeline correlation methodology can efficiently process and analyze this information without becoming a bottleneck in the investigative process. The integration of machine learning and artificial intelligence techniques, as proposed in the future work section, could further enhance the capabilities of the timeline correlation approach. However, it is worth noting that the adoption of these techniques may introduce additional complexities and challenges, such as the need for large and representative training datasets, the potential for biased or inaccurate predictions, and the requirement for interpretable and explainable models. As such, the integration of machine learning and artificial intelligence should be approached with caution and rigor, to ensure that the benefits outweigh the potential risks.

The automatic custodian timeline correlation approach presents a promising avenue for advancing digital forensic investigations. While the proof-of-concept implementation demonstrates the potential of this methodology, it is essential to acknowledge its limitations and challenges and address them through further research and development. By doing so, the digital forensic community can continue to innovate and enhance its capabilities in the face of ever-evolving threats and challenges in the digital realm.

VI. FUTURE WORK

The proof-of-concept Python program developed for timeline correlation has demonstrated its potential in facilitating digital forensic investigations involving multiple custodians. However, there is still a significant scope for improvement and expansion to enhance the capabilities and effectiveness of this approach. In this section, we outline several directions for future work, focusing on advancing automatic custodian timeline correlations.

Integration with Existing Forensic Tools: The next logical step would be to integrate the timeline correlation methodology with widely used digital forensic tools and platforms. This integration would streamline the investigative process by allowing investigators to access and utilize timeline correlation features seamlessly within their existing workflows.
Scalability and Performance Optimization: As digital forensic investigations often involve large volumes of data; it is essential to ensure that the timeline correlation approach can efficiently handle datasets of varying sizes. Future work should focus on optimizing the underlying algorithms and data structures to improve the scalability and performance of the methodology in real-world scenarios.
Advanced Visualization Techniques: While the current implementation provides basic visualizations for timeline correlations, there is room for improvement in terms of the clarity, interactivity, and customization of these visual representations. Future work could explore the incorporation of more advanced visualization techniques, such as interactive graphs and heatmaps, to provide investigators with a more intuitive and informative means of analyzing the correlations.
Machine Learning and Artificial Intelligence: The application of machine learning and artificial intelligence techniques could significantly enhance the capabilities of timeline correlation. For instance, machine learning algorithms could be used to identify patterns and anomalies in the correlated data, automatically flagging potential areas of interest for further investigation. Additionally, natural language processing techniques could be employed to analyze the content of documents and provide deeper insights into the relationships between custodians.
Expanding the Scope of Correlated Data: The current proof-of-concept focuses primarily on document metadata. However, future work could explore the possibility of incorporating additional data sources, such as social media activity, communication logs, and geolocation information, to provide a more comprehensive picture of custodian interactions and behaviour.
Evaluation and Validation: To ensure the effectiveness and reliability of the timeline correlation approach, future work should involve rigorous evaluation and validation using real-world case studies and datasets. This would enable researchers to assess the accuracy, efficiency, and practicality of the methodology in real investigative contexts, identify potential limitations, and refine the approach accordingly.

There is ample opportunity for future work to further develop and refine the automatic custodian timeline correlation approach, addressing the current limitations and expanding its capabilities. By pursuing these research directions, the digital forensic community can continue to advance this innovative methodology, ultimately enhancing the effectiveness and efficiency of investigations involving multiple custodians.

Conclusion

A. Timeline Database In digital forensic investigations, gaining a comprehensive understanding of custodian activities and events is crucial to building a solid case. With the increasing volume of digital evidence, investigators often find themselves overwhelmed by the sheer amount of data to be analyzed. This research project aimed to show one way to streamline this process by compiling a centralized database of essential metadata for various file types, including Word, PDF, and other commonly encountered documents. While the raw metadata itself may not immediately provide significant insights, it serves as the foundation upon which more advanced analysis can be performed. By intelligently processing and correlating this metadata, investigators can generate actionable insights, such as identifying suspicious patterns of activity, uncovering hidden connections between seemingly unrelated events, and pinpointing critical moments in a timeline that warrant further investigation. We explored various methods for extracting and aggregating metadata from a wide range of file types and develop algorithms to identify and highlight key events and patterns within the dataset. We can use these types of techniques to provide digital forensic investigators with powerful new tool to expedite their analysis and improve the overall quality of their investigations. By automating the extraction and correlation of critical metadata, investigators can spend less time sifting through mountains of data and more time pursuing leads and uncovering the truth. B. Timeline Correlation The timeline correlation offers a valuable solution for digital forensic investigators seeking to analyze and compare document metadata between multiple custodians. This approach can reveal critical insights into potential cooperation between parties, the sequence of document creation and revisions, and any notable patterns in custodian behaviour. In investigations involving several suspects, this information can be vital for establishing connections and uncovering evidence that might have otherwise gone unnoticed. The timeline correlation not only provides conclusive evidence in certain cases, but it also enables digital forensic investigators to refine and focus their investigative efforts. By highlighting areas of interest and identifying potential leads, investigators can allocate their time and resources more efficiently, ensuring a more targeted and effective approach to uncovering the truth. The successful implementation of timeline correlation in digital forensic investigations has the potential to drastically increase efficiency, providing investigators with a powerful new tool to navigate the ever-growing volume of digital evidence. As technology continues to advance, it is essential for digital forensic professionals to adapt and embrace innovative solutions like timeline correlation to stay ahead of the curve and continue delivering accurate, reliable results in their investigations.

References

[1] Casey, E. (2011). Digital Evidence and Computer Crime: Forensic Science, Computers, and the Internet (3rd ed.). Academic Press. [2] Garfinkel, S. L. (2010). Digital forensics research: The next 10 years. Digital Investigation, 7, S64-S73.Volkov, V., Demmel, J.W.: Benchmarking GPUs to Tune Dense Linear Algebra. In: Proc. 2008 ACM/IEEE Conference on Supercomputing, pp. 1--11, IEEE Press, New York (2008). [3] Nisén, P. (2013). Implementation of a timeline analysis software for digital forensic investigations. Aalto University, School of Science. [4] Nisén, T. (2015). \"Data visualization and timeline production for security incident events.\" Journal of Digital Forensics and Security, 7(2), 112-125. [5] Baggili, I., Breitinger, F., & Levine, B. (2017). \"Using machine learning for digital forensic triage.\" In Baggili, I., & Breitinger, F. (Eds.), Digital Forensics and Cyber Crime: 8th International Conference, ICDF2C 2016, New York, NY, USA, September 28-30, 2016, Revised Selected Papers (pp. 95-108). Springer International Publishing. [6] Kim, J., Park, S., & Lee, H. (2018). \"A machine learning-based framework for metadata activity timeline correlation in digital forensic investigations.\" International Journal of Digital Forensics, 14(4), 401-417. [7] Al-Zaidy, R., Fung, B. C., & Youssef, A. M. (2017). \"A scalable and efficient approach for timeline analysis of digital forensic artifacts.\" Digital Investigation, 21, 31-45. [8] Wang, S., Chen, L., & Chen, H. (2015). \"A novel approach for forensic timeline analysis based on domain ontology.\" In 2015 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC), pp. 26-33. [9] Wang, S., Chen, L., & Chen, H. (2015). \"A novel approach for forensic timeline analysis based on domain ontology.\" In 2015 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC), pp. 26-33. [10] Sartori, F., & Prandini, M. (2013). \"Ontology-based automatic event correlation for digital forensics investigation.\" In 2013 IEEE Security and Privacy Workshops (pp. 57-63). IEEE. [11] Simou, S., Kalloniatis, C., Kavakli, E., & Gritzalis, S. (2014). \"A Knowledge-based Approach to Support Digital Forensic Investigations.\" In 2014 Third International Conference on Advanced Communications and Computation (INFOCOMP), 26-35.

Copyright

Copyright © 2023 Glenn Nor, Dr. Mabrouka Abuhmida, Dr. Eric Llewellyn. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET51447

Publish Date : 2023-05-02

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here