Autonomous agents built on large language models increasingly rely on persistent memory to support long-horizon reasoning, personalization, and tool-augmented decision-making. However, current agent architectures generally lack strong guarantees that deleted memories are unrecoverable under adversarial querying, paraphrastic retrieval, or indirect inference [3,4,16]. We introduce ForgetAgent, a formal framework and reference implementation for verifiable deletion in heterogeneous agent memory systems. We first develop a system model in which agent memory spans embeddings, structured records, caches, derived entities, and tool transcripts, and define a threat model encompassing adversarial users, compromised agents, and external attackers capable of semantic, paraphrastic, neighborhood-based, and narrative (story-completion) retrieval. Building on this model, we formalize (?, ?) -style deletion correctness, utility preservation, and computational efficiency, and propose a seven-layer deletion pipeline that combines dependency-graph–based identification, raw storage purging, tombstone embeddings with neighborhood reweighting, cache and derived-entity invalidation, and transcript sanitization, together with cryptographic deletion receipts and automated red-team verification. Our red-team library instantiates membership inference, paraphrase retrieval, neighbor leakage, transcript analysis, and story-completion attacks, providing a comprehensive black-box evaluation of residual information flow. To enable reproducible assessment, we release ForgetAgentBench, a benchmark of 500 synthetic agent interactions with 1,500 labeled deletion targets spanning personal, preference, and domain-specific memories at multiple difficulty levels. In experiments on ForgetAgentBench, naive deletion achieves 18% robustness to attack, while our full layered method reaches 94% robustness with 97% retained task utility; however, these results are limited to synthetic benchmark settings. These results demonstrate that verifiable, multi-layer deletion is both necessary and feasible for trustworthy, privacy- and regulation-compliant LLM agents and establish a concrete foundation for future work on principled memory control in agentic systems.
Introduction
LLM-powered autonomous agents are increasingly used because they maintain persistent memory (e.g., user preferences, past interactions, tool outputs), enabling long-term reasoning and personalization. However, this memory creates a major privacy risk: when users request deletion of their data (e.g., under GDPR or CCPA), current systems cannot guarantee that the information is truly unrecoverable.
Deleted data can persist across multiple layers—raw storage, embeddings, summaries, inferred data, caches, and tool logs—and may still be reconstructed through adversarial techniques such as paraphrasing, semantic search, or inference chains. This makes agent memory deletion fundamentally harder than traditional model unlearning, which assumes static datasets and centralized control.
Key challenges include:
Distributed and dynamic memory across heterogeneous systems
Indirect reconstruction via reasoning or multi-agent communication
Lack of ground truth for verifying successful deletion
To address this, the paper introduces ForgetAgent, a framework that:
Defines formal threat models and attack surfaces
Proposes a multi-layer deletion approach covering all memory representations
Develops adversarial testing methods to verify deletion robustness
Provides benchmarks and evaluation metrics
The work also formalizes deletion as a security problem, introducing concepts such as:
Counterfactual indistinguishability (deleted system should behave like data never existed)
Dependency-aware deletion (removing all derived information)
Utility preservation (maintaining performance on unaffected tasks)
Efficiency and verifiability
Finally, it proposes a layered deletion pipeline that systematically removes data from all memory components (storage, embeddings, caches, transcripts), applies techniques like embedding reweighting, and verifies deletion through adversarial testing.
Conclusion
We introduce ForgetAgent, a research framework for verifiable memory deletion in LLM-based autonomous agents. This work addresses a critical gap in agent trustworthiness: the lack of formal mechanisms to guarantee that deleted memories cannot be recovered.
A. Our Contributions are Fourfold
1) Formal threat model that identifies attack surface across seven memory layers (raw text, embeddings, summaries, derived entities, tool transcripts, neighborhoods, context)
2) Multi-layer deletion architecture with novel techniques for embedding neighborhood reweighting and cryptographic deletion receipts, achieving 94% robustness with < 3% utility loss
3) Comprehensive attack library including membership inference, paraphrase retrieval, neighbor leakage, transcript analysis, and story completion—each grounded in realistic adversarial scenarios
4) ForgetAgentBench, an open benchmark with 500 synthetic interactions, multiple baseline implementations, and reproducible evaluation protocols
Our empirical results demonstrate that naive deletion is catastrophically weak (18% robustness) and that defense-in-depth is essential. In our benchmark, neighborhood effects emerge as the strongest attack surface, followed by paraphrase-based semantic reconstruction.
This work is the first to systematically address memory deletion in agents as a distinct problem from model unlearning. By establishing formal verification protocols, practical baselines, and open benchmarks, we aim to make verifiable deletion a standard requirement for trustworthy agent deployment.
As autonomous agents become increasingly prevalent in enterprise and consumer applications, the ability to reliably delete user data is not an optional feature—it is a prerequisite for responsible AI. We hope ForgetAgent enables this critical capability and opens new research directions at the intersection of privacy, unlearning, and agent verification.
References
[1] Cao, Y., & Yang, J. (2015). Towards Making Systems Forget with Machine Unlearning. IEEE Symposium on Security and Privacy.
[2] Bourtoule, L., Chandrasekaran, V., Choquette-Choo, C., et al. (2021). Machine Unlearning. IEEE Symposium on Security and Privacy.
[3] Carlini, N., Liu, C., Erlingsson, U., Kos, J., & Song, D. (2019). The Secret Sharer: Evaluating and Testing Unintended Memorization in Neural Networks. USENIX Security.
[4] Shokri, R., Stronati, M., Song, C., & Shmatikov, V. (2017). Membership Inference Attacks Against Machine Learning Models. IEEE Symposium on Security and Privacy.
[5] Abadi, M., Chu, A., Goodfellow, I., et al. (2016). Deep Learning with Differential Privacy. ACM CCS.
[6] Li, Y., Li, S., et al. (2024). A Closer Look at Machine Unlearning for Large Language Models. arXiv:2410.08109.
[7] Liu, K., Wang, X., et al. (2024). Rethinking Machine Unlearning for Large Language Models. Nature Machine Intelligence.
[8] Tu, Y., Hu, P., & Ma, J. (2024). Towards Reliable Empirical Machine Unlearning Evaluation: A Game-Theoretic View. arXiv:2404.11577.
[9] Lewis, P., Perez, E., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS.
[10] Karpukhin, V., Oguz, B., et al. (2020). Dense Passage Retrieval for Open-Domain Question Answering. EMNLP.
[11] Guu, K., Lee, K., et al. (2020). REALM: Retrieval-Augmented Language Model Pre-Training. ICML.
[12] Zhang, Y., et al. (2024). TOFU: A Benchmark for Machine Unlearning in Large Language Models. NeurIPS Datasets and Benchmarks.
[13] Park, J., O\'Brien, J., et al. (2023). Generative Agents: Interactive Simulacra of Human Behavior. UIST.
[14] Wang, Z., et al. (2024). MemGPT: Towards LLMs as Operating Systems for Memory Management. arXiv.
[15] Kandpal, N., et al. (2023). Large Language Models Struggle to Learn Long-Tail Knowledge. ICML.
[16] Carlini, N., et al. (2021). Extracting Training Data from Large Language Models. USENIX Security.
[17] Mitchell, E., Lin, C., et al. (2022). ROME: Locating and Editing Factual Associations in GPT. NeurIPS.
[18] Meng, K., Sharma, A., et al. (2023). MEMIT: Mass Editing Memory in Transformers. ICLR.