Cleansera: A Context-Aware, Algorithm-Centric Data Cleaning System with RAG-Enhanced Intelligence

Authors: Kanchan Dhomse, Shubham Deshmukh, Vishal Daware, Shubham Rao, Om Bhise

DOI Link: https://doi.org/10.22214/ijraset.2026.78010

Abstract

The two most difficult tasks of data analytics are usually data preparation and data cleaning; according to the research presented in this article, these activities account for approximately 50 to 80 percent of total time spent on real-world data analytic initiatives [1]. Most existing data cleaning tools use static, rule-based approaches and do not accommodate the unique needs of a specific domain, nor do they provide clear visibility of the internal ways in which they arrive at their final data-cleaned outputs [1]. This study introduces Cleansera, a context-aware, AI based data cleaning solution that places a greater emphasis on algorithmic characterization and uses flowchart-driven methodologies to achieve execution. Cleansera offers automated context-detecting capability, Retrieval Augmented Generation (RAG), deterministic data cleaning workflows, version-controlled user interaction, and dual-checkpoint data cleaning QA capabilities [1][2]. The primary focus of this article is on the latest designed algorithms and validated workflows created during partial implementation of the Cleansera system. Through the creation of defined algorithms, execution paths, and verification checkpoints, Cleansera offers transparency, auditability, and repeatability for automated data cleaning [1]. Cleansera combines the elements of AI-driven flexibility with the principles of traditional algorithms to create a data cleaning methodology that may be adopted in both industry and academia [1][2].

Introduction

Cleansera is a context-aware, deterministic data cleaning system designed to automate preprocessing while maintaining transparency, auditability, and reproducibility. Traditional data cleaning tools rely heavily on fixed rules and manual effort, making them inefficient for large or domain-diverse datasets. Cleansera addresses this by explicitly modeling industry-specific semantics and business rules, using deterministic algorithms and flowcharts instead of opaque "black-box" methods.

Key Features and Objectives:

Context-Sensitive: Detects dataset semantics and applies relevant cleaning rules.
Algorithmic Transparency: All operations follow explicit, verifiable algorithms.
Flowchart-Driven Execution: Visual representation facilitates understanding and validation.
Quality Assurance: Deterministic checkpoints ensure measurable effectiveness.
Partial Automation with Control: AI-assisted cleaning with human oversight.

System Architecture Components:

Secure Authentication & Session Management – Validates users, manages sessions, and ensures security.
Dataset Ingestion & Profiling – Examines datasets, identifies data types, missing values, and duplicates.
Context Detection Engine – Determines the semantic meaning of each column (e.g., phone numbers vs. product IDs).
Context-Aware Cleaning Engine – Applies tailored cleaning rules for missing values, duplicates, outliers, and format standardization.
Master Field Identification – Detects unique identifiers requiring special handling.
Data Loss Detection & Validation – Monitors changes, prevents critical data loss, and generates detailed reports.
Version Control & Audit Module – Maintains complete logs for reproducibility, rollback, and compliance.

Core Algorithms:

Authentication uses bcrypt hashing and token-based sessions.
Rate limiting leverages Redis to control request flow.
Dataset profiling calculates column-level statistics to guide cleaning.
Context detection combines semantic and statistical analysis with confidence scores.
Cleaning engine performs schema validation, duplicate removal, missing value treatment, format standardization, outlier handling, and semantic checks—all logged for auditability.

Summary: Cleansera combines AI-assisted automation with explicit deterministic logic, providing a transparent, auditable, and domain-aware data cleaning platform suitable for complex, regulated industry datasets.

Conclusion

The goal of this paper was to provide a more comprehensive and expanded view of Cleansera\'s algorithmic foundation. Specifically, through a focus on transparency, flowchart-based execution, and deterministic quality assurance, Cleansera has been designed to overcome some of the major challenges associated with currently available data cleaning systems [1]. Future work on Cleansera will focus on developing full operational implementations, evaluating Cleansera\'s performance, and determining Cleansera\'s empirical efficacy across a variety of industry data sets.

References

[1] S. Deshmukh, O. Bhise, S. Rao, and V. Daware, \"Cleansera: An intelligent desktop application for domain-specific data cleaning,\" International Research Journal of Engineering and Technology (IRJET), vol. 12, no. 11, 2025. [2] P. Lewis et al., \"Retrieval-augmented generation for knowledge-intensive NLP tasks,\" in Proc. NeurIPS, 2020. [3] Y. Gao et al., \"Retrieval-augmented generation for large language models: A survey,\" arXiv:2312.10997, 2023. [4] L. Li et al., \"AutoDCWorkflow: LLM-based data cleaning workflow auto-generation and benchmark,\" arXiv, 2025. [5] M. Naeem et al., \"RetClean: Retrieval-based data cleaning using LLMs and data lakes,\" arXiv, 2024. [6] E. Meguellati et al., \"Are LLMs good data preprocessors?\" arXiv, 2025. [7] S. Zhang, Z. Huang, and E. Wu, \"Data cleaning using large language models,\" arXiv, 2024. [8] L. Biester et al., \"LLMClean: Context-aware tabular data cleaning via LLM generated OFDs,\" in Proc. VLDB, 2024. [9] F. Ahmadi, Y. Mandirali, and Z. Abedjan, \"Accelerating the data cleaning systems Raha and Baran through task and data parallelism,\" in Proc. VLDB Workshop, 2024. [10] J. Choi et al., \"Multi-News+: Cost-efficient dataset cleansing via LLM-based data annotation,\" in Proc. EMNLP, 2024. [11] S. Zhang, Z. Huang, and E. Wu, \"Cocoon: Data cleaning using LLMs,\" arXiv, 2024. [12] W. Ni et al., \"IterClean: Iterative data cleaning with LLMs,\" in Proc. SIGMOD, 2024. [13] P. Martins et al., \"Performance and scalability of data cleaning tools,\" MDPI Data, 2025. [14] T. Brown et al., \"Language models are few-shot learners,\" in Proc. NeurIPS, 2020. [15] A. Vaswani et al., \"Attention is all you need,\" in Proc. NeurIPS, 2017. [16] J. Devlin et al., \"BERT: Pre-training of deep bidirectional transformers,\" in Proc. NAACL-HLT, 2019. [17] E. Rahm and H. H. Do, \"Data cleaning: Problems and current approaches,\" IEEE Data Engineering Bulletin, vol. 23, no. 4, pp. 3–13, 2000.

Copyright

Copyright © 2026 Kanchan Dhomse, Shubham Deshmukh, Vishal Daware, Shubham Rao, Om Bhise. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET78010

Publish Date : 2026-03-07

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here