Due to the rapid growth in heterogeneous data all over the modern enterprise level areas which has created a large demand for easy, intelligent, self-managing data infrastructure which capable of operating without human supervision. Manual data pipeline architecture is very good and well definedbut they require human intervention to diagnose quality failures such that they it detects various anomalous patterns and adapt to evolving schematic structures. This paper presents the “Autonomous Data Pipeline Monitoring System (ADPMS)” an agentic AI framework consisting of four specialized agents: an Ingestion Agent: for data acquisition, a Data Quality Agent: performing automated reduplication and forward filling of data, an Anomaly Detection Agent: applying the Isolation Forest algorithm from scikit-learn and a Decision Agent: orchestrating the pipeline health and classifying through a multi-threshold rule engine. A Streamlit based monitoring dashboard with six analytical tabs, four-format CSV export and anomaly removal functionality which completes the system. Evaluation on IoT sensor telemetry and largescale compensation datasets demonstrates a mean anomaly detection and F1score of 90.9%, sub400ms pipeline execution for datasets up to 10,000 rows and complete resolution of missing values and duplicate records. The system achieves sub two-minute deployment via pip install and outperforms all compared systems on the combined criteria of autonomous operation, anomaly detection, data export, and deployment simplicity.
Introduction
This paper presents an Autonomous Data Pipeline Monitoring System (ADPMS) that uses Agentic AI to automatically monitor, clean, analyze, and manage data quality in modern data pipelines. With the rapid growth of data generated by industries such as manufacturing, finance, and healthcare, traditional data quality methods based on manual monitoring and rule-based validation are no longer sufficient. ADPMS addresses these limitations by employing autonomous agents that continuously observe data, detect anomalies, make decisions, and execute corrective actions without human intervention.
The system is based on the IBM MAPE-K (Monitor, Analyze, Plan, Execute, Knowledge) architecture and consists of four main agents: Ingestion Agent, Data Quality Agent, Anomaly Detection Agent, and Decision Agent. These agents work together to ingest data, clean missing values and duplicates, detect anomalies using the Isolation Forest machine learning algorithm, classify pipeline health, and generate recommendations. The system also provides clean datasets with anomalies removed, enabling direct use in downstream machine learning applications.
A comprehensive Streamlit dashboard with six analytical tabs and interactive Plotly visualizations allows users to monitor data quality, analyze anomalies, view decision reports, and download processed datasets in multiple formats. The platform is dataset-independent and supports large datasets while maintaining fast execution times and easy deployment.
The literature review highlights the shortcomings of existing solutions such as Great Expectations, AWS Deequ, and Apache Griffin, which primarily rely on manually defined rules and lack autonomous decision-making and anomaly removal capabilities. Compared to these systems, ADPMS offers unsupervised anomaly detection, integrated quality management, automated decision intelligence, rapid deployment, and clean data exports.
The methodology defines hardware and software requirements, functional and non-functional specifications, input/output design, and modular implementation principles. The system supports CSV ingestion, automatic cleaning, anomaly detection, health classification, and export functionalities while ensuring scalability, reproducibility, and ease of maintenance.
Conclusion
With the rapid growth of data intensive applications across enterprise level environments where the need for autonomous, self-managing data pipeline infrastructure has become a strategic imperative. Data owners and engineering teams face the challenge while ensuring that pipelines continuously deliver high-quality data without prohibiting manual intervention costs.
To address this challenge, the Autonomous Data Pipeline Monitoring System has been proposed and implemented as a four-agent agentic AI framework in Python. The system\'s Isolation Forest based Anomaly Detection Agent detects the statistical outliers without requiring labeled training data or manually specified rules of the fundamental limitation of all existing data quality tools. The Decision Agent autonomously classifies pipeline health across four quality dimensions and generates severity classified alerts with actionable recommendations. The anomaly removal functionality produces clean export datasets ready for downstream ML consumption of a capability absent from all compared systems.
Our schemas are significantly reduced via manual intervention requirements as the system autonomously executes the complete quality management lifecycle from ingestion through export. This makes it difficult for the data quality issues to be persistent and undetected without any immediate automated actions that are required. In order to validate the effectiveness of the proposed system we have conducted a functional testing all over the 12 test cases (all PASS) and performance evaluation across five of the dataset sizes.
The results demonstrate that the ADPMS is both efficient and practical for real-world deployment: sub400ms latency for 10000row datasets and 90.9% mean F1score for anomaly detection and sub two-minute setup from pip install through requirements.
References
[1] Liu, F. T., Ting, K. M., & Zhou, Z. H. (2008). Isolation Forest. In Proc. 8th IEEE ICDM, pp. 413–422.
[2] Wooldridge, M., & Jennings, N. R. (1995). Intelligent agents: Theory and practice. Knowledge Engineering Review, 10(2), 115–152.
[3] Pedregosa, F., et al. (2011). Scikitlearn: Machine learning in Python. JMLR, 12, 2825–2830.
[4] McKinney, W. (2010). Data structures for statistical computing in Python. Proc. 9th Python in Science Conf., pp. 56–61.
[5] Sculley, D., et al. (2015). Hidden technical debt in machine learning systems. NeurIPS, pp. 2503–2511.
[6] Kephart, J. O., & Chess, D. M. (2003). The vision of autonomic computing. Computer, 36(1), 41–50.
[7] Redman, T. C. (1996). Data Quality for the Information Age. Artech House.
[8] Pipino, L., Lee, Y. W., & Wang, R. Y. (2002). Data quality assessment. CACM, 45(4), 211–218.
[9] Minsky, M. (1986). The Society of Mind. Simon & Schuster.
[10] Kreuzberger, D., Kuhl, N., & Hirschl, S. (2022). MLOps: Overview, definition, and architecture. IEEE Access, 10.
[11] Breunig, M. M., et al. (2000). LOF: Identifying densitybased local outliers. ACM SIGMOD, 29(2), 93–104.
[12] Hawkins, D. M. (1980). Identification of Outliers. Chapman and Hall.
[13] Few, S. (2006). Information Dashboard Design. O\'Reilly Media.
[14] Streamlit Inc. (2023). Streamlit: The fastest way to build and share data apps. https://streamlit.io
[15] Schelter, S., et al. (2018). Automating largescale data quality verification. Proc. VLDB Endowment.
[16] ISO/IEC 25012:2008. Data quality model. International Organization for Standardization.
[17] Schölkopf, B., et al. (2001). Estimating the support of a highdimensional distribution. Neural Computation.
[18] Jennings, N. R., & Wooldridge, M. (1998). Applications of intelligent agents. Agent Technology. Springer.
[19] Chen, P., et al. (2021). A benchmark study on error detection for tabular data. VLDB, 14(9).
[20] Abedjan, Z., et al. (2016). Detecting data errors: Where are we and what do we need to know? VLDB, 9(12).