This project focuses on developing an automated expense processing system which utilizes OCR and LLMs for invoice enhancement. Such system processes the necessary information contained in documents of various types and formats automatically. This greatly decreases the time spent on data entry, without losing accuracy. Following extraction, data is classified, and further processed to be more orderly with regard to finances. Users are also able to view and manage their expenditure through an intuitive interface which allows setting filters or defining tracking criteria. One of the main features of the system is the ability to incorporate analytics self-service – where users can analyze their spending habits based on historical data and make predictive models, enabling them to spend less in the future. In particular, structured historical expense data and spending for the upcoming month were evaluated using XGBoost, which also served as a financial prevision tool due to its advanced machine learning capabilities. XGBoost does not rely on explicitly defined relationships, rather it thrives on the existence of complex, non-linear relationships between features such as amount of expenses charged previously, transaction frequency, category, seasonality, and trend indicators. It constructs an ensemble of decision trees where each subsequent tree is built to increase precision, hence a greater ensemble will be more accurate. Tesseract OCR, Django, OpenAI’s language models, and XGBoost were incorporated into a single cohesive and scalable framework that serves the organizational and individual users personalized financial goals.
Ultimately, the result of this project is to improve spending accuracy and efficiency with respect to related spreadsheets irrespective of the intended user. Moreover, it is scalable incorporating into pre-existing systems and thus maintaining relevance and value over time.
Our aim is to develop a functional prototype for automated invoice processing and expense management using Django, OCR, and OpenAI language models to streamline financial workflows. The system is designed to learn from various invoice templates, applying Tesseract OCR for text extraction and NLP models for classifying key fields such as dates, vendors, and itemized costs. Leveraging Django’s web framework, users can easily upload, organize, and manage expenses, while historical data is used to estimate future costs through machine learning. The intuitive web interface delivers actionable financial insights, and the combination of OCR, NLP, and predictive analytics allows for detailed, customizable expense tracking suited to both individuals and organizations—making the system efficient, accurate, and user-friendly.
Introduction
The project aims to automate invoice processing and expense tracking by integrating Optical Character Recognition (OCR), Natural Language Processing (NLP), and Machine Learning (ML). It extracts invoice data (e.g., date, items, cost) using Tesseract OCR, structures it into formats like JSON/CSV, and processes it using OpenAI's GPT models and XGBoost for categorization and future expense predictions. The application is built using the Django framework, styled with Tailwind CSS, and deployed via cloud platforms using Docker, AWS, and Apache Kafka for real-time data flow.
Key Components:
1. Data Acquisition & Preprocessing:
Invoices are uploaded and stored securely.
Preprocessing includes grayscale conversion, resizing, noise removal, and skew correction to enhance OCR accuracy.
2. OCR & Text Extraction:
Text is extracted using PyTesseract and processed with OpenCV.
Output is cleaned and structured for easier data extraction.
3. Data Extraction & Structuring:
Regular expressions and NLP extract entities like vendor name, date, total, and items.
Structured data is stored in relational databases or JSON format for analytics and retrieval.
4. NLP & Categorization:
GPT-3.5-turbo-instruct analyzes text and classifies expenses into predefined categories using dynamic prompts.
This reduces manual classification effort and improves accuracy.
5. Expense Prediction using Machine Learning:
Uses XGBoost for forecasting monthly expenses based on features like historical spending, day of the week, and vendor behavior.
Handles outliers well and offers insight into which features most affect predictions.
6. Data Visualization:
Expense trends are visualized using Matplotlib, Seaborn, and Plotly.
Forecasts are displayed alongside historical data on a Django-based dashboard, enhancing user insights and planning.
7. Django Framework & Admin Panel:
Follows the Model-View-Template (MVT) pattern.
Django ORM handles data modeling and retrieval.
Admin interface provides tools for invoice review, category management, and system monitoring.
8. System Deployment:
Uses Docker, AWS Elastic Beanstalk, Heroku, and AWS Lambda for scalable, serverless deployment.
Apache Kafka enables asynchronous, real-time data processing between services.
Role-based access, encryption, and modular microservices ensure high security and responsiveness.
Evaluation & Results:
Prediction Metrics:
MAE, RMSE, and R² used to evaluate accuracy.
Good prediction accuracy in categories like Transportation (MAE: 23.91) and Entertainment (MAE: 34.52).
Lower performance in Utilities and Clothing indicates room for improvement with enhanced features or models.
Invoice Accuracy:
OCR and GPT successfully extracted and matched invoice fields (date, item, amount, category) with high precision.
The system demonstrated effective, real-time processing for diverse invoice formats.
Literature Review:
References focus on Tesseract OCR, image preprocessing (binarization, resizing, skew correction), and early systems that used OCR for invoice extraction.
Prior systems lacked ML-based forecasting or NLP categorization, which this project integrates for improved functionality and scalability.
Conclusion
The entire process of invoice management within a company is a big chore on its own. This project aims to have OCR, alongside machine learning, integrated into a user-friendly web interface in order to promote productivity and accuracy while automating invoice processing as well as expense tracking. The use of data extraction and forecasting allows for specially designed financial management systems to be implemented, making it easier to scale. With the utilization of Tesseract OCR to extract text and a GPT model that categorizes expenses while predicting and analyzing a company’s finances through machine learning, the expenses are streamlined alongside the accuracy.
References
[1] An Intelligent Invoice Processing System using Tesseract OCR,2024, Ashlin Deepa R N, Suhas Chinta, Nikhil Kumar Ashili, B Sankara Babu, Revanth Reddy Vydugula, Raj Sripada VSL.
[2] Digitization of Data from Invoice using OCR, 2022, Venkata Naga Sai Rakesh Kamisetty, Bodapati Sohan Chidvilas, S. Revathy, P. Jeyanthi, V. Maria Anu, L. Mary Gladence.
[3] An Empirical Analysis of Topic Categorization using PaLM, GPT and BERT Models, 2023, Dhanvanth Reddy Yerramreddy, Jayasurya Marasani, Ponnuru Sathwik Venkata Gowtham, S Abhishek, Anjali