Typically, large-scale data transfers into the database cause lots of locking which negatively affects query execution, concurrency, as well as blocking operations such as ALTER, UPDATE, and DELETE. To ensure data consistency, a set of locks are necessary, that can become a performance bottleneck as well as cause deadlocks. The contributions of this paper include three novel techniques to improve database operation and defeat large data loads through lock contention. We minimize data handling efficiency by converting lists into sets first. The second is that data can be processed outside the main database by using temporary or virtual memory, which minimizes the lock durations. Third, by using ‘placeholders’ when inserting data, the operations stay non-blocking and do not affect other transactions. We implement these methods to show how the performance of the database can be increased, how high the concurrency is, and how we can prevent locking issues in the database in the future.
Introduction
This study focuses on optimizing database performance and addressing foreign key violations during large-scale data integration using AWS Glue Jobs. The main challenge involved transaction failures and data inconsistencies when handling four highly interdependent tables (Provider, Sides, Cases, and Benefit Profiles) among a total of 62 tables.
Key Issues:
Foreign key constraints caused failures during bulk insertions.
AWS Glue jobs experienced long database locks and slow ETL performance.
Optimization Strategies Implemented:
Relaxing foreign key constraints during insertion to avoid violations.
Uploading large-scale generated data (400,000 records) to AWS S3, then processing via Glue jobs.
Glue job enhancements: batching, parallelism, and transformation logic optimization.
Three Code-Level Models Used:
List-to-Set Conversion: Eliminates duplicates for faster lookups.
Temporary Memory Limit (50MB): Controls memory usage during large operations.
Placeholders (e.g., None): Optimizes memory by skipping or reducing unneeded data storage.
Performance Results:
Model 1: Reduced job runtime from 1.10s → 0.53s (↓ 0.57s)
Model 2: Further reduced to 0.43s using virtual memory (↓ 0.67s total)
Model 3: Final optimization to 0.38s using placeholders (↓ 0.72s total)
These improvements reduced ETL latency and maintained data consistency under high-concurrency loads.
Discussion:
Each method improved performance by reducing memory overhead, SQL parsing time, and avoiding redundant operations.
The study demonstrates how code-level optimization can significantly enhance serverless data pipelines in AWS.
Future Directions:
Use Amazon ElastiCache for distributed caching to reduce I/O latency.
Leverage Spark Adaptive Query Execution (AQE) in Glue.
Employ vectorized operations (e.g., via NumPy/Pandas) for faster data processing.
Implement data partitioning, bucketing, and dynamic partitioning in S3 to improve scalability and filtering.
Conclusion
This research clearly shows that specific code optimizations can substantially affect the AWS Glue job duration during the large ingestion of data. This way, we were able to overcome the foreign key constraint problems and reduce the amount of time it takes to process the list of data by converting lists to sets by allocating memory and using parameterized queries. The improvements that were observed especially in the use of placeholders show the effectiveness of reducing the I/O overhead and query optimization in serverless ETL frameworks. These results emphasize that improving the efficiency and scalability of data pipelines in cloud environments involves careful code-level improvements to the processing pipelines, which can help practitioners working on data processing workflows.
References
[1] Vamsi Krishna Myalapalli, Thirumala Padmakumar Totakura, and Sunitha Geloth, “Augmenting Database Performance via SQL
[2] Tuning,” International Conference on Energy Systems and Applications, pp. 13-18, 2015.
[3] Khaled Saleh Maabreh, “Optimizing Database Query Performance Using Table Partitioning Techniques,”International Arab Conference on Information Technology, pp. 1-4,2018.
[4] Structure of Database Management System – Geeks for Geeks, 2020. [Online]. Available: https://www.geeksforgeeks.org/structure-of-database-management-system/
[5] Vivek Basavegowda Ramu, “Performance Impact of Microservices Architecture,” The Review of Contemporary Scientific and Academic Studies, vol. 3, no. 6, 2023. [CrossRef] [Publisher Link]