Deep Reinforcement Learning for Supply Chain Optimization: A DQN and LSTM-Based Approach

Authors: Mihir Deshpande, Vibhav Sahasrabudhe, Piyush Agarwal, Atharva Zodpe, Mayur Chavan

DOI Link: https://doi.org/10.22214/ijraset.2025.69117

Abstract

Effective inventory management is essential for optimizing supply chains, balancing stock levels, minimizing holding costs, and preventing stockouts. Traditional forecasting andrule-basedsystemsoftenfailtoadapttoreal-timede- mandfluctuations and supplyuncertainties.Inthisresearch, we propose a Reinforcement Learning (RL)-based approach for dynamic inventory optimization, leveraging Deep Q-Networks (DQN)alongsideMulti-ArmedBandit(MAB)strategiessuch as Epsilon-Greedy, Upper Confidence Bound (UCB), KL-UCB, and Thompson Sampling. The DQN agent learns an optimal replenishment policybyinteracting withtheenvironmentandad- justing inventory decisions based on observed demand patterns. Ourexperimentalanalysiscomparesthesetechniquesbased on key performance metrics such as inventory costs, stockout rates, and supply chain efficiency. Results indicate that while bandit-based methods provide strong baseline heuristics, DQN significantly outperforms them in long-term adaptability and decision-making under uncertainty. These findings highlight the potential of deep reinforcement learning to enhance real-time demand responsiveness, reduce operational costs, and improve supply chain resilience.

Introduction

1. Background & Motivation
Inventory management is vital for efficient supply chains, affecting both cost and customer satisfaction. Traditional methods like EOQ and reorder-point systems often struggle with the dynamic nature of modern supply chains, especially under unpredictable demand or disruptions.

2. Rise of AI-Based Techniques
Recent advancements in AI have introduced Reinforcement Learning (RL) and Multi-Armed Bandits (MAB) as promising alternatives for adaptive inventory control:

RL, particularly Deep Q-Networks (DQN), learns optimal policies through interaction with the environment, adapting to changing conditions over time.
MAB algorithms like Epsilon-Greedy, UCB, and Thompson Sampling are simpler, offering fast decisions by balancing exploration and exploitation but lack long-term planning.

3. Research Objective
The study compares RL (DQN-based) and MAB methods in a simulated inventory environment, evaluating metrics such as:

Inventory holding costs
Stockout rates
Order efficiency
Overall system robustness

4. Reinforcement Learning Framework
RL is modeled as a Markov Decision Process (MDP) with:

States: inventory levels, past demand, lead times
Actions: reorder quantities
Rewards: negative of holding, stockout, and ordering costs
The goal is to maximize long-term rewards while adjusting to real-time variability.

5. Key RL Approaches

Value-Based: Q-learning, DQN (deep networks approximate Q-values)
Policy-Based: REINFORCE, PPO (direct optimization of policy)
Actor-Critic: A2C, DDPG (combine value and policy for stability)
Multi-Agent RL (MARL): Useful for distributed warehouses or production lines
Model-Based RL: Includes predictive methods like MPC and MCTS for planning

6. MAB Framework
MAB formulates inventory control as single-step decision problems:

Each order quantity is an "arm"
Rewards are based on costs incurred
Useful for short-term optimization but lacks adaptability to longer-term trends

7. Practical Constraints
The models account for:

Random lead times
Limited warehouse capacity
Probabilistic demand (Poisson/Normal)
Ordering cost structures (fixed + variable components)

8. Methodology & Implementation

Simulation: Environment mimics real-world supply chain dynamics
Model: DQN enhanced with LSTM to capture time-series patterns
Architecture: Inputs include 10-day histories of demand and inventory; outputs represent order decisions
Tools: Python, PyTorch/TensorFlow, OpenAI Gym, NumPy, Pandas, Matplotlib

9. Key Results

Metric	DQN	DQN + LSTM
Total Reward	Baseline	+15%
Stockout Rate	5.2%	3.4%
Holding Costs	-	↓12%
Order Efficiency	-	>85%

10. Related Work
RL has been successfully applied to:

Production scheduling
Robotic assembly
Predictive maintenance
Sustainable manufacturing
Hybrid RL + DDMRP models for robust, flexible inventory control

Conclusion

This research presents a reinforcement learning-based ap- proachtooptimizeinventorymanagement,leveragingthe strengths of Deep Q-Networks (DQN) and Long Short-Term Memory (LSTM) networks to address the complexities inher- ent in dynamic and uncertain demand environments. The pro- posedDQN+LSTMmodelwasrigorouslyevaluatedagainsta rule-based baseline and a standard DQN agent, demonstrating significant improvements across key performance metrics — average reward, stockout rate, and holding cost reduction. Our experiments show that incorporating temporal aware- ness through LSTM enables the agent to capture long-term demand patterns, leading to more informed and proactive inventory decisions. The proposed model achieved a 31% reduction in holding costs and a 66% reduction in stockout rates compared to the traditional rule-based system, all while maximizing reward and maintaining system stability across varying demand scenarios. Beyond empirical performance, this work highlights the broader applicability of deep reinforcement learning tech- niquesinreal-worldsupplychaincontexts.Byreplacingstatic heuristics with adaptive, data-driven policies, organizations can significantly improve inventory responsiveness and oper- ational efficiency. However, it is worth noting the increased computational demands and training time associated with deep learning models, especially those involving recurrent layers. Future work will focus on optimizing model efficiency, deployingthesysteminnearreal-timeenvironments,andextending the framework to multi-echelon and multi-product inventory systems. In conclusion, our findings affirm that reinforcement learn- ing — particularly when integrated with memory-based ar- chitectures like LSTM — holds substantial promise for rev- olutionizing inventory management in the modern era of intelligent supply chains.

References

[1] Leluc, R´emi & Kadoche, Elie & Bertoncello, Antoine & Gourv´enec,S´ebastien.(2023).MARLIM:Multi-AgentReinforcementLearningforInventory Management. 10.48550/arXiv.2308.01649. [2] Muller, Arthur & Grumbach, Felix & Sabatelli, Matthia. (2024).Smaller Batches, Bigger Gains? Investigating the Impact of Batch Sizeson Reinforcement Learning Based Real-World Production Scheduling.10.48550/arXiv.2406.02294. [3] Joren Gijsbrechts, Robert N. Boute, Jan A. Van Mieghem, Dennis J.Zhang (2022) Can Deep Reinforcement Learning Improve InventoryManagement? Performance on Lost Sales, Dual-Sourcing, and Multi-Echelon Problems. Manufacturing & Service Operations Management24(3):1349-1368. https://doi.org/10.1287/msom.2021.1064 [4] MunyakaBaraka,Jean-Claude&Yadavalli,Sarma.(2022).Inven-tory management concepts and implementations: a systematic review.South African Journal of Industrial Engineering. Vol 33, No 2. 15-36.10.7166/33-2-2527. [5] Modrak,V.,Sudhakarapandian,R.,Balamurugan,A.,&Soltysova, [6] Z. (2024). A Review on Reinforcement Learning in ProductionScheduling: An Inferential Perspective. Algorithms, 17(8), 343.https://doi.org/10.3390/a17080343. [7] Sutton, R.S., & Barto, A.G. (2018). Reinforcement Learning: An Intro-duction (2nd ed.). MIT Press. [8] Kaelbling, L.P., Littman, M.L., & Moore, A.W. (1996). ReinforcementLearning: A Survey. Journal of Artificial Intelligence Research, 4,237–285, https://doi.org/10.48550/arXiv.cs/9605103 [9] Szepesva´ri,C.(2010).AlgorithmsforReinforcementLearning.Morgan& Claypool Publishers. [10] Kuhnle, A., Kaiser, J.P., Theiß, F. et al. (2021). Designing an adaptiveproduction control system using reinforcement learning. Journal ofIntelligentManufacturing,32,855–876.https://doi.org/10.1007/s10845-020-01612-y [11] Paraschos, P.D., Koulinas, G.K. & Koulouriotis, D.E. (2024). A re-inforcement learning/ad-hoc planning and scheduling mechanism forflexible and sustainable manufacturing systems. Flexible Services andManufacturingJournal,36,714–736.https://doi.org/10.1007/s10696- [12] 023-09496-9 [13] Th¨urer, M., Fernandes, N., & Stevenson, M. (2022). Production plan-ning and control in multi-stage assembly systems: An assessment ofKanban, MRP, OPT (DBR) and DDMRP by simulation. InternationalJournal of Production Research, 60(3), 1036–1050. [14] Velasco Acosta, A. P., Mascle, C., & Baptiste, P. (2020). Applicabil-ity of demand-driven MRP in a complex manufacturing environment.International Journal of Production Research, 58(14), 4233–4245.

Copyright

Copyright © 2025 Mihir Deshpande, Vibhav Sahasrabudhe, Piyush Agarwal, Atharva Zodpe, Mayur Chavan. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET69117

Publish Date : 2025-04-17

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here