Ijraset Journal For Research in Applied Science and Engineering Technology
Authors: Saber Muthanna Ahmed Qasem , Feng Wang
DOI Link: https://doi.org/10.22214/ijraset.2026.76765
Certificate: View Certificate
Urban traffic congestion remains a pressing challenge, particularly in developing regions where infrastructure development struggles to keep pace with rapid urbanization. Fixed-time traffic signal systems, commonly deployed in countries like Yemen, lack the adaptability needed to manage dynamic and unpredictable traffic flows efficiently. This thesis addresses these limitations by proposing and evaluating reinforcement learning (RL)–based algorithms for adaptive traffic signal control, with a dual focus on improving traffic flow and reducing environmental impact. The research introduces two core innovations: a decentralized Q-learning algorithm and a deep reinforcement learning framework based on Proximal Policy Optimization with Masking (PPO-Mask). While Q-learning demonstrated substantial improvements over traditional control strategies in reducing vehicle delays and emissions, its scalability was limited in complex traffic environments. To overcome these challenges, the PPO-Mask model was developed, incorporating action masking to ensure safer decision-making and faster convergence in high-dimensional settings. Simulations conducted using the SUMO platform across both synthetic and real-world scenarios demonstrated that PPO-Mask consistently outperformed Q-learning and fixed-time baselines across all key performance metrics. This work contributes a robust, scalable, and cost-effective approach to adaptive traffic signal control that is particularly suitable for low-resource urban environments. It also provides a comparative framework and practical insights that can inform future integration of AI-driven traffic control strategies into broader urban mobility planning.
In recent years, Reinforcement Learning (RL) has emerged as a powerful approach for adaptive traffic signal timing, overcoming the limitations of traditional fixed and rule-based systems. Unlike conventional traffic models that rely on predefined schedules or simplified assumptions, RL enables traffic controllers to learn optimal signal strategies directly through interaction with the environment. By continuously receiving feedback in the form of rewards (e.g., minimizing delay, congestion, emissions), RL agents adapt to dynamic and unpredictable traffic conditions without requiring prior knowledge of traffic patterns.
However, applying RL to real-world traffic control presents significant challenges. As intersections become more complex (e.g., multi-phase intersections with multiple turning movements), the state-action space grows exponentially, leading to computational complexity, inefficient exploration, and slow convergence. Trial-and-error learning in real traffic systems is also costly, as poor policies may increase congestion and environmental pollution.
To address these challenges, this research proposes an adaptive traffic signal timing framework based on Q-learning and PPO-Mask (Proximal Policy Optimization with action masking). Unlike traditional model-based methods, these model-free RL techniques learn optimal policies directly from environmental interaction. Q-learning optimizes signal timing using real-time traffic data, while PPO-Mask enhances performance through:
Action masking, preventing invalid or unsafe phase transitions
Improved exploration efficiency
Faster and more stable convergence
Better scalability in multi-agent, multi-intersection settings
The system is implemented and evaluated using the SUMO (Simulation of Urban MObility) simulation platform integrated with the TraCI interface. Traffic state information (queue length, vehicle count, waiting time, phase status) is transmitted to the RL agent, which selects actions (switch phase or maintain phase) and receives a reward based on performance metrics including:
Queue length
Waiting time
Stops
Fuel consumption
Emissions (CO, CO?, NOx)
Noise level
The problem is formulated as a Markov Decision Process (MDP), where:
State includes traffic metrics and signal phase information
Actions include switching or maintaining signal phases
Reward penalizes congestion, delay, emissions, and unsafe transitions
The proposed PPO-Mask architecture uses a fully connected neural network with two hidden layers (512 neurons each, ReLU activation). Additionally, a lane-level state decomposition strategy preserves invariance and improves learning efficiency.
Integration of constraint-aware action masking for safe transitions
Improved convergence and exploration efficiency
Multi-agent coordination capability
Reduced vehicle waiting time, congestion, and emissions
Superior performance compared to fixed-time (Yemen) systems and conventional Q-learning
Simulation results using real-world traffic data demonstrate that the proposed PPO-Mask-based adaptive system:
Achieves faster convergence
Reduces vehicle waiting times
Lowers fuel consumption and emissions
Performs effectively in complex multi-phase intersections
Overall, this research advances scalable, intelligent, and sustainable urban traffic management through safe and efficient reinforcement learning-based signal optimization.
This proposal explores reinforcement learning for adaptive traffic signal timing in developing cities like Yemen. It proposes two RL models decentralized Q-learning and PPO-Mask both significantly outperforming fixed-time systems in reducing delay, fuel use, and emissions. While Q-learning suits small-scale deployments, PPO-Mask excels in complex environments with safer, more stable learning. The work demonstrates RL’s practical potential for low-cost, sustainable traffic management in resource-constrained urban setting.
[1] Mannion, P., Duggan, J., Howley, E., (2016). “An experimental review of reinforcement learning algorithms for adaptive traffic signal timing”. Autonomic road transport support systems , 47–66. [2] Miller, A.J., (1963). “Settings for fixed-cycle traffic signals”. Journal of the Operational Research Society. 14, 373–386. [3] Hunt, P.; Robertson, D.; Bretherton, R.; Royle, M.C. (1982) “The SCOOT on-line traffic signal optimisation technique”. Traffic Eng. Control 1982, 23, 190–192. [4] Cools, SB., Gershenson, C., D’Hooghe, B. (2008). “Self-Organizing Traffic Lights: A Realistic Simulation”. In: Prokopenko, M. (eds) Advances in Applied Self-organizing Systems. Advanced Information and Knowledge Processing. Springer, London. https://doi.org/10.1007/978-1-84628-982-8_3 [5] Luk, J. (1984). “Two traffic-responsive area traffic control methods: SCAT and SCOOT”. Traffic Eng. Control 1984, 25, 14. [6] Lior Kuyer, Shimon Whiteson, Bram Bakker, and Nikos Vlassis. (2010). “Multiagent reinforcement learning for urban traffic control using coordination graphs”. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 656–671. [7] Samah El-Tantawy and Baher Abdulhai. (2010). “An agent-based learning towards decentralized and coordinated traffic signal control”. In Intelligent Transportation Systems (ITSC), 2010 13th International IEEE Conference on. IEEE, 665–670. [8] Hua Wei, Guanjie Zheng, Huaxiu Yao, and Zhenhui Li. (2021). “IntelliLight: A Reinforcement Learning Approach for Intelligent Traffic Light Control”. In ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD). 2496–2505. [9] Wei, H., Xu, N., Zhang, H., Zheng, G., Zang, X., Chen, C., Zhang, W., Zhu, Y., Xu, K., Li, Z., (2020). “Colight: Learning network-level cooperation for traffic signal timing”, in: Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 1913–1922 [10] Fu, Q., Han, Z., Chen, J., Lu, Y., Wu, H., Wang, Y., (2023). “Applications of reinforcement learning for building energy efficiency control: A review”. Journal of Building Engineering 50, 104165 [11] Martin Fellendorf. (1994). “VISSIM: A microscopic simulation tool to evaluate actuated signal control including bus priority”. In 64th Institute of Transportation Engineers Annual Meeting. Springer, 1–9. 240-255 [12] P Lowrie. (1990). “SCATS–A Traffic Responsive Method of Controlling Urban Traffic”. Roads and Traffic Authority, Sydney. New South Wales, Australia (1990). [13] PR Lowrie. (1992). “SCATS–a traffic responsive method of controlling urban traffic”. Roads and traffic authority. NSW, Australia (1992). [14] Pitu Mirchandani and Fei-Yue Wang. (2005). “RHODES to intelligent transportation systems”. IEEE Intelligent Systems 20, 1 (2005), 10–15. [15] PB Hunt, DI Robertson, RD Bretherton, and M Cr Royle. (1982). “The SCOOT on-line traffic signal optimisation technique”. Traffic Engineering & Control 23, 4 (1982). [16] van der Pol et al. (2016). Coordinated Deep Reinforcement Learners for Traffic Light Control. NIPS. [17] F.-X. Devailly, D. Larocque, and L. Charlin, (2022) “IG-RL: Inductive graph reinforcement learning for massive-scale traffic signal control,” IEEE Trans. Intell. Transp. Syst., vol. 23, no. 7, pp. 7496–7507, Jul. 2022. [18] Conference on Intelligent Transportation Systems, IEEE. URL: https://elib.dlr.de/124092/. [19] Hua Wei, Guanjie Zheng, Huaxiu Yao, and Zhenhui Li. (2018). “IntelliLight: A Reinforcement Learning Approach for Intelligent Traffic Light Control”. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD \'18). Association for Computing Machinery, New York, NY, USA, 2496–2505. https://doi.org/10.1145/3219819.3220096. [20] Hua Wei, Chacha Chen, Guanjie Zheng, Kan Wu, Vikash Gayah, Kai Xu, and Zhenhui Li. (2019). “PressLight: Learning Max Pressure Control to Coordinate Traffic Signals in Arterial Network”. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD \'19). Association for Computing Machinery, New York, NY, USA, 1290–1298. https://doi.org/10.1145/3292500.3330949. [21] Chu, T., Wang, J., Codecà, L. and Li, Z., (2019). “Multi-agent deep reinforcement learning for large-scale traffic signal control”. IEEE transactions on intelligent transportation systems, 21(3), pp.1086-1095. [22] Hua Wei, Nan Xu, Huichu Zhang, Guanjie Zheng, Xinshi Zang, Chacha Chen, Weinan Zhang, Yanmin Zhu, Kai Xu, and Zhenhui Li. (2019). “CoLight: Learning Network-level Cooperation for Traffic Signal Control”. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM \'19). Association for Computing Machinery, New York, NY, USA, 1913–1922. https://doi.org/10.1145/3357384.3357902. [23] Shubham Pateria, Budhitama Subagdja, Ah-hwee Tan, and Chai Quek. (2021). “Hierarchical Reinforcement Learning: A Comprehensive Survey”. ACM Comput. Surv. 54, 5, Article 109 (June 2022), 35 pages. https://doi.org/10.1145/3453160. [24] Xu, W., Gu, J., Zhang, W., Gen, M., & Ohwada, H. (2025). “Multi-agent reinforcement learning for flexible shop scheduling problem: a survey”. Frontiers in Industrial Engineering, 3. https://doi.org/10.3389/fieng.2025.1611512. [25] Mnih, V., Kavukcuoglu, K., Silver, D. et al. Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015). https://doi.org/10.1038/nature14236. [26] Mousavi, S. S., Schukat, M., & Howley, E. (2017). “Traffic light control using deep policy-gradient and value-function based reinforcement learning”. IET Intelligent Transport Systems, 11(7), 417–423. [27] Genders, W., & Razavi, S. (2019). “Transfer learning for adaptive traffic signal control: A reinforcement learning approach”. Transportation Research Part C: Emerging Technologies, 106, 332–347. [28] Ye, F., Yang, Y., & Zhang, S. (2021). “A traffic signal control method based on improved deep reinforcement learning”. IEEE Access, 9, 108345–108357. [29] Wang, Z., Schaul, T., Hessel, M., et al. (2016). “Dueling network architectures for deep reinforcement learning”. Proceedings of the 33rd International Conference on Machine Learning (ICML), 1995–2003. [30] Zang, X., Zheng, G., Xu, N., Wei, H., & Li, Z. (2020). “MetaLight: Value-based meta-reinforcement learning for adaptive traffic signal control”. AAAI Conference on Artificial Intelligence, 34(1), 1153–1160. [31] Zhu, R., Chen, X., & Wang, X. (2022). “Graph attention actor-critic for cooperative traffic signal control”. IEEE Transactions on Intelligent Transportation Systems, 23(8), 12456–12468. [32] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). “Proximal policy optimization algorithms”. arXiv preprint arXiv:1707.06347. [33] Wei, H., Zheng, G., Yao, H., & Li, Z. (2018). “IntelliLight: A reinforcement learning approach for intelligent traffic light control”. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2496–2505. [34] Wang, Z., Li, J., Zhang, H., & Tang, S. (2023). “Model-based deep reinforcement learning with traffic inference for adaptive traffic signal control”. Applied Sciences, 13(2), 1125. [35] Barto, A.G.; Sutton, R.S.(1998). “Reinforcement learning: An introduction (Adaptive computation and machine learning)”. IEEE Trans. Neural Netw. 1998, 9, 1054. [36] Wei, H., Zheng, G., Yao, H., Li, Z., 2018. “Intellilight: A reinforcement learning approach for intelligent traffic light control”, in: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2496–2505. [37] Wu, S.-H., Zhan, Z.-H., Tan, K. C., & ZHANG, J. (2025). Traffic Flow Dataset for \"Traffic Signal Timing Optimization: From Evolution to Adaptation\" [Dataset]. Zenodo. https://doi.org/10.5281/zenodo.14653151. [38] Babaeizadeh, M.; Frosio, I.; Tyree, S.; Clemons, J.; Kautz, J. (2016) “ Reinforcement learning through asynchronous advantage actor-critic on a gpu”. arXiv 2016, arXiv:1611.06256.
Copyright © 2026 Saber Muthanna Ahmed Qasem , Feng Wang. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Paper Id : IJRASET76765
Publish Date : 2026-01-01
ISSN : 2321-9653
Publisher Name : IJRASET
DOI Link : Click Here
Submit Paper Online
