Multi-Agent Deep Q Network to Enhance the Reinforcement Learning for Delayed Reward System
Abstract
:1. Introduction
2. Related Works
2.1. Q-Learning
2.2. Maze Finding
2.3. The **-Pong Game
2.4. Multi-Agent System and Reinforcement Learning
3. Proposed Multi-Agent N-DQN
3.1. Proposed Architecture
3.1.1. Hierarchization and Policy
Algorithm 1: Action Control Policy | |
1 | Input: number of sublayer N, current states, target action a |
2 | Initialize: shared replay memory D to size N ∗ n |
3 | Initialize: N action-value function Q with random weights |
4 | Initialize: episode history queue H to size N ∗ n |
5 | Copy data to H from D For 1, N do For state in H do if state = s then For action in state.action do if action != a then do action else break else do action |
Algorithm 2: Reward Policy | |
1 | Input: current state s, immediate reward r |
2 | Initialize: list of lethal state L, dict of reward P |
3 | Initialize: the number of reward point K After do action Before return immediate reward r |
4 | For k in range (0, K) if s = L[k] then return r += P[L[k]] else return r |
3.1.2. Parallel Processing and Memory Sharing
Algorithm 3: N-DQN Experience Sharing | |
1 | Procedure: training |
2 | Initialize: shared replay memory D to size N ∗ n |
3 | Initialize: N action-value function Q with random weights |
4 | Loop: episode = 1, M do Initialize state s1 For t = 1, T do For each Qn do With probability ϵ select random action at otherwise select at = argmaxa Qn(st, a; θi) Execute action at in emulator and observe rt and st+1 Store transition (st, at, rt, st+1) in D End For Sample a minibatch of transitions (sj, aj, rj, sj+1) from D Set yj: = rj For terminal sj+1 rj + γmax_(a^′) Qn(sj+1, a′; θi) For non-terminal sj+1 Perform a gradient step on (yj − Qn(sj, aj; θi))2 with respect to θ end For |
5 | End Loop |
3.1.3. Training Policy
Algorithm 4: Prioritized Experience Replay | |
1. | Input: minibatch k, step-size n, replay period K and size N, exponents α and β, budget T and N action-value function Q |
2. | Initialize replay memory H = θ, Δ = 0, = 1 |
3. | Observe and choose ~ () |
4. | for t = 1 to T do Observe for p = 1 to N do Store transition ( in H with maximal priority = end for if t then for j = 1 to k do Sample transition j ~ P(j) = Compute sampling weight Compute TD-error = Update transition priority Gather weight-change Δ Δ + end for Update weights Δ, reset Δ = 0 From time to time copy weights into target NN end if for p = 1 to N do Choose action ∼ () end for end for |
4. Evaluation Results and Analysis with Maze Finding
4.1. Environment
4.2. Training Features
4.3. Implement and Performance Evaluation
4.3.1. N-DQN-Based Implementation
4.3.2. Performance Evaluation and Discussion
5. Evaluation Results and Analysis with **-Pong
5.1. Environment
5.2. Training Features
5.3. Implement and Performance Evaluation
5.3.1. N-DQN-Based Implementation
Algorithm 5: PING-PONG Game’s DQN Training | |
1. | Procedure: training |
2. | Initialize: replay memory D to size N |
3. | Initialize: action-value function Q with random weights |
4. | Loop: episode = 1, M do Initialize state s_1 for t = 1, T do With probability ϵ select random action a_t otherwise select a_t = argmax_a Q(s_t,a; θ_i) Execute action a_t in emulator and observe r_t and s_(t + 1) Store transition (s_t,a_t,r_t,s_(t + 1)) in D Sample a minibatch of transitions (s_j,a_j,r_j,s_(j + 1)) from D Set y_j:= r_j for terminal s_(j + 1) r_j + γ*max_(a^′ ) Q(s_(j + 1),a′; θ_i) for non-terminal s_(j + 1) Perform a gradient step on (y_j-Q(s_j,_j; θ_i))^2 with respect to θ end for |
5. | End Loop |
5.3.2. Performance Evaluation and Discussion
6. Conclusions and Future Research
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Kaelbling, L.P.; Littman, M.L.; Moore, A.W. Reinforcement learning: A survey. J. Artif. Intell. Res. 1996, 4, 237–285. [Google Scholar] [CrossRef] [Green Version]
- Boyan, J.A.; Moore, A.W. Generalization in reinforcement learning: Safely approximating the value function. Adv. Neural Inf. Process. Syst. 1995, 369–376. [Google Scholar]
- Kulkarni, T.D.; Narasimhan, K.; Saeedi, A.; Tenenbaum, J. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. Adv. Neural Inf. Inf. Process. Syst. 2016, 3675–3683. [Google Scholar]
- Hasselt, H.V. Double Q-learning. Adv. Neural Inf. Process. Syst. 2010, 2613–2621. [Google Scholar]
- Watkins, C.J.; Dayan, P. Q-learning. Mach. Learn. 1992, 8, 279–292. [Google Scholar] [CrossRef]
- Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing atari with deep reinforcement learning. ar**: Reinforcement learning with less data and less time. Mach. Learn. 1993, 13, 103–130. [Google Scholar] [CrossRef] [Green Version]
- Dayan, P.; Hinton, G.E. Feudal reinforcement learning. Adv. Neural Inf. Process. Syst. 1993, 5, 271–278. [Google Scholar]
- Tesauro, G. Practical issues in temporal difference learning. Adv. Neural Inf. Process. Syst. 1992, 4, 259–266. [Google Scholar]
- Abul, O.; Polat, F.; Alhajj, R. Multiagent reinforcement learning using function approximation. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 2000, 4, 485–497. [Google Scholar] [CrossRef]
- Bus¸oniu, L.; De Schutter, B.; Babuska, R. Multiagent reinforcement learning with adaptive state focus. In Proceedings of the 17th Belgian-Dutch Conference on Artificial Intelligence (BNAIC-05), Brussels, Belgium, 17–18 October 2005; pp. 35–42. [Google Scholar]
- Castaño, F.; Haber, R.E.; Mohammed, W.M.; Nejman, M.; Villalonga, A.; Martinez Lastra, J.L. Quality monitoring of complex manufacturing systems on the basis of model driven approach. Smart Struct. Syst. 2020, 26, 495–506. [Google Scholar] [CrossRef]
- Beruvides, G.; Juanes, C.; Castaño, F.; Haber, R.E. A self-learning strategy for artificial cognitive control systems. In Proceedings of the 2015 IEEE International Conference on Industrial Informatics, INDIN, Cambridge, UK, 22–24 July 2015; pp. 1180–1185. [Google Scholar] [CrossRef]
No. | Rule |
---|---|
1 | The actor can only move in four directions: up/down/left/right |
2 | The actor cannot go through walls |
3 | The actor must reach his/her destination in the shortest time possible |
No. | Rule |
---|---|
1 | The actor can perform 5 actions: up/down/left/right/nothing |
2 | Up/down is related to movement, and left/right pushes the ball |
3 | When the ball touches the upper/lower area of the game screen, it is refracted |
Category | Contents |
---|---|
State | The coordinates of actor’s location (x/y) |
Action | Up/down is related to movement, and left/right pushes the ball |
Rewards | +1, on arrival at the goal For movement of each step—(0.1/number of cell) |
Model | Goal | Need Step | Time Required (400 Step) |
---|---|---|---|
Q-Learning | Success | 50 step | 46.7918 s |
DQN | Fail | 500 step | 57.8764 s |
N-DQN | Success | 100 step | 66.1712 s |
Model | Goal | Need Step | Time Required (400 Step) |
---|---|---|---|
Q-Learning | Fail | 1200 step | 379.1281 s |
DQN | Fail | step | 427.3794 s |
N-DQN | Success | 700 step | 395.9581 s |
Category | Contents |
---|---|
State | Coordinates of the current position of the paddle (x/y) Coordinates of the current position of the ball (x/y) |
Action | 1 action of up/down/left/right actions |
Rewards | +1, if the opponent’s paddle misses the ball −1, if the ball is missed by the actor’s paddle, otherwise 0 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Kim, K. Multi-Agent Deep Q Network to Enhance the Reinforcement Learning for Delayed Reward System. Appl. Sci. 2022, 12, 3520. https://doi.org/10.3390/app12073520
Kim K. Multi-Agent Deep Q Network to Enhance the Reinforcement Learning for Delayed Reward System. Applied Sciences. 2022; 12(7):3520. https://doi.org/10.3390/app12073520
Chicago/Turabian StyleKim, Keecheon. 2022. "Multi-Agent Deep Q Network to Enhance the Reinforcement Learning for Delayed Reward System" Applied Sciences 12, no. 7: 3520. https://doi.org/10.3390/app12073520