Hierarchical Reinforcement Learning for Multi-agent MOBA Game
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Hierarchical Reinforcement Learning for Multi-agent MOBA Game
Zhijian Zhang1 , Haozheng Li2 , Luo Zhang2 , Tianyin Zheng2 , Ting Zhang2 , Xiong
Hao2,3 , Xiaoxin Chen2,3 , Min Chen2,3 , Fangxu Xiao2,3 , Wei Zhou2,3
1
vivo AI Lab
{zhijian.zhang, haozheng.li, zhangluo, zhengtianyin, haoxiong}@vivo.com
arXiv:1901.08004v1 [cs.LG] 23 Jan 2019
Abstract
Although deep reinforcement learning has achieved
great success recently, there are still challenges in
Real Time Strategy (RTS) games. Due to its large
state and action space, as well as hidden informa-
tion, RTS games require macro strategies as well as (a) 5v5 map (b) 1v1 map
micro level manipulation to obtain satisfactory per-
formance. In this paper, we present a novel hierar- Figure 1: (a) Screenshot from 5v5 map of KOG. Players can get the
chical reinforcement learning model for mastering position of allies, towers, enemies in view and know whether jungles
Multiplayer Online Battle Arena (MOBA) games, alive or not from mini-map. From the screen, players can observe
a sub-genre of RTS games. In this hierarchical surrounding information including what kind of skills released and
releasing. (b) Screenshot from 1v1 map of KOG, known as solo
framework, agents make macro strategies by imi-
mode.
tation learning and do micromanipulations through
reinforcement learning. Moreover, we propose a steer button, while using skills by control right bottom set of
simple self-learning method to get better sample buttons. The upper-left corner shows mini-map, with the blue
efficiency for reinforcement part and extract some markers pointing own towers and the red markers pointing the
global features by multi-target detection method in enemies’ towers. Each player can obtain gold and experience
the absence of game engine or API. In 1v1 mode, by killing enemies, jungles and destroying the towers. The
our agent successfully learns to combat and de- final goal of players is to destroy enemies’ crystal. As shown
feat built-in AI with 100% win rate, and experi- in figure.1b, there are totally two players in 1v1 map.
ments show that our method can create a competi-
tive multi-agent for a kind of mobile MOBA game The main challenges of MOBA game for us compared to
King of Glory (KOG) in 5v5 mode. Atari or AlphaGo are as follows: (1) No game engine or
API. We need to extract features by multi-target detection,
and run the game through the terminal, which indicates low
1 Introduction computational power. However, the computational complex-
Since its success in playing game Atari [Mnih et al., 2015], ity can be up to 1020,000 , while AlphaGo is about 10250 [Ope-
AlphaGo [Silver et al., 2017], Dota 2 [OpenAI, 2018] nAI, 2018]. (2) Delayed and sparse rewards. The final
and so on, Deep reinforcement learning (DRL) has become goal of the game is to destroy the enemies’ crystal, which
a promising tool for game AI. Researchers can verify al- means that rewards are seriously delayed. Meanwhile, there
gorithms by conducting experiments in games quickly and are really sparse if we set −1/1 according to the final result
transfer this ability to real world such as robotics control, rec- loss/win. (3) Multi-agent. Cooperation and communication
ommend services and so on. Unfortunately, there are still are crucial important for RTS games especially for 5v5 mode.
many challenges in practice. Recently, more and more re- In this paper, (1) we propose hierarchical reinforcement
searchers start to conquer real time strategy (RTS) games learning for a kind of mobile MOBA game KOG, a novel
such as StarCraft and Defense of the Ancients (Dota), which algorithm which combines imitation learning with reinforce-
are much more complex. Dota is a kind of MOBA game ment learning. Imitation learning according to humans’ expe-
which include 5v5 or 1v1 multiplayers. To achieve a victory rience is responsible for macro strategies such as where to go
in MOBA game, the players need to control their only one to, when to offense and defense, while reinforcement learn-
unit to destroy the enemies’ crystal. ing is in charge of micromanipulations such as which skill to
MOBA games take up more than 30% of the online game- use and how to move in battle. (2) As we don’t have game
play all over the world, including Dota, League of Legends, engine or API, in order to get better sample efficiency to ac-
and King of Glory [Murphy, 2015]. Figure.1a shows a 5v5 celerate the training for reinforcement learning part, we use
map, KOG players control movements by using left bottom a simple self-learning method which learns to compete withagent’s past good decisions and come up with an optimal pol- architecture also has the ability of transferring and multitask
icy. (3) A multi-target detection method is used to extract learning. However, it’s complex and hard-to-tune.
global features composing the state of reinforcement learning
in case of lacking of game engine or API. (4) Dense reward 2.3 Multi-agent Reinforcement Learning in Games
function design and multi-agent communication. Designing Multi-agent reinforcement learning(MARL) has certain ad-
a dense reward function and using real-time and actual data vantages over single agent. Different agents can complete
to learn communication with each other [Sukhbaatar et al., tasks faster and better through experience sharing. There are
2016], which is a branch of multi-agent reinforcement learn- some challenges at the same time. For example, the com-
ing research [Foerster et al., 2018]. Experiments show that putational complexity increases due to larger state and ac-
our agent learns good policy which trains faster than other tion space compared to single agent. Based on the above
reinforcement learning methods. challenges, MARL is mainly focus on stability and adaption.
Simple applications of reinforcement learning to MARL is
2 Related Work limited, such as no communication and cooperation among
agents [Sukhbaatar et al., 2016], lack of global rewards
2.1 RTS Games [Rashid et al., 2018], and failure to consider enemies’ strate-
There has been a history of studies on RTS games such as gies when learning policy. Some recent studies relevant to
StarCraft [Ontanón et al., 2013] and Dota [OpenAI, 2018]. the challenges have been done. [Foerster et al., 2017] in-
One practical way using rule-based method by bot SAIDA troduced a concentrated criticism of the cooperative settings
achieved champion on SSCAIT recently. Based on the ex- with shared rewards. The approach interprets the experience
perience of the game, rule-based bots can only choose the in the replay memory as off-environment data and marginal-
predefined action and policy at the beginning of the game, ize the action of a single agent while keeping others un-
which is insufficient to deal with large and real time state changed. These methods enable the successful combination
space throughout the game, and it hasn’t the ability of learn- of experience replay with multi-agent. Similarly, [Jiang
ing and growing up. Dota2 AI created by OpenAI, named and Lu, 2018] proposed an attentional communication model
OpenAI Five, has made great success by using proximal pol- based on actor-critic algorithm for MARL, which learns to
icy optimization algorithm along with well-designed rewards. communicate and share information when making decision.
However, OpenAI Five has used huge resources due to lack Therefore, this approach can be a complement for us. Pa-
of macro strategy. rameter sharing multi-agent gradient descent Sarsa(λ) (PS-
Related work has also been done in macro strategy by Ten- MASGDS) algorithm [Shao et al., 2018] used a neural net-
cent AI Lab in game King of Glory [Wu et al., 2018], and work to estimate the value function and proposed a reward
their 5-AI team achieved 48% winning rate against human function to balance the units move and attack in the game of
player teams which are ranked top 1% in the player ranking StarCraft, which can be learned from for us.
system. However, 5-AI team used supervised learning and the
training data can be obtained from game replays processed by 3 Methods
game engine and API, which ran on the server. This method
is not available for us because we don’t have game engine or In this section, we introduce our hierarchical architecture,
API, and we need to run on the terminal. state representation and action definition firstly. Then the
network architecture and training algorithm are given. At
2.2 Hierarchical Reinforcement Learning last, we discuss the reward function design and self-learning
method used in this paper.
Due to large state space in the environment, traditional rein-
forcement learning method such as Q-learning or DQN is dif- 3.1 Hierarchical Architecture
ficult to handle. Hierarchical reinforcement learning [Barto
and Mahadevan, 2003] solves this kind of problem by decom- The hierarchical architecture is shown in Fig.2. There are
posing a high dimensional target into several sub-target which four types of macro actions including attack, move, purchase
is easier to solve. and learning skills, and it’s selected by imitation learning (IL)
Hierarchical reinforcement learning has been explored in and high-level expert guidance. Then reinforcement learning
different environments. As for games, somewhat related to algorithm chooses specific action a according policy π for
our hierarchical architecture is that of [Sun et al., 2018], making micromanagement in state s. The encoded action is
0
which designs macro strategy using prior knowledge of game performed and we can get reward r and next observation s
StarCraft (e.g. TechTree), but no imitation learning and no from KOG environment. Defining the discounted return as
PT
high-level expert guidance. There have been many novel hi- Rπ = t=0 γ t rt , where γ ∈[0,1] is a discount factor. The
erarchical reinforcement learning algorithms come up with in aim of agents is to learn a policy that maximizes the expected
recent years. One approach of combining meta-learning with discounted returns, J = Eπ [Rπ ].
a hierarchical learning is MLSH [Frans et al., 2017], which With this architecture, we relieve the heavy burden of deal-
is mainly used for multi-task and transferring to new tasks. ing with massive actions directly, and the complexity of ex-
FeUdal Networks [Vezhnevets et al., 2017] designed a Man- ploration for some sparse rewards scenes such as going to
ager module and a Worker module. The Manager operates at the front at the beginning of the game. Moreover, the tuple
a lower temporal resolution and sets goals to Worker. This (s,a,r) collected by imitation learning will be stored in ex-KOG action
Decision Layer
macro action selection interaction
KOG
Macro Actions Agents (Heros)
Environment
observation, reward
Scheduler + IL Learning
Attack Move Purchase Execution Layer
Skills
Reinforcement Attack,Skills Movement, Equipment
Skill 1,2,3,...
Learning Skills Purchase
refined action
Figure 2: Hierarchical Architecture
States Dimension Type reinforcement learning. Moreover, states with real value are
Extracted Features 170 ∈R normalized to [0,1].
Mini-map Information 32×32×3 ∈R
Big-map Information 64×64×5 ∈R Action Definition
Action 17 ∈ one-hot In this game, players can control movements by using left
bottom steer button, which is continuous with 360 degrees.
Table 1: The dimension and data type of our states In order to simplify the action space, we select 9 move direc-
tions including Up, Down, Left, Right, Lower-right, Lower-
perience replay buffer and be trained through reinforcement left, Upper-right, Upper-left, and Stay still. When the selected
learning network. action is attack, it can be Skill-1, Skill-2, Skill-3, Attack, and
From the above, we can see that there are some advantages summoned skills including Flash and Restore. Meanwhile,
in the hierarchical architecture. First, using of macro actions attacking the weakest enemy is our first choice when the ac-
decreases the dimensional of action space for reinforcement tion attack is available for each unit. Moreover, we can go
learning, and solves the problem of sparse rewards in macro to a position through path planning when choosing the last
scenes to some extent. Second, in some complicated situa- action.
tions such as team battling, pure imitation learning algorithm
is unable to handle well especially when we don’t have game 3.3 Network Architecture and Training Algorithm
engine or API. Last but not least, the hierarchical architec- Network Architecture
ture makes training resources lower and design of the reward
Table reinforcement learning such as Q-learning has limit in
function easier. Meanwhile, we can also replace the imitation
large state space situation. To solve this problem, the micro
learning part with high-level expert system for the fact that
level algorithm design is similar to OpenAI Five, proximal
the data in imitation learning model is produced by high-level
policy optimization (PPO) algorithm [Schulman et al., 2017].
expert guidance.
Inputs of convolutional network are big-map and mini-map
information with a shape of 64×64×5 and 32×32×3 respec-
3.2 State Representation and Action Definition
tively. Meanwhile, the input of fully-connect layer 1 (fc1) is
State Representation a 170 dimensions’ tensor extracted from feature. We use the
How to represent states of RTS games is an open problem rectified linear unit (ReLU) activation function in the hidden
without universal solution. We construct a state representa- layer, as demonstrated by
tion as inputs of neural network from features extracted by
multi-target detection, image information of the game, and f (x) = max(0, x) (1)
global features for all agents, which have different dimen-
sions and data types, as illustrated in Table 1. Big-map in- where x is the output of the hidden layer. The output layer’s
formation includes five gray frames from different agents and activation function is Softmax function, which outputs the
mini-map information is one RGB image from upper left cor- probability of each action, as demonstrated by
ner of the screenshot. K
Extracted features includes friendly and enemy heroes’ and
X
σ(z)j = ezj / ezk (2)
towers’ position and blood volume, own hero’s money and k=1
skills, and soldiers’ position in the field of vision, as shown
in Fig. 3. Our inputs in current step are composed of current where j=1,. . . ,K. Our model in game KOG, including inputs
state information, the last step information, and the last action and architecture of the network, and output of actions, is de-
which has been proven to be useful for the learning process in picted in Fig.3.concat Agent 1
conv1 conv2 conv3 conv4 flat1
Softmax
fc4 actions
fc3
image feature Macro-actions
Step t-1 Step t
Extracted Features
Extracted Features
Mini-map Information Imitation Learning
Mini-map Information
Big-map Information Shared Layers .
.
Big-map Information .
Action
Extracted Features Imitation Learning
…
Macro-actions
Enemy Buildings Enemy
actions
… vector feature
fc1 fc2
Own Buildings Own Player Other Features
Softmax
fc6
fc5 Agent 5
Figure 3: Network Architecture of Hierarchical reinforcement learning model
Training Algorithm enemies’ crystal. If our reward is only based on the final re-
We propose a hierarchical RL algorithm for multi-agent sult, it will be extremely sparse, and the seriously delayed
learning, and the training process is presented in Algorithm reward makes agent difficult to learn fast. Obviously, dense
1. Firstly, we initialize our controller policy and global state. reward gives more positive or negative feedback to the agent,
Then each unit takes action at and receive reward rt+1 and and can help to learn faster and better. As we don’t have
next state st+1 . From state st+1 , we can obtain both macro game engine or API, damage amount of an agent is not avail-
action through imitation learning and micro action from re- able for us. In our experiment, all agents can receive two
inforcement learning. In order to choose action at+1 from parts rewards including self-reward and global-reward. Self-
macro action At+1 , we do a normalization of the action prob- reward contains own money and health points (HP) loss/gain
ability. At the end of each iteration, we use the experience re- of agent, while global-reward includes tower loss and death
play samples to update the parameters of the policy. In order of friendly/enemy players.
to balance the trade-off between exploration and exploitation,
we take the loss of entropy and self-learning into account to rt = ρ1 × rself + ρ2 × rglobal
encourage exploration. Our loss formula is as follows: = ρ1 ((moneyt − moneyt−1 )fm + (HPt − HPt − 1)fH )
Lt (θ) = Et [w1 Lvt (θ) + w2 Nt (π, at ) + Lpt (θ) + w3 St (π, at )] + ρ2 (towerlosst × ft + playerdeatht × fd )
(3) (6)
where w1 , w2 , w3 are the weights of value loss, entropy loss
and self-learning loss that we need to tune, Nt denotes the where towerlosst is positive when enemies’ tower is broken,
entropy loss, and St means the self-learning loss. Lvt (θ) and negative when own tower is broken, the same as playerdeatht ,
Lpt (θ) are defined as follows: fm is a coefficient of money loss, the same as fH , ft and fd ,
ρ1 is the weight of self-reward and ρ2 means the weight of
Lvt (θ) = Et [(r(st , at ) + Vt (st ) − Vt (st+1 ))2 ] (4)
global-reward. The reward function is effective for training,
and the results are shown in the experiment section.
Lpt (θ) = Et [min(rt (θ)At , clip(rt (θ), 1 − ε, 1 + ε)At )]
(5) Self-learning
where rt (θ) = πθ (at |st )/πθold (at |st ), At is advantage com- There are many kinds of self-learning methods for reinforce-
puted by the difference between return and value estimation. ment learning such as Self-Imitation Learning (SIL) proposed
by [Oh et al., 2018] and Episodic Memory Deep Q-Networks
3.4 Reward Design and Self-learning (EMDQN) presented by [Lin et al., 2018]. SIL is applicable
Reward Design to actor-critic architecture, while EMDQN combines episodic
Reward function is significant for reinforcement learning, and memory with DQN. However, considering better sample effi-
good learning results of an agent are mainly depending on di- ciency and easier-to-tune of the system, we migrate EMDQN
verse rewards. The final goal of the game is to destroy the to our reinforcement learning algorithm PPO. Loss of self-Algorithm 1 Hierarchical RL Training Algorithm Category Training Set Testing Set Precision
Input: Reward function Rn , max episodes M, function IL(s) Own Soldier 2677 382 0.6158
indicates imitation learning model. Enemy Solider 2433 380 0.6540
Output: Hierarchical reinforcement learning neural net- Own Tower 485 79 0.9062
work. Enemy Tower 442 76 0.9091
Own Crystal 95 17 0.9902
1: Initialize controller policy π, global state sg shared
Enemy Crystal 152 32 0.8425
among our agents;
2: for episode = 1, 2, ..., M do Table 2: The accuracy of multi-target detection
3: Initialize st , at ;
4: repeat
Scenarios AI.1 AI.2 AI.3 AI.4
5: Take action at , receive reward rt+1 , next state st+1 ;
1v1 mode 80% 50% 52% 58%
6: Choose macro action At+1 from st+1 according to 5v5 mode 82% 68% 66% 60%
IL(s = st+1 );
Table 3: Win rates playing against AI.1:AI without macro strat-
7: Choose micro action at+1 from At+1 according to egy, AI.2:without multi-agent, AI.3:without global reward and
the output of RL in state st+1 ; AI.4:without self-learning method
8: if ait+1 ∈/ At+1 , where i = 0, . . . , 16 then
9: P (ait+1 |st+1 ) = 0;
10: else X
distributed phones when training. In the training process, we
11: P (ait+1 |st+1 ) = P (ait+1 |st+1 )/ P (ait+1 |st+1 ); transmit the data and share the parameters of network through
12: end if gRPC. As for the features obtained by multi-target detection,
13: Collect samples (st , at , rt+1 ); its accuracy and category are depicted in Table 2. In our ex-
14: Update policy parameter θ to maximize the expected periment, the speed of taking an action is about 150 APM
returns; compared to 180 APM of high level player, which is enough
15: until st is terminal for this game. For going to somewhere, we use A-star path
16: end for planning algprithm.
4.2 1v1 mode of game KOG
learning part can be demonstrated as follows: As shown in Figure.1b, there are one agent and one enemy
St (π, at ) = Et [(Vt+1 − VH ) ] 2 player in 1v1 map. We need to destroy the enemies’ tower
first and then destroy the crystal to get the final victory. We
+ Et [min(rt (θ)AHt , clip(rt (θ), 1 − ε, 1 + ε)AHt )] draw the episodes needed to win when our agent fights with
(7) different level of built-in AI and different genres of internal
where the memory target VH is the best value from memory AI.
buffer, and AHt means the best advantage from it. Episodes until win
Figure.4 shows the length of episodes for our agent An-
max((max(Ri (st , at ))), R(st , at )), if (st , at ) ∈ memory
VH =
R(st , at ), otherwise
(8)
gela to defeat the opponents. Higher level of the built-
in AI, longer our agent need to train. Moreover, for dif-
ferent kinds of enemies, the training time is not the same
AHt = VH − Vt+1 (st+1 ) (9) as well. The results when our AI play against AI with-
where i ∈ [1,2,. . . ,E ], E represents the number of episodes out macro-strategy, without multi-agent, without global re-
in memory buffer that the agent has experienced. ward and without self-learning method are listed in Table
3. 50 games are played against AI.1:without macro strat-
4 Experiments egy, AI.2:without multi-agent, AI.3:without global reward
and AI.4:without self-learning method, and the win rates are
In this section, we introduce the experiment setting first. Then 80%, 50%, 52% and 58% respectively.
we evaluate the performance of our algorithms on two envi-
ronments: (i) 1v1 map including entry-level, easy-level and Average rewards
medium-level built-in AI which don’t include difficult-level, Generally speaking, the aim of our agent is to defeat the en-
and (ii) a challenging 5v5 map. For a better comprehension, emies as soon as possible. Figure.5 illustrates the average
we analyze the average rewards and win rates during training. rewards of our agent Angela in 1v1 mode when combatting
with different types of enemies. In the beginning, the rewards
4.1 Setting are low because the agent is still a beginner and hasn’t enough
The experiment setting includes terminal experiment plat- learning experience. However, our agent is learning gradually
form and GPU cluster training platform. In order to increase and being more and more experienced. When the training
the diversity and quantity of samples, we use 10 vivo X23 episodes of our agent reach about 100, the rewards in each
and NEX phones for an agent to collect the distributed data. step become positive overall and our agent is starting to have
Meanwhile, we need to maintain the consistency of all the some advantages in battle. There are also some decreases in1000 1.1
HRL with entry-level AI HRL with entry-level AI
900
HRL with easy-level AI HRL with easy-level AI
800 0.9
HRL with medium-level AI
HRL with medium-level AI
700 PPO algorithm with entry-level AI
Episodes Until Win
0.7 Supervised learning with medium-level AI
600
500
Win Rates
0.5
400
300
0.3
200
100 0.1
0
vs. Support vs. Mage vs. Shooter vs. Assassin vs. Warrior Average
-0.1
0 90 180 270 360 450 540 630 720 810 900 1000
Figure 4: The episodes to train of our model against with differ- Episodes
ent level internal AI when combatting with Support, Mage, Shooter,
Assassin and Warrior. Figure 6: The win rates of our agents in 5v5 mode against different
level of internal AI.
1
2.5
Entry-level
2 Easy-level
0.5
Average Rewards
Medium-level
1.5
Entry-level
Average Rewards
0
1
Easy-level
Medium-level 0.5
-0.5
0
-1
0 100 200 300 400 500 600 700 800 900 1000 -0.5
Episodes
-1
1 100 199 298 397 496 595 694 793 892 991
Figure 5: The average rewards of our agent in 1v1 mode during
Episodes
training.
Figure 7: The average rewards of our agents in 5v5 mode during
rewards when facing high level internal AI because of the fact training.
that the agent is not able to defeat the Warrior at first. To sum
up, the average rewards are increasing obviously, and stay method used about 300 thousand game replays under the ad-
smooth after about 600 episodes. vantage of API. Another way is using PPO algorithm that
OpenAI Five used [OpenAI, 2018] without macro strategy,
4.3 5v5 mode of game KOG which achieves about 22% win rate when combatting with
As shown in Fig.1a, there are five agents and five enemy play- entry-level internal AI. Meanwhile, the results of our AI play-
ers in 5v5 map. What we need to do actually is to destroy the ing against AI without macro strategy, without multi-agent,
enemies’ crystal. In this scenario, we train our agents with in- without global reward and without self-learning method are
ternal AI, and each agent hold one model. In order to analyze listed in Table 3. These indicate the importance of each
the results during training, we illustrate the average rewards method in our hierarchical reinforcement learning algorithm.
and win rates in Fig.6 and Fig.7. Average rewards
Win rates As shown in Figure.7, the total rewards are divided by episode
steps in the combat. In three levels, the average rewards are
We draw the win rates in Figure6. there are three different
increasing overall. For medium-level internal AI, it’s hard
levels of built-in AI that our agents combat with. When fight-
to learn well at first. However, the rewards are growing up
ing with entry-level internal AI, our agents learn fast and the
after 500 episodes and stay smooth after almost 950 episodes.
win rates reach 100% finally. When training with medium-
Although there are still some losses during training. This is
level AI, the learning process is slow and our agents can’t
reasonable for the fact that we encounter different lineups of
win until 100 episodes. In this mode, the win rates are about
internal AI which make different levels of difficulty.
55% in the end. This is likely due to the fact that our agents
can hardly obtain dense global rewards in games against high
level AI, which leads to hard cooperation in team fight. One 5 Conclusion
way using supervised learning method from Tencent AI Lab In this paper, we proposed hierarchical reinforcement learn-
obtains 100% win rate [Wu et al., 2018]. However, the ing for multi-agent MOBA game KOG, which learns macrostrategies through imitation learning and taking micro actions [Oh et al., 2018] Junhyuk Oh, Yijie Guo, Satinder Singh,
by reinforcement learning. In order to obtain better sample and Honglak Lee. Self-imitation learning. arXiv preprint
efficiency, we presented a simple self-learning method, and arXiv:1806.05635, 2018.
we extracted global features as a part of state input by multi- [Ontanón et al., 2013] Santiago Ontanón, Gabriel Synnaeve,
target detection. Our results showed that hierarchical rein- Alberto Uriarte, Florian Richoux, David Churchill, and
forcement learning is very helpful for this MOBA game. Mike Preuss. A survey of real-time strategy game ai re-
In addition, there are still some works to do in the future. search and competition in starcraft. IEEE Transactions
Cooperation and communication of multi-agent are learned on Computational Intelligence and AI in games, 5(4):293–
by sharing network, constructing an efficient global reward 311, 2013.
function and state representation. Although our agents can
successfully learn some cooperation strategies, we are going [OpenAI, 2018] OpenAI. Openai five, 2018. https://blog.
to explore more effective methods for multi-agent collabo- openai.com/openai-five/, 2018.
ration. Meanwhile, this hierarchical reinforcement learning [Rashid et al., 2018] Tabish Rashid, Mikayel Samvelyan,
architecture’s implementation encourages us to go further in Christian Schroeder de Witt, Gregory Farquhar, Jakob Fo-
5v5 mode of game King of Glory especially when our agents erster, and Shimon Whiteson. Qmix: Monotonic value
compete with human beings. function factorisation for deep multi-agent reinforcement
learning. arXiv preprint arXiv:1803.11485, 2018.
Acknowledgments [Schulman et al., 2017] John Schulman, Filip Wolski, Pra-
We would like to thank our colleagues at vivo AI Lab, partic- fulla Dhariwal, Alec Radford, and Oleg Klimov. Prox-
ularly Jingwei Zhao and Guozhi Wang, for the helpful com- imal policy optimization algorithms. arXiv preprint
ments about paper writing. We are also very grateful for the arXiv:1707.06347, 2017.
support from vivo AI Lab. [Shao et al., 2018] Kun Shao, Yuanheng Zhu, and Dongbin
Zhao. Starcraft micromanagement with reinforcement
References learning and curriculum transfer learning. IEEE Transac-
[Barto and Mahadevan, 2003] Andrew G Barto and Sridhar tions on Emerging Topics in Computational Intelligence,
Mahadevan. Recent advances in hierarchical reinforce- 2018.
ment learning. Discrete event dynamic systems, 13(1- [Silver et al., 2017] David Silver, Julian Schrittwieser,
2):41–77, 2003. Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur
[Foerster et al., 2017] Jakob Foerster, Nantas Nardelli, Gre- Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian
gory Farquhar, Triantafyllos Afouras, Philip HS Torr, Bolton, et al. Mastering the game of go without human
Pushmeet Kohli, and Shimon Whiteson. Stabilising expe- knowledge. Nature, 550(7676):354, 2017.
rience replay for deep multi-agent reinforcement learning. [Sukhbaatar et al., 2016] Sainbayar Sukhbaatar, Rob Fergus,
arXiv preprint arXiv:1702.08887, 2017. et al. Learning multiagent communication with backprop-
[Foerster et al., 2018] Jakob N Foerster, Gregory Farquhar, agation. In Advances in Neural Information Processing
Triantafyllos Afouras, Nantas Nardelli, and Shimon Systems, pages 2244–2252, 2016.
Whiteson. Counterfactual multi-agent policy gradients. In [Sun et al., 2018] Peng Sun, Xinghai Sun, Lei Han, Jiechao
Thirty-Second AAAI Conference on Artificial Intelligence, Xiong, Qing Wang, Bo Li, Yang Zheng, Ji Liu, Yongsheng
2018. Liu, Han Liu, et al. Tstarbots: Defeating the cheating level
[Frans et al., 2017] Kevin Frans, Jonathan Ho, Xi Chen, builtin ai in starcraft ii in the full game. arXiv preprint
Pieter Abbeel, and John Schulman. Meta learning shared arXiv:1809.07193, 2018.
hierarchies. arXiv preprint arXiv:1710.09767, 2017. [Vezhnevets et al., 2017] Alexander Sasha Vezhnevets, Si-
[Jiang and Lu, 2018] Jiechuan Jiang and Zongqing Lu. mon Osindero, Tom Schaul, Nicolas Heess, Max Jader-
Learning attentional communication for multi-agent coop- berg, David Silver, and Koray Kavukcuoglu. Feudal
eration. arXiv preprint arXiv:1805.07733, 2018. networks for hierarchical reinforcement learning. arXiv
[Lin et al., 2018] Zichuan Lin, Tianqi Zhao, Guangwen preprint arXiv:1703.01161, 2017.
Yang, and Lintao Zhang. Episodic memory deep q- [Wu et al., 2018] Bin Wu, Qiang Fu, Jing Liang, Peng Qu,
networks. arXiv preprint arXiv:1805.07603, 2018. Xiaoqian Li, Liang Wang, Wei Liu, Wei Yang, and Yong-
[Mnih et al., 2015] Volodymyr Mnih, Koray Kavukcuoglu, sheng Liu. Hierarchical macro strategy model for moba
David Silver, Andrei A Rusu, Joel Veness, Marc G Belle- game ai. arXiv preprint arXiv:1812.07887, 2018.
mare, Alex Graves, Martin Riedmiller, Andreas K Fidje-
land, Georg Ostrovski, et al. Human-level control through
deep reinforcement learning. Nature, 518(7540):529,
2015.
[Murphy, 2015] M Murphy. Most played games: Novem-
ber 2015–fallout 4 and black ops iii arise while starcraft ii
shines, 2015.You can also read