Hierarchical Reinforcement Learning for Multi-agent MOBA Game

Hierarchical Reinforcement Learning for Multi-agent MOBA Game
Hierarchical Reinforcement Learning for Multi-agent MOBA Game

                                                 Zhijian Zhang1 , Haozheng Li2 , Luo Zhang2 , Tianyin Zheng2 , Ting Zhang2 , Xiong
                                                         Hao2,3 , Xiaoxin Chen2,3 , Min Chen2,3 , Fangxu Xiao2,3 , Wei Zhou2,3
                                                                                          vivo AI Lab
                                                         {zhijian.zhang, haozheng.li, zhangluo, zhengtianyin, haoxiong}@vivo.com
arXiv:1901.08004v1 [cs.LG] 23 Jan 2019

                                              Although deep reinforcement learning has achieved
                                              great success recently, there are still challenges in
                                              Real Time Strategy (RTS) games. Due to its large
                                              state and action space, as well as hidden informa-
                                              tion, RTS games require macro strategies as well as                        (a) 5v5 map                      (b) 1v1 map
                                              micro level manipulation to obtain satisfactory per-
                                              formance. In this paper, we present a novel hierar-             Figure 1: (a) Screenshot from 5v5 map of KOG. Players can get the
                                              chical reinforcement learning model for mastering               position of allies, towers, enemies in view and know whether jungles
                                              Multiplayer Online Battle Arena (MOBA) games,                   alive or not from mini-map. From the screen, players can observe
                                              a sub-genre of RTS games. In this hierarchical                  surrounding information including what kind of skills released and
                                                                                                              releasing. (b) Screenshot from 1v1 map of KOG, known as solo
                                              framework, agents make macro strategies by imi-
                                              tation learning and do micromanipulations through
                                              reinforcement learning. Moreover, we propose a                  steer button, while using skills by control right bottom set of
                                              simple self-learning method to get better sample                buttons. The upper-left corner shows mini-map, with the blue
                                              efficiency for reinforcement part and extract some              markers pointing own towers and the red markers pointing the
                                              global features by multi-target detection method in             enemies’ towers. Each player can obtain gold and experience
                                              the absence of game engine or API. In 1v1 mode,                 by killing enemies, jungles and destroying the towers. The
                                              our agent successfully learns to combat and de-                 final goal of players is to destroy enemies’ crystal. As shown
                                              feat built-in AI with 100% win rate, and experi-                in figure.1b, there are totally two players in 1v1 map.
                                              ments show that our method can create a competi-
                                              tive multi-agent for a kind of mobile MOBA game                    The main challenges of MOBA game for us compared to
                                              King of Glory (KOG) in 5v5 mode.                                Atari or AlphaGo are as follows: (1) No game engine or
                                                                                                              API. We need to extract features by multi-target detection,
                                                                                                              and run the game through the terminal, which indicates low
                                         1   Introduction                                                     computational power. However, the computational complex-
                                         Since its success in playing game Atari [Mnih et al., 2015],         ity can be up to 1020,000 , while AlphaGo is about 10250 [Ope-
                                         AlphaGo [Silver et al., 2017], Dota 2 [OpenAI, 2018]                 nAI, 2018]. (2) Delayed and sparse rewards. The final
                                         and so on, Deep reinforcement learning (DRL) has become              goal of the game is to destroy the enemies’ crystal, which
                                         a promising tool for game AI. Researchers can verify al-             means that rewards are seriously delayed. Meanwhile, there
                                         gorithms by conducting experiments in games quickly and              are really sparse if we set −1/1 according to the final result
                                         transfer this ability to real world such as robotics control, rec-   loss/win. (3) Multi-agent. Cooperation and communication
                                         ommend services and so on. Unfortunately, there are still            are crucial important for RTS games especially for 5v5 mode.
                                         many challenges in practice. Recently, more and more re-                In this paper, (1) we propose hierarchical reinforcement
                                         searchers start to conquer real time strategy (RTS) games            learning for a kind of mobile MOBA game KOG, a novel
                                         such as StarCraft and Defense of the Ancients (Dota), which          algorithm which combines imitation learning with reinforce-
                                         are much more complex. Dota is a kind of MOBA game                   ment learning. Imitation learning according to humans’ expe-
                                         which include 5v5 or 1v1 multiplayers. To achieve a victory          rience is responsible for macro strategies such as where to go
                                         in MOBA game, the players need to control their only one             to, when to offense and defense, while reinforcement learn-
                                         unit to destroy the enemies’ crystal.                                ing is in charge of micromanipulations such as which skill to
                                            MOBA games take up more than 30% of the online game-              use and how to move in battle. (2) As we don’t have game
                                         play all over the world, including Dota, League of Legends,          engine or API, in order to get better sample efficiency to ac-
                                         and King of Glory [Murphy, 2015]. Figure.1a shows a 5v5              celerate the training for reinforcement learning part, we use
                                         map, KOG players control movements by using left bottom              a simple self-learning method which learns to compete with
Hierarchical Reinforcement Learning for Multi-agent MOBA Game
agent’s past good decisions and come up with an optimal pol-     architecture also has the ability of transferring and multitask
icy. (3) A multi-target detection method is used to extract      learning. However, it’s complex and hard-to-tune.
global features composing the state of reinforcement learning
in case of lacking of game engine or API. (4) Dense reward       2.3    Multi-agent Reinforcement Learning in Games
function design and multi-agent communication. Designing         Multi-agent reinforcement learning(MARL) has certain ad-
a dense reward function and using real-time and actual data      vantages over single agent. Different agents can complete
to learn communication with each other [Sukhbaatar et al.,       tasks faster and better through experience sharing. There are
2016], which is a branch of multi-agent reinforcement learn-     some challenges at the same time. For example, the com-
ing research [Foerster et al., 2018]. Experiments show that      putational complexity increases due to larger state and ac-
our agent learns good policy which trains faster than other      tion space compared to single agent. Based on the above
reinforcement learning methods.                                  challenges, MARL is mainly focus on stability and adaption.
                                                                 Simple applications of reinforcement learning to MARL is
2     Related Work                                               limited, such as no communication and cooperation among
                                                                 agents [Sukhbaatar et al., 2016], lack of global rewards
2.1    RTS Games                                                 [Rashid et al., 2018], and failure to consider enemies’ strate-
There has been a history of studies on RTS games such as         gies when learning policy. Some recent studies relevant to
StarCraft [Ontanón et al., 2013] and Dota [OpenAI, 2018].       the challenges have been done. [Foerster et al., 2017] in-
One practical way using rule-based method by bot SAIDA           troduced a concentrated criticism of the cooperative settings
achieved champion on SSCAIT recently. Based on the ex-           with shared rewards. The approach interprets the experience
perience of the game, rule-based bots can only choose the        in the replay memory as off-environment data and marginal-
predefined action and policy at the beginning of the game,       ize the action of a single agent while keeping others un-
which is insufficient to deal with large and real time state     changed. These methods enable the successful combination
space throughout the game, and it hasn’t the ability of learn-   of experience replay with multi-agent. Similarly, [Jiang
ing and growing up. Dota2 AI created by OpenAI, named            and Lu, 2018] proposed an attentional communication model
OpenAI Five, has made great success by using proximal pol-       based on actor-critic algorithm for MARL, which learns to
icy optimization algorithm along with well-designed rewards.     communicate and share information when making decision.
However, OpenAI Five has used huge resources due to lack         Therefore, this approach can be a complement for us. Pa-
of macro strategy.                                               rameter sharing multi-agent gradient descent Sarsa(λ) (PS-
   Related work has also been done in macro strategy by Ten-     MASGDS) algorithm [Shao et al., 2018] used a neural net-
cent AI Lab in game King of Glory [Wu et al., 2018], and         work to estimate the value function and proposed a reward
their 5-AI team achieved 48% winning rate against human          function to balance the units move and attack in the game of
player teams which are ranked top 1% in the player ranking       StarCraft, which can be learned from for us.
system. However, 5-AI team used supervised learning and the
training data can be obtained from game replays processed by     3     Methods
game engine and API, which ran on the server. This method
is not available for us because we don’t have game engine or     In this section, we introduce our hierarchical architecture,
API, and we need to run on the terminal.                         state representation and action definition firstly. Then the
                                                                 network architecture and training algorithm are given. At
2.2    Hierarchical Reinforcement Learning                       last, we discuss the reward function design and self-learning
                                                                 method used in this paper.
Due to large state space in the environment, traditional rein-
forcement learning method such as Q-learning or DQN is dif-      3.1    Hierarchical Architecture
ficult to handle. Hierarchical reinforcement learning [Barto
and Mahadevan, 2003] solves this kind of problem by decom-       The hierarchical architecture is shown in Fig.2. There are
posing a high dimensional target into several sub-target which   four types of macro actions including attack, move, purchase
is easier to solve.                                              and learning skills, and it’s selected by imitation learning (IL)
   Hierarchical reinforcement learning has been explored in      and high-level expert guidance. Then reinforcement learning
different environments. As for games, somewhat related to        algorithm chooses specific action a according policy π for
our hierarchical architecture is that of [Sun et al., 2018],     making micromanagement in state s. The encoded action is
which designs macro strategy using prior knowledge of game       performed and we can get reward r and next observation s
StarCraft (e.g. TechTree), but no imitation learning and no      from KOG environment. Defining the discounted return as
high-level expert guidance. There have been many novel hi-       Rπ = t=0 γ t rt , where γ ∈[0,1] is a discount factor. The
erarchical reinforcement learning algorithms come up with in     aim of agents is to learn a policy that maximizes the expected
recent years. One approach of combining meta-learning with       discounted returns, J = Eπ [Rπ ].
a hierarchical learning is MLSH [Frans et al., 2017], which         With this architecture, we relieve the heavy burden of deal-
is mainly used for multi-task and transferring to new tasks.     ing with massive actions directly, and the complexity of ex-
FeUdal Networks [Vezhnevets et al., 2017] designed a Man-        ploration for some sparse rewards scenes such as going to
ager module and a Worker module. The Manager operates at         the front at the beginning of the game. Moreover, the tuple
a lower temporal resolution and sets goals to Worker. This       (s,a,r) collected by imitation learning will be stored in ex-
KOG action
            Decision Layer
                                                                    macro action selection                    interaction
                                              Macro Actions                                  Agents (Heros)

                                                                                                       observation, reward
           Scheduler + IL                                                  Learning
                             Attack       Move           Purchase                                                    Execution Layer

          Reinforcement Attack,Skills   Movement,       Equipment
                                                                         Skill 1,2,3,...
            Learning                     Skills          Purchase
                                                                                             refined action

                                                    Figure 2: Hierarchical Architecture

      States                      Dimension      Type                     reinforcement learning. Moreover, states with real value are
      Extracted Features          170            ∈R                       normalized to [0,1].
      Mini-map Information        32×32×3        ∈R
      Big-map Information         64×64×5        ∈R                       Action Definition
      Action                      17             ∈ one-hot                In this game, players can control movements by using left
                                                                          bottom steer button, which is continuous with 360 degrees.
        Table 1: The dimension and data type of our states                In order to simplify the action space, we select 9 move direc-
                                                                          tions including Up, Down, Left, Right, Lower-right, Lower-
perience replay buffer and be trained through reinforcement               left, Upper-right, Upper-left, and Stay still. When the selected
learning network.                                                         action is attack, it can be Skill-1, Skill-2, Skill-3, Attack, and
   From the above, we can see that there are some advantages              summoned skills including Flash and Restore. Meanwhile,
in the hierarchical architecture. First, using of macro actions           attacking the weakest enemy is our first choice when the ac-
decreases the dimensional of action space for reinforcement               tion attack is available for each unit. Moreover, we can go
learning, and solves the problem of sparse rewards in macro               to a position through path planning when choosing the last
scenes to some extent. Second, in some complicated situa-                 action.
tions such as team battling, pure imitation learning algorithm
is unable to handle well especially when we don’t have game               3.3     Network Architecture and Training Algorithm
engine or API. Last but not least, the hierarchical architec-             Network Architecture
ture makes training resources lower and design of the reward
                                                                          Table reinforcement learning such as Q-learning has limit in
function easier. Meanwhile, we can also replace the imitation
                                                                          large state space situation. To solve this problem, the micro
learning part with high-level expert system for the fact that
                                                                          level algorithm design is similar to OpenAI Five, proximal
the data in imitation learning model is produced by high-level
                                                                          policy optimization (PPO) algorithm [Schulman et al., 2017].
expert guidance.
                                                                          Inputs of convolutional network are big-map and mini-map
                                                                          information with a shape of 64×64×5 and 32×32×3 respec-
3.2   State Representation and Action Definition
                                                                          tively. Meanwhile, the input of fully-connect layer 1 (fc1) is
State Representation                                                      a 170 dimensions’ tensor extracted from feature. We use the
How to represent states of RTS games is an open problem                   rectified linear unit (ReLU) activation function in the hidden
without universal solution. We construct a state representa-              layer, as demonstrated by
tion as inputs of neural network from features extracted by
multi-target detection, image information of the game, and                                       f (x) = max(0, x)                        (1)
global features for all agents, which have different dimen-
sions and data types, as illustrated in Table 1. Big-map in-              where x is the output of the hidden layer. The output layer’s
formation includes five gray frames from different agents and             activation function is Softmax function, which outputs the
mini-map information is one RGB image from upper left cor-                probability of each action, as demonstrated by
ner of the screenshot.                                                                                             K
   Extracted features includes friendly and enemy heroes’ and
                                                                                                σ(z)j = ezj /            ezk              (2)
towers’ position and blood volume, own hero’s money and                                                            k=1
skills, and soldiers’ position in the field of vision, as shown
in Fig. 3. Our inputs in current step are composed of current             where j=1,. . . ,K. Our model in game KOG, including inputs
state information, the last step information, and the last action         and architecture of the network, and output of actions, is de-
which has been proven to be useful for the learning process in            picted in Fig.3.
concat                                    Agent 1

                                                                      conv1 conv2 conv3 conv4 flat1
                                                                                                                        fc4                        actions

                                                                image feature                                                  Macro-actions
               Step t-1                         Step t
            Extracted Features
                                          Extracted Features
          Mini-map Information                                                                                                Imitation Learning
                                         Mini-map Information
           Big-map Information                                               Shared Layers                                              .
                                         Big-map Information                                                                            .


           Extracted Features                                                                                                 Imitation Learning

          Enemy Buildings        Enemy
                …                                                 vector feature
                                                                                             fc1   fc2
           Own Buildings    Own Player       Other Features
                                                                                                                  fc5                              Agent 5

                                    Figure 3: Network Architecture of Hierarchical reinforcement learning model

Training Algorithm                                                                  enemies’ crystal. If our reward is only based on the final re-
We propose a hierarchical RL algorithm for multi-agent                              sult, it will be extremely sparse, and the seriously delayed
learning, and the training process is presented in Algorithm                        reward makes agent difficult to learn fast. Obviously, dense
1. Firstly, we initialize our controller policy and global state.                   reward gives more positive or negative feedback to the agent,
Then each unit takes action at and receive reward rt+1 and                          and can help to learn faster and better. As we don’t have
next state st+1 . From state st+1 , we can obtain both macro                        game engine or API, damage amount of an agent is not avail-
action through imitation learning and micro action from re-                         able for us. In our experiment, all agents can receive two
inforcement learning. In order to choose action at+1 from                           parts rewards including self-reward and global-reward. Self-
macro action At+1 , we do a normalization of the action prob-                       reward contains own money and health points (HP) loss/gain
ability. At the end of each iteration, we use the experience re-                    of agent, while global-reward includes tower loss and death
play samples to update the parameters of the policy. In order                       of friendly/enemy players.
to balance the trade-off between exploration and exploitation,
we take the loss of entropy and self-learning into account to                       rt = ρ1 × rself + ρ2 × rglobal
encourage exploration. Our loss formula is as follows:                                 = ρ1 ((moneyt − moneyt−1 )fm + (HPt − HPt − 1)fH )
Lt (θ) = Et [w1 Lvt (θ) + w2 Nt (π, at ) + Lpt (θ) + w3 St (π, at )]                   + ρ2 (towerlosst × ft + playerdeatht × fd )
                                                               (3)                                                                   (6)
where w1 , w2 , w3 are the weights of value loss, entropy loss
and self-learning loss that we need to tune, Nt denotes the                         where towerlosst is positive when enemies’ tower is broken,
entropy loss, and St means the self-learning loss. Lvt (θ) and                      negative when own tower is broken, the same as playerdeatht ,
Lpt (θ) are defined as follows:                                                     fm is a coefficient of money loss, the same as fH , ft and fd ,
                                                                                    ρ1 is the weight of self-reward and ρ2 means the weight of
       Lvt (θ) = Et [(r(st , at ) + Vt (st ) − Vt (st+1 ))2 ]              (4)
                                                                                    global-reward. The reward function is effective for training,
                                                                                    and the results are shown in the experiment section.
  Lpt (θ) = Et [min(rt (θ)At , clip(rt (θ), 1 − ε, 1 + ε)At )]
                                                             (5)                    Self-learning
where rt (θ) = πθ (at |st )/πθold (at |st ), At is advantage com-                   There are many kinds of self-learning methods for reinforce-
puted by the difference between return and value estimation.                        ment learning such as Self-Imitation Learning (SIL) proposed
                                                                                    by [Oh et al., 2018] and Episodic Memory Deep Q-Networks
3.4   Reward Design and Self-learning                                               (EMDQN) presented by [Lin et al., 2018]. SIL is applicable
Reward Design                                                                       to actor-critic architecture, while EMDQN combines episodic
Reward function is significant for reinforcement learning, and                      memory with DQN. However, considering better sample effi-
good learning results of an agent are mainly depending on di-                       ciency and easier-to-tune of the system, we migrate EMDQN
verse rewards. The final goal of the game is to destroy the                         to our reinforcement learning algorithm PPO. Loss of self-
Algorithm 1 Hierarchical RL Training Algorithm                                       Category          Training Set     Testing Set     Precision
Input: Reward function Rn , max episodes M, function IL(s)                           Own Soldier       2677             382             0.6158
indicates imitation learning model.                                                  Enemy Solider     2433             380             0.6540
Output: Hierarchical reinforcement learning neural net-                              Own Tower         485              79              0.9062
work.                                                                                Enemy Tower       442              76              0.9091
                                                                                     Own Crystal       95               17              0.9902
 1: Initialize controller policy π, global state sg shared
                                                                                     Enemy Crystal     152              32              0.8425
    among our agents;
 2: for episode = 1, 2, ..., M do                                                           Table 2: The accuracy of multi-target detection
 3:    Initialize st , at ;
 4:    repeat
                                                                                             Scenarios     AI.1    AI.2     AI.3    AI.4
 5:       Take action at , receive reward rt+1 , next state st+1 ;
                                                                                             1v1 mode      80%     50%      52%     58%
 6:        Choose macro action At+1 from st+1 according to                                   5v5 mode      82%     68%      66%     60%
           IL(s = st+1 );
                                                                                   Table 3: Win rates playing against AI.1:AI without macro strat-
 7:        Choose micro action at+1 from At+1 according to                         egy, AI.2:without multi-agent, AI.3:without global reward and
           the output of RL in state st+1 ;                                        AI.4:without self-learning method
 8:        if ait+1 ∈/ At+1 , where i = 0, . . . , 16 then
 9:           P (ait+1 |st+1 ) = 0;
10:        else                                    X
                                                                                   distributed phones when training. In the training process, we
11:           P (ait+1 |st+1 ) = P (ait+1 |st+1 )/   P (ait+1 |st+1 );             transmit the data and share the parameters of network through
12:        end if                                                                  gRPC. As for the features obtained by multi-target detection,
13:        Collect samples (st , at , rt+1 );                                      its accuracy and category are depicted in Table 2. In our ex-
14:        Update policy parameter θ to maximize the expected                      periment, the speed of taking an action is about 150 APM
           returns;                                                                compared to 180 APM of high level player, which is enough
15:     until st is terminal                                                       for this game. For going to somewhere, we use A-star path
16:   end for                                                                      planning algprithm.

                                                                                   4.2   1v1 mode of game KOG
learning part can be demonstrated as follows:                                      As shown in Figure.1b, there are one agent and one enemy
St (π, at ) = Et [(Vt+1 − VH ) ]     2                                             player in 1v1 map. We need to destroy the enemies’ tower
                                                                                   first and then destroy the crystal to get the final victory. We
            + Et [min(rt (θ)AHt , clip(rt (θ), 1 − ε, 1 + ε)AHt )]                 draw the episodes needed to win when our agent fights with
                                                             (7)                   different level of built-in AI and different genres of internal
where the memory target VH is the best value from memory                           AI.
buffer, and AHt means the best advantage from it.                                  Episodes until win
                                                                                   Figure.4 shows the length of episodes for our agent An-
            max((max(Ri (st , at ))), R(st , at )), if (st , at ) ∈ memory
 VH =
                     R(st , at ),                   otherwise
                                                                                   gela to defeat the opponents. Higher level of the built-
                                                                                   in AI, longer our agent need to train. Moreover, for dif-
                                                                                   ferent kinds of enemies, the training time is not the same
                      AHt = VH − Vt+1 (st+1 )                                (9)   as well. The results when our AI play against AI with-
where i ∈ [1,2,. . . ,E ], E represents the number of episodes                     out macro-strategy, without multi-agent, without global re-
in memory buffer that the agent has experienced.                                   ward and without self-learning method are listed in Table
                                                                                   3. 50 games are played against AI.1:without macro strat-
4     Experiments                                                                  egy, AI.2:without multi-agent, AI.3:without global reward
                                                                                   and AI.4:without self-learning method, and the win rates are
In this section, we introduce the experiment setting first. Then                   80%, 50%, 52% and 58% respectively.
we evaluate the performance of our algorithms on two envi-
ronments: (i) 1v1 map including entry-level, easy-level and                        Average rewards
medium-level built-in AI which don’t include difficult-level,                      Generally speaking, the aim of our agent is to defeat the en-
and (ii) a challenging 5v5 map. For a better comprehension,                        emies as soon as possible. Figure.5 illustrates the average
we analyze the average rewards and win rates during training.                      rewards of our agent Angela in 1v1 mode when combatting
                                                                                   with different types of enemies. In the beginning, the rewards
4.1     Setting                                                                    are low because the agent is still a beginner and hasn’t enough
The experiment setting includes terminal experiment plat-                          learning experience. However, our agent is learning gradually
form and GPU cluster training platform. In order to increase                       and being more and more experienced. When the training
the diversity and quantity of samples, we use 10 vivo X23                          episodes of our agent reach about 100, the rewards in each
and NEX phones for an agent to collect the distributed data.                       step become positive overall and our agent is starting to have
Meanwhile, we need to maintain the consistency of all the                          some advantages in battle. There are also some decreases in
1000                                                                                                                     1.1
                                                        HRL with entry-level AI                                                                                            HRL with entry-level AI
                                                       HRL with easy-level AI                                                                                              HRL with easy-level AI
                                       800                                                                                                                 0.9
                                                                                                                                                                           HRL with medium-level AI
                                                       HRL with medium-level AI
                                       700                                                                                                                                 PPO algorithm with entry-level AI
                  Episodes Until Win

                                                                                                                                                           0.7             Supervised learning with medium-level AI


                                                                                                                                         Win Rates


                                       100                                                                                                                 0.1
                                                    vs. Support     vs. Mage   vs. Shooter    vs. Assassin    vs. Warrior     Average
                                                                                                                                                                      0    90      180        270   360    450     540   630     720    810   900 1000

Figure 4: The episodes to train of our model against with differ-                                                                                                                                          Episodes
ent level internal AI when combatting with Support, Mage, Shooter,
Assassin and Warrior.                                                                                                                    Figure 6: The win rates of our agents in 5v5 mode against different
                                                                                                                                         level of internal AI.
                                                                                                                                                             2                  Easy-level
Average Rewards

                                                                                                                                         Average Rewards
                                                                                                                      Medium-level                         0.5

                                                0      100        200   300     400     500      600    700     800         900   1000                     -0.5

                                                                                                                                                                  1       100      199        298    397     496      595      694     793    892   991
Figure 5: The average rewards of our agent in 1v1 mode during
                                                                                                                                         Figure 7: The average rewards of our agents in 5v5 mode during
rewards when facing high level internal AI because of the fact                                                                           training.
that the agent is not able to defeat the Warrior at first. To sum
up, the average rewards are increasing obviously, and stay                                                                               method used about 300 thousand game replays under the ad-
smooth after about 600 episodes.                                                                                                         vantage of API. Another way is using PPO algorithm that
                                                                                                                                         OpenAI Five used [OpenAI, 2018] without macro strategy,
4.3                                         5v5 mode of game KOG                                                                         which achieves about 22% win rate when combatting with
As shown in Fig.1a, there are five agents and five enemy play-                                                                           entry-level internal AI. Meanwhile, the results of our AI play-
ers in 5v5 map. What we need to do actually is to destroy the                                                                            ing against AI without macro strategy, without multi-agent,
enemies’ crystal. In this scenario, we train our agents with in-                                                                         without global reward and without self-learning method are
ternal AI, and each agent hold one model. In order to analyze                                                                            listed in Table 3. These indicate the importance of each
the results during training, we illustrate the average rewards                                                                           method in our hierarchical reinforcement learning algorithm.
and win rates in Fig.6 and Fig.7.                                                                                                        Average rewards
Win rates                                                                                                                                As shown in Figure.7, the total rewards are divided by episode
                                                                                                                                         steps in the combat. In three levels, the average rewards are
We draw the win rates in Figure6. there are three different
                                                                                                                                         increasing overall. For medium-level internal AI, it’s hard
levels of built-in AI that our agents combat with. When fight-
                                                                                                                                         to learn well at first. However, the rewards are growing up
ing with entry-level internal AI, our agents learn fast and the
                                                                                                                                         after 500 episodes and stay smooth after almost 950 episodes.
win rates reach 100% finally. When training with medium-
                                                                                                                                         Although there are still some losses during training. This is
level AI, the learning process is slow and our agents can’t
                                                                                                                                         reasonable for the fact that we encounter different lineups of
win until 100 episodes. In this mode, the win rates are about
                                                                                                                                         internal AI which make different levels of difficulty.
55% in the end. This is likely due to the fact that our agents
can hardly obtain dense global rewards in games against high
level AI, which leads to hard cooperation in team fight. One                                                                             5                    Conclusion
way using supervised learning method from Tencent AI Lab                                                                                 In this paper, we proposed hierarchical reinforcement learn-
obtains 100% win rate [Wu et al., 2018]. However, the                                                                                    ing for multi-agent MOBA game KOG, which learns macro
strategies through imitation learning and taking micro actions     [Oh et al., 2018] Junhyuk Oh, Yijie Guo, Satinder Singh,
by reinforcement learning. In order to obtain better sample           and Honglak Lee. Self-imitation learning. arXiv preprint
efficiency, we presented a simple self-learning method, and           arXiv:1806.05635, 2018.
we extracted global features as a part of state input by multi-    [Ontanón et al., 2013] Santiago Ontanón, Gabriel Synnaeve,
target detection. Our results showed that hierarchical rein-          Alberto Uriarte, Florian Richoux, David Churchill, and
forcement learning is very helpful for this MOBA game.                Mike Preuss. A survey of real-time strategy game ai re-
   In addition, there are still some works to do in the future.       search and competition in starcraft. IEEE Transactions
Cooperation and communication of multi-agent are learned              on Computational Intelligence and AI in games, 5(4):293–
by sharing network, constructing an efficient global reward           311, 2013.
function and state representation. Although our agents can
successfully learn some cooperation strategies, we are going       [OpenAI, 2018] OpenAI. Openai five, 2018. https://blog.
to explore more effective methods for multi-agent collabo-            openai.com/openai-five/, 2018.
ration. Meanwhile, this hierarchical reinforcement learning        [Rashid et al., 2018] Tabish Rashid, Mikayel Samvelyan,
architecture’s implementation encourages us to go further in          Christian Schroeder de Witt, Gregory Farquhar, Jakob Fo-
5v5 mode of game King of Glory especially when our agents             erster, and Shimon Whiteson. Qmix: Monotonic value
compete with human beings.                                            function factorisation for deep multi-agent reinforcement
                                                                      learning. arXiv preprint arXiv:1803.11485, 2018.
Acknowledgments                                                    [Schulman et al., 2017] John Schulman, Filip Wolski, Pra-
We would like to thank our colleagues at vivo AI Lab, partic-         fulla Dhariwal, Alec Radford, and Oleg Klimov. Prox-
ularly Jingwei Zhao and Guozhi Wang, for the helpful com-             imal policy optimization algorithms. arXiv preprint
ments about paper writing. We are also very grateful for the          arXiv:1707.06347, 2017.
support from vivo AI Lab.                                          [Shao et al., 2018] Kun Shao, Yuanheng Zhu, and Dongbin
                                                                      Zhao. Starcraft micromanagement with reinforcement
References                                                            learning and curriculum transfer learning. IEEE Transac-
[Barto and Mahadevan, 2003] Andrew G Barto and Sridhar                tions on Emerging Topics in Computational Intelligence,
   Mahadevan. Recent advances in hierarchical reinforce-              2018.
   ment learning. Discrete event dynamic systems, 13(1-            [Silver et al., 2017] David Silver, Julian Schrittwieser,
   2):41–77, 2003.                                                    Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur
[Foerster et al., 2017] Jakob Foerster, Nantas Nardelli, Gre-         Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian
   gory Farquhar, Triantafyllos Afouras, Philip HS Torr,              Bolton, et al. Mastering the game of go without human
   Pushmeet Kohli, and Shimon Whiteson. Stabilising expe-             knowledge. Nature, 550(7676):354, 2017.
   rience replay for deep multi-agent reinforcement learning.      [Sukhbaatar et al., 2016] Sainbayar Sukhbaatar, Rob Fergus,
   arXiv preprint arXiv:1702.08887, 2017.                             et al. Learning multiagent communication with backprop-
[Foerster et al., 2018] Jakob N Foerster, Gregory Farquhar,           agation. In Advances in Neural Information Processing
   Triantafyllos Afouras, Nantas Nardelli, and Shimon                 Systems, pages 2244–2252, 2016.
   Whiteson. Counterfactual multi-agent policy gradients. In       [Sun et al., 2018] Peng Sun, Xinghai Sun, Lei Han, Jiechao
   Thirty-Second AAAI Conference on Artificial Intelligence,          Xiong, Qing Wang, Bo Li, Yang Zheng, Ji Liu, Yongsheng
   2018.                                                              Liu, Han Liu, et al. Tstarbots: Defeating the cheating level
[Frans et al., 2017] Kevin Frans, Jonathan Ho, Xi Chen,               builtin ai in starcraft ii in the full game. arXiv preprint
   Pieter Abbeel, and John Schulman. Meta learning shared             arXiv:1809.07193, 2018.
   hierarchies. arXiv preprint arXiv:1710.09767, 2017.             [Vezhnevets et al., 2017] Alexander Sasha Vezhnevets, Si-
[Jiang and Lu, 2018] Jiechuan Jiang and Zongqing Lu.                  mon Osindero, Tom Schaul, Nicolas Heess, Max Jader-
   Learning attentional communication for multi-agent coop-           berg, David Silver, and Koray Kavukcuoglu. Feudal
   eration. arXiv preprint arXiv:1805.07733, 2018.                    networks for hierarchical reinforcement learning. arXiv
[Lin et al., 2018] Zichuan Lin, Tianqi Zhao, Guangwen                 preprint arXiv:1703.01161, 2017.
   Yang, and Lintao Zhang. Episodic memory deep q-                 [Wu et al., 2018] Bin Wu, Qiang Fu, Jing Liang, Peng Qu,
   networks. arXiv preprint arXiv:1805.07603, 2018.                   Xiaoqian Li, Liang Wang, Wei Liu, Wei Yang, and Yong-
[Mnih et al., 2015] Volodymyr Mnih, Koray Kavukcuoglu,                sheng Liu. Hierarchical macro strategy model for moba
   David Silver, Andrei A Rusu, Joel Veness, Marc G Belle-            game ai. arXiv preprint arXiv:1812.07887, 2018.
   mare, Alex Graves, Martin Riedmiller, Andreas K Fidje-
   land, Georg Ostrovski, et al. Human-level control through
   deep reinforcement learning. Nature, 518(7540):529,
[Murphy, 2015] M Murphy. Most played games: Novem-
   ber 2015–fallout 4 and black ops iii arise while starcraft ii
   shines, 2015.
You can also read
NEXT SLIDES ... Cancel