A Memory Efficient Deep Reinforcement Learning Approach For Snake Game Autonomous Agents

 
CONTINUE READING
A Memory Efficient Deep Reinforcement Learning Approach For Snake Game Autonomous Agents
A Memory Efficient Deep Reinforcement Learning
                                           Approach For Snake Game Autonomous Agents
                                                                 Md. Rafat Rahman Tushar1                                      Shahnewaz Siddique2
                                                Department of Electrical and Computer Engineering              Department of Electrical and Computer Engineering
                                                              North South University                                         North South University
                                                               Dhaka, Bangladesh                                              Dhaka, Bangladesh
                                                           rafat.tushar@northsouth.edu                                shahnewaz.siddique@northsouth.edu

                                            Abstract—To perform well, Deep Reinforcement Learning             the image-based DRL methods have enjoyed considerable
                                         (DRL) methods require significant memory resources and               success, they are memory intensive during training as well as
                                         computational time. Also, sometimes these systems need               deployment. Since they require a massive amount of memory,
                                         additional environment information to achieve a good reward.
                                         However, it is more important for many applications and devices      they are not suitable for implementation in mobile devices or
                                         to reduce memory usage and computational times than to achieve       mid-range autonomous robots for training and deployment.
arXiv:2301.11977v1 [cs.AI] 27 Jan 2023

                                         the maximum reward. This paper presents a modified DRL                  All modern reinforcement learning algorithms use replay
                                         method that performs reasonably well with compressed imagery         buffer for sampling uncorrelated data for online training in
                                         data without requiring additional environment information and        mainly off-policy algorithms. Experience replay buffer also
                                         also uses less memory and time. We have designed a lightweight
                                         Convolutional Neural Network (CNN) with a variant of the             improves the data efficiency [9] during data sampling. Since
                                         Q-network that efficiently takes preprocessed image data as          the use of neural networks in various DRL algorithms is
                                         input and uses less memory. Furthermore, we use a simple             increasing, it is necessary to stabilize the neural network
                                         reward mechanism and small experience replay memory so as to         with uncorrelated data. That is why the experience replay
                                         provide only the minimum necessary information. Our modified         buffer is a desirable property of various reinforcement learning
                                         DRL method enables our autonomous agent to play Snake, a
                                         classical control game. The results show our model can achieve       algorithms. The first successful implementation of DRL in
                                         similar performance as other DRL methods.                            high dimensional observation space, the Deep Q-learning [6],
                                                                                                              used a replay buffer of 106 size. After that, [8], [10]–[12], to
                                           Index Terms—Deep Reinforcement Learning, Convolutional             name a few, have solved complex high dimensional problems
                                         Neural Network, Deep Q Learning, Hyperparameter Tuning,              but still use a replay buffer of the same size.
                                         Replay Size, Image Preprocessing
                                                                                                                 Experience replay buffer suffers from two types of issues.
                                                                                                              One is to choose the size of the replay buffer, and the second
                                                                     I. I NTRODUCTION
                                                                                                              is the method of sampling data from the buffer. [13]–[15]
                                            Complex problems can be solved in real-world applications         consider the latter problem to best sample from the replay
                                         by carefully designing Deep Reinforcement Learning (DRL)             buffer. But the favorable size for the replay buffer remains
                                         models by taking high dimensional input data and producing           unknown. Although [15] points out that the learning algorithm
                                         discrete or continuous outputs. It is challenging to build a         is sensitive to the size of the replay buffer, they have not come
                                         agent using sensory data capable of controlling and acting           up with a better conclusion on the size of the buffer.
                                         in an environment. The environment is also complex and                  In this paper, we tackle the memory usage of DRL al-
                                         primarily unknown to the acting agent. The agent needs to            gorithms by implementing a modified approach for image
                                         learn the underlying distribution of the state and action spaces,    preprocessing and replay buffer size. Although we want the
                                         and the distribution changes as the agent encounters new             agent to obtain a decent score, we are more concerned about
                                         data from an environment. Previously reinforcement learning          memory usage. We choose a Deep Q-Network (DQN) [6]
                                         algorithms [1]–[5] were presented with lower constraint prob-        for our algorithm with some variations. Our objective is to
                                         lems to demonstrate the algorithms effectiveness. However,           design a DRL model that can be implemented on mobile
                                         these systems were not well generalized for high dimensional         devices during training and deployment. To be deployed on
                                         inputs; thus, they could not meet the requirements of practical      mobile devices, memory consumption must be minimized as
                                         applications.                                                        traditional DRL model with visual inputs sometimes need half
                                            Recently, DRL has had success in CNN based vision-based           a terabyte of memory. We achieve low memory consumption
                                         problems [6]–[8]. They have successfully implemented DRL             by preprocessing the visual image data and tuning the replay
                                         methods that learn to control based on image pixel. Although         buffer size with other hyperparameters. Then, we evaluate
                                           1 Research
                                                                                                              our model in our simulation environment using the classical
                                                        Assistant.
                                           2 Assistant Professor, IEEE Member.                                control game named Snake.* The results show that our model
                                           * GitHub implementation: https://github.com/rafattushar/rl-snake   can achieve similar performance as other DRL methods.
II. R ELATED W ORK                                                          TABLE I
                                                                                    R EWARD M ECHANISM FOR S NAKE G AME
   The core idea of reinforcement learning is the sequential                     Moves                  Rewards               Results
decision making process involving some agency that learns                     Eats an apple                +1              Score Increase
from the experience and acts on uncertain environments. After           Hits with wall or itself           -1              End of episode
the development of a formal framework of reinforcement                Not eats or hits wall or itself     -0.1         Continue playing games
learning, many algorithms have been introduced such as, [1]–
[5].
                                                                                               TABLE II
   Q-learning [1] is a model-free asynchronous dynamic pro-                  M EMORY R EQUIREMENT FOR D IFFERENT P IXEL DATA
gramming algorithm of reinforcement learning. Q-learning
proposes that by sampling all the actions in states and iterating                                           RGB             Grayscale     Binary
the action-value functions repeatedly, convergence can be                     Data Type                     float             float         int
achieved. The Q-learning works perfectly on limited state                     Size (kB)                    165.375           55.125        6.890
                                                                      Memory Save % w.r.t. RGB               0%               67%          96%
and action space while collapsing with high dimensional
                                                                     Memory Save % w.r.t. Grayscale           -                0%         87.5%
infinite state space. Then, [6] proposes their Deep Q-network
algorithm that demonstrates significant results with image
data. Among other variations, they use a convolutional neural
network and replay buffer. Double Q-learning [16] is applied                                                0
with DQN to overcome the overestimation of the action-value                                                10
function and is named Deep Reinforcement Learning with                                                     20

Double Q-Learning (DDQN) [8]. DDQN proposes another                                                        30

neural network with the same structure as DQN but gets                                                     40
                                                                                                           50
updated less frequently. Refined DQN [17] proposes another
                                                                                                           60
DRL method that involves a carefully designed reward mech-
                                                                                                           70
anism and a dual experience replay structure. Refined DQN
                                                                                                           80
evaluate their work by enabling their agent to play the snake                                                   0      20      40   60        80

game.
                                                                          (a) Before preprocessing                  (b) After preprocessing
   The experience replay buffer is a desirable property of
                                                                             Fig. 1. Visual image data before and after preprocessing
modern DRL algorithms. It provides powerful, model-free, off-
policy DRL algorithms with correlated data and improves data
efficiency [9] during data sampling. DQN [6] shows the power        A. Image Preprocessing
of replay buffer in sampling data. DQN uses the size 106 for
replay buffer. After that, [8], [10]–[12], [17], among others,         The agent gets the RGB values in the 3-D array format
have shown their work with the same size and structure as           from the games’ environments. We convert the RGB array into
the replay buffer. Schaul et al. propose an efficient sampling      grayscale because it would not affect the performance [18] and
strategy in their prioritized experience replay (PER) [13]. PER     it saves three times of memory. We resize the grayscale data
shows that instead of sampling data uniform-randomly, the           into 84 × 84 pixels. Finally, for more memory reduction, we
latest data gets the most priority; hence the latest data have      convert this resized grayscale data into binary data (values only
more probability of being selected, and this selection method       with 0 and 1). The memory requirement for storing various
seems to improve results. [15] shows that a large experience        image data (scaled-down between 0 and 1) is given in Table II.
replay buffer can hurt the performance. They also propose that      Table II shows that it saves around 67% from converting
when sampling data to train DRL algorithms, the most recent         RGB into grayscale and around 96% from converting RBG
data should the appended to the batch.                              into binary. Also, the memory requirement reduces by around
                                                                    87.5% converting from grayscale into binary. Visual pixel
                                                                    data transformation with preprocessing is given in Fig. 1. The
                         III. M ETHOD                               preprocessing method is presented using a flowchart in Fig. 2.
                                                                    B. Game Selection and Their Environments
   Our objective is to reduce memory usage during training
time while achieving the best performance possible. The replay        The use-case of our target applications is less complex tasks.
memory takes a considerable amount of memory, as described          For this reason, we implemented the classical Snake game [19]
later. We try to achieve memory efficiency by reducing the
massive replay buffer requirement with image preprocessing
and the buffer size. The buffer size is carefully chosen so           Game Env             Graysclale           Resize 84X84
                                                                                                                                         Pixel value
                                                                                                                                            0 or 1
that the agent has the necessary information to train well and
achieves a moderate score. We use a slight variation of the
deep Q-learning algorithm for this purpose.                                          Fig. 2. Diagram of image preprocessing
in the ’pygame’ module. The game screen is divided into a                                                TABLE III
12 × 12 grid. The resolution for the game is set to 252 × 252.                             T HE ARCHITECTURE OF N EURAL N ETWORK
The initial snake size is 3. The controller has four inputs to          Layer             Filter        Stride              Layer               Acti-    Zero              Output
navigate. Table I shows the valid actions and respective reward         Name                                                                   vation    Padd
for the snake game environment.                                          Input                                                                                            84*84*4
                                                                        Conv1              8*8             4                   32               ReLU      Yes            21*21*32
                                                                       M. Pool             2*2             2                                              Yes            11*11*32
                                                                        Conv2              4*4             2                   64               ReLU      Yes              6*6*64
C. Reinforcement Learning Preliminary                                  M. Pool             2*2             2                                              Yes              3*3*64
                                                                       B. Norm                                                                                             3*3*64
   Any reinforcement learning or sequential decision-making             Conv3              3*3             2                  128               ReLU      Yes             2*2*128
problem can be formulated with Markov Decision Processes               M. Pool             2*2             2                                              Yes             1*1*128
(MDPs). An MDP is a triplet M = (X , A, P 0 ), where X                 B. Norm                                                                                            1*1*128
                                                                        Flatten                                                                                              128
is a set of valid states, A is a set of valid actions, and P0             FC                                                 512               ReLU                          512
is transition probability kernel that maps X × A into next                FC                                                 512               ReLU                          512
state transition probability. For a deterministic system, the state     Output                                             No. of              Linear                      No. of
                                                                                                                           actions                                         actions
transition is defined as,                                              M. Pool = Max Pooling, B. Norm = Batch Normalization, FC = Fully Connected

                        st+1 = f (st , at )                    (1)
                                                                                                         TABLE IV
                                                                                          M EMORY R EQUIREMENT E XPERIENCE R EPLAY
The reward is defined as,
                                                                                                                                         RGB            Grayscale              Binary
                                                                           Memory Usage (GB)                                            1261.71          420.57                 2.628
                         rt = R(st , at )                      (2)      Memory Save % w.r.t. RGB                                          0%              67%                  99.7%
                                                                       Memory Save % w.r.t. Grayscale                                      -               0%                  99.4%
The cumulative reward over a trajectory or episode is called
the return, R(τ ). The equation for discounted return is given
below,                                                                E. Neural Network
                                ∞
                               X                                         The action-value function is iteratively updated to achieve
                       R(τ ) =     γ t rt                   (3)       the optimal action-value function. The neural network used
                                    t=0
                                                                      to approximate the action-value function and update at each
                                                                      iteration is called Q-network. We train the Q-network, param-
D. Deep Q-Learning
                                                                      eterized by θ, by minimizing a loss function Li (θi ) at ith
  The goal of the RL agent is to maximize the expected return.        iteration.
Following a policy π, the expected return, J(π), is defined as,
                                                                                     Li (θi ) = E (yi − Q(s, a; θi ))2
                                                                                                                      
                                                                                                                                 (8)
                                                                                                                 s,a∼ρ
                       J(π) = E [R(τ )]                        (4)                                  h                                  i
                                τ ∼π                                                                                   0 0 0 0
                                                                      where yi = 0E                     r(s, a) + γmax
                                                                                                                    0
                                                                                                                      Q (s , a ; θ k )   is the target
                                                                                            s ∼ρ                                    a
The optimal action-value or q function Q∗ (s, a) maximizes            for that update. Here Q0 is another Q-network with the
the expected return by taking any action at state s and acting        same shape as Q-network but with a frozen parameter called
optimally in the following states.                                    target Q-network for training stability parameterized by θk0 .
                                                                      We train the Q-network by minimizing this loss function (8)
          Q∗ (s, a) = max E [R(τ )|s0 = s, a0 = a]             (5)    w.r.t. the parameter θi . We use Adam [20] optimizer for fast
                         π   τ ∼π

For finding out the optimal actions based on an optimal action-
value function at time t, the Q∗ must satisfy the Bellman
Equation, which is,
                                                                                                                                                                E1= (s1,a1,r2,s2)
                         h                             i                     Random Action                                                                      E2= (s2,a2,r3,s3)
                                                                                                                          Screen Data
         Q∗ (s, a) = 0E r(s, a) + γ max  0
                                           Q ∗ 0 0
                                              (s , a )      (6)              or Actions taken
                                                                                 by Agent
                                                                                                   Environment
                                                                                                                           Rewards                              E3= (s3,a3,r4,s4)
                     s ∼ρ                   a                                                                                                                   E4= (s4,a4,r5,s5)
                                                                                                                                                                       ....
                                                                                                                                                                       ....
                                                                                                                 State, Action, Reward, Next State                     ....
The optimal action-value function gives rise to optimal action                                                                                                         ....
                                                                                                                                                               E1= (st,at,rt+1,st+1)
a∗ (s). The a∗ (s) can be described as,                                           State
                                                                                                                            Replay
                                                                                                                            Memory

                   a∗ (s) = arg max Q∗ (s, a)                  (7)                                                                                         Experience Replay Memory
                                    a

For training an optimal action-value function, sometimes a
non-linear function approximator like neural network [6] is
used. We used a convolutional neural network.                                Fig. 3. Structure of experience replay memory and flowchart
Image Pre-processing
              St+1                                                                St+1
                                          Q0
                                          Q1 Max Q                                                Et=(st, at, rt+1, st+1)
                                                                                      St+1
          St              Online DQN                       At           ENV
                                          Q2                                          Rt+1
                                          Q3

                                                                                                                 Experience
                                                                                                                   Replay
                                                                                                                  Memory

                                                        Random Mini-Batch

                                                                              Sync weights
                                                                              every p steps
     E2=(s2, a2, r3, s3)
                                                                     Q0                                                                  Q0'
                     s2                                              Q1                   s3                                             Q1'
                                                                     Q2                                                                  Q2'
                                                                     Q3                                                                  Q3'

                                        Online Deep Q Network                                                     Target Deep Q Network

                                        Loss = [ yt - Q(At) ]2                                                 yt = Rt+1 + .maxa Q'(a)

                                       Fig. 4. The deep reinforcement learning design structure of our model

convergence. Our convolutional neural network structure is                 model has two convolutional neural networks (online DQN
shown in Table III.                                                        and target DQN) sharing the same structure but does not sync
                                                                           automatically. The weights of the target network are frozen so
F. Experience Replay Buffer                                                that it cannot be trained. The state history from the mini-batch
   As our focus is to keep memory requirements as low as                   is fed into the Online DQN. The DQN outputs the Q-values,
possible during training, choosing the size of the replay buffer           Q(st , at ).
is one of the critical design decisions. The size of the replay                                Loss = [yt − Q(st , at )]2               (9)
buffer directly alters the requirement of memory necessity. We
use a replay buffer of size 50,000, requiring less memory                  The yt is calculated from the target Q-network. We are passing
(only 5%) than [6], [8], [17], which use a replay buffer                   the next-state value to the target Q-network, and for each next-
of size 1,000,000. [6], [8], [17] store grayscale data into a              state in the batch, we get Q-value, respectively. That is our
replay buffer. Table IV shows that we use 99.4% less memory                maxa0 Q(s0 , a0 ) value in the below equation.
compared to these works. The replay buffer stores data in FIFO
(first in, first out) order so that the buffer contains only the                               yt = Rt+1 + γmaxa0 Q(s0 , a0 )              (10)
latest data. We present the complete cycle of the experience               The γ is the discount factor, which is one of many hyperpa-
replay buffer in Fig 3. Fig. 4 illustrates our complete design             rameters we are using in our model. Initially, we set γ value to
diagram.                                                                   0.99. The Rt+1 is the reward in each experience tuple. So, we
                            IV. E XPERIMENTS                               get the yt value. The loss function is generated by putting these
                                                                           values in (9). Then, we use this loss function to backpropagate
A. Training                                                                our Online DQN with an ‘Adam’ optimizer. Adam optimizer is
  For training our model, we take a random batch of 32                     used instead of classical stochastic gradient descent for more
experiences from the replay buffer at each iteration. Our                  speed. The target DQN is synced with online DQN at every
18
                                                                                                           3                                                                              90

                                                                                                                                                                                                                                                                                           16
                                                                                                                                                                                          80
                                                                                                       2.5
                                                                                                                                                                                                                                                                                           14
                                                                                                                                                                                          70
                                                                                                                                                                                                                                                                                           12
                                                                                                           2
                                                                                                                                                                                          60
                                                                                                                                                                                                                                                                                           10

                                                                                                       1.5                                                                                50
                                                                                                                                                                                                                                                                                           8

                                                                                                                                                                                          40                                                                                               6
                                                                                                           1

                                                                                                                                                                                          30                                                                                               4

                                                                                                       0.5
                                                                                                                                                                                          20                                                                                               2

                                                                                                                                                                                                                                                                                           0
                                                                                                           0                                                                              10                                                                                                    0   5    10   15
                                                                                                               0       2         4        6        8         10        12         14           0                  2        4         6        8        10        12         14
                                                                                                                                                                                      4                                                                                         4
                                                                                                                                                                                 10                                                                                        10
      (a) Score vs. episode graph                  (b) Reward vs. episode graph                                                                                                                                                                                                          (a) Performance
                                                                                                (a) Performance
                                                                                               (a)  Score graphevaluation in terms
                                                                                                                   of Refined   DQN(b) Performance  evaluation
                                                                                                                                         (b) Score graph       in model
                                                                                                                                                          of our  terms
      Fig. 5. Results of our agent playing Snake game during training                           of gametaken
                                                                                                         scorefrom [17])           of survival time                                                                                                                                      score
                                                                                               (graph
                                                                                                Fig. Fig. 8. Comparison
                                                                                                     2. Visualization       between comparison.
                                                                                                                      of performance Refined DQN   model and
                                                                                                                                                To improve     ourwe
                                                                                                                                                           clarity, model
                                                                                                                                                                     only                                                                                                                Fig. 3. The perf
                                                                                                use the averaged values of each 1,000 games.                                                                                                                                             in additional 50

                                                                                                  18
         3                                          90                                                                                                                                    6000
                                                                                                                                                                                                       20.0                                                                                             P ERFOR
                                                                                                  16

        2.5
                                                    80

                                                                                                  14
                                                                                                    Moreover, for benchmarking purpose,                  17.5                we also conduct
                                                                                                                                                                                          5000

                                                                                                                                                         15.0
                                                    70
                                                                                                experiments using a baseline model,
                                                                                                  12                                                             which follows the same   4000
         2
                                                    60                                                                                                   12.5
                                                                                                strategy used in the DeepMind’s 10.0groundbreaking work [2]
                                                                                                  10

                                                                                                                                                                                           Score
        1.5                                         50                                                                                                                                    3000                                    B
                                                    40
                                                                                                (with the same network structure as
                                                                                                  8
                                                                                                                                                          7.5 shown in Table I). This                                          Re
                                                                                                  6                                                                                       2000
         1

                                                    30
                                                                                                baseline model is trained in the same
                                                                                                  4
                                                                                                                                                          5.0      manner as our refined
                                                                                                                                                          2.5
        0.5
                                                    20
                                                                                                DQN model, but without our carefully
                                                                                                  2                                                               designed reward mech-   1000

      (a) Score vs. episode graph                  (b) Reward vs. episode graph                                                                           0.0                                                                    R
         0                                          10                                          anism, training gap, and dual experience
                                                                                                  0
                                                                                                       0           5       10   15   20       25   30   35     0  40 replay
                                                                                                                                                                        10  45    20strategy.
                                                                                                                                                                                  50         30
                                                                                                                                                                                               0
                                                                                                                                                                                                   0   Fig. 250
                                                                                                                                                                                                       40     5       10       15   20   25       30   35   40        45            50
              0   2   4   6   8   10   12     14
                                            10 4
                                                         0   2   4   6   8   10   12     14
                                                                                       10 4
                                                                                                                                                                                     Episode
Fig. 6. Results of baseline DQN model playing Snake game during training                        clearly demonstrates
                                                                                               (a) Performance          in terms
                                                                                                                                    that   our model
                                                                                                                                      of game
                                                                                                                                                               outperforms
                                                                                                                                                   (b) Performance            in terms
                                                                                                                                                                                          theof baseline
                                                                                                                                                                                                   the    num-
      (a) Performance evaluation in terms (b) Performance evaluation in terms                 (a)
                                                                                                modelRefinedin        DQN
                                                                                                                   terms      of score
                                                                                                                                   both     (Taken
                                                                                                                                             the    game            (b) Our
                                                                                                                                                                 score        and  model’sthe       score
                                                                                                                                                                                                  survival
      of game score                       of survival time                                     score [17])
                                                                                              from                                                 ber of steps survived
                                                                                                time. This finding empirically shows the effectiveness of our game play po
                                                                                               Fig. 3.Fig.
                                                                                                         The9.performance
                                                                                                improvements       Testing       of our
                                                                                                                         overevaluation
                                                                                                                                 the      agent
                                                                                                                                       baselineby(after
                                                                                                                                                   playingbeing
                                                                                                                                                     model,         training
                                                                                                                                                                 random
                                                                                                                                                                   i.e.,       50for
                                                                                                                                                                             the         134,000
                                                                                                                                                                                      episodes
                                                                                                                                                                                    reward              games) not be optim
                                                                                                                                                                                                        game
                                                                                                                                                                                                    assign-
10,000Fig.steps.  The values
           2. Visualization         of hyperparameters
                            of performance                  we choose
                                           comparison. To improve clarity, weare
                                                                              only
                                                                                               in additional 50 games, wherein  = 0 and training is turned off.
      use the averaged values of each 1,000 games.                                              ment based on distance, the training gap, the timeout punish- 27,000 game
listed in Table VI.
                                                                                                ment, and the dual experience replay strategies. Nevertheless, agent drops s
                                                                                              results      than                             TABLE      II
B. Results and Comparisons                                                                      as shown          in the
                                                                                                                      Fig. baseline
                                                                                                                              2, the highest   and values
                                                                                                                                                       refined
                                                                                                        P ERFORMANCE C OMPARISON A MONG D IFFERENT M ODELS
                                                                                                                                                                    of DQN the averaged  models.        game Fig. of
                                                                                                                                                                                                                   6 game sco
                                                                                              displays
                                                                                                score and    the thebaseline
                                                                                                                       averagedDQN    number   results
                                                                                                                                                    of steps duringsurvived  training           on the snake
                                                                                                                                                                                      are seemingly                see that even
         Moreover,
   We allow     DRL for      benchmarking
                         agents   to play 140,000purpose,    we also
                                                         episodes     of conduct
                                                                          games
      experiments      using  a  baseline  model,    which    follows   the same                small, In
                                                                                              game.        i.e.,
                                                                                                               Fig. around
                                                                                                                         7 we2.5
                                                                                                                       Performance
                                                                                                                                       and 80,
                                                                                                                                   present       the respectively.
                                                                                                                                                        scoreSurvival
                                                                                                                                                    Score             and However,
                                                                                                                                                                                reward
                                                                                                                                                                                    Steps comparison
                                                                                                                                                                                                      please learn more ap
to match the training results presented in [17]. We train                     one               note thatour    these     numbers                                                                                  increasing pe
                                                                                              between                  model
                                                                                                                      Human         andarethe
                                                                                                                                Average        computed
                                                                                                                                                   baseline
                                                                                                                                                     1.98         as DQN
                                                                                                                                                                       the     average
                                                                                                                                                                           216.46    model.      of 1,000
                                                                                                                                                                                                        The blue
agentstrategy
        with our used    in the and
                      method      DeepMind’s
                                       another withgroundbreaking
                                                          the DQN work  method [2]
                                                                                                games,       within        which
                                                                                                                     Baseline          severalouroutlier
                                                                                                                                 Average             0.26          cases    31.64may and      drastically          the period o
      (with the same network structure as shown in Table I). This                             line   in Fig.        7(a)
                                                                                                                 Refined DQN
                                                                                                                             represents
                                                                                                                                     Average
                                                                                                                                                       model’s
                                                                                                                                                     9.04
                                                                                                                                                                           score,
                                                                                                                                                                         1477.40
                                                                                                                                                                                                     the     purple
presented in [6], we refer to [6] as the baseline DQN model.                                    lower the averaged              performance.          Furthermore,               in the latter part for the agent
      baseline model is trained in the same manner as our refined                             line representsHuman        the scoreBest of the15        baseline 1389       DQN model. During
Next, we compare our model with the baseline DQN model                                          of this experiment     Baselinesection,
                                                                                                                                   Best     we compare 2           the performance
                                                                                                                                                                            1015                      of our due to the li
      DQN model, but without our carefully designed reward mech-                              140,000 numbers                    of    training        episodes,              our model remains
[6] and   the refined DQN model [17]. The results of training                                   refined DQN            modelDQN
                                                                                                                    Refined       withBesthuman performance,
                                                                                                                                                      17                    5039 trying to further able to re-ru
      anism, training gap, and dual experience replay strategy. Fig. 2                        better      at theepisode         scoreof though            it requires
the snake                                                                                       evaluate              capability             our proposed             model. fewer     As shown       resources.
                                                                                                                                                                                                              in Nonetheless,
      clearlygame      with our
               demonstrates      thatmodel    are shown
                                       our model              in Fig.
                                                    outperforms      the 5.  Fig.
                                                                         baseline             Fig.   7(b)      demonstrates              that   our    model          is     capable            of    achieving
5(a) shows    the    game’s    score   with   our   model     during   training.                Fig. 2, the performance of our refined DQN model in terms                                                     of 77,000th gam
      model in terms of both the game score and the survival                                  higher      cumulative            rewards
Fig. 5(b)                                                                                       game score           increases      slowly than over thethefirst  baseline
                                                                                                                                                                         50,000DQN       gamesmodel.    along correctly in t
      time. shows      that even
             This finding            thoughshows
                              empirically      our reward      mechanism
                                                     the effectiveness          is
                                                                            of our             game play policies learned during the exploration phase may
                                                                                                                                                                                                                   of this section
simpler   than the refined      DQN     model,    the agent    maximizes      the                 We the
                                                                                                with      alsodecay compare          the results
                                                                                                                          of . Moreover,                 between ourin model
                                                                                                                                                   the performance                         terms ofand      the the
      improvements       over the  baseline   model,   i.e., the  reward assign-               not be optimal or near optimal that after a while (around
                                                                                                number
                                                                                              refined        of    steps     survived      even     gets    decreasing             (see       Fig.      2(b)).     can already s
cumulative    reward
      ment based      on optimally.
                          distance, the training gap, the timeout punish-                      27,000 DQN   games model  after  [17].decaysRefinedto 0), the     DQN           follows a ofdual
                                                                                                                                                                         performance                         the ex-
                                                                                                These
                                                                                              perience
                                                                                               agent drops
                                                                                                           findings
                                                                                                               replay      are   due
                                                                                                                             memory(also
                                                                                                                    significantly
                                                                                                                                         to   the   exploration-exploitation
                                                                                                                                             architecture
                                                                                                                                                   shown as aand          slight a complex
                                                                                                                                                                                       drop in terms        reward To further
                                                                                                                                                                                                       trade-
   In ment,
      sectionand    the we
                III-F    dualshowed
                               experience
                                        that replay  strategies.
                                              our model     is moreNevertheless,
                                                                       memory                   off.   As     in    the    exploration         phase,      wherein               linearly           decays        trained agent
      as shown     in Fig.   2, theDQN
                                     highest   valuesandof the averaged      game             mechanism.
                                                                                               of game scores         However,
                                                                                                                            in Fig. our  2(a)).modelHowever, surpasses              their score. to
                                                                                                                                                                        it is encouraging                      Since
efficient  than the     baseline            model                refined DQN                    from      0.5     to   0,    the    agent      is   actually         getting          familiar            with     the results in
      score  and  the   averaged   number    of steps survived     are seemingly              their
                                                                                               see that game evenisafter similar       to ours, we
                                                                                                                               the exploration            phase, compare our agent    our isresults  able towith
model during training. In this section we show that despite low                                 the game environment by accumulating knowledge learned minimum sco
                                                                                              the
                                                                                               learnresults provided               inknowledge
                                                                                                                                        their paper.           Fig. 8(a) shows                       the results
memorysmall,  i.e., around
            usage,            2.5 and
                      our model       can80,achieve
                                              respectively.
                                                      similarHowever,       please
                                                                  if not better                 from more random   appropriate
                                                                                                                         exploration.          After and the achieves
                                                                                                                                                                  exploration        monotonically
                                                                                                                                                                                             phase, the score of arou
      note that these numbers are computed as the average of 1,000                            presented
                                                                                               increasing
                                                                                                performance
                                                                                                                 in   [17],
                                                                                                                  performance   and
                                                                                                                       of the agent
                                                                                                                                        Fig.    8(b)
                                                                                                                                         afterstarts     is    our
                                                                                                                                                  the performance
                                                                                                                                                         to improve by
                                                                                                                                                                        model’s     drop.    results
                                                                                                                                                                                                  It seems
                                                                                                                                                                                             making
                                                                                                                                                                                                             during
                                                                                                                                                                                                             all higher than
      games, within which several outlier cases may drastically                                the   period        of       decay,      i.e.,   50,000         games,
                                                                                                the decisions based on the learned knowledge. As shown in number of s           is     not       sufficient
      lower the averaged performance. Furthermore, in the latter part                          for
                                                                                                Fig.the    agent
                                                                                                       2(a),     the to     obtain game
                                                                                                                       averaged        a converged            knowledge
                                                                                                                                                score generally               keeps   set.improving.
                                                                                                                                                                                                However, again signific
      of this experiment section, we compare the performance of our                            due    to   the     limited      computing      TABLEresource
                                                                                                Similarly, as shown in Fig. 2(b), the averaged number      V          we      have,          we      are not  of      To further
      refined DQN model with human performance, trying to further                              able   LtoIST   OF P ERFORMANCE
                                                                                                           re-run       all   the    experiments
                                                                                                                                            COMPARISON    due      to
                                                                                                                                                                   OF    Dthe    time
                                                                                                                                                                             IFFERENT            AGENTS
                                                                                                                                                                                              constraint.
                                                                                                steps survived also shows improvements in general. There is performance,
      evaluate the capability of our proposed model. As shown in                               Nonetheless,
                                                                                                a noticeable the       peakmonotonically
                                                                                                                                in Performance
                                                                                                                                     terms of the     increasing
                                                                                                                                                            number           performance
                                                                                                                                                                              of steps survived            after Snake Game
                                                                                                                                                                      Score
      Fig. 2, the performance of our refined DQN model in terms of                             77,000th       game       empirically
                                                                                                around 50,000th to 77,000th                 shows
                                                                                                                                  Human Average        that      our
                                                                                                                                               games. This unexpected   agent
                                                                                                                                                                      1.98 *         is    able     to    learn
                                                                                                                                                                                                peak may performance
      game score increases slowly over the first 50,000 games along                            correctly
                                                                                                be due tointhe      thecompletion
                                                                                                                           Snake      Game.
                                                                                                                                  Baseline       Moreover,
                                                                                                                                            ofAverage
                                                                                                                                                    decay that0.26     inthe the     last paragraph
                                                                                                                                                                                *performance                  of 10 games to
      with the decay of . Moreover, the performance in terms of the                           of
                                                                                                thethis   section,
                                                                                                      agent       startswetoRefined
                                                                                                                              show
                                                                                                                                improve   DQN
                                                                                                                                        that   as Average
                                                                                                                                               although
                                                                                                                                                    it relies         9.04 *
                                                                                                                                                                pre-converged,
                                                                                                                                                                     purely         on theour            agent implementati
                                                                                                                                                                                                    learned
                                                                                               can   already for    surpass         Our Average
                                                                                                                                  average       human         players.  9.53
      number of steps survived even gets decreasing (see Fig. 2(b)).                            knowledge                 decision      making.
                                                                                                                                     Human       BestHowever,15we* suspect that the game scores
      These findings are due to the exploration-exploitation trade-                               To further justify the                 performance
                                                                                                                                    Baseline      Best           of our  2 *agent, we let the
      off. As in the exploration phase, wherein  linearly decays                              trained agent play additional    Refined DQN      50Bestgames with       17 *  = 0 and show
         (a) Score comparison              (b) Reward comparison                                                                       Our Best
      from 0.5 to 0, the agent is actually getting familiar with                               the results in Fig. 3. In terms                 of game score,20             our agent obtains a
     Fig. 7.  Comparison between our model
      the game environment by accumulating  and baseline DQN model
                                                     knowledge   learned                       minimum score of *3,Data                  taken fromscore
                                                                                                                                   a maximum             [17] of 17, and the averaged
      from random exploration. After the exploration phase, the                                score of around 9. The averaged score of 9 is significantly
      performance of the agent starts to improve by making all                                 higher than 2.5 shown in Fig. 2(a). Similarly, the averaged
      the decisions based on the learned knowledge. As shown in                                number of steps survived is approximately 1,500, which is
      Fig. 2(a), the averaged game score generally keeps improving.                            again significantly higher than that of 80 shown in Fig. 2(b).
      Similarly, as shown in Fig. 2(b), the averaged number of                                    To further compare our refined DQN model with human
      steps survived also shows improvements in general. There is                              performance, we invite ten undergraduate students to play the
      a noticeable peak in terms of the number of steps survived                               Snake Game for 50 games. Before they play 50 games for
      around 50,000th to 77,000th games. This unexpected peak may                              performance comparisons, each human player played at least
      be due to the completion of  decay that the performance of                              10 games to get familiar with this particular Snake Game
      the agent starts to improve as it relies purely on the learned                           implementation. The performance comparisons in terms of
      knowledge for decision making. However, we suspect that the                              game scores and the number of steps survived are shown
training. By comparing Fig. 8(a) and Fig. 8(b), we can safely                                        R EFERENCES
say that our model achieves better scores despite having a              [1] C. J. C. H. Watkins and P. Dayan, “Q-learning,” in Machine Learning,
simple replay buffer, a simple reward mechanism, and less                   1992, pp. 279–292.
memory consumption.                                                     [2] G. Tesauro, “Temporal difference learning and td-gammon,” Commun.
                                                                            ACM, vol. 38, no. 3, p. 58–68, Mar. 1995.
   Fig. 9(a) and Fig. 9(b) show scores of random 50 episodes            [3] R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour, “Policy gradient
during testing of refined DQN and our model, respectively.                  methods for reinforcement learning with function approximation,” in
Table V summarizes the scores provided in the refined DQN                   Advances in Neural Information Processing Systems, S. Solla, T. Leen,
                                                                            and K. Müller, Eds., vol. 12. MIT Press, 1999.
and our model. We can identify from Table V that their refined          [4] J. Peters, S. Vijayakumar, and S. Schaal, “Natural actor-critic,” in
DQN average is 9.04, while ours is 9.53, and their refined                  Machine Learning: ECML 2005. Berlin, Heidelberg: Springer Berlin
DQN best score is 17, while ours is 20. So, we can see that our             Heidelberg, 2005, pp. 280–291.
                                                                        [5] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Ried-
model also performs better in the training and testing phase.               miller, “Deterministic policy gradient algorithms,” in Proceedings of the
                                                                            31st International Conference on International Conference on Machine
                                                                            Learning - Volume 32, ser. ICML’14. JMLR.org, 2014, p. I–387–I–395.
                               TABLE VI
                                                                        [6] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wier-
                      L IST OF H YPERPARAMETERS
                                                                            stra, and M. A. Riedmiller, “Playing atari with deep reinforcement
                                                                            learning,” Computing Research Repository, vol. abs/1312.5602, 2013.
   Hyperparameter       Value                Description                [7] V. Mnih, K. Kavukcuoglu, D. Silver, A. Rusu, J. Veness, M. Bellemare,
   Discount Factor       0.99        γ-value in max Q-function              A. Graves, M. Riedmiller, A. Fidjeland, G. Ostrovski, S. Petersen,
    Initial Epsilon       1.0     Exploration epsilon initial value         C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra,
                                                                            S. Legg, and D. Hassabis, “Human-level control through deep reinforce-
    Final Epsilon        0.01      Exploration final epsilon value
                                                                            ment learning,” Nature, vol. 518, pp. 529–33, 02 2015.
      Batch size          32      Mini batch from replay memory         [8] H. v. Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with
       Max step         10,000       Maximum number of steps                double q-learning,” in Proceedings of the Thirtieth AAAI Conference on
                                        allowed per episode                 Artificial Intelligence, ser. AAAI’16. AAAI Press, 2016, p. 2094–2100.
                                                                        [9] L.-J. Lin, “Self-improving reactive agents based on reinforcement learn-
    Learning Rate      0.0025    Learning rate for Adam optimizer           ing, planning and teaching,” Mach. Learn., vol. 8, no. 3–4, p. 293–321,
     Clip-Norm           1.0     Clipping value for Adam optimizer          may 1992.
   Random Frames        50,000    Number of random initial steps       [10] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,
                                                                            D. Silver, and D. Wierstra, “Continuous control with deep reinforcement
   Epsilon greedy      500,000   Number of frames in which initial          learning,” Computing Research Repository, 2019.
       frames                    epsilon will be equal final epsilon   [11] S. Li, Y. Wu, X. Cui, H. Dong, F. Fang, and S. Russell, “Robust multi-
  Experience Replay     50,000      Capacity of experience replay           agent reinforcement learning via minimax deep deterministic policy
                                                                            gradient,” Proceedings of the AAAI Conference on Artificial Intelligence,
      Memory                                  memory
                                                                            vol. 33, no. 01, pp. 4213–4220, Jul. 2019.
   Update of DQN          4        The number of steps after each      [12] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-
                                     update of DQN takes place              policy maximum entropy deep reinforcement learning with a stochastic
    Update Target       10,000      The number of steps after the           actor.” in ICML, ser. Proceedings of Machine Learning Research, vol. 80.
                                                                            PMLR, 2018, pp. 1856–1865.
       DQN                          Target and Online DQN sync         [13] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience
                                                                            replay,” 2015. [Online]. Available: https://arxiv.org/abs/1511.05952
                                                                       [14] M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder,
                         V. C ONCLUSION                                     B. McGrew, J. Tobin, O. Pieter Abbeel, and W. Zaremba, “Hindsight
                                                                            experience replay,” in Advances in Neural Information Processing
   In this paper, we have shown that better image preprocess-               Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus,
ing and constructing a better mechanism for replay buffer                   S. Vishwanathan, and R. Garnett, Eds., vol. 30. Curran Associates,
                                                                            Inc., 2017.
can reduce memory consumption on DRL algorithms during                 [15] S. Zhang and R. S. Sutton, “A deeper look at experience replay,”
training. We have also demonstrated that using our method,                  Computing Research Repository, vol. abs/1712.01275, 2017.
the performance of the DRL agent on a lower constraint                 [16] H. Hasselt, “Double q-learning,” in Advances in Neural Information
                                                                            Processing Systems, J. Lafferty, C. Williams, J. Shawe-Taylor, R. Zemel,
application is entirely similar, if not better. We combined our             and A. Culotta, Eds., vol. 23. Curran Associates, Inc., 2010.
method with the DQN (with some modification) algorithm                 [17] Z. Wei, D. Wang, M. Zhang, A.-H. Tan, C. Miao, and Y. Zhou,
to observe the method’s effectiveness. Our presented design                 “Autonomous agents in snake game via deep reinforcement learning,” in
                                                                            2018 IEEE International Conference on Agents (ICA), 2018, pp. 20–25.
requires less memory and a simple CNN. We established that             [18] T. D. Nguyen, K. Mori, and R. Thawonmas, “Image colorization using
our method’s result is as good as other DRL approaches for                  a deep convolutional neural network,” Computing Research Repository,
the snake game autonomous agent.                                            vol. abs/1604.07904, 2016.
                                                                       [19] A. Punyawee, C. Panumate, and H. Iida, “Finding comfortable settings
                                                                            of snake game using game refinement measurement,” in Advances in
                       ACKNOWLEDGMENT                                       Computer Science and Ubiquitous Computing. Singapore: Springer
  This work was supported by North South University re-                     Singapore, 2017, pp. 66–73.
                                                                       [20] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza-
search grant CTRG-21-SEPS-18.                                               tion,” in 3rd International Conference on Learning Representations,
  The authors would like to gratefully acknowledge that the                 ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track
computing resources used in this work was housed at the                     Proceedings, Y. Bengio and Y. LeCun, Eds., 2015.
National University of Sciences and Technology (NUST),
Pakistan. The cooperation was pursued under the South Asia
Regional Development Center (RDC) framework of the Belt
& Road Aerospace Innovation Alliance (BRAIA).
You can also read