Mobile Robot Path Optimization Technique Based on Reinforcement Learning Algorithm in Warehouse Environment - MDPI

Page created by Tina Ruiz
 
CONTINUE READING
Mobile Robot Path Optimization Technique Based on Reinforcement Learning Algorithm in Warehouse Environment - MDPI
applied
              sciences
Article
Mobile Robot Path Optimization Technique Based on
Reinforcement Learning Algorithm in Warehouse Environment
HyeokSoo Lee and Jongpil Jeong *

                                          Department of Smart Factory Convergence, Sungkyunkwan University, 2066 Seobu-ro, Jangan-gu,
                                          Suwon 16419, Korea; huxlee@g.skku.edu
                                          * Correspondence: jpjeong@skku.edu; Tel.: +82-31-299-4260

                                          Abstract: This paper reports on the use of reinforcement learning technology for optimizing mobile
                                          robot paths in a warehouse environment with automated logistics. First, we compared the results of
                                          experiments conducted using two basic algorithms to identify the fundamentals required for planning
                                          the path of a mobile robot and utilizing reinforcement learning techniques for path optimization. The
                                          algorithms were tested using a path optimization simulation of a mobile robot in same experimental
                                          environment and conditions. Thereafter, we attempted to improve the previous experiment and
                                          conducted additional experiments to confirm the improvement. The experimental results helped us
                                          understand the characteristics and differences in the reinforcement learning algorithm. The findings
                                          of this study will facilitate our understanding of the basic concepts of reinforcement learning for
                                          further studies on more complex and realistic path optimization algorithm development.

                                          Keywords: warehouse environment; mobile robot path optimization; reinforcement learning

         
         

Citation: Lee, H.; Jeong, J. Mobile       1. Introduction
Robot Path Optimization Technique               The fourth industrial revolution and digital innovation have led to the increased
Based on Reinforcement Learning           interest in the use of robots across industries; thus, robots are being used in various
Algorithm in Warehouse
                                          industrial sites. Robots can be broadly categorized as industrial or service robots. Industrial
Environment. Appl. Sci. 2021, 11,
                                          robots are automated equipment and machines used in industrial sites such as factory lines
1209. https://doi.org/10.3390/
                                          and are used for assembly, machining, inspection, and measurement as well as pressing
app11031209
                                          and welding [1]. The use of service robots is expanding in manufacturing environments
                                          and other areas such as industrial sites, homes, and institutions. Service robots can be
Academic Editor: Marina Paolanti
                                          broadly divided into personal service robots and professional service robots. In particular,
Received: 30 December 2020
Accepted: 26 January 2021
                                          professional service robots are robots used by workers to perform tasks, and mobile robots
Published: 28 January 2021
                                          are one of the professional service robots [2].
                                                A mobile robot needs the following elements to perform tasks: localization, mapping,
Publisher’s Note: MDPI stays neutral
                                          and path planning. First, the location of the robot must be known in relation to the
with regard to jurisdictional claims in
                                          surrounding environment. Second, a map is needed to identify the environment and move
published maps and institutional affil-   accordingly. Third, path planning is necessary to find the best path to perform a given
iations.                                  task [3]. Path planning has two types: point-to-point and complete coverage. A complete
                                          coverage path planning is performed when all positions must be checked, such as for a
                                          cleaning robot. A point-to-point path planning is performed for moving from the start
                                          position to the target position [4].
Copyright: © 2021 by the authors.
                                                To review the use of reinforcement learning as a path optimization technique for
Licensee MDPI, Basel, Switzerland.
                                          mobile robots in a warehouse environment, the following points must be understood. First,
This article is an open access article
                                          it is necessary to understand the warehouse environment and the movement of mobile
distributed under the terms and           robots. To do this, we must check the configuration and layout of the warehouse in which
conditions of the Creative Commons        the mobile robot is used as well as the tasks and actions that the mobile robot performs
Attribution (CC BY) license (https://     in this environment. Second, it is necessary to understand the recommended research
creativecommons.org/licenses/by/          steps for the path planning and optimization of mobile robots. Third, it is necessary to
4.0/).                                    understand the connections between layers and modules as well as their working when

Appl. Sci. 2021, 11, 1209. https://doi.org/10.3390/app11031209                                            https://www.mdpi.com/journal/applsci
Mobile Robot Path Optimization Technique Based on Reinforcement Learning Algorithm in Warehouse Environment - MDPI
formation, a simulation program for real experiments can be implemented. Becau
                            forcement learning can be tested only when a specific experimental environment
                            pared in advance, substantial preparation of the environment is required early
                            study but the researcher must prepare it. Subsequently, the reinforcement learnin
Appl. Sci. 2021, 11, 1209                                                                                            2 of 16
                            rithm to be tested is integrated with the simulation environment for the                    experime
                                   This paper discusses an experiment performed using a reinforcement learnin
                            rithm as a path optimization technique in a warehouse environment. Section 2 bri
                             they are integrated with mobile robots, environments, hardware, and software. With this
                            scribes   the warehouse environment, the concept of path planning for mobile rob
                             basic information, a simulation program for real experiments can be implemented. Because
                            concept    of reinforcement
                             reinforcement    learning can belearning,   and
                                                               tested only  whenthea algorithms     to be tested.
                                                                                      specific experimental         Section 3 brie
                                                                                                             environment
                            scribes   the in
                             is prepared   structure
                                              advance, of the mobile
                                                       substantial      robot path-optimization
                                                                    preparation  of the environment issystem,
                                                                                                         requiredthe  main
                                                                                                                   early  in logic
                             the study  but  the researcher must  prepare  it. Subsequently,   the reinforcement
                            search, and the idea of dynamically adjusting the reward value according to the m     learning
                             algorithm to be tested is integrated with the simulation environment for the experiment.
                            action.   Section 4 describes the construction of the simulation environment for the
                                  This paper discusses an experiment performed using a reinforcement learning algo-
                            ments,    the  parameters
                             rithm as a path              andtechnique
                                                optimization   reward in values   to be used
                                                                           a warehouse           for the experiments,
                                                                                          environment.    Section 2 brieflyand the
                            mental
                             describesresults.   Sectionenvironment,
                                        the warehouse     5 presents the    conclusions
                                                                       the concept   of pathand   directions
                                                                                              planning         for future
                                                                                                        for mobile  robots, researc
                            the concept of reinforcement learning, and the algorithms to be tested. Section 3 briefly
                            describes
                            2. RelatedtheWork
                                          structure of the mobile robot path-optimization system, the main logic of
                            path search, and the idea of dynamically adjusting the reward value according to the
                            2.1. Warehouse
                            moving           Environment
                                    action. Section 4 describes the construction of the simulation environment for the
                            2.1.1. Warehouse Layout and
                            experiments,  the parameters  andreward    values to be used for the experiments, and the
                                                                Components
                            experimental results. Section 5 presents the conclusions and directions for future research.
                                 The major components for automated warehouse with mobile robots are as
                             2. Related Work
                            [5–10].
                             2.1. Warehouse Environment
                            •2.1.1.Robot  drive
                                   Warehouse      device:
                                              Layout      It instructs
                                                     and Components           the robot from the center to pick or load go
                                 The major components for automated warehouse withpod.
                                  tween    the workstation        and  the   inventory      mobile robots are as follows [5–10].
                            •     Inventory pod: This is a shelf rack that includes storage boxes. There are two
                            •    Robot drive device: It instructs the robot from the center to pick or load goods between
                                  ardworkstation
                                 the   sizes: a small    pod
                                                   and the       can hold
                                                              inventory  pod.up to 450 kg, and a large pod can load up to 1
                            ••    Workstation:
                                 Inventory         This
                                             pod: This is aisshelf
                                                               therack
                                                                   work thatarea  in which
                                                                             includes  storagethe worker
                                                                                               boxes.  There operates    and perform
                                                                                                             are two standard
                                 sizes: a small pod  can  hold   up to 450  kg,
                                  such as replenishment, picking, and packaging.and  a large pod can  load up  to 1300 kg.
                            •    Workstation: This is the work area in which the worker operates and performs tasks
                                 Figure
                                 such     1 shows an picking,
                                      as replenishment,  example andofpackaging.
                                                                       a warehouse layout with inventory pods                 and
                            stations.
                                 Figure 1 shows an example of a warehouse layout with inventory pods and workstations.

                            Figure 1. Sample Layout of Warehouse Environment.
                            Figure  1. Sample Layout of Warehouse Environment.
                            2.1.2. Mobile Robot’s Movement Type
                                 In the robotic mobile fulfillment system, a robot performs a storage trip, in which it
                            transports goods, and a retrieval trip, in which it moves to pick up goods. After dropping
                            the goods into the inventory pod, additional goods can be picked up or returned to the
                            workstation [11].
Mobile Robot Path Optimization Technique Based on Reinforcement Learning Algorithm in Warehouse Environment - MDPI
the goods into the inventory pod, additional goods can be picked up or returned to the
                             workstation [11].
                                    The main
                            2.1.2. Mobile      movements
                                           Robot’s MovementofType
                                                                the mobile robot in the warehouse are to perform the fol
                             lowing    tasks [12].
                                  In the robotic mobile fulfillment system, a robot performs a storage trip, in which it
Appl. Sci. 2021, 11, 1209                                                                                                         3 of 16
                             1. move
                            transports    goods
                                        goods,    from
                                                and      the workstation
                                                    a retrieval             to itthe
                                                                trip, in which        inventory
                                                                                   moves   to pickpod;
                                                                                                  up goods. After dropping
                            the
                             2. goods
                                   moveinto  the one
                                          from   inventory  pod, additional
                                                      inventory               goodsto
                                                                   pod to another       can be picked
                                                                                          pick        up or returned to the
                                                                                                up goods;
                            workstation  [11].
                             3. move goods from the inventory pod to the workstation
                                  Themain
                                 The   main movements
                                             movements of the mobile
                                                                  mobile robot
                                                                          robotin inthe
                                                                                     thewarehouse
                                                                                          warehouseare
                                                                                                     aretoto
                                                                                                           perform thethe
                                                                                                             perform   follow-
                                                                                                                          fol-
                             ing tasks
                            lowing     [12].
                                    tasks [12].
                             2.2. Path Planning for Mobile Robot
                            1.1. move
                                  movegoods
                                        goodsfrom
                                               fromthe
                                                    theworkstation   totothe
                                                                          theinventory
                                                                               inventorypod;
                                  For path planning    inworkstation
                                                           a given work     environment, pod;
                                                                                            a mobile robot searches for an opti
                            2.2. move from one   inventory  pod  to another   to pick up goods;
                                  move from one inventory pod to another to pick up goods;
                              mal path from the starting point to the target point according to specific performance cri
                            3.3. move
                                  movegoods
                                        goodsfrom
                                               fromthe
                                                    theinventory
                                                         inventorypod
                                                                    podtotothe
                                                                             theworkstation
                                                                                 workstation
                             teria. In general, path planning is performed for two path types, global and local, depend
                             ingPath
                             2.2.
                            2.2.   on whether
                                  Path  Planningfor
                                       Planning    the  mobile
                                                    forMobile
                                                        Mobile    robot has environmental information. In the case of global path
                                                                 Robot
                                                                Robot
                             planning,
                                   Forpath
                                  For      allplanning
                                        path    information
                                               planning          about
                                                          in a given
                                                          in          work
                                                                      worktheenvironment,
                                                                               environment
                                                                              environment,        is provided
                                                                                               aamobile
                                                                                                   mobile robot    to the mobile
                                                                                                           robotsearches
                                                                                                                    searchesforfor
                                                                                                                                 ananrobot
                                                                                                                                      opti- before i
                                                                                                                                    optimal
                             starts
                             pathpath
                            mal       moving.
                                    from  the the
                                       from        Based
                                                starting    on
                                                          point
                                                    starting    local
                                                                 to the
                                                              point    path   planning,
                                                                         target
                                                                     to the      point
                                                                             target         the
                                                                                        according
                                                                                     point         robot
                                                                                            according      does
                                                                                                      to specific not   know
                                                                                                                    performance
                                                                                                          to specific            most
                                                                                                                        performance     information
                                                                                                                                    criteria.
                                                                                                                                        cri-
                             about
                             In
                            teria. In the
                                 general,  environment
                                           pathpath
                                      general,     planning   before
                                                       planning        it moves
                                                               is performed
                                                                   is performed for [13,14].
                                                                                    twotwo
                                                                                    for  path    types,
                                                                                              path       global
                                                                                                     types, globaland   local,
                                                                                                                      and local,depending
                                                                                                                                  depend-
                             on on
                            ing  whether
                                    The     thethe
                                         path
                                     whether      mobile
                                                  planning
                                                     mobilerobot   has
                                                                     hasenvironmental
                                                               development
                                                              robot               processinformation.
                                                                          environmental                      In
                                                                                             for mobile robots
                                                                                            information.      In the
                                                                                                                 the case
                                                                                                                       usually  global paththe steps
                                                                                                                                  follows
                                                                                                                           of global  path
                             planning,
                            planning,
                             below inall all  information
                                                    2 [13].about the environment is provided to the mobile robot before itit
                                            information
                                         Figure               about  the  environment     is  provided    to  the mobile   robot   before
                             startsmoving.
                            starts moving.Based
                                             Basedon onlocal
                                                        localpath
                                                              pathplanning,
                                                                   planning,the
                                                                              therobot
                                                                                  robotdoes
                                                                                        doesnot
                                                                                             notknow
                                                                                                 knowmost
                                                                                                       mostinformation
                                                                                                             information
                             •
                             about Environmental
                                    the environment   modeling
                                                      before  it moves [13,14].
                            about the environment before it moves [13,14].
                             • The Optimization
                                       pathplanningcriteria
                                            planning   developmentprocess
                                                                     processfor
                                                                              formobile
                                                                                  mobilerobots
                                                                                         robotsusually
                                                                                                usuallyfollows
                                                                                                        followsthe
                                                                                                                thesteps
                                                                                                                    steps
                                 The path             development
                             •     Path  finding
                             belowininFigure
                            below                 algorithm
                                       Figure22[13].
                                                [13].
                            •    Environmental modeling
                            •    Optimization criteria
                            •    Path finding algorithm

                             Figure2.2.Development
                             Figure     Developmentsteps for for
                                                      steps   global/local pathpath
                                                                 global/local   optimization [13]. [13].
                                                                                    optimization
                             •    Environmental modeling
                             •     The classification
                                  Optimization  criteriaof the path planning of the mobile robot is illustrated in Figure 3.
                            Figure 2. Development steps for global/local path optimization [13].
                             •    Path finding algorithm
                                 The
                                  Theclassification
                                      classificationof
                                                     ofthe
                                                        thepath
                                                            pathplanning
                                                                 planningof
                                                                          ofthe
                                                                             themobile
                                                                                mobilerobot
                                                                                       robotisisillustrated
                                                                                                 illustratedin
                                                                                                             inFigure
                                                                                                               Figure 3.
                                                                                                                      3.

                             Figure 3. Classification of path planning and global path search algorithm [13].
                                        Classificationofofpath
                             Figure3.3.Classification
                            Figure                         pathplanning
                                                                planningand
                                                                         andglobal
                                                                             globalpath
                                                                                    pathsearch
                                                                                         searchalgorithm
                                                                                                algorithm[13].
                                                                                                          [13].

                                  Dijkstra, A*, and D* algorithms for the heuristic method are the most commonly
                             used algorithms for path search in global path planning [13,15]. Traditionally, warehouse
                             path planning has been optimized using heuristics and meta heuristics. Most of these
                             methods work without additional learning and require significant computation time as
                             the problem becomes complex. In addition, it is challenging to design and understand
Mobile Robot Path Optimization Technique Based on Reinforcement Learning Algorithm in Warehouse Environment - MDPI
biases from labeled data, and reinforcement learning learns weights and biases using the
                            concept of rewards [17].
                                 The essential elements of reinforcement learning are as follows, and see Figure 4 for
                            how to interact.
Appl. Sci. 2021, 11, 1209        First, agents are the subjects of learning and decision making. Second, the environ-4 of 16
                            ment refers to the external world recognized by the agent and interacts with the agent.
                            Third is policy. The agent finds actions to be performed from the state of the surrounding
                            environment. The fourth element is a reward, an information that tells you if an agent has
                            heuristic algorithms, making it difficult to maintain and scale the path. Therefore, interest
                            done good or bad behavior. In reinforcement learning, positive rewards are given for ac-
                            in artificial intelligence technology for path planning is increasing [16].
                            tions that help agents reach their goals, and negative rewards are given for actions that
                            prevent   them from
                            2.3. Reinforcement      reaching
                                                  Learning   their goals. The fifth element is a value function. The value
                                                           Algorithm
                            function tells you what’s good in the long term. Value is the total amount of reward an
                                 Reinforcement learning is a method that uses trial and error to find a policy to reach
                            agent can expect. Sixth is the environmental model. Environmental models predicts the
                            a goal by learning through failure and reward. Deep neural networks learn weights and
                            next state and the reward that changes according to the current state. Planning is a way to
                            biases from labeled data, and reinforcement learning learns weights and biases using the
                            decide what to do before experiencing future situations. Reinforcement learning using
                            concept of rewards [17].
                            models and plans is called a model-based method, and reinforcement learning using trial
                                 The essential elements of reinforcement learning are as follows, and see Figure 4 for
                            and error is called a model free method [18].
                            how to interact.

                            Figure 4.
                            Figure    The essential
                                   4. The essential elements
                                                    elements and
                                                             and interactions
                                                                 interactions of
                                                                              of Reinforcement
                                                                                 Reinforcement Learning
                                                                                               Learning [18].
                                                                                                        [18].

                                  First, agents are the subjects of learning and decision making. Second, the environment
                                  The basic concept of reinforcement learning modeling aims to find the optimal policy
                            refers to the external world recognized by the agent and interacts with the agent. Third
                            for problem solving and can be defined on the basis of the Markov Decision Process
                            is policy. The agent finds actions to be performed from the state of the surrounding
                            (MDP). If the state and action pair is finite, this is called a finite MDP [18]. The most basic
                            environment. The fourth element is a reward, an information that tells you if an agent
                            Markov process concept in MDP, consists of a state s, a state transition probability p, and
                            has done good or bad behavior. In reinforcement learning, positive rewards are given for
                            an action a performed. The p is the probability that when performing an action a in state
                            actions that help agents reach their goals, and negative rewards are given for actions that
                            s, it will change to s’ [18].
                            prevent them from reaching their goals. The fifth element is a value function. The value
                                  The Formula
                            function    tells you (1) that good
                                                   what’s  defines
                                                                 inthe
                                                                    therelationship between
                                                                        long term. Value      state,
                                                                                          is the     action
                                                                                                 total      and probability
                                                                                                       amount   of reward anis
                            as  follows   [18,19].
                            agent can expect. Sixth is the environmental model. Environmental models predicts the
                            next state and the reward(that      , ) = Praccording
                                                             ∣ changes        = ∣ to the   = current
                                                                                              ,    =state. Planning is a way
                                                                                                                          (1)
                            to decide what to do before experiencing future situations. Reinforcement learning using
                            models and plans is called a model-based method, and reinforcement learning using trial
                            and error is called a model free method [18].
                                  The basic concept of reinforcement learning modeling aims to find the optimal policy
                            for problem solving and can be defined on the basis of the Markov Decision Process (MDP).
                            If the state and action pair is finite, this is called a finite MDP [18]. The most basic Markov
                            process concept in MDP, consists of a state s, a state transition probability p, and an action
                            a performed. The p is the probability that when performing an action a in state s, it will
                            change to s’ [18].
                                  The Formula (1) that defines the relationship between state, action and probability is
                            as follows [18,19].

                                                    p s0 | s, a = Pr St = s0 | St−1 = s, At−1 = a
                                                                      
                                                                                                                          (1)

                                  The agent’s goal is to maximize the sum of regularly accumulated rewards, and the
                            expected return is the sum of rewards. In other words, using rewards to formalize the
                            concept of a goal is characteristic of reinforcement learning. In addition, the discount factor
                            is also important in the concept of reinforcement learning. The discount factor (γ) is a
                            coefficient that discounts the future expected reward and has a value between 0 and 1 [18].
Mobile Robot Path Optimization Technique Based on Reinforcement Learning Algorithm in Warehouse Environment - MDPI
Appl. Sci. 2021, 11, 1209                                                                                                   5 of 16

                            Discounted return obtained through reward and depreciation rate can be defined as the
                            following Formula (2) [18,19].
                                                                                                         ∞
                                                 Gt = Rt+1 + γRt+2 + γ2 Rt+3 + · · · =                   ∑ γ k R t + k +1      (2)
                                                                                                        k =0

                                  The reinforcement learning algorithm has the concept of a value function, a function of
                            a state that estimates whether the environment given to an agent is good or bad. Moreover,
                            the value function is defined according to the behavioral method called policy. Policy is
                            the probability of choosing possible actions in a state [18].
                                  If the value function was the consideration of the cumulative reward and the change
                            of state together, the state value function was also considered the policy [20]. The value
                            function is the expected value of the return if policy π is followed since starting at state s,
                            and the formulas are as follows [18,20].

                                                                                                   ∞
                                                                             "                       #
                                                vπ (s) = Eπ [ Gt | St = s] = Eπ                ∑ γ k R t + k +1 | S t = s      (3)
                                                                                               k =0

                                 Among the policies, the policy with the highest expected value of return is the optimal
                            policy, and the optimal policy has an optimal state value function and an optimal action
                            value function [18]. The optimal state value function can be found in equation (4), and the
                            optimal action value function can be found in Equations (5) and (6) [18,20].

                                                                        v∗ (s) = max vπ (s)                                    (4)
                                                                                          π

                                                                   q∗ (s, a) = max qπ (s, a)                                   (5)
                                                                                          π

                                                       = E[ Rt+1 + γv∗ (St+1 ) | St = s, At = a]                               (6)
                                 The value of a state that follows an optimal policy is equal to the expected value of
                            the return from the best action that can be chosen in that state [18]. The formula is as
                            follows [18,20].
                                                                                                   
                                                                                  0
                                                                                    
                                             q∗ (s, a) = E Rt+1 + γmax q∗ St+1 , a | St = s, At = a                (7)
                                                                             a0
                                                                                                                      
                                                        = ∑ p s , r | s, a
                                                                    0                                     0    0
                                                                                                                  
                                                                                          r + γmax q∗ s , a                    (8)
                                                           s0 ,r                                   a0

                            2.3.1. Q-Learning
                                  One of most basic algorithm among reinforcement learning algorithms is off-policy
                            temporal-difference control algorithm known as Q-Learning [18]. The Q-Learning algo-
                            rithm stores the state and action in the Q Table in the form of a Q Value; for action selection,
                            it selects a random value or a maximum value according to the epsilon-greedy value and
                            stores the reward values in the Q Table [18].
                                  The algorithm stores the values in the Q Table using Equation (9) below [21].

                                                          Q(s, a) = r (s, a) + max Q s0 , a
                                                                                            
                                                                                                                         (9)
                                                                                              a

                                 Equation (9) is generally converted into update—Equation (10) by considering a factor
                            of time [18,22,23].
                                                                                                
                                                                                  0 0
                                                                                      
                                                Q(s, a) ← Q(s, a) + α r + γmax Q s , a − Q(s, a)                  (10)
                                                                                              a0

                                 The procedure of the Q-Learning algorithm is as follows (Algorithm 1) [18].
( , )← ( , )+          +          ( ,      )− ( , )                  (10)

                                 The procedure of the Q-Learning algorithm is as follows (Algorithm 1) [18].
Appl. Sci. 2021, 11, 1209                                                                                                  6 of 16
                              Algorithm 1: Q-Learning.
                              Initialize Q(s, a)
                              Loop for each episode:
                             Algorithm 1: Q-Learning.
                                 Initialize s (state)
                             Initialize Q(s, a)
                             LoopLoop   for episode:
                                   for each each step of episode until s (state) is destination:
                                   Chooses (state)
                               Initialize   a (action) from s (state) using policy from Q values
                               Loop for each step of episode until s (state) is destination:
                                   Take a (action), find r (reward), s’ (next state)
                                   Choose a (action) from s (state) using policy from Q values
                                   Calculate
                                   Take         Q value
                                        a (action), find rand  Save s’
                                                           (reward),  it (next
                                                                         into Q   Table
                                                                               state)
                                   Calculate
                                   Change s’ Q (next
                                                value and   Save
                                                       state) to it into Q Table
                                                                 s (state)
                                   Change s’ (next state) to s (state)
                                  In reinforcement learning algorithms, the process of exploration and exploitation
                            playsInanreinforcement  learning
                                       important role.         algorithms,
                                                        Initially,          the process
                                                                   lower priority        of exploration
                                                                                    actions              and exploitation
                                                                                             should be selected            plays
                                                                                                                   to investigate
                            an  important   role. Initially, lower  priority  actions  should   be  selected  to investigate
                            the best possible environment, and then select and proceed with higher priority actions           the
                            best possible  environment,    and   then  select and proceed    with  higher  priority
                            [23]. The Q-Learning algorithm has become the basis of the reinforcement learning algo-  actions [23].
                            The
                            rithmQ-Learning
                                    because it algorithm   has shows
                                               is simple and    becomeexcellent
                                                                         the basislearning
                                                                                    of the reinforcement    learning
                                                                                             ability in a single  agentalgorithm
                                                                                                                         environ-
                            because it is simple and shows excellent learning ability in a single agent environment.
                            ment. However, Q-Learning updates only once per action; it is thus difficult to solve com-
                            However, Q-Learning updates only once per action; it is thus difficult to solve complex
                            plex problems effectively when many states and actions have not yet been realized.
                            problems effectively when many states and actions have not yet been realized.
                                  In addition, because the Q table for reward values is predefined, a considerable
                                  In addition, because the Q table for reward values is predefined, a considerable
                            amount of storage memory is required. For this reason, the Q-Learning algorithm is dis-
                            amount of storage memory is required. For this reason, the Q-Learning algorithm is
                            advantageous in a complex and advanced multi-agent environment [24].
                            disadvantageous in a complex and advanced multi-agent environment [24].
                            2.3.2. Dyna-Q
                            2.3.2. Dyna-Q
                                  Dyna-Q is
                                  Dyna-Q   is aa combination
                                                 combination of of the
                                                                    the Q-Learning
                                                                         Q-Learning and and Q-Planning
                                                                                              Q-Planning algorithms.
                                                                                                           algorithms. The   Q-
                                                                                                                        The Q-
                            Learning   algorithm   changes  a  policy  based   on  a  real experience.  The   Q-Planning
                            Learning algorithm changes a policy based on a real experience. The Q-Planning algorithm      algo-
                            rithm obtains
                            obtains        simulated
                                     simulated         experience
                                                 experience  through through
                                                                        searchsearch
                                                                                 controlcontrol and changes
                                                                                          and changes          the policy
                                                                                                         the policy basedbased
                                                                                                                          on a
                            on  a simulated  experience  [18]. Figure   5 illustrates  the structure of the Dyna-Q
                            simulated experience [18]. Figure 5 illustrates the structure of the Dyna-Q model.       model.

                            Figure 5. Dyna-Q
                            Figure 5. Dyna-Q model
                                             model structure
                                                   structure [18].
                                                             [18].

                                 The
                                  The Q-Planning
                                       Q-Planningalgorithm
                                                   algorithmrandomly
                                                               randomly selects samples
                                                                           selects      onlyonly
                                                                                   samples    fromfrom
                                                                                                   previously experienced
                                                                                                       previously  experi-
                            information,  and the Dyna-Q   model   obtains information  for simulated experience
                            enced information, and the Dyna-Q model obtains information for simulated experience through
                            search control exactly as it obtains information through modeling learning based on a real
                            through search control exactly as it obtains information through modeling learning based
                            experience [18]. The Dyna-Q update-equation is the same as that for Q-Learning (10) [22].
                            on a real experience [18]. The Dyna-Q update-equation is the same as that for Q-Learning
                                 The Dyna-Q algorithm is as follows (Algorithm 2) [18].
                            (10) [22].
The Dyna-Q algorithm is as follows (Algorithm 2) [18].

                              Algorithm 2: Dyna-Q.
Appl. Sci. 2021, 11, 1209                                                                                                    7 of 16
                              Initialize Q(s, a)
                              Loop forever:
                              (a) Get s (state)
                              Algorithm 2: Dyna-Q.
                              (b)     Choose a (action) from s (state) using policy from Q values
                              Initialize Q(s, a)
                              (c) forever:
                              Loop   Take a (action), find r (reward), s’ (next state)
                              (a) Get s (state)
                              (d) Calculate Q value and Save it into Q Table
                              (b) Choose a (action) from s (state) using policy from Q values
                              (e) Take
                              (c)    Savea (action),
                                           s (state),find
                                                       a (action),   r (reward),
                                                           r (reward),              s’ (next state) in memory
                                                                        s’ (next state)
                              (d)
                              (f) Calculate
                                    Loop for Q    value and Save it into Q Table
                                               n times
                              (e) Save s (state), a (action), r (reward), s’ (next state) in memory
                              (f) LoopTake
                                         for ns(state),
                                                times a (action), r (reward), s’ (next state)
                                        Calculate
                                      Take  s(state),Qa (action),
                                                        value and    Save it into
                                                                  r (reward),        Q Table
                                                                               s’ (next  state)
                                      Calculate Q value and Save it into Q Table
                              (g) Change s’ (next state) to s (state)
                              (g) Change s’ (next state) to s (state)
                                Steps (a) to (d) and, (g) are the same as in Q-Learning. Steps (e) and (f) added to the
                                 Steps (a) to (d) and, (g) are the same as in Q-Learning. Steps (e) and (f) added to the
                            Dyna-Q algorithm store past experiences in memory to generate the next state and a more
                            Dyna-Q algorithm store past experiences in memory to generate the next state and a more
                            accurate reward value.
                            accurate reward value.
                            3.
                             3.Reinforcement
                                ReinforcementLearning-Based
                                                   Learning-BasedMobile
                                                                     MobileRobot
                                                                              RobotPathPathOptimization
                                                                                            Optimization
                            3.1.
                             3.1.System
                                  SystemStructure
                                           Structure
                                  In
                                   In the
                                       the mobile
                                            mobile robot
                                                    robot framework
                                                           framework structure
                                                                        structure in
                                                                                   in Figure
                                                                                       Figure 6,
                                                                                              6, the
                                                                                                  the agent
                                                                                                      agent is
                                                                                                             is controlled
                                                                                                                controlled by
                                                                                                                            by the
                                                                                                                                the
                            supervisor
                             supervisorcontroller,
                                            controller, and
                                                        and the
                                                             the robot
                                                                 robot controller
                                                                       controller is
                                                                                   is executed
                                                                                      executed byby the
                                                                                                     the agent,
                                                                                                         agent, so
                                                                                                                 so that
                                                                                                                    that the
                                                                                                                         the robot
                                                                                                                             robot
                            operates
                             operates in in the
                                             the warehouse
                                                 warehouse environment
                                                              environment and
                                                                            and the
                                                                                 the necessary
                                                                                       necessary information
                                                                                                   information isis observed
                                                                                                                    observed and
                                                                                                                               and
                            collected
                             collected [25–27].
                                         [25–27].

                            Figure6.
                            Figure   Warehouse mobile
                                   6.Warehouse mobile robot
                                                      robot framework
                                                            framework structure.

                            3.2. Operation Procedure
                                  The mobile robot framework works as follows. At the beginning, the simulator is set
                            to its initial state, and the mobile robot collects the first observation information and sends
                            it to the supervisor controller. The supervisor controller verifies the received information
                            and communicates the necessary action to the agent. At this stage, the supervisor controller
                            can obtain additional information if necessary. The supervisor controller can communicate
                            job information to the agent. The agent conveys the action to the mobile robot through the
3.2. Operation Procedure
                                  The mobile robot framework works as follows. At the beginning, the simulator is set
                            to its initial state, and the mobile robot collects the first observation information and sends
                            it to the supervisor controller. The supervisor controller verifies the received information
Appl. Sci. 2021, 11, 1209                                                                                                         8 of 16
                            and communicates the necessary action to the agent. At this stage, the supervisor control-
                            ler can obtain additional information if necessary. The supervisor controller can com-
                            municate job information to the agent. The agent conveys the action to the mobile robot
                            through
                            robot      the robot
                                    controller,     controller,
                                                  and  the mobile androbot
                                                                        the mobile
                                                                             performsrobot
                                                                                        theperforms     the received
                                                                                             received action          action using
                                                                                                               using actuators.      ac-
                                                                                                                                  After
                            tuators.
                            the  actionAfter   the action
                                         is performed     byisthe
                                                                performed      by the the
                                                                  robot controller,   robot   controller,
                                                                                           changed        the and
                                                                                                    position   changed
                                                                                                                   statusposition   and
                                                                                                                          information
                            status
                            is       information
                               transmitted           is agent
                                                to the  transmitted
                                                                and the to supervisor
                                                                            the agent and    the supervisor
                                                                                        controller,  and thecontroller,
                                                                                                               supervisorand    the su-
                                                                                                                             controller
                            pervisor controller
                            calculates    the reward  calculates
                                                         value for thethereward   valueaction.
                                                                           performed     for the Subsequently,
                                                                                                  performed action.    Subsequently,
                                                                                                                  the action,  reward
                            value,   and reward
                            the action,    status information      are stored
                                                     value, and status          in memory.
                                                                           information         These procedures
                                                                                          are stored  in memory. are   repeated
                                                                                                                    These         until
                                                                                                                           procedures
                            the
                            are mobile
                                 repeated robot
                                             untilachieves  the defined
                                                    the mobile              goal or the defined
                                                                  robot achieves        number goal
                                                                                                  of epochs/episodes
                                                                                                        or the number of or epochs/ep-
                                                                                                                            conditions
                            that
                            isodessatisfy  the goal [25].
                                     or conditions     that satisfy the goal [25].
                                  In Figure 7, we show an example class that is composed of the essential functions     functions and
                                                                                                                                    and
                            main logic required
                                           required by by the
                                                          the supervisor
                                                               supervisor controller
                                                                              controller for
                                                                                          for path
                                                                                              path planning.
                                                                                                   planning.

                            Figure 7. Required methods and logic of path finding
                                                                         finding [25].
                                                                                 [25].

                            3.3. Dynamic Reward
                            3.3. Dynamic  Reward Adjustment
                                                   Adjustment Based
                                                               Based on
                                                                     on Action
                                                                        Action
                                 The  simplest   type  of reward  for the  movement
                                 The simplest type of reward for the movement       of aof a mobile
                                                                                         mobile  robotrobot  in a grid-type
                                                                                                       in a grid-type ware-
                            warehouse environment is achieved by specifying a negative reward value for an obstacle
                            house environment is achieved by specifying a negative reward value for an obstacle and
                            and a positive reward value for the target location. The direction of the mobile robot is
                            a positive reward value for the target location. The direction of the mobile robot is deter-
                            determined by its current and target positions. The path planning algorithm requires
                            mined by its current and target positions. The path planning algorithm requires signifi-
                            significant time to find the best path by repeating many random path finding attempts. In
                            cant time to find the best path by repeating many random path finding attempts. In the
                            the case of the basic reward method introduced earlier, an agent searches randomly for a
                            case of the basic reward method introduced earlier, an agent searches randomly for a path,
                            path, regardless of whether it is located far from or close to the target position. To improve
                            regardless of whether it is located far from or close to the target position. To improve on
                            on these vulnerabilities, we considered a reward method as shown in Figure 8.
                            these vulnerabilities, we considered a reward method as shown in Figure 8.
                                 The suggestions are as follows:
                                 First, when the travel path and target position are determined, a partial area close to
                            the target position is selected. (Reward adjustment area close to the target position).
                                 Second, a higher reward value than the basic reward method is assigned to the path
                            positions within the area selected in the previous step. Third, we propose a method to
                            select an area as above whenever the movement path changes and changes the position
                            of the movement path in the selected area to a high reward value. This is called dynamic
                            reward adjustment based on action, and the following (Algorithm 3) is an algorithm of
                            the process.
Appl. Sci. 2021, 11, 1209                                                                                                                     9 of 16

                                        Algorithm 3: Dynamic reward adjustment based on action.
                                       1. Get new destination to move
Appl. Sci. 2021, 11, x FOR PEER REVIEW 2. Select a partial area near the target position.                                                   9 of 17
                                       3. A higher positive reward value is assigned to the selected area, excluding the target point and obstacles.
                                       (Highest positive reward value to target position)

                                    Figure 8. 8.
                                      Figure  Dynamic adjusting
                                                 Dynamic         reward
                                                         adjusting      values
                                                                   reward      forfor
                                                                          values    the moving
                                                                                      the      action.
                                                                                          moving action.

                                      4. Test
                                          The and  Results are as follows:
                                               suggestions
                                      4.1.First,
                                           Warehouse
                                                 when Simulation Environment
                                                      the travel path  and target position are determined, a partial area close to
                                    the target  position
                                            To create     is selected.environment,
                                                       a simulation    (Reward adjustment      areato
                                                                                      we referred   close
                                                                                                       the to the target
                                                                                                           paper   [23] andposition)
                                                                                                                             built a virtual
                                         Second, aenvironment
                                      experiment      higher reward    value
                                                                     based onthan  the basic reward
                                                                               the previously           method
                                                                                                described        is assigned
                                                                                                             warehouse         to the pathA
                                                                                                                           environment.
                                    positions   within
                                      total of 15       the area
                                                  inventory   podsselected   in the in
                                                                    were included       a 25 × 25
                                                                                    previous   step.
                                                                                                  gridThird,  we
                                                                                                        layout. A propose    a method
                                                                                                                   storage trip           to
                                                                                                                                 of a mobile
                                      robotanfrom
                                    select     areathe  workstation
                                                     as above          to the
                                                                 whenever   theinventory
                                                                                 movement   pod was
                                                                                              path    simulated.
                                                                                                   changes          This experiment
                                                                                                              and changes                was
                                                                                                                              the position
                                    ofconducted
                                       the movementto find  theinoptimal
                                                         path            path to
                                                                  the selected    thetotarget
                                                                                area          position,
                                                                                         a high reward avoiding     inventory
                                                                                                          value. This  is calledpods   at the
                                                                                                                                 dynamic
  Appl. Sci. 2021, 11, x FOR PEER REVIEW                                                                                              10 of 17
                                      starting
                                    reward      position where
                                             adjustment    basedaon single mobile
                                                                       action, androbot    was defined.
                                                                                    the following    (Algorithm 3) is an algorithm of
                                            The Figure 9 shows the target position for the experiments.
                                    the process.

                                     Algorithm 3: Dynamic reward adjustment based on action.
                                     1. Get new destination to move
                                     2. Select a partial area near the target position.
                                     3. A higher positive reward value is assigned to the selected area,
                                     excluding the target point and obstacles.
                                     (Highest positive reward value to target position)

                                    4. Test and Results
                                    4.1. Warehouse Simulation Environment
                                          To create a simulation environment, we referred to the paper [23] and built a virtual
                                    experiment environment based on the previously described warehouse environment. A
                                    total of 15 inventory pods were included in a 25 × 25 grid layout. A storage trip of a mobile
                                    robot from the workstation to the inventory pod was simulated. This experiment was con-
                                    ducted
                                      Figureto9. find theposition
                                                 Target   optimal of path to theexperiment.
                                                                     simulation  target position, avoiding inventory pods at the start-
                                    ing position where a single mobile robot was defined.
                                      4.2. TestFigure
                                                 Parameter   Values
                                      4.2.The           9 shows
                                           Test Parameter    Valuesthe target position for the experiments.
                                            Q-Learning and Dyna-Q algorithms were tested in a simulation environment.
                                            Q-Learning and Dyna-Q algorithms were tested in a simulation environment.
                                            Table 1 shows the parameter values used in the experiment.

                                      Table 1. Parameter values for experiment.
Figure 9. Target position of simulation experiment.
Appl. Sci. 2021, 11, 1209                                                                                                  10 of 16
                            4.2. Test Parameter Values
                                  Q-Learning and Dyna-Q algorithms were tested in a simulation environment.
                                  Table11shows
                                 Table    showsthe
                                                theparameter
                                                    parametervalues
                                                              valuesused
                                                                     usedininthe
                                                                              theexperiment.
                                                                                  experiment.

                             Table1.1.Parameter
                            Table      Parametervalues
                                                 valuesfor
                                                        forexperiment.
                                                            experiment.

                                             Parameter
                                       Parameter                          Q-Learning Q-Learning                 Dyna-Q
                                                                                                             Dyna-Q
                                          Learning
                                   Learning Rate (α) Rate (α)                0.01         0.01                 0.01 0.01
                                   Discount Rate (γ) Rate (γ)
                                          Discount                            0.9          0.9                  0.9 0.9

                            4.3.
                             4.3.Test
                                 TestReward
                                      RewardValues
                                             Values
                                  The
                                  The experimentwas
                                      experiment  wasconducted
                                                      conductedwith
                                                               withthe
                                                                    thefollowing
                                                                        followingreward
                                                                                  rewardmethods.
                                                                                         methods.
                            ••   Basic
                                  Basic method:
                                        method: TheThe obstacle
                                                         obstaclewas   setasas-1,−the
                                                                  wasset          1, the  target
                                                                                       target     position
                                                                                              position  as 1,asand
                                                                                                                1, and   the other
                                                                                                                    the other  posi-
                                 positions  as
                                  tions as 0.  0.
                            ••   Dynamic
                                  Dynamicreward
                                             rewardadjustment
                                                       adjustment based
                                                                   based on
                                                                          on action:
                                                                                action: To
                                                                                         Tocheck
                                                                                            check the
                                                                                                   theefficacy
                                                                                                       efficacyof ofthe
                                                                                                                     theproposal,
                                                                                                                          proposal,
                                 an
                                  anexperiment
                                     experimentwas wasconducted
                                                         conducteddesignating
                                                                    designatingset  setvalues
                                                                                        valuesthat
                                                                                               thatwere
                                                                                                    weresimilar
                                                                                                           similartotothose
                                                                                                                       thoseininthe
                                                                                                                                 the
                                 proposed   method    as shown  in Figure 10   [28–30].  The  obstacle was   set as − 1,
                                  proposed method as shown in Figure 10 [28–30]. The obstacle was set as −1, the target  the target
                                 position
                                  positionasas10,
                                               10,the
                                                   thepositions
                                                        positionsin
                                                                  inthe
                                                                     theselected
                                                                         selectedareaarea(area
                                                                                           (areaclose
                                                                                                 closeto
                                                                                                       totarget)
                                                                                                          target)as as0.3,
                                                                                                                       0.3,and
                                                                                                                            andthe
                                                                                                                                 the
                                 other positions   as 0.
                                  other positions as 0.

                             Figure10.
                            Figure  10.Selected
                                        Selectedarea
                                                 areafor
                                                      fordynamic
                                                          dynamicreward
                                                                  rewardadjustment
                                                                         adjustmentbased
                                                                                    basedon
                                                                                          onaction.
                                                                                             action.

                            4.4. Results
                                  First, we tested two basic reinforcement learning algorithms: the Q-Learning algorithm
                            and the Dyna-Q algorithm. The experiment was performed five times for each algorithm,
                            and there may be deviations in the results for each experiment, so the experiment was
                            performed under the same condition and environment. The following points were verified
                            and compared using the experimental results. The optimization efficiency of each algorithm
                            was checked based on the length of the final path, and the path optimization performance
                            was confirmed by the path search time. We were also able to identify learning patterns by
                            the number of learning steps per episode.
                                  The path search time of the Q-Learning algorithm was 19.18 min for the first test,
                            18.17 min for the second test, 24.28 min for the third test, 33.92 min for the fourth test, and
                            19.91 min for the fifth test. It took about 23.09 min on average. The path search time of the
                            Dyna-Q algorithm was 90.35 min for the first test, 108.04 min for the second test, 93.01 min
                            for the third test, 100.79 min for the fourth test, and 81.99 min for the fifth test. The overall
                            average took about 94.83 min.
                                  The final path length of the Q-Learning algorithm was 65 steps for the first test, 34 steps
                            for the second test, 60 steps for the third test, 91 steps for the fourth test, 49 steps for the
                            fifth test, and the average was 59.8 steps. The final path length of the Dyna-Q algorithm
                            was 38 steps for the first test, 38 steps for the second test, 38 steps for the third test, 34 steps
                            for the fourth test, 38 steps for the fifth test, and the average was 37.2 steps.
                                  The Q-Learning algorithm was faster than Dyna-Q in terms of path search time,
                            and Dyna-Q selected shorter and simpler final paths than those selected by Q-Learning.
average took about 94.83 min.
                                            The final path length of the Q-Learning algorithm is 65 steps for the first test, 34 steps
                                      for the second test, 60 steps for the third test, 91 steps for the fourth test, 49 steps for the
                                      fifth test, and the average is 59.8 steps. The final path length of the Dyna-Q algorithm is
Appl. Sci. 2021, 11, 1209             38 steps for the first test, 38 steps for the second test, 38 steps for the third test, 34 steps
                                                                                                                                   11 offor
                                                                                                                                         16
                                      the fourth test, 38 steps for the fifth test, and the average is 37.2 steps.
                                            The Q-Learning algorithm was faster than Dyna-Q in terms of path search time, and
                                      Dyna-Q selected shorter and simpler final paths than those selected by Q-Learning. Fig-
                                     Figures 11 and 12 show the learning pattern through the number of training steps per
                                      ures 11 and 12 show the learning pattern through the number of training steps per epi-
                                     episode. Q-Learning found the path by repeating many steps for 5000 episodes. Dyna-Q,
                                      sode. Q-Learning found the path by repeating many steps for 5000 episodes. Dyna-Q, sta-
                                     stabilized after finding the optimal path between 1500 and 3000 episodes. In both algo-
                                      bilized after finding the optimal path between 1500 and 3000 episodes. In both algorithms,
                                     rithms, the maximum number of steps rarely exceeded 800 steps. All tests ran 5000 episodes
                                      the maximum number of steps rarely exceeded 800 steps. All tests ran 5000 episodes for
                                     for each test. In the case of Q-Learning algorithm, the second test exceeded 800 steps 5 times
                                      each test. In the case of Q-Learning algorithm, the second test exceeded 800 steps 5 times
                                     (0.1%) and fourth test exceeded 800 steps 5 times (0.1%). So, Q-Learning algorithm test
                                      (0.1%) and fourth test exceeded 800 steps 5 times (0.1%). So, Q-Learning algorithm test
                                     exceeded 800 steps at a total level of 0.04% in the five tests. In the case of Dyna-Q algo-
                                      exceeded 800 steps at a total level of 0.04% in the five tests. In the case of Dyna-Q algo-
                                     rithm, the first test exceeded 800 steps once (0.02%), third test exceeded 800 steps 7 times
                                      rithm, the first test exceeded 800 steps once (0.02%), third test exceeded 800 steps 7 times
                                     (0.14%), fifth test exceeded 800 steps 3 times (0.06%). So, Dyna-Q algorithm test exceeded
                                      (0.14%), fifth test exceeded 800 steps 3 times (0.06%). So, Dyna-Q algorithm test exceeded
                                     800 steps at a total level of 0.044 % in the five tests. To compare the path search time and
                                      800 steps at a total level of 0.044 % in the five tests. To compare the path search time and
                                     the learning pattern result, we compared the experiment (e) of the two algorithms. The
                                      the learning pattern result, we compared the experiment (e) of the two algorithms. The
                                     path search time was 19.9 s for Q-Learning algorithm and 81.98 s for Dyna-Q. Looking at
                                      path search time was 19.9 s for Q-Learning algorithm and 81.98 s for Dyna-Q. Looking at
                                     the learning patterns in Figures 11 and 12, the Q-Learning algorithm learned by executing
                                      the learning patterns in Figures 11 and 12, the Q-Learning algorithm learned by executing
                                     200 steps to 800 steps per each episode until the end of 5000 episodes, and the Dyna-Q
                                      200 steps to 800 steps per each episode until the end of 5000 episodes, and the Dyna-Q
                                     algorithm learned by executing 200 steps to 800 steps per each episode until the middle
                                      algorithm learned by executing 200 steps to 800 steps per each episode until the middle of
                                     of the 2000 episode. However, the time required for the Dyna-Q algorithm is more than
                                      the 2000 episode. However, the time required for the Dyna-Q algorithm is more than 75%
                                     75% slower than the Q-Learning algorithm. In this perspective, it has been confirmed that
                                      slower than the Q-Learning algorithm. In this perspective, it has been confirmed that
                                     Dyna-Q takes longer to learn in a single episode than Q-Learning.
                                      Dyna-Q takes longer to learn in a single episode than Q-Learning.

 Appl. Sci. 2021, 11, x FOR PEER REVIEW                                                                                                      12 of 17

                                           (a)                                                        (b)

                                           (c)                                                        (d)

                                                                         (e)
       Figure11.
      Figure  11.Path
                 Pathoptimization
                      optimizationlearning
                                   learning progress
                                            progress status
                                                     status of
                                                            of Q-Learning:
                                                               Q-Learning: (a) 1st test; (b) 2nd test; (c) 3rd test; (d)
                                                                                                                     (d) 4th
                                                                                                                         4th test;
                                                                                                                             test; (e)
                                                                                                                                    (e)5th
                                                                                                                                       5thtest.
                                                                                                                                           test.
Appl. Sci. 2021, 11, 1209                                                                                                                                12 of 16

                                                                              (e)
       Figure 11. Path optimization learning progress status of Q-Learning: (a) 1st test; (b) 2nd test; (c) 3rd test; (d) 4th test; (e) 5th test.

                                            (a)                                                               (b)

Appl. Sci. 2021, 11, x FOR PEER REVIEW                                                                                                                  13 of 17

                                            (c)                                                               (d)

                                                                             (e)
      Figure 12.
             12. Path
                 Pathoptimization
                      optimizationlearning
                                   learningprogress
                                            progressstatus
                                                      statusofofDyna-Q:
                                                                 Dyna-Q:(a)(a)
                                                                            1st1st
                                                                                test; (b)(b)
                                                                                   test;  2nd  test;
                                                                                             2nd     (c) (c)
                                                                                                  test;  3rd3rd
                                                                                                             test;test;
                                                                                                                    (d) (d)
                                                                                                                        4th4th
                                                                                                                            test;test;
                                                                                                                                  (e) 5th
                                                                                                                                       (e) test.
                                                                                                                                           5th test.

                                             Subsequently, additional experiments were conducted by applying ‘dynamic reward
                                      adjustment
                                      adjustment based based on action’ to the Q-Learning algorithm. The                 The path
                                                                                                                               path search time of this
                                      experiment was 19.04 min for the first test, 15.21 min for the second test,test,
                                      experiment       was    19.04   min   for the   first  test,   15.21  min    for the   second            17.05
                                                                                                                                            17.05  minmin      for
                                                                                                                                                          for the
                                      the third
                                      third   test,test,
                                                    17.5117.51
                                                           min min      forfourth
                                                                   for the   the fourth
                                                                                     test, test,   and 17.38
                                                                                            and 17.38            minthe
                                                                                                           min for     forfifth
                                                                                                                            the test.
                                                                                                                                 fifth Ittest.
                                                                                                                                            tookIt about
                                                                                                                                                    took about
                                                                                                                                                            17.23
                                      17.23on
                                      min     min    on average.
                                                 average.    The finalThepath
                                                                           finallength
                                                                                  path length
                                                                                           of this of    this experiment
                                                                                                      experiment     was 38 wassteps38for  steps   for the
                                                                                                                                             the first       first
                                                                                                                                                         test,   34
                                      test,  34  steps  for  the   second   test, 34  steps    for  the  third  test, 34  steps   for the
                                      steps for the second test, 34 steps for the third test, 34 steps for the fourth test, and 34 steps     fourth    test,  and
                                      34 steps
                                      for         for the
                                          the fifth   test.fifth
                                                             Thetest.   The average
                                                                    average               was 34.8
                                                                              is 34.8 steps.     The steps.    The experimental
                                                                                                        experimental     results canresultsbe seen caninbe   seen
                                                                                                                                                          Figure
                                      in Figure    13, and    it includes   the experimental        results  of  Q-Learning
                                      13, and it includes the experimental results of Q-Learning and Dyna-Q previously con-      and  Dyna-Q         previously
                                      conducted.
                                      ducted.         In the
                                                  In the  case  case
                                                                  of of
                                                                      thethe  Q-Learning
                                                                           Q-Learning          algorithmtotowhich
                                                                                            algorithm           which‘dynamic
                                                                                                                         ‘dynamicreward
                                                                                                                                      reward adjustment
                                                                                                                                                   adjustment
                                      based    on  action’   was    applied,  the  path   search     time  was   slightly  faster
                                      based on action’ was applied, the path search time was slightly faster than Q-Learning,       than    Q-Learning,       and
                                      the final
                                      and          pathpath
                                            the final    length     waswas
                                                                length    confirmed
                                                                              confirmed  to beto improved
                                                                                                 be improved   to atolevel   similar
                                                                                                                       a level  similarto to that
                                                                                                                                                thatof of
                                                                                                                                                       Dyna-Q.
                                                                                                                                                          Dyna-
                                      Figure    14  shows    the   learning   pattern   obtained       using  the  additional
                                      Q. Figure 14 shows the learning pattern obtained using the additional experiments com-     experiments         compared
                                      with the
                                      pared    withresult
                                                      the of    Q-Learning.
                                                           result                There There
                                                                     of Q-Learning.       is no outstanding
                                                                                                   is no outstandingpoint.point.
                                                                                                                             FigureFigure
                                                                                                                                       15 shows        the path
                                                                                                                                                15 shows       the
                                      found     through     the   experiment      of  the   Q   Leaning     algorithm
                                      path found through the experiment of the Q Leaning algorithm applying ‘dynamic       applying       ‘dynamic      reward  re-
                                      adjustment based on action’ on the map.
                                      ward adjustment based on action’ on the map.
ducted. In the case of the Q-Learning algorithm to which ‘dynamic reward adjustment
                                    based on action’ was applied, the path search time was slightly faster than Q-Learning,
                                    and the final path length was confirmed to be improved to a level similar to that of Dyna-
                                    Q. Figure 14 shows the learning pattern obtained using the additional experiments com-
Appl. Sci. 2021, 11, 1209           pared with the result of Q-Learning. There is no outstanding point. Figure 15 shows  13 ofthe
                                                                                                                               16
                                    path found through the experiment of the Q Leaning algorithm applying ‘dynamic re-
                                    ward adjustment based on action’ on the map.

                                    (a)                                                                (b)
 Appl. Sci. 2021, 11, x FOR PEER REVIEW                                                                                               14 of 17
                       Figure 13.
                       Figure 13. Comparison of experiment
                                                experiment results:
                                                           results: (a) Total
                                                                        Total elapsed
                                                                              elapsed time;
                                                                                       time; (b)
                                                                                              (b) Length
                                                                                                  Length of
                                                                                                         offinal
                                                                                                            finalpath.
                                                                                                                 path.

                                          (a)                                                    (b)

                                          (c)                                                    (d)

                                                                      (e)
        Figure14.
      Figure    14.Path
                    Pathoptimization
                          optimization learning
                                          learning progress
                                                   progress status
                                                            statusof
                                                                   ofQ-Learning
                                                                     Q-Learningwith
                                                                                withaction
                                                                                     actionbased
                                                                                            basedreward: (a)(a)
                                                                                                   reward:   1st1st
                                                                                                                 test; (b)(b)
                                                                                                                    test;  2nd2nd
                                                                                                                               test;test;
                                                                                                                                      (c)
        3rd test; (d) 4th test; (e) 5th test.
      (c) 3rd test; (d) 4th test; (e) 5th test.
Appl. Sci. 2021, 11, 1209                                                                                                                       14 of 16

                                                                           (e)
       Figure 14. Path optimization learning progress status of Q-Learning with action based reward: (a) 1st test; (b) 2nd test; (c)
       3rd test; (d) 4th test; (e) 5th test.

 Appl. Sci. 2021, 11, x FOR PEER REVIEW                                                                                                         15 of 17

                                            (a)                                                         (b)

                                            (c)                                                         (d)

                                                                           (e)
       Figure15.
      Figure 15. Path results of
                              of Q-Learning
                                 Q-Learningwith
                                            withaction
                                                 actionbased
                                                        basedreward:
                                                              reward:(a)(a)
                                                                          1st1st
                                                                              test; (b)(b)
                                                                                 test;  2nd  test;
                                                                                           2nd     (c) (c)
                                                                                                test;  3rd3rd
                                                                                                           test;test;
                                                                                                                  (d) (d)
                                                                                                                      4th 4th
                                                                                                                          test;test;
                                                                                                                                (e) 5th
                                                                                                                                     (e) test.
                                                                                                                                         5th test.

                                     5.
                                      5. Conclusions
                                         Conclusions
                                           In
                                            Inthis
                                               thispaper,
                                                    paper,we wereviewed
                                                                 reviewedthe  theuse
                                                                                   useofofreinforcement
                                                                                           reinforcementlearning
                                                                                                            learningas asaapath
                                                                                                                            pathoptimization
                                                                                                                                  optimization
                                     technique
                                      technique in a warehouse environment. We built a simulation environmentto
                                                  in a warehouse      environment.       We  built a simulation   environment      totest
                                                                                                                                      testpath
                                                                                                                                           path
                                     navigation
                                      navigation in a warehouse environment and tested the two most basic algorithms,the
                                                   in a warehouse      environment      and  tested  the  two  most  basic  algorithms,   theQQ
                                     -Learning
                                      -LearningandandDyna-Q
                                                        Dyna-Qalgorithms,
                                                                  algorithms,  forfor
                                                                                    reinforcement
                                                                                      reinforcement learning.
                                                                                                       learning.WeWe
                                                                                                                   then   compared
                                                                                                                       then  comparedthe the
                                                                                                                                         exper-
                                                                                                                                             ex-
                                     imental   results in terms   of the  path  search   times and  the  final path lengths.
                                      perimental results in terms of the path search times and the final path lengths. ComparedCompared    with
                                     the
                                      withaverage   of the results
                                             the average             of the five
                                                            of the results        experiments,
                                                                              of the              the path the
                                                                                       five experiments,     search
                                                                                                                 pathtime  of the
                                                                                                                        search    Q- Learning
                                                                                                                                time  of the Q-
                                      Learning algorithm was about 4.1 times faster than the Dyna-Q algorithm’s, and the final
                                      path length of Dyna-Q algorithm was about 37% shorter than in the Q-Learning algo-
                                      rithm’s. As a result, the experimental results of both Q-Learning and Dyna-Q algorithm
                                      showed unsatisfactory results for either the path search time or the final path length. To
Appl. Sci. 2021, 11, 1209                                                                                                         15 of 16

                                 algorithm was about 4.1 times faster than the Dyna-Q algorithm’s, and the final path length
                                 of Dyna-Q algorithm was about 37% shorter than the Q-Learning algorithm’s. As a result,
                                 the experimental results of both Q-Learning and Dyna-Q algorithm showed unsatisfactory
                                 results for either the path search time or the final path length. To improve the results, we
                                 conducted various experiments using the Q-Learning algorithm and found that it does not
                                 perform well when finding search paths in a complex environment with many obstacles.
                                 The Q-Leaning algorithm has the property of finding a path at random. To minimize this
                                 random path searching trial, the target position was determined for movement, and the set
                                 reward value was high around the target position,’ thus the performance and efficiency of
                                 the agent improved with ‘dynamic reward adjustment based on action’. Agents mainly
                                 explore locations with high reward values. In further experiments, the path search time
                                 was about 5.5 times faster than the Dyna-Q algorithm’s, and the final path length was about
                                 71% shorter than the Q-Learning algorithm. It was confirmed that the path search time
                                 was slightly faster than the Q-Learning algorithm, and the final path length was improved
                                 to the same level as the Dyna-Q algorithm.
                                      It is a difficult goal to meet both minimization of path search time and high accuracy
                                 for path search. If the learning speed is too fast, it will not converge to the optimal value
                                 function, and if the learning speed is set too late, the learning may take too long. In this
                                 paper, we have experimentally confirmed how to reduce inefficient exploration with simple
                                 technique to improve path search accuracy and maintain path search time.
                                      In this study, we have seen how to use reinforcement learning for a single agent and a
                                 path optimization technique for a single path, but the warehouse environment in the field
                                 is much more complex and dynamically changing. In the automation warehouse, multiple
                                 mobile robots work simultaneously. Therefore, this basic study is an example of improving
                                 the basic algorithm in a single path condition for a single agent, and it is insufficient to
                                 apply the algorithm to an actual warehouse environment. Therefore, based on the obtained
                                 results, an in-depth study of the improved reinforcement learning algorithm is needed,
                                 considering highly challenging multi-agent environments to solve more complex and
                                 realistic problems.

                                 Author Contributions: Conceptualization, H.L. and J.J.; methodology, H.L.; software, H.L.; vali-
                                 dation, H.L. and J.J.; formal analysis, H.L.; investigation, H.L.; resources, J.J.; data curation, H.L.;
                                 writing–original draft preparation, H.L.; writing—review and editing, J.J.; visualization, H.L.; super-
                                 vision, J.J.; project administration, J.J.; funding acquisition, J.J. All authors have read and agreed to
                                 the published version of the manuscript.
                                 Funding: This research was supported by the MSIT (Ministry of Science and ICT), Korea, under the
                                 ITRC (Information Technology Research Center) support program (IITP-2020-2018-0-01417) super-
                                 vised by the IITP (Institute for Information & Communications Technology Planning & Evaluation).
                                 Acknowledgments: This work was supported by the Smart Factory Technological R&D Program
                                 S2727115 funded by Ministry of SMEs and Startups (MSS, Korea).
                                  Conflicts of Interest: The authors declare no conflict of interest.

References
1.    Hajduk, M.; Koukolová, L. Trends in Industrial and Service Robot Application. Appl. Mech. Mater. 2015, 791, 161–165. [CrossRef]
2.    Belanche, D.; Casaló, L.V.; Flavián, C. Service robot implementation: A theoretical framework and research agenda. Serv. Ind. J.
      2020, 40, 203–225. [CrossRef]
3.    Fethi, D.; Nemra, A.; Louadj, K.; Hamerlain, M. Simultaneous localization, mapping, and path planning for unmanned vehicle
      using optimal control. SAGE 2018, 10, 168781401773665. [CrossRef]
4.    Zhou, P.; Wang, Z.-M.; Li, Z.-N.; Li, Y. Complete Coverage Path Planning of Mobile Robot Based on Dynamic Programming
      Algorithm. In Proceedings of the 2nd International Conference on Electronic & Mechanical Engineering and Information
      Technology (EMEIT-2012), Shenyang, China, 7 September 2012; pp. 1837–1841.
5.    Azadeh, K.; Koster, M.; Roy, D. Robotized and Automated Warehouse Systems: Review and Recent Developments. INFORMS
      2019, 53, 917–1212. [CrossRef]
6.    Koster, R.; Le-Duc, T.; JanRoodbergen, K. Design and control of warehouse order picking: A literature review. EJOR 2007, 182,
      481–501. [CrossRef]
Appl. Sci. 2021, 11, 1209                                                                                                          16 of 16

7.    Jan Roodbergen, K.; Vis, I.F.A. A model for warehouse layout. IIE Trans. 2006, 38, 799–811. [CrossRef]
8.    Larson, T.N.; March, H.; Kusiak, A. A heuristic approach to warehouse layout with class-based storage. IIE Trans. 1997, 29,
      337–348. [CrossRef]
9.    Kumara, N.V.; Kumarb, C.S. Development of collision free path planning algorithm for warehouse mobile robot. In Proceedings
      of the International Conference on Robotics and Smart Manufacturing (RoSMa2018), Chennai, India, 19–21 July 2018; pp. 456–463.
10.   Sedighi, K.H.; Ashenayi, K.; Manikas, T.W.; Wainwright, R.L.; Tai, H.-M. Autonomous local path planning for a mobile robot
      using a genetic algorithm. In Proceedings of the 2004 Congress on Evolutionary Computation, Portland, OR, USA, 19–23 June
      2004; pp. 456–463.
11.   Lamballaisa, T.; Roy, D.; De Koster, M.B.M. Estimating performance in a Robotic Mobile Fulfillment System. EJOR 2017, 256,
      976–990. [CrossRef]
12.   Merschformann, M.; Lamballaisb, T.; de Kosterb, M.B.M.; Suhla, L. Decision rules for robotic mobile fulfillment systems. Oper.
      Res. Perspect. 2019, 6, 100128. [CrossRef]
13.   Zhang, H.-Y.; Lin, W.-M.; Chen, A.-X. Path Planning for the Mobile Robot: A Review. Symmetry 2018, 10, 450. [CrossRef]
14.   Han, W.-G.; Baek, S.-M.; Kuc, T.-Y. Genetic algorithm based path planning and dynamic obstacle avoidance of mobile robots.
      In Proceedings of the 1997 IEEE International Conference on Systems, Man, and Cybernetics. Computational Cybernetics and
      Simulation, Orlando, FL, USA, 12–15 October 1997.
15.   Nurmaini, S.; Tutuko, B. Intelligent Robotics Navigation System: Problems, Methods, and Algorithm. IJECE 2017, 7, 3711–3726.
      [CrossRef]
16.   Johan, O. Warehouse Vehicle Routing Using Deep Reinforcement Learning. Master’s Thesis, Uppsala University, Uppsala,
      Sweden, November 2019.
17.   Hands-On Machine Learning with Scikit-Learn and TensorFlow by Aurélien Géron. Available online: https://www.oreilly.com/
      library/view/hands-on-machine-learning/9781491962282/ch01.html (accessed on 22 December 2020).
18.   Sutton, R.S.; Barto, A.G. Introduction to Reinforcement Learning, 2nd ed.; MIT Press: London, UK, 2018; pp. 1–528.
19.   Puterman, M.L. Markov Decision Processes: Discrete Stochastic Dynamic Programming, 1st ed.; Wiley: New York, NY, USA, 1994; pp.
      1–649.
20.   Deep Learning Wizard. Available online: https://www.deeplearningwizard.com/deep_learning/deep_reinforcement_learning_
      pytorch/bellman_mdp/ (accessed on 22 December 2020).
21.   Meerza, S.I.A.; Islam, M.; Uzzal, M.M. Q-Learning Based Particle Swarm Optimization Algorithm for Optimal Path Planning of
      Swarm of Mobile Robots. In Proceedings of the 1st International Conference on Advances in Science, Engineering and Robotics
      Technology, Dhaka, Bangladesh, 3–5 May 2019.
22.   Hsu, Y.-P.; Jiang, W.-C. A Fast Learning Agent Based on the Dyna Architecture. J. Inf. Sci. Eng. 2014, 30, 1807–1823.
23.   Sichkar, V.N. Reinforcement Learning Algorithms in Global Path Planning for Mobile Robot. In Proceedings of the 2019
      International Conference on Industrial Engineering Applications and Manufacturing, Sochi, Russia, 25–29 March 2019.
24.   Jang, B.; Kim, M.; Harerimana, G.; Kim, J.W. Q-Learning Algorithms: A Comprehensive Classification and Applications. IEEE
      Access 2019, 7, 133653–133667. [CrossRef]
25.   Kirtas, M.; Tsampazis, K.; Passalis, N. Deepbots: A Webots-Based Deep Reinforcement Learning Framework for Robotics. In
      Proceedings of the 16th IFIP WG 12.5 International Conference AIAI 2020, Marmaras, Greece, 5–7 June 2020; pp. 64–75.
26.   Altuntas, N.; Imal, E.; Emanet, N.; Ozturk, C.N. Reinforcement learning-based mobile robot navigation. Turk. J. Electr. Eng.
      Comput. Sci. 2016, 24, 1747–1767. [CrossRef]
27.   Glavic, M.; Fonteneau, R.; Ernst, D. Reinforcement Learning for Electric Power System Decision and Control: Past Considerations
      and Perspectives. In Proceedings of the 20th IFAC World Congress, Toulouse, France, 14 July 2017; pp. 6918–6927.
28.   Wang, Y.-C.; Usher, J.M. Application of reinforcement learning for agent-based production scheduling. Eng. Appl. Artif. Intell.
      2005, 18, 73–82. [CrossRef]
29.   Mehta, N.; Natarajan, S.; Tadepalli, P.; Fern, A. Transfer in variable-reward hierarchical reinforcement learning. Mach. Learn. 2008,
      73, 289. [CrossRef]
30.   Hu, Z.; Wan, K.; Gao, X.; Zhai, Y. A Dynamic Adjusting Reward Function Method for Deep Reinforcement Learning with
      Adjustable Parameters. Math. Probl. Eng. 2019, 2019, 7619483. [CrossRef]
You can also read