A Deep Reinforcement Learning Algorithm Based on Tetanic Stimulation and Amnesic Mechanisms for Continuous Control of Multi-DOF Manipulator

Page created by Annie Nelson
 
CONTINUE READING
A Deep Reinforcement Learning Algorithm Based on Tetanic Stimulation and Amnesic Mechanisms for Continuous Control of Multi-DOF Manipulator
actuators
Article
A Deep Reinforcement Learning Algorithm Based on Tetanic
Stimulation and Amnesic Mechanisms for Continuous Control
of Multi-DOF Manipulator
Yangyang Hou, Huajie Hong * , Dasheng Xu, Zhe Zeng, Yaping Chen and Zhaoyang Liu

 College of Intelligence Science and Technology, National University of Defense Technology,
 Changsha 410073, China; houyangyang14@nudt.edu.cn (Y.H.); ixudasheng@163.com (D.X.);
 zengzhe@nudt.edu.cn (Z.Z.); yaping_chen2021@163.com (Y.C.); ZhaoyangLiuNUDT@163.com (Z.L.)
 * Correspondence: honghuajie@nudt.edu.cn; Tel.: +86-138-7313-0046

 Abstract: Deep Reinforcement Learning (DRL) has been an active research area in view of its
 capability in solving large-scale control problems. Until presently, many algorithms have been
 developed, such as Deep Deterministic Policy Gradient (DDPG), Twin-Delayed Deep Deterministic
 Policy Gradient (TD3), and so on. However, the converging achievement of DRL often requires
 extensive collected data sets and training episodes, which is data inefficient and computing resource
 consuming. Motivated by the above problem, in this paper, we propose a Twin-Delayed Deep
 Deterministic Policy Gradient algorithm with a Rebirth Mechanism, Tetanic Stimulation and Amnesic
 Mechanisms (ATRTD3), for continuous control of a multi-DOF manipulator. In the training process
 
  of the proposed algorithm, the weighting parameters of the neural network are learned using
 Tetanic stimulation and Amnesia mechanism. The main contribution of this paper is that we show a
Citation: Hou, Y.; Hong, H.; Xu, D.;
 biomimetic view to speed up the converging process by biochemical reactions generated by neurons
Zeng, Z.; Chen, Y.; Liu, Z. A Deep
 in the biological brain during memory and forgetting. The effectiveness of the proposed algorithm
Reinforcement Learning Algorithm
Based on Tetanic Stimulation and
 is validated by a simulation example including the comparisons with previously developed DRL
Amnesic Mechanisms for Continuous algorithms. The results indicate that our approach shows performance improvement in terms of
Control of Multi-DOF Manipulator. convergence speed and precision.
Actuators 2021, 10, 254. https://
doi.org/10.3390/act10100254 Keywords: multi-DOF manipulator; tetanic stimulation; amnesia mechanism; deep reinforcement
 learning
Academic Editors: Ioan Doroftei and
Karsten Berns

Received: 11 August 2021 1. Introduction
Accepted: 20 September 2021
 Deep Reinforcement Learning (DRL) is an advanced intelligent control method. It uses
Published: 29 September 2021
 a neural network to parameterize the Markov decision processes (MDP). DRL has been
 successfully applied to the field of robots [1–4], Machine Translation [5], Auto-Driving [6],
Publisher’s Note: MDPI stays neutral
 Target Positioning [7], and shows strong adaptability. There are two kinds of the DRL
with regard to jurisdictional claims in
published maps and institutional affil-
 algorithms, one is based on value function such as Deep Q Network (DQN) [8] and Nature
iations.
 DQN [9]. The output of DRL algorithm based on value function is discrete state-action
 value. The other is policy-based, such as Deep Deterministic Policy Gradient (DDPG) [10],
 Trust Region Policy Optimization (TRPO) [11], Asynchronous Advantage Actor-Critic
 (A3C) [12], Distributed Proximal Policy Optimization (DPPO) [13,14], and Twin-Delayed
 Deep Deterministic Policy Gradient (TD3) [15]. For continuous action space, advanced
Copyright: © 2021 by the authors.
 search policy can improve the sampling efficiency of the underlying algorithms [10]. Many
Licensee MDPI, Basel, Switzerland.
 research results focus on improving exploration policy. Among them, Fortunato et al. [16]
This article is an open access article
 and Plppert et al. [17] put forward a noise-based exploration policy in adding noise to the
distributed under the terms and
conditions of the Creative Commons
 action space and observation space. Bellemare et al. propose an exploratory algorithm
Attribution (CC BY) license (https://
 based on pseudo-counting for efficient exploration. The algorithm evaluates frequency by
creativecommons.org/licenses/by/ designing a density model that satisfies certain properties and calculates pseudo-counts
4.0/). that are generalized in continuous space to encourage exploration [18]. In [19], Fox et al.

Actuators 2021, 10, 254. https://doi.org/10.3390/act10100254 https://www.mdpi.com/journal/actuators
A Deep Reinforcement Learning Algorithm Based on Tetanic Stimulation and Amnesic Mechanisms for Continuous Control of Multi-DOF Manipulator
Actuators 2021, 10, 254 2 of 16

 innovatively propose a framework, DORA, that uses two parallel MDP processes to inject
 exploration signals into random tasks. Reward-based exploration results in a slower
 approximation of functions and failure to provide an intrinsic reward signal in time. So,
 Badia et al. [20] propose a “never give up” exploration strategy (NGU), which is designed
 to quickly prevent repeated access to the same state in the same episode.
 Due to the coupling and complexity, a multi-degree-of-freedom (multi-DOF) manipu-
 lator is a hot application background of DRL. The direct applications of DRL algorithms
 have, so far, been restricted to simulated settings and relatively simple tasks, due to their
 apparent high sample complexity [21]. Therefore, the application of DRL in multi-DOF
 robots requires a more effective DRL algorithm. Neuroscience provides a rich source of
 inspiration for new types of algorithms and architectures, independent of and complemen-
 tary to the mathematical and logic-based methods and ideas that have largely dominated
 traditional approaches to AI [22].
 In this paper, we design a new DRL algorithm named ATRTD3 based on the research
 results of neuroscience and the analysis of the human brain memory learning process.
 Our team shows a biomimetic view to speed up the converging process by biochemical
 reactions generated by neurons in the biological brain during memory and forgetting.
 We apply the neural network parameter updating mechanism with Tetanic stimulation
 and Amnesia mechanism to the DRL algorithm to further improve the efficiency in the
 application of manipulator.

 2. Related Work
 The advancement of DRL has led to the development of intelligent control of a multi-
 DOF manipulator in recent years. Kim et al. [23] propose a motion planning algorithm
 for robot manipulators using TD3, which designed paths are smoother and shorter after
 140,000 episodes than those designed by Probabilistic Roadmap. Based on the classic DDPG
 algorithm, Zhang et al. smoothly add Gaussian parameters to improve the exploratory
 nature of the algorithm, dynamically sets the robot grasping space parameters to adapt
 to the workspace of multiple scales, and realizes the accurate grasping of the robot [24].
 Robert Kwiatkowski et al. [25] used DL methods to make the manipulator build a self-
 model after 35 h of training. By comparing the application of DDPG and Proximal Policy
 Optimization (PPO) in the manipulator, Lriondo et al. [26] concluded that the current DRL
 algorithm could not obtain robust motion ability and acceptable training efficiency. The
 difficulty of applying DRL to the motion control of the multi-DOF manipulator lies in how
 to improve the exploration efficiency and improve the robustness of the output action of
 the manipulator. Therefore, it is necessary to get inspiration from the research results of
 neuroscience. For flexible manipulators, some research [27,28] are very interesting, which
 brings some help to kinematics modeling and control design in this paper.
 Long-term potentiation (LTP) is a form of activity-dependent plasticity which results
 in a persistent enhancement of synaptic transmission. LTP has been a source of great
 fascination to neuroscientists since its discovery in the early 1970s [29] because it satisfies
 the criteria proposed by Donald Hebb for a synaptic memory mechanism in his influential
 book ‘The Organization of Behavior’ [30].
 LTP is a persistent enhancement of excitatory synaptic transmission induced by some
 kinds of preceding operations of high-frequency stimulation (HFS) [31]. In LTP, stimulation
 changes the synaptic proteins, that is, changes the sensitivity of postsynaptic neurons to
 presynaptic neurons, thus changing the strength and efficiency of synaptic signal transmis-
 sion. Memory formation is considered to be the result of long-term synaptic plasticities,
 such as long-term depression (LTD) and LTP [32].
 LTP and LTD have another potentially important role in modern neuroscience, and
 that is the possibility that they may be exploited to treat disorder and disease in the human
 central nervous system (CNS). A variety of neurological conditions arise from lost or
 excessive synaptic drive due to sensory deprivation during childhood, brain damage, or
 disease [33]. Memory and forgetting are the stages that the human brain must go through in
to presynaptic neurons, thus changing the strength and efficiency of synaptic signal trans‐
 mission. Memory formation is considered to be the result of long‐term synaptic plastici‐
 ties, such as long‐term depression (LTD) and LTP [32].
 LTP and LTD have another potentially important role in modern neuroscience, and
 that is the possibility that they may be exploited to treat disorder and disease in the human
Actuators 2021, 10, 254 3 of 16
 central nervous system (CNS). A variety of neurological conditions arise from lost or ex‐
 cessive synaptic drive due to sensory deprivation during childhood, brain damage, or
 disease [33]. Memory and forgetting are the stages that the human brain must go through
 in
 thethe process
 process ofof accepting
 accepting knowledge
 knowledge and
 and accumulating
 accumulating experience.
 experience. This
 This paperwill
 paper willmodify
 mod‐
 ify
 thethe Actor‐network
 Actor-network module
 module in DRL,
 in DRL, andand change
 change thethe Actor‐network
 Actor-network module
 module optimized
 optimized by
 by gradient
 gradient descent
 descent into
 into a network
 a network module
 module with
 with biological
 biological characteristics.
 characteristics.

 3. Methods
 3. Methods
 ATRTD3 is a DRL algorithm
 ATRTD3 algorithm proposed
 proposedin inthis
 thispaper
 papertotoimprove
 improvethe themotion
 motionability
 abilityofofa
 amulti-DOF
 multi‐DOF manipulator.
 manipulator. The innovation
 The of the
 innovation of algorithm is to transform
 the algorithm the research
 is to transform results
 the research
 of neuroscience
 results into DRL.
 of neuroscience This
 into DRL.algorithm is based is
 This algorithm onbased
 the Twin-Delayed Deep Deterministic
 on the Twin‐Delayed Deep De‐
 Policy Gradient
 terministic Policyalgorithm
 Gradient with Rebirth
 algorithm Mechanism
 with (RTD3) [34](RTD3)
 Rebirth Mechanism and improves
 [34] andthe update
 improves
 of network weight parameters of Actor-network module. The highlight
 the update of network weight parameters of Actor‐network module. The highlight of the of the algorithm is
 that it uses tetanic stimulation and amnesia mechanism to randomly enhance
 algorithm is that it uses tetanic stimulation and amnesia mechanism to randomly enhance and weaken
 the weighting
 and weaken the parameters
 weightingofparameters
 the neural of
 network, thusnetwork,
 the neural realizingthusthe bionic update
 realizing of the
 the bionic
 neural of
 update network.
 the neuralThe Actor-network
 network. module obtained
 The Actor‐network modulethrough
 obtained the deterministic
 through policy
 the determin‐
 gradient needs to be further updated through the above two mechanisms.
 istic policy gradient needs to be further updated through the above two mechanisms. Compared
 with otherwith
 Compared DRLother
 algorithms, ATRTD3 ATRTD3
 DRL algorithms, increasesincreases
 the network update update
 the network part of part
 biological
 of bi‐
 characteristics and further expands the scope of exploration. Figure 1 shows
 ological characteristics and further expands the scope of exploration. Figure 1 shows the the framework
 of the overall
 framework of algorithm.
 the overall algorithm.

 Figure 1.
 Figure ATRTD3 algorithm
 1. ATRTD3 algorithm framework.
 framework.

 The pseudo-code of ATRTD3 is shown in Appendix A at the end of this paper. The
 The pseudo‐code of ATRTD3 is shown in Appendix A at the end of this paper. The
 following is a detailed description of the Tetanic stimulation and Amnesia mechanism.
 following is a detailed description of the Tetanic stimulation and Amnesia mechanism.
 3.1. Tetanic Stimulation
 Tetanic stimulation is the memory part of the Actor-network module. In the process
 of back propagation, the neural network obtains the updating quantity of parameters
 by gradient descent, to realize the iterative updating of the network. By evaluating the
 updating of network parameters, we can get which neural nodes’ weights are enhanced
 and which ones are weakened. For the parameters of the strengthened neurons, there are
 also differences in the strengthened degree. Therefore, it is necessary to evaluate and sort
 the parameters of the strengthened neuron nodes, select the ranking of the strengthened
 degree, and then obtain the neuron nodes qualified for Tetanic stimulation, and conduct
 Tetanic stimulation on the above parameters of the neuron nodes to achieve LTP, as shown
 in Figure 2, and the specific pseudo code is shown in Algorithm 1. Directly modifying the
 parameters of neural nodes will directly affect the nonlinear expression results of neural
 networks. This is a kind of immediate influence, which will immediately show the change
Actuators 2021, 10, 254 4 of 16

 effect of neuron weight in the continuous MDP, so we need to control this kind of influence
 in a reasonable range, and constantly update the parameters of the neural network in the
 iterative process. The designed Tetanic stimulation mechanism is nested in the fixed delay
 update step, which can exert the effect of Tetanic stimulation mechanism to a certain extent,
 and will not affect the overall effect of network update iteration, to ensure that the network
 converges towards the direction of performance improvement in the training process, and
 has the function of not weakening the attempt to explore the process.

 Algorithm 1 Tetanic stimulation
 1: Tetanic stimulation coefficient κ
 2: Load Actor network φ, W ← Actor. f c.weight
 3: Update Actor network φ0 ← φ
 4: Load New Actor network, W new ← Actor. f c.weight
 5: ∆W = W new − W
 6: Select the serial number (rowlist , collist ) of the T largest data in ∆W
 7: For t = 1 to T do:
 8: If random(0, 1) < κ:
 9: If A.w(rowt , colt ) > 0
 10: A.w(rowt , colt ) ← (1 + random(0, 0.01)) × A.w(rowt , colt )
 11: else:
 12: A.w(rowt , colt ) ← (1 − random(0, 0.01)) × A.w(rowt , colt )
Actuators 2021, 10, x FOR PEER REVIEW 5 of 17
 13: End if
 14: End if
 15: End for

 Tetanic Stimulation Framework
 Behind _Fc Behind _Fc Behind _Fc
 Front _Fc Front _Fc
 Front _Fc 1 1 1
 1 1 1
 Backward
 2 2 2
 +
 2 2 ‐ 2
 3 3 3

 Figure 2. Schematic
 Figure 2. Schematic diagram
 diagram of
 of working
 working mechanism
 mechanismof
 ofTetanic
 Tetanicstimulation.
 stimulation.

 3.2. Amnesia Mechanism
 3.2. Amnesia Mechanism
 The Amnesia mechanism is the forgetting part of the Actor-network module. When
 The Amnesia mechanism is the forgetting part of the Actor‐network module. When
 there are problems in information transmission between neurons, synapses cannot normally
 there are problems in information transmission between neurons, synapses cannot nor‐
 perform the function of neurotransmitter transmission, the brain begins to have problems
 mally perform the function of neurotransmitter transmission, the brain begins to have
 in information transmission, forgetting begins to occur, at this time, the huge redundant
 problems in information transmission, forgetting begins to occur, at this time, the huge
 brain neural network begins to have problems, and some memory and logic units begin to
 redundant brain neural network begins to have problems, and some memory and logic
 have problems. Neuron function is not always in a stable and good working state, there
 unitsbebegin
 will to have
 all kinds problems. Neuron
 of accidents, function
 just as the world’s is not
 bestalways
 snookerin players
 a stable and
 cannotgood working
 guarantee
 state,every
 that thereshot
 willwill
 be all
 bekinds of accidents,
 accurate. So, therejust as the world’s
 is forgetting in thebest
 whole snooker
 period players cannot
 and process
 guarantee that every shot
 of neurons. For this reason, the Amnesia mechanism is added to the neural networkand
 will be accurate. So, there is forgetting in the whole period by
 process ofselecting
 randomly neurons.theFor this reason,
 neurons in thethe Amnesia
 network withmechanism is added
 small probability, andtothen
 the weakening
 neural net‐
 work
 the by randomly
 parameters selecting
 of the neuronthe neurons
 nodes. in the
 When network
 using with small
 the Amnesia probability,
 mechanism to and then
 weaken
 weakening the parameters of the neuron nodes. When using the Amnesia
 the representation ability of the neural network, the influence of this weakening must be mechanism to
 weaken the representation
 controllable, that is to say, itability of affect
 will not the neural network, thetrend
 the convergence influence
 of theofneural
 this weakening
 network.
 must be controllable, that is to say, it will not affect the convergence trend of the neural
 network. Therefore, to ensure that the influence of the Amnesia mechanism can be con‐
 trolled, the weights of neurons are weakened by random force (a random percentage
 number) in this paper, as shown in Figure 3, and the specific pseudo code is shown in
 Algorithm 2.
Actuators 2021, 10, 254 5 of 16

 Therefore, to ensure that the influence of the Amnesia mechanism can be controlled, the
 weights of neurons are weakened by random force (a random percentage number) in this
 paper, as shown in Figure 3, and the specific pseudo code is shown in Algorithm 2.

 Algorithm 2 Amnesia Framework
 1: Load Actor network, W ← A.w( Actor. f c.weight)
 2: N is the number of the W’s node
 3: For i = 1 to N:
 4: Random(0, 1) number ξ, Amnesia threshold value τ, Mutation coefficient k
 5: If ξ < τ:
Actuators 2021, 10, x FOR PEER REVIEW 6: Actor. f c.weight[i ] ← Actor. f c.weight[i ] × (1 − k × random(0, 1)) 6 of 17
 7: End if
 8: End for

 Figure 3. Schematic diagram of Amnesia framework.
 Figure 3. Schematic diagram of Amnesia framework.
 4. Experiment
 4.1. Experiment Setup
 4. Experiment
 For the
 4.1. Experiment control problem of the multi-DOF manipulator, if we only consider the kine-
 Setup
 matics model of the manipulator, and the motion of the multi-DOF manipulator is regarded
 For the control problem of the multi‐DOF manipulator, if we only consider the kine‐
 as a discrete process from one position of the end effector to another position, the DRL
 maticsmethod
 model ofof deterministic
 the manipulator,
 policyand the motion
 gradient, ofRTD3,
 such as the multi‐DOF
 can achieve manipulator
 good results.is However,
 re‐
 gardedif as a discrete process from one position of the end effector to another position,
 the motion of the manipulator is regarded as a continuous motion process, a group the
 DRL method of deterministic
 of new inputs policy
 and outputs gradient,
 of DRL must such as RTD3,
 be found. can achieve
 The idea adoptedgood results.
 in this paper is to
 However,
 discretize the motion process of the manipulator in time, take the position deviationa of the
 if the motion of the manipulator is regarded as a continuous motion process,
 groupend-effector
 of new inputs and
 from outputs
 the target, of
 theDRL mustvelocity
 angular be found. Thejoints,
 of the idea and
 adopted in this paper
 the angular of the joints
 is to discretize the motion process of the manipulator in time, take the position
 as the input information of the DRL in this paper, and then take the angular deviation
 acceleration
 of the end‐effector from the target, the angular velocity of the joints, and the angular of
 the joints as the input information of the DRL in this paper, and then take the angular
 acceleration command for the next interval of the control joint as the output information,
 as shown in Figure 4. In this way, by controlling the angular acceleration of the joint, the
 problem of discrete processes in the previous control position process can be solved. How‐
Actuators 2021, 10, 254 6 of 16

 command for the next interval of the control joint as the output information, as shown
 in Figure 4. In this way, by controlling the angular acceleration of the joint, the problem
 of discrete processes in the previous control position process can be solved. However,
Actuators 2021, 10, x FOR PEER REVIEW 7 of 17
 this change will inevitably lead to the increase in model dimensions, which not only puts
 forward new requirements for the ability of the DRL algorithm but also puts forward new
 requirements for the redesign of the reward function.

 Initial State
 DRL Controller
 Position Deviation
 Observation
 Input Output Joints Angular
 Joints Angular Acceleration
 Velocity ATRTD3 Increment

 Manipulator Joints Angular

 Reset Update
 Intermediate State
 Termination State No
 Replay Buffer

 Yes Termination
 Judgment

 Manipulator Episode Manipulator

 Figure 4. Schematic diagram of control flow.
 Figure 4. Schematic diagram of control flow.
 UR manipulator is a representative manipulator in industrial production and scientific
 4.2. Task Introduction
 research. Therefore, this paper will use the structural size and joint layout of manipulator
 to establish the simulation
 In this paper, manipulator
 the DRL algorithm in this
 is used paper.a model controller of the multi‐DOF
 to train
 manipulator. The model can control the angular acceleration of the joints of the manipu‐
 4.2. so
 lator Task Introduction
 that the manipulator can start to move from the initial position of the workspace
 in the In this state,
 static paper,and the then
 DRL move
 algorithmto theis used
 targettoposition
 train a model controller
 and stop. In theofwhole
 the multi-DOF
 training
 manipulator.
 process, The model
 the target position canofcontrol
 the task theis angular acceleration
 a fixed position of the
 in the joints of of
 workspace thethe
 manipulator
 manipu‐
 so that
 lator. The the manipulator
 core of the taskcan startthe
 is that to move from the
 manipulator initial
 can reachposition
 the targetof the workspace
 position in the
 and when
 itstatic
 reachesstate,theand thenposition
 target move toatthe thetarget
 sameposition
 time, each andjoint
 stop.ofInthe
 themanipulator
 whole training is atprocess,
 static
 the target
 state. Finally,position of the task is
 the manipulator a fixed
 can reachposition
 the targetin the workspace
 position of the
 smoothly bymanipulator.
 controlling theThe
 core of acceleration
 angular the task is that of the manipulator
 joints. In order to can reach
 limit thethe target position
 boundaries of theand when
 entire it reaches
 task, the
 the entire
 target position at the same time, each joint of the manipulator is at
 training process must be restricted. Each episode is divided into twenty steps. This setting static state. Finally, the
 manipulator
 mainly takes intocan account
 reach thethattarget
 theposition
 training smoothly
 convergence by controlling
 process takes the aangular acceleration
 long time and the
 time of a single episode must be shortened. This task is a simulation experiment to must
 of joints. In order to limit the boundaries of the entire task, the entire training process test
 beconvergence
 the restricted. Each episode
 ability is divided
 and learning intooftwenty
 ability steps. This
 the improved settingATRTD3.
 algorithm mainly takes into
 account that the training convergence process takes a long time and the time of a single
 episode
 4.3. mustEnvironment
 Simulation be shortened. This task is a simulation experiment to test the convergence
 Construction
 ability and learning ability of the improved algorithm ATRTD3.
 The DRL algorithm establishes the manipulator model through the standard DH [35]
 method and uses
 4.3. Simulation the positiveConstruction
 Environment kinematics solution to solve the spatial pose of the end ef‐
 fector through the joint angle. DH modeling method is a general modeling method of
 The DRL algorithm establishes the manipulator model through the standard DH [35]
 multi‐link mechanism. Standard DH models are used for serial structure robots. In Table
 method and uses the positive kinematics solution to solve the spatial pose of the end
 1, we show the DH parameters of this manipulator, where a is the length of the link, d
 effector through the joint angle. DH modeling method is a general modeling method of
 is the offset of the link,  is the twist angle of the link,  is the joint angle. The units
 multi-link mechanism. Standard DH models are used for serial structure robots. In Table 1,
 of a and d are meters, and the units of  and  are radians.
 we show the DH parameters of this manipulator, where a is the length of the link, d is the
 offset of the link, α is the twist angle of the link, θ is the joint angle. The units of a and d are
 Table 1. The D‐H parameters of manipulator.
 meters, and the units of α and θ are radians.
 Joint 
 Base 0 0.0892  /2 1
 Shoulder −0.4250 0 0 2
 Elbow −0.3923 0 0 3
 Wrist1 0 0.1092  /2 4
 Wrist2 0 0.0946  /2 5
Actuators 2021, 10, 254 7 of 16

 Table 1. The D-H parameters of manipulator.

 Joint α d α θ
 Base 0 0.0892 π/2 θ1
 Shoulder −0.4250 0 0 θ2
 Elbow −0.3923 0 0 θ3
 Wrist1 0 0.1092 π/2 θ4
 Wrist2 0 0.0946 −π/2 θ5
 Wrist3 0 0.0823 0 θ6

 During the experiment, only base, shoulder, and elbow are controlled, wrist1, wrist2,
 and wrist3 were locked. Because the problem studied in this paper focuses on the position
 reaching ability of the manipulator, it can be realized only by using three joints, so the three
 joints of the wrist joints are locked. The homogeneous transformation matrix is established
 through D–H parameters, as shown in Equation (1).

 − sin(θi ) cos(αi )
  
 cos(θi ) sin(θi ) sin(αi ) ai cos(θi )
 i −1  sin(θi )
  cos(θi ) cos(αi ) − cos(θi ) sin(αi ) ai sin(θi ) 
 i T = (1)
 
 0 sin(αi ) cos(αi ) di 
 0 0 0 1

 The solution of the forward kinematics of the manipulator can be obtained by mul-
 tiplying the homogeneous transformation matrix, as shown in Equation (2). In the base
 coordinate system {B}, the position of the end of the manipulator can be obtained.
 0
 6T = 01 T 12 T 23 T 34 T 45 T 56 T (2)

 In each episode, the target position is randomly generated in the workspace of the
 manipulator. In the experiment, the distance difference between the center position of the
 end effector and the target position in three directions (dx, dy, and dz), the angular veloc-
 ity (ω Joint_Base , ω Joint_Shoulder , and ω Joint_Eblow ) and absolute angle of the first three joints
 (θ Joint_Base , θ Joint_Shoulder , and θ Joint_Eblow ) are used as the input of DRL, and the angular
 . .
 acceleration control commands of the base, shoulder and elbow (ω Joint_Base , ω Joint_Shoulder ,
 .
 and ω Joint_Elbow ) are output by DRL. In order to ensure the safe operation of the virtual
 manipulator, the angular acceleration (θ/s2 ) is limited, as shown in Equation (3).
 .
 ω i ∈ (−0.5, 0.5), i ∈ ( Base, Shoulder, Eblow) (3)
 .
 When the DRL outputs the angular acceleration control command ω i , the joint angle
 increment ∆θi obtained in this step is calculated by Equation (4) according to the interval
 time t = 0.1 s and the angular velocity of the previous step ω_i . The current joint angle
 θi_ is updated through the joint angle increment ∆θi and the joint angle of the previous
 step θ_i in Equation (5). The position of the manipulator end effector in {B} coordinate is
 obtained by homogeneous transformation matrix 06 T calculation. The joint angular velocity
 is updated as shown in Equation (6).

 1 .
 ∆θi = ω_i t + ω i t2 , i ∈ ( Base, Shoulder, Eblow) (4)
 2
 θi_ = θ_i + ∆θi (5)
 .
 ωi_ = ω_i + ω i t (6)
 In this motion process, the DRL model sends the angular acceleration command (the
 output of the DRL algorithm) of the joints to manipulator according to the perceived
 environment and its state (the input of the DRL algorithm) and gives the termination
 command when judging the motion.
Actuators 2021, 10, x FOR PEER REVIEW 9 of 17

 Actuators 2021, 10, 254 8 of 16

 4.4. Rewriting Experience Playback Mechanism and Reward Function Design
 4.4.The motion
 Rewriting process Playback
 Experience of a multi‐DOF manipulator
 Mechanism is no longer
 and Reward Function multiple discrete spatial
 Design
 position Thepoints.
 motionAs can be
 process of seen from Figure
 a multi-DOF 5 below,
 manipulator this
 is no experiment
 longer multiple completes the spec‐
 discrete spatial
 ified task by controlling the angular acceleration of the three joints.
 position points. As can be seen from Figure 5 below, this experiment completes the specifiedIn the joint angular
 velocity
 task byfield drawnthe
 controlling byangular
 three joint angularofvelocities,
 acceleration in Case
 the three joints. 1, the
 In the manipulator
 joint can reach
 angular velocity
 thefield
 target
 drawnposition
 by three and stop,
 joint which
 angular meansin
 velocities, that
 Case the1, task is successfully
 the manipulator completed;
 can reach in Case
 the target
 position and stop, which means that the task is successfully completed;
 2, the angular velocity of the joint does not stop at all, or it passes through the target po‐ in Case 2, the
 angular
 sition velocity
 quickly andof theto
 fails joint does not
 complete stop
 the at all,
 task; foror it passes
 Case throughthe
 3, although themanipulator
 target position stops in
 quickly and fails to complete the task; for Case 3, although the manipulator
 the scheduled time, the end of the manipulator does not reach the target position, and the stops in the
 scheduled time, the end of the manipulator does not reach the target position, and the task
 task fails to complete. Figure 5 also shows that the task of this experiment is more difficult
 fails to complete. Figure 5 also shows that the task of this experiment is more difficult than
 than
 thethe
 tasktask of only
 of only achieving
 achieving thethrough
 the goal goal through
 discretediscrete spatial position.
 spatial position.

 Figure 5. Schematic diagram of the angular velocity field of joints. ξ is the threshold for determining joint stop.
 Figure 5. Schematic diagram of the angular velocity field of joints.  is the threshold for determining joint stop.
 Therefore, the design of the reward function must be reconsidered, that is to say, the
 angular velocity
 Therefore, theinformation
 design of theof the joint function
 reward must be introduced. In the wholethat
 must be reconsidered, process
 is to of
 say, the
 the manipulator movement, each instance of acceleration and deceleration has an impact
 angular velocity information of the joint must be introduced. In the whole process of the
 on the angular velocity of each joint when the termination condition is reached and the
 manipulator movement, each instance of acceleration and deceleration has an impact on
 iteration of the round stops. Therefore, it is necessary to further improve the experience
 theplayback
 angularmechanism
 velocity ofand
 each joint when
 change the termination
 the experience condition
 pool storage. is reached
 In other words, the and the iter‐
 final
 ation
 angular velocity of each joint should be shared by all the previous continuous movements. play‐
 of the round stops. Therefore, it is necessary to further improve the experience
 back
 As mechanism and change
 shown in Equation the
 (7), here experience
 the poolof
 absolute value storage. In other
 the angular words,
 velocity thejoint
 of the final angular
 will
 velocity
 be taken,of multiplied
 each joint by
 should be shared
 the constant by all the
 λi , divided previous
 by the numbercontinuous movements.
 of steps TStop , and the As
 corresponding reward value will be obtained.
 shown in Equation (7), here the absolute value of the angular velocity of the joint will be
 taken, multiplied by the constant i , divided by the number of steps TStop , and the corre‐
 λ |ω | + λ2 |ωShoulder | + λ3 |ωElbow |
 R Joint_Vel = − 1 Base , λi > 0, i = {1, 2, 3} (7)
 sponding reward value will be obtained.
 TStop

 This part of the reward as 1 anegative
 Base  2 reward
  Shoulderis added
 3 Elbow
 to the corresponding reward in
 RJoint_Vel
 the experience pool, so as to
  realize the feedback of the joint , i  0velocity
 angular , i  {1,2state
 ,3} in the (7)
 TStop
 neural network update parameters, as shown in Figure 6.
 This part of the reward as a negative reward is added to the corresponding reward
 in the experience pool, so as to realize the feedback of the joint angular velocity state in
 the neural network update parameters, as shown in Figure 6.
Actuators 2021, 10, x FOR PEER REVIEW 10 of 17
 Actuators 2021, 10, 254 9 of 16

 Figure
 Figure 6. Schematic
 6. Schematic diagram
 diagram of rewriting
 of rewriting experience
 experience playback
 playback mechanism.
 mechanism.

 TheThe design
 design of the
 of the reward
 reward function
 function in this
 in this paperpaper
 addsadds
 thethe angular
 angular velocity
 velocity reward
 reward partpart
 to the Step-by-Step reward function (r −by−Step ) [35]. The Step-by-Step reward function
 to the Step‐by‐Step reward function ( rStepStep by is
 Step ) [35]. The Step‐by‐Step reward function
 mainly includes two parts: the first part the negative value of the Euclidean distance
 mainly includes
 between the endtwoofparts: the first part
 the manipulator andisthethetarget.
 negative
 Thevalue
 secondofpart
 the is
 Euclidean
 the rewarddistance
 obtained
 between the end ofthe
 by comparing thedistance
 manipulator
 closedand the target
 to the target.position
 The second part isthe
 between thecurrent
 rewardposition
 obtainedand
 by the
 comparing the distance
 last position closed to the target
 of the manipulator end during position
 thebetween the current
 movement. position
 Therefore, and
 the reward
 thefunction
 last position
 in thisofpaper
 the manipulator end during
 is shown in Equation (8): the movement. Therefore, the reward
 function in this paper is shown in Equation (8):
 r 0 = λ4 rStep−by−Step + λ5 R Joint_Vel (8)
 r '  4 rStep by  Step  5 RJoint_Vel (8)
 where λ4 and λ5 are two constants.
 where 4 and 5 are two constants.
 4.5. Simulation Experimental Components
 In the E
 4.5. Simulation application
 xperimentalofComponents
 the DRL algorithm, a problem that cannot be avoided is the
 random generation of a large number of neural network parameters. It is because of the
 In the application of the DRL algorithm, a problem that cannot be avoided is the ran‐
 randomness of the parameters that we cannot train and learn effectively in the face of
 dom generation of a large number of neural network parameters. It is because of the ran‐
 specific tasks, so we need to explore a more efficient, faster convergence, and more stable
 domness of the parameters that we cannot train and learn effectively in the face of specific
 algorithm framework to make up for this disadvantage. Since ATRTD3, RTD3, and TD3 are
 tasks, so we need to explore a more efficient, faster convergence, and more stable algo‐
 all improved and innovated on the basis of DDPG, the contrast group of this experiment
 rithm framework to make up for this disadvantage. Since ATRTD3, RTD3, and TD3 are
 chooses the above four algorithms. In the contrast experiment, we train with the same
 all improved and innovated on the basis of DDPG, the contrast group of this experiment
 task and acquire the ability to solve the target task through learning and training. In the
 chooses the above four algorithms. In the contrast experiment, we train with the same task
 experiment, we specify two kinds of evaluation indexes. The first index is to calculate
 and acquire the ability to solve the target task through learning and training. In the exper‐
 all the reward scores in an episode and divide them by the total number of steps in the
 iment, we specify two kinds of evaluation indexes. The first index is to calculate all the
 episode to get the average score. The second index is to record the absolute value of the
 reward scores
 angular in anof
 velocity episode and divide
 base, shoulder, andthem
 elbow byatthe
 thetotal
 end number of steps
 of an episode. in the episode
 Additionally, in the
 to get the average score. The second index is to record the absolute value
 same group of experiments, in order to ensure a fair comparison of the practical of the application
 angular
 velocity
 abilityofand
 base, shoulder,
 model and elbow
 convergence at of
 ability thethe
 end of algorithms,
 four an episode. weAdditionally,
 use the sameininitialization
 the same
 group of experiments, in order
 model as the test algorithm model.to ensure a fair comparison of the practical application
 ability and model convergence ability of the four algorithms, we use the same initializa‐
 tion5.model as the test algorithm model.
 Discussion
 Through the comparison of DDPG, TD3, RTD3, and ATRTD3 in Figure 7a, we can
 5. Discussion
 clearly see the improvement effect of ATRTD3 learning ability. Therefore, we need to
 Through
 further the comparison
 analyze the evaluation of DDPG,
 index ofTD3, RTD3, and
 the average ATRTD3
 score when the in final
 Figure 7a, weerror
 distance can is
 clearly see the
 roughly theimprovement
 same. From Figureeffect 7a,
 of ATRTD3
 we can see learning
 that theability.
 averageTherefore,
 score of we need to
 ATRTD3 is fur‐
 higher
 therthan
 analyze
 other the evaluation
 algorithms when index of the
 the final average
 distance score
 error when
 is the sametheinfinal
 purpledistance
 areas A, error is C,
 B, and
 roughly
 whichthe same. From
 indicates Figure
 that the 7a, we
 learning can see
 effect that the average
 of ATRTD3 is better.score of ATRTD3
 The reason is thatispart
 higher
 of the
 reward
 than function is the
 other algorithms when negative
 the finalreward
 distanceintroduced by same
 error is the the absolute
 in purple value
 areasofA,
 theB,angular
 and
 velocity
 C, which of each joint
 indicates afterlearning
 that the each termination of an episode.
 effect of ATRTD3 Through
 is better. purpleisareas
 The reason A, B,ofand
 that part
 theCreward
 of Figure 7b, ATRTD3
 function can better
 is the negative guideintroduced
 reward the multi-DOF
 by the manipulator
 absolute valueto move
 of thetoangular
 the target
 position
 velocity and joint
 of each stop after
 better,each
 so ATRTD3
 terminationis more
 of anefficient
 episode.and stable than
 Through purple other
 areasalgorithms.
 A, B, and
Secondly, through Figure 7, we can draw the conclusion that ATRTD3 sho
 stronger stability than other algorithms in the late training period. As can be seen in Figu
 8, although DDPG has reached the score level close to that of ATRT3 through the la
 training, we can clearly see from the average score curve and the final error distance cur
 Actuators 2021, 10, 254 10 of 16
 that ATRTD3 has better stability in the later training stage compared with DDPG, and t
 two curves of ATRTD3 are straighter, there are many spikes in the two curves of DDPG

 25 20
 20 A 15 2
 1
 Avg_Score

 Avg_Score
 15 10
 10 RTD3
 5 5
 ATRTD3
 0 0 TD3
 -5 -5 DDPG
 -10
 -10
 -15
 0 10k 20k 30k 40k 50k 60k 0 10k 20k 30k 40k 50k 60k
 Episode Episode
 25
 25
 20 B 20 C

 Avg_Score
 4
 Avg_Score

 15
 10
 3 15
 10
 5 5
 0 0
 -5 -5
 -10 -10
 0 10k 20k 30k 40k 50k 60k 0 10k 20k 30k 40k 50k 60k
 Episode Episode

 (a)
 RTD3
 0.8
 1
 0.8
 2
 Final_Dis (m)

 ATRTD3
 Final_Dis (m)

 0.6 0.6 TD3
 0.4 0.4 DDPG
 0.2 0.2
 0.0 A 0.0
 0 10k 20k 30k 40k 50k 60k 0 10k 20k 30k 40k 50k 60k
 Episode Episode

 0.8 0.8
 Final_Dis (m)

 Final_Dis (m)

 0.6 0.6
 4
 0.4 3 0.4

 0.2 0.2
 B C
 0.0 0.0
 0 10k 20k 30k 40k 50k 60k 0 10k 20k 30k 40k 50k 60k
 Episode Episode
 (b)
 Figure
Figure 7. Four 7. Four randomized
 randomized experiments
 experiments (Group1,
 (Group1, Group2,Group3,
 Group2, Group3, andandGroup4)
 Group4) werewere
 conducted to evaluate
 conducted the perfor-
 to evaluate the perfor‐
mance of four algorithms (RTD3, ATRTD3, TD3 and DDPG). There are Group1, Group2, Group3, andaverage
 mance of four algorithms (RTD3, ATRTD3, TD3 and DDPG). There are Group1, Group2, Group3, and Group4 in the Group4 in the
 score (Avg_score)
average score (Avg_score) in (a)
 in and
 (a) the
 andfinal
 theerror distance
 final error (Final_Dis)
 distance in (b). Purple areas
 (Final_Dis) A, B,
 in (b). and C need
 Purple areasspecial
 A, B,attention.
 and C need special
attention. Secondly, through Figure 7, we can draw the conclusion that ATRTD3 shows stronger
 stability than other algorithms in the late training period. As can be seen in Figure 8,
 although DDPG has reached the score level close to that of ATRT3 through the later
 training, we can clearly see from the average score curve and the final error distance curve
 that ATRTD3 has better stability in the later training stage compared with DDPG, and the
 two curves of ATRTD3 are straighter, there are many spikes in the two curves of DDPG.
 From Figure 9, we can see that ATRTD3 improves the average score by at least 49.89%
 compared with the other three algorithms. In terms of stability, ATRTD3 performs better,
 with an improvement of at least 89.27% compared with the other three algorithms.
Group1 Group1
 1.0 100
 Group2 Group2
 89.84
 Group3 0.858 Group3
Actuators2021,
 2021,10,
 10,254
 x FOR PEER REVIEW Group4 12 of 16
 17
Actuators 0.8 0.755 80 Group4 11 of

 Final_Dis(m)

 Avg_Score
 60 59.62
 0.6
 0.531 52.06
 47.27Group1
 Group1
 1.0 100
 0.4 Group2 40 Group2
 89.84
 Group3 0.858 Group3
 0.8 0.755
 Group4 80 Group4
 0.2 20
 Final_Dis(m)

 Avg_Score
 0.053 60 59.62
 0.6
 0.531
 0.0 0 52.06 47.27
 RTD3 ATRTD3 TD3 DDPG RTD3 ATRTD3 TD3 DDPG
 0.4 40
 Algorithm Algorithm
 0.2 20
 (a) (b)
 0.053
 0.0
 Figure 8. The results 0
 of four experiments (Group1, Group2, Group3, and Group4) of TD3
 four algorithms (RTD3, ATRTD3,
 RTD3 ATRTD3 TD3 DDPG RTD3 ATRTD3 DDPG
 TD3, and DDPG) are compared in a stacked way. From left to right, (a) represents the final error distance; (b) represents
 Algorithm Algorithm
 the average score.
 (a) (b)
 From Figure 9, we can see that ATRTD3 improves the average score by at least 49.89%
 Figure8.8.The
 Figure Theresults
 resultsofoffour experiments
 four experiments(Group1, Group2,
 (Group1, Group3,
 Group2, andand
 Group3, Group4) of four
 Group4) algorithms
 of four (RTD3,
 algorithms ATRTD3,
 (RTD3, TD3,
 ATRTD3,
 and DDPG) are compared in a stacked way. From left to right, (a) represents the final error distance; (b) representsperforms
 TD3, and DDPG) are comparedcompared
 in a with
 stacked the
 way. other
 From three
 left to algorithms.
 right, (a) In
 represents terms
 the of
 final stability,
 error ATRTD3
 distance; (b) represents
 the better
 the average
 average score.score. with an improvement of at least 89.27% compared with the other three algorithms.

 From Figure 9, we can see that ATRTD3 improves theAvg_Score
 average score by at least 49.89%
 compared with the other three algorithms. In terms of stability,
 Final_DisATRTD3 performs better,
 with100%
 an improvement of at least 89.27% compared with the other three algorithms.
 Improvement

 93.36%

 92.45%

 89.27%
 89.06%
 Avg_Score
 80% Final_Dis
 100%
 71.67%
 Improvement

 93.36%

 92.45%

 60%
 89.27%
 89.06%

 80%
 Performance

 49.89%

 71.67%

 40%
 60%
 Performance

 49.89%

 20%
 40%

 0%
 20%
 DDPG RTD3 TD3
 0%
 Figure9.9.Combined
 Figure Combined with
 with four
 four groups
 groups of experiments
 of experiments (Group1,
 (Group1, Group2,
 Group2, Group3,
 Group3, and Group4),
 and Group4), the the
 DDPG RTD3 TD3
 ATRTD3algorithm
 ATRTD3 algorithm improves
 improves the the average
 average scorescore anderror
 and final finaldistance
 error distance performance
 performance of the model.
 of the model.
 Figure 9. Combined with four groups of experiments (Group1, Group2, Group3, and Group4), the
 To
 Tofurther
 ATRTD3 furtherdemonstrate
 algorithmdemonstrate the
 improves the advantages
 the advantages
 average of the
 score and ofATRTD3
 final the ATRTD3
 error fromperformance
 distance the underlying
 from the of thecontrol
 underlying
 model. contro
 variables, we will collect the end angular speeds of the three joints during model training
 variables, we will collect the end angular speeds of the three joints during model training
 as shown in Figure
 To further 10a,b. In the
 demonstrate Figure 10a, we show
 advantages of the final angular velocity of thecontrol
 three
 as shown
 joints of the
 in
 high
 Figure 10a,b. In Figure 10a, wethe ATRTD3
 show the from angular
 final the underlying
 velocity of the three
 variables, we willscore
 collect model (Group1).
 the end angularInspeeds
 Figureof 10b,
 thewe show
 three theduring
 joints final angular
 model velocity
 training
 joints
 of of thejoints
 highofscore model (Group1). In Figure 10b, we show the final angular velocity
 as the three
 shown in Figure the
 10a,b.arm Incontrolled bywe
 Figure 10a, a low
 show score
 themodel (Group2).
 final angular By placing
 velocity of thea local
 three
 of the three
 magnification joints
 joints of the highin each of
 score
 the
 image, arm
 modelwe
 controlled
 can clearly
 (Group1).
 by
 In see
 a low
 the 10b,
 Figure
 score
 final we model
 angular
 showvelocity
 (Group2).
 the finalofangular
 By
 the three placing
 joints a loca
 velocity
 magnification
 at
 ofthe
 theend
 three thein
 ofjoints eachthe image,
 training
 of process,
 arm we can the
 where
 controlled clearly
 by lowsee
 acurve themodel
 final angular
 represented
 score by velocity
 the red
 (Group2). color
 By is of
 thethe
 placing aspeedthree joints
 local
 at
 of the end ofinthe
 each joint
 magnification under eachtraining
 the guidance
 image, we process,
 of the where
 ATRTD3
 can clearly thefinal
 curve
 throughout
 see the represented
 the
 angular training bythe
 the
 process.
 velocity of redjoints
 ATRTD3
 three color is the
 has
 speed obvious
 at the end ofadvantages
 of each joint
 the training over
 under the other
 the
 process, threethe
 guidance
 where algorithms.
 of the
 curveATRTD3
 represented throughout
 by the redthecolor
 training
 is theprocess
 ATRTD3 has joint
 speed of each obvious advantages
 under over
 the guidance of the other three
 the ATRTD3 algorithms.
 throughout the training process.
 ATRTD3 has obvious advantages over the other three algorithms.
Actuators 2021, 10, x FOR PEER REVIEW 13 of 17
Actuators 2021, 10, 254 12 of 16

 4
 1 RTD3
 0.12 ATRTD3
 2
 TD3
 Baes(rad/s)

 0.08
 3
 0.04
 DDPG 0.05

 Elbow(rad/s)
 0.00 0.00
 0

 Shoulder(rad/s)
 1
 Baes(rad/s)

 -0.04 2

 Elbow(rad/s)
 40k 50k 60k -0.05
 Episode
 0.05
 -0.10

 Shoulder(rad/s)
 1 40k 50k 60k
 0 0.00
 Episode
 -1 -0.05
 -0.10 0
 -0.15
 -1
 40k 50k 60k
 Episode -1
 -2
 0 10k 20k 30k 40k 50k 60k 0 10k 20k 30k 40k 50k 60k 0 10k 20k 30k 40k 50k 60k
 Episode Episode Episode
 (a)
 4 1 4
 0.15

 Elbow(rad/s)
 0.10
 3 0.06 0 0.05
 Baes(rad/s)

 0.04 2 0.00
 0.02 -0.05
 Shoulder(rad/s)
 Baes(rad/s)

 Elbow(rad/s)
 2 0.00 -1 -0.10
 40k 50k 60k
 -0.02
 0.05 Episode
 Shoulder(rad/s)

 -0.04
 40k 50k 60k 0.00 0
 1 Episode -2
 -0.05
 -0.10
 -0.15 RTD3
 0 -3 ATRTD3
 -0.20 -2
 40k 50k 60k TD3
 Episode
 DDPG
 -1 -4
 0 10k 20k 30k 40k 50k 60k 0 10k 20k 30k 40k 50k 60k 0 10k 20k 30k 40k 50k 60k
 Episode Episode Episode
 (b)
 Figure 10. (a) The
 The final angular velocity of the joints in the high score model Group1; (b) The final angular velocity of the
 joints in the low score model Group2.

 The final
 The final joint
 jointangular
 angular velocity
 velocityof the manipulator
 of the manipulatorbasedbased
 on ATRTD3
 on ATRTD3is always around
 is always
 0 (rad/s),
 around with the
 0 (rad/s), minimum
 with fluctuation
 the minimum and theand
 fluctuation bestthe
 stability. In the high
 best stability. In thescore
 highmodel
 score
 (Group1),
 model only DDPG
 (Group1), can achieve
 only DDPG a score
 can achieve similar
 a score to ATRTD3
 similar to ATRTD3 in the training
 in the trainingscore, so
 score,
 only the joint angular velocity of DDPG and ATRTD3
 so only the joint angular velocity of DDPG and ATRTD3 is compared. It canis compared. It can be seen from
 Table 22 that
 Table that DDPG
 DDPGhas hasananorder
 orderofofmagnitude
 magnitude disadvantage
 disadvantage in the
 in the comparison
 comparison of an‐
 of the the
 angular velocity of the other two joints except that the final angular velocity
 gular velocity of the other two joints except that the final angular velocity of the base joint of the base
 joint
 is is roughly
 roughly the samethe same as of
 as that that of ATRTD3.
 ATRTD3. By comparing
 By comparing the variances
 the variances in Table
 in Table 2, it 2,
 is it is
 not
 not difficult
 difficult to find
 to find that that ATRTD3
 ATRTD3 has absolute
 has absolute advantages
 advantages in stability,
 in stability, whichwhich is generally
 is generally one
 one order
 order of magnitude
 of magnitude higher.
 higher. Because
 Because the smaller
 the smaller the variance
 the variance is, theis, the stable
 more more stable the
 the train‐
 training is. In the low model (Group2), the advantages of ATRTD3 cannot
 ing is. In the low model (Group2), the advantages of ATRTD3 cannot be seen in Table 3. be seen in Table 3.
 After accumulating and comparing the angular velocity and angular
 After accumulating and comparing the angular velocity and angular velocity variance of velocity variance of
 the three
 the three joints,
 joints, itit can
 can be
 be seen
 seen from
 from Figure
 Figure 1111 that
 that ATRTD3
 ATRTD3 is is still
 still better
 better than
 than the
 the other
 other
 three algorithms.
 three algorithms.
Average Average Average
 Var Var
 (rad/s) (rad/s) (rad/s)
 −2.50Actuators
 × 102021,
 −3 10, 254 2.80 × 10−5 2.51 × 10−3 3.09 × 10−5 −3.90 × 10 −416
 13 of

 −4.65 × 10−3 4.58 × 10−4 −3.13 × 10−2 1.07 × 10−3 3.35 × 10−3
 3.13 × 10−2Table 2. The mean
 7.26and× variance
 10−4 of joints angular
 −2.79velocity
 × 10in−2local enlarged1.32
 images×of10 −3score model−6.69
 high (Group1).× 10−3

 2.66 × 10Joint −3 4.60 × 10 −5
 Base 2.52 × 10−2 Shoulder 6.41 × 10−5 −2.01 × 10−2
 Elbow
 Angular Average Average Average
 Var Var Var
 Velocity (rad/s) (rad/s) (rad/s)
e mean and variance
 ATRTD3 of ×joints
 −2.50 10−3 angular
 2.80 × 10−velocity
 5 2.51 in
 × 10local
 −3 enlarged
 3.09 × 10−5 images of10low
 −3.90 × −4 score model
 2.83 × 10 −5 (Gro
 RTD3 −4.65 × 10−3 4.58 × 10−4 −3.13 × 10−2 1.07 × 10−3 3.35 × 10−3 6.81 × 10−4
 TD3 3.13 × 10−2 7.26 × 10−4 −2.79 × 10−2 1.32 × 10−3 −6.69 × 10−3 4.73 × 10−4
 DDPG Base 2.66 × 10−3 Shoulder
 4.60 × 10−5 2.52 × 10−2 6.41 × 10−5 −2.01 × 10−2 Elbow
 3.95 × 10−5

 Average Average Average
 Table 3. The meanVar Var
 and variance of joints angular velocity in local enlarged images of low score model (Group2).
 (rad/s) (rad/s) (rad/s)
 Joint Base Shoulder Elbow
 −1.71 × Angular
 10−2 4.17
 Average× 10−5 8.04 ×Average
 10−3 1.08 × 10−4Average −1.23 × 10−2
 Var Var Var
 Velocity (rad/s) (rad/s)−3 −4 (rad/s)
 3.03 × 10 −2 1.14 ×−10 2
 −4
 −
 −2.50
 5
 × 10 − 3
 5.87 − 4
 × 10 − 2
 −1.73 × 10 − 5
 −2
 ATRTD3 −1.71 × 10 4.17 × 10 8.04 × 10 1.08 × 10 −1.23 × 10 6.18 × 10
 1.39 × 10 −2
 RTD3 3.40
 3.03 × 10× −210−5 1.14 × 10−43.67−×
 − 2 − 5
 2.5010
 × −2
 10−3
 − 2
 5.879.69
 × 10−4 × 10−−5
 − 5
 1.73 × 10−2 −6.58
 − 2
 × 1010
 4.48 × −4 −2
 −
 TD3 1.39 × 10 3.40 × 10 3.67 × 10 9.69 × 10 −6.58 × 10 6.95 × 10 5
 3.20 × 10 −2
 DDPG 5.38
 3.20 × 10× 10 5.38 × 10 −1.40
 − 2 −5 − 5 × 10
 −1.40 × 10
 −2
 − 2 7.077.07
 × 10 × 10 8.21 × 10
 − 5 −5 − 3 8.21
 1.27×× 10
 10−4
 −3

 Skill Level Improvement
 Stability Improvement

 100%
 Performance Improvement

 96.54
 96.05

 91.80
 88.74

 86.25

 80%
 81.59

 67.84

 60%

 40%
 41.71

 30.94

 25.27

 20%
 15.90

 -5.540

 0%

 Group1 Group2 Group1 Group2 Group1 Group2
 DDPG RTD3 TD3
 Figure 11. In the high score model (Group1) and low score model (Group2), the ATRTD3 algorithm compared with the
 Figure
 other 11. Ininthe
 three algorithms the high score
 control of model
 joint angular (Group1)
 velocity skill leveland low
 the ATR
 and late score
 training model
 stability (Group2),
 are two aspects of
 improvement effect.
 rithm compared with the other three algorithms in the control of joint angular veloc
 and late training stability are two aspects of improvement effect.

 Through Figure 11, we can see that ATRTD3 can significantly improve
 and stability compared with the other three algorithms. ATRTD3 generally
Actuators 2021, 10, 254 14 of 16

 Through Figure 11, we can see that ATRTD3 can significantly improve the skill level
 and stability compared with the other three algorithms. ATRTD3 generally improves the
 skill level by more than 25.27% and improves stability by more than 15.90%. Compared
 with TD3, the stability improvement of ATRTD3 in Group 2 is −5.54%. The main reason is
 that TD3 falls into a more stable local optimization, so the performance of ATRTD3 cannot
 be questioned under this index.
 In the Actor neural network, the Tetanic stimulation mechanism changes the param-
 eters of some neurons compulsively and improves the exploration space of action in a
 certain range. Through Tetanic stimulation, the weight of selected neurons can be increased
 again, which helps to improve the convergence speed of the neural network and speed up
 the training speed. The Amnesia mechanism makes the parameters of neurons changed
 compulsively, which is in line with the situation of biological cell decay.

 6. Conclusions
 In this paper, we propose an algorithm named ATRTD3, for continuous control of a
 multi-DOF manipulator. In the training process of the proposed algorithm, the weighting
 parameters of the neural network are learned using Tetanic stimulation and Amnesia
 mechanism. The main contribution of this paper is that we show a biomimetic view to
 speed up the converging process by biochemical reactions generated by neurons in the
 biological brain during memory and forgetting. The integration of the two mechanisms
 is of great significance to expand the scope of exploration, jump out of the local optimal
 solution, speed up the learning process and enhance the stability of the algorithm. The
 effectiveness of the proposed algorithm is validated by a simulation example including
 the comparisons with previously developed DRL algorithms. The results indicate that our
 approach shows performance improvement in terms of convergence speed and precision
 in the multi-DOF manipulator.
 The ATRTD3 we proposed successfully applies the research results of neuroscience
 to the DRL algorithm to enhance its performance, and uses computer technology to carry
 out biological heuristic design through code to approximately realize some functions of
 the biological brain. In the future, we will further learn from neuroscience achievements,
 improve the learning efficiency of DRL, and use DRL to achieve reliable and accurate
 manipulator control. Furthermore, we will continue to challenge the scheme of controlling
 six joints to realize the cooperative control of position and orientation.

 Author Contributions: Conceptualization, H.H. and Y.H.; methodology, Y.H.; software, Y.H.; valida-
 tion, Y.H., D.X. and Z.Z.; formal analysis, Y.H.; investigation, Y.H.; resources, H.H.; data curation, Y.C.;
 writing—original draft preparation, Y.H.; writing—review and editing, Y.C. and Z.L.; visualization,
 Y.H.; supervision, H.H. All authors have read and agreed to the published version of the manuscript.
 Funding: This research received no external funding.
 Institutional Review Board Statement: Not applicable.
 Informed Consent Statement: Not applicable.
 Data Availability Statement: The authors exclude this statement.
 Conflicts of Interest: The authors declare no conflict of interest.
Actuators 2021, 10, 254 15 of 16

 Appendix A
 Pseudo code of Algorithm A1.

 Algorithm A1. ATRTD3
 1: Initialize critic networks Qθ1 , Qθ2 ,and actor network πφ with random parameters θ1 , θ2 , φ
 2: Initialize target networks Qθ10 , Qθ20 , πφ0
 3: Target network node assignment θ1 → θ10 ; θ2 → θ20 ; φ → φ0
 4: Initialize replay buffer B
 5: for t = 1 to T do
 6: Amnesia framework
 7: Select action with exploration noise a ∼ πφ + ε, ε ∼ N (0, σ)
 8: Temporary store transition tuple (s, a, s0 , r ) in B
 9: Fix transition tuple r 0 ← r + R joint_Vel , angular velocity correction R Joint_Vel
 10: Store transition tuple (s, a, s0 , r 0 ) in B
 11: If SumB < mini_batch then
 12 return
 13: Sample mini-batch of N transition (s, a, r, s0 ) from B
 14: a ← πφ0 (s0 ) + ε, ε ∼ clip( N (0, e
 e σ), −c, c)
 15: y = r + mini=1,2 Qθi0 (s0 , e
 a)
 16: Statistical calculation of Q value utilization ratio wθ10 , wθ20
 17: If wθ10 < Mini_utilization then
 18: Rebirth target networks θ10
 19: End if
 20: If wθ20 < Mini_utilization then
 21: Rebirth target networks θ20
 22: End if
 23: Update critics θi ← argminθi N −1 ∑ (y − Qθi (s, a))2
 24: If t mod d then
 25: Update φ by the deterministic policy gradient:
 26: ∇φ J (φ) = N −1 ∑ ∇ a Qθ1 (s, a) a=πφ (s) ∇φ πφ (s)
 27: Tetanic stimulation Framework
 28: Update target networks:
 29: θi0 ← τθi + (1 − τ )θi0
 30: φ0 ← τφ + (1 − τ )φ0
 31: End if
 32: End if
 33: End for

References
1. Levine, S.; Pastor, P.; Krizhevsky, A.; Ibarz, J.; Quillen, D. Learning Hand-Eye Coordination for Robotic Grasping with Large-Scale
 Data Collection. In International Symposium on Experimental Robotics; Springer: Cham, Switzerland, 2016; pp. 173–184.
2. Zhang, M.; Mccarthy, Z.; Finn, C.; Levine, S.; Abbeel, P. Learning deep neural network policies with continuous memory states.
 In Proceedings of the International Conference on Robotics and Auto-mation, Stockholm, Sweden, 16 May 2016; pp. 520–527.
3. Levine, S.; Finn, C.; Darrell, T.; Abbeel, P. End-to-end training of deep visuomotor policies. J. Mach. Learn. Res. 2016, 17, 1–40.
4. Lenz, I.; Knepper, R.; Saxena, A. DeepMPC:learning deep latent features for model predictive control. In Proceedings of the
 Robotics Scienceand Systems, Rome, Italy, 13–17 July 2015; pp. 201–209.
5. Satija, H.; Pineau, J. Simultaneous machine translation using deep reinforcement learning. In Proceedings of the Workshops of
 International Conference on Machine Learning, New York, NY, USA, 24 June 2016; pp. 110–119.
6. Sallab, A.; Abdou, M.; Perot, E.; Yogamani, S. Deep reinforcement learning framework for autonomous driving. Electron. Imaging
 2017, 19, 70–76. [CrossRef]
7. Caicedo, J.; Lazebnik, S. Active Object Localization with Deep Reinforcement Learning. In Proceedings of the IEEE International
 Conference on Computer Vision, Santiago, Chile, 11–18 December 2015; pp. 2488–2496.
8. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing Atari with Deep
 Reinforcement Learning. arXiv 2013, arXiv:1312.5602.
9. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Figjeland, A.K.;
 Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [CrossRef]
You can also read