Autonomous Obstacle Avoidance by Learning Policies for Reference Modification
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Autonomous Obstacle Avoidance by Learning Policies for Reference
Modification
Benjamin Evans1 , Herman A. Engelbrecht1 and Hendrik W. Jordaan1
Abstract— The key problem for autonomous robots is how to
navigate through complex, obstacle-filled environments, without
colliding into obstacles. The navigation problem is split into
generating a global reference path to the goal and then using a
local planner to track the reference path and avoid obstacles by
arXiv:2102.11042v2 [cs.RO] 4 May 2021
generating velocity and steering references for the robot control
system to execute. Path
This paper presents a hybrid local planning architecture Follower RL Agent
that combines a classic path following algorithm with a deep
reinforcement learning agent. The novel “modification planner”
uses the path follower to track the global plan and the deep
reinforcement learning agent to modify the references generated
by the path follower to avoid obstacles. Importantly, our Hybrid Modification Planner
architecture does not require an updated obstacle map and
only 10 laser range finders to avoid obstacles. The modification
planner is evaluated in the context of F1/10th autonomous
racing and compared to a mapless navigation baseline, the
Follow the Gap Method and an optimisation based planner. The
results show that the modification planner can achieve faster
average times compared to the baseline mapless navigation
planner and a 94% success rate which is similar to the baseline. Fig. 1. Examples of Autonomous Navigation: The green dotted line is
the global plan, the blue blocks are the obstacles and the red line is the path
taken by the vehicle. Path following algorithms are able to perfectly follow
the global plan, yet crash into unknown obstacles. Reinforcement learning
I. I NTRODUCTION agents are able to avoid obstacles, yet suffer from being inefficient. The
lower image shows the desired behaviour which we create in the proposed
The general problem of robot navigation is to plan and modification planner.
execute a trajectory that moves a robot from a starting loca-
tion to a goal location without colliding with any obstacles.
Successful approaches to robot navigation have implemented smoothly than a mapless navigation planner and does not
a hierarchical approach by splitting the problem into global require an updated obstacle map as with optimisation based
planning, generating a high-level plan before the mission planners. We evaluate our novel approach in the context
begins, and local planning, following the global plan while of F1/10th autonomous racing and compare it to current
avoiding obstacles [1]. Local planning is a difficult problem approaches which include a mapless navigation planner, the
to solve because it has the dual objective of (1) trying to Follow the Gap Method, and an oracle optimisation based
follow a predetermined path (2) and being able to react planner.
appropriately to uncertainty in the environment, namely The rest of the paper is summarised as follows. We
smoothly avoiding obstacles. first discuss previous attempts to solve the local planning
In this paper, we address the problem of local planning problem in Section II. This is followed by a theoretical
without an updated obstacle, which involves using the global overview of our proposed reference modification planner
plan and onboard sensor readings to generate velocity and in Section III. In Section IV we explain the components
steering commands that follow the global reference while that we use in building our solution and in Section V the
avoiding obstacles. We present our novel approach which learning formulation for the Reinforcement Learning (RL)
is a hybrid local planner that integrates a path follower agent is explained. Lastly in Sections VI and VII we detail
and a deep reinforcement learning (RL) agent that modi- the evaluation methodology and present our results.
fies the steering reference generated by the path follower
to accomplish the dual objective of tracking a reference
II. R ELATED W ORK
path while also smoothly avoiding obstacles. We call our
approach a reference modification planner. The advantage of Different methods of local planning, following a global
our architecture is that it can learn to avoid obstacles more plan while avoiding obstacles have been developed. We study
1 Electrical and Electronic Engineering Department, Stellenbosch Uni-
reactive methods [2], [3], classical planning techniques [4],
versity, Stellenbosch, 7600, South Africa. {19811799, hebrecht, [5], learning based approaches [6], [7] and hybrid systems
wjordaan}@sun.ac.za; [8], [9].The earliest obstacle avoidance algorithms were reactive jectory being followed and obstacles being smoothly avoided
algorithms (such as “The Bug” algorithm) which generate as shown in Figure 1. The solution should track a reference
steering references based on the surrounding environment at trajectory and deviate as necessary to safely and efficiently
each step [2]. One of the most recently proposed reactive ob- avoid the obstacle. The desired behaviour is analogous to
stacle avoidance algorithms is the “Follow the Gap Method” driving a car, seeing an obstacle in the road, swerving around
(FGM) [3]. The FGM method uses a dense LiDAR scan it and then returning to the original lane.
to navigate away from the nearest obstacle into free space.
Reactive methods do not have a model of the vehicle or the Global
ability to plan for multiple timesteps and thus they cannot Plan +
guarantee optimality. Path Follower
Classical local planning methods aimed to improve on + Modified
reactive methods by storing a map of the environment which references
can be used to plan a trajectory, multiple time-steps in Vehicle
advance. Classical methods typical build a map using a state
Simultaneous Localisation and Mapping (SLAM) algorithm RL Agent
[10], and then use the map to plan and optimise a collision-
free trajectory [4]. The trajectory is then followed using a Fig. 2. Proposed Local Planner Architecture: A path follower to track
path tracking algorithm such as Pure Pursuit [5]. Classical the reference path is combined in parallel with an RL agent that modifies
planning and path tracking methods can generate and execute the path follower references.
near-optimal trajectories, however, because map building
is computationally expensive and slow, these solutions are We present a hybrid local planner that features a pure
not feasible for fast-moving, dynamic environments such as pursuit path follower in parallel with an RL agent. Figure 2
autonomous racing [11]. shows how we combine the path follower and an RL agent
Due to the hardware requirements/constraints of classical in parallel such that the agent can modify the path follower
planning solutions, machine learning (ML) methods have references to avoid obstacles. This system is designed that
been applied to the problem of local planning for autonomous the path follower maintains the reference path and the agent
navigation [12]. Neural networks, trained with reinforcement can deal with the uncertainty of obstacles.
learning (RL) [6] and imitation learning (IL) [7] have been The path follower uses the current location of the vehicle
used to associate sensor readings (camera images and LiDAR and a list of way-points to calculate the required steering
scans) directly with navigation commands. While using an and velocity commands to follow the global plan. an RL
RL agent as a local planner does not require the expensive agent is added in parallel to avoid obstacles by modifying
maintenance of a map, RL agents exhibit poor long term the steering angle calculated by the path follower. We train
performance and have shown to require a high-level planner the agent by punishing it for crashing and rewarding it for
in certain circumstances [13]. progress along the global plan. These two components in
In response to the problem of neural networks exhibiting the reward signal represent the dual objectives of achieving
poor long-term performance in complex environments, hy- efficient, collision-free paths.
brid architectures have been introduced which combine high- Where previous attempts have used RL agents to replace
level planners with neural networks. Paths generated with A* the local planner and learn to navigate, we use an RL agent
search have been used in conjunction with a neural network to learn to navigate as a subsystem within our local planner.
as a local planner and shown to be able to execute long This is done with the aim of better utilising the global path
missions [9], [8]. The above methods completely replace the and avoiding obstacles more smoothly. The aim is that the
local planner with an RL agent. A recent survey notes that agent will learn to swerve around obstacles where needed
using machine learning (ML) for specific subsystems tends without modifying the behaviour of the vehicle when no
to lead towards better overall performance [12]. We aim to obstacles are present.
improve upon the performance of these solutions by using an
IV. P RELIMINARIES : NAVIGATION S TACK
RL agent to learn the local planning subsystem of obstacle
avoidance. We aim to improve the performance of the vehicle We design our system in the context of a standard per-
by giving the agent less influence over the navigation. Our ception, planning and control stack as shown in Figure 3.
solution doesn’t require an updated obstacle map, as with Before the mission begins, the global planner uses a map of
classical model-based planning solutions, and can harness the the environment to generate a reference path of way-points
adaptability of learning techniques to form a planner which is that connect the start and goal locations. The map does not
able to smoothly avoid obstacles while maintaining a global contain any dynamic obstacles. During the mission, the local
trajectory. planner and control system continuously operates to move
the vehicle to the goal location.
III. T HEORETICAL D ESIGN We use a pure pursuit path follower which navigates
We address the problem of generating navigation refer- towards a way-point situated at a fixed look-ahead distance
ences for autonomous vehicles that result in a reference tra- in front of the vehicle. Figure 4 shows how the vehicle andEnvironment Map Global Planner actions. For each training step, the network receives a state
and selects an action that is implemented in the environment
as shown in Figure 5. The simulator then returns the updated
Perception Local Planner Control System state and a reward to the agent and the process is repeated.
State Action
Agent
Vehicle Reward
Fig. 3. Planning, Perception, and Control Navigation Stack
Environment
way-points are modelled so that the steering angle can be
Fig. 5. Reinforcement Learning Agent Interacting With Environment
calculated. Using the notation from Figure 4, the desired
steering angle is calculated according to,
We train the agent to modify the pure pursuit steering ref-
2L sin(α) erence δpp to prevent collisions. The state which the network
δpp = arctan , (1)
ld receives consists of a vector containing the current velocity
where L is the length of the vehicle body, ld is the look Vt and steering of the vehicle δt , the references calculated
ahead distance to the way-point being pursued and α is the by the path follower δpp , and the current range finder scan
angle between the vehicles current heading and the heading [r1 , ..., r10 ]. The network outputs an action quantity which is
to the way-point. used to modify the steering reference calculated by the path
follower. The path follower (pp) and neural network (nn)
steering commands are combined as,
α
δref = δpp + δnn . (3)
R The reference steering value is limited to be within the
ld maximum range and then sent to the control system to be
L executed. The inputs and outputs from the network are in
2α the range [−1, 1] and are scaled according to the maximum
values.
R Path
Fig. 4. Pure Pursuit Path Follower: R is the turning radius, g is the goal
point, alpha is the angle to the goal and delta is the steering angle which
is calculated. r1 , ..., r10
Action
State
The velocity is calculated as the fastest the vehicle can δnn
safely travel while turning at the current steering angle. The δpp , δt , V
velocity is calculated according to the maximum amount of
lateral friction the vehicle can withstand, relative to the force
exerted by the car in turning based on the current steering Fig. 6. Reinforcement Learning Agent Architecture: The state consists
angle. We compute the velocity as, of 10 sparse laser range finders and the pure pursuit steering reference. A
s neural network with two fully connected layers is used to calculate an action
bg which is the steering reference which is used to modify the path follower
V = fs , (2) reference.
tan(|δref| )/ld
where b is the coefficient of friction, m is the mass of the The aim in training the network is that the vehicle does not
vehicle, l is the wheelbase length, and sf is a safety factor. crash and that it tracks the reference path closely. Therefore,
The velocity is limited to the vehicle’s maximum velocity. we use a reward signal which severely punishes crashing
The navigation stack uses a low-level proportional control and rewards progress along the reference path. We write our
system to command the onboard vehicle actuators to follow reward signal as,
the local planning references. The control system calculates
acceleration and steering-velocity commands which can be rcrash = −1
if crash
implemented on hardware. r(st , at ) = rcomplete = 1 if lap complete
rtracking = (st+1 − st )/stotal otherwise,
V. L EARNING P OLICIES FOR R EFERENCE M ODIFICATION (4)
The RL agent comprises a neural network that is trained where s is the progress along the track at a specific step. The
from experience to maximise a reward signal [14]. The neural sum of the tracking rewards for completing the episode will
network is used to represent a policy that maps states to always be 1 since the sum of the progress differences willbe equal to the track length. The tracking reward must be runs at a higher frequency of 100Hz compared to the local
a potential function of the terminal reward so that the agent planner which runs at 10Hz. Table I notes several important
is rewarded in proportion to the progress towards the ideal parameters used in our simulation.
terminal state.
Simulation Parameter Value
VI. E VALUATION M ETHODOLOGY Simulation dynamics time step 0.01 s
Local planning time step 0.1 s
The ability of the modification planner to follow a refer- Max range finder value 10 m
ence path and avoid obstacles efficiently is evaluated in the Range finder noise σs 0.01
context of F1/10th autonomous racing. We aim to determine Max vehicle speed 7 m/s
Max steering angle 0.4 rad
the performance of the modification planner and compare Pure Pursuit Look-ahead distance 0.8 m
it to current baseline solutions. We measure performance in
TABLE I
terms of the following metrics, i.e. (1) the time to navigate
PARAMETERS U SED IN THE S IMULATION
to a target and (2) the success rate, which is the percentage
of episodes successfully completed without crashing.
The new vehicle state is used to simulate a LiDAR scan of
10 sparse range finders on the map and the new state and the
scan is returned to the local planner. To simulate the noise
in range finder readings, Gaussian distributed process noise,
N (0, σs ), is added to the measurements.
B. Baseline Solutions
Fig. 7. F1/10th Vehicle Simulator with 10 Laser Range Finder Beams We compare the modification planner to three common
baseline local planners. The first baseline is the “Follow
the Gap Method” (FGM) [3]. The FGM method identifies a
A. Simulation Environment bubble around the nearest obstacle and then navigates away
We train and test our local planner architectures using from the bubble into free space. In the implementation of
a custom-built, F1/10th autonomous racing simulator. The this method, we adapt the simulator to produce 1080 beams.
simulator models a non-holonomic 1/10th scale autonomous The second baseline which is used is an optimisation
race car with a LiDAR sensor as shown in Figure 7. At each based, oracle planner which has full access to a map of
time-step, the simulator receives an action, updates the state the environment and the location of the obstacles before
according to the transition dynamics and then returns the new the episode begins. The obstacle map is used to generate
state of the vehicle. A version of our test code is available an optimal path around the obstacles. The optimal plan is
online 1 . followed using the pure pursuit algorithm. This solution is
treated as the best performance which could be achieved by
any local planner.
Action Control System
Local Planner Lastly, we attempt to replicate the mapless navigation
planner presented by Tai et al. [6] and adapt it to a non-
State holonomic vehicle. The planner presented by Tai uses a
neural network that receives 10 laser range finder readings
and the relative position of the target. The network is trained
LiDAR Simulator Vehicle Dynamics using reinforcement learning and rewarded according to the
progress made towards the goal.
Simulation Environment For all of our baseline implementations, we use the same
method of calculating the speed based on the steering angle
Fig. 8. Schematic of Planning Loop in Simulation Environment to ensure that the results are comparable.
Figure 8 shows how the simulator interfaces with the local C. Reinforcement Learning
planner that is being evaluated. The local planner selects an
action consisting of velocity and steering angle commands The RL agent in the modification planner is trained
which are passed to the simulator and used as the references using the Twin Delayed Deterministic Policy Gradient (TD3)
for the control system. The feedback control system calcu- reinforcement learning algorithm with the original training
lates acceleration and steering-velocity commands which are hyper-parameters [15]. Neural networks with two hidden
the input to the kinematic bicycle model equations used to layers of 200 neurons are trained in mini-batches of 100
model the car. As in practice, the simulated control system transitions. The training lasts for 200,000 steps which is
generally just below 4000 episodes. Figure 9 shows the
1 Available at: https://github.com/BDEvan5/ReferenceModification average reward per episode for the training of the network.2
Average Reward
1 x x
Modification
0
0 1,000 2,000 3,000 4,000
Episode Number
x x
Fig. 9. Modification Planner Training Graph
Navigation
Fig. 10. Example Trajectories Executed by the Modification and Navigation
VII. E VALUATION R ESULTS Planners
A. Random Forests
The modification planner is evaluated in a random forest
which consists of a straight race track with randomly located high success rates show that both planners learn to avoid a
obstacles. The random forest is a straight track (20m x 2m) significant majority of obstacles.
where 4 obstacles (0.5m x 0.5m) are randomly placed every The influence of the network in modifying the path
episode along the track. Table II shows the results from follower references is shown below. Figure 11 shows a
running 100 test episodes and taking the average times to graph of the output from the neural network as the vehicle
navigate to goal and success rate. moves along the path. The neural network modifies the path
enough to avoid the obstacles while roughly maintaining the
Vehicle No Obstacles Obstacles Success reference path.
Follow the Gap Method 3.73 4.38 99%
Oracle 3.72 3.87 100%
Navigation Planner 5.08 6.18 97% x x
Modification Planner 3.82 4.63 94%
TABLE II
R ESULTS FROM 20 M R ANDOM F OREST 0.4
0.2
In the tests with no obstacles the modification planner is 0
able to track the optimal path, straight line through the forest, −0.2
and achieve a very similar time to the optimisation based −0.4
oracle planner (within 3% of the oracle planner). The similar 0 5 10 15 20
times to the optimisation based planner when no obstacles Time
are present shows that the RL agent learns to not modify
the steering command unnecessarily. The navigation planner Fig. 11. Example Trajectory from the Modification Planner with Neural
achieves a significantly slower average time of 5.08 seconds Network Output
in this test compared to the 3.72 seconds which the other
planners are able to achieve.
In the tests with obstacles present, the optimisation based B. Repeatability
oracle planner outperforms the other planners as expected To measure the repeatability of the learned policies, we
with an average time of 3.87 seconds. The FGM planner train and evaluate 10 agents on the random forest described
achieves a mediocre time of 4.33 seconds, however, this above. The average and standard deviation statistics are
is still faster than any of the learning-based planners. The presented in Table III.
modification planner achieves an average time of 4.63 sec-
Statistic Average Standard Deviation
onds for the forest filled with obstacles which is faster than
No Obstacle Average Time 3.85 s ± 0.10
the navigation planners time of 6.18 seconds. The decrease Obstacle Average Time 4.63 s ± 0.17
in time shows that the modification planner is able to find Obstacle Success Rate 92.2% ± 5.3
smoother, faster trajectories than the mapless navigation TABLE III
solution. Figure 10 shows a comparison of the trajectories R ESULTS FROM R EPEATABILITY E VALUATION
taken by the navigation and modification planners and how
the modification planner learns to take smoother trajectories.
The modification planner achieves a slightly lower success The standard deviation in the times achieved by the
rate of 94% compared to the navigation planner’s 97%. These different agents is small and this shows that the method oftraining agents to achieve similar times is stable and highly our planner is able to avoid obstacles more smoothly than
repeatable. The standard deviation of the success rate is large, previous solutions from map-less navigation.
5%, which shows that there is a non-trivial difference in how
R EFERENCES
agents learn to avoid obstacles.
[1] P. Falcone, F. Borrelli, H. E. Tseng, J. Asgari, and D. Hrovat,
“A hierarchical model predictive control framework for autonomous
C. Discussion ground vehicles,” in 2008 American Control Conference. IEEE, 2008,
The results show that the modification planner can learn pp. 3719–3724.
[2] V. J. Lumelsky and A. A. Stepanov, “Path-planning strategies for a
policies that improve upon map-less navigation planners. point mobile automaton moving amidst unknown obstacles of arbitrary
Specifically, the modification planner avoids obstacles more shape,” Algorithmica, vol. 2, no. 1-4, pp. 403–430, 1987.
smoothly which results in the vehicle being able to travel [3] V. Sezer and M. Gokasan, “A novel obstacle avoidance algo-
rithm:“follow the gap method”,” Robotics and Autonomous Systems,
at higher speeds. This shows that the modification planner vol. 60, no. 9, pp. 1123–1134, 2012.
achieves the behaviour that it was designed for, allowing [4] L. Heng, L. Meier, P. Tanskanen, F. Fraundorfer, and M. Pollefeys,
non-holonomic vehicles to avoid obstacles smoothly at high “Autonomous obstacle avoidance and maneuvering on a vision-guided
mav using on-board processing,” in 2011 IEEE International Confer-
speeds. The success rate which the planners achieve show ence on Robotics and Automation. IEEE, 2011, pp. 2472–2477.
that while the modification planner is able to learn to avoid [5] R. C. Coulter, “Implementation of the pure pursuit path tracking
a significant number of obstacles, it still underperforms the algorithm,” Carnegie-Mellon UNIV Pittsburgh PA Robotics INST,
Tech. Rep., 1992.
map-less navigation planners ability to consistently avoid [6] L. Tai, G. Paolo, and M. Liu, “Virtual-to-real deep reinforcement
obstacles. learning: Continuous control of mobile robots for mapless navigation,”
in 2017 IEEE/RSJ International Conference on Intelligent Robots and
It is suggested that the reason for the lower rate of Systems (IROS), 2017, pp. 31–36.
completion is that the complexity of the learning problem [7] G. Kahn, T. Zhang, S. Levine, and P. Abbeel, “Plato: Policy learning
for the modification planner is higher than for the map- using adaptive trajectory optimization,” in 2017 IEEE International
Conference on Robotics and Automation (ICRA). IEEE, 2017, pp.
less navigation baseline. The transition dynamics (movement 3342–3349.
from one state to the next) for the modification planner are [8] W. Gao, D. Hsu, W. S. Lee, S. Shen, and K. Subramanian, “Intention-
dependant on combining the action, which the agent has net: Integrating planning and deep learning for goal-directed au-
tonomous navigation,” in Conference on Robot Learning. PMLR,
selected, with the pure pursuit reference. The result of this 2017, pp. 185–194.
combination is that the agent has less influence on how the [9] M. Wei, S. Wang, J. Zheng, and D. Chen, “Ugv navigation opti-
vehicle moves. Having less influence improves the quality mization aided by reinforcement learning-based path tracking,” IEEE
Access, vol. 6, pp. 57 814–57 825, 2018.
of the path taken which results in the modification planner [10] H. Durrant-Whyte and T. Bailey, “Simultaneous localization and
achieving faster times. A problem that arises in having less mapping: part i,” IEEE robotics & automation magazine, vol. 13, no. 2,
influence is that it achieves a lower rate of completion. pp. 99–110, 2006.
[11] C. Cadena, L. Carlone, H. Carrillo, Y. Latif, D. Scaramuzza, J. Neira,
This problem of guaranteeing vehicle safety is a current I. Reid, and J. J. Leonard, “Past, present, and future of simultaneous
topic of research and could be solved in future work by ex- localization and mapping: Toward the robust-perception age,” IEEE
ploring different training methods such as imitation learning. Transactions on Robotics, vol. 32, no. 6, pp. 1309–1332, 2016.
[12] X. Xiao, B. Liu, G. Warnell, and P. Stone, “Motion control for mobile
Another possible solution is to combine the RL agent with robot navigation using machine learning: a survey,” arXiv preprint
a safety system that can stop the vehicle before it crashes. arXiv:2011.13112, 2020.
A recent survey [12] suggested that future work in mobile [13] F. de Villiers and W. Brink, “Learning fine-grained control for mapless
navigation,” in 2020 International SAUPEC/RobMech/PRASA Confer-
navigation should continue into using ML to learn specific ence. IEEE, 2020, pp. 1–6.
subsystems of the navigation pipeline. In future, we plan to [14] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.
investigate how RL agents can be used in different archi- MIT press, 2018.
[15] S. Fujimoto, H. Hoof, and D. Meger, “Addressing function approxi-
tectures to learn obstacle avoidance behaviours by reducing mation error in actor-critic methods,” in International Conference on
the complexity of the navigation problem to more simple Machine Learning. PMLR, 2018, pp. 1587–1596.
components.
VIII. C ONCLUSION
In this paper, we addressed the problem of local planning
for autonomous navigation. Local planners are responsible
for tracking a reference trajectory and avoiding obstacles.
We presented a novel local planning architecture that uses a
path following algorithm in conjunction with an RL agent
to smoothly avoid obstacles while maintaining the global
plan. Our architecture is evaluated in the context of F1/10th
autonomous racing and shown to achieve comparable racing
times to optimisation based and reactive methods, yet using
fewer inputs. Specifically, our modification local planner
requires only 10 sparse laser range finder readings and is
able to smoothly avoid obstacles. Additionally, we show thatYou can also read