Benchmarking Perturbation-based Saliency Maps for Explaining Atari Agents

Page created by Charlotte Gross
 
CONTINUE READING
Benchmarking Perturbation-based Saliency Maps for Explaining Atari Agents
Benchmarking Perturbation-based Saliency Maps for
                                                     Explaining Atari Agents

                                                     Tobias Huber                               Benedikt Limmer
                                                 University of Augsburg                        University of Augsburg
                                                  Augsburg, Germany                             Augsburg, Germany
arXiv:2101.07312v2 [cs.LG] 19 Jun 2021

                                               tobias.huber@uni-a.de                benedikt.limmer@student.uni-augsburg.de

                                                                               Elisabeth André
                                                                             University of Augsburg
                                                                              Augsburg, Germany
                                                                      andre@informatik.uni-augsburg.de

                                                                                      Abstract

                                                  Recent years saw a plethora of work on explaining complex intelligent agents.
                                                  One example is the development of several algorithms that generate saliency maps
                                                  which show how much each pixel attributed to the agents’ decision. However,
                                                  most evaluations of such saliency maps focus on image classification tasks. As far
                                                  as we know, there is no work that thoroughly compares different saliency maps
                                                  for Deep Reinforcement Learning agents. This paper compares four perturbation-
                                                  based approaches to create saliency maps for Deep Reinforcement Learning agents
                                                  trained on four different Atari 2600 games. All four approaches work by perturbing
                                                  parts of the input and measuring how much this affects the agent’s output. The
                                                  approaches are compared using three computational metrics: dependence on the
                                                  learned parameters of the agent (sanity checks), faithfulness to the agent’s reasoning
                                                  (input degradation), and run-time. In particular, during the sanity checks we find
                                                  issues with two approaches and propose a solution to fix one of those issues.

                                         1   Introduction

                                         With the rapid development of machine learning methods, Intelligent Agents powered by Deep
                                         Reinforcement Learning (DRL), are making their way into increasingly high-risk applications, such
                                         as healthcare and robotics [Stone et al., 2016]. However, with the growing complexity of these
                                         algorithms, it is hardly if at all possible to comprehend the decisions of the resulting agents [Selbst
                                         and Barocas, 2018]. The research areas of Explainable Artificial Intelligence (XAI) and Interpretable
                                         Machine Learning aim to shed light on the decision-making process of existing black-box models.
                                         In the case of Neural Networks with visual inputs, the most common explanation approach is the
                                         generation of saliency maps that highlight the most relevant input pixels for a given decision. Recent
                                         years saw a plethora of methods to create such saliency maps [Arrieta et al., 2020]. However, a
                                         current challenge for XAI is finding suitable measures for evaluating these explanations. For black-
                                         box models like deep neural networks, it is especially crucial to evaluate the faithfulness of the
                                         explanations (i.e., is the reasoning given by the explanation the same reasoning which the agent
                                         actually used) [Mohseni et al., 2020]. This need for evaluating the faithfulness of explanations was
                                         further demonstrated by Adebayo et al. [2018], who proposed sanity checks which showed that for
                                         some saliency approaches, there is no strong dependence between the agents learned parameters and
                                         the resulting saliency maps.

                                         Preprint. Under review.
Benchmarking Perturbation-based Saliency Maps for Explaining Atari Agents
So far, most faithfulness comparisons of saliency maps focus on image classification tasks. There
is little work on computationally evaluating saliency maps in different tasks like Reinforcement
Learning. Furthermore, these evaluations often try to cover as many different saliency map approaches
as possible. This mostly leads to selections of algorithms with distinct motivations and requirements,
which is less helpful for people with specific requirements. Without full access to the agent’s inner
architecture, for example, one cannot use methods that rely on the inner workings of the agent but
must rely on model agnostic methods, which can be applied to any agent. Model agnostic saliency
maps mostly come in the form of perturbation-based approaches that perturb parts of the input and
observe how much this affects the agent’s decision.
This work presents a computational comparison of four perturbation-based saliency map approaches:
the original Occlusion Sensitivity approach [Zeiler and Fergus, 2014], Local Interpretable Model-
agnostic Explanations (LIME) [Ribeiro et al., 2016], a Noise Sensitivity approach proposed for DRL
[Greydanus et al., 2018], and Randomized Input Sampling for Explanation (RISE) [Petsiuk et al.,
2018]. As test-bed, we use four DRL agents trained on different Atari 2600 games. As metrics, we
use the sanity checks proposed by Adebayo et al. [2018], an insertion metric that slowly inserts the
most important pixels according to the saliency maps, and run-time analysis. As far as we know, this
is the first time that sanity checks were done for perturbation-based saliency maps and the first direct
comparison of how faithful different perturbation-based saliency maps are to DRL agents.

2   Related Work

The XAI literature is rapidly growing in recent years. In this work, we focus on saliency maps
that highlight the areas of the input which were important for the agents’ decision. There are three
main ideas on how to create saliency maps. The first idea is to use the gradient with respect to each
input to see how much small changes of this input influence the prediction [Simonyan et al., 2014,
Sundararajan et al., 2017, Selvaraju et al., 2020]. These approaches require the underlying agent to
be differentiable and need access to the gradients of the agent. The second group of methods uses
modified propagation rules to calculate how relevant each neuron of the network was, based on the
intermediate results of the prediction. Examples for this are Layer-wise Relevance Propagation (LRP)
[Bach et al., 2015] or PatternAttribution [Kindermans et al., 2018]. This idea requires access to the
inner workings of the agent. Finally, perturbation-based approaches perturb areas of the input and
measure how much this changes the output of the agent. The major advantage of perturbation-based
approaches over the aforementioned methods is their model agnosticism. Since they only use the in-
and outputs of the agent, they can be applied to any agent without adjustments.
The evaluation metrics for XAI approaches can be separated into two broad categories: human user
studies and computational measurements [Mohseni et al., 2020]. Examples of human user-studies
of saliency maps for DRL agents are Huber et al. [2020] and Anderson et al. [2019], who evaluate
LRP and Noise Sensitivity saliency maps respectively, with regards to mental models, trust, and
user satisfaction. To obtain more objective quantitative data it is important to additionally evaluate
explanations through computational measurements. Such measurements also provide an easy way to
collect preliminary data before recruiting users for a user study.
The most common computational measurement for saliency maps is input degradation. Here, the
input of the agent is gradually deleted, starting with the most relevant input features according to
the saliency map. In each step, the agent’s confidence is measured. If the saliency maps faithfully
describe the agent’s reasoning, then the agent’s confidence should fall quickly. For visual input, this
is either done by deleting individual pixels per step [Petsiuk et al., 2018, Ancona et al., 2018] or by
deleting patches of the image in each step [Samek et al., 2017, Kindermans et al., 2018, Schulz et al.,
2020]. In addition to deleting features, some newer approaches also propose an insertion metric where
they start with "empty" inputs and gradually insert input features [Ancona et al., 2018, Petsiuk et al.,
2018, Schulz et al., 2020]. The aforementioned image degradation tests mostly compared several
gradient-based methods and one or two perturbation-based and modified propagation approaches.
Furthermore, all previous tests use image classification tasks for their degradation measurements. As
far as we know, there are no input degradation benchmarks for Reinforcement Learning tasks.
Another computational measurement for saliency maps are the so-called sanity checks proposed
by Adebayo et al. [2018]. These tests measure whether the saliency map is dependent on what the
agent learned. One method for this is gradually randomizing the layers of the neural network and

                                                   2
Benchmarking Perturbation-based Saliency Maps for Explaining Atari Agents
measuring how much this changes the saliency maps. Adebayo et al. did this for various gradient-
based approaches and Sixt et al. [2019] additionally tested LRP methods. As far as we know, there is
no work that computed sanity checks for perturbation-based saliency maps even though this is one of
the most popular saliency maps approaches.

3       Experiments
This section presents details about the implementation of our experiments. The code for all experi-
ments is available online.1
The test-bed in our paper is the Atari Learning Environment [Bellemare et al., 2013]. Four DRL
agents were trained on the games MsPacman (simplified to Pac-Man in this work), Space Invaders,
Frostbite, and Breakout using the Deep Q-Network (DQN) [Mnih et al., 2015] implementation of the
OpenAI Baselines Framework [Dhariwal et al., 2017] (available under the MIT License). We chose
the DQN because it is the most basic DRL architecture which most other DRL agents build upon.
The games were selected because the DQN performs very well on Breakout and Space Invaders but
performs badly on Frostbite and Pac-Man. The agents make predictions by observing the last 4 frames
of the game and then choose from a pool of possible actions. Hereby, each frame is down-sampled
and greyscaled resulting in 84 × 84 × 4 input images. The reward is given by the change in in-game
score since the last state, which we scaled such that the minimal possible reward is 1. To normalize
the output values between different inputs, we use a softmax activation function for the output layer.
Saliency Map Methods: The basic saliency map generation process is the same between all four
approaches compared in this work. Let f be the agent that takes a visual input I and maps it to a
confidence value for each possible action. Without loss of generality, f (I) describes the confidence in
the agent’s original prediction. That is the action which the agent chooses for the unperturbed image.
An input image I with height H and width W can be defined as a mapping I : ΛI → Rc of each pixel
λ ∈ ΛI = {1, ..., H} × {1, ..., W } to c channels (e.g. c = 4 for the Atari environment). To determine
the relevance of each pixel λ for the prediction of the agent, all four approaches feed perturbed
versions of I to the agent and then compare the resulting confidence values with the original results.
However, the approaches widely differ in the way the image is perturbed and how the relevance per
pixel is computed:
Occlusion Sensitivity [Zeiler and Fergus, 2014]: This approach creates perturbed images I 0 by
shifting a n × n patch across the original image I and occluding this patch by setting all the pixels
within to a certain color (e.g., black or gray). The importance S(λ) of each pixel λ inside the patch is
then computed based on the agents’ confidence after the perturbation
                                            S(λ) = 1 − f (I 0 ).                                    (1)
Since the original source does not go into details about the algorithm, we use the tf-explain implemen-
tation as reference [tf explain, 2019]. As long as the saliency maps are normalized this is equivalent
to f (I) − f (I 0 ), since all values in the saliency map are shifted by the same constant f (I) − 1.
Noise Sensitivity [Greydanus et al., 2018]: Instead of completely occluding patches of the image,
this approach adds noise to the image I by applying a Gaussian blur to a circle with radius r around a
pixel λ. The modified image I 0 (λ) is then used to compute the importance of the covered circle by
comparing the agent’s logit units π(x) (i.e., the outputs of all output neurons before softmax):
                                              1
                                    S(λ) =      ||π(I) − π(I 0 (λ))||2                              (2)
                                              2
This is done for every rth pixel, resulting in a temporary saliency map that is smaller than the input.
For the final saliency map, the result is up-sampled using bilinear interpolation.
RISE [Petsiuk et al., 2018]: This approach uses a set of N randomly generated masks {M1 , ..., MN }
for perturbation. To this end, temporary n × n masks are created by setting each element to 1 with a
probability p and 0 otherwise. These temporary masks are upsampled to the size of the input image
using bilinear interpolation. The images are perturbed by element-wise multiplication with those
masks I Mi . The relevance of each pixel λ is given by
                                                  N
                                             1 X
                                S(λI ) =             f (I   Mi ) · Mi (λ),                          (3)
                                           p · N i=1
    1
        https://github.com/belimmer/PerturbationSaliencyEvaluation

                                                     3
Benchmarking Perturbation-based Saliency Maps for Explaining Atari Agents
where Mi (λ) denotes the value of the pixel λ in the ith mask.
LIME [Ribeiro et al., 2016]: The original image is divided into superpixels using segmentation
algorithms. Perturbed variations of the image are generated by “deleting” different combinations of
superpixels (i.e., setting all pixels of the superpixels to 0). The combination of occluded images and
the corresponding predictions by the agent are then used to train a locally weighted interpretable
model for N steps. Analyzing the weights of this local model provides a relevance value for each
superpixel.
We evaluate the generated saliency maps using three different computational metrics:
Sanity Checks: The parameter randomization test proposed by Adebayo et al. [2018] measures the
dependence between the saliency maps and the parameters learned by the neural network of the agent.
To this end, the parameters of each layer in the network are randomized in a cascading manner, starting
with the output layer. Every time a new layer is randomized, a saliency map for this version of the
agent is created. The resulting saliency maps are then compared to the saliency map for the original
network, using three different similarity metrics (Spearman rank correlation, Structural Similarity
(SSIM), and Pearson correlation of the Histogram of Oriented Gradients (HOGs)). Following Sixt et al.
[2019], we account for saliency maps that differ only in sign by additionally computing similarity with
an inverse version of the saliency maps and using the maximum similarity. Analogous to Adebayo
et al. [2018] we tuned the similarity metrics such that two randomly sampled saliency maps with
uniform distribution have mean similarity values (0.0087, 0.0136, 0.0096) and two random saliency
maps with Gaussian distribution have mean similarity (0.0093, 0.0374, 0.0087). If the saliency maps
depend on the learned parameters of the agent then the saliency maps for the randomized model
should vastly differ from the ones of the original model.
Insertion Metric: To test the premise that the most relevant pixels, according to the saliency maps,
have the highest impact on the agent, we use the insertion metric proposed by Petsiuk et al. [2018].
We do not use a deletion metric, since we feel that it is too similar to the way that perturbation-based
saliency maps are created. The insertion metric starts with a fully occluded image (i.e. the values
of all pixels are set to 0). In each step, 84 occluded pixels (approx. 1.2% of the full image) are
uncovered, starting with the most relevant pixels according to the saliency map. For LIME, the
superpixels are sorted by their relevance but the order of pixels within superpixels is randomized. The
partly uncovered image is then fed to the agent and its confidence in the original prediction, which the
agent chooses for the full image, is stored. If the saliency map correctly highlights the most important
pixels, then the agent’s confidence should increase quickly for each partly uncovered image.
Run-time Analysis: The run-time of an algorithm can be an important aspect when choosing between
different approaches. Therefore, we computed the mean time it took each algorithm to create a single
saliency map using the timeit python library. To this end, we measured a total of 1000 saliency maps
for each game.
Hardware: All the insertion metric and run-time tests were done on the same machine with an
Nvidia GeForce GTX TITAN X GPU to ensure comparability of the run-time results. The sanity
checks and parameter tests were divided between the aforementioned machine and another one with
an Nvidia GeForce GTX 1080 Ti GPU.

4   Parameter Tuning

All the perturbation-based saliency map approaches tested in this work depend on a choice of
parameters. To get an estimate of which parameters work well with the Atari environment, we
tested a range of different parameters for each approach. Since LIME and RISE in particular have
long computation times and a large number of possible parameter combinations we only used 5
images to test the parameters. We chose the images among a stream of Pac-Man game-play with
the HIGHLIGHTS-DIV algorithm which selects a diverse set of states that give a good overview of
the agent’s policy [Amir and Amir, 2018]. These states were shown to produce more informative
saliency maps for human observers than randomly sampled states [Huber et al., 2020]. While the
sample size is too small to find optimal parameters, this does allow us to get a good estimate of the
approaches’ performance for a wide range of different parameters in a reasonable amount of time.
However, the process still took up to 10 hours for some of the segmentation algorithms we tested
with LIME. As Metric we used the Insertion Metric to estimate how well the generated saliency maps
capture the agent’s reasoning.

                                                   4
Benchmarking Perturbation-based Saliency Maps for Explaining Atari Agents
Input State Occlusion     RISE         NS           NS Black NS Chosen LIME              LIME
                Sensitivity              Original                 Action Quickshift          SLIC

Figure 1: Example saliency maps for three different Pac-Man game states generated by each of the
approaches investigated in this paper (NS is Noise Sensitivity). The circles mark Pac-Man’s position.
For the LIME variants we only show the top 5 superpixels as is custom with this approach.

For Occlusion Sensitivity, we tested patches of size 1 to 10 and two different occlusion colors: black
and gray. Independent of the size, black was better than gray. For Noise Sensitivity, we tested circles
with a radius of 1 to 10. In general, the smaller the patch size and radius, the better were the results,
while the run-time increased. Since LIME and RISE are not suited to create such fine-granular
saliency maps, we decided against using the sizes 1 or 2. Moreover, the results of sizes 3 and 4 were
very close and 4 even beat 3. Therefore, we decided to use patch size and radius 4 such that the
results are more comparable with the other approaches.
For RISE we tested 500, 1000,...,3000 masks of size 4 to 24. The probability p with which each pixel
is occluded varied between 0.1 and 0.9 in steps of 0.1. The best parameters were a probability of 0.8,
mask size 18, and 3000 masks.
For LIME we tested the three most common Segmentation techniques SLIC, Quickshift and Felzen-
szwalb and varied the number of samples on which the local interpretable model is trained. For the
number of learning steps we took the default number of samples (1000) and increased it in steps
of 500 up to 3000. This range produced good results (all top 5 results contain some parameter
combinations with less than 3000 samples) while the run-time per saliency map did not diverge too
much from the other approaches. To determine which parameter ranges we should use for each
segmentation algorithm, we performed preliminary tests where we visually checked which parameters
resulted in different segmentation. The exact parameters we used are listed in Appendix A.1. The best
parameters for Felzenszwalb were scale factor 1, Gaussian smoothing kernel width 0.25, minimum
component size 2 and 2500 training samples. The best parameters for SLIC were 80 segments,
compactness factor 10, Gaussian smoothing kernel width 0.5, and 1000 samples. Quickshift obtained
the best result with kernel size 1, max distance 4, color ratio 0 and 3000 samples.
The top five results of all approaches and segmentation algorithms can be seen in Appendix A.1 and
the full results can be seen in our GitHub Repository.2

5        Results
Visual Assessment: Fig. 1 shows example saliency maps for the Pac-Man agent (saliency maps
for the remaining agents are shown in Appendix A.2). We only show the two LIME segmentation
algorithms that performed better on Pac-Man. To prevent cherry-picking, the states were chosen by
the HIGHLIGHTS-DIV algorithm which selects diverse and informative states about the agent’s
strategy [Amir and Amir, 2018]. Except for the Noise Sensitivity approaches with blurring, the
saliency maps generally seem to highlight Pac-Man and its surroundings.
Sanity Checks: The results of the parameter randomization test are shown in Fig. 2. The lower the
scores the higher the dependence on the agents’ learned parameters. Fig. 3 shows an example for
the different saliency maps during a single run of the sanity check. Notably, LIME has a very high
     2
         https://github.com/belimmer/PerturbationSaliencyEvaluation

                                                    5
Benchmarking Perturbation-based Saliency Maps for Explaining Atari Agents
1.0
 0.8
 0.6
 0.4
 0.2
 0.0
                Spearman                          SSIM                            Pearson
Figure 2: Results of the parameter randomization sanity check for the different saliency map
approaches. Measured for 1000 states of each of the 4 tested games. Starting from the left, each mark
represents an additional randomized layer starting with the output layer. The y-axis shows the average
similarity values (Spearman rank correlation, SSIM, Pearson correlation of the HOGs). High values
indicate a low parameter dependence. Since all LIME variants were similar, we only show the one
with the highest parameter dependence (Quickshift). The translucent error bands show the 99% CI.

  Occlusion      RISE         NS       NS Black NS Chosen LIME                LIME        LIME
                            Original             Action Quickshift            SLIC     Felzenszwalb

Figure 3: Example saliency maps for the parameter randomization sanity check. All saliency maps
are generated for the first state in Fig. 1. From top to bottom each row after the first is generated for
agents with cascadingly randomized layers starting with the output layer. In contrast to Fig. 1, the
LIME saliency maps show all superpixels with their corresponding importance values.

Pearson correlation of HOGs, and RISE’s similarity values increase with the number of randomized
layers. Furthermore, the original Noise Sensitivity has very low dependence on the parameters of the
output layer when compared to Occlusion Sensitivity. Since those two approaches are very similar
in theory, we implemented two modifications of Noise Sensitivity to investigate the reason for this
difference in parameter dependence. First, Noise Sensitivity Black occludes the circles in the Noise
Sensitivity approach with black color instead of blurring them. Second, Noise Sensitivity Chosen
Action changes the way that the importance of each pixel is calculated from Eq. (2), which takes all
actions into account, to the one used by Occlusion Sensitivity (Eq. (1)), which focuses on the chosen
action. We did not test a combination of black circles and the Occlusion Sensitivity importance
calculation, since that would be pretty much equivalent to Occlusion Sensitivity with circles instead
of squares. While the black occlusion did not really change the sanity check results, the change of
the importance calculation immensely increased the dependence on the learned parameters of the
output layer. Noise Sensitivity Chosen Action and Occlusion Sensitivity both show high parameter
dependence across all three similarity metrics.

                                                   6
Benchmarking Perturbation-based Saliency Maps for Explaining Atari Agents
Figure 4: The insertion metric results for four different Atari games (from left to right: Pac-Man,
Space Invaders, Breakout, and Frostbite), averaged over 1000 steps. The x-axis shows the percentage
of inserted pixels and the y-axis shows the average confidence in the original prediction for those
modified states. For Noise Sensitivity and LIME we only plot the variant with the highest AUC. The
error bands show the 99% CI.

 AUC                      Pac-Man       Space Invaders Breakout          Frostbite
  Occlusion Sensitivity 0.351           0.293           0.354            0.123
  RISE                    0.351         0.248           0.359            0.123
 NS Original              0.130         0.211           0.281            0.068
 NS Black                 0.141         0.213           0.279            0.072
 NS Chosen Action         0.115         0.257           0.301            0.076
  LIME Quickshift         0.21          0.214           0.289            0.072
  LIME SLIC               0.197         0.202           0.285            0.067
  LIME Felzenszwalb 0.172               0.219           0.292            0.071
Table 1: The average Area Under the Curve (AUC) for the graphs obtained by the insertion metric
(Fig. 4). NS is Noise Sensitivity. The average was computed across 1000 states for each Atari model.

Insertion Metric: Fig. 4 shows the insertion metric results for the best parameters for each game
(the remaining LIME and Noise Sensitivity variants can be seen in Appendix A.2). Table 1 reports the
average Area Under the Curve (AUC) for all parameter combinations. Across all games, Occlusion
Sensitivity and RISE achieve the highest AUC values. LIME and Noise Sensitivity are less good.
Noticeably, the confidence often increases over the confidence for the full image. This is related to
the fact that the agents are not perfectly sure about their actions (average confidence around 0.3).
Furthermore, we use a softmax activation function which increases the confidence in the observed
action when the agents’ confidence in other actions decreases.
Run-time Analysis: The average run-times for each of the tested saliency map approaches with the
final parameters we used are shown in Table 2. Occlusion Sensitivity and Noise Sensitivity black,
which simply occlude image patches with black pixels, perform faster than the approaches with more
complex image perturbations. However, this strongly depends on the chosen parameters as can be
seen in the low run-time of LIME with SLIC segmentation. More time measurements from our
parameter tests can be seen in Appendix A.1.

6   Discussion
Occlusion Sensitivity performed the best across all tests we ran. It achieved the highest AUC values
in 3 out of 4 games and is only slightly behind the best approach in the remaining game (Fig. 4 and
Table 1). This result is in contrast to the evaluations by Schulz et al. [2020] and Petsiuk et al. [2018],
where Occlusion Sensitivity was among the worst of the tested approaches. We think that this is
mainly due to the differences in the domains. In most Atari games, a black square really means that

 Occlusion RISE NS             NS        NS Chosen LIME Quick- LIME               LIME Felzen-
 Sensitivity         Blur      Black Action            shift           Slic       szwalb
 0.722        4.914 1.614 0.712 1.632                  3.135           0.858      3.189
Table 2: The mean number of seconds it took an perturbation-based approach to generate one saliency
map. The average was computed across 1000 states for each game.

                                                    7
Benchmarking Perturbation-based Saliency Maps for Explaining Atari Agents
there is no relevant object at this position. This is not the case for the real-world images used by
Petsiuk et al. and Schulz et al.. Moreover, Schulz et al. use a different implementation of the insertion
metric, where n × n patches are inserted in each step instead of the top n pixels. Ancona et al.
[2018], who use individual pixels in each step, found that Occlusion Sensitivity performed similarly
to the compared gradient-based method. In the parameter randomization sanity checks, Occlusion
Sensitivity is very dependent on the learned parameters (Fig. 2). The parameter dependence is
among the highest of all the perturbation-based saliency maps we tested and it is on par with the best
gradient-based saliency maps tested by Adebayo et al. [2018]. However, their tests were done on
another domain so it should be taken with a grain of salt. Finally, Occlusion Sensitivity has one of
the lowest run-times (Table 2) and was the easiest for us to find suiting parameters.
Noise Sensitivity, in its original formulation, performed quite badly in our tests. It is especially
concerning, that the approach only showed very little dependence on the parameters of the output
layer (Fig. 2). Since the output layer has the highest impact on the actual decision of a network,
it is crucial that a faithful saliency map depends on the weights learned in this layer. Our results
empirically show that replacing the original equation to calculate the importance of each pixel S(I 0 )
(Eq. (2)) with the equation used by Occlusion Sensitivity (Eq. (1)) greatly increases the parameter
dependence. We think that this is due to the fact that Eq. (2) takes all actions into account and
therefore measures a general increase in entropy within the activations of the output layer. In contrast,
Eq. (1) only measures the action which is actually chosen and therefore captures a more specific
change in the output layer activation. Recently, Puri et al. [2020] also criticized that the saliency maps
by Greydanus et al. [2018] take all actions into account. The results of our sanity checks provide the
first computational evidence for this critique. Puri et al. propose a solution to this problem, which is
similar to our adjustment. In the future, we would like to include their approach in our evaluation.
Changing the perturbation within the circles from blurring to black occlusion did not have a big
impact on the parameter dependence. Interestingly, however, changing the perturbation in this way
increased the insertion metric score for the game Pac-Man (Table 1). For this game, Greydanus
et al. [2018] reported that their Noise Sensitivity approach produced unintuitive saliency maps. Our
results indicate that this is not due to a flaw in the agents but rather that blurred perturbation is not
suitable for this game. In the other games, the Noise Sensitivity Chosen Action variant achieves higher
insertion metric scores than the other Noise Sensitivity variants (Table 1). In Space Invaders, it even
obtained the second-highest AUC among all saliency approaches. Together with its good parameter
dependence and the fact that it was easy to find suiting parameters for this approach, we think that this
modified Noise Sensitivity is a good alternative to Occlusion Sensitivity for environments without a
single color that is suited to occlude the input.
RISE obtained comparable AUC values to Occlusion Sensitivity in the insertion metric (Fig. 4 and
Table 1). In Breakout it had the highest AUC but in Space Invaders it was below both Occlusion
and the Noise Sensitivity Chosen Action. These high scores are in line with the results by Petsiuk
et al. [2018] who found their RISE approach to perform better than Occlusion Sensitivity, LIME,
and a gradient-based method on an image classification task. Notably, RISE required a much higher
run-time to acquire those results (Table 2) and took more resources for fine-tuning than Occlusion and
Noise Sensitivity. Visually, the saliency maps produced by RISE and Occlusion Sensitivity mostly
agree on the most relevant region in the input states (see Fig. 1). However, RISE produces more noisy
saliency maps making it harder to quickly interpret the results. This noise might also be related to
the biggest disadvantage of RISE. During the parameter randomization sanity check, RISE saliency
maps got more similar to the original explanation after the first randomization of the output layer
(Fig. 2). Investigating further, we found that nearly all saliency maps, which are generated after more
than the output layer was randomized, look the same (Fig. 3). It seems like they reflect the structure
of the randomly generated masks. We made sure that the same masks are used for all saliency maps
during the sanity check. Since the background noise in the RISE saliency maps for the fully trained
agent also seems to reflect the structure of the masks, this noise might be the reason for the similarity.
Thus, the high similarity values might not stem from low parameter dependence. However, this needs
to be investigated further before the approach can be relied upon.
LIME was the hardest approach to fine-tune. General parameters, like the number of training samples
for the local model, combined with different segmentation algorithms that have their own parameters,
result in an exponentially growing amount of possible parameter combinations. Even when taking all
those parameters into account and trying to optimize for the insertion metric, we were not able to
achieve good results in this metric (Fig. 4 and Table 1). This contrasts the findings by Petsiuk et al.

                                                    8
Benchmarking Perturbation-based Saliency Maps for Explaining Atari Agents
[2018] who found LIME to perform better than Occlusion Sensitivity in an image classification task.
Our results indicate that LIME is not suited to identify the most important pixels for Atari agents.
LIME’s run-time highly depends on the chosen parameters and we found that it could easily explode
during our parameter tests, making the parameter search even more resource-intensive (Appendix
A.1). However, the final parameters we used were faster than RISE. The SLIC segmentation variant
even was among the fastest saliency map approaches (Table 2). The main positive result for LIME is
its high dependence on the learned parameters of the agents. Here, the best LIME variant (Quickshift)
was on par with Occlusion Sensitivity and Noise Sensitivity Chosen Action. Only the Pearson
correlation of the HOGs was very high between LIME saliency maps for the trained and randomized
agents. However, the reason for this is not necessarily a low dependence on the agent’s learned
weights. More likely it is due to the fact that all LIME saliency maps for a given state work with
the same superpixels. Since every pixel inside a superpixel has the same value there are hard edges
between the superpixels. These edges are captured by the HOGs and result in high values of the
Pearson correlation of the HOGs.

Limitations As with every evaluation, our study has limitations. First, we did not fine-tune the
approaches for each game individually. To save time, we only used one game to find parameters
that work reasonably well with the general Atari environment. It is likely that there are differences
between the optimal parameters for each game. However, since the tuning process was the same for
all approaches, we think that the results are still representative. The results for a fully fine-tuned
game can be seen with Pac-Man. Second, the metrics in our evaluation only provide an estimate of
the faithfulness of saliency maps. Especially the insertion metric is only an approximation of how
well a saliency map captures the reasoning of an agent. So far, there is no way to obtain perfect
ground truth about which pixels were the most important for a DRL agent. In this context, we want
to emphasize that we do not claim that the best saliency maps according to our evaluation perfectly
capture the agents’ reasoning. Creating such perfect saliency maps is still an open challenge and this
work aims to guide the development in this direction. Solely relying on saliency maps to be 100%
accurate in high-risk domains like healthcare could lead to a negative social impact. For now, saliency
maps should not be used in isolation but as part of an interpretability toolbox. Finally, we chose one
of the most basic DRL architectures without any sophisticated adjustments for our experiments to
ensure that the results generalize as much as possible. Since all the saliency map approaches we
tested are model-agnostic the results should not change drastically with different agent architectures.
In particular, we expect the sanity check results to be very independent of the underlying agents.
However, to be absolutely sure, we plan to include other architectures in future experiments.

7   Conclusion

This paper compared four different perturbation-based saliency map approaches measuring their
dependence on the agent’s parameters, their faithfulness to the agent’s reasoning, and their run-time.
The three most interesting findings from our experiments are:

       • The simplest approach produces the best-suited saliency map for our agents. Occlusion
         Sensitivity with black occlusion color performs the best across all our metrics.
       • Noise Sensitivity, which was proposed for the Atari environment and is one of the most
         prominent saliency map methods for DRL agents, did not perform well in our tests and
         should be adjusted in the future. Especially concerning is the fact that the original Noise
         sensitivity approach shows little dependence on the learned parameters of the output layer.
         We empirically show that replacing the original importance calculation with the one used by
         Occlusion Sensitivity, which only takes the chosen action into account, drastically increases
         parameter dependence. Moreover, it also improves the insertion metric results in most games
         we tested. Thus, we propose that this variant should be used in the future.
       • Both LIME and RISE showed more severe issues in our tests. Even with extensive parameter
         tuning to optimize the insertion metric we did not manage to achieve good insertion metric
         results with LIME. In contrast, RISE failed the parameter randomization sanity check by
         showing high similarities between saliency maps for trained and randomized agents. While
         we think that this might not completely stem from low parameter dependence, it should be
         investigated further before the approach can be relied upon.

                                                  9
The computational measurements in this work present a first step to fully evaluate perturbation-based
saliency maps for DRL. In the future, we want to build upon the insights from this paper and conduct
a human user-study , similar to the one we did in Huber et al. [2020], to evaluate how useful the
saliency map approaches with good computational results are for actual end-users.

References
J. Adebayo, J. Gilmer, M. Muelly, I. Goodfellow, M. Hardt, and B. Kim. Sanity checks for saliency
   maps. In Advances in Neural Information Processing Systems, pages 9505–9515, 2018.
D. Amir and O. Amir. HIGHLIGHTS: summarizing agent behavior to people. In Proceedings of the
  17th International Conference on Autonomous Agents and MultiAgent Systems, pages 1168–1176,
  2018. URL http://dl.acm.org/citation.cfm?id=3237869.
M. Ancona, E. Ceolini, C. Öztireli, and M. Gross. Towards better understanding of gradient-based
 attribution methods for deep neural networks. In 6th International Conference on Learning
 Representations, 2018. URL https://openreview.net/forum?id=Sy21R9JAW.
A. Anderson, J. Dodge, A. Sadarangani, Z. Juozapaitis, E. Newman, J. Irvine, S. Chattopadhyay,
  A. Fern, and M. Burnett. Explaining reinforcement learning to mere mortals: An empirical study. In
  Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-
  19, pages 1328–1334. International Joint Conferences on Artificial Intelligence Organization, 7
  2019. URL https://doi.org/10.24963/ijcai.2019/184.
A. B. Arrieta, N. D. Rodríguez, J. D. Ser, A. Bennetot, S. Tabik, A. Barbado, S. García, S. Gil-Lopez,
  D. Molina, R. Benjamins, R. Chatila, and F. Herrera. Explainable artificial intelligence (XAI):
  concepts, taxonomies, opportunities and challenges toward responsible AI. Inf. Fusion, 58:82–115,
  2020. doi: 10.1016/j.inffus.2019.12.012.
S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. Müller, and W. Samek. On pixel-wise
  explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one, 10
  (7), 2015.
M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The arcade learning environment: An
 evaluation platform for general agents. J. Artif. Intell. Res., 47:253–279, 2013. doi: 10.1613/jair.
 3912. URL https://doi.org/10.1613/jair.3912.
P. Dhariwal, C. Hesse, O. Klimov, A. Nichol, M. Plappert, A. Radford, J. Schulman, S. Sidor, Y. Wu,
   and P. Zhokhov. Openai baselines. https://github.com/openai/baselines, 2017.
S. Greydanus, A. Koul, J. Dodge, and A. Fern. Visualizing and understanding atari agents. In
  Proceedings of the 35th International Conference on Machine Learning, pages 1787–1796, 2018.
   URL http://proceedings.mlr.press/v80/greydanus18a.html.
T. Huber, K. Weitz, E. André, and O. Amir. Local and global explanations of agent behavior:
   Integrating strategy summaries with saliency maps. CoRR, abs/2005.08874, 2020. URL https:
  //arxiv.org/abs/2005.08874.
P. Kindermans, K. T. Schütt, M. Alber, K. Müller, D. Erhan, B. Kim, and S. Dähne. Learning how
   to explain neural networks: Patternnet and patternattribution. In 6th International Conference
   on Learning Representations. OpenReview.net, 2018. URL https://openreview.net/forum?
   id=Hkn7CBaTW.
V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Ried-
   miller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement
   learning. nature, 518(7540):529–533, 2015.
S. Mohseni, N. Zarei, and E. D. Ragan. A multidisciplinary survey and framework for design and
   evaluation of explainable ai systems, 2020.
V. Petsiuk, A. Das, and K. Saenko. Rise: Randomized input sampling for explanation of black-box
   models. arXiv preprint arXiv:1806.07421, 2018.

                                                 10
N. Puri, S. Verma, P. Gupta, D. Kayastha, S. Deshmukh, B. Krishnamurthy, and S. Singh. Explain
  your move: Understanding agent actions using specific and relevant feature attribution. In 8th
  International Conference on Learning Representations, ICLR. OpenReview.net, 2020.
M. T. Ribeiro, S. Singh, and C. Guestrin. "Why should i trust you?" Explaining the predictions of
 any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge
 discovery and data mining, pages 1135–1144, 2016.
W. Samek, A. Binder, G. Montavon, S. Lapuschkin, and K. Müller. Evaluating the visualization
  of what a deep neural network has learned. IEEE Trans. Neural Networks Learn. Syst., 28(11):
  2660–2673, 2017. URL https://doi.org/10.1109/TNNLS.2016.2599820.
K. Schulz, L. Sixt, F. Tombari, and T. Landgraf. Restricting the flow: Information bottlenecks for
  attribution. In 8th International Conference on Learning Representations, 2020. URL https:
  //openreview.net/forum?id=S1xWh1rYwB.
A. D. Selbst and S. Barocas. The intuitive appeal of explainable machines. Fordham L. Rev., 87:1085,
  2018.
R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra. Grad-CAM: Visual
  explanations from deep networks via gradient-based localization. Int. J. Comput. Vis., 128(2):
  336–359, 2020. URL https://doi.org/10.1007/s11263-019-01228-7.
K. Simonyan, A. Vedaldi, and A. Zisserman. Deep inside convolutional networks: Visualising
  image classification models and saliency maps. In 2nd International Conference on Learning
  Representations, 2014. URL http://arxiv.org/abs/1312.6034.
L. Sixt, M. Granz, and T. Landgraf. When explanations lie: Why modified BP attribution fails. CoRR,
   abs/1912.09818, 2019. URL http://arxiv.org/abs/1912.09818.
P. Stone, R. Brooks, E. Brynjolfsson, R. Calo, O. Etzioni, G. Hager, J. Hirschberg, S. Kalyanakrishnan,
   E. Kamar, S. Kraus, K. Leyton-Brown, D. Parkes, P. William, S. AnnaLee, S. Julie, T. Milind, and
   T. Astro. Artificial intelligence and life in 2030. One Hundred Year Study on Artificial Intelligence:
   Report of the 2015-2016 Study Panel, 2016.
M. Sundararajan, A. Taly, and Q. Yan. Axiomatic attribution for deep networks. In D. Precup
 and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning,
 volume 70 of Proceedings of Machine Learning Research, pages 3319–3328. PMLR, 2017. URL
 http://proceedings.mlr.press/v70/sundararajan17a.html.
tf explain. Interpretability methods for tf.keras models with tensorflow 2.0. https://github.com/
   sicara/tf-explain, 2019.
M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In European
 conference on computer vision, pages 818–833. Springer, 2014.

                                                  11
A     Appendix
A.1   Parameter Tuning Results

Table 3, 4, and 5 show the top five results of the parameter tests, as described in Section 4 of the main
paper, for Occlusion Sensitivity, Noise Sensitivity, and RISE, respectively. Note that the run-time
values in the appendix might differ from the ones in the main results since we ran some of the
parameter tests on a different machine than the run-time tests.

       AUC Patch Size Color           Time                       AUC Radius          Time
      4.513        1          0.0    64.622                      0.992     1       127.870
      4.020        2          0.0    14.593                      0.917     2        34.349
      3.160        4          0.0     3.958                      0.858     5         5.864
      3.130        3          0.0     7.274                      0.854     4         8.060
      2.682        5          0.0     2.517                      0.852     3        14.868
Table 3: Best parameters for Occlusion Sensitivity. Table 4:    Best parameters for Noise Sensitivity.

                 AUC      Propability p Mask Size Number of Masks                Time
                 3.288        0.8            18              3000               25.928
                 3.207        0.8            22              3000               25.374
                 3.184        0.8            21              2500               21.091
                 3.182        0.7            24              3000               26.170
                 3.145        0.8            16              3000               25.351
                                Table 5: Best parameters for RISE

For LIME we tested the three most prominent segmentation algorithms: Felzenszwalb, SLIC, and
Quickshift. For Felzenszwalb segmentation we used a scale factor of 1,21,...,101, a minimum
component size from 1 to 8 and Gaussian smoothing kernels with width σ of 0,0.25,...,1. The top
results are shown in Table 6. For SLIC we tested 40,60 to 240 segments, a compactness factor of
0.001,0.01,...,10 and Gaussian smoothing kernels with width σ of 0,0.25,...,1. The top five parameter
combinations can be seen in Table 7. Finally, we tested Quickshift with a color ratio of 0.0,0.33,0.66
and 0.99, a kernel size from 1 to 6 and a max distance of kernelsize ∗ i, where i goes from 1 to 4.
The top results are shown in Table 8.

                AUC Scale Sigma Minimum Size Num Samples                  Time
                1.843     1      0.25          2              2500        7.116
                1.792     1       1.0          2              3000        21.70
                1.741     1       1.0          0              3000       53.359
                1.740     1       1.0          1              1000       17.367
                1.731     1      0.25          1              1000        5.528
               Table 6: Best parameters for LIME with Felzenszwalb segmentation.

                                                   12
AUC     Number of Segments Compactness Sigma Num Samples             Time
1.987            80                 10.0        0.5       1000       0.835
1.966            80                 10.0        0.5       3000       2.429
1.952            80                 10.0       0.75       1000       0.859
1.949            80                 10.0        0.5       1500       1.256
1.942            80                 10.0       0.25       2500       2.132
         Table 7: Best parameters for LIME with SLIC segmentation.

   AUC Kernel Size Max Distance Ratio Num Samples                 Time
   2.061        1               4         0.0        3000         2.957
   2.051        1               1        0.33        3000        13.380
   2.014        1               4         0.0        2500         2.687
   2.005        1               1        0.66        3000        13.086
   1.951        1               1        0.99        2500        10.911
      Table 8: Best parameters for LIME with Quickshift segmentation.

                                   13
A.2    Additional Results

In this section, we show some additional results that did not fit in the main paper. Fig. 5 shows
example saliency maps for HIGHLIGHT-DIV states of the remaining three games apart from Pac-
Man. Fig. 6 shows the results of the sanity checks for LIME with all three segmentation algorithms.
The insertion metric results for all variants of Noise Sensitivity and LIME are shown in Fig. 7.

  Input State Occlusion      RISE         NS            NS Black NS Chosen LIME            LIME
              Sensitivity               Original                  Action Quickshift       Felzens.

Figure 5: Example saliency maps for the remaining games we tested. From top to bottom: Breakout,
Space Invaders and Frostbite. NS is Noise Sensitivity. For the LIME variants we only show the top 5
superpixels as is custom with this approach and we only show the two segmentation variants that
performed the best on these games.

1.0
 0.8
 0.6
 0.4
 0.2
 0.0
                Spearman                          SSIM                            Pearson
Figure 6: Results of the parameter randomization sanity check for the different LIME segmentation
variants. Measured for 1000 states of each of the 4 tested games. Starting from the left, each mark
represents an additional randomized layer starting with the output layer. The y-axis shows the average
similarity values (Spearman rank correlation, SSIM, Pearson correlation of the HOGs). High values
indicate a low parameter dependence. The translucent error bands show the 99% CI.

                                                   14
Pac-Man                                       Space Invaders

                    Breakout                                           Frostbite
Figure 7: The remaining insertion metric results for four different Atari games, averaged over 1000
steps. NS is Noise Sensitivity. The x-axis shows the percentage of inserted pixels and the y-axis
shows the average confidence in the original prediction for those modified states. The error bands
show the 99% CI.

                                                15
You can also read