What's in the Box? Exploring the Inner Life of Neural Networks with Robust Rules

Page created by Hugh Pope
 
CONTINUE READING
What's in the Box? Exploring the Inner Life of Neural Networks with Robust Rules
What’s in the Box?
            Exploring the Inner Life of Neural Networks with Robust Rules

                                      Jonas Fischer 1 Anna Oláh 1 Jilles Vreeken 2

                        Abstract                                 Y are typically active when neurons X are. For robustness
                                                                 we explicitly allow for noise, and to ensure that we dis-
     We propose a novel method for exploring how
                                                                 cover a succinct yet descriptive set of rules that captures the
     neurons within neural networks interact. In partic-
                                                                 regularities in the data, we formalize the problem in terms
     ular, we consider activation values of a network
                                                                 of the Minimum Description Length principle (Grünwald
     for given data, and propose to mine noise-robust
                                                                 & Roos, 2019). To discover good rule sets, we propose
     rules of the form X → Y , where X and Y are sets
                                                                 the unsupervised E XPLAI NN algorithm and show that the
     of neurons in different layers. We identify the best
                                                                 rules we discover give clear insight in how networks per-
     set of rules by the Minimum Description Length
                                                                 forms their tasks. As we will see, these identify what the
     principle, as those rules that together are most de-
                                                                 network deems similar and different between classes, how
     scriptive of the activation data.To learn good rule
                                                                 information flows within the network, and which convolu-
     sets in practice, we propose the unsupervised E X -
                                                                 tional filters it expects to be active where. Our rules are
     PLAI NN algorithm. Extensive evaluation shows
                                                                 easily interpretable, give insight in the differences between
     that the patterns it discovers give clear insight
                                                                 datasets, show the effects of fine-tuning, as well as super-
     into how networks perceive the world: they iden-
                                                                 charge prototyping as they tell which neurons to consider in
     tify shared and class-specific traits, composition-
                                                                 unison.
     ality, as well as locality in convolutional layers.
     Moreover, they are not only easily interpretable,           Explaining neural networks is of widespread interest, and
     but also super-charge prototyping by identifying            especially important with the emergence of applications in
     which neurons to consider in unison.                        healthcare and autonomous driving. In the interest of space
                                                                 we here only shortly introduce the work most relevant to
                                                                 ours, while we refer to surveys for more information (Adadi
1. Introduction                                                  & Berrada, 2018; Ras et al., 2018; Xie et al., 2020; Gilpin
                                                                 et al., 2018). There exist several proposals for investigating
Neural networks achieve state of the art performance in          how networks arrive at a decision for a given sample, with
many settings. However, how they perform their tasks, how        saliency mapping techniques for CNNs among the most
they perceive the world, and especially, how the neurons         prominent (Bach et al., 2015; Zhou et al., 2016; Sundarara-
within the network operate in concert, remains largely elu-      jan et al., 2017; Shrikumar et al., 2017). Although these
sive. While there exists a plethora of methods for explaining    provide insight on what parts of the image are used, they
neural networks, most of these focus either on the mapping       are inherently limited to single samples, and do not reveal
between input and output (e.g. model distillation) or only       structure across multiple samples or classes. For explain-
characterize a given set of neurons, but can not identify        ing the inner working of a CNN, research mostly focuses
which set to look at in the first place (e.g. prototyping). In   on feature visualization techniques (Olah et al., 2017) that
this paper, we introduce a new approach to explain how the       produce visual representations of the information captured
neurons in a neural network interact. In particular, we con-     by neurons (Mordvintsev et al., 2015; Gatys et al., 2015).
sider the activations of neurons in the network over a given     Although these visualizations provide insight on how CNNs
dataset, and propose to characterize these in terms of rules     perceive the world (Øygard, 2016; Olah et al., 2018) it has
X → Y , where X and Y are sets of neurons in different           been shown that concepts are often encoded over multiple
layers of the network. A rule hence represents that neurons      neurons, and that inspecting individual neurons does not
   1
     Max Planck Institute for Informatics, Germany 2 CISPA       provide meaningful information about their role (Szegedy
Helmholtz Center for Information Security, Germany. Correspon-   et al., 2013; Bau et al., 2017). How to find such groups of
dence to: Jonas Fischer .                neurons, and how the information is routed between layers
                                                                 in the networks, however, remains unsolved.
Proceedings of the 38 th International Conference on Machine
Learning, PMLR 139, 2021. Copyright 2021 by the author(s).
What's in the Box? Exploring the Inner Life of Neural Networks with Robust Rules
Exploring the Inner Life of Neural Networks with Robust Rules

An orthogonal approach is model distillation, where we            2. Theory
train easy-to-interpret white box models to mimic the de-
cisions of a neural network (Ribeiro et al., 2016; Frosst &       We first informally discuss how to discover association rules
Hinton, 2017; Bastani et al., 2017; Tan et al., 2018). Rules      between neurons. We then formally introduce the concept
of the form (if–then) are easily interpretable, and hence a       of robust rules, and how to find them for arbitrary binary
popular technique for model distillation (Taha & Ghosh,           datasets, last, we show how to combine these ideas to reveal
1999; Lakkaraju et al., 2017). Existing techniques (Robnik-       how neurons are orchestrated within feedforward networks.
Šikonja & Kononenko, 2008; Özbakır et al., 2010; Barakat
& Diederich, 2005) aim for rules that directly map input to       2.1. Patterns of neuron co-activation
output, rather than providing insight into how information        Similar to neurons in the brain, when they are active, arti-
flows through the network. Tran & d’Avila Garcez (2018)           ficial neurons send information along their outgoing edges.
restrict themselves to Deep Belief Networks only, and for         To understand flow of information through the network,
these propose to mine all sufficiently strong association         it is hence essential to understand the activation patterns
rules. As such, their method suffers from the well-known          of neurons between layers. Our key idea is to use recent
pattern explosion. In contrast, Chu et al. (2018) propose         advances in pattern mining to discover a succinct and non-
to explain NNs by deriving decision boundaries of a net-          redundant set of rules that together describe the activation
work using polytope theory. While this approach permits           patterns found for a given dataset. For two layers Ii , Ij ,
strong guarantees, it is limited to very small (< 20 hidden       these rules X → Y, X ⊂ Ii , Y ⊂ Ij express that the set
neurons) piecewise linear NNs. In sum, existing methods           of neurons Y are usually co-activated when neurons X are
either do not give insight in what happens inside a neural        co-activated. That is, such a rule provides us local informa-
network, and/or, are not applicable to the type or size of        tion about co-activations within, as well as the dependence
state-of-the-art convolutional neural networks. Zhang et al.      of neurons between layers. Starting from the output layer,
(2018) show how we can gain insight into convolutional            we discover rules between consecutive layers Ij , Ij−1 . Dis-
layers of neural networks by building an explanatory graph        covering overlapping rules between layers X → Y and
over sets of neurons. In contrast to what we propose, their       Y → Z, X ⊂ Ij , Y ⊂ Ij−1 , Z ⊂ Ij−2 , allows us to trace
method does not elucidate the relation between such filters       how information flows through the entire network.
and subsequent dense layers, nor to the network output.
                                                                  Before we can mine rules between two sets of neurons – e.g.
Instead, we propose to mine sets of rules to discover groups      layers – Ii and Ij of a network, we have to obtain its bina-
of neurons that act together across different layers in feed      rized activations for a given data set D = {dk = (sk , ok )}.
forward networks, and so reveal how information is com-           In particular, for each sample sk and neuron set Ii , we take
posed and routed through the network to arrive at the output.     the tensor of activations φi and binarize it to φbi . For net-
To discover rules over neuron activations, we need an un-         works with ReLU activations, which binarize naturally at
supervised approach. While many rule mining methods               threshold 0, we might lose some information about activa-
exists, either based on frequency (Agrawal & Srikant, 1994;       tion strength that is eventually used by subsequent layers.
Bayardo, 1998; Moerchen et al., 2011) or statistical test-        This binarization however allows us to derive crisp sym-
ing (Hämäläinen, 2012; Webb, 2010), these typically return     bolic, and directly interpretable statements on how neurons
millions of rules even for small datasets, thus thwarting the     interact. Furthermore, binarization reflects the natural on/off
goal of interpretability. We therefore take a pattern set min-    state of biological neurons, also captured by smooth step
ing approach similar to G RAB (Fischer & Vreeken, 2019),          functions such as sigmoid or tanh used in artifical neural
where we are after that set of rules that maximizes a global      networks. We gather the binarized activations into a dataset
criterion, rather than treating each rule independently. Al-      D where each row tk corresponds to the concatenation of
though providing succinct and accurate sets of rules, G RAB       φbi and φbj of Ii and Ij for sk , i.e., tk ∈ D is a binary vector
is limited to conjunctive expressions. This is too restrictive    of length |Ii | + |Ij |. See Fig. 1 for a toy example.
for our setting, as we are also after rules that explain shared
patterns between classes, and are robust to the inherently        Next, given binary activation data D, our goal is to find that
noisy activation data, which both require a more expressive       set of rules that together succinctly describe the observed
pattern language of conjunctions, approximate conjunctions,       activations. The Minimum Description Length (MDL) prin-
and disjunctions. We hence present E XPLAI NN, a non-             ciple lends itself as an objective to find such sets. MDL
parametric and unsupervised method that learns sets of such       is a statistically well-founded and computable approxi-
rules efficiently.                                                mation of Kolmogorov Complexity (Li & Vitányi, 1993).
                                                                  First introduced by Rissanen (1978), the essential idea is
                                                                  that the model M ∗ ∈ M that best describes data D is
                                                                  the model that losslessly encodes D using the fewest bits
                                                                  M ∗ = arg minM ∈M L (D, M ). Here, our model class M
What's in the Box? Exploring the Inner Life of Neural Networks with Robust Rules
Exploring the Inner Life of Neural Networks with Robust Rules

Figure 1: Overview. For a given network (left), binarized activations are gathered for the layers Ii , Ij for each sample, and
summarized in the binary database D (right). Rules are discovered over D, where a good rule set M is given with at the
bottom right, with rules X → Y , X ∈ Ii , Y ∈ Ij .

is the superset of all possible rules over D, and by MDL we       be computed in linear time (Kontkanen & Myllymäki, 2007).
identify the best model M ∗ as the one that compresses the        We thus have L(D, Mind ) = I∈I log |TnI | + Lpc (n) .
                                                                                               P                          

data best. Traditionally, rule mining is restricted to conjunc-   Mind serves as our baseline model, and its singleton rules
tions over items, which is not sufficient for our application;    are a required part of any more complex model as they
neuron activations tend to be noisy, labels are inherently        ensure we can always send any data over I.
mutually exclusive, and hence we consider an extended lan-
guage that allows for partial disjunctions of items (neurons,     Non-trivial models A non-trivial model M contains rules
labels) and introduce a codelength function L(D, M ) to           of the form X → Y, X, Y ⊂ I that are not part of Mind .
instantiate MDL for our model class of rule sets.                 The idea is that we first transmit the data for where these
                                                                  non-trivial rules hold, and then send the remaining data
2.2. MDL for Robust Rules                                         using Mind . To determine where such a rule applies, the
                                                                  receiver needs to know where X holds, and hence the data
Our goal is to find a set of rules M that, in terms of de-
                                                                  over X needs to be transmitted first. To ensure that we can
scription length L(D, M ), best describes a binary database
                                                                  decode the data, we only consider models M for which
D = {t | t ⊂ I} that consists of transactions t that are sub-
                                                                  the directed graph G = (I, E) is acyclic, in which there
sets of items I. Each rule is of the form X → Y, X, Y ⊂ I,
                                                                  exists an edge between two items i1 , i2 iff they occur in the
and indicates that Y is strongly associated with, i.e. occurs
                                                                  head and tail of a rule, that is E = {(i1 , i2 ) | ∃X → Y ∈
mostly in transactions where X is present. We say a rule
                                                                  M. i1 ∈ X ∧ i2 ∈ Y }. We thus get a codelength
X → Y applies to a transaction t iff X ⊂ t and say a rule
holds for t if additionally Y ⊂ t. We indicate these trans-
                                                                                               X                        
                                                                                                                  |TX |
actions sets as TX = {i | ti ∈ D, X ⊂ ti }, respectively               L(D | M ∪ Mind ) =                log
                                                                                                                 |TY |X |
TY |X = {i | ti ∈ TX , Y ⊂ ti }. Based on these definitions                                      X→Y ∈M
                                                                                                                        
of rule transaction sets, we can now formally introduce our                                         X                n
                                                                                              +             log            ,
codelength function L(D, M ).                                                                                      |TI0 |
                                                                                                ∅→I∈Mind

                                                                  where TI0 = {t ∈ D | (I ∈ t) ∧ (∀X → Y ∈ M. I ∈
Baseline model Our base model Mind = {∅ → I |                     Y =⇒ t 6∈ TY |X )} is a modified transaction set containing
∀I ∈ I} consists of singleton rules only, i.e. it models          transactions with item I not covered by any non-trivial rule.
that all items I are generated independently. To send the
n transactions of D using Mind , we simply send for each          In addition to the parametric complexities of the binomial
item I in which out of all transactions in the database it        codes, the model cost of a non-trivial model also includes
appears. We can do so optimally using a log binomial            the cost of transmitting the non-trivial rules. To transmit a
                                                                  rule X → Y , we first send the cardinalities of X resp. Y
code, which is given by log |T|TI|∅
                                 ∅|
                                    |
                                      = log |TnI | . To un-
                                                                  using the universal code for integers LN (Rissanen, 1983).
ambiguously decode, the recipient needs to know each              For n ≥ 1, this is defined as LN (z) = log∗ z + log c0 with
|TI |, which we can optimally encode via the paramet-             log∗ (z) = log z + log log z + . . . , summing only over the
ric complexities
               Pof the binomials,kwhich are  defined
                                                         as      positive components (Rissanen, 1983). To satisfy the Kraft
                   n      n!     k     n−k n−k
Lpc (n) = log      k=0 (n−k)!k! n       n          , and can      inequality up to equality we set c0 = 2.865064. Knowing
What's in the Box? Exploring the Inner Life of Neural Networks with Robust Rules
Exploring the Inner Life of Neural Networks with Robust Rules

                                                                    +            −
the cardinalities, we can then send the items of X resp. Y        XX→Y   , and XX→Y    correct for the errors made in the part
one by one using an optimal prefix code given by L(X) =           where the rule applies and holds, respectively applies but
− x∈X log P |Tx ||TI | . For a particular rule X → Y ∈ M ,        does not hold. These error matrices are part of the model M
  P
                 I∈I
the model costs for a rule, respectively the full model thus      and have to be transmitted with an adapted L(D, M ). We
amount to                                                         provide examples and a short review how how to adapt the
                                                                  codelength function in Supp. A.
       L(X → Y ) =LN (|X|) + LN (|Y |)
                       + L(X) + L(Y ) + Lpc (|TX |) ,
    L(M ∪ Mind ) =|I| × Lpc (n) + LN (|M |)
                       X
                   +          L(X → Y ) .                         Complexity of the search To discover rules over the acti-
                          X→Y ∈M                                  vations of layers Ii , Ij , we have to explore all rules formed
                                                                  by subsets of neurons in Ii for the head, combined with any
We provide an example calculation in Supp. A.1. With these        subset of neurons of Ij for the tail. There exist 2|Ii | × 2|Ij |
definitions, we have an MDL score that identifies the best                                    |Ii |+|Ij |
                                                                  such rules, and hence 22                distinct models would need
rule set M ∗ for data D as
                                                                  to be explored. Fischer & Vreeken (2019) showed that the
 M ∗ = arg min L(M ∪ Mind ) + L(D | M ∪ Mind ) ,
                                                      
                                                                  rule set search space does not lend itself to efficient search
          M ∈M                                                    as it is neither monotone nor submodular, the counterex-
where M contains all possible rule sets over the items in D.      amples also holding for our model definition. In fact, for
                                                                  robust rules, we additionally have to consider where rules
Robust Rules In real world applications, we need a score          should apply respectively hold – optimizing k and l – which
                                                                                                               |Ii |+|Ij |
that is robust against noise.The key problem with noisy           results in approximately 2|Ii |×|Ij |×2                  models (details
data is that a single missing item in a transaction can cause     in Supp. A.4). Exhaustive search is therewith infeasible,
a whole rule not to hold or apply. To discover rules that         which is why we present E XPLAI NN, a heuristic algorithm
generalize well, we need to explicitly account for noise. The     to efficiently discover good sets of rules.
idea is to let rules apply, and hold, also when some items
of head respectively tail are missing. Specifying how many        2.3. Discovering good rule sets with E XPLAI NN
items l, and k, out of all items in the rule head, respectively
                                                                  E XPLAI NN is based on the idea of iteratively refining the
tail, need to be part of a transaction, we relax the original
                                                                  current model by merging and refining already selected rules.
rule definition to account for missing items, or in other
                                                                  The key insight of the algorithm is that for a rule X → Y to
words, noise.
                                                                  summarize the data well, also rules X → Y 0 with only part
Furthermore, as output neurons – the classes – are only           of the tail, Y 0 ⊂ Y , should summarize well, as all tail items
active mutually exclusively, rules need to be able to model       should be similarly co-occuring with head X. Starting from
disjunctions. Setting l = 1 and k = 1 means that only one         the baseline model Mind we iteratively and greedily search
of the items of head respectively tail need to be present,        for better models until we can no longer improve the MDL
thus coincidentally corresponding to a disjunction of items       score. As search steps we consider either introducing a new
in the head and tail of the rule X → Y , thus allowing to         rule to M , by taking a good set of items X ⊂ Ii for the
model output neurons correctly, and l = |X| and k = |Y |          head and a single item A ∈ Ij for the tail and refine the
correspond to the original stringent rule definition. Varying     model to M 0 = M ⊕ {X → A}, seeing if it decreases the
between the two extremes accounts for varying levels of           overall MDL costs (Eq. 2.2). Or, we merge two existing
noise. The optimal l and k are those that minimize the            rules r1 = X → Y1 ∈ M and r2 = X → Y2 ∈ M , to
MDL score.                                                        form a new rule r0 = X → Y1 ∪ Y2 and refine the model
                                                                  to M 0 = M ⊕ {r0 } = (M \ {r1 , r2 }) ∪ {r0 }. For a rule r0 ,
To ensure a lossless encoding, we need to make sure that
                                                                  the refinement operator ⊕ is adding the rule r0 = X → Y
the receiver can reconstruct the original data. Thus, for
                                                                  to M , and removes the merged rules that led to r0 , if any.
the previously introduced relaxed definition of when rules
                                                                  Moreover, it updates the singleton transaction lists TA for
hold and apply, we send for each rule the corresponding
                                                                  all items A ∈ Y , removing all transactions where r0 holds.
number of items l that need to be present for it to apply using
LN (l) bits. Knowing each l, the receiver can reconstruct         To permit scaling up to the size of a typical neural net, we
where each rule applies. Sending where a rule holds now           next discuss how to efficiently search for candidate rules
leaves the receiver with an approximation of the data. To         with heads that can express anything from conjunctions to
be able to reconstruct the actual data, Fischer & Vreeken         disjunctions. Immediately after, we present the full algo-
(2019) introduced error matrices that when XORed with the         rithm E XPLAI NN for mining high quality rule sets for two
approximation yield the original data. These two matrices         arbitrary sets of neurons (e.g. layers) of a network.
What's in the Box? Exploring the Inner Life of Neural Networks with Robust Rules
Exploring the Inner Life of Neural Networks with Robust Rules

Searching for candidates A key component of E X -                    low merging of similar heads and can thus revert too greedy
PLAI NN is the candidate generation process, which imple-            decisions introduced earlier. Furthermore, by observing that
ments the two possible steps of generating new and merging           independent rules X1 → Y1 , X2 → Y2 , Y1 ∪Y2 = ∅ do not
existing rules. Given two layers Ii , Ij , to efficiently discover   influence each others impact on codelength, we can add all
rules that are both robust to noise, and may include disjunc-        independent rules with the highest respective gain at once.
tively active neurons in the head, we can not enumerate all          We provide pseudocode for candidate generation and the
possible rule heads for each individual tail neuron, as this         E XPLAI NN algorithm in Supp. A.5.
would result in |Ij | × 2|Ii | many rules. Instead, we keep a
list Hy for each item y ∈ Ij , storing all head neurons x ∈ Ii
for which y is frequently active when x is active, that is           Complexity of E XPLAI NN The generation of new rules
         |T ∩T |
σx,y = x|Tx | y > θ, where θ is a confidence threshold. We           results in time O(n × |Ij | × |Ii |3 ), by iterating over each
consider a rule X → Y to be good, if when neurons X are              neuron in Ij , and considering each subset of the most
active, the neurons Y are also likely to be active, which is         overlapping neurons in Ii , and considering each threshold
directly represented by the confidence θ. With parameter             k = 1 . . . |Ii | for when the rule should apply, and the factor
µ we account for early decisions on rule merging that later          n from intersecting transaction lists T to compute the gain.
hinder us to see a more general trend. The lists are sorted          We can have at most |Ij | generated rules before considering
decreasing on σ. We search in each Hy for the rule with              rule merges, and in every iteration of merging we combine
highest gain over all unions of first t = 1 . . . |Hy | neurons      two rules, reducing the rule set size by 1. In each such step,
in the list. We add that rule X → y with highest gain to the         we consider |Ij |2 merges, for each of which we compute the
candidate list. To compute the gain, we consider all possible        gain considering noisy head and tail. We thus have a worst
values k = 1 . . . |X| to determine for which transactions           case runtime of O(n × |Ij |4 × |Ii |). As MDL ensures we
TXk
     = {t ∈ D | |X ∩ t| ≥ k} the rule should robustly apply,         consider models that tend to be succinct and hence capture
where k = 1 corresponds to disjunction and k = |X| to                only relevant structure in the data, E XPLAI NN is in prac-
conjunction of neurons.                                              tice much faster and easily scales to several thousands of
                                                                     neurons.
For an individual neuron y, such a rule would be optimal,
but, our goal is to discover groups of neurons that act in con-
cert. To this end we hence iteratively merge rules with simi-        3. Experiments
lar heads – similar, rather than same, as this gives robustness      In this section we empirically evaluate E XPLAI NN on syn-
both against noise in the data, as well as earlier merging de-       thetic data with known ground truth and real world data to
cisions of the algorithm. For two rules X1 → Y1 , X2 → Y2            explore how CNNs perceive the world. Other approaches
with symmetric difference X1 X2 = (X1 \X2 )∪(X2 \X1 ),               to discover patterns based on e.g. frequency measures or
we consider possible candidate rules X1 ∪ X2 → Y1 ∪ Y2               statistical testing have already been shown to yield millions
and X1 ∩ X2 → Y1 ∪ Y2 , iff |X1 X2 | ≤ µ for some                    or billions of rules or patterns, most spurious and redundant,
threshold µ ∈ N. For example, µ = 1 corresponds to the               and many more than anyone would be willing to investi-
case that one head has one label more than the other, all            gate, see e.g. (Fischer & Vreeken, 2019), we hence focus
other labels are the same.                                           on evaluating our method for the task of finding activation
Both parameters θ and µ are simple, yet effective runtime            patterns. Here, we look at CNNs as they count towards the
optimizations. The best results with respect to MDL will             most widespread use of feedforward networks and naturally
always be obtained with the largest search space, i.e. with          lend themselves for visualization, which helps us to interpret
θ and µ set to 0, respectively |X1 | + |X2 |. Besides im-            the discovered rules. We compare to traditional prototyp-
pacting run-time, many of those rules may be uninteresting           ing and activation map approaches on MNIST (LeCun &
from a user-perspective, µ and θ allow to directly instruct          Cortes, 2010), and examine which information is used how
E XPLAI NN to ignore such rules.                                     to arrive at classification for ImageNet (Russakovsky et al.,
                                                                     2015). Finally, we investigate the effect of fine-tuning in
                                                                     transfer learning on the Oxford Flower data (Nilsback &
E XPLAI NN Assembling the above, we have E XPLAI NN,
                                                                     Zisserman, 2008). The implementation of E XPLAI NN is
which given two sets of neurons Ii , Ij and a database of
                                                                     publicly available.1 For the below experiments, running on
activations of these neurons, yields a heuristic approxima-
                                                                     commodity hardware E XPLAI NN took minutes for MNIST
tion to the MDL optimal model M ∗ . By first introducing all
                                                                     and Flower, and up to 6 hours for ImageNet— yielding from
relevant single neuron rules, it then proceeds by iteratively
                                                                     a few hundred up to 3000 rules, for the smaller, respectively
merging existing rules using the approach described above,
                                                                     larger networks, and earlier, respectively later layers.
until it can achieve no more gain. For efficiency, we separate
the generation of the new rules from the merging of existing           1
                                                                         http://eda.mmci.uni-saarland.de/
rules. In practice, this does not harm performance, as we al-        explainn/
What's in the Box? Exploring the Inner Life of Neural Networks with Robust Rules
Exploring the Inner Life of Neural Networks with Robust Rules

Figure 2: Evaluation of rule quality. Top: Performance of E XPLAI NN as precision and recall on data with varying number
of planted rules with mutual exclusive head items (left) and co-occurring head items with varying noise (right). 10%
noise corresponds to more noise than signal in the data. We provide the average (bar) and distribution (boxplot) across 10
repetitions. Bottom: Accuracy per class of VGG-S before (yellow) and after (blue) intervention on weights connecting
neurons to class given by a rule, and 90% quantile of accuracies obtained for randomized intervention (red).

3.1. Recovering ground truth                                     features across several classes. We compare to average ac-
                                                                 tivation maps and single neuron prototypes. Whereas the
To evaluate how well E XPLAI NN can recover the
                                                                 average activation maps per class do not reveal the purpose
ground truth from data, we first generate synthetic
                                                                 of a filter, we see that the rules learned by E XPLAI NN,
binary data sets of 10000 samples and introduce
                                                                 clearly identify which pixels together trigger a filter. For
{10, 50, 100, 200, 300, 500} rules with up to 5 items in
                                                                 example, in filter 2 in layer 1 the prototype looks like a maze
head and tail, respectively. For each rule, the frequency
                                                                 and does not reveal any insight, and average activation maps
is drawn from U (.02, .08), the confidence is drawn from
                                                                 just show the number given by the class, whereas the dis-
U (.5, 1). We introduce noise by flipping 0.1% of the en-
                                                                 covered rules identify shared structure, such as curvatures
tries chosen uniformly at random, and add 5 noise features
                                                                 shared between digits. For filter 36 in layer 2, the discovered
with frequency equal to those of rules. Fischer & Vreeken
                                                                 rules show that it detects horizontal edges in a class specific
(2019) showed that a similar MDL model works for con-
                                                                 manner, whereas prototyping and activation maps again fail
junctive rules, hence we will focus on investigating perfor-
                                                                 to reveal this information. Interestingly, the discovered rules
mance for mutually exclusive rule heads and noise. In the
                                                                 indicate that certain filters learn a negative, with activated
first set of experiments, we set head items mutual exclu-
                                                                 areas corresponding to the imprint of the digit. We provide
sively, in line of finding rules over the NN output labels.
                                                                 images visualizing rules, prototypes, and average activations
E XPLAI NN achieves high recall and precision (see Fig-
                                                                 in Supp. B.1.
ure 2) in terms of retrieving exact ground truth rules, and
does not retrieve any redundant rules. Next, we investi-
gate the impact of noise on the performance, generating          How information flows To understand the inner life of
data of 10000 samples and 100 rules similar to above, with       neural networks in a more complex setting, we examine the
head items now set co-occuring, varying the level of noise       activations for the ImageNet data set of pretrained VGG-
in {0.1%, 0.5%, 1%, 3%, 5%, 10%} bitflips in the matrix,         S and GoogLeNet architectures (Chatfield et al., 2014;
where 10% noise means more noise than actual signal. E X -       Szegedy et al., 2015). We focus on analyzing the VGG-S re-
PLAI NN is robust to noise, even when facing almost the          sults for which an optimized and highly interpretable proto-
same amount of noise and signal (see Fig. 2).                    typing method to visualize multiple neurons exists (Øygard,
                                                                 2016), and provide results for GoogLeNet in Supp. B.2.1.
3.2. How neural networks perceive the world                      Here, we focus on particular rules, and provide a larger and
                                                                 more diverse set of results in Supp. Fig. 16, 17. We see
How information is filtered We first consider the MNIST          that rule-derived prototypes generally show highly inter-
data of handwritten digits. We train a simple CNN of 2 con-      pretable features for the corresponding classes. Mining for
volutional and one fully connected layer using Keras, achiev-    rules from the output to the last layer, E XPLAI NN yields
ing 99% classification accuracy on test data (see Supp. B.1      rules with individual heads spanning multiple labels and
for details). We are interested in what the individual filters   tails spanning multiple neurons, which together encode the
learn about the digits, and how E XPLAI NN reveals shared        information shared between labels. Examples include the
What's in the Box? Exploring the Inner Life of Neural Networks with Robust Rules
Exploring the Inner Life of Neural Networks with Robust Rules

                                                                  covered for the individual classes. Beside discovering what
                                                                  shared and distinct traits the network has learned for classes,
                                                                  we also find out when it learns differences across samples
                                                                  of the same class. As one example, for the dog breed Great
                                                                  Danes, we discover three rules that upon visualization each
                                                                  correspond to visually very different sub-breeds, whereas a
                                                                  simple class prototype does not reveal any such information
                                                                  (Supp. Fig. 15).
                                                                  Next we investigate the information flow within the network,
                                                                  by iteratively finding rules between adjacent layers, starting
                                                                  with rules X → Y from output layer to last fully connected
                                                                  layer FC7. Based on this set of rules, we then apply E X -
                (a) Visualization for the whole tail              PLAI NN to discover rules Y → Z between FC7 and FC6,
                                                                  where heads Y are groups of neurons found as tails in the
                                                                  previous iteration. We recursively apply this process until
                                                                  we arrive at a convolutional layer. This gives us traces of
                                                                  neuronal activity by chaining rules X → Y → Z → · · ·
                                                                  discovered in the iterative runs. We visualize two such traces
                                                                  in Fig. 5, which give insight in how the network perceives
                                                                  different classes, passing on information from layer to layer.
                                                                  One example of a discovered trace is for the class totem
                                                                  pole (Fig. 5a). We observe that the set of neurons discov-
                                                                  ered for FC7 and FC6 each yield prototypes that clearly
                                                                  resemble the animalistic ornaments of such totem poles,
                                                                  which can also be found in the training data. Interestingly,
                                                                  we see that the neuron sets found for different filters of the
       (b) Visualization for the units in the tail individually   last convolutional layer CONV5 together detect parts of
                                                                  the object, including the vertical pole, and the rooftop-like
Figure 3: Characteristic faces. From the data for all
                                                                  structure, decorated with animalistic shapes with eyes, that
dog breed categories, E XPLAI NN discovered the rule be-
                                                                  is typically found at the top of a totem. These filters act in
tween the labels {Japanese spaniel, Pekinese,
                                                                  a highly specific manner, detecting only specific parts of
Shih-Tzu, Lhasa, Affenpinscher, Pug,
                                                                  the image, such as thinner or wider vertical structures in the
Brabancon griffon}, and 5 units from the FC7 layer,
                                                                  center, or objects at the top center of the image.
for which a prototype is given in the top image. The units
together capture the characteristic face of these breeds,         We also find signs of overfitting, e.g. when considering the
wherease individual units (bottom) give only little insight       information trace for a set of dog breeds (Fig. 5b). Note that
about the encapsulated information.                               due to space, we here only show a subset of the discovered
                                                                  rules. We observe that the prototypes for FC7 and FC6 both
                                                                  show side-views of animals. The networks seems to learn
faces of certain dog breeds, for which, if we visualize these     features that are specific to side photos of dogs, which are
neurons individually (Fig. 3), it is hard to extract anything     prevalent in the training data, also indicated by the filter
meaningful from the images: the information is really en-         prototypes. For the filters, we see that the network acts
coded in the set of neurons that act together.                    on very specific parts of the image, detecting structures
                                                                  at the bottom that resemble paws and pairs of front and
We also observe cases where rules describe how the network        hind legs, and at the top of the image, which resemble
discriminates between similar classes. We give an example         dog faces and clouds. We also find more abstract features
in Fig. 4 for the neurons E XPLAI NN discovers to be asso-        with groups of filters detecting horizontal edges, which
ciated with just huskies, just malamutes, and both of these       reminds of the back of the dog in side-view. While there is
classes together. These dog breeds are visually similar, shar-    room for interpretation of prototypes, the discovered traces
ing a black–white fur pattern, as well as head and ear shapes.    provide evidence on how the network perceives the world, as
These traits are reflected by the neurons corresponding to the    information from prototypes can be interpreted across layers,
rule for both breeds. Looking closer, we can see that distinct    and in combination with the spatial location of activations
traits, the more pointy ears of the husky, respectively the       in the filters.
fluffy fur of the malamute, are picked up by the neurons dis-
What's in the Box? Exploring the Inner Life of Neural Networks with Robust Rules
Exploring the Inner Life of Neural Networks with Robust Rules

                                                Siberian husky           Siberian husky            Malamute AND NOT
                                               AND NOT Malamute           AND Malamute              Siberian husky
              (a) Top: Malamute,
            Bottom: Siberian Husky                                       (b) Rule prototypes for FC7

Figure 4: Neurons discriminating Huskies and Malamutes. a) Huskies and Malamutes are very similar looking dog breeds.
b) Prototypes for rules X → Y discovered for classes X, Siberian husky (red frame), class Malamute (yellow
frame), resp. both (orange frame) and neurons Y in FC7. The neurons associated with both classes represent typical features
shared between the two classes, those associated only with Siberian huskies show their slightly sharper, more defined
head, while those associated only with Malamutes capture their more fluffy fur.

Rules carry class information To quantitatively assess             for the original VGG-S show almost identical shapes and
the rules that E XPLAI NN discovers, we here consider the          pattern, but with less intense colour, and in both observe
VGG-S network for ImageNet and intervene on those neu-             prototypes with animal-like features such as eyes or beaks.
rons in the last fully connected layer that E XPLAI NN finds
to be class-associated. For each class c, we set incoming          4. Discussion and Conclusion
weights from neurons y to 0, for which we have discovered
a rule X → Y, c ∈ X, y ∈ Y , comparing classification              The experiments show that E XPLAI NN is able to discover
rate before and after intervention. As baseline, we addi-          distinct groups of neurons that together capture traits shared
tionally intervene on an equally sized random subset of all        and distinct between classes, within-class heterogeneity, and
weights leading to class c, again measuring classification         how filters are used to detect shared features, segment back-
rate after intervention. We see that for all classes, perfor-      ground, or detect edges locally. Neither of these are revealed
mance drops much more strongly for the actual interven-            by activation maps, which miss the local information that
tions than for the random ones, in most cases even to 0 (see       patterns provide, nor by saliency maps, which investigate
Fig. 2 bottom). This gives evidence that the discovered            network attention for an individual image alone. Proto-
rules capture information necessary for classification. We         typing is a great tool for visualizing neuron information
further observe that under intervention the model often pre-       content, but, by itself is limited by the massive number of
dicts closely related classes, e.g. Fire Salamander to             possible combinations of neurons, requiring thousands of
Spotted Salamander, Barbell to Dumbbell, or                        hours to painstakingly handpick and connect the information
Palace to Monastery, which gives insight towards sim-              of just individual neurons (Olah et al., 2020). Combining
ilarity of classes, robustness of predictions, and therewith       E XPLAI NN with prototyping permits exploring networks
sensitivity to adversarial attacks.                                beyond single neurons, by automatically discovering which
                                                                   neurons act in concert, which information they encode, and
                                                                   how information flows through the network.
The effect of fine tuning Finally, we show that E X -
                                                                   In particular, we discover distinct groups of neurons in fully
PLAI NN provides insight into the effect of fine-tuning in
                                                                   connected layers that capture shared respectively distinct
transfer learning. For this we consider Oxford Flower
                                                                   traits across classes, which helps in understanding how the
data (Nilsback & Zisserman, 2008), which consists of 8k
                                                                   network learns generality but still can discriminate between
images of flowers of 102 classes. For investigation, we con-
                                                                   classes. Due to the local information that our rules provide,
sider both the vanilla VGG-S network trained on ImageNet
                                                                   we can also detect differences in the perception across sam-
from above, and a fine-tuned version from the Caffee model
                                                                   ples of a single class, where for example different groups of
zoo.2 We run E XPLAI NN to obtain rules between the output
                                                                   neurons describe visually different sub-breeds of a class of
and the final layer of both networks. We provide examples
                                                                   dogs. By connecting rules that we find across several layers,
in Supp. Fig. 18. The visualizations show, as expected, a
                                                                   we trace how information is gathered and combined to ar-
strong emphasis on colour and shape of the corresponding
                                                                   rive at a classification, from filters that detect typical class
flower. Interestingly, the visualizations of the same neurons
                                                                   specific features in the image, through fully connected lay-
   2
       https://github.com/jimgoo/caffe-oxford102                   ers where multiple neurons together encode the combined
What's in the Box? Exploring the Inner Life of Neural Networks with Robust Rules
Exploring the Inner Life of Neural Networks with Robust Rules

                                                                                             CONV5
                  Labels             FC7                   FC6

                  Totem pole

                                              (a) Information flow for class Totem Pole.
                                                                                            CONV5
          Labels               FC7                  FC6
          Dogs*

(b) Part of an information flow for *{Black-and-tan coonhound, english foxhound, borzoi, ibizan hound,
saluki, scottish deerhound, curly-coated retriever, entle bucher, mexican hairless}.

Figure 5: Information flow. Example rule cascades discovered for ImageNet. For each rule X → Y , the group of neurons
of tail Y are used to generate a prototype (images in colored frames). To discover these rule cascades, we first mine rules
between output and FC7. We use the tails of these rules (neurons of FC7) as heads to mine rules to the next layer (FC6).
Finally, we use the tails of those rules to mine rules between FC6 and CONV5.

information, up to the final classification output. Apply-           Acknowledgements
ing E XPLAI NN to investigate the impact of fine-tuning in
transfer learning, we found that for groups of neurons in the        Anna Olah and Jonas Fischer are supported by scholarships
given fine-tuned CNN, surprisingly, the contained informa-           from the International Max Planck Research School for
tion is almost identical to the original CNN, but capturing          Computer Science (IMPRS-CS).
the traits of the new classes almost perfectly. For the given
task, fine-tuning thus mostly resulted in routing information        References
differently, rather than learning to detect new features.
                                                                     Adadi, A. and Berrada, M. Peeking inside the black-box: A
Overall, E XPLAI NN performs well and finds surprising re-             survey on Explainable Artificial Intelligence (XAI). IEEE
sults that help to understand how CNNs perceive the world.            Access, 6:52138–52160, 2018.
While many important tasks are solved by such networks,
attention based architectures play an important role in e.g.         Agrawal, R. and Srikant, R. Fast algorithms for mining
language processing. Although rules can likely also help to            association rules. In Proceedings of the 20th International
understand what these models learn, these networks encode             Conference on Very Large Data Bases (VLDB), Santiago
an entirely different type of information that is inherently           de Chile, Chile, pp. 487–499, 1994.
hard to understand and visualize, and hence an exciting
challenge for future work. Here, our main interest was char-         Bach, S., Binder, A., Montavon, G., Klauschen, F., Müller,
acterizing information flow through neural networks, and               K.-R., and Samek, W. On pixel-wise explanations for
hence, we focused on subsequent layers. E XPLAI NN, how-               non-linear classifier decisions by layer-wise relevance
ever, operates on arbitrary sets of neurons, thus naturally            propagation. PloS one, 10(7):e0130140, 2015.
allows investigating e.g. residual networks, where the previ-
ous two layers contribute information. Currently scaling to          Barakat, N. and Diederich, J. Eclectic rule-extraction from
thousands of neurons, it will make for engaging future work            support vector machines. International Journal of Com-
to scale to entire networks at once.                                   putational Intelligence, 2(1):59–62, 2005.

                                                                     Bastani, O., Kim, C., and Bastani, H. Interpreting blackbox
What's in the Box? Exploring the Inner Life of Neural Networks with Robust Rules
Exploring the Inner Life of Neural Networks with Robust Rules

  models via model extraction. CoRR, abs/1705.08504,           Lakkaraju, H., Kamar, E., Caruana, R., and Leskovec, J.
  2017.                                                          Interpretable & explorable approximations of black box
                                                                 models. arXiv preprint arXiv:1707.01154, 2017.
Bau, D., Zhou, B., Khosla, A., Oliva, A., and Torralba,
  A. Network dissection: Quantifying interpretability of       LeCun, Y. and Cortes, C. MNIST handwritten digit
  deep visual representations. In Proceedings of the IEEE        database. 2010. URL http://yann.lecun.com/
  conference on computer vision and pattern recognition,         exdb/mnist/.
  pp. 6541–6549, 2017.
                                                               Li, M. and Vitányi, P. An Introduction to Kolmogorov Com-
Bayardo, R. Efficiently mining long patterns from databases.     plexity and Its Applications. Springer-Verlag, Berlin,
  In Proceedings of the ACM International Conference on          Heidelberg, 1993. ISBN 0387940537.
  Management of Data (SIGMOD), Seattle, WA, pp. 85–93,
  1998.                                                        Moerchen, F., Thies, M., and Ultsch, A. Efficient mining of
                                                                all margin-closed itemsets with applications in temporal
Chatfield, K., Simonyan, K., Vedaldi, A., and Zisserman, A.     knowledge discovery and classification by compression.
  Return of the devil in the details: Delving deep into con-    Knowledge and Information Systems, 29(1):55–80, 2011.
  volutional nets. arXiv preprint arXiv:1405.3531, 2014.
                                                               Mordvintsev, A., Olah, C., and Tyka, M. Inceptionism:
Chu, L., Hu, X., Hu, J., Wang, L., and Pei, J. Exact            Going deeper into neural networks. 2015.
  and consistent interpretation for piecewise linear neu-
  ral networks: A closed form solution. In Proceedings         Nilsback, M.-E. and Zisserman, A. Automated flower clas-
  of the 24th ACM SIGKDD International Conference on             sification over a large number of classes. In 2008 Sixth
  Knowledge Discovery and Data Mining, KDD ’18, pp.              Indian Conference on Computer Vision, Graphics & Im-
 1244–1253. Association for Computing Machinery, 2018.           age Processing, pp. 722–729. IEEE, 2008.
  ISBN 9781450355520.                                          Olah, C., Mordvintsev, A., and Schubert, L. Feature vi-
Fischer, J. and Vreeken, J. Sets of robust rules, and how to     sualization. Distill, 2017. doi: 10.23915/distill.00007.
  find them. In Proceedings of the European Conference on        https://distill.pub/2017/feature-visualization.
  Machine Learning and Principles and Practice of Knowl-       Olah, C., Satyanarayan, A., Johnson, I., Carter, S., Schubert,
  edge Discovery in Databases (ECML PKDD), Würzburg,            L., Ye, K., and Mordvintsev, A. The building blocks of
  Germany. Springer, 2019.                                       interpretability. Distill, 2018. doi: 10.23915/distill.00010.
Frosst, N. and Hinton, G. Distilling a neural network into       https://distill.pub/2018/building-blocks.
  a soft decision tree. arXiv preprint arXiv:1711.09784,
                                                               Olah, C., Cammarata, N., Schubert, L., Goh, G., Petrov,
  2017.
                                                                 M., and Carter, S. Zoom in: An introduction to cir-
Gatys, L. A., Ecker, A. S., and Bethge, M. A neural algo-        cuits. Distill, 2020. doi: 10.23915/distill.00024.001.
  rithm of artistic style. arXiv preprint arXiv:1508.06576,      https://distill.pub/2020/circuits/zoom-in/.
  2015.
                                                               Øygard, A. M.       Peeking inside convnets.
Gilpin, L. H., Bau, D., Yuan, B. Z., Bajwa, A., Specter, M.,     https://www.auduno.com/2016/06/18/
  and Kagal, L. Explaining explanations: An overview of          peeking-inside-convnets/, June 2016.
  interpretability of machine learning. In 2018 IEEE 5th
                                                               Özbakır, L., Baykasoğlu, A., and Kulluk, S. A soft
  International Conference on data science and advanced
                                                                  computing-based approach for integrated training and
  analytics (DSAA), pp. 80–89. IEEE, 2018.
                                                                  rule extraction from artificial neural networks: Difaconn-
Grünwald, P. and Roos, T. Minimum description length             miner. Applied Soft Computing, 10(1):304–317, 2010.
  revisited. International Journal of Mathematics for In-
                                                               Ras, G., van Gerven, M., and Haselager, P. Explanation
  dustry, 11(01):1930001, 2019.
                                                                 methods in deep learning: Users, values, concerns and
Hämäläinen, W. Kingfisher: an efficient algorithm for         challenges. In Explainable and Interpretable Models
  searching for both positive and negative dependency rules      in Computer Vision and Machine Learning, pp. 19–36.
  with statistical significance measures. Knowledge and          Springer, 2018.
  information systems, 32(2):383–414, 2012.
                                                               Ribeiro, M. T., Singh, S., and Guestrin, C. ”Why should I
Kontkanen, P. and Myllymäki, P. A linear-time algorithm for     trust you?” Explaining the predictions of any classifier.
  computing the multinomial stochastic complexity. Infor-        In Proceedings of the 22nd ACM SIGKDD international
  mation Processing Letters, 103(6):227–233, 2007. ISSN          conference on knowledge discovery and data mining, pp.
  0020-0190.                                                     1135–1144, 2016.
Exploring the Inner Life of Neural Networks with Robust Rules

Rissanen, J. Modeling by shortest data description. Auto-         Xie, N., Ras, G., van Gerven, M., and Doran, D. Explainable
  matica, 14(5):465–471, 1978.                                      deep learning: A field guide for the uninitiated. arXiv
                                                                    preprint arXiv:2004.14545, 2020.
Rissanen, J. A universal prior for integers and estimation
  by minimum description length. The Annals of Statistics,        Zhang, Q., Cao, R., Shi, F., Wu, Y. N., and Zhu, S. Inter-
  11(2):416–431, 1983.                                              preting CNN knowledge via an explanatory graph. In
                                                                    Proceedings of the Thirty-Second AAAI Conference on
Robnik-Šikonja, M. and Kononenko, I. Explaining classifi-          Artificial Intelligence, (AAAI-18), the 30th innovative Ap-
  cations for individual instances. IEEE Transactions on            plications of Artificial Intelligence (IAAI-18), and the 8th
  Knowledge and Data Engineering, 20(5):589–600, 2008.              AAAI Symposium on Educational Advances in Artificial
                                                                    Intelligence (EAAI-18), New Orleans, Louisiana, USA,
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S.,        February 2-7, 2018, pp. 4454–4463, 2018.
  Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein,
  M., Berg, A. C., and Fei-Fei, L. Imagenet large scale           Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., and Torralba,
  visual recognition challenge. International Journal of            A. Learning deep features for discriminative localiza-
  Computer Vision, 115(3):211–252, 2015.                            tion. In Proceedings of the IEEE conference on computer
                                                                    vision and pattern recognition, pp. 2921–2929, 2016.
Shrikumar, A., Greenside, P., and Kundaje, A. Learning
  important features through propagating activation differ-
  ences. In Proceedings of the 34th International Confer-
  ence on Machine Learning - Volume 70, ICML’17, pp.
  3145–3153. JMLR.org, 2017.

Sundararajan, M., Taly, A., and Yan, Q. Axiomatic attri-
  bution for deep networks. In Proceedings of the 34th
  International Conference on Machine Learning - Volume
  70, ICML’17, pp. 3319–3328. JMLR.org, 2017.

Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan,
  D., Goodfellow, I., and Fergus, R. Intriguing properties of
  neural networks. arXiv preprint arXiv:1312.6199, 2013.

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.,
  Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich,
  A. Going deeper with convolutions. In Proceedings
  of the IEEE conference on computer vision and pattern
  recognition, pp. 1–9, 2015.

Taha, I. A. and Ghosh, J. Symbolic interpretation of artificial
  neural networks. IEEE Transactions on knowledge and
  data engineering, 11(3):448–463, 1999.

Tan, S., Caruana, R., Hooker, G., Koch, P., and Gordo, A.
  Learning global additive explanations for neural nets us-
  ing model distillation. arXiv preprint arXiv:1801.08640,
  2018.

Tran, S. N. and d’Avila Garcez, A. S. Deep logic net-
  works: Inserting and extracting knowledge from deep
  belief networks. IEEE Transactions on Neural Networks
  and Learning Systems, 29(2):246–258, 2018.

Webb, G. I. Self-sufficient itemsets: An approach to screen-
 ing potentially interesting associations between items.
 ACM Transactions on Knowledge Discovery from Data,
 4(1):1–20, 2010.
Exploring the Inner Life of Neural Networks with Robust Rules

A. MDL for robust rules                                           ric complexity. We thus get
In this section we will give extended examples on how to                                                    X
compute the MDL score for a given database and set of             L(M ∪ Mind ) =|I| × Lpc (|D|) +                  LN (|X|)
rules, elaborate on the error encoding for the rule tails, and                                            X→Y ∈M
give a visual toy example on the impact of the extended
                                                                                                                               !
pattern language for the rule head.                                                 + LN (|Y |) + L(X) + L(Y ) + Lpc (TX )

A.1. Computing MDL for rules
                                                                                  =4 × Lpc (100) + LN (1) + LN (2)
For the given example in Fig. 6, we will now compute the
                                                                                          40         35         33
codelength L(D, M ) = L(M ) + L(D | M ) of transmitting                            − log      − log      − log     + Lpc (40)
the whole database D using M ∪ Mind . Here, we will stick                                143        143        143
with the simple encoding without error matrices, to make the
process of computation more understandable. For reference,                        ≈14.88 + 1.52 + 2.52 + 1.84
we first compute the baseline model, which is given by                              + 2.03 + 2.12 + 3.11
                                                                                  =28.02.
                                              
                                   X      |D|                     Hence, the model with the complex rule has a smaller code-
L(D, Mind ) =|I| × Lpc (|D|) +      log
                                          |TI |                   lengh than the baseline, with L(D, M ∪ Mind ) = 284.44
                               I∈I
                                   
                                    100                           bits.
            =4 × Lpc (100) + log
                                     40
                                                              A.2. The error encoding for tails
                        100           100
              + 2 log         + log                               For the error encoding for tails, which allow to discover
                         35            33
              ≈14.88 + 93.47 + 179.64 + 87.93 = 375.92.           rules in noisy settings (compare Fig. 7a,b), we send where
                                                                  a rule X → Y approximately holds according to some
                                                                  parameter k, which defines the number of items of the tail
                                                                  that have to be present in the transaction. The errors made by
Thus, sending the data with just the baseline model costs         this approximation are then accounted for by sending error
                                                                                          −            +
375.92 bits. Now, we will compute L(D, M ∪ Mind ), we             correcting matrices XX→Y       and XX→Y    , which account for
will start with the costs of sending the data L(D | M ∪           the destructive, respectively additive noise in the are where
Mind )                                                            the rule applies (compare Fig. 7c).
                                                            Let us first assume we are given a k, we will later show
                                             !            how we can optimize for k. We redefine the transaction
                        X              |TX |                sets TY |X = {t ∈ D | (X ⊂ t) ∧ (|Y ∩ t| ≥ k)},
L(D | M ∪ Mind ) =            log
                                    |TY |X |                which   corresponds to the transactions where the rule ap-
                     X→Y ∈M
                                       !                  proximately    holds. We will now slightly abuse notation
                        X         |D|                       and indicate the binary input matrix that correspond to D
                   +        log
                                  |TI0 |                    by D, and we subset this matrix using the transaction id
                        I∈I
                                                      lists and item subsets. Both of these are sets of indices that
                        40              100           100   indicate which rows, respectively columns to use of the ma-
                 = log       + log            + log
                        30               40            5    trix. For example, the submatrix where X holds is given by
                                                            D[TX , X]. We can now define the error correcting matrices
                                                                     −                                              −
                                                                            = D[TY |X , Y ] × 1|TY |X |×|Y | , and XX→Y
                                             
                           100              100             to be XX→Y                                                     =
                   + log         + log
                            3                35             D[T  X  \ TY |X , Y ], where ×  is the element-wise   XOR   oper-
                                                                        i×j
                 ≈29.66 + 93.47 + 26.17 + 17.30 + 89.82 ator and 1          is a matrix of size i × j filled with ones. The
                                                            receiver, knowing TX and TY |X , can then reconstruct the
                                                                                                                −
                                                            original data D[TY |X , Y ] = 1|TY |X |×|Y | × XX→Y     , respec-
                 =256.42.                                                                    +
                                                            tively D[TX \ TY |X , Y ] = XX→Y       .
                                                                  While this explains the concept of how error correcting
The model costs are composed of the parametric complexi-          matrices can be used to reconstruct the original input, which
ties for the (adapted) baseline rules, plus the costs of trans-   hence define a lossless encoding, we are mainly interested
mitting what the rule is composed of along with its paramet-      in the codelength functions. To adapt the data costs, we now
Exploring the Inner Life of Neural Networks with Robust Rules

Figure 6: Example database and model. A toy database D with blocks indicating where the items A, B, C, D occur in D,
margins and relevant joint counts are given on the right. A sensible rule set M ∪ Mind = A → BC ∪ Mind is given on the
right, the part of the database where the rule applies and holds is indicated by a light respectively dark orange area.

additionally send the two error matrices, which we can do            as is the data cost. That means we can optimize a k for
using binomial codes. Hence, we get                                  each rule separately. Thus, for a given rule X → Y we can
                                               !                   enumerate all |Y | many models for the different thresholds
                         X                |TX |                      k and let MDL decide which one fits the data best.
        L(D | M ) =              log
                                       |TY |X |
                       X→Y ∈M
                                          !                        A.3. The impact of the extended pattern language
                          X          |D|
                     +        log                                    Extending the pattern language for rule heads is crucial to
                                     |TI0 |
                          I∈I                                        be applicable for tracing activation patterns through a neural
                             |TY |X | × |Y |
                                            
                                                                     network. First of all, we need to start from labels, which
                     + log         −
                                |XX→Y      |                         are inherently activated mutually exclusive, as we only have
                           
                             |TX \ TY |X | × |Y |
                                                                    a single label as classification. To find shared features of
                     + log             +           ,                 labels, it is essential to be able to express disjunctions with
                                   |XX→Y     |
                                                                     rule heads. Furthermore, the data as well as activation
with the second line providing the codelength of the error           patterns across the data are very noisy. Thus, determining
matrices, and |X | indicating the number of ones in X .              where a rule applies just based on conjunctions of features
                                                                     can give a very twisted look of the data at hand, as visualized
Our model M now not only consists of rules M ∪ Mind , but            in Fig. 8. That is the reason to introduce a more flexible
also of the set of error correcting matrices. As the submatrix       language similar to approximate rule tails, which solves
to which we need to apply the matrix is fully defined by             these issues.
TX , TY |X , and Y of the corresponding rule, also defining
its size, the only adaptation we need for the model costs            A.4. Search complexity
is the parametric complexities induced by the codes for
transmitting the data. This yields                                   The size of the search space implied by our model class
                                                                                              |Ii |+|Ij |
                                                                     M is in O(2|Ii |×|Ij |×2             ). For two layers Ii , Ij , we
                                    X
L(M ) =|I| × Lpc (|D|) +                   L(X → Y )
                                 X→Y ∈M
                                                                 !
          + Lpc (|TY |X | × |Y |) + Lpc (|TX \ TY |X | × |Y |) .

This completes the MDL costs for rules robust to noise in
the tail for a given k. To optimize k, the crucial insight is that
the codelength of individual complex rules are independent,
Exploring the Inner Life of Neural Networks with Robust Rules

Figure 7: Example of tail error encoding. For a given database D given in a, where blocks indicate the occurrence of items,
a good rule is given byA → BCDE. The part of the database where the rule applies is indicated by the orange area. In
b we show the part of the transaction were the rule holds for varying number k of tail items that have to be present in a
transaction, from all items on the left – corresponding to a conjunction – towards just a single item on the right, which
corresponds to a disjunction. In c we visualize the error encoding used to transmit the data for k = 3. We first transmit the
data where the rule holds, resulting in the area that is indicated by the gray block. XORing the error matrix X − with this
block, it is possible to reconstruct the original data for the part where the rule holds. Using X + , we reconstruct the original
data in the area where the rule applies but does not hold.

enumerate all possible rules by                                             encoding. As in theory we can have any subset of these rules
                                                                                                                                 |Ii |+|Ij |
            |Ii |            X   |Ij |                               as a model, we thus get approximately 2(|Ij |×|Ii |×2            )
           X              |Ii |               |Ij |                        many different models.
                     k×           ×        l×
                            k                    l
               k=0                   l=0
           |            {z       } |
               Possibilities for head
                                            {z       }
                                                 Possibilities for tail
                                                                            A.5. Algorithm Pseudocode
               |Ii |                |Ij |                               E XPLAI NN explores the search space of rule sets in an itera-
              X      |Ii |         X      |Ij | 
       ≤ |Ii |                × |Ij |                                       tive fashion, either generating new rules with a single item in
                       k                       l
                   k=0                              l=0                     the tail, or merging two existing rules, thus generating more
                   |Ii |            |Ij |
       = |Ii | 2        × |Ij | 2           = |Ij | |Ii | 2|Ii |+|Ij | ,    complex rules with multiple items in the tail. Using these
                                                                            two steps, we can generate all potential candidate rules to
where the first sum enumerates all heads of size k, the bi-                 add to the model, and evaluate their respective gain in terms
nomial coefficient describes the ways of drawing heads of                   of MDL. For a rule r0 , we will say model M 0 = M ⊕ r0 is
such size, and the term k is the number of models given by                  the refined model, with the refinement operator ⊕ adding
the robust head encoding. Similarly, the second sum enu-                    the rule r0 = X → Y to M , removing the merged rules that
merates all tails of size l, the binomial coefficient describes             led to r0 , if any, and updating the singleton transaction lists
the drawing of such tails, and the term l is the number of                  TA for all items in the tail A ∈ Y . Here, we will provide
ways to place the error correcting matrices for the robust tail
You can also read