DIRECTPROBE: Studying Representations without Classifiers

Page created by Vanessa Mueller

Uncategorized

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

D IRECT P ROBE: Studying Representations without Classifiers

                    Yichu Zhou                                        Vivek Srikumar
                School of Computing                                 School of Computing
                 University of Utah                                  University of Utah
              flyaway@cs.utah.edu                                 svivek@cs.utah.edu

                     Abstract                            task. However, their ability to reveal the informa-
                                                         tion in a representation is occluded by numerous
    Understanding how linguistic structure is en-        factors, such as the choice of the optimizer and the
    coded in contextualized embedding could help         initialization used to train the classifiers. For exam-
    explain their impressive performance across
                                                         ple, in our experiments using the task of preposition
    NLP. Existing approaches for probing them
    usually call for training classifiers and use        supersense prediction (Schneider et al., 2018), we
    the accuracy, mutual information, or complex-        found that the accuracies across different training
    ity as a proxy for the representation’s good-        runs of the same classifier can vary by as much
    ness. In this work, we argue that doing so can       as ∼ 8%! (Detailed results can be found in Ap-
    be unreliable because different representations      pendix F.)
    may need different classifiers. We develop a            Indeed, the very choice of a classifier influences
    heuristic, D IRECT P ROBE, that directly stud-
                                                         our estimate of the quality of a representation.
    ies the geometry of a representation by build-
    ing upon the notion of a version space for a         For example, one representation may achieve the
    task. Experiments with several linguistic tasks      best classification accuracy with a linear model,
    and contextualized embeddings show that, even        whereas another may demand a multi-layer per-
    without training classifiers, D IRECT P ROBE can     ceptron for its non-linear decision boundaries. Of
    shine light into how an embedding space repre-       course, enumerating every possible classifier for
    sents labels, and also anticipate classifier per-    a task is untenable. A common compromise in-
    formance for the representation.
                                                         volves using linear classifiers to probe representa-
                                                         tions (Alain and Bengio, 2017; Kulmizev et al.,
1 Introduction                                           2020), but doing so may mischaracterize repre-
Distributed representations of words (e.g., Peters       sentations that need non-linear separators. Some
et al., 2018; Devlin et al., 2019) have propelled        work   recognizes this problem (Hewitt and Liang,
the state-of-the-art across NLP to new heights. 2019) and proposes to report probing results for at
Recently, there is much interest in probing these        least logistic regression and a multi-layer percep-
opaque representations to understand the informa- tron (Eger et al., 2019), or to compare the learning
tion they bear (e.g., Kovaleva et al., 2019; Conneau     curves between multiple controls (Talmor et al.,
et al., 2018; Jawahar et al., 2019). The most com- 2020). However, the success of these methods still
monly used strategy calls for training classifiers on    depends on the choices of classifiers.
them to predict linguistic properties such as syntax,       In this paper, we pose the question: Can we eval-
or cognitive skills like numeracy (e.g. Kassner and      uate the quality of a representation for an NLP task
Schütze, 2020; Perone et al., 2018; Yaghoobzadeh         directly without relying on classifiers as a proxy?
et al., 2019; Krasnowska-Kieraś and Wróblewska,            Our approach is driven by a characterization of
2019; Wallace et al., 2019; Pruksachatkun et al., not one, but all decision boundaries in a represen-
2020). Using these classifiers, criteria such as accu- tation that are consistent with a training set for a
racy or model complexity are used to evaluate the        task. This set of consistent (or approximately con-
representation quality for the task (e.g. Goodwin        sistent) classifiers constitutes the version space for
et al., 2020; Pimentel et al., 2020a; Michael et al., the task (Mitchell, 1982), and includes both simple
2020).                                                   (e.g., linear) and complex (e.g., non-linear) classi-
   Such classifier-based probes are undoubtedly          fiers for the task. However, perfectly characterizing
useful to estimate a representation’s quality for a      the version space for a problem presents compu-
                                                      5070
                        Proceedings of the 2021 Conference of the North American Chapter of the
              Association for Computational Linguistics: Human Language Technologies, pages 5070–5083
                           June 6–11, 2021. ©2021 Association for Computational Linguistics

tational challenges. To develop an approximation,
Error < ϵ
we note that any decision boundary partitions the
underlying feature space into contiguous regions E1(x) h1(E1(x))
associated with labels. We present a heuristic ap-
E2(x) h2(E2(x))
proach called D IRECT P ROBE, which builds upon
hierarchical clustering to identify such regions for
E3(x) h3(E3(x))
a given task and embedding.
The resulting partitions allow us to directly probe E4(x)
the embeddings via their geometric properties. For Input: x Output: y
example, distances between these regions correlate
with the difficulty of learning with the representa- Figure 1: An illustration of the total work in a learning
tion: larger distances between regions of different problem, visualized as the distance traveled from the
labels indicates that there are more consistent sep- input bar to the output bar. Different representations Ei
and different classifiers hi provide different effort.
arators between them, and imply easier learning,
and better generalization of classifiers. Further,
by assigning test points to their closest partitions, 2.1 Two Sub-tasks of a Learning Problem
we have a parameter-free classifier as a side effect, The problem of training a predictor for a task can
which can help benchmark representations without be divided into two sub-problems: (a) representing
committing to a specific family of classifiers (e.g., data, and (b) learning a predictor (Bengio et al.,
linear) as probes. 2013). The former involves transforming input
Our experiments study five different NLP tasks objects x—words, pairs of words, sentences, etc.—
that involve syntactic and semantic phenomena. into a representation E(x) ∈ Rn that provides
We show that our approach allows us to ascertain, features for the latter. Model learning builds a clas-
without training a classifier, (a) if a representation sifier h over E(x) to make a prediction, denoted
admits a linear separator for a dataset, (b) how by the function composition h(E(x)).
different layers of BERT differ in their represen- Figure 1 illustrates the two sub-tasks and their
tations for a task, (c) which labels for a task are roles in figuratively “transporting” an input x to-
more confusable, (d) the expected performance of wards a prediction with a probability of error be-
the best classifier for the task, and (e) the impact of low a small . For the representation E1 , the best
fine-tuning. classifier h1 falls short of this error requirement.
In summary, the contributions of this work are: The representation E4 does not need an additional
1. We point out that training classifiers as probes classifier because it is identical to the label. The
is not reliable, and instead, we should directly representations E2 and E3 both admit classifiers
analyze the structure of a representation space. h2 and h3 that meet the error threshold. Further,
note, that E3 leaves less work for the classifier than
2. We formalize the problem of evaluating repre- E2 , suggesting that it is a better representation as
sentations via the notion of version spaces and far as this task is concerned.
introduce D IRECT P ROBE, a heuristic method This illustration gives us the guiding principle
to approximate it directly which does not in- for this work2 :
volve training classifiers.
The quality of a representation E for a
3. Via experiments, we show that our approach task is a function of both the performance
can help identify how good a given represen- and the complexity of the best classifier
tation will be for a prediction task.1 h over that representation.
2 Representations and Learning Two observations follow from the above discussion.
In this section, we will first briefly review the rela- First, we cannot enumerate every possible classi-
tionship between representations and model learn- fier to find the best one. Other recent work, such
ing. Then, we will introduce the notion of -version 2
In addition to performance and complexity of the best
spaces to characterize representation quality. classifier, other aspects such as sample complexity, the sta-
bility of learning, etc are also important. We do not consider
1
D IRECT P ROBE can be downloaded from https:// them in this work: these aspects are related to optimization and
github.com/utahnlp/DirectProbe. learnability, and more closely tied to classifier-based probes.
5071

as that of Xia et al. (2020) make a similar point. h1 h2
h3
Instead, we need to resort to an approximation to
◦ ◦
evaluate representations. Second, trivially, the best ◦ ◦ ◦

◦ ◦

◦ ◦ ◦ ◦
representation for a task is identical to an accu- ◦

rate classifier; in the illustration in Figure 1, this
is represented by E4 . However, such a represen-
tation is over-specialized to one task. In contrast, ◦ ◦ ◦ ◦ ◦ ◦
◦ ◦ ◦ ◦ ◦ ◦
learned representations like BERT promise task-
◦

◦ ◦

◦
independent representations that support accurate
classifiers. ◦ ◦ ◦ ◦ ◦ ◦
◦ ◦ ◦ ◦ ◦ ◦

2.2 -Version Spaces Figure 2: Using the piecewise linear functions to mimic
Given a classification task, we seek to disentangle decision boundaries on two different cases. The solid
lines on the left are the real decision boundaries for a
the evaluation of a representation E from the clas-
binary classification problem. The dashed lines in the
sifiers h that are trained over it. To do so, the first middle are the piecewise linear functions used to mimic
step is to characterize all classifiers supported by a these decision boundaries. The gray area is the region
representation. that a separator must cross. The connected points in
Classifiers are trained to find a hypothesis (i.e., the right represent the convex regions that the piecewise
a classifier) that is consistent with a training set. linear separators lead to.
A representation E admits a set of such hypothe-
ses, and a learner chooses one of them. Consider Instead, we seek to directly evaluate the -version
the top-left example of Figure 2. There are many space of a representation for a task, without com-
classifiers that separate the two classes; the figure mitting to a restricted set of probe classifiers.
shows two linear (h1 and h2 ) and one non-linear
(h3 ) example. Given a set H of classifiers of in- 3 Approximating an -Version Space
terest, the subset of classifiers that are consistent
with a given dataset represents the version space Although the notion of an -version space is well
with respect to H (Mitchell, 1982). To account defined, finding it for a given representation and
for errors or noise in data, we define an -version task is impossible because it involves enumerating
space: the set of hypothesis that can achieve less all possible classifiers. In this section, we will
than error on a given dataset. describe a heuristic to approximate the -version
Let us formalize this definition. Suppose H rep- space. We call this approach as D IRECT P ROBE.
resents the whole hypothesis space consisting of
3.1 Piecewise Linear Approximation
all possible classifiers h of interest. The -version
space V (H, E, D) expressed by a representation Each classifier in V (H, E, D) is a decision bound-
E for a labeled dataset D is defined as: ary in the representation space that separates exam-
ples with different labels (see Figure 2, left). The
V (H, E, D) , {h ∈ H | err(h, E, D) ≤ } decisions of a classifier can be mimicked by a set
(1) of piecewise linear functions. Figure 2 (middle)
where err represents training error. shows two examples. At the top is the simple case
Note that the -version space V (H, E, D) is with linearly separable points. At the bottom is
only a set of functions and does not involve any a more complicated case with a circular separa-
learning. However, understanding a representation tor. The set of the piecewise linear function that
requires examining its -version space—a larger matches its decisions needs at least three lines.
one would allow for easier learning. The ideal piecewise linear separator partitions
In previous work, the quality of a representation training points into groups, each of which contains
E for a task represented by a dataset D is measured points with exactly one label. These groups can
via properties of a specific h ∈ V (H, E, D), typi- be seen as defining convex regions in the embed-
cally a linear probe. Commonly measured proper- ding space (see Figure 2, left). Any classifier in
ties include generalization error (Kim et al., 2019), V (H, E, D) must cross the regions between the
minimum description length of labels (Voita and groups with different labels; these are the regions
Titov, 2020) and complexity (Whitney et al., 2020). that separate labels from each other, as shown in
5072

the gray areas in Figure 2 (middle). Inspired by this, choose the next closest pair of clusters. We repeat
we posit that these regions between groups with dif- these steps till no more clusters can be merged.
ferent labels, and indeed the partitions themselves,
offer insight into V (H, E, D). Algorithm 1 D IRECT P ROBE: Bottom-up Cluster-
ing
3.2 Partitioning Training Data Input: A dataset D with labeled examples (xi , yi ).
Output: A set of clusters of points C.
Although finding the set of all decision boundaries 1: Initialize Ci as the cluster for each (xi , yi ) ∈ D
remain hard, finding the regions between convex 2: C = {C0 , C1 , · · · }, B = ∅
3: repeat
groups that these piecewise linear functions splits 4: Select the closest pair (Ci , Cj ) ∈ C which have the
the data into is less so. Grouping data points in same label and is not flagged in B.
this fashion is related to a well-studied problem, 5: S ← Ci ∪ Cj
6: if convex hull of S overlaps with other elements of C
namely clustering, and several recent works have then
looked at clustering of contextualized representa- 7: B ← B ∪ {(Ci , Cj )}
8: else
tions (Reimers et al., 2019; Aharoni and Goldberg, 9: Update C by removing Ci , Cj and adding S
2020; Gupta et al., 2020). In this work, we have 10: end if
a new clustering problem with the following crite- 11: until no more pairs can be merged
12: return C
ria: (i) All points in a group have the same label.
We need to ensure we are mimicking the decision
boundaries. (ii) There are no overlaps between We define the distance between clusters Ci and
the convex hulls of each group. If convex hulls Cj as the Euclidean distance between their cen-
of two groups do not overlap, there must exist a troids. Although Algorithm 1 does not guarantee
line that can separates them, as guaranteed by the the minimization criterion in §3.2 since it is greedy
hyperplane separation theorem. (iii) Minimize the heuristic, we will see in our experiments that, in
number of total groups. Otherwise, a simple so- practice, it works well.
lution is that each data point becomes a group by Overlaps Between Convex Hulls A key point
itself. of Algorithm 1 is checking if the convex hulls
Note that the criteria do not require all points of two sets of points overlap (line 6). Suppose
of one label to be grouped together. For example, we have two sets C = {xC C 0
0 0
1 , . . . , xn } and C =
in Figure 2 (bottom right), points of the circle (i.e., {xC C
1 , . . . , xm }. We can restate this as the problem
◦) class are in three subgroups. of checking if there is some vector w ∈ E(xC
i ) + b ≥ 1,
0
0 0 (2)
Next, let us see a heuristic for partitioning the train- ∀xC
j ∈C , w> E(xC
j ) + b ≤ −1.
ing set into clusters based on these criteria.
where E is the representation under investigation.
3.3 A Heuristic for Clustering We can state this problem as a linear program
that checks for feasibility of the system of inequali-
To find clusters as described in §3.2, we define a ties. If the LP problem is feasible, there must exist
simple bottom-up heuristic clustering strategy that a separator between the two sets of points, and they
forms the basis of D IRECT P ROBE (Algorithm 1). do not overlap. In our implementation, we use the
In the beginning, each example xi with label yi in Gurobi optimizer (Gurobi Optimization, 2020) to
a training set D is a cluster Ci by itself. In each solve the linear programs.4
iteration, we select the closest pair of clusters with
the same label and merge them (lines 4, 5). If the 3.4 Noise: Controlling
convex hull of the new cluster does not overlap Clusters with only a small number of points could
with all other clusters3 , we keep this new cluster be treated as noise. Geometrically, a point in the
(line 9). Otherwise, we flag this pair (line 7) and neighborhood of other points with different labels
3 4
Note that “all other clusters” here means the other clusters Algorithm 1 can be made faster by avoiding unnecessary
with different labels. There is no need to prohibit overlap calls to the solver. Appendix A gives a detailed description
between clusters with the same label since they might be of these techniques, which are also incorporated in the code
merged in the next iterations. release.
5073

could be thought of as noise. Other clusters can not We use the dataset of semeval2010 task 8 (Hen-
merge with it because of the no-overlap constraint. drickx et al., 2010). To represent the pair of nomi-
As a result, such clusters will have only a few points nals, we concatenate their embeddings. Some nom-
(one or two in practice). If we want zero error inals could be spans instead of individual tokens,
rate on the training data, we can keep these noise and we represent them via the average embedding
points; if we allow a small error rate , then we can of the tokens in the span.
remove these noise clusters. In our experiments, Dependency relation (DEP) is the task of pre-
for simplicity, we keep all clusters. dicting the syntactic dependency relation between
a token whead and its modifier wmod . We use the
4 Representations, Tasks and Classifiers universal dependency annotation of the English
Before looking at the analysis offered by the parti- web treebank (Bies et al., 2012). As with seman-
tions obtained via D IRECT P ROBE in §5, let us first tic relations, to represent the pair of tokens, we
enumerate the English NLP tasks and representa- concatenate their embeddings.
tions we will encounter. 4.3 Classifier Accuracy
4.1 Representations The key starting point of this work is that restrict-
Our main experiments focus on BERTbase,cased , and ing ourselves to linear probes may be insufficient.
we also show additional analysis on other contex- To validate the results of our analysis, we evaluate
tual representations: ELMo (Peters et al., 2018)5 , a large collection of classifiers—from simple lin-
BERTlarge,cased (Devlin et al., 2019), RoBERTabase ear classifiers to two-layers neural networks—for
and RoBERTalarge (Liu et al., 2019b). We refer the each task. For each one, we choose the best hyper-
reader to the Appendix C for further details about parameters using cross-validation. From these clas-
these embeddings. sifiers, we find the best test accuracy of each task
We use the average of subword embeddings as and representation. All classifiers are trained with
the token vector for the representations that use the scikit-learn library (Pedregosa et al., 2011). To
subwords. We use the original implementation of reduce the impact of randomness, we trained each
ELMo, and the HuggingFace library (Wolf et al., classifier 10 times with different initializations, and
2020) for the others. report their average accuracy. The Appendix E
summarizes the best classifiers we found and their
4.2 Tasks performance.
We conduct our experiments on five NLP tasks
5 Experiments and Analysis
that cover the varied usages of word representa-
tions (token-based, span-based, and token pairs) D IRECT P ROBE helps partition an embedding space
and include both syntactic and semantic prediction for a task, and thus characterize its -version space.
problems. The Appendix D has more details about Here, we will see that these clusters do indeed
the tasks to help with replication. characterize various linguistic properties of the rep-
Preposition supersense disambiguation repre- resentations we consider.
sents a pair of tasks, involving classifying a preposi-
tion’s semantic role (SS-role) and semantic func- 5.1 Number of Clusters
tion (SS-func). Following the previous work (Liu The number of clusters is an indicator of the
et al., 2019a), we only use the single-token prepo- linear separability of representations for a task.
sitions in the Streusle v4.2 corpus (Schneider et al., The best scenario is when the number of clusters
2018). equals the number of labels. In this case, examples
Part-of-speech tagging (POS) is a token level with the same label are placed close enough by the
prediction task. We use the English portion of the representation to form a cluster that is separable
parallel universal dependencies treebank (ud-pud from other clusters. A simple linear multi-class
Nivre et al., 2016). classifier can fit well in this scenario. In contrast, if
Semantic relation (SR) is the task of predicting the number of clusters is more than the number of
the semantic relation between pairs of nominals. labels, then some labels are distributed across mul-
5
For simplicity, we use the equally weighted three layers tiple clusters (as in Figure 2, bottom). There must
of ELMo in our experiments. be a non-linear decision boundary. Consequently,
5074

Linear SVM
Embedding #Clusters
Training Accuracy
BERTbase,cased 100 17
BERTlarge,cased 100 17
RoBERTabase 100 17
RoBERTalarge 99.97 1487
ELMo 100 17

Table 1: Linearity experiments on POS tagging task.
Our POS tagging task has 17 labels in total. Both
linear SVM and the number of clusters suggest that
RoBERTalarge is non-linear while others are all linear,
which means the best classifier for RoBERTalarge is not a
Figure 3: Here we juxtapose the minimum distances
linear model. More details can be found in Appendix E.
between clusters and the best classifier accuracy for
all 12 layers. The horizontal axis is the layer index of
BERTbase,cased ; the left vertical axis is the best classifier
this scenario calls for a more complex classifier,
accuracy and the right vertical axis is the minimum
e.g., a multi-layer neural network. distance between all pairs of clusters.
In other words, using the clusters, and without
training a classifier, we can answer the question:
can a linear classifier fit the training set for a task separates them and compute its margin. The dis-
with a given representation? tance we seek is twice the margin. For a given
To validate our predictions, we use the training representation, we are interested in the minimum
accuracy of a linear SVM (Chang and Lin, 2011) distance across all pairs of clusters with different
classifier. If a linear SVM can perfectly fit (100% labels.
accuracy) a training set, then there exist linear de-
cision boundaries that separate the labels. Table 1
5.2.1 Minimum Distance of Different Layers
shows the linearity experiments on the POS task,
which has 17 labels in total. All representations Higher layers usually have larger -version
except RoBERTalarge have 17 clusters, suggesting a spaces. Different layers of BERT play different
linearly separable space, which is confirmed by the roles when encoding liguistic information (Tenney
SVM accuracy. We conjecture that this may be the et al., 2019). To investigate the geometry of dif-
reason why linear models usually work for BERT- ferent layers of BERT, we apply D IRECT P ROBE
family models. Of course, linear separability does to each layer of BERTbase,cased for all five tasks.
not mean the task is easy or that the best classifier Then, we computed the minimum distances among
is a linear one. We found that, while most repre- all pairs of clusters with different labels. By com-
sentations we considered are linearly separable for paring the minimum distances of different layers,
most of our tasks, the best classifier is not always we answer the question: how do different layers of
linear. We refer the reader to Appendix E for the BERT differ in their representations for a task?
full results. Figure 3 shows the results on all tasks. In each
subplot, the horizontal axis is the layer index. For
5.2 Distances between Clusters each layer, the blue circles (left vertical axis) is
As we mentioned in §3.1, a learning process seeks the best classifier accuracy, and the red triangles
to find a decision boundary that separates clusters (right vertical axis) is the minimum distance de-
with different labels. Intuitively, a larger gap be- scribed above. We observe that both best classi-
tween them would make it easier for a learner to fier accuracy and minimum distance show similar
find a suitable hypothesis h that generalizes better. trends across different layers: first increasing, then
We use the distance between convex hulls of decreasing. It shows that minimum distance corre-
clusters as an indicator of the size of these gaps. lates with the best performance for an embedding
We note that the problem of computing the distance space, though it is not a simple linear relation. An-
between convex hulls of clusters is equivalent to other interesting observation is the decreasing per-
finding the maximum margin separator between formance and minimum distance of higher layers,
them. To find the distance between two clusters, which is also corroborated by Ethayarajh (2019)
we train a linear SVM (Chang and Lin, 2011) that and Liu et al. (2019a).
5075

Task Min Distance Best Acc Small Medium Large
Task
Distance Distance Distance
original 0.778 77.51
SS-role
fine-tuned 4.231 81.62 SS-role 97.17% (555) 2.83% (392) 0% (88)
SS-func 96.88% (324) 3.12% (401) 0% (55)
original 0.333 86.13 POS 99.19% (102) 0.81% (18) 0% (16)
SS-func
fine-tuned 2.686 88.4 SR 93.20% (20) 6.80% (20) 0% (5)
original 0.301 93.59 DEP 99.97% (928) 0.03% (103) 0% (50)
POS
fine-tuned 0.7696 95.95
original 0.421 86.85 Table 3: Error distribution based on different distance
SR bins. The number of label pairs in each bin is shown in
fine-tuned 4.734 90.03
original 0.345 91.52
the parentheses.
DEP
fine-tuned 1.075 94.82

Table 2: The best performance and the minimum dis- classified label pairs for each bin. For example, if
tances between all pairs of clusters of the last layer of the clusters associated with the part of speech tags
BERTbase,cased before and after fine-tuning. ADV and ADJ are close to each other, and the best
classifier misclassified ADV as ADJ, we put this
error pair into the bin of small distance. The distri-
5.2.2 Impact of Fine-tuning bution of all errors is shown in Table 3. This table
Fine-tuning expands the -version space. Past shows that a large majority of the misclassified
work (Peters et al., 2019; Arase and Tsujii, 2019; labels are concentrated in the small distance bin.
Merchant et al., 2020) has shown that fine-tuning For example, in the supersense role task (SS-role),
pre-trained models on a specific task improves per- 97.17% of the errors happened in small distance
formance, and fine-tuning is now the de facto proce- bin. The number of label pairs of each bin is shown
dure for using contextualized embeddings. In this in the parentheses. Table 3 shows that small dis-
experiment, we try to understand why fine-tuning tances between clusters indeed confuse a classifier
can improve performance. Without training classi- and we can detect it without training classifiers.
fiers, we answer the question: What changes in the
embedding space after fine-tuning? 5.3 By-product: A Parameter-free Classifier
We conduct the experiments described in §5.2.1
on the last layer of BERTbase,cased before and after
fine-tuning for all tasks. Table 2 shows the results.
We see that after fine-tuning, both the best classifier
accuracy and minimum distance show a big boost.
It means that fine-tuning pushes the clusters away
from each other in the representation space, which
results in a larger -version space. As we discussed
in §5.2, a larger -version space admits more good
classifiers and allows for better generalization.

5.2.3 Label Confusion
Small distances between clusters can confuse a Figure 4: Comparison between the best classifier accu-
classifier. By comparing the distances between racy, intra-accuracy, and 1-kNN accuracy. The X-axis is
clusters, we can answer the question: Which labels different representation models. BB: BERTbase,cased , BL:
for a task are more confusable? BERTlarge,cased , RB: RoBERTabase , RL: RoBERTalarge ,
We compute the distances between all the pairs E: ELMo. The pearson correlation coefficient between
best classifier accuracy and intra-accuracy is shown in
of labels based on the last layer of BERTbase,cased .6
the parentheses alongside each task title. This figure is
Based on an even split of the distances, we parti- best viewed in color.
tion all label pairs into three bins: small, medium,
and large. For each task, we use the predictions of
the best classifier to compute the number of mis- We can predict the expected performance of
6
the best classifier. Any h ∈ V (H, E, D) is a
For all tasks, BERTbase,cased space (last layer) is linearly
separable. So, the number of label pairs equals the number of predictor for the task D on the representation E. As
cluster pairs. a by-product of the clusters from D IRECT P ROBE,
5076

we can define a predictor. The prediction strategy is between the annotated label and the BERTbase,cased
simple: for a test example, we assign it to its closest neighborhood:
cluster.7 Indeed, if the label of the cluster is the
true label of the test point, then we know that there . . . our new mobile number is . . .
exists some classifier that allows this example to be
correctly labeled. We can verify the label every test The data labels the word our as G ESTALT, while
point and compute the aggregate accuracy to serve the embedding places it in the neighborhood of
as an indicator of the generalization ability of the P OSSESSOR. The annotation guidelines for these
representation at hand. We call this accuracy the labels (Schneider et al., 2017) notes that G ESTALT
intra-accuracy. In other words, without training is a supercategory of P OSSESSOR. The latter is
classifiers, we can answer the question: given a specifically used to identify cases where the pos-
representation, what is the expected performance sessed item is alienable and has monetary value.
of the best classifier for a task? From this definition, we see that though the an-
Figure 4 compares the best classifier accuracy, notated label is G ESTALT, it could arguably also
and the intra-accuracy of the last layer of different be a P OSSESSOR if phone numbers are construed
embeddings. Because our assignment strategy is as alienable possessions that have monetary value.
similar to nearest neighbor classification (1-kNN), Importantly, it is unclear whether BERTbase,cased
which assigns the unlabelled test point to its closest makes this distinction. Other examples we exam-
labeled point, the figure also compares to the 1- ined required similarly nuanced analysis. This ex-
kNN accuracy. ample shows D IRECT P ROBE can be used to iden-
First, we observe that intra-accuracy always out- tify examples in datasets that are potentially misla-
performs the simple 1-kNN classifier, showing that beled, or at least, require further discussion.
D IRECT P ROBE can use more information from the
representation space. Second, we see that the intra- 6 Related Work and Discussion
accuracy is close to the best accuracy for some
tasks (Supersense tasks and POS tagging). More- In addition to the classifier based probes described
over, all the pearson correlation coefficients be- in the rest of the paper, a complementary line of
tween best accuracy and intra-accuracy (showed in work focuses on probing the representations us-
the parentheses alongside each task title) suggest a ing a behavior-based methodology. Controlled test
high linear correlation between best classifier accu- sets (Şenel et al., 2018; Jastrzebski et al., 2017)
racy and intra-accuracy. That is, the intra-accuracy are designed and errors are analyzed to reverse-
can be a good predictor of the best classifier accu- engineer what information can be encoded by the
racy for a representation. From this, we argue that model (e.g., Marvin and Linzen, 2018; Ravichander
intra-accuracy can be interpreted as a benchmark et al., 2021; Wu et al., 2020). Another line of work
accuracy of a given representation without actually probes the space by “opening up” the representa-
training classifiers. tion space or the model (e.g., Michel et al., 2019;
Voita et al., 2019). There are some efforts to inspect
5.4 Case Study: Identifying Difficult the space from a geometric perspective (e.g., Etha-
Examples yarajh, 2019; Mimno and Thompson, 2017). Our
work extends this line of work to connect the geo-
The distances between a test point and all the clus-
metric structure of embedding space with classifier
ters from the training set can not only be used to
performance without actually training a classifier.
predict the label but also can be used to identify
Recent work (Pimentel et al., 2020b; Voita and
difficult examples as per a given representation.
Titov, 2020; Zhu and Rudzicz, 2020) probe repre-
Doing so could lead to re-annotation of the data,
sentations from an information theoretic perspec-
and perhaps lead to cleaner data, or to improved
tive. These efforts still need a probability distribu-
embeddings. Using the supersense role task, we
tion p(y|x) from a trained classifier. In §5.3, we
show a randomly chosen example of a mismatch
use clusters to predict labels. In the same vein,
7
To find the distance between the convex hull of a cluster the conditional probability p(y|x) can be obtained
and a test point, we find a max-margin separating hyperplane by treating the negative distances between the test
by training a linear SVM that separates the point from the clus-
ter. The distance is twice the distance between the hyperplane point x and all clusters as predicted scores and
and the test point. normalizing via softmax. Our formalization can
5077

fit into the information theoretic analysis and yet Yuki Arase and Jun’ichi Tsujii. 2019. Transfer fine-
avoid training a classifier. tuning: A BERT case study. In Proceedings of
the 2019 Conference on Empirical Methods in Natu-
Our analysis and experiments open new direc-
ral Language Processing and the 9th International
tions for further research: Joint Conference on Natural Language Processing
Novel pre-training target: The analysis presented (EMNLP-IJCNLP), pages 5393–5404, Hong Kong,
here informs us that larger distance between clus- China. Association for Computational Linguistics.
ters can improve classifiers. This could guide loss
Y Bengio, A Courville, and P Vincent. 2013. Repre-
function design when pre-training representations. sentation learning: a review and new perspectives.
Quality of a representation: In this paper, we IEEE transactions on pattern analysis and machine
focus on the accuracy of a representation. We could intelligence, 35(8):1798.
seek to measure other properties (e.g., complexity)
Ann Bies, Justin Mott, Colin Warner, and Seth Kulick.
or proxies for them. These analytical approaches 2012. English web treebank. Linguistic Data Con-
can be applied to the -version space to further sortium, Philadelphia, PA.
analyze the quality of the representation space.
Theory of representation: Learning theory, e.g. Chih-Chung Chang and Chih-Jen Lin. 2011. Libsvm:
A library for support vector machines. ACM trans-
VC-theory (Vapnik, 2013), describes the learnabil- actions on intelligent systems and technology (TIST),
ity of classifiers; representation learning lacks of 2(3):1–27.
such theoretical analysis. The ideas explored in this
work (-version spaces, distances between clusters Alexis Conneau, German Kruszewski, Guillaume Lam-
ple, Loïc Barrault, and Marco Baroni. 2018. What
being critical) could serve as a foundation for an you can cram into a single $&!#* vector: Probing
analogous theory of representations. sentence embeddings for linguistic properties. In
Proceedings of the 56th Annual Meeting of the As-
7 Conclusion sociation for Computational Linguistics (Volume 1:
Long Papers), pages 2126–2136, Melbourne, Aus-
In this work, we ask the question: what makes a tralia. Association for Computational Linguistics.
representation good for a task? We answer it by
developing D IRECT P ROBE, a heuristic approach Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training of
builds upon hierarchical clustering to approximate
deep bidirectional transformers for language under-
the -version space. Via experiments with several standing. In Proceedings of the 2019 Conference of
contextualized embeddings and linguistic tasks, we the North American Chapter of the Association for
showed that D IRECT P ROBE can help us under- Computational Linguistics: Human Language Tech-
stand the geometry of the embedding space and nologies, Volume 1 (Long and Short Papers), pages
4171–4186, Minneapolis, Minnesota. Association for
ascertain when a representation can successfully Computational Linguistics.
be employed for a task.
Steffen Eger, Andreas Rücklé, and Iryna Gurevych.
Acknowledgments 2019. Pitfalls in the evaluation of sentence em-
beddings. In Proceedings of the 4th Workshop on
We thank the members of the Utah NLP group
Representation Learning for NLP (RepL4NLP-2019),
and Nathan Schneider for discussions and valuable pages 55–60, Florence, Italy. Association for Com-
insights, and reviewers for their helpful feedback. putational Linguistics.
We also thank the support of NSF grants #1801446
(SATC) and #1822877 (Cyberlearning). Kawin Ethayarajh. 2019. How contextual are contextu-
alized word representations? comparing the geom-
etry of BERT, ELMo, and GPT-2 embeddings. In
Proceedings of the 2019 Conference on Empirical
References Methods in Natural Language Processing and the
Roee Aharoni and Yoav Goldberg. 2020. Unsupervised 9th International Joint Conference on Natural Lan-
domain clusters in pretrained language models. In guage Processing (EMNLP-IJCNLP), pages 55–65,
Proceedings of the 58th Annual Meeting of the Asso- Hong Kong, China. Association for Computational
ciation for Computational Linguistics, pages 7747– Linguistics.
7763, Online. Association for Computational Lin-
guistics. Emily Goodwin, Koustuv Sinha, and Timothy J.
O’Donnell. 2020. Probing linguistic systematicity.
Guillaume Alain and Yoshua Bengio. 2017. Under- In Proceedings of the 58th Annual Meeting of the As-
standing intermediate layers using linear classifier sociation for Computational Linguistics, pages 1958–
probes. In International Conference on Learning 1969, Online. Association for Computational Linguis-
Representations. tics.
5078

Vikram Gupta, Haoyue Shi, Kevin Gimpel, and Mrin- Empirical Methods in Natural Language Processing
maya Sachan. 2020. Clustering contextualized repre- and the 9th International Joint Conference on Natu-
sentations of text for unsupervised syntax induction. ral Language Processing (EMNLP-IJCNLP), pages
arXiv preprint arXiv:2010.12784. 4365–4374, Hong Kong, China. Association for Com-
putational Linguistics.
LLC Gurobi Optimization. 2020. Gurobi optimizer
reference manual. Katarzyna Krasnowska-Kieraś and Alina Wróblewska.
2019. Empirical linguistic study of sentence embed-
Iris Hendrickx, Su Nam Kim, Zornitsa Kozareva, dings. In Proceedings of the 57th Annual Meeting of
Preslav Nakov, Diarmuid Ó Séaghdha, Sebastian the Association for Computational Linguistics, pages
Padó, Marco Pennacchiotti, Lorenza Romano, and 5729–5739, Florence, Italy. Association for Compu-
Stan Szpakowicz. 2010. SemEval-2010 task 8: Multi- tational Linguistics.
way classification of semantic relations between pairs
of nominals. In Proceedings of the 5th International
Artur Kulmizev, Vinit Ravishankar, Mostafa Abdou,
Workshop on Semantic Evaluation, pages 33–38, Up-
and Joakim Nivre. 2020. Do neural language mod-
psala, Sweden. Association for Computational Lin-
els show preferences for syntactic formalisms? In
guistics.
Proceedings of the 58th Annual Meeting of the Asso-
John Hewitt and Percy Liang. 2019. Designing and in- ciation for Computational Linguistics, pages 4077–
terpreting probes with control tasks. In Proceedings 4091, Online. Association for Computational Lin-
of the 2019 Conference on Empirical Methods in Nat- guistics.
ural Language Processing and the 9th International
Joint Conference on Natural Language Processing Nelson F. Liu, Matt Gardner, Yonatan Belinkov,
(EMNLP-IJCNLP), pages 2733–2743, Hong Kong, Matthew E. Peters, and Noah A. Smith. 2019a. Lin-
China. Association for Computational Linguistics. guistic knowledge and transferability of contextual
representations. In Proceedings of the 2019 Confer-
Stanisław Jastrzebski, Damian Leśniak, and Woj- ence of the North American Chapter of the Associ-
ciech Marian Czarnecki. 2017. How to evaluate ation for Computational Linguistics: Human Lan-
word embeddings? on importance of data effi- guage Technologies, Volume 1 (Long and Short Pa-
ciency and simple supervised tasks. arXiv preprint pers), pages 1073–1094, Minneapolis, Minnesota.
arXiv:1702.02170. Association for Computational Linguistics.

Ganesh Jawahar, Benoît Sagot, and Djamé Seddah. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
2019. What does BERT learn about the structure of dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
language? In Proceedings of the 57th Annual Meet- Luke Zettlemoyer, and Veselin Stoyanov. 2019b.
ing of the Association for Computational Linguistics, Roberta: A robustly optimized bert pretraining ap-
pages 3651–3657, Florence, Italy. Association for proach. arXiv preprint arXiv:1907.11692.
Computational Linguistics.
Rebecca Marvin and Tal Linzen. 2018. Targeted syn-
Nora Kassner and Hinrich Schütze. 2020. Negated and tactic evaluation of language models. In Proceed-
misprimed probes for pretrained language models: ings of the 2018 Conference on Empirical Methods
Birds can talk, but cannot fly. In Proceedings of the in Natural Language Processing, pages 1192–1202,
58th Annual Meeting of the Association for Compu- Brussels, Belgium. Association for Computational
tational Linguistics, pages 7811–7818, Online. Asso- Linguistics.
ciation for Computational Linguistics.
Amil Merchant, Elahe Rahimtoroghi, Ellie Pavlick, and
Najoung Kim, Roma Patel, Adam Poliak, Patrick Xia,
Ian Tenney. 2020. What happens to BERT embed-
Alex Wang, Tom McCoy, Ian Tenney, Alexis Ross,
dings during fine-tuning? In Proceedings of the
Tal Linzen, Benjamin Van Durme, Samuel R. Bow-
Third BlackboxNLP Workshop on Analyzing and In-
man, and Ellie Pavlick. 2019. Probing what differ-
terpreting Neural Networks for NLP, pages 33–44,
ent NLP tasks teach machines about function word
Online. Association for Computational Linguistics.
comprehension. In Proceedings of the Eighth Joint
Conference on Lexical and Computational Semantics
(*SEM 2019), pages 235–249, Minneapolis, Min- Julian Michael, Jan A Botha, and Ian Tenney. 2020.
nesota. Association for Computational Linguistics. Asking without telling: Exploring latent ontologies
in contextual representations. In Proceedings of the
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A 2020 Conference on Empirical Methods in Natural
Method for Stochastic Optimization. In 3rd Inter- Language Processing (EMNLP), pages 6792–6812.
national Conference on Learning Representations,
ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Paul Michel, Omer Levy, and Graham Neubig. 2019.
Conference Track Proceedings. Are sixteen heads really better than one? In Ad-
vances in Neural Information Processing Systems 32:
Olga Kovaleva, Alexey Romanov, Anna Rogers, and Annual Conference on Neural Information Process-
Anna Rumshisky. 2019. Revealing the dark secrets ing Systems 2019, NeurIPS 2019, December 8-14,
of BERT. In Proceedings of the 2019 Conference on 2019, Vancouver, BC, Canada, pages 14014–14024.
5079

David Mimno and Laure Thompson. 2017. The strange          Yada Pruksachatkun, Jason Phang, Haokun Liu,
  geometry of skip-gram with negative sampling. In           Phu Mon Htut, Xiaoyi Zhang, Richard Yuanzhe Pang,
  Proceedings of the 2017 Conference on Empirical            Clara Vania, Katharina Kann, and Samuel R. Bow-
  Methods in Natural Language Processing, pages              man. 2020. Intermediate-task transfer learning with
  2873–2878, Copenhagen, Denmark. Association for            pretrained language models: When and why does it
  Computational Linguistics.                                 work? In Proceedings of the 58th Annual Meeting of
                                                             the Association for Computational Linguistics, pages
Tom M Mitchell. 1982. Generalization as search. Artifi-      5231–5247, Online. Association for Computational
  cial intelligence, 18(2):203–226.                          Linguistics.

Joakim Nivre, Marie-Catherine De Marneffe, Filip Gin-      Abhilasha Ravichander, Yonatan Belinkov, and Ed-
  ter, Yoav Goldberg, Jan Hajic, Christopher D Man-          uard H. Hovy. 2021. Probing the probing paradigm:
  ning, Ryan McDonald, Slav Petrov, Sampo Pyysalo,           Does probing accuracy entail task relevance? In
  Natalia Silveira, et al. 2016. Universal dependencies      Proceedings of the 16th Conference of the European
  v1: A multilingual treebank collection. In Proceed-        Chapter of the Association for Computational Lin-
  ings of the Tenth International Conference on Lan-         guistics: Main Volume, EACL 2021, Online, April 19
  guage Resources and Evaluation (LREC’16), pages           - 23, 2021, pages 3363–3377. Association for Com-
  1659–1666.                                                 putational Linguistics.

                                                           Nils Reimers, Benjamin Schiller, Tilman Beck, Jo-
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,          hannes Daxenberger, Christian Stab, and Iryna
   B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,       Gurevych. 2019. Classification and clustering of
   R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,           arguments with contextualized word embeddings. In
   D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-        Proceedings of the 57th Annual Meeting of the As-
   esnay. 2011. Scikit-learn: Machine learning in            sociation for Computational Linguistics, pages 567–
   Python. Journal of Machine Learning Research,             578, Florence, Italy. Association for Computational
   12:2825–2830.                                             Linguistics.
Christian S Perone, Roberto Silveira, and Thomas S         Nathan Schneider, Jena D Hwang, Archna Bhatia, Vivek
  Paula. 2018. Evaluation of sentence embeddings             Srikumar, Na-Rae Han, Tim O’Gorman, Sarah R
  in downstream and linguistic probing tasks. arXiv          Moeller, Omri Abend, Adi Shalev, Austin Blod-
  preprint arXiv:1806.06259.                                 gett, et al. 2017. Adposition and case supersenses
                                                             v2. 5: Guidelines for english. arXiv preprint
Matthew Peters, Mark Neumann, Mohit Iyyer, Matt              arXiv:1704.02134.
 Gardner, Christopher Clark, Kenton Lee, and Luke
 Zettlemoyer. 2018. Deep contextualized word repre-        Nathan Schneider, Jena D. Hwang, Vivek Srikumar,
 sentations. In Proceedings of the 2018 Conference of        Jakob Prange, Austin Blodgett, Sarah R. Moeller,
 the North American Chapter of the Association for           Aviram Stern, Adi Bitan, and Omri Abend. 2018.
 Computational Linguistics: Human Language Tech-             Comprehensive supersense disambiguation of En-
 nologies, Volume 1 (Long Papers), pages 2227–2237,          glish prepositions and possessives. In Proceedings
 New Orleans, Louisiana. Association for Computa-            of the 56th Annual Meeting of the Association for
 tional Linguistics.                                         Computational Linguistics (Volume 1: Long Papers),
                                                             pages 185–196, Melbourne, Australia. Association
Matthew E. Peters, Sebastian Ruder, and Noah A. Smith.       for Computational Linguistics.
  2019. To tune or not to tune? adapting pretrained
  representations to diverse tasks. In Proceedings of      Lütfi Kerem Şenel, Ihsan Utlu, Veysel Yücesoy, Aykut
  the 4th Workshop on Representation Learning for            Koc, and Tolga Cukur. 2018. Semantic structure
 NLP (RepL4NLP-2019), pages 7–14, Florence, Italy.           and interpretability of word embeddings. IEEE/ACM
 Association for Computational Linguistics.                  Transactions on Audio, Speech, and Language Pro-
                                                             cessing, 26(10):1769–1779.
Tiago Pimentel, Naomi Saphra, Adina Williams, and          Alon Talmor, Yanai Elazar, Yoav Goldberg, and
  Ryan Cotterell. 2020a. Pareto probing: Trading off         Jonathan Berant. 2020. oLMpics-on what language
  accuracy for complexity. In Proceedings of the 2020        model pre-training captures. Transactions of the As-
  Conference on Empirical Methods in Natural Lan-            sociation for Computational Linguistics, 8:743–758.
  guage Processing (EMNLP), pages 3138–3153, On-
  line. Association for Computational Linguistics.         Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019.
                                                              BERT rediscovers the classical NLP pipeline. In
Tiago Pimentel, Josef Valvoda, Rowan Hall Maudslay,           Proceedings of the 57th Annual Meeting of the Asso-
   Ran Zmigrod, Adina Williams, and Ryan Cotterell.           ciation for Computational Linguistics, pages 4593–
   2020b. Information-theoretic probing for linguistic        4601, Florence, Italy. Association for Computational
   structure. In Proceedings of the 58th Annual Meet-         Linguistics.
   ing of the Association for Computational Linguistics,
   pages 4609–4622, Online. Association for Computa-       Vladimir Vapnik. 2013. The nature of statistical learn-
   tional Linguistics.                                       ing theory. Springer science & business media.
                                                       5080

Elena Voita, David Talbot, Fedor Moiseev, Rico Sen-       Zining Zhu and Frank Rudzicz. 2020. An informa-
  nrich, and Ivan Titov. 2019. Analyzing multi-head         tion theoretic view on selecting linguistic probes. In
  self-attention: Specialized heads do the heavy lift-      Proceedings of the 2020 Conference on Empirical
  ing, the rest can be pruned. In Proceedings of the        Methods in Natural Language Processing (EMNLP),
  57th Annual Meeting of the Association for Computa-       pages 9251–9262, Online. Association for Computa-
  tional Linguistics, pages 5797–5808, Florence, Italy.     tional Linguistics.
  Association for Computational Linguistics.
                                                          A    D IRECT P ROBE in Practice
Elena Voita and Ivan Titov. 2020. Information-theoretic
   probing with minimum description length. In Pro-       In practice, we apply several techniques to speed
   ceedings of the 2020 Conference on Empirical Meth-     up the probing process. Firstly, we add two caching
   ods in Natural Language Processing (EMNLP),
                                                          strategies:
   pages 183–196, Online. Association for Computa-
   tional Linguistics.                                    Black List Suppose we know two clusters A and
Eric Wallace, Yizhong Wang, Sujian Li, Sameer Singh,      B can not be merged, then A0 and B 0 cannot be
   and Matt Gardner. 2019. Do NLP models know num-        merged either if A ⊆ A0 and B ⊆ B 0
   bers? probing numeracy in embeddings. In Proceed-
   ings of the 2019 Conference on Empirical Methods       White List Suppose we know two clusters A and
   in Natural Language Processing and the 9th Inter-      B can be merged, then A0 and B 0 can be merged
   national Joint Conference on Natural Language Pro-     too if A0 ⊆ A and B 0 ⊆ B
   cessing (EMNLP-IJCNLP), pages 5307–5315, Hong
   Kong, China. Association for Computational Linguis-       These two observations allow us to cache pre-
   tics.                                                  vious decisions in checking whether two clusters
                                                          overlap. Applying these caching strategies can help
William F Whitney, Min Jae Song, David Brand-
  fonbrener, Jaan Altosaar, and Kyunghyun Cho.            us avoid unnecessary checking for overlap, which
  2020. Evaluating representations by the complex-        is time-consuming.
  ity of learning low-loss predictors. arXiv preprint        Secondly, instead of merging from the start to
  arXiv:2009.07368.                                       the end and checking at every step, we directly
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien          keep merging to the end without any overlap check-
  Chaumond, Clement Delangue, Anthony Moi, Pier-          ing. After we arrived at the end, we start checking
  ric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz,     backwards to see if there is an overlap. If final
  Joe Davison, Sam Shleifer, Patrick von Platen, Clara
  Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le     clusters have overlaps, we find the step of the first
  Scao, Sylvain Gugger, Mariama Drame, Quentin            error, correct it and keep merging. Merging to the
  Lhoest, and Alexander M. Rush. 2020. Transform-         end can also help us avoid plenty of unnecessary
  ers: State-of-the-art natural language processing. In   checking because, in most representations, the first
  Proceedings of the 2020 Conference on Empirical
  Methods in Natural Language Processing: System          error usually happens only in the final few merges.
  Demonstrations, pages 38–45, Online. Association        Algorithm 2 shows the whole algorithm.
  for Computational Linguistics.
                                                          Algorithm 2 D IRECT P ROBE in Practice
Zhiyong Wu, Yun Chen, Ben Kao, and Qun Liu. 2020.
  Perturbed masking: Parameter-free probing for ana-      Input: A dataset D with labeled examples (xi , yi ).
                                                          Output: A set of clusters of points C.
  lyzing and interpreting BERT. In Proceedings of the
                                                           1: Initialize Ci as the cluster for each (xi , yi ) ∈ D
  58th Annual Meeting of the Association for Compu-        2: C = {C0 , C1 , · · · }
  tational Linguistics, pages 4166–4176, Online. Asso-     3: Keep merging the closest pairs who have the same label
  ciation for Computational Linguistics.                      in C without overlapping checking.
                                                           4: if There are overlappings in C then
Mengzhou Xia, Antonios Anastasopoulos, Ruochen Xu,         5:     Find the step k that first overlap happens.
 Yiming Yang, and Graham Neubig. 2020. Predicting          6:     Rebuild C from step k − 1
  performance for natural language processing tasks.       7:     Keeping merging the closest pairs who have the same
  In Proceedings of the 58th Annual Meeting of the As-        label in C with overlapping checking.
  sociation for Computational Linguistics, pages 8625–     8: end if
  8646, Online. Association for Computational Lin-         9: return C
  guistics.
Yadollah Yaghoobzadeh, Katharina Kann, T. J. Hazen,         Table 4 shows the runtime comparison between
  Eneko Agirre, and Hinrich Schütze. 2019. Probing        DirectProbe and training classifiers.
  for semantic classes: Diagnosing the meaning con-
  tent of word embeddings. In Proceedings of the 57th     B    Classifier Training Details
  Annual Meeting of the Association for Computational
  Linguistics, pages 5740–5753, Florence, Italy. Asso-   We train 3 different kinds of classifiers in order
  ciation for Computational Linguistics.                 to find the best one: Logistic Regression, a one-
                                                      5081

DirectProbe Classifier • Embedding: The name of representation.
Clustering Predict CV Train-Test
SS-role 1 min 7 min 1.5 hours 1 hour • Best classification: The accuracy of the best
SS-func 1.5 min 5.5 min 1 hour 1 hour classifier among 21 different classifiers. Each
POS 14 min 43 min 3.5 hours 2 hours classifier run 10 times with different starting
SR 10 min 22 min 50 min 1 hour
DEP 3 hours 3 hours 20 hours 11 hours points. The final accuracy is the average accu-
racy of these 10 runs.
Table 4: The comparison of running time between D I -
RECT P ROBE and training classifiers using the setting • Intra-accuracy: The prediction accuracy by
described in Appendix B. The time is computed on the assigning test examples to their closest clus-
BERTbase,cased space. D IRECT P ROBE and classifiers run ters.
on the same machine.
• Type: The type of the best classifier. e.g.
Representations #Parameters Dimensions (256, ) means one-layer neural network with
hidden size 256 and (256, 128) means two-
BERTbase,cased 110M 768 layer neural network with hidden sizes 256
BERTlarge,cased 340M 1024 and 128 respectively.
RoBERTabase 125M 768
RoBERTalarge 355M 1024 • Parameters: The number of parameters of
ELMo 93.6M 1024 the best classifier.

Table 5: Statistics of the five representations in our • Linear SVM: Training accuracy of a linear
experiments. SVM classifier.

• #Clusters: The number of final clusters after
layer neural network with hidden layer size of probing.
(32, 64, 128, 256) and two-layers neural network
F Detailed Classification Results
with hidden layer sizes of (32, 64, 128, 256) ×
(32, 64, 128, 256). All neural networks use ReLU Table 7 shows the detailed classification results on
as the activation function and optimized by Supersense-role task using the same settings de-
Adam (Kingma and Ba, 2015). The cross- scribed in Appendix B. In this table, we record
validation process chooses the weight of each regu- the difference between the minimum and maxi-
larizer of each classifier. The weight range is from mum test accuracy of the 10 runs of each classifier.
10−7 to 100 . We set the maximum iterations to The maximum difference for each representation
be 1000. After choosing the best hyperparame- is highlighted by the underlines. The best perfor-
ters, each specific classifier is trained 10 times with mance of each representation is highlighted by the
different initializations. bold numbers. From this table, we observe: (i) the
difference between different runs of the same clas-
C Summary of Representations sifier can be as large as 4-7%, which can not be
ignored; (ii) different representation requires dif-
Table 5 summarizes all the representations in our
ferent model architecture to achieve its best perfor-
experiments.
mance.
D Summary of Tasks
In this work, we conduct experiments on 5 tasks,
which is designed to cover different usages of rep-
resentations. Table 6 summarizes these tasks.

E Classifier Results v.s. D IRECT P ROBE
Results
Table 8 summarizes the results of the best classifier
for each representation and task. The meanings of
each column are as follows:
5082

You can also read