Learning from Human-Generated Lists - Burr Settles

Page created by Leonard Gordon
 
CONTINUE READING
Learning from Human-Generated Lists - Burr Settles
Learning from Human-Generated Lists

Kwang-Sung Jun, Xiaojin Zhu                                       {deltakam,jerryzhu}@cs.wisc.edu
Department of Computer Sciences, University of Wisconsin-Madison, Madison, WI 53706 USA
Burr Settles                                                                   bsettles@cs.cmu.edu
Machine Learning Department, Carnegie Mellon University, Pittsburgh, PA 15213 USA
Timothy T. Rogers                                                               ttrogers@wisc.edu
Department of Psychology, University of Wisconsin-Madison, Madison, WI 53706 USA

                     Abstract                              modeling the human process of generating these lists,
                                                           which (as we show) can be combined with these al-
    Human-generated lists are a form of non-iid
                                                           gorithms for further improvement. For cognitive psy-
    data with important applications in machine
                                                           chology, such list-generation tasks are used to probe
    learning and cognitive psychology. We pro-
                                                           the structure of human memory and to diagnose forms
    pose a generative model — sampling with
                                                           of cognitive impairment. For instance, the “verbal
    reduced replacement (SWIRL) — for such
                                                           fluency” task requires a subject to generate as many
    lists. We discuss SWIRL’s relation to stan-
                                                           words of a given category (e.g., “vehicle”) as possible
    dard sampling paradigms, provide the max-
                                                           within 60 seconds (e.g., “car, plane, boat, . . . ”). Per-
    imum likelihood estimate for learning, and
                                                           formance on this simple task is highly sensitive to even
    demonstrate its value with two real-world ap-
                                                           mild mental dysfunction.
    plications: (i ) In a “feature volunteering”
    task where non-experts spontaneously gener-            Learning from human-generated lists differs from more
    ate feature⇒label pairs for text classification,       familiar machine learning settings. Unlike ranking, the
    SWIRL improves the accuracy of state-of-               human teacher does not have the collection of items in
    the-art feature-learning frameworks. (ii ) In          front of her to arrange in order. Unlike active learn-
    a “verbal fluency” task where brain-damaged            ing, she is not prompted with item queries, either. In-
    patients generate word lists when prompted             stead, she must generate the list internally via memory
    with a category, SWIRL parameters align                search. Such lists have two key characteristics:
    well with existing psychological theories, and
                                                           1. Order matters. Items (e.g., vehicle names or
    our model can classify healthy people vs. pa-
                                                           phrase⇒label rules) that appear earlier in a list tend
    tients from the lists they generate.
                                                           to be more “important.” This suggests that we can es-
                                                           timate the “value” of an item according to its position
1. Introduction                                            in the list. This relation is not deterministic, however,
We present a probabilistic model describing the hu-        but stochastic. Very important items can appear later
man process of ordered list generation. For machine        or be missing from any single list altogether.
learning, such a model can enhance the way computers       2. Repeats happen. Humans tend to repeat items in
learn from people. Consider a user who is training a       their list even though they know they should not (see
system to classify sports articles. She might begin by     Table 1). Indeed, we will see that people tend to repeat
generating a list of relevant phrase⇒label rules (e.g.,    items even when their lists are displayed right in front
“touchdown⇒football, home-run⇒baseball”). Incor-           of them (e.g., Figure 1). Though this is a nuisance
porating such “feature volunteering” as training data      for some applications, the propensity to repeat can
in a machine learning algorithm has been an area of        provide useful information in others.
considerable research interest (Druck et al., 2008; Liu
et al., 2004; Settles, 2011). Less effort has gone into    We propose a new sampling paradigm, sampling with
                                                           reduced replacement (SWIRL), to model human list
Proceedings of the 30 th International Conference on Ma-   production. Informally, SWIRL is “in-between” sam-
chine Learning, Atlanta, Georgia, USA, 2013. JMLR:
W&CP volume 28. Copyright 2013 by the author(s).           pling with and without replacement, since a drawn
Learning from Human-Generated Lists

     Order      Item                          Order      Item                Order     Item               Order     Item
     1          baseball bat⇒BS               1          research⇒P          1         yacht              1         automobile
     ...        ...                           ...        ...                 2         rowboat            2         truck
     7          quarterback⇒F                 16         school⇒F            3         paddle boat        3         train
     8          football field⇒F              17         requirement⇒C       4         casino             4         boat
     9          soccer ball⇒S                 18         grade⇒C             5         steam liner        5         train
     ...        ...                           19         science⇒C           6         warship            6         airplane
     23         basketball court⇒BK           ...        ...                 7         aircraft carrier   7         bicycle
     24         football field⇒F              37         school⇒F            ...       ...                8         motorcycle
     25         soccer field⇒S                38         grade⇒C             11        clipper ship       9         minivan
     ...        ...                           ...        ...                 12        rowboat            10        bus
                  (a) sports                           (b) webkb                     (d) boats               (c)   vehicles

Table 1. Example human-generated lists, with repeats in bold red. (a) Feature volunteering for sports articles:
BS=baseball, F=football, S=soccer, and BK=basketball. (b) Feature volunteering for academic web pages: P=project,
F=faculty, and C=course. (c,d) Verbal fluency tasks, from healthy and brain-damaged individuals, respectively.

item is replaced but with its probability discounted.                   Algorithm 1 The SWIRL Model
This allows us to model order and repetition simul-                       Parameters: λ, s = {si | i ∈ V}, α.
taneously. In Section 2, we formally define SWIRL                         n ∼ Poisson(λ)
and provide the maximum likelihood estimate to                            for t = 1, . . . , n do            
learn model parameters from human-generated lists.                          zt ∼ Multinomial P si sj | i ∈ V
                                                                                                    j∈V
Though not in closed form, our likelihood function                          szt ← αszt
                                                                          end for
is convex and easily optimized. We present a ma-
chine learning application in Section 3: feature vol-
unteering for text classification, where we incorporate                 2.1. Relation to Other Sampling Paradigms
SWIRL parameters into down-stream text classifiers.
We compare two frameworks: (i ) Generalized Expec-                      Setting α = 1 recovers sampling with replacement,
tation (GE) for logistic regression (Druck et al., 2008)                while α = 0 differs subtly from sampling without re-
and (ii ) informative Dirichlet priors (IDP) for naı̈ve                 placement. Consider an “urn-ball” model where balls
Bayes (Settles, 2011). We then present a psychology                     have |V| distinct colors. Let there be mi balls of color
application in Section 4: verbal fluency, where SWIRL                   i, each with size si . The probability of drawing a ball
itself is used to classify healthy vs. brain damaged pop-               is proportional to its size. The P
                                                                                                         chance that a draw has
ulations, and its parameters provide insight into the                   color i is P (color i) = si mi /( j∈V sj mj ). We con-
different mental processes of the two groups.                           trast several sampling paradigms:
                                                                        Sampling without replacement and all colors
2. Sampling With Reduced                                                have the same size si = sj . If a draw produces
   Replacement (SWIRL)                                                  a ball with color i then mi ← mi − 1 for the next
Let V denote the vocabulary of items for a task, and                    draw. Let m = (m1 , . . . , m|V| )> be the counts of balls
z = (z1 , . . . , zn ) be an ordered list of n items where              in the urn and k = (k1 . . . k|V| )> be the counts of balls
zt ∈ V and the z’s are not necessarily distinct. The                    drawn, then k follows the multivariate hypergeometric
set of N lists produced by different people is written                  distribution mhypg(k; m, 1> k).
         (1)           (1)                      (N )         (N )
z(1) = (z1 , . . . , zn(1) ), . . . , z(N ) = (z1 , . . . , zn(N ) ).   Sampling without replacement when the sizes
                                                                        may be different. The distribution of k follows
We now formally define SWIRL. Assume that humans
                                                                        the multivariate Wallenius’ noncentral hypergeometric
possess an unnormalized distribution over the items
                                                                        distribution mwnchypg(k; m, s, 1> k), which is a gen-
for this task. Let si ≥ 0 be the “initial size” of item
                                                                        eralization to the multivariate hypergeometric distri-
i for i ∈ V, not necessarily normalized. One would
                                                                        bution (Wallenius, 1963; Chesson, 1976; Fog, 2008).
select item i with probability proportional to si . Crit-
                                                                        “Noncentral” means that the si ’s may be different.
ically, the size of the selected item (say, z1 ) will be
                                                                        Note that after drawing a ball of color i, we subtract
discounted by a factor α ∈ [0, 1] for the next draw:
                                                                        si from the “total size” of color i.
sz1 ← αsz1 . This reduces the chance that item z1
will be selected again in the future. To make it a full                 SWIRL. Each color has only one ball: mi = 1, but
generative model, we assume a Poisson(λ) list length                    the sizes si may differ. Balls are replaced but trimmed:
distribution. The process of generating a single list z                 mi stays at one but si ← αsi . This results in a geomet-
is specified in Algorithm 1.                                            ric (rather than a Pólya urn process-style arithmetic)
Learning from Human-Generated Lists

change in that color’s size. We are interested in the                                  sizes s are scale invariant. We remove this invariance
probability of the ordered sequence of draws (i.e., z)                                 by setting sM F I = 1 where M F I is the most frequent
rather than just a count vector k. The salient differ-                                 item in z(1) . . . z(N ) . Equivalently, βM F I = 0.
ences are illustrated by the following example.
                                                                                       The complete convex optimization problem for finding
Example (Non-exchangeability). Sampling without                                        the MLE of SWIRL is
replacement is well-known to be exchangeable. This
is not true for SWIRL. Let there be two colors V =                                                              min           −`(β, γ)                    (2)
                                                                                                                β,γ
{A, B}. Consider two experiments with matching to-                                                              s.t.          βM F I = 0                  (3)
tal size for each color:
                                                                                                                              γ ≤ 0.                      (4)
(Experiment 1) There are m1 = a balls of color A and
m2 = b balls of color B. Let s1 = s2 = 1, and perform                                  The MLE is readily computed by quasi-Newton meth-
sampling without replacement. Let zi be the color of                                   ods such as LBFGS, where the required gradient for (2)
the ith draw. It is easy to show that P (zi = A) =                                     is computed by
               a
P (zj = A) = a+b  ; ∀i, j.                                                                                      N X
                                                                                                                  n     (j)
                                                                                       ∂`                       X                  exp(c(i, j, t)γ + βi )
(Experiment 2) There is m1 = 1 ball of color A with                                            = −c(i) +                      P               0
                                                                                       ∂βi                      j=1 t=1         i0 ∈V exp(c(i , j, t)γ + βi )
                                                                                                                                                           0
size s1 = a, and m2 = 1 ball of color B with size
s2 = b. We perform SWIRL with discounting factor α.                                                   n(j)
                                                                                                    N X
                      a                        αa    a                                  ∂`          X                  (j)
Then P (z1 = A) = a+b    , but P (z2 = A) = ( αa+b a+b )+                                      =                −c(zt , j, t)
    a   b                                                                               ∂γ
( a+αb a+b ). In general, P (z1 = A) 6= P (z2 = A). For
                                                                                                    j=1 t=1
                                                                                                        P
instance, when α = 0 we have P (z2 = A) = a+b    b
                                                   .                                                      i∈V   exp(c(i, j, t)γ + βi )c(i, j, t)
                                                                                                    +       P               0
                                                                                                                                                 ,
                                                                                                              i0 ∈V exp(c(i , j, t)γ + βi )
                                                                                                                                          0

2.2. Maximum Likelihood Estimate
                                                                                       where c(i) is the total count of occurrences for item i
Given observed lists z(1) . . . z(N ) , the log likelihood is
     N                   n(j)
                                                                                       in the lists z(1) . . . z(N ) , for i ∈ V.
                                                         
                                          (j)    (j)
    X                    X
`=      n(j) log λ − λ +        log P zt | z1:t−1 , s, α ,                             2.3. Maximum A Posteriori Estimate
     j=1                               t=1
                                                                                       Additionally, it is easy to compute the MAP estimate
where n(j) is the length of the jth list, and λ is the                                 when we have prior knowledge on β and γ. Sup-
Poisson intensity parameter. The key quantity here is                                  pose we believe the initial sizes should be approxi-
                                                         (j)
                                                αc(zt        ,j,t)
                                                                       s   (j)
                                                                                       mately proportional to s0 . For example, in the ve-
                   (j)        (j)                                 zt
          P       zt     |   z1:t−1 , s, α       =P                              (1)   hicle verbal fluency task, s0 may be the counts of
                                                           α c(i,j,t) s
                                                       i∈V             i               various vehicles in a large corpus. We can define
where c(i, j, t) is the count of item i in the jth prefix                              β0i ≡ log(s0i /s0M F I ); ∀i ∈ V, and adopt a Gaussian
                        (j)         (j)                                                prior on β centered around β 0 (equivalently, s follows a
list of length t − 1: z1 , . . . , zt−1 . This repeated dis-
counting and renormalization couples α and s, making                                   log-Normal distribution). Since this prior removes the
the MLE difficult to solve in closed form.                                             scale invariance on β, we no longer need the constraint
                                                                                       βM F I = 0. Similarly, we may have prior knowledge of
The λ parameter is independent of s and α, so the                                      γ0 . The MAP estimate is the solution to
MLE λ  b is the usual average list length. To simplify
notation, we omit the Poisson and focus on the log                                           min        −`(β, γ) + τ1 kβ − β 0 k2 + τ2 (γ − γ0 )2 (5)
                                                                                             β,γ
likelihood of s and α, namely `(s, α) =                                                      s.t.       γ≤0,
  N Xn(j)
             (j)
 X                                             X
          c(zt , j, t) log α + log sz(j) − log   αc(i,j,t) si .                        where τ1 , τ2 are appropriate regularization terms to
                                                   t
j=1 t=1                                                          i∈V
                                                                                       prevent overfitting.
We transform the variables into the log domain which
is easier to work with: β ≡ log s, and γ ≡ log α. The              3. Application I: Feature Volunteering
log likelihood can now be written as `(β, γ) =                         for Text Classification
       (j)
  N Xn
 X            (j)
                                       X                           We now turn to a machine learning application
           c(zt , j, t)γ + βz(j) − log     exp(c(i, j, t)γ + βi ).
                             t                                     of SWIRL: training text classifiers from human-
 j=1 t=1                               i∈V
                                                                   volunteered feature labels (rather than documents). A
Due to the log-sum-exp form, `(β, γ) is concave in β               feature label is a simple rule stating that the presence
and γ (Boyd & Vandenberge, 2004). Note the initial                 of a word or phrase indicates a particular class label.
Learning from Human-Generated Lists

For example, “touchdown⇒football” implies that doc-            able by multinomial logistic regression
uments containing the word “touchdown” probably be-                       ∆Θ = {pθ (y | x) | θ ∈ R|F
                                                                                                         +
                                                                                                             |×|Y|
                                                                                                                     },    (6)
long to the class “football.” Some prior work exploits
a bag of volunteered features from users, where each           where                           exp(θy> x)
has an equal importance. Although such feature la-                           pθ (y | x) = P              >
                                                                                                                .          (7)
bels help classification (Liu et al., 2004), the order of                                    y 0 ∈Y exp(θy 0 x)
volunteered words is not taken into account. The or-           Generalized Expectation (GE) seeks a distribution
der turns out to be very useful, however, as we will           p∗ ∈ ∆Θ that matches a set of given reference dis-
show later. Other prior work solicits labeled features         tributions pcf (y): distributions over the label y if the
through a form of feature queries: the computer, via           feature f ∈ F is present. We will construct pc f (y) from
unsupervised corpus analysis (e.g., topic modeling),           feature volunteering and compare it against other con-
proposes a list of high-probability candidate features         structions in the next section. Before that, we specify
for a human to label (Druck et al., 2008).                     the matching sought by GE (Druck et al., 2008).
Departing from previous works, we point out that the           We restrict ourselves to sufficient statistics based on
human teacher, upon hearing the categorization goal            counts, such that xf ∈ x is the number of times feature
(e.g., the classes to be distinguished), can volunteer an      f occurs in the document. Let U be an unlabeled
ordered list of feature labels without first consulting a      corpus. The empirical mean conditional distribution
corpus; see Table 1(a,b). This “feature volunteering”          on documents where xf > 0 is given by
procedure is particularly attractive when a classifier                              P
                                                                                            1{xf > 0}pθ (y | x)
is promptly needed for a novel task, since humans can             Mf [pθ (y | x)] ≡ x∈U P           0           .  (8)
be recruited quickly via crowdsourcing or other means,                                     x0 ∈U 1{xf > 0}
even before a corpus is fully compiled. Another pos-
                                                               GE training minimizes a regularized objective based
sibility, which we recommend for practitioners1 , is to
                                                               on the KL-divergence of these distributions:
treat feature volunteering as a crucial first step in a                     X                             kθk2
chain of progressively richer interactive supervision,          p∗ = argmin    KL pc f (y) Mf [pθ (y | x)] +     .
                                                                     pθ ∈∆Θ                                   2
followed by queries on both features and documents.                           f ∈F
Queries can be selected by the computer in order to                                                            (9)
build better classifiers over time. Such a combination         In other words, GE seeks to make the reference and
has been studied recently (Settles, 2011).                     empirical distributions as similar as possible.
In this section, we show that feature volunteering can         3.2. Constructing GE Reference Distributions
be successfully combined with two existing frameworks               pc
                                                                     f (y) with SWIRL
for training classifiers with labeled features: (i ) Gener-
alized Expectation (GE) for logistic regression (Druck         Recall that in feature volunteering, the N human
et al., 2008) and (ii ) informative Dirichlet priors (IDP)     users produce multiple ordered lists z(1) , . . . , z(N ) .
for multinomial naı̈ve Bayes (Settles, 2011). We also          Each item z in these lists is a f ⇒y (feature⇒label)
show that “order matters” by highlighting the value of         pair. It is possible that the same f appears multiple
SWIRL as a model for feature volunteering. That is,            times in different z’s, mapping to different y’s (e.g.,
by endowing each volunteered feature with its size si          “goalie⇒soccer” and “goalie⇒hockey”). In this case
as estimated in Section 2, we can build better classi-         we say feature f co-occurs with multiple y’s.
fiers than by treating the volunteered features equally        For each list z, we split it into |Y| sublists by item
under both of these machine learning frameworks.               labels. This produces one ordered sublist per class. We
                                                               collect all N sublists for a single class y, and treat them
3.1. Generalized Expectation (GE)
                                                               as N lists generated by SWIRL from the yth urn. We
                                              +
Let y ∈ Y be a class label, and x ∈ R|F | be a vec-            find the MLE of this yth urn using (2), and normalize
tor describing a text document using feature set F + ,         it to sum to one. We are particularly interested in the
which is a super set of volunteered feature set F. Con-        size estimates sy = {sf ⇒y | f ∈ F }. This is repeated
sider the conditional probability distributions realiz-        for all |Y| urns, so we have s1 , . . . , s|Y| .
   1
      Feature volunteering and feature query labeling are      We construct reference distributions using
complementary ways of obtaining feature labels. The for-                               sf ⇒y
                                                                         pc
                                                                          f (y) = P                ; ∀f ∈ F ,             (10)
mer provides a way to capture importance of feature labels                          y 0 ∈Y sf ⇒y
                                                                                                 0

(e.g., by their order), while the latter can consult extra
resources (e.g., unlabeled data) to harvest more feature la-   where F is the union of features appearing in the lists
bels, as pointed out by Liu et al. (2004).                     z(1) , . . . , z(N ) and sf ⇒y = 0 if feature f is absent from
Learning from Human-Generated Lists

the yth list. For example, imagine “goalie⇒soccer”                Corpus          Class Labels          N     |F|   |F + |
appears near the top of a list, so sgoalie⇒soccer is large                     baseball, basketball,
                                                                  sports                                52    594   2948
                                                                             football, hockey, soccer
(say 0.4), “goalie⇒hockey” appears much later in an-
other list, so sgoalie⇒hockey is small (say 0.1), and             movies        negative, positive      27    382   2514
“goalie” is never associated with “baseball”. Then                              course, faculty,
                                                                  webkb                                 56    961   2521
by (10), p̂goalie (soccer) = 0.8, p̂goalie (hockey) = 0.2, and                  project, student
p̂goalie (baseball) = 0.
                                                                 Table 2. Domains in the feature volunteering application.
In our experiments, we compare three ways of creat-
ing reference distributions. GE/SWIRL is given by
                                                                 3.4. Constructing IDP Priors df y with SWIRL
Equation (10). GE/Equal is defined as
                                                                 We compare two approaches for incorporating prior
                    1{sf ⇒y > 0}                                 knowledge into naı̈ve Bayes by feature volunteering.
      pc
       f (y) = P                    ; ∀i ∈ F,            (11)
                  y ∈Y 1{sf ⇒y > 0}                              IDP/SWIRL sets the hyperparameters as follows:
                   0          0

                                                                 df y = 1 + kny sf ⇒y , where f is a feature, k a parame-
which is similar to (10), except that all features               ter, and ny is the number of unique features in y’s list.
co-occurring with y have equal size. This serves                 Again, we compute sf ⇒y via SWIRL as in Section 3.2.
as a baseline to investigate whether order matters.
GE/Schapire is the reference distribution used in                Note that replacing sf ⇒y with 1{sf ⇒y > 0}/ny re-
previous work (Druck et al., 2008):                              covers prior work (Settles, 2011). In this method, only
                                                                the association between a feature f and a class y is
             q/m if feature f co-occurs with y                   taken into account, rather than relative importance of
  pc
   f (y) =                                     (12)
             (1 − q)/(|Y| − m) otherwise,                        these associations. This baseline, IDP/Settles, sets
                                                                 df y = 1 + k1{sf ⇒y > 0} and serves to investigate
where m is the number of distinct labels co-occurring            whether order matters in human-generated lists.
with feature f in z(1) , . . . , z(N ) , and q is a smoothing
parameter. We use q = 0.9 as in prior work.                      3.5. Experiments
3.3. Informative Dirichlet Priors (IDP)                          We conduct feature volunteering text classification ex-
                                                                 periments in three different domains: sports (sports
Another way to use feature volunteering is by training           articles), movies (movie reviews), and webkb (uni-
multinomial Naı̈ve Bayes models, where feature⇒label             versity web pages). The classes, number of human
rules are adapted as informative priors on feature pa-           participants (N ), and the number of distinct list fea-
rameters. The class distribution p(y) is parametrized            tures they produced (|F|) are listed in Table 2.
by πy , and the likelihood of a document x given a class
label y is modeled as p(x | y) = f (φf y )xf , where xf          Participants. A total of 135 undergraduate stu-
                                   Q
is the frequency count of feature f and φf y is multino-         dents from the University of Wisconsin-Madison par-
mial parameter for feature f under class y. Dirichlet            ticipated for partial course credit. No one participated
priors are placed on each class-conditional multinomial          in more than one domain. All human studies in this
parameter, where the hyperparameter is denoted by                paper were approved by the institutional review board.
df y for phrase f under class y.                                 Procedure. Participants were informed of only the
We estimate πy by class proportion of labeled docu-              class labels, and were asked to provide as many words
ments, and φf y by posterior expectation as follows:             or short phrases as they thought would be necessary to
                                                                 accurately classify documents into the classes. They
                              X
              φf y ∝ df y +         p(y | x)xf ,         (13)    volunteered features using the web-based computer
                                x                                interface illustrated in Figure 1 (shown here for the
where φf y is normalized over phrases to sum to one              sports domain). The interface consists of a text box
for each y. When learning from only labeled instances,           to enter features, followed by a series of buttons cor-
p(y | x) ∈ {0, 1} indicates the true labeling of instance        responding to labels in the domain. When the partic-
x. When learning from both labeled and unlabeled                 ipant clicks on a label (e.g., “hockey” in the figure),
instances, we run EM as follows. First, initialize φf y ’s       the phrase is added to the bottom of the list below the
by (13) using only the df y hyperparameters. Second,             corresponding button, and erased from the input box.
repeatedly apply (13) until convergence, where the sec-          The feature⇒label pair is recorded for each action in
ond summation term is over both labeled and unla-                sequence. The label order was randomized for each
beled instances and p(y | x) for unlabeled instance x            subject to avoid presentation bias. Participants had
is computed using Bayes rule.                                    15 minutes to complete the task.
Learning from Human-Generated Lists

                                                               Corpus           SWIRL Equal Schapire              FV
                                                               sports            0.865      0.847       0.795    0.875
                                                               movies            0.733     0.733       0.725     0.681
                                                               webkb             0.463      0.444       0.429    0.426
                                                                       (a) GE accuracies for logistic regression
                                                               Corpus                       SWIRL Settles         FV
                                                               sports                        0.911       0.901   0.875
                                                               movies                        0.687       0.656   0.681
                                                               webkb                         0.659      0.651    0.426
                                                                  (b) IDP accuracies for multinomial naı̈ve Bayes
                                                                           Table 3. Text categorization results.

                                                              Training with IDP. We use the same F + in GE
                                                              training, construct IDP hyperparameters according to
                                                              IDP/SWIRL and IDP/Settles, and learn MNB
Figure 1. Screenshot of the feature volunteering interface.   classifiers as in section 3.3 using uniform πy . Following
                                                              prior work, we apply one-step EM with k = 50.
Data cleaning. We normalized the volunteered
phrases by case-folding, punctuation removal, and             Feature Voting Baseline (FV). We also include
space normalization. We manually corrected obvi-              a simple baseline for both frameworks. To classify a
ous misspellings. We also manually mapped different           document x, FV scans through unique volunteered fea-
forms of a feature to its dictionary canonical form:          tures for which xf > 0. Each class y for which f ⇒y
for example, we mapped “lay-up” and “lay up” to               exists in z(1) , . . . , z(N ) receives one vote. At the end,
“layup.” The average (and maximum) list length is             FV predicts the label as the one with the most votes.
39 (91) for sports, 20 (46) for movies, and 40 (85) for       Ties are broken randomly. Accuracy of FV is measured
webkb. Most participants volunteered features at a            by averaging over 20 trials due to this randomness.
fairly uniform speed for the first five minutes or so;        Results. Text classifiers built from volunteered fea-
some then exhausted ideas. This suggests the impor-           tures and SWIRL consistently outperform the base-
tance of combining feature volunteering with feature          lines. Classification accuracies of the different models
and document querying, as mentioned earlier. This             under GE and IDP are shown in Table 3(a,b). For
combination is left for future work.                          each domain, we show the best accuracy in bold face,
Unlabeled corpus U. Computing (8) requires an un-             as well as any accuracies whose difference from the best
labeled corpus. We produce U for the sports domain            is not statistically significant2 . Under the GE frame-
by collecting 1123 Wikipedia documents via a shallow          work, GE/SWIRL is the best on movies and webkb,
web crawl starting from the top-level wiki-category           and is indistinguishable from the best on sports. Un-
for the five sport labels (e.g., “Category:Baseball”).        der the IDP framework, IDP/SWIRL consistently
We produce a matching U for the movies and we-                outperforms all baselines.
bkb domains from the standard movie sentiment cor-            The fact that both GE/SWIRL and IDP/SWIRL
pus (Pang et al., 2002) (2000 instances) and the We-          are better than (or on par with) the baselines under
bKB corpus (Craven et al., 1998) (4199 instances), re-        both frameworks strongly indicates that order mat-
spectively. Note that U’s are treated as unlabeled for        ters. That is, when working with human-generated
training our models, however, we use each U’s in a            lists, the item order carries information that can be
transductive fashion to serve as our test sets as well.       useful to machine learning algorithms. Such informa-
Training with GE. We define the feature set for               tion can be extracted by SWIRL parameter estimates
learning F + (i.e., the dimensionality in (6)) to be the      and successfully incorporated into a secondary classi-
union of F (volunteered phrases) plus all unigrams oc-        fication task. Although dominated by SWIRL-based
curring at least 50 times in the corpus U for that do-        approaches, FV is reasonably strong and may be a
main. That is, we include all volunteered phrases, even       quick stand-in due to its simplicity.
if they are not a unigram or appear fewer than 50 times       A caveat: the human participants were only informed
in the corpus. |F + | for each domain is listed in Ta-        of the class labels and did not know U. Mildly
ble 2. We construct the reference distributions accord-       amusing mismatch ensued. For example, the we-
ing to GE/SWIRL, GE/Equal, and GE/Schapire                    bkb corpus was collected in 1997 (before the “so-
as in section 3.2, and find the optimal logistic regres-      cial media” era), but the volunteered feature labels
sion models p∗ (y | x) by solving (9) with LBFGS for
                                                                 2
each domain and reference distribution.                              Using paired two-tailed t-tests, p < 0.05.
Learning from Human-Generated Lists

included “facebook⇒student,” “dropbox⇒course,”               semantic abilities), and a group of 24 healthy controls
“reddit⇒faculty,” and so on. Our convenient but out-         matched to the patients in age, education, sex, nonver-
dated choice of U quite possibly explains the low ac-        bal IQ and working-memory span. Patients were re-
curacy of all methods in the webkb domain.                   cruited through an epilepsy clinic at the University of
                                                             Wisconsin-Madison. Controls were recruited through
4. Application II: Verbal Fluency for                        fliers posted throughout Madison, Wisconsin.
   Brain Damaged Patients
                                                             Procedure. We conducted four category-fluency
One human list-generation task that has received de-         tasks: animals, vehicles, dogs, and boats. In each
tailed examination in cognitive psychology is “verbal        task, participants were shown a category name on a
fluency.” Human participants are asked to say as many        computer screen and were instructed to verbally list as
examples of a category as possible in one minute with        many examples of the category as possible in 60 sec-
no repetitions (Glasdjo et al., 1999).3 For instance,        onds without repetition. The audio recordings were
participants may be asked to generate examples of a          later transcribed by lab technicians to render word
semantic category (e.g., “animals” or “furniture”), a        lists. We normalize the lists by expanding abbre-
phonemic or orthographic category (e.g., “words be-          viations (“lab” → “labrador”), removing inflections
ginning with the letter F”), or an ad-hoc category (e.g.,    (“birds” → “bird”), and discarding junk utterances
“things you would rescue from a burning house”).             and interjections. The average (and maximum) list
                                                             length is 20 (37) for animals, 14 (33) for vehicles,
Verbal fluency has been widely adopted in neurology
                                                             11 (23) for dogs, and 11 (20) for boats.
to aid in the diagnosis of cognitive dysfunction (Mon-
sch et al., 1992; Troyer et al., 1998; Rosser & Hodges,      Results. We estimated SWIRL parameters λ, s, α us-
1994). Category and letter fluency in particular are         ing (2) for the patient and control groups on each task
sensitive to a broad range of cognitive disorders re-        separately, and observed that:
sulting from brain damage (Rogers et al., 2006). For
                                                             1. Patients produced shorter lists. Figure 2(a) shows
instance, patients with prefrontal injuries are prone to
                                                             that the estimated Poisson intensity λ̂ (i.e., the av-
inappropriately repeating the same item several times
                                                             erage list length) is smaller for patients on all four
(perseverative errors) (Baldo & Shimamura, 1998),
                                                             tasks. This is consistent with the psychological hy-
whereas patients with pathology in the anterior tem-
                                                             pothesis that patients suffering from temporal-lobe
poral cortex are more likely to generate incorrect re-
                                                             epilepsy produce items at a slower rate, hence fewer
sponses (semantic errors) and produce many fewer
                                                             items in the time-limit.
items overall (Hodges et al., 1999; Rogers et al., 2006).
Despite these observations and widespread adoption of        2. Patients and controls have similar lexicon distribu-
the task, standard methods for analyzing the data are        tions, but both deviate from word usage frequency. Fig-
comparatively primitive: correct responses and differ-       ure 2(b) shows the top 10 lexicon items and their prob-
ent error types are counted, while sequential informa-       abilities (normalized s) for patients (sP ) and controls
tion is typically discarded.                                 (sC ) in the animals task, sorted by (sP i + sCi )/2. sP
                                                             and sC are qualitatively similar. We also show corpus
We propose to use SWIRL as a computational model
                                                             probabilities sW from the Google Web 1T 5-gram data
of the verbal fluency task, since we are unaware of any
                                                             set (Brants et al., 2007) for comparison (normalized
other such models. We show that, though overly sim-
                                                             on items appearing in human-generated lists). Not
plified in some respects, SWIRL nevertheless estimates
                                                             only does sW have smaller values, but its order is very
key parameters that correspond to cognitive mecha-
                                                             different: horse (0.06) is second largest, while other
nisms. We further show that these estimates differ
                                                             top corpus words like fish (0.05) and mouse (0.03) are
in healthy populations versus patients with temporal-
                                                             not even in the human top 10. This challenges the
lobe epilepsy, a neurological disorder thought to dis-
                                                             psychological view that verbal fluency largely follows
rupt semantic knowledge. Finally, we report promising
                                                             real-world lexicon usage frequency. Both observations
classification results, indicating that our model could
                                                             are supported quantitatively by comparing the Jensen-
be useful in aiding diagnosis of cognitive dysfunction
                                                             Shannon divergence (JSD) 4 between the whole distri-
in the future.
                                                             butions sP , sC , sW ; see Figure 2(c). Clearly, sP and
Participants. We investigated fluency data gener-            sC are relatively close, and both are far from sW .
ated from two populations: a group of 27 patients with          4
                                                                  Each group’s MLE of s (and hence p) has zero proba-
temporal-lobe epilepsy (a disorder thought to disrupt        bility on items only in the other groups. We use JSD since
   3
       Despite the instruction, people still do repeat.      it is symmetric and allows disjoint supports.
Learning from Human-Generated Lists

                     λ̂                       Item     sP    sC    sW        Task     Pair       JSD                           α̂
  25                                                                                                     0.2
                                                cat    .15   .13   .05              (sP , sC )    .09
                                                                           animals
                               Patients        dog     .17   .11   .08              (sP , sW )    .17               Patients
                               Controls                                                                             Controls
  20                                           lion    .04   .04   .01              (sC , sW )    .19   0.15
                                              tiger    .04   .03   .01              (sP , sC )    .18
                                                                           vehicles
  15                                           bird    .04   .02   .03              (sP , sW )    .23
                                            elephant   .03   .03   .01              (sC , sW )    .25    0.1
  10                                          zebra    .01   .04   .00              (sP , sC )    .15
                                               bear    .03   .03   .03      dogs
                                                                                    (sP , sW )    .54
                                              snake    .02   .02   .01              (sC , sW )    .55   0.05
   5
                                              horse    .02   .03   .06              (sP , sC )    .15
                                                 ..     ..    ..    ..      boats
                                                                                    (sP , sW )    .54
   0                                              .      .     .     .                                    0
       Animals Vehicles Dogs   Boats                                                (sC , sW )    .52          Animals Vehicles Dogs   Boats

          (a) list length λ̂              (b) top 10 “animal” words      (c) distribution comparisons          (d) discount factor α̂

Figure 2. Verbal fluency experimental results. SWIRL distributions for patients, controls, and general word frequency on
the World Wide Web (using Google 1T 5-gram data) are denoted by sP , sC , and sW , respectively.

3. Patients discount less and repeat more. Figure 2(d)                   fectively combining SWIRL statistics with modern
shows the estimated discount factor α̂. In three out of                  feature-labeling frameworks) and cognitive psychology
four tasks, the patient group has larger α̂. Recall that                 (modeling memory in healthy vs. brain-damaged pop-
a larger α̂ leaves an item’s size relatively unchanged,                  ulations, and predicting cognitive dysfunction).
thus increasing the item’s chance to be sampled again
                                                                         Learning from human-generated lists opens up several
— in other words, more repeats. This is consistent
                                                                         lines of future work. For example: (i ) Designing a “su-
with the psychological hypothesis that patients have a
                                                                         pervision pipeline” that combines feature volunteering
reduced ability to inhibit items already produced.
                                                                         with feature label querying, document label querying,
Healthy vs. Patient Classification. We conducted                         and other forms of interactive learning to build bet-
additional experiments where we used SWIRL param-                        ter text classifiers more quickly. (ii ) Identifying more
eters to build down-stream healthy vs. patient classi-                   applications which can benefit from models of human
fiers. Specifically, we performed leave-one-out (LOO)                    list generation. For example, creating lists of photo
classification experiments for each of the four verbal                   tags on Flickr.com, or hashtags on Twitter.com, can be
fluency tasks. Given a task, each training set (mi-                      viewed as a form of human list generation conditioned
nus one person’s list for testing) was used for learn-                   on a specific photo or tweet. (iii ) Developing a hierar-
ing two separate SWIRL models with MAP estima-                           chical version of SWIRL, so that each human has their
tion (5): one for patients and the other for healthy                     own personalized parameters λ, s, and α, while a group
participants. We set τ1 = ∞ and β 0 = 1 due to the                       has summary parameters, too. This is particularly at-
finding that both populations have similar lexicon dis-                  tractive for applications like verbal fluency, where we
tributions, and τ2 = 0. We classified the held-out test                  want to understand both individual and group behav-
list by likelihood ratio threshold at 1 for the patient                  iors. (iv ) Developing structured models of human list
vs. healthy models. The LOO accuracies of SWIRL                          generation that can capture and learn from people’s
on animals, vehicles, dogs and boats were 0.647, 0.706,                  tendency to generate “runs” of semantically-related
0.784, and 0.627, respectively. In contrast, the major-                  items in their lists (e.g., pets then predators then fish).
ity vote baseline has accuracy 27/(24 + 27) = 0.529 for
all tasks. The improvement for the dogs task over the                    Acknowledgments
baseline approach is statistically significant5 .
                                                                         The authors were supported in part by National Sci-
5. Conclusion and Future Work                                            ence Foundation grants IIS-0953219, IIS-0916038, IIS-
Human list generation is an interesting process of                       1216758, IIS-0968487, DARPA, and Google. The pa-
data creation that deserves attention from the ma-                       tient data was collected under National Institute of
chine learning community. Our initial foray into                         Health grant R03 NS061164 with TR as the PI. We
modeling human-generated lists by sampling with re-                      thank Bryan Gibson for help collecting the feature vol-
duced replacement (SWIRL) has resulted in two in-                        unteering data.
teresting applications for both machine learning (ef-
   5
       Using paired two-tailed t-tests, p < 0.05.
Learning from Human-Generated Lists

References                                                    the alzheimer type. Archives of Neurology, 49(12):
                                                              1253–1258, 1992.
Baldo, J. V. and Shimamura, A. P. Letter and cat-
  egory fluency in patients with frontal lobe lesions.      Pang, B., Lee, L., and Vaithyanathan, S. Thumbs
  Neuropsychology, 12(2):259–267, 1998.                       up: Sentiment classification using machine learn-
                                                              ing techniques. In Proceedings of the Conference on
Boyd, S. and Vandenberge, L. Convex Optimization.
                                                              Empirical Methods in Natural Language Processing
  Cambridge University Press, Cambridge UK, 2004.
                                                              (EMNLP), pp. 79–86. ACL, 2002.
Brants, T., Popat, A.C., Xu, P., Och, F.J., and Dean,
                                                            Rogers, T.T., Ivanoiu, A., Patterson, K., and Hodges,
  J. Large language models in machine translation.
                                                              J.R. Semantic memory in alheimer’s disease and the
  In Joint Conference on Empirical Methods in Natu-
                                                              fronto-temporal dementias: A longitudinal study of
  ral Language Processing and Computational Natural
                                                              236 patients. Neuropsychology, 20(3):319–335, 2006.
  Language Learning (EMNLP-CoNLL), 2007.
                                                            Rosser, A. and Hodges, J.R. Initial letter and semantic
Chesson, J. A non-central multivariate hypergeometric
                                                              category fluency in alzheimer’s disease, huntington’s
  distribution arising from biased sampling with ap-
                                                              disease, and progressive supranuclear palsy. jour-
  plication to selective predation. Journal of Applied
                                                              nal of Neurology, Neurosurgery, and Psychiatry, 57:
  Probability, 13(4):795–797, 1976.
                                                              1389–1394, 1994.
Craven, M., DiPasquo, D., Freitag, D., McCallum, A.,
                                                            Settles, B. Closing the loop: Fast, interactive semi-
  Mitchell, T., Nigam, K., and Slattery, S. Learning
                                                              supervised annotation with queries on features and
  to extract symbolic knowledge from the world wide
                                                              instances. In Proceedings of the Conference on
  web. In Proceedings of the Conference on Artifi-
                                                              Empirical Methods in Natural Language Processing
  cial Intelligence (AAAI), pp. 509–516. AAAI Press,
                                                              (EMNLP), pp. 1467–1478. ACL, 2011.
  1998.
                                                            Troyer, A.K., Moscovitch, M., Winocur, G., Alexan-
Druck, G., Mann, G., and McCallum, A. Learning                der, M., and Stuss, D. Clustering and switch-
  from labeled features using generalized expectation         ing on verbal fluency: The effects of focal fronal-
  criteria. In Proceedings of the International ACM           and temporal-lobe lesions. Neuropsychologia, 36(6),
  SIGIR Conference on Research and Development In             1998.
  Information Retrieval, pp. 595–602, 2008.
                                                            Wallenius, K.T. Biased Sampling: The Non-Central
Fog, A. Calculation methods for Wallenius’ noncen-           Hypergeometric Probability Distribution. PhD the-
  tral hypergeometric distribution. Communications           sis, Department of Statistics, Stanford University,
  in Statistics - Simulation and Computation, 37(2):         1963.
  258–273, 2008.

Glasdjo, J. A., Schuman, C. C., Evans, J. D., Peavy,
  G. M., Miller, S. W., and Heaton, R. K. Normas for
  letter and category fluency: Demographic correc-
  tions for age, education, and ethnicity. Assessment,
  6(2):147–178, 1999.

Hodges, J. R., Garrard, P., Perry, R., Patterson, K.,
 Bak, T., and Gregory, C. The differentiation of
 semantic dementia and frontal lobe dementia from
 early alzheimer’s disease: a comparative neuropsy-
 chological study. Neuropsychology, 13:31–40, 1999.

Liu, B., Li, X., Lee, W.S., and Yu, P.S. Text classifica-
  tion by labeling words. In Proceedings of the Confer-
  ence on Artificial Intelligence (AAAI), pp. 425–430.
  AAAI Press, 2004.

Monsch, A. U., Bondi, M. W., Butters, N., Salmon,
 D. P., Katzman, R., and Thal, L. J. Comparisons of
 verbal fluency tasks in the detection of dementia of
You can also read