Online Latent Dirichlet Allocation with Infinite Vocabulary

Page created by Clifford Cole
Online Latent Dirichlet Allocation with Infinite Vocabulary

Ke Zhai                                                                         
Department of Computer Science, University of Maryland, College Park, MD USA
Jordan Boyd-Graber                                                              
iSchool and UMIACS, University of Maryland, College Park, MD USA
                      Abstract                              et al., 2011; Mimno et al., 2012) make the same lim-
    Topic models based on latent Dirichlet alloca-          iting assumption. The namesake topics, distributions
    tion (LDA) assume a predefined vocabulary.              over words that evince thematic coherence, are always
    This is reasonable in batch settings but not            modeled as a multinomial drawn from a finite Dirich-
    reasonable for streaming and online settings.           let distribution. This assumption precludes additional
    To address this lacuna, we extend LDA by                words being added over time.
    drawing topics from a Dirichlet process whose           Particularly for streaming algorithms, this is neither
    base distribution is a distribution over all            reasonable nor appealing. There are many reasons
    strings rather than from a finite Dirichlet. We         immutable vocabularies do not make sense: words
    develop inference using online variational in-          are invented (“crowdsourcing”), words cross languages
    ference and—to only consider a finite number            (“Gangnam”), or words common in one context become
    of words for each topic—propose heuristics to           prominent elsewhere (“vuvuzelas” moving from music
    dynamically order, expand, and contract the             to sports in the 2010 World Cup). To be flexible, topic
    set of words we consider in our vocabulary.             models must be able to capture the addition, invention,
    We show our model can successfully incorpo-             and increased prominence of new terms.
    rate new words and that it performs better
    than topic models with finite vocabularies in           Allowing models to expand topics to include additional
    evaluations of topic quality and classification         words requires changing the underlying statistical for-
    performance.                                            malism. Instead of assuming that topics come from a
                                                            finite Dirichlet distribution, we assume that it comes
                                                            from a Dirichlet process (Ferguson, 1973) with a base
1. Introduction                                             distribution over all possible words, of which there are
                                                            an infinite number. Bayesian nonparametric tools like
Latent Dirichlet allocation (LDA) is a probabilistic        the Dirichlet process allow us to reason about distri-
approach for exploring topics in document collec-           butions over infinite supports. We review both topic
tions (Blei et al., 2003). Topic models offer a formalism   models and Bayesian nonparametrics in Section 2. In
for exposing a collection’s themes and have been used       Section 3, we present the infinite vocabulary topic model,
to aid information retrieval (Wei & Croft, 2006), un-       which uses Bayesian nonparametrics to go beyond fixed
derstand academic literature (Dietz et al., 2007), and      vocabularies.
discover political perspectives (Paul & Girju, 2010).
                                                            In Section 4, we derive approximate inference for our
As hackneyed as the term “big data” has become, re-         model. Since emerging vocabulary are most important
searchers and industry alike require algorithms that        in non-batch settings, in Section 5, we extend inference
are scalable and efficient. Topic modeling is no differ-    to streaming settings. We compare the coherence and
ent. A common scalability strategy is converting batch      effectiveness of our infinite vocabulary topic model
algorithms into streaming algorithms that only make         against models with fixed vocabulary in Section 6.
one pass over the data. In topic modeling, Hoffman
                                                            Figure 1 shows a topic evolving during inference. The
et al. (2010) extended LDA to online settings.
                                                            algorithm processes documents in sets we call mini-
However, this and later online topic models (Wang           batches; after each minibatch, online variational infer-
                                                            ence updates our model’s parameters. This shows that
Proceedings of the 30 th International Conference on Ma-    out of vocabulary words can enter topics and eventually
chine Learning, Atlanta, Georgia, USA, 2013. JMLR:
W&CP volume 28. Copyright 2013 by the author(s).            become high probability words.
Online Latent Dirichlet Allocation with Infinite Vocabulary

     minibatch-2    minibatch-3      minibatch-5   minibatch-8    minibatch-10       minibatch-16         minibatch-17           minibatch-39    minibatch-83        minibatch-120
         ...             ...              ...           ...             ...                ...                   ...                  ...
                                                                                                                                                   0-captain           0-appear
      100-issu        90-club          102-club      118-club        132-rock           87-seri                82-seri             1-annual
         ...            ...               ...           ...             ...               ...                    ...
                                                                                                                                                   1-appear             1-hulk
      146-cover       105-issu         115-issu      128-copi        194-issu           161-issu            162-issu
         ...             ...              ...           ...             ...                ...                 ...                                 2-crohn            2-wolverin
      196-admir      161-cover        127-cover     137-cover        215-seri          283-copi             288-copi
         ...            ...              ...                           ...                ...                  ...                                  3-hulk             3-annual
                                                     138-issu                                                                                         ...
      199-copi        214-copi         130-copi                      217-copi         306-appear           294-appear
                                                        ...                                                                                                             4-copi
         ...             ...              ...                           ...                                    ...               5-comicstrip       5-rock
                                                   180-appear                          307-cover
     229-appear      244-appear      197-appear                     226-cover                               311-cover                                                   5-rider
                                                       ...                                ...                                       6-seri        6-wolverin
         ...             ...             ...                           ...                                      ...
                                                    319-rock                          502-annual                                                                      6-comicstrip
      324-club        288-rock         289-rock                     261-appear                             512-annual              7-mutant        7-bloom
                                                       ...                                ...
         ...             ...              ...                           ...                                    ...
                                                   493-annual                           814-forc                                   8-cover          8-forc
     360-annual      381-annual       450-annual                    588-annual                               830-forc
                                                       ...                                 ...                                        ...
         ...             ...              ...                           ...                                     ...                                                      8-forc
                                                     639-seri                          1194-rider                                  12-issu
      643-rock        685-seri         584-seri                      949-forc                             4782-wolverin                              ...
                                                       ...                                ...                                        ...                               9-captain
         ...            ...              ...                            ...                                    ...
                                                     877-forc                         8516-bloom                                   14-hulk
      819-forc        791-forc         811-forc                     1074-rider                             9231-bloom
                                                        ...                               ...                                        ...                               10-bloom
         ...             ...              ...                          ...                                     ...                                 12-annal
                                                    1003-rider                         8944-hulk                                   16-copi
     1064-rider      1091-rider       1090-rider                  6038-comicstrip                           9659-hulk                                                   11-issu
                                                       ...                                ...                                        ...          13-mutant
        ...             ...              ...                            ...                                    ...
                                                   7075-captain                     10819-comicstrip                               53-forc           ...                12-seri
      1185-seri      4639-hiram       6267-crohn                   6520-mutant                           11527-comicstrip
                                                        ...                                ...                                       ...            15-seri               ...
         ...             ...              ...                          ...                                      ...
                                                   9420-crohn                        11301-mutant                                  57-rider                           16-mutant
                                      7113-hiram                   9569-captain                           12009-mutant
                                                       ...                                ...                                         ...          16-cover              ...
         Settings                         ...                           ...                                    ...
     Number of Topics: K = 50                      10266-hiram
                                                                                                                                  86-captain                           37-crohn
      Truncation Level: T = 20K                        ...
                                                                                                                                      ...                                 ...
        Minibatch Size: S = 155                                    12760-hiram
                                                                                                                                  3531-hiram                            41-rock
                           β                                                                                                                       23-issu
    DP Scale Parameter: α = 5000                                       ...
                                                                                                                                      ...                                 ...
     Reordering Delay: U = 20                                                         17519-hiram
                                                                                                                                 3690-bloom                            43-hiram
       Learning Inertia: τ0 = 256                                                         ...
                                                                                                                                     ...           280-rider              ...
         Learning Rate: κ = 0.6                                                                                                   3915-crohn

    New-Words 2     New-Words 3      New-Words 5   New-Words 8    New-Words 10       New-Words 16        New-Words-17           New-Words-39    New-Words-83

        hiram          crohn           captain      comicstrip        bloom             wolverin                laci                 izzo           gown
                                                                                                                 ...                   ...           ...
     moskowitz         corpu            seqitur      mutant            hulk              albion
        ...              ...              ...                                              ...
                                                   patlafontain     mazelyah
                                                        ...            ...                                             words added at corresponding minibatch

Figure 1. The evolution of a single “comic book” topic from the 20 newsgroups corpus. Each column is a ranked list of
word probabilities after processing a minibatch (numbers preceding words are the exact rank). The box below the topics
contains words introduced in a minibatch. For example, “hulk” first appeared in minibatch 10, was ranked at 9659 after
minibatch 17, and became the second most important word by the final minibatch. Colors help show words’ trajectories.

2. Background                                                                       2.1. Bayesian Nonparametrics
Latent Dirichlet allocation (Blei et al., 2003) assumes                             Bayesian nonparametrics is an appealing solution; it
a simple generative process. The K topics, drawn from                               models arbitrary distributions with an unbounded and
a symmetric Dirichlet distribution, βk ∼ Dir(η), k =                                possibly countably infinite support. While Bayesian
{1, . . . , K} generate a corpus of observed words:                                 nonparametrics is a broad field, we focus on the Dirich-
 1: for each document d in a corpus D do                                            let process (DP, Ferguson 1973).
 2:       Choose a distribution θd over topics from a                               The Dirichlet process is a two-parameter distribution
          Dirichlet distribution θd ∼ Dir(αθ ).                                     with scale parameter αβ and base distribution G0 . A
 3:       for each of the n = 1, . . . , Nd word indexes do                         draw G from DP(αβ , G0 ) is modeled as
 4:          Choose a topic zn from the document’s distri-
             bution over topics zn ∼ Mult(θd ).                                         b1 , . . . , bi , . . . ∼ Beta(1, αβ ),                 ρ1 , . . . , ρi , . . . ∼ G0 .
 5:          Choose a word wn from the appropriate topic’s                          Individual draws from a Beta distribution are the
             distribution over words p(wn |βzn ).                                   foundation for the stick-breaking construction of the
Implicit in this model is a finite number of words in                               DP (Sethuraman, 1994). Each break point bi mod-
the vocabulary because the support of the Dirichlet                                 els how much probability mass remains. These break
distribution Dir(η) is fixed. Moreover, it fixes a priori                           points combine to form an infinite multinomial,
which words we can observe, a patently false assump-                                                           i−1
                                                                                                               Y                                         X
tion (Algeo, 1980).                                                                                 β i ≡ bi             (1 − bj ),              G≡                βi δρi ,          (1)
                                                                                                               j=1                                             i
Online Latent Dirichlet Allocation with Infinite Vocabulary
where the weights βi give the probability of selecting       4:     Set stick weights βkt = bkt           s
Online Latent Dirichlet Allocation with Infinite Vocabulary

the evidence lower bound (ELBO) L,                            distributions of interest. For q(z | η), we use local col-
                                                              lapsed MCMC sampling (Mimno et al., 2012) and for
 log p(W ) ≥ Eq(Z) [log p(W , Z)] − Eq(Z) [q] = L. (2)        q(b | ν) we use stochastic variational inference (Hoffman
                                                              et al., 2010). We describe both in turn.
Maximizing L is equivalent to minimizing the Kullback-
Leibler (KL) divergence between the true distribution
                                                              4.2. Stochastic Inference
and the variational distribution.
                                                              Recall that the variational distribution q(zd | η) is a
Unlike mean-field approaches (Blei et al., 2003), which
                                                              single distribution over the Nd vectors of length K.
assume q is a fully factorized distribution, we integrate
                                                              While this removes the tight coupling between θ and z
out the word-level topic distribution vector θ: q(zd | η)
                                                              that often complicates mean-field variational inference,
is a single distribution over K Nd possible topic con-
                                                              it is no longer as simple to determine the variational
figurations rather than a product of Nd multinomial
                                                              distribution q(zd | η) that optimizes Eqn. (2). However,
distributions over K topics. Combined with a beta
                     1     2                                  Mimno et al. (2012) showed that Gibbs sampling in-
distribution q(bkt |νkt , νkt ) for stick break points, the                     ∗
                                                              stantiations of zdn  from the distribution conditioned
variational distribution q is
                                                              on other topic assignments results in a sparse, effi-
  q(Z) ≡ q(β, z) = D q(zd | η) K q(bk | νk1 , νk2 ). (3)
                     Q               Q                        cient empirical estimate of the variation distribution.
                                                              In our model, the conditional distribution of a topic
However, we cannot explicitly represent a distribution        assignment of a word with TOS index t = Tk (wdn ) is
over all possible strings, so we truncate our variational
                                                                 q(zdn = k|z−dn , t = Tk (wdn ))                (4)
stick-breaking distribution q(b | ν) to a finite set.                                     
                                                                      PNd                θ
                                                                  ∝     m=1 Izdm =k + αk exp Eq(ν) [log βkt ] .
4.1. Truncation Ordered Set                                              m6=n

Variational methods typically cope with infinite dimen-       We iteratively sample from this conditional distribution
sionality of nonparametric models by truncating the           to obtain the empirical distribution φdn ≡ q̂(zdn ) for
distribution to a finite subset of all possible atoms that    latent variable zdn , which is fundamentally different
nonparametric distributions consider (Blei & Jordan,          from mean-field approach (Blei et al., 2003).
2005; Kurihara et al., 2006; Boyd-Graber & Blei, 2009).       There are two cases to consider for computing Eqn. (4)—
This is done by selecting a relatively large truncation       whether a word wdn is in the TOS for topic k or not.
index Tk , and then stipulating that the variational dis-     First, we look up the word’s index t = Tk (wdn ). If this
tribution uses the rest of the available stick at that        word is in the TOS, i.e., t ≤ Tk , the expectations are
index, i.e., q(bTk = 1) ≡ 1. As a consequence, β is zero      straightforward (Mimno et al., 2012)
in expectation under q beyond that index.                                                          
                                                                                PNd               θ           1
However, directly applying such a technique is not              q(zdn = k) ∝       m=1 φdmk + αk · exp{Ψ(νkt ) (5)
feasible here, as truncation is not just a search over                          Ps Tk ) is then
a higher probability than words with higher indices.                                             
                                                                                PNd             θ
                                                                 q(zdn = k) ∝      m=1 φdmk + αk                  (6)
Each topic has a unique TOS Tk of limited size that                                  m6=n
maps every word type w to an integer t; thus t = Tk (w)                              Ps≤t       2         1     2
                                                                            · exp{     s=1   Ψ(νks ) − Ψ(νks + νks ) }.
is the index of the atom ρkt that corresponds to w. We
defer how we choose this mapping until Section 4.3.           This is different from finite vocabulary topic models
More pressing is how we compute the two variational           that set vocabulary a priori and ignore OOV words.
Online Latent Dirichlet Allocation with Infinite Vocabulary

4.3. Refining the Truncation Ordered Set                                 5.1. Updating the Truncation Ordered Set
In this section, we describe heuristics to update the                    A nonparametric streaming model should allow the
TOS inspired by MCMC conditional equations, a com-                       vocabulary to dynamically expand as new words ap-
mon practice for updating truncations. One component                     pear (e.g., introducing “vuvuzelas” for the 2010 World
of a good TOS is that more frequent words should come                    Cup), and contract as needed to best model the data
first in the ordering. This is reasonable because the                    (e.g., removing “vuvuzelas” after the craze passes). We
stick-breaking prior induces a size-biased ordering of                   describe three components of this process, expanding
the clusters. This has previously been used for trun-                    the truncation, refining the ordering of TOS, and con-
cation optimization for Dirichlet process mixtures and                   tracting the vocabulary.
admixtures (Kurihara et al., 2007).
Another component of a good TOS is that words con-                       Determining the TOS Ordering This process de-
sistent with the underlying base distribution should                     pends on the ranking score of a word in topic k at
be ranked higher than those not consistent with the                      minibatch i, Ri,k (ρ). Ideally, we would compute R
base distribution. This intuition is also consistent with                from all data. However, only a single minibatch is
the conditional sampling equations for MCMC infer-                       accessible. We have a per-minibatch rank estimate
ence (Müller & Quintana, 2004); the probability of                                                                PNd
creating a new table with dish ρ is proportional to                         ri,k (ρ) = p(ρ|G0 ) ·   |Si |   d∈Si    n=1   φdnk δωdn =ρ
αβ G0 (ρ) in the Chinese restaurant process.
                                                                         which we interpolate with our previous ranking
Thus, to update the TOS, we define the ranking score
of word t in topic k as
                                                                              Rik (ρ) = (1 − ) · Ri−1,k (ρ) +  · rik (ρ).          (10)
                                    D X
       R(ρkt ) = p(ρkt |G0 )                    φdnk δωdn =ρkt ,   (7)
                                    d=1 n=1                              We introduce an additional algorithm parameter, the
                                                                         reordering delay U . We found that reordering after
sort all words by the scores within that topic, and then                 every minibatch (U = 1) was not effective; we explore
use those positions as the new TOS. In Section 5.1, we                   the role of reordering delay in Section 6. After U
present online updates for the TOS.                                      minibatches have been observed, we reorder the TOS
                                                                         for each topic according to the words’ ranking score
5. Online Inference                                                      R in Eqn. (10); Tk (w) becomes the rank position of w
                                                                         according to the latest Rik .
Online variational inference seeks to optimize the
ELBO L according to Eqn. (2) by stochastic gradi-
ent optimization. Because gradients estimated from                       Expanding the Vocabulary Each minibatch con-
a single observation are noisy, stochastic inference for                 tains words we have not seen before. When we see
topic models typically uses “minibatches” of S docu-                     them, we must determine their relative rank position
ments out of D total documents (Hoffman et al., 2010).                   in the TOS, their rank scores, and their associated
                                                                         variational parameters. The latter two issues are rele-
An approximation of the natural gradient of L with                       vant for online inference because both are computed
respect to ν is the product of the inverse Fisher infor-                 via interpolations from previous values in Eqn. (10)
mation and its first derivative (Sato, 2001)                             and (9). For an unseen word ω, previous values are
     1          D
                   P     PNd                     1
                                                                         undefined. Thus, we set Ri−1,k for unobserved words
  ∆νkt  = 1 + |S|    d∈S   n=1 φdnk δωdn =ρkt − νkt (8)                  to be 0, ν to be 1, and Tk (ω) is Tk + 1 (i.e., increase
     2            D
                          PNd                                            truncation and append to the TOS).
        = αβ + |S|                                 2
  ∆νkt                d∈S    n=1 φdnk δωdn >ρkt − νkt ,

which leads to an update of ν,                                           Contracting the Vocabulary To ensure tractabil-
    νkt   =    1
              νkt   +·     1
                          ∆νkt ,       2
                                      νkt   =    2
                                                νkt   +·     2
                                                            ∆νkt   (9)   ity we must periodically prune the words in the TOS.
                                                                         When we reorder the TOS (after every U minibatches),
where i = (τ0 + i)−κ defines the step size of the algo-                 we only keep the top T terms, where T is a user-defined
rithm in minibatch i. The learning rate κ controls                       integer. A word type ρ will be removed from Tk if its in-
how quickly new parameter estimates replace the old;                     dex Tk (ρ) > T and its previous information (e.g., rank
κ ∈ (0.5, 1] is required for convergence. The learn-                     and variational parameters) is discarded. In a later
ing inertia τ0 prevents premature convergence. We                        minibatch, if a previously discarded word reappears, it
recover the batch setting if S = D and κ = 0.                            is treated as a new word.
Online Latent Dirichlet Allocation with Infinite Vocabulary

                       T : 4000           T : 5000                                         T : 4000            T : 5000                                                      T : 20000 T : 30000 T : 40000                                                   T : 20000 T : 30000 T : 40000
                                                                                                                                                 150                                                                                         80

                                                                                                                                                                                                                          U : 10 U : 20

                                                                                                                                                                                                                                                                                             U : 10 U : 20

                                                         U : 10 U : 20

                                                                                                                               U : 10 U : 20
       60                                                                  60                                                                    100                                                                                         40
       40                                                                  40                                                                     50                                                                                         20


                                                                                                                                                   0                                                                                          0

        0                                                                                                                                        150                                                                                         80
                                                                                                                                                 100                                                                                         60
       60                                                                  60                                                                     50                                                                                         40
       40                                                                  40                                                                                                                                                                20
       20                                                                  20                                                                      0                                                                                          0
























                                                                                                                                                                                       minibatch                                                                       minibatch
                            minibatch                                                           minibatch                                                                     infvoc: α β
                                                                                                                                                                                      ab=3k   infvoc: α β
                                                                                                                                                                                                      ab=5k                                                   infvoc: α β
                                                                                                                                                                                                                                                                      ab=3k   infvoc: α β
                             β               β                                                   β               β

                   infvoc: α
                           ab=1k   infvoc: α
                                           ab=2k                                       infvoc: α
                                                                                               ab=1k   infvoc: α
                                                                                                               ab=2k                                                          infvoc: α β
                                                                                                                                                                                      ab=4k                                                                   infvoc: α β
                  (a) de-news, S = 140.                                               (b) de-news, S = 245.                                    (c) 20 newsgroups, S = 155. (d) 20 newsgroups, S = 310.

Figure 2. PMI score on de-news (Figure 2(a) and 2(b), K = 10) and 20 newsgroups (Figure 2(c) and 2(d), K = 50) against
different settings of DP scale parameter αβ , truncation level T and reordering delay U , under learning rate κ = 0.8 and
learning inertia τ0 = 64. Our model is more sensitive to αβ and less sensitive to T .

                                                                                                                                                                                   ! : 0.6                   ! : 0.7                       ! : 0.8                     ! : 0.9         !:1
6. Experimental Evaluation                                                                                                                                 80

                                                                                                                                                                                                                                                                                                                  "0 : 256 "0 : 64
In this section, we evaluate the performance of our                                                                                                         0

infinite vocabulary topic model (infvoc) on two cor-                                                                                                       60
pora: de-news 1 and 20 newsgroups.2 Both corpora                                                                                                           20

were parsed by the same tokenizer and stemmer with









a common English stopword list (Bird et al., 2009).                                                                                                                                                       infvoc: α β
                                                                                                                                                                                                                  ab=1k T=4k infvoc: α β
                                                                                                                                                                                                                                     ab=2k T=4k

First, we examine its sensitivity to both model param-                                                                                                                                                    infvoc: α β
                                                                                                                                                                                                                  ab=1k T=5k infvoc: α β
                                                                                                                                                                                                                                     ab=2k T=5k
eters and online learning rates. Having chosen those                                                                                                                                                     (a) de-news, S = 245 and K = 10.
parameters, we then compare our model with other
topic models with fixed vocabularies.                                                                                                                                               ! : 0.6                     ! : 0.7                    ! : 0.8                    ! : 0.9          !:1

                                                                                                                                                                                                                                                                                                             "0 : 256 "0 : 64

Evaluation Metric Typical evaluation of topic mod-                                                                                                        100
els is based on held-out likelihood or perplexity. How-                                                                                                     0













ever, creating a strictly fair comparison for our model                                                                                                                                                                                   minibatch
against existing topic model algorithms is difficult, as                                                                                                                      infvoc: α β
                                                                                                                                                                                      ab=3k T=20k infvoc: α β
                                                                                                                                                                                                          ab=4k T=20k infvoc: α β
                                                                                                                                                                                                                              ab=5k T=20k
traditional topic model algorithms must discard words                                                                                                                                 (b) 20 newsgroups, S = 155 and K = 50.
that have not previously been observed. Moreover,
held-out likelihood is a flawed proxy for how topic                                                                                             Figure 3. PMI score on two datasets with reordering delay
models are used in the real world (Chang et al., 2009).                                                                                         U = 20 against different settings of decay factor κ and τ0 .
Instead, we use two evaluation metrics: topic coherence                                                                                         A suitable choice of DP scale parameter αβ increases the
and classification accuracy.                                                                                                                    performance significantly. Learning parameters κ and τ0
                                                                                                                                                jointly define the step decay. Larger step sizes promote
Pointwise mutual information (PMI), which correlates
                                                                                                                                                better topic evolution.
with human perceptions of topic coherence, measures
how words fit together within a topic. Following New-
man et al. (2009), we extract document co-occurence                                                                                             learned from the topic distribution of training docu-
statistics from Wikipedia and score a topic’s coherence                                                                                         ments applied to test documents (the topic model sees
by averaging the pairwise PMI score (w.r.t. Wikipedia                                                                                           both sets). A higher accuracy means the unsupervised
co-occurence) of the topic’s ten highest ranked words.                                                                                          topic model better captures the underlying structure of
Higher average PMI implies a more coherent topic.                                                                                               the corpus. To better simulate real-world situations, 20-
Classification accuracy is the accuracy of a classifier                                                                                         newsgroup’s test/train split is by date (test documents
                                                                                                                                                appeared after training documents).
      A collection of daily news items between 1996 to
2000 in English. It contains 9,756 documents, 1,175,526
word tokens, and 20,000 distinct word types. Avail-                                                                                             Comparisons We evaluate the performance of our
able at                                                                                             model (infvoc) against three other models with fixed
    2                                                                                                                                           vocabularies: online variational Bayes LDA (fixvoc-vb,
      A collection of discussions in 20 different newsgroups. It
contains 18,846 documents and 100,000 distinct word types.                                                                                      Hoffman et al. 2010), online hybrid LDA (fixvoc-hybrid,
It is sorted by date into roughly 60% training and 40% test-                                                                                    Mimno et al. 2012), and dynamic topic models (dtm,
ing data. Available at                                                                                           Blei & Lafferty 2006). Including dynamic topic models
Online Latent Dirichlet Allocation with Infinite Vocabulary
is not a fair comparison, as its inferences requires access
to all of the documents in the dataset; unlike the other
algorithms, it is not online.                                        40


                                                                     20                                              dtm−dict: tcv=0.01        fixvoc−vb−dict
                                                                                                                     fixvoc−hybrid−dict        fixvoc−vb−null
                                                                      0                                              fixvoc−hybrid−null        infvoc: α β
                                                                                                                                                       ab=2k T=4k U=10
Vocabulary For fixed vocabulary models, we must
                                                                                          0              10                      20                 30              40
decide on a vocabulary a priori. We consider two                                                                          factor(minibatch)
different vocabulary methods: use the first minibatch                     (a) de-news, S = 245, K = 10, κ = 0.6 and τ0 = 64
to define a vocabulary (null ) or use a comprehensive
dictionary3 (dict). We use the same dictionary to train          150

infvoc’s base distribution.                                      100

Experiment Configuration For all models, we use
the same symmetric document Dirichlet prior with                                      0   10   20   30               40     50    60     70    80        90   100   110   120
αθ = 1/K, where K is the number of topics. Online                                                                          factor(minibatch)
                                                                                      dtm−dict: tcv=0.05 fixvoc−hybrid−null fixvoc−vb−null
models see exactly the same minibatches. For dtm,

                                                                                      fixvoc−hybrid−dict fixvoc−vb−dict     infvoc: α
                                                                                                                                    ab=5k T=20k U=20
which is not an online algorithm but instead partitions        (b) 20 newsgroups, S = 155, K = 50, κ = 0.8 and τ0 = 64
its input into “epochs”, we combine documents in ten
consecutive minibatches into an epoch (longer epochs          Figure 4. PMI score on two datasets against different mod-
tended to have worse performance; this was the shortest       els. Our model infvoc yields a better PMI score against
epoch that had reasonable runtime).                           fixvoc and dtm; gains are more marked in later minibatches
For online hybrid approaches (infvoc and fixvoc-hybrid ),     as more and more proper names have been added to the
                                                              topics. Because dtm is not an online algorithm, we do not
we collect 10 samples empirically from the variational
                                                              have detailed per-minibatch coherence statistics and thus
distribution in E-step with 5 burn-in sweeps. For fixvoc-     show topic coherence as a box plot per epoch.
vb, we run 50 iterations for local parameter updates.

6.1. Sensitivity to Parameters                                select parameters for each of the models4 and plotted
Figure 2 shows how the PMI score is affected by the           the topic coherence averaged over all topics in Figure 4.
DP scale parameter αβ , the truncation level T , and the      While infvoc initially holds its own against other mod-
reordering delay U . The relatively high values of αβ         els, it does better and better in later minibatches, since
may be surprising to readers used to seeing a DP that         it has managed to gain a good estimate of the vocabu-
instantiates dozens of atoms, but when vocabularies           lary and the topic distributions have stabilized. Most
are in tens of thousands, such scale parameters are           of the gains in topic coherence come from highly spe-
necessary to support the long tail. Although we did           cific proper nouns which are missing from vocabularies
not investigate such approaches, this suggests that more      of the fixed-vocabulary topic models. This advantage
advanced nonparametric distributions (Teh, 2006) or           holds even against dtm, which uses batch inference.
explicitly optimizing αβ may be useful. Relatively large
values of U suggest that accurate estimates of the rank       6.3. Comparing Algorithms: Classification
order are important for maintaining coherent topics.
                                                              For the classification comparison, we consider addi-
While infvoc is sensitive to parameters related to the
                                                              tional topic models. While we need the most probable
vocabulary, once suitable values of those parameters
                                                              topic strings for PMI calculations, classification exper-
are chosen, it is no more sensitive to learning-specific
                                                              iments only need a document’s topic vector. Thus, we
parameters than other online LDA algorithms (Fig-
                                                              consider hashed vocabulary schemes. The first, which
ure 3), and values used for other online topic models
                                                              we call dict-hashing, uses a dictionary for the known
also work well here.
                                                              words and hashes any other words into the same set
6.2. Comparing Algorithms: Coherence                               For the de-news dataset, we select (20 newsgroups pa-
                                                              rameters in parentheses) minibatch size S ∈ {140, 245}
Now that we have some idea of how we should set               (S ∈ {155, 310}), DP scale parameter αβ ∈ {1k, 2k}
parameters for infvoc, we compare it against other            (αβ ∈ {3k, 4k, 5k}), truncation size T ∈ {3k, 4k} (T ∈
topic modeling techniques. We used grid search to             {20k, 30k, 40k}), reordering delay U ∈ {10, 20} for infvoc;
                                                              and topic chain variable tcv ∈ {0.001, 0.005, 0.01, 0.05} for
   3          dtm.
Online Latent Dirichlet Allocation with Infinite Vocabulary

                                   model settings                accuracy %   6.4. Qualitative Example
                             infvoc αβ = 3k T = 40k U = 10         52.683
                             fixvoc             vb-dict            45.514     Figure 1 shows the evolution of a topic in 20 news-
           τ0 = 64 κ = 0.6

                             fixvoc             vb-null            49.390     groups about comics as new vocabulary words enter
                             fixvoc           hybrid-dict          46.720
 S = 155

                                                                              from new minibatches. While topics improve over time
                             fixvoc           hybrid-null          50.474
                             fixvoc          vb dict-hash          52.525     (e.g., relevant words like “seri(es)”, “issu(e)”, “forc(e)”
                             fixvoc     vb full-hash T = 30k       51.653     are ranked higher), interesting words are being added
                             fixvoc        hybrid dict-hash        50.948     throughout training and become prominent after later
                             fixvoc   hybrid full-hash T = 30k     50.948     minibatches are processed (e.g., “captain”, “comic-
                                  dtm-dict tcv = 0.001             62.845     strip”, “mutant”). This is not the case for standard
                             infvoc αβ = 3k T = 40k U = 20         52.317
                                                                              online LDA—these words are ignored and the model
                             fixvoc             vb-dict            44.701
           τ0 = 64 κ = 0.6

                             fixvoc             vb-null            51.815     does not capture such information. In addition, only
                             fixvoc           hybrid-dict          46.368     about 60% of the word types appeared in the SIL En-
 S = 310

                             fixvoc           hybrid-null          50.569     glish dictionary. Even with a comprehensive English
                             fixvoc          vb dict-hash          48.130     dictionary, online LDA could not capture all the word
                             fixvoc     vb full-hash T = 30k       47.276
                                                                              types in the corpus, especially named entities.
                             fixvoc        hybrid dict-hash        51.558
                             fixvoc   hybrid full-hash T = 30k     43.008
                                  dtm-dict tcv = 0.001             64.186     7. Conclusion and Future Work
Table 1. Classification accuracy based on 50 topic features                   We proposed an online topic model that, instead of
extracted from 20 newsgroups data. Our model (infvoc) out-                    assuming vocabulary is known a priori, adds and sheds
performs algorithms with a fixed or hashed vocabulary but                     words over time. While our model is better able to
not dtm, a batch algorithm that has access to all documents.                  create coherent topics, it does not outperform dynamic
                                                                              topic models (Blei & Lafferty, 2006; Wang et al., 2008)
                                                                              that explicitly model how topics change. It would
                                                                              be interesting to allow such models to—in addition
                                                                              to modeling the change of topics—also change the
of integers. The second, full-hash, used in Vowpal                            underlying dimensionality of the vocabulary.
Wabbit,5 hashes all words into a set of T integers.
                                                                              In addition to explicitly modeling the change of topics
We train 50 topics for all models on the entire dataset                       over time, it is also possible to model additional struc-
and collect the document level topic distribution for ev-                     ture within topic. Rather than a fixed, immutable base
ery article. We treat such statistics as features and train                   distribution, modeling each topic with a hierarchical
a SVM classifier on all training data using Weka (Hall                        character n-gram model would capture regularities in
et al., 2009) with default parameters. We then use the                        the corpus that would, for example, allow certain top-
classifier to label testing documents with one of the 20                      ics to favor different orthographies (e.g., a technology
newsgroup labels. A higher accuracy means the model                           topic might prefer words that start with “i”). While
is better capturing the underlying content.                                   some topic models have attempted to capture orthogra-
Our model infvoc captures better topic features than                          phy for multilingual applications (Boyd-Graber & Blei,
online LDA fixvoc (Table 1) under all settings.6 This                         2009), our approach is more robust and incorporating
suggests that in a streaming setting, infvoc can better                       the our approach with models of transliteration (Knight
categorize documents. However, the batch algorithm                            & Graehl, 1997) might allow concepts expressed in one
dtm, which has access to the entire dataset performs                          language better capture concepts in another, further
better because it can use later documents to retrospec-                       improving the ability of algorithms to capture the evolv-
tively improve its understanding of earlier ones. Unlike                      ing themes and topics in large, streaming datasets.
dtm, infvoc only sees early minibatches once and cannot
revise its model when it is tested on later minibatches.                      Acknowledgments
    6                                                                         The authors thank Chong Wang, Dave Blei, and Matt
      Parameters were chosen via cross-validation on a
30%/70% dev-test split from the following parameter set-                      Hoffman for answering questions and sharing code. We
tings: DP scale parameter α ∈ {2k, 3k, 4k}, reordering                        thank Jimmy Lin and the anonymous reviewers for
delay U ∈ {10, 20} (for infvoc only); truncation level T ∈                    helpful suggestions. Research supported by NSF grant
{20k, 30k, 40k} (for infvoc and fixvoc full-hash models); step                #1018625. Any opinions, conclusions, or recommenda-
decay factors τ0 ∈ {64, 256} and κ ∈ {0.6, 0.7, 0.8, 0.9, 1.0}
                                                                              tions are the authors’ and not those of the sponsors.
(for all online models); and topic chain variable tcv ∈
{0.01, 0.05, 0.1, 0.5} (for dtm only).
Online Latent Dirichlet Allocation with Infinite Vocabulary

References                                                    Kurihara, Kenichi, Welling, Max, and Teh, Yee Whye.
                                                                Collapsed variational Dirichlet process mixture models.
Algeo, John. Where do all the new words come from?              In IJCAI. 2007.
  American Speech, 55(4):264–277, 1980.
                                                              Mimno, David, Hoffman, Matthew, and Blei, David. Sparse
Bird, Steven, Klein, Ewan, and Loper, Edward. Natural           stochastic inference for latent Dirichlet allocation. In
  Language Processing with Python. O’Reilly Media, 2009.        ICML, 2012.
Blei, David M. and Jordan, Michael I. Variational infer-      Müller, Peter and Quintana, Fernando A. Nonparametric
  ence for Dirichlet process mixtures. Journal of Bayesian      Bayesian data analysis. Statistical Science, 19(1):95–110,
  Analysis, 1(1):121–144, 2005.                                2004.
Blei, David M. and Lafferty, John D. Dynamic topic models.    Neal, Radford M. Probabilistic inference using Markov
  In ICML, 2006.                                                chain Monte Carlo methods. Technical Report CRG-TR-
                                                                93-1, University of Toronto, 1993.
Blei, David M., Ng, Andrew, and Jordan, Michael. Latent
  Dirichlet allocation. JMLR, 3:993–1022, 2003.               Newman, David, Karimi, Sarvnaz, and Cavedon, Lawrence.
                                                                External evaluation of topic models. In ADCS, 2009.
Blunsom, Phil and Cohn, Trevor. A hierarchical Pitman-Yor
  process HMM for unsupervised part of speech induction.      Paul, Michael and Girju, Roxana. A two-dimensional topic-
  In ACL, 2011.                                                 aspect model for discovering multi-faceted topics. 2010.

Boyd-Graber, Jordan and Blei, David M. Multilingual topic     Sato, Masa-Aki. Online model selection based on the varia-
  models for unaligned text. In UAI, 2009.                      tional Bayes. Neural Computation, 13(7):1649–1681, July
Chang, Jonathan, Boyd-Graber, Jordan, and Blei, David M.
  Connections between the lines: Augmenting social net-       Sethuraman, Jayaram. A constructive definition of Dirichlet
  works with text. In KDD, 2009.                                priors. Statistica Sinica, 4:639–650, 1994.

                                                              Teh, Yee Whye. A hierarchical Bayesian language model
Clark, Alexander. Combining distributional and morpho-
                                                                based on Pitman-Yor processes. In ACL, 2006.
  logical information for part of speech induction. 2003.
                                                              Wang, Chong and Blei, David. Variational inference for the
Cohen, Shay B., Blei, David M., and Smith, Noah A. Vari-       nested Chinese restaruant process. In NIPS, 2009.
  ational inference for adaptor grammars. In NAACL,
  2010.                                                       Wang, Chong and Blei, David M. Truncation-free online
                                                               variational inference for bayesian nonparametric models.
Dietz, Laura, Bickel, Steffen, and Scheffer, Tobias. Un-       In NIPS, 2012.
  supervised prediction of citation influences. In ICML,
  2007.                                                       Wang, Chong, Blei, David M., and Heckerman, David.
                                                               Continuous time dynamic topic models. In UAI, 2008.
Ferguson, Thomas S. A Bayesian analysis of some nonpara-
  metric problems. The Annals of Statistics, 1(2):209–230,    Wang, Chong, Paisley, John, and Blei, David. Online
  1973.                                                        variational inference for the hierarchical Dirichlet process.
                                                               In AISTATS, 2011.
Goldwater, Sharon and Griffiths, Thomas L. A fully
  Bayesian approach to unsupervised part-of-speech tag-       Wei, Xing and Croft, Bruce. LDA-based document models
  ging. In ACL, 2007.                                          for ad-hoc retrieval. In SIGIR, 2006.

Hall, Mark, Frank, Eibe, Holmes, Geoffrey, Pfahringer,        Weinberger, K.Q., Dasgupta, A., Langford, J., Smola, A.,
  Bernhard, Reutemann, Peter, and Witten, Ian H. The           and Attenberg, J. Feature hashing for large scale multi-
  WEKA data mining software: An update. SIGKDD                 task learning. In ICML, pp. 1113–1120. ACM, 2009.
  Explorations, 11, 2009.
                                                              Zhai, Ke, Boyd-Graber, Jordan, Asadi, Nima, and Alkhouja,
Hoffman, Matthew, Blei, David M., and Bach, Francis.            Mohamad. Mr. LDA: A flexible large scale topic modeling
  Online learning for latent Dirichlet allocation. In NIPS,     package using variational inference in mapreduce. In
  2010.                                                         WWW, 2012.

Jelinek, F. and Mercer, R. Probability distribution estima-
  tion from sparse data. IBM Technical Disclosure Bulletin,
  28:2591–2594, 1985.

Knight, Kevin and Graehl, Jonathan. Machine translitera-
  tion. In ACL, 1997.

Kurihara, Kenichi, Welling, Max, and Vlassis, Nikos. Accel-
  erated variational Dirichlet process mixtures. In NIPS,
You can also read