Text as Data Matthew Gentzkow, Bryan Kelly, and Matt Taddy* - Stanford University

Page created by Earl Deleon
 
CONTINUE READING
Journal of Economic Literature 2019, 57(3), 535–574
https://doi.org/10.1257/jel.20181020

 Text as Data†
 Matthew Gentzkow, Bryan Kelly, and Matt Taddy*

 An ever-increasing share of human interaction, communication, and culture is
 recorded as digital text. We provide an introduction to the use of text as an input to
 economic research. We discuss the features that make text different from other forms
 of data, offer a practical overview of relevant statistical methods, and survey a variety
 of applications. (JEL C38, C55, L82, Z13)

 1. Introduction from advertisements and product reviews is
 used to study the drivers of consumer deci-

N ew technologies have made available
 vast quantities of digital text, recording
an ever-increasing share of human interac-
 sion making. In political economy, text from
 politicians’ speeches is used to study the
 dynamics of political agendas and debate.
tion, communication, and culture. For social The most important way that text differs
scientists, the information encoded in text is from the kinds of data often used in econom-
a rich complement to the more structured ics is that text is inherently h­ igh dimensional.
kinds of data traditionally used in research, Suppose that we have a sample of documents,
and recent years have seen an explosion of each of which is w ​ ​words long, and suppose
empirical economics research using text as that each word is drawn from a vocabulary
data. of ​p​possible words. Then the unique repre-
 To take just a few examples: In finance, sentation of these documents has dimension​​
text from financial news, social media, and p​​ w​​. A sample of ­thirty-word Twitter mes-
company filings is used to predict asset price sages that use only the one thousand most
movements and study the causal impact of common words in the English language, for
new information. In macroeconomics, text example, has roughly as many dimensions as
is used to forecast variation in inflation and there are atoms in the universe.
unemployment, and estimate the effects of A consequence is that the statistical meth-
policy uncertainty. In media economics, text ods used to analyze text are closely related to
from news and social media is used to study those used to analyze h ­ igh-dimensional data
the drivers and effects of political slant. In in other domains, such as machine learning
industrial organization and marketing, text and computational biology. Some methods,
 such as lasso and other penalized regres-
 * Gentzkow: Stanford University. Kelly: Yale University sions, are applied to text more or less exactly
and AQR Capital Management. Taddy: University of Chi- as they are in other settings. Other methods,
cago Booth School of Business.
 †
 Go to https://doi.org/10.1257/jel.20181020 to visit the such as topic models and multinomial inverse
article page and view author disclosure statement(s). regression, are close cousins of more general

 535
536 Journal of Economic Literature, Vol. LVII (September 2019)

methods adapted to the specific structure of third task is p
 ­ redicting the incidence of local
text data. flu outbreaks from Google searches, where
 In all of the cases we consider, the analysis the outcome V ​ ​is the true incidence of flu.
can be summarized in three steps: In these examples, and in the vast major-
 ity of settings where text analysis has been
 1. Represent raw text ​​as a numerical applied, the ultimate goal is prediction rather
 array C
 ​ ​; than causal inference. The interpretation of
 the mapping from V ​ ​ to ​​Vˆ ​​is not usually an
 ˆ ​​ of
 2. Map C to predicted values V
 ​​ object of interest. Why certain words appear
 unknown outcomes ​V​; and more often in spam, or why certain searches
 are correlated with flu is not important so
 3. Use ​​Vˆ ​​in subsequent descriptive or long as they generate highly accurate predic-
 causal analysis. tions. For example, Scott and Varian (2014,
 2015) use data from Google searches to pro-
 In the first step, the researcher must duce ­ high-frequency estimates of macro-
impose some preliminary restrictions to economic variables such as unemployment
reduce the dimensionality of the data claims, retail sales, and consumer sentiment
to a manageable level. Even the most that are otherwise available only at lower fre-
­cutting-edge ­high-dimensional techniques quencies from survey data. Groseclose and
can make nothing of ​​1,000​​ 30​​-dimensional Milyo (2005) compare the text of news out-
raw Twitter data. In almost all the cases we lets to speeches of congresspeople in order
discuss, the elements of C ​ ​are counts of to estimate the outlets’ political slant. A large
tokens: words, phrases, or other p ­ redefined literature in finance following Antweiler and
features of text. This step may involve filter- Frank (2004) and Tetlock (2007) uses text
ing out very common or uncommon words; from the internet or the news to predict
dropping numbers, punctuation, or proper stock prices.
names; and restricting attention to a set of In many social science studies, however,
features such as words or phrases that are the goal is to go further and, in the third
likely to be especially diagnostic. The map- step, use text to infer causal relationships
ping from raw text to C leverages prior infor- or the parameters of structural economic
mation about the structure of language to ­models. S ­ tephens-Davidowitz (2014) uses
reduce the dimensionality of the data prior Google search data to estimate local areas’
to any statistical analysis. racial animus, then studies the causal
 The second step is where ­high-dimensional effect of racial animus on votes for Barack
statistical methods are applied. In a classic Obama in the 2008 election. Gentzkow and
example, the data is the text of emails, and Shapiro (2010) use congressional and news
the unknown variable of interest V is an indi- text to estimate each news outlet’s political
cator for whether the email is spam. The slant, then study the supply and demand
prediction ​​Vˆ ​​determines whether or not to forces that determine slant in equilibrium.
send the email to a spam filter. Another clas- Engelberg and Parsons (2011) measure local
sic task is sentiment prediction (e.g., Pang, news coverage of earnings announcements,
Lee, and Vaithyanathan 2002), where the then use the relationship between coverage
unknown variable ​V​is the true sentiment of and trading by local investors to separate
a message (say positive or negative), and the the causal effect of news from other sources
prediction ​​Vˆ ​​might be used to identify posi- of correlation between news and stock
tive reviews or comments about a product. A prices.
Gentzkow, Kelly, and Taddy: Text as Data 537

 In this paper, we provide an overview from the text as a whole. It might seem
of methods for analyzing text and a survey obvious that any attempt to distill text into
of current applications in economics and meaningful data must similarly take account
related social sciences. The methods discus- of complex grammatical structures and rich
sion is ­forward looking, providing an over- interactions among words.
view of methods that are currently applied The field of computational linguistics
in economics as well as those that we expect has made tremendous progress in this kind
to have high value in the future. Our discus- of interpretation. Most of us have mobile
sion of applications is selective and necessar- phones that are capable of complex speech
ily omits many worthy papers. We highlight recognition. Algorithms exist to efficiently
examples that illustrate particular methods parse grammatical structure, disambiguate
and use text data to make important substan- different senses of words, distinguish key
tive contributions even if they do not apply points from secondary asides, and so on.
methods close to the frontier. Yet virtually all analysis of text in the social
 A number of other excellent surveys have sciences, like much of the text analysis in
been written in related areas. See Evans and machine learning more generally, ignores
Aceves (2016) and Grimmer and Stewart the lion’s share of this complexity. Raw text
(2013) for related surveys focused on text consists of an ordered sequence of language
analysis in sociology and political science, elements: words, punctuation, and white
respectively. For methodological surveys, space. To reduce this to a simpler repre-
Bishop (2006), Hastie, Tibshirani, and sentation suitable for statistical analysis, we
Friedman (2009), and Murphy (2012) cover typically make three kinds of simplifications:
contemporary statistics and machine learn- dividing the text into individual documents ​i​,
ing in general while Jurafsky and Martin reducing the number of language elements
(2009) overview methods from computa- we consider, and limiting the extent to which
tional linguistics and natural language pro- we encode dependence among elements
cessing. The Spring 2014 issue of the Journal within documents. The result is a mapping
of Economic Perspectives contains a sympo- from raw text  to a numerical array C ​ ​. A row
sium on “big data,” which surveys broader ​​ci​​​​ of ​C​is a numerical vector with each ele-
applications of ­ high-dimensional statistical ment indicating the presence or count of a
methods to economics. particular language token in document ​i​.
 In section 2 we discuss representing
 2.1 What Is a Document?
text data as a manageable (though still
­high-dimensional) numerical array C ​ ​; in sec- The first step in constructing C ​ ​is to divide
 tion 3 we discuss methods from data mining raw text ​​into individual documents { ​​ ​i​​​}​​.
 and machine learning for predicting V ​ ​ from​ In many applications, this is governed by the
 C​. Section 4 then provides a selective survey level at which the attributes of interest ​V​ are
 of text analysis applications in social science, defined. For spam detection, the outcome of
 and section 5 concludes. interest is defined at the level of individual
 emails, so we want to divide text that way
 too. If V​ ​is daily stock price movements that
 2. Representing Text as Data
 we wish to predict from the prior day’s news
 When humans read text, they do not see a text, it might make sense to divide the news
vector of dummy variables, nor a sequence text by day as well.
of unrelated tokens. They interpret words In other cases, the natural way to define
in light of other words, and extract ­meaning a document is not so clear. If we wish to
538 Journal of Economic Literature, Vol. LVII (September 2019)

­redict legislators’ partisanship from their
p that occur fewer than ​k​times for some arbi-
floor speeches (Gentzkow, Shapiro, and trary small integer ​k​.
Taddy 2016), we could aggregate speech An approach that excludes both common
so a document is a ­speaker–day, a ­speaker– and rare words and has proved very useful
year, or all speech by a given speaker during in practice is filtering by “­term frequency–
the time she is in Congress. When we use inverse document frequency” (­ tf–idf).
methods that treat documents as indepen- For a word or other feature ​j​in document i​​,
dent (which is true most of the time), finer term frequency (​t ​fi​j​​​) is the count ​​cij​ ​​​ of occur-
partitions will typically ease computation at rences of ​j​in ​i​. Inverse document frequency
the cost of limiting the dependence we are (​id ​f​j​​​
 ) is the log of one over the share of
able to capture. Theoretical guidance for the documents containing j​​: ​log(n/​dj​​​)​ where ​​dj​​​ 
right level of aggregation is often limited, so = ​∑ i​ ​​​1​​[​cij​ ​​>0]​​​​ and ​n​is the total number of
this is an important dimension along which documents. The object of interest ­ tf–idf
to check the sensitivity of results. is the product ​ t ​f​ij​​ × id ​f​j​​​. Very rare words
 will have low ­­tf–idf scores because ​t ​fi​j​​​ will
2.2 Feature Selection
 be low. Very common words that appear in
 To reduce the number of features to some- most or all documents will have low ­­tf–idf
thing manageable, a common first step is to scores because ​id ​f​j​​​will be low. (Note that
strip out elements of the raw text other than this improves on simply excluding words that
words. This might include punctuation, num- occur frequently because it will keep words
bers, HTML tags, proper names, and so on. that occur frequently in some documents but
 It is also common to remove a subset of do not appear in others; these often provide
words that are either very common or very useful information.) A common practice is to
rare. Very common words, often called “stop keep only the words within each document ​i​
words,” include articles (“the,” “a”), conjunc- with ­­tf–idf scores above some rank or cutoff.
tions (“and,” “or”), forms of the verb “to be,” A final step that is commonly used to
and so on. These words are important to the reduce the feature space is stemming: replac-
grammatical structure of sentences, but they ing words with their root such that, e.g.,
typically convey relatively little meaning on “economic,” “economics,” “economically”
their own. The frequency of “the” is proba- are all replaced by the stem “economic.” The
bly not very diagnostic of whether an email Porter stemmer (Porter 1980) is a standard
is spam, for example. Common practice is stemming tool for English language text.
to exclude stop words based on a ­predefined All of these cleaning steps reduce the
list.1 Very rare words do convey meaning, but number of unique language elements we
their added computational cost in expand- must consider and thus the dimensional-
ing the set of features that must be consid- ity of the data. This can provide a massive
ered often exceeds their diagnostic value. computational benefit, and it is also often
A ­common approach is to exclude all words key to getting more interpretable model fits
 (e.g., in topic modeling). However, each of
 these steps requires careful decisions about
 1 There is no single stop word list that has become a
 the elements likely to carry meaning in a
standard. How aggressive one wants to be in filtering stop
words depends on the application. The web page http:// particular application.2 One researcher’s
www.ranks.nl/stopwords shows several common stop word
lists, including the one built into the database software
SQL and the list claimed to have been used in early ver- 2 Denny and Spirling (2018) discuss the sensitivity of
sions of Google search. (Modern Google search does not unsupervised text analysis methods such as topic modeling
appear to filter any stop words.) to preprocessing steps.
Gentzkow, Kelly, and Taddy: Text as Data 539

stop words are another’s subject of interest. representation then corresponds to counts of
Dropping numerals from political text means ­1-grams.
missing references to “the first 100 days” or Counting ​n-​grams of order ​n > 1​ yields
“September 11.” In online communication, data that describe a limited amount of the
even punctuation can no longer be stripped dependence between words. Specifically, the​
without potentially significant information n​-gram counts are sufficient for estimation
loss :-(. of an ​n-​order homogeneous Markov model
 across words (i.e., the model that arises if we
2.3 n-grams
 assume that word choice is only dependent
 Producing a tractable representation also upon the previous n ​ ​words). This can lead
requires that we limit dependence among to richer modeling. In analysis of partisan
language elements. A fairly mild step in this speech, for example, single words are often
direction, for example, might be to parse doc- insufficient to capture the patterns of inter-
uments into distinct sentences and encode est: “death tax” and “tax break” are phrases
features of these sentences while ignoring with strong partisan overtones that are not
the order in which they occur. The most evident if we look at the single words “death,”
common methodologies go much further. “tax,” and “break” (see, e.g., Gentzkow and
 The simplest and most common way to Shapiro 2010).
represent a document is called ­bag-of-words. Unfortunately, the dimension of ​​ c​i​​​ in-
The order of words is ignored altogether, creases exponentially quickly with the order​
and ​​ci​​​​is a vector whose length is equal to n​of the phrases tracked. The majority of text
the number of words in the vocabulary and analyses consider ​n​-grams up to two or three
whose elements c​ ​​ ij​​​are the number of times at most, and the ubiquity of these simple
word j​​occurs in document ​i.​ Suppose that representations (in both machine learning
the text of document i​​ is and social science) reflects a belief that the
 return to richer ​n​-gram modeling is usually
 Good night, good night! small relative to the cost. Best practice in
 Parting is such sweet sorrow. many cases is to begin analysis by focusing on
 single words. Given the accuracy obtained
 After stemming, removing stop words, and with words alone, one can then evaluate if it
 removing punctuation, we might be left with is worth the extra time to move on to ­2-grams
 “good night good night part sweet sorrow.” or ­3-grams.
 The ­bag-of-words representation would then
 2.4 Richer Representations
 have ​​c​ij​​ = 2​for ​j ∈ ​{good, night}​​, ​​c​ij​​ = 1​ for​
 j ∈ ​{part, sweet, sorrow}​​, and ​​cij​ ​​ = 0​for all While rarely used in the social science
 other words in the vocabulary. literature to date, there is a vast array of
 This scheme can be extended to encode methods from computational linguistics
 a limited amount of dependence by count- that capture richer features of text and may
 ing unique phrases rather than unique have high return in certain applications.
 words. A phrase of length n ​ ​is referred to One basic step beyond the simple n ​ ​-gram
 as an ​n​-gram. For example, in our snippet counting above is to use sentence syntax to
 above, the count of ­2-grams (or “bigrams”) inform the text tokens used to summarize
 would have c​ ​​ij​​ = 2​for j​ = good.night​,​​ a document. For example, Goldberg and
 c​ij​​ = 1​for j​​including ​night.good​, ​night.part​, Orwant (2013) describe syntactic ​ n​-grams
​part.sweet​, and ​sweet.sorrow​, and ​​c​ij​​ = 0​ for where words are grouped together when-
 all other possible ­2-grams. The ­bag-of-words ever their meaning depends upon each
540 Journal of Economic Literature, Vol. LVII (September 2019)

 other, according to a model of language A more serious issue is that research-
 syntax. ers sometimes do not have direct access
 An alternative approach is to move beyond to the raw text and must access it through
 treating documents as counts of language some interface such as a search engine. For
 tokens, and to instead consider the ordered example, Gentzkow and Shapiro (2010)
 sequence of transitions between words. count the number of newspaper articles
 In this case, one would typically break the containing partisan phrases by entering the
 document into sentences, and treat each phrases into a search interface (e.g., for the
 as a separate unit for analysis. A single sen- database ProQuest) and counting the num-
 tence of length s​​(i.e., containing s​​ words) ber of matches they return. Baker, Bloom,
 is then represented as a binary p ​ × s​ matrix​ and Davis (2016) perform similar searches
S​, where the nonzero elements of ​S​ indi- to count the number of articles mentioning
cate occurrence of the r­ow-word in the terms related to policy uncertainty. Saiz and
­column-position within the sentence, and ​p​ Simonsohn (2013) count the number of web
 is the length of the vocabulary. Such repre- pages measuring combinations of city names
 sentations lead to a massive increase in the and terms related to corruption by enter-
 dimensions of the data to be modeled, and ing queries in a search engine. Even if one
 analysis of this data tends to proceed through can automate the searches in these cases, it
 ­word embedding: the mapping of words to is usually not feasible to produce counts for
 ​​ K​​for some ​K ≪ p​, such that
 a location in 핉​​ very large feature sets (e.g., every ­two-word
 the sentences are then sequences of points phrase in the English language), and so the
 in this K
 ​ ​dimensional space. This is discussed initial feature selection step must be rel-
 in detail in section 3.3. atively aggressive. Relatedly, interacting
 through a search interface means that there
2.5 Other Practical Considerations
 is no simple way to retrieve objects like the
 It is worth mentioning two details that can set of all words occurring at least twenty
cause practical social science applications times in the corpus of documents, or the
of these methods to diverge a bit from the inputs to computing tf–idf.
 ­­
ideal case considered in the statistics liter-
ature. First, researchers sometimes receive
 3. Statistical Methods
data in a p
 ­ re-aggregated form. In the analysis
of Google searches, for example, one might This section considers methods for map-
observe the number of searches contain- ping the ­document-token matrix ​C​to pre-
ing each possible keyword on each day, but dictions ​​Vˆ ​​of an attribute V ​ ​. In some cases,
not the raw text of the individual searches. the observed data is partitioned into subma-
This means documents must be similarly trices ​​C​​ train​​ and ​​C​​ test​​, where the matrix ​​C​​ train​​
aggregated (to days, rather than individual collects rows for which we have observations​​
searches), and it also means that the natu- V​​ train​​ of ​V​and the matrix ​​C​​ test​​ collects rows
ral representation where ​​c​ij​​​is the number of for which V ​ ​is unobserved. The dimension
occurrences of word j​​on day i​​is not avail- of ​​C​​ train​​ is ​​n​​ train​ × p​, and the dimension of​​
able. This is probably not a significant limita- V​​ train​​ is ​​n​​ train​ × k​, where ​k​is the number of
tion, as the missing information (how many attributes we wish to predict.
times per search a word occurs conditional Attributes in ​ V​can include observable
on occurring at least once) is unlikely to be quantities such as the frequency of flu cases,
essential, but it is useful to note when map- the positive or negative rating of movie
ping practice to theory. reviews, or the unemployment rate, about
Gentzkow, Kelly, and Taddy: Text as Data 541

which the documents are informative. There The second and third groups of meth-
can also be latent attributes of interest, such ods are distinguished by whether they
as the topics being discussed in a congressio- begin from a model of ​p(​ ​v​i​​ | ​ci​​​)​​or a model of
nal debate or in news articles. ​p​(​ci​​​ | ​vi​)​​ ​​. In the former case, which we will
 Methods to connect counts c​ ​​i​​​ to attri- call text regression methods, we directly
butes ​​vi​​​​can be roughly divided into four estimate the conditional outcome distribu-
categories. The first, which we will call tion, usually via the conditional expectation
­dictionary-based methods, do not involve ​E[​ ​v​i​​ | ​ci​]​​ ​​ of attributes ​​v​i​​​. This is intuitive: if we
 statistical inference at all: they simply spec- want to predict ​​v​i​​​ from ​​c​i​​​, we would naturally
 ify ​​​vˆ ​​i​​ = f (​ ​ci​)​​ ​​for some known function f​ (​ ⋅ )​​. regress the observed values of the former
This is by far the most common method in (​​V​​ train​​) on the corresponding values of the lat-
the social science literature using text to ter (​​C​​ train​​). Any generic regression technique
date. In some cases, researchers define ​f ​( ⋅ )​​ can be applied, depending upon the nature
based on a p ­respecified dictionary of of ​​vi​​​​. However, the h ­ igh dimensionality of
terms capturing particular categories of ​​ci​​​​, where ​p​is often as large as or larger than​​
text. In Tetlock (2007), for example, ​​c​i​​​ is a n​​ train​​, requires use of regression techniques
bag-of-words representation and the out-
­ appropriate for such a setting, such as penal-
come of interest ​​v​i​​​is the latent “sentiment” ized linear or logistic regression.
of Wall Street Journal columns, defined along In the latter case, we begin from a genera-
a number of dimensions such as “positive,” tive model of ​p​(​ci​​​ | ​vi​)​​ ​​. To see why this is intu-
“optimistic,” and so on. The author defines itive, note that in many cases the underlying
the function ​ f (​ ⋅ )​​using a dictionary called causal relationship runs from outcomes to
the General Inquirer, which provides lists of language rather than the other way around.
words associated with each of these sentiment For example, Google searches about the flu
categories.3 The elements of ​f(​ ​ci​)​​ ​​ are defined do not cause flu cases to occur; rather, peo-
 to be the sum of the counts of words in each ple with the flu are more likely to produce
 category. (As we discuss below, the main anal- such searches. Congresspeople’s ideology
 ysis then focuses on the first principal com- is not determined by their use of partisan
 ponent of the resulting counts.) In Baker, language; rather, people who are more con-
 Bloom, and Davis (2016), c​ ​​ i​​​is the count of servative or liberal to begin with are more
articles in a given n ­ ewspaper-month contain- likely to use such language. From an eco-
ing a set of p ­ respecified terms such as “pol- nomic point of view, the correct “structural”
icy,” “uncertainty,” and “Federal Reserve,” model of language in these cases maps from
and the outcome of interest v​ ​​ i​​​ is the degree ​​vi​​​​ to ​​ci​​​​, and as in other cases familiar to
 of “policy uncertainty” in the economy. The economists, modeling the underlying causal
 authors define ​f (​ ⋅ )​​to be the raw count of relationships can provide powerful guidance
the ­prespecified terms divided by the total to inference and make the estimated model
number of articles in the ­newspaper–month, more interpretable.
averaged across newspapers. We do not pro- Generative models can be further divided
vide additional discussion of d ­ ictionary-based by whether the attributes are observed or
methods in this section, but we return to them latent. In the first case of unsupervised
in section 3.5 and in our discussion of applica- methods, we do not observe the true value of​​
tions in section 4. v​i​​​for any documents. The function relating​​
 c​i​​​ and ​​v​i​​​is unknown, but we are willing to
 impose sufficient structure on it to allow us to
 3 http://www.wjh.harvard.edu/~inquirer/. infer ​​vi​​​​ from ​​c​i​​​. This class includes ­methods
542 Journal of Economic Literature, Vol. LVII (September 2019)

 such as topic modeling and its variants (e.g., t­ypically perform close to the frontier in
 latent Dirichlet allocation, or LDA). In the terms of ­out-of-sample prediction.
 second case of supervised methods, we Linear models in the sense we mean here
 observe training data V​​ ​​ train​​and we can fit are those in which v​ ​​i​​​ depends on ​​c​i​​​ only
 our model, say ​​f​θ(​​​ ​ci​​​; ​vi​)​​ ​​for a vector of param- through a linear index ​​η​i​​ = α + ​ ​ ′i​ ​  β​, where​​
eters ​θ,​ to this training set. The fitted model ​i​​​is a known transformation of c​ ​​ i​​​. In many
​​  f​​θˆ ​​​​ can then be inverted to predict v​ ​​ i​​​ for doc- cases, we simply have E ​ ​[​v​i​​ | ​ i​]​​ ​ = ​ηi​​​​. It is
 uments in the test set and can also be used to also possible that E ​ ​[​vi​​​   |  ​ i​]​​ ​ = f (​ ​ηi​)​​ ​​ for some
 interpret the structural relationship between known link function ​f ​( ⋅ )​​, as in the case of
 attributes and text. Finally, in some cases, v​ ​​ i​​​ logistic regression.
 includes both observed and latent attributes Common transformations are the iden-
 for a ­semi-supervised analysis. tity ​​ i​​​ = ​ci​​​​, normalization by document
 Lastly, we discuss word embeddings, length ​​ ​i​​ = ​ci​​​/​m​i​​​ with ​​m​i​​ = ​∑ j​ ​​ ​cij​ ​​​, or
 which provide a richer representation of the the positive indicator x​ ​​ ij​​ = ​1​​[​cij​ ​​>0]​​​.​ The best
 underlying text than the token counts that choice is a­pplication specific, and may be
 underlie other methods. They have seen driven by interpretability; does one wish to
 limited application in economics to date, but interpret ​​β​j​​​as the added effect of an extra
 their dramatic successes in deep learning count for token j​ ​(if so, use x​ ​​ ij​​ = ​cij​ ​​​) or as the
 and other machine learning domains sug- effect of the presence of token j​​(if so, use
 gest they are likely to have high value in the ​​xi​j​​ = ​1​[​​c​ij​​>0]​​​)?
 ​ The identity is a reasonable
 future. default in many settings.
 We close in section 3.5 with some broad Write ​l​(α, β)​​for an unregularized objec-
 recommendations for practitioners. tive proportional to the negative log likeli-
3.1 Text Regression hood,​− log  p​(​v​i​​ | ​ i​)​​ ​​. For example, in Gaussian
 (linear) regression, l​​(α, β)​ = ​∑ i​ ​​ ​​(​v​i​​ − ​η​i)​​ ​​​ 2​​
 Predicting an attribute ​​v​i​​​ from counts ​​ci​​​​ is and in binomial (logistic) regression, ​l​(α, β)​ 
 a regression problem like any other, except = − ​∑ i​ ​​​[​ηi​​​ ​v​i​​ − log​(1 + ​e​​ ​ηi​​​)​ ]​ ​​ for ​​v​i​​ ∈ ​ {0, 1}​​.
 that the ­high dimensionality of ​​c​i​​​ makes ordi- A penalized estimator is then the solution to
nary least squares (OLS) and other standard

 { }
 p
techniques infeasible. The methods in this
section are mainly applications of standard (1) ​min​ l​(α, β)​ + nλ ​ ∑​ ​​ ​κj​​​​(|​βj​​​|)​ ​,​
 j=1
­high-dimensional regression methods to text.
 where ​λ > 0​controls overall penalty mag-
3.1.1 Penalized Linear Models
 nitude and κ​ ​​ j​​​( ⋅ )​​are increasing “cost” func-
 The most popular strategy for very tions that penalize deviations of the ​​β​j​​​ from
­igh-dimensional regression in contempo-
h zero.
rary statistics and machine learning is the A few common cost functions are shown in
estimation of penalized linear models, par- figure 1. Those that have a n ­ on-differentiable
ticularly with L​​​ 1​​​ penalization. We recom- spike at zero (lasso, elastic net, and log) lead
mend this strategy for most text regression to sparse estimators, with some coefficients
applications: linear models are intuitive and set to exactly zero. The curvature of the
interpretable; fast, h ­igh-quality software penalty away from zero dictates the weight
is available for big sparse input matrices of shrinkage imposed on the nonzero coef-
like our C​ ​. For simple t­ext-regression tasks ficients: ​​L2​ ​​​costs increase with coefficient
with input dimension on the same order as size; lasso’s ​​L​1​​​penalty has zero curvature and
the sample size, penalized linear models imposes constant shrinkage, and as c­ urvature
Gentzkow, Kelly, and Taddy: Text as Data 543

A. Ridge B. Lasso C. Elastic net D. log

 400 60

 | β | + 0.1 × β 2
 2.5

 log(1 + | β |)
 15 40

 |β|
 200 1.5
β2

 5 20
 0.5
 0 0 0
 −20 0 20 −20 0 20 −20 0 20 −20 0 20
 β β β β

 Figure 1

Note: From left to right, L​ ​​ 2​​​costs (ridge, Hoerl and Kennard 1970), L​
 ​​ 1​​​(lasso, Tibshirani 1996), the “elastic net”
mixture of ​​L​1​​​ and ​​L2​ ​​​(Zou and Hastie 2005), and the log penalty (Candès, Wakin, and Boyd 2008).

goes toward ​− ∞​one approaches the ​​L​0​​​ pen- sample standard deviation of that covariate.
alty of subset selection. The lasso’s ​​L​1​​​ pen- In text analysis, where each covariate corre-
alty (Tibshirani 1996) is extremely popular: sponds to some transformation of a specific
it yields sparse solutions with a number of text token, this type of weighting is referred
desirable properties (e.g., Bickel, Ritov, and to as “rare feature u ­p-weighting” (e.g.,
Tsybakov 2009; Wainwright 2009; Belloni, Manning, Raghavan, and Schütze 2008) and
Chernozhukov, and Hansen 2013; Bühlmann is generally thought of as good practice: rare
and van de Geer 2011), and the number of words are often most useful in differentiat-
nonzero estimated coefficients is an unbi- ing between documents.5
ased estimator of the regression degrees of Large ​λ​leads to simple model estimates
freedom (which is useful in model selection; in the sense that most coefficients will be
see Zou, Hastie, and Tibshirani 2007).4 set at or close to zero, while as λ ​ → 0​ we
 Focusing on ​​L​1​​​ regularization, ­rewrite the approach maximum likelihood estimation
penalized linear model objective as (MLE). Since there is no way to define an
 optimal ​λ​a priori, standard practice is to

 { }
 p
(2) ​min​ l​(α, β)​ + nλ ​ ∑​​ ​​ ω​j​​  |​βj​​​| ​.​ compute estimates for a large set of possible​
 j=1 λ​and then use some criterion to select the
 one that yields the best fit.
A common strategy sets ω​ ​​ j​​​so that the pen- Several criteria are available to choose an
alty cost for each coefficient is scaled by the optimal ​λ.​ One common approach is to leave
 out part of the training sample in estimation
 4 Penalties with a bias that diminishes with coefficient
 and then choose the λ ​ ​that yields the best
size—such as the log penalty in figure 1 (Candès, Wakin, ­out-of-sample fit according to some criterion
and Boyd 2008), the smoothly clipped absolute deviation such as mean squared error. Rather than work
(SCAD) of Fan and Li (2001), or the adaptive lasso of Zou with a single ­leave-out sample, researchers
(2006)—have been promoted in the statistics literature as
improving upon the lasso by providing consistent variable most often use K ​ ​-fold ­cross-validation (CV).
selection and estimation in a wider range of settings. These
diminishing-bias penalties lead to increased computation
costs (due to a ­non-convex loss), but there exist efficient 5 This is the same principle that motivates
approximation algorithms (see, e.g., Fan, Xue, and Zou “­inverse-document frequency” weighting schemes, such
2014; Taddy 2017b). as ­­tf–idf.
544 Journal of Economic Literature, Vol. LVII (September 2019)

This splits the sample into ​K​disjoint subsets, Penalized linear models use shrinkage and
and then fits the full regularization path K ​​ variable selection to manage high dimen-
times excluding each subset in turn. This sionality by forcing the coefficients on most
yields ​K​realizations of the mean squared regressors to be close to (or, for lasso, exactly)
error or other ­out-of-sample fit measure for zero. This can produce ­suboptimal forecasts
each value of ​λ​. Common rules are to select when predictors are highly correlated. A
the value of λ ​ ​that minimizes the average transparent illustration of this problem would
error across these realizations, or (more be a case in which all of the predictors are
conservatively) to choose the largest ​λ​ with equal to the forecast target plus an i.i.d. noise
mean error no more than one standard error term. In this situation, choosing a subset of
away from the minimum. predictors via lasso penalty is inferior to tak-
 Analytic alternatives to c­ross-validation ing a simple average of the predictors and
are Akaike’s information criterion (AIC; using this as the sole predictor in a univar-
Akaike 1973) and the Bayesian informa- iate regression. This predictor averaging, as
tion criterion (BIC) of Schwarz (1978). In opposed to predictor selection, is the essence
particular, Flynn, Hurvich, and Simonoff of dimension reduction.
(2013) describe a b ­ias-corrected AIC PCR consists of a ­two-step procedure. In
objective for ­ high-dimensional problems the first step, principal components analysis
that they call AICc. It is motivated as an (PCA) combines regressors into a small set
approximate likelihood maximization sub- of ​K​linear combinations that best preserve
ject to a degrees of freedom (​d ​f​λ​​​) adjust- the covariance structure among the predic-
ment: ​AICc​(λ)​ = 2l​(​αλ​ ​​, ​βλ​ )​​ ​ + 2d ​f​λ_
 ​​​ n ​​. tors. This amounts to solving the problem
 n − d ​f​λ​​ − 1
Similarly, the BIC objective is ​ BIC​(λ)​ 
= l​(​αλ​ ​​, ​βλ​ ​​)​ + d ​f​λ​​  log n​
 , and is motivated (3) ​​min​​ ​  trace​[​(C − Γ​B′ ​)​​​(C − Γ​B′ ​)′​ ​]​,
 Γ,B
as an approximation to the Bayesian pos-
terior marginal likelihood in Kass and subject to
Wasserman (1995). AICc and BIC selec-
tion choose λ ​​to minimize their respec- (Γ)​ = rank​(B)​ = K​.
 rank​
tive objectives. The BIC tends to choose
simpler models than ­ cross-validation or The count matrix C ​ ​consists of n
 ​ ​rows (one
AICc. Zou, Hastie, and Tibshirani (2007) for each document) and p ​ ​columns (one for
recommend BIC for lasso penalty selec- each term). PCA seeks a low-rank represen-
tion whenever variable selection, rather tation ​Γ​B′ ​​that best approximates the text
than predictive performance, is the primary data ​C​. This formulation has the character of
goal. a factor model. The ​n × K​matrix ​Γ​ captures
3.1.2 Dimension Reduction the prevalence of ​K​common components,
 or “factors,” in each document. The ​p × K​
 Another common solution for taming high matrix ​B​describes the strength of associa-
dimensional prediction problems is to form a tion between each word and the factors. As
small number of linear combinations of pre- we will see, this ­reduced-rank decomposi-
dictors and to use these derived indices as tion bears a close resemblance to other text
variables in an otherwise standard predictive analytic methods such as topic modeling and
regression. Two classic dimension reduction word embeddings.
techniques are principal components regres- In the second step, the ​K​components are
sion (PCR) and partial least squares (PLS). used in standard predictive regression. As an
Gentzkow, Kelly, and Taddy: Text as Data 545

example, Foster, Liberman, and Stine (2013) is condensed into a single predictive index.
use PCR to build a hedonic real estate pricing To use additional predictive indices, both​​
model that takes textual content of property v​i​​​ on ​​cij​ ​​​are orthogonalized with respect
­listings as an input.6 With text data, where the to ​​​vˆ ​​i​​​, the above procedure is repeated on
number of features tend to vastly exceed the the orthogonalized data set, and the result-
observation count, regularized versions of ing forecast is added to the original ​​​vˆ ​​i​​​. This
PCA such as predictor thresholding (e.g., Bai is iterated until the desired number of PLS
and Ng 2008) and sparse PCA (Zou, Hastie, components K ​ ​is reached. Like PCR, PLS
and Tibshirani 2006) help exclude the least components describe the prevalence of K ​​
informative features to improve predictive common factors in each document. And also
content of the d ­ imension-reduced text. like PCR, PLS can be implemented with a
 A drawback of PCR is that it fails to incor- variety of regularization schemes to aid its
porate the ultimate statistical objective— performance in the u ­ ltra-high-dimensional
forecasting a particular set of attributes—in world of text. Section 4 discusses applica-
the dimensionality reduction step. PCA con- tions using PLS in text regression.
denses text data into indices based on the PCR and PLS share a number of com-
covariation among the predictors. This hap- mon properties. In both cases, ​ K​is a
pens prior to the forecasting step and with- user-controlled parameter which, in many
 ­
out consideration of how predictors associate social science applications, is selected ex ante
with the forecast target. by the researcher. But, like any hyperparam-
 In contrast, PLS performs dimension eter, ​K​can be tuned via c­ ross-validation. And
reduction by directly exploiting covaria- neither method is scale invariant—the fore-
tion of predictors with the forecast target.7 casting model is sensitive to the distribution
Suppose we are interested in forecasting of predictor variances. It is therefore com-
a scalar attribute v​ ​​i​​​. PLS regression pro- mon to ­variance-standardize features before
 ceeds as follows. For each element j​​of the applying PCR or PLS.
 feature vector c​ ​​i​​​, estimate the univariate 3.1.3 Nonlinear Text Regression
covariance between ​​ v​i​​​ on ​​cij​ ​​​. This covari-
ance, denoted ​​ φ​j​​​, reflects the attribute’s Penalized linear models are the most
 “partial” sensitivity to each feature j​​. Next, widely applied text regression tools due to
 form a single predictor by averaging all their simplicity, and because they may be
 attributes into a single aggregate predictor viewed as a fi ­rst-order approximation to
 ​​​vˆ ​​i​​ = ​ ∑j​ ​​ ​φj​​​ ​cij​ ​​ / ​∑ j​ ​​ ​φj​​​​. This forecast places potentially nonlinear and complex data gen-
 the highest weight on the strongest uni- erating processes (DGPs). In cases where a
 variate predictors, and the least weight on linear specification is too restrictive, there
 the weakest. In this way, PLS performs its are several other machine learning tools that
 dimension reduction with the ultimate fore- are well suited to represent nonlinear asso-
 casting objective in mind. The description ciations between text ​​c​i​​​ and outcome attri-
 of ​​​vˆ ​​i​​​ reflects the ​K = 1​case, i.e., when text butes ​​vi​​​​. Here we briefly describe four such
 nonlinear regression methods—generalized
 linear models, support vector machines,
 regression trees, and deep learning—and
 6 See Stock and Watson (2002a, b) for development of
the PCR estimator and an application to macroeconomic
 provide references for readers interested in
forecasting with a large set of numerical predictors. thorough treatments of each.
 7 See Kelly and Pruitt (2013, 2015) for the asymptotic
theory of PLS regression and its application to forecasting GLMs and SVMs.—One way to capture
risk premia in financial markets. nonlinear associations between ​​c​i​​​ and ​​vi​​​​ is
546 Journal of Economic Literature, Vol. LVII (September 2019)

with a generalized linear model (GLM). ­ roblems. The logic of trees differs markedly
 p
These expand the linear model to include from traditional regressions. A tree “grows”
nonlinear functions of ​​c​i​​​ such as polynomials by sequentially sorting data observations
or interactions, while otherwise treating the into bins based on values of the predictor
problem with the penalized linear regression variables. This partitions the data set into
methods discussed above. rectangular regions, and forms predictions
 A related method used in the social science as the average value of the outcome vari-
literature is the support vector machine, or able within each partition (Breiman et al.
SVM (Vapnik 1995). This is used for text 1984). This structure is an effective way to
classification problems (when V ​ ​is categor- accommodate rich interactions and nonlin-
ical), the prototypical example being email ear dependencies.
spam filtering. A detailed discussion of SVMs Two extensions of the simple regression
is beyond the scope of this review, but from tree have been highly successful thanks to
a high level, the SVM finds hyperplanes in a clever regularization approaches that min-
basis expansion of ​C​that partition the obser- imize the need for tuning and avoid over-
vations into sets with equal response (i.e., so fitting. Random forests (Breiman 2001)
that ​​vi​​​​are all equal in each region).8 average predictions from many trees that
 GLMs and SVMs both face the limita- have been randomly perturbed in a b ­ ootstrap
tion that, without a priori assumptions for step. Boosted trees (e.g., Friedman 2002)
which basis transformations and interactions recursively combine predictions from many
to include, they may overfit and require ­oversimplified trees.10
extensive tuning (Hastie, Tibshirani, and The benefits of regression trees—non-
Friedman 2009; Murphy 2012). For exam- linearity and ­ high-order interactions—are
ple, ­ multi-way interactions increase the sometimes lessened in the presence of
parameterization combinatorially and can high-dimensional inputs. While we would
 ­
quickly overwhelm the penalization rou- generally recommend tree models, and
tine, and their performance suffers in the especially random forests, they are often not
presence of many spurious “noise” inputs worth the effort for simple text regression.
(Hastie, Tibshirani, and Friedman 2009).9 Often times, a more beneficial use of trees is
 in a final prediction step after some dimen-
 Regression Trees.—Regression trees have sion reduction derived from the generative
become a popular nonlinear approach for models in section 3.2.
incorporating ­
 multi-way predictor inter-
actions into regression and classification Deep Learning.—There is a host of other
 machine learning techniques that have been
 8 Hastie, Tibshirani, and Friedman (2009, chapter 12) applied to text regression. The most com-
and Murphy (2012, chapter 14) provide detailed overviews mon techniques not mentioned thus far are
of GLMs and SVMs. Joachims (1998) and Tong and Koller neural networks, which typically allow the
(2001) (among others) study text applications of SVMs.
 9 Another drawback of SVMs is that they cannot be inputs to act on the response through one
easily connected to the estimation of a probabilistic
model and the resulting fitted model can sometimes be
difficult to interpret. Polson and Scott (2011) provide a 10 Hastie, Tibshirani, and Friedman (2009) provide an
­pseudo-likelihood interpretation for a variant of the SVM overview of these methods. In addition, see Wager, Hastie,
 objective. Our own experience has led us to lean away from and Efron (2014) and Wager and Athey (2018) for results
 SVMs for text analysis in favor of more easily interpretable on confidence intervals for random forests, and see Taddy
 models. Murphy (2012, chapter 14.6) attributes the pop- et al. (2015) and Taddy et al. (2016) for an interpretation
 ularity of SVMs in some application areas to an ignorance of random forests as a Bayesian posterior over potentially
 of alternatives. optimal trees.
Gentzkow, Kelly, and Taddy: Text as Data 547

or more layers of interacting nonlinear basis Dunson, and Lee (2013) for Bayesian ana-
functions (e.g., see Bishop 1995). A main logues of diminishing bias penalties like the
attraction of neural networks is their status as log penalty on the right of figure 1.
universal approximators, a theoretical result For those looking to do a full Bayesian
describing their ability to mimic general, analysis for ­ high-dimensional (e.g., text)
smooth nonlinear associations. regression, an especially appealing model is
 In ­high-dimensional and very noisy set- the ­spike-and-slab introduced in George and
tings, such as in text analysis, classical neu- McCulloch (1993). This models the distribu-
ral nets tend to suffer from the same issues tion over regression coefficients as a mixture
referenced above: they often overfit and between two densities centered at zero—
are difficult to tune. However, the recently one with very small variance (the spike) and
popular “deep” versions of neural networks another with large variance (the slab). This
(with many layers, and fewer nodes per model allows one to compute posterior vari-
layer) incorporate a number of innovations able inclusion probabilities as, for each coef-
that allow them to work better, faster, and ficient, the posterior probability that it came
with little tuning, even in difficult text analy- from the slab and not the spike component.
sis problems. Such deep neural nets (DNNs) Due to a need to integrate over the posterior
are now the ­state-of-the-art solution for many distribution, e.g., via Markov chain Monte
machine learning tasks (LeCun, Bengio, and Carlo (MCMC), inference for ­spike-and-slab
Hinton 2015).11 DNNs are now employed in models is much more computationally inten-
many complex natural language processing sive than fitting the penalized regressions of
tasks, such as translation (Sutskever, Vinyals, section 3.1.1. However, Yang, Wainwright,
and Le 2014; Wu et al. 2016) and syntactic and Jordan (2016) argue that s­ pike-and-slab
parsing (Chen and Manning 2014), as well as estimates based on short MCMC samples
in exercises of relevance to social scientists— can be useful in application, while Scott
for example, Iyyer et al. (2014) infer political and Varian (2014) have engineered effi-
ideology from text using a DNN. They are cient implementations of the s­ pike-and-slab
frequently used in conjunction with richer model for big data applications. These pro-
text representations such as word embed- cedures give a full accounting of parameter
dings, described more below. uncertainty, which we miss in a quick penal-
 ized regression.
3.1.4 Bayesian Regression Methods
 3.2 Generative Language Models
 The penalized methods above can all be
interpreted as posterior maximization under Text regression treats the token counts as
some prior. For example, ridge regression generic ­ high-dimensional input variables,
maximizes the posterior under independent without any attempt to model structure that
Gaussian priors on each coefficient, while is specific to language data. In many set-
Park and Casella (2008) and Hans (2009) give tings it is useful to instead propose a gen-
Bayesian interpretations to the lasso. See also erative model for the text tokens to learn
the horseshoe of Carvalho, Polson, and Scott about how the attributes influence word
(2010) and the double Pareto of Armagan, choice and account for various dependen-
 cies among words and among attributes. In
 this approach, the words in a document are
 11 ­Goodfellow, Bengio, and Courville (2016) provide a
 viewed as the realization of a generative pro-
thorough textbook overview of these “deep learning” tech-
nologies, while Goldberg (2016) is an excellent primer on cess defined through a probability model for​
their use in natural language processing. p​(​ci​​​ | ​vi​)​​ ​​.
548 Journal of Economic Literature, Vol. LVII (September 2019)

3.2.1 Unsupervised Generative Models Many readers will recognize the model in
 (5) as a factor model for the vector of nor-
 In the unsupervised setting, we have no malized counts for each token in document
 direct observations of the true attributes ​i​, ​​ci​​​ / ​mi​​​​. Indeed, a topic model is simply a fac-
 ​​v​i​​​. Our inference about these attributes must tor model for multinomial data. Each topic
 therefore depend entirely on strong assump- is a probability vector over possible tokens,
 tions that we are willing to impose on the denoted ​​θ​l​​, l = 1, … , k​ (where ​​θ​lj​​ ≥ 0​ and​​
 structure of the model p ​ (​ ​ci​​​ | ​vi​)​​ ​​. Examples in ∑pj=1​​​ ​θlj​ ​​ = 1​). A topic can be thought of as
 the broader literature include cases where a cluster of tokens that tend to appear in
 the ​​v​i​​​are latent factors, clusters, or catego- documents. The latent attribute vector v​ ​​ i​​​ is
 ries. In text analysis, the leading application referred to as the set of topic weights (for-
 has been the case in which the ​​v​i​​​ are topics. mally, a distribution over topics, ​​v​il​​ ≥ 0​ and​​
 A typical generative model implies that ∑l=1 k
 ​​​ ​vil​ ​​ = 1​). Note that v​​ il​ ​​​ describes the pro-
 each observation c​ ​​ i​​​is a conditionally inde- portion of language in document i​ ​devoted to
pendent draw from the vocabulary of the ​lth​topic. We can allow each document
possible tokens according to some d ­ ocument- to have a mix of topics, or we can require
specific token probability vector, say that one ​​v​il​​ = 1​while the rest are zero, so
​​ ​i​​ = ​​[​q​i1​​  ⋯ ​qip
 ​ ​​]′​ . ​​ Conditioning on doc- that each document has a single topic.13
 ument length, m​ ​​ i​​ = ​∑ j​ ​​ ​cij​ ​​​, this implies a Since its introduction into text analysis,
 ­multinomial distribution for the counts topic modeling has become hugely popu-
 lar.14 (See Blei 2012 for a ­high-level over-
(4) ​​c​i​​ ∼ MN​
 (​ ​i​​, ​m​i​​)​.​ view.) The model has been especially useful
 in political science (e.g., Grimmer 2010),
This multinomial model underlies the vast where researchers have been successful in
majority of contemporary generative models attaching political issues and beliefs to the
for text. estimated latent topics.
 Under the basic model in (4), the function​​ Since the v​ ​​ i​​​are of course latent, estima-
 ​i​​ = q​(​v​i)​​ ​​links attributes to the distribution tion for topic models tends to make use of
of text counts. A leading example of this link some alternating inference for V ​ | Θ​and ​Θ | V​.
function is the topic model specification of One possibility is to employ a version of the
Blei, Ng, and Jordan (2003),12 where expectation-maximization (EM) algorithm
 ­
 to either maximize the likelihood implied by
(5) ​​ ​i​​ = ​vi1
 ​ ​​ ​θ1​ ​​ + ​v​i2​​ ​θ2​ ​​ + ⋯ + ​v​ik​​ ​θk​ ​​

 = Θ ​v​i​​.​ 13 Topic modeling is alternatively labeled as “latent
 Dirichlet allocation,” (LDA) which refers to the Bayesian
 model in Blei, Ng, and Jordan (2003) that treats each v​ ​​ i​​​ and​​
 θ​l​​​as generated from a ­Dirichlet-distributed prior. Another
 specification that is popular in political science (e.g., Quinn
 et al. 2010) keeps θ​ ​​ l​​​ as ­Dirichlet-distributed but requires
 each document to have a single topic. This may be most
 12 Standard l­ east-squares factor models have long appropriate for short documents, such a press releases or
been employed in “latent semantic analysis” (LSA; single speeches.
Deerwester et al. 1990), which applies PCA (i.e., singu- 14 The same model was independently introduced in
lar value decompositions) to token count transformations genetics by Pritchard, Stephens, and Donnelly (2000) for
such as ​​ ​i​​ = ​ci​​​/​m​i​​​ or ​​xi​j​​ = ​c​ij​​  log​(​dj​​​)​​ where ​​dj​​​ = ​∑i ​​​​ 1​​[​cij​ ​​>0]​​​.​ factorizing gene expression as a function of latent popula-
Topic modeling and its precursor, probabilistic LSA, are tions; it has been similarly successful in that field. Latent
generally seen as improving on such approaches by replac- Dirichlet allocation is also an extension of a related mix-
ing arbitrary transformations with a plausible generative ture modeling approach in the latent semantic analysis of
model. Hofmann (1999).
Gentzkow, Kelly, and Taddy: Text as Data 549

(4) and (5) or, after incorporating the usual on the application. As we discuss below, in
Dirichlet priors on v​ ​​ i​​​ and ​​θ​l​​​, to maximize the many applications of topic models to date,
posterior; this is the approach taken in Taddy the goal is to provide an intuitive description
(2012; see this paper also for a review of of text, rather than inference on some under-
topic estimation techniques). Alternatively, lying “true” parameters; in these cases, the
one can target the full posterior distribution​ ad hoc selection of the number of topics may
p​(Θ, V ∣ ​ci​)​​ ​​. Estimation, say for ​Θ​, then pro- be reasonable.
ceeds by maximization of the estimated mar- The basic topic model has been general-
ginal posterior, say p ​ ​(Θ ∣ ​ci​)​​ ​​. ized and extended in variety of ways. A prom-
 Due to the size of the data sets and dimen- inent example is the dynamic topic model
sion of the models, posterior approximation of Blei and Lafferty (2006), which considers
for topic models usually uses some form documents that are indexed by date (e.g.,
of variational inference (Wainwright and publication date for academic articles) and
Jordan 2008) that fits a tractable paramet- allows the topics, say Θ​ ​​ t​​​, to evolve smoothly
ric family to be as close as possible (e.g., in in time. Another example is the super-
­Kullback–Leibler divergence) from the true vised topic model of Blei and McAuliffe
 posterior. This variational approach was (2007), which combines the standard topic
 used in the original Blei, Ng, and Jordan model with an extra equation relating the
 (2003) paper and in many applications since. weights ​​vi​​​​to some additional attribute ​​y​i​​​ in
 Hoffman et al. (2013) present a stochastic ​p​(​yi​​​ | ​vi​​​)​​. This pushes the latent topics to be
 variational inference algorithm that takes relevant to y​ ​​ i​​​as well as the text c​ ​​ i​​​. In these
 advantage of techniques for optimization on and many other extensions, the modifica-
 massive data; this algorithm is used in many tions are designed to incorporate available
 contemporary topic modeling applications. document metadata (in these examples,
 Another approach, which is more computa- time and y​ ​​ i​​​ respectively).
 tionally intensive but can yield more accu-
 3.2.2 Supervised Generative Models
 rate posterior approximations, is the MCMC
 algorithm of Griffiths and Steyvers (2004). In supervised models, the attributes ​​v​i​​​ are
 Alternatively, for quick estimation without observed in a training set and thus may be
 uncertainty quantification, the posterior directly harnessed to inform the model of
 maximization algorithm of Taddy (2012) is a text generation. Perhaps the most common
 good option. supervised generative model is the ­so-called
 The choice of k​ ​, the number of topics, is naive Bayes classifier (e.g., Murphy 2012),
often fairly arbitrary. ­ Data-driven choices which treats counts for each token as inde-
do exist: Taddy (2012) describes a model pendent with class-dependent means. For
selection process for k​​that is based upon example, the observed attribute might be
Bayes factors, Airoldi et al. (2010) provide author identity for each document in the
a ­cross-validation (CV) scheme, while Teh corpus with the model specifying different
et al. (2006) use Bayesian nonparametric mean token counts for each author.
techniques that view ​k​as an unknown model In naive Bayes, ​​v​i​​​is a univariate categor-
parameter. In practice, however, it is very ical variable and the token count distribu-
common to simply start with a number of tion is factorized as ​p​(​ci​​​ | ​vi​)​​ ​ = ​∏ j​ ​​ ​pj​​​​(​cij​ ​​ | ​vi​​​)​​,
topics on the order of ten, and then adjust thus “naively” specifying conditional inde-
the number of topics in whatever direction pendence between tokens j​​. This rules out
seems to improve interpretability. Whether the possibility that by choosing to say one
this ad hoc procedure is problematic depends token (say, “hello”) we reduce the ­probability
You can also read