Semi-supervised Learning of Compact Document Representations with Deep Networks

Page created by Tim Gonzales

Government & Politics

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Semi-supervised Learning of Compact Document Representations
with Deep Networks

Marc’Aurelio Ranzato ranzato@courant.nyu.edu
Courant Institute, New York University, 719 Broadway 12th fl., New York NY 10003, USA
Martin Szummer szummer@microsoft.com
Microsoft Research Cambridge, 7 J J Thomson Avenue, Cambridge CB3 0FB, UK

Abstract tor of counts. These include various term-weighting
retrieval schemes, such as tf-idf and BM25 (Robertson
Finding good representations of text docu-
and Walker, 1994), and bag-of-words generative mod-
ments is crucial in information retrieval and
els such as naive Bayes text classifiers. The pertinent
classification systems. Today the most pop-
feature of these representations is that they represent
ular document representation is based on a
individual words. A serious drawback of the basic tf-
vector of word counts in the document. This
idf and BM25 representations is that all dimensions
representation neither captures dependencies
are treated as independent, whereas in reality word
between related words, nor handles synonyms
occurrences are highly correlated.
or polysemous words. In this paper, we pro-
pose an algorithm to learn text document There have been many attempts at modeling word
representations based on semi-supervised au- correlations by rotating the vector space and project-
toencoders that are stacked to form a deep ing documents onto principal axes that expose related
network. The model can be trained efficiently words. Methods include LSI (Deerwester et al., 1990)
on partially labeled corpora, producing very and pLSI (Hofmann, 1999). These methods constitute
compact representations of documents, while a linear re-mapping of the original vector space, and
retaining as much class information and joint while an improvement, still can only capture very lim-
word statistics as possible. We show that it ited relations between words. As a result they need a
is advantageous to exploit even a few labeled large number of projections in order to give an appro-
samples during training. priate representation.
Other models, such as LDA (Blei et al., 2003), have
shown superior performance over pLSI and LSI. How-
1. Introduction ever, inferring the representation is computationally
Document representations are a key ingredient in all expensive because of the “explaining away” effect that
information retrieval and processing systems. The goal plagues all directed graphical models.
of the representation is to make certain aspects of the More recently, a number of authors have proposed
document readily accessible, e.g. the document topic. undirected graphical models that can make inference
To identify a document topic, we cannot rely on specific efficient at the cost of more complex learning due to
words in the document, as it may use other synonymous a global (rather than local) partition function whose
words or misspellings. Likewise, the presence of a word exact gradient is intractable. These models build on
does not warrant that the document is related to it, Restricted Boltzmann Machines (RBMs) by adapting
as it may be taken out of context, or polysemous, or the conditional distribution of the input visible units to
unimportant to the document topic. model discrete counts of words (Hinton and Salakhut-
The most widespread representations for document dinov, 2006; Gehler et al., 2006; Salakhutdinov and
classification and retrieval today are based on a vec- Hinton, 2007a,b). These models have shown state-of-
the-art performance in retrieval and clustering, and can
Appearing in Proceedings of the 25 th International Confer- be easily used as a building block for deep multi-layer
ence on Machine Learning, Helsinki, Finland, 2008. Copy- networks (Hinton et al., 2006). This might allow the
right 2008 by the author(s)/owner(s).