Issues in Pre- and Post-translation Document Expansion: Untranslatable Cognates and Missegmented Words

Page created by Debbie Sandoval
 
CONTINUE READING
Issues in Pre- and Post-translation Document Expansion:
               Untranslatable Cognates and Missegmented Words

                                         Gina-Anne Levow
                                       University of Chicago
                              1100 E. 58th St., Chicago, IL 60637, USA
                                   levow@cs.uchicago.edu

                    Abstract                       with concepts expressed in documents. This match-
                                                   ing process is complicated by the variety of dif-
    Query expansion by pseudo-relevance            ferent ways - different terms - available to express
    feedback is a well-established technique       these concepts and information needs. In addition,
    in both mono- and cross- lingual informa-      this matching process is dramatically complicated
    tion retrieval, enriching and disambiguat-     in cross-language and spoken document retrieval
    ing the typically terse queries provided       by the need to match expressions across languages
    by searchers. Comparable document-side         and typically using error-prone processes such as
    expansion is a relatively more recent de-      translation and automatic speech recognition tran-
    velopment motivated by error-prone tran-       scription. To compensate for this variation in ex-
    scription and translation processes in spo-    pression of underlying concepts, researchers have
    ken document and cross-language re-            developed the technique of pseudo-relevance feed-
    trieval. In the cross-language case, one       back whereby the information representation - query
    can perform expansion before translation,      or document - is enriched with highly selective,
    after translation, and at both points. We      topically related terms from a large collection of
    investigate the relative impact of pre- and    comparable documents. Such expansion techniques
    post- translation document expansion for       have proved useful across the range of information
    cross-language spoken document retrieval       retrieval applications from mono-lingual to multi-
    in Mandarin Chinese. We find that post-        lingual, from text to speech, and from queries to doc-
    translation expansion yields a highly sig-     uments.
    nificant improvement in retrieval effec-
    tiveness, while improvements due to pre-          Expansion in the context of cross-language in-
    translation expansion alone or in combina-     formation retrieval (CLIR) is particularly interesting
    tion do not reach significance. We identify    as it presents multiple opportunities for improving
    two key factors of segmentation and trans-     retrieval effectiveness. The pseudo-relevance feed-
    lation in Chinese orthography that limit       back process can be applied, depending on the re-
    the effectiveness of pre-translation expan-    trieval architecture, before translating the query, af-
    sion in the Chinese-English case, while        ter translating the query, before translating the doc-
    post-translation expansion yields its full     ument, after translating the document, or at some
    benefit.                                       subset of these points, though not all combinations
                                                   are reasonable. While pre- and post-translation ex-
                                                   pansion have been well-studied for a query transla-
1 Introduction
                                                   tion architecture in European languages, as we de-
Information retrieval aims to match the informa-   scribe in more detail below, these effects are less
tion need expressed by the searcher in the query   well-understood on the document side, especially
for Asian languages.                                      speech recognition transcriptions, Singhal et al in-
   In this paper, we compare the effects of pre-          troduced document expansion as a way of recover-
translation, post-translation, and combined pre-          ing those words that might have been in the original
and post-translation document expansion for cross-        broadcast but that had been misrecognized. They
language retrieval using English queries to retrieve      speculated that correctly recognized terms would
spoken documents in Mandarin Chinese. We iden-            yield a topically coherent transcript, while the spo-
tify not only significant enhancements to retrieval       radic errors would be from a random distribution.
effectiveness for post-translation document expan-        Enriching the documents with highly selective terms
sion, but also key contrasts with prior work on query     drawn from highly ranked documents retrieved by
translation and expansion, caused by certain char-        using the document itself as a query yielded re-
acteristics of Mandarin Chinese, shared by many           trieval effectiveness that improved not only over the
Asian languages, including issues of segmentation         original errorful transcription but also over a perfect
and orthography.                                          manual transcription. (Levow and Oard, 2000) ap-
                                                          plied post-translation document expansion to both
2 Related Work                                            spoken documents and newswire text in Mandarin-
This work draws on prior research in pseudo-              English multi-lingual retrieval and found some im-
relevance feedback for both queries and documents.        provements in retrieval effectiveness.         (Levow,
                                                          2003) evaluated multi-scale units (words and bi-
2.1   Pre- and Post-translation Query Expansion           grams) for post-transcription expansion of Mandarin
In pre-translation query expansion, the goal is both      spoken documents, finding the significant improve-
that of monolingual query expansion - providing ad-       ments for expansion with word units using bigram
ditional terms to refine the query and to enhance         based indexing.
the probability of matching the terminology cho-
sen by the authors of the document - and to pro-          3 Experimental Configuration
vide additional terms to limit the possibility of fail-   Here we describe the basic experimental configu-
ing to translate a concept in the query simply be-        ration under which contrastive document expansion
cause the particular term is not present in the trans-    experiments were carried out.
lation lexicon. (Ballesteros and Croft, 1997) eval-
uated pre- and post-translation query expansion in        3.1      Experimental Collection
a Spanish-English cross-language information re-          We used the Topic Detection and Tracking (TDT)
trieval task and found that combining pre- and post-      Collection for this work. TDT is an evaluation pro-
translation query expansion improved both precision       gram where participating sites tackle tasks as such
and recall with pre-translation expansion improving       identifying the first time a story is reported on a
both precision and recall, and post-translation ex-       given topic or grouping similar topics from audio
pansion enhancing precision. (McNamee and May-            and textual streams of newswire date. In recent
field, 2002)’s dictionary ablation experiments on the     years, TDT has focused on performing such tasks
effect of translation resource size and pre- and post-    in both English and Mandarin Chinese. 1 The task
translation query expansion effectiveness demon-          that we have performed is not a strict part of TDT
strated the key and dominant role of pre-translation      because we are performing retrospective retrieval
expansion in providing translatable terms. If too few     which permits knowledge of the statistics for the
terms are translated, post-translation expansion can      entire collection. Nevertheless, the TDT collection
provide little improvement.                               serves as a valuable resource for our work. The
2.2   Document Expansion                                  TDT multilingual collection includes English and
                                                          Mandarin newswire text as well as (audio) broad-
The document expansion approach was first pro-
                                                          cast news. For most of the Mandarin audio data,
posed by (Singhal et al., 1999) in the context of
                                                          word-level transcriptions produced by the Dragon
spoken document retrieval. Since spoken document
                                                             1
retrieval involves search of error-prone automatic               This year Arabic was added to the languages of interest.
automatic speech recognition system are provided.        3.4   Document Expansion
All news stories are exhaustively tagged with event-
based topic labels, which serve as the relevance         We implemented document expansion for the VOA
judgments for performance evaluation of our cross-       Mandarin broadcast news stories in an effort to par-
language spoken document retrieval work. We used         tially recover terms that may have been mistran-
a subset of the TDT-2 corpus for the experiments re-     scribed. Singhal et al. used document expansion for
ported here.                                             monolingual speech retrieval (Singhal and Pereira,
                                                         1999).
3.2   Query Formulation                                     The automatic transcriptions of the VOA Man-
                                                         darin broadcast news stories and their word-for-
TDT frames the retrieval task as query-by-example,       word translations are an often noisy representation
designating 4 exemplar documents to specify the in-      of the underlying stories. For expansion, the text
formation need. For query formulation, we con-           of these documents was treated as a query to a
structed a vector of the 180 terms that best distin-     comparable collection (in Mandarin before transla-
guish the query exemplars from other contempora-         tion and English after translation), by simply com-
neous (and hopefully not relevant) stories. We used
a  test in a manner similar to that used by Schütze
                                                         bining all the terms with uniform weighting. This
                                                         query was presented to the InQuery retrieval system
et al (Schütze et al., 1995) to select these terms.
The pure  statistic is symmetric, assigning equal
                                                         version 3.1pl developed at the University of Mas-
                                                         sachusetts (Callan et al., 1992).
value to terms that help to recognize known rele-
                                                            Figure 1 depicts the document expansion process.
vant stories and those that help to reject the other
                                                         The use of pre- and post-translation document ex-
contemporaneous stories. We limited our choice to
                                                         pansion components was varied as part of the ex-
terms that were positively associated with the known
relevant training stories. For the  computation,
                                                         perimental suite described below. We selected the
                                                         five highest ranked documents from the ranked re-
we constructed a set of 996 contemporaneous doc-
                                                         trieval list. From those five documents, we extracted
uments for each topic by removing the four query
                                                         the most selective terms and used them to enrich the
examplars from a topic-dependent set of up to 1000
                                                         original translations of the stories. For this expan-
stories working backwards chronologically from the
                                                         sion process we first created a list of terms from the
last English query example. Additional details may
                                                         documents where each document contributed one in-
be found in (Levow and Oard, 2000).
                                                         stance of a term to the list. We then sorted the terms
                                                         by inverse document frequency (IDF). We next aug-
3.3   Document Translation
                                                         mented the original documents with these terms
Our translation strategy implemented a word-for-         until the document had approximately doubled in
word translation approach.         For our original      length. Doubling was computed in terms of number
spoken documents, we used the word bound-                of whitespace delimited units. For Chinese audio
aries provided in the baseline recognizer tran-          documents, words were identified by the Dragon au-
scripts. We next perform dictionary-based word-          tomatic speech recognizer as part of the transcription
for-word translation, using a bilingual term list        process. For the Chinese newswire text, segmenta-
produced by merging the entries from the sec-            tion was performed by the NMSU segmenter ( (Jin,
ond release of the LDC Chinese-English term list         1998)). The expansion factor chosen here followed
(http://www.ldc.upenn.edu, (Huang, 1999)) and en-        Singhal et al’s original proposal. A proportional
tries from the CETA file, a large human-readable         expansion factor is more desirable than some con-
Chinese-English dictionary. The resulting term list      stant additive number of words or some selectivity
contains 195,078 unique Mandarin terms, with an          threshold, as it provides a more consistent effect on
average of 1.9 known English translations per Man-       documents of varying lengths; an IDF-based thresh-
darin term. We select the translation with the highest   old, for example, adds disproportionately more new
target language unigram frequency, based on a side       terms to short original documents than long ones,
collection in the target language.                       outweighing the original content. Prior experiments
indicate little sensitivity to the exact expansion fac-
tor chosen, as long as it is proportional.
   This process thus relatively increased the weight
of terms that occurred rarely in the document collec-
tion as a whole but frequently in related documents.
The resulting augmented documents were then in-
                                                                                  Query Vector
                                                            Results
dexed by InQuery in the usual way.This expanded

                                                                                 Post-translation
document collection formed the basis for retrieval
using the translated exemplar queries.
                                                                                   Expansion
   The intuition behind document expansion is that
                                                            InQuery              Term Selection
terms that are correctly transcribed will tend to be
topically coherent, while mistranscription will intro-
duce spurious terms that lack topical coherence. In
                                                                                     Top 5
other words, although some “noise” terms are ran-
domly introduced, some “signal” terms will survive.
The introduction of spurious terms degrades ranked        Translated
retrieval somewhat, but the adverse effect is limited     Documents
by the design of ranking algorithms that give high                                  InQuery
scores to documents that contain many query terms.
Because topically related terms are far more likely
to appear together in documents than are spurious         Translation
terms, the correctly transcribed terms will have a                               Comp English
disproportionately large impact on the ranking pro-                             Newswire Corpus
cess. The highest ranked documents are thus likely
to be related to the correctly transcribed terms, and                            Term Selection
                                                          Transcribed
to contain additional related terms. For example, a
                                                          Documents
system might fail to accurately transcribe the name
“Yeltsin” in the context of the (former) “Russian                                    Top 5
Prime Minister”. However, in a large contemporane-
ous text corpus, the correct form of the name will ap-
                                                              ASR
                                                                                    InQuery
pear in such document contexts, and relatively rarely
outside of such contexts. Thus, it will be a highly       Transcription
correlated and highly selective term to be added in
the course of document expansion.
                                                           Mandarin
                                                                                 Comp Chinese
                                                           Broadcast
                                                                                Newswire Corpus
4 Document Expansion Experiments
                                                             News
Our goal is to evaluate the effectiveness of pseudo-
relevance feedback expansion applied at different                                Pre-translation
stages of document processing and determine what                                   Expansion
factors contribute to the any differences in final re-
trieval effectiveness. We consider expansion before
                                                           Figure 1: Document Expansion Process
translation, after translation, and at both points. The
expansion process aims to (re)introduce terminology
that could have been used by the author to express
the concepts in the documents. Expansion at differ-
ent stages of processing addresses different causes
of loss or absence of terms. At all points, it can ad-
dress terminological choice by the author.                          segmented into words using the NMSU seg-
   Since we are working with automatic transcrip-                   menter. The resulting documents are translated
tions of spoken documents, pre-translation (post-                   as usual. Note that translation requires that the
transcription) expansion directly addresses term loss               expansion units be words.
due to substitution or deletion errors in automatic
recognition. In addition, as emphasized by (Mc-               3. Post-translation Expansion
Namee and Mayfield, 2002), pre-translation expan-                        The English document forms produced by
sion can be crucial to providing translatable terms so              item 1 are expanded using a contemporaneous
that there is some material for post-translation index-             collection of English newswire text from the
ing and matching to operate on. In other words, by                  New York Times and Associated Press (also
including a wider range of expressions of the docu-                 part of the TDT-2 corpus).
ment concepts, pre-translation expansion can avoid
translation gaps by enhancing the possibility that            4. Pre- and Post-translation Expansion
some term representing a concept that appears in                         The document forms produced by item 2
the original document will have a translation in the                are translated in the the usual word-for-word
bilingual term list. Addition of terms can also serve               process. The resulting English text is expanded
a disambiguating effect as identified by (Ballesteros               as in item 3.
and Croft, 1997).
   Post-translation expansion provides an opportu-          After the above processing, the resulting English
nity to address translation gaps even more strongly.      documents are indexed.
Pre-translation expansion requires that there be          4.1       Results
some representation of the document language con-
cept in the term list, whereas post-translation expan-    The results of these different expansion configura-
sion can acquire related terms with no representation     tions appear in Figure 2. We observe that both post-
in the translation resources from the query language      translation expansion and combined pre- and post-
side collection. This capability is particularly desir-   translation document expansion yield highly sig-

                                                                            
able given both the important role of named entities      nificant improvements (Wilcoxon signed rank test,
(e.g. person and organization names) in many re-          two-tailed,               ) in retrieval effectiveness
trieval activities, in conjunction with their poor cov-   over the unexpanded case. In contrast, although
erage in most translation resources. Finally, it pro-     pre-translation expansion yields an 18% relative in-
vides the opportunity to introduce additional con-        crease in mean average precision, this improvement
ceptually related terminology in the query language,      does not reach significance. The combination of pre-
even if the document language form of the term was        and post-translation expansion increases effective-
not introduced by the original author to enhance the      ness by only 3% relative over post-translation ex-
representation.                                           pansion, but 33% relative over pre-translation ex-
   We evaluate four document processing configura-        pansion alone. This combination of pre- and post-

                                                                                                                   
tions:                                                    translation expansion significantly improves over

                                                          
                                                          pre-translation document expansion alone (
  1. No Expansion                                               ).
          Documents are translated directly as de-
                                                          5 Discussion
     scribed above, based on the provided automatic
     speech recognition transcriptions.                   These results clearly demonstrate the significant
                                                          utility of post-translation document expansion for
  2. Pre-translation Expansion                            English-Mandarin CLIR with Mandarin spoken doc-
         Documents are expanded as described              uments, in contrast to pre-translation expansion. Not
     above, using a contemporaneous Mandarin              only do these results extend our understanding of the
     newswire text collection from Xinhua and Za-         interactions of translation and expansion, but they
     obao news agencies. These collections are            contrast dramatically with prior work on translation
ity of the query translation experiments that demon-
                                                          strate the utility of pre-translation expansion have
                                                          been performed on European language pairs that
                                                          share a common alphabet, making names found at
                                                          any stage of expansion available for matching as
                                                          cognates in retrieval even when no explicit transla-
                                                          tion is available. Recent side experiments on pre-
                                                          and post-translation query expansion on the English-
              Document Expansion
                                                          Chinese pair show a similar pattern of effectiveness
          None Pre Post Pre+Post
                                                          for post-translation expansion over pre-translation
          0.39 0.46 0.59       0.61
                                                          expansion (Levow et al., Under Review).
                                                             A further complication is caused by the fact that
Figure 2: Retrieval effectiveness of document ex-
                                                          Mandarin Chinese is written without white space
pansion
                                                          separating words. As a result, some segmentation
                                                          process must be performed to identify words for
                                                          translation, even though indexing and retrieval can
and query expansion - in particular, with the (Mc-        be performed effectively on -gram units (Meng et
Namee and Mayfield, 2002) work emphasizing the            al., 2001). This segmentation process typically re-
primary importance of pre-translation expansion.          lies on a list of terms that may appear in legal seg-
   Two main factors contribute to this contrast: first,   mentations. Just as in the case of translation, these
differences between languages, and second, differ-        term lists often lack good coverage of proper names.
ences between documents and queries. The charac-          Thus, these terms may not be identified for trans-
teristics of the document and query languages play a      lation, expansion, or even transcription by an auto-
crucial role in determining the effectiveness of pre-     matic speech recognition system that also depends
and post-translation document expansion. In partic-       on word lists as models. These constraints limit
ular, the orthography of Mandarin Chinese and the         the effectiveness of pre-translation expansion. In
difference in writing systems between the English         post-translation expansion, however, these problems
queries and Mandarin documents affect the expan-          are much less significant. In English, white-space
sion process. If one examines the terms contributed       delimited terms are available and largely sufficient
by post-translation expansion, one can quickly ob-        for retrieval (especially after stemming). Even with
serve the utility of the enriching terms. For in-         multi-word concepts as in the name examples above,
stance in a document about the Iraqi oil embargo,         the cooccurrence of these terms in expansion docu-
one finds the names of Tariq Aziz and Saddam; in an       ments makes it likely that they will cooccur in the
article about the former Soviet republic of Georgia,      list of enriching terms as well, though perhaps not in
one finds the name of former president Zviad Gam-         the same order. In Chinese or other typically unseg-
sakhurdia. These and many of the other useful ex-         mented languages, overlapping -grams can be used
pansion terms do not appear anywhere in the transla-      as indexing or expansion units, to bypass segmenta-
tion resource. Even if these terms were proposed by       tion issues, once translation has been completed.
pre-translation expansion or existed in the original         Finally, (McNamee and Mayfield, 2002) observe
document, they would not be available in the trans-       that pre-translation query expansion plays a crucial
lated result. These named entities are highly useful      role in ensuring that some terms are translatable, and
in many information retrieval activities but are no-      post-translation expansion would having nothing to
toriously absent from translation resources. For lan-     operate on if no query terms translated. This is cer-
guages with different orthographies, these terms can      tainly true, but this problem is much more likely to
not match as cognates but must be explicitly trans-       arise in the case of short queries, where only a single
lated or transliterated. Thus, these terms are only       term may represent a topic and there are few terms in
useful for enrichment when the translation barrier        the query. As documents are typically much longer,
has already been passed. In contrast, the major-          there is often more redundancy of representation.
This is analogous to the observation (Krovetz, 1993)        Robert Krovetz. 1993. Viewing morphology as an infer-
that stemming has less of an impact as documents              ence process. In SIGIR-93, pages 191–202.
become longer because a wider variety of surface            Gina-Anne Levow and Douglas W. Oard.              2000.
forms are likely to appear. Thus it is more likely            Translingual topic tracking with PRISE. In Working
that some translatable form of a concept is likely to         Notes of the Third Topic Detection and Tracking Work-
appear in a long document, even without expansion             shop, February.
and even with a poor translation resource. As a re-         Gina-Anne Levow, Douglas W. Oard, and Philip Resnik.
sult, pre-translation expansion may be less crucial           Under Review. Dictionary-based techniques for cross-
for long documents.                                           language information retrieval.
                                                            Gina-Anne Levow. 2003. Multi-scale document ex-
6 Conclusion                                                  pansion for mandarin chinese. In Proceedings of the
                                                              ISCA Workshop on Multi-lingual Spoken Document
These factors together explain both the significant           Retrieval.
improvement for post-translation document expan-
sion that our experiments illustrate in contrast to the     Paul McNamee and James Mayfield. 2002. Comparing
                                                              cross-language query expansion techniques by degrad-
much weaker effects of pre-translation expansion,             ing translation resources. In Proceedings of the 25th
and also the difference observed between the exper-           Annual International Conference on Research and De-
imental results reported here and prior work on pre-          velopment in Information Retrieval (SIGIR-2002).
and post-translation query expansion that has em-
                                                            Helen Meng, Berlin Chen, Erika Grams, Wai-Kit Lo,
phasized European language pairs. We have iden-               Gina-Anne Levow, Douglas Oard, Patrick Schone,
tified a key role for post-translation expansion in           Karen Tang, and Jian Qiang Wang. 2001. Mandarin-
CLIR language pairs where trivial cognate matching            English Information (MEI): Investigating translingual
is not possible, but explicit translation or translitera-     speech retrieval. In Human Language Technology
                                                              Conference.
tion is required. We have also identified limitations
on pre-translation expansion due to corresponding           Hinrich Schütze, David A. Hull, and Jan O. Peder-
gaps in segmentation, translation, and transcription          sen. 1995. A comparison of classifiers and docu-
                                                              ment representations for the routing problem. In Ed-
resources. We believe that these findings will extend         ward A. Fox, Peter Ingwersen, and Raya Fidel, ed-
to other CLIR language combinations with com-                 itors, Proceedings of the 18th Annual International
parable characteristics, including many other Asian           ACM SIGIR Conference on Research and Develop-
languages.                                                    ment in Information Retrieval, pages 229–237, July.
                                                              ftp://parcftp.xerox.com/pub/qca/schuetze.html.
                                                            Amit Singhal and Fernando Pereira. 1999. Document
References                                                   expansion for speech retrieval. In Proceedings of the
                                                             22nd International Conference on Research and De-
Lisa Ballesteros and W. Bruce Croft. 1997. Phrasal           velopment in Information Retrieval, pages 34–41, Au-
   translation and query expansion techniques for cross-     gust.
   language information retrieval. In Proceedings of
   the 20th International ACM SIGIR Conference on           Amit Singhal, John Choi, Donald Hindle, Julia
   Research and Development in Information Retrieval,        Hirschberg, Fernando Pereira, and Steve Whittaker.
   July.                                                     1999. AT&T at TREC-7 SDR Track. In Proceedings
                                                             of the DARPA Broadcast News Workshop.
James P. Callan, W. Bruce Croft, and Stephen M. Hard-
  ing. 1992. The INQUERY retrieval system. In
  Proceedings of the Third International Conference on
  Database and Expert Systems Applications, pages 78–
  83. Springer-Verlag.

Shudong Huang. 1999. Evaluation of LDC’s bilingual
  dictionaries. Unpublished manuscript.

Wanying Jin. 1998. NMSU Chinese segmenter. In First
  Chinese Language Processing Workshop, Philadel-
  phia.
You can also read