How do you pronounce your name? Improving G2P with transliterations

Page created by Gail Bowen
 
CONTINUE READING
How do you pronounce your name? Improving G2P with transliterations

                               Aditya Bhargava and Grzegorz Kondrak
                                   Department of Computing Science
                                         University of Alberta
                                 Edmonton, Alberta, Canada, T6G 2E8
                             {abhargava,kondrak}@cs.ualberta.ca

                       Abstract                           scripts such as Arabic, Korean, or Hindi are more
                                                          consistent and easier to identify than various pho-
     Grapheme-to-phoneme conversion (G2P) of              netic transcription schemes. The process of translit-
     names is an important and challenging prob-
                                                          eration, also called phonetic translation (Li et al.,
     lem. The correct pronunciation of a name is
     often reflected in its transliterations, which are
                                                          2009b), involves “sounding out” a name and then
     expressed within a different phonological in-        finding the closest possible representation of the
     ventory. We investigate the problem of us-           sounds in another writing script. Thus, the correct
     ing transliterations to correct errors produced      pronunciation of a name is partially encoded in the
     by state-of-the-art G2P systems. We present a        form of the transliteration. For example, given the
     novel re-ranking approach that incorporates a        ambiguous letter-to-phoneme mapping of the En-
     variety of score and n-gram features, in order       glish letter g, the initial phoneme of the name Gersh-
     to leverage transliterations from multiple lan-
                                                          win may be predicted by a G2P system to be ei-
     guages. Our experiments demonstrate signifi-
     cant accuracy improvements when re-ranking           ther /g/ (as in Gertrude) or /Ã/ (as in Gerald). The
     is applied to n-best lists generated by three        transliterations of the name in other scripts provide
     different G2P programs.                              support for the former (correct) alternative.
                                                             Although it seems evident that transliterations
                                                          should be helpful in determining the correct pronun-
1   Introduction
                                                          ciation of a name, designing a system that takes ad-
Grapheme-to-phoneme conversion (G2P), in which            vantage of this insight is not trivial. The main source
the aim is to convert the orthography of a word to its    of the difficulty stems from the differences between
pronunciation (phonetic transcription), plays an im-      the phonologies of distinct languages. The mappings
portant role in speech synthesis and understanding.       between phonemic inventories are often complex
Names, which comprise over 75% of unseen words            and context-dependent. For example, because Hindi
(Black et al., 1998), present a particular challenge      has no /w/ sound, the transliteration of Gershwin
to G2P systems because of their high pronunciation        instead uses a symbol that represents the phoneme
variability. Guessing the correct pronunciation of a      /V/, similar to the /v/ phoneme in English. In ad-
name is often difficult, especially if they are of for-   dition, converting transliterations into phonemes is
eign origin; this is attested by the ad hoc transcrip-    often non-trivial; although few orthographies are as
tions which sometimes accompany new names intro-          inconsistent as that of English, this is effectively the
duced in news articles, especially for international      G2P task for the particular language in question.
stories with many foreign names.                             In this paper, we demonstrate that leveraging
   Transliterations provide a way of disambiguating       transliterations can, in fact, improve the grapheme-
the pronunciation of names. They are more abun-           to-phoneme conversion of names. We propose a
dant than phonetic transcriptions, for example when       novel system based on discriminative re-ranking that
news items of international or global significance are    is capable of incorporating multiple transliterations.
reported in multiple languages. In addition, writing      We show that simplistic approaches to the problem
fail to achieve the same goal, and that translitera-      to the method used by Finch and Sumita (2010) to
tions from multiple languages are more helpful than       combine the scores of two different machine translit-
from a single language. Our approach can be com-          eration systems.
bined with any G2P system that produces n-best lists
instead of single outputs. The experiments that we        2.3   Measuring similarity
perform demonstrate significant error reduction for       The approaches presented in the previous section
three very different G2P base systems.                    crucially depend on a method for computing the
                                                          similarity between various symbol sequences that
2     Improving G2P with transliterations                 represent the same word. If we have a method
                                                          of converting transliterations to phonetic represen-
2.1    Problem definition
                                                          tations, the similarity between two sequences of
In both G2P and machine transliteration, we are in-       phonemes can be computed with a simple method
terested in learning a function that, given an input      such as normalized edit distance or the longest com-
sequence x, produces an output sequence y. In the         mon subsequence ratio, which take into account the
G2P task, x is composed of graphemes and y is             number and position of identical phonemes. Alter-
composed of phonemes; in transliteration, both se-        natively, we could apply a more complex approach,
quences consist of graphemes but they represent dif-      such as ALINE (Kondrak, 2000), which computes
ferent writing scripts. Unlike in machine translation,    the distance between pairs of phonemes. However,
the monotonicity constraint is enforced; i.e., we as-     the implementation of a conversion program would
sume that x and y can be aligned without the align-       require ample training data or language-specific ex-
ment links crossing each other (Jiampojamarn and          pertise.
Kondrak, 2010). We assume that we have available a           A more general approach is to skip the tran-
base G2P system that produces an n-best list of out-      scription step and compute the similarity between
puts with a corresponding list of confidence scores.      phonemes and graphemes directly. For example, the
The goal is to improve the base system’s perfor-          edit distance function can be learned from a training
mance by applying existing transliterations of the in-    set of transliterations and their phonetic transcrip-
put x to re-rank the system’s n-best output list.         tions (Ristad and Yianilos, 1998). In this paper, we
                                                          apply M2M-A LIGNER (Jiampojamarn et al., 2007),
2.2    Similarity-based methods                           an unsupervised aligner, which is a many-to-many
A simple and intuitive approach to improving G2P          generalization of the learned edit distance algorithm.
with transliterations is to select from the n-best list   M2M-A LIGNER was originally designed to align
the output sequence that is most similar to the cor-      graphemes and phonemes, but can be applied to dis-
responding transliteration. For example, the Hindi        cover the alignment between any sets of symbols
transliteration in Figure 1 is arguably closest in per-   (given training data). The logarithm of the probabil-
ceptual terms to the phonetic transcription of the        ity assigned to the optimal alignment can then be
second output in the n-best list, as compared to          interpreted as a similarity measure between the two
the other outputs. One obvious problem with this          sequences.
method is that it ignores the relative ordering of the
n-best lists and their corresponding scores produced      2.4   Discriminative re-ranking
by the base system.                                       The methods described in Section 2.2, which are
   A better approach is to combine the similarity         based on the similarity between outputs and translit-
score with the output score from the base system, al-     erations, are difficult to generalize when multiple
lowing it to contribute an estimate of confidence in      transliterations of a single name are available. A lin-
its output. For this purpose, we apply a linear combi-    ear combination is still possible but in this case opti-
nation of the two scores, where a single parameter λ,     mizing the parameters would no longer be straight-
ranging between zero and one, determines the rela-        forward. Also, we are interested in utilizing other
tive weight of the scores. The exact value of λ can be    features besides sequence similarity.
optimized on a training set. This approach is similar        The SVM re-ranking paradigm offers a solution
input                                                     Gershwin

   n-best outputs               /d͡ʒɜːʃwɪn/             /ɡɜːʃwɪn/                                   /d͡ʒɛɹʃwɪn/

   transliterations                 गश� िवन           ガーシュウィン                                         Гершвин
                                  (/ɡʌrʃʋɪn/)           (/ɡaːɕuwiɴ/)                                  (/ɡerʂvin/)

Figure 1: An example name showing the data used for feature construction. Each arrow links a pair used to generate
features, including n-gram and score features. The score features use similarity scores for transliteration-transcription
pairs and system output scores for input-output pairs. One feature vector is constructed for each system output.

to the problem. Our re-ranking system is informed             of them simultaneously. We apply the n-gram fea-
by a large number of features, which are based on             tures across all transliteration-transcription pairs in
scores and n-grams. The scores are of three types:            addition to the usual input-output pairs correspond-
                                                              ing to the n-best lists. Figure 1 illustrates the set of
  1. The scores produced by the base system for               pairs used for feature generation.
     each output in the n-best list.
                                                                 In this paper, we augment the n-gram features by
  2. The similarity scores between the outputs and            a set of reverse features. Unlike a traditional G2P
     each available transliteration.                          generator, our re-ranker has access to the outputs
                                                              produced by the base system. By swapping the input
  3. The differences between scores in the n-best             and the output side, we can add reverse context and
     lists for both (1) and (2).                              linear-chain features. Since the n-gram features are
                                                              also applied to transliteration-transcription pairs, the
  Our set of binary n-gram features includes those
                                                              reverse features enable us to include features which
used for D IREC TL+ (Jiampojamarn et al., 2010).
                                                              bind a variety of n-grams in the transliteration string
They can be divided into four types:
                                                              with a single corresponding phoneme.
  1. The context features combine output symbols                 The construction of n-gram features presupposes
     (phonemes) with n-grams of varying sizes in a            a fixed alignment between the input and output se-
     window of size c centred around a correspond-            quences. If the base G2P system does not provide
     ing position on the input side.                          input-output alignments, we use M2M-A LIGNER
                                                              for this purpose. The transliteration-transcription
  2. The transition features are bigrams on the out-          pairs are also aligned by M2M-A LIGNER, which at
     put (phoneme) side.                                      the same time produces the corresponding similarity
  3. The linear chain features combine the context            scores. (We set a lower limit of -100 on the M2M-
     features with the bigram transition features.            A LIGNER scores.) If M2M-A LIGNER is unable to
                                                              produce an alignment, we indicate this with a binary
  4. The joint n-gram features are n-grams contain-           feature that is included with the n-gram features.
     ing both input and output symbols.
                                                              3   Experiments
We apply the features in a new way: instead of be-
ing applied strictly to a given input-output set, we          We perform several experiments to evaluate our
expand their use across many languages and use all            transliteration-informed approaches. We test simple
similarity-based approaches on single-transliteration              Language     Corpus size     Overlap
data, and evaluate our SVM re-ranking approach                     Bengali           12,785       1,840
against this as well. We then test our approach us-                Chinese           37,753       4,713
ing all available transliterations. Relevant code and              Hindi             12,383       2,179
scripts required to reproduce our experimental re-                 Japanese          26,206       4,773
sults are available online1 .                                      Kannada           10,543       1,918
                                                                   Korean             6,761       3,015
3.1    Data & setup                                                Russian            6,447         487
For pronunciation data, we extracted all names from                Tamil             10,646       1,922
the Combilex corpus (Richmond et al., 2009). We                    Thai              27,023       5,436
discarded all diacritics, duplicates and multi-word
names, which yielded 10,084 unique names. Both            Table 1: The number of unique single-word entries in the
the similarity and SVM methods require transliter-        transliteration corpora for each language and the amount
ations for identifying the best candidates in the n-      of common data (overlap) with the pronunciation data.
best lists. They are therefore trained and evaluated
on the subset of the G2P corpus for which transliter-     English-to-Hindi transliteration performance with a
ations available. Naturally, allowing transliterations    simple cleaning of the data.
from all languages results in a larger corpus than the
                                                             Our tests involving transliterations from multiple
one obtained by the intersection with transliterations
                                                          languages are performed on the set of names for
from a single language.
                                                          which we have both the pronunciation and translit-
   For our experiments, we split the data into 10%        eration data. There are 7,423 names in the G2P cor-
for testing, 10% for development, and 80% for             pus for which at least one transliteration is available.
training. The development set was used for initial        Table 1 lists the total size of the transliteration cor-
tests and experiments, and then for our final results     pora as well as the amount of overlap with the G2P
the training and development sets were combined           data. Note that the base G2P systems are trained us-
into one set for final system training. For SVM re-       ing all 10,084 names in the corpus as opposed to
ranking, during both development and testing we           only the 7,423 names for which there are transliter-
split the training set into 10 folds; this is necessary   ations available. This ensures that the G2P systems
when training the re-ranker as it must have system        have more training data to provide the best possible
output scores that are representative of the scores on    base performance.
unseen data. We ensured that there was never any
                                                             For our single-language experiments, we normal-
overlap between the training and testing data for all
                                                          ize the various scores when tuning the linear com-
trained systems.
                                                          bination parameter λ so that we can compare values
   Our transliteration data come from the shared
                                                          across different experimental conditions. For SVM
tasks on transliteration at the 2009 and 2010 Named
                                                          re-ranking, we directly implement the method of
Entities Workshops (Li et al., 2009a; Li et al., 2010).
                                                          Joachims (2002) to convert the re-ranking problem
We use all of the 2010 English-source data plus the
                                                          into a classification problem, and then use the very
English-to-Russian data from 2009, which makes
                                                          fast LIBLINEAR (Fan et al., 2008) to build the SVM
nine languages in total. In cases where the data
                                                          models. Optimal hyperparameter values were deter-
provide alternative transliterations for a given in-
                                                          mined during development.
put, we keep only one; our preliminary experiments
                                                             We evaluate using word accuracy, the percentage
indicated that including alternative transliterations
                                                          of words for which the pronunciations are correctly
did not improve performance. It should be noted
                                                          predicted. This measure marks pronunciations that
that these transliteration corpora are noisy: Jiampo-
                                                          are even slightly different from the correct one as in-
jamarn et al. (2009) note a significant increase in
                                                          correct, so even a small change in pronunciation that
   1
   http://www.cs.ualberta.ca/˜ab31/                       might be acceptable or even unnoticeable to humans
g2p-tl-rr                                                 would count against the system’s performance.
3.2   Base systems                                       find the similarity between phonetic transcriptions,
It is important to test multiple base systems in order   we use the two different methods described in Sec-
to ensure that any gain in performance applies to the    tion 2.2: A LINE and M2M-A LIGNER. We further
task in general and not just to a particular system.     test the use of a linear combination of the similar-
We use three G2P systems in our tests:                   ity scores with the base system’s score so that its
                                                         confidence information can be taken into account;
  1. F ESTIVAL (F EST), a popular speech synthe-         the linear combination weight is determined from
     sis package, which implements G2P conver-           the training set. These methods are referred to as
     sion with CARTs (decision trees) (Black et al.,     A LINE +BASE and M2M+BASE. For these experi-
     1998).                                              ments, our training and testing sets are obtained by
                                                         intersecting our G2P training and testing sets respec-
  2. S EQUITUR (S EQ), a generative system based         tively with the Hindi transliteration corpus, yielding
     on the joint n-gram approach (Bisani and Ney,       1,950 names for training and 229 names for testing.
     2008).                                                 Since the similarity-based methods are designed
  3. D IREC TL+ (DTL), the discriminative system         to incorporate homogeneous same-script translitera-
     on which our n-gram features are based (Ji-         tions, we can only run this experiment on one lan-
     ampojamarn et al., 2010).                           guage at a time. Furthermore, ALINE operates on
                                                         phoneme sequences, so we first need to convert the
All systems are capable of providing n-best output       transliterations to phonemes. An alternative would
lists along with scores for each output, although for    be to train a proper G2P system, but this would re-
F ESTIVAL they had to be constructed from the list       quire a large set of word-pronunciation pairs. For
of output probabilities for each input character.        this experiment, we choose Hindi, for which we
   We run D IREC TL+ with all of the features de-        constructed a rule-based G2P converter. Aside from
scribed in (Jiampojamarn et al., 2010) (i.e., context    simple one-to-one mapping (romanization) rules,
features, transition features, linear chain features,    the converter has about ten rules to adjust for con-
and joint n-gram features). System parameters, such      text.
as maximum number of iterations, were determined            For these experiments, we apply our SVM re-
during development. For S EQUITUR, we keep de-           ranking method in two ways:
fault options except for the enabling of the 10 best
outputs and we convert the probabilities assigned to       1. Using only Hindi transliterations (referred to as
the outputs to log-probabilities. We set S EQUITUR’s          SVM-H INDI).
joint n-gram order to 6 (this was also determined          2. Using all available languages (referred to as
during development).                                          SVM-A LL).
   Note that the three base systems differ slightly in
terms of the alignment information that they pro-        In both cases, the test set is restricted to the same
vide in their outputs. F ESTIVAL operates letter-by-     229 names, in order to provide a valid comparison.
letter, so we use the single-letter inputs with the         Table 2 presents the results. Regardless of the
phoneme outputs as the aligned units. D IREC TL+         choice of the similarity function, the simplest ap-
specifies many-to-many alignments in its output. For     proaches fail in a spectacular manner, significantly
S EQUITUR, however, since it provides no informa-        reducing the accuracy with respect to the base sys-
tion regarding the output structure, we use M2M-         tem. The linear combination methods give mixed re-
A LIGNER to induce alignments for n-gram feature         sults, improving the accuracy for F ESTIVAL but not
generation.                                              for S EQUITUR or D IREC TL+ (although the differ-
                                                         ences are not statistically significant). However, they
3.3   Transliterations from a single language            perform much better than the methods based on sim-
The goal of the first experiment is to compare sev-      ilarity scores alone as they are able to take advan-
eral similarity-based methods, and to determine how      tage of the base system’s output scores. If we look
they compare to our re-ranking approach. In order to     at the values of λ that provide the best performance
Base system                                                 Base system
                          F EST    S EQ   DTL                                         F EST    S EQ    DTL
       Base                58.1    67.3    71.6                   Base                 55.3    66.5     70.8
       A LINE              28.0    26.6    27.5                   SVM-S CORE           62.1    68.4     71.0
       M2M                 39.3    36.2    36.2                   SVM- N - GRAM        66.2    72.5     73.8
       A LINE +BASE        58.5    65.9    71.2                   SVM-A LL             67.2    73.4     74.3
       M2M+BASE            58.5    66.4    70.3
       SVM-H INDI          63.3    69.0    69.9            Table 3: Word accuracy of the base system versus the re-
       SVM-A LL            68.6    72.5    75.6            ranking variants with transliterations from multiple lan-
                                                           guages.
Table 2: Word accuracy (in percentages) of various meth-
ods when only Hindi transliterations are used.             which differ with respect to the set of included fea-
                                                           tures:
on the training set, we find that they are higher for
                                                             1. SVM-S CORE includes only the three types of
the stronger base systems, indicating more reliance
                                                                score features described in Section 2.4.
on the base system output scores. For example,
for A LINE +BASE the F ESTIVAL-based system has              2. SVM- N - GRAM uses only the n-gram features.
λ = 0.58 whereas the D IREC TL+-based system has
λ = 0.81. Counter-intuitively, the A LINE +BASE              3. SVM-A LL is the full system that combines the
and M2M+BASE methods are unable to improve                      score and n-gram features.
upon S EQUITUR or D IREC TL+. We would expect              The objective is to determine the degree to which
to achieve at least the base system’s performance,         each of the feature classes contributes to the overall
but disparities between the training and testing sets      results. Because we are using all available transliter-
prevent this.                                              ations, we achieve much greater coverage over our
   The two SVM-based methods achieve much bet-             G2P data than in the previous experiment; in this
ter results. SVM-A LL produces impressive accu-            case, our training set consists of 6,660 names while
racy gains for all three base systems, while SVM-          the test set has 763 names.
H INDI yields smaller (but still statistically signifi-       Table 3 presents the results. Note that the base-
cant) improvements for F ESTIVAL and S EQUITUR.            line accuracies are somewhat lower than in Table 2
These results suggest that our re-ranking method           because of the different test set. We find that, when
provides a bigger boost to systems built with dif-         using all features, the SVM re-ranker can provide
ferent design principles than to D IREC TL+ which          a very impressive error reduction over F ESTIVAL
utilizes a similar set of features. On the other hand,     (26.7%) and S EQUITUR (20.7%) and a smaller but
the results also show that the information obtained        still significant (p < 0.01 with the McNemar test)
by consulting a single transliteration may be insuf-       error reduction over D IREC TL+ (12.1%).
ficient to improve an already high-performing G2P             When we consider our results using only the score
converter.                                                 and n-gram features, we can see that, interestingly,
                                                           the n-gram features are most important. We draw
3.4   Transliterations from multiple languages
                                                           a further conclusion from our results: consider the
Our second experiment expands upon the first; we           large disparity in improvements over the base sys-
use all available transliterations instead of being re-    tems. This indicates that F ESTIVAL and S EQUITUR
stricted to one language. This rules out the sim-          are benefiting from the D IREC TL+-style features
ple similarity-based approaches, but allows us to          used in the re-ranking. Without the n-gram fea-
test our re-ranking approach in a way that fully uti-      tures, however, there is still a significant improve-
lizes the available data. We test three variants of our    ment over F ESTIVAL, demonstrating that the scores
transliteration-informed SVM re-ranking approach,          do provide useful information. In this case there is
no way for D IREC TL+-style information to make                             # TL    # Entries    Improvement
its way into the re-ranking; the process is based                           ≤1            111               0.9
purely on the transliterations and their similarities                       ≤2            266               3.0
with the transcriptions in the output lists, indicat-                       ≤3            398               3.8
ing that the system is capable of extracting use-                           ≤4            536               3.2
ful information directly from transliterations. In the                      ≤5            619               2.8
case of D IREC TL+, the transliterations help through                       ≤6            685               3.4
the n-gram features rather than the score features;                         ≤7            732               3.7
this is probably because the crucial feature that                           ≤8            762               3.5
signals the inability of M2M-A LIGNER to align a                            ≤9            763               3.5
given transliteration-transcription pair belongs to the
set of the n-gram features. Both the n-gram fea-                  Table 4: Absolute improvement in word accuracy (%)
tures and score features are dependent on the align-              over the base system (D IREC TL+) of the SVM re-ranker
ments, but they differ in that the n-gram features                for various numbers of available transliterations.
allow weights to be learned for local n-gram pairs
whereas the score features are based on global infor-             which provides some idea of how similar the pre-
mation, providing only a single feature for a given               dicted pronunciation is to the correct one.
transliteration-transcription pair. The two therefore
overlap to some degree, although the score fea-                   3.5    Effect of multiple transliterations
tures still provide useful information via probabili-
                                                                  One motivating factor for the use of SVM re-ranking
ties learned during the alignment training process.
                                                                  was the ability to incorporate multiple transliteration
   A closer look at the results provides additional               languages. But how important is it to use more than
insight into the operation of our re-ranking system.              one language? To examine this question, we look
For example, consider the name Bacchus, which D I -               particularly at the sets of names having at most k
REC TL+ incorrectly converts into /bækÙ@s/. The
                                                                  transliterations available. Table 4 shows the results
most likely reason why our re-ranker selects instead              with D IREC TL+ as the base system. Note that the
the correct pronunciation /bæk@s/ is that M2M-                    number of names with more than five transliterations
A LIGNER fails to align three of the five available               was small. Importantly, we see that the increase in
transliterations with /bækÙ@s/. Such alignment fail-              performance when only one transliteration is avail-
ures are caused by a lack of evidence for the map-                able is so small as to be insignificant. From this, we
ping of the grapheme representing the sound /k/                   can conclude that obtaining improvement on the ba-
in the transliteration training data with the phoneme             sis of a single transliteration is difficult in general.
/Ù/. In addition, the lack of alignments prevents any             This corroborates the results of the experiment de-
n-gram features from being enabled.                               scribed in Section 3.3, where we used only Hindi
   Considering the difficulty of the task, the top ac-            transliterations.
curacy of almost 75% is quite impressive. In fact,
many instances of human transliterations in our cor-              4     Previous work
pora are clearly incorrect. For example, the Hindi
transliteration of Bacchus contains the /Ù/ conso-                There are three lines of research that are relevant to
nant instead of the correct /k/. Moreover, our strict             our work: (1) G2P in general; (2) G2P on names; and
evaluation based on word accuracy counts all sys-                 (3) combining diverse data sources and/or systems.
tem outputs that fail to exactly match the dictio-                   The two leading approaches to G2P are repre-
nary data as errors. The differences are often very               sented by S EQUITUR (Bisani and Ney, 2008) and
minor and may reflect an alternative pronunciation.               D IREC TL+ (Jiampojamarn et al., 2010). Recent
The phoneme accuracy2 of our best result is 93.1%,                comparisons suggests that the former obtains some-
                                                                  what higher accuracy, especially when it includes
   2
     The phoneme accuracy is calculated from the minimum          joint n-gram features (Jiampojamarn et al., 2010).
edit distance between the predicted and correct pronunciations.   Systems based on decision trees are far behind. Our
results confirm this ranking.                            5   Conclusions & future work
   Names can present a particular challenge to G2P
                                                         In this paper, we explored the application of translit-
systems. Kienappel and Kneser (2001) reported a
                                                         erations to G2P. We demonstrated that transliter-
higher error rate for German names than for general
                                                         ations have the potential for helping choose be-
words, while on the other hand Black et al. (1998)
                                                         tween n-best output lists provided by standard G2P
report similar accuracy on names as for other types
                                                         systems. Simple approaches based solely on sim-
of English words. Yang et al. (2006) and van den
                                                         ilarity do not work when tested using a single
Heuvel et al. (2007) post-process the output of a
                                                         transliteration language (Hindi), necessitating the
general G2P system with name-specific phoneme-
                                                         use of smarter methods that can incorporate mul-
to-phoneme (P2P) systems. They find significant im-
                                                         tiple transliteration languages. We apply SVM re-
provement using this method on data sets consisting
                                                         ranking to this task, enabling us to use a variety
of Dutch first names, family names, and geograph-
                                                         of features based not only on similarity scores but
ical names. However, it is unclear whether such an
                                                         on n-grams as well. Our method shows impressive
approach would be able to improve the performance
                                                         error reductions over the popular F ESTIVAL sys-
of the current state-of-the-art G2P systems. In addi-
                                                         tem and the generative joint n-gram S EQUITUR sys-
tion, the P2P approach works only on single outputs,
                                                         tem. We also find significant error reduction using
whereas our re-ranking approach is designed to han-
                                                         the state-of-the-art D IREC TL+ system. Our analy-
dle n-best output lists.
                                                         sis demonstrated that it is essential to provide the
   Although our approach is (to the best of our
                                                         re-ranking system with transliterations from multi-
knowledge) the first to use different tasks (G2P and
                                                         ple languages in order to mitigate the differences
transliteration) to inform each other, this is concep-
                                                         between phonological inventories and smooth out
tually similar to model and system combination ap-
                                                         noise in the transliterations.
proaches. In statistical machine translation (SMT),
                                                            In the future, we plan to generalize our approach
methods that incorporate translations from other lan-
                                                         so that it can be applied to the task of generating
guages (Cohn and Lapata, 2007) have proven effec-
                                                         transliterations, and to combine data from distinct
tive in low-resource situations: when phrase trans-
                                                         G2P dictionaries. The latter task is related to the no-
lations are unavailable for a certain language, one
                                                         tion of domain adaptation. We would also like to ap-
can look at other languages where the translation
                                                         ply our approach to web data; we have shown that it
is available and then translate from that language.
                                                         is possible to use noisy transliteration data, so it may
A similar pivoting approach has also been applied
                                                         be possible to leverage the noisy ad hoc pronuncia-
to machine transliteration (Zhang et al., 2010). No-
                                                         tion data as well. Finally, we plan to investigate ear-
tably, the focus of these works have been on cases in
                                                         lier integration of such external information into the
which there are less data available; they also modify
                                                         G2P process for single systems; while we noted that
the generation process directly, rather than operating
                                                         re-ranking provides a general approach applicable to
on existing outputs as we do. Ultimately, a combina-
                                                         any system that can generate n-best lists, there is a
tion of the two approaches is likely to give the best
                                                         limit as to what re-ranking can do, as it relies on the
results.
                                                         correct output existing in the n-best list. Modifying
   Finch and Sumita (2010) combine two very dif-
                                                         existing systems would provide greater potential for
ferent approaches to transliteration using simple lin-
                                                         improving results even though the changes would be
ear interpolation: they use S EQUITUR’s n-best out-
                                                         necessarily system-specific.
puts and re-rank them using a linear combination
of the original S EQUITUR score and the score for        Acknowledgements
that output of a phrased-based SMT system. The lin-
ear weights are hand-tuned. We similarly use linear      We are grateful to Sittichai Jiampojamarn and Shane
combinations, but with many more scores and other        Bergsma for the very helpful discussions. This re-
features, necessitating the use of SVMs to determine     search was supported by the Natural Sciences and
the weights. Importantly, we combine different data      Engineering Research Council of Canada.
types where they combine different systems.
References                                                       Computational Linguistics, pages 697–700, Los An-
                                                                 geles, California, USA, June. Association for Compu-
Maximilian Bisani and Hermann Ney. 2008. Joint-                  tational Linguistics.
   sequence models for grapheme-to-phoneme conver-            Thorsten Joachims. 2002. Optimizing search engines us-
   sion. Speech Communication, 50(5):434–451, May.               ing clickthrough data. In Proceedings of the Eighth
Alan W. Black, Kevin Lenzo, and Vincent Pagel. 1998.             ACM SIGKDD International Conference on Knowl-
   Issues in building general letter to sound rules. In The      edge Discovery and Data Mining, pages 133–142, Ed-
   Third ESCA/COCOSDA Workshop (ETRW) on Speech                  monton, Alberta, Canada. Association for Computing
   Synthesis, Jenolan Caves House, Blue Mountains, New           Machinery.
   South Wales, Australia, November.                          Anne K. Kienappel and Reinhard Kneser. 2001. De-
Trevor Cohn and Mirella Lapata. 2007. Machine trans-             signing very compact decision trees for grapheme-
   lation by triangulation: Making effective use of multi-       to-phoneme transcription. In EUROSPEECH-2001,
   parallel corpora. In Proceedings of the 45th Annual           pages 1911–1914, Aalborg, Denmark, September.
   Meeting of the Association of Computational Linguis-       Grzegorz Kondrak. 2000. A new algorithm for the
   tics, pages 728–735, Prague, Czech Republic, June.            alignment of phonetic sequences. In Proceedings of
   Association for Computational Linguistics.                    the First Meeting of the North American Chapter of
Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui             the Association for Computational Linguistics, pages
   Wang, and Chih-Jen Lin. 2008. LIBLINEAR: A li-                288–295, Seattle, Washington, USA, April.
   brary for large linear classification. Journal of Ma-      Haizhou Li, A Kumaran, Vladimir Pervouchine, and Min
   chine Learning Research, 9:1871–1874.                         Zhang. 2009a. Report of NEWS 2009 machine
Andrew Finch and Eiichiro Sumita. 2010. Translitera-             transliteration shared task. In Proceedings of the 2009
   tion using a phrase-based statistical machine transla-        Named Entities Workshop: Shared Task on Transliter-
   tion system to re-score the output of a joint multigram       ation (NEWS 2009), pages 1–18, Suntec, Singapore,
   model. In Proceedings of the 2010 Named Entities              August. Association for Computational Linguistics.
   Workshop (NEWS 2010), pages 48–52, Uppsala, Swe-           Haizhou Li, A Kumaran, Min Zhang, and Vladimir Per-
   den, July. Association for Computational Linguistics.         vouchine. 2009b. Whitepaper of NEWS 2009 ma-
Sittichai Jiampojamarn and Grzegorz Kondrak. 2010.               chine transliteration shared task. In Proceedings
   Letter-phoneme alignment: An exploration. In Pro-             of the 2009 Named Entities Workshop: Shared Task
   ceedings of the 48th Annual Meeting of the Associ-            on Transliteration (NEWS 2009), pages 19–26, Sun-
   ation for Computational Linguistics, pages 780–788,           tec, Singapore, August. Association for Computational
   Uppsala, Sweden, July. Association for Computational          Linguistics.
   Linguistics.                                               Haizhou Li, A Kumaran, Min Zhang, and Vladimir Per-
Sittichai Jiampojamarn, Grzegorz Kondrak, and Tarek              vouchine. 2010. Report of NEWS 2010 transliteration
   Sherif. 2007. Applying many-to-many alignments                generation shared task. In Proceedings of the 2010
   and hidden Markov models to letter-to-phoneme con-            Named Entities Workshop (NEWS 2010), pages 1–11,
   version. In Human Language Technologies 2007: The             Uppsala, Sweden, July. Association for Computational
   Conference of the North American Chapter of the As-           Linguistics.
   sociation for Computational Linguistics; Proceedings       Korin Richmond, Robert Clark, and Sue Fitt. 2009. Ro-
   of the Main Conference, pages 372–379, Rochester,             bust LTS rules with the Combilex speech technology
   New York, USA, April. Association for Computational           lexicon. In Proceedings of Interspeech, pages 1295–
   Linguistics.                                                  1298, Brighton, UK, September.
Sittichai Jiampojamarn, Aditya Bhargava, Qing Dou,            Eric Sven Ristad and Peter N. Yianilos. 1998. Learn-
   Kenneth Dwyer, and Grzegorz Kondrak. 2009. Di-                ing string edit distance. IEEE Transactions on Pattern
   recTL: a language independent approach to translitera-        Recognition and Machine Intelligence, 20(5):522–
   tion. In Proceedings of the 2009 Named Entities Work-         532, May.
   shop: Shared Task on Transliteration (NEWS 2009),          Henk van den Heuvel, Jean-Pierre Martens, and Nanneke
   pages 28–31, Suntec, Singapore, August. Association           Konings. 2007. G2P conversion of names. what can
   for Computational Linguistics.                                we do (better)? In Proceedings of Interspeech, pages
Sittichai Jiampojamarn, Colin Cherry, and Grzegorz Kon-          1773–1776, Antwerp, Belgium, August.
   drak. 2010. Integrating joint n-gram features into a       Qian Yang, Jean-Pierre Martens, Nanneke Konings, and
   discriminative training framework. In Human Lan-              Henk van den Heuvel. 2006. Development of a
   guage Technologies: The 2010 Annual Conference of             phoneme-to-phoneme (p2p) converter to improve the
   the North American Chapter of the Association for             grapheme-to-phoneme (g2p) conversion of names. In
Proceedings of the 2006 International Conference on
  Language Resources and Evaluation, pages 2570–
  2573, Genoa, Italy, May.
Min Zhang, Xiangyu Duan, Vladimir Pervouchine, and
  Haizhou Li. 2010. Machine transliteration: Leverag-
  ing on third languages. In Coling 2010: Posters, pages
  1444–1452, Beijing, China, August. Coling 2010 Or-
  ganizing Committee.
You can also read