A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance

Page created by Jamie Bush
 
CONTINUE READING
A Conditional Random Field for
         Discriminatively-trained Finite-state String Edit Distance

          Andrew McCallum and Kedar Bellare                   Fernando Pereira
             Department of Computer Science     Department of Computer and Information Science
            University of Massachusetts Amherst           University of Pennsylvania
                 Amherst, MA 01003, USA                 Philadelphia, PA 10104, USA
             {mccallum,kedarb}@cs.umass.edu                 pereira@cis.upenn.edu

                      Abstract                              re-estimated using the expectations determined in the
                                                            E-step so as to reduce the cost of the edit sequences ex-
    The need to measure sequence similarity                 pected to have caused the match. A useful attribute of
    arises in information extraction, object iden-          this method is that the edit operations and parameters
    tity, data mining, biological sequence analy-           can be associated with states of a finite state machine
    sis, and other domains. This paper presents             (with probabilities of edit operations depending on
    discriminative string-edit CRFs, a finite-              previous edit operations, as determined by the finite-
    state conditional random field model for edit           state structure.) However, as a generative model, this
    sequences between strings. Conditional ran-             model cannot tractably incorporate arbitrary features
    dom fields have advantages over generative              of the input strings, and it cannot benefit from nega-
    approaches to this problem, such as pair                tive evidence from pairs of strings that (while partially
    HMMs or the work of Ristad and Yiani-                   overlapping) should be considered dissimilar.
    los, because as conditionally-trained meth-             Bilenko and Mooney (2003) extend Ristad’s model to
    ods, they enable the use of complex, arbitrary          include affine gaps, and also present a learned string
    actions and features of the input strings. As           similarity measure based on unordered bags of words,
    in generative models, the training data does            with training performed by an SVM. Cohen and Rich-
    not have to specify the edit sequences be-              man (2002) use a conditional maximum entropy clas-
    tween the given string pairs. Unlike genera-            sifier to learn weights on several sequence distance fea-
    tive models, however, our model is trained on           tures. A survey of string edit distance measures is pro-
    both positive and negative instances of string          vided by Cohen et al. (2003). However, none of these
    pairs. We present positive experimental re-             methods combine the expressive power of a Markov
    sults on several data sets.                             model of edit operations with discriminative training.
                                                            This paper presents an undirected graphical model for
1   Introduction                                            string edit distance, and a conditional-probability pa-
                                                            rameter estimation method that exploits both match-
Parameterized string similarity models based on string      ing and non-matching sequence pairs. Based on con-
edits have a long history (Levenshtein, 1966; Needle-       ditional random fields (CRFs), the approach not only
man & Wunsch, 1970; Sankoff & Kruskal, 1999). How-          provides powerful capabilities long sought in many ap-
ever, there are few methods for learning model pa-          plication domains, but also demonstrates an interest-
rameters from training data, even though, as in other       ing example of discriminative learning of a probabilis-
tasks, learning may lead to greater accuracy on real-       tic model involving structured latent variables.
world problems.
                                                            The training data consists of input string pairs, each
Ristad and Yianilos (1998) proposed an expectation-         associated with a binary label indicating whether the
maximization-based method for learning string edit          pair should be considered a “match” or a “mismatch.”
distance with a generative finite-state model. In their     Model parameters are estimated from both positive
approach, training data consists of pairs of strings that   and negative examples, unlike in previous generative
should be considered similar, and the parameters are        models (Ristad & Yianilos, 1998; Bilenko & Mooney,
probabilities of certain edit operations. In the E-step,    2003). As in those models, however, it is not necessary
the highest probability edit sequence is found using the    to provide the desired edit-operations or alignments—
current parameters; in the M-step the probabilities are     the alignments that enable the most accurate discrimi-
nation will be discovered automatically through an EM       2       Discriminatively Trained String
procedure. Thus this model is an example of an inter-               Edit Distance
esting class of graphical models that are trained condi-
tionally, but have latent variables, and find the latent    Let x = x1 · · · xm and y = y1 · · · yn be two strings or
variable parameters that maximize discriminative per-       symbol sequences. This pair of input strings is associ-
formance. Another recent example includes work on           ated with an output label z ∈ {0, 1} indicating whether
CRFs for object recognition from images (Quattoni           or not the strings should be considered a match (1) or
et al., 2005).                                              a mismatch (0).1 As we now explain, our model scores
The model is structured as a finite-state machine           alignments between x and y as to whether they are
(FSM) with a single initial state and two disjoint sets     a match or a mismatch. An alignment a is a four-
of non-initial states with no transitions between them.     tuple consisting of a sequence of edit operations, two
State transitions are labeled by edit operations. One       sequences of string positions, and a sequence of FSM
of the disjoint sets represents the match condition, the    states.
other the mismatch condition. Any non-empty tran-           Let a.e = e1 · · · ek indicate the sequence edit op-
sition path starting at the initial state defines an edit   erations, such as delete-one-character-in-x, substitute-
sequence that is wholly contained in either the match       one-character-in-x-for-one-character-in-y, or delete-all-
or mismatch subsets of the machine. By marginalizing        characters-in-x-up-to-its-next-nonalphabetic. Each edit
out all the edit sequences in a subset, we obtain the       operation ep in the sequence consumes either some of
probability of match or mismatch.                           x (deletion), some of y (insertion), or some of both
The cost of a transition is a function of its edit opera-   (substitution), up to positions ip in x and jp in y. We
tion, the previous state, the new state, the two input      have therefore corresponding non-decreasing sequences
strings, and the starting and ending position (the po-      a.ix = i1 , . . . , ik and a.iy = j1 , . . . , jk of edit-operation
sition of the match-so-far before and after performing      positions for x and y.
this edit operation) for each of the two input strings.     To classify alignments into matches or mismatches, we
In applications, we take full advantage of this flexi-      take edits as transition labels for a non-deterministic
bility. For example, the cost function can examine          FSM with state set S = {q0 } ∪ S0 ∪ S1 . There are
portions of the input strings both before and after the     transitions from the initial state q0 to states in the
current match position, it can examine domain knowl-        disjoint sets S0 and S1 , but no transitions between
edge, such as lexicons, or it can depend on rich con-       those two sets. In addition to the edit sequence and
junctions of more primitive features.                       string position sequences, we associate the alignment
The flexibility of edit operations is possibly even more    a with a sequence of consecutive destinations states
valuable. Edits can make arbitrarily-sized forward          a.q = q1 · · · qk , where ep labels an allowed transition
jumps in both input strings, and the size of the jumps      from qp−1 to qp . By construction, either a.q ⊆ S0 or
can be conditioned on the input strings, the current        a.q ⊆ S1 . Alignments with states in S1 are supposed
match points in each, and the previous state of the         to represent matches, while alignments with states in
finite state process. For example, a single edit oper-      S0 are supposed to represent mismatches.
ation could match a three-letter acronym against its        In summary, an alignment is specified by the four-
expansion in the other string by consuming three cap-       tuple a = ha.e = e1 · · · ek , a.ix = i1 · · · ik , a.iy =
italized characters in the first string, and consuming      j1 · · · jk , a.q = q1 · · · qk i. For convenience, we also
three matching words in the second string. The cost of      write a = a0 , a1 · · · ak with ap = hep , ip , jp , qp i, 1 ≤
such an operation could be conditioned on the previous      p ≤ k and a0 = h−, 0, 0, q0 i where − is a dummy ini-
state of the finite state process, as well as the appear-   tial edit.
ance of the consumed strings in various lexicons, and
the words following the acronym.                            Given two strings x and y, our discriminative string
                                                            edit CRF defines the probability of an alignment be-
Inference and training in the model depends on a com-       tween x and y as
plex dynamic program in three dimensions. We em-
ploy various optimizations to speed learning.                                               |a|
                                                                                       1    Y
                                                                      p(a|x, y) =                 Φ(ai−1 , ai , x, y),
We present experimental results on five standard text                                Zx,y   i=1
data sets, including short strings such as names and
                                                                1
addresses, as well as longer more complex strings, such         One could also straightforwardly imagine a different
as bibliographic citations. We show significant error       regression-based scenario in which z is real-valued, or also
reductions in all but one of the data sets.                 a ranking-based criteria, in which two pairs are provided
                                                            and z indicates which pair of strings should be considered
                                                            closer.
where the potential function Φ(·) is a non-negative              Dynamic programming for this model fills a three-
function of its arguments, and Zx,y is the normalizer            dimensional table (two for the two input strings, and
(partition function). In our experiments we parame-              one for the states in S). The table can be moderately
terize these potential functions as an exponential of a          large in practice (n = m = 100 and |S| = 12, resulting
linear scoring function                                          in 120,000 entries), and beam search may effectively be
                                                                 used to increase speed, just as in speech recognition,
      Φ(ai−1 , ai , x, y) = exp Λ · f (ai−1 , ai , x, y),        where even larger tables are common.
where f is a vector of feature functions, each taking            It is interesting to examine what alignments will be
as arguments two consecutive states in the alignment             learned in S0 , the non-match portion of the model. To
sequence, the corresponding edits, and their string po-          attain high accuracy, these states should attract string
sitions, which allow the feature functions to depend on          pairs that are dissimilar. But even similar strings have
the context of ai in x and y. A typical feature function         bad alignments, for example the alignment that first
combines some predicate on the input, orinput feature,           deletes all of x, and then inserts all of y. Fortunately,
with a predicate over the alignment itself (edit opera-          finding how dissimilar two strings are requires finding
tion, states, positions).                                        as good an alignment as is possible, and then deciding
To obtain the probability of match given simply the              that this alignment is not very good. These as-good-
input strings, we marginalize over all alignments in             as-possible alignments are exactly what our learning
the corresponding state set:                                     procedure discovers: driven by an objective function
                                                                 that aims to maximize the likelihood of the correct
                                   |a|
                     X       1     Y                             binary match/non-match labels, the model finds the
     p(z|x, y) =                         Φ(ai−1 , ai , x, y),    latent alignment paths that enable it to maximize this
                            Zx,y   i=1
                   a.q⊆Sz                                        likelihood.
Fortunately, this sum can be calculated efficiently by           This model thus falls in a family of interesting tech-
dynamic programming. Typically, for any given edit               niques involving discrimination among complex struc-
operation, starting positions and input strings, there           tured objects, in which the structure or relationship
are a small number of possible resulting ending posi-            among the parts is unknown (latent), and the latent
tions. Max-product (Viterbi-like) inference can also             choice has high impact on the discrimination task.
be performed efficiently.                                        Similar considerations are at the core of discrimina-
                                                                 tive non-probabilistic methods for structured problems
3    Parameter Estimation                                        such as handwriting recognition (LeCun et al., 1998)
                                                                 and speech recognition (Woodland & Povey, 2002),
Parameters are estimated by penalized maximum like-              and, more recently, computer vision object recogni-
lihood on a set of training data. Training data consists         tion (Quattoni et al., 2005). We discuss related work
of a set of N string pairs hx(j) , y(j) i with correspond-       further in Section 6.
ing labels z (j) ∈ {0, 1}, indicating whether or not the
pair is a match.   We use a zero-mean spherical Gaus-
sian prior k λ2k /σ 2 for penalization.
           P                                                     4   Implementation
The incomplete (non-penalized) log-likelihood is then            The model has been implemented as part of the
                 X                                               finite-state transducer classes in Mallet (McCallum,
                    log p(z (j) |x(j) , y(j) )
                                              
          LI =                                                   2002). We map three-dimensional dynamic program-
                      j                                          ming problems over positions in x and y and states
                                                                 S to Mallet’s existing finite-state forward-backward
and the complete log-likelihood is
       XX                                                        and Viterbi implementations by encoding the two po-
              log(p(z (j) |a, x(j) , y(j) )p(a|x(j) , y(j) ))
                                                             
LC =                                                             sition indices into a single index in a diagonal crossing
          j   a                                                  pattern that starts at (0, 0). For example, a single-
                                                                 character delete operation, which would be a hop to an
We maximize this likelihood with EM, estimating
                                                                 a adjacent vertical or horizontal in the original table,
p(a|x(j) , y(j) ) given current parameters Λ in the E-
                                                                 is a longer, one-dimensional (but deterministically-
step, and maximizing the complete penalized log-
                                                                 calculated) jump in the encoding.
likelihood in the M-step. For optimization in the M-
step we use BFGS. Unlike CRFs without latent vari-               In addition to the standard edit operations (inser-
ables, the objective function has local maxima. To               tion, deletion, substitution), we have also more pow-
avoid getting stuck in poor local maxima, the param-             erful edits that fit naturally into this model, such
eters are initialized to yield a reasonable default edit         as delete-until-end-of-word, delete-word-in-lexicon, and
distance.                                                        delete-word-appearing-in-other-string.
5     Experimental Results                                     • Operations for handling acronyms and abbrevia-
                                                                 tions by inserting, deleting, or substituting spe-
We show experimental results on one synthetic and six            cific types of substrings.
real-world data sets, all of which have been used in pre-
vious work evaluating string edit measures. The first        Learned parameters are associated with the input fea-
two data sets are the name and address fields of the         tures as well as with state transitions in the FSM. All
Restaurant database. Among its 864 records, 112 are          transitions entering a state may share tied parameters
matches. The last four data sets are citation strings        (first order), or have different parameters (second or-
from the standard Reasoning, Constraint, Reinforce-          der). Since the FSM can have more states than edit
ment and Face sections of the CiteSeer data. The ra-         operations, it can remember the context of previous
tios of citations to unique papers for these are 514/196,    edit actions.
349/242, 406/148 and 295/199 respectively. Making
the problem more challenging than certain other evalu-
ations on these data sets, our strings are not segmented     5.2   Experimental Methodology
into fields such as title or author, but are each treated
                                                             Our model exploits both positive and negative exam-
as a single unsegmented character sequence. We also
                                                             ples during training. Positive training examples in-
present results on synthetic noise on person names,
                                                             clude all pairs of strings referring to the same object
generated by the UIS Database generator. This pro-
                                                             (the matching strings). However, the total number
gram produces perturbed names according to modifi-
                                                             of negative examples is quadratic in the number of
able noise parameters, including the probability of an
                                                             objects. Due to both time and memory constraints,
error anywhere in a record, the probability of single
                                                             as well as a desire to avoid overwhelming the positive
character insertion, deletion or swap, and the proba-
                                                             training examples, we sample the negative (mismatch)
bility of a word swap.
                                                             string pairs so as to attain a 1:10 ratio of match to mis-
                                                             match pairs. In order to preferentially sample “near
5.1    Edit Operations and Features                          misses” we filter negative examples in one of two ways:
One of the main advantages of our model is the abil-
ity to include non-independent input features and ex-          • Remove negative examples that are too dissimilar
tremely flexible edit operations. The input features             according to a suitable metric. For the Citeseer
used in our experiments include subsets of the follow-           datasets we use the cosine metric to measure sim-
ing, described as acting on cell i, j in the dynamic pro-        ilarity of two citations; for other datasets we use
gramming table and the two input strings x and y.                the metric of Jaro (1989).

    • same, different : xi and yj match (do not match);        • Select the best matching negative pairs according
    • same-alphabetic, different-alphabetic : xi and yj          to a CRF with parameters set by hand to reason-
      are alphabetic and they match (do not match);              able values.
    • same-numeric, different-numeric : xi and yj are nu-
      meric and they match (do not match);                   As in Bilenko and Mooney (2003), we use a 50/50
    • punctuation-x, punctuation-y : xi and yj are punc-     train/test split of the data, and repeat the process
      tuation, respectively;                                 with the folds interchanged. With the restaurant name
                                                             and restaurant address dataset, we run our algorithm
    • alphabet-mismatch, number-mismatch : One of xi
                                                             with different choices of features and states, and 4 ran-
      and yj is alphabetic (numeric), the other is not;
                                                             dom splits of the data. With the Citeseer datasets, we
    • end-of-x, end-of-y : i = |x| (j = |y|);                have results for two random splits of the data.
    • same-next-character, different-next-character: xi+1
      and yi+1 match (do not match).                         To give EM training a reasonable starting point, we
                                                             hand-set the initial parameters to somewhat arbitrary,
                                                             yet reasonable parameters. (Of course, hand-setting of
Edit operations on FSM transitions include:
                                                             string edit parameters is the standard for all the non-
                                                             learning approaches.) We examined a small held-out
    • Standard string edit operations: insert, delete and    set of data to verify that these initial parameters were
      substitute.                                            reasonable. We set the parameters on the match por-
    • Two character operations: swap-two-characters.         tion of the FSM to provide good alignments; then we
    • Word skip operations: skip-if-word-in-lexicon, skip-   then copy these parameters to the mismatch portion of
      word-if-present-in-other-string, skip-parenthesized-   the model, offseting them by bringing all values closer
      words and skip-any-word .                              to zero by a small constant.
Distance Metric            Restaurant name      Restaurant address   Reasoning      Face    Reinforcement    Constraint
 Edit Distance                   0.290                 0.686            0.927       0.952        0.893          0.924
 Learned Edit Distance           0.354                 0.712            0.938       0.966        0.907          0.941
 Vector-space                    0.365                 0.380            0.897       0.922        0.903          0.923
 Learned Vector-space            0.433                 0.532            0.924       0.875        0.808          0.913
 CRF Edit Distance               0.448                 0.783           0.964        0.918        0.917         0.976

Table 1: Averaged F-measure for detecting matching field values on several standard data sets (bold indicates
highest F1). The top four rows are results duplicated from Bilenko and Mooney (2003); the bottom row is the
performance of the CRF method introduced in this paper.

Lexicons were populated automatically by gathering             Dataset               Viterbi    Forward-Backward
the most frequent words in the training set. (Alter-           Restaurant name        0.689           0.720
natively one could imagine lexicon feature values set          Restaurant address     0.708           0.651
to inverse-document-frequency values, or similar infor-
mation retrieval metrics.) In some cases, before train-      Table 2: Averaged F-measures for Viterbi vs. forward-
ing, lexicons were edited to remove author surnames.         backward on (trained and evaluated on a subset of the
The equations in section 3 are used to calculate             data; smaller test set yields higher accuracy).
p(z|x, y), with a first-order model. A threshold of 0.5
predicts whether the string pair is a match or a mis-
match. (Note that alternative thresholds could easily        stead of forward-backward (sum-product) inference.
be used to trade of precision and recall, and that CRFs      Except for the restaurant address dataset, forward-
are typically good at predicting calibrated posterior        backward performs significantly better than Viterbi on
probabilities needed for such tuning as well as accu-        all datasets. The restaurant address data set contains
racy/coverage curves.) Bilenko and Mooney (2003)             positive examples with a large unmatched suffix in one
found transitive closure to improve F1, and use it for       of the strings, which may lead to an inappropriate dilu-
their results; we did not find it to help, and do not.       tion of probability amongst many alignments. Average
Precision is calculated to be the ratio of the number        F1 measures for the restaurant datasets using Viterbi
of correctly classified duplicates to the total number       and forward-backward are shown in Table 2. All re-
of duplicates identified. Recall is the ratio of correctly   sults shown in Table 1 use forward-backward proba-
classified duplicates to the total number of duplicates      bilities.
in the dataset. We report the mean performance across        In the other tables we present results showing the im-
multiple random splits.                                      pact of various edit operations and features.
                                                             Table 3 shows F1 on the restaurant data set as vari-
5.3   Results
                                                             ous edit operations are added to the model: i denotes
In experiments on the six real-world data sets we com-       insert, d denotes delete, s denotes substitute, paren de-
pare our performance against results in a recent bench-      notes skip-parenthesized-word, lex denotes skip-if-word-
mark paper by Bilenko and Mooney (2003); Bilenko             in-lexicon, and pres denotes skip-word-if-present-in-
recently completely thesis work in this area. These re-      other-string. All use the same-alphabets and different-
sults are summarized in Table 1, where the top four          alphabets input features. As can be seen from the re-
rows are duplicated from Bilenko and Mooney (2003),          sults, adding “skip” edits improves performance. Al-
and the bottom row shows the results of our method.          though skip-parenthesized-words gives better results on
The entries are the average F1 measure across the            the smaller data set used for the experiments in the
folds. We observe large performance improvements on          table, skip-if-word-in-lexicon produces a higher accu-
most datasets. The fact that the difference in perfor-       racy on larger data sets, because of peculiarities in
mance across our trials is typically around 0.01 sug-        how restaurants with the same name and different lo-
gests strong statistical significance. Our average F1        cations are named in the data set. We also see that
on the Face dataset was 0.04 less than the previous          a second-order model performs less well, presumably
best. The examples on which we made errors gener-            because of data sparseness.
ally had a large venue, authors, or URL field in one
                                                             Table 4 shows the benefits of including various features
string but not in the other.
                                                             for the restaurant address data set, while fixing the edit
We also evaluate the effect on performance of us-            operations (insert, delete and substitute). In the table,
ing Viterbi (max-product) inference in training in-          s and d denote the same and different features, salp
Run                                   F1                             Run                F1
        i, d, s                              0.701                           Without skip      0.856
        i, d, s, paren                       0.835                           With skip         0.981
        i, d, s, lex                         0.769
        i, d, s, lex 2nd order               0.742          Table 5: Average maximum F-measure for synthetic
        i, d, s, paren,lex,pres              0.746          name dataset with and without skip-if-present-in-other-
        i, d, s, paren,lex,pres, 2nd order   0.699          string state.

Table 3: Averaged maximum F-measure for differ-                         restaurant : katzu
ent state combinations on a subset of restaurant name                  l           s
(trained and evaluated on the same train/test split).                 k               s
                                                                      a                 s
            Run                           F1                          t                   s
                                                                      s                     s
            s, d                         0.944                        u                       s
            salp, dalp, snum, dnum       0.973                                                 -

Table 4: Averaged maximum F1-measure for differ-            Table 6: Alignment in both the match and mismatch
ent feature combinations on a subset of the restaurant      subsets of the model, with correct prediction. Opera-
address data set.                                           tions causing edits are in bold.

                                                            Table 7: Alignment in both the match and mismatch
and dalp stand for the same-alphabets and different-        subsets of the model, with correct prediction. Opera-
alphabets features, and snum and dnum stand for the         tions causing edits in bold.
same-numbers and different-numbers features. The s
and d features are different from the salp,dalp,snum,
and dnum features in that the weights learned for the       tion 3, the mismatch portion of the model indeed
former depend only on whether the two characters are        learns the best possible latent alignments in order to
equal or not, and no separate weights are learned for       measure distance with the most salient features. This
a number match or an letter match. We conjecture            example’s alignment score from the match portion is
that a number mismatch in the address data needs to         higher. The entries in the dynamic programming ta-
be penalized more than a letter mismatch. Separating        ble i, d, s, l, and p correspond to states reached by the
the same and different features into features for letters   operations insert, delete,substitute, skip-word-in-lexicon,
and numbers reduces the error from about 6% to 3%.          and skip-parenthesized-word respectively. The symbol
                                                            - denotes a null transition.
Finally, Table 5 demonstrates the power of CRFs to in-
clude extremely flexible edit operations that examine       6    Related Work
arbitrary pieces of the two input strings. In particu-
lar we measure the impact of including the skip-word-       String (dis)similarity metrics based on edit distance
if-present-in-other-string operation, (“skip” for short).   are widely used in applications ranging from approx-
Here we train and test on the UIS synthetic name            imate matching and duplicate removal in database
data, in which the error probability is 40%, the typo       records to identifying conserved regions in compara-
error probability is 40% and the swap first and last        tive genomics. Levenshtein (1966) introduced least-
name probability is 50%; (the rest of the parameters        cost editing based on independent symbol insertion,
were unchanged from the default values). The differ-        deletion, and substitution costs, and Needleman and
ence in performance is dramatic, bringing error down        Wunsch (1970) extended the method to allow gaps.
from about 14% to less than 2%. Of course, arbi-            Editing between strings over the same alphabet can
trary substring swaps are not expressible in standard       be generalized to transduction between strings in dif-
dynamic programs, but the skip operation gives an ex-       ferent alphabets, for instance in letter-to-sound map-
cellent approximation while preserving efficient finite-    pings (Riley & Ljolje, 1996) and in speech recognition
state inference. Typical improved alignments with the       (Jelinek et al., 1975).
new operation may skip over a matching swapped first
                                                            In most applications, the edit distance model is de-
name, and then proceed to correct individual typo-
                                                            rived by heuristic means, possibly including some
graphic errors in the last name.
                                                            data-dependent tuning of parameters. For exam-
An example alignment found by our model on restau-          ple, Monge and Elkan (1997) recognize duplicate cor-
rant name is shown in Table 7. As discussed in Sec-         rupted records using an edit distance with tunable
edit and gap costs. Hernandez and Stolfo (May 1995)         do not need to be given explicit alignments because
merge records in large databases using rules based on       they do EM with alignment as a latent (structured)
domain-specific edit distances for duplicate detection.     variable. Joachims (2003) gives a generic maximum-
Cohen (2000) use a token-based TF-IDF string simi-          margin method for learning to score alignments from
larity score to compute ranked approximate joins on         positive and negative examples, but the training ex-
tables derived from Web pages. Koh et al. (2004) use        amples must include the actual alignments. In ad-
association rule mining to check for duplicate records      dition, he cannot solve the problem exactly because
with per-field exact, Levenshtein or BLAST 2 gapped         he does not exploit factorizations of the problem that
alignment (Altschul et al., 1997) matching. Cohen           yield a polynomial number of constraints and efficient
et al. (2003) surveys edit and common substring simi-       dynamic programming search over alignments.
larity metrics for name and record matching, and their
                                                            While the basic models and algorithms are expressed
application in various duplicate detection tasks.
                                                            in terms of single letter edits, in practice it is con-
In bioinformatics, sequence alignment with edit costs       venient to use a richer application-specific set of edit
based on evolutionary or biochemical estimates are          operations, for example name abbreviation. For ex-
common (Durbin et al., 1998). Position-independent          ample, Brill and Moore (2000) use edit operations de-
costs are normally used for general sequence similar-       signed for spelling correction in a spelling correction
ity search, but position-dependent costs are often used     model trained by EM. Tejada et al. (2001) has edit op-
when searching for specific sequence motifs.                erations such as abbreviation and acronym for record
                                                            linkage.
In basic edit distance, the cost of individual edit op-
erations is independent of the string context. How-
ever, applications often require edit costs to change
                                                            7   Conclusions
depending on context. For instance, the characters in
an author’s first name after the first character are more
likely to be deleted than the first character. Instead      We have presented a new discriminative model for
of specialized representations and dynamic program-         learning finite-state edit distance from postive and
ming algorithms, we can instead represent context-          negative examples consisting of matching and non-
dependent editing with weighted finite-state transduc-      matching strings. It is not necessary to provide se-
ers (Eilenberg, 1974; Mohri et al., 2000) whose states      quence alignments during training. Experimental re-
represent different types of editing contexts. The          sults show the method to outperform previous ap-
same idea has also been expressed with pair hidden          proaches.
Markov models for pairwise biological sequence align-       The model is an interesting member of a family of
ment (Durbin et al., 1998).                                 models that use a discriminative objective function
If edit costs are identified with − log probabilities       to discover latent structure. The latent edit opera-
(up to normalization), edit distance models and cer-        tion sequences that are learning by EM are indeed the
tain weighted transducers can be interpreted as gen-        alignments that help discriminate matching from non-
erative models for pairs of sequences. Pair HMMs            matching strings.
are such generative models by definition. Therefore,        We have described in some detail the finite-state ver-
expectation-maximization using an appropriate ver-          sion of this model. A context-free grammar version of
sion of the forward-backward algorithm can be used          the model could, through edit operations defined on
to learn parameters that maximize the likelihood of         trees, handle swaps of arbitrarily-sized substrings.
a given training set of pairs of strings according to
the generative model (Ristad & Yianilos, 1998; Ristad
& Yianilos, 1996; Durbin et al., 1998). Bilenko and         Acknowledgments
Mooney (2003) use EM to train the probabilities in
a simple edit transducer for one of the duplicate de-       We thank Charles Sutton and Xuerui Wang for use-
tection measures they evaluate. Eisner (2002) gives         ful conversations, and Mikhail Bilenko for helpful
a general algorithm for learning weights for transduc-      comments on a previous draft. This work was sup-
ers, and notes that the approach applies to transduc-       ported in part by the Center for Intelligent Infor-
ers with transition scores given by globally normalized     mation Retrieval, the National Science Foundation
log-linear models. These models are to CRFs as pair         under NSF grants #IIS-0326249, #IIS-0427594, and
HMMs are to HMMs.                                           #IIS-0428193, and by the Defense Advanced Research
The foregoing methods for training edit transduc-           Projects Agency, through the Department of the Inte-
ers or pair HMMs use positive examples alone, but           rior, NBC, Acquisition Services Division, under con-
                                                            tract #NBCHD030010.
References                                                 Koh, J. L. Y., Lee, M. L., Khan, A. M., Tan, P. T. J., &
                                                             Brusic, V. (2004). Duplicate detection in biological
Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang,
                                                             data using association rule mining. Proceedings of
  J., Zhang, Z., Miller, W., & Lipman, D. J. (1997).
                                                             the Second European Workshop on Data Mining and
  Gapped BLAST and PSI-BLAST: a new generation
                                                             Text Mining in Bioinformatics.
  of protein database search programs. Nucleic Acids
  Research, 25.                                            LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P.
                                                             (1998). Gradient-based learning applied to doc-
Bilenko, M., & Mooney, R. J. (2003). Adaptive dupli-         ument recognition. Proceedings of the IEEE, 86,
  cate detection using learnable string similarity mea-      2278–2324.
  sures. In Proceedings of the 9th ACM SIGKDD In-
  ternational Conference on Knowledge Discovery and        Levenshtein, L. I. (1966). Binary codes capable of cor-
  Data Mining (KDD) (pp. 39–48). Washington, DC.             recting deletions, insertions and reversals. Soviet
                                                             Physics Doklady, 10, 707–710.
Brill, E., & Moore, R. C. (2000). An improved error
                                                           McCallum, A. K. (2002). Mallet: A machine learning
  model for noisy channel spelling correction. Proceed-
                                                             for language toolkit. http://mallet.cs.umass.edu.
  ings of the 38th Annual Meeting of the ACL.
                                                           Mohri, M., Pereira, F., & Riley, M. (2000). The Design
Cohen, W. W. (2000). Data integration using similar-         Principles of a Weighted Finite-State Transducer Li-
  ity joins and a word-based information representa-         brary. Theoretical Computer Science, 231, 17–32.
  tion language. ACM Transactions on Information
  Systems, 18, 288–321.                                    Monge, A. E., & Elkan, C. (1997). An efficient domain-
                                                             independent algorithm for detecting approximately
Cohen, W. W., Ravikumar, P., & Fienberg, S. (2003).          duplicate database records. DMKD. Tuscan, Ari-
  A comparison of string metrics for matching names          zona.
  and records. KDD Workshop on Data Cleaning and
                                                           Needleman, S. B., & Wunsch, C. D. (1970). A general
  Object Consolidation.
                                                             method applicable to the search for similarities in
Cohen, W. W., & Richman, J. (2002). Learning to              the amino acid sequence of two proteins. Journal of
  match and cluster large high-dimensional data sets         Molecular Biology, 48, 443–453.
  for data integration. KDD (pp. 475–480). ACM.            Quattoni, A., Collins, M., & Darrell, T. (2005). Con-
Durbin, R., Eddy, S., Krogh, A., & Mitchison, G.             ditional random fields for object recognition. In
 (1998). Biological sequence analysis: Probabilistic         L. K. Saul, Y. Weiss and L. Bottou (Eds.), Advances
 models of proteins and nucleic acids. Cambridge             in neural information processing systems 17, 1097–
 University Press.                                           1104. Cambridge, MA: MIT Press.
Eilenberg, S. (1974). Automata, languages and ma-          Riley, M. D., & Ljolje, A. (1996). Automatic gener-
  chines, vol. A. Academic Press.                            ation of detailed pronunciation lexicons. In C. H.
                                                             Lee, F. K. Soong and E. K. K. Paliwal (Eds.), Au-
Eisner, J. (2002). Parameter estimation for proba-
                                                             tomatic speech and speaker recognition: Advanced
  bilistic finite-state transducers. Proceedings of the
                                                             topics, chapter 12. Boston: Kluwer Academic.
  40th Annual Meeting of the Association for Compu-
  tational Linguistics.                                    Ristad, E. S., & Yianilos, P. N. (1996). Finite growth
                                                             models (Technical Report TR-533-96). Department
Hernandez, M. A., & Stolfo, S. J. (May 1995). The
                                                             of Computer Science, Princeton University.
  merge/purge problem for large databases. Proceed-
  ings of the 1995 ACM SIGMOD International Con-           Ristad, E. S., & Yianilos, P. N. (1998). Learning string
  ference on Management of Data (SIGMOD-95) (pp.             edit distance. IEEE Transactions on Pattern Anal-
  127–138). San Jose, CA.                                    ysis and Machine Intelligence, 20, 522–532.
                                                           Sankoff, D., & Kruskal, J. (Eds.). (1999). Time warps,
Jaro, M. A. (1989). Advances in record-linkage
                                                             string edits, and macromolecules. Stanford, Cali-
  methodology as applied to matching the 1985 census
                                                             fornia: CSLI Publications. Reissue edition edition,
  of tampa, florida. Journal of the American Statisti-
                                                             Originally published by Addison-Wesley, 1983.
  cal Association, 84, 414–420.
                                                           Tejada, S., Knoblock, C. A., & Minton, S. (2001).
Jelinek, F., Bahl, L. R., & Mercer, R. L. (1975).
                                                             Learning object identification rules for information
  The design of a linguistic statistical decoder for the
                                                             integration. Information Systems, 26, 607–633.
  recognition of continuous speech. IEEE Transac-
  tions on Information Theory, 3, 250–256.                 Woodland, P. C., & Povey, D. (2002). Large scale
                                                             discriminative training of hidden Markov models for
Joachims, T. (2003). Learning to align sequences: A          speech recognition. Computer Speech and Language,
  maximum-margin approach (Technical Report). De-            16, 25–47.
  partment of Computer Science, Cornell University.
You can also read