String Similarity Joins: An Experimental Evaluation

Page created by Irene Cummings
 
CONTINUE READING
String Similarity Joins: An Experimental Evaluation

                      Yu Jiang†                      Guoliang Li†                          Jianhua Feng†          Wen-Syan Li‡
                                †
                          Department of Computer Science, Tsinghua University, Beijing, China
                                              ‡
                                                SAP Lab, Shanghai, China
               y-jiang12@mails.tsinghua.edu.cn; {liguoliang, fengjh}@tsinghua.edu.cn; wen-syan.li@sap.com

ABSTRACT                                                                              (2) Verification step: verifying each candidate pair by com-
String similarity join is an important operation in data in-                          puting the real similarity and outputting the final results.
tegration and cleansing that finds similar string pairs from                          Filtering algorithms in the first step play an important role
two collections of strings. More than ten algorithms have                             in the framework. Most of existing filtering algorithms em-
been proposed to address this problem in the recent two                               ploy a signature-based technique, which generates signatures
decades. However, existing algorithms have not been thor-                             for each string such that if two strings are similar, their sig-
oughly compared under the same experimental framework.                                natures must have overlaps. Thus the signature-based tech-
For example, some algorithms are tested only on specific                              nique can prune string pairs that have no common signature.
datasets. This makes it rather difficult for practitioners to                            Recently many filtering techniques have been proposed,
decide which algorithms should be used for various scenarios.                         e.g., count filtering [8,13,18], length filtering [8,14], posi-
To address this problem, in this paper we provide a compre-                           tion filtering [25,27], prefix filtering [4] and content filter-
hensive survey on a wide spectrum of existing string simi-                            ing [25]. As prefix filtering is the most effective filtering tech-
larity join algorithms, classify them into different categories                       nique, many algorithms have been proposed to optimize pre-
based on their main techniques, and compare them through                              fix filtering for different similarity metrics, e.g., AllPair [2],
extensive experiments on a variety of real-world datasets                             PPJoin [27], EDJoin [25], QChunk [17], VChunk [24], AdaptJoin [23].
with different characteristics. We also report comprehen-                             There are also many other signature schemes, e.g., PartEnum [1],
sive findings obtained from the experiments and provide new                           PassJoin [14], FastSS [20]. In addition, there are some algo-
insights about the strengths and weaknesses of existing simi-                         rithms which directly compute join results using tree-based
larity join algorithms which can guide practitioners to select                        index structures, e.g., TrieJoin [22].
appropriate algorithms for various scenarios.                                            However these algorithms have not been thoroughly com-
                                                                                      pared under the same experimental framework. For ex-
                                                                                      ample, some algorithms [20,22] are tested only on specific
                                                                                      datasets, e.g., datasets with short strings. Some algorithms [14,
1.    INTRODUCTION                                                                    17] are effective for some specific similarity metrics while ex-
   Given two collections of strings, e.g., products and movie                         pensive for other similarity metrics. Some fast algorithms,
names, the string similarity join problem is to find all similar                      e.g., PassJoin, FastSS, and QChunk, are never compared.
string pairs from the two collections. The similarity between                         This makes it rather difficult for practitioners to decide
two strings can be quantified by similarity metrics (see Sec-                         which algorithms should be used for various scenarios.
tion 2). String similarity join can play an important role in                            To address this problem, in this paper we thoroughly com-
many real-world applications, e.g., data cleansing and inte-                          pare existing similarity join algorithms on the same exper-
gration, and duplicate detection. The brute-force algorithm                           imental framework. We make the following contributions.
that enumerates every string pair and checks whether the                              (1) We provide a comprehensive survey on a wide spectrum
two strings in the pair are similar is rather expensive. To                           of existing string similarity join algorithms and classify them
alleviate this problem, many algorithms have been proposed                            into different categories based on their main techniques. Fig-
in the recent two decades [1–5,8,13,14,17–20,22–25,27,28].                            ure 1 summarizes these algorithms. (2) We compare exist-
   One widely-adopted technique employs a filter-verification                         ing algorithms through extensive experiments on a variety of
framework, which includes two steps: (1) Filter step: devis-                          real-world datasets with different characteristics. (3) We re-
ing effective filtering algorithms to prune large numbers of                          port comprehensive findings obtained from the experiments
dissimilar pairs and generating a set of candidate pairs; and                         and provide new insights about the strengths and weak-
                                                                                      nesses of existing algorithms which can guide practitioners
                                                                                      to select appropriate algorithms for various scenarios.
This work is licensed under the Creative Commons Attribution-
NonCommercial-NoDerivs 3.0 Unported License. To view a copy of this li-               2.    PRELIMINARIES
cense, visit http://creativecommons.org/licenses/by-nc-nd/3.0/. Obtain per-              Given two collections of strings, the string similarity join
mission prior to any use beyond those covered by the license. Contact                 problem is to find all the similar pairs from the two collec-
copyright holder by emailing info@vldb.org. Articles from this volume                 tions. The criterion used to judge whether two strings are
were invited to present their results at the 40th International Conference on         similar is called a similarity metric. We can broadly classify
Very Large Data Bases, September 1st - 5th 2014, Hangzhou, China.
Proceedings of the VLDB Endowment, Vol. 7, No. 8                                      existing similarity metrics into two categories: character-
Copyright 2014 VLDB Endowment 2150-8097/14/04.                                        based similarity metrics and token-based similarity metrics.

                                                                                625
Signature-based Algorithms              Tree-based Algorithms                    Table 1: An Example String Set.
                                                                    M-Tree
                                                                                                   ID String
Count Filtering                     Prefix Filtering
                                                                    TrieJoin
                                                                                                   s1 database concepts system
 GramCount            ;+*($.      SSJoin                             ed                            s2 database large-scale techniques
                                                                    B -tree
 ListMerger
Algorithm 1: Signature-based Framework (Self-Join)                                  where γ = 0 if the i-th character of r is equal to the j-th
                                                                                      character of s; γ = 1 otherwise. Obviously, the complexity
   Input: String set S
                                                                                      of this algorithm is O(|r||s|).
   Output: The similarity join result set A
                                                                                         To avoid unnecessary computations, given an edit-distance
1 A ← Φ;
                                                                                      threshold τ , only values M [i][j] for |i−j| ≤ τ are required to
2 Initialize an empty index I; // a map from a
                                                                                      compute since M [i][j] > τ for |i − j| > τ . Thus in each row
      signature to a list of string IDs
                                                                                      (or column), only 2τ + 1 values are required to compute and
3 for each s ∈ S do
                                                                                      the complexity can be improved to O((2τ +1)·min(|r|, |s|)) =
4     Generate the signature set Ss of string s;
                                                                                      O(τ · min(|r|, |s|)). Moreover, the early termination tech-
5     for each sig ∈ Ss do
                                                                                      niques [14] can be used to further improve the performance.
6         for each c ∈ I[sig] do
7             verify(c, s);
8             if c is similar to s then A = A ∪ {(c, s)};                             4.    SIGNATURE-BASED ALGORITHMS
9           append s to I[sig];                                                          In this section, we discuss signature-based filtering algo-
                                                                                      rithms. We first introduce the count filtering (Section 4.1)
10   return A;                                                                        and length filtering (Section 4.2). Then we discuss the prefix
                                                                                      filtering (Section 4.3). Finally we introduce other signature-
                                                                                      based algorithms (Section 4.4).
Then for each string s, it generates s’s signature set Ss                             4.1    Count Filtering
(line 4). For each signature sig ∈ Ss , it retrieves the in-                             Count filtering is proposed in GramCount [8]. The basic
verted list I[sig]. Each string c in the list is a candidate of                       idea is that if two strings are similar, their signatures must
s and it verifies candidate pair (c, s) (line 7). If c and s are                      share at least T common signatures. In other words, if the
similar, it adds the pair into the result set (line 8). Notice                        number of shared signatures between two strings is smaller
that it requires to append s into the list I[sig] (line 9).                           than T , the string pair can be pruned. GramCount utilizes
R-S Join. The above techniques can be easily extended to                              this property to support edit distance. ListMerger extends
support the case of R 6= S. It first builds the inverted index                        GramCount to support token-based metrics in [13,18].
for strings in one set, e.g., R, and then scans strings in the                        Computing T . GramCount generates q-grams for each string
other set S and uses the index on R to compute the results.                           and takes q-grams as signatures. Suppose edit-distance thresh-
   For example, consider the strings in Table 1. Suppose we                           old is τ . As one edit operation destroys at most q grams and
use Jaccard and the threshold δ is 0.4. According to the                              string s has |s| + 1 − q grams, if strings r and s are similar,
definition of Jaccard, if the similarity between two strings                          their signature sets share at least T = max(|r|, |s|)+1−q−qτ
is larger than 0, they should share at least one common to-                           signatures. ListMerger takes each token as a signature.
ken. Thus we can take the token set as the signature set. For                         Suppose the overlap threshold is δ. Two strings are simi-
example, the signature set of s1 is {“database”, “concepts”,                          lar w.r.t. the overlap similarity, they must share at least δ
“system”}. Using this signature scheme, it generates the fol-                         common signatures, and T = δ. Other token-based metrics
lowing candidate pairs: (s2 , s1 ), (s4 , s2 ), (s4 , s3 ), (s5 , s1 ), (s5 , s2 ).   can be converted to overlap. We take Jaccard as an ex-
Obviously, the number of candidate pairs is reduced from 10                                       |r∩s|
                                                                                      ample. If |r∪s|          |r∩s|
                                                                                                        = |r|+|s|−|r∩s| ≥ δ, it is easy to deduce that
to 5. In the verification step, it verifies each candidate pair
                                                                                      |r ∩ s| ≥ dδ|s|e and T = dδ|s|e. Similar techniques can be
by computing their real similarity. For example, we will
                                                                                      applied to Cosine and Dice. Thus if string r is similar to
eliminate pair (s2 , s1 ) since their similarity (0.2) is smaller
                                                                                      s, r shares at least T signatures with s, where
than the threshold (0.4). Similarly, we can prune (s4 , s2 )
and (s5 , s2 ). Here we get two results: (s4 , s3 ) and (s5 , s1 ).                                     
                                                                                                            |s| + 1 − q − qτ          ED
                                                                                                        
                                                                                                        
                                                                                                         δ
                                                                                                                                    Olp
3.2      Verification Step                                                                              
   The verification step aims to verify each candidate pair                                         T =     dδ|s|e                    Jac          (1)
                                                                                                               2
by computing the real similarity.
                                                                                                        
                                                                                                        
                                                                                                           dδ   |s|e                 Cos
                                                                                                         d δ |s|e
                                                                                                        
                                                                                                                                     Dice
Token-based Metrics. The token-based similarity metrics                                                       2−δ

rely on the overlap of two string sets and string set sizes to
compute the similarity. It is easy to get the string set sizes,                       Pruning with T . ListMerger builds inverted indexes for
and we discuss how to compute the overlap. It first sorts                             signatures. For each string s, ListMerger generates its sig-
the tokens in the set (e.g., by alphabetical order) and then                          natures and retrieves inverted lists of its signatures. For
uses a merge-based algorithm to compute the overlap.                                  each string r on these lists, ListMerger computes its count
                                                                                      number, which is the number of lists that contain r. Obvi-
Character-based Metrics. The dynamic programming
                                                                                      ously if the count number of string r is small than T , it can
algorithm is used to compute the edit distance. Given two
                                                                                      prune the pair (r, s). This is called the T-occurrence prob-
strings r and s, a matrix M with |r|+1 rows and |s|+1
                                                                                      lem. Five efficient algorithms have been proposed to solve
columns is used to compute their edit distance, where |r|(|s|)
                                                                                      this problem by merging inverted lists [13,18] and the two
is the length of r(s). M [i][j] is the edit distance between the
                                                                                      heap-based algorithms MergeOpt [18] and DivideSkip [13]
prefix of r with length i and the prefix of s with length j.
                                                                                      achieved better performance.
Initially M [i][0] = i and M [0][j] = j for 0 ≤ i ≤ |r| and
0 ≤ j ≤ |s|. Then each value of the matrix can be computed
using the following equation,
                                                                                      4.2    Length Filtering
                                                                                        Length filtering is proposed in GramCount [8]. The basic
M [i][j] = min(M [i−1][j]+1, M [i][j−1]+1, M [i−1][j−1]+γ),                           idea is that if two strings are similar, their length difference

                                                                               627
cannot be large. Formally, if a string r is similar to s, then            (1) Position Filtering. EDJoin removes unnecessary signa-
                                                                          tures from the prefix to further reduce the prefix length. For
               ED :     |s| − τ ≤ |r| ≤ |s| + τ
                                                                          example, consider string “sigmod” with q = 2 and τ = 1.
                                        |s|                               Suppose the signatures are “si”, “gm” and “od” in the prefix
               Jac :    δ|s| ≤ |r| ≤
                                          δ                               filtering. EDJoin proves that it can remove the last signature
                         2                |s|                 (2)         “od” safely since it requires at least τ + 1 edit operations to
               Cos : δ |s| ≤ |r| ≤ 2
                                           δ                              destroy both “si” and “gm”. EDJoin devises effective algo-
                           δ                  2−δ                         rithms to detect and remove unnecessary signatures.
              Dice :           |s| ≤ |r| ≤        |s|
                        2−δ                    δ                          (2) Content Filtering. The position filtering prunes dissimi-
  To utilize the length filtering, one can partition the strings          lar pairs using grams’ positions. It is effective for mismatch
into different groups Gl1 , Gl2 , · · · , Gln and strings in each         grams scattered in different positions in the string but in-
group have the same length. The string pairs between two                  effective for consecutive grams. To alleviate this problem,
groups Gli and Glj can be pruned, if the string lengths of                EDJoin proposes the content filtering to do further pruning.
the two groups do not satisfy Equation 2.                                 For example, consider the following two strings: sigmod and
                                                                          sigkdd. Consider the underlined substrings. Suppose the
4.3     Prefix Filtering                                                  edit-distance threshold is τ = 1. As the two substrings is
4.3.1    Prefix Filtering                                                 rather different, EDJoin can prune this pair. To achieve this
   The prefix filtering is proposed in SSJoin [4] which fo-               goal, EDJoin uses the L1 distance as a bound of edit dis-
cuses on enabling similarity joins inside DBMS. AllPair [2]               tance to do pruning, where L1 distance between two strings
extends the idea and proposes a prefix filtering based frame-             is the total number of different characters. EDJoin proves
work. We first use the Overlap similarity to illustrate how               that the edit distance cannot be smaller than half of the L1
prefix filtering works and then apply it to other metrics.                distance. For example, the L1 distance of the two substrings
Overlap. Assume we have a global ordering on all tokens                   is 4 so the edit distance between the two strings is at least
in the string sets, e.g., the alphabetical order or inverse doc-          2. Therefore, this pair can be pruned safely.
ument frequency (idf) order. For each string, the prefix                  QChunk [17]. The QChunk algorithm contains two types of sig-
filtering sorts its tokens based on the given token ordering.             natures: q-grams and q-chunks. The q-chunks are q-grams
Given the overlap threshold δ, for each string s, the prefix              with starting positions at i ∗ q + 1 for 0 ≤ i ≤ l−1
                                                                                                                             q
                                                                                                                                where l is
filtering selects the first |s| − δ + 1 tokens as the signatures,         the string length. To guarantee the last q-chunk has exactly
denoted by sp . It is easy to prove that if two strings r and s           q characters, it appends several special characters, e.g., $.
are similar, then sp ∩ rp 6= φ. For example, consider the first           For example, the q-chunks of “vldbj” are {“vl”, “db”, “j$”}.
two strings “database concepts system” and “database large-               Let gq denote the q-gram set and cq denote the q-chunk set.
scale techniques” in Table 1. Suppose the overlap threshold               Given two strings r and s, QChunk proves if they are similar,
is 2. The prefix size is 3-2+1=2. Assume the tokens are                          (1) |gq (r) ∩ cq (s)| ≥ d |s|
                                                                                                            q
                                                                                                               e − τ , and
sorted by idf in the descending order. The prefix filtering se-                 (2 )|cq (r) ∩ gq (s)| ≥ d |r| e − τ.
                                                                                                           q
lects “concepts system” and “large-scale techniques” as their
                                                                             To use the two properties, QChunk devises two strategies
prefixes. As their prefixes have no overlap, the pair can be
                                                                          to prune dissimilar pairs. The first one is to index q-grams of
pruned. Existing methods use idf in descending order and
                                                                          r and use q-chunks of s to generate candidates. The second
take tokens with large idf (infrequent tokens) as signatures.
                                                                          one is to index q-chunks of r and use q-grams of s to generate
Edit Distance. The gram-based method [8] is used to                       candidates. Both of the two strategies can be applied into
support the edit distance. It generates the q-gram set for                the prefix filtering framework. Moreover, QChunk devises
each string, sorts the grams based on a global ordering, and              effective techniques to decrease the number of signatures to
selects the first qτ + 1 grams as its signatures. It is easy to           the lower bound τ + 1 which are similar to those in EDJoin.
prove that if r and s are similar, their prefixes must have
common signatures. This is because each edit operation                     4.3.3 Optimizations on Jaccard, Cosine, Dice
can only destroy at most q q-grams, and if more than qτ + 1               PPJoin [27]. The PPJoin algorithm proposes two optimiza-
grams are destroyed, it requires at least τ +1 edit operations.           tion techniques for Jaccard, Cosine, and Dice.
For example, consider strings “sigmod” and “sigir” with                   (1) Position Filtering. It is similar to the first optimization
edit-distance threshold τ = 1. It sorts the q-grams by idf                of EDJoin, by using positions of signatures to prune dissimi-
in descending order and generates their prefixes (underlined              lar pairs. For example, consider the following two sets sorted
grams): {gm, mo, od, ig, si} and {gi, ir, ig, si}. Since they             by the alphabetical order: {B, C, D, E, F }, {A, B, C, D, F }.
do not share any common signature, this pair will be pruned.              Suppose the Jaccard threshold is 0.8. The prefix filter-
   Based on Equation 1, for string s, the prefix filtering sorts          ing selects the first 5-0.8*5+1=2 tokens as the signatures
its tokens and selects its first p tokens as signatures, where            (underlined). As the two prefixes have overlaps, the prefix
                 
                 
                   qτ + 1                       ED                       filtering cannot prune the pair. However, from two pre-
                  |s| − δ + 1                                            fixes, one can easily estimate the lower bound of the union
                 
                                               Olp
            p=      b(1 − δ)|s|c + 1             Jac          (3)         size (3+3=6) and the upper bound of the intersection size
                             2
                  b(1 − δ )|s|c + 1
                 
                                               Cos                       (1+3=4). Thus the upper bound of the Jaccard similar-
                  b(1 − δ )|s|c + 1                                      ity can be computed ( 46 ). Since the upper bound is smaller
                 
                            2−δ
                                               Dice
                                                                          than the threshold, the position filtering can prune this pair.
 4.3.2 Optimizations on Edit Distance                                     Similar ideas can be applied to Cosine and Dice.
EDJoin [25]. The EDJoin algorithm proposes two optimiza-                  (2) Suffix Filtering. It probes tokens in the suffix to estimate
tion techniques to improve prefix filtering for edit distance.            a tighter upper bound. For example, consider the following

                                                                    628
two sets (“?” stands for unknown signatures):                           distance and then extend the idea to other metrics. The
                                                                        basic idea is based on the pigeon-hole principle. Given an
         {A, B, D, E, ?, ?, ?, ?, ?, ?, Q, ?, ?, ?, ?, ?, ?, ?}         edit-distance threshold τ , for each string s, PassJoin splits
         {A, C, D, E, ?, ?, ?, ?, Q, ?, ?, ?, ?, ?, ?, ?, ?, ?}         the string into τ + 1 segments and if another string r is
                                                                        similar to the string s, r must contain a substring which
This pair cannot be pruned by the first optimization tech-              is equal to one of the segments of string s. For example,
                                     17
nique since the upper bound is 19       > 0.8. It chooses the           consider the following two strings with threshold τ = 2:
middle token in the suffix of the first set and uses it to do           {si|gm|od, pvldb}. As the second string does not contain
a binary search in the second set. By using the searching               any substring that matches a segment of the first one, their
result position, it can estimate a tighter upper bound of the           edit distance must be larger than 2.
Jaccard similarity. In the example, the upper bound of                     Intuitively, one should enumerate all substrings to check
the intersection size is 3 + 4 + 1 + 7 = 15, the lower bound of         the condition, and this process is obviously rather expensive.
the union size is 5 + 6 + 1 + 9 = 21, and the upper bound of            To alleviate this problem, PassJoin proposes the multi-match-
                              15
the Jaccard similarity is 21     < 0.8. Then this pair can be           aware selection technique to select the minimum number of
pruned by the suffix filtering. PPJoin iteratively estimates            substrings. The number of selected substrings can be re-
much tighter upper bounds [27].                                                       2   2
                                                                        duced to b τ −∆ 2
                                                                                            c + τ + 1, where ∆ = ||r| − |s|| is the length
4.3.4     The AdaptJoin Algorithm                                       difference between two strings. For example, suppose τ = 3
                                                                        and ∆ = 1. PassJoin selects only 8 substrings.
   The AdaptJoin algorithm aims to improve the prefix fil-                 For Jaccard, Cosine and Dice, PassJoin can convert
tering fundamentally for all similarity metrics [23]. All the           them to Edit Distance. Take Jaccard as an example.
aforementioned algorithms select fixed-length prefix for strings        First, PassJoin sorts all the tokens in the set based on a
with same length (different content), and they take the string          global ordering. Next it converts the Jaccard threshold to
pairs which have common signatures as candidate pairs.                  Overlap and then to Edit Distance. Notice that how to
However, it is worth noting that if l more tokens in the set            convert Jaccard to Overlap has been discussed in Sec-
are selected into the prefix, called l-prefix scheme, two sim-          tion 4.1. Here we discuss how to convert Overlap to Edit
ilar strings should share at least l + 1 signatures. Thus the           Distance. Suppose the Overlap threshold is τ . Then the
filtering power is enhanced while it involves more filtering            corresponding edit-distance threshold is at most l1 + l2 − 2τ
cost. Thus there is a tradeoff and the AdaptJoin algorithm              where l1 and l2 are the numbers of tokens in the two sets
proposes an adaptive framework to choose the best prefix                respectively. This is because one string can be transformed
scheme to achieve high performance.                                     to another string by l1 − τ deletions and l2 − τ insertions.
4.4     Other Signatures                                                Since l2 is unknown when building the index, one can use
                                                                        the maximum possible length b lδ1 c (see Equation 2) to re-
4.4.1     The VChunk Algorithm                                          place l2 . Thus given a string with length l, the Jaccard
  The VChunk algorithm [24] is designed for the edit dis-               threshold δ can be converted to the edit-distance thresh-
tance. The signatures of VChunk are variable-length chunks,             old τ = b 1−δδ
                                                                                        lc. For Cosine and Dice, the thresholds are
                                                                                                  2
called “vchunk”. The key part of the signature generation is            respectively τ = b 1−δ δ2
                                                                                                    lc and τ = b 2(1−δ)
                                                                                                                    δ
                                                                                                                        lc.
the chunk boundary dictionary (CBD) which is a set of rules
used to split the strings into chunks. For example, “vl” can
be a rule. For each string, if it contains a substring “vl”,
                                                                        4.4.3      The PartEnum Algorithm
VChunk cuts the string just after the substring. For example,              The PartEnum algorithm [1] is designed for hamming dis-
the string “pvldb” will be split into two chunks: “pvl” and             tance which is the minimum number of substitutions re-
“db” by this rule. VChunk proposes the tail-restricted CBD              quired to change one string to the other. The signature of
which is a subset of all the CBDs and proves that if using a            the PartEnum algorithm is a combination of “partition” and
tail-restricted CBD to split strings, any edit operation will           “enumeration”. There are two parameters in the PartEnum
destroy at most 2 chunks. Thus the gram-based technique                 algorithm: N1 and N2. Given a hamming distance thresh-
can be applied to the chunks split by CBDs. Next we define              old τ , in the first level, it partitions the string into N1 parts.
the tail-restricted CBD. It divides the character alphabet Σ            According to the pigeon-hole principle, there should exist at
in the strings into two disjoint subsets: the prefix character          least one partition and the corresponding partitions of the
                                                                                                                                       τ
set P and the suffix character set Q (i.e., P ∪ Q = Σ and               two strings have hamming distance no larger than τ2 = b N1       c.
P ∩Q = φ). All the rules in the CBD can be described by reg-            In the second level, it partitions each part into N2 parti-
ular expression [P ]*[Q] which denotes strings with arbitrary           tions and generates the enumeration signatures: all possible
numbers of characters from P and a single character from                combinations     of selecting N2 − τ2 parts from N2 parts (i.e.,
                                                                           N2
                                                                                
Q. For example, {“vl”, “v”} is an invalid tail-restricted CBD            N2−τ2
                                                                                  combinations).     Obviously if two strings are similar,
and {“vl”, “d”} is a valid tail-restricted CBD. If this CBD is          they must share a common signature.
used to split string “pvldb”, it will get three chunks {“pvl”,             Next we introduce how to convert the Jaccard thresh-
“d”, “b”}. VChunk adopts similar techniques of EDJoin to do             old δ to the hamming distance threshold. Suppose all strings
the similarity join using the CBD. The most important part              are with the same length l. If the Jaccard similarity of two
of this algorithm is to select a proper method to generate              strings is not smaller than δ, then their Overlap should be
                                                                                              2δl
the CBD. More details are referred to [24].                             not smaller than 1+δ      . Thus the hamming distance between
                                                                                                                        2δl
                                                                        their vectors should be no larger than 2l − 2 ∗ 1+δ = 2(1−δ)l
                                                                                                                                1+δ
                                                                                                                                      .
4.4.2    The PassJoin Algorithm                                         Similarly, both thresholds of Cosine and Dice can be con-
                                                                                     2l
   The PassJoin algorithm [14] employs a partition-based                verted to 1−δ   . To deal with strings with different lengths,
framework. We will first introduce how to support the edit              it splits all the strings into groups based on string lengths,

                                                                  629
applies the techniques to strings in the same group, and uses             tree structure, called M-Tree, where each leaf node contains
the length filtering to do pruning between different groups.              a group of similar strings and each internal node is used for
Finally, we discuss how to convert the ED threshold τ to the              pruning. An internal node N in an M-Tree stores a land-
hamming distance threshold. It also uses the q-gram based                 mark string Nm and a radius Nr , which denotes that all the
method. Since an edit operation can at most destroy q q-                  strings under the node have distance to Nm no larger than
grams, the ED threshold τ can be converted to the hamming                 Nr and vice versa. Based on this property, for each string s,
distance threshold 2qτ .                                                  it traverses the M-Tree from the root node. Consider a node
                                                                          N . If d(Nm , s) > Nr +τ , the node can be pruned; otherwise,
4.4.4     The FastSS Algorithm                                            it accesses each of its children. If the node is a leaf, it visits
   The FastSS algorithm [20] is designed only for the edit                each string in the node and computes its similarity to the
distance, which uses a neighborhood-based method. The                     string. It is important to construct an effective M-Tree with
basic idea is that if two strings are similar, their neigh-               high pruning power and more details are referred to [5].
bors must have overlap. Thus it takes neighbors as sig-
natures. Formally, suppose the edit-distance threshold is                 5.3      Other Related Works
τ . Given a string s, let Di (s) denote the set of s’ sub-                Similarity Joins inside DBMS. Gravano et. al. [8] stud-
                                                                          ied how to enable similarity joins inside DBMS using database
strings by deleting i characters. Obviously, |Di (s)| = |s|
                                                             
                                                           i
                                                              .
                                                                          functionalities. SSJoin [4] proposed prefix filtering and in-
For example, D2 (“pvldb”) = {ldb, vdb, vlb, vld, pdb,
                                                                          corporated the techniques into DBMS.
plb, pld, pub, pvd, pvl}. Let D̄τ (s) = ∪τi=0 Di (s). It
can be proved that if two strings r and s are similar within              Parallel Similarity Joins. Vernica et. al. [21] studied
threshold τ , then D̄τ (r) ∩ D̄τ (s) 6= φ. Based on this prop-            parallel similarity joins using MapReduce by extending pre-
erty, for each string s, FastSS takes the substrings (called              fix filtering based techniques. Metwally et. al. [16] studied
neighborhoods) in D̄τ (s) as the signatures of s. Then it can             the parallel set similarity join problem using count filtering.
use the filter-verification framework to compute join results.            Top-k Similarity Joins. Given two sets of strings, the
FastSS also devises effective verification algorithms to verify           top-k similarity join problem is to find top-k similar pairs
the candidate pairs based on the deleted characters. Obvi-                with the largest similarity. Xiao et. al. [26] extended prefix
ously FastSS is ineffective for long strings since it involves            filleting to support top-k similarity join. Zhang et. al. [28]
larger numbers of signatures.                                             extended B-tree to support top-k similarity search problem.
                                                                          Lee et. al. [11,12] studied the similarity join size estimation
5. OTHER ALGORITHMS                                                       problem which can also be utilized to compute top-k similar
5.1 The TrieJoin Algorithm                                                pairs. Kim et. al. [10] studied parallel top-k string similarity
                                                                          join problem using MapReduce.
   The TrieJoin algorithm [22] uses a trie structure to cal-
                                                                              Jestes et. al. [9] and Lian et. al. [15] studied the problem
culate the similarity join result directly. Each trie node is
                                                                          of similarity joins on probabilistic data. Satuluri et. al. [19]
associated with a character and the path from the root to a
                                                                          studied the problem of approximate similarity joins (which
leaf node corresponds to a string. Two strings with a com-
                                                                          may miss results) and proposed a Bayesian algorithm by
mon prefix will share a common ancestor.
                                                                          extending traditional locality-sensitive hashing.
   TrieJoin relies on an important concept “active nodes”. A
node is called an active node for string s if the edit distance           Similarity Search. Similarity search has been extensively
between s and the string w.r.t. the node is not larger than a             studied [6,7,13,18], which finds the similar strings of the
given threshold τ . For self join, TrieJoin first builds a single         query string from a set of data strings. Similarity joins based
trie structure and computes all the active nodes of the leaf              techniques can be extended to support similarity search [17,
nodes. Then given a leaf node, the strings corresponding                  23]. Disk-based similarity search was studied in [3,28]. Dif-
to active nodes must be similar to the string corresponding               ferent from similarity joins, the indexing time can be ex-
to the leaf node. Thus TrieJoin can easily obtain the join                cluded in similarity search, and thus in similarity joins we
result based on active nodes of leaf nodes. TrieJoin devises              need to consider the indexing cost which cannot be expensive
three efficient algorithms to compute the active nodes. The               but in similarity search we can build sophisticated indexes.
main idea is to traverse the trie index in pre-order and use                  Recently EDBT/ICDT 2013 organized a competition on
the active nodes of the parent node to compute those of the               string similarity search and join1 . The competition used two
child nodes. In other words, TrieJoin can share the com-                  datasets: the city-name dataset with short strings and the
putations among sibling nodes when computing their active                 Genome dataset with long strings. It allowed participants to
nodes. TrieJoin also proposes a partition-based method to                 use parallel algorithms. For similarity search, PassJoin won
improve the algorithm. It partitions strings into two parts,              the champion on the city-name dataset and the partition-
and thus can decrease the threshold to τ2 . Then it builds                based method with sophisticated graph indexes won the
two tries for the two parts and utilizes the two tries to find            champion on the Genome dataset. For similarity join, PassJoin
join results. More technical details can be found in [22].                won the champion on both of the two datasets. Our work
                                                                          had the following differences from the competition. First,
5.2    The M-Tree Algorithm                                               the competition only considered the edit-distance functions
  The M-Tree algorithm [5] is devised to support metric                   and our work discussed both edit-distance functions and
space similarity functions, which satisfy: (1) d(x, y) ≥ 0;               token-based functions. Second, in the competition, different
(2) d(x, y) = 0 ⇔ x = y; (3) d(x, y) = d(y, x); and (4) d(x, z) ≤         authors implemented different algorithms while we imple-
d(x, y) + d(y, z), where d is a similarity function. We can               mented all the algorithms by ourselves. Third, we evaluated
see that M-Tree can be used for Edit Distance since it is                 on more datasets, varied the dataset distributions, and com-
a metric. The basic idea of M-Tree is to use the triangle                 pared disk-based algorithms.
                                                                          1
inequality to prune dissimilar pairs. To this end, it builds a                www2.informatik.hu-berlin.de/∼wandelt/searchjoincompetition2013/

                                                                    630
τ=1                                                                   τ=1                       1000        IndexGram                                     300                     Gram
                                           τ=2                                1000                               τ=2                                  IndexChunk                                                        IndexGram
          1000                             τ=3                                                                   τ=3                                                                                 250               IndexChunk
Time(s)

                                                                   Time(s)

                                                                                                                                 Time(s)

                                                                                                                                                                                       Time(s)
                                           τ=4                                                                   τ=4                                                                                 200
                                                                                                                                                                                                     150
                                                                                 100
          100                                                                                                                                                                                        100
                                                                                                                                           100
                                                                                                                                                                                                     50
                 3   4             5       6       7       8                           2   3   4             5     6    7   8                     2   3    4           5          6                        2       3          4     5   6
                                       q                                                                     q                                             q                                                                  q

             (a) AllPair, DBLP                                                         (b) EDJoin, DBLP                                    (c) QChunk, τ = 2, DBLP                               (d) AdaptJoin, τ = 2, DBLP

                                                                                                                                                                      160
                                                                             τ=1                                                τ=1                                             N1=2
                                                                             τ=2                                                τ=2                                   140       N1=3
                                                                             τ=3                                                τ=3                                   120       N1=4
                         Time(s)

                                                                                                   Time(s)

                                                                                                                                                            Time(s)
                                       1000                                  τ=4                                 100            τ=4
                                                                                                                                                                      100
                                                                                                                                                                      80
                                                                                                                                                                      60
                                       100
                                                                                                                  10                                                  40
                                               4       5       6             7         8   9                           1 2 3 4 5 6 7 8 9 10                                 3     4              5     6       7   8      9
                                                                      q                                                        scale                                                                  N2

                                           (e) ListMerger, DBLP           (f) VChunk, DBLP     (g) PartEnum, τ = 2, DBLP-50K
                                                        Figure 2: Evaluation on Parameter Selection (ED).
6.         EXPERIMENTS                                                           Table 3: Datasets for ED.
   We conducted extensive experiments to compare existing         Dataset Cardinality Avg Len Max Len            Size    Type
                                                                     Word       122,823        8.7        29 1.2M       Short
string similarity join algorithms. Our experimental goal was
                                                                Querylog        500,000      18.9        500 9.7M       Short
to evaluate running time, candidate size, memory usage and           DBLP     1,000,000      70.3        766    69M Middle
scalability of different algorithms. The running time was          Genome       250,000       100        100    25M      Long
the total time to compute the join results, including sorting
tokens, generating signatures, indexing signatures, filtering          Table 4: Datasets for Jac, Cos and Dice.
                                                                Dataset Cardinality Avg Tok Max Tok             Size    Type
and verification (if necessary). The candidate size was the         Trec      347,949         75        273 103M       Small
number of candidates generated in the filtering step. The         Enron       245,567       135        3162 129M Middle
memory usage was the peak memory used during the algo-              Wiki    4,000,000       213       36907    3.2G    Large
rithm execution. We compared the following algorithms.
                                                                 in data integration applications.
Character-based Metrics. ListMerger [13,18], AllPair [2],
                                                                    For Jaccard, Cosine and Dice, we used the following
EDJoin [25], QChunk [17], AdaptJoin [23], VChunk [24], PassJoin
                                                                 datasets. We used white spaces to tokenize each string.
[14], PartEnum [1], FastSS [20], TrieJoin [22], BiTrieJoin
                                                                 Small Token Number: Trec [2,17,24,25,27] was a set of doc-
[22], M-Tree [5]. BiTrieJoin was the improved version of
                                                                 uments from the well-known benchmark in information re-
TrieJoin by using two tries as discussed in Section 5.1. We
                                                                 trieval5 and the average token number was 75. It can be
implemented the best algorithm DivideSkip [13] for ListMerger.
                                                                 used in data cleansing applications.
Token-based Metrics. ListMerger [13,18], AllPair [2], PPJoin     Middle Token Number: Enron [2,23,27] included a set of
[27], PPJoin+ [27], AdaptJoin [23], PassJoin [14], PartEnum      emails with titles and bodies6 and the average token number
[1]. As discussed in Section 4.3.3, we used PPJoin to de-        was 135. It can be used in data integration applications.
note the algorithm which only contained the first optimiza-      Large Token Number: Wiki [19] included a set of English
tion (position filtering) and PPJoin+ to denote the algorithm    wikipedia webpages7 and the average token number was 213.
which used two optimizations (position & suffix filtering).      It can be used in document clustering applications.
   All the algorithms were implemented by ourselves using
                                                                 Global Ordering. Algorithms for token-based metrics re-
C++, compiled by GCC 4.8.2 with -O3 flag. All the exper-
                                                                 quired the tokens to be sorted in a global ordering. Some
iments were conducted on a Ubuntu server with two Intel
                                                                 algorithms for Edit Distance needed to sort grams using
Xeon E5420 CPUs (8 cores, 2.5GHz) and 32GB memory.
                                                                 a global ordering. We used the best order for all algorithms.
Datasets. We used a variety of datasets with different char-     For PartEnum and PassJoin, the best order is random order.
acteristics which were also widely used in previous works.       For other algorithms, the best order is idf order.
Tables 3-4 showed the details of the datasets. For Edit
Distance we used the following datasets.                         6.1 Experiments on Edit Distance
Short Strings: Word [22] was a set of real English words and     Parameter Selection. Many existing algorithms required
its average string length was 8.7. Querylog [14,22,23] was       to tune parameters to achieve the best performance. The q-
a set of query log strings2 and the average length was 18.9.     gram based algorithms, e.g., ListMerger, AllPair, EDJoin,
These datasets can be used in data cleansing applications.       QChunk and AdaptJoin, required to tune parameter q. VChunk
Middle-Length Strings: DBLP [1,2,14,17,23–25,27] included        required to tune parameter scale. PartEnum required to tune
publication titles3 and the average string length was 70.3.      parameters N1 and N2. Figure 2 shows the results. As
It can be used in duplicate detection applications.              PartEnum was slow, we ran it on a small dataset DBLP-50K
Long Strings: Genome was a dataset of human gene se-             (by randomly selecting 50K string from DBLP) and other al-
quences4 and the average length was 100. It can be used          gorithms were run on the DBLP dataset.
2                                                                                                                                           5
  http://www.gregsadetsky.com/aol-data/                                                                                                       http://trec.nist.gov/data/t9_filtering.html
3                                                                                                                                           6
  http://dblp.uni-trier.de/xml/                                                                                                               http://www.cs.cmu.edu/~enron/
4                                                                                                                                           7
  http://www.1000genomes.org/                                                                                                                 http://dumps.wikimedia.org/

                                                                                                                                631
AdaptJoin
              100                                       BiTrieJoin
                                                           TrieJoin                   alleviate such problems as it can decrease the threshold
                                                             AllPair                  by half. Fourth, PassJoin achieved the best performance
   Time(s)
                                                            EDJoin
               10                                       PartEnum                      since it had large pruning power using the partition-based
                                                         PassJoin                     strategy which was effective for both long strings and short
                                                           QChunk
                1                                          VChunk                     strings. Fifth, some q-gram based algorithms, e.g., AllPair,
                                                        ListMerger
                                                            FastSS                    EDJoin, QChunk, achieved similar results. As these algo-
              0.1                                                                     rithms relied on q, their signatures were usually short and
                    1          2          3         4
                                                                                      leaded to low pruning power.
                         Threshold                                                    Datasets with Short Strings. Since FastSS only achieved
               Figure 3: Evaluation on DBLP-50K (ED).                                 high performance on datasets with short strings, we con-
             Table 5: Best parameter values for Figure 3.                             ducted an experiment on the Word dataset with average
   Method                  ED = 1        ED = 2        ED = 3             ED = 4      string length of 8.7. We compared with other two algo-
 AdaptJoin                     q =2          q =2          q =3               q =3    rithms PassJoin and BiTrieJoin which were also efficient
   AllPair                     q =3          q =4          q =5               q =5    for short strings. Figure 4 shows the results. FastSS was
    EDJoin                     q =2          q =3          q =3               q =3
                                                                                      faster than both PassJoin and BiTrieJoin. This is because
  PartEnum              N1/N2=3/6     N1/N2=3/3     N1/N2=2/4          N1/N2=2/5
    QChunk                     q =2          q =3          q =3               q =4    the signatures of FastSS have much stronger pruning power.
    VChunk                 scale =1      scale =2      scale =3           scale =3    However FastSS had rather large space overhead to main-
ListMerger                     q =6          q =7          q =8               q =7    tain the signatures as it generated exponential numbers of
                                                                                      signatures. In terms of candidate sizes, as BiTrieJoin did
     We have the following observations. First, different pa-                         not involve false positives, its candidate number was the
  rameter choices indeed had effect on the performance, thus                          minimum. FastSS had smaller numbers of candidates than
  it is very important to select an appropriate parameter. For                        PassJoin which implied it had larger pruning power.
  example, consider EDJoin in Figure 2(b). The performance                            Large Datasets. We evaluated the algorithms on large
  on q = 2 was much worse than that on q = 3. Second, for the                         datasets. As many algorithms cannot return results in 2
  gram-based methods, q should be neither too large nor too                           hours on the Genome dataset, we used Genome-100K (by ran-
  small, because a large q involved large numbers of signatures                       domly selecting 100K strings from Genome). Figure 5 shows
  (the number of signatures is related to q, e.g., qτ + 1) and a                      the results. We have the following observations. First,
  small q leaded to low pruning power (short grams had large                          PassJoin was nearly the best among all the algorithms. On
  probabilities to match). Thus we should carefully choose an                         the Genome dataset with long strings, PassJoin was much
  appropriate parameter q. Third, QChunk had two strategies                           faster than other algorithms, even by an order of magnitude.
  to build indexes: (1) indexing q-grams; and (2) indexing q-                         Similar results can be achieved on the DBLP dataset with
  chunks, and AdaptJoin had three strategies, (1) traditional                         middle length strings. On the datasets with short strings,
  q-gram based method; (2) indexing q-grams; and (3) index-                           PassJoin and BiTrieJoin achieved similar results and out-
  ing q-chunks. We can see that it was rather hard to select                          performed other algorithms. Second, similarity joins were
  the best strategy and the two methods required to tune dif-                         hard for datasets with short strings using large thresholds,
  ferent strategies to achieve the best performance, and this                         since there were many similar pairs and it was rather hard
  was coincided with the findings in the original papers.                             to select appropriate signatures to prune dissimilar pairs.
     In the following experiments, for each threshold τ , we                          Scalability. We evaluated the scalability of different al-
  tuned the parameters for every algorithm and reported the                           gorithms, including running time, candidate numbers, and
  best performance each algorithm can achieve.                                        memory usage. Figure 6 shows the result. We can see that
  Small Dataset. As PartEnum and M-Tree were rather inef-                             BiTrieJoin and PassJoin outperformed other algorithms.
  ficient, we first compared all algorithms on the small dataset                      This is consistent with results on the large datasets. More-
  DBLP-50K. Figure 3 shows the results and Table 5 shows the                          over, with the increase of dataset sizes, PassJoin nearly
  best parameters. (Since there are many experiment figures,                          achieved the linear results, and this was also attributed to its
  we cannot show the best parameter values for the following                          effective pruning techniques. PassJoin involved less mem-
  experiments due to the space constraints.) We also eval-                            ory than other algorithms, because the number of signatures
  uated the M-Tree algorithm but it failed to report results                          in the q-gram-based methods depended on the string length,
  within 2 hours. In the rest of the experiments, we do not                           while that of PassJoin relied on threshold τ . BiTrieJoin
  show the result in the figure if the corresponding algorithm                        and TrieJoin had the smallest number of candidates as they
  cannot return results within 2 hours.                                               directly computed the answers.
     We have the following observations. First, PartEnum were                         R-S Join. We evaluated the R-S join on the Word dataset.
  much worse than other algorithms because PartEnum was                               We randomly divided the original dataset into two sets with
  designed for hamming distance and had low pruning power                             cardinalities ratio of 1:4. We compared FastSS, PassJoin,
  for other similarity metrics. Thus we excluded it in the re-                        and BiTrieJoin. For each algorithm, we tested two strate-
  mainder of the experiments. Second, FastSS can only com-                            gies: indexing the smaller set or indexing the larger set. Fig-
  pute the results for small thresholds, e.g., 1 and 2. This                          ure 7 shows the results. We can see that these algorithms
  is because it was designed for short strings, and for long                          can efficiently support R-S join and the running time of the
  strings it involved exponential numbers of signatures. We                           different choices of which part to index was similar.
  will compare it deeply on datasets with extremely short
  strings later. Third, TrieJoin was efficient for small thresh-                      Algorithm Selection. We report our findings from the
  olds but inefficient for large thresholds. This is because, for                     results and discuss how to select appropriate algorithms
  large thresholds, it involved large numbers of active nodes                         (Table 6). On the datasets with short strings, FastSS >
  which was rather expensive to calculate. BiTrieJoin could                           PassJoin ∼ BiTrieJoin > AdaptJoin > TrieJoin > QChunk ∼

                                                                                632
BiTrieJoin                                                                                        BiTrieJoin                                                                     BiTrieJoin

                                                                                                Number of candidates
                                         PassJoin                                                                      1e+009              PassJoin                                                                       PassJoin
                100                       FastSS                                                                                            FastSS                                                                         FastSS
                                                                                                                                                                                                            1

                                                                                                                                                                                            Memory(GB)
                                                                                                                       1e+008
      Time(s)
                 10                                                                                                    1e+007
                                                                                                                                                                                                           0.1

                                                                                                                       1e+006
                  1
                                                                                                                                                                                                          0.01
                                                                                                                       100000
                      1                                      2                       3                                             1                          2                     3                               1                           2                      3
                                                        Threshold                                                                                         Threshold                                                                         Threshold
                                         (a) Running Time              (b) Candidate Number             (c) Memory Usage
                                              Figure 4: Evaluation on the Word Dataset with Very Short Strings (ED).
                                              AllPair                VChunk                                            10000                AllPair                  VChunk                                                   AllPair                   VChunk
                                             EDJoin                  TrieJoin                                                              EDJoin                    TrieJoin                                                EDJoin                     TrieJoin
                10000                       QChunk                 BiTrieJoin                                                             QChunk                   BiTrieJoin                            10000              QChunk                    BiTrieJoin
                                         AdaptJoin                  PassJoin                                                           AdaptJoin                    PassJoin                                             AdaptJoin                     PassJoin
                                         ListMerger                                                                     1000           ListMerger                                                                        ListMerger
                                                                                                                                                                                                         1000
    Time(s)

                                                                                          Time(s)

                                                                                                                                                                                         Time(s)
                 1000

                                                                                                                        100                                                                               100
                  100
                                                                                                                                                                                                            10

                   10                                                                                                     10
                                                                                                                                                                                                                1

                      1                                                                                                                                                                                    0.1
                          1                             2              3            4                                          1                      2                3           4                                1                   2                 3           4
                                                         Threshold                                                                                     Threshold                                                                        Threshold
                                  (a) Querylog (Short)                                          (b) DBLP (Middle)           (c) Genome-100K (Long)
                                                                                  Figure 5: Evaluation on Large Datasets (ED).
                1e+006
                                               AllPair              AdaptJoin                                                                AllPair                AdaptJoin                                                AllPair                 AdaptJoin
                                                                                          Number of candidates

                100000                       TrieJoin                 VChunk                                           1e+012              TrieJoin                   VChunk                                               TrieJoin                    VChunk
                                             QChunk                 BiTrieJoin                                                             QChunk                   BiTrieJoin                                             QChunk                    BiTrieJoin
                                              EDJoin                 PassJoin                                                               EDJoin                   PassJoin                              1                EDJoin                    PassJoin

                                                                                                                                                                                         Memory(GB)
                 10000                                                                                                 1e+010
                                          ListMerger                                                                                    ListMerger                                                                      ListMerger
    Time(s)

                  1000                                                                                                 1e+008
                   100
                                                                                                                       1e+006                                                                             0.1
                      10
                                                                                                                        10000
                          1
                                                                                                                          100
                   0.1                                                                                                                                                                                   0.01
                     50K                        100K        150K           200K    250K                                        50K           100K           150K           200K   250K                      50K              100K             150K            200K   250K
                                                Number of records                                                                             Number of records                                                                Number of records
                                         (a) Runing Time               (b) Candidate Number               (c) Memory Usage
                                                Figure 6: Evaluation on Scalability on the Genome Dataset (ED, τ = 3).
                                                BiTrieJoin-small
                                          100
                                                             as discussed in Section 6.1. Due to space constraint, we
                                                BiTrieJoin-large
                                                 PassJoin-small
                                                             omitted the details to tune its parameters.
                                                 PassJoin-large
                                                  FastSS-small
                              Time(s)

                                                   FastSS-large
                10
                                                             Large Datasets. We compared the algorithms on large
                                                             datasets. As many algorithms cannot return results in 2
                                                             hours on the Wiki dataset, we used Wiki-400K (by ran-
                 1                                           domly selecting 400K strings from Wiki). Figure 8 shows
                                                             the results. Since PassJoin and PartEnum are very slow for
                   1            2            3
                            Threshold                        Cosine and Dice, we only showed their results on Jac-
    Figure 7: Evalution on R-S Join (Word, ED).              card. We make the following observations. First, different
         Table 6: Algorithm Selection for ED.                token-based similarity metrics had no large effect on the per-
       Datasets                 Suggested Algorithms         formance of different algorithms. In other words, these algo-
       Short Strings            FastSS & PassJoin            rithms nearly achieved the same results on different metrics.
       Middle-length Strings PassJoin                        Second, AllPair, PPJoin, PPJoin+ and AdaptJoin achieved
       Long Strings             PassJoin                     similar results and outperformed PartEnum and PassJoin.
                                                             Third, PPJoin and PPJoin+ were better than AllPair be-
EDJoin > ListMerger > VChunk > AllPair  PartEnum           cause the two filtering techniques (position and suffix filter-
M-Tree, where ∼ denotes similar performance, > denotes       ing) can prune more dissimilar pairs. For large thresholds
better, and  denotes significantly better. On the datasets  (≥ 0.8), PPJoin+ was better than PPJoin because the suf-
with long strings, PassJoin > BiTrieJoin > EDJoin >          fix was long and the suffix filtering can significantly prune
AdaptJoin > QChunk > VChunk > ListMerger ∼ AllPair >         dissimilar pairs. For small similarity thresholds (< 0.8),
TrieJoin  PartEnum ∼ FastSS  M-Tree. On the datasets       PPJoin was better than PPJoin+, because the suffix was
with middle-length strings, PassJoin > VChunk > BiTrieJoin > short and the suffix filtering was more expensive than calcu-
TrieJoin > AdaptJoin > QChunk > EDJoin > AllPair >           lating overlap directly. Fourth, on large similarity thresholds
ListMerger  PartEnum ∼ FastSS  M-Tree.                     (≥ 0.8), PPJoin+ was the best. On small thresholds (< 0.8),
                                                             AdaptJoin outperformed AllPair, PPJoin, PPJoin+. This is
6.2 Experiments on Jaccard, Cosine and Dice                  because for small thresholds, similarity join algorithms took
  Among algorithms for Jaccard, Cosine and Dice, only
                                                             longer time and AdaptJoin had opportunity to select adap-
PartEnum required parameters and we tuned its parameters

                                                                                                                                                          633
PartEnum                   PPJoin                                           PartEnum                                  PPJoin                                         PartEnum                     PPJoin+                                        10000         ListMerger
           10000              PassJoin                 PPJoin+                                            PassJoin                                PPJoin+                                         ListMerger                  AdaptJoin                                                         AllPair
                             ListMerger               AdaptJoin                            1000          ListMerger                              AdaptJoin                             1000           AllPair                                                                                   PPJoin
                                 AllPair                                                                     AllPair                                                                                  PPJoin                                                                   1000           PPJoin+
            1000                                                                                                                                                                                                                                                                            AdaptJoin
 Time(s)

                                                                                 Time(s)

                                                                                                                                                                             Time(s)

                                                                                                                                                                                                                                                                   Time(s)
                                                                                            100
                                                                                                                                                                                        100                                                                                       100
             100

                                                                                             10
              10                                                                                                                                                                                                                                                                  10
                                                                                                                                                                                         10

               1                                                                              1                                                                                                                                                                                    1
                0.95           0.9    0.85    0.8   0.75    0.7   0.65    0.6                  0.95       0.9   0.85                      0.8   0.75   0.7   0.65     0.6                  0.95     0.9   0.85    0.8        0.75           0.7    0.65    0.6                      0.95     0.9   0.85     0.8        0.75   0.7   0.65   0.6
                                             Threshold                                                                             Threshold                                                                     Threshold                                                                                Threshold
                               (a) Trec, Jac                                                              (b) Enron, Jac                                                                       (c) Wiki-400K, Jac                                                                            (d) Trec, Cos

                             ListMerger                                                                  ListMerger                                                                    10000      ListMerger                                                                               ListMerger
           10000                 AllPair                                                                     AllPair                                                                                  AllPair                                                                10000             AllPair
                                 PPJoin                                                    10000             PPJoin                                                                                   PPJoin                                                                                   PPJoin
                               PPJoin+                                                                     PPJoin+                                                                      1000        PPJoin+                                                                                  PPJoin+
           1000              AdaptJoin                                                                   AdaptJoin                                                                                AdaptJoin                                                                   1000         AdaptJoin
Time(s)

                                                                                Time(s)

                                                                                                                                                                            Time(s)

                                                                                                                                                                                                                                                                  Time(s)
                                                                                            1000
                                                                                                                                                                                         100
             100                                                                                                                                                                                                                                                               100
                                                                                             100
                                                                                                                                                                                         10
             10                                                                                                                                                                                                                                                                   10
                                                                                             10
              1                                                                                                                                                                           1                                                                                        1
               0.95           0.9    0.85     0.8   0.75   0.7    0.65    0.6                     0.95    0.9    0.85                     0.8   0.75   0.7   0.65    0.6                   0.95     0.9   0.85    0.8        0.75           0.7    0.65   0.6                       0.95    0.9    0.85    0.8     0.75       0.7   0.65   0.6
                                            Threshold                                                                             Threshold                                                                      Threshold                                                                                Threshold
                             (e) Enron, Cos                                       (f) Wiki-400K, Cos         (g) Trec, Dice            (h) Enron, Dice
                                                                         Figure 8: Evaluation on Large Datasets (Token-based Metrics).
                                                                                                                                           1e+010
                                                AllPair                                                                                                     AllPair                                                                                       AllPair
                                                                                                                   Number of candidates

                                               PPJoin                                                                                                      PPJoin                                                                                        PPJoin
                                              PPJoin+                                                                                                     PPJoin+                                                                                       PPJoin+
                                             AdaptJoin                                                                                                   AdaptJoin                                                                                     AdaptJoin

                                                                                                                                                                                                                              Memory(GB)
                                                                                                                                                                                                                                            10
                              1000
                   Time(s)

                                                                                                                                           1e+009

                                                                                                                                           1e+008
                                100

                                     1M                    2M                   3M                        4M                                       1M                 2M                       3M                 4M                              1M                         2M                    3M                         4M
                                                     Number of records                                                                                              Number of records                                                                               Number of records
                                               (a) Runing Time                (b) Candidate Number              (c) Memory Usage
                                                     Figure 9: Evaluation on Scalability on the Wiki Dataset (Jac, δ = 0.75).
                                                                                                                                                                                                                                                 AllPair-small
      tive prefix lengths to improve performance. Fifth, different                                                                                                                                                            1000
                                                                                                                                                                                                                                                 AllPair-large
                                                                                                                                                                                                                                                PPJoin-small
                                                                                                                                                                                                                                                PPJoin-small
      from Edit Distance, different string lengths had no much                                                                                                                                                                                 PPJoin+-small
                                                                                                                                                                                                                                               PPJoin+-large
                                                                                                                                                                                                                   Time(s)

      effect on performance for token-based similarity metrics.                                                                                                                                                                               AdaptJoin-small
                                                                                                                                                                                                                                              AdaptJoin-large
      Scalability. We evaluated the scalability of different al-
      gorithms. Figure 9 shows the results on the large dataset                                                                                                                                                                  100
      Wiki. We only showed the results of Jaccard since the
      results of Cosine and Dice are similar. We can see that
                                                                                                                                                                                                                                           0.95                  0.9                       0.85                  0.8
      when the number of strings was large, AdaptJoin was al-                                                                                                                                                                                                        Threshold
      ways the fastest one. This is because for large datasets,                                                                                                                              Figure 10: Evalution on R-S Join (Wiki, Jac).
      similarity join algorithms took long time to find similar pairs                                                                                                                      Table 7: Algorithm Selection for Jac, Cos, Dice.
      and AdaptJoin had large potential to select variable-length                                                                                                                          Datasets         Thresholds   Suggested Algorithms
      prefixes to improve the performance. AdaptJoin involved                                                                                                                              Large Token No. Large/Small AdaptJoin
      slightly larger indexes than PPJoin+ as AdaptJoin required                                                                                                                           Small Token No. Large         PPJoin+ & PPJoin
      to build additional index structures.                                                                                                                                                Small Token No. Small         AdaptJoin
      R-S Join. We evaluated the R-S join on the Wiki dataset.
                                                                                                                                                                                         ListMerger  PassJoin > PartEnum. On small datasets,
      We randomly selected 2 million strings as one set and other
                                                                                                                                                                                         for large thresholds, PPJoin+ > PPJoin > AdaptJoin >
      2 million strings as another set. We compared AllPair,
                                                                                                                                                                                         AllPair  ListMerger  PassJoin > PartEnum; for small
      PPJoin, PPJoin+ and AdaptJoin. For each algorithm, we
                                                                                                                                                                                         thresholds, AdaptJoin > PPJoin > PPJoin+ > AdaptJoin >
      tested two strategies, indexing the small set or indexing the
                                                                                                                                                                                         AllPair  ListMerger  PassJoin > PartEnum.
      large set. Figure 10 shows the results. We can see that
      AdaptJoin still achieved the best performance. Moreover,                                                                                                                           6.3              Other Experiments
      AdaptJoin-small (indexing the small set) was slightly better                                                                                                                        6.3.1             Similarity Search
      than AdaptJoin-large (indexing the large set) because index-                                                                                                                          We extended the similarity join algorithms to support
      ing small/large datasets will not affect the pruning power                                                                                                                         similarity search and compared them with the similarity
      but indexing the large set will take longer time.                                                                                                                                  search algorithm ListMerger. We randomly selected 10K
      Algorithm Selection. We report our findings and discuss                                                                                                                            queries and reported the average search time. Figure 11
      how to select an algorithm (Table 7). On large datasets,                                                                                                                           shows the results. Note that TrieJoin and BiTrieJoin re-
      for large thresholds, AdaptJoin > PPJoin+ > PPJoin >                                                                                                                               lied on dual subtrie pruning to achieve high performance
      AllPair  ListMerger  PassJoin > PartEnum; for small                                                                                                                              and they cannot utilize this feature for similarity search
      thresholds, AdaptJoin > PPJoin > PPJoin+ > AllPair                                                                                                                                and thus they were inefficient for similarity search. In addi-
                                                                                                                                                                                         tion, AllPair was always slower than EDJoin. Thus we did

                                                                                                                                                                            634
100
                         ListMerger                                                           ListMerger                                                        PartEnum               PPJoin                             PartEnum               PPJoin
                             EDJoin                                               100             AllPair                                              100       PassJoin             PPJoin+                              PassJoin             PPJoin+
                10          QChunk                                                                PPJoin                                                        ListMerger           AdaptJoin                            ListMerger           AdaptJoin
                         AdaptJoin                                                              PPJoin+                                                             AllPair                                                   AllPair
                          PassJoin                                                            AdaptJoin
Time(ms)

                                                                      Time(ms)
                                                                                   10
                1

                                                                                                                                             Time(s)

                                                                                                                                                                                                       Time(s)
                                                                                       1
            0.1                                                                                                                                        10

                                                                                  0.1                                                                                                                            10
           0.01
                                                                                 0.01
                     1                2               3           4                    0.95                 0.9          0.85         0.8               1
                                                                                                                                                         0.95                 0.9       0.85     0.8               0.95                 0.9       0.85     0.8
                                          Threshold                                                           Threshold
                                                                                                                                                                                Threshold                                                 Threshold
                       (a) ED, DBLP              (b) Jac, Wiki                                                                                                          (a) Zipfian                                              (b) Uniform
                     Figure 11: Evalutaion on Similarity Search.
           20                                                                    700
                                                    EDJoin                                                              Allpair
                                                   QChunk                        600                                   PPJoin                          100                                                       100
                                                 AdaptJoin                                                            PPJoin+                                   PartEnum               PPJoin                             PartEnum               PPJoin
           15                                     PassJoin                                                           AdaptJoin                                   PassJoin             PPJoin+                              PassJoin             PPJoin+
                                                                                 500                                                                            ListMerger           AdaptJoin                            ListMerger           AdaptJoin
Time(s)

                                                                      Time(s)
                                                                                                                                                                    AllPair                                                   AllPair
                                                                                 400
           10

                                                                                                                                             Time(s)

                                                                                                                                                                                                       Time(s)
                                                                                 300

                                                                                 200                                                                   10                                                        10
            5
                                                                                 100

            0                                                                      0
                         1            2          4            8                               1              2       4            8
                                      Threads                                                                 Threads
                                                                                                                                                         0.95                 0.9       0.85     0.8               0.95                 0.9       0.85     0.8
               (a) ED, DBLP                    (b) Jac, Wiki                                                                                                                    Threshold                                       Threshold
           Figure 12: Evalutaion on Parallel Algorithms                                                                                                   (c) Normal, N (0, ( N3 )2 )     (d)                    N 2
                                                                                                                                                                                                             , ( 10 ) )     Normal, N ( N
                                                                                                                                                                                                                                        2
       not show TrieJoin, BiTrieJoin and AllPair for Edit Dis-                                                                                             Figure 13: Evalutaion on Distributions (Jac).
       tance. We can see that the extended algorithms were better                                                                                      PPJoin+ were slightly worse. We got similar results for uni-
       than ListMerger as they shorten the signatures using effec-                                                                                                                             N 2
                                                                                                                                                       form distribution especially N ( N2 , ( 10 ) ). The main reason
       tive signature-based techniques (e.g., prefix filtering) and ac-                                                                                is that the high pruning power of PPJoin and PPJoin+ re-
       cessed smaller numbers of signatures. On Edit Distance,                                                                                         lied on the low frequency of signatures in the prefix. In Zip-
       the average search time was under 1 ms and PassJoin out-                                                                                        fian distribution, the signatures in prefixes were tokens with
       performed other methods; and on Jaccard, the average                                                                                            largest idf and their frequencies were small. Thus the match-
       search time was also under 1 ms and PPJoin and PPJoin+                                                                                          ing probability between different strings was low. However
       outperformed other algorithms. AdaptJoin was slower than                                                                                        in other distributions, the matching probability can be larger
       PPJoin and PPJoin+ as AdaptJoin had additional overhead                                                                                         than that in the Zipfian distribution. Instead, AdaptJoin
       to select high-quality signatures.                                                                                                              can adaptively selected the high-quality prefixes and thus it
           6.3.2              Evaluation on Parallel Algorithms                                                                                        still achieved higher performance in different distributions.
          We parallelized existing similarity join algorithms and eval-                                                                                6.3.4            Evaluation on Disk-based Algorithms
       uated the parallel algorithms. We used R-S joins as an ex-                                                                                         We extended similarity join algorithms to support really
       ample. We evenly split each dataset into two parts. We first                                                                                    large datasets and compared them with ListMerger [3] and
       indexed strings in one dataset and then performed search on                                                                                     Bed -tree [28], which were specially designed for disk-based
       strings in another dataset in parallel. Figure 12 shows the                                                                                     settings in order to support similarity search using disk-
       results. We showed the running time of the index step and                                                                                       based indexes. We implemented two types of methods to
       search step separately where the lower bar denoted the in-                                                                                      support large datasets that cannot be loaded into the mem-
       dexing time and the upper bar denoted the search time. We                                                                                       ory. The first built the disk-based indexes on strings in one
       did not show ListMerger since it was too slow. We can see                                                                                       dataset and performed search on strings in another dataset
       that the parallel algorithms can improve the performance.                                                                                       by utilizing the disk-based indexes (denoted by EDJoin-d,
       For example, PassJoin can improve the join time to 2 sec-                                                                                       PPJoin-d). The second partitioned each dataset into several
       onds on 8 nodes from 9 seconds on 1 node. We find that the                                                                                      small groups and performed join algorithms on the small
       speedup cannot reach the optimal value as we can parallelize                                                                                    groups which can be loaded into the memory (denoted by
       the search step but cannot parallelize the indexing step. It                                                                                    EDJoin-p, PPJoin-p). For Edit Distance, we set the mem-
       is still challenging to devise efficient parallel algorithms.                                                                                   ory buffer as 64MB and for Jaccard, we set the buffer
           6.3.3              Evalutaion on Dataset Distribution                                                                                       as 512MB, which were about 5% of memory used by in-
         We evaluated different dataset distributions. We gener-                                                                                       memory algorithms EDJoin and PPJoin respectively. Fig-
       ated four datasets with different distributions as follows.                                                                                     ure 14 shows the results. Our extended algorithms using the
       We first generated N = 400K distinct tokens and then                                                                                            first strategy outperformed Bed -tree but were worse than
       constructed each string by randomly selecting tokens. The                                                                                       ListMerger as Bed -tree involved large numbers of random
       probability to select the i-th token ti is p(ti ) = P score(ti)
                                                                            ,                                                                          accesses and ListMerger avoided unnecessary disk accesses
                                                                 score(tj )
                                                          1
                                                                                                                    1≤j≤N                              using a novel physical layout for inverted indexes. Our ex-
       where score(ti ) =          for Zipfian distribution, score(ti ) =
                                                          i                                                                                            tended algorithms using the second strategy outperformed
                                                                   (x−µ)2
        1
                                                      Ri
           for uniform distribution, score(ti ) = i−1 σ√12π e 2σ2
                                                                 −                                                                                     ListMerger, as we loaded the data into memory, used in-
       N
                                                                                                                                                       memory methods to compute results, and avoided randomly
       for normal distribution. We generated two datasets using
                                                                                                                                                       accessing disk-based indexes. Thus for similarity search the
       normal distribution with N (0, ( N3 )2 ) (µ = 0, σ = N3 ) and
                  N 2
                                                                                                                                                       disk-based methods were better but for similarity joins the
       N ( N2 , ( 10 ) )(µ = N2 , σ = 10
                                       N
                                         ). Each dataset contained 500K                                                                                partition-based methods achieved higher performance.
       strings and each string contained 50 tokens. Figure 13 shows
       the results. We can see that in the traditional Zipfian distri-                                                                                 6.3.5            Comparision with Approxiamte Algorithms
       bution, PPJoin, PPJoin+, and AdaptJoin achieved the best                                                                                          We compared with approximate similarity join algorithms,
       performance. However, for some special distributions such                                                                                       BayesLSH-lite [19], which approximately computed the re-
       as uniform distribution, the performance for PPJoin and                                                                                         sults and may miss results. We set the expected recall of

                                                                                                                                            635
You can also read