Harvesting Large-Scale Weakly-Tagged Image Databases from the Web

Page created by Brenda Arnold
 
CONTINUE READING
Harvesting Large-Scale Weakly-Tagged Image Databases from the Web
Harvesting Large-Scale Weakly-Tagged Image Databases from the Web

                            Jianping Fan1 , Yi Shen1 , Ning Zhou1 , Yuli Gao2
                     1
                       Department of Computer Science, UNC-Charlotte, NC28223, USA
              2
                Multimedia Interaction and Understanding, HP Labs, Palo Alto, CA94304, USA

                        Abstract                                   new algorithm for harvesting image databases from the web
                                                                   by combining text, meta-data and visual information. All
To leverage large-scale weakly-tagged images for computer          these existing techniques have made a hidden assumption,
vision tasks (such as object detection and scene recogni-          e.g., image semantics have an explicit correspondence with
tion), a novel cross-modal tag cleansing and junk image            the associated texts or nearby texts. Unfortunately, such an
filtering algorithm is developed for cleansing the weakly-         assumption may not always be true.
tagged images and their social tags (i.e., removing irrele-            Collaborative image tagging system, such as Flickr [3],
vant images and finding the most relevant tags for each im-        is now a popular way to obtain large set of labeled images
age) by integrating both the visual similarity contexts be-        easily by relying on the collaborative effort of a large pop-
tween the images and the semantic similarity contexts be-          ulation of Internet users. In a collaborative image tagging
tween their tags. Our algorithm can address the issues of          system, people can tag the images according to their social
spams, polysemes and synonyms more effectively and de-             or cultural backgrounds, personal expertise and perception.
termine the relevance between the images and their social          We call the collaboratively-tagged images as weakly-tagged
tags more precisely, thus it can allow us to create large          images because their social tags may not have exact cor-
amounts of training images with more reliable labels by har-       respondences with the underlying image semantics. With
vesting from large-scale weakly-tagged images, which can           the exponential growth of the weakly-tagged images, it is
further be used to achieve more effective classifier training      very attractive to develop new algorithms that can lever-
for many computer vision tasks.                                    age large-scale weakly-tagged images for computer vision
                                                                   tasks (such as learning the classifiers for object detection
1. Introduction                                                    and scene recognition). Without controlling the word vo-
                                                                   cabulary, many text terms for image tagging may be syn-
For many computer vision tasks, such as object detection           onyms or polysemes or even spams. The appearances of
and scene recognition, machine learning techniques are usu-        synonyms, polysemes and spams may either return incom-
ally involved to learn the classifiers from a set of labeled       plete sets of the relevant images or result in large amounts
training images [1]. The size of the labeled training images       of ambiguous images or even junk images. Thus it is not
must be large-scale due to: (1) the number of object classes       a trivial task to leverage large-scale weakly-tagged images
and scenes of interest could be very large; (2) the learning       for computer vision tasks.
complexity for some object classes and scenes could be very            In this paper, we focus on collecting large-scale weakly-
high because of visual ambiguity; and (3) a small number           tagged images from collaborative image tagging systems
of labeled training images are incomplete or insufficient to       such as Flickr by addressing the following crucial issues:
interpret the diverse visual properties of large amounts of            (a) Synonymous Tags: Different people may use dif-
unseen test images. However, hiring professionals to label         ferent tags, which have the same or close meanings (syn-
large amounts of training images is cost-sensitive and poses       onyms), to tag their images. For example, car, auto, and au-
a key limitation for the practical use of some advanced com-       tomobile are a set of synonyms. The synonyms may result
puter vision techniques. On the other hand, large-scale dig-       in incomplete returns of the relevant images in the image
ital images and their associated text terms are available on       crawling process, and most tag clustering algorithms can-
the Internet, thus it is very attractive to leverage large-scale   not incorporate the visual similarities between the relevant
online images for computer vision tasks [2].                       images to deal with the issue of synonyms more effectively.
    Some pioneering works have been done to leverage In-               (b) Polysemous Tags: Collaborative image tagging is
ternet images for computer vision tasks [2, 4-8]. Fergus et        an ambiguous process. Without controlling the vocabulary,
al. [4] and Li et al. [6] dealt with the precision problem by      different people may apply the same tag in different ways
re-ranking the images which are downloaded from an image           (i.e., the same tag may have different meanings under dif-
search engine. Recently, Schroff et al. [7] have developed a       ferent contexts), which may result in large amounts of am-
Harvesting Large-Scale Weakly-Tagged Image Databases from the Web
biguous images. For example, the text term “bank” can be               The occurrence frequency for each content-relevant tag
used to tag “bank office”, “river bank” and “cloud bank”.          and each event-relevant tag is counted automatically by us-
Word sense disambiguation is one potential solution for ad-        ing the number of relevant images. The misspelling tags
dressing this ambiguity issue, but it cannot incorporate the       may have low frequencies (i.e., different people may make
visual properties of the relevant images to deal with the is-      different typing mistakes), thus it is easy for us to correct
sue of polysemes more effectively [9-10].                          such the misspelling tags and their images are added into
    (c) Spam Tags: Spam tags, which are used to drive traf-        the relevant tags automatically. Two tags, which are used
fic to certain images for fun or profit, are done by inserting     for tagging the same image, are considered to co-occur once
the text terms that are more related to popular query terms        without considering their order. A co-occurrence matrix is
rather than the text terms related to the actual image con-        obtained by counting the frequencies of such pairwise tag
tent. Spam tags are problematic because the junk images            co-occurrences.
may mislead the underlying machine learning tools for clas-            The content-relevant tags and the event-relevant tags are
sifier training. Junk image filtering is an attractive direction   further partitioned into two categories according to their
for dealing with the issue of spam tags, but it is worth noting    interestingness scores: interesting tags and uninteresting
that the scenario for junk image filtering in a collaborative      tags. In this paper, multiple information sources have been
image tagging space is significantly different.                    exploited for determining the interesting tags more accu-
    In this paper, a novel cross-modal tag cleansing and           rately. For a given tag C, its interestingness score ω(C)
junk image filtering algorithm is developed by integrat-           depends on: (1) its occurrence frequency t(C) (e.g., higher
ing both the visual properties of the weakly-tagged images         occurrence frequency corresponds to higher interestingness
and their social tags to deal with the issues of spams, pol-       score); and (2) its co-occurrence frequency ϑ(C) with any
ysemes and synonyms more effectively, so that we can cre-          other tag in the vocabulary (e.g., higher co-occurrence fre-
ate large amounts of training images with more reliable la-        quency corresponds to higher interestingness score). The
bels for computer vision tasks by harvesting from large-           occurrence frequency t(C) for a given tag C is equal to the
scale weakly-tagged images. The paper is organized as fol-         number of images that are tagged by the given tag C. The
lows. In section 2, an automatic algorithm is introduced for       co-occurrence frequency ϑ(C) for the given tag C is equal
image topic extraction. In section 3, a mixture-of-kernels         to the number of images that are tagged jointly by the given
algorithm is introduced for image similarity characteriza-         tag C and any other tag in the vocabulary.
tion. In section 4, a spam tag detection technique is intro-           The interestingness score ω(C) for a given tag C is de-
duced for junk image filtering. In section 5, a cross-modal        fined as:
tag cleansing algorithm is introduced for addressing the is-                            p                           p
                                                                   ω(C) = ξ·log(t(C)+     t2 (C) + 1)+ζ·log(ϑ(C)+      ϑ2 (C) + 1)
sues of synonyms and polysemes. The algorithm evaluation                                                                      (1)
results are given in section 6. We conclude this paper at           where ξ and ζ are the weighting factors, ξ +ζ = 1.
section 7.                                                             All the interesting tags, which have large values of Ω(·)
                                                                   (i.e., top 5000 tags in our current experiments), are treated
2. Image Topic Extraction                                          as image topics. In this work, only the interesting tags,
                                                                   which are used to interpret the most popular real-world ob-
Each image in a collaborative tagging system is associated         ject classes and scenes or events, are treated as the image
with the image holder’s taggings of the underlying image           topics. It is worth noting that one single weakly-tagged im-
content and other users’ taggings or comments. It is worth         age may be assigned into multiple image topics when the
noting that entity extraction can be done more effectively in      relevant tags are used for tagging the image jointly. Collect-
a collaborative image tagging space. In this paper, we first       ing large-scale training images for the most popular real-
focus on extracting the social tags which are strongly related     world object classes and scenes or events and learning their
to the most popular real-world objects and scenes or events.       classifiers more accurately are crucial for many computer
The social tags, which are related to image capture time and       vision tasks.
place, are also very attractive, but they are beyond the scope
of this paper. Thus the image tags are first partitioned into      3. Image Similarity Characterization
two categories: noun phrases versus verb phrases. The noun
phrases are further partitioned into two categories automat-       To achieve more sufficient characterization of various visual
ically: content-relevant tags (i.e., tags that are relevant to     properties of the images, both global and local visual fea-
image objects and scenes) and content-irrelevant tags. The         tures are extracted for image content representation. In our
verb phrases are further partitioned into two categories au-       current experiments, the following visual features are ex-
tomatically: event-relevant tags (i.e., tags that are relevant     tracted: (1) 36-bin RGB color histogram to characterize the
to image events) and event-irrelevant tags.                        global color distributions of the images; (2) 48-dimensional
Harvesting Large-Scale Weakly-Tagged Image Databases from the Web
texture features from Gabor filter banks to characterize the     4. Spam Tag Detection
global visual properties (i.e., global structures) of the im-
ages; and (3) a number of interest points and their SIFT         Some popular image topics in the vocabulary may consist of
(scale invariant feature transform) features to characterize     large amounts of junk images because of spam tagging, and
the local visual properties of the underlying salient image      incorporating the junk images for classifier training may se-
components.                                                      riously mislead the underlying machine learning tools. Ob-
                                                                 viously, the junk images, which are induced by spam tag-
   By using high-dimensional visual features (color his-         ging, may make a significant difference on their visual prop-
togram, wavelet textures, and SIFT features) for image con-      erties with the relevant images. Thus the junk images can
tent representation, it is able for us to characterize various   be filtered out effectively by performing visual-based image
visual properties of the images more sufficiently. On the        clustering and relevance analysis.
other hand, the statistical properties of the images in the
high-dimensional feature space may be heterogeneous be-
cause different feature subsets are used to characterize dif-    4.1    Image Clustering
ferent visual properties of the images, thus the statistical
                                                                 A K-way min-max cut algorithm is developed to achieve
properties of the images in the high-dimensional feature
                                                                 more effective image clustering, where the cumulative inter-
space may be heterogeneous and sparse. Therefore, it is
                                                                 cluster visual similarity contexts are minimized while the
hard to use only one single type of kernel to characterize
                                                                 cumulative intra-cluster visual similarity contexts (summa-
the diverse visual similarity contexts between the images
                                                                 tion of pairwise image similarity contexts within a cluster)
precisely.
                                                                 are maximized. These two criteria can be satisfied simulta-
    Based on these observations, the high-dimensional vi-        neously with a simple K-way min-max cut function [11].
sual features are first partitioned into multiple feature sub-      For a given image topic C, a graph is first constructed for
sets and each feature subset is used to characterize one cer-    organizing all its weakly-tagged images according to their
tain type of visual properties of the images, thus the un-       visual similarity contexts [11-12], where each node on the
derlying visual similarity contexts between the images are       graph is one weakly-tagged image for the given image topic
more homogeneous and can be approximated more pre-               C and an edge between two nodes is used to characterize
cisely by using one particular type of kernel. For each fea-     the visual similarity contexts between two weakly-tagged
ture subset, a suitable base kernel is designed for image sim-   images, κ(·, ·).
ilarity characterization. Because different base image ker-         All the weakly-tagged images for the given image topic
nels may play different roles on characterizing the diverse      C are partitioned into K clusters automatically by minimiz-
visual similarity contexts between the images, the optimal       ing the following objective function:
kernel for diverse image similarity context characterization
                                                                                                   K
                                                                               (                                       )
can be approximated more accurately by using a linear com-                                         X s(Gi , G/Gi )
bination of these base image kernels with different impor-                min Ψ(C, K, β) =                                  (3)
                                                                                                   i=1
                                                                                                         s(Gi , Gi )
tance.
    For a given image topic Cj in the vocabulary, differ-        where G = {Gi |i = 1, · · · , K} is used to represent K im-
ent base image kernels may play different roles on charac-       age clusters, G/Gi is used to represent other K − 1 image
terizing the diverse visual similarity relationships between     clusters in G except Gi , K is the total number of image
the images. Thus the diverse visual similarity contexts be-      clusters, β is the set of the optimal kernel weights. The cu-
tween the images are characterized more precisely by using       mulative inter-cluster visual similarity context s(Gi , G/Gi )
a mixture-of-kernels [13-14]:                                    is defined as:
                                                                                                X X
                                                                             s(Gi , G/Gi ) =                κ(u, v)        (4)
                    τ                     τ
                    X                     X                                                     u∈Gi v∈G/Gi
        κ(x, y) =         βl κl (x, y),         βl = 1    (2)
                    l=1                   l=1                    The cumulative intra-cluster visual similarity context
                                                                 s(Gi , Gi ) is defined as:
where τ is the number of feature subsets (i.e., the number                                      X X
of base image kernels), βl ≥ 0 is the importance factor                         s(Gi , Gi ) =               κ(u, v)         (5)
                                                                                                u∈Gi v∈Gi
for the lth base image kernel κl (x, y). Combining multiple
base kernels can allow us to achieve more precise charac-
                                                                    We further define X = [X1 , · · · , Xl , · · · , Xk ] as the
terization of the diverse visual similarity contexts between
                                                                 cluster indicators, and its component Xl is a binary indi-
the weakly-tagged images.
Harvesting Large-Scale Weakly-Tagged Image Databases from the Web
Figure 1: Image clustering for the image topic “beach”: (a) cluster correlation network; (b) filtered junk images.

cator for the appearance of the lth cluster Gl ,                                The objective function for kernel weight determination
                                                                           is to maximize the inter-cluster separability and the intra-
                          1, u ∈ Gl                                        cluster compactness. For one specific cluster Gl , its inter-
                Xl (u) =                                             (6)    cluster separability µ(Gl ) and its intra-cluster compactness
                             0, otherwise
                         
                                                                            σ(Gl ) are defined as:
    W is defined as an n×n symmetrical matrix (i.e., n is the                   µ(Gl ) = XlT (D − W )Xl , σ(Gl ) = XlT W Xl                    (12)
total number of web images), and its component is defined
as:                                                                         For one specific cluster Gl , we can refine its cumulative
                       Wu,v = κ(u, v)                     (7)               intra-cluster pairwise image similarity contexts s(Gl , Gl ) as
                                                                            W (Gl ):
D is defined as an n × n diagonal matrix, and its diagonal
                                                                                                                         τ
components are defined as:                                                                     X X                       X
                                                                                 W (Gl ) =                 κ(u, v) =           βi ωi (Gl )     (13)
                                  n
                                  X                                                            u∈Gl v∈Gl                 i=1
                       Du,u =           Wu,v                         (8)
                                                                                                           τ
                                  v=1                                                                      X
                                                                                 D(Gl ) − W (Gl ) =              βi [i (Gl ) − ωi (Gl )]      (14)
For the given image topic C, an optimal partition of its                                                   i=1
weakly-tagged images (i.e., image clustering) is achieved                   where ωi (Gl ) and i (Gl ) are defined as:
by:
                                                                                          X X                                    nl
                                                                                                                                 X
                             K                                               ωi (Gl ) =               κi (u, v), i (Gl ) =            ωi (Gl ) (15)
           (                                     )
                            X   XlT (D − W )Xl
       min Ψ(C, K, β) =                               (9)                                 u∈Gl v∈Gl                              v=1
                            l=1
                                   XlT W Xl
                                                                               The optimal weights β   ~ = [β1 , · · ·, βτ ] for kernel com-
    −
    →      1      1       −
                          →                  D2
                                               1
                                                   Xl                       bination are determined automatically by maximizing the
Let W = D− 2 W D− 2 , and Xl =                 1          , the objective
                                            kD 2   Xl k                     inter-cluster separability and the intra-cluster compactness:
function for our K-way min-max cut algorithm can further
be refined as:                                                                                        (      K
                                                                                                                         )
                                                                                               max       1 X σ(Gl )
                                                                                                 ~                                       (16)
                         K                                                                      β        K      µ(Gl )
         (                                    )
                        X        1                                                                               l=1
    min Ψ(C, K, β) =        −
                            →T − → −  → −K         (10)
                        l=1 Xl · W · Xl
                                                                                          Pτ
                                                                            subject to: i=1 βi = 1, ∀i : βi ≥ 0
                                                                               The optimal kernel weights β~ = [β1 , · · ·, βτ ] are de-
subject to:
                                                                            termined automatically by solving the following quadratic
    −
    →T − →       −
                 →T − → − →                                                 programming problem:
    Xl · Xl = I, Xl · W · Xl > 0, l ∈ [1, · · · , K]
                                                                                                     K
                                                                                           (                            ! )
   The optimal solution for Eq. (10) is finally achieved by                           min    1 ~T X                   T    ~
                                                                                       ~       β        Ω(Gl )Ω(Gl )       β        (17)
solving multiple eigenvalue equations:                                                 β     2
                                                                                                           l=1
              −
              → − →        −
                           →                                                              Pτ
              W · Xl = λ l Xl ,         l ∈ [1, · · · , K]          (11)    subject to:     i=1   βi = 1, ∀i : βi ≥ 0
Harvesting Large-Scale Weakly-Tagged Image Databases from the Web
Figure 2: Image clustering for the image topic “rock”: (a) cluster correlation network; (b) filtered junk images.

   Ω(Gl ) is defined as:                                             are used to indicate the co-occurrence probabilities for the
                                                                     images x and y with the image topic C.
                                  ω(Gl )                                 In order to leverage the inter-cluster correlations for
                  Ω(Gl ) =                                   (18)
                              (Gl ) − ω(Gl )                        achieving more effective relevance re-ranking, a random
                                                                     walk process is performed for automatic relevance score re-
    In summary, our K-way min-max cut algorithm takes the
                                                                     finement [15]. For a given image topic C, our image cluster-
following steps iteratively for image clustering and kernel
                                                                     ing algorithm can automatically determine a cluster corre-
weight determination: (1) β is set equally for all these fea-
                                                                     lation network (i.e., K image clusters and their inter-cluster
ture subsets at the first run of iterations. (2) Given the initial
                                                                     correlations) as shown in Fig. 1(a) and Fig. 2(a). We use
values of kernel weights, our K-way min-max cut algorithm
                                                                     ρl (Gi ) to denote the relevance score for the ith image clus-
is performed to partition the weakly-tagged images into K
                                                                     ter Gi at the lth iteration. The relevance scores for all these
clusters according to their pairwise visual similarity con-
                                                                     K image clusters at the lth iteration will form a column vec-
texts. (3) Given an initial partition of the weakly-tagged               −−−→
images, our kernel weight determination algorithm is per-            tor ρ(Gi ) ≡ [ρl (Gi )]K×1 . We further define Φ as an K × K
formed to estimate more suitable kernel weights, so that             transition matrix, its element φGi ,Gj is used to define the
more precise characterization of the diverse visual similar-         probability of the transition from the image cluster Gi to its
ity contexts between the images can be achieved. (4) Go to           inter-related image cluster Gj . φGi ,Gj is defined as:
step 2 and continue the loop iteratively until β is convergent.
As shown in Fig. 1(a) and Fig. 2(a), our image cluster-                                               s(Gi , Gj )
                                                                                      φGi ,Gj = P                               (20)
ing algorithm can achieve a good partition of large amounts                                          Gh ∈C s(Gi , Gh )
of weakly-tagged images and determine their global distri-
butions and inter-cluster correlations effectively. Unfortu-         where s(Gi , Gj ) is the inter-cluster visual similarity context
nately, such image clustering process cannot directly iden-          between two image clusters Gi and Gj as defined in Eq. (4).
tify the clusters for the junk images.                                  The random walk process is then formulated as:
                                                                                     X
                                                                      ρl (Gi ) = θ          ρl−1 (Gj )φGi ,Gj + (1 − θ)ρ(C, Gi ) (21)
4.2    Relevance Re-Ranking                                                          j∈Ωj
For different users, their motivations for spam tagging are
significantly different and their images for spam tagging            where Ωj is the first-order nearest neighbors of the image
should contain different content and have different visual           cluster Gj on the cluster correlation network, ρ(C, Gi ) is
properties. Thus the clusters for the junk images (which             the initial relevance score for the image cluster Gi and θ is
come from different users with different motivations) could          a weight parameter. This random walk process will promote
be in small sizes. Based on this observation, it is reasonable       the image clusters which have many connections on the
for us to define the relevance score ρ(C, Gi ) for a given im-       cluster correlation network, e.g., the image clusters which
age cluster Gi with the image topic C as:                            have close visual properties (i.e., stronger visual similarity
                             P                                       contexts) with other image clusters. On the other hand, this
                                     P (x, C)                        random walk process will also weaken the isolated image
                ρ(C, Gi ) = Px∈Gi                         (19)       clusters on the cluster correlation network, e.g., the image
                                y∈C P (y, C)
                                                                     clusters which have weak visual correlations with other im-
where x and y are used to represent particular weakly-               age clusters. This random walk process is terminated when
tagged images for the image topic C, P (x, C) and P (y, C)           the relevance scores converge.
Harvesting Large-Scale Weakly-Tagged Image Databases from the Web
For two given image topics Ci and Cj , their visual simi-
                                                                 larity context γ(Ci , Cj ) is defined as:
                                                                                      1       X X
                                                                  γ(Ci , Cj ) =                   [κ̂(u, v) + κ̄(u, v)] (22)
                                                                                  2|Ci ||Cj |
                                                                                            u∈Ci v∈Cj

                                                                 where |Ci | and |Cj | are the numbers of the weakly-tagged
                                                                 images for the image topics Ci and Cj , κ̂(u, v) is the kernel-
                                                                 based visual similarity context between two weakly-tagged
       Figure 3: Different views of our topic network.
                                                                 images u and v by using the kernel weights for the image
                                                                 topic Ci , and κ̄(u, v) is the kernel-based visual similarity
    By performing random walk over the cluster correlation
                                                                 context between two weakly-tagged images u and v by us-
network, our relevance score refinement algorithm can re-
                                                                 ing the kernel weights for the image topic Cj .
rank the relevance between the image clusters and the im-
                                                                    The co-occurrence correlation β(Ci , Cj ) between two
age topic C more precisely. Thus the top-k image clusters,
                                                                 image topics Ci and Cj is defined as:
which have higher relevance scores with the image topic,
are selected as the most relevant image clusters for the given                                             P (Ci , Cj )
image topic C. Through integrating the cluster correla-                β(Ci , Cj ) = −P (Ci , Cj )log                       (23)
                                                                                                        P (Ci ) + P (Cj )
tion network and random walk for relevance re-ranking, our
spam tag detection algorithm can filter out the junk images      where P (Ci , Cj ) is the co-occurrence probability for two
effectively as shown in Fig. 1(b) and Fig. 2(b). By filtering    image topics Ci and Cj , P (Ci ) and P (Cj ) are the occur-
out the junk images, we can automatically create large-scale     rence probability for the image topics Ci and Cj .
training images with more reliable labels to learn more ac-         The cross-modal inter-topic correlation between two im-
curate classifiers for object detection and scene recognition.   age topics Ci and Cj is finally defined as:

                                                                   ϕ(Ci , Cj ) = α · γ(Ci , Cj ) + (1 − α) · β(Ci , Cj )    (24)
5. Cross-Modal Tag Cleansing
                                                                 where α is the weighting factor and it is determined through
The appearance of synonyms may result in insufficient im-        cross-validation. The topic network for our image collec-
age collections, which may prevent the underlying machine        tions is shown in Fig. 3, where each image topic is linked
learning techniques from learning reliable classifiers for the   with multiple most relevant image topics with larger values
synonymous image topics. On the other hand, the appear-          of ϕ(·, ·).
ance of polysems may result in the image sets with huge              Our K-way min-max cut algorithm is further performed
visual diversity, which may also prevent the underlying ma-      on the topic network for topic clustering, thus the synony-
chine learning tools from learning precise classifiers for the   mous topics are grouped into the same cluster and can be
polysemous image topics. To leverage large-scale weakly-         combined as one super-topic. The images for these synony-
tagged images for computer vision tasks, it is very attrac-      mous topics may share similar visual properties and seman-
tive to develop cross-modal tag cleansing techniques for ad-     tics, thus they are combined and assigned to the super-topic
dressing the issues of synonyms and polysems more effec-         automatically and a more comprehensive set of the relevant
tively.                                                          images can be obtained. Multiple tags for interpreting these
                                                                 synonymous topics are combined as one unified phrase for
5.1    Combining Synonymous Topics                               tagging the super-topic. Through combining the synony-
                                                                 mous topics and their similar images, we can obtain more
When people tag their images, they may use multiple text         sufficient images to achieve more reliable learning of the
terms with similar meanings to tag their images alterna-         classifier for the corresponding super-topic.
tively. Thus the image tags are inter-related and such inter-
related tags and their relevant images should be consid-
                                                                 5.2     Splitting Polysemous Topics
ered jointly. Based on this observation, a topic network is
constructed automatically for characterizing such inter-tag      Some image topics may be polysemous, which may result
(inter-topic) similarity contexts more precisely. Our topic      in large amounts of ambiguous images with diverse visual
network consists of two key components: (a) a large number       properties. Using the ambiguous images for classifier train-
of image topics; and (b) their cross-modal inter-topic corre-    ing may result in the classifiers with high variance and low
lations. The cross-modal inter-topic correlations consist of     generalization ability. To address the issue of polysemes,
two components: (1) inter-topic co-occurrence correlations;      automatic image clustering is performed to split the poly-
and (2) inter-topic visual similarity contexts.                  semous topics by partitioning their ambiguous images into
Harvesting Large-Scale Weakly-Tagged Image Databases from the Web
multiple clusters with more homogeneous visual properties.
Thus our K-way min-max cut algorithm is used to partition
the ambiguous images under the same polysemous topic
into multiple groups automatically and each group may cor-
respond to one certain sub-topic with more homogeneous
visual properties and smaller semantic gap.
    To address the issue of the polysemous topics more ef-        Figure 4: The comparison on the precision rates after and
fectively, WordNet is first incorporated to identify the can-     before performing spam tag detection.
didates of the polysemous topics. For a given candidate of
the polysemous topics P , all its weakly-tagged images are
first partitioned into multiple clusters according to their vi-
sual similarity contexts by using our K-way min-max cut
algorithm. The visual diversity Ω(P ) for the given candi-
date P is defined as:
                                                   2
                       X         µ(Gi ) − µ(Gj )
           Ω(P ) =                                        (25)
                                 σ(Gi ) + σ(Gj )
                     Gi ,Gj ∈P
                                                                  Figure 5: The comparison on the recall rates after and before
where µ(Gi ) and µ(Gj ) are the means of the image clusters       merging the synonymous topics.
Gi and Gj , σ(Gi ) and σ(Gj ) are the variances of the image
clusters Gi and Gj .                                              where ϑ is the set of images that are relevant to the given
    The candidates with large visual diversity between their      image topic and are returned correctly, ξ is the set of images
images are treated as the polysemous topics and are fur-          that are irrelevant to the given image topic and are returned
ther partitioned into multiple sub-topics. For a given poly-      incorrectly, and ν is the set of images that are relevant to the
semous topic, all its ambiguous images are partitioned into       given image but are not returned. In our experiments, only
multiple clusters automatically, and each cluster may corre-      top 200 images are used for calculating the precision and
spond to one certain sub-topic. By assigning the ambiguous        recall rates.
images for the polysemous topic into multiple sub-topics,             The precision rate is used to characterize the accuracy of
we can obtain multiple image sets with more homogeneous           our system for finding the particular images of interest, thus
visual properties, which may have better correspondences          it can be used to assess the effectiveness of our spam tag
between the tags (i.e., sub-topics) and the image semantics       detection algorithm. As shown in Fig. 4, one can observe
(i.e., smaller semantic gaps). Through splitting the poly-        that our spam tag detection algorithm can filter out the junk
semous topics and their ambiguous images, we can obtain:          images effectively and result in higher precision rates for
(a) multiple sub-topics with smaller semantic gaps and vi-        image retrieval. On the other hand, the recall rate is used to
sual diversity; and (b) more precise image collections (with      characterize the efficiency of our system for finding the par-
smaller visual diversity) which can be used to achieve more       ticular images of interest, thus it can be used to assess the
accurate learning of the classifiers for multiple sub-topics      effectiveness of our cross-modal tag cleansing algorithm on
with smaller semantic gaps.                                       addressing the issue of synonymous tags. As shown in Fig.
                                                                  5, one can observe that our cross-modal tag cleansing algo-
6. Algorithm Evaluation                                           rithm can combine the synonymous topics and their similar
                                                                  images effectively and result in higher recall rates for image
We have carried out our experimental studies by using             retrieval.
large-scale weakly-tagged Flickr images. We have down-                To evaluate the effectiveness of our cross-modal tag
loaded more than 10 million Flickr images. Our algorithm          cleansing algorithm on dealing with the polysemous tags,
evaluation work focuses on evaluating how well our tech-          we have compared the performance differences on the pre-
niques can address the issues of spams, polysemes and syn-        cision rates before and after separating the polysmous tags
onyms. To evaluate the performance of our algorithms on           and their ambiguous images. Some results are shown in Fig.
spam tag detection and cross-modal tag cleansing, we have         6, one can obtain that our cross-modal tag cleansing algo-
designed an interactive system for searching and exploring        rithm can tackle the issue of polysemous tags effectively.
large-scale collections of Flickr images. The benchmark           By splitting the polysemous topics and their ambiguous im-
metric for algorithm evaluation includes precision ρ and re-      ages into multiple sub-topics, our system can achieve higher
call % for image retrieval. They are defined as:                  precision rates for image retrieval.
                  ϑ                          ϑ                        We have also compared the precision and recall rates
            ρ=       ,                 %=                 (26)    between our system (i.e., which have provided techniques
                 ϑ+ξ                        ϑ+ν
Harvesting Large-Scale Weakly-Tagged Image Databases from the Web
images have provided very positive results. We will also
                                                                lease our image sets with more reliable labels on our web
                                                                site.

                                                                References
Figure 6: The precision rates for some query terms before and   [1] A.W.M. Smeulders, M. Worring, S. Santini, A. Gupta and R.
after separating the polysemous topics and their ambiguous          Jain, “Content-based image retrieval at the end of the early
images.                                                             years”, IEEE Trans. on PAMI, 2000.

                                                                [2] J. Fan, C. Yang, Y. Shen, N. Babaguchi, H. Luo, “Leveraging
                                                                    large-scale weakly-tagged images to train inter-related classi-
                                                                    fiers for multi-label annotation”, Proc. of first ACM workshop
                                                                    on Large-scale multimedia retrieval and mining, 2009.

                                                                [3] Flickr, http://www.flickr.com.

                                                                [4] R. Fergus, P. Perona, A. Zisserman, “A visual category filter
                                                                    for Google Images”, ECCV, 2004.
Figure 7: The precision rates for 5000 query terms: (a) our     [5] T. Berg, D. Forthy, “Animals on the Web”, IEEE CVPR, 2006.
system; (b) Flickr search.
                                                                [6] L. Li, G. Wang, L. Fei-Fei, “OPTIMOL: automatic online pic-
to deal with the critical issues of spam tags, synonymous           ture collection via incremental model learning”, IEEE CVPR
tags, and polysemous tags) and Flickr search system (which          2007.
have not provided techniques to deal with the critical issues   [7] F. Schroff, A. Criminisi, A. Zisserman, “Harvesting image
of spam tags, synonymous tags and polysemous tags). As              databases from the web”, IEEE ICCV, 2007.
shown in Fig. 7 and Fig. 8, one can observe that our system
can achieve higher precision and recall rates for all these     [8] B.C. Russell, A. Torralba, R. Fergus, W.T. Freeman, “80 mil-
5000 queries (i.e., 5000 tags of interest in our experiments)       lion tiny images: a large dataset for non-parametric object
by addressing the critical issues of spams, synonyms and            and scene recognition”, IEEE Trans. on PAMI, vol.30, no.11,
polysemes effectively.                                              2008.

                                                                [9] K. Barnard, M. Johnson, ”Word sense disambiguation with
7. Conclusions                                                      pictures”, Artificial Intelligence, vol. 167, pp. 13-30, 2005.

                                                                [10] J. Yuan, Y. Wu, M. Yang, “Discovery of collocation patterns:
The objective of this work is to create large amounts of            from visual words to visual phrases”, IEEE CVPR, 2007.
training images with more reliable labels for computer vi-
sion tasks by harvesting from large-scale weakly-tagged im-     [11] C. Ding, X. He, H. Zha, M. Gu, H. Simon, “A Min-max
ages. A novel cross-modal tag cleansing and junk image              Cut Algorithm for Graph Partitioning and Data Clustering”,
filtering algorithm is developed by integrating both the vi-        ICDM, 2001.
sual similarity contexts between the images and the seman-      [12] J Shi, J Malik, “Normalized cuts and image segmentation”,
tic similarity contexts between their tags for cleansing the        IEEE Trans. on PAMI, 2000.
weakly-tagged images and their social tags. Our exper-
iments on large-scale collections of weakly-tagged Flickr       [13] J. Zhang, M. Marszalek, S. Lazebnik, C. Schmid, “Local fea-
                                                                    tures and kernels for classification of texture and object cate-
                                                                    tories: A comprehensive study”, Intl. Journal of Computer
                                                                    Vision, vol.73, no.2, 2007.

                                                                [14] J. Fan, Y. Gao, H. Luo, ““Integrating concept ontology and
                                                                    multi-task learning to achieve more effective classifier train-
                                                                    ing for multi-level image annotation”, IEEE Trans. on Image
                                                                    Processing, vol. 17, no.3, pp.407-426, 2008.

                                                                [15] W. Hsu, L. Kennedy, S.F. Chang, “Video search reranking
                                                                    through random walk over document-level context graph”,
                                                                    ACM Multimedia, 2007.
Figure 8: The recall rates for 5000 query terms: (a) our sys-
tem; (b) Flickr search.
You can also read