Fashion Coordinates Recommender System Using Photographs from Fashion Magazines

Page created by Anne Leonard
 
CONTINUE READING
Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence

                              Fashion Coordinates Recommender System
                              Using Photographs from Fashion Magazines
                           Tomoharu Iwata Shinji Watanabe Hiroshi Sawada
                                   NTT Communication Science Laboratories
                              2-4 Hikaridai, Seika-cho, Soraku-gun, Kyoto, Japan
                         {iwata.tomoharu,watanabe.shinji,sawada.hiroshi}@lab.ntt.co.jp

                         Abstract                                           from fashion magazines. We use a topic model to learn co-
                                                                            ordinate information that includes the intrinsic relationship
     Fashion magazines contain a number of pho-                             between fashion items. The proposed topic model is a proba-
     tographs of fashion models, and their clothing co-                     bilistic generative model of images for each fashion item re-
     ordinates serve as useful references. In this pa-                      gion. The learned coordinate information is used to recom-
     per, we propose a recommender system for cloth-                        mend coordinates. Intuitively, the proposed method finds ref-
     ing coordinates using full-body photographs from                       erence photographs that are similar to the query photograph,
     fashion magazines. The task is that, given a pho-                      and recommends fashion items that are similar to those in the
     tograph of a fashion item (e.g. tops) as a query,                      found reference photograph. By using photographs that are
     to recommend a photograph of other fashion items                       taken from the user’s favorite fashion magazines to train the
     (e.g. bottoms) that is appropriate to the query. With                  topic model, the proposed method can recommend personal-
     the proposed method, we use a probabilistic topic                      ized coordinates that coincide with the user’s preference.
     model for learning information about coordinates                          The proposed method is novel in the sense that it learns
     from visual features in each fashion item region.                      information about coordinates automatically from a collec-
     We demonstrate the effectiveness of the proposed                       tion of reference photographs. Some fashion recommen-
     method using real photographs from a fashion mag-                      dation methods have been proposed [Nagao et al., 2008;
     azine and two fashion style sharing services with                      Shen et al., 2007; Yonezawa and Nakatani, 2009]. For exam-
     the task of making top (bottom) recommendations                        ple, [Shen et al., 2007] proposed a recommender system for
     given bottom (top) photographs.                                        clothing to suit a given situation using item attributes such as
                                                                            brands, types (e.g. jeans) and styles (e.g. formal, trendy and
                                                                            sporty). [Nagao et al., 2008] recommends clothing relevant
1 Introduction                                                              to the weather and the situation using annotated data. These
Many people are interested in their fashion. They regularly                 methods require system developers and users to attach meta-
check trends in fashion magazines, and refer to them when                   data to clothes, which is a tiresome and time-consuming task.
buying clothes and choosing coordinates. Fashion magazines                  On the other hand, the proposed method does not require the
contain a number of photographs of fashion models, and their                attachment of meta-data; it only requires users to take pho-
clothing coordinates serve as useful references.                            tographs of their own clothes and system developers to col-
   In this paper, we propose a recommender system for cloth-                lect reference photographs. We can easily collect reference
ing coordinates using full-body photographs from fashion                    photographs from fashion magazines, online clothes stores,
magazines. The task is that, given a photograph of a fash-                  or fashion photograph sharing services, such as Chicisimo1
ion item (e.g. tops) as a query, to recommend a photograph                  and Weardrobe2.
of other fashion items (e.g. bottoms) that is appropriate to                   The remainder of this paper is organized as follows. In Sec-
the query. The proposed method can be helpful when select-                  tion 2, we present a method for detecting the top and bottom
ing clothes to wear from one’s wardrobe. Sometimes, it is                   regions in full-body photographs from a wide variety of pho-
difficult for a person to make a quick decision about stylish                tographs. In Section 3, we propose a topic model for learning
coordinates. The proposed method can also be used when                      information about coordinates from the reference full-body
purchasing clothes that match one’s own clothes, and when                   photographs. In Section 4, we describe a coordinates recom-
purchasing matching clothes for the top and bottom half of                  mendation method that uses the topic model. In Section 5, we
the body in an online store. If the store employs experts to                demonstrate the effectiveness of the proposed method by us-
provide advice, we can ask them about coordinates. How-                     ing real photographs from a fashion magazine and two style
ever, online stores have no such staff.                                     sharing services with the task of top (bottom) recommenda-
   With the proposed method, coordinate information is
                                                                               1
learned from visual features in each fashion item region (e.g.                     http://chicisimo.com
                                                                               2
top or bottom half of the body) using full-body photographs                        http://www.weardrobe.com

                                                                     2262
Table 1: Notation
                                                                      Symbol     Description
                                                                      D          set of reference full-body photographs
                                                                      R          set of regions, e.g. R = {top, bottom}
                                                                      r          region, r ∈ R
                                                                      V          number of unique visual features
                                                                         (r)
                                                                      Nd         number of features in region r of photograph d
                                                                        (r)
                                                                      xdn        nth feature in region r of photograph d,
                                                                                   (r)
                                                                                 xdn ∈ {1, · · · , V }
Figure 1: Example of top and bottom region detection using            K          number of latent topics
a face detection algorithm.                                            (r)
                                                                      zdn        latent topic of the nth feature in region r
                                                                                                      (r)
                                                                                 of photograph d, zdn ∈ {1, · · · , K}
                                                                      θd         topic proportions for photograph
                                                                                                                    d,
tions given bottom (top) photographs. Finally, we present
                                                                                 θd = {θdk }K  k=1 , θdk ≥ 0,   k θdk = 1
concluding remarks and a discussion of future work in Sec-              (r)
tion 6.                                                               φk         multinomial distribution over features
                                                                                 for topic k in region r,
                                                                                   (r)       (r)         (r)      (r)
                                                                                 φk = {φkv }Vv=1 , φkv ≥ 0, v φkv = 1
2 Top and Bottom Region Detection
The proposed method described in Sections 3 and 4 can be             3 Topic Model for Coordinates
used to recommend arbitrary fashion items, such as tops, bot-        Let D be a set of reference photographs. Each reference pho-
toms, hats, bags, shoes and accessories. However, this pa-           tograph consists of a pair of visual features, one in the top
per focuses on top (bottom) recommendations given a bottom                                              (t)  (b)          (r)
                                                                     and one in the bottom region (xd , xd ). Here, xd repre-
(top) photograph, since it is the most basic and frequently          sents visual features in region r of photograph d. The top and
occurring situation. In such cases, the proposed method re-          bottom regions can be detected using the method described in
quires a collection of pairs of top and bottom images to learn       Section 2. We can use arbitrary visual features, such as color,
coordinates. This section describes a simple method of identi-       texture and local descriptors such as the Scale Invariant Fea-
fying full-body photographs from a collection of photographs                                                                     (r)
in fashion magazines, and detecting their top and bottom re-         ture Transform (SIFT) [Lowe, 2004]. We assume that xd
                                                                                                                                  (r)
gions.                                                                                                          (r)       (r) N
                                                                     is represented by bag-of-features; i.e. xd = {xdn }n=1      d
                                                                                                                                   ,
   In addition to full-body photographs of fashion models,                    (r)
                                                                     where xdn is the nth visual feature in region r of photograph
fashion magazines include photographs of clothes, bags, ac-          d. Our notation is summarized in Table 1.
cessories, wording and logos. Therefore, we need to select              We use a topic model for learning information about co-
only full-body photographs from all the given photographs.           ordinates since the topic model can extract the intrinsic rela-
We utilize a face detection algorithm [Rowley et al., 1998;          tionship between fashion items from full-body photographs.
Viola and Jones, 2004] for this task. First, we detect the lo-       A topic model is a probabilistic generative model for dis-
cation of faces in each photograph. Next, we select full-body        crete data such as text documents. Topics in fashion pho-
photographs by using the following two rules: they must con-         tographs, for example, represent fashion categories, such
tain a face region, and they must have sufficient space for a         as casual, formal, elegant, classic and conservative. Topic
full body; the space can be calculated from the size and po-         models are successfully used for a wide variety of appli-
sition of the face region. When more than one face region is         cations including information retrieval [Blei et al., 2003;
detected, we use one that is the closest to the vertical center      Hofmann, 1999], collaborative filtering [Hofmann, 2003;
line of the photograph.                                              Iwata et al., 2009a], and modeling images [Blei and Jor-
   We determine the top and bottom regions by assuming that          dan, 2003; Cao and Fei-Fei, 2007; Iwata et al., 2009b]. The
the top region is double the width and 2.5 times the height          proposed model is an extension of latent Dirichlet allocation
of the face region, and the bottom region is double the width        (LDA) [Blei et al., 2003], which is a representative topic
and 3.5 times the height of the face region. Figure 1 shows an       model.
example of top and bottom region detection.                             The proposed topic model can learn cooccurrence informa-
   We can detect the top and bottom regions directly by using        tion about visual features in top and bottom regions. For ex-
pose estimation instead of face detection algorithms. Some           ample, a white top is likely to be coordinated with a dark blue
algorithms for detecting human body parts have been pro-             bottom when color information is used for the visual features.
posed [Ramanan, 2007; Andriluka et al., 2009]. However, we           Using the learned cooccurrence information, we recommend
utilize a face detection algorithm for simplicity, because face      coordinates as described in the next section. The proposed
detection performance is superior to that for other regions,         model assumes that photograph d has its own topic propor-
and most of the poses in fashion magazine photographs show           tions θd , which is a low dimensional intrinsic representation
the front view of a standing model.                                  of the photograph. Each topic has its own visual feature prob-

                                                              2263
distributions. The first factor on the right hand side of (1) is
                               (t)           (t)                              (t)
                           z                x                 φ (t)         β                            calculated as follows:
        α        θ
                                                   N(t)                                                                             |D|  
                                                                                                                              Γ(Kα)               k Γ(Ndk + α)
                           z (b)            x(b)               φ(b)         β(b)                                P (Z|α) =                                       , (2)
                                                    (b)                                                                       Γ(α)K              Γ(Nd + Kα)
                                                   N      D           K                                                                        d

                                                                                                         where Γ(·) is the gamma function, |D| is the size of set D,
Figure 2: Graphical model representation of the proposed                                                 Ndk is the number of  features assigned to topic k in the pho-
topic model for recommending top and bottom coordinates.                                                 tograph d and Nd = k Ndk . Similarly, the second factor is
                                                                                                         given as follows:
                                            (r)                                                                             Γ(V β (r) ) K   Γ(N (r) + β (r) )
ability depending on region φk , which represents cooccur-                                               P (X|Z, β) =                               w       kv
                                                                                                                                                                         ,
                                         (r)                                                                                    Γ(β (r) )V              (r)       (r) )
rence information. The visual feature xdn is generated ac-                                                                 r                   k   Γ(N  k +Vβ
                                               (r)                                                                                                                     (3)
cording to the region dependent probabilities φ (r) after topic                                                    (r)
                                                                      zdn                                where Nkv is the number of times feature v has been as-
 (r)
zdn is chosen from topic proportions θd .                                                                                                     (r)          (r)
                                                                                                         signed to topic k in region r, and Nk = v Nkv .
  In summary, the proposed model assumes the following                                                      The inference of the latent topics Z given reference pho-
generative process for a set of reference photographs X =                                                tographs can be efficiently computed using collapsed Gibbs
   (r)
{xd }r,d ,                                                                                               sampling [Griffiths and Steyvers, 2004]. Given the current
                                                                                                         state of all but one variable zj , where j = (d, r, n), the as-
 1. For each topic k = 1, · · · , K:
                                                                                                         signment of a latent topic to the nth feature in region r of
       (a) For each region r ∈ R:                                                                        photograph d is sampled from the following probability:
           i. Draw the visual feature probability for region r                                                                                            (r)
               (r)                                                                                                                                   Nkxj \j + β (r)
              φk ∼ Dirichlet(β (r) )                                                                       P (zj = k|D, Z\j ) ∝ (Ndk\j + α) ·                          ,   (4)
                                                                                                                                                          (r)
 2. For each reference photograph d ∈ D:                                                                                                             Nk\j + V β (r)

       (a) Draw topic proportions θd ∼ Dirichlet(α)                                                      where \j represents the count when excluding the jth feature.
       (b) For each region r ∈ R:                                                                           The parameters α and β (r) can be estimated by maximiz-
                                                                                        (r)              ing the joint distribution (1) by using the fixed-point iteration
            i. For each feature in region r, n = 1, · · · , Nd :                                         method described in [Minka, 2000] as follows:
                                      (r)
            A. Draw topic zdn ∼ Multinomial(θd )                                                                           
                                                                                                                                    Ψ(Nkd + α) − |D|KΨ(α)
                                           (r)                              (r)                                   α ← α d k                                     ,      (5)
            B. Draw feature xdn ∼ Multinomial(φ                              (r)    )                                    K ( d Ψ(Nd + Kα) − |D|Ψ(Kα))
                                                                            zdn
                                                                                                                                        (r)     (r)
where α and β (r) are Dirichlet distribution parameters. Fig-                                                                      v Ψ(Nkv + β        ) − KV Ψ(β (r) )
                                                                                                           β (r) ← β (r)    k                                        ,
ure 2 shows a graphical model representation of the proposed                                                                            (r)       (r) ) − KΨ(V β (r) )
                                                                                                                         V      k Ψ(N  k    + V β
topic model for top and bottom regions, where shaded and un-
shaded nodes indicate observed and latent variables, respec-                                                                                                            (6)
tively. The structure of the model is the same as that of the                                            where Ψ(·) is the digamma function defined by Ψ(x) =
                                                                                                         ∂ log Γ(x)
polylingual topic model [Mimno et al., 2009], which is used                                                  ∂x     . By iterating Gibbs sampling with (4) and maxi-
for extracting topics from documents in many languages.                                                  mum likelihood estimation with (5) and (6), we can infer la-
   The proposed topic model can learn coordinate informa-                                                tent topics while optimizing the parameters.
                                                                                                                                (r)
tion about other regions in addition to top and bottom re-                                                  We can estimate φkv by using the following equation:
gions, such as hats, bags, shoes and accessories. In this
                                                                                                                                         (r)
case, a reference photograph consists of a set of visual fea-                                                                  (r)     Nkv + β (r)
                                  (r)                                                                                        φ̂kv =                   ,                    (7)
tures with multiple regions {xd }r∈R , where e.g. R =                                                                                   (r)
                                                                                                                                      Nk + V β (r)
{top, bottom, hat, bag}.
   The joint distribution of visual features and latent topics is                                        which is used for recommending coordinates as described in
described as follows:                                                                                    the next section.

            P (X, Z, |α, β) = P (Z|α)P (X|Z, β),                                          (1)            4 Coordinate Recommendation based on
                     (r)                                                            (r)                    Topic Model
where Z = {zd }r,d is a set of latent topics, zd                                              =
  (r)
        (r)
       Nd                            (r)
                                                                                                         In this section, we describe a method for recommending co-
{zdn }n=1 ,  and β = {β }r is a set of Dirichlet parame-                                                 ordinates using the topic model learned with reference pho-
ters. We can integrate out multinomial distribution parame-                                              tographs. Intuitively speaking, we recommend a bottom (top)
                     (r)
ters, {θd}d and {φk }k,r , because we use Dirichlet distri-                                              that has the closest topic proportions to those of the given top
butions for their priors, which are conjugate to multinomial                                             (bottom).

                                                                                                  2264
Let C (t) and C (b) respectively be a set of photographs                   full-body   visual features                       topic        visual       query top
                                                                               photos     for each region                    proportions    features        photo
of top and bottom garments that the user owns. The pro-
posed recommendation method finds a bottom (top) photo-
graph ĉ ∈ C (b) (ĉ ∈ C (t) ) that matches the given top (bot-                           top     bottom

tom) photograph c∗ ∈ C (t) (c∗ ∈ C (b) ). Each photograph                                                        topic
                                                                                                                                                       own bottom
                                                                                                                                                         photos
                                      (r)
is associated with visual features xc , c ∈ C (r) . Note that                                                    model
                   (r)
photographs in C       need not be paired; they can be pho-
tographs of a shirt, a blouse or a skirt. We can create those
data from full-body photographs of the user by using the re-
gion detection method described in Section 2. Here, we aim
to recommend coordinates from among the clothes that the                                           recommended
user owns. With recommendation in an online store, C (r)                                              bottom
                                                                                                                                   find a bottom that has the
can be the store’s photographs.                                                                                          recommend closest topic proportions to
                                                                                                                           system  the query top
   In the following, we explain the case of a bottom recom-
mendation given a top. We can recommend a top given a
bottom in the same way. First, we estimate topic proportions                 Figure 3: Framework of proposed coordinate recommenda-
θc∗ for a given top photograph and the user’s own bottom                     tion method.
photographs C (b) using the topic model learned with refer-
                             (r)
ence photographs Φ̂ = {φ̂k }k,r . The assignment of a latent
topic zj , where j = (c, r, n), is sampled as follows:                       women’s fashion magazine. There were 14,813 photographs
                                                                             including photographs of clothes, bags, shoes, cosmetics
                         (r)
      P (zj = k|x(r)
                 c , zc\j , Φ̂) ∝ (Nkc\j + α) · φ̂kxj ,         (8)          and accessories as well as full-body fashion model pictures.
                                                                             Chicisimo and Weardrobe are fashion community web ser-
        (r)           N (r)                                                  vices for women in which users can share style photographs.
where zc = {zcn}n=1     c
                            represents a set of latent topics in
                                                                             The communities have a rule stating that shared photographs
region r of photograph c. Then, we recommend the bottom ĉ
                                                                             should show the full-body of the user. We used 1,059 and
that has the closest topic proportions to those of the given top
                                                                             1,410 photographs from Chicisimo and Weardrobe data, re-
photograph as follows:
                                                                             spectively.
                  ĉ = arg max S(θc∗ , θc ),                    (9)
                              c∈C (b)                                        5.2      Region Detection
where S(·, ·) represents a similarity function. For the similar-             In the Magazine data, 2,062 photographs out of 14,813 were
ity, we use the following negative KL divergence:                            estimated as full-body by the proposed method described
                                                    θc ∗ k                   in Section 2. We checked the results for top and bot-
              S(θc∗ , θc ) = −          θc∗ k log          ,   (10)          tom region extraction from the 2,062 photographs by hu-
                                                    θck                      man judgement. The regions were adequately extracted
                                   k
                                                                             in 1,502 photographs (73%). We also checked the Chi-
since topic proportions θc are a probability distribution, and
                                                                             cisimo and Weardrobe data, and the regions were adequately
the KL divergence is one of the most commonly used dis-
                                                                             extracted from 75%(797/1,059) and 71%(1,004/1,410) of
tances for probability distributions. Although the recommen-
                                                                             the photographs, respectively. For the Magazine data, we
dation of one photograph is assumed above, we can also rec-
                                                                             checked randomly sampled 221 photographs that were de-
ommend n photographs with the highest S(θc∗ , θc ) values.
                                                                             termined as not being full-body using the proposed method.
   Figure 3 shows the framework of the proposed coordinate
                                                                             81%(180/221) of the photographs were truly not full-body.
recommendation method. In the framework, full-body pho-
                                                                             Although the proposed detection method is simple, it can de-
tographs for reference are converted to visual features for
                                                                             tect top and bottom regions with the high true positive rate,
each region, and the proposed topic model is learned using
                                                                             and the high true negative rate. In the following recommenda-
the visual features. The query top and the user’s own bottom
                                                                             tion experiments, we used 1,502, 797 and 1,004 photographs
photographs are also converted to visual features, and topic
                                                                             respectively from the Magazine, Chicisimo and Weardrobe
proportions of the query and the user’s own photographs are
                                                                             data that were checked by human judgement.
estimated using the learned topic model. Then, a bottom is
recommended that has the closest topic proportions to those                  5.3      Baseline Recommendation Method based on
of the query top.
                                                                                      Visual Feature Similarities
5 Experiments                                                                Before describing our experimental coordinate recommenda-
                                                                             tion results, we describe a baseline method for recommenda-
5.1   Data Set                                                               tion based on visual feature similarities and compare it with
We evaluated our proposed coordinate recommender system                      the proposed topic model based method.
based on the topic model using three real photograph col-                       Assume that a photograph of a top c∗ is given as a query.
lections: Magazine, Chicisimo and Weardrobe. The Maga-                       We would like to find a photograph of a bottom that matches
zine data consists of photographs taken from 32 copies of a                  a given top from a set of the user’s own bottom photographs

                                                                      2265
C (b) . First, the baseline method finds the top photograph dˆ              ure 5 shows the average similarities and the standard devia-
that is closest to the given top c∗ from a collection of refer-            tions with different numbers of topics. The proposed method
ence photographs D as follows:                                             achieved higher accuracies and similarities for all three data
                                                                           sets compared with the baseline and random methods. This
                 dˆ = arg max F (x(t)
                                        (t)
                                  c∗ , xd ),                (11)           result indicates that the proposed method can learn infor-
                           d∈D
                                                                           mation about coordinates adequately using the topic model.
where F (·, ·) represents a similarity function, For the similar-          Since the proposed method uses the similarities of extracted
ity function, we used the cosine similarity in the experiment,             low dimensional intrinsic representations, it is more robust
which is a commonly used similarity function in the bag-of-                than the baseline method, which naively uses similarities be-
features representation. Then, the baseline method recom-                  tween high dimensional visual features.
mends bottom photograph ĉ that is the most similar to the                    When we analyze the extracted topics, photographs with
bottom in the found reference photograph dˆ from the user’s                similar coordinates form clusters. For example, a topic is
own bottom photographs C (b) :                                             frequently assigned to photographs showing white one-piece
                                                                           suits, and another topic is frequently assigned to photographs
                                     (b)
                ĉ = arg max F (xd̂ , x(b)
                                       c ).                 (12)           showing white t-shirts and blue jeans, which correspond to
                         c∈C (b)
                                                                           formal and casual topics, respectively.
In the same way, the baseline method can recommend a                          The computational time of a query response was 0.04 sec-
matching top given a bottom.                                               ond, which includes estimating topic proportions of the query
                                                                           and sorting photographs by their similarities.
5.4   Recommendation
We evaluated the proposed method for recommending coor-                    6 Conclusion
dinates based on a topic model with a baseline method based                We have proposed a topic model for recommending fashion
on the similarities between visual features.                               coordinates using a collection of photographs from fashion
   We used a color histogram for the visual features as de-                magazines. We have confirmed experimentally that the pro-
scribed in [Swain and Ballard, 1991]. Specifically, the three               posed method can learn information about coordinates appro-
color axes used for the histogram are defined as follows:                   priately, and can be used for recommending fashion coordi-
rg = r − g, by = 2b − r − g, wb = r + g + b, where r,                      nates.
g and b represent red, green and blue signals, respectively.                  Although our results have been encouraging to date, our
The wb axis, which indicates intensity, was divided into eight             approach can be further improved in a number of ways. First,
sections, and the rg and by axes were divided into 16 sec-                 we would like to improve the region detection method. We
tions, and therefore the dimension of the visual features was              utilized a face detection algorithm for simplicity to detect
V = 2, 048.                                                                top and bottom regions. However, top and bottom regions
   We generated 100 sets of training and test data, in which 10            may not appear immediately below the face region when the
% of the samples were randomly selected as test data. With                 fashion model leans, sits or lies down; the border of top and
the proposed method, the training data set is assumed to be a              bottom regions differs depending on the clothes, the fash-
collection of reference photographs D, and the parameters of               ion model’s style, and the angle at which the photographs
the topic model are inferred using the training data. The test             were taken. By detecting regions directly using pose esti-
data are assumed to be sets of photographs C (t) and C (b)                 mation algorithms, the method becomes applicable to pho-
owned by a user. In the evaluation, each of the top (bot-                  tographs with a wide variety of poses. Second, the proposed
tom) photographs in the test data is given as the input, and               topic model can be used to recommend coordinates for fash-
a matching bottom (top) photograph is recommended, while                   ion items other than tops and bottoms, such as shoes, bags,
the bottom (top) photographs are held out. We assumed that                 hats and accessories. Third, we would like to investigate ap-
the correct answer for a recommendation given a top (bottom)               propriate visual features for recommending coordinates. In
photograph was the bottom (top) photograph that was paired                 our experiments, we used color information for the features.
with the given photograph.                                                 However, the texture and the shape of items may also consti-
   For the evaluation, we used the following two measure-                  tute important information. Finally, it is valuable to recom-
ments: n-best accuracy and similarity. The n-best accuracy                 mend coordinates that match the user’s preference and situ-
represents the rate of recommending the correct bottom (top)               ation, such as business, dating or formal. This personalized
photograph with n recommendations given a test top (bottom)                recommendation can be achieved by using the history of the
photograph. The similarity evaluation measurement is the av-               clothes the user has worn and the context.
erage similarity between the recommended clothing and the
held-out paired clothing.                                                  References
   Figure 4 shows the average accuracies and the standard de-              [Andriluka et al., 2009] Mykhaylo Andriluka, Stefan Roth, and
viations over 100 experiments on the three data sets. The Pro-                Bernt Schiele. Pictorial structures revisited: People detection and
posed plot represents results for the proposed method where                   articulated pose estimation. In CVPR, pages 1014–1021, 2009.
the number of topics K = 20, the Baseline plot represents                  [Blei and Jordan, 2003] David M. Blei and Michael I. Jordan. Mod-
results for the baseline method based on the similarities be-                 eling annotated data. In SIGIR ’03: Proceedings of the 26th
tween visual features, and the Random plot represents re-                     Annual International ACM SIGIR Conference on Research and
sults for a method that recommends clothes randomly. Fig-                     Development in Information Retrieval, pages 127–134, 2003.

                                                                    2266
0.8                                                                              0.9                                                                              0.9
                       Proposed                                                                         Proposed                                                                         Proposed
               0.7      Baseline                                                                0.8      Baseline                                                                0.8      Baseline
                        Random                                                                           Random                                                                           Random
               0.6                                                                              0.7                                                                              0.7
                                                                                                0.6                                                                              0.6
               0.5
  accuracy

                                                                                   accuracy

                                                                                                                                                                    accuracy
                                                                                                0.5                                                                              0.5
               0.4
                                                                                                0.4                                                                              0.4
               0.3
                                                                                                0.3                                                                              0.3
               0.2                                                                              0.2                                                                              0.2
               0.1                                                                              0.1                                                                              0.1
                0                                                                                0                                                                                0
                           5        10         15        20         25        30                            5        10         15        20         25        30                            5        10         15        20         25        30
                                                n                                                                                n                                                                                n

                                (a) Magazine                                                                     (b) Chicisimo                                                                    (c) Weardrobe

                                                        Figure 4: n-best accuracies with different numbers of recommendations.

               0.55                                                                              0.5                                                                             0.55

                0.5                                                                                                                                                               0.5
                                                                                                0.45
                                                                                                                                                                                 0.45
               0.45
                                                                                                 0.4
  similarity

                                                                                   similarity

                                                                                                                                                                    similarity
                                                                                                                                                                                  0.4
                0.4
                                                                                                0.35                                                                             0.35
               0.35
                                                                                                                                                                                  0.3
                                                        Proposed                                 0.3                                     Proposed                                                                         Proposed
                0.3                                                                                                                                                              0.25
                                                         Baseline                                                                         Baseline                                                                         Baseline
                                                         Random                                                                           Random                                                                           Random
               0.25                                                                             0.25                                                                              0.2
                      10   15      20    25     30       35   40         45   50                       10   15      20    25     30       35   40         45   50                       10   15      20    25     30       35   40         45   50
                                              #topics                                                                          #topics                                                                          #topics

                                (a) Magazine                                                                     (b) Chicisimo                                                                    (c) Weardrobe

                                                                         Figure 5: Similarities with different numbers of topics.

[Blei et al., 2003] David M. Blei, Andrew Y. Ng, and Michael I.                                                                  [Mimno et al., 2009] David Mimno, Hanna M. Wallach, Jason
   Jordan. Latent Dirichlet allocation. JMLR, 3:993–1022, 2003.                                                                     Naradowsky, David A. Smith, and Andrew McCallum. Polylin-
[Cao and Fei-Fei, 2007] L. Cao and L. Fei-Fei. Spatially coherent                                                                   gual topic models. In EMNLP ’09, pages 880–889, 2009.
   latent topic model for concurrent object segmentation and clas-                                                               [Minka, 2000] Thomas Minka. Estimating a Dirichlet distribution.
   sification. In Proceedings of IEEE Intern. Conf. in Computer                                                                      Technical report, M.I.T., 2000.
   Vision (ICCV), 2007.                                                                                                          [Nagao et al., 2008] S. Nagao, S. Takahashi, and J. Tanaka. Mirror
[Griffiths and Steyvers, 2004] Thomas L. Griffiths and Mark                                                                           appliance: Recommendation of clothes coordination in daily life.
   Steyvers. Finding scientific topics. Proceedings of the National                                                                  In HFT ’08, pages 367–374, 2008.
   Academy of Sciences, 101 Suppl 1:5228–5235, 2004.                                                                             [Ramanan, 2007] Deva Ramanan. Learning to parse images of ar-
[Hofmann, 1999] Thomas Hofmann. Probabilistic latent semantic                                                                       ticulated bodies. In Advances in Neural Information Processing
   analysis. In UAI ’99: Proceedings of 15th Conference on Uncer-                                                                   Systems, volume 19, 2007.
   tainty in Artificial Intelligence, pages 289–296, 1999.                                                                        [Rowley et al., 1998] Henry A. Rowley, Shumeet Baluja, and
[Hofmann, 2003] Thomas Hofmann. Collaborative filtering via                                                                          Takeo Kanade. Neural network-based face detection. IEEE
   Gaussian probabilistic latent semantic analysis. In Proceedings                                                                  Transactions on Pattern Analysis and Machine Intelligence,
   of the 26th Annual International ACM SIGIR Conference on Re-                                                                     20(1):23–38, 1998.
   search and Development in Information Retrieval, pages 259–
                                                                                                                                 [Shen et al., 2007] Edward Shen, Henry Lieberman, and Francis
   266, 2003.
                                                                                                                                    Lam. What am I gonna wear?: scenario-oriented recommenda-
[Iwata et al., 2009a] Tomoharu Iwata, Shinji Watanabe, Takeshi                                                                      tion. In IUI ’07, pages 365–368, 2007.
   Yamada, and Naonori Ueda. Topic tracking model for analyzing
                                                                                                                                 [Swain and Ballard, 1991] Michael J. Swain and Dana H. Ballard.
   consumer purchase behavior. In IJCAI ’09: Proceedings of 21st
   International Joint Conference on Artificial Intelligence, pages                                                                  Color indexing. Int. J. Comput. Vision, 7(1):11–32, 1991.
   1427–1432, 2009.                                                                                                              [Viola and Jones, 2004] Paul Viola and Michael J. Jones. Robust
[Iwata et al., 2009b] Tomoharu Iwata, Takeshi Yamada, and                                                                           real-time face detection. Int. J. Comput. Vision, 57:137–154, May
   Naonori Ueda. Modeling social annotation data with content rel-                                                                  2004.
   evance using a topic model. In NIPS ’09, pages 835–843, 2009.                                                                 [Yonezawa and Nakatani, 2009] Yuri Yonezawa and Yoshio
[Lowe, 2004] David G. Lowe. Distinctive image features from                                                                         Nakatani. Fashion support from clothes with characteristics. In
   scale-invariant keypoints. International Journal of Computer Vi-                                                                 HCI International ’09, pages 323–330, 2009.
   sion, 60(2):91–110, 2004.

                                                                                                                          2267
You can also read