DeepStyle: Multimodal Search Engine for Fashion and Interior Design

Page created by Frank Pope
 
CONTINUE READING
DeepStyle: Multimodal Search Engine for Fashion and Interior Design
SUBMITTED TO IEEE TRANSACTIONS ON MULTIMEDIA                                                                                                       1

                                                                        DeepStyle: Multimodal Search Engine
                                                                          for Fashion and Interior Design
                                              Ivona Tautkute1, 3 , Tomasz Trzciński2, 3 , Aleksander Skorupa3 , Lukasz Brocki1 and Krzysztof Marasek1

                                            Abstract—In this paper, we propose a multimodal search
                                        engine that combines visual and textual cues to retrieve items
                                        from a multimedia database aesthetically similar to the query.
                                        The goal of our engine is to enable intuitive retrieval of fashion
                                        merchandise such as clothes or furniture. Existing search engines
                                        treat textual input only as an additional source of information
                                        about the query image and do not correspond to the real-
arXiv:1801.03002v1 [cs.CV] 8 Jan 2018

                                        life scenario where the user looks for ”the same shirt but of
                                        denim”. Our novel method, dubbed DeepStyle, mitigates those
                                        shortcomings by using a joint neural network architecture to
                                        model contextual dependencies between features of different
                                        modalities. We prove the robustness of this approach on two
                                        different challenging datasets of fashion items and furniture
                                        where our DeepStyle engine outperforms baseline methods by
                                        18-21% on the tested datasets. Our search engine is commercially
                                        deployed and available through a Web-based application.
                                          Index Terms—Multimedia computing, Multi-layer neural net-
                                        work, Multimodal Search, Machine Learning

                                                              I. INTRODUCTION
                                                                                                               Fig. 1. Example of a typical multimodal query sent to a search engine for
                                                                                                               fashion items. By modeling common multimodal space with a deep neural
                                        M       ULTIMODAL search engine allows to retrieve a set of
                                                items from a multimedia database according to their
                                        similarity to the query in more than one feature spaces, e.g.
                                                                                                               network, we can provide a more flexible and natural user interface while
                                                                                                               retrieving results that are semantically correct, as opposed to the results of
                                                                                                               the search based on the state-of-the-art Visual Search Embedding model [9].
                                        textual and visual or audiovisual. This problem can be divided
                                        into smaller subproblems by using separate solutions for each
                                        modality. The advantage of this approach is that both textual          context of a search for fashion items, they typically focus on
                                        and visual search engines have been developed for several              using other modalities as an additional source of information,
                                        decades now and have reached a certain level of maturity.              e.g. to increase classification accuracy of compatible and non-
                                        Traditional approaches such as Video Google [2] have been              compatible outfits [7].
                                        improved, adapted and deployed in industry, especially in the             To address the above-mentioned shortcomings of the cur-
                                        ever-growing domain of e-commerce. Major online retailers              rently available search engines, we propose a novel end-to-
                                        such as Zalando and ASOS already offer visual search engine            end method that uses neural network architecture to model the
                                        functionalities to help users find products that they want to          joint multimodal space of database objects. This method is an
                                        buy [3]. Furthermore, interactive multimedia search engines            extension of our previous work [8] that blended multimodal
                                        are omnipresent in mobile devices and allow for speech, text           results. Although in this paper we focus mostly on the fashion
                                        or visual queries [4], [5], [6].                                       items such as clothes, accessories and furniture, our search
                                           Nevertheless, using separate search engines per each modal-         engine is in principle agnostic to object types and can be suc-
                                        ity suffers from one significant shortcoming: it prevents the          cessfully applied in many other multimedia applications. We
                                        users from specifying a very natural query such as ’I want             call our method DeepStyle and show that thanks to its ability to
                                        this type of dress but made of silk’. This is mainly due to the        jointly model both visual and textual modalities, it allows for a
                                        fact that the notion of similarity in separate spaces of different     more intuitive search queries, while providing higher accuracy
                                        modalities is different than in one multimodal space. Fur-             than the competing approaches. We prove the superiority of
                                        thermore, modeling this highly dimensional multimodal space            our method over single-modality approaches and state-of-the-
                                        requires more complex training strategies and thoroughly               art multimodal representation using two large-scale datasets of
                                        annotated datasets. Finally, defining the right balance between        fashion and furniture items. Finally, we deploy our DeepStyle
                                        the importance of various modalities in the context of a user          search engine as a web-based application.
                                        query is not obvious and hard to estimate a priori. Although              To summarize, the contributions of our paper are threefold:
                                        several multimodal representations have been proposed in the              •   In addition to the results using blending methods from
                                          1 Polish-Japanese                                                           multiple modalities, we propose a novel multimodal end-
                                                           Academy of Information Technology, Warsaw, Poland
                                          2 Warsaw University of Technology, Warsaw, Poland                           to-end search engine based on a deep neural network ar-
                                          3 Tooploox, Warsaw, Poland.                                                 chitecture. It is robust to domain changes and outperforms
DeepStyle: Multimodal Search Engine for Fashion and Interior Design
SUBMITTED TO IEEE TRANSACTIONS ON MULTIMEDIA                                                                                          2

     the state-of-the-art methods on two diversified datasets by     B. Textual Search
     18 and 21% respectively.                                           First methods that proposed to address textual information
  • Our system is deployed in production and available
                                                                     retrieval were based on token counts, e.g. Bag-of-Words [12]
     through a Web-based application.                                or TF-IDF [18].
  • Last but not least, we introduce a new interior design
                                                                        Later, a new type of representation called word2vec was
     dataset of furniture items offered by IKEA, an interna-         proposed by Mikolov et. al [19]. The proposed models in
     tional furniture manufacturer, which contains both visual       word2vec family, namely continuous Bag of Words (CBOW)
     and textual meta-data of over 2 000 objects from almost         and Skip-Grams, allow the token representation to be learned
     300 rooms. We plan to release the dataset to the public.        based on its local context. To grasp also the global context
  The remainder of this work is organized in the following           of the token, GloVe [20] has been introduced. GloVe takes
manner. In Sec. II we discuss related work. In Sec. III we           advantage of information both from the local context and the
present a set of methods based on blending single-modality           global co-occurrence matrix, thus providing a powerful and
search results that serve as our baseline. Finally, in Sec. IV,      discriminative representation of textual data. Similarly, not all
we introduce our DeepStyle multimodal approach as well as its        queries can be represented with a text only. There might be
extension. In Sec. V we present the datasets used for evaluation     a clear textual definition missing for style similarities that are
and in Sec. VI we evaluate our method and compare its results        apparent in visual examples. Also, the same concepts might
against the baseline. Sec. VIII concludes the paper.                 be expressed in synonymical ways.

                     II. R ELATED W ORK
                                                                     C. Stylistic Similarity
   In this section, we first give an overview of the current
visual search solutions proposed in the literature. Secondly,           Comparing the style similarity of two objects or scenes is
we discuss several approaches used in the context of a textual       one of the challenges that have to be addressed when training a
search. We then present works related to defining similarity         machine learning model for interior design or fashion retrieval
in the context of aesthetics and style, as it directly pertains      application. This problem is far from being solved mainly due
to the results obtained using our proposed method. Finally,          to the lack of a clear metric defining how to measure style
we present an overview of existing search methods in fashion         similarity. Various approaches have been proposed for defining
domain as this topic is gaining popularity.                          style similarity metric. Some of them focus on evaluating
                                                                     similarity between shapes based on their structures [21], [22]
                                                                     and measuring the differences between scales and orientations
A. Visual Search
                                                                     of bounding boxes. Other approaches propose the structure-
   Traditionally, image-based search methods drew their inspi-       transcending style similarity that accounts for element sim-
ration from textual retrieval systems [10]. By using k-means         ilarity [23]. In this work, we follow [24], and define style
clustering method in the space of local feature descriptors          as a distinctive manner which permits the grouping of works
such as SIFT [11], they are able to mimic textual word               into related categories. We enforce this definition by including
entities with the so-called visual words. Once the mapping           context information that groups different objects together (in
from image salient keypoints to visually representative words        terms of clothing items in an outfit or furniture in a room
was established, typical textual retrieval methods such as Bag-      picture in interior design catalog). This allows us to a take
of-Words [12] could be used. Video Google [2] was one of             data-driven approach that measures style similarity without
the first visual search engines that relied on this concept.         using hand-crafted features and predefined styles.
Several extensions of this concept were proposed, e.g. spatial
verification [13] that checks for geometrical correctness of
initial query or fine-grained image search [14] that accounts        D. Deep Learning in Fashion
for semantic attributes of visual words.                                There has been a significant number of works published
   Successful applications of deep learning techniques in other      in the domain of fashion item retrieval or recommendation
computer vision applications have motivated researchers to ap-       due to the potential of their application in highly profitable
ply those methods also to visual search. Although preliminary        e-commerce business. Some of them focused on the notion of
results did not seem promising due to the lack of robustness         fashionability, e.g [26] rated a user’s photo in terms of how
to cropping, scaling and image clutter [15], later works proved      fashionable it is and provided fashion recommendations that
potential of those methods in the domain of image-based              would increase overall outfit score. Others focused on fashion
retrieval [16]. Many other deep architectures such as Siamese        items retrieval from online database when presented with
networks were also proposed, and proved successful when              user photos taken ’in the wild’ usually with phone cameras
applied to content-based image retrieval [17].                       [27]. Finally, there is ongoing research in terms of clothing
   Nevertheless, all of the above-mentioned methods suffer           cosegmentation [28], [29] that is an important preprocessing
from one important drawback, namely they do not take into            step for better item retrieval results.
account the stylistic similarity of the retrieved objects, which        Kiros et al. [9] present an encoder-decoder pipeline that
is often a different problem from visual similarity. Items that      learns a joint multimodal embedding (VSE) from images
are similar in style do not necessarily have to be close in visual   and a text, which is later used to generate text captions for
features space.                                                      custom images. Their approach is inspired by successes in
DeepStyle: Multimodal Search Engine for Fashion and Interior Design
SUBMITTED TO IEEE TRANSACTIONS ON MULTIMEDIA                                                                                                                  3

Fig. 2. A high-level overview of our Simple Style Search Engine. The visual search block uses object detection algorithm YOLO 9000 [25] and outputs of a
pretrained deep neural network. The textual block allows to further specify search criteria with text and increases the contextual importance of the retrieved
results. Finally, by blending the visual and textual search results method significantly improves the stylistic and aesthetic similarity of the retrieved items.

Neural Machine Translation (NMT) and perceives visual and                        to extend the information about the searched object. Finally,
textual modalities as the same concept described in differ-                      research community has not yet paid much attention to define
ent languages. The proposed architecture consists of LSTM                        or evaluate style similarity.
RNNs for encoding sentences, CNN for encoding images
and structure-content neural language model (SC-NLM) for                                   III. F ROM SINGLE TO MULTIMODAL SEARCH
decoding. The authors show that their learned multimodal                            In this section, we present a baseline style search engine
embedding space preserves semantic regularities in terms of                      model introduced in [8], which is the basis for our current
vector space arithmetic e.g. image of a blue car - ”blue” +                      research. It is built on top of two single-modal modules.
”red” is near images of red cars. However, results of this task                  More precisely, two searches are run independently for both
are only available in some example images. We would like                         image and text queries resulting in two initial sets of results.
to leverage their work and numerically evaluate multimodal                       Then, the best matches are selected from initial pool of results
query retrieval, specifically in the domain of fashion and                       according to blending methods - re-ranking based on visual
interior design.                                                                 features similarity to the query image as well as on contextual
   Xintong Han et al. [30] train bi-LSTM model to predict                        similarity (items that appear more often together in the same
next item in the outfit generation. Moreover, they learn a joint                 context).
image-text embedding by regressing image features to their                          For input, baseline style search engine takes two types
semantic representations aiming to inject attribute and category                 of query information: an image containing object(-s), e.g.
information as a regularization for training the LSTM. It                        a picture of a dining room, and a textual query used to
should be noted, however, that their approach to stylistic                       specify search criteria, e.g. cozy and fluffy. If needed, an object
compatibility is different from ours in a way that they optimize                 detection algorithm is run on the uploaded picture to detect
for generation of a complete outfit (e.g. it should not contain                  objects of classes of interest such as chairs, tables or sofas.
two pairs of shoes) whereas we would like to retrieve items of                   Once the objects are detected, their regions of interest are
similar style regardless of the category they belong to. Also,                   extracted as picture patches and run through visual search
they evaluate compatibility with ”fill-in-the-blanks” test that                  method. For queries that already represent a single object,
does not incorporate retrieval from the full dataset of items.                   no object detection is required. Simultaneously, the engine
Only several example results are illustrated and no quantitative                 retrieves the results for a textual query. With all visual and
evaluation is presented.                                                         textual matches retrieved, our blending algorithm ranks them
   Numerous works focus on the task of generating a compati-                     depending on the similarity in the respective feature spaces
ble outfit from available clothing products [7], [30]. However,                  and returns the resulting list of stylistically and aesthetically
none of the related works focus on the notion of multimodality                   similar objects. Fig. 2 shows a high-level overview of our Style
and multimodal fashion retrieval. Text information is only used                  Search Engine. Below, we describe each part of the engine in
as an alternative query and not as a complimentary information                   more details.
DeepStyle: Multimodal Search Engine for Fashion and Interior Design
SUBMITTED TO IEEE TRANSACTIONS ON MULTIMEDIA                                                                                                              4

Fig. 3. Architecture comparison for choosing the object detection model in
Visual Search. The recall is plotted as a function of the number of returned
items k. Best retrieval results are achieved for YOLO object detection and     Fig. 4. t-SNE visualization of interior items’ embedding using context
visual features exraction from Resnet-50.                                      information only. Distinctive classes of objects, e.g. those that appear in a
                                                                               bathroom or a baby room, are clustered around the same region of the space.
                                                                               No text descriptions nor information about image room categories was used
A. Visual Search                                                               during training.
   Instead of using an entire image of the interior as a query,
our search engine applies an object detection algorithm as
a pre-processing step. This way, not only can we retrieve                         To obtain the above-defined space embedding, we use a
the results with higher precision, as we search only within                    Continuous Bag-of-Words (CBOW) model that belongs to
a limited space of same-class pictures, but we do not need to                  word2vec model family [19]. In order to train our model, we
know the object category beforehand. This is in contrast to                    use the descriptions of items available as a metadata supplied
other visual search engines proposed in the literature [17],                   with the catalog images. Such descriptions are available as
[31], where the object category is known at test time or                       part of both, the IKEA and the Polyvore datasets, which we
inferred from textual tags provided by human labeling.                         describe in details in Sec. V. Textual description embedding is
   For object detection, we used YOLO 9000 [25], which is                      calculated as a mean vector of individual words embeddings.
based on the DarkNet-19 model [32], [25]. The bounding                            In order to optimize hyper-parameters of CBOW for item
boxes are then used to generate regions of interest in the                     embedding, we run a set of initial experiments on the valida-
pictures and search is performed on the extracted parts of the                 tion dataset and use cluster analysis of the embedding results.
image.                                                                         We select the parameters that minimize intra-cluster distances
   Once the regions of interest are extracted, we feed them to a               at the same maximizing inter-cluster distance.
pretrained deep neural network to get a vector representation.                    Having found such a mapping, we can perform the search by
More precisely, we use the outputs of fully connected layers                   returning k-nearest neighbors of the transformed query in the
of neural networks pretrained on ImageNet dataset [33]. We                     space of product descriptions from the database using cosine
then normalize the extracted output vectors, so that their L2                  similarity as a distance measure.
norm is equal to 1. We search for similar images within the
dataset using this representation to retrieve a number of closest              C. Context Space Search
vectors (in terms of Euclidean distance).
                                                                                  In order to leverage the information about different item
   To determine the pretrained neural network architecture pro-
                                                                               compatibility, which is available as a context data (outfit or
viding the best performance, we conduct several experiments
                                                                               room), we train an additional word2vec model (using the
that are illustrated in Fig. 3. As a result, we choose ResNet-50
                                                                               CBOW model), where different products are treated as words.
as our visual feature extraction architecture.
                                                                               Compatible sets of those products appearing in the same
                                                                               context are treated as sentences. It is worth noticing that our
B. Text Query Search                                                           context embedding is trained without relying on any linguistic
   To extend the functionality of our Style Search Engine, we                  knowledge. The only information that the model sees during
implement a text query search that allows to further specify the               training is whether given objects appeared in the same set.
search criteria. This part of our engine is particularly useful                   Fig. 4 shows the obtained feature embeddings using t-SNE
when trying to search for product items that represent abstract                dimensionality reduction algorithm [34] for IKEA dataset. One
concepts such as minimalism, Scandinavian style, casual and                    can see that some classes of objects, e.g. those that appear in
so on.                                                                         a bathroom or a baby room, are clustered around the same
   In order to perform such a search, we need to find a mapping                region of the space.
from textual information to vector representation of the item,
i.e, from the space of textual queries to the space of items
in the database. The resulting representation should live in                   D. Blending Methods
a multidimensional space, where stylistically similar objects                     Let us denote p = (i, t) to be a representation of a product
reside close to each other.                                                    stored in the database P. This representation consists of a
DeepStyle: Multimodal Search Engine for Fashion and Interior Design
SUBMITTED TO IEEE TRANSACTIONS ON MULTIMEDIA                                                                                              5

catalog image i ∈ I and the textual description t ∈ T.                    network, improves over the first network by introducing a
The multimodal query provided by the user is given by                     second branch with shared weights and contrastive loss to learn
Q = (iq , tq ),where iq ∈ I is the visual query and tq ∈ T                map pairs from the same outfit close in the embedding space.
is the textual query.                                                        DeepStyle: Our proposed neural network learns com-
   We run a series of experiments with blending methods,                  mon embedding through classification task. Our architecture,
aiming to combine the retrieval results from various modalities           dubbed DeepStyle, is inspired by [7], where they use a
in the most effective way. To that end, we use the following              multimodal joint embedding for fashion product retrieval. In
approaches for blending.                                                  contrast to their work, our goal is not to retrieve images with
   Late-fusion Blending: In the simplest case, we retrieve top            text query (or vice versa) but to retrieve items where a text
k items independently for each modality and take their sum as             query compliments the image and provides additional query
a set of final results. We do not use the contextual information          requirements.
here.                                                                        Similarly to [7], our network has two inputs - image features
   Early-fusion Blending:                                                 (output of penultimate layer of pretrained CNN) and text
   In order to use the full potential of our multimodal search            features (processed with the same word2vec model trained
engine, we combine the retrieval results of visual, textual as            on descriptions). We then optimize for classification loss to
well as contextual search engines in the specific order. We               enforce the concept of semantic regularities. For this purpose,
optimize this order to present the most stylistically coherent            product category labels (with arbitrary number of classes)
sets to the user. To that end, we propose Early-fusion Blending           should be present in the dataset. Unlike [7], we do not consider
approach that uses features extracted from different modalities           the image and the text branches separately for predictions but
in a sequential manner.                                                   add a fully connected layer on top of the concatenated image
   More precisely, for a multimodal query (iq , tq ), an initial          and text embeddings that is used to predict a single class.
set of results Rvis is returned for visual modality - closest                DeepStyle-Siamese: We want to also include context in-
images to iq in terms of Euclidean distance dvis between                  formation (whether or not two items appeared in the same
their visual representations. Then, we retrieve contextually              context) to our network. For this purpose, we design a Siamese
similar products Rcont that are close to Rvis results in terms            network [35] where each branch has a dual input consisting
of dcont distance (context space search described in section              of image and text features. Positive pairs are generated as
III-C). Finally, Rvis and Rcont form a list of candidate items            image-text pairs from the same outfit while unrelated pairs
from which we select the results R by extracting the textual              are obtained by randomly sampling an item (image and
features (word2vec vectors) from items descriptors and rank               description) from a different outfit.
them using distance from the textual query dtext .                           Two types of losses are optimized. Classification loss is used
   This process can be formulated as:                                     as before to help network learn semantic regularities. Also,
                                                                          minimizing contrastive loss encourages image-text pairs from
                   (                n1
                                                       )
                                   X
           Rvis = p : argmin            dvis (iq , ii ) ⇒    (1)          the same outfit to have a small distance between embedding
                        p1 ,...,pn1 ∈P i=1
                                                                          vectors while different outfit items to have distance larger than
                       (                                          )       a predefined margin.
                                             n2
                [                            X                               Formally, contrastive loss is defined in the following manner
    Rcont =              p : argmin               dcont (p, pi )      ⇒
                             p1 ,...,pn2 ∈P i=1                           [35]:
              p∈Rvis
                                                                                                   1      1
                      Rcand = Rcont ∪ Rvis                                      LC (d, y) = (1 − y) d2 + y {max(0, d − m)}2 ,          (2)
                                                                                                   2      2
               (                        n3
                                                              )           where d is the Euclidean distance between two different
                                        X                                 embedded image-text vectors (i, t) and (i0 , t0 ), y is a binary
         R=      p:        argmin              dtext (tq , ti )
                      p1 ,...,pn3 ∈Rcand i=1                              label indicating whether two vectors are from the same outfit
                                                                          (y = 0) or from different outfits (y = 1) and m is a predefined
  where n1 , n2 and n3 are parameters to be chosen empiri-                margin for the minimal distance between items from different
cally.                                                                    outfits.
                                                                             Full training loss consists of weighted sum of contrastive
 IV. D EEP S TYLE : M ULTIMODAL S TYLE S EARCH E NGINE                    loss and cross entropy classification losses:
                   WITH D EEP L EARNING
   Inspired by recent advancements in deep learning for com-                          αLC (d, y) + βLX (Cl1 (i, t), ỹ(i, t))+         (3)
puter vision, we experiment with end-to-end approaches that
learn the embedding space jointly. In this section, we propose                             +γLX (Cl2 (i0 , t0 ), ỹ(i0 , t0 )),
neural network architectures that are fed with image and text
as inputs, while learning a multimodal embedding space. Such                 where LX is the cross entropy loss, Cl1 (i, t) and Cl2 (i, t)
embedding can later be used to retrieve results using a multi-            are outputs of the first and second classification branches
modal query. The first proposed architecture is a multimodal              respectively and ỹ(i, t) is the category label for product with
DeepStyle network that learns common image-text embedding                 image i and text description t. Parameters α, β, γ are treated
through classification task. The second, DeepStyle-Siamese                as hyperparameters for tuning.
DeepStyle: Multimodal Search Engine for Fashion and Interior Design
SUBMITTED TO IEEE TRANSACTIONS ON MULTIMEDIA                                                                                                             6

Fig. 5. The proposed architecture of DeepStyle network. An image is first fed through the ResNet-50 network pretrained on ImageNet, while the corresponding
text description is transformed with word2vec. Both branches are compressed to 128 dimensions, then concatenated to common vector. Final layer predicts a
clothing item category. Penultimate layer serves as a multimodal image and text representation of a product item.

                                                                                  •   It should contain both images of individual objects as well
                                                                                      as scene images (room/outfit image) with those objects
                                                                                      present.
                                                                                   • It should have a ground truth defining which objects are
                                                                                      present in scene photo.
                                                                                   • It should also have textual descriptions.
                                                                                   We specifically focus on datasets containing pictures of
                                                                                interior design and fashion as both domains are highly de-
                                                                                pendant on style and would benefit from style search engine
                                                                                applications. In addition, we analyze datasets with varying
                                                                                degrees of context information, as in real life applications it
                                                                                might differ from dataset to dataset. For example, in some
                                                                                cases (specifically when the database is not very extensive),
                                                                                items can co-occur very often together (in context of the
                                                                                same design, look or outfit). Whereas in other cases, when
                                                                                database of available items is much bigger, the majority of
                                                                                items will not have many co-ocurrences with other items. We
                                                                                apply our Multimodal Search Engine for both types of datasets
                                                                                and perform quantitative evaluation to find the best model.

                                                                                A. Interior Design
                                                                                   To our knowledge, there is no publicly available dataset
Fig. 6. The architecture of DeepStyle-Siamese network. DeepStyle block          that contains the interior design items and fulfill previously
is the block of dense and concatenation layers from Fig. 5 that has shared      mentioned criteria. Hence, we collect our own dataset by
weights between the image-text pairs. Three kinds of losses are optimised
- the classification loss for each image-text branch and the contrastive loss   scraping the website of one of the most popular interior
for image-text pairs. Contrastive loss is computed on joint image and text      design distributors - IKEA1 . We collect 298 room photos
descriptors.                                                                    with their description and 2193 individual product photos with
                                                                                their textual descriptions. A sample image of the room scene
                             V. DATASETS                                        and interior item along with their description can be seen in
                                                                                Fig. 7. We also group together products from some of the
  Although several datasets for standard visual search meth-                    most frequent object classes (e.g. chair, table, sofa) for more
ods exist, e.g. Oxford 5K [13] or Paris 6K [36], they are                       detailed analysis. In addition, we divide room scene photos
not suitable for our experiments, as our multimodal approach                    into 10 categories based on the room class (kitchen, living
requires an additional type of information to be evaluated.                     room, bedroom, children room, office). The vast majority of
More precisely, dataset that can be used with a multimodal
search engine should fulfill the following conditions:                            1 https://ikea.com/
DeepStyle: Multimodal Search Engine for Fashion and Interior Design
SUBMITTED TO IEEE TRANSACTIONS ON MULTIMEDIA                                                                                                              7

Fig. 7. Example entries from IKEA dataset. It contains room images, object images and their respective text descriptions.

furniture items in the dataset (especially from the frequent
classes above) have rich context as they appear in more than
one room.

B. Fashion
   Several datasets for fashion related tasks are already pub-
licly available. DeepFashion [37] contains 800 000 images
divided into several subsets for different computer vision tasks.
However, it lacks the context (outfit) information as well as the
detailed text description. Fashion Icon [28] dataset contains
video frames for human parsing but no individual product
images. In contrast, Polyvore [30] dataset has satisfied our
dataset conditions mentioned before.
   Polyvore dataset contains 111 589 clothing items that are
grouped into compatible outfits (of 5-10 items per outfit).
We perform additional dataset cleaning - remove non-clothing
items such as electronic gadgets, furniture, cosmetics, designer
logos, plants, furniture. In addition, we perform additional
scraping of Polyvore2 website for product items in the cleaned
dataset to obtain longer product descriptions and add the
descriptions where they are missing. As a result, we have                      Fig. 8. t-SNE visualization of clothing items’ visual features embedding.
                                                                               Distinctive classes of objects, e.g. those that share visual similarities are
82 229 items from 85 categories with text descriptions and                     clustered around the same region of the space.
context information. Context information is much weaker
when compared to IKEA dataset. Only 30% of clothing items
appear in more than one outfit.
   Item (query) images are already object photos. Therefore,                                            VI. EVALUATION
for fashion dataset object detection step from style search
engine is omitted for evaluation.
                                                                                 In this section we will present the evaluation procedure, as
  2 http://polyvore.com                                                        well as the quantitative results.
SUBMITTED TO IEEE TRANSACTIONS ON MULTIMEDIA                                                                                            8

A. Evaluation Metrics                                                   Inspired by the [40], we define the average intra-list similarity
                                                                        for a generated results list R of length k to be:
   Similarity score: As mentioned in Sec. II-C, defining a
                                                                                               −1 X
similarity metric that allows quantifying the stylistic similarity                             k               X
between products is a challenging task and an active area of                    AILS(R) =                             s(p1 , p2 ),    (7)
                                                                                               2
research. In this work, we propose the following similarity                                           pi ∈R pj ∈R,pi 6=pj

measure that is inspired by [24] and based on the probabilistic         that is an average similarity score computed across all possible
data-driven approach.                                                   pairs in the list of generated items. By doing so, we are aiming
   Let us remind that P is a set of all possible product items          to assess the overall compatibility of the generated set. As
available in the catalog. Let us then denote C to be a set of all       mentioned in [40], this metric is also permutation-insensitive,
sets that contain stylistically compatible items (such as outfits       hence the order of retrieved results does not matter, making it
or interior design rooms). Then we search for a similarity              suitable for not ranked results.
function between two items p1 , p2 ∈ P which determines
if they fit well together. We propose the empirical similarity
function sc : P × P → [0, 1] which is computed in the                   B. Baseline
following way:                                                             In experiments, we compare the results with a recent mul-
                                                                        timodal approach to item retrieval, namely Visual Search Em-
                         |{Ci ∈ C : p1 ∈ Ci ∧ p2 ∈ Ci }|                bedding (VSE) [9]. For evaluation, we fine-tune the weights
      sc (p1 , p2 ) =                                         .   (4)
                        maxp∈{p1 ,p2 } | {Cj ∈ C : p ∈ Cj } |           of a pretrained model made publicly available by authors on
                                                                        our datasets. The model was pretrained on MS COCO dataset
In fact, it is the number of compatible sets Ci that are
                                                                        that has 80 categories with broad semantic context, hence it’s
empirically found from C, in which both p1 and p2 appear,
                                                                        applicable to our datasets. For feature extraction we use VGG
normalized by the maximum number of compatible sets in
                                                                        19 [41] architecture as suggested by authors.
which any of those items occur. This metric can be interpreted
                                                                           We also compare our method with Late and Early-fusion
as an empirical probability for the two objects p1 and p2 to
                                                                        Blending strategies.
appear in the same compatible set and it is expressed by the
similarity score lying in the interval [0, 1]
   In order to account for datasets that have weak context              C. Results
information (where two items rarely co-occur in the same                   Evaluation protocol: In order to test the ability of our
compatible set), we add an additional similarity measure sn             method to generalize, we evaluate it using a dataset different
that is directly derived from their name overlap. It counts for         from the training dataset. For both datasets, we set aside 10%
overlap of some of the most frequent descriptive words such as          of the initial number of items for that purpose. All results
elegant, denim, casual, etc. It should be mentioned, however,           shown in this section come from the following evaluation
that product name information should be independent from                procedure:
the text description (that is used during training). As a result,
                                                                           1) For each item/text query from the test set we extract
name-derived similarity is non-zero only on datasets that have
                                                                              visual and textual features.
this kind of additional name information.
                                                                           2) We run engine and retrieve a set of k most compatible
                                                                              items from the trained embedding space.
                 sn (p1 , p2 ) = 1{Wp1 ∩ Wp2 6= ∅},               (5)      3) We evaluate the query results by computing an Average
                                                                              Intra-List Similarity metric for all possible pairs between
where Wf is a set of frequent descriptive words appearing in                  the retrieved items and the query, which gives k2 pairs
                                                                                                                                   
the name of item f .                                                          for k retrieved items.
   To summarize, an evaluated pair is considered to be similar             4) The final results are computed as the mean of AILS
if either of the two conditions is satisfied:                                 scores for all of the tested queries.
  •   items co-occurred in the same outfit before                       It should be noted that for the IKEA dataset, object detection is
  •   names of the two items are overlapping                            performed on room images and similar items are returned for
  Formally,                                                             the most confident item in the picture. On the other hand, for
                                                                        Polyvore dataset, the test set images are already catalog items
             s(p1 , p2 ) = max (sc (p1 , p2 ), sn (p1 , p2 )) .   (6)   of clothes on white background, hence the object detection is
                                                                        not necessary and this step is omitted.
   Intra-List similarity: Given that our multimodal query                  Quantitative results: Tab. I shows the results of the blend-
search engine provides a non-ranked list of stylistically sim-          ing methods for the IKEA dataset in terms of the mean value
ilar items, the definition of the evaluation problem differs            of our similarity metric.
significantly from other information retrieval domains. For                When analyzing the results of blending approaches, we
this reason, instead of using some of the usual metrics for             experiment with several textual queries in order to evaluate
performance evaluation like mAP [38] or nDCG [39], which                system robustness towards changes in the text search. We
use a ranked list of items as an input, we apply a modified             observe that DeepStyle approach outperforms the baseline and
version of the established metric for non-ranked list retrieval.        other blending methods for almost all text queries achieving
SUBMITTED TO IEEE TRANSACTIONS ON MULTIMEDIA                                                                                                         9

                                                          TABLE I
M EAN AILS RESULTS AVERAGED FOR IKEA DATASET AND SAMPLE TEXT QUERIES FROM THE SET OF MOST FREQUENT WORDS IN TEXT DESCRIPTIONS .

                                                          Blending [8]
        Text query           VSE [9]                                                         DeepStyle              DeepStyle-Siamese
                                           Late-fusion               Early-fusion
        decorative            0.1475            0.2742                 0.2332                  0.2453                     0.2840
           black              0.3217            0.2361                 0.2354                  0.1967                     0.2237
           white              0.1476            0.2534                 0.2048                  0.1730                     0.2742
          smooth              0.1648            0.2667                 0.2472                  0.3022                     0.2642
            cosy              0.2918            0.1073                 0.2283                  0.3591                     0.2730
           fabric             0.1038            0.1352                 0.2225                  0.0817                     0.2487
         colourful            0.3163            0.2698                 0.2327                  0.3568                     0.2623
         Average              0.2134            0.2164                 0.2287                  0.2449                     0.2589

                                                            TABLE II
  M EAN AILS RESULTS FOR FASHION S EARCH ON P OLYVORE DATASET. S AMPLE TEXT QUERIES ARE SELECTED FROM THE SET OF MOST FREQUENT
                                                         WORDS IN TEXT DESCRIPTIONS

                                                          Blending [8]
        Text query           VSE [9]                                                         DeepStyle              DeepStyle-Siamese
                                           Late-fusion               Early-fusion
           black              0.2932            0.2038                 0.3038                  0.2835                     0.2719
           white              0.2524            0.2047                 0.2898                  0.2012                     0.2179
          leather             0.2885            0.2355                 0.2946                  0.2510                     0.3155
           jeans              0.2381            0.1925                 0.2843                  0.4341                     0.4066
           wool               0.3025            0.1836                 0.2657                  0.5457                     0.4337
          women               0.2488            0.1931                 0.3088                  0.3808                     0.3460
            men               0.2836            0.1944                 0.2900                  0.1961                     0.2549
           floral             0.2729            0.3212                 0.2954                  0.3384                     0.2858
          vintage             0.2986            0.3104                 0.3035                  0.3317                     0.3935
           boho               0.2543            0.3074                 0.2893                  0.2750                     0.3641
          casual              0.2808            0.3361                 0.3030                  0.2071                     0.2693
         Average              0.2740            0.2439                 0.2935                  0.3131                     0.3236

the highest average similarity score. DeepStyle-Siamese ap-              released to public4 . Fig. 10 shows a screenshot from the
proach gives the best results, outperforming the VSE baseline            working Web application with Style Search Engine.
by 21% for IKEA dataset and 18% for Polyvore dataset.
                                                                           4 http://stylesearch.tooploox.com/
   Tab. II shows the results of all of the tested methods for
the Polyvore dataset in terms of the mean value of our sim-
ilarity metric. Here, we also evaluate two joint architectures,
namely DeepStyle and DeepStyle-Siamese. Fig. VI-C shows
that DeepStyle architecture yields better results in terms of
an average performance over different textual queries, when
compared to our previous manual blending approaches, as well
as the VSE baseline approach. In this case, DeepStyle-Siamese
also yields the best average similarity results. In terms of an
average performance, it scores by 32% higher, when compared
to the simplest baseline model and more than 4% higher, when
compared to DeepStyle.

                       VII. W EB A PPLICATION

   To apply our method in real-life application, we imple-
mented a Web-based application of our Style Search Engine.
The application allows the user either to choose the query
image from a pre-defined set of room images or to upload                 Fig. 9. Mean AILS metric scores for selected textual queries and the average
                                                                         of the mean scores for all of the methods. We can see that our DeepStyle-
his/her own image. The application was implemented using                 Siamese architecture significantly outperforms other architectures on multiple
Python Flask3 - a lightweight server library. It is currently            text queries.

 3 http://flask.pocoo.org/
SUBMITTED TO IEEE TRANSACTIONS ON MULTIMEDIA                                                                                                                10

                                                                                  [7] Y. Li, L. Cao, J. Zhu, and J. Luo, “Mining fashion outfit composition
                                                                                      using an end-to-end deep learning approach on set data,” IEEE Trans.
                                                                                      Multimedia, vol. 19, no. 8, pp. 1946–1955, 2017.
                                                                                  [8] I. Tautkute, A. Mozejko, W. Stokowiec, T. Trzcinski, L. Brocki, and
                                                                                      K. Marasek, “What looks good with my sofa: Multimodal search engine
                                                                                      for interior design,” in Proceedings of the 2017 Federated Conference on
                                                                                      Computer Science and Information Systems (M. Ganzha, L. Maciaszek,
                                                                                      and M. Paprzycki, eds.), vol. 11 of Annals of Computer Science and
                                                                                      Information Systems, pp. 1275–1282, IEEE, 2017.
                                                                                  [9] R. Kiros, R. Salakhutdinov, and R. S. Zemel, “Unifying visual-
                                                                                      semantic embeddings with multimodal neural language models,” CoRR,
                                                                                      vol. abs/1411.2539, 2014.
                                                                                 [10] D. Nister and H. Stewenius, “Scalable recognition with a vocabulary
                                                                                      tree,” In CVPR, 2006.
                                                                                 [11] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,”
                                                                                      In IJCV, 2004.
                                                                                 [12] Z. S. Harris, “Distributional structure,” Papers on Syntax, pp. 3–22,
                                                                                      1981.
                                                                                 [13] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Object
                                                                                      retrieval with large vocabularies and fast spatial matching,” In CVPR,
                                                                                      2007.
                                                                                 [14] L. Xie, J. Wang, B. Zhang, and Q. Tian, “Fine-grained image search,”
                                                                                      IEEE Trans. Multimedia, vol. 17, no. 5, pp. 636–647, 2015.
                                                                                 [15] A. Gordo, J. Almazn, J. Revaud, and D. Larlus, “Deep image retrieval:
                                                                                      Learning global representations for image search,” In ECCV, 2016.
                                                                                 [16] G. Tolias, R. Sicre, and H. Jégou, “Particular object retrieval with
                                                                                      integral max-pooling of CNN activations,” CoRR, vol. abs/1511.05879,
                                                                                      2015.
                                                                                 [17] S. Bell and K. Bala, “Learning visual similarity for product design with
                                                                                      convolutional neural networks,” ACM Trans. on Graph., vol. 34, no. 4,
                                                                                      pp. 98:1–98:10, 2015.
                                                                                 [18] G. Salton, A. Wong, and C. S. Yang, “A vector space model for
                                                                                      automatic indexing,” Commun. of the ACM, vol. 18, no. 11, pp. 613–620,
                                                                                      1975.
                                                                                 [19] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of
                                                                                      word representations in vector space,” CoRR, vol. abs/1301.3781, 2013.
Fig. 10. Sample screenshot of our Style Search Engine for interior design        [20] J. Pennington, R. Socher, and C. Manning, “Glove: Global vectors for
applied in web application showing product detection and retrieval of visually        word representation,” In EMNLP, 2014.
similar products.                                                                [21] M. E. Yumer and L. B. Kara, “Co-constrained handles for deformation
                                                                                      in shape collections,” ACM Trans. on Graph., vol. 33, no. 6, pp. 1–11,
                                                                                      2014.
                                                                                 [22] O. V. Kaick, K. Xu, H. Zhang, Y. Wang, S. Sun, A. Shamir, and
                       VIII. CONCLUSIONS                                              D. Cohen-Or, “Co-hierarchical analysis of shape structures,” ACM Trans.
   In this paper, we experiment with several different archi-                         on Graph., vol. 32, no. 4, p. 1, 2013.
                                                                                 [23] Z. Lun, E. Kalogerakis, and A. Sheffer, “Elements of style: Learning
tectures for multimodal query item retrieval. This includes                           perceptual shape style similarity,” ACM Trans. on Graph., vol. 34, no. 4,
manual result blending approaches as well as joint systems,                           pp. 84:1–84:14, 2015.
where we learn common embeddings using classification and                        [24] “Art history and its methods: a critical anthology,” Choice Reviews
                                                                                      Online, vol. 33, no. 06, 1996.
contrastive loss functions. Our method achieves state-of-the-art                 [25] J. Redmon and A. Farhadi, “YOLO9000: better, faster, stronger,” CoRR,
results for the generation of stylistically compatible item sets                      vol. abs/1612.08242, 2016.
using multimodal queries. We also show that our methodology                      [26] E. Simo-Serra, S. Fidler, F. Moreno-Noguer, and R. Urtasun, “Neuroaes-
                                                                                      thetics in fashion: Modeling the perception of fashionability,” In CVPR,
can be applied to various commercial domain applications,                             2015.
easily adopting new e-commerce datasets by exploiting the                        [27] S. Liu, Z. Song, M. Wang, C. Xu, H. Lu, and S. Yan, “Street-to-shop:
product images and their associated metadata. Finally, we                             Cross-scenario clothing retrieval via parts alignment and auxiliary set,”
                                                                                      In CVPR, 2012.
deploy a publicly available web implementation of our solution                   [28] S. Liu, X. Liang, L. Liu, K. Lu, L. Lin, X. Cao, and S. Yan, “Fashion
and release the new dataset with the IKEA furniture items.                            parsing with video context,” IEEE Trans. Multimedia, vol. 17, no. 8,
                                                                                      pp. 1347–1358, 2015.
                                                                                 [29] B. Zhao, X. Wu, Q. Peng, and S. Yan, “Clothing cosegmentation for
                              R EFERENCES                                             shopping images with cluttered background,” IEEE Trans. Multimedia,
 [1] G. Bradski, “OpenCV.” https://opencv.org/, 2014.                                 vol. 18, no. 6, pp. 1111–1123, 2016.
 [2] J. Sivic and A. Zisserman, Video Google: Efficient Visual Search of         [30] X. Han, Z. Wu, Y. Jiang, and L. S. Davis, “Learning fashion compati-
     Videos. Springer, 2006.                                                          bility with bidirectional lstms,” CoRR, vol. abs/1707.05691, 2017.
 [3] B. Davis, “Image recognition in ecommerce: Visual search, product           [31] Y. Jing, D. C. Liu, D. Kislyuk, A. Zhai, J. Xu, J. Donahue, and S. Tavel,
     tagging and content curation.” https://econsultancy.com/blog, 2017.              “Visual search at pinterest,” CoRR, vol. abs/1505.07647, 2015.
 [4] H. Li, Y. Wang, T. Mei, J. Wang, and S. Li, “Interactive multimodal         [32] J. Redmon, “Darknet: Open source neural networks in C.” http://pjreddie.
     visual search on mobile device,” IEEE Trans. Multimedia, vol. 15, no. 3,         com/darknet/, 2016.
     pp. 594–607, 2013.                                                          [33] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
 [5] J. Sang, T. Mei, Y.-Q. Xu, C. Zhao, C. Xu, and S. Li, “Interaction               Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, and et al., “Imagenet
     design for mobile visual search,” IEEE Trans. Multimedia, vol. 15, no. 7,        large scale visual recognition challenge,” In IJCV, 2015.
     pp. 1665–1676, 2013.                                                        [34] G. Hinton and L. Van der Maaten, “Visualizing data using t-sne,” In
 [6] D. M. Chen and B. Girod, “A hybrid mobile visual search system with              JMLR, 2008.
     compact global signatures,” IEEE Trans. Multimedia, vol. 17, no. 7,         [35] R. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality reduction by
     pp. 1019–1030, 2015.                                                             learning an invariant mapping,” In CVPR, 2006.
SUBMITTED TO IEEE TRANSACTIONS ON MULTIMEDIA                                     11

[36] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Lost in
     quantization: Improving particular object retrieval in large scale image
     databases,” In CVPR, 2008.
[37] Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang, “Deepfashion: Powering
     robust clothes recognition and retrieval with rich annotations,” 2016.
[38] C. D. Manning, P. Raghavan, and H. Schütze, Introduction to Informa-
     tion Retrieval. Cambridge University Press, 2008.
[39] K. Jrvelin and J. Keklinen, “Cumulated gain-based evaluation of ir
     techniques,” ACM Trans. on Inf. Syst., vol. 20, no. 4, pp. 422–446, 2002.
[40] C.-N. Ziegler, S. M. McNee, J. A. Konstan, and G. Lausen, “Improving
     recommendation lists through topic diversification,” In ICWWW, 2007.
[41] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
     large-scale image recognition,” CoRR, vol. abs/1409.1556, 2014.
You can also read