TransRec: Learning Transferable Recommendation from Mixture-of-Modality Feedback - arXiv

Page created by Edith Hayes
 
CONTINUE READING
TransRec: Learning Transferable Recommendation
                                                   from Mixture-of-Modality Feedback

                                                          Jie Wang1,3∗ Fajie Yuan2† Mingyue Cheng4 Joemon M. Jose1
                                                          Chenyun Yu6 Beibei Kong3 Zhijin Wang5 Bo Hu3 Zang Li3
                                                      1
                                                     University of Glasgow 2 Westlake University 3 Platform and Content Group, Tencent
                                                 4
                                                  University of Science and Technology of China 5 Jimei University 6 Sun Yat-sen University
arXiv:2206.06190v1 [cs.IR] 13 Jun 2022

                                             j.wang.9@research.gla.ac.uk         yuanfajie@westlake.edu.cn        mycheng@mail.ustc.edu.cn
                                                  joemon.jose@glasgow.ac.uk          {echokong, harryyfhu, gavinzli}@tencent.com
                                                                  yuchy35@mail.sysu.edu.cn       zhijinecnu@gmail.com

                                                                                        Abstract
                                                     Learning big models and then transfer has become the de facto practice in com-
                                                     puter vision (CV) and natural language processing (NLP). However, such unified
                                                     paradigm is uncommon for recommender systems (RS). A critical issue that ham-
                                                     pers this is that standard recommendation models are built on unshareable identity
                                                     data, where both users and their interacted items are represented by unique IDs. In
                                                     this paper, we study a novel scenario where user’s interaction feedback involves
                                                     mixture-of-modality (MoM) items. We present TransRec, a straightforward mod-
                                                     ification done on the popular ID-based RS framework. TransRec directly learns
                                                     from MoM feedback in an end-to-end manner, and thus enables effective transfer
                                                     learning under various scenarios without relying on overlapped users or items.
                                                     We empirically study the transferring ability of TransRec across four different
                                                     real-world recommendation settings. Besides, we study its effects by scaling the
                                                     size of source and target data. Our results suggest that learning recommenders
                                                     from MoM feedback provides a promising way to realize universal recommender
                                                     systems. Our code and datasets will be made available.

                                         1     Introduction
                                         The mainstream recommender systems (RS) model domain-specific user behaviors and generate item
                                         recommendations only from the same platform. Such specialized RS have been well-established
                                         in literature [6, 8, 18, 19, 22], yet they routinely suffer from some intrinsic limitations, such as
                                         low-accuracy problem for cold and new items [44], heavy manual work and high cost for training
                                         from scratch separate models [46]. Hence, developing general-purpose recommendation models to
                                         be useful to many systems has significant practical value. These kinds of models are also popular
                                         in computer vision (CV) [15, 10] and natural language processing (NLP) [9, 3] literature, recently
                                         called the foundation models (FM) [2]. Despite their remarkable progress, there has yet to be a
                                         recognized learning paradigm for building general-purpose models for recommender systems (gpRS).
                                         One paramount reason is that existing RS models are mainly dominated by the ID-based collaborative
                                         filtering (CF) approaches, where users/items are represented by their unique IDs assigned by the
                                         recommendation platform. Thereby, well-trained RS models can only be used to serve the current
                                         system because neither users nor items are easy to be shared across different private systems, e.g.
                                           ∗
                                             Work was done when Jie Wang was a visiting scholar at Westlake University and intern at Platform and
                                         Content Group, Tencent.
                                           †
                                             Corresponding author. Fajie Yuan designed the research and Jie Wang led the experiments.

                                         Preprint. Under review.
TikTok3 and YouTube4 . Even in some special cases, where userIDs or itemIDs from two platforms
can be shared, it is still less prone to realizing the desired transferability since users/items on the two
platforms also have the overlapping problem. That is, less user and item overlapping will lead to very
limited transferring effects. Recent attempts such as PeterRec [44], lifelong Conure model [46] and
STAR [33] fall exactly into this category.
To address the above problems, we attempt to explore modality content-based recommendation where
items are represented by a modality encoder (e.g. BERT [9] and ResNet [15]) rather than the ID
embedding. By modeling modality features, intuitively recommendation models have the potential
to achieve domain transferring for this modality in a broader sense — i.e. no longer relying on
overlapping and shared-ID information. Moreover, the revolution of big encoder networks in NLP
and computer vision is also potentially beneficial to multi-modal item recommendation, and might
even bring about a paradigm shift for RS from ID-based CF back to content-based recommendation.
In this paper, we study a common yet unexplored recommendation scenario where user behaviors
are composed of items with mixed-modal features — e.g. interacted items by a user can be texts
or images or both. Such scenario is popular in many practical recommender systems such as feeds
recommendation, where recommended feeds can be a piece of news, an image or a micro-video.
We claim that developing recommendation models based on user feedback with mixture-of-
modality items is a vital way towards transferable and general-purpose recommendation. To
verify it, we design TransRec, a simple yet representative framework for modeling the mixture-of-
modality feedback. TransRec is a direct modification on the most classical two-tower ID-based
DSSM [21] model, where one tower represents users and one represents items. To eliminate the
non-transferable ID features, items are encoded by a modality encoder instead of ID embeddings,
whereas users are represented by a sequence of items instead of user embeddings, as shown in
Figure 1. We train TransRec including both user and item encoders by an end-to-end manner rather
than using frozen features pre-extracted from modality encoders.
More importantly, we perform broad empirical studies on TransRec — the first RS regime enabling
effective transfer across modalities & domains. Specifically, we first train TransRec on a large-scale
source dataset collected from a commercial website, where a user’s feedback contains either textual
or visual modality, or both. Then we evaluate pre-trained TransRec on the first target dataset collected
from a different platform but has similar user feedback formats. Second, we evaluate TransRec on
the second target dataset where items have only one modality. Third, we evaluate TransRec with
still one modality, but along with additional user/item features to verify the its flexibility. At last,
we evaluate TransRec on another target dataset that are very different from source domain so as to
verify its generality. Beyond this, we empirically examine the performance of TransRec with different
scaling strategies on the source and target datasets. Our results confirm that TransRec learns from
MoM feedback is effective for various transfer learning tasks. To date, TransRec is probably the
closest model towards gpRS. We believe its success will point out a new way towards the foundation
models in the RS domain.
To summarize, our contributions are as follows: (1) we identify an important fact that learning from
MoM feedback has the potential to realize gpRS; (2) we design TransRec, the first recommendation
model that realizes cross-modality and cross-domain recommendation; (3) we evaluate the trans-
ferability of TransRec across four types of recommendation scenarios; (4) we study the effects of
TransRec by scaling the data, and provide useful insights; (5) we make our code, datasets5 and
pre-trained parameters of TransRec available for future research.

2       Related Work

In this section, we briefly review the progress of gpRS and its core techniques: self-supervised
pre-training (SSP) and transfer learning (TF).
General-purpose Recommendation. Big foundation models have achieved astounding feats in
the NLP and CV communities [2]. BERT [9], GPT-3 [3], ResNet [15] and various Vision Trans-
formers [10, 1, 27] have almost dominated the two fields because of their superb performance and
    3
      https://www.tiktok.com/
    4
      https://www.youtube.com/
    5
      For privacy issues, the datasets are provided only by email with a permission for research purposes.

                                                       2
Figure 1: Illustration of the training process in TransRec. Here, inner product is employed to compute
the preference between users and candidate items.

transferability. By contrast, very few efforts have been devoted to foundation recommendation models
which are key towards gpRS. Some work adopt multi-task learning (MTL) to combine multiple
objectives so as to obtain a more general representation model [28, 47, 37, 31]. However, typical MTL
is only useful to these trained tasks and cannot be directly transferred to new coming recommendation
tasks or scenarios.
PeterRec [44] proposed the first pre-training-then-fine-tuning paradigm for learning and transferring
general-purpose user representation, following closely the BERT model. Following it, Conure [46]
introduced the ‘one person, one model, one world’ idea and claimed that recommendation models
benefit from lifelong learning. Similar cross-domain recommendation work also include [26, 29, 40,
49, 4, 30, 33, 7]. However, these models are all based on the shared-ID assumption in source and
target domains, which tends to be difficult to hold in practice. Distinct from these work, some preprint
paper [34, 35, 41] devised general-purpose gpRS by leveraging the textual information. [39] learned
user representations based on text and image modalities with images processed into frozen features
beforehand. The most recent preprint paper P5 [12] formulates various recommendation-related task
(e.g. rating prediction, item recommendation and explanation generation) as a unified text-to-text
paradigm with the same language modeling objective for pre-training. However, to the best of our
knowledge, there exists no general-purpose recommendation models that are trained from various
modality feedback, along with an end-to-end training fashion.
SSP & TF. Recent years have seen a growing interest in the paradigm of upstream SSP and down-
stream TF. In terms of pre-training, existing literature can be broadly categorized into two classes:
BERT/GPT-like generative pre-training [9, 3] and discriminative pre-training [14, 5, 38]. Compared
with supervised learning, SSP methods are more likely to learn general-purpose representation
since they are directly trained by self-generated labels rather than explicit task-specific supervision.
Recommender systems are a natural fit for self-supervised learning with a large amount implicit
user feedback. Both generative pre-training [44, 46, 36, 12] and contrastive pre-training [13, 48, 42]
are popular, although many [48, 42, 36] of them only investigated pre-training for the current task
rather than pursuing general-purpose recommendation. Regarding transfer learning, current work
mainly adopted parameter freezing [17], full fine-tuning [9, 16], adapter fine-tuning [44, 20] and
prompt [3, 11] to adapt an upstream SSP model to downstream tasks. In this paper, we consider both
generative and contrastive pre-training for learning the gpRS model and apply full fine-tuning for
domain adaption.

3   The TransRec Framework

In this section, we first formulate the recommendation tasks, involving mixture-modality feedback.
Then, we introduce the TransRec framework in detail.

                                                   3
3.1       Problem Definition

Assume that we are given two cat-                                                         ...
egories of domains:             source do-
main S and target domains T =
{T1 , T2 , ..., TN }. In source (target) do-             ...                              ...

main, suppose that there exist user set
Us (Ut ) and item set Vs (Vt ), involving                                                 ...
|Us | (|Ut |) users and |Vs | (|Vt |) items in
the systems. In both domains, the con-             Source domain                   Target domain
tent feature of items are recorded with
                                                   domain
modality set M = {a, b}, containing Figure 2: Illustration of the transferring process. TransRec
textual and visual modalities, denoted first pre-train a unified recommender in source domain, and
as a and b. Following the setting of item- then serve the target domain with the pre-trained network.
based collaborative filtering [44], users
can be represented with the sequence of their historical interaction records Cu = {c1,m , ..., cn,m }.
Here, n indicates the sequence length while m ∈ M. The goal of this work is to learn a generic
recommender from source domain S and can be transferred to N target domains T , involving
non-overlapping of userIDs or itemIDs.
As illustrated in Figure 2, by learning from M, the trained model can be applied to the following
domains, including single-modality domain C = {c1,a , ..., cn,a } (ci,a ∈ Vta ) in domain T a , single-
modality domain C = {c1,b , ..., cn,b } (ci,b ∈ Vtb ) in domain T b , and mixed-modality domain
C = {c1,m , ..., cn,m } ( ci,m ∈ Vtm , m ∈ M) in domain T m . Suppose that we extend the source
domain with four modalities M = {a, b, c, d} = {text, vision, audio, video}. The target domain
can be served with 15 types of modalities, i.e. {a}, {b}, {c}, {a, b}, {a, b, c}...{a, b, c, d} that covers
a majority of existing multimedia modalities.

3.2       TransRec Architecture

To verify our claim, we develop TransRec on the most popular two-tower based recommendation
architecture, a.k.a. DSSM [21]. To eliminate ID features, we represent both users and items with
item modality contents. That is, the item tower of DSSM is replaced with an item modality encoder
(e.g. BERT for text and ResNet for images) network, and user tower is replaced with a user encoder
network which directly models an ordered collection of item interactions rather than the explicit
userID data. In this sense, TransRec is also a sequential recommendation (SR) model [19, 38].6
Formally, given a user interaction sequence C = {c1,m , ..., cn+l,m } from a mixed-modal scenario,
TransRec takes two sub-sequence C u = {c1,m , ..., cn,m } and C e = {cn+1,m , ..., cn+l,m } as inputs.
Z u = { z1 , ..., zn } and Z e = { zn+1 , ..., zn+l } are item representations for C u and C e by item
encoder Ei . The item representations in Z u are then fed into the user encoder Eu to achieve user
representation U u . U u and Z e are used to compute their relevance scores. The process can be
formulated as:
                                    e            e       u         u
                                  Z = Ei (C ) , Z = Ei (C ) ,                                       (1)
                                    u             u               u    e
                                  U = Eu (Z ) , Ru,e = U · Z ,                                      (2)

               
where Ru,e = ru,1 , ru,2 , ..., ru,l , ru,t denotes the relevance score between U u and t-th item of
the sub-sequence C e , and Ru,e demonstrates the relation between the user and his next interaction
sequence. Next, we describe components of our model in detail.
Item Encoder. Given an MoM scenario, the item encoders of TransRec take as input individual
modality content. We consider two types of modalities in this paper, i.e. textual tokens and image
pixels. Attempts for more modality scenarios are interesting for future work. For an item with textual

      6
    It is worth mentioning that there exist various types of SR frameworks, e.g. the popular CPC [19, 38]
framework, NextItNet- and SASRec-style autoregressive framework [45, 22], and BERT4Rec-style denoising
framework [36]. The reason we extend DSSM or CPC framework is mainly because of its flexibility in
incorporating various user and item features. In fact, TransRec fits to most sequential recommender models.

                                                    4
modality (i.e. ci,t ), we adopt BERT-base [9] to model its token content t = [t1 , t2 , ..., tk ]. Then we
apply an attention network as pooling layer to obtain the final textual representation Zi,t :

                                      Zi,t = SelfAtt(BERT(t)).                                        (3)

Similarly, we apply the ResNet-18 [15] to encoder visual pixels of an image, denoted as v. Then we
perform average pooling, followed by a MLP layer. The visual representation is given:

                                      Zi,v = MLP(ResNet(v)).                                          (4)

User Encoder. For user encoder, we still use the BERT architecture (denoted as BERTu ), where
each token embedding is the representation from an item encoder rather than the original word ID
embedding. Position embedding P = {p1 , ..., pn } is added to model the sequential patterns of user
behaviors. The specific process is formulated as follows:

                                    u   u     u
                                   S =Z +P ,                                                          (5)
                                    u       u            u
                                   U = Eu (S ) = BERTu (S ).                                          (6)

TransRec can incorporate user features by simply concatenating them with user embedding, and
incorporate item features by concatenating them with item embedding. By contrast, typical sequential
recommendation models, such as NextItNet & SASRec are not straightforward to merge user features.

3.3   Optimization

Inspired by the pre-training and fine-tuning paradigm, we apply the two-stage training for TransRec:
first pre-training the user encoder network, and then training whole framework of TransRec.
Stage 1: User Encoder Pre-training. We perform pre-training for user encoder network in the
self-supervised manner. Specifically, we apply the left-to-right style generative pre-training to predict
the next item in the interaction sequence, similarly as SASRec and NextItNet [22, 45]. The way
we choose unidirectional pre-training rather than BERT-style (bidirectional) (i.e. Eq.(6)) is simply
because unidirectional pre-training converges much faster but without loss of precision. We use the
softmax cross-entropy loss as objective function:

                               ỹ t =softmax(RELU(S 0 t W U + bU )),                                  (7)
                                            X X                       
                               LUEP = −                 yt log (ỹ t ) ,                              (8)
                                             u∈U t∈[1,...,n]

where W U , bU are the projection matrix & bias terms, S 0 t is the representation of last hidden layer.
Stage 2: End-to-End Training. TransRec is trained in an end-to-end manner by fine-tuning pa-
rameters of both user and item encoders. This is significantly different from many multi-modal
recommendation task that pre-extracts modal features before training models [17]. End-to-end train-
ing enables a better adaption for textual and visual features to the current recommendation domain.
Specifically, we propose to use the Contrastive Predictive Coding (CPC) [38] learning method. Given
a sequence of user interactions C, we divide the sequence into sub-sequence C u and C e to encode
the relationship between them. The binary cross entropy loss function is as follows:
                                                              j
                                   " n+l                                             #
                               X X                           X
                  LCPC = −                 log (σ (ru,t )) +     log (1 − σ (ru,g ))            (9)
                               u∈U   t=n+1                     g=1

where g is a randomly sampled negative item [32, 43] during model training.

4     Experiments
To verify the effectiveness of our proposed TransRec, we conduct empirical experiments and evaluate
pre-trained recommenders on four types of downstream tasks.

                                                    5
4.1        Experiments for The Source Domain

Datasets. The source data is the news recommendation data collected from QQBrowser7 from 14th
to 17th, December 2020. We collect around 25 million user-item interaction behaviors, involving
about 1 million randomly sampled users and 133, 000 interacted items. Each interaction denotes
observed feedback at a certain time, including full-play and clicks.
For each user, we construct the Table 1: Characteristics of the source dataset. ‘Form’ indicates
sequence behaviors using her re- the item category. ‘All’ means all users in this datasets, including
cent 25 ordered interactions. In the above three types. For example, the first line denotes that there
addition to the ID information, are 765,895 users whose interacted items always have two-modal
the datasets have rich content (i.e. textual and visual) features.
features to represent each item.
More accurately, items in a user    Form        Modality           User        Item      Interaction
session can be videos-only, or      Mixed Text + Vision          765,895         -       19,233,882
news-only, or both. But each item   Article       Text           133,107         -        3,327,463
can be either a video or a piece    Video        Vision          123,897         -        2,996,048
of news, i.e. containing only one   All       Text + Vision 1,022,899 133,107 25,557,393
modality. Note that for video
items, we represent them by their cover images. The statistics are presented in Table 1.
Evaluation Metrics. We use the typical leave-one-out strategy [18] for evaluation, where the last
item of user interaction sequence is denoted as test data, and the item before the last one is used as
validation data. Unlike many previous methods that employ a small scope of randomly sampled items
for evaluation, which may lead to inconsistencies with the non-sampled version [23], we rank the
entire item set without using the inaccurate sampling measures. We apply Hit Ratio (HR) [44] and
Normalized Discounted Cumulative Gain (NDCG) [45] to measure the performance of each method.
Our evaluation methods are consistent for both the source and downstream tasks.
Implement Details. To ensure fair evaluation, each hyper-parameter is fine-tuned on the validation
set. We set the embedding size to 256. All networks are optimized by employing Adam optimizer
with the learning rate of 1e−4 . Batch size is set to 512 in pre-training. All baseline models are either
tuned on the validation set or use the suggested settings from the original paper. We train all models
until it converges and save parameters when they reach the highest accuracy on the validation set. We
set the layer number of user encoder (i.e. BERT) to 4 and head number of multi-head attention to 4.
Similar setup is adopted for all downstream tasks.
Results. Before evaluating TransRec on the downstream datasets, we first examine its performance in
the source domain. The purpose here is not to demonstrate that TransRec can achieve state-of-the-art
recommendation accuracy. Instead, we hope to investigate whether learning from modality contents
has advantages over traditional ID-based methods, i.e. IDRec. Throughout this paper, we use IDRec
to denote the network that has the similar architecture with TransRec, but with the item encoder
replaced with the ID embedding layer.
Table 2 shows the results of IDRec Table 2: Results on the source dataset. The terms below
and TransRec in terms of HR@10 and have the same meaning with Table 1. TransRec- denotes
NDCG@10. Two important observations TransRec without first stage user encoder pre-training.
can be made: (1) Both TransRec and
TransRec- largely outperform IDRec (e.g.     Method    Modality. HR@10 NDCG@10
0.0699 vs. 0.0230, and 0.0532 vs. 0.0230     IDRec     ID         0.0230       0.0118
on HR@10), which demonstrates strong
potential of learning from modality content            Vision     0.0540       0.0281
data. The higher results of TransRec are               Text       0.0536       0.0280
                                             TransRec-
presumably attributed to three key factors:            Mixed      0.0530       0.0275
large-scale training data, powerful item en-           All        0.0532       0.0276
coder networks, and an end-to-end training             Vision     0.1128       0.0553
fashion. (2) TransRec exceeds TransRec-                Text       0.0582       0.0272
with all types of modality settings (e.g.    TransRec
                                                       Mixed      0.0679       0.0326
0.0699 vs. 0.0532, and 0.0679 vs. 0.0530),             All        0.0699       0.0334
which evidences the effectiveness of user
      7
          https://browser.qq.com/

                                                   6
encoder pre-training (see Section 3.3). The results of (1) further motivate us to develop modality
content-based recommendation models for downstream tasks.

4.2        Experiments for The Target Domains

All the target datasets below are from other recommender systems different from the source domain.
TN-mixed: It was collected from Tencent News (TN)8 , where an interaction can be either a piece
of news or a video’s cover image. Similar to the source domain, the interacted item set of a user
contains mixture-of-modality features, i.e. both visual and textual features.
TN-video/text: The two datasets only con-              Table 3: Datasets for downstream recommendation
tain items with single modality. For example,          tasks. Except Douyin, interactions in other datasets
user’s interactions in TN-video include only           are mainly about user’s clicking or watching behav-
videos, while TN-text includes only textual            iors.
items, i.e. news. They are used to evaluate
TransRec’s generality for singe-modal item                  Domain        Modality           User        Item
recommendation. Given that users and items                  TN-mixed     Text+Vision        49,639      48,383
in real-world recommender systems have var-                 TN-video       Vision           47,004      50,053
ious additional features, we introduce two                  TN-text         Text            49,033      49,142
types of user features (gender and age) and                 Douyin         Vision          100,000      66,228
one item feature (category) for TN-text.
Douyin: It was collected from Douyin9 (the Chinese version of TikTok), a well-known short video
recommendation application. Unlike all previous datasets, the user positive feedback in Douyin only
contains comment behaviors. In addition, video genres and cover image size in Douyin are vastly
different from the source domain. Table 3 summarizes the statistics of downstream datasets.
Baselines. In addition to IDRec, we have also presented IDGru [19] and IDNext [45], two popular ID-
based sequential recommendation baselines as a reference. Despite that, we emphasize again that the
purpose of this study is neither to propose a more advanced neural recommendation architecture, nor to
pursue some state-of-the-art results. The key purpose of this study is to indicate that: (1) learning from
modality content features instead of ID features achieves the goal of transferable recommendations
across different domains; (2) learning from MoM feedback rather than single-modal or multi-modal
feedback reaches the goal of generic recommendations across different modalities.
Results. The overall results are shown in Table 4, which includes four recommendation scenarios.
Two important observations can be made: (1) TransRec performs largely better than training-from-
scratch; (2) TransRec performs better than IDRec as well. The results suggest that training the
source domain brings much better results for TransRec on all target datasets. By analyzing these
scenarios, we can draw the conclusion that TransRec — learning from MoM feedback — can be
broadly transferred to various recommendation scenarios, including the source-like mixed-modal
scenario (i.e. TN-mixed), the single-modal scenario (TN-video), the scenario with more additional
features (TN-text), and scenario with very different modality content (Douyin).

4.3        Scaling Effects
                                                                   TN-mixed                        TN-text
The larger the source dataset, the             0.05                               0.06
stronger      representation      TransRec     0.04                               0.05
achieves. To investigate the scaling effect    0.03
                                                                                  0.04
                                                    HR@10

                                                                                   HR@10

of pre-trained dataset in the source domain,                                      0.03
                                               0.02
we compare the results of TransRec on                         Train-from-scratch  0.02           Train-from-scratch
                                               0.01           20%                                20%
downstream tasks by changing the size of                      50%                 0.01           50%
                                                              TransRec                           TransRec
the QQBrowser dataset. To be specific, we      0.00                               0.00
                                                    0 10 20   30       40      50      0 10 20   30       40      50
extract 20% and 50% user sequences from                   Epoch                              Epoch
the original dataset, and conduct the same
training process on them, and then evaluate Figure 3: Convergence trend by scaling the source
their transferability.                       data.
      8
          https://news.qq.com/
      9
          https://www.douyin.com/

                                                            7
Table 4: Comparison of recommendation results on four downstream domains. Train-from-scratch
denotes training target datasets with random parameters as initialization. It shares exactly the same
network architecture and hyper-parameters with TransRec. The best results are bolded.

               Domain           Metric                IDRec              IDGru             IDNext           Train-from-scratch                      TransRec
                                HR@10                 0.0210             0.0281            0.0334                          0.0428                     0.0478
               TN-mixed
                                NDCG@10               0.0100             0.0143            0.0167                          0.0213                     0.0239
                                HR@10                 0.0267             0.0357            0.0406                          0.0336                     0.0424
               TN-video
                                NDCG@10               0.0125             0.0206            0.0208                          0.0173                     0.0221
                                HR@10                 0.0192                   -                 -                         0.0500                     0.0597
               TN-text
                                NDCG@10               0.0090                   -                 -                         0.0255                     0.0303
                                HR@10                 0.0019             0.0140            0.0200                          0.0205                     0.0259
               Douyin
                                NDCG@10               0.0011             0.0068            0.0095                          0.0101                     0.0126

                                    Table 5: Results of the target data by scaling source corpus.

                    Domain                 Metric                    Train-from-scratch                   20%                    50%        TransRec
                                           HR@10                          0.0428                          0.0448            0.0474           0.0485
                    TN-mixed
                                           NDCG@10                        0.0213                          0.0227            0.0237           0.0245
                                           HR@10                          0.0336                          0.0400            0.0417           0.0424
                    TN-video
                                           NDCG@10                        0.0173                          0.0209            0.0214           0.0221
                                           HR@10                          0.0500                          0.0543            0.0581           0.0597
                    TN-text
                                           NDCG@10                        0.0255                          0.0281            0.0292           0.0303
                                           HR@10                          0.0205                          0.0233            0.0254           0.0260
                    Douyin
                                           NDCG@10                        0.0101                          0.0113            0.0120           0.0126

Table 5 shows the results for the four downstream recommendation tasks. First, it can be seen that the
recommendation accuracy improves obviously by using more source data for training. For example,
in the TN-mixed dataset, the result of HR@10 grows from 0.0428 to 0.0448 by using 20% source
data, and then grows from 0.0448 to 0.0474 by using 50% source data. Such property of TransRec is
desired since it implies that scaling up the source dataset is an effective way to improve downstream
tasks. We also depict the convergence behavior in Figure 3, which shows more clear improvements.
The smaller the downstream dataset, the larger improvement TransRec achieves. We study the
effects of TransRec by scaling down the target data, aiming to verify whether TransRec can lessen
the insufficient data problem, and thus improve the recommendation performance.
To be specific, we decrease the size of these target datasets by using 20% and 60% of the original
data. Results are shown in Table 6 and Figure 4. It can be seen that (1) recommendation accuracy
increases with more training data; (2) More performance gains are achieved with less training data;

        0.05
                              10,000                          0.05
                                                                                   30,000                                 0.05
                                                                                                                                              49,639

        0.04                                                  0.04                                                        0.04

        0.03                                                  0.03                                                        0.03
HR@10

                                                      HR@10

                                                                                                                  HR@10

        0.02                                                  0.02                                                        0.02

        0.01                                                  0.01                                                        0.01
                               Fine-tuning                                              Fine-tuning                                               Fine-tuning
                               Train-from-scratch                                       Train-from-scratch                                        Train-from-scratch
        0.00                                                  0.00                                                        0.00
                0   10   20    30     40    50   60                  0    10       20       30       40      50                  0     10    20       30    40     50
                              Epoch                                                 Epoch                                                     Epoch

                              Figure 4: Comparison of convergence by scaling TN-mixed dataset.

                                                                                   8
Table 6: Comparison of relative performance improvement on downstream tasks with varied target
data size. ‘Num. Sample’ denotes the number of user behavior sequences for training. ‘Improv.’
indicates the relative performance improvement of TransRec compared with Train-from-scratch.

                  Num.              HR@10                                      NDCG@10
    Domain                                               Improv.                                 Improv.
                 Sample
                            Train-from                                 Train-from
                                           TransRec                                  TransRec
                            -scratch                                   -scratch
                 10,000       0.0261        0.0385       41.51%             0.0126    0.0193     53.17%
    TN-mixed     30,000       0.0354        0.0448       26.55%             0.0176    0.0223     26.70%
                 49,639       0.0428        0.0485       13.32%             0.0213    0.0245     15.02%
                 10,000       0.0393        0.0597       51.91%             0.0201    0.0262     30.35%
    TN-text      30,000       0.0453        0.0549       21.19%             0.0230    0.0283     23.04%
                 49,033       0.0500        0.0597       19.40%             0.0255    0.0303     18.82%

                           Table 7: End-to-end training vs. frozen features.

                                 TN-mixed                         TN-text                 TN-video
    Method       Manner
                           HR@10       NDCG@10           HR@10       NDCG@10         HR@10      NDCG@10
    Train-from   Frozen    0.0334        0.0167          0.0350        0.0176        0.0037      0.0017
    -scratch     End2end   0.0428        0.0213          0.0500        0.0255        0.0336      0.0173
                 Frozen    0.0359        0.0181          0.0411        0.0206        0.0040      0.0023
    TransRec
                 End2end   0.0485        0.0245          0.0597        0.0303        0.0424      0.0221

This suggests that when a recommender system is in the phase of lacking training data, transferring
the (user-item) matching relationship from a large source dataset is of great help.

4.4    End-to-End vs. Frozen Features.

Traditional multimodal and multimedia recommendations have been well studied [17]. While due to
high computing resource and less powerful text/image encoder network, prior art tends to extract
frozen modality features and then feed them into a CTR or recommendation model. Such practice is
especially common for industrial applications given billions of training examples [8, 6]. However, we
want to explore whether end-to-end learning is superior to learning from frozen features. The results
are in Table 7. Clearly, we can achieve consistent improvements by end-to-end learning, although fine-
tuning BERT and ResNet is computationally expensive than using pre-extracted features. Particularly,
we notice that using frozen textual features yields less worse results than using visual features. This
may imply that the textual features generated by BERT is more general than visual features generated
by ResNet. This is also aligned with findings in NLP and CV fields — fine-tuning all parameters are
in general better than fine-tuning only the classification layer (with the backbone network frozen).
5     Conclusion, Limitations, and Future Works
In this paper, we study a new recommendation scenario where user feedback contains items with
mixture-of-modality features. We develop TransRec, the first recommendation model learning from
MoM feedback by an end-to-end manner. To show its transferring ability, we conduct empirical
study in four types of downstream recommendation tasks. Our results verify that TransRec is a
generic model that can be broadly transferred to improve many recommendation tasks as long as
the modality has been trained in the source domain. Our work has significant practical implications
towards universal recommender systems with the goal of realizing ‘One Model to Serve All’ [46, 33].
One limitation is that we only examine TransRec with two types of modality features (vision and
text). As a result, it can only serve three scenarios: vision-only, text-only, and vision-text. Since both
video and audio data can be represented by images [24, 25], intuitively, TransRec can be extended
to scenarios where items involve more modalities. For example, suppose four distinct modalities
(vision, text, audio and video) are available in user feedback, then TransRec has the potential to serve
at most 15 types of scenarios which might cover most modalities for multimedia data. This is an
interesting future direction and may realize a more general recommendation system.

                                                     9
References
 [1] Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, and Cordelia Schmid. Vivit: A
     video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision,
     pages 6836–6846, 2021.
 [2] Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S
     Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of
     foundation models. arXiv preprint arXiv:2108.07258, 2021.
 [3] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind
     Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.
     Advances in neural information processing systems, 33:1877–1901, 2020.
 [4] Lei Chen, Fajie Yuan, Jiaxi Yang, Xiangnan He, Chengming Li, and Min Yang. User-specific adaptive
     fine-tuning for cross-domain recommendations. IEEE Transactions on Knowledge and Data Engineering,
     2021.
 [5] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for
     contrastive learning of visual representations. In International conference on machine learning, pages
     1597–1607. PMLR, 2020.
 [6] Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen
     Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al. Wide & deep learning for recommender systems.
     In Proceedings of the 1st workshop on deep learning for recommender systems, pages 7–10, 2016.
 [7] Mingyue Cheng, Fajie Yuan, Qi Liu, Xin Xin, and Enhong Chen. Learning transferable user representations
     with sequential behaviors via contrastive pre-training. In 2021 IEEE International Conference on Data
     Mining (ICDM), pages 51–60. IEEE, 2021.
 [8] Paul Covington, Jay Adams, and Emre Sargin. Deep neural networks for youtube recommendations. In
     Proceedings of the 10th ACM conference on recommender systems, pages 191–198, 2016.
 [9] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirec-
     tional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
[10] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas
     Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth
     16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
[11] Tianyu Gao, Adam Fisch, and Danqi Chen. Making pre-trained language models better few-shot learners.
     arXiv preprint arXiv:2012.15723, 2020.
[12] Shijie Geng, Shuchang Liu, Zuohui Fu, Yingqiang Ge, and Yongfeng Zhang. Recommendation as
     language processing (rlp): A unified pretrain, personalized prompt & predict paradigm (p5). arXiv preprint
     arXiv:2203.13366, 2022.
[13] Jie Gu, Feng Wang, Qinghui Sun, Zhiquan Ye, Xiaoxiao Xu, Jingmin Chen, and Jun Zhang. Exploiting
     behavioral consistence for universal user representation. arXiv preprint arXiv:2012.06146, 2020.
[14] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised
     visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern
     recognition, pages 9729–9738, 2020.
[15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.
     In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[16] Ruining He and Julian McAuley. Ups and downs: Modeling the visual evolution of fashion trends with
     one-class collaborative filtering. In proceedings of the 25th international conference on world wide web,
     pages 507–517, 2016.
[17] Ruining He and Julian McAuley. Vbpr: visual bayesian personalized ranking from implicit feedback. In
     Proceedings of the AAAI Conference on Artificial Intelligence, volume 30, 2016.
[18] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. Neural collaborative
     filtering. In Proceedings of the 26th international conference on world wide web, pages 173–182, 2017.
[19] Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. Session-based recommen-
     dations with recurrent neural networks. arXiv preprint arXiv:1511.06939, 2015.
[20] Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Ges-
     mundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. In International
     Conference on Machine Learning, pages 2790–2799. PMLR, 2019.
[21] Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. Learning deep
     structured semantic models for web search using clickthrough data. In Proceedings of the 22nd ACM
     international conference on Information & Knowledge Management, pages 2333–2338, 2013.

                                                      10
[22] Wang-Cheng Kang and Julian McAuley. Self-attentive sequential recommendation. In 2018 IEEE
     International Conference on Data Mining (ICDM), pages 197–206. IEEE, 2018.
[23] Walid Krichene and Steffen Rendle. On sampled metrics for item recommendation. In Proceedings of the
     26th ACM SIGKDD international conference on knowledge discovery & data mining, pages 1748–1757,
     2020.
[24] Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, and Ming Liu. Neural speech synthesis with transformer
     network. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 6706–6713,
     2019.
[25] Valerii Likhosherstov, Anurag Arnab, Krzysztof Choromanski, Mario Lucic, Yi Tay, Adrian Weller, and
     Mostafa Dehghani. Polyvit: Co-training vision transformers on images, videos and audio. arXiv preprint
     arXiv:2111.12993, 2021.
[26] Jian Liu, Pengpeng Zhao, Fuzhen Zhuang, Yanchi Liu, Victor S Sheng, Jiajie Xu, Xiaofang Zhou, and
     Hui Xiong. Exploiting aesthetic preference in deep cross networks for cross-domain recommendation. In
     Proceedings of The Web Conference 2020, pages 2768–2774, 2020.
[27] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin
     transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF
     International Conference on Computer Vision, pages 10012–10022, 2021.
[28] Jiaqi Ma, Zhe Zhao, Xinyang Yi, Jilin Chen, Lichan Hong, and Ed H Chi. Modeling task relationships
     in multi-task learning with multi-gate mixture-of-experts. In Proceedings of the 24th ACM SIGKDD
     International Conference on Knowledge Discovery & Data Mining, pages 1930–1939, 2018.
[29] Muyang Ma, Pengjie Ren, Yujie Lin, Zhumin Chen, Jun Ma, and Maarten de Rijke. π-net: A parallel
     information-sharing network for shared-account cross-domain sequential recommendations. In Proceedings
     of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval,
     pages 685–694, 2019.
[30] Tong Man, Huawei Shen, Xiaolong Jin, and Xueqi Cheng. Cross-domain recommendation: An embedding
     and mapping approach. In IJCAI, volume 17, pages 2464–2470, 2017.
[31] Yabo Ni, Dan Ou, Shichen Liu, Xiang Li, Wenwu Ou, Anxiang Zeng, and Luo Si. Perceive your users in
     depth: Learning universal user representations from multiple e-commerce tasks. In Proceedings of the 24th
     ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 596–605, 2018.
[32] Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. Bpr: Bayesian
     personalized ranking from implicit feedback. arXiv preprint arXiv:1205.2618, 2012.
[33] Xiang-Rong Sheng, Liqin Zhao, Guorui Zhou, Xinyao Ding, Binding Dai, Qiang Luo, Siran Yang, Jingshan
     Lv, Chi Zhang, Hongbo Deng, et al. One model to serve all: Star topology adaptive recommender for
     multi-domain ctr prediction. In Proceedings of the 30th ACM International Conference on Information &
     Knowledge Management, pages 4104–4113, 2021.
[34] Kyuyong Shin, Hanock Kwak, Kyung-Min Kim, Minkyu Kim, Young-Jin Park, Jisu Jeong, and Se-
     ungjae Jung. One4all user representation for recommender systems in e-commerce. arXiv preprint
     arXiv:2106.00573, 2021.
[35] Kyuyong Shin, Hanock Kwak, Kyung-Min Kim, Su Young Kim, and Max Nihlen Ramstrom. Scal-
     ing law for recommendation models: Towards general-purpose user representations. arXiv preprint
     arXiv:2111.11294, 2021.
[36] Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang. Bert4rec: Sequential
     recommendation with bidirectional encoder representations from transformer. In Proceedings of the 28th
     ACM international conference on information and knowledge management, pages 1441–1450, 2019.
[37] Hongyan Tang, Junning Liu, Ming Zhao, and Xudong Gong. Progressive layered extraction (ple): A novel
     multi-task learning (mtl) model for personalized recommendations. In Fourteenth ACM Conference on
     Recommender Systems, pages 269–278, 2020.
[38] Aaron Van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive
     coding. arXiv e-prints, pages arXiv–1807, 2018.
[39] Chuhan Wu, Fangzhao Wu, Tao Qi, and Yongfeng Huang. Mm-rec: Multimodal news recommendation.
     arXiv preprint arXiv:2104.07407, 2021.
[40] Chuhan Wu, Fangzhao Wu, Tao Qi, Jianxun Lian, Yongfeng Huang, and Xing Xie. Ptum: Pre-training user
     model from unlabeled user behaviors via self-supervision. arXiv preprint arXiv:2010.01494, 2020.
[41] Chuhan Wu, Fangzhao Wu, Yang Yu, Tao Qi, Yongfeng Huang, and Xing Xie. Userbert: Contrastive user
     model pre-training. arXiv preprint arXiv:2109.01274, 2021.
[42] Jiancan Wu, Xiang Wang, Fuli Feng, Xiangnan He, Liang Chen, Jianxun Lian, and Xing Xie. Self-
     supervised graph learning for recommendation. In Proceedings of the 44th International ACM SIGIR
     Conference on Research and Development in Information Retrieval, pages 726–735, 2021.

                                                     11
[43] Fajie Yuan, Guibing Guo, Joemon M Jose, Long Chen, Haitao Yu, and Weinan Zhang. Lambdafm: learning
     optimal ranking with factorization machines using lambda surrogates. In Proceedings of the 25th ACM
     international on conference on information and knowledge management, pages 227–236, 2016.
[44] Fajie Yuan, Xiangnan He, Alexandros Karatzoglou, and Liguang Zhang. Parameter-efficient transfer from
     sequential behaviors for user modeling and recommendation. In Proceedings of the 43rd International
     ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1469–1478, 2020.
[45] Fajie Yuan, Alexandros Karatzoglou, Ioannis Arapakis, Joemon M Jose, and Xiangnan He. A simple
     convolutional generative network for next item recommendation. In Proceedings of the Twelfth ACM
     International Conference on Web Search and Data Mining, pages 582–590, 2019.
[46] Fajie Yuan, Guoxiao Zhang, Alexandros Karatzoglou, Joemon Jose, Beibei Kong, and Yudong Li. One
     person, one model, one world: Learning continual user representation without forgetting. In Proceedings
     of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval,
     pages 696–705, 2021.
[47] Zhe Zhao, Lichan Hong, Li Wei, Jilin Chen, Aniruddh Nath, Shawn Andrews, Aditee Kumthekar, Mah-
     eswaran Sathiamoorthy, Xinyang Yi, and Ed Chi. Recommending what video to watch next: a multitask
     ranking system. In Proceedings of the 13th ACM Conference on Recommender Systems, pages 43–51,
     2019.
[48] Kun Zhou, Hui Wang, Wayne Xin Zhao, Yutao Zhu, Sirui Wang, Fuzheng Zhang, Zhongyuan Wang, and
     Ji-Rong Wen. S3-rec: Self-supervised learning for sequential recommendation with mutual information
     maximization. In Proceedings of the 29th ACM International Conference on Information & Knowledge
     Management, pages 1893–1902, 2020.
[49] Feng Zhu, Yan Wang, Chaochao Chen, Guanfeng Liu, Mehmet Orgun, and Jia Wu. A deep framework for
     cross-domain and cross-system recommendations. arXiv preprint arXiv:2009.06215, 2020.

                                                    12
You can also read