One-Class Collaborative Filtering
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
One-Class Collaborative Filtering
Rong Pan1 Yunhong Zhou2 Bin Cao3 Nathan N. Liu3 Rajan Lukose1
Martin Scholz1 Qiang Yang3
1.
HP Labs, 1501 Page Mill Rd, Palo Alto, CA, 94304, US
{rong.pan,rajan.lukose,scholz}@hp.com
2
Rocket Fuel Inc. Redwood Shores, CA 94065 yzhou@rocketfuelinc.com
3.
Hong Kong University of Science & Technology, Clear Water Bay, Kowloon, Hong Kong
{caobin, nliu, qyang}@cse.ust.hk
Abstract zon.com1 , DVDs at Netflix2 , News by Google 3 etc.
The central technique used in these systems is collabo-
Many applications of collaborative filtering (CF), rative filtering (CF) which aims at predicting the pref-
such as news item recommendation and bookmark rec- erence of items for a particular user based on the items
ommendation, are most naturally thought of as one- previously rated by all users. The rating expressed in
class collaborative filtering (OCCF) problems. In these different scores (such as a 1-5 scale in Netflix) can be ex-
problems, the training data usually consist simply of bi- plicitly given by users in many of these systems. How-
nary data reflecting a user’s action or inaction, such as ever, in many more situations, it also can be implicitly
page visitation in the case of news item recommenda- expressed by users’ behaviors such as click or not-click
tion or webpage bookmarking in the bookmarking sce- and bookmark or not-bookmark. These forms of im-
nario. Usually this kind of data are extremely sparse plicit ratings are more common and easier to obtain.
(a small fraction are positive examples), therefore am- Although the advantages are clear, a drawback of
biguity arises in the interpretation of the non-positive implicit rating, especially in situations of data spar-
examples. Negative examples and unlabeled positive ex- sity, is that it is hard to identify representative negative
amples are mixed together and we are typically unable examples. All of the negative examples and missing
to distinguish them. For example, we cannot really at- positive examples are mixed together and cannot be
tribute a user not bookmarking a page to a lack of inter- distinguished. We refer to collaborative filtering with
est or lack of awareness of the page. Previous research only positive examples given as One-Class Collabora-
addressing this one-class problem only considered it as tive Filtering (OCCF). OCCF occurs in different sce-
a classification task. In this paper, we consider the one- narios with two examples as follows.
class problem under the CF setting. We propose two
• Social Bookmarks: Social bookmarks are very
frameworks to tackle OCCF. One is based on weighted
popular in Web 2.0 services such as del.icio.us. In
low rank approximation; the other is based on negative
such a system, each user bookmarks a set of web-
example sampling. The experimental results show that
pages which can be regarded as positive examples
our approaches significantly outperform the baselines.
of the user’s interests. But two possible explana-
tions can be made for the behavior that a user did
not bookmark a webpage. The first one is, the
page is of the users’ interest but she did not see
the page before; the second one is the user had
1 Introduction seen this page but it is not of her interest. We
cannot assume all the pages not in his bookmarks
are negative examples. Similar examples include
Personalized services are becoming increasingly in- social annotation, etc.
dispensable on the Web, ranging from providing search 1 http://www.amazon.com
results to product recommendation. Examples of 2 http://www.netflix.com
such systems include recommending products at Ama- 3 http://news.google.com
1• Clickthrough History: Clickthrough data are study various weighting and sampling approaches using
widely used for personalized search and search re- several real world datasets. Our proposed solutions
sult improvement. Usually a triple < u, q, p > significantly outperform the two extremes (AMAN and
indicates a user u submitted a query q and clicked AMAU) in OCCF problems, with at least 8% improve-
a page p. It is common that pages that have not ment over the best baseline approaches in our experi-
been clicked on are not collected. Similar to the ments. In addition, we show empirically that these two
bookmark example, we cannot judge whether the proposed solution frameworks (weighting and sampling
page is not clicked because of the irrelevance of its based) for OCCF have almost identical performance.
content or redundancy, for example. The rest of the paper is organized as follows. In
the next section, we review previous works related to
There are several intuitive strategies to attack this
the OCCF problems. In Section 3, we propose two ap-
problem. One approach is to label negative examples
proaches for OCCF problems. In Section 4, we empir-
to convert the data into a classical CF problem. But
ically compare our methods to some baselines on two
this is very expensive or even intractable because the
real world data sets. Finally, we conclude the paper
users generating the preference data will not bear the
and give some future works.
burden. In fact, users rarely supply the ratings needed
by traditional learning algorithms, specifically not neg-
ative examples [23]. Moreover, based on some user 2 Related Works
studies [14], if a customer is asked to provide many
positive and negative examples before the system per- 2.1 Collaborative Filtering
forms well, she would get a bad impression of it, and
may decline to use the system. Another common solu-
In the past, many researchers have explored collabo-
tion is to treat all the missing data as negative exam-
rative filtering (CF) from different aspects ranging from
ples. Empirically, this solution works well (see Section
improving the performance of algorithms to incorporat-
4.6). The drawback is that it biases the recommen-
ing more resources from heterogeneous data sources [1].
dation results because some of the missing data might
However, previous research on collaborative filtering
be positive. On the other hand, if we treat missing as
still assumes that we have positive (high rating) as well
unknown, that is, ignore all the missing examples and
as negative (low rating) examples. In the non-binary
utilize the positive examples only and then feed it into
case, items are rated using scoring schemes. Most pre-
CF algorithms that only model non-missing data (as
vious work focuses on this problem setting. In all the
in [24]), a trivial solution arising from this approach
CF problems, there are a lot of examples whose rat-
is that all the predictions on missing values are posi-
ing is missing. In [2] and [19], the authors discuss the
tive examples. All missing as negative (AMAN) and
issue of modeling the distribution of missing values in
all missing as unknown (AMAU) are therefore two ex-
collaborative filtering problems. Both of them cannot
treme strategies in OCCF.
handle the case where negative examples are absent.
In this paper, we consider how to balance the ex-
In the binary case, each example is either positive or
tent of treating missing values as negative examples.
negative. Das et al. [8] studied news recommendation,
We propose two possible approaches to OCCF. These
while a click on a news story is a positive example, and
methods allow us to tune the tradeoff in the interpre-
a non-click indicates a negative example. The authors
tation of so-called negative examples and actually re-
compare some practical methods on this large scale bi-
sult in better performing CF algorithms overall. The
nary CF problem. KDD Cup 2007 hosted a “Who rated
first approach is based on weighted low rank approx-
What” recommendation task while the training data
imation [24]. The second is based on negative exam-
are the same as the Netflix prize dataset (with rating).
ple sampling. They both utilize the information con-
The winner team [15] proposed a hybrid method com-
tained in unknown data and correct the bias of treat-
bining both SVD and popularity using binary training
ing them as negative examples. While the weighting-
data.
based approach solves the problem deterministically,
the sampling-based method approximates the exact so-
lution with much lower computational costs for large 2.2 One-class Classification
scale sparse datasets.
Our contributions are summarized as follows. First Algorithms for learning from positive-only data have
we propose two possible frameworks for the one-class been proposed for binary classification problems. Some
collaborative filtering problem and provide and char- research addresses problems where only examples of
acterize their implementations; second, we empirically the positive class are available [22] (refer to one-classclassification) where others also utilize unlabeled exam- 3.1 Problem Definition
ples [16]. For one-class SVMs [22], the model is describ-
ing the single class and is learned only from positive Suppose we have m users and n items and the pre-
examples. This approach is similar to density estima- vious viewing information stored in a matrix R. The
tion [4]. When unlabeled data are available, a strat- element of R takes value 1, which represents a positive
egy to solve one-class classification problems is to use example, or ‘?’, which indicates an unknown (missing)
EM-like algorithms to iteratively predict the negative positive or negative example. Our task is to identify po-
examples and learn the classifier [28, 17, 26]. In [9], tential positive examples from the missing data based
Denis show that function classes learnable under the on R. We refer to it as One-Class Collaborative Fil-
statistical query model are also learnable from positive tering (OCCF). Note that, in this paper, we assume
and unlabeled examples if each positive example is left that we have no additional information about users
unlabeled with a constant probability. and items besides R. In this paper, we use bold capi-
The difference between our research and previous tal letters to denote a matrix. Given a matrix A, Aij
studies on learning from one-class data is that they aim represents its element, Ai. indicates the i-th row of A,
at learning one single concept with positive examples. A.j symbolizes the j-th column of A, and AT stands
In this paper, we are exploring collaboratively learning for the transpose of A.
many concepts in a social network.
3.2 wALS for OCCF
2.3 Class Imbalance Problem
Our first approach to tackle the one-class collabora-
Our work is also related to the class imbalance prob-
tive filtering problem is based on a weighted low-rank
lem which typically occurs in classification tasks with
approximation [11, 24] technique. In [24], weighted
more instances of some classes than others. The one-
low-rank approximations (wLRA) is applied to a CF
class problem can be regarded as one extreme case of
problem with a naive weighting scheme assigning “1” to
a class imbalance problem. Two strategies are used
observed examples and “0” to missing (unobserved) val-
for solving the class imbalance problem. One is at the
ues, which corresponds to the AMAU. Another naive
data level. The idea is to use sampling to re-balance
method for OCCF is to treat all missing values as neg-
the data [3] [18]. Another one is at the algorithmic
ative examples. However, because there are positive
level where cost-sensitive learning is used [10] [27]. A
examples in missing values, this treatment can make
comparison of the two strategies can be found in [20].
mistakes. We address this issue by using low weights
on the error terms. Next, we propose the weighted al-
3 Weighting & Sampling based Ap- ternating least squares (wALS) for OCCF. We further
proaches discuss various weighting schemes different from naive
schemes AMAU ([24]) and AMAN.
As discussed above, AMAN and AMAU (no missing Given a matrix R = (Rij )m×n ∈ {0, 1}m×n with m
as negative) are two general strategies for collaborative users and n items and a corresponding non-negative
filtering, which can be considered to be two extremes. weight matrix W = (Wij )m×n ∈ Rm×n + , weighted low-
We will argue that there can be some methods in be- rank approximation aims at approximating R with a
tween that can outperform the two strategies in OCCF low rank matrix X = (Xij )m×n minimizing the objec-
problems; examples include “all missing as weak nega- tive of a weighted Frobenius loss function as follows.
tive” or “some missing as negative”. In this section, we X 2
introduce two different approaches to address the issue L (X) = Wij (Rij − Xij ) . (1)
of one-class collaborative filtering. They both balance ij
between the strategies of missing as negative and miss-
ing as unknown. The first method uses weighted low In the above objective function L (X) (Eq. (1)),
2
rank approximations [24]. The idea is to give differ- (Rij − Xij ) is the common square error term often
ent weights to the error terms of positive examples and seen in low-rank approximations, and Wij reflects the
negative examples in the objective function; the sec- contribution of minimizing the term to the overall ob-
ond one is to sample some missing values as negative jective L (X). In OCCF, we set Rij = 1 for positive
examples based on some sampling strategies. We first examples; for missing values, we posit that most of
formulate the problem and introduce the main notation them are negative examples. We replace all the miss-
in this paper. In the next two subsections, we discuss ing values with zeros. As we have high confidence on
the two solutions in details. the observed positive examples where Rij = 1, we setthe corresponding weights Wij to 1. In contrast, differ- Algorithm 1 Weighted Alternating Least Squares
ent from the simple treatment of missing as negative, (wALS)
we lower the weights on “negative” examples. Generally Require: data matrix R, weight matrix W , rank d
we set Wij ∈ [0, 1] where Rij = 0. Before discussing Ensure: Matrices U and V with ranks of d
the weighting schemes for“negative” examples, we show Initialize V
how to solve the optimization problem argminX L (X) repeat
effectively and efficiently. Update U i. , ∀i with Eq. (6)
Consider the decomposition X = U V T where U ∈ Update V j. , ∀j with Eq. (7)
m×d
R and V ∈ Rn×d . Note that usually the number until convergence.
of features d ≪ r where r ≈ min (m, n) is the rank return U and V
of the matrix R. Then we can re-write the objective
function (Eq. (1)) 3 as Eq. (2)
where Wgi. ∈ Rn×n is a diagonal matrix with the ele-
X 2
L (U , V ) = Wij Rij − U i. V Tj. . (2) ments of W i. on the diagonal, and I is a d × d identity
ij matrix.
,V )
Fixing V and solving ∂L(U∂U i. = 0, we have
To prevent overfitting, one can append a regularization
term to the objective function L (Eq. (2)): X −1
gi. V
U i. = Ri. W gi. V + λ
V TW j Wij I ,
X 2
L (U , V ) = Wij Rij − U i. V Tj. ∀1 ≤ i ≤ m. (6)
ij
+λ kU k2F + kV k2F , (3) Notice that the matrix V T W gi. V + λ P Wij I is
j
strictly positive definite, thus invertible. It is not dif-
or gi. V
ficult to prove that without regularization, V T W
X 2
can be a degenerate matrix which is not invertible.
L (U , V ) = Wij Rij − U i. V Tj.
Similarly, given a fixed U , we can solve V as follows.
ij
X −1
+ λ kU i. k2F + kV j. k2F . (4) g.j U U T W
V j. = RT.j W g.j U + λ i Wij I ,
In Eq. (3) and Eq. (4), k.kF denotes the Frobenius ∀1 ≤ j ≤ n, (7)
norm and λ is the regularization parameter which, in
practical problems, is determined with cross-validation. where W g.j ∈ Rm×m is a diagonal matrix with the
Note that Eq. (4) subsumes the special case of regu- elements of W .j on the diagonal.
larized low-rank approximation in [21, 30]. Zhou et Based on Eq. (6) and Eq. (7), we propose the follow-
al. [30] show that the alternating least squares (ALS) ing iterative algorithm for wLRA with regularization
[11] approach is efficient for solving these low rank ap- (based on Eq. (4)). We first initialize the matrix V
proximation problems. In this paper, we extend this with Gaussian random numbers with zero mean and
approach to weighted ALS (wALS). Now we focus on small standard deviation (we use 0.01 in our experi-
minimizing the objective function L (Eq. (4)) to illus- ments). Next, we update the matrix U as per Eq. (6)
trate how wALS works. and then update the matrix V based on Eq. (7). We
Taking partial derivatives of L with respect to each repeat these iterative update procedures until conver-
entry of U and V , we obtain gence. We summarize the above process in Algorithm
1 ∂L (U , V ) X 1 which we refer to as Weighted Alternating Least
T
= j Wij U i. V j. − Rij Vjk Squares (wALS). Note that for the objective function
2 ∂Uik
X Eq. (3) with a uniform
P regularization
term, we only
+λ P
j Wij Uik , ∀1 ≤ i ≤ m, 1 ≤ k ≤ d. (5) need to change both j Wij in Eq. (6) and ( i Wij )
Then we have in Eq. (7) to 1.
1 ∂L (U , V )
3.2.1 Weighting Schemes: Uniform, User Ori-
2 ∂U i. ented, and Item Oriented
1 ∂L (U , V ) ∂L (U , V )
= ,..., As we discussed above, the matrix W is crucial to the
2 ∂Ui1 ∂Uid
X performance of OCCF. W = 1 is equivalent to the case
= U i. V T Wgi. V + λ g
j Wij I − Ri. W i. V , of AMAN with the bias discussed above. The basic ideaTable 1. Weighting Schemes Negative Example Sampling Matrix Reconstruction by ALS
Pos Examples “Neg” Examples e (1)
R b (1)
R
Uniform Wij = 1 Wij P
=δ
R e (2)
R b (2)
R b
R
User-Oriented Wij = 1 Wij ∝ j Rij
P
Item-Oriented Wij = 1 Wij ∝ m − i Rij
... ...
e (l)
R b (l)
R
of correcting the bias is to let Wij involve the credibility
of the training data (R) that we use to build a collab- Figure 1. A diagram that illustrates an ensem-
orative filtering model.d For positive examples, they ble based on negative exampling sampling
have relative high likeliness to be true. We let Wij = 1 for OCCF
for each pair of (i, j) that Rij = 1. For missing data, it
is very likely that most of them are negative examples.
For instance, in social bookmarking, a user has very based on negative example sampling for OCCF as
few web pages and tags; for news recommendation, a shown in Figure 1. In phase I, we sample negative
user does not read most of the news. That is why pre- examples from missing values. Based on an assumed
vious studies make the AMAN assumption although it e (i)
biases the recommendations. However, we notice that probability distribution, we generate a new matrix R
the confidence of missing values being negative is not including all positive examples in R. In phase II, for
as high as of non-missing values being positive. There- each Re (i) , we re-construct the rating matrix R b (i) by a
fore, essentially, we should give lower weights to the special version of wALS which we discuss in Section 3.2.
“negative” examples. The first weighting scheme as- Finally, we combine all the R b (i) with equal weights gen-
sumes that a missing data being a negative example erating a matrix R b which approximates R. We refer to
has an equal chance over all users or all items, that is, this method as sampling ALS Ensemble (sALS-ENS).
it uniformly assign a weight δ ∈ [0, 1] for “negative” ex-
amples. The second weighting scheme posits that if a 3.3.1 Sampling Scheme
user has more positive examples, it is more likely that
she does not like the other items, that is, the missing Since there are too many negative examples (compared
data for this user is negative with higher probability. to positive ones), it is costly and not necessary to learn
The third weighting scheme assumes that if an item the model on all entries of R. The idea of sampling
has fewer positive examples, the missing data for this can help us to solve the OCCF problem. We use a
item is negative with higher probability. We summarize fast (O (q)) random sampling algorithm [25] to generate
these three schemes in Table 1. A parameter for the new training data R b from the original training data R
three schemes is the ratio of the sum of the positive by negative example sampling given a sampling proba-
example weights to the sum of the negative example bility matrix P b and negative sample size q. As OCCF
weights. We will discuss the impact of the parameter is a class-imbalanced problem, where positive examples
in Section 4.6. In the future, we plan to explore more are very sparse, we transfer all positive examples to the
weighting schemes and learn the weight matrix W it- new training set. We then sample negative examples
eratively. from missing data based on P b and the negative sample
size q.
3.3 Sampling-based ALS for OCCF In this algorithm, P b is an important input. In this
paper, we provide three solutions which correspond to
As we state above, for one-class CF, a naive strategy the following sampling schemes:
is to assume all missing values to be negative. This
implicit assumption of most of the missing values being 1. Uniform Random Sampling: Pbij ∝ 1. All the miss-
negative is roughly held in most cases. However, the ing data share the same probability of being sam-
main drawback here is that the computational costs pled as negative examples.
are very high when the size of the rating matrix R P
2. User-Oriented Sampling: Pbij ∝ i I[Rij = 1],
is large. wALS has the same issue. We will analyze that is, if a user has viewed more items, those items
its computational complexity in the next subsection. that she has not viewed could be negative with
Another critical issue with the naive strategy is the higher probability.
imbalanced-class problem discussed in Section 2.3. P
In this subsection, we present a stochastic method 3. Item-Oriented Sampling: Pb(i, j) ∝ 1/ j I[Rij =1], which means that if an item is viewed by less 4 Experiments
users, those users that have not viewed the item
will not view it either. 4.1 Validation Datasets
3.3.2 Bagging We use two test datasets to compare our proposed
algorithms with possible baselines. The first dataset,
After generating a new training matrix by the above the Yahoo news dataset, is a news click through record
algorithm, we can use a low-rank matrix R b to approx-
e e b can stream4 . Each record is a user-news pair which con-
imate R using wALS. Because R is stochastic, R sists of the user id and the URL of the Yahoo news
also be biased and unstable. A practical solution to the article. After preprocessing to make sure that the
problem is to construct an ensemble. In particular, we same news always gets the same article id, we have
use the bagging technique [6] (Algorithm 2). 3158 unique users and 1536 identical news stories. The
second dataset is from a social bookmarking site. It
Algorithm 2 Bagging Algorithm for OCCF is crawled from http://del.icio.us. The data contains
Require: matrix R ∈ Rm×n , matrix P b ∈ Rm×n , sam- 246, 436 posts with 3000 users and 2000 tags.
ple size q, number of single predictor ℓ
Ensure: Reconstructed matrix R b 4.2 Experiment setting and Evaluation
for i = 1 : ℓ do Measurement
Generate a new training matrix R fi by negative
example sampling (Section 3.3.1 ) As a most frequently used methodology in machine
Reconstruct Rci from Rfi by ALS [30] learning and data mining, we use cross-validation to
end for estimate the performance of different algorithms. The
b = 1 PR
ℓ
ci validation datasets are randomly divided into training
R
ℓ i=1 and test sets with a 80/20 splitting ratio. The train-
return R b ing set contains 80% known positive examples and the
other elements of the matrix are treated as unknown.
The test set includes the other 20% known positive and
3.4 Computational Complexity Analysis all unknown examples. Note that the known positives
in the training set are excluded in the test process.
Next, we analyze the running time of wALS and The intuition of good performance of a method is that
sALS-ENS. Recall that U is a m × d matrix, V is a the method has high probabilities to rank known posi-
n×d matrix, and d ≤ min{m, n}. For wALS, tives over unknown examples most of which are usually
each step negative examples. We evaluate the performance on
of updating U (or M ) takes time O d2 nm (based on
Eqs. 6 and 7). The total running time of wALS is the test set using MAP and half-life utility which will
be discussed below. We repeat the above procedure
O d2 nt mn 20 times and report both mean and standard devia-
tion of the experimental results. The parameters of
assuming that it takes nt rounds to stop. our approaches and the baselines are determined by
For sALS-ENS similarly, assuming that ALS takes cross-validation.
on-average nt rounds to stop, and the number of NES MAP (Mean Average Precision) is widely used in
predictors is ℓ, then its total running time is information retrieval for evaluating the ranked docu-
ments over a set of queries. We use it in this paper
O d2 ℓnt (nr (1 + α) + (m + n) d) .
to assess the overall performance based on precisions
In practice, nt , ℓ, α are small constants (nt ranges at different recall levels on a test set. It computes the
from 20 to 30, ℓ ≤ 20, and α ≤ 5), and (n + m) d ≤ mean of average precision (AP) over all users in the test
nr (1 + α). Therefore the running time of wALS set, where AP is the average of precisions computed at
is O(mn), while the running time of sALS-ENS is all positions with a preferred item:
O(nr ). Thus sALS-ENS is more scalable to large- PN
prec (i) × pref (i)
scale sparse data compared to wALS. To be pre- APu = i=1 , (8)
cise, # of preferred items
the running time ratio of wALS to sALS-ENS
mn where i is the position in the rank list, N is the number
is ℓ(nr (1+α)+(n+m)d) . When the data is very sparse
nr
of retrieved items, prec (i) is the precision (fractions of
mn ≪ 1 , then ALS-ENS takes less time to finish than
wALS; otherwise, wALS is faster. 4 We thank “NielsenOnline” for providing the clickstream data.retrieved items that are preferred by the user) of a cut- algorithm, namely the one-class SVM [22] into our pool
off rank list from 1 to i, and pref (i)is a binary indicator of baseline methods. The idea is to create a one-class
returning 1 if the i-th item is preferred or 0 otherwise. SVM classifier for every item, which takes a user’s rat-
Half-life Utility (HLU): Breese et al. [5] introduced ings on the remaining set of items as input features
a half-life utility [13] (“cfaccuracy” [12]) to estimate and predicts if the user’s rating on the target item is
of how likely a user will view/choose an item from a positive or negative. The set of training instances for
ranked list, which assumes that the user will view each a SVM classifier consists of the rating profiles of those
consecutive item in the list with an exponential decay users who have rated the target item, which should
of possibility. A half-life utility over all users in a test consists of positive examples only in the one-class col-
set is defined as in Eq. (9). laborative filtering setting, which could be used to train
P a one-class SVM classifier for each target item.
Ru
R = 100 P u max , (9)
R
u u 4.4 Impact of Number of Features
where Ru is the expected utility of the ranked list for
user u and Rumax is the maximally achievable utility if
Figure 2 shows the impact of the number of features
all true positive items are at the top of the ranked list.
(parameter d) on SVD and wALS. We can see for SVD,
According to [5], Ru is defined as follows.
the performance will first increase and then drop as we
X δ (j) increase the number of features. But for wALS, the
Ru = , (10) performance is much more stable and keeps increas-
j
2(j−1)(β−1)
ing. The performance of wALS usually converge at
where δ (j) equals 1 if the item at position j is pre- around 50 features. In our following experiments, we
ferred by the user and 0 otherwise, and β is the half-life will use the optimal feature number for SVD (10 for
parameter which is set to 5 in this paper, which is the Yahoo news data and 16 for user-tag data) and wALS
same as in [5]. (50).
4.3 Baselines 4.5 Sampling and Weighting Approaches
We evaluate our approaches weighting/sampling
negative examples by comparing with two categories In Section 3, we introduced two approaches to
of baselines treating all missing as negative (AMAN) OCCF based on sampling and weighting. For each ap-
or treating all missing as unknown (AMAU). proach, we proposed three types of schemes, that is uni-
form, user-oriented and item-oriented. In this section,
4.3.1 AMAN we will compare the three schemes for both sampling
and weighting.
In AMAN settings, most traditional collaborative fil- Table 2 compares these schemes. Among the weight-
tering algorithms can be directly applied. In this ing schemes, the user-oriented weighting scheme is the
paper, we use several well-known collaborative filter- best and the item-oriented weighting scheme is the
ing algorithms combined with the AMAN strategy worst. The uniform weighting lies in between. This
as our baselines, which include the alternating least may be due to the imbalance between the number
squares with the missing as negative assumption (ALS- of users and the number of items. For the current
AMAN), singular value decomposition (SVD) [29], and datasets, the number of users is much larger than the
a neighborhood-based approach including user-user number of items.
similarity[1] and item-item similarity algorithms[1].
4.6 Comparison with Baselines
4.3.2 AMAU
Following the AMAU strategy, it is difficult to adapt Figure 3 shows the performance comparisons of dif-
traditional collaborative filtering algorithms to obtain ferent methods based on the missing as unknown strat-
non-trivial solutions, as we discussed in Section 1. In egy (Popularity and SVM), methods based on the miss-
this case, ranking items by their overall popularity is a ing as negative strategy (SVD and ALS-AMAN) and
simple but widely used recommendation method. An- our proposed methods(wALS,
P sALS).
P The x-axes α is
other possible approach is to convert the one-class col- defined as α = ( ij∈Rij =0 Wij )/( ij∈Rij =1 Wij ). For
laborative filtering problem into a one-class classifica- comparison consideration, given α, the negative sam-
tion problem. In this paper, we also include one such pling parameter q in Algorithm 2 (sALS-ENS) is set11 15
10 14
13
9
12
8
MAP
HLU
11
7
10
6
SVD 9 SVD
wALS−Uniform wALS−Uniform
5 wALS−UserOriented 8 wALS−UserOriented
wALS−ItemOriented wALS−ItemOriented
4 7
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
# of Features # of Features
(a) Yahoo news data
14 28
13 26
12 24
MAP
HLU
11 22
10 20
SVD SVD
wALS−Uniform wALS−Uniform
9 18
wALS−UserOriented wALS−UserOriented
wALS−ItemOriented wALS−ItemOriented
8 16
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
# of Features # of Features
(b) User-Tag data
Figure 2. Impact of number of features
Yahoo News User-Tag
sALS-ENS wALS sALS-ENS wALS
MAP HLU MAP HLU MAP HLU MAP HLU
Uniform 9.55 13.98 9.56 14.05 13.33 25.81 13.35 25.89
UserOriented 9.80 14.42 9.82 14.39 13.44 25.90 13.48 25.94
ItemOriented 9.42 13.76 9.41 13.83 13.03 25.32 13.07 25.43
Table 2. Comparisons of different weighting and sampling schemesto α × nr , where nr indicates the number of total posi- other weighting and sampling schemes and to identify
tive examples. The baseline popularity will not change the optimal scheme. We also plan to study the rela-
with the parameter α, therefore it is shown in a hori- tionship between wALS and sALS-ENS theoretically.
zontal line in the figures. The same holds for the base-
lines SVD and ALS-AMAN. It can be seen from the
References
figures that our methods outperform all the methods
based on missing as negative and missing as unknown
strategies. [1] G. Adomavicius and A. Tuzhilin. Toward the next gen-
eration of recommender systems: a survey of the state-
Parameter α controls the proportion of negative ex-
of-the-art and possible extensions. TKDE, 17(6):734–
amples. As α → 0, the methods are approaching the
749, 2005.
AMAU strategies and as α → 1, the methods are ap-
[2] Y. Azar, A. Fiat, A. R. Karlin, F. McSherry, and
proaching the AMAU strategies. We can clearly see J. Saia. Spectral analysis of data. In STOC, pages
that the best results lie in between. That is to say 619–626. ACM, 2001.
both weighting and sampling methods outperform the [3] G. E. A. P. A. Batista, R. C. Prati, and M. C. Monard.
baselines. The weighting approach slightly outperform A study of the behavior of several methods for balanc-
the sampling approach. But as indicated in Table (3), ing machine learning training data. SIGKDD Explo-
the sampling approach is much more efficient when α rations, 6(1):20–29, 2004.
is relative small. [4] S. Ben-David and M. Lindenbaum. Learning distribu-
tions by their density levels: a paradigm for learning
wALS sALS sALS-ENS without a teacher. J. Comput. Syst. Sci., 55:171–182,
α=1 904.00 32.58 651.67 1997.
α=2 904.00 37.45 749.13 [5] J. S. Breese, D. Heckerman, and C. M. Kadie. Empir-
ical analysis of predictive algorithms for collaborative
Table 3. Runtime (seconds) of wALS, sALS filtering. In UAI, pages 43–52. Morgan Kaufmann,
ans sALS-ENS with 20 sALS combinations 1998.
on the Yahoo News data. [6] L. Breiman. Bagging predictors. Machine Learning,
24(2):123–140, 1996.
We can also see that compared to the AMAU strate- [7] N. V. Chawla, N. Japkowicz, and A. Kotcz. Editorial:
gies the missing as negative strategy is more effective. special issue on learning from imbalanced data sets.
This is because although the label information of un- SIGKDD Explorations, 6:1–6, 2004.
labeled examples are unknown, we still have the prior [8] A. Das, M. Datar, A. Garg, and S. Rajaram. Google
knowledge that most of them are negative examples. news personalization: scalable online collaborative fil-
Disregarding such information does not lead to com- tering. In WWW, pages 271–280. ACM, 2007.
petitive recommendations. This is somewhat differ- [9] F. Denis. PAC learning from positive statistical
ent with the conclusion in class imbalance classification queries. In ALT, LNCS 1501, pages 112–126, 1998.
problems where discarding examples of the dominating [10] C. Drummond and R. Holte. C4.5, class imbalance,
class often produces better results [7]. and cost sensitivity: why under-sampling beats over-
sampling. In Proc. ICML’2003 Workshop on Learning
from Imbalanced Data Sets II, 2003.
5 Conclusion and Future Work [11] K. R. Gabriel and S. Zamir. Lower rank approximation
of matrices by least squares with any choice of weights.
Motivated by the fact that negative examples are Technometrics, 21(4):489–498, 1979.
often absent in many recommender systems, we for- [12] D. Heckerman, D. M. Chickering, C. Meek, R. Roun-
mulate the problem of one-class collaborative filtering thwaite, and C. M. Kadie. Dependency networks for
(OCCF). We show that some simple solutions such as inference, collaborative filtering, and data visualiza-
tion. JMLR, 1:49–75, 2000.
the “all missing as negative” and “all missing as un-
[13] J. L. Herlocker, J. A. Konstan, L. G. Terveen, and
known” strategies both bias the predictions. We cor-
J. Riedl. Evaluating collaborative filtering recom-
rect the bias in two different ways, negative example
mender systems. TOIS, 22(1):5–53, 2004.
weighting and negative example sampling. Experimen-
[14] D. Kelly and J. Teevan. Implicit feedback for inferring
tal results show that our methods outperform state of user preference: a bibliography. SIGIR Forum, 37:18–
the art algorithms on real life data sets including social 28, 2003.
bookmarking data from del.icio.us and a Yahoo news [15] M. Kurucz, A. A. Benczur, T. Kiss, I. Nagy, A. Szabo,
dataset. and B. Torma. Who rated what: a combination of
For future work, we plan to address the problem how SVD, correlation and frequent sequence mining. In
to determine the parameter α. We also plan to test Proc. KDD Cup and Workshop, 2007.10 15
9.5
14
9
13
8.5
8 12
7.5
MAP
HLU
11
7
USER_USER USER_USER
6.5 ITEM_ITEM 10 ITEM_ITEM
SVD SVD
6
ALS−AMAN 9 ALS−AMAN
5.5 sALS sALS
sALS−ENS 8 sALS−ENS
5
wALS−UserOriented wALS−UserOriented
4.5 7
1 2 4 6 8 10 12 14 16 1 2 4 6 8 10 12 14 16
α α
POP OCSVM POP OCSVM
MAP 3.4528 2.2545 HLU 5.3461 2.5958
(a) Yahoo news data
14
26
13
24
12
22
11
20
10 18
MAP
HLU
9 16
USER_USER USER_USER
ITEM_ITEM 14
ITEM_ITEM
8
SVD SVD
ALS−AMAN 12 ALS−AMAN
7
sALS sALS
10
6
sALS−ENS sALS−ENS
wALS−UserOriented 8
wALS−UserOriented
5
1 2 4 6 8 10 12 14 16 1 2 4 6 8 10 12 14 16
α α
POP OCSVM POP OCSVM
MAP 7.315 3.2500 HLU 15.7703 3.6260
(b) User-Tag data
Figure 3. Impact of the ratio of negative examples to positive examples (α).
[16] W. S. Lee and B. Liu. Learning with positive and [23] I. Schwab, A. Kobsa, and I. Koychev. Learning user in-
unlabeled examples using weighted logistic regression. terests through positive examples using content anal-
In ICML, pages 448–455. AAAI Press, 2003. ysis and collaborative filtering, 2001. Internal Memo.
[17] B. Liu, Y. Dai, X. Li, W. S. Lee, and P. S. Yu. Build- [24] N. Srebro and T. Jaakkola. Weighted low-rank ap-
ing text classifiers using positive and unlabeled exam- proximations. In ICML, pages 720–727. AAAI Press,
ples. In ICDM, pages 179–188. IEEE Computer Soci- 2003.
ety, 2003. [25] J. S. Vitter. Faster methods for random sampling.
[18] X.-Y. Liu, J. Wu, and Z.-H. Zhou. Exploratory under- Commun. ACM, 27(7):703–718, 1984.
sampling for class-imbalance learning. In ICDM, pages [26] G. Ward, T. Hastie, S. Barry, J. Elith, and J. Leath-
965–969. IEEE Computer Society, 2006. wick. Presence-only data and the EM algorithm. Bio-
[19] B. Marlin, R. Zemel, S. Roweis, and M. Slaney. Col- metrics, 2008.
[27] G. M. Weiss. Mining with rarity: a unifying frame-
laborative filtering and the missing at random assump-
work. SIGKDD Explorations, 6:7–19, 2004.
tion. In UAI, 2007.
[28] H. Yu, J. Han, and K. C.-C. Chang. PEBL: positive
[20] B. Raskutti and A. Kowalczyk. Extreme re-balancing
example based learning for web page classification us-
for svms: a case study. SIGKDD Explorations, 6:60–
ing SVM. In KDD, pages 239–248. ACM, 2002.
69, 2004. [29] S. Zhang, W. Wang, J. Ford, F. Makedon, and
[21] R. Salakhutdinov, A. Mnih, and G. E. Hinton. Re- J. Pearlman. Using singular value decomposition ap-
stricted boltzmann machines for collaborative filter- proximation for collaborative filtering. In CEC, pages
ing. In ICML, pages 791–798. ACM, 2007. 257–264. IEEE Computer Society, 2005.
[22] B. Schölkopf, J. C. Platt, J. Shawe-Taylor, A. J. [30] Y. Zhou, D. Wilkinson, R. Schreiber, and R. Pan.
Smola, and R. C. Williamson. Estimating the support Large-scale parallel collaborative filtering for the net-
of a high-dimensional distribution. Neural Computa- flix prize. In AAIM, LNCS 5034, pages 337–348, 2008.
tion, 13(7):1443–1471, 2001.You can also read