GetJar Mobile Application Recommendations with Very Sparse Datasets

Page created by Nancy Hawkins
 
CONTINUE READING
GetJar Mobile Application Recommendations with Very
                        Sparse Datasets

                                     Kent Shi                                                        Kamal Ali
                                 GetJar Inc.                                                       GetJar Inc.
                             San Mateo, CA, USA                                                San Mateo, CA, USA
                             kent@getjar.com                                                   kamal@getjar.com

ABSTRACT                                                                          Keywords
The Netflix competition of 2006 [2] has spurred significant                         Recommender system, mobile application, evaluation, sparse
activity in the recommendations field, particularly in ap-                         data, PCA
proaches using latent factor models [3, 5, 8, 12]. However,
the near ubiquity of the Netflix and the similar MovieLens
datasets1 may be narrowing the generality of lessons learned
                                                                                  1. INTRODUCTION
in this field. At GetJar, our goal is to make appealing rec-                          In the last few years, there has been a tremendous amount
ommendations of mobile applications (apps). For app usage,                        of growth in the mobile app space, particularly in the An-
we observe a distribution that has higher kurtosis (heavier                       droid platform. As of January 2012, there are more than
head and longer tail) than that for the aforementioned movie                      400,000 apps hosted on Google’s app store:2 Google Play
datasets. This happens primarily because of the large dis-                        (formerly known as Android Market). However, Google Play
parity in resources available to app developers and the low                       provides little personalization beyond location-based tailor-
cost of app publication relative to movies.                                       ing of catalogs. That means all users from a given country
  In this paper we compare a latent factor (PureSVD) and                          will see the same list of apps regardless of their tastes and
a memory-based model with our novel PCA-based model,                              preferences.
which we call Eigenapp. We use both accuracy and variety                             Since most users typically navigate no more than a few
as evaluation metrics. PureSVD did not perform well due                           pages when browsing the store, lack of personalization lim-
to its reliance on explicit feedback such as ratings, which we                    its exposure for the majority of the apps. By analyzing
do not have. Memory-based approaches that perform vec-                            the usage of apps on a sample of devices, we find that this
tor operations in the original high dimensional space over-                       space is dominated by a few apps, which unsurprisingly are
predict popular apps because they fail to capture the neigh-                      ones that have been “featured” recently on the front page of
borhood of less popular apps. They have high accuracy due                         Google Play.
to the concentration of mass in the head, but did poorly                             GetJar, founded in 2004, is the largest free app store in
in terms of variety of apps exposed. Eigenapp, which ex-                          the world. It provides mobile apps to users of all mobile
ploits neighborhood information in low dimensional spaces,                        platforms. We have recently begun to focus on the Android
did well both on precision and variety, underscoring the im-                      platform due to its openness and surging market share. Our
portance of dimensionality reduction to form quality neigh-                       goal is to become an attractive destination for Android apps
borhoods in high kurtosis distributions.                                          by providing high quality personalization as a means to app
                                                                                  discovery.
Categories and Subject Descriptors                                                1.1   Challenges
H.2.8 [Database Management]: Database Applications—                                 While recommendations techniques, especially those using
Data mining; H.3.3 [Information Storage and Retrieval]:                           collaborative filtering, have been common since the early
Information Search and Retrieval—Information filtering                             1990s [6] and have been deployed on a number of e-commerce
                                                                                  websites such as Amazon.com [9], recommendation in the
General Terms                                                                     emerging app domain is a task beset by unique challenges
Algorithms, Experimentation, Performance                                          mainly due to the greater kurtosis in the distribution of app
                                                                                  usage data.
1
    http://www.grouplens.org/node/73                                                From anonymous usage data collected at GetJar, we find
                                                                                  that there are a few well-known apps popular among a large
                                                                                  number of users, but the vast majority of apps are rarely
Permission to make digital or hard copies of all or part of this work for         used by most users. Figure 1(a) shows a comparison of the
personal or classroom use is granted without fee provided that copies are         data distribution between the movie (Netflix) and app (Get-
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
                                                                                  Jar) domains. Note the plot only includes apps that have
otherwise, or republish, to post on servers or to redistribute to lists,          been recently used by GetJar users. This constitutes approx-
requires prior specific permission and/or a fee.                                  imately 55,000 apps, or about 14% of all apps. The movie
KDD’12, August 12–16, 2012, Beijing, China.                                       2
Copyright 2012 ACM 978-1-4503-1462-6/12/08... $15.00.                               http://www.distimo.com/blog/2012_01_
                                                                                  google-android-market-tops-400000-applications

                                                                            204
100%

                                                                                                    100%
                                                                GetJar                                                                          GetJar
                                                                Netflix                                                                         Netflix
                     10%

                                                                                                    10%
  Percent of Users

                                                                                 Percent of Users
                     1%

                                                                                                    1%
                     0.1%

                                                                                                    0.1%
                     0.01%

                                                                                                    0.01%
                             0   20%   40%        60%     80%      100%                                       0.01%   0.1%        1%      10%      100%

                                       Percent of Items                                                                Percent of Items
                                       (a)                                                                             (b)

Figure 1: (a) Distribution of items (GetJar apps or Netflix movies) in terms of percentage of total users,
with items sorted by popularity. (b) Distributions of items plotted in log-log scale.

at the first percentile (rank 177) is rated by 20% of Net-                       app stores rely on developers to categorize their own apps
flix users. In contrast, the app at the first percentile (rank                    using a fixed inventory of labels. This leads to a small num-
550) is used only by 0.6% of GetJar users. Furthermore,                         ber of categories and a large number of apps within each,
the movie at the first percentile has 42% as many users as                       causing only the top few apps in each category to ever have
the most popular movie, but app at the first percentile has                      significant visibility. Search is also ineffective because we
only 1.3% as many users as the most popular app. There-                         find that most users don’t know what to search for. About
fore, even though there are over 400,000 available apps, in                     90% of search queries at GetJar are titles (or close variants)
reality only a few thousand of them are being used in any                       of popular apps, which means search currently is not being
significant sense.                                                               used as an effective vehicle to discover new apps.
   The same data is plotted in Figure 1(b), this time using
a log scale for both axes. We can see that the GetJar curve
is almost a straight line in log-log space, indicating that the                 1.2                    Goal and evaluation criteria
frequencies can be approximated by a Zipf distribution[17].                        Users visit GetJar hoping to find interesting and useful
This figure definitively shows the qualitative difference in dis-                  apps. But as we have seen, common strategies such as
tribution: App distribution is linear in log-log space whereas                  browsing and searching, which have worked well for other
movie distribution isn’t. Traditional collaborative filtering                    e-commerce sites don’t work as well in domains where many
techniques [9, 14] or even the newer latent factor models [3,                   items remain under-publicized. Our goal is to use personal-
5, 8, 12, 13] were not designed to handle this level of sparsity.               ization to help users find a greater variety of appealing apps.
   There are at least three reasons for this difference. First,                  Our prototype recommendation system recommends a top-
the disparity in available resources among the app develop-                     N list of apps to each user based on her recent app usage.
ers is larger than that of movie producers. This is mainly                      We judge the quality of the recommendations primarily by
due to the cost (time and money) of publishing apps being                       accuracy, which represents the ability of the recommender
much lower than that for releasing movies. Second, due to                       to predict the presence of an app on the user’s device. To
the less mature nature of the smart phone space, most ca-                       increase the exposure of under-publicized apps, the recom-
sual users are unaware of the full capabilities of their device                 mender is also evaluated on its ability to recommend tail
or what apps are available for it. This is in contrast to other                 apps as well as the variety of the apps it recommend.
domains such as movies, where there are numerous outlets                           A number of app stores currently offer personalized app
dedicated to reviewing or promoting those products. Third,                      recommendations, most notably the Apple App Store and
discovery mechanisms in the app space are less effective and                     the Amazon Appstore. However, little is known about how
mature compared to those of other domains.                                      they generate their recommendations. Furthermore, we are
   Today, most app stores offer three ways for users to dis-                     unaware of any publications on mobile app recommenda-
cover apps: (1) Listings of apps sorted by the number of                        tions.
downloads or a similar trending metric, (2) Category-based                         The rest of the paper is organized as follows: Section 2
browsing and (3) Keyword-based searching. We know that                          will review how the data was collected and some of its prop-
the number of apps that can be exposed using listings is                        erties; Section 3 will provide details of the algorithms that
limited, and that methods 2 and 3 are not as effective as we                     we considered; Section 4 will provide the experimental setup
would like. Browsing by category is only useful if the ontol-                   and results; and finally Sections 5 and 6 provide discussion
ogy of categories is rich, as in the case of Amazon. But most                   and conclusions.

                                                                          205
2. THE GETJAR DATA

                                                                                                         100%
                                                                                                                   GetJar
   The data we report upon in this paper comes from server                                                         Netflix
log files at GetJar where all personally identifying infor-

                                                                                                         80%
mation had been stripped, but information pertaining to

                                                                         Percent of Total Rating/Usage
a single source can be uniquely identified up to a common
anonymous identifier. The apps we report here include those

                                                                                                         60%
hosted on GetJar as well as those on Google Play.
   For the purposes of this study, we rely upon app usage
data rather than installation data. The reason we choose

                                                                                                         40%
not to use installation data is that it is a poor indicator of
interest since many app installations are experimental from
a user’s perspective. A significant fraction of our users are

                                                                                                         20%
found to uninstall an app on the same day as they installed
it. Also, there is another significant fraction of users that
have a vast number of installed apps that never get used.
   Many users are new to the mobile app space and thus are

                                                                                                         0
likely experimenting with a variety of apps. We restrict our                                                       0.01%       0.1%         1%      10%     100%
data to recent app usage to account for this fact, that users’
tastes for apps can change more rapidly than for traditional                                                                     Percent of Items
domains such as movies and music. We are only interested
in recommending apps that reflect their current tastes or                Figure 2: Cumulative distribution of items in terms
interests.                                                              of percentage of total usage, the curves can be
   The observation period for data used for this study is from          viewed as the integral of the curves in Figure 1.
November 7 to November 21, 2011. We find that varying
length of the observation period by a few days makes almost              Dataset                                 Users       Items    Usages/Ratings      Density
no difference in the number of apps used by the users.3 In an             GetJar                                 101,106      55,020       1.99M           0.04%
effort to reduce noise in the data from apps that were being              GetJar*                                101,031       7,304       1.82M           0.25%
trialed by users, we filtered out apps that were not used other           Netflix                                 480,189      17,770       100M            1.18%
than on the day of installation. We further cleaned the data
by removing users that joined or left midway during the                 Table 1: Size of user-item matrices for Netflix and
observation period and those that were not associated with              GetJar dataset. GetJar* denotes the GetJar dataset
a pre-determined list of legitimate devices. The resultant              including only apps that have been used by more
dataset contains 101,106 users. For each user we used the list          than 20 users.
of apps and the number of days each app was used during the
observation period. The total number of unique apps used
by all users during the interval satisfying our constraints was
55,020.                                                                 Jar dataset, as previously alluded to, is primarily due to the
                                                                        low cost of publishing apps compared to the cost of releas-
2.1    Data sparsity and long tail                                      ing a movie. This encourages developers to release as many
   As we have already illustrated in Figure 1, our data is              apps as possible to increase the chances of their apps being
extremely sparse and that the vast majority of apps have                discovered by search. This strategy often leads to apps be-
low usage. While it is well known that sparsity and a long              ing published multiple times with different titles but similar
tail [1] are two characteristics of all e-commerce data, these          functionalities. This also encourages the proliferation of a
are especially pronounced in our dataset.                               large number of apps tailored for very specific needs (e.g.
   Figure 2 plots the cumulative distribution of the items              ringtone apps dedicated to music by specific artists) as op-
in terms of the total amount of usage. We can see that                  posed to general apps (e.g. a single ringtone app containing
the GetJar dataset is far more head-heavy compared to the               music by all artists).
Netflix dataset, with the top 1% of apps accounting for 58%                 Given that we have little or no usage information on the
of usage in contrast to Netflix where the top 1% of movies               bulk of the tail apps, it makes recommending them a very
contribute to 22% of all ratings. An even more selective                difficult task. In order to ensure that the recommended apps
subset - the 100 most popular apps - account for 30% of                 will have certain amount of support, for this study, we lim-
total app usage. For the GetJar dataset, we define the head              ited our app population by only including apps with more
to be the top 100 apps and the remaining apps to be the                 than 20 users. This reduces the number of apps from 55,030
tail.                                                                   to 7,304. Even though this pruning process removed 87% of
   One major reason for this difference is that many apps are            apps (or 98% if we include apps with no usage), it is note-
used every day, but movies are seldom watched more than                 worthy that only 9% of the total usage was thus eliminated
once or twice. Thus Netflix users may be more likely to ex-              from our modeling. Table 1 shows the size and density of the
plore new items relative to GetJar users. Another reason                user-item matrices before and after our pruning. It shows
is that the Netflix data was collected over a much longer                that even after rejecting the bottom 87% of the apps, the
period of time. The reason for the longer tail in the Get-              GetJar* dataset is still much sparser relative to Netflix.

3
  We use the more convenient word users to denote their                 2.2                               Usage versus ratings
anonymized identifiers.                                                         Another difference between the GetJar dataset and the

                                                                  206
Netflix dataset is that movie ratings is an explicit feedback                                   Number of Common Users
                                                                             Dataset
for interest whereas days of usage is implicit [11]. The ben-                             0       1     2-10  11-20   > 20
efit of an explicit rating system is that it is well-defined and               GetJar*   83.2%    9.1% 6.6%     0.6%    0.6%
standardized, thus generating a more accurate measurement                    Netflix     0.2%    0.4% 33.8% 22.2% 43.3%
of interest compared to implicit feedbacks such as days of
usage. The latter can be influenced by a number of factors              Table 2: Breakdown of number of common users for
such as mood, unforeseen events, or logging errors. Fur-               the GetJar and Netflix datasets. For n items, the
                                                                                                      2
thermore, there is also correlation between usage and cate-            total number of item pairs is n 2−n .
gory - we find that “social” apps are consistently the most
heavily used apps among nearly all users. This is because
“social” apps need to be used often in order to serve their
purpose, but apps in categories such as “productivity” are             items in user space or that of users in item space. A user-
seldom needed on a continuous basis. So while it is safe to            user or item-item similarity matrix is computed for pairs
assume that a user enjoyed a movie that she rated highly               and recommendations are generated based on these similari-
relatively to one rated lowly, the same cannot be said for             ties. Latent factor models are more sophisticated approaches
a user that used a “social” app more than a “productivity”             where the user-item matrix is decomposed via matrix fac-
app.                                                                   torization techniques such as Singular Value Decomposition
   We choose not to use ratings because it has a number of             (SVD). Latent factors are then extracted and used to gen-
drawbacks in the mobile app domain. Most importantly, it               erate predictions.
is very difficult to collect for a large number of users with-              We evaluated both the above approaches using our data.
out forceful intervention. Furthermore, since users’ taste in          In addition, we developed a hybrid system using Princi-
apps may change and many app developers frequently up-                 pal Components Analysis (PCA) which we call Eigenapp.
date their apps with new features or functionalities, ratings          These three algorithms were also compared against a non-
may become obsolete in as little as one month. Finally, ob-            personalized baseline recommendation system that serves
serving ratings on Google Play, we find they are polarized,             the most popular items.
with the vast majority of ratings being either 1 or 5. This
is likely due to fragmentation of the Android platform,4 re-           3.1    Non-personalized models
sulting in most ratings being given based on whether the                  Non-personalized models are those that serve the same
app worked (5) or not (1) for the user.                                list of items to all users. They commonly sort items by
   Due to the influence of the Netflix competition, most re-             the number of purchases, profit margin, click-through rate
search in the recommendations community has been geared                (CTR), or other similar metrics. In this paper, our non-
toward rating prediction by means of minimizing root mean              personalized baseline algorithm sorts items by popularity,
square error (RMSE). However, Cremonesi et. al [3] re-                 where popularity is defined as the number of distinct users
ported that improving RMSE does not translate into im-                 that have used the item during the observation period.
provement in accuracy for the top-N task. On the Netflix
and MovieLens datasets, the predictive accuracy of a naive             3.2    Memory-based models
most popular list is comparable to those by sophisticated
                                                                          There are two types of memory-based models: Item-based
algorithms optimized for RMSE. We tried the same using
                                                                       and user-based. Item-based models find similarities between
the GetJar dataset but substituting days of usage for rat-
                                                                       items, and for a given user they recommend items that are
ings, and found that algorithms optimized for RMSE actu-
                                                                       similar to items she already owns. User-based models find
ally performed far worse than a simple most popular list.
                                                                       similarities between users, and for a given user they recom-
   With that said, days of usage can still be used for neigh-
                                                                       mend items owned by her most similar users.
borhood approaches, provided that there still exists some
                                                                          Computationally, item-based models are more scalable be-
correlation between it and interest. A part of this study is
                                                                       cause there are usually far fewer items than users, as is the
to evaluate the usefulness of this metric. Thus, for our ex-
                                                                       case in the mobile app space. In addition, there is research
periments, we used two versions of the user-item matrix. In
                                                                       showing that item-based algorithms generally perform bet-
the first version, each cell represents the number of days the
                                                                       ter than user-based algorithms [9, 14]. Hence, our memory-
app was used, and in the second, each cell is a binary indi-
                                                                       based model uses the item-based approach.
cator of usage during the observation period. We’d like to
                                                                          Two of the most common neighborhood similarity met-
see if the additional granularity provided by the days of us-
                                                                       rics in current use are the Pearson correlation coefficient
age will generate better recommendations than when using
                                                                       and cosine similarity. The Pearson correlation coefficient is
a binary indicator.
                                                                       computed for a pair of items based on the set of users that
                                                                       have used both. Since the vast majority of our items reside
3. MODELS                                                              in the long tail, many of those items are unlikely to share
                                                                       common users with most other items.
   Two common recommendation approaches in use today
                                                                          Table 2 presents the distribution of number of common
are those using memory-based models and latent factor mod-
                                                                       users in the GetJar and Netflix datasets. The table shows
els. Memory-based models leverage the neighborhood of
                                                                       that 83.2% of item pairs in the GetJar dataset have zero
4                                                                      users in common, whereas that same percentage for Net-
  There are many manufacturers that produce Android de-                flix is 0.2%. For GetJar, more than 90% of item pairs have
vices with various hardware specifications and tweaks of the
operating system. This makes it difficult for developers to to           one or no common users. Thus it is impossible to compute
test their apps on all devices, resulting in apps not working          correlations for these item pairs. In addition, the vast ma-
as intended on many devices.                                           jority of the remaining item pairs share 10 or fewer users,

                                                                 207
meaning that the sample correlation estimate is likely to be               Examples of this approach include [5, 8, 12, 13]. We
inaccurate due to poor support. In contrast, the published              tried [5] and [13] by substituting days of usage for ratings,
Netflix dataset has less than 1% of movie pairs sharing 1                and then sorting the predictions to generate a top-N rec-
or fewer common users and about 65% of movie pairs share                ommended list. The results were by far the worst of all
more than 10 common users. Since the Pearson correlation                algorithms, for reasons explained in Section 2.2. We expect
coefficient is undefined for 90% of our item pairs, we will use            similar results for other rating prediction based algorithms.
cosine similarity.                                                         The only latent factor top-N algorithm we are aware of
    Let R denote the m × n user-item matrix where m is the              is PureSVD [3]. The algorithm works by replacing all miss-
number of users and n is the number of items. From R, we                ing values (those with no ratings) in R with 0, and then
compute an item-item similarity matrix S, whose (i, j) entry            factorizing R via SVD:
is:
                                                                                              R = U · Σ · VT                     (4)
                             r∗,i · r∗,j
                   si,j =                                (1)
                          r∗,i 2 · r∗,j 2                           Then affinity between user u and item i can be computed
                                                                        by:
where r∗,i and r∗,j are the ith and jth columns respectively
of R. Cosine similarity does not require items to share com-                                 tu,i = ru,∗ · Q · qTi               (5)
mon users. In such case it will simply produce a similarity             where Q stands for the top k singular vectors extracted from
of 0. However, it still suffers from low overlap support. The            V and qi is the row in Q corresponding to item i. Note that
closest neighbors for a less popular item will often occur              tu,i is simply an association measure and not a predicted
by coincidence simply because they are the only ones that               rating. A top-N list can then be made for user u by selecting
produced non-zero similarity scores.                                    the N items with the highest affinity score to u.
   Using S, the affinity tu,i between user u and item i is the               PureSVD is the only latent factor algorithm we evaluated
sum of similarities between i and items used by u:                      that was able to generate reasonable recommendations. The
                                                                       main reason for this is that, unlike the other algorithms,
                         tu,i =    si,j                   (2)           PureSVD is not optimized for RMSE based rating prediction
                                 j∈Iu
                                                                        but rather the relative ordering of items produced by the
where Iu is the set of items used by u. For a given user, all           association scores.
items are sorted by their affinity score in order to produce
a top-N list.5
                                                                        3.4   Eigenapp model
   We made two slight modifications to the above method                     Of the two previously mentioned approaches, memory-
that produced better results. First, the item-item similarity           based models yielded far better results despite only hav-
scores si,j were normalized before being used in equation (2).          ing neighborhoods for popular items. We want to improve
Deshpande et al. [4] suggested using a normalization such               the result of memory-based models by borrowing ideas from
that the sum of the similarities add up to 1. However, we               the latent factor models. Along these lines, we used dimen-
found that normalizing using z-score worked much better for             sionality reduction techniques to extract meaningful features
the GetJar dataset, producing the asymmetric similarity:                from the items and then applied memory-based techniques
                                                                        to generate recommendations in this reduced space.
                                si,j − s∗,j
                      si,j =                              (3)             Our neighborhood is still item-based, but items are now
                                   σs∗,j                                represented using features instead of users. Similar to [3],
                                                                        we also replace all missing values in R with 0. Given the
where s∗,j is the average similarity to item j and σs∗,j is the
                                                                        large disparity in app frequencies, we normalized the item
standard deviation of similarities to item j. Second, for each
                                                                        vectors to prevent the features from being based on only
item candidate i, instead of summing over all items in Iu ,
                                                                        popular items. This is done by normalizingeach column in
we only considered the l nearest items, which are those with                                                               
the greatest normalized similarly scores to i. This has the
                                                                        R to have zero mean and length of one:
                                                                          2                                          u ru,i = 0 and

effect of noise reduction by discarding weakly related items                u ru,i  = 1. We denote this new normalized user-item
to the given i. For the GetJar dataset, we find that setting             matrix as R and apply PCA to R for feature extraction.
l = 5 seemed to work the best.                                             PCA is performed via eigen decomposition of the covari-
                                                                        ance matrix C. C is computed by    first calculating the mean
                                                                                                         
3.3    Latent factor models                                             item vector b with bu = n1 i ru,i   . Then remove the mean
                                                                                                                          
  Latent factor models work by factorizing the user-item                by forming matrix A where each cell au,i = ru,i      − bu and
                                                                                                  T
matrix R into two lower rank matrices: user factors and item            finally compute C = AA . Note that C is an m × m matrix
factors. These models are often used for rating predictions,            with the number of users m likely to be a very large num-
where a rating ru,i for user u on item i can be predicted               ber. This makes eigen decomposition practically impossible
by taking the inner product of their respective vectors in              in time and space. Observing that the number of items n
the user factors and item factors. User bias and item bias              is likely to be much lower, we used the same procedure as
are commonly removed by subtracting the row and column                  in Eigenface [16] to optimize the process. The procedure
means from R prior to the factorization step. The biases                works by first conducting eigen decomposition on the n × n
are added back on to the inner product to generate the final             matrix AT A obtaining eigenvectors v∗ and eigenvalues λ∗
prediction.                                                             such that for each j:
5                                                                                             AT Avj∗ = λ∗j vj∗                  (6)
 In equation (2), users that use a greater number of items
will have more summands, but since we’re only interested                Multiplying both sides by A, we get:
in the relative order of items for a given user, the varying
number of summands does not pose a problem.                                               AAT (Avj∗ ) = λ∗j (Avj∗ )              (7)

                                                                  208
POP                                           POP

                0.020

                                                                                               0.25
                                                            MEM BIN                                       MEM BIN
                                                            MEM DAY                                       MEM DAY
                                                            PureSVD BIN                                   PureSVD BIN
                                                            PureSVD DAY                                   PureSVD DAY

                                                                                               0.20
                                                            Eigenapp BIN                                  Eigenapp BIN
                0.015

                                                            Eigenapp DAY                                  Eigenapp DAY
    Precision

                                                                                               0.15
                                                                                      Recall
                0.010

                                                                                               0.10
                0.005

                                                                                               0.05
                0.000

                                                                                               0.00
                        0.0   0.2      0.4            0.6   0.8       1.0                             0     10           20       30   40   50

                                             Recall                                                                           N
                               (a) Precision-Recall                                                              (b) Recall at N

                    Figure 3: (a) Precision-recall curves and (b) Recall at N curves using all users in the test set.

We see that vectors vj = Avj∗ are the eigenvectors for C.                         the number of apps with some minimum amount of usage is
From there, we normalize each vj to length one and keep                           unlikely to increase significantly with more users, we do not
only the k eigenvectors with the highest corresponding eigen-                     believe this will pose a problem.
values. The eigenvectors represent the dimensions with the                           Eigenapp is similar to another PCA based algorithm Eigen-
largest variances, or the dimensions that can best differenti-                     taste [7]. The main difference is that Eigentaste, which was
ate the items. Alternatively, these eigenvectors can also be                      evaluated on the Jester joke dataset,7 requires a gauge item
viewed as item features, items with similar projected values                      set where every user must have rated every item in the gauge
on a particular eigenvector are likely to be similar in cer-                      set. Coming up with such a gauge set is impossible for our
tain attributes. We will denote these eigenvectors as eige-                       application, much less one that is representative. In addi-
napps. Finally, we can project all the items onto the reduced                     tion, Eigentaste uses a user-based neighborhood approach to
eigenspace by D = vA. D is a k × n matrix, where each col-                        generate recommendations, whereas Eigenapp utilizes item-
umn contains the projected values of the item onto each of                        based neighborhoods.
the eigenapps. The values can be viewed as the coefficients
or weights of the eigenapp for the items. By observing sev-
eral rows in D, apps with high projected values in these                          4. EVALUATION
eigenapps are often similar types of apps. This was useful in                        We evaluated the four types of models from Section 3:
preliminary validation showing that the Eigenapp approach                         Non-personalized (POP), Memory-based (MEM), PureSVD
indeed captured latent item features.                                             and Eigenapp, using the GetJar dataset. The experiment is
   Item-item similarities can be computed using equation (1)                      set up by randomly dividing the users into five equal sized
except that we use D instead of R. Since D is dense,                              groups. Four of the groups are used for training, and the
similarity scores will likely be non-zero for all item pairs.                     remaining one for evaluation. Using the training set, we
Once the item-item similarity matrix S has been computed,                         compute the item-item similarity matrix S for MEM and
the remainder of the algorithm is identical to the memory-                        Eigenapp, item factor matrix Q for PureSVD, and the list
based algorithm described in Section 3.2. We find that                             of most popular items for POP. The number of eigenvectors
the computed neighborhood in the reduced eigenspace is of                         used for Eigenapp and number of singular vectors used for
much better quality compared to the one computed using                            PureSVD are both 300. For each user in the test set, we sort
the memory-based methods in the non-reduced space. How-                           the apps by install time. We feed the first M − 1 apps to the
ever, neighborhood quality is still better for popular items                      model to generate its recommendation list of N apps. Then
than for less popular items, likely due to better support. We                     we check if the left out app is in the recommended list (all
also find that the quality of neighborhood improves when we                        algorithms make sure to exclude from their recommendation
increase the number of eigenapps used, and that the neigh-                        list the M − 1 apps known to already be installed for the
borhood becomes relatively stable after k = 200.                                  given user). This procedure is repeated on all 5 possible ways
  The computation complexity of this algorithm, up to gen-                        of dividing the user groups, allowing every group to be used
erating S, is O(mn2 ). Using the current GetJar dataset, that                     as the evaluation group once, and thus a recommendation
process took about 11 minutes on an Intel Core i7 machine                         list for every user exists.
using the Eigen library.6 However, since the computation                             Two forms of user-item matrix R were considered for the
of S is the offline phase of the recommender system, and                            experiments, as described in Section 2.2. The first version
6                                                                                 7
    http://eigen.tuxfamily.org                                                        http://eigentaste.berkeley.edu/dataset

                                                                            209
POP                                        POP
                                                          MEM BIN                                    MEM BIN

                                                                                          0.20
              0.015                                       MEM DAY                                    MEM DAY
                                                          PureSVD BIN                                PureSVD BIN
                                                          PureSVD DAY                                PureSVD DAY
                                                          Eigenapp BIN                               Eigenapp BIN

                                                                                          0.15
                                                          Eigenapp DAY                               Eigenapp DAY
  Precision

              0.010

                                                                                 Recall

                                                                                          0.10
              0.005

                                                                                          0.05
              0.000

                                                                                          0.00
                      0.0   0.2      0.4            0.6   0.8       1.0                          0     10           20       30   40    50

                                           Recall                                                                        N
                             (a) Precision-Recall                                                           (b) Recall at N

Figure 4: (a) Precision-recall curves and (b) Recall at N curves after removing the 100 most popular items.

using days of usage will be denoted as DAY, and the bina-                       4.2         Accuracy of less popular items
rized version will be denoted as BIN.                                              Given the overwhelming exposure popular apps receive
   Accuracy is the first evaluation criterion we used because                    today in the Android ecosystem, many users will use them
we want our recommendations to be relevant to user’s inter-                     simply because those are the only apps they know. Thus
est and preferences. However, user satisfaction is not solely                   using a popular app may not be a strong indicator of in-
dependent on accuracy [10]. In particular, given the domi-                      terest relative to less popular apps. In order to measure
nance of the popular apps in this domain, it is important to                    precision and recall on the “tail”, we redrew the precision-
expose apps in the tail. With that in mind, we also evalu-                      recall curves by excluding the 100 most popular apps from
ated the accuracy of the models in recommending tail apps,                      the recommended list of each user. Note therefore, that hu
and the variety of the apps recommended.                                        will always be 0 for users whose relevant items are among
                                                                                the 100 most popular apps. Thus those users were removed
4.1 Accuracy                                                                    for this experiment.
   The accuracies of the models were evaluated by the stan-                        Figure 4(a) shows the precision-recall curves after remov-
dard precision-recall methodology. Since we have only one                       ing the 100 most popular items. The figure shows that Eige-
relevant item to be predicted for each user (the left out app),                 napp has the highest accuracy for this tail subset. MEM is
we set hu equal to 1 if the relevant item is in the top-N list                  now second, followed by PureSVD and POP. Recall at N
for user u and 0 otherwise. Precision and recall at each N                      shown in Figure 4(b) shows a similar picture, but it is worth
is computed by:                                                                 noting that relative to Figure 3(b), recall dropped for every
                                   m                                           algorithm with the exception of PureSVD. This shows it is
                                      u=1 hu
                   precision(N ) =                          (8)                 more difficult to recommend relevant tail apps than head
                                   m m
                                        ·N
                                                                                apps.
                                      u=1 hu
                      recall(N ) =                          (9)                    Using the two types of user-item matrix (BIN and DAY)
                                       m                                        still achieved similar performance for all three algorithms,
where m is the number of users.                                                 but it appears Eigenapp and PureSVD yielded slightly bet-
   Figure 3(a) illustrates the precision-recall curves for the                  ter results using BIN compared to DAY.
algorithms . As we can see, the best performer was MEM de-
spite using an item-item similarity matrix consisting mostly
of zeros. A close second was Eigenapp, followed by POP                          4.3         Presentation
and PureSVD. Figure 3(b) illustrates the recall at each N ,                        The impression that the recommended list makes to the
up to N = 50. This figure shows the percentage of users                          user is also important to their satisfaction [10]. An artifact
whose missing app was identified in the top-N. When N is                         of our methodology for predicting the left-out item means
10, MEM identified the missing app for about 11% of users,                       that we penalize algorithms for predicting items that the
Eigenapp identified the missing app for about 10% of users,                      user may have liked had she known about them. Since it
and POP and PureSVD identified the missing app for about                         is impossible for us to know which of the “irrelevant” items
7% and 4% of users respectively.                                                (those that do not correspond to the left out item) in the
   The two types of user-item matrix (BIN and DAY) made                         top-N are potentially interesting ones, we can only judge the
little difference in the global accuracy of any of the three al-                 diversity of items that are presented. In this study, we are
gorithms. Indicating that the additional signals contributed                    interested in recommending a diverse list of apps from all
by number of days of usage do not outweigh its inaccuracies.                    popularity spectrums.

                                                                          210
Popularity Rank

                                                                                  10
 Algorithm
               1-50    51-100 101-500 501-1000         >1000
 POP          100%       0        0          0           0
 MM BIN       85%       5%       6%         2%          2%

                                                                                  8
 MM DAY       80%       6%       8%         3%          4%
 PS BIN
list of apps. This is because all item vectors are normalized           8. REFERENCES
prior to applying PCA, thus usage of less popular apps can              [1] C. Anderson. The Long Tail: Why the Future of
be captured by the top eigenvectors. That makes it possible                 Business Is Selling Less of More. Hyperion, 2006.
for the less popular apps to be among the closest neigh-
                                                                        [2] J. Bennett and S. Lanning. The netflix prize. In
bors of the popular apps. This is particularly important for
                                                                            Proceedings of KDD Cup and Workshop, pages 3–6,
exposure of the less popular apps, because given the domi-
                                                                            2007.
nance of the popular apps, only apps that are close to one of
                                                                        [3] P. Cremonesi, Y. Koren, and R. Turrin. Performance
the popular apps can make frequent appearances at the top
                                                                            of recommender algorithms on top-n recommendation
of the recommended lists. Using traditional memory-based
                                                                            tasks. In Proceedings of the fourth ACM conference on
models, the popular apps form a tight cluster (relative to
                                                                            Recommender systems, RecSys ’10, pages 39–46, New
the less popular apps) in its neighborhood, thus making it
                                                                            York, NY, USA, 2010. ACM.
difficult for less popular apps to surface to the top of the
recommended lists for many users.                                       [4] M. Deshpande and G. Karypis. Item-based top-n
                                                                            recommendation algorithms. ACM Trans. Inf. Syst.,
                                                                            22(1):143–177, Jan. 2004.
6. CONCLUSION                                                           [5] S. Funk. Netflix update: Try this at home.
   With increasing numbers of people switching to smart                     http://sifter.org/˜simon/journal/20061211.html, 2006.
phones, the mobile application space is an emerging domain              [6] D. Goldberg, D. Nichols, B. M. Oki, and D. Terry.
for recommendation systems. Due to the wide disparity in                    Using collaborative filtering to weave an information
resources among app publishers, the apps that large compa-                  tapestry. Commun. ACM, 35(12):61–70, Dec. 1992.
nies develop receive far more exposure than those developed             [7] K. Goldberg, T. Roeder, D. Gupta, and C. Perkins.
by individual developers. This results in app usage being                   Eigentaste: A constant time collaborative filtering
dominated by a few popular apps. The problem is further                     algorithm. Inf. Retr., 4(2):133–151, July 2001.
exacerbated by existing apps stores using non-personalized              [8] Y. Koren. Factorization meets the neighborhood: a
ranking mechanisms. While that approach may help most                       multifaceted collaborative filtering model. In
users find high quality and essential apps quickly, it is less               Proceedings of the 14th ACM SIGKDD international
effective in recommending apps to users who are in an ex-                    conference on Knowledge discovery and data mining,
ploratory mode                                                              KDD ’08, pages 426–434, New York, NY, USA, 2008.
   In this study, we used app-usage as our metric. Given                    ACM.
the characteristics of this data, we found that traditional             [9] G. Linden, B. Smith, and J. York. Amazon.com
memory-based approaches heavily favor popular apps con-                     recommendations: Item-to-item collaborative filtering.
trary to our mission. On the other hand, latent factor                      IEEE Internet Computing, 7:76–80, 2003.
models that were developed based on the Netflix data per-
                                                                       [10] S. M. McNee, J. Riedl, and J. A. Konstan. Being
formed quite poorly accuracy-wise. We find that the Eige-
                                                                            accurate is not enough: how accuracy metrics have
napp model performed the best in accuracy and in promo-
                                                                            hurt recommender systems. In CHI ’06 extended
tion of less well known apps in the tail of our dataset.
                                                                            abstracts on Human factors in computing systems,
   A system using the Eigenapp model is currently in internal
                                                                            CHI EA ’06, pages 1097–1101, New York, NY, USA,
trials at GetJar. It presents a personalized app list to users
                                                                            2006. ACM.
along with a non-personalized most popular list. The first
                                                                       [11] D. W. Oard and J. Kim. Implicit feedback for
list is elicited when users are in an exploratory mode and
                                                                            recommender systems. In Proceedings of the AAAI
the second when they are looking for the most sought-after
                                                                            Workshop on Recommender Systems, pages 81–83,
apps. We plan to open this system for general use in the
                                                                            1998.
second half of 2012. Simultaneously, we are also working
continuously to improve our system.                                    [12] A. Paterek. Improving regularized singular value
   A limitation of the current model is that it includes only               decomposition for collaborative filtering. In
apps with certain minimum of usage, a condition that most                   Proceedings of KDD Cup and Workshop, pages 39–42,
apps do not satisfy. While the set of apps included probably                2007.
contains most of the potentially interesting ones, it is pos-          [13] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl.
sible that we removed some interesting niche apps, or high                  Application of dimensionality reduction in
quality apps by individual developers that were not exposed                 recommender system – a case study. In Proceedings of
due to lack of marketing. The latter case is particularly                   the ACM WebKDD Workshop, 2000.
important to us. We are currently exploring content-based              [14] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl.
models that extract useful features from app metadata and                   Item-based collaborative filtering recommendation
plan to combine the results of the collaborative and content-               algorithms. In Proceedings of the 10th international
based approaches in future work.                                            conference on World Wide Web, WWW ’01, pages
                                                                            285–295, New York, NY, USA, 2001. ACM.
                                                                       [15] G. Shani and A. Gunawardana. Evaluating
7. ACKNOWLEDGEMENTS                                                         recommendation systems. Recommender Systems
  The authors would like to thank Anand Venkataraman for                    Handbook, pages 257–297, 2011.
guidance, edits and help with revisions. Chris Dury provided           [16] M. Turk and A. Pentland. Eigenfaces for recognition.
valuable feedback and Sunil Yarram helped during various                    J. Cognitive Neuroscience, 3(1):71–86, Jan. 1991.
stages of data preparation.                                            [17] G. K. Zipf. Human Behavior and the Principle of
                                                                            Least Effort. Addison-Wesley, 1949.

                                                                 212
You can also read