Multidimensional Characterization of Expert Users in the Yelp Review Network

Page created by Bertha Pham
 
CONTINUE READING
Multidimensional Characterization of Expert Users in the
                     Yelp Review Network ∗

                        Cheng Han Lee                                            Sean Massung
               Department of Computer Science                          Department of Computer Science
           University of Illinois at Urbana-Champaign              University of Illinois at Urbana-Champaign
                      clee17@illinois.edu                                  massung1@illinois.edu

ABSTRACT                                                         reviewer to give a fair or useful review of a business that
In this paper, we propose a multidimensional model that in-      may be a future destination. From a business’s perspective,
tegrates text analysis, temporal information, network struc-     expert reviewers should be great summarizers and able to
ture, and user metadata to effectively predict experts from      explain exactly how to improve their store or restaurant. In
a large collection of user profiles. We make use of the Yelp     both cases, it’s much more efficient to find the opinion of an
Academic Dataset which provides us with a rich social net-       expert reviewer than sift through hundreds of thousands of
work of bidirectional friendships, full review text including    potentially useless or spam reviews.
formatting, timestamped activities, and user metadata (such
as votes and other information) in order to analyze and          Yelp is a crowd-sourced business review site as well as a so-
train our classification models. Through our experiments,        cial network, consisting of several objects: users, reviews,
we hope to develop a feature set that can be used to accu-       and businesses. Users write text reviews accompanied by a
rately predict whether a user is a Yelp expert (also known       star rating for businesses they visit. Users also have bidi-
as an ‘elite’ user) or a normal user. We show that each of       rectional friendships as well as one-directional fans. We
the four feature types is able to capture a signal that a user   consider the social network to consist of the bidirectional
is an expert user. In the end, we combine all feature sets       friendships since each user consents to the friendship of the
together in an attempt to raise the classification accuracy      other user. Additionally, popular users are much less likely
even higher.                                                     to know their individual fans making this connection much
                                                                 weaker. Each review object is annotated with a time stamp,
Keywords                                                         so we are able to investigate trends temporally.
network mining, text mining, expert finding, social network
analysis, time series analysis                                   The purpose of this work is to investigate and analyze the
                                                                 Yelp dataset and find potentially interesting patterns that
                                                                 we can exploit in our future expert-finding system. The key
1.    INTRODUCTION                                               question we hope to answer is:
Expert finding seeks to locate users in a particular domain
that have more qualifications or knowledge (expertise) than
the average user. Usually, the number of experts is very            Given a network of Yelp users, who is an elite user?
low compared to the overall population, making this a chal-
lenging problem. Expert finding is especially important in
medical, legal, and even governmental situations. In our         To answer the above question, we have to first address the
work, we focus on the Yelp academic dataset [8] since it has     following:
many rich features that are unavailable in other domains.
In particular, we have full review content, timestamps, the
friend graph, and user metadata—this allows us to use tech-        1. How does the text in expert reviews differ from text in
niques from text mining, time series analysis, social network         normal reviews?
analysis, and classical machine learning.
                                                                   2. How does the average number of votes per review for
                                                                      a user change over time?
From a user’s perspective, it’s important to find an expert
∗Submitted as the semester project for CS 598hs Fall 2014.         3. Are elite users the first to review a new business?
                                                                   4. Does the social network structure suggest whether a
                                                                      user is an elite user?
                                                                   5. Does user metadata available from Yelp have any in-
                                                                      dication about a user’s status?

                                                                 The structure of this paper is as follows: in section 2, we
                                                                 discuss related work. In sections 3, 4, 5, and 6, we discuss the
                                                                 four different dimensions of the Yelp dataset. For the first
three feature types, we use text analysis, temporal analysis,         Paper                       Text?     Time?     Graph?
and social network analysis respectively. The user metadata           Sun et al. 2009 [13]                              X
is already in a quantized format, so we simply overview the           Bozzon et al. 2013 [3]                            X
fields available. Section 7 details running experiments on the        Zhang et al. 2007 [17]                            X
proposed features on balanced (number of experts is equal             Choudhury et al. 2009 [5]     X          X
to the number of normal users) and unbalanced (number of              Balog et al. 2009 [1]         X
experts is much less) data. Finally, we end with conclusions          Ehrlich et al. 2007 [6]       X                     X
and future work in section 8.

2.   RELATED WORK                                                 Figure 1: Comparison of features used in previous
RankClus [13] integrates clustering and ranking on heteroge-      work.
neous information networks. Within each cluster, a ranking
of nodes is created, so the top-ranked nodes could be consid-
                                                                  A paper on Expert Language Models [1] builds two different
ered experts for a given cluster. For example, consider the
                                                                  language models by invoking Bayes’ Theorem. The con-
DBLP bibliographic network. Clusters are formed based on
                                                                  ditional probability of a candidate given a specific query is
authors who share coauthors, and within each cluster there
                                                                  estimated by representing it using a multinomial probability
is a ranking of authoritative authors (experts in their field).
                                                                  distribution over the vocabulary terms. A candidate model
                                                                  θca is inferred for each candidate ca, such that the probabil-
Clustering and ranking are defined to mutually enhance each
                                                                  ity of a term used in the query given the candidate model
other since conditional rank is used as a clustering feature
                                                                  is p(t|θca ). For one of the models, they assumed that the
and cluster membership is used as an object feature. In
                                                                  document and the candidate are conditionally independent
order to determine the final configuration, an expectation-
                                                                  and for the other model, they use the probability p(t|d, ca),
maximization algorithm is used to iteratively update cluster
                                                                  which is based on the strength of the co-occurrence between
and ranking assignments.
                                                                  a term and a candidate in a particular document.
This work is relevant to our Yelp dataset if we consider clus-
                                                                  In terms of the modeling techniques used above, we can
ters to be the business categories, and experts to be domain
                                                                  adopt a similar method whereby the candidate in the Expert
experts. However, the Yelp categories are not well-defined
                                                                  Language Models [1] is a yelp user and we will determine the
since some category labels overlap, so some extra processing
                                                                  extent to which a review characterizes an elite or normal
may be necessary to deal with this issue.
                                                                  user.
Expert Finding in Social Networks [3] considers Facebook,
                                                                  For the paper on Interesting YouTube commenters [5], the
LinkedIn, and Twitter as domains where experts reside. In-
                                                                  goal is to determine a real scalar value corresponding to
stead of labeling nodes from the entire graph as experts, a
                                                                  each conversation to measure its interestingness. The model
subset of candidate nodes is considered and they are ranked
                                                                  comprised of detecting conversational themes using a mix-
according to an expertise score. These expertise scores are
                                                                  ture model approach, determining ‘interestingness’ of par-
obtained through link relation types defined on each social
                                                                  ticipants and conversations based on a random walk model,
network, such as creates, contains, annotates, owns, etc.
                                                                  and lastly, establishing the consequential impact of ‘inter-
                                                                  estingness’ via different metrics. The paper could be useful
To rank experts, they used a vector-space retrieval model
                                                                  to us for characterizing reviews and Yelp users in terms of
common in information retrieval [10] and evaluated with
                                                                  ‘interestingness’. An intuitive conjecture is that ‘elite’ users
popular IR metrics such as MAP, MRR, and NDCG [10].
                                                                  should be ones with high ‘interestingness’ level and likewise,
Their vector space consisted of resources, related entities,
                                                                  they should post reviews that are interesting.
and expertise measures. They concluded that profile infor-
mation is a less useful determiner for expertise than their
                                                                  In summary, Fig 1 shows a comparison of related work sur-
extracted relations, and that resources created by others in-
                                                                  veyed and which aspects of the dataset they examined.
cluding the target are also quite useful.

A paper on “Expertise Networks” [17] begins with a large          3.     TEXT ANALYSIS
study on analyzing a question and answer forum; typical           The text analysis examines the reviews written by each user
centrality measures such as PageRank [11] and HITS [9] are        in order to extract features from the unstructured text con-
used to initially find expert users. Then, other features de-     tent. Common text processing techniques such as indexing,
scribing these expert users are defined or extracted in order     categorization, and language modeling are explored in the
to create an “ExpertiseRank” algorithm, which (as far as we       next sections.
can tell) is essentially PageRank. This algorithm was then
evaluated by human raters and it was found ExpertiseRank
had slightly smaller errors than the other measures (includ-      3.1     Datasets
ing HITS, but was not evaluated against PageRank).                First, we preprocessed the text by lowercasing, removing
                                                                  stop words, and performing stemming with the Porter2 En-
While the result of ExpertiseRank is unsurprising, we would       glish stemmer. Text was tokenized as unigram bag-of-words
be unable to directly use it or PageRank since the Yelp social    by the MeTA toolkit1 .
network is undirected; running PageRank on an undirected
                                                                  1
network approximates degree centrality.                               http://meta-toolkit.github.io/meta/
Dataset         Docs    Lenavg        |V |   Raw    Index       F1 scores were similar. Recall the F1 score is a harmonic
 All           659,248     81.8    164,311     480      81       mean of precision P and recall R:
 Elite         329,624     98.8    125,137     290      46
 Normal        329,624     64.9     95,428     190      37
                                                                                                  2P R
                                                                                          F1 =
                                                                                                 P +R
Figure 2: Comparison of the three text datasets of
Yelp reviews in terms of corpus size, average doc-               Since this is just a baseline classifier, we expect it is possi-
ument length (in words), vocabulary size, raw data               ble to achieve higher accuracy using more advanced features
size, and indexed data size (both in MB).                        such as n-grams of words or grammatical features like part-
                                                                 of-speech tags or parse tree productions. However, this ini-
                     Confusion Matrix                            tial experiment is to determine whether elite and non-elite
                 classified as elite classfied as normal
                                                                 reviews can be separated based on text alone, with no re-
       elite            0.665                0.335
                                                                 gard to the author or context. Since the accuracy on this
      normal            0.252                0.748               default model is 70% is seems that text will make a useful
                                                                 subset of overall features to predict expertise.
        Class      F1 Score       Precision   Recall
        Elite      0.694          0.665       0.725              Furthermore, remember that this classification experiment
        Normal     0.718          0.748       0.691              is not whether a user is elite or not, but rather whether a
        Total      0.706          0.706       0.708              review has been written by an elite user; it would be very
                                                                 straightforward to extend this problem to classify users in-
                                                                 stead, where each user is a combination of all reviews that
                                                                 he or she writes. In fact, this is what we do in section 7,
Figure 3: Confusion matrix and classification accu-
                                                                 where we are concerned with identifying elite users.
racy on normal vs elite reviews.
                                                                 3.3    Language Model
We created three datasets that are used in this section:         We now turn to the next text analysis method: unigram
                                                                 language models. A language model is simply a distribution
                                                                 of words given some context. In our example, we will define
   • All: contains all the elite reviews (reviews written by     three language models—each based on a corpus described in
     an elite user) and an equal number of normal reviews        section 3.1.

   • Elite: contains only reviews written by elite users         The background language model (or “collection” language
                                                                 model) simply represents the All corpus. We define a
   • Normal: contains only reviews written by normal             smoothed collection language model pC (w) as
     users

                                                                                              count(w, C) + 1
Elite and Normal together make up All; this is to ensure the                       pC (w) =
analyses run on these corpora have a balanced class distri-                                      |C| + |V |
bution. Overall, there were 1,125,458 reviews, consisting of
329,624 elite reviews and 795,834 normal reviews. Thus, the      This creates a distribution pC (·) over each word w ∈ V .
number of normal reviews was randomized and truncated to         Here, C is the corpus, and V is the vocabulary (each unique
329,624.                                                         word in C), so |C| is the total number of words in the corpus
                                                                 and |V | is the number of unique terms.
The number of elite users is far fewer than the ratio of writ-
ten reviews may suggest; this is because elite users write       The collection language model essentially shows the prob-
many more reviews on average than normal users. A sum-           ability of a word occurring in the entire corpus. Thus, we
mary of the three datasets is found in Fig 2.                    can sort the outcomes in the distribution by their assigned
                                                                 probabilities to get the most frequent words. Unsurprisingly,
3.2   Classification                                             these words are the common stop words with no real content
                                                                 information. However, we will use this background model to
We tested how easy it is to distinguish between an elite re-
                                                                 filter out words specific to elite or normal users.
view and a non-elite (normal) review by a simple supervised
classification task. We used the dataset All described in the
                                                                 We now define another unigram language model to repre-
previous section along with each review’s true label to train
                                                                 sent the probability of seeing a word w in a corpus θ ∈
an SVM classifier. Evaluation was performed with five-fold
                                                                 {elite, normal}. We create a normalized language model
cross validation and had a baseline of 50% accuracy. Results
                                                                 score per word using the smoothed background model de-
of this categorization experiment are displayed in Fig 3.
                                                                 fined previously:
The confusion matrix tells us that it was slightly easier to
classify normal reviews, though the overall accuracy was ac-                               count(w,θ)
ceptable at just over 70%. Precision and recall highs had                                      |θ|          count(w, θ)
opposite maximums for normal and elite, though overall the                 score(w, θ) =                =
                                                                                              pC (w)         pC (w) · |θ|
Background       Normal               Elite                  3.4     Typographical Features
         the           gorsek               uuu                   Based partly on the experiments performed in section 3.3, we
         and           forks)             aloha!!!                now define typographical features of the review text. We call
            a          yu-go      **recommendations**             a feature a ‘typographical’ feature if it is a trait that can’t
            i         sabroso             meter:                  be detected by a unigram words tokenizer and is indicative
           to            (***         **summary**                 of the style of review writing.
         was             eloff             carin
           of            -/+               no1dp                  We use the following six style or typographical features:
           is            jeph             (lyrics
          for         deirdra           friends!!!!!
           it          ruffin’         **ordered**                     • Average review length. We calculate review length
           in          josefa           8/20/2011                        as the number of whitespace-delimited tokens in a re-
        that            ubox               rickie                        view. Average review length is simply the average of
         my             waite               kuge                         this count across all of a user’s reviews.
        with           again!!               ;]]]
                                                                       • Average review sentiment. We used sentiment va-
         but          optionz              #365
                                                                         lence scores [12] to calculate the sentiment of an entire
         this            ecig                  g
                                                                         review. The sentiment valence score is < 0 if the over-
         you          nulook               *price
                                                                         all sentiment is negative and > 0 if the overall senti-
          we              gtr             visits):
                                                                         ment is positive.
        they            shiba                 r
          on            kenta                 ik                       • Paragraph rate. Based on the language model analy-
                                                                         sis, we included a feature to detect whether paragraph
                                                                         segmentation was used in a review. We simply count
Figure 4: Top 20 tokens from each of the three lan-                      the rate of multiple newline characters per review per
guage models.                                                            user.
                                                                       • List rate. Again, based on the language model anal-
                                                                         ysis, we add this feature to detect whether a bulleted
                                                                         list is included in the review. We defined a list as the
The goal of the language model score is to find unigram                  beginning of a line followed by ‘*’ or ‘-’ before alpha
tokens that are very indicative of their respective categories;          characters.
using a language model this way can be seen as a form of
feature selection. Fig 4 shows a comparison of the top twenty          • All caps. The rate of words in all capital letters. We
words from each of the three methods.                                    suspect very high rates of capital letters will indicate
                                                                         spam or useless reviews.
These default language models did not reveal very clear dif-
ferences in word usage between the two categories, despite             • Bad punctuation. Again, this feature is to detect
the elite users using a larger vocabulary as shown in Fig 2.             less serious reviews in an attempt to find spam. A
The singular finding was that the elite language model shows             basic example of bad punctuation is not starting a new
that its users are more likely to segment their reviews into             sentence with a capital letter.
different sections, discussing different aspects of the busi-
ness. For example, recommendations, summary, ordered, or          Although the number of features here is low, we hope that
price.                                                            the added meaning behind each one is more informative than
                                                                  a single unigram words feature.
Also, it may appear that there are a good deal of nonsense
words in the top words from each language model. However,
upon closer inspection, these words are actually valid given      4.    TEMPORAL ANALYSIS
some domain knowledge of the Yelp dataset. For example,           In this section, we look at how features change temporally
the top word “gorsek” in the normal language model is the         by making use of the time stamp in reviews as well as tips.
last name of a normal user that always signs his posts. Sim-      This allows to us to analyze the activity of a user over time
ilarly, “sabroso” is a Spanish word meaning delicious that a      as well as how the average number of votes the user has
particular user likes to say in his posts. Similar arguments      received changes with each review posted.
can be made for other words in the normal language model.
In the elite model, “uuu” was originally “\uuu/”, an emoti-       4.1     Average Votes-per-review Over Time
con that an elite user is fond of. “No1DP” is a Yelp username     For the average number of votes-per-review varies with each
that is often referred to by a few other elite users in their     review posted by an user. To gather this data, we grouped
review text.                                                      the reviews in the Yelp dataset by users and ordered the
                                                                  reviews by the date each was posted.
Work on supervised and unsupervised review aspect segmen-
tation has been done before [15,16], and it may be applicable     The goal was to try to predict whether an user is an “Elite”
in our case since there are clear boundaries in aspect men-       or “Normal” user using the votes-per-review vs review num-
tions. Another approach would be to add a boolean feature         ber plot. The motivation for this was that after processing
has_aspects that detects whether a review is segmented in         the data, we found out that the number of votes on average
the style popular among elite users.                              was significantly greater for elite users compared to normal
Elite vs Normal users Statistics                                            Confusion Matrix
                 useful votes funny votes cool votes                              classified as elite classified as normal
   elite users       616          361        415                         elite            0.64                 0.36
  normal users        20           7          7                         normal            0.26                 0.74

Figure 5: Average number of votes per category for                Figure 7: Summary of results for logistic regression.
elite and normal users.
             Logistic Regression Summary
                      elite users normal users
            training      2005        2005
             testing     18040       18040

Figure 6: Summary of training and testing data for
logistic regression.

users as show in Fig 5. Thus, we decided to find out whether
any trend exists on how the average number of votes grow
with each review posted by users from both categories. We
hypothesized that elite users should have an increasing av-
erage number of votes over time.

On the y-axis, we have υi which is the votes-per-review after
a user posts his ith review. This is defined as the sum of
the number of “useful” votes, “cool” votes and “funny” votes
divided by the number of reviews by the user up to that
point in time. On the x-axis, we will have the review count.      Figure 8: Plot of the probability of being an elite
                                                                  user for reviews at rank r.
Using the Yelp dataset, we plotted a scatter plot for each
user. Visual inspection of graphs did not show any obvious
trends in how the average number of likes per review varied       users compared to normal users. This means that each re-
with each review being posted by the user.                        view that a elite user posts tends to be a “quality” review
                                                                  that receives enough votes to increase the running average
We then proceeded to perform a logistic regression using the      of votes-per-review for this user. The second hypothesis is
following variables:                                              that the mean of the running average votes-per-review for
                                                                  elite users is higher than that of normal users. This is sup-
                                                                  ported by data shown in Fig 5 where the average votes for
                             count(increases)                     elite users are higher than normal users.
               Pincrease =
                              count(reviews)
                                                                  4.2    User Review Rank
                        Pcount(reviews)                           For the second part of our temporal analysis, we look at
                          i=0         υi
                   µ=                                             the rank of each review a user has posted. Using 0-index, if
                         count(reviews)                           a review has rank r for business b, the review was the rth
                                                                  review written for business b.
where count(increases) is the number of times the average
votes-per-review increased (i.e. υi+1 > υi ) after a user posts   Our hypothesis was that an elite user should be one of the
a review and count(reviews) is the number of reviews the          first few users who write a review for a restaurant because
user has made.                                                    elite users are more likely to find new restaurants to review.
                                                                  Also, based on the dataset, elite users write approximately
Both the training and testing sets consists of only users with    230 more reviews on average than normal users, thus it is
at least one review. For each user, we calculated the vari-       more likely that elite users will be one of the first users to
ables Pincrease and µ. The training and testing data are          review a business. Over time, since there are more normal
shown in Fig 6. 10% of users with at least one review be-         users, the ratio of elite to normal users will decrease as more
came part of the training data and the remaining 90% were         normal users write reviews.
used to test.
                                                                  To verify this, we calculated the percentage of elite reviews
There was an accuracy of 0.69 on the testing set. The results     for each rank across the top 10,000 businesses, whereby the
are shown in Fig 7.                                               top business is defined as the business with the most reviews.
                                                                  The number of ranks we look at will be the minimum num-
Given the overall accuracy of our model is relatively high        ber of reviews of a single business among the top 10,000
at 0.69, we can hypothesize that Pincrease is higher for elite    businesses. The plot is shown in Fig 8.
users, the plot shows us that it is more likely for an elite
                                                                 user to be among the first few tippers of a business. Fur-
                                                                 thermore, for this specific dataset, elite users only make up
                                                                 approximately 25% of the total number of tips, yet for the
                                                                 top 10,000 businesses, they make up more than 25% of the
                                                                 tips for almost all the ranks shown in Fig 9.

                                                                 We then calculated a score for each user based on the rank
                                                                 of each tip of the user and we included this as a feature in
                                                                 the SVM. The score is defined as follows:

                                                                            X
                                                                  score =    (tip count(business of (review)) − rank(tip))
                                                                            tip

                                                                 (The equation for this score follows the same reasoning as
                                                                 the user review rank section)

Figure 9: Plot of the probability of being an elite
                                                                 4.4    Review Activity Window
user for tips at rank r.                                         In this section, we look at the distribution of a user’s activity
                                                                 over time. The window we look at is between the user’s join
                                                                 date and end date, defined as the last date of any review
Given that the dataset consists of approximately 10% elite       posted in the entire dataset. For each user, we will find
users, the plot shows us that it is more likely for an elite     the interval in days between each review, including the join
user to be among the first few reviewers of a business.          date and end date. For example if the user has two reviews
                                                                 on date1 and date2, where date2 is after date1, the interval
We calculated a score for each user which is a function of       durations will be: date1-joinDate, date2-date1 and endDate-
the rank of each review of the user and we included this as      date2. So for n number of reviews, we will get n+1 intervals.
a feature in the SVM. For each review of a user, we find the     Based on the list of intervals, we will calculate a score. For
total number of reviews the business that this review belongs    this feature, we hypothesize that the lower the score, the
to has. We take the total review count of this business and      more likely the user is an elite user.
subtract the rank of the review from it. We then sum this
value for each review to assign a score to the user. Based       The score is defined as:
on our hypothesis, since elite users will more likely have a
lower rank for each review than normal users, the score for
elite users should therefore be higher than normal users.                             var(intervals) + avg(intervals)
                                                                            score =
                                                                                               days on yelp
The score for a review is defined as follows:
                                                                 Where var(intervals) is the variance of all the interval values,
          X                                                      avg(intervals) is the average and days on yelp is the number
score =    (review count(business of (rev)) − rank(rev))         of days a user has been on Yelp.
          rev
                                                                 For the variance, the hypothesis is that for elite users, the
                                                                 variance will tend to be low as we hypothesize that elite users
We subtract the rank from the total review count so that
                                                                 should post regularly. For normal users, the variance will be
based on our hypothesis, elite users will end up having a
                                                                 high possibly due to irregular posting and long periods of
higher score.
                                                                 inactivity between posts.

4.3   User Tip Rank                                              We also look at the average value of the intervals. This is
A tip is a short chunk of text that a user can submit to a       because if we were to only look at variance, a user who writes
restaurant via any Yelp mobile application. Using 0-index,       a review every two days will get the same variance (zero) as
if a tip has rank r for business b, the tip was the rth tip      a user who writes a review every day. As such the average
written for business b. Similar to the review rank, we we        of the intervals will account for this by increasing the score
hypothesized that an elite user should be one of the first few
tippers (person who gives a tip) of a restaurant. We plotted     Finally, we divide the score by the number of days the user
the same graph which shows the percentage of elite tips for      has been on Yelp. This is to account for situations where
each rank across the top 10,000 businesses, whereby the top      a user makes a post every week but has only been on Yelp
business is defined as the business with the most tips. The      for three weeks, versus a user who makes a post every week
plot is shown in Fig 9.                                          as well but has been on Yelp for a year. The user who has
                                                                 been on Yelp for a year will then get a lower value for this
Given that the dataset consists of approximately 10% elite       score (elite user).
5.    SOCIAL NETWORK ANALYSIS                                                        Degree Centrality
The Yelp social network is the user friendship graph. This          Name         Reviews Useful Friends              Fans    Elite
data is available in the latest version of the Yelp academic        Walker            240    6,166    2,917            142       Y
dataset. We used the graph library from the same toolkit            Kimquyen          628    7,489    2,875            128       Y
that was used to do the text analysis in section 3.                 Katie             985   23,030    2,561          1,068       Y
                                                                    Philip            706    4,147    2,551             86       Y
We make an assumption that users on the Yelp network                Gabi            1,440   12,807    2,550            420       Y
don’t become friends at random; that is, we hypothesize                           Betweenness Centrality
that users become friends if they think their friendship is         Name         Reviews Useful Friends              Fans    Elite
mutually beneficial. In this model, we think one friend will        Gabi            1,440  12,807     2,550           420        Y
become friends with another user if he or she thinks the            Philip            706    4,147    2,551            86        Y
other user is worth knowing (i.e., is a “good reviewer”). We        Lindsey           906    7,641    1,617           348        Y
believe this is a fair assumption to make, since the purpose        Jon               230    2,709    1,432            60        Y
of the Yelp website is to provide quality reviews for both          Walker            240    6,166    2,917           142        Y
businesses and users. One potential downside we can see is                         Eigenvector Centrality
users becoming friends just because they are friends in real        Name         Reviews Useful Friends              Fans    Elite
life, or in a different social network.                             Kimquyen          628    7,489    2,875            128       Y
                                                                    Carol             505    2,740    2,159            163       Y
                                                                    Sam               683    9,142    1,960            100       Y
5.1    Network Centrality                                           Alina             329    2,096    1,737            141       N
Since our goal is to find “interesting” or “elite” users, we use
                                                                    Katie             985   23,030    2,561          1,068       Y
three network centrality measures to identify central (im-
portant) nodes. We would like to find out if elite users are
more likely to be central nodes in their friendship network.       Figure 10: Comparison of the top-ranked users as
We’d also like to investigate whether the results of the three     defined by the three centrality measures on the so-
centrality measures we investigate are correlated. Next, we        cial network.
briefly summarize each measure. For a more in-depth dis-
cussion of centrality (including the measures we use), we
suggest the reader consult [7]. For our centrality calcula-        The eigenvector centralities for the Yelp social network were
tions we considered the graph of 123,369 users that wrote at       calculated in less than 30 seconds.
least one review.
                                                                   Fig 10 shows the comparison of the top five ranked users
Degree centrality for a user u is simply the degree of node        based on each centrality score. The top five users of each
u. In our network, this is the same value as the number            centrality shared some names: Walker, Gabi, and Philip in
of friends. Therefore, it makes sense that users with more         degree and betweenness; Kimquyen and Katie in degree and
friends are more important (or active) than those that have        eigenvector; betweenness and eigenvector shared no users in
fewer or no friends. Degree centrality can be calculated al-       the top five (though not shown, there are some that are the
most instantly.                                                    same in the range six to ten).
Betweenness centrality for a node u essentially captures the       The top users defined by centrality measures are almost all
number of shortest paths between all pairs of nodes that           elite users even though elite users only make up about 8% of
pass through u. In this case, a user being an intermediary         the dataset. The only exception here is Alina from eigenvec-
between many user pairs signifies importance. Betweenness          tor centrality. Her other statistics look like they fit in with
centrality is very expensive to calculate, even using a O(mn)      the other elite users, so perhaps this could be a prediction
algorithm [4]. This algorithm is part of the toolkit we used       that Alina will be elite in the year 2015.
and it took two hours to run on 3.0 GHz processors with 24
threads.                                                           The next step is to use these social network features to pre-
                                                                   dict elite users.
Eigenvector centrality operates under the assumption that
important nodes are connected to other important nodes.
PageRank [11] is a simple extension to eigenvector centrality.     5.2    Weighted Networks
If a graph is represented as an adjacency matrix A, then           Adding weighted links between users could definitely en-
the (i, j)th cell is 1 if there is an edge between i and j,        hance the graph representation. The types which could po-
and 0 otherwise. This notation is convenient when defining         tentially be weighted are fans and votes. Additionally, if we
eigenvector centrality for a node u denoted as xu :                had some tie strength of friendship based on communication
                                                                   or profile views, we could use weighted centrality measures
                                                                   for this aspect as well.
                                n
                             1X
                      xu =         Aiu xi                          Unfortunately, we have no way to define the strength of
                             λ i=1
                                                                   the friendship between two users, since we only have the
                                                                   information present in the Yelp academic dataset. As for
Since this can be rewritten as Ax = λx, we can solve for           the votes and fans, in the Yelp academic dataset we are
the eigenvector centrality values with power iteration, which      only given a raw number for these values, as opposed to the
converges in a small number of iterations and is quite fast.       actual links for the social network. If we had this additional
information, we could add those centrality measures to the              Confusion Matrix: Balanced Text Features
                                                                                 classified as elite classified as normal
friendship graph centrality measures for an enriched social
                                                                         elite          0.651                 0.349
network feature set.
                                                                        normal          0.124                 0.876
                                                                             Overall Accuracy: 76.7%, baseline 50%
6.     USER METADATA
User metadata is information that is already part of the             Confusion Matrix: Unbalanced Text Features
JSON Yelp user object. It is possible to see all the metadata                  classified as elite classified as normal
by visiting the Yelp website and viewing specific numerical            elite          0.582                 0.418
fields.                                                               normal          0.039                 0.961
                                                                           Overall accuracy: 91.8%, baseline 92%

     • Votes. Votes are ways to show a specific type of ap-
       preciation towards a user. There are three types of        Figure 11: Confusion matrices for normal vs elite
       votes: funny, useful, and cool. There is no specific       users on balanced and unbalanced datasets.
       definition for what each means.
                                                                    Confusion Matrix: Balanced Temporal Features
     • Review count. This is simply the total number of                        classified as elite classified as normal
       reviews that a user has written.                                elite          0.790                 0.210
                                                                      normal          0.320                 0.680
     • Number of friends. The total number of friends in                   Overall Accuracy: 73.5%, baseline 50%
       a user’s friendship graph. This feature is duplicated
       in the degree centrality measure of the social network     Confusion Matrix: Unbalanced Temporal Features
       analysis.                                                                classified as elite classified as normal
                                                                       elite           0.267                 0.733
     • Number of fans. The total number of fans a user               normal            0.067                 0.933
       has.                                                                  Overall accuracy: 88%, baseline 92%

     • Average rating. The average star rating in [1, 5] the
       user gives a business.                                     Figure 12: Confusion matrices for normal vs elite
                                                                  users on balanced and unbalanced datasets.
     • Number of compliments. According to Yelp, the
       compliment button is “an easy way to send some good
       vibes.” This is separate from a review. In fact, users     7.1    Text Features
       get compliments from other users based on particular       We represent users as a collection of all their review text.
       reviews that they write.                                   Based on the previous experiments, we saw that it was pos-
                                                                  sible to classify a single review as being written by an elite
                                                                  or normal user. Now, we want to classify users based on all
We hope to use these metadata features in order to classify       their reviews as either an elite or normal user. Figure 11
users as elite. We already saw in section 5 that some meta-       shows the results of the text classification task. Using the
data fields seemed to be correlated with network centrality       balanced dataset we achieve about 77% accuracy, compared
measures as well as a user’s status, so it seems like they will   to barely achieving the baseline accuracy in the full dataset.
be informative features.
                                                                  Since the text features are so high dimensional, we per-
                                                                  formed some basic feature selection by selecting the most
7.     EXPERIMENTS                                                frequent features from the dataset. Before feature selection,
We now run experiments to test whether each feature gen-          we had an accuracy on the balanced dataset of about 70%.
eration method is a viable candidate to distinguish between       Using the top 100, 250, and 500 features all resulted in a
elite and normal users. As mentioned before, the number of        similar accuracy of around 76%. We use the reduced fea-
elite users is much smaller than the number of total users;       ture set of 250 in our experimental results in the rest of this
about 8% of all 252,898 users are elite. This presents us with    paper.
a very imbalanced class distribution. Since using the entire
user base to classify elite users has such a high baseline (92%   7.2    Temporal Features
accuracy), we also truncate the dataset to a balanced class       The temporal features consist of features derived using
distribution with a total of 40,090 users, giving a alternate     changes in the average number of votes per review posted,
baseline of 50% accuracy. Both datasets are used for all          the sum of the ranks of reviews of an user as well as the tips,
future experiments.                                               and the distribution of reviews posted over the lifetime of a
                                                                  user. Using these features, we obtained the results shown in
As described in section 3.1, we use the MeTA toolkit2 to do       Figure 12.
the text tokenization, class balancing, and five-fold cross-
validation with SVM. SVM is implemented here as stochas-
tic gradient descent with hinge loss.                             7.3    Graph Features
                                                                  Figure 13 shows the results using the centrality measures
2
    http://meta-toolkit.github.io/meta/                           from the social network. Although there are only three fea-
Confusion Matrix: Balanced Graph Features                        Confusion Matrix: Balanced All Features
                classified as elite classified as normal                        classified as elite classified as normal
        elite          0.842                 0.158                      elite          0.754                 0.256
       normal          0.251                 0.749                     normal          0.111                 0.889
            Overall Accuracy: 79.6%, baseline 50%                           Overall Accuracy: 82.2%, baseline 50%

  Confusion Matrix: Unbalanced Graph Features                         Confusion Matrix: Unbalanced All Features
             classified as elite classified as normal                            classified as elite classified as normal
     elite          0.311                 0.689                         elite           0.976                 0.024
    normal          0.075                 0.925                        normal           0.731                 0.269
         Overall accuracy: 87.6%, baseline 92%                                Overall accuracy: 92%, baseline 92%

Figure 13: Confusion matrices for normal vs elite                Figure 15: Confusion matrices for normal vs elite
users on balanced and unbalanced datasets.                       users on balanced and unbalanced datasets with all
                                                                 features present.
  Confusion Matrix: Balanced Metadata Features
             classified as elite classified as normal                             Text     Temp.      Graph      Meta     All
     elite          0.959                 0.041                   Balanced        .767     .735       .796       .938     .822∗
    normal          0.083                 0.917                   Unbalanced      .918     .880       .876       .901     .920
         Overall Accuracy: 93.8%, baseline 50%

Confusion Matrix: Unbalanced Metadata Features                   Figure 16: Final results summary for all features and
             classified as elite classified as normal            feature combinations on balanced and unbalanced
     elite          0.880                 0.120                  data. ∗ Excluding just the text features resulted in
   normal           0.097                 0.903                  90.4% accuracy.
         Overall accuracy: 90.1%, baseline 92%

                                                                 the difficult baseline.
Figure 14: Confusion matrices for normal vs elite
users on balanced and unbalanced datasets.                       Using all combined features except the text features resulted
                                                                 in 90.4% accuracy, suggesting there is some sort of disagree-
                                                                 ment between “predictive” text features and all other pre-
tures, Figure 10 showed that there is potentially a corre-       dictive features. Thus, removing the text features yielded
lation between the elite status and high-valued centrality       a much higher result, approaching the accuracy of just the
measures. The three graph features alone were able to pre-       Yelp metadata features.
dict whether a user was elite using the balanced dataset with
almost 80% accuracy. Again, results were lower compared          Since we dealt with some overfitting issues, we made sure
to the baseline when using the full user set.                    that the classifier used regularization. Regularization en-
                                                                 sures that weights for specific features do not become too
7.4     Metadata Features                                        high if it seems that they are incredibly predictive of the
Using only the six metadata features from the original Yelp      class label. Fortunately (or unfortunately), the classifier we
JSON file gave surprisingly high accuracy at almost 94% for      used does employ regularization, so there is nothing we could
the balanced classes. In fact, the metadata features had the     further do to attempt to increase the performance.
highest precision for both the elite and normal classes. The
unbalanced accuracy was near the baseline.                       8.   CONCLUSION AND FUTURE WORK
                                                                 We investigated several different feature types to attempt to
7.5     Feature Combination and Discussion                       classify elite users in the Yelp network. We found that all of
To combine features, we simply concatenated the feature          our features were able to distinguish between the two user
vectors for all the previous features and used the same splits   types. However, when combined, we weren’t able to make
and classifier as before. Figure 15 shows the breakdown of       an improvement in accuracy on the class-balanced dataset
this classification experiment. Additionally, we summarize       over the best-performing single feature type.
all results by final accuracy in Figure 16.
                                                                 In the text analysis, we can investigate different classifiers to
Unfortunately, it looks like the combined feature vectors did    improve the classification accuracy. For example, k-nearest
not significantly improve the classification accuracy on the     neighbor could be a good approach since it is nonlinear and
balanced dataset as expected. Initially, we though that this     we have a relatively small number of dimensions after re-
might be due to overfitting, which is why we reduced the         ducing the text features. The text analysis could also be
number of text features from over 70,000 to 250. Using the       extended with the aid of topic modeling [2]. One output
70,000 text features combined with the other feature types       from this algorithm acts to cluster documents into separate
resulted in about 70% accuracy; with the top 250 features,       topics; a document is then represented as a mixture of these
we achieved 82.2% as shown in the tables. For the unbal-         topics, and each document’s mixture can be used as a feature
anced dataset, it seems that the results did improve to reach    for the classifier.
In the temporal analysis, we made some assumptions about           [9] Jon M. Kleinberg. Authoritative sources in a
the ‘elite’ data provided by the Yelp dataset. The data tells          hyperlinked environment. J. ACM, 46(5):604–632,
us for which years the user was ‘elite’ and we made a sim-             September 1999.
plifying assumption that as long a user has at least one year     [10] Christopher D. Manning, Prabhakar Raghavan, and
of elite status, the user is currently and has always been an          Hinrich Schütze. Introduction to Information
elite user. For instance, if a user was only elite in the year         Retrieval. Cambridge University Press, New York, NY,
2010, we treated the user’s review back in 2008 as an elite re-        USA, 2008.
view. Also, we could have made use of more advanced mod-          [11] Larry Page, Sergey Brin, R. Motwani, and
els like the vector autoregression model (VAR) [14] which              T. Winograd. The pagerank citation ranking:
might allow us to improve the analysis of votes per review             Bringing order to the web, 1998.
over time. One possible way will be to look at all the votes-     [12] Bo Pang and Lillian Lee. Opinion mining and
per-review plots of users in the dataset and run the model             sentiment analysis. Found. Trends Inf. Retr.,
using this data. Finally, in the network analysis, we can con-         2(1-2):1–135, January 2008.
sider different network features such as clustering coefficient   [13] Yizhou Sun, Jiawei Han, Peixiang Zhao, Zhijun Yin,
or similarity via random walks.                                        Hong Cheng, and Tianyi Wu. Rankclus: Integrating
                                                                       clustering with ranking for heterogeneous information
The graph features would certainly benefit from added                  network analysis. In Proceedings of the 12th
weights, but as mentioned in section 5, we unfortunately               International Conference on Extending Database
do not have this data. Social graph structure can also be              Technology: Advances in Database Technology, EDBT
created with more information about fans and votes.                    ’09, pages 565–576, New York, NY, USA, 2009. ACM.
                                                                  [14] Hiro Y. Toda and Peter C.B. Phillips. Vector
Finally, since the metadata features were by far the best-
                                                                       Autoregression and Causality. Cowles Foundation
performing, it would be an interesting auxiliary problem to
                                                                       Discussion Papers 977, Cowles Foundation for
predict their values via a regression using the other feature
                                                                       Research in Economics, Yale University, May 1991.
types we created.
                                                                  [15] Hongning Wang, Yue Lu, and ChengXiang Zhai.
                                                                       Latent aspect rating analysis on review text data: A
APPENDIX                                                               rating regression approach. In Proceedings of the 16th
A. REFERENCES                                                          ACM SIGKDD International Conference on
 [1] Krisztian Balog, Leif Azzopardi, and Maarten                      Knowledge Discovery and Data Mining, KDD ’10,
     de Rijke. A language modeling framework for expert                pages 783–792, New York, NY, USA, 2010. ACM.
     finding. Inf. Process. Manage., 45(1):1–19, January          [16] Hongning Wang, Yue Lu, and ChengXiang Zhai.
     2009.                                                             Latent aspect rating analysis without aspect keyword
 [2] David M. Blei, Andrew Y. Ng, and Michael I. Jordan.               supervision. In Proceedings of the 17th ACM SIGKDD
     Latent dirichlet allocation. J. Mach. Learn. Res.,                International Conference on Knowledge Discovery and
     3:993–1022, March 2003.                                           Data Mining, KDD ’11, pages 618–626, New York,
 [3] Alessandro Bozzon, Marco Brambilla, Stefano Ceri,                 NY, USA, 2011. ACM.
     Matteo Silvestri, and Giuliano Vesci. Choosing the           [17] Jun Zhang, Mark S. Ackerman, and Lada Adamic.
     right crowd: Expert finding in social networks. In                Expertise networks in online communities: Structure
     Proceedings of the 16th International Conference on               and algorithms. In Proceedings of the 16th
     Extending Database Technology, EDBT ’13, pages                    International Conference on World Wide Web, WWW
     637–648, New York, NY, USA, 2013. ACM.                            ’07, pages 221–230, New York, NY, USA, 2007. ACM.
 [4] Ulrik Brandes. A faster algorithm for betweenness
     centrality. Journal of Mathematical Sociology,
     25:163–177, 2001.
 [5] Munmun De Choudhury, Hari Sundaram, Ajita John,
     and Dorée Duncan Seligmann. What makes
     conversations interesting?: Themes, participants and
     consequences of conversations in online social media.
     In Proceedings of the 18th International Conference on
     World Wide Web, WWW ’09, pages 331–340, New
     York, NY, USA, 2009. ACM.
 [6] Kate Ehrlich, Ching-Yung Lin, and Vicky
     Griffiths-Fisher. Searching for experts in the
     enterprise: Combining text and social network
     analysis. In Proceedings of the 2007 International
     ACM Conference on Supporting Group Work, GROUP
     ’07, pages 117–126, New York, NY, USA, 2007. ACM.
 [7] Jiawei Han. Data Mining: Concepts and Techniques.
     Morgan Kaufmann Publishers Inc., San Francisco,
     CA, USA, 2005.
 [8] Yelp Inc. Yelp Dataset Challenge, 2014.
     http://www.yelp.com/dataset_challenge.
You can also read