A Multiple Feature Integration Model to Infer Occupation from Social Media Records

Page created by Jared Boyd

Style & Fashion

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

A Multiple Feature Integration Model to Infer Occupation from Social Media Records

A Multiple Feature Integration Model
 to Infer Occupation from Social Media Records

                   Xiang Wang, Lele Yu, Junjie Yao, and Bin Cui

                         Department of Computer Science
      Key Lab of High Conﬁdence Software Technologies (Ministry of Education)
                                Peking University
                  {kingflying.pkueecs,yulele214}@gmail.com,
                        {junjie.yao,bin.cui}@pku.edu.cn

       Abstract. With the rapid development of more and more social media
       applications, lots of users are connected with friends and their daily life
       and opinions are recorded. Social media provides us an unprecedented
       way to collect and analyze billions of users’ information. Proper user
       attribute identiﬁcation or proﬁle inference becomes more and more at-
       tractive and feasible. However, the ﬂourishing social records also pose
       great challenge in eﬀective feature selection and integration for user pro-
       ﬁle inference. This is mainly caused by the text sparsity and complex
       community structures.
          In this paper, we propose a comprehensive framework to infer user’s
       occupation from his/her social activities recorded in micro-blog message
       streams. A multi-source integrated classiﬁcation model is set up with
       some ﬁne selected features. We ﬁrst identify some beneﬁcial basic content
       features, and then we proceed to tailor a community discovery based
       latent dimension solution to extract community features.
          Extensive empirical studies are conducted on a large real micro-blog
       dataset. Not only we demonstrate the integrated model shows advantages
       over several baseline methods, but also we verify the eﬀect of homophily
       in users’ interaction records. The diﬀerent eﬀects of heterogeneous inter-
       active networks are also revealed.

       Keywords: User Proﬁle Modeling, Occupation Inference, Feature
       Selection, Heterogeneous Network, Micro-blog.

1     Introduction
In the recent fast availability of Web 2.0 and social network services, social media
has become more and more popular among the world and has already exerted
great inﬂuence on billions of users’ ordinary life. Based on recent statistics, Face-
book has more than one billion registered users and enjoys 660 million active
users every day. Twitter has around 500 millions registered users and several
hundreds of million messages are posted everyday1.
1
    http://goo.gl/90KjQ,
    http://blog.twitter.com/2013/03/celebrating-twitter7.html

X. Lin et al. (Eds.): WISE 2013, Part II, LNCS 8181, pp. 137–150, 2013.

c Springer-Verlag Berlin Heidelberg 2013

138    X. Wang et al.

   The explosive development of social media has brought great opportunities
to many ﬁelds. Billions of users’ life are recorded. Enterprise and research areas
face an unprecedent opportunity to extract and analyze users’ taste, interest and
other information. Since usually little information is explicitly provided by users
themselves due to privacy or other concerns, automatic user attribute inference
is required to infer the missing attributes of users like gender, age and interest.
Through accurate attribute inference, personalized and targeting services, such
as product and content recommendation, can be improved to each individual
user. Occupation inference can also be introduced to adjust the advertising and
user proﬁle modeling.
   Besides these great opportunities, challenges also exist. The main challenge is
caused by the complicated nature of social media: the extreme rich features of
data, low quality of social content and complex user interaction network within
them.
   Recently, several works have focused on the task of user attribute inference.
For example, [16] implemented a classiﬁcation method to infer three charac-
teristics of users: political aﬃliation, ethnicity identiﬁcation and aﬃliation to
Starbucks. [12] used a community detection method to identify the department
and college aﬃliation of undergraduate students. Inference of gender, age, loca-
tion and other attributes were discussed in [18,10,15,1]. A common assumption
to infer user attribute is homophily [11]. Homophily indicates that similar users
tend to interact with each other. User attribute inference can be resolved through
the information of similar users.
   However, there are still many unsolved challenges to infer user attribute on
social media. First, user representation is diﬃcult because we need to extract
proper features from lots of noisy data, and diﬀerent features should be used for
diﬀerent inference tasks. A ﬂexible way to integrate features is also valuable so
that user representation can quickly be achieved for diﬀerent tasks.
   Another challenge is information rich heterogeneous networks. Heterogeneous
networks are network systems consisting of multiple object types and multiple
link types. In social media sites, users can be connected through friendship, co-
discussion and mention activities. [8] discussed knowledge about such networks
is often hidden in massive links. [20] put forward a concept called meta-path
to encode the diﬀerent relationships in heterogeneous networks to cluster ob-
jects under the limited guidance of users. However, many facets of heteroge-
neous networks, such as unstructured data and cyber-physical networks, are still
untouched.
   In this paper, we propose a multi-source integration model to infer the oc-
cupation of users on social media sites. We carry out a comprehensive feature
analysis on a large real dataset, and identify language behaviors of users in
diﬀerent occupation categories are very diﬀerent. Besides, we propose a latent
network factor, i.e., latent social dimension to capture the community structure
of users. To integrate these features, we represent each user as a feature vec-
tor and utilize the supervised machine learning classiﬁcation framework to infer
user’s occupation.

A Multiple Feature Integration Model to Infer Occupation      139

   Analyses and experiments are conducted on Sina Weibo2 , the largest mi-
croblog platform in China. Comprehensive results demonstrate the signiﬁcant
advantage of our proposed model. The contributions of this paper can be sum-
marized as follows:
 1. We systematically analyze the feature representation of users and dive into
    the network structure to capture users’ latent community aﬃliations.
 2. We propose a multi-source supervised classiﬁcation framework combined
    with both content-based and community-based features.
 3. We conduct several experiments and the results show the good performance
    of both content and community features, and especially the community ones.
    Besides, we validate homophily assumption in this user inference task.
   The rest of this paper is organized as follows. In Section 2 we discuss related
work. Section 3 introduces the problem deﬁnition and approach framework. Sec-
tion 4 details the feature selection and engineering work in the inference model.
Section 5 presents experiments and evaluations on a real large dataset and ﬁnally
we conclude this work in Section 6.

2     Related Work
Research in this paper is related to several areas. Here we brieﬂy review the
corresponding literature.
User Proﬁling: Works in this ﬁeld focus on expertise modeling, inﬂuence infer-
ence, and interest extraction.[22] proposed a model to propagate interests of an
item among users via their friendships. [10] put forward an uniﬁed discriminative
inﬂuence probabilistic model to identify users’ locations. [3] measured user’s in-
ﬂuence from in-degree, retweets, mentions, topics and time respectively. Another
common method to infer user proﬁling is collective classiﬁcation [19]. The idea
of collective classiﬁcation is to infer user attribute using neighbors’ information.
Normally a relational classiﬁer is constructed based on the relational features
of labeled data, and then an iterative process is required to infer the unlabeled
data. However, the main drawback of collective inference is that it only consid-
ers the direct neighbors of users and the interactions between indirectly linked
users are ignored. Besides, collective classiﬁcation fails to capture the presence
of underlying factors that actually inﬂuence user’s behaviors.
Community Detection: Community detection has been a trending topic for
a long time. Traditional community detection algorithm uses closeness metric,
by adding edges into an empty network one by one. However, to cut the hi-
erarchical tree and determine the ﬁnal network community, manual division is
required. [6] put forward edge betweenness metric to divide community. This
method removes edges with larger betweenness from the original network, which
is opposite to closeness method. [14] proposed modularity metric to identify com-
munity. Larger modularity means that there is larger number of intra-community
2
    http://weibo.com/

140      X. Wang et al.

edges than inter-community edges. Many previous work used community metric
to infer user attribute. [12] used a greedy algorithm to maximize a new eval-
uation metric called normalized conductance, which measures the quality of a
single community, to detect communities, and then assigned an identical at-
tribute value to users in the same community. The disadvantage of this method
is that using labeled data in community to infer the unlabeled data involves too
much noise and can’t capture the interactions between communities.

3     Problem Formulation and Approach Framework
In this section, we ﬁrst deﬁne the problem of user occupation inference and
then introduce our multi-source integration framework to infer users’ occupation
information.
Data Scenario: Dataset used in this work is based on one of largest micro-blog
platforms–Sina Weibo3 . Users can post, re-tweet and comment messages. At the
same time, they can follow other users. Everyday, hundreds of millions messages
are posted and spread in this social media site4 . Sina Weibo has already labeled
a small subset of its users and categorize them into 12 occupation classes, such as
entertainment, media and government5. We use the open API provided by Sina
Weibo to crawl these users’ data and get about ten thousand accounts. Proﬁles,
tweets, tags, friend and follower lists are collected. After removing some low
active users, we select 65828 accounts for later empirical study. Here, we can
not only identify the diﬀerence in users’ language behaviors from their posted
messages but also catch the strength variety of users’ interactions by utilizing
the community structure.
   In Table 1, we provide the occupation distribution of this dataset. We ﬁnd
that media accounts for the largest proportion(26%), followed by entertainment
class (18%). The percentage of public welfare is smallest, which is only 1%.

          Table 1. Occupation Distribution of Veriﬁed Users on Sina Weibo

         Transport Government Finance Electronic Public welfare Education
           1.9%       14.4%     8.9%      2.8%        1.2%       10.0%
                Estate Media Service Entertainment Others Medical
                 8.8% 26.1% 3.7%         17.9%      3.0% 1.3%

   To make our problem clear and unambiguous, we give the formal deﬁnition
of it in the following.
Deﬁnition 1. There are K occupation labels κ = {c1 , . . . , cK }. Given a social
network G = (V, E, Y ) where V is the set of user vertices, E is the set of con-
nection edges and Y is the set of users’ occupation labels. yi ∈ Y and yi ∈ κ
3
    http://weibo.com
4
    http://www.36kr.com/p/201443.html
5
    Veriﬁed Account: http://verified.weibo.com/

A Multiple Feature Integration Model to Infer Occupation 141

represents the occupation label of user vertex i, and we have already labeled the
occupation labels Yknow of some vertices Vknow . The occupation inference task is
aiming to select the occupation labels Yunknow for the remaing vertices Vunknow .
Multi-source User Occupation Inference Model: To solve this problem, we
utilize both the content and network features, and then transform this problem
into a machine learning classiﬁcation task. Figure 1 is the framework of our
multi-source inference model. We can divide this approach into two stages:

Content-based Feature
Known labels
Tweet content Tag
Training
Hashtag Profile Classifiers
Feature
integration
Multi-source Feature selection

Inference

Unknown labels

Community-based Feature

Fig. 1. Multi-source User Occupation Inference Model

1. Feature Selection and Integration: We integrate features from two categories.
One is content feature, including tweet content, hashtag, tag and profile.
Actually, we also investigate some other possible features, like location, tem-
poral pattern of behaviors and linguistic characteristics, but these features
don’t perform well in our task. Another class of features is community, i.e.,
latent social dimension used in this work. We use it to capture the latent
community affiliations of users so that the global network information can
be utilized.
After this selection, we then represent each user as a feature vector, which
combines these two types of social media features. New features can be easily
added into our model .
2. Model Prediction: After feature extraction and user representation, we choose
a de facto supervised machine learning classifier to infer user’s occupation.
Common classifiers include naive bayes, decision tree, support vector machine
and logistic regression. Theoretically, any classifier is adaptive in this case.
However, the actual utility needs to be verified through experiments. As
to the latent social dimension, it is very important for classifier to choose
effective and significant community dimensions so that the inference can
be optimized. We also conduct comparisons between different classifiers in
section 5.

142     X. Wang et al.

4     Feature Selection

In this section, we introduce the features used in occupation inference work. Sina
Weibo is a rich social media platform, with a large variety of user generated
content and multiple types of user interactions.

4.1    Content Feature

Personal Proﬁle: Personal proﬁle refers to the information provided by users
themselves when they register Sina Weibo account, such as gender, location. Be-
cause the proﬁle data returned by Sina Weibo API is in dictionary format, we
choose keys such as description, verified reason and screen name as user’s per-
sonal proﬁle. Description is personal description given by users; verified reason
refers to the reasons why this user was veriﬁed by Sina Weibo; screen name
means the nickname of Sina Weibo users.
Personal Tag: Personal tags are key words provided by users themselves to
describe personal interest. For example, a programmer may use Technology, Mo-
bile Internet and Programming Language as his personal tags. Few research has
touched this feature before. After tentative analysis, we ﬁnd that around 76%
users have personal tags. Based on the idea of homophily, we try to collect the
tags from their neighbors, which can be regarded as a complement of their own
tags. To utilize neighbors’ tags, we implement the following method. We combine
user’s own tags with tags of his top-k most similar neighbors based on similarity
measure. We use the Jaccard Similarity Coeﬃcient6 to measure the similarity of
users. After joining neighbors’ tags, we ﬁnd that nearly every user has at least
one tag and the average number of tags of users are between 10 and 20 based
on the choice of k.
   It should be noted that neighbors here refer to bi-follower friends. Bi-follower
friends imply two users following each other. Friends or neighbors mentioned
in this paper refer to bi-follower friends by default. Bi-follower friends indicate
stronger relationship than one-directional relationship and thus can ﬁlter much
noise.
Hashtag: A hashtag is a word or phrase ﬁxed between the symbol #. It serves
as a symbol to integrate similar tweets. The usage of hashtags is related to
the occupations of users. For example, users from public welfare may concern
hashtags like Beijing Rescue Team and Social Public Welfare while users of
IT companies may be interested in Iphone5 and Google I/O Conference. After
extracting hashtags from tweets, we implement word ﬁltering strategy based on
word frequency and represent users as hashtag vectors.
Tweet Content: Sina Weibo allows users to post tweets within 140 words. It is
intuitive to observe that users of diﬀerent occupations often post tweets which
are diﬀerent in content. For example, a property developer may be accustomed to
use words like housing, bank, inflation, while a famous singer may like using words
6
    http://en.wikipedia.org/wiki/Jaccard_index

A Multiple Feature Integration Model to Infer Occupation       143

such as singing, composer, popularity. However, it is still diﬃcult to represent
tweet content due to the following causes. First, the distribution of tweets is
unbalanced. The number of tweets posted by most users is comparatively small,
while only a small fraction of users post many tweets, which follows the Power-
Law distribution. Another challenge is noise, which is caused by user’s arbitrary
writing habits and ordinary user’s low quality background.
   In order to solve the content representation problem, many methods have been
proposed. Probabilistic Latent Semantic Analysis (PLSA) was introduced in [9]
to project similar words into a latent dimension. Its disadvantage is the single
topic assumption of each document. Latent Dirichlet Allocation(LDA) [2] was
later proposed to solve the disadvantage of PLSA, and it allows multiple topics
in each document, which is considered as state-of-the-art method. In this paper,
we treat the tweets of a user as a document and use LDA to represent each user
as a probabilistic distribution among topics. This method could identify user’s
latent topic distribution from their posted content.

4.2   Community Feature

Here we continue to discuss extraction method of community feature. There are
a variety of heterogeneous networks on Sina Weibo, such as friendship, retweet
and mention network(using @ ). One important phenomenon in social network is
community structure. Here we set up a new community feature, i.e., latent social
dimension, based on community structure of users to infer their occupations.
Latent Social Dimension: [21] presented a new way to utilize community
structure, which is called latent social dimension. Actually, latent social dimen-
sion represents the aﬃliation of users to diﬀerent communities. Figure 2 is a
toy example to illustrate latent social dimension. There are three communities
and ﬁve users in this graph. One user can aﬃliate to multiple communities and
the thickness of lines between users and communities indicates the strength of
aﬃliation. The mathematical format of latent social dimension is actually a vec-
tor, with each dimension corresponding to each community. Take User A as an
example. User A can be represented as a vector< 0.3, 0.2, 0.1 >. 0.3 means the
strength of aﬃliation between community 1 and user A is 0.3.

Fig. 2. A user and community interactive graph to illustrate latent social dimension

144    X. Wang et al.

   The advantage of latent social dimension compared to collective classiﬁcation
is that latent social dimension can capture the interactions between users from
the whole network while collective classiﬁcation can only infer user attribute
with neighbors’ help.
   It is intuitive to conclude that users of same occupation are more likely to
connect with each other in networks and thus are more likely to form community
structure. Users of diﬀerent occupations tend to form diﬀerent communities, and
thus their latent social dimensions are diﬀerent. We will validate this assumption
in section 5.
Extraction of Latent Dimension: This task is based on community detec-
tion algorithm. Traditional graph partition-based algorithm [7] aims to minimize
the number of edges between communities. [13] points out that minimizing the
number of edges between communities is not a good metric because it tends to
divide most nodes into one community. A better metric to divide communities is
modularity. Modularity is deﬁned as number of intra-edges in our target network
minus number of intra-edges in a comparable random network. Intra-edges means
edges inside communities, not between communities. Community detection task
is to maximize modularity function. The advantage of modularity-based algo-
rithm is that it can ﬁnd the communities which naturally exist in the network,
without the need to pre-assign the number of communities to be detected.
   To simplify our explanation, we make some deﬁnitions. We assume that the
number of users in the network is n and the number of edges is m. We ﬁrst
deﬁne adjacent matrix A: Aij = 1 if there is an edge between node i and node j;
Aij = 0 otherwise. Here we ignore the direction of the graph, that is Aij = Aji
and we also don’t consider the weight of edge. Next, we deﬁne modularity:
                                1 
                           Q=         [Aij − Pij ]δ(gi , gj )                  (1)
                               2m ij

where gi represents the community of node i. δ is a function. δ(r, s) = 1 if r = s
and δ(r, s) = 0 if r = s. Pij represents probability that there is an edge between
node i and node j in a random network. For convenience, we choose Pij as:
                                           ki kj
                                   Pij =                                       (2)
                                           2m
                                                                         
where ki indicates the degree of node i and can be calculated as ki = j Aij .
   Now we consider the problem of dividing the network into c communities. We
ﬁrst deﬁne index matrix S: S = (s1 |s2 | . . . |sc ). Every column of S is a index
vector of 0 or 1, which can be regarded as a latent community. 0 or 1 represents
the disaﬃliation or aﬃliation to this community. The formal deﬁnition of S is:
Sij = 1 if node i belongs to community j; 0 otherwise. Then modularity can be
revised as:
                                    1
                              Q=      T r(S T BS)                              (3)
                                   2m
where B = A − P , which is called modularity matrix. Modularity matrix B
is a real symmetric matrix and its function is the same as that of Laplacian

A Multiple Feature Integration Model to Infer Occupation    145

Matrix in standard spectral partitioning. We decompose B as B = U DU T ,
where U = (u1 |u2 | . . .) is a matrix made up of the eigenvectors of B and D is a
diagonal matrix made up of eigenvalues of B. Then we can revise modularity as:
                                     n    c
                                 1 
                           Q=          βj (uTj sk )2                          (4)
                                2m j=1
                                         k=1

   According to [17], when the column vectors of S are proportional to the leading
eigenvectors of B, modularity can be maximized. To avoid the problem of 0 or 1
in S, we relax S to be continuous. In this case, when S is made up of the top-c
eigenvectors of B, modularity can be maximized theoretically. We should note
that the number of communities, c, is uncertain. We need to choose proper c
to maximize modularity in practice. According to equation 4, only when βj is
positive can it have positive eﬀect on modularity. As a result, the maximum of
c will not exceed the number of positive eigenvalues of B.
   With the discussion of the above features, we can enrich the classiﬁcation
framework introduced in Section 3. Empirical study will be presented in the
next Section.

5     Experiments
In this section, we report our evaluation experiments on Sina Weibo. We ﬁrst
compare results of diﬀerent inference models and diﬀerent classiﬁers. Then we
validate our homophily assumption from two aspects and ﬁnally dive into the
heterogeneous networks characters.

5.1   Baseline and Evaluation Metrics
To demonstrate the improvement of the proposed model, we select the following
baselines:
 – Weighted Random Model(WRand): This model ignores any content and net-
   work information and simply classiﬁes user to a random occupation with the
   probability proportional to the percentage of that occupation.
 – Majority Model(Majority): This model also ignores any content and net-
   work information. Users are classiﬁed into the same occupation label which
   accounts for the largest proportion of all the occupations.
 – Content-based Model : This model considers the content-based feature. We
   classify this model into 4 parts: tweet content model(LDA), hashtag model
   (Hashtag), tag model(Tag) and proﬁle model(Profile).
 – Community-based Model(Cmty): This model only contains the latent social
   dimension feature.
 – Combined Model : This model contains both content and community features.
   We divide this model into 4 parts: tweet content and community(LDA-
   Cmty), hashtag and community(Hashtag-Cmty), tag and community(Tag-
   Cmty) and ﬁnally proﬁle and community(Profile-Cmty).
   We choose common evaluation metrics to evaluate our model. They are Pre-
cision, Recall and F-measure respectively.

146    X. Wang et al.

5.2   Classiﬁer Choice

Figure 3 shows the performance of diﬀerent classiﬁers using community feature.
Here we take logistic regression and support vector machine(SVM) for example.
From the result, we can ﬁnd that though the precisions of these two classiﬁers
are almost the same, SVM performs much worse than logistic regression in recall,
more than 10% lower, which ﬁnally results in the poor f-measure for SVM.
   This study indicates that regarding the dimension selection for the community
feature, logistic regression performs much better than SVM.

                    Fig. 3. Comparison of Diﬀerent Classiﬁers

5.3   Inference Performance
Figure 4 shows the results of diﬀerent inference models. To let diﬀerent models
comparable, we set the dimension of Hashtag, T ag, P rof ile, Cmty to all 500.
From the results, we observe that W Rand and M ajority perform very poor
because they do not consider any content and community information. For the
other four models based on content, we ﬁnd that the performance of LDA, T ag
and P rof ile are almost the same, i.e., about 60% for all metrics while the per-
formance of Hahstag is comparably poor, with f-measure just 24%. For Cmty
model, it outperforms all the content-based models, with nearly 70% precision,
recall and f-measure. Thus we can conclude that the community feature, i.e.,
latent social dimension, performs better than content-based features in our oc-
cupation inference task. The beneﬁt of this ﬁnding is that we can infer user’s oc-
cupation just based on network structure, without incorporating what he tweets
or what he re-tweets. This is especially important for users who seldom post any
tweets or make any comments.

5.4   Homophily Characters
In Section 4, we discuss the homophily hypothesis, i.e., users of same occupa-
tion are more likely to gather together and establish connections, thus forming
signiﬁcant community structure. We verify this assumption in this section from
the following two aspects.

A Multiple Feature Integration Model to Infer Occupation                                                    147

  Fig. 4. Comparison of Diﬀerent Inference Models(Logistic Regression Classiﬁer)

 – Verification 1 : [4] points out that when the value of modularity in community
   discovery is larger than 0.3, an obvious community structure can be observed
   in the network. Based on this , we calculate the modularity of friend network
   and we get 0.59 which is much larger than 0.3. Thus, we are sure to conclude
   that a signiﬁcant community structure exists in friend network.
 – Verification 2 : After verifying signiﬁcant community structure in friend net-
   work, we analyze whether there is a dominant occupation in these commu-
   nities. Figure 5 explains the occupation distribution of top-10 communities
   detected from friend network. The red color indicates dominant occupation
   in this community. We ﬁnd that there is always a dominant occupation in
   top-10 communities. This also indicates that users of the same occupation
   are more likely to connect with each other and form community structure.

        Community                                 Entertai               Governmen Public
                  Education Service   Estate            Media Medical                  Others Electronic Transport Finance
           size                                    nment                    t    welfare
          1823      0.095     0.049     0.012      0.252 0.188 0.023     0.017    0.015 0.052       0.042       0.010     0.245
          1319       0.019    0.033     0.595
                                                  0.020   0.047   0.014    0.014    0.011   0.020     0.087      0.006    0.134
           984       0.023    0.447    0.027      0.034   0.061   0.012    0.045    0.021   0.066     0.075      0.149    0.039
           927       0.068    0.037     0.013      0.024   0.035   0.011    0.040    0.634 0.071      0.033      0.008    0.027
           810       0.075    0.101     0.017      0.060   0.064   0.044    0.017    0.002   0.072     0.475     0.021    0.049
           715       0.022    0.076     0.001      0.006   0.027   0.580   0.069    0.048   0.091     0.052      0.014    0.015
           693       0.017    0.014     0.001      0.003   0.087   0.007    0.694   0.006   0.088     0.020      0.046    0.016
           644       0.012    0.022     0.003      0.042   0.054   0.009    0.002    0.006   0.017     0.030      0.764   0.039
           485       0.501   0.019     0.008      0.016   0.031   0.019    0.153    0.023   0.167     0.049      0.012    0.002
           336       0.042    0.021     0.015      0.036   0.054   0.074    0.083    0.024   0.196     0.009      0.435   0.012
           270       0.089    0.063     0.004      0.052   0.085   0.537   0.004    0.019   0.019     0.067      0.015    0.048
           250       0.172    0.020     0.012      0.012   0.080   0.028    0.128    0.004   0.396    0.036      0.044    0.068
           239       0.025    0.100     0.004      0.548 0.151    0.013    0.038    0.013   0.038     0.029      0.004    0.038
           100       0.060    0.080     0.000      0.000   0.090   0.050    0.110    0.010   0.480    0.080      0.010    0.030
           100       0.040    0.030     0.060      0.170   0.080   0.000    0.100    0.030   0.390    0.050      0.020    0.030

Fig. 5. The Occupation Distribution of top 10 Communities Detected(Friend Network)

148    X. Wang et al.

5.5   Heterogeneous Network Eﬀects

Heterogeneous networks, refer to networks made up of diﬀerent types of ob-
jects or diﬀerent interaction patterns. Here, we mainly focus on three hetero-
geneous networks based on diﬀerent interaction types: friend network, retweet
network and mention network. Friend network refers to networks made up of
bi-follower relationships. Retweet network forms due to the retweeting actions
between users. Mention network is created by users using @ to mention each
other. Figure 6 is the inference result of three networks using just community
feature. From the result, we can conclude that friend network performs best in
all three metrics and retweet network performs a little worse than friend net-
work while mention network performs worst. One interesting phenomenon is that
even though both retweet network and mention network are created due to user
interaction behaviors, the performances of them are quite diﬀerent.
   To investigate the reason, we conduct a tentative analysis of three networks.
From Table 2, we ﬁnd that the statistical features of friend network and retweet
network are almost the same while mention network is much sparser, with more
than 10000 users of degree zero, which might be a reason for the poor perfor-
mance of mention network. [5] points out that in social network, the mention
function(@ ) often plays a role to connect users having diﬀerent or even opposed
opinions and behaviors, which results in the poor homophily phenomenon in
mention network.

        Fig. 6. Inference Performance of Diﬀerent Heterogeneous Networks

            Table 2. Statistical Characters of Heterogeneous Networks

                            Friend Network Retweet Network Mention Network
            Links               2770378        2745765         776177
        Average degree             42             41             12
       Number of nodes
                                 618            3714           10811
      with degree of zero

A Multiple Feature Integration Model to Infer Occupation       149

6   Conclusions
In this paper, we propose a multi-source model to infer users’ occupation cate-
gories on micro-blog platforms. We utilize both users’ posted content feature and
their interaction community features in this model. The content feature includes
tweet content, hashtag, personal tag and personal proﬁle. For the network fea-
ture, we propose to use latent social dimension, in order to better identify user’s
latent aﬃliation. Then we model this inference task as a supervised classiﬁcation
problem and introduce the manually labeled source to train a inference model.
   We systemically analyze the data characters on a real large micro-blog(Sina
Weibo) dataset and demonstrate the advantage of proposed approach. At the
same time, we also reveal the patterns of diﬀerent user interaction communities
and homophily phenomenon among users of the same occupation category.
   With the continuous growth of social media services, eﬀective user proﬁle
extraction and user interest modeling become more and more important. Oc-
cupation inference model proposed in this paper has several promising future
direction. For example, we can extract users’ occupation evolution, proﬁle vari-
ety and interest distribution among diﬀerent groups.

Acknowledgements. This research was supported by the National Natural
Science foundation of China under Grant No. 61272155 and 61073019.

References
 1. Backstrom, L., Sun, E., Marlow, C.: Find me if you can: improving geographical
    prediction with social and spatial proximity. In: Proceedings of the 19th Interna-
    tional Conference on World Wide Web, pp. 61–70. ACM (2010)
 2. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. The Journal of
    Machine Learning Research 3, 993–1022 (2003)
 3. Cha, M., Haddadi, H., Benevenuto, F., Gummadi, K.P.: Measuring user inﬂuence
    in twitter: The million follower fallacy. In: 4th International AAAI Conference on
    Weblogs and Social Media (ICWSM), vol. 14, page 8 (2010)
 4. Clauset, A., Newman, M.E., Moore, C.: Finding community structure in very large
    networks. Physical Review E 70(6), 066111 (2004)
 5. Conover, M.D., Ratkiewicz, J., Francisco, M., Gonçalves, B., Flammini, A.,
    Menczer, F.: Political polarization on twitter. In: Proc. 5th Intl. Conference on
    Weblogs and Social Media (2011)
 6. Girvan, M., Newman, M.E.: Community structure in social and biological networks.
    Proceedings of the National Academy of Sciences 99(12), 7821–7826 (2002)
 7. Hagen, L., Kahng, A.B.: New spectral methods for ratio cut partitioning and clus-
    tering. IEEE Transactions on Computer-Aided Design of Integrated Circuits and
    Systems 11(9), 1074–1085 (1992)
 8. Han, J.: Mining heterogeneous information networks by exploring the power of
    links. In: Gama, J., Costa, V.S., Jorge, A.M., Brazdil, P.B. (eds.) DS 2009. LNCS,
    vol. 5808, pp. 13–30. Springer, Heidelberg (2009)
 9. Hofmann, T.: Probabilistic latent semantic analysis. In: Proceedings of the Fif-
    teenth Conference on Uncertainty in Artiﬁcial Intelligence, pp. 289–296. Morgan
    Kaufmann Publishers Inc. (1999)

150     X. Wang et al.

10. Li, R., Wang, S., Deng, H., Wang, R., Chang, K.C.-C.: Towards social user pro-
    ﬁling: uniﬁed and discriminative inﬂuence model for inferring home locations. In:
    Proceedings of the 18th ACM SIGKDD International Conference on Knowledge
    Discovery and Data Mining, pp. 1023–1031. ACM (2012)
11. McPherson, M., Smith-Lovin, L., Cook, J.M.: Birds of a feather: Homophily in
    social networks. Annual Review of Sociology, 415–444 (2001)
12. Mislove, A., Viswanath, B., Gummadi, K.P., Druschel, P.: You are who you know:
    inferring user proﬁles in online social networks. In: Proceedings of the Third ACM
    International Conference on Web Search and Data Mining, pp. 251–260. ACM
    (2010)
13. Newman, M.E.: Finding community structure in networks using the eigenvectors
    of matrices. Physical Review E 74(3), 036104 (2006)
14. Newman, M.E.: Modularity and community structure in networks. Proceedings of
    the National Academy of Sciences 103(23), 8577–8582 (2006)
15. Otterbacher, J.: Inferring gender of movie reviewers: exploiting writing style, con-
    tent and metadata. In: Proceedings of the 19th ACM International Conference on
    Information and Knowledge Management, pp. 369–378. ACM (2010)
16. Pennacchiotti, M., Popescu, A.-M.: Democrats, republicans and starbucks aﬃ-
    cionados: user classiﬁcation in twitter. In: Proceedings of the 17th ACM SIGKDD
    International Conference on Knowledge Discovery and Data Mining, pp. 430–438.
    ACM (2011)
17. Pothen, A., Simon, H.D., Liou, K.-P.: Partitioning sparse matrices with eigenvec-
    tors of graphs. SIAM Journal on Matrix Analysis and Applications 11(3), 430–452
    (1990)
18. Rao, D., Yarowsky, D., Shreevats, A., Gupta, M.: Classifying latent user attributes
    in twitter. In: Proceedings of the 2nd International Workshop on Search and Mining
    User-Generated Contents, pp. 37–44. ACM (2010)
19. Sen, P., Namata, G., Bilgic, M., Getoor, L., Galligher, B., Eliassi-Rad, T.: Collec-
    tive classiﬁcation in network data. AI Magazine 29(3), 93 (2008)
20. Sun, Y., Norick, B., Han, J., Yan, X., Yu, P.S., Yu, X.: Integrating meta-path
    selection with user-guided object clustering in heterogeneous information networks.
    In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge
    Discovery and Data Mining, pp. 1348–1356. ACM (2012)
21. Tang, L., Liu, H.: Relational learning via latent social dimensions. In: Proceedings
    of the 15th ACM SIGKDD International Conference on Knowledge Discovery and
    Data Mining, pp. 817–826. ACM (2009)
22. Yang, S.-H., Long, B., Smola, A., Sadagopan, N., Zheng, Z., Zha, H.: Like like alike:
    joint friendship and interest propagation in social networks. In: Proceedings of the
    20th International Conference on World Wide Web, pp. 537–546. ACM (2011)

You can also read