Node metadata can produce predictability transitions in network inference problems
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Node metadata can produce predictability transitions in network inference problems
Oscar Fajardo-Fontiveros,1, ∗ Marta Sales-Pardo,1, † and Roger Guimerà2, 1, ‡
1
Department of Chemical Engineering, Universitat Rovira i Virgili, 43007 Tarragona, Catalonia
2
ICREA, 08010 Barcelona, Catalonia
(Dated: March 29, 2021)
Network inference is the process of learning the properties of complex networks from data. Besides using
information about known links in the network, node attributes and other forms of network metadata can help
to solve network inference problems. Indeed, several approaches have been proposed to introduce metadata
into probabilistic network models and to use them to make better inferences. However, we know little about
the effect of such metadata in the inference process. Here, we investigate this issue. We find that, rather than
arXiv:2103.14424v1 [physics.data-an] 26 Mar 2021
affecting inference gradually, adding metadata causes abrupt transitions in the inference process and in our
ability to make accurate predictions, from a situation in which metadata does not play any role to a situation
in which metadata completely dominates the inference process. When network data and metadata are partly
correlated, metadata optimally contributes to the inference process at the transition between data-dominated and
metadata-dominated regimes.
Many systems can be represented as networks, with nodes We find that, contrary to what one may expect, node meta-
representing units (for example, people in a social network, data do not affect the inference problem gradually. Rather,
or proteins in a protein-protein interaction network), and even when the weight of metadata increases smoothly, the
links representing interactions between the units (for exam- inference process undergoes a transition from a situation in
ple, friendship relationships or physical binding interactions which metadata does not play any role, to a situation in which
between proteins). Network inference is the process of infer- metadata completely dominates the inference process. When
ring the properties of those networks from data; typical net- network data and metadata are partly correlated, metadata op-
work inference problems include the identification of groups timally contributes to the inference process at the transition
of nodes with similar connection patterns, or the identification between data-dominated and metadata-dominated regimes.
of unobserved interactions, that is, link prediction [1–6]. Net-
work inference and, in particular, link prediction are increas-
ingly important in problems with applications ranging from
I. MULTIPARTIPARTITE MIXED-MEMBERSHIP
the prediction of interactions between drugs [7–9] to the pre- STOCHASTIC BLOCK MODELS WITH LABELED LINKS
diction of human preferences and decisions [10–13].
Typically, network inference starts from observations of
some of the links in the network, which are used to predict We introduce a very general network model based on
unobserved links or to infer other network properties. How- stochastic block models [3, 25, 26] that allows us to deal
ever, other sources of information such as system dynamics with (directed or undirected) unipartite and bipartite networks,
[14, 15] or node attributes [13, 16–23] can also be used to aid whose links are binary or labeled, and with node attributes of
in the inference process. Here we study how node attributes different types that can be combined as needed (Fig. 1). As
are introduced in the inference process, and what is the effect we discuss below, this model extends and generalizes previ-
of using such metadata. ous models.
We present our work in terms of the problem of link pre- In what follows we use the terminology of recommender
diction in recommender systems [11, 12, 24], in which the systems [11, 12, 24] although, as previously mentioned, the
goal is to predict the association between users and items (for model is completely general and applicable to any type of re-
example, books or movies). However, our conclusions ap- lational data with node attributes. Our objective is to model
ply to network inference problems in general. We introduce a a bipartite network with labeled links connecting N users to
multipartite network model that encompasses and generalizes M items (for example, movies or books). Links rij repre-
previous attempts to use node metadata in network inference sent ratings of users i to items j and are labeled, that is, rij
problems (Fig. 1). Within this framework, the problem of link can take values in a finite discrete set such as {like, dislike},
prediction in general unipartite or bipartite networks is just a {green, yellow, red}, or {0, 1, . . . , R}. To model these rat-
particular case. Unlike most previous approaches, our mul- ings, we assume that: (i) there are user and item groups, and
tipartite network model allows us to control the importance users and items belong to mixtures of such groups; (ii) the
of the node metadata and thus to investigate when and how probability that a user i rates item j with rij depends only of
metadata helps in the inference. the groups to which they belong.
These assumptions lead to a bipartite [10, 11, 27] mixed-
membership [28] stochastic block model [12] in which the
probability that user i gives item j a rating r is
∗ oscar.fajardo@urv.cat
†
X
marta.sales@urv.cat Pr[rij = r] = θiα ηjβ pαβ (r) . (1)
‡ roger.guimera@urv.cat; Corresponding author
αβ2
(a) cluding age group in the example). The probability that user
Gender Genre 1 i has an excluding attribute e (that is, the probability that the
link ei` between user i and attribute node ` is of type e) is
Age Genre 2
X
Pr[ei` = e] = θiα qα (e) , (2)
(b) α
where qα (e) is the probability
P that a user of group α has an
(c)
attribute of type e, and e qα (e) = 1. For items, the expres-
sion is identical except that we use item membership vectors
η instead of user membership vectors θ.
We also consider non-excluding attributes, such as item
genre (for example, a movie could be both “action” and “west-
ern”). We model each of these non-excluding attribute types
as individual attribute nodes connected to user or item nodes
by links that are typically binary (either do or do not have the
FIG. 1. Multipartipartite mixed-membership stochastic block
attribute) but that could in general be also labeled. Then, the
model with labeled links. (a), We cast the recommendation prob-
lem (in which one aims to predict how users will rate certain items)
probability that item i has attribute g of type a is also modeled
into a network inference problem. Here, users rate movies with three using a mixed-membership, bipartite stochastic block model
possible ratings (green, orange or red). Additionally, we have exclud- X
ing attributes for users (two excluding genders and three excluding Pr[aig = a] = θiα ζgγ q̂αγ (a) (3)
age groups, represented by different shades of the same color) and αγ
non-excluding attributes for movies (two movie genres; the connec-
tion to these attributes is binary, yes/no, but in general it does not where ζgγ is the membership vector of attribute g and q̂αγ (a)
need to be). Similar to ratings, we represent these attributes as bi- is the probability that a user in group α has an attribute of type
partite networks. Although we frame our description of the model in a for an attribute in attribute group γ. As before, the expres-
terms of recommendations or link prediction in a bipartite network, sion for item non-excluding attributes is identical, just replac-
the problem of link prediction in regular unipartite networks is just a ing user membership vectors θ by item membership vectors
particular case in which user nodes and item nodes are the same. (b)
η.
Each bipartite network in the multipartite network is modeled using a
mixed-membership stochastic block model (see text). The individual
block models are coupled by the user and item membership vectors
(θ and η, respectively), shown in (c) along with all other model pa- II. MODEL POSTERIOR AND INFERENCE
rameters and their dimensions (see text).
Our objective is to model the observed ratings RO , and to
predict the value of some unobserved ratings R. For this, and
Here, θ i is the normalized membership vector of user i, and given Eq. (1), we need to infer the parameters θ, η and p from
each element θiα represents the probability that user i belongs RO ; the posterior distribution over these parameters is given
P
to group α (with α θiα = 1). Similarly, η j is the normal- by
ized membership vector of item j; ηjβ represents the proba- P (θ, η, p|RO ) ∝ P (RO |θ, η, p) P (θ, η, p)
bility that item j belongs to group β. Finally, pαβ (r) is the
probability that a user in group α and an item in group β are ≡ LR (θ, η, p) P (θ, η, p) , (4)
connected
P with a rating r. The normalization condition here where LR (θ, η, p) = P (RO |θ, η, p) is the likelihood of the
is r pαβ (r) = 1. model and P (θ, η, p) is the prior over model parameters. Ac-
We note that the association between nodes (users and cording to Eq. (1), the likelihood is
items) and attributes can also be represented as a bipartite net-
work. Therefore we can model node-attribute associations in Y X
a similar manner to ratings. Because we are interested in how LR (θ, η, p) = O
θiα ηjβ pαβ (rij ) . (5)
node attributes can help in the inference of the model for rat- (i,j)∈RO αβ
ings (θ, η, p), we consider that membership vectors for users
(θ) and items (η) in their respective attribute networks are the Similarly, if we decide to jointly model the ratings and the
same as in the model for the ratings. metadata encoded in the observed user and item attributes AO ,
We consider both excluding and non-excluding attributes. we also need to infer the values of the parameters ζ, q and q̂)
For excluding attributes, having one attribute excludes from using the posterior
having another; for example, a user’s age group cannot be 30- P (θ, η, ζ, p, q, q̂|RO , AO ) ∝ LR (θ, η, p) ×
39 years old and 40-49 years old simultaneously. We model Y
each set of excluding attributes as a single attribute node (for × LAk (θ, η, ζ, q, q̂) ×
example, an age node) that is connected to users or items k
through labeled links (each label representing a mutually ex- × P (θ, η, ζ, p, q, q̂) (6)3
where LAk (θ, η, ζ, q, q̂) = P (AOk |θ, η, ζ, q, q̂) is the likeli- III. RELATIONSHIP TO PREVIOUS WORK
hood of the k-th attribute network (for example, the age at-
tribute network for users, or the genre attribute network for The literature on using metadata for link prediction and rec-
items). For the k-th excluding attribute, this likelihood reads ommender systems is vast, and includes all sort of approaches
Y
"
X
# ranging from simple heuristics to sophisticated machine learn-
Ak k O ing methods. However, our interest here is more closely re-
L (θ, η, q) = θiα qα ((ek )i`k ) , (7)
(i,`k )∈AO α lated to probabilistic approaches to network inference, even
k
when those approaches are not applied directly to link pre-
where `k is the k-th non-excluding attribute and the product is diction [13, 16–21]—as shown in Refs. [22, 23], once model
over all nodes i for which we observe attribute `k . parameters are inferred for, for example, community detec-
For the k-th non excluding attribute we have tion, they can easily be used to predict links as well. Our fo-
Y
"
X
# cus on approaches based on probabilistic generative models is
Ak
L (θ, η, ζ, q̂) = k k O
θiα ζgγ q̂αγ ((ak )ig ) . (8) motivated by three characteristics of such approaches: (i) all
(i,g)∈AO αγ assumptions in them are explicit; (ii) principled (as opposed to
k
heuristic) and sometimes even exact inference approaches are
where the product is over all observed associations between possible; and (iii) their results are more readily interpretable.
nodes i and attributes g within the k-th class of non-excluding These three characteristics make probabilistic approaches es-
attributes. pecially appropriate for our ultimate goal of understanding
Ignoring normalizing constants, and in a spirit similar to how node attributes enter and help in the inference process.
Refs. [17, 23], we define a parametric log-posterior as From this perspective, the multipartite mixed-membership
π(θ, η, ζ, p, q, q̂|RO , AO ) = LR (θ, η, p) + stochastic block model is useful because it extends and gen-
X eralizes previous models. By introducing excluding and non-
+ λk LAk (θ, η, ζ, q, q̂) ,(9) excluding attributes, the model can accommodate simultane-
k ously attributes like those considered in Refs. [19, 23] (ex-
R Ak
where L (θ, η, p) and L (θ, η, ζ, q, q̂) are the log- cluding) and in Refs. [17, 18] (non-excluding). It can also
likelihoods of ratings and attributes, respectively. For λk = 0, combine an arbitrary number of attributes of different types,
we recover Eq. (4) with uniform priors on the parameters, thus unlike approaches that can only deal with single attributes
completely ignoring all metadata. Conversely, for λk = 1, [19, 23] or, more often, with a single type of attribute; and it
we are jointly modeling the network of ratings and the net- deals naturally with missing attribute data, unlike approaches
work of attributes as in Eq. (6), with uniform priors on the that require all node attributes to be known [16, 20]. Since
parameters. By tuning the values of λk we can interpolate attributes are modeled with a stochastic block model, our ap-
between these situations, and extrapolate to situations with proach also automatically clusters attributes that have similar
λk > 1 in which we would eventually only model the at- effects on the data (for example, age groups that show similar
tribute network (λk 1). The terms corresponding to the behavior) as in Ref. [18]. Unlike most previous approaches
attribute models can indistinctly be interpreted as part of the for attributed networks, nodes and attributes in our model be-
likelihood of a joint model of ratings and attributes, similar to long to mixtures of groups, which makes the model more ex-
Refs. [17, 18, 22, 23], or as a non-uniform prior over mem- pressive [12], links between nodes and to attributes can be
bership vectors as in Refs. [16, 19, 20]. If interpreted as part labeled, and the influence of the attributes can be tuned on
of a joint model, then λk can be seen as some factors that and off (as in Ref. [23]). As stated above, this last feature is
are needed because attribute data are somehow less (or more) precisely the main focus of our work.
reliable than rating data, perhaps because we have reason to
believe that attributes are more (or less) subject to noise, or
because each rating corresponds, in fact, to a mean over sev- IV. SYNTHETIC DATA
eral observations. Conversely, if interpreted as priors over the
partitions, λk should be interpret as hyperparameters defin- We first use synthetic data to validate the expectation-
ing how certain we are a priori about the importance of node maximization inference approach and to investigate the role
attributes. of introducing node attributes. We generate synthetic data
Either way, this parametrized posterior allows us to inves- with a model similar to the model Fig. 1. Our synthetic rating
tigate how the metadata encoded in the attribute networks networks consist of 200 users and 200 items, partitioned into
enter the inference process for the ratings, and under which K = 2 groups of users and L = 4 groups of items. Users
conditions it results in better and more predictive models for have an excluding attribute labeled “male” or “female”, and
those ratings. To do this, we maximize the posterior for fixed items have an excluding attribute labeled from 0 to 3, which
values of λk using an expectation-maximization algorithm may represent four different genres.
[12, 19, 22, 23] (see Appendix A), which gives the most plau- In the simplest case, in which ratings and attributes are
sible parameter values. Because the posterior landscape is in completely correlated, all female users have membership vec-
general rugged, we perform several runs of the EM algorithm tors θ f = (0.8, 0.2); conversely, all male users have θ m =
and compute the average probability for each unobserved rat- (0.2, 0.8). Similarly, an item with attribute a has a mem-
ing to make predictions (see [12] and Appendix A). bership of 0.8 to group a and 0.067 to all other groups. To4
0.30
simulate partial correlation c or even no correlation (c = 0) 0.25
(a) 10
3
2
10
3
2
(b) 0.2
10 10
between membership vectors and attributes, with probability
Totally correlated
0.20 10
1
10
1
0.1
Relative accuracy,
Relative accuracy,
0 0
1 − c we reassign each node attribute to a value selected uni- 0.15 10 10
user
1 1
0.10 10 10 0.0
formly at random among all possibilities (2 for users and 4 for 0.05 10
2
10
2
3 3 0.1
10 10
items). 0.00
10
4
10
4
0.05
0.2
For the experiments reported in Fig. 2, we consider all at- 0.10
0 0
tribute links, but only a number |RO | = 400 of observed 0.30
(c) 10
3
10
3
(d) 0.2
0.25 2 2
ratings (that is, 1% of all generated ratings). Although the 0.20
10 10
75% correlated
1 1
10 10 0.1
Relative accuracy,
Relative accuracy,
synthetic data are created with item genre as an excluding 0.15 10
0
10
0
user
1 1
0.10 10 10 0.0
attribute, we carry out the inference process assuming that 0.05 10
2
10
2
3 3
genre is a non-excluding attribute, which is what one would 0.00 10
4
10
4
0.1
0.05 10 10
likely assume in real settings where the generating model is 0.10
0 0 0.2
unknown. 0.30 3 3
0.25
(e) 10 10 (f) 0.2
We infer the values of the model parameters using the 0.20
10
2
10
2
50% correlated
1 1
10 10 0.1
Relative accuracy,
Relative accuracy,
expectation-maximization equations, and use the inferred pa- 0.15 10
0
10
0
user
1 1
rameters to predict unobserved ratings in the bipartite ratings 0.10 10
2
10
2
0.0
0.05 10 10
network. We do this for different levels of correlation c be- 0.00 10
3
10
3 0.1
4 4
10 10
tween the ratings and the attribute networks (Fig. 2), from a 0.05
0 0 0.2
0.10
situation c = 1 in which the attributes are perfectly correlated 0.30 3 3
with user and item membership vectors (all male users belong 0.25
(g) 10
2
10
2
(h) 0.2
10 10
0.20
to one group and have identical parameters, and all females 1 1
0% correlated
10 10 0.1
Relative accuracy,
Relative accuracy,
0.15 0 0
10 10
belong to another group with different parameters; items with
user
1 1
0.10 10 10 0.0
2 2
10 10
each genre belong to the exact same mixture of groups) to a 0.05
10
3
10
3 0.1
0.00
situation c = 0 in which user and item memberships and at- 0.05 10
4
10
4
0 0 0.2
tributes are completely uncorrelated (Fig. 2). 0.10
0 10 4 3 2 1 0 1 2 3 0 10 410 310 210 1 100 101 102 103
10 10 10 10 10 10 10
Since we focus on sparse observations in which the number item item
of observed ratings is low (only 1% of all ratings), model pa-
rameters cannot be inferred accurately from the ratings alone. FIG. 2. Predictive performance and effect of metadata on syn-
Therefore, when we only consider the observed ratings RO thetic ratings. We create synthetic ratings from 200 users on 200
and ignore all attributes AO by setting λuser = λitem = 0 in items, with different levels of correlation c between ratings and node
Eq. (9) (λuser and λitem correspond to the user and item at- attributes (see text). We then use 5-fold cross-validation to calcu-
tribute networks, respectively), the prediction of unobserved late the performance of the expectation-maximization equations at
links is suboptimal, that is, the inferred probabilities of unob- predicting unobserved ratings. In particular, we take as a reference
the predictive accuracy a0 of the algorithm when all attributes are
served links differ significantly from the actual probabilities
ignored (λuser = λitem = 0), and measure relative accuracy α
used to build the network. for a given pair (λuser , λitem ) as the log-ratio α(λuser , λitem ) =
When there is perfect correlation between node attributes log [a(λuser , λitem )/a0 ]. The value α(λuser , λitem ) = 0 (dashed
and group memberships, considering the attributes AO by set- line) thus indicates no change with respect to the reference a0 , and
ting λuser > 0 and λitem > 0 should in principle help in the α(λuser , λitem ) > 0 (respectively, α(λuser , λitem ) < 0) indicates
inference process. In fact, since attributes are perfectly cor- predictions that are more (less) accurate than those obtained by ig-
related to group memberships, in the limit λuser → ∞ and noring node attributes. The maximum possible relative performance
λitem → ∞ nodes will be forced into the correct groups and (dotted line) is obtained when each rating is assigned the exact prob-
predictions should be near optimal. This is what we observe ability that was used to generate it. For each value of the corre-
lation ((a)-(b), full correlation, c = 1; (c)-(d), c = 0.75; (e)-(f),
in our numerical experiments (Fig. 2a). Interestingly, as we
c = 0.50; (g)-(h), no correlation, c = 0) we show the variation of
increase the weight of the attributes in the log-posterior from α(λuser , λitem ) with λitem for different values of λuser (left), and
λuser = λitem = 0, the effect on prediction accuracy is not the whole dependence of α(λuser , λitem ) on both λuser and λitem
smooth. Rather, below certain threshold values of λuser and (right).
λitem , using the attributes does not have any significant ef-
fect on prediction accuracy. Then, at those threshold values,
a transition occurs and prediction accuracy increases abruptly lation, when attributes are partly correlated with the true group
until it reaches its theoretical maximum, as expected. memberships of the nodes, the change in performance is not
When attributes and ratings are completely uncorrelated monotonic as we increase the importance of the attributes. As
(Fig. 2d), the role of attributes is reversed. Predictions are before, when λuser and λitem are small enough, we observe
equally suboptimal at λuser = λitem = 0, but then, as λuser no difference with the situation in which the attributes are ig-
and λitem cross certain threshold values, predictions suddenly nored entirely. In the other extreme, when λuser → ∞ and
worsen as user and item nodes are forced into groups that λitem → ∞ user and item nodes are forced into groups that
are uncorrelated with their real membership vectors and, thus, match partly, but not perfectly, the true group memberships
with the observed ratings. of the nodes, so the performance may increase or decrease
Unlike the extreme cases of total correlation or zero corre- with respect to the situation with no attributes, depending on5
0.010
Totally correlated 3000 (a) 0.000 10
3
2
10
0.025 0.005 10
1
Relative accuracy,
Relative accuracy,
Log-posterior,
3200 0.050 10
0
Age
user
1
0.000 10
0.075 2
3400 10
3
Optimal model for data 0.100 0.005 10
4
10
3600 Optimal model for attributes 0.125 (a) (b) 0
Optimal model at transition 0.010
3800 0.010
10
3
4 3 2 1 0 1 0.00 2
10 10 10 10 10 10 0.005
10
1
10
Relative accuracy,
Relative accuracy,
0.02
0
(b) 10
Gender
user
0.04 1
0.000 10
2
75% correlated
10
3500 0.06
Log-posterior,
3
0.005 10
4
0.08 10
(c) (d) 0
0.010
4000 Optimal model for data 0.010 3
10
Optimal model for attributes 0.00
10
2
Optimal model at transition
Age and gender
4500 0.005 10
1
Relative accuracy,
Relative accuracy,
0.05 0
10
user
4 3 2 1 0 1 0.000 10
1
10 10 10 10 10 10 0.10 2
10
3
0.005 10
(c) 0.15
10
4
3500 (e) (f) 0
50% correlated
0.20 0.010
Log-posterior,
0 10 4 3 2 1 0 1 2 3 2 1 0 1
10 10 10 10 10 10 10 10 10 10 10
item item
4000
Optimal model for data
4500 Optimal model for attributes FIG. 4. Predictive performance and effect of metadata on the
Optimal model at transition MovieLens data set. As in Fig. 2, we take as a reference the
4 3 2 1 0 1 predictive accuracy a0 of the algorithm when all attributes are ig-
10 10 10 10 10 10
nored (λuser = λitem = 0), and measure relative accuracy α
(d) for a given pair (λuser , λitem ) as the log-ratio α(λuser , λitem ) =
3500 log [a(λuser , λitem )/a0 ]. We consider three different attributes for
0% correlated
Log-posterior,
user nodes: (a)-(b), age; (c)-(d), gender; (e)-(f), age and gender com-
4000
bined as a single attribute. We plot the whole range of λuser (left),
Optimal model for data and zoom into the intermediate (shaded) region of λuser in which
4500
Optimal model for attributes predictions are significantly more accurate than the reference (right).
Optimal model at transition
5000
4 3 2 1 0 1
10 10 10 10 10 10
respectively. Regardless of the correlation between ratings
and attributes, we find that the transition in predictability in
FIG. 3. Transition between data-dominated and metadata- Fig. 2 coincides with the region where the data-dominated and
dominated inference regimes. For the synthetic data in Fig. 2, we metadata-dominated posteriors cross. By considering Eq. (9)
plot the log-posterior π(θ, η, ζ, p, q, q̂|RO , AO ) as a function of the we see that this must be the case. Indeed, for each attribute
hyperparameter λ = λitem = λuser for three models: the model network we find three regimes—one dominated by the LR
that maximizes the data likelihood LR , the model that maximizes
term, one dominated by the LA term, and one in which both
the metadata likelihood LA , and the model that maximizes the pos-
terior when two previous cases cross (that is, have equal posteriors). terms are comparable. Unless there is perfect or almost per-
The position of the crossing coincides with the transitions and the fect correlation between attributes and node memberships,
maxima observed in Fig. 2. any improvement in predictive power must come from con-
sidering both the observed ratings and the observed attributes,
and therefore in the transition region.
whether the correlation is high (Fig. 2b) or low (Fig. 2c).
However, we find that the most predictive models in this case
are those at intermediate values of λuser and λitem , precisely V. REAL DATA
at the transition region where both the observed ratings and
the observed attributes play a role in determining the most Finally, we analyze two empirical data sets and study
plausible group memberships. In this case, the inferred node whether we observe the same behaviors as in the synthetic
memberships do not coincide with either those that maximize data. First, we consider the 100K MovieLens data set [29],
LR of those that maximize LAk . which contains 100,000 ratings of movies by users. Age and
To understand the transition from the rating-dominated to gender attributes are available for users, which we model as
the attribute-dominated regime, we study the posterior of excluding attributes (Fig. 4). Movies have genre attributes,
the two extreme models corresponding to the maximum a which we model as non-excluding attributes. The relative
posterior estimates obtained by expectation-maximization for weights of user and movie attributes are given by the parame-
λuser = λitem = 0 and for λuser = λitem → ∞ (Fig. 3). ters λusers and λitems .
These are the most plausible models when only data (ratings) Just as in the synthetic networks with small but finite cor-
and only metadata (attributes) are taken into consideration, relation, we observe an intermediate value of λuser and λitem6
Party this case, predictive accuracy does not improve monotonically
Party and State with λuser because, for very large values, representatives are
0.15 State
Relative accuracy,
forced into small groups that are more prone to fluctuations,
0.10 that is, the model overfits the data thus worsening the predic-
tive power with respect to considering large groups associated
0.05 to party affiliation alone.
0.00
4 3 2 1 0 1 2 3
10 10 10 10 10 10 10 10 VI. CONCLUSION
user
FIG. 5. Predictive performance and effect of metadata on the U.S. There is ample evidence that using node metadata can help
Congress data set. As in Fig. 2, we take as a reference the predictive to solve network inference problems. As we have discussed,
accuracy a0 of the algorithm when all attributes are ignored (λuser = several approaches have been proposed in recent years to in-
0), and measure relative accuracy α for a given λuser as the log-ratio troduce node attributes into probabilistic network models, and
α(λuser ) = log [a(λuser )/a0 ]. We consider three different attributes to use them to make better inferences about, for example, the
for user nodes: Party, State, and party and State simultaneously. group structure of networks or the existence of unobserved
interactions. In these approaches, node attributes are intro-
duced either as part of a whole-system model (including both
that provides more accurate rating predictions than either con- the links between nodes and node attributes), or as priors over
sidering the observed ratings alone or considering the node the parameters of the model for the links (for example, as pri-
attributes alone. This behavior is similar when we consider ors for the node group memberships that, in turn, determine
age only, gender only, or age and gender simultaneously. As the probability of existence of links). However, beyond the
in synthetic networks, the optimal combination of rating data improvement in performance that they may entail in a given
and node metadata occurs for values of λ such that the ratings task such as group detection or link prediction, we know lit-
network and the attributes networks have comparable contri- tle about the effect that node attributes have in the inference
butions to the log-posterior. process. Here, our goal has been to clarify this issue.
Second, we consider a data set on the votes of 441 mem- Regardless of whether attributes are introduced as part of a
bers of the U.S. House of Representatives in the 108th U.S. whole model or as a prior for model parameters, they appear
Congress [30] (Fig. 5). Between Jannuary 2003 and Jannuary in probabilistic models as additional terms in the likelihood or
2005, these representatives voted on 1,217 bills, casting one the posterior. As we have shown, our results depend on this
of 9 different types of vote, which, following previous anal- simple observation alone—only when all terms in these like-
yses, we simplify to Yes, No, and Other [30]. In this data lihoods or posteriors are comparable in magnitude, or when
set, “users” are the representatives and “items” are the bills. attributes are perfectly correlated with ratings, can we expect
The ratings represent the votes of the representatives on the attributes to improve the inference process. In this sense, our
bills. For representatives, we have attribute data indicating findings here may be expected to be universal.
their party and state, which we model as excluding attributes. From a practical point of view, our work helps to under-
Although all votes of all members are recorded in the data stand when certain approaches will not work. For example,
set (in total, 536,698 votes), for the purpose of our analysis our results suggest that modeling data and metadata jointly
we infer the parameters of the multipartite mixed-membership will only improve link predictions (or other network inference
stochastic block model using 1% of the data, and predict the problems) if two conditions are fulfilled simultaneously: (i)
remaining 99% (and repeat this using each 1% of the data as the metadata are correlated to the data; (ii) as we have men-
training set). tioned, the balance between amount of data and metadata is
Again, the effects of introducing the attributes in the infer- such that their likelihoods (LR and LA above) are of the same
ence process are very similar to those we encounter in syn- order. If the first condition is not fulfilled, using metadata will
thetic data (Fig. 5). When using only the state of the represen- in general worsen predictions, rather than improving them;
tatives, we observe a behavior that is compatible with small if the second condition is not fulfilled, one may, in practice,
but finite correlation between attribute and voting patterns, inadvertently ignore either the data or the metadata and thus
since the optimal predictive performance is observed at inter- make, again, suboptimal predictions.
mediate values of λuser . Rather, when we consider party af- Some works have intuitively addressed this problem by in-
filiation we observe a behavior that is compatible with almost troducing tuning parameters akin to our λk [17, 23]. However,
perfect correlation between attribute and voting behavior. In- the impact of those parameters has not been studied in detail
deed, in this case the predictive performance of the model in- and, instead, their values are typically chosen among a very
creases monotonically with λuser , with an abrupt transition at limited set by means of cross-validation. Our work clarifies
λuser ≈ 1, just as for perfectly correlated attributes in syn- how the value of those parameters should be chosen, and why.
thetic data. When state and party are combined into a single From a broader perspective, our work opens the door to
excluding attribute (for example, “Democrat from Texas” is a understanding the role of different terms in probabilistic net-
group), we observe a behavior compatible with strong (but im- work models, as well as the transitions that occur between the
perfect) correlation between attributes and voting behavior. In regimes in which one term or another dominates. This sets7
the stage for more systematic approaches to building better have
probabilistic models of network systems. X X
LAk = log θiα qαk (i`k )
(i,`k )∈AO α
k
X X
k θiα qαk (i`k )
= log σi` (α)
α
k
σi`k (α)
(i,`k )∈AO
k
ACKNOWLEDGMENTS
X X
k θiα qαk (i`k )
≥ σi`k
(α) log k (α)
(A2)
α
σi`
(i,`k )∈AO
k
k
The authors acknowledge support by the Spanish Ministe- where σi`k
(α) is the auxiliary distribution, and to simplify the
k
rio de Economı́a y Competitividad (Grants FIS2016-78904- notation we have defined qαk (i`k ) ≡ qαk ( eO
k i`k ).
C3-P-1 and PID2019-106811GB-C31) and by the Govern-
Finally, for the term corresponding to non-excluding node
ment of Catalonia (Grant 2017SGR-896).
attributes we have
X X
LAk = log k
θiα ζgγ q̂αγ (ig)
(i,g)∈AO αγ
k
k
X X
k
θiα ζgγ q̂αγ (ig)
= log σ̂ig (α, γ) k
αγ
σ̂ig (α, γ)
Appendix A: Expectation-maximization equations (i,g)∈AO
k
k
X X
k
θiα ζgγ q̂αγ (ig)
≥ σ̂ig (α, γ) log k (α, γ)
(A3)
σ̂ig
We aim to maximize the parametric log-posterior in Eq. (9) (i,g)∈AO
k
αγ
as a function of the model parameters θ, η, p, ζ, q and q̂. Be-
k
cause logarithms of sums are hard to deal with, we use a where σ̂ig (α, γ) is the auxiliary distribution, and to simplify
variational P
trick that first introduces an auxiliary distribution
P the notation we have defined q̂αk (ig) ≡ q̂αγ
k
( aO
k ig ).
p(x)
P with x p(x) = 1 into a sum P of terms as x x = Note that, in Eqs. (A1)-(A3) above, the equality is satisfied
x p(x) (x/p(x)). Then because x p(x) (x/p(x)) = when maximizing with respect to the auxiliary distributions.
hx/p(x)i weP can use Jensens’ inequality
P loghyi ≥ hlog yi to By solving these optimization problems we obtain
write log [ x p(x) (x/p(x))] ≥ x p(x) log [x/p(x)].
O
θiα ηjβ pαβ (rij )
Because both rating and attribute terms in Eq. (9) contain ωij (α, β) = P O
, (A4)
α0 β 0 θiα ηjβ pα β (rij )
0 0 0 0
logarithms of sums, we introduce an auxiliary distribution for
each of the terms as follows. For the ratings, we have k θiα qαk (i`k )
σi` k
(α) = P k
, (A5)
α0 θiα qα0 (i`k )
0
k
k
θiα ζgγ q̂αγ (ig)
X X σ̂ig (α, γ) = P . (A6)
LR = α γ iα gγ q̂α γ (ig)
θ ζ
O
log θiα ηjβ pαβ (rij )r 0 0 0 0 0 0
(i,j)∈RO αβ
Therefore, the auxiliary distributions have the following inter-
O
X X θiα ηjβ pαβ (rij ) pretations: ωij (α, β) is the contribution of user group α and
= log ωij (α, β)
ωij (α, β) item group β to the probability that user i gives item j a rating
(i,j)∈RO αβ O k
rij ; σi` k
(α) is the contribution of user group (or item group) α
O
θiα ηjβ pαβ (rij ) to the probability that user (item) i has attribute type (eO k )i`k
X X
≥ ωij (α, β) log (A1) k
ωij (α, β) in the k-th excluding attribute; and, finally, σ̂ig (α, γ) is the
(i,j)∈RO αβ
contribution of groups α and γ to the probability that, for the
k-th non-excluding attribute, the association between node i
and attribute g is of type (aO k )ig .
where ωij (α, β) is the auxiliary distribution. Using Lagrange multipliers for the normalization con-
straints, and equating to zero the derivatives of the log-
For the term corresponding to excluding node attributes we posterior with respect to the model parameters yields
k
P P P P P P l
j∈∂i β ωij (α, β) + λk σi`
k k
(α) + l λl g∈∂i k γ σ̂ig (α, γ)
θiα = P k
P l
(A7)
di + k λk δi + l λl ∆i
where ∂ik is the set of k-th attributes associated with user i, di is the degree of user i in the network of ratings, and ∆li = |∂i l |.8
k
Note that the term σi` k
(α) is equal to zero if user i does not have attribute `k , so that δik = 1 if user i has exclusive attribute `k
and zero otherwise.
k
P P P P P P l
i∈∂j α ωij (α, β) + k λk σj`k (β) + l λl i∈∂jk γ σ̂ij (β, γ)
ηjβ = P k
P l
(A8)
dj + k λk δj + l λl ∆j
where ∂jk is the set of k-th attributes associated with item j, Appendix B: Expectation-maximization algorithm
dj is the degree of item j in the network of ratings, and ∆lj =
|∂j l |. As before, the term σj`k
k
(β) is equal to zero if item j To obtain a maximum of the posterior we start by be-
does not have attribute `k , so that δjk = 1 ifitem j has exclusive rating random initial conditions for each model parameter
attribute `k and zero otherwise. θ, η, p, ζ, q, q̂.
The we perform iteratively two steps until model parame-
ters convergence:
k
P P
i∈∂gk α σ̂ig (α, γ) 1. Expectation step: compute the auxiliary functions
k
ζgγ = (A9) k k
∆kg ωij (α, β), σi` k
(α), and σ̂ig (α, γ) using current values
for θ, η, p, ζ, q, q̂ using Eqs. A.4, A.5 and A.6.
where where ∂gk is the set of nodes associated with attribute g, 2. Maximization step: Compute the new values for the
and ∆kg = |∂g k |. Additionally, we have model parameters using the values for the auxiliary
P functions and Eqs. A.7 - A.12.
0
(i,j)∈RO |rij = rωij (α, β)
pαβ (r) = P (A10) Because the posterior landscape is very rugged, to make
(i,j)∈RO ωij (α, β) predictions we perform the EM algorithm 10 times and con-
sider all of the models to estimate the average probability that
user i rates item j with rating r (see [31]) as follows:
k
P
(i,`k )∈AO |(eO )i` =e σi`k (α) N
qαk (e) = P k k kk (A11) 1 X
(i,`k ) σi`k (α) hp(rij = r|RO , AO
k )i ≈ pn (rij = r|RO , AO
k , (. . . ))
N n=1
(B1)
P k where (. . . ) = {θ, η, p, ζ, q, q̂}, and pn (rij =
(i,g)∈AO O = aσ̂ig (α, γ)
k |(ak )ig r|RO , AOk , (. . . )) is the probability that user i rates item
k
q̂αγ (a) = P k (α, γ)
(A12)
(i,g)∈AO σ̂ig j with rating r in run n of the EM algorithm.
k
[1] D. Liben-Nowell and J. Kleinberg, “The link-prediction prob- [8] M. Tarrés-Deulofeu, A. Godoy-Lorite, R. Guimerà, and
lem for social networks,” J. Am. Soc. Inf. Sci. Tec. 58, 1019– M. Sales-Pardo, “Tensorial and bipartite block models for link
1031 (2007). prediction in layered networks and temporal networks,” Phys.
[2] A. Clauset, C. Moore, and M. E. J. Newman, “Hierarchical Rev. E 99, 032307 (2019).
structure and the prediction of missing links in networks.” Na- [9] Michael P. Menden, Dennis Wang, Mike J. Mason, Bence
ture 453, 98–101 (2008). Szalai, Krishna C. Bulusu, Yuanfang Guan, Thomas Yu, Jae-
[3] R. Guimerà and M. Sales-Pardo, “Missing and spurious interac- woo Kang, Minji Jeon, Russ Wolfinger, Tin Nguyen, Mikhail
tions and the reconstruction of complex networks.” Proc. Natl. Zaslavskiy, AstraZeneca-Sanger Drug Combination DREAM
Acad. Sci. U. S. A. 106, 22073–22078 (2009). Consortium, In Sock Jang, Zara Ghazoui, Mehmet Eren Ah-
[4] L. Lü, L. Pan, T. Zhou, Y.-C. Zhang, and H.E. Stanley, “Toward sen, Robert Vogel, Elias Chaibub Neto, Thea Norman, Eric
link predictability of complex networks,” Proc. Natl. Acad. Sci. K. Y. Tang, Mathew J. Garnett, Giovanni Y. Di Veroli, Stephen
U.S.A. 112, 2325–2330 (2015). Fawell, Gustavo Stolovitzky, Justin Guinney, Jonathan R. Dry,
[5] A. Ghasemian, H. Hosseinmardi, A. Galstyan, E. M. Airoldi, and Julio Saez-Rodriguez, “Community assessment to advance
and A. Clauset, “Stacking models for nearly optimal link pre- computational prediction of cancer drug combinations in a
diction in complex networks,” Proc. Natl. Acad. Sci. USA 117, pharmacogenomic screen,” Nat. Comm. 10, 2674 (2019).
23393–23400 (2020). [10] R. Guimerà and M. Sales-Pardo, “Justice blocks and pre-
[6] R Guimerà, “One model to rule them all in network science?” dictability of U.S. Supreme Court votes,” PLoS ONE 6, e27188
Proc. Natl. Acad. Sci. USA 117, 25195–25197 (2020). (2011).
[7] R. Guimerà and M. Sales-Pardo, “A network inference method [11] R. Guimerà, A. Llorente, E. Moro, and M. Sales-Pardo, “Pre-
for large-scale unsupervised identification of novel drug-drug dicting human preferences using the block structure of complex
interactions,” PLoS Comput. Biol. 9, e1003374 (2013). social networks,” PLoS ONE 7, e44620 (2012).9
[12] A. Godoy-Lorite, R. Guimerà, C. Moore, and M. Sales-Pardo, els with multiple continuous attributes,” Appl. Netw. Sci. 4, 54
“Accurate and scalable social recommendation using mixed- (2019).
membership stochastic block models,” Proc. Natl. Acad. Sci. [23] M. Contisciani, E. A. Power, and C. De Bacco, “Community
U.S.A. 113, 14207 –– 14212 (2016). detection with node attributes in multilayer networks,” Sci. Rep.
[13] S. Cobo-López, A. Godoy-Lorite, J. Duch, M. Sales-Pardo, and 10, 1–16 (2020).
R. Guimerà, “Optimal prediction of decisions and model selec- [24] Y. Koren, R. Bell, and C. Volinsky, “Matrix factorization tech-
tion in social dilemmas using block models,” EPJ Data Sci. 7 , niques for recommender systems,” Computer 42, 30–37 (2009).
48 (2018) 7, 48 (2018). [25] P. W. Holland, K. B. Laskey, and S. Leinhardt, “Stochastic
[14] M. Timme, “Revealing Network Connectivity from Response blockmodels: First steps,” Soc. Networks 5, 109–137 (1983).
Dynamics,” Phys. Rev. Lett. 98, 224101 (2007). [26] K. Nowicki and T. A. B. Snijders, “Estimation and prediction
[15] T.P. Peixoto, “Network reconstruction and community detec- for stochastic blockstructures,” J. Am. Stat. Assoc. 96, 1077–
tion from dynamics,” Phys. Rev. Lett. 123, 128301 (2019). 1087 (2001).
[16] C. Tallberg, “A Bayesian approach to modeling stochastic [27] T.-C. Yen and D. B. Larremore, “Community detection in bi-
blockstructures with covariates,” J. Math. Sociol. 29, 1–23 partite networks with stochastic block models,” Phys. Rev. E
(2004). 102, 032309 (2020).
[17] J. Yang, J. McAuley, and J. Leskovec, “Community detection [28] E. M. Airoldi, D. M. Blei, S. E Fienberg, and E. P. Xing,
in networks with node attributes,” in 2013 IEEE 13th Interna- “Mixed membership stochastic blockmodels,” J. Mach. Learn.
tional Conference on Data Mining (2013) pp. 1151–1156. Res. 9, 1981–2014 (2008).
[18] D. Hric, T. P. Peixoto, and S. Fortunato, “Network structure, [29] F. M. Harper and J. A. Konstan, “The Movielens datasets: His-
metadata, and the prediction of missing nodes and annotations,” tory and context,” ACM Trans. Interact. Intell. Syst. 5 (2015).
Phys. Rev. X 6, 031038 (2016). [30] A. S. Waugh, L. Pei, J. Fowler, P. Mucha, and M. A. Porter,
[19] M. E. J. Newman and A. Clauset, “Structure and inference in “Party polarization in Congress: A network science approach,”
annotated networks,” Nat. Comm. 7, 11863 (2016). arXiv: Physics and Society (2009).
[20] A. White and T. B. Murphy, “Mixed-membership of experts [31] A. Godoy-Lorite, R. Guimerà, C. Moore, and M. Sales-Pardo,
stochastic blockmodel,” Netw. Sci. 4, 48–80 (2016). “Accurate and scalable social recommendation using mixed-
[21] L. Peel, D. B. Larremore, and A. Clauset, “The ground truth membership stochastic block models,” Proc. Natl. Acad. Sci.
about metadata and community detection in networks,” Sci. U.S.A. 113, 14207 –– 14212 (2016).
Adv. 3 (2017), 10.1126/sciadv.1602548.
[22] N. Stanley, T. Bonacci, and R. Kwitt, “Stochastic block mod-You can also read