Hinge-loss Markov Random Fields: Convex Inference for Structured Prediction
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Hinge-loss Markov Random Fields:
Convex Inference for Structured Prediction
Stephen H. Bach Bert Huang Ben London Lise Getoor
Computer Science Dept.
University of Maryland
College Park, MD, 20742, USA
Abstract loss Markov random fields (HL-MRFs) [3], these mod-
els are analogous to discrete Markov random fields,
Graphical models for structured domains are except that random variables are continuous valued
powerful tools, but the computational com- in the unit interval [0,1], and potentials are linear or
plexities of combinatorial prediction spaces squared hinge-loss functions.
can force restrictions on models, or re- In this work, we demonstrate that HL-MRFs are
quire approximate inference in order to be powerful tools for structured prediction by producing
tractable. Instead of working in a combina- state-of-the-art performance in a number of domains.
torial space, we use hinge-loss Markov ran- We are the first to leverage some of the most pow-
dom fields (HL-MRFs), an expressive class erful features of HL-MRFs, such as squared poten-
of graphical models with log-concave density tials, which we show produce better results on mul-
functions over continuous variables, which tiple tasks. We show that HL-MRFs are well-suited to
can represent confidences in discrete predic- structured prediction for the following reasons. They
tions. This paper demonstrates that HL- are expressive, interpretable, and easily defined using
MRFs are general tools for fast and accu- the modeling language probabilistic soft logic (PSL)
rate structured prediction. We introduce the [6, 13]. Further, continuous variables are useful both
first inference algorithm that is both scalable for modeling continuous data as well as for expressing
and applicable to the full class of HL-MRFs, confidences in discrete predictions. Confidences are
and show how to train HL-MRFs with several desirable for the same reason that practitioners often
learning algorithms. Our experiments show prefer marginal probabilities to the single most prob-
that HL-MRFs match or surpass the predic- able discrete prediction. Finally, HL-MRFs have log-
tive performance of state-of-the-art methods, concave density functions, so finding an exact most
including discrete models, in four application probable explanation (MPE) for a given input is a
domains. convex optimization, and therefore exactly solvable in
polynomial time.
Our specific contributions include the following. First,
1 INTRODUCTION we introduce a new, fast algorithm for MPE inference
in HL-MRFs, which is the first to be both scalable
The study of probabilistic modeling in structured
and applicable to the full class of HL-MRFs. Sec-
and relational domains primarily focuses on predict-
ond, we show how to train HL-MRFs with several
ing discrete variables [12, 19, 24]. However, except
learning algorithms. Third, we show empirically that
for some isolated cases, the combinatorial nature of
these advances enable HL-MRFs to tackle a diverse set
discrete, structured prediction spaces requires conces-
of relational and structured prediction tasks, provid-
sions: most notably, for inference algorithms to be
ing state-of-the-art performance on collective classifi-
tractable, they must be relaxed or approximate (e.g.,
cation, social-trust prediction, collaborative filtering,
[19, 22, 25]), or the model’s structure must be re-
and image reconstruction. In particular, we show that
stricted (e.g., [2, 8]). Broecheler et al. [6] introduced
HL-MRFs can outperform their discrete counterparts,
a class of models for continuous variables that has the
as well as other leading methods.
potential to combine fast and exact inference with the
expressivity of discrete graphical models, but this po-
tential has not been well explored. Now called hinge-1.1 RELATED WORK Definition 1. Let Y = (Y1 , . . . , Yn ) be a vector of
n variables and X = (X1 , . . . , Xn0 ) a vector of n0
0
Probabilistic soft logic (PSL) [6, 13] is a declarative variables with joint domain D = [0, 1]n+n . Let φ =
language for defining templated HL-MRFs. Its de- (φ1 , . . . , φm ) be m continuous potentials of the form
velopment was partially motivated by the need for
rich models of continuous similarity values, for use φj (Y, X) = [max {`j (Y, X), 0}]pj
in tasks such as entity resolution, collective classifi-
cation, and ontology alignment. PSL is in a family of where `j is a linear function of Y and X and pj ∈
systems for defining templated, relational probabilistic {1, 2}. Let C = (C1 , . . . , Cr ) be linear constraint func-
models that includes, for instance, Markov logic net- tions associated with index sets denoting equality con-
works [19], relational dependency networks [17], and straints E and inequality constraints I, which define
relational Markov networks [23]. In our experiments, the feasible set
we compare against Markov logic networks, which use
Ck (Y, X) = 0, ∀k ∈ E
a first-order syntax similar to PSL’s to build discrete D̃ = Y, X ∈ D .
Ck (Y, X) ≥ 0, ∀k ∈ I
probabilistic models.
MPE inference algorithms for HL-MRFs solve a con- For Y, X ∈ D̃, given a vector of nonnegative free
strained, convex optimization. A standard approach parameters, i.e., weights, λ = (λ1 , . . . , λm ), a con-
for general, constrained, convex optimization is to use strained hinge-loss energy function fλ is defined as
an interior-point method, which Broecheler et al. [6] m
use. While theoretically efficient, the practical run- X
fλ (Y, X) = λj φj (Y, X) .
ning time of interior-point optimizers quickly becomes
j=1
cumbersome for large problems. For discrete graph-
ical models, recent advances use consensus optimiza- Definition 2. A hinge-loss Markov random field P
tion to obtain fast, approximate MPE inference algo- over random variables Y and conditioned on random
rithms [5, 15, 16]. Bach et al. [3] recently developed variables X is a probability density defined as follows:
an analogous algorithm for exact MPE inference in if Y, X ∈
/ D̃, then P (Y|X) = 0; if Y, X ∈ D̃, then
HL-MRFs that produced a significant improvement in
running time over interior-point methods, though it 1
P (Y|X) = exp [−fλ (Y, X)] ,
was limited to pairwise potentials and constraints, and Z(λ)
cost considerably more computation to optimize over R
squared potentials. where Z(λ) = Y
exp [−fλ (Y, X)].
The learning methods we adapt for HL-MRFs are stan- Thus, MPE inference is equivalent to finding the min-
dard approaches for learning parameters of probabilis- imizer of the convex energy fλ .
tic models. In particular, our adaptations are anal-
The potentials and weights can be grouped together
ogous to previous learning algorithms for relational
into templates, which can be used to define general
and structured models using approximate maximum-
classes of HL-MRFs that are parameterized by the in-
likelihood or maximum-pseudolikelihood estimation
put data. Let T = (t1 , . . . , ts ) denote a vector of tem-
[6, 14, 17, 19] and large-margin estimation [11, 12, 24].
plates with associated weights Λ = (Λ1 , . . . , Λs ). We
partition the potentials
P by their associated templates
2 HINGE-LOSS MARKOV and let Φq (Y, X) = j∈tq φj (Y, X) for all tq ∈ T . In
RANDOM FIELDS the HL-MRF, the weight of the j’th hinge-loss poten-
tial is set to the weight of the template from which it
Hinge-loss Markov random fields (HL-MRFs) are a was derived, i.e., λj = Λq , for each j ∈ tq .
general class of conditional, continuous probabilistic Probabilistic soft logic [6, 13] provides a natural in-
models [3]. HL-MRFs are log-linear models whose fea- terface to represent hinge-loss potential templates us-
tures are hinge-loss functions of the variable states. ing logical rules. In particular, a logical conjunc-
Through constructions based on soft logic, hinge-loss tion of Boolean variables X ∧ Y can be general-
potentials can be used to model generalizations of log- ized to continuous variables using the hinge function
ical conjunction and implication. HL-MRFs can be max{X + Y − 1, 0}, which is known as the Lukasiewicz
defined using the modeling language probabilistic soft t-norm. Disjunction X ∨Y is relaxed to min{X +Y, 1},
logic (PSL) [6, 13], making these powerful models in- and negation ¬X to 1 − X. PSL allows modelers to
terpretable, flexible, and expressive. In this section, we design rules that, given data, ground out substitutions
formally present constrained hinge-loss energy func- for logical terms. The groundings of a template define
tions, HL-MRFs, and briefly review PSL. hinge-loss potentials that share the same weight andhave the form one minus the truth value of the ground alternating-direction method of multipliers (ADMM)
rule. We defer to Broecheler et al. [6] and Kimmig [5]. The algorithm works by creating local copies of the
et al. [13] for details on PSL. variables in each potential and constraint, constrain-
ing them to be equal to the original variables, and re-
To further demonstrate this templating, consider the
laxing those equality constraints to make independent
task of predicting who trusts whom in a social network.
subproblems. By solving the subproblems repeatedly
Let the network contain three individuals: A, B, and
and averaging the results, the algorithm reaches a con-
C. We can design an HL-MRF to include potentials
sensus on the best values of the original variables, also
that encode the belief that trust is transitive (which
called the consensus variables. This procedure is guar-
is a rule we use in our experiments). Let the variable
anteed to converge to the global minimizer of fλ . See
YA,B represent how much A trusts B, and similarly so
[3] and [5] for more details on consensus optimization
for YB,C and YA,C . Then the potential
and ADMM.
φ(Y, X) = [max{YA,B + YB,C − YA,C − 1, 0}]p
This previous consensus-optimization approach to
is equivalent to one minus the truth value of MPE inference works well for linear potentials with at
the Boolean formula YA,B ∧ YB,C → YA,C when most two unobserved variables, and empirical evidence
YA,B , YB,C , YA,C ∈ {0, 1}. When they are allowed to suggests it scales linearly with the size of the problem
take on their full range [0, 1], the potential is a convex [3]. However, it is restricted to pairwise potentials and
relaxation of the implication. An HL-MRF with this constraints, and requires an interior-point method as
potential function assigns higher probability to vari- a subroutine to solve subproblems induced by squared
able states that satisfy the logical implication above, potentials. Because of the embedded interior-point
which can occur to varying degrees in the continuous method, its running time can increase roughly 100-fold
domain. Given a social network with more than these with squared potentials [3].
three individuals, PSL can ground out possible sub-
We improve the algorithm of Bach et al. [3] by refor-
stitutions for the roles of A, B, and C to generate
mulating the optimization to enforce Y ∈ [0, 1]n only
potential functions for each substitution, thus defining
on the consensus variables, not the local copies. This
the full, ground HL-MRF.
form of consensus optimization is described in greater
HL-MRFs support a few additional components useful detail by Boyd et al. [5]. The result is that, in our algo-
for modeling. The constraints in Definition 1 allow the rithm, the potentials and constraints are not restricted
encoding of functional modeling requirements, which to a certain number of unknowns, and the subproblems
is useful, e.g., when variables correspond to mutually can all be solved quickly using simple linear algebra.
exclusive labels, and thus should sum to one. The
Algorithm 1 gives pseudocode for our new algorithm.
exponent parameter pj allows flexibility in the shape
It starts by initializing local copies of the variables that
of the hinge, affecting the sharpness of the penalty for
appear in each potential and constraint, along with
violating the logical implication. Setting pj to 1 penal-
a corresponding Lagrange multiplier for each copy.
izes violation linearly with the amount the implication
Then, until convergence, it iteratively updates La-
is unsatisfied, while setting pj to 2 penalizes small vi-
grange multipliers and solves suproblems induced by
olations much less. In effect, some linear potentials
the HL-MRF’s potentials and constraints. If the sub-
overrule others, while the influences of squared poten-
problem is induced by a potential, it sets the local vari-
tials are averaged together.
able copies to a balance between the minimizer of the
potential and the emerging consensus. Eventually the
3 MPE INFERENCE Lagrange multipliers will enforce agreement between
the local copies and the consensus. If instead the sub-
MPE inference for HL-MRFs requires finding a feasible problem is induced by a constraint in the HL-MRF,
assignment that minimizes fλ . Performing MPE infer- the algorithm projects the consensus variables’ current
ence quickly is crucial, especially because weight learn- values to that constraint’s feasible region. The con-
ing often requires performing inference many times sensus variables are updated at each iteration based
with different weights (as we discuss in Section 4). on the values of their local copies and corresponding
Here, HL-MRFs have a distinct advantage over gen- Lagrange multipliers, and clipped to [0,1].
eral discrete models, since minimizing fλ is a convex
optimization rather than a combinatorial one. In this We take the same basic approach as Bach et al. [3] to
section, we detail a new, faster MPE inference algo- solve the subproblems. We first try to find a minimizer
rithm for HL-MRFs. on either side of the hinge (where `j (Y, X) is either
positive or negative) before projecting onto the plane
Bach et al. [3] showed how to minimize fλ with defined by `j (Y, X) = 0. Without the constraints on
a consensus-optimization algorithm, based on theAlgorithm 1 MPE Inference for HL-MRFs performs approximate maximum-likelihood estimation
Input: HL-MRF(Y, X, φ, λ, C, E, I), ρ > 0 using MPE inference to approximate the gradient of
the log-likelihood. The second method maximizes the
Initialize yj as copies of the variables Yj that appear
pseudolikelihood. The third method finds a large-
in φj , j = 1, . . . , m
margin solution, preferring weights that discriminate
Initialize yk+m as copies of the variables Yk+m
the ground truth from other states. We describe below
that appear in Ck , k = 1, . . . , r
how to apply these learning strategies to HL-MRFs.
Initialize Lagrange multipliers αi corresponding to
variable copies yi , i = 1, . . . , m + r
4.1 MAXIMUM-LIKELIHOOD
while not converged do ESTIMATION
for j = 1, . . . , m do
αj ← αj + ρ(yj − Yj ) The canonical approach for learning parameters Λ is
yj ← Yj − αj /ρ to maximize the log-likelihood of training data. The
if `j (yj , X) > 0 then partial derivative of the log-likelihood with respect to
λj [`j (yj , X)]pj
a parameter Λq is
yj ← arg minyj
+ ρ2 kyj − Yj + ρ1 αj k22
∂ log p(Y|X)
if `j (yj , X) < 0 then = EΛ [Φq (Y, X)] − Φq (Y, X) ,
yj ← Proj`j =0 (Yj ) ∂Λq
end if
where EΛ is the expectation under the distribution de-
end if
fined by Λ. The voted perceptron algorithm [7] opti-
end for
mizes Λ by taking steps of fixed length in the direc-
for k = 1, . . . , r do tion of the gradient, then averaging the points after
αk+m ← αk+m + ρ(yk+m − Yk+m ) all steps. Any step that is outside the feasible region
yk+m ← ProjCk (Yk+m ) is projected back before continuing. For a smoother
end for ascent, it is often helpful to divide the q-th component
for i = 1, . . . , n do of the gradient by the number of groundings |tq | of the
q’th template [14], which we do in our experiments.
1
P αc
Yi ← |copies(Y i )| yc ∈copies(Yi ) yc + ρ Computing the expectation is intractable, so we use
Clip Yi to [0,1] a common approximation: the values of the potential
end for functions at the most probable setting of Y with the
end while current parameters [6].
4.2 MAXIMUM-PSEUDOLIKELIHOOD
the local variable copies, finding these minimizers and ESTIMATION
projections is much simpler. Solving for the constraint
subproblems is now also simpler, requiring just a pro- Since exact maximum likelihood estimation is in-
jection onto a hyperplane or a halfspace. tractable in general, we can instead perform
Our algorithm retains all of the benefits of the original maximum-pseudolikelihood estimation (MPLE) [4],
MPE inference algorithm while removing all restric- which maximizes the likelihood of each variable condi-
tions on the numbers of unknowns in the potentials tioned on all other variables, i.e.,
and constraints, making MPE inference fast with both n
Y
linear and squared potentials. Another advantage of P ∗ (Y|X) = P ∗ (Yi |MB(Yi ))
consensus optimization for MPE inference is that it is i=1
easy to warm start. Warm starting provides signifi- Yn
1
exp −fλi (Yi , Y, X) ;
cant efficiency gains when inference is repeated on the =
Z(λ, Yi )
same HL-MRF with small changes in the weights, as i=1
Z
often occurs during weight learning. i
Z(λ, Yi ) = exp −fλ (Yi , Y, X) ;
Y
i
4 WEIGHT LEARNING
X
fλi (Yi , Y, X) =
λj φj {Yi ∪ Y\i }, X .
j:i∈φj
In this section, we present three weight learning meth-
ods for HL-MRFs, each with a different objective func- Here, i ∈ φj means that Yi is involved in φj , and
tion, two of which are new for learning HL-MRFs. MB(Yi ) denotes the Markov blanket of Yi —that is,
The first method, introduced by Broecheler et al. [6], the set of variables that co-occur with Yi in any po-tential function. The partial derivative of the log- objective for a large-margin solution
pseudolikelihood with respect to Λq is
1
min ||Λ||2 + Cξ
2
n Λ≥0
∂ log P ∗ (Y |X) X X
= EYi |MB φj (Y, X) s.t. Λ> (Φ(Y, X) − Φ(Ỹ, X)) ≤ −L(Y, Ỹ) + ξ, ∀Y,
∂Λq i=1 j∈tq :i∈φj
where Φ(Y, X) = (Φ1 (Y, X), . . . , Φs (Y, X)). This
− Φj (Y, X).
formulation is analogous to the margin-rescaling ap-
Computing the pseudolikelihood gradient does not re- proach by Joachims et al. [12]. Though such a struc-
quire inference and takes time linear in the size of tured objective is natural and intuitive, its number
Y. However, the integral in the above expectation of constraints is the cardinality of the output space,
does not readily admit a closed-form antiderivative, which here is infinite. Following their approach, we
so we approximate the expectation. When a variable optimize subject to the infinite constraint set using
in unconstrained, the domain of integration is a one- a cutting-plane algorithm: we greedily grow a set K
dimensional interval on the real number line, so Monte of constraints by iteratively adding the worst-violated
Carlo integration quickly converges to an accurate es- constrain given by a separation oracle, then updating
timate of the expectation. Λ subject to the current constraints. The goal of the
cutting-plane approach is to efficiently find the set of
We can also apply MPLE when the constraints are not active constraints at the solution for the full objec-
too interdependent. For example, for linear equality tive, without having to enumerate the infinite inactive
constraints over disjoint groups of variables (e.g., vari- constraints. The worst-violated constraint is
able sets that must sum to 1.0), we can block-sample
the constrained variables by sampling uniformly from arg min Λ> Φ(Ỹ, X) − L(Y, Ỹ).
a simplex. These types of constraints are often used Ỹ
to represent mutual exclusivity of classification labels. The separation oracle performs loss-augmented in-
We can compute accurate estimates quickly because ference by adding additional loss-augmenting poten-
these blocks are typically low-dimensional. tials to the HL-MRF. For ground truth in {0, 1},
these loss-augmenting potentials are also examples of
4.3 LARGE-MARGIN ESTIMATION hinge-losses, and thus adding them simply creates an
augmented HL-MRF. The worst-violated constraint
A different approach to learning drops the probabilistic is then computed as standard inference on the loss-
interpretation of the model and views HL-MRF infer- augmented HL-MRF. However, ground-truth variables
ence as a prediction function. Large-margin estima- in the interior (0, 1) cause any distance-based loss to be
tion (LME) shifts the goal of learning from producing concave, which require the separation oracle to solve
accurate probabilistic models to instead producing ac- a non-convex objective. For interior ground truth val-
curate MPE predictions. The learning task is then to ues, we use the difference of convex functions algo-
find the weights Λ that provide high-accuracy struc- rithm [1] to find a local optimum. Since the concave
tured predictions. We describe in this section a large- portion of the loss-augmented inference objective piv-
margin method based on the cutting-plane approach ots around the ground truth value, the subgradients
for structural support vector machines (SVMs) [12]. are 1 or −1, depending on whether the current value
The intuition behind large-margin structured predic- is greater than the ground truth. We simply choose an
tion is that the ground-truth state should have energy initial direction for interior labels by rounding, and flip
lower than any alternate state by a large margin. In the direction of the subgradients for variables whose
our setting, the output space is continuous, so we pa- solution states are not in the interval corresponding to
rameterize this margin criterion with a continuous loss the subgradient direction until convergence.
function. For any valid output state Ỹ, a large-margin Given a set K of constraints, we solve the SVM objec-
solution should satisfy: tive as in the primal form minΛ≥0 21 ||Λ||2 + Cξ s.t. K.
We then iteratively invoke the separation oracle to find
fλ (Y, X) ≤ fλ (Ỹ, X) − L(Y, Ỹ), ∀Ỹ,
the worst-violated constraint. If this new constraint is
where the decomposable loss function L(Y, Ỹ) = not violated, or its violation is within numerical toler-
P ance, we have found the max-margin solution. Other-
i L(Yi , Ỹi ) measures the disagreement between a
wise, we add the new constraint to K, and repeat.
state Ỹ and the training label state Y. We define
L as the `1 distance. Since we do not expect all prob- One fact of note is that the large-margin criterion
lems to be perfectly separable, we relax this constraint always requires a little slack for squared HL-MRFs.
with a penalized slack ξ. We obtain a convex learning Since the squared hinge potential is quadratic and theloss is linear, there always exists a small enough dis-
Table 1: Average accuracy of classification by HL-
tance from the ground truth such that an absolute (i.e.,
MRFs and discrete MRFs. Scores statistically equiva-
linear) distance is greater than the squared distance.
lent to the best scoring method are typed in bold.
In these cases, the slack parameter trades off between
the peakedness of the learned quadratic energy func- Citeseer Cora
tion and the margin criterion.
HL-MRF-Q (MLE) 0.729 0.816
HL-MRF-Q (MPLE) 0.729 0.818
5 EXPERIMENTS HL-MRF-Q (LME) 0.683 0.789
HL-MRF-L (MLE) 0.724 0.802
To demonstrate the flexibility and effectiveness of HL- HL-MRF-L (MPLE) 0.729 0.808
MRFs, we test them on four diverse learning tasks: HL-MRF-L (LME) 0.695 0.789
collective classification, social-trust prediction, prefer-
ence prediction, and image reconstruction. 1 Each of MRF (MLE) 0.686 0.756
these experiments represents a problem domain that MRF (MPLE) 0.715 0.797
is best solved with relational learning approaches be- MRF (LME) 0.687 0.783
cause structure is a critical component of their prob-
lems. The experiments show that HL-MRFs perform
as well as or better than state-of-the-art approaches.
5.1 COLLECTIVE CLASSIFICATION
For these diverse tasks, we compare against a number
of competing methods. For collective classification and When classifying documents, links between those
social-trust prediction, we compare HL-MRFs to dis- documents—such as hyperlinks, citations, or co-
crete Markov random fields (MRFs). We construct authorship—provide extra signal beyond the local fea-
them with Markov logic networks (MLNs) [19], which tures of individual documents. Collectively predicting
template discrete MRFs using logical rules similarly to document classes with these links tends to improve
PSL for HL-MRFs. We perform inference in discrete accuracy [21]. We classify documents in citation net-
MRFs using 2500 rounds of the sampling algorithm works using data from the Cora and Citeseer scientific
MC-Sat (500 of which are burn in), and we find ap- paper repositories. The Cora data set contains 2,708
proximate MPE states during MLE learning using the papers in seven categories, and 5,429 directed citation
search algorithm MaxWalkSat [19]. For collaborative links. The Citeseer data set contains 3,312 papers in
filtering, a task that is inherently continuous and non- six categories, and 4,591 directed citation links.
trivial to encode in discrete logic, we compare against
Bayesian probabilistic matrix factorization [20]. Fi- The prediction task is, given a set of seed documents
nally, for image reconstruction, we run the same exper- whose labels are observed, to infer the remaining doc-
imental setup as Poon and Domingos [18] and compare ument classes by propagating the seed information
against the results they report, which include tests us- through the network. For each of 20 runs, we split the
ing sum product networks, deep belief networks, and data sets 50/50 into training and testing partitions,
deep Boltzmann machines. and seed half of each set. To predict discrete cate-
gories with HL-MRFs we predict the category with
When appropriate, we evaluate statistical significance the highest predicted value.
using a paired t-test with rejection threshold 0.01. We
omit variance statistics to save space and only report We compare HL-MRFs to discrete MRFs on this task.
the average and statistical significance. We describe We construct both using the same logical rules, which
the HL-MRFs used for our experiments using the PSL simply encode the tendency for a class to propagate
rules that define them. To investigate the differences across citations. For each category Ci , we have two
between linear and squared potentials we use both in separate rules for each direction of citation:
our experiments. HL-MRF-L refers to a model with
all linear potentials and HL-MRF-Q to one with all Label(A, Ci ) ∧ Cites(A, B) ⇒ Label(B, Ci ),
squared potentials. When training with MLE and
Label(A, Ci ) ∧ Cites(B, A) ⇒ Label(B, Ci ).
MPLE, we use 100 steps of voted perceptron and a
step size of 1.0 (unless otherwise noted), and for LME
we set C = 0.1. We experimented with various set- Table 1 lists the results of this experiment. HL-MRFs
tings, but the scores of HL-MRFs and discrete MRFs are the most accurate predictors on both data sets. We
were not sensitive to changes. also note that both variants of HL-MRFs are much
faster than discrete MRFs. See Table 3 for average
1
All code is available at http://psl.umiacs.umd.edu. inference times on five folds.Table 2: Average area under ROC and precision-recall Table 3: Average inference times (reported in seconds)
curves of social-trust prediction by HL-MRFs and dis- of single-threaded HL-MRFs and discrete MRFs.
crete MRFs. Scores statistically equivalent to the best
scoring method by metric are typed in bold. Citeseer Cora Epinions
HL-MRF-Q 0.42 0.70 0.32
ROC P-R (+) P-R (-)
HL-MRF-L 0.46 0.50 0.28
HL-MRF-Q (MLE) 0.822 0.978 0.452 MRF 110.96 184.32 212.36
HL-MRF-Q (MPLE) 0.832 0.979 0.482
HL-MRF-Q (LME) 0.814 0.976 0.462
HL-MRF-L (MLE) 0.765 0.965 0.357 them using various learning algorithms. We compute
HL-MRF-L (MPLE) 0.757 0.963 0.333 three metrics: the area under the receiver operating
HL-MRF-L (LME) 0.783 0.967 0.453 characteristic (ROC) curve, and the areas under the
MRF (MLE) 0.655 0.942 0.270 precision-recall curves for positive trust and negative
MRF (MPLE) 0.725 0.963 0.298 trust. On all three metrics, HL-MRFs with squared
MRF (LME) 0.795 0.973 0.441 potentials score significantly higher. The differences
among the learning methods for squared HL-MRFs are
insignificant, but the differences among the models is
5.2 SOCIAL-TRUST PREDICTION statistically significant for the ROC metric. For area
under the precision-recall curve for positive trust, dis-
An emerging problem in the analysis of online social crete MRFs trained with LME are statistically tied
networks is the task of inferring the level of trust be- with the best score, and both HL-MRF-L and discrete
tween individuals. Predicting the strength of trust MRFs trained with LME are statistically tied with the
relationships can provide useful information for viral best area under the precision-recall curve for negative
marketing, recommendation engines, and internet se- trust. The results are listed in Table 2.
curity. HL-MRFs with linear potentials have recently Though the random fold splits are not the same, using
been applied by Huang et al. [10] to this task, show- the same experimental setup, Huang et al. [10] also
ing superior results with models based on sociologi- scored the precision-recall area for negative trust of
cal theory. We reproduce their experimental setup us- standard trust prediction algorithms EigenTrust and
ing their sample of the signed Epinions trust network, TidalTrust, which scored 0.131 and 0.130, respectively.
in which users indicate whether they trust or distrust The logical models based on structural balance that
other users. We perform eight-fold cross-validation. In we run here are significantly more accurate, and HL-
each fold, the prediction algorithm observes the entire MRFs more than discrete MRFs.
unsigned social network and all but 1/8 of the trust
ratings. We measure prediction accuracy on the held- Table 3 lists average inference times on five folds of
out 1/8. The sampled network contains 2,000 users, three prediction tasks: Cora, Citeseer, and Epinions.
with 8,675 signed links. Of these links, 7,974 are pos- We implemented each method in Java. Both HL-
itive and only 701 are negative. MRF-Q and HL-MRF-L are much faster than discrete
MRFs. This illustrates an important difference be-
We use a model based on the social theory of struc- tween performing structured prediction via convex in-
tural balance, which suggests that social structures are ference versus sampling in a discrete prediction space:
governed by a system that prefers triangles that are using our MPE inference algorithm is much faster.
considered balanced. Balanced triangles have an odd
number of positive trust relationships; thus, consider-
ing all possible directions of links that form a triad of 5.3 PREFERENCE PREDICTION
users, there are sixteen logical implications of the form
Preference prediction is the task of inferring user at-
Trusts(A, B) ∧ Trusts(B, C) ⇒ Trusts(A, C). titudes (often quantified by ratings) toward a set of
items. This problem is naturally structured, since a
Huang et al. [10] list all sixteen of these rules, a reci-
user’s preferences are often interdependent, as are an
procity rule, and a prior in their Balance-Recip model,
item’s ratings. Collaborative filtering is the task of
which we omit to save space.
predicting unknown ratings using only a subset of ob-
Since we expect some of these structural implications served ratings. Methods for this task range from sim-
to be more or less accurate, learning weights for these ple nearest-neighbor classifiers to complex latent fac-
rules provides better models. Again, we use these rules tor models. To illustrate the versatility of HL-MRFs,
to define HL-MRFs and discrete MRFs, and we train we design a simple, interpretable collaborative filtering5.4 IMAGE RECONSTRUCTION
Table 4: Normalized mean squared/absolute errors
(NMSE/NMAE) for preference prediction using the Digital image reconstruction requires models that un-
Jester dataset. The lowest errors are typed in bold. derstand how pixels relate to each other, such that
when some pixels are unobserved, the model can in-
NMSE NMAE
fer their values from parts of the image that are ob-
HL-MRF-Q (MLE) 0.0554 0.1974 served. We construct pixel-grid HL-MRFs for image
HL-MRF-Q (MPLE) 0.0549 0.1953 reconstruction. We test these models using the exper-
HL-MRF-Q (LME) 0.0738 0.2297 imental setup of Poon and Domingos [18]: we recon-
HL-MRF-L (MLE) 0.0578 0.2021 struct images from the Olivetti face data set and the
HL-MRF-L (MPLE) 0.0535 0.1885 Caltech101 face category. The Olivetti data set con-
HL-MRF-L (LME) 0.0544 0.1875 tains 400 images, 64 pixels wide and tall, and the Cal-
tech101 face category contains 435 examples of faces,
BPMF 0.0501 0.1832 which we crop to the center 64 by 64 patch, as was
done by Poon and Domingos [18]. Following their ex-
perimental setup, we hold out the last fifty images and
predict either the left half of the image or the bottom
half.
model for predicting humor preferences. We test this
model on the Jester dataset, a repository of ratings The HL-MRFs in this experiment are much more com-
from 24,983 users on a set of 100 jokes [9]. Each joke plex than the ones in our other experiments because
is rated on a scale of [−10, +10], which we normalize we allow each pixel to have its own weight for the fol-
to [0, 1]. We sample a random 2,000 users from the lowing rules, which encode agreement or disagreement
set of those who rated all 100 jokes, which we then between neighboring pixels:
split into 1,000 train and 1,000 test users. From each
train and test matrix, we sample a random 50% to use Bright(Pij , I) ∧ North(Pij , Q) ⇒ Bright(Q, I),
as the observed features X; the remaining ratings are Bright(Pij , I) ∧ North(Pij , Q) ⇒ ¬Bright(Q, I),
treated as the variables Y. ¬Bright(Pij , I) ∧ North(Pij , Q) ⇒ Bright(Q, I),
Our HL-MRF model uses an item-item similarity rule: ¬Bright(Pij , I) ∧ North(Pij , Q) ⇒ ¬Bright(Q, I),
where Bright(Pij , I) is the normalized brightness of
pixel Pij in image I, and North(Pij , Q) indicates that
SimRate(J1 , J2 ) ∧ Likes(U, J1 ) ⇒ Likes(U, J2 )
Q is the north neighbor of Pij . We similarly include
analogous rules for the south, east, and west neighbors,
as well as the pixels mirrored across the horizontal and
where J1 , J2 are jokes and U is a user; the pred-
vertical axes. This setup results in up to 24 rules per
icate Likes indicates the degree of preference (i.e.,
pixel, which, in a 64 by 64 image, produces 80,896
rating value); and SimRate measures the mean-
weighted potential templates.
adjusted cosine similarity between the observed rat-
ings of two jokes. We also include rules to enforce that We train these HL-MRFs using MPE-approximate
Likes(U, J) concentrates around the observed average maximum likelihood with a 5.0 step size on the first
rating of user U and item J, and the global average. 200 images of each data set and test on the last fifty.
For training, we maximize the data log-likelihood of
We compare our HL-MRF model to a current state-
uniformly random held-out pixels for each training
of-the-art latent factors model, Bayesian probabilis-
image, allowing for generalization throughout the im-
tic matrix factorization (BPMF) [20]. BPMF is a
age. Table 5 lists our results and others reported by
fully Bayesian treatment and, as such, is considered
Poon and Domingos [18]. HL-MRFs produce the best
“parameter-free”; the only parameter that must be
mean squared error on the left- and bottom-half set-
specified is the rank of the decomposition.For our ex-
tings for the Caltech101 set and the left-half setting in
periments, we use Xiong et al.’s code [2010]. Since
the Olivetti set. Only sum product networks produce
BPMF does not train a model, we allow BPMF to use
lower error on the Olivetti bottom-half faces. Some
all of the training matrix during the prediction phase.
reconstructed faces are displayed in Figure 1, where
Table 4 lists the normalized mean squared er- the shallow, pixel-based HL-MRFs produce compara-
ror (NMSE) and normalized mean absolute error bly convincing images to sum-product networks, es-
(NMAE), averaged over 10 random splits. Though pecially in the left-half setting, where HL-MRF can
BPMF produces the best scores, the improvement over learn which pixels are likely to mimic their horizontal
HL-MRF-L (LME) is not significant in NMAE. mirror. While neither method is particularly good atTable 5: Mean squared errors per pixel for image reconstruction. HL-MRFs produce the most accurate recon-
structions on the Caltech101 and the left-half Olivetti faces, and only sum-product networks produce better
reconstructions on Olivetti bottom-half faces. Scores for other methods are taken from Poon and Domingos [18].
HL-MRF-Q (MLE) SPN DBM DBN PCA NN
Caltech-Left 1741 1815 2998 4960 2851 2327
Caltech-Bottom 1910 1924 2656 3447 1944 2575
Olivetti-Left 927 942 1866 2386 1076 1527
Olivetti-Bottom 1226 918 2401 1931 1265 1793
Figure 1: Example results on image reconstruction of Caltech101 (left) and Olivetti (right) faces. From left
to right in each column: (1) true face, left side predictions by (2) HL-MRFs and (3) SPNs, and bottom half
predictions by (4) HL-MRFs and (5) SPNs. SPN reconstructions are downloaded from Poon and Domingos [18].
reconstructing the bottom half of faces, the qualitative variety of domains. HL-MRFs admit fast, convex in-
difference between the deep SPN and the shallow HL- ference. The MPE inference algorithm we introduce
MRF reconstructions is that SPNs seem to hallucinate is applicable to the full class of HL-MRFs. With this
different faces, often with some artifacts, while HL- fast, general algorithm, we are the first to show results
MRFs predict blurry shapes roughly the same pixel using quadratic HL-MRFs on real-world data. In our
intensity as the observed, top half of the face. The experiments, HL-MRFs match or exceed the predic-
tendency to better match pixel intensity helps HL- tive performance of state-of-the-art methods on four
MRFs score better quantitatively on the Caltech101 diverse tasks. The natural mapping between hinge-
faces, where the lighting conditions are more varied. loss potentials and logic rules makes HL-MRFs easy
to define and interpret.
Training and predicting with these HL-MRFs takes lit-
tle time. In our experiments, training each model takes
about 45 minutes on a 12-core machine, while predict- Acknowledgements
ing takes under a second per image. While Poon and
Domingos [18] report faster training with SPNs, both This work was supported by NSF grants CCF0937094
HL-MRFs and SPNs clearly belong to a class of faster and IIS1218488, and IARPA via DoI/NBC contract
models when compared to DBNs and DBMs, which number D12PC00337. The U.S. Government is autho-
can take days to train on modern hardware. rized to reproduce and distribute reprints for govern-
mental purposes notwithstanding any copyright anno-
tation thereon. Disclaimer: The views and conclusions
6 CONCLUSION contained herein are those of the authors and should
not be interpreted as necessarily representing the of-
We have shown that HL-MRFs are a flexible and inter- ficial policies or endorsements, either expressed or im-
pretable class of models, capable of modeling a wide plied, of IARPA, DoI/NBC, or the U.S. Government.References [14] D. Lowd and P. Domingos. Efficient weight learn-
ing for Markov logic networks. In Principles and
[1] L. An and P. Tao. The DC (difference of convex
Practice of Knowledge Discovery in Databases,
functions) programming and DCA revisited with
2007.
DC models of real world nonconvex optimization
problems. Annals of Operations Research, 133: [15] A. Martins, M. Figueiredo, P. Aguiar, N. Smith,
23–46, 2005. and E. Xing. An augmented Lagrangian approach
to constrained MAP inference. In International
[2] F. Bach and M. Jordan. Thin junction trees. In
Conference on Machine Learning, 2011.
Neural Information Processing Systems, 2001.
[3] S. Bach, M. Broecheler, L. Getoor, and [16] O. Meshi and A. Globerson. An alternating di-
D. O’Leary. Scaling MPE inference for con- rection method for dual MAP LP relaxation. In
strained continuous Markov random fields with European Conference on Machine Learning and
consensus optimization. In Neural Information Knowledge Discovery in Databases, 2011.
Processing Systems, 2012. [17] J. Neville and D. Jensen. Relational dependency
[4] J. Besag. Statistical analysis of non-lattice data. networks. Journal of Machine Learning Research,
Journal of the Royal Statistical Society, 24(3): 8:653–692, 2007.
179–195, 1975. [18] H. Poon and P. Domingos. Sum-product net-
[5] S. Boyd, N. Parikh, E. Chu, B. Peleato, and works: A new deep architecture. In Uncertainty
J. Eckstein. Distributed optimization and statisti- in Artificial Intelligence, 2011.
cal learning via the alternating direction method [19] M. Richardson and P. Domingos. Markov logic
of multipliers. Foundations and Trends in Ma- networks. Machine Learning, 62(1-2):107–136,
chine Learning, 3(1), 2011. 2006.
[6] M. Broecheler, L. Mihalkova, and L. Getoor. [20] R. Salakhutdinov and A. Mnih. Bayesian prob-
Probabilistic similarity logic. In Uncertainty in abilistic matrix factorization using Markov chain
Artificial Intelligence, 2010. Monte Carlo. In International Conference on Ma-
[7] M. Collins. Discriminative training methods for chine Learning, 2008.
hidden Markov models: theory and experiments [21] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Gal-
with perceptron algorithms. In Empirical Meth- lagher, and T. Eliassi-Rad. Collective classifica-
ods in Natural Language Processing, 2002. tion in network data. AI Magazine, 29(3):93–106,
[8] P. Domingos and W. Webb. A tractable first- 2008.
order probabilistic logic. In AAAI Conference on [22] D. Sontag and T. Jaakkola. Tree block coordinate
Artificial Intelligence, 2012. descent for MAP in graphical models. In Artificial
[9] K. Goldberg, T. Roeder, D. Gupta, and Intelligence and Statistics, 2009.
C. Perkins. Eigentaste: A constant time collabo- [23] B. Taskar, M. Wong, P. Abbeel, and D. Koller.
rative filtering algorithm. Information Retrieval, Link prediction in relational data. In Neural In-
4(2):133–151, July 2001. formation Processing Systems, 2003.
[10] B. Huang, A. Kimmig, L. Getoor, and J. Golbeck. [24] B. Taskar, C. Guestrin, and D. Koller. Max-
A flexible framework for probabilistic models of margin Markov networks. In Neural Information
social trust. In Conference on Social Comput- Processing Systems, 2004.
ing, Behavioral-Cultural Modeling, & Prediction,
2013. [25] M. Wainwright and M. Jordan. Graphical mod-
els, exponential families, and variational infer-
[11] T. Huynh and R. Mooney. Online max-margin ence. Foundations and Trends in Machine Learn-
weight learning for Markov logic networks. In ing, 1(1-2), January 2008.
SIAM International Conference on Data Mining,
2011. [26] L. Xiong, X. Chen, T. Huang, J. Schneider, and
J. Carbonell. Temporal collaborative filtering
[12] T. Joachims, T. Finley, and C. Yu. Cutting-plane with Bayesian probabilistic tensor factorization.
training of structural SVMs. Machine Learning, In SIAM International Conference on Data Min-
77(1):27–59, 2009. ing, 2010.
[13] A. Kimmig, S. Bach, M. Broecheler, B. Huang,
and L. Getoor. A short introduction to probabilis-
tic soft logic. In NIPS Workshop on Probabilis-
tic Programming: Foundations and Applications,
2012.You can also read