Hinge-loss Markov Random Fields: Convex Inference for Structured Prediction

Page created by Theodore Turner
 
CONTINUE READING
Hinge-loss Markov Random Fields:
                     Convex Inference for Structured Prediction

                    Stephen H. Bach          Bert Huang       Ben London          Lise Getoor
                                             Computer Science Dept.
                                              University of Maryland
                                          College Park, MD, 20742, USA

                      Abstract                              loss Markov random fields (HL-MRFs) [3], these mod-
                                                            els are analogous to discrete Markov random fields,
    Graphical models for structured domains are             except that random variables are continuous valued
    powerful tools, but the computational com-              in the unit interval [0,1], and potentials are linear or
    plexities of combinatorial prediction spaces            squared hinge-loss functions.
    can force restrictions on models, or re-                In this work, we demonstrate that HL-MRFs are
    quire approximate inference in order to be              powerful tools for structured prediction by producing
    tractable. Instead of working in a combina-             state-of-the-art performance in a number of domains.
    torial space, we use hinge-loss Markov ran-             We are the first to leverage some of the most pow-
    dom fields (HL-MRFs), an expressive class               erful features of HL-MRFs, such as squared poten-
    of graphical models with log-concave density            tials, which we show produce better results on mul-
    functions over continuous variables, which              tiple tasks. We show that HL-MRFs are well-suited to
    can represent confidences in discrete predic-           structured prediction for the following reasons. They
    tions. This paper demonstrates that HL-                 are expressive, interpretable, and easily defined using
    MRFs are general tools for fast and accu-               the modeling language probabilistic soft logic (PSL)
    rate structured prediction. We introduce the            [6, 13]. Further, continuous variables are useful both
    first inference algorithm that is both scalable         for modeling continuous data as well as for expressing
    and applicable to the full class of HL-MRFs,            confidences in discrete predictions. Confidences are
    and show how to train HL-MRFs with several              desirable for the same reason that practitioners often
    learning algorithms. Our experiments show               prefer marginal probabilities to the single most prob-
    that HL-MRFs match or surpass the predic-               able discrete prediction. Finally, HL-MRFs have log-
    tive performance of state-of-the-art methods,           concave density functions, so finding an exact most
    including discrete models, in four application          probable explanation (MPE) for a given input is a
    domains.                                                convex optimization, and therefore exactly solvable in
                                                            polynomial time.
                                                            Our specific contributions include the following. First,
1   INTRODUCTION                                            we introduce a new, fast algorithm for MPE inference
                                                            in HL-MRFs, which is the first to be both scalable
The study of probabilistic modeling in structured
                                                            and applicable to the full class of HL-MRFs. Sec-
and relational domains primarily focuses on predict-
                                                            ond, we show how to train HL-MRFs with several
ing discrete variables [12, 19, 24]. However, except
                                                            learning algorithms. Third, we show empirically that
for some isolated cases, the combinatorial nature of
                                                            these advances enable HL-MRFs to tackle a diverse set
discrete, structured prediction spaces requires conces-
                                                            of relational and structured prediction tasks, provid-
sions: most notably, for inference algorithms to be
                                                            ing state-of-the-art performance on collective classifi-
tractable, they must be relaxed or approximate (e.g.,
                                                            cation, social-trust prediction, collaborative filtering,
[19, 22, 25]), or the model’s structure must be re-
                                                            and image reconstruction. In particular, we show that
stricted (e.g., [2, 8]). Broecheler et al. [6] introduced
                                                            HL-MRFs can outperform their discrete counterparts,
a class of models for continuous variables that has the
                                                            as well as other leading methods.
potential to combine fast and exact inference with the
expressivity of discrete graphical models, but this po-
tential has not been well explored. Now called hinge-
1.1   RELATED WORK                                           Definition 1. Let Y = (Y1 , . . . , Yn ) be a vector of
                                                             n variables and X = (X1 , . . . , Xn0 ) a vector of n0
                                                                                                             0
Probabilistic soft logic (PSL) [6, 13] is a declarative      variables with joint domain D = [0, 1]n+n . Let φ =
language for defining templated HL-MRFs. Its de-             (φ1 , . . . , φm ) be m continuous potentials of the form
velopment was partially motivated by the need for
rich models of continuous similarity values, for use                    φj (Y, X) = [max {`j (Y, X), 0}]pj
in tasks such as entity resolution, collective classifi-
cation, and ontology alignment. PSL is in a family of        where `j is a linear function of Y and X and pj ∈
systems for defining templated, relational probabilistic     {1, 2}. Let C = (C1 , . . . , Cr ) be linear constraint func-
models that includes, for instance, Markov logic net-        tions associated with index sets denoting equality con-
works [19], relational dependency networks [17], and         straints E and inequality constraints I, which define
relational Markov networks [23]. In our experiments,         the feasible set
we compare against Markov logic networks, which use                                                              
                                                                                        Ck (Y, X) = 0, ∀k ∈ E
a first-order syntax similar to PSL’s to build discrete           D̃ = Y, X ∈ D                                     .
                                                                                        Ck (Y, X) ≥ 0, ∀k ∈ I
probabilistic models.
MPE inference algorithms for HL-MRFs solve a con-            For Y, X ∈ D̃, given a vector of nonnegative free
strained, convex optimization. A standard approach           parameters, i.e., weights, λ = (λ1 , . . . , λm ), a con-
for general, constrained, convex optimization is to use      strained hinge-loss energy function fλ is defined as
an interior-point method, which Broecheler et al. [6]                                    m
use. While theoretically efficient, the practical run-                                   X
                                                                           fλ (Y, X) =         λj φj (Y, X) .
ning time of interior-point optimizers quickly becomes
                                                                                         j=1
cumbersome for large problems. For discrete graph-
ical models, recent advances use consensus optimiza-         Definition 2. A hinge-loss Markov random field P
tion to obtain fast, approximate MPE inference algo-         over random variables Y and conditioned on random
rithms [5, 15, 16]. Bach et al. [3] recently developed       variables X is a probability density defined as follows:
an analogous algorithm for exact MPE inference in            if Y, X ∈
                                                                     / D̃, then P (Y|X) = 0; if Y, X ∈ D̃, then
HL-MRFs that produced a significant improvement in
running time over interior-point methods, though it                                   1
                                                                       P (Y|X) =          exp [−fλ (Y, X)] ,
was limited to pairwise potentials and constraints, and                              Z(λ)
cost considerably more computation to optimize over                          R
squared potentials.                                          where Z(λ) =     Y
                                                                                  exp [−fλ (Y, X)].

The learning methods we adapt for HL-MRFs are stan-          Thus, MPE inference is equivalent to finding the min-
dard approaches for learning parameters of probabilis-       imizer of the convex energy fλ .
tic models. In particular, our adaptations are anal-
                                                             The potentials and weights can be grouped together
ogous to previous learning algorithms for relational
                                                             into templates, which can be used to define general
and structured models using approximate maximum-
                                                             classes of HL-MRFs that are parameterized by the in-
likelihood or maximum-pseudolikelihood estimation
                                                             put data. Let T = (t1 , . . . , ts ) denote a vector of tem-
[6, 14, 17, 19] and large-margin estimation [11, 12, 24].
                                                             plates with associated weights Λ = (Λ1 , . . . , Λs ). We
                                                             partition the potentials
                                                                                   P by their associated templates
2     HINGE-LOSS MARKOV                                      and let Φq (Y, X) = j∈tq φj (Y, X) for all tq ∈ T . In
      RANDOM FIELDS                                          the HL-MRF, the weight of the j’th hinge-loss poten-
                                                             tial is set to the weight of the template from which it
Hinge-loss Markov random fields (HL-MRFs) are a              was derived, i.e., λj = Λq , for each j ∈ tq .
general class of conditional, continuous probabilistic       Probabilistic soft logic [6, 13] provides a natural in-
models [3]. HL-MRFs are log-linear models whose fea-         terface to represent hinge-loss potential templates us-
tures are hinge-loss functions of the variable states.       ing logical rules. In particular, a logical conjunc-
Through constructions based on soft logic, hinge-loss        tion of Boolean variables X ∧ Y can be general-
potentials can be used to model generalizations of log-      ized to continuous variables using the hinge function
ical conjunction and implication. HL-MRFs can be             max{X + Y − 1, 0}, which is known as the Lukasiewicz
defined using the modeling language probabilistic soft       t-norm. Disjunction X ∨Y is relaxed to min{X +Y, 1},
logic (PSL) [6, 13], making these powerful models in-        and negation ¬X to 1 − X. PSL allows modelers to
terpretable, flexible, and expressive. In this section, we   design rules that, given data, ground out substitutions
formally present constrained hinge-loss energy func-         for logical terms. The groundings of a template define
tions, HL-MRFs, and briefly review PSL.                      hinge-loss potentials that share the same weight and
have the form one minus the truth value of the ground        alternating-direction method of multipliers (ADMM)
rule. We defer to Broecheler et al. [6] and Kimmig           [5]. The algorithm works by creating local copies of the
et al. [13] for details on PSL.                              variables in each potential and constraint, constrain-
                                                             ing them to be equal to the original variables, and re-
To further demonstrate this templating, consider the
                                                             laxing those equality constraints to make independent
task of predicting who trusts whom in a social network.
                                                             subproblems. By solving the subproblems repeatedly
Let the network contain three individuals: A, B, and
                                                             and averaging the results, the algorithm reaches a con-
C. We can design an HL-MRF to include potentials
                                                             sensus on the best values of the original variables, also
that encode the belief that trust is transitive (which
                                                             called the consensus variables. This procedure is guar-
is a rule we use in our experiments). Let the variable
                                                             anteed to converge to the global minimizer of fλ . See
YA,B represent how much A trusts B, and similarly so
                                                             [3] and [5] for more details on consensus optimization
for YB,C and YA,C . Then the potential
                                                             and ADMM.
    φ(Y, X) = [max{YA,B + YB,C − YA,C − 1, 0}]p
                                                             This previous consensus-optimization approach to
is equivalent to one minus the truth value of                MPE inference works well for linear potentials with at
the Boolean formula YA,B ∧ YB,C → YA,C when                  most two unobserved variables, and empirical evidence
YA,B , YB,C , YA,C ∈ {0, 1}. When they are allowed to        suggests it scales linearly with the size of the problem
take on their full range [0, 1], the potential is a convex   [3]. However, it is restricted to pairwise potentials and
relaxation of the implication. An HL-MRF with this           constraints, and requires an interior-point method as
potential function assigns higher probability to vari-       a subroutine to solve subproblems induced by squared
able states that satisfy the logical implication above,      potentials. Because of the embedded interior-point
which can occur to varying degrees in the continuous         method, its running time can increase roughly 100-fold
domain. Given a social network with more than these          with squared potentials [3].
three individuals, PSL can ground out possible sub-
                                                             We improve the algorithm of Bach et al. [3] by refor-
stitutions for the roles of A, B, and C to generate
                                                             mulating the optimization to enforce Y ∈ [0, 1]n only
potential functions for each substitution, thus defining
                                                             on the consensus variables, not the local copies. This
the full, ground HL-MRF.
                                                             form of consensus optimization is described in greater
HL-MRFs support a few additional components useful           detail by Boyd et al. [5]. The result is that, in our algo-
for modeling. The constraints in Definition 1 allow the      rithm, the potentials and constraints are not restricted
encoding of functional modeling requirements, which          to a certain number of unknowns, and the subproblems
is useful, e.g., when variables correspond to mutually       can all be solved quickly using simple linear algebra.
exclusive labels, and thus should sum to one. The
                                                             Algorithm 1 gives pseudocode for our new algorithm.
exponent parameter pj allows flexibility in the shape
                                                             It starts by initializing local copies of the variables that
of the hinge, affecting the sharpness of the penalty for
                                                             appear in each potential and constraint, along with
violating the logical implication. Setting pj to 1 penal-
                                                             a corresponding Lagrange multiplier for each copy.
izes violation linearly with the amount the implication
                                                             Then, until convergence, it iteratively updates La-
is unsatisfied, while setting pj to 2 penalizes small vi-
                                                             grange multipliers and solves suproblems induced by
olations much less. In effect, some linear potentials
                                                             the HL-MRF’s potentials and constraints. If the sub-
overrule others, while the influences of squared poten-
                                                             problem is induced by a potential, it sets the local vari-
tials are averaged together.
                                                             able copies to a balance between the minimizer of the
                                                             potential and the emerging consensus. Eventually the
3    MPE INFERENCE                                           Lagrange multipliers will enforce agreement between
                                                             the local copies and the consensus. If instead the sub-
MPE inference for HL-MRFs requires finding a feasible        problem is induced by a constraint in the HL-MRF,
assignment that minimizes fλ . Performing MPE infer-         the algorithm projects the consensus variables’ current
ence quickly is crucial, especially because weight learn-    values to that constraint’s feasible region. The con-
ing often requires performing inference many times           sensus variables are updated at each iteration based
with different weights (as we discuss in Section 4).         on the values of their local copies and corresponding
Here, HL-MRFs have a distinct advantage over gen-            Lagrange multipliers, and clipped to [0,1].
eral discrete models, since minimizing fλ is a convex
optimization rather than a combinatorial one. In this        We take the same basic approach as Bach et al. [3] to
section, we detail a new, faster MPE inference algo-         solve the subproblems. We first try to find a minimizer
rithm for HL-MRFs.                                           on either side of the hinge (where `j (Y, X) is either
                                                             positive or negative) before projecting onto the plane
Bach et al. [3] showed how to minimize fλ with               defined by `j (Y, X) = 0. Without the constraints on
a consensus-optimization algorithm, based on the
Algorithm 1 MPE Inference for HL-MRFs                           performs approximate maximum-likelihood estimation
 Input: HL-MRF(Y, X, φ, λ, C, E, I), ρ > 0                      using MPE inference to approximate the gradient of
                                                                the log-likelihood. The second method maximizes the
    Initialize yj as copies of the variables Yj that appear
                                                                pseudolikelihood. The third method finds a large-
       in φj , j = 1, . . . , m
                                                                margin solution, preferring weights that discriminate
    Initialize yk+m as copies of the variables Yk+m
                                                                the ground truth from other states. We describe below
       that appear in Ck , k = 1, . . . , r
                                                                how to apply these learning strategies to HL-MRFs.
    Initialize Lagrange multipliers αi corresponding to
       variable copies yi , i = 1, . . . , m + r
                                                                4.1   MAXIMUM-LIKELIHOOD
    while not converged do                                            ESTIMATION
      for j = 1, . . . , m do
        αj ← αj + ρ(yj − Yj )                                   The canonical approach for learning parameters Λ is
        yj ← Yj − αj /ρ                                         to maximize the log-likelihood of training data. The
        if `j (yj , X) > 0 then                                 partial derivative of the log-likelihood with respect to
                              
                                λj [`j (yj , X)]pj
                                                               a parameter Λq is
           yj ← arg minyj
                                + ρ2 kyj − Yj + ρ1 αj k22
                                                                      ∂ log p(Y|X)
           if `j (yj , X) < 0 then                                                 = EΛ [Φq (Y, X)] − Φq (Y, X) ,
              yj ← Proj`j =0 (Yj )                                         ∂Λq
           end if
                                                                where EΛ is the expectation under the distribution de-
        end if
                                                                fined by Λ. The voted perceptron algorithm [7] opti-
      end for
                                                                mizes Λ by taking steps of fixed length in the direc-
      for k = 1, . . . , r do                                   tion of the gradient, then averaging the points after
        αk+m ← αk+m + ρ(yk+m − Yk+m )                           all steps. Any step that is outside the feasible region
        yk+m ← ProjCk (Yk+m )                                   is projected back before continuing. For a smoother
      end for                                                   ascent, it is often helpful to divide the q-th component
      for i = 1, . . . , n do                                   of the gradient by the number of groundings |tq | of the
                                                                q’th template [14], which we do in our experiments.
                                                           
                      1
                               P                       αc
        Yi ← |copies(Y    i )|  yc ∈copies(Yi ) yc +   ρ        Computing the expectation is intractable, so we use
        Clip Yi to [0,1]                                        a common approximation: the values of the potential
      end for                                                   functions at the most probable setting of Y with the
    end while                                                   current parameters [6].

                                                                4.2   MAXIMUM-PSEUDOLIKELIHOOD
the local variable copies, finding these minimizers and               ESTIMATION
projections is much simpler. Solving for the constraint
subproblems is now also simpler, requiring just a pro-          Since exact maximum likelihood estimation is in-
jection onto a hyperplane or a halfspace.                       tractable in general, we can instead perform
Our algorithm retains all of the benefits of the original       maximum-pseudolikelihood estimation (MPLE) [4],
MPE inference algorithm while removing all restric-             which maximizes the likelihood of each variable condi-
tions on the numbers of unknowns in the potentials              tioned on all other variables, i.e.,
and constraints, making MPE inference fast with both                                  n
                                                                                      Y
linear and squared potentials. Another advantage of                   P ∗ (Y|X) =           P ∗ (Yi |MB(Yi ))
consensus optimization for MPE inference is that it is                                i=1
easy to warm start. Warm starting provides signifi-                                   Yn
                                                                                          1
                                                                                                 exp −fλi (Yi , Y, X) ;
                                                                                                                    
cant efficiency gains when inference is repeated on the                         =
                                                                                       Z(λ, Yi )
same HL-MRF with small changes in the weights, as                                  i=1
                                                                                   Z
often occurs during weight learning.                                                       i              
                                                                       Z(λ, Yi ) =     exp −fλ (Yi , Y, X) ;
                                                                                       Y
                                                                                       i

4     WEIGHT LEARNING
                                                                                       X
                                                                  fλi (Yi , Y, X) =
                                                                                                                   
                                                                                               λj φj {Yi ∪ Y\i }, X .
                                                                                      j:i∈φj
In this section, we present three weight learning meth-
ods for HL-MRFs, each with a different objective func-          Here, i ∈ φj means that Yi is involved in φj , and
tion, two of which are new for learning HL-MRFs.                MB(Yi ) denotes the Markov blanket of Yi —that is,
The first method, introduced by Broecheler et al. [6],          the set of variables that co-occur with Yi in any po-
tential function. The partial derivative of the log-         objective for a large-margin solution
pseudolikelihood with respect to Λq is
                                                                    1
                                                             min      ||Λ||2 + Cξ
                                                                    2
                                               
                      n                                      Λ≥0
   ∂ log P ∗ (Y |X) X              X
                   =     EYi |MB      φj (Y, X)             s.t. Λ> (Φ(Y, X) − Φ(Ỹ, X)) ≤ −L(Y, Ỹ) + ξ, ∀Y,
         ∂Λq         i=1          j∈tq :i∈φj
                                                             where Φ(Y, X) = (Φ1 (Y, X), . . . , Φs (Y, X)). This
                     − Φj (Y, X).
                                                             formulation is analogous to the margin-rescaling ap-
Computing the pseudolikelihood gradient does not re-         proach by Joachims et al. [12]. Though such a struc-
quire inference and takes time linear in the size of         tured objective is natural and intuitive, its number
Y. However, the integral in the above expectation            of constraints is the cardinality of the output space,
does not readily admit a closed-form antiderivative,         which here is infinite. Following their approach, we
so we approximate the expectation. When a variable           optimize subject to the infinite constraint set using
in unconstrained, the domain of integration is a one-        a cutting-plane algorithm: we greedily grow a set K
dimensional interval on the real number line, so Monte       of constraints by iteratively adding the worst-violated
Carlo integration quickly converges to an accurate es-       constrain given by a separation oracle, then updating
timate of the expectation.                                   Λ subject to the current constraints. The goal of the
                                                             cutting-plane approach is to efficiently find the set of
We can also apply MPLE when the constraints are not          active constraints at the solution for the full objec-
too interdependent. For example, for linear equality         tive, without having to enumerate the infinite inactive
constraints over disjoint groups of variables (e.g., vari-   constraints. The worst-violated constraint is
able sets that must sum to 1.0), we can block-sample
the constrained variables by sampling uniformly from                     arg min Λ> Φ(Ỹ, X) − L(Y, Ỹ).
a simplex. These types of constraints are often used                        Ỹ

to represent mutual exclusivity of classification labels.    The separation oracle performs loss-augmented in-
We can compute accurate estimates quickly because            ference by adding additional loss-augmenting poten-
these blocks are typically low-dimensional.                  tials to the HL-MRF. For ground truth in {0, 1},
                                                             these loss-augmenting potentials are also examples of
4.3   LARGE-MARGIN ESTIMATION                                hinge-losses, and thus adding them simply creates an
                                                             augmented HL-MRF. The worst-violated constraint
A different approach to learning drops the probabilistic     is then computed as standard inference on the loss-
interpretation of the model and views HL-MRF infer-          augmented HL-MRF. However, ground-truth variables
ence as a prediction function. Large-margin estima-          in the interior (0, 1) cause any distance-based loss to be
tion (LME) shifts the goal of learning from producing        concave, which require the separation oracle to solve
accurate probabilistic models to instead producing ac-       a non-convex objective. For interior ground truth val-
curate MPE predictions. The learning task is then to         ues, we use the difference of convex functions algo-
find the weights Λ that provide high-accuracy struc-         rithm [1] to find a local optimum. Since the concave
tured predictions. We describe in this section a large-      portion of the loss-augmented inference objective piv-
margin method based on the cutting-plane approach            ots around the ground truth value, the subgradients
for structural support vector machines (SVMs) [12].          are 1 or −1, depending on whether the current value
The intuition behind large-margin structured predic-         is greater than the ground truth. We simply choose an
tion is that the ground-truth state should have energy       initial direction for interior labels by rounding, and flip
lower than any alternate state by a large margin. In         the direction of the subgradients for variables whose
our setting, the output space is continuous, so we pa-       solution states are not in the interval corresponding to
rameterize this margin criterion with a continuous loss      the subgradient direction until convergence.
function. For any valid output state Ỹ, a large-margin      Given a set K of constraints, we solve the SVM objec-
solution should satisfy:                                     tive as in the primal form minΛ≥0 21 ||Λ||2 + Cξ s.t. K.
                                                             We then iteratively invoke the separation oracle to find
        fλ (Y, X) ≤ fλ (Ỹ, X) − L(Y, Ỹ), ∀Ỹ,
                                                             the worst-violated constraint. If this new constraint is
where   the decomposable loss function L(Y, Ỹ) =            not violated, or its violation is within numerical toler-
P                                                            ance, we have found the max-margin solution. Other-
   i L(Yi , Ỹi ) measures the disagreement between a
                                                             wise, we add the new constraint to K, and repeat.
state Ỹ and the training label state Y. We define
L as the `1 distance. Since we do not expect all prob-       One fact of note is that the large-margin criterion
lems to be perfectly separable, we relax this constraint     always requires a little slack for squared HL-MRFs.
with a penalized slack ξ. We obtain a convex learning        Since the squared hinge potential is quadratic and the
loss is linear, there always exists a small enough dis-
                                                              Table 1: Average accuracy of classification by HL-
tance from the ground truth such that an absolute (i.e.,
                                                              MRFs and discrete MRFs. Scores statistically equiva-
linear) distance is greater than the squared distance.
                                                              lent to the best scoring method are typed in bold.
In these cases, the slack parameter trades off between
the peakedness of the learned quadratic energy func-                                         Citeseer   Cora
tion and the margin criterion.
                                                                      HL-MRF-Q (MLE)          0.729     0.816
                                                                      HL-MRF-Q (MPLE)         0.729     0.818
5       EXPERIMENTS                                                   HL-MRF-Q (LME)          0.683     0.789
                                                                      HL-MRF-L (MLE)          0.724     0.802
To demonstrate the flexibility and effectiveness of HL-               HL-MRF-L (MPLE)         0.729     0.808
MRFs, we test them on four diverse learning tasks:                    HL-MRF-L (LME)          0.695     0.789
collective classification, social-trust prediction, prefer-
ence prediction, and image reconstruction. 1 Each of                  MRF (MLE)               0.686     0.756
these experiments represents a problem domain that                    MRF (MPLE)              0.715     0.797
is best solved with relational learning approaches be-                MRF (LME)               0.687     0.783
cause structure is a critical component of their prob-
lems. The experiments show that HL-MRFs perform
as well as or better than state-of-the-art approaches.
                                                              5.1     COLLECTIVE CLASSIFICATION
For these diverse tasks, we compare against a number
of competing methods. For collective classification and       When classifying documents, links between those
social-trust prediction, we compare HL-MRFs to dis-           documents—such as hyperlinks, citations, or co-
crete Markov random fields (MRFs). We construct               authorship—provide extra signal beyond the local fea-
them with Markov logic networks (MLNs) [19], which            tures of individual documents. Collectively predicting
template discrete MRFs using logical rules similarly to       document classes with these links tends to improve
PSL for HL-MRFs. We perform inference in discrete             accuracy [21]. We classify documents in citation net-
MRFs using 2500 rounds of the sampling algorithm              works using data from the Cora and Citeseer scientific
MC-Sat (500 of which are burn in), and we find ap-            paper repositories. The Cora data set contains 2,708
proximate MPE states during MLE learning using the            papers in seven categories, and 5,429 directed citation
search algorithm MaxWalkSat [19]. For collaborative           links. The Citeseer data set contains 3,312 papers in
filtering, a task that is inherently continuous and non-      six categories, and 4,591 directed citation links.
trivial to encode in discrete logic, we compare against
Bayesian probabilistic matrix factorization [20]. Fi-         The prediction task is, given a set of seed documents
nally, for image reconstruction, we run the same exper-       whose labels are observed, to infer the remaining doc-
imental setup as Poon and Domingos [18] and compare           ument classes by propagating the seed information
against the results they report, which include tests us-      through the network. For each of 20 runs, we split the
ing sum product networks, deep belief networks, and           data sets 50/50 into training and testing partitions,
deep Boltzmann machines.                                      and seed half of each set. To predict discrete cate-
                                                              gories with HL-MRFs we predict the category with
When appropriate, we evaluate statistical significance        the highest predicted value.
using a paired t-test with rejection threshold 0.01. We
omit variance statistics to save space and only report        We compare HL-MRFs to discrete MRFs on this task.
the average and statistical significance. We describe         We construct both using the same logical rules, which
the HL-MRFs used for our experiments using the PSL            simply encode the tendency for a class to propagate
rules that define them. To investigate the differences        across citations. For each category Ci , we have two
between linear and squared potentials we use both in          separate rules for each direction of citation:
our experiments. HL-MRF-L refers to a model with
all linear potentials and HL-MRF-Q to one with all                  Label(A, Ci ) ∧ Cites(A, B) ⇒ Label(B, Ci ),
squared potentials. When training with MLE and
                                                                    Label(A, Ci ) ∧ Cites(B, A) ⇒ Label(B, Ci ).
MPLE, we use 100 steps of voted perceptron and a
step size of 1.0 (unless otherwise noted), and for LME
we set C = 0.1. We experimented with various set-             Table 1 lists the results of this experiment. HL-MRFs
tings, but the scores of HL-MRFs and discrete MRFs            are the most accurate predictors on both data sets. We
were not sensitive to changes.                                also note that both variants of HL-MRFs are much
                                                              faster than discrete MRFs. See Table 3 for average
    1
        All code is available at http://psl.umiacs.umd.edu.   inference times on five folds.
Table 2: Average area under ROC and precision-recall         Table 3: Average inference times (reported in seconds)
curves of social-trust prediction by HL-MRFs and dis-        of single-threaded HL-MRFs and discrete MRFs.
crete MRFs. Scores statistically equivalent to the best
scoring method by metric are typed in bold.                                       Citeseer     Cora    Epinions
                                                                   HL-MRF-Q          0.42      0.70        0.32
                            ROC     P-R (+)     P-R (-)
                                                                   HL-MRF-L          0.46      0.50        0.28
 HL-MRF-Q (MLE)            0.822       0.978     0.452             MRF             110.96    184.32      212.36
 HL-MRF-Q (MPLE)           0.832       0.979     0.482
 HL-MRF-Q (LME)            0.814       0.976     0.462
 HL-MRF-L (MLE)            0.765       0.965      0.357      them using various learning algorithms. We compute
 HL-MRF-L (MPLE)           0.757       0.963      0.333      three metrics: the area under the receiver operating
 HL-MRF-L (LME)            0.783       0.967     0.453       characteristic (ROC) curve, and the areas under the
 MRF (MLE)                 0.655        0.942     0.270      precision-recall curves for positive trust and negative
 MRF (MPLE)                0.725        0.963     0.298      trust. On all three metrics, HL-MRFs with squared
 MRF (LME)                 0.795       0.973     0.441       potentials score significantly higher. The differences
                                                             among the learning methods for squared HL-MRFs are
                                                             insignificant, but the differences among the models is
5.2   SOCIAL-TRUST PREDICTION                                statistically significant for the ROC metric. For area
                                                             under the precision-recall curve for positive trust, dis-
An emerging problem in the analysis of online social         crete MRFs trained with LME are statistically tied
networks is the task of inferring the level of trust be-     with the best score, and both HL-MRF-L and discrete
tween individuals. Predicting the strength of trust          MRFs trained with LME are statistically tied with the
relationships can provide useful information for viral       best area under the precision-recall curve for negative
marketing, recommendation engines, and internet se-          trust. The results are listed in Table 2.
curity. HL-MRFs with linear potentials have recently         Though the random fold splits are not the same, using
been applied by Huang et al. [10] to this task, show-        the same experimental setup, Huang et al. [10] also
ing superior results with models based on sociologi-         scored the precision-recall area for negative trust of
cal theory. We reproduce their experimental setup us-        standard trust prediction algorithms EigenTrust and
ing their sample of the signed Epinions trust network,       TidalTrust, which scored 0.131 and 0.130, respectively.
in which users indicate whether they trust or distrust       The logical models based on structural balance that
other users. We perform eight-fold cross-validation. In      we run here are significantly more accurate, and HL-
each fold, the prediction algorithm observes the entire      MRFs more than discrete MRFs.
unsigned social network and all but 1/8 of the trust
ratings. We measure prediction accuracy on the held-         Table 3 lists average inference times on five folds of
out 1/8. The sampled network contains 2,000 users,           three prediction tasks: Cora, Citeseer, and Epinions.
with 8,675 signed links. Of these links, 7,974 are pos-      We implemented each method in Java. Both HL-
itive and only 701 are negative.                             MRF-Q and HL-MRF-L are much faster than discrete
                                                             MRFs. This illustrates an important difference be-
We use a model based on the social theory of struc-          tween performing structured prediction via convex in-
tural balance, which suggests that social structures are     ference versus sampling in a discrete prediction space:
governed by a system that prefers triangles that are         using our MPE inference algorithm is much faster.
considered balanced. Balanced triangles have an odd
number of positive trust relationships; thus, consider-
ing all possible directions of links that form a triad of    5.3   PREFERENCE PREDICTION
users, there are sixteen logical implications of the form
                                                             Preference prediction is the task of inferring user at-
  Trusts(A, B) ∧ Trusts(B, C) ⇒ Trusts(A, C).                titudes (often quantified by ratings) toward a set of
                                                             items. This problem is naturally structured, since a
Huang et al. [10] list all sixteen of these rules, a reci-
                                                             user’s preferences are often interdependent, as are an
procity rule, and a prior in their Balance-Recip model,
                                                             item’s ratings. Collaborative filtering is the task of
which we omit to save space.
                                                             predicting unknown ratings using only a subset of ob-
Since we expect some of these structural implications        served ratings. Methods for this task range from sim-
to be more or less accurate, learning weights for these      ple nearest-neighbor classifiers to complex latent fac-
rules provides better models. Again, we use these rules      tor models. To illustrate the versatility of HL-MRFs,
to define HL-MRFs and discrete MRFs, and we train            we design a simple, interpretable collaborative filtering
5.4   IMAGE RECONSTRUCTION
Table 4: Normalized mean squared/absolute errors
(NMSE/NMAE) for preference prediction using the            Digital image reconstruction requires models that un-
Jester dataset. The lowest errors are typed in bold.       derstand how pixels relate to each other, such that
                                                           when some pixels are unobserved, the model can in-
                               NMSE      NMAE
                                                           fer their values from parts of the image that are ob-
      HL-MRF-Q (MLE)           0.0554     0.1974           served. We construct pixel-grid HL-MRFs for image
      HL-MRF-Q (MPLE)          0.0549     0.1953           reconstruction. We test these models using the exper-
      HL-MRF-Q (LME)           0.0738     0.2297           imental setup of Poon and Domingos [18]: we recon-
      HL-MRF-L (MLE)           0.0578    0.2021            struct images from the Olivetti face data set and the
      HL-MRF-L (MPLE)          0.0535    0.1885            Caltech101 face category. The Olivetti data set con-
      HL-MRF-L (LME)           0.0544    0.1875            tains 400 images, 64 pixels wide and tall, and the Cal-
                                                           tech101 face category contains 435 examples of faces,
      BPMF                     0.0501    0.1832            which we crop to the center 64 by 64 patch, as was
                                                           done by Poon and Domingos [18]. Following their ex-
                                                           perimental setup, we hold out the last fifty images and
                                                           predict either the left half of the image or the bottom
                                                           half.
model for predicting humor preferences. We test this
model on the Jester dataset, a repository of ratings       The HL-MRFs in this experiment are much more com-
from 24,983 users on a set of 100 jokes [9]. Each joke     plex than the ones in our other experiments because
is rated on a scale of [−10, +10], which we normalize      we allow each pixel to have its own weight for the fol-
to [0, 1]. We sample a random 2,000 users from the         lowing rules, which encode agreement or disagreement
set of those who rated all 100 jokes, which we then        between neighboring pixels:
split into 1,000 train and 1,000 test users. From each
train and test matrix, we sample a random 50% to use        Bright(Pij , I) ∧ North(Pij , Q) ⇒ Bright(Q, I),
as the observed features X; the remaining ratings are       Bright(Pij , I) ∧ North(Pij , Q) ⇒ ¬Bright(Q, I),
treated as the variables Y.                                ¬Bright(Pij , I) ∧ North(Pij , Q) ⇒ Bright(Q, I),
Our HL-MRF model uses an item-item similarity rule:        ¬Bright(Pij , I) ∧ North(Pij , Q) ⇒ ¬Bright(Q, I),

                                                           where Bright(Pij , I) is the normalized brightness of
                                                           pixel Pij in image I, and North(Pij , Q) indicates that
  SimRate(J1 , J2 ) ∧ Likes(U, J1 ) ⇒ Likes(U, J2 )
                                                           Q is the north neighbor of Pij . We similarly include
                                                           analogous rules for the south, east, and west neighbors,
                                                           as well as the pixels mirrored across the horizontal and
where J1 , J2 are jokes and U is a user; the pred-
                                                           vertical axes. This setup results in up to 24 rules per
icate Likes indicates the degree of preference (i.e.,
                                                           pixel, which, in a 64 by 64 image, produces 80,896
rating value); and SimRate measures the mean-
                                                           weighted potential templates.
adjusted cosine similarity between the observed rat-
ings of two jokes. We also include rules to enforce that   We train these HL-MRFs using MPE-approximate
Likes(U, J) concentrates around the observed average       maximum likelihood with a 5.0 step size on the first
rating of user U and item J, and the global average.       200 images of each data set and test on the last fifty.
                                                           For training, we maximize the data log-likelihood of
We compare our HL-MRF model to a current state-
                                                           uniformly random held-out pixels for each training
of-the-art latent factors model, Bayesian probabilis-
                                                           image, allowing for generalization throughout the im-
tic matrix factorization (BPMF) [20]. BPMF is a
                                                           age. Table 5 lists our results and others reported by
fully Bayesian treatment and, as such, is considered
                                                           Poon and Domingos [18]. HL-MRFs produce the best
“parameter-free”; the only parameter that must be
                                                           mean squared error on the left- and bottom-half set-
specified is the rank of the decomposition.For our ex-
                                                           tings for the Caltech101 set and the left-half setting in
periments, we use Xiong et al.’s code [2010]. Since
                                                           the Olivetti set. Only sum product networks produce
BPMF does not train a model, we allow BPMF to use
                                                           lower error on the Olivetti bottom-half faces. Some
all of the training matrix during the prediction phase.
                                                           reconstructed faces are displayed in Figure 1, where
Table 4 lists the normalized mean squared er-              the shallow, pixel-based HL-MRFs produce compara-
ror (NMSE) and normalized mean absolute error              bly convincing images to sum-product networks, es-
(NMAE), averaged over 10 random splits. Though             pecially in the left-half setting, where HL-MRF can
BPMF produces the best scores, the improvement over        learn which pixels are likely to mimic their horizontal
HL-MRF-L (LME) is not significant in NMAE.                 mirror. While neither method is particularly good at
Table 5: Mean squared errors per pixel for image reconstruction. HL-MRFs produce the most accurate recon-
structions on the Caltech101 and the left-half Olivetti faces, and only sum-product networks produce better
reconstructions on Olivetti bottom-half faces. Scores for other methods are taken from Poon and Domingos [18].

                                     HL-MRF-Q (MLE)        SPN     DBM      DBN     PCA      NN
                  Caltech-Left               1741          1815    2998     4960    2851    2327
                  Caltech-Bottom             1910          1924    2656     3447    1944    2575
                  Olivetti-Left               927          942     1866     2386    1076    1527
                  Olivetti-Bottom            1226           918    2401     1931    1265    1793

Figure 1: Example results on image reconstruction of Caltech101 (left) and Olivetti (right) faces. From left
to right in each column: (1) true face, left side predictions by (2) HL-MRFs and (3) SPNs, and bottom half
predictions by (4) HL-MRFs and (5) SPNs. SPN reconstructions are downloaded from Poon and Domingos [18].

reconstructing the bottom half of faces, the qualitative   variety of domains. HL-MRFs admit fast, convex in-
difference between the deep SPN and the shallow HL-        ference. The MPE inference algorithm we introduce
MRF reconstructions is that SPNs seem to hallucinate       is applicable to the full class of HL-MRFs. With this
different faces, often with some artifacts, while HL-      fast, general algorithm, we are the first to show results
MRFs predict blurry shapes roughly the same pixel          using quadratic HL-MRFs on real-world data. In our
intensity as the observed, top half of the face. The       experiments, HL-MRFs match or exceed the predic-
tendency to better match pixel intensity helps HL-         tive performance of state-of-the-art methods on four
MRFs score better quantitatively on the Caltech101         diverse tasks. The natural mapping between hinge-
faces, where the lighting conditions are more varied.      loss potentials and logic rules makes HL-MRFs easy
                                                           to define and interpret.
Training and predicting with these HL-MRFs takes lit-
tle time. In our experiments, training each model takes
about 45 minutes on a 12-core machine, while predict-      Acknowledgements
ing takes under a second per image. While Poon and
Domingos [18] report faster training with SPNs, both       This work was supported by NSF grants CCF0937094
HL-MRFs and SPNs clearly belong to a class of faster       and IIS1218488, and IARPA via DoI/NBC contract
models when compared to DBNs and DBMs, which               number D12PC00337. The U.S. Government is autho-
can take days to train on modern hardware.                 rized to reproduce and distribute reprints for govern-
                                                           mental purposes notwithstanding any copyright anno-
                                                           tation thereon. Disclaimer: The views and conclusions
6   CONCLUSION                                             contained herein are those of the authors and should
                                                           not be interpreted as necessarily representing the of-
We have shown that HL-MRFs are a flexible and inter-       ficial policies or endorsements, either expressed or im-
pretable class of models, capable of modeling a wide       plied, of IARPA, DoI/NBC, or the U.S. Government.
References                                                 [14] D. Lowd and P. Domingos. Efficient weight learn-
                                                                ing for Markov logic networks. In Principles and
 [1] L. An and P. Tao. The DC (difference of convex
                                                                Practice of Knowledge Discovery in Databases,
     functions) programming and DCA revisited with
                                                                2007.
     DC models of real world nonconvex optimization
     problems. Annals of Operations Research, 133:         [15] A. Martins, M. Figueiredo, P. Aguiar, N. Smith,
     23–46, 2005.                                               and E. Xing. An augmented Lagrangian approach
                                                                to constrained MAP inference. In International
 [2] F. Bach and M. Jordan. Thin junction trees. In
                                                                Conference on Machine Learning, 2011.
     Neural Information Processing Systems, 2001.
 [3] S. Bach, M. Broecheler, L. Getoor, and                [16] O. Meshi and A. Globerson. An alternating di-
     D. O’Leary. Scaling MPE inference for con-                 rection method for dual MAP LP relaxation. In
     strained continuous Markov random fields with              European Conference on Machine Learning and
     consensus optimization. In Neural Information              Knowledge Discovery in Databases, 2011.
     Processing Systems, 2012.                             [17] J. Neville and D. Jensen. Relational dependency
 [4] J. Besag. Statistical analysis of non-lattice data.        networks. Journal of Machine Learning Research,
     Journal of the Royal Statistical Society, 24(3):           8:653–692, 2007.
     179–195, 1975.                                        [18] H. Poon and P. Domingos. Sum-product net-
 [5] S. Boyd, N. Parikh, E. Chu, B. Peleato, and                works: A new deep architecture. In Uncertainty
     J. Eckstein. Distributed optimization and statisti-        in Artificial Intelligence, 2011.
     cal learning via the alternating direction method     [19] M. Richardson and P. Domingos. Markov logic
     of multipliers. Foundations and Trends in Ma-              networks. Machine Learning, 62(1-2):107–136,
     chine Learning, 3(1), 2011.                                2006.
 [6] M. Broecheler, L. Mihalkova, and L. Getoor.           [20] R. Salakhutdinov and A. Mnih. Bayesian prob-
     Probabilistic similarity logic. In Uncertainty in          abilistic matrix factorization using Markov chain
     Artificial Intelligence, 2010.                             Monte Carlo. In International Conference on Ma-
 [7] M. Collins. Discriminative training methods for            chine Learning, 2008.
     hidden Markov models: theory and experiments          [21] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Gal-
     with perceptron algorithms. In Empirical Meth-             lagher, and T. Eliassi-Rad. Collective classifica-
     ods in Natural Language Processing, 2002.                  tion in network data. AI Magazine, 29(3):93–106,
 [8] P. Domingos and W. Webb. A tractable first-                2008.
     order probabilistic logic. In AAAI Conference on      [22] D. Sontag and T. Jaakkola. Tree block coordinate
     Artificial Intelligence, 2012.                             descent for MAP in graphical models. In Artificial
 [9] K. Goldberg, T. Roeder, D. Gupta, and                      Intelligence and Statistics, 2009.
     C. Perkins. Eigentaste: A constant time collabo-      [23] B. Taskar, M. Wong, P. Abbeel, and D. Koller.
     rative filtering algorithm. Information Retrieval,         Link prediction in relational data. In Neural In-
     4(2):133–151, July 2001.                                   formation Processing Systems, 2003.
[10] B. Huang, A. Kimmig, L. Getoor, and J. Golbeck.       [24] B. Taskar, C. Guestrin, and D. Koller. Max-
     A flexible framework for probabilistic models of           margin Markov networks. In Neural Information
     social trust. In Conference on Social Comput-              Processing Systems, 2004.
     ing, Behavioral-Cultural Modeling, & Prediction,
     2013.                                                 [25] M. Wainwright and M. Jordan. Graphical mod-
                                                                els, exponential families, and variational infer-
[11] T. Huynh and R. Mooney. Online max-margin                  ence. Foundations and Trends in Machine Learn-
     weight learning for Markov logic networks. In              ing, 1(1-2), January 2008.
     SIAM International Conference on Data Mining,
     2011.                                                 [26] L. Xiong, X. Chen, T. Huang, J. Schneider, and
                                                                J. Carbonell. Temporal collaborative filtering
[12] T. Joachims, T. Finley, and C. Yu. Cutting-plane           with Bayesian probabilistic tensor factorization.
     training of structural SVMs. Machine Learning,             In SIAM International Conference on Data Min-
     77(1):27–59, 2009.                                         ing, 2010.
[13] A. Kimmig, S. Bach, M. Broecheler, B. Huang,
     and L. Getoor. A short introduction to probabilis-
     tic soft logic. In NIPS Workshop on Probabilis-
     tic Programming: Foundations and Applications,
     2012.
You can also read