Learning Low-Rank Kernel Matrices

Page created by Kyle Price
 
CONTINUE READING
Learning Low-Rank Kernel Matrices

Brian Kulis                                                                        kulis@cs.utexas.edu
Mátyás Sustik                                                                   sustik@cs.utexas.edu
Inderjit Dhillon                                                                inderjit@cs.utexas.edu
Department of Computer Sciences, University of Texas at Austin, Austin, TX 78712

                     Abstract                              a matrix with such a property; most standard kernel
    Kernel learning plays an important role in             functions do not produce low-rank kernel matrices, in
    many machine learning tasks. However, algo-            general.
    rithms for learning a kernel matrix often scale        The focus in this paper is on learning low-rank kernel
    poorly, with running times that are cubic in           matrices given distance and similarity constraints on
    the number of data points. In this paper, we           the data. Learning a kernel matrix has been a topic
    propose efficient algorithms for learning low-         of significant research, but most existing algorithms
    rank kernel matrices; our algorithms scale             are not very efficient. Moreover, there is not much
    linearly in the number of data points and              literature on learning low-rank kernel matrices. We
    quadratically in the rank of the kernel. We            propose to learn a low-rank kernel by minimizing the
    introduce and employ Bregman matrix diver-             divergence to an initial low-rank kernel matrix while
    gences for rank-deficient matrices—these di-           satisfying distance and similarity constraints as well as
    vergences are natural for our problem since            a low-rank constraint. However, low-rank constraints
    they preserve the rank as well as positive             are non-convex, and optimization problems involving
    semi-definiteness of the kernel matrix. Spe-           such constraints are intractable in general. By intro-
    cial cases of our framework yield faster al-           ducing specific matrix divergences, we show how to
    gorithms for various existing kernel learning          naturally obtain convexity of the optimization prob-
    problems. Experimental results demonstrate             lem, leading to algorithms that are substantially more
    the effectiveness of our algorithms in learning        efficient than current kernel learning algorithms.
    both low-rank and full-rank kernels.
                                                           Our main contributions in this paper are:

1. Introduction                                              • We employ rank-preserving Bregman matrix di-
Kernel methods have played a major role in many re-            vergences, dissimilarity measures over matrices
cent machine learning algorithms. However, scalabil-           which are natural for learning low-rank kernel ma-
ity is often a concern: given n input data points, many        trices, as they implicitly constrain the rank and
kernel-based algorithms scale as O(n3 ). Furthermore,          maintain positive semi-definiteness during the up-
the kernel matrix requires O(n2 ) memory overhead,             dates of our algorithms.
which may be prohibitive for large-scale learning tasks.
Recently, research has been done on using low-rank           • We develop projection algorithms based on the
kernel representations to improve scalability (Fine &          Burg matrix divergence and the von Neumann di-
Scheinberg, 2001). If the kernel matrix is assumed to          vergence that scale linearly with the number of
be of rank r (with r < n), we need only store the de-          data points.
composition of the kernel matrix K = GGT , where G
is n × r. Many kernel-based learning algorithms can          • A special case of our formulation leads to the Def-
be reformulated in terms of G, and lead to algorithms          initeBoost optimization problem of Tsuda et al.
that scale linearly with n. One of the main obsta-             (2005). Our approach improves the running time
cles in using low-rank kernel matrices lies in obtaining       of their algorithm by a factor of n, from O(n3 ) to
                                                               O(n2 ) per projection. Additional special cases of
Appearing in Proceedings of the 23 rd International Con-       our formulation lead to improved algorithms for
ference on Machine Learning, Pittsburgh, PA, 2006. Copy-       nonlinear dimensionality reduction and the near-
right 2006 by the author(s)/owner(s).
                                                               est correlation matrix problem.
Learning Low-Rank Kernel Matrices

  • We experimentally demonstrate that our algo-           a low-rank representation, where k is the number of
    rithms can be effectively used in classification and   desired clusters. Low-rank kernel representations are
    clustering tasks.                                      often obtained using incomplete Cholesky decomposi-
                                                           tions (Fine & Scheinberg, 2001). Recently, work has
2. Background and Related Work                             been done on using labeled data to improve the low-
                                                           rank decomposition (Bach & Jordan, 2005).
In this section, we briefly review relevant background
                                                           In this paper, our focus is on using distance and sim-
material and related work.
                                                           ilarity constraints to learn a low-rank kernel matrix.
2.1. Kernel Methods                                        The problem of learning a kernel matrix has been stud-
                                                           ied in various contexts. Lanckriet et al. (2004) study
Given a set of training points a1 , ..., an , a common     transductive learning of the kernel matrix and multi-
step in kernel algorithms is to transform the data us-     ple kernel learning using semi-definite programming.
ing a non-linear function ψ. This mapping, typically,      In Kwok and Tsang (2003), a formulation based on
represents a transformation of the data to a higher-       idealized kernels is presented to learn a kernel ma-
dimensional feature space. A kernel function is a func-    trix when some labels are given. Another recent pa-
tion κ that gives the inner product between two vectors    per (Weinberger et al., 2004) considers learning a ker-
in the feature space:                                      nel matrix for nonlinear dimensionality reduction; like
                                                           much of the research on learning a kernel matrix, semi-
               κ(ai , aj ) = ψ(ai ) · ψ(aj ).              definite programming is used and the running time is
It is often possible to compute this inner product with-   at least cubic in the number of data points. Our work
out explicitly computing the expensive mapping of the      is closest to that of Tsuda et al. (2005), who learn
input points to the higher-dimensional feature space.      a (full-rank) kernel matrix using von Neumann diver-
Generally, given n points ai , we form an n × n matrix     gence under linear constraints. However, our frame-
K, called the kernel matrix, whose (i, j) entry corre-     work is more general and our emphasis is on low-rank
sponds to κ(ai , aj ). In kernel-based algorithms, the     kernel learning. Our algorithms are more efficient than
only information needed about the input data points        those of Tsuda et al.; we use exact instead of approx-
is the inner products; hence, the kernel matrix pro-       imate projections to speed up convergence, and we
vides all relevant information for learning in the fea-    consider the Burg divergence in addition to the von
ture space. A kernel matrix formed from any set of         Neumann divergence.
input data points is always positive semi-definite (has
non-negative eigenvalues). See Shawe-Taylor and Cris-      3. Optimization Framework
tianini (2004) for more details.                           3.1. Bregman Matrix Divergences
                                                           Let φ be a real-valued strictly convex function defined
2.2. Low-Rank Kernel Representations and
                                                           over a convex set S = dom(φ) ⊆ Rm such that φ is
     Kernel Learning
                                                           differentiable on the relative interior of S. The Breg-
Despite the popularity of kernel methods in machine        man divergence (Bregman, 1967) with respect to φ is
learning, many kernel-based algorithms scale poorly.       defined as
To improve scalability, the use of low-rank kernel rep-
resentations has been proposed. Given an n × n kernel           Dφ (x, y) = φ(x) − φ(y) − (x − y)T ∇φ(y).
matrix K, if the matrix is of low rank, say r < n, we      For example, if φ(x) = xT x, then the resulting Breg-
can represent the kernel matrix in terms of a decom-       man divergence  is Dφ (x, y) = kx − yk22 . Alternately,
position K = GGT , with G an n × r matrix.
                                                                     P
                                                           if φ(x) = i (xi log xi − xi ), then the resulting Breg-
In addition to easing the burden of memory overhead        man divergence is the (unnormalized) relative entropy.
from O(n2 ) storage to O(nr), this low-rank decomposi-     Bregman divergences generalize many properties of
tion can lead to improved efficiency. For example, Fine    squared loss and relative entropy.
and Scheinberg (2001) show that SVM training re-           We can naturally extend this definition to convex func-
duces from O(n3 ) to O(nr2 ) when using a low-rank         tions defined over matrices. In this case, given a
decomposition. Empirically, this algorithm was shown       strictly convex, differentiable function φ(X), the Breg-
to outperform other SVM training algorithms in terms       man matrix divergence is defined to be
of training time by several orders of magnitude. In
                                                            Dφ (X, Y ) = φ(X) − φ(Y ) − tr((∇φ(Y ))T (X − Y )),
clustering, the kernel k-means algorithm (Kulis et al.,
2005) has a running time of O(n2 ) per iteration but       where tr(A) denotes the trace of matrix A. Ex-
can be improved to O(nrk) time per iteration with          amples include φ(X) = kXk2F , which leads to the
Learning Low-Rank Kernel Matrices

squared Frobenius norm kX − Y k2F . We consider two            is given by Kii + Kjj − 2Kij . Given the constraint
other matrix divergences which are less well-known.            Kii + Kjj − 2Kij ≤ b, it can be represented as
Let the function φ compute the (negative) entropy              tr(KA) ≤ b, where A = zzT , zi = 1, zj = −1, and
of the eigenvalues of a positive semi-definite matrix.         all other entries of z are 0. The constraint Kij ≤ b
More specifically,
        P          if X has eigenvalues λ1 , ..., λn , then    can be written as tr(KA) ≤ b using A = xyT , with
φ(X) = i (λi log λi − λi ), which may be expressed as          xj = 1, yi = 1, and all other entries of x and y are 0.
φ(X) = tr(X log X − X), where log X is the matrix
logarithm.1 The resulting matrix divergence is                 3.3. Bregman Projections
                                                               Consider the convex optimization problem presented
  DvN (X, Y ) = tr(X log X − X log Y − X + Y ),         (1)    above, without the rank constraint. (We will see how
                                                               to handle the rank constraint in the next section.) To
and is known as the von Neumann divergence (or quan-
                                                               solve this problem, we use the method of cyclic projec-
tum relative entropy in the physics literature). An-
                                                               tions (Bregman, 1967; Censor & Zenios, 1997), which
other example is to take
                      P the Burg entropy of the eigen-         we briefly summarize. Suppose we wish to minimize
values, i.e. φ(X) = − i log λi ; this may be expressed
                                                               f (X) = Dφ (X, X0 ) subject to linear equality and in-
as φ(X) = − log det X. The resulting matrix diver-
                                                               equality constraints. In each iteration of the cyclic pro-
gence is
                                                               jection algorithm, we choose one constraint (assumed
 DBurg (X, Y ) = tr(XY −1 ) − log det(XY −1 ) − n, (2)         to be tr(XAi ) = bi or tr(XAi ) ≤ bi ). If constraint
                                                               i is an equality constraint, we project our current so-
which we call the Burg matrix divergence (or the               lution Xt onto constraint i to obtain Xt+1 by solving
LogDet divergence).                                            the following system of equations uniquely for α and
                                                               Xt+1 :
3.2. Problem Description
                                                                            ∇f (Xt+1 ) = ∇f (Xt ) + αATi               (3)
We now give a formal statement of the problem. Given                       tr(Xt+1 Ai ) = bi .
an input kernel matrix K0 , we attempt to solve the
following for K:                                               If constraint i is an inequality constraint, we maintain
                                                               a non-negative dual variable λi for that constraint. Af-
         minimize       Dφ (K, K0 )                            ter solving the above system of equations for α, we set
        subject to      tr(KAi ) ≤ bi , 1 ≤ i ≤ c,             α′ = min(λi , α) and λi = λi − α′ . Then we update
                                                               Xt+1 via
                        rank(K) ≤ r,
                        K º 0.                                            ∇f (Xt+1 ) = ∇f (Xt ) + α′ ATi .             (4)
                                                               We cycle through constraints in such a way that all
Any of the linear inequality constraints above may be          constraints are visited infinitely often in the limit.
replaced with equalities. This problem is non-convex           Both of our algorithms in Section 5 are based on this
in general, due to the rank constraint. However, when          method. The main difficulty in using cyclic projec-
the rank of K0 does not exceed r, then this problem            tions lies in solving the nonlinear system of equations
surprisingly turns out to be convex when we use rank-          efficiently. Details of the convergence of this method
preserving Bregman matrix divergences (defined later           can be found in Censor and Zenios (1997). Note
in Section 4). As we will see later, another advantage         that Tsuda et al. (2005) do not handle inequality
of using the von Neumann and Burg divergences is that          constraints correctly, as they fail to make the correc-
the algorithms used to solve the minimization problem          tion (4) based on the dual variables.
implicitly maintain the positive semi-definiteness con-
straint.                                                       For simplicity, we assume that the constraint matrices
                                                               are symmetric and rank one matrices: Ai = zi zTi (we
Though our algorithms can handle general linear con-           briefly discuss extensions to higher-rank constraints
straints of the form tr(KAi ) ≤ bi , we will focus             in Section 5.3). By calculating the gradient for the
on two types of constraints. The first is a distance           Burg and the von Neumann matrix divergences, re-
constraint. The squared Euclidean distance in fea-             spectively, (3) simplifies to:
ture space between the ith and the jth data points                                                 ¢−1
                                                                          Xt+1 = Xt−1 − αzi zTi
                                                                                      ¡
                                                                                                                  (5)
   1
     If X = V ΛV T is the eigendecomposition of the
positive-definite matrix X, then its matrix logarithm can                  Xt+1     =   exp(log(Xt ) + αzi zTi ),      (6)
be written as V log ΛV T , where log Λ is the diagonal ma-
trix whose entries contain the logarithm of the eigenvalues.   subject to     tr(Xt+1 zi zTi )   =   bi , or equivalently,
The matrix exponential can be defined analogously.             zTi Xt+1 zi = bi .
Learning Low-Rank Kernel Matrices

4. Bregman Divergences for                                 5. Algorithms
   Rank-Deficient Matrices                                 5.1. Burg Divergence
The von Neumann and Burg matrix divergences, as            5.1.1. Matrix Updates
well as the corresponding updates in Bregman’s algo-
rithm, are seemingly undefined for low-rank matrices.      Consider minimizing DBurg (X, X0 ), the Burg matrix
We now discuss extensions of these divergences to low-     divergence between X and X0 . Recall the projection
rank matrices.                                             update rule (5). To accommodate low-rank kernels,
Let the eigendecompositions of X and Y be X =              we can write the update as:
V ΛV T and Y = U ΘU T . We express the Burg di-
                                                                            Xt+1 = (Xt† − αzi zTi )† ,
vergence DBurg (X, Y ) given in (2) using the eigende-
composition of X and Y :
                                  µ ¶                      where Xt† is the pseudoinverse of the positive semi-
         X λi               X       λi                     definite matrix Xt (Golub & Van Loan, 1996). Al-
                   T     2
                 (vi uj ) −   log       − n.
             θ j                    θi                     ternatively, we could have written this update using
         i,j                i
                                                           the more “procedural” reduced eigendecomposition of
Similarly, the von Neumann divergence DvN (X, Y )          Xt . Note that the limit of ((Xt + ǫI)−1 − αzi zTi )−1
can be written as:                                         as ǫ → 0 leads to the above update, justifying our
                                                           use of the pseudoinverse. Note that in the full-rank
  X               X                        X
      λi log λi −   (vTi uj )2 λi log θj −   (λi − θi ).
   i              i,j                    i                 case, the above update is the same as (5). We apply
                                                           the Sherman-Morrison inverse formula to this update,
Using continuity arguments, it is possible to show that
                                                           which can be extended to use the pseudoinverse of a
the Burg divergence between X and Y is finite if and
                                                           positive semi-definite matrix:
only if the range spaces of X and Y are the same.
Similarly, the von Neumann divergence is finite if and                                        A† uvT A†
only if the range space of Y contains the range space of               (A + uvT )† = A† −                .
                                                                                             1 + vT A† u
X (for example, by defining 0 log 0 = 0, we can verify
this for the von Neumann divergence).                      Using this formula, we arrive at a simplified expression
This fact has an important implication: when min-          for Xt+1 :
imizing DBurg (K, K0 ), the updates of our algorithm
automatically maintain the range space of the input                                       αXt zi zTi Xt
                                                                         Xt+1 = Xt +                    .
kernel at each iteration as long as the optimization                                     1 − αzTi Xt zi
problem is feasible. It follows that the rank of K is
                                                           Since Xt+1 must satisfy the ith constraint, i.e.
equal to the rank of K0 during all updates of the al-
                                                           tr(Xt+1 zi zTi ) = bi , we can solve the following equa-
gorithm. Similarly, for the von Neumann divergence,
                                                           tion for α:
the range space of K0 must contain the range space of
K, so we conclude that the rank of K never exceeds                     µµ
                                                                                   αXt zi zTi Xt
                                                                                                 ¶       ¶
                                                                                                       T
the rank of K0 during all updates of the algorithm.                 tr     Xt +                    zi zi = bi .
                                                                                  1 − αzTi Xt zi
Thus, we implicitly maintain rank constraints in our
optimization problem under both divergences.               Let p = zTi Xt zi . Note that tr(Xt zi zTi ) = p and
We can re-write the divergences in terms of the re-        tr(Xt zi zTi Xt zi zTi ) = p2 . In the case of distance con-
duced eigendecompositions of X and Y when they             straints, p is the distance between the two data points
share the same range space. We consider only the           corresponding to constraint i, and can be computed in
top r leading eigenvalues and eigenvectors of X and        constant time. Then we have:
Y , where r is the rank of Y . Then the Burg matrix
                                                                                       1  1
divergence DBurg (X, Y ) can be written as                                        α=     − .
                                    µ ¶                                                p bi
         X λi                 X      λi
                 (vTi uj )2 −   log     − r,
              θj                     θi                    If we let β = α/(1 − αp), then our matrix update is
         i,j≤r              i≤r
                                                           given by
where λ1 ≥ λ2 ≥ ... ≥ λn and θ1 ≥ θ2 ≥ ... ≥ θn .                        Xt+1 = Xt + βXt zi zTi Xt .
The von Neumann divergence DvN (X, Y ) is expressed
using the reduced eigendecomposition as                    This update is pleasantly surprising since the projec-
  X              X                         X               tion parameter for Bregman’s algorithm does not usu-
     λi log λi −    (vTi uj )2 λi log θj −   (λi − θi ).   ally admit a closed form solution (see Section 5.2.2 for
  i≤r            i,j≤r                   i≤r               the case of the von Neumann divergence).
Learning Low-Rank Kernel Matrices

5.1.2. Update for the Factored Matrix                          Algorithm 1: Learning a low-rank kernel in
                                                                Burg divergence under distance constraints.
If one stopped with the update given above, the cost is       KernelLearnBurg(r, {Ai }ci=1 , G0 )
O(n2 ) per iteration. However, we can achieve a more          Input: r: rank of desired kernel matrix, {Ai }ci=1 :
efficient update for low-rank matrices by working on a        constraints, G0 : optional input kernel factor ma-
suitable factored form of the matrix Xt . If Xt can be        trix of rank r
written as GGT , where G is an n×r matrix, the update         Output: G: output low-rank kernel factor matrix
can be formulated using only the factored matrices.           1. If not given, create a random n × r matrix G0 .
Let β be as above. Then we have:                              2. Set B = Ir , i = 1, and λj = 0 ∀ constraints j.
                                                              3. Repeat until convergence:
         Xt+1    = GGT + βGGT zi zTi GGT
                                                                • Let vT be row i1 of G0 minus row i2 of G0 ,
                 = G(I + βGT zi zTi G)GT .
                                                                  where, corresponding to constraint i, points
                                                                  i1 and i2 are constrained to have squared Eu-
The matrix I + βz̃i z̃Ti , where z̃i = GT zi , is an r × r        clidean distance less than or equal to bi .
matrix. To update G for the next iteration, we factor
this matrix as LLT ; then our new G is updated to               • Set the following variables:
GL. Since I + βz̃i z̃Ti is a rank-one perturbation of the
identity, this update can be done in O(r2 ) time using                     w    ← B T vµ                   ¶
a Cholesky rank-one update routine.                                                         1       1
                                                                           α    ← min λi ,    2   −
                                                                                           kwk2     bi
To increase computational efficiency, we note that
                                                                           λi   ← λi − α
G = G0 B, where B is the product of all the L matri-
ces from every iteration and G0 is the initial Cholesky                     β   ← α/(1 − αkwk22 )
factor of K0 . Instead of updating G explicitly at each
iteration, we can just update B as BL. The matrix               • Compute the Cholesky factorization LLT =
I + βGT zi zTi G is then I + βB T GT0 zi zTi G0 B. In the         I + βwwT .
case of distance constraints, we can compute GT0 zi in
                                                                • Set B ← BL, i ← mod(i, c) + 1.
O(r) time as the difference of two rows of G0 . Further-
more, α, β and the Cholesky factorization can all be          4. Return G = G0 B.
computed in O(r2 ) time. The multiplication BL ap-
pears to cost O(r3 ) time, but the simple structure of L
and the fact that it depends on O(r) parameters allow        5.2. Von Neumann Divergence
us to implement this multiplication in O(r2 ) time as        In this section we develop a cyclic projection algorithm
well. Details are omitted due to lack of space.              when the matrix divergence is the von Neumann diver-
                                                             gence.
5.1.3. Algorithm
                                                             5.2.1. Matrix Updates
The algorithm for distance inequality constraints using
                                                             Consider minimizing DvN (X, X0 ), the von Neumann
the Burg divergence is given as Algorithm 1. As dis-
                                                             divergence between X and X0 . Recall equation (6),
cussed in the previous section, every projection can be
                                                             the projection update rule for constraint i. As with
done in O(r2 ) time. Thus, cycling through all c pro-
                                                             the Burg divergence, we can express the von Neumann
jections requires O(cr2 ) time. The only dependence
                                                             divergence update using the reduced eigendecomposi-
on n—the number of data points—occurs in steps 1
                                                             tion of Xt . Let Xt = Vt Λt VtT , where Vt is n × r and
and 4. If we need to explicitly create the matrix G0 ,
                                                             Λt is r × r. We must calculate exp(log(Xt ) + αzi zTi ).
this takes O(nr) time, and the last step, multiplying
                                                             When Xt is of full rank, the definitions of the log
G = G0 B, takes O(nr2 ) time. Hence, the algorithm is
                                                             and exp functions imply that exp(log(Xt ) + αzi zTi ) =
linear in n (and linear in c).
                                                             Vt exp(log(Λt ) + αVtT zi zTi Vt )VtT . This motivates the
Convergence can be checked by using the dual vari-           following natural extension of the update when Xt is
ables λ. The cyclic projection algorithm can be viewed       low rank:
as a dual ascent algorithm, thus convergence can be
                                                                   Xt+1 = Vt exp(log(Λt ) + αVtT zi zTi Vt )VtT .
measured as follows: after cycling through all con-
straints, we check to see how much λ has changed after       We can verify that the matrix on the right-hand side
a full cycle. At convergence, this change (as measured       is equal to the limit of exp(log(Xt + ǫI) + αzi zTi ) when
with a vector norm) should be small.                         ǫ → 0, thus justifying the update.
Learning Low-Rank Kernel Matrices

If the eigendecomposition of the exponent log(Λt ) +        Algorithm 2: Learning a low-rank kernel in von
αVtT zi zTi Vt is Ut Θt UtT , then the eigendecomposition   Neumann divergence under distance constraints.
of Xt+1 is calculated by Vt+1 = Vt Ut and Λt+1 =             KernelLearnVonNeumann(r, {Ai }ci=1 , V0 , Λ0 )
exp(Θt ). This special eigenvalue problem (diagonal          Input: r: rank of desired kernel matrix, {Ai }ci=1 :
plus rank-one update) can be solved in O(r2 ) time;          constraints, V0 , Λ0 : optional input kernel of rank r
see Demmel (1997). This means that the matrix mul-           in factored form
tiplication Vt+1 = Vt Ut becomes the most expensive          Output: V : output low-rank kernel factor matrix
step in the computation, yielding O(nr2 ) complexity.        1. If not given, create a random n × r orthogonal
We reduce this cost by modifying the decomposition           matrix V0 and r × r diagonal matrix Λ0 .
slightly. Let Xt = Vt Wt Λt WtT VtT be the factorization     2. Set W = Ir , Λ = Λ0 , i = 1, and λj = 0 ∀
of Xt , where Wt is a r × r orthogonal matrix, while Vt      constraints j.
and Λt are defined as before. The matrix update may          3. Repeat until convergence:
be written as:                                                 • Let vT be row i1 of V0 minus row i2 of V0 ,
Xt+1 = Vt Wt exp(log Λt + αWtT VtT zi zi Vt Wt )WtT VtT ,        where, corresponding to constraint i, points
                                                                 i1 and i2 are constrained to have squared Eu-
yielding the following updates:                                  clidean distance less than or equal to bi .

     Vt+1 = Vt , Wt+1 = Wt Ut , Λt+1 = exp(Θt ),               • Set the following variables:

where log Λt + αWtT VtT zi zi Vt Wt = Ut Θt UtT . For a              w    ←   WTv
general rank-one constraint, the product VtT zi can                  α    ←   ComputeProj(v, w, W, Λ, bi )
be calculated in O(nr) time, but for distance con-                   β    ←   min(λi , α)
straints O(r) operations are sufficient. The calcula-
                                                                     λi   ←   λi − β
tion of WtT (VtT zi ) and computing the eigendecompo-
sition Ut Θt UtT both take O(r2 ) time. Forming the
matrix product Wt Ut appears to cost O(r3 ) time, but          • Compute the eigendecomposition U ΘU T =
in fact the multiplication can be approximated very              Λ + βwwT .
accurately in O(r2 log r) and even in O(r2 ) time us-          • Set W ← W U , Λ ← exp(Θ), i ← mod(i, c)+1.
ing the fast multipole method (Barnes & Hut, 1986;
Greengard & Rokhlin, 1987).                                  4. Return V = V0 W Λ1/2 .

5.2.2. Computing the Projection Parameter
                                                            can be done in O(r2 ) time. The asymptotic running
In the previous section, we assumed that we knew the
                                                            time of this algorithm is the same as the algorithm
projection parameter α. Unlike in the case of the Burg
                                                            that uses the Burg divergence.
divergence, this parameter does not have a closed form
solution. Instead, we must compute α by solving the
                                                            5.3. Generalizations and Special Cases
nonlinear system of equations given in (3).
                                                            In the previous two sections, we focused on the case
The main computational difficulty in finding α is in
                                                            of distance constraints. In this section, we briefly dis-
calculating zTi Xt+1 zi for a given α. Using the ap-
                                                            cuss generalizations to other constraints, and discuss
proach from Section 5.2.1, we have that zTi Xt+1 zi =
                                                            important special cases of our optimization problem.
                     T
zTi Vt+1 Wt+1 Λt+1 Wt+1   T
                        Vt+1 zi , and for a given choice
of α, this trace can be computed in O(r2 ) time. We         When the constraints are similarity constraints (i.e.
built a custom nonlinear solver that is optimized for       Kij ≤ bi ), the updates can be modified easily (details
this problem, as a result of which we can accurately        are omitted due to lack of space). Other constraints
compute α using only a few trace computations (rarely       are possible as well; for example, one could incorporate
more than six trace evaluations per projection). Thus,      further distance constraints (such as kai −aj k22 ≤ kal −
the projection parameter can effectively be computed        am k22 ) or further similarity constraints (such as Kij ≤
in O(r2 ) time.                                             Klm ). Arbitrary linear constraints can also be applied,
                                                            the cost per projection update will then be O(nr).
5.2.3. Algorithm                                            If we use the von Neumann divergence, let r = n,
The algorithm for distance inequality constraints using     and set bi = 0 for all constraints, we exactly obtain
the von Neumann divergence is given as Algorithm 2.         the DefiniteBoost optimization problem from Tsuda
By using the fast multipole method, every projection        et al. (2005). In this case, our algorithm computes
Learning Low-Rank Kernel Matrices

the projection update in O(n2 ) time. In contrast, the                                    0.9
algorithm from Tsuda et al. (2005) computes the pro-                                                      Burg Divergence
                                                                                         0.85             von Neumann Divergence
jection in a more expensive manner in O(n3 ) time. An-
                                                                                          0.8
other difference of our approach is that we compute the
                                                                                         0.75
exact projection, whereas Tsuda et al. (2005) compute
an approximate projection. Computing an approxi-                                          0.7

                                                               NMI
mate projection may lead to a faster per-iteration cost,                                 0.65

but it takes more iterations to converge to the optimal                                   0.6

solution. We illustrate this further in Section 6.                                       0.55

Another important special case is the nearest corre-                                      0.5

lation matrix problem (Higham, 2002), which is an                                        0.45

important problem that arises in applications in fi-                                      0.4
                                                                                                0     20           40         60        80         100         120    140
nance. In this problem, we set the constraints to be                                                                       Number of Constraints

Kii = 1 for all i. Thus, we seek to find the nearest          Figure 1. Results of clustering the Digits data set
positive semi-definite matrix with unit diagonal. Our
algorithms from this paper give new methods of find-                                       1
                                                                                                      Burg Divergence
ing low-rank correlation matrices. Previous algorithms                                   0.95
                                                                                                      von Neumann Divergence

scale cubically in n.
                                                                                          0.9

Our formulation can also be employed for nonlinear di-

                                                               Classification Accuracy
                                                                                         0.85
mensionality reduction, as in Weinberger et al. (2004).
The enforced constraints on the kernel matrix (center-                                    0.8

ing and isometry) are linear, and thus can be encoded                                    0.75
into our framework. The only difference is that Wein-
berger et al. maximize the trace of K, whereas we                                         0.7

minimize a matrix divergence. Comparisons between                                        0.65

these approaches is a potential area of future research.
                                                                                           100      200      300        400   500      600      700      800    900   1000
                                                                                                                           Number of Constraints

6. Experiments                                             Figure 2. Classification accuracy with the gyrB data set
To show the effectiveness of our algorithms, we present
both clustering and classification results. We use two     dom and a distance constraint is constructed. If the
data sets from real-life applications:                     data points are in the same class, the constraint is of
                                                           the form d(i1 , i2 ) ≤ bi , where bi is the distance from
1. Digits data: a subset of the Pendigits data from
                                                           the original kernel matrix, and for different classes, we
the UCI repository that contains handwritten samples
                                                           construct d(i1 , i2 ) ≥ bi constraints. For the Digits
of the digits 3, 8, and 9. The raw data for each digit
                                                           data set, we added constraints of the form d(i1 , i2 ) ≤
is 16-dimensional, and our subset contains 317 digits.
                                                           (1 − ǫ)bi for same class pairs and d(i1 , i2 ) ≥ (1 + ǫ)bi
2. GyrB protein data: a 52 × 52 kernel matrix among        for different class pairs (bi is the original distance and
bacteria proteins, containing three bacteria species.      ǫ = .25). This is very similar to “idealizing” the ker-
This matrix is identical to the one used to test the       nel, as in Kwok and Tsang (2003). Our convergence
DefiniteBoost algorithm in Tsuda et al. (2005).            tolerance was 10−3 .
For classification, we compute accuracy using a k-
nearest neighbor classifier (k = 5) that computes dis-     6.1. Results
tance in the feature space, with a 50/50 training/test     We first ran our algorithms on the Digits data set
split and 2-fold cross validation. Our results are av-     to learn a rank-16 kernel matrix using the randomly
eraged over 20 runs. For clustering, we use the ker-       generated constraints. The Gram matrix of the data
nel k-means algorithm and compute accuracy using           set (a 317x317 rank-16 matrix formed using a linear
the Normalized Mutual Information (NMI) measure,           kernel) was used as our initial kernel matrix. Figure 1
a standard technique for determining quality of clus-      shows NMI values for the clustering with increasing
ters. NMI measures the amount of statistical informa-      constraints. Adding just a few constraints improves
tion shared by the random variables representing the       the results significantly, and both of the kernel learn-
cluster and class distributions.                           ing algorithms perform comparably. Classification ac-
Constraints for the GyrB data set were generated ran-      curacy using the k-nearest neighbor method was also
domly as follows: two data points are chosen at ran-       computed for this data set: marginal classification ac-
Learning Low-Rank Kernel Matrices

curacy gains were observed with the addition of con-        we were able to obtain convexity of our optimization
straints (an increase from 94 to 97 percent accuracy for    problem. Unlike previous kernel learning algorithms,
both divergences). We also recorded convergence data        which have running times that are cubic in the number
in terms of the number of cycles (sweeps) through all       of data points, our algorithms are highly efficient: both
constraints; for the von Neumann divergence, conver-        algorithms have running times linear in the number of
gence was attained from 11 sweeps for few constraints       data points and quadratic in the rank of the kernel.
to 105 sweeps for many constraints, and for the Burg        Furthermore, our algorithms can be used in conjunc-
divergence, between 17 and 354 sweeps were needed for       tion with a number of kernel-based learning algorithms
convergence. This experiment highlights that our al-        that are optimized for low-rank kernel representations.
gorithm can use constraints to learn a low-rank kernel      The experimental results demonstrate that our algo-
matrix (rank 16 as opposed to a full rank of 317). It       rithms effectively learned low-rank and full-rank ker-
is noteworthy that the learned kernel performs better       nels for classification and clustering problems.
than the original kernel for clustering and classifica-     Acknowledgements This research was sup-
tion.                                                       ported by NSF grant CCF-0431257, NSF Ca-
As a second experiment, we performed a compari-             reer Award ACI-0093404, and NSF-ITR award IIS-
son to the DefiniteBoost algorithm of Tsuda et al.          0325116. We thank Koji Tsuda for the protein data
(2005) (modified to correctly handle inequality con-        and the anonymous reviewers for their helpful sugges-
straints) using the GyrB data set. Using only con-          tions.
straints, we attempt to learn a kernel matrix that
achieves high classification accuracy. As in the exper-     References
iments of Tsuda et al. (2005), we learned a full-rank       Bach, F., & Jordan, M. (2005). Predictive low-rank de-
kernel matrix starting from the scaled identity matrix.       composition for kernel methods. Proc. ICML-2005.
In our experiments, we observed that using approxi-         Barnes, J., & Hut, P. (1986). A hierarchical O(n log n)
mate projections—as done in DefiniteBoost—increases           force calculation algorithm. Nature, 324, 446–449.
                                                            Bregman, L. (1967). The relaxation method of finding
the number of sweeps needed for convergence consid-           the common point of convex sets and its application to
erably. For example, starting with the scaled iden-           the solution of problems in convex programming. USSR
tity as the initial kernel matrix and 100 constraints,        Comp. Mathematics and Mathematical Physics, 7, 200–
it took our von Neumann algorithm only 11 sweeps              217.
                                                            Censor, Y., & Zenios, S. (1997). Parallel optimization.
to converge, whereas it took 3220 sweeps for the Def-         Oxford University Press.
initeBoost algorithm to converge. Since the optimal         Demmel, J. D. (1997). Applied numerical linear algebra.
solutions are the same for approximate versus exact           Society for Industrial and Applied Mathematics.
projections, we converge to the same kernel matrix as       Fine, S., & Scheinberg, K. (2001). Efficient SVM training
                                                              using low-rank kernel representations. Journal of Ma-
DefiniteBoost but in far fewer sweeps.                        chine Learning Research, 2, 243–264.
The slow convergence of the DefiniteBoost algorithm         Golub, G. H., & Van Loan, C. F. (1996). Matrix computa-
                                                              tions. Johns Hopkins University Press.
did not allow us to run it with a larger set of con-        Greengard, L., & Rokhlin, V. (1987). A fast algorithm for
straints. For the Burg and the von Neumann exact              particle simulations. J. Comput. Phys., 73, 325–348.
projection algorithms, the number of sweeps required        Higham, N. (2002). Computing the nearest correlation
for convergence never exceeded 600 on runs of up to           matrix—a problem from finance. IMA J. Numerical
1000 constraints. Figure 2 depicts the classification ac-     Analysis, 22, 329–343.
                                                            Kulis, B., Basu, S., Dhillon, I., & Mooney, R. (2005). Semi-
curacy achieved versus the number of constraints. The         supervised graph clustering: A kernel approach. Proc.
classification accuracy on the original matrix is .948,       ICML-2005.
and so our learned kernels achieve even higher accu-        Kwok, J., & Tsang, I. (2003). Learning with idealized
racy than the target kernel with a sufficient number          kernels. Proc. ICML-2003.
                                                            Lanckriet, G., Cristianini, N., Bartlett, P., Ghaoui, L. E.,
of constraints. These results highlight that excellent        & Jordan, M. (2004). Learning the kernel matrix with
classification accuracy can be obtained using a kernel        semidefinite programming. Journal of Machine Learning
that is learned using only distance constraints. Note         Research, 5, 27–72.
that the starting kernel was the scaled identity matrix,    Shawe-Taylor, J., & Cristianini, N. (2004). Kernel methods
                                                              for pattern analysis. Cambridge University Press.
and so did not encode any domain information.               Tsuda, K., Rätsch, G., & Warmuth, M. (2005). Matrix
                                                              exponentiated gradient updates for online learning and
7. Conclusions                                                Bregman projection. Journal of Machine Learning Re-
                                                              search, 6, 995–1018.
In this paper, we have developed algorithms for learn-      Weinberger, K., Sha, F., & Saul, L. (2004). Learning a ker-
                                                              nel matrix for nonlinear dimensionality reduction. Proc.
ing low-rank kernel matrices. By exploiting the rank-
                                                              ICML-2004.
preserving property of Bregman matrix divergences,
You can also read