Low-Rank Similarity Metric Learning in High Dimensions

Page created by Ivan Coleman

Business

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence

Low-Rank Similarity Metric Learning in High Dimensions

Wei Liu† Cun Mu‡ Rongrong Ji\ Shiqian Ma§ John R. Smith† Shih-Fu Chang‡
† ‡ \ §
IBM T. J. Watson Research Center Columbia University Xiamen University The Chinese University of Hong Kong
{weiliu,jsmith}@us.ibm.com cm3052@columbia.edu sfchang@ee.columbia.edu
rrji@xmu.edu.cn sqma@se.cuhk.edu.hk

Abstract clude supervised metric learning which is the main focus
in the literature, semi-supervised metric learning (Wu et al.
Metric learning has become a widespreadly used tool
in machine learning. To reduce expensive costs brought 2009; Hoi, Liu, and Chang 2010; Liu et al. 2010a; 2010b;
in by increasing dimensionality, low-rank metric learn- Niu et al. 2012), etc. The suite of supervised metric learning
ing arises as it can be more economical in storage and can further be divided into three categories in terms of forms
computation. However, existing low-rank metric learn- of supervision: the methods supervised by instance-level la-
ing algorithms usually adopt nonconvex objectives, and bels (Goldberger, Roweis, and Salakhutdinov 2004; Glober-
are hence sensitive to the choice of a heuristic low-rank son and Roweis 2005; Weinberger and Saul 2009), the
basis. In this paper, we propose a novel low-rank met- methods supervised by pair-level labels (Xing et al. 2002;
ric learning algorithm to yield bilinear similarity func- Bar-Hillel et al. 2005; Davis et al. 2007; Ying and Li 2012),
tions. This algorithm scales linearly with input dimen- and the methods supervised by triplet-level ranks (Ying,
sionality in both space and time, therefore applicable
Huang, and Campbell 2009; McFee and Lanckriet 2010;
to high-dimensional data domains. A convex objective
free of heuristics is formulated by leveraging trace norm Shen et al. 2012; Lim, McFee, and Lanckriet 2013). More
regularization to promote low-rankness. Crucially, we methods can also be found in two surveys (Kulis 2012) and
prove that all globally optimal metric solutions must (Bellet, Habrard, and Sebban 2013).
retain a certain low-rank structure, which enables our Despite the large amount of literature, relatively little
algorithm to decompose the high-dimensional learning work has dealt with the problem where the dimensional-
task into two steps: an SVD-based projection and a met- ity of input data can be extremely high. This is relevant
ric learning problem with reduced dimensionality. The to a wide variety of practical data collections, e.g., images,
latter step can be tackled efficiently through employing
videos, documents, time-series, genomes, and so on, which
a linearized Alternating Direction Method of Multipli-
ers. The efficacy of the proposed algorithm is demon- frequently involve from tens to hundreds of thousands of
strated through experiments performed on four bench- dimensions. Unfortunately, learning full-rank metrics in a
mark datasets with tens of thousands of dimensions. high-dimensional input space Rd quickly becomes computa-
tionally prohibitive due to high computational complexities
ranging from O(d2 ) to O(d6.5 ). Additionally, limited mem-
Introduction ory makes storing a huge metric matrix M ∈ Rd×d a heavy
During the past decade, distance metric learning, typically burden.
referring to learning a Mahalanobis distance metric, has be- This paper concentrates on a more reasonable learning
come ubiquitous in a variety of applications stemming from paradigm, Low-Rank Metric Learning, also known as struc-
machine learning, data mining, information retrieval, and tural metric learning. (Davis and Dhillon 2008) demon-
computer vision. Many fundamental machine learning al- strated that in the small sample setting n d (n is the num-
gorithms such as K-means clustering and k-nearest neigh- ber of training samples) learning a low-rank metric in a high-
bor classification all need an appropriate distance function dimensional input space is tractable, achieving linear space
to measure pairwise affinity (or closeness) between data and time complexities with respect to the dimensionality d.
examples. Beyond the naive Euclidean distance, the dis- We follow this learning setting n d throughout the pa-
tance metric learning problem intends to seek a better met- per. Given a rank r d, a distance metric can be expressed
ric through learning with a training dataset subject to par- into a low-rank form M = LL> which is parameterized by
ticular constraints arising from fully supervised or semi- a rectangular matrix L ∈ Rd×r . L constitutes a low-rank
supervised information, such that the learned metric can re- basis in Rd , and also acts as a dimensionality-reducing lin-
flect domain-specific characteristics of data affinities or re- ear transformation since any distance between two inputs x
lationships. A vast number of distance metric learning ap- and x0 under such a low-rank metric M can be equivalently
proaches have been proposed, and the representatives in- viewed as a Euclidean distance between the transformed in-
Copyright c 2015, Association for the Advancement of Artificial puts L> x and L> x0 . Therefore, the use of low-rank met-
Intelligence (www.aaai.org). All rights reserved. rics allows reduced storage of metric matrices (only saving

2792

L) and simultaneously enables efficient distance computa-
tion (O(dr) time). Despite the benefits of using low-rank
metrics, the existing methods most adopt nonconvex opti-
mization objectives like (Goldberger, Roweis, and Salakhut-
dinov 2004; Torresani and Lee 2006; Mensink et al. 2013;
Lim and Lanckriet 2014), and are thus confined to the sub-
space spanned by a heuristic low-rank basis L0 even includ-
ing the convex method (Davis and Dhillon 2008). In terms of
the training costs, most low-rank metric learning approaches              Figure 1: The effect of a desired low-rank similarity met-
including (Goldberger, Roweis, and Salakhutdinov 2004;                    ric that induces a linear transformation L> . Points with the
Torresani and Lee 2006; McFee and Lanckriet 2010; Shen                    same color denote similar data examples, while points with
et al. 2012; Kunapuli and Shavlik 2012; Lim, McFee, and                   different colors denote dissimilar data examples.
Lanckriet 2013) yet incur quadratic and cubic time complex-
ities in the dimensionality d. A few methods such as (Davis
and Dhillon 2008) and (Mensink et al. 2013) are known to                  metric M = LL> for a bilinear similarity function to mea-
provide efficient O(d) procedures, but they both rely on and              sure the similarity between two arbitrary inputs x and x0 in
are sensitive to an initialization of the heuristic low-rank ba-          Rd :
sis L0 .
   This work intends to pursue a convex low-rank metric                              SM (x, x0 ) = x> Mx0 = x> LL> x0
learning approach without resorting to any heuristics. Re-                                              > > 0 
                                                                                                 = L> x    L x ,                    (1)
cently, one trend in similarity learning (Kar and Jain 2011;
2012; Chechik et al. 2010; Crammer and Chechik 2012;                      which is parameterized by a low-rank basis L ∈ Rd×r with
Cheng 2013) emerges, which attempts to learn similarity                   the rank r  d. Naturally, such a similarity function charac-
functions in the form of S(x, x0 ) = x> Mx0 and appears to                terized by a low-rank metric essentially measures the inner-
be more flexible than distance metric learning. In light of this          product between two transformed inputs L> x and L> x0 in
trend, we incorporate the benefits of both low-rank metrics               a lower dimensional space Rr .
and similarity functions to learn a low-rank metric M that                   Motivated by (Kriegel, Schubert, and Zimek 2008) which
yields a discriminative similarity function S(·). Specifically,           presented a statistical means based on angles rather than dis-
we propose a novel low-rank metric learning algorithm by                  tances to identify outliers scattering in a high-dimensional
designing a convex objective which consists of a hinge loss               space, we would think that angles own a stronger discrim-
enforced over pairwise supervised information and a trace                 inating power than distances in capturing affinities among
norm regularization penalty promoting low-rankness of the                 data points in high dimensions. Suppose  that we are pro-
desired metric. Significantly, we show that minimizing such               vided with a set of pairwise labels yij ∈ {1, −1} of
an objective guarantees to obtain an optimal solution of a                which yij = 1 stands for a similar data pair (xi , xj ) while
certain low-rank structure as M? = UW? U> . The low-                      yij = −1 for a dissimilar pair (xi , xj ). The purpose of our
rank basis U included in any optimal metric M? turns out                  supervised similarity metric learning is to make an acute an-
to lie in the column (or range) space of the input data matrix            gle between any similar pair whereas an obtuse angle be-
X ∈ Rd×n , and W? can thus be sought within a fixed sub-                  tween any dissimilar pair after applying the learned metric
space whose dimensionality is much lower than the original                M or equivalently mapping data points from x to L> x. For-
dimensionality.                                                           mally, this purpose is expressed as
   By virtue of this key finding, our algorithm intelligently                                         
decomposes the challenging high-dimensional learning task                                                > 0, yij = 1,
                                                                                      SM (xi , xj ) =                                (2)
into two steps which are both computationally tractable and                                              < 0, yij = −1.
cost a total O(d) time. The first step is dimensionality re-
duction via SVD-based projection, and the second step is                  Fig. 1 exemplifies our notion for those angles between simi-
lower dimensional metric learning by applying a linearized                lar and dissimilar data pairs.
modification of the Alternating Direction Method of Multi-
pliers (ADMM) (Boyd et al. 2011). The linearized ADMM                                     Learning Framework
maintains analytic-form updates across all iterations, works              Let us inherit the conventional deployment of pairwise su-
efficiently, and converges rapidly. We evaluate the proposed              pervision adopted by most metric learning methods, e.g.,
algorithm on four benchmark datasets with tens of thousands               (Xing et al. 2002; Bar-Hillel et al. 2005; Davis et al. 2007;
of dimensions, demonstrating that the low-rank metrics ob-                Davis and Dhillon 2008; Wu et al. 2009; Hoi, Liu, and
tained by our algorithm accomplish prominent performance                  Chang 2010; Liu et al. 2010a; 2010b; Niu et al. 2012;
gains in terms of kNN classification.                                     Ying and Li 2012). This paper tackles a supervised simi-
                                                                          larity metric learning problem provided with a set of labeled
           Low-Rank Similarity Metrics                                    pairs I = {(xi , xj , yij )}. Suppose that n ( d) data ex-
Different from most previous metric learning approaches                   amples X = {xi ∈ Rd }ni=1 are gathered in the set I. In-
that
p used the learned metric into a distance function like                   spired by the aforementioned angle-based learning principle
  (x − x0 )> M(x − x0 ), in this paper we pursue a low-rank               in Eq. (2), we design a margin criterion over the similarity

                                                                   2793

outputs of the supervised pairs, that is, SM (xi , xj ) ≥ 1 for           Optimality Characterization
yij = 1 and SM (xi , xj ) ≤ − ( is a small positive value)              Despite being convex, the framework in Eq. (3) is not easy
for yij = −1. Note that setting the margin values to 1 and                to directly tackle when the data points of X live in a space
-1 for positive and negative pairs in the context of similarity           of high dimensions, i.e., d is very large. Under the high-
learning may be infeasible for multi-class data. For exam-                dimensional scenario, basic matrix manipulations upon M
ple, there are data points x1 , x2 , and x3 from three different          are too expensive to execute. For example, projecting M
classes. Imposing SM (x1 , x2 ) ≤ −1, SM (x1 , x3 ) ≤ −1,                 onto Sd+ costs O(d3 ). Nonetheless, by delving into problem
and SM (x2 , x3 ) ≤ −1 is problematic since the former two                (3) we can characterize a certain low-rank structure of its
likely lead to SM (x2 , x3 ) > 0.                                         all optimal solutions. In particular, such a low-rank structure
   This similarity margin criterion ensures that after apply-             can be proven to be the form of UWU> , in which U is a
ing the linear transformation L> induced by a desired met-                low-rank basis lying in the column space of the data matrix
ric M, the angle between a similar pair (xi , xj ) is acute               X.
                                     > > 
since their inner-product L> xi         L xj = SM (xi , xj )                 Let us do the singular value decomposition (SVD) over
is large enough, and meanwhile the angle between a dissim-                the data matrix X = [x1 , · · · , xn ] ∈ Rd×n , obtaining
                                                                                              Pm
ilar pair is obtuse since their inner-product is negative.                X = UΣV> =                         >
                                                                                                  i=1 σi ui vi , where m (≤ n  d) is
   In order to promote low-rankness of the target metric, we              the rank of X, σ1 , · · · , σm are the positive singular values,
leverage nuclear norm kMk∗ , the sum of all singular values               Σ = diag(σ1 , · · · , σm ), and U = [u1 , · · · , um ] ∈ Rd×m
of M, as a regularization penalty in our learning framework.              and V = [v1 , · · · , vm ] ∈ Rn×m are the matrices com-
Nuclear norm has been corroborated to be able to encourage                posed of left- and right-singular vectors, respectively. It is
low-rank matrix structure in the matrix completion litera-                obvious that range(X) ≡ range(U), in which we write
ture (Fazel, Hindi, and Boyd 2001; Candès and Recht 2009;                range(A) = span {ai }i as the column space of a given
Recht, Fazel, and Parrilo 2010). Moreover, since our frame-               matrix A. Significantly, the optimality of problem (3) can
work constrains M to be in the positive semidefinite (PSD)                be characterized as follows.
cone Sd+ to make M a valid metric, the nuclear norm kMk∗                  Lemma 1. For           solution M? to problem (3), we
can be replaced by the trace norm tr(M).                                            any optimal
                                                                                          >
                                                                          have M ∈ UWU |W ∈ Sm
                                                                                ?
                                                                                                   + .
   Integrating the angle principle and the trace norm regu-
larizer, the objective of our proposed Low-Rank Similarity                   Lemma 1 reveals that although we are seeking M? in Sd+ ,
Metric Learning (LRSML) framework is formulated as                        any optimal solution M? must obey the certain low-rank
                                                                          structure UWU> because of the nature of problem (3). This
                                                                          important discovery enables us to first project the data onto
                   n
                   X                                                      the low-rank basis U and then seek W via optimizing a new
  min f (M) :=            [ỹij − yij SM (xi , xj )]+ + αtr(M),           problem in Eq. (4) which is at a much smaller scale than the
 M∈Sd
    +             i,j=1                                                   raw problem in Eq. (3).
                                                             (3)
                                                                          Theorem 1. For any i ∈ [1 : n], denote x̃i = U> xi ∈ Rm .
                                                                         Then M? is an optimal solution to problem (3) if and only if
                , yij = −1
where ỹij =                    , [x]+ = max(0, x), and α >               M? = UW? U> in which W? belongs to the set
                yij , otherwise
                                                                                                 n h
0 is the regularization parameter. Above all, the objective                                      X                      i
in Eq. (3) is convex and does not introduce any heuris-                    arg minm f˜(W) :=        ỹij − yij x̃>
                                                                                                                 i Wx̃j   + αtr(W).
                                                                               W∈S+                                       +
                                                                                                i,j=1
tic low-rank basis which was required by the prior con-
                                                                                                                                       (4)
vex method (Davis and Dhillon 2008) and those noncon-
vex methods (Goldberger, Roweis, and Salakhutdinov 2004;                     Theorem 1 indicates that through a simple projection step
Torresani and Lee 2006; Mensink et al. 2013; Lim and                      (xi → U> xi ), we are able to reduce the high-dimensional
Lanckriet 2014) in low-rank distance metric learning. Sec-                metric learning task in Eq. (3) to a lower dimensional met-
ond, minimizing the hinge loss in Eq. (3) enforces the                    ric learning problem in Eq. (4) with guarantees, thereby
margin criterion on the similarity function SM so that the                greatly reducing the computational costs. However, the scale
dissimilar pairs can be obviously distinguished from the                  of problem (4) could still be not neglectable, which severely
similar ones. It is worth clarifying that we do not pursue                limits the use of the off-the-shell interior point based solvers
SM (xi , xj ) = 0 (i.e.,  = 0) in the case of yij = −1 in                such as SDPT3 (Toh, Todd, and Tütüncü 1999). In the next
Eq. (2) because it may lead to the trivial solution M = 0 to              subsection, we will devise an efficient first-order method to
Eq. (3).1                                                                 solve problem (4).
   Due to the space limit, all the proofs of lemmas and theo-                Similar projection tricks have been recently exploited and
rems presented in this section are placed in the supplemental             proven for convex distance metric learning problems reg-
material.                                                                 ularized by squared Frobenius norm (Chatpatanasiri et al.
                                                                          2010) and several other strictly convex functions (Kulis
   1
    We observed the trivial solution M = 0 in the case of  = 0           2012), whose proof strategies, however, cannot carry over to
when the training data examples are fewer and the regularization          trace norm since trace norm is not strictly convex. To the best
parameter α is relatively large.                                          of our knowledge, the structural representation UW? U>

                                                                   2794

characterizing optimal metric solutions is firstly explored                        By applying T1/ρ elementwisely, we obtain a closed-form
here for trace norm regularized convex metric learning.                            solution to the Z subproblem as

Linearized ADMM
                                                                                        Zk+1 := T1/ρ Ỹ − Y ◦ (X̃> Wk X̃) − Λk /ρ .
                                                                                                                                 
Now we develop an efficient optimization method to exactly                                                                                 (8)
solve Eq. (4). To facilitate derivations, we define three ma-
trix operators: the Hadamard product ◦, the inner-product
h·, ·i, and the Frobenius norm k · kF .                                              The Λ-update is prescribed to be
   We first introduce a dummy variable matrix Z = [zij ]ij ∈
Rn×n to equivalently rewrite problem (4) as
                                                                                    Λk+1 := Λk + ρ Zk+1 − Ỹ + Y ◦ (X̃> Wk+1 X̃) . (9)
                                                                                                                                
                       n
                       X
           min              [zij ]+ + αtr(W)                          (5)
         W∈Sm
            + ,Z
                       i,j=1

                       Z − Ỹ + Y ◦ X̃> WX̃ = 0,
                                           
            s.t.                                                                   W-Update
in which three constant matrices X̃ = [x̃1 , · · · , x̃n ] =
U> X ∈ Rm×n , Y = [yij ]ij ∈ Rn×n , and Ỹ = [ỹij ]ij ∈                           In the standard ADMM framework, W is updated by mini-
Rn×n are involved. Either of the objective function and con-                       mizing Lρ (Z, W; Λ) with respect to W while Z and Λ are
straint in Eq. (5) is separable in terms of two variables Z                        fixed. Noticing X̃X̃> = ΣV> VΣ = Σ2 , we derive the
and W, which naturally suggests a use of alternating direc-                        W-update as follows
tion methods. To this end, we employ the increasingly pop-
ular Alternating Direction Method of Multipliers (ADMM)
(Boyd et al. 2011) to cope with Eq. (5). ADMM works on
                                                                                     Wk+1 := arg minm Lρ Zk+1 , W; Λk
                                                                                                                            
the following augmented Lagrangian function:                                                       W∈S+
                                                                                                        D                 E
                         n
                         X                                                           = arg minm αtr(W) + Λk , Y ◦ X̃> WX̃
  Lρ (Z, W; Λ) :=              [zij ]+ + αtr(W)                                             W∈S+
                        i,j=1                                                            ρ                           2
                        D                        E                                     +   Zk+1 − Ỹ + Y ◦ X̃> WX̃
                       + Λ, Z − Ỹ + Y ◦ X̃> WX̃                                         2                            F
                                                                                                ρ    >     2
                           ρ                                  2                      = arg minm   X̃ WX̃ F + hαI, Wi
                                                                                           W∈S+ 2
                             Z − Ỹ + Y ◦ X̃> WX̃
                                                 
                       +                                          , (6)
                           2                                  F                          D                               E
                                                                                       + Λk + ρZk+1 − ρỸ, Y ◦ X̃> WX̃
where Λ = [λij ]ij ∈ Rn×n is the multiplier for the lin-
ear constraint in Eq. (5) (a.k.a. dual variable), and ρ > 0 is                                  ρ
                                                                                     = arg minm WΣ2 , Σ2 W + hαI, Wi
the penalty parameter. ADMM will proceed through succes-                                   W∈S+ 2
sively and alternatingly updating the three variable matrices                            D                                  E
                                                                                       + X̃ Y ◦ (Λk + ρZk+1 − ρỸ) X̃> , W
                                                                                                                  
Zk , Wk and Λk (k = 0, 1, · · · ).
                                                                                                   1
Z, Λ-Updates                                                                         = arg minm      WΣ2 , Σ2 W
                                                                                            W∈S+   2
Here we first present the updates for Z and Λ which are                                  D                                         E
                                                                                       + αI/ρ + X̃ Y ◦ (Zk+1 − Ỹ + Λk /ρ) X̃> , W
                                                                                                                          
easier to achieve than the update for W.
  The Z-update is achieved by minimizing Lρ (Z, W; Λ)
                                                                                     = arg minm g W; Zk+1 , Λk ,
                                                                                                              
                                                                                                                                  (10)
with respect to Z while keeping W and Λ fixed, that is,                                     W∈S+
                                         n
                                         X
 Zk+1 := arg min Lρ Z, Wk ; Λk = arg min
                              
                                           [zij ]+
                   Z                              Z
                                                      i,j=1
                                                                                   where I denotes the m × m identity matrix. Unfortunately,
                                                                  2
                                                                                   there is no analytical solution available for this W sub-
               ρ                                                                   problem in Eq. (10). Trying iterative optimization methods
                 Z − Ỹ + Y ◦ X̃> Wk X̃ + Λk /ρ
                                       
           +                                                          .
               2                                                  F                to solve Eq. (10) would, however, limit the efficiency of
To solve this Z subproblem, we utilize the proximal operator                       ADMM. In what follows, we develop a linearized modifi-
Tθ (a) = arg minz∈R θ[z]+ + 12 (z − a)2 (θ > 0) to handle                          cation of the original ADMM framework in Eqs. (8)(10)(9)
[zij ]+ . Tθ (a) is a proximal mapping of θ[z]+ and has shown                      by rectifying the W-update in Eq. (10) to an easier one.
in (Ye, Chen, and Xie 2009) to output as follows                                     Linearization. Rather than dealing with intrinsically
                           (
                             a − θ, a > θ,                                         quadratic g W; Zk+1 , Λk , weresort to optimizing a surro-
                 Tθ (a) =    0,     0 ≤ a ≤ θ,             (7)                     gate function gbτ W; Zk+1 , Λk which is the linear approx-
                             a,     a < 0.                                         imation to g and augmented by a quadratic proximal term.

                                                                            2795

Concretely, the rectified W-update is shown as                            Algorithm 1           Low-Rank Similarity Metric Learning
                                                                           Input: the data matrix X ∈ Rd×n (d  n), the pairwise label matrix
 Wk+1 := arg minm gbτ W; Zk+1 , Λk
                                         
                 W∈S+                                                      Y ∈ {1, −1}n×n , the regularization parameter α > 0, three positive constants
                                                                           , ρ, τ , the budget iteration number T .
             1                    2                                        SVD Projection: perform the SVD of X = UΣV> to obtain U ∈ Rd×m ,
               W − Wk                 + g Wk ; Zk+1 , Λk +
                                                        
   = arg minm
       W∈S+ 2τ                    F                                        Σ ∈ Rm×m +      , and compute the projected data X̃ = U> X ∈ Rm×n .
                                                                         Linearized ADMM:
                                                                           form the matrix Ỹ ∈ {1, }n×n according to Y;
     + ∇W g W; Zk+1 , Λk                            k
                         
                                          ,W − W
                                                                           initialize W0 = Im×m , Λ̃0 = 0n×n , S0 = X̃> X̃;
                                  W=Wk
                                                                           for k = 1, · · · , T do
                                 1                      2
   = arg minm Gk , W − Wk +        W − Wk                                      Zk ←− T1/ρ Ỹ − Y ◦ Sk−1 − Λ̃k−1 ,
                                                                                                                             
         W∈S+                   2τ                      F                         k−1
                                                                                         ←− ρ Im×m + X̃ Y ◦(Z − Ỹ + Λ̃k−1 ) X̃> +Σ2 Wk−1 Σ2 ,
                                                                                               α                     k
                                                                                                                                    
                                                                               G
                                                                                   k                   k−1        k−1             >
                                                                                                                       k
                  1                 2                                         W ←− PSm W                   − τG        , S ←− X̃ Wk X̃,
   = arg minm         W − Wk − τ Gk F                                                       +
                                                                               Λ̃k ←− Λ̃k−1 + Zk − Ỹ + Y ◦ Sk ,
         W∈S+    2τ
                                                                               if Zk − Ỹ + Y ◦ Sk 2F /n2 ≤ 10−12 then
         Wk − τ Gk .
                  
   = PSm
       +
                                                            (11)                   break,
                                                                               end if
In Eq. (11) we introduce the matrix                                        end for.
                                                                           Output: run the eigen-decomposition of Wk = HEH> in which E ∈ Rr×r
  Gk = ∇W g W; Zk+1 , Λk                                                   retains the positive eigenvalues, and then output the optimal low-rank basis L? =
                             
                                                            (12)
                                 W=Wk                                      UHE1/2 ∈ Rd×r (the corresponding low-rank metric is M? = L? L? > ).

  = αI/ρ + X̃ Y ◦ (Zk+1 − Ỹ + Λ /ρ) X̃> + Σ2 Wk Σ2 .
                                    k
                                    

 The introduced operator PSm   +
                                 (·) represents the projection            matrix M; instead, it only needs to store the low-rank ba-
operation onto the PSD cone Sm    + ; for example, PSm+
                                                        (A) =             sis U and update the much smaller m × m metric matrix
Pm                >                                       m×m             Wk , as indicated by Theorem 1. In doing so, LRSML cir-
   i=1 [γi ]+ pi pi for any symmetric matrix A ∈ R
whose eigenvector-eigenvalue pairs are (pi , γi )m    . The in-           cumvents expensive matrix manipulations such as projecting
                                                  i=1
troduced parameter τ > 0 is the step size which is to be                  M onto Sd+ , therefore accomplishing considerable savings in
specified later.                                                          both storage and computation. The total time complexity of
                                                                          LRSML is bounded by O(dn2 + T n3 ) which scales linearly
                                                   
   The advantage of linearizing g W; Zk+1 , Λk is that the
resulting W-update in Eq. (11) enjoys an analytical expres-               with the input dimensionality d. In practice, we find that the
sion and consequently works very efficiently. This modifi-                linearized ADMM converges rapidly within T = 1, 000 it-
cation of the standard ADMM, i.e., optimizing gbτ instead of              erations under the setting of ρ = 1, τ = 0.01 (see the con-
g, is the core ingredient of our developed linearized ADMM                vergence curves in the supplemental material).
framework, namely linearized ADMM. The following theo-
rem certifies that the developed linearized ADMM is theo-                 Discussion
retically sound, producing a globally optimal solution to the             A primary advantage of our analysis in Lemma 1 and The-
raw problem in Eq. (5).                                                   orem 1 is avoiding the expensive projection onto the high-
Theorem 2. Given 0 < τ < kX̃k   1
                                     = kXk1                               dimensional PSD cone Sd+ , which was required by the pre-
                                  4         4 , the sequence
 k                              op        op                            vious trace norm regularized metric learning methods such
  Z , Wk , Λk k generated by the linearized ADMM in                       as (McFee and Lanckriet 2010; Lim, McFee, and Lanck-
                                                                          riet 2013). In the supplemental material, we further provide
                                                           
Eqs. (8)(11)(9) starting with any symmetric Z0 , W0 , Λ0
converges to an optimal solution of problem (5).                          in-depth theoretic analysis (see Lemma 3 and Theorem 3)
                                                                          to comprehensively justify the low-rank solution structure
   Note that k · kop denotes the operator norm of matrices.               M? = UW? U> for any convex loss function in terms of
For simplicity, we initialize W0 = Im×m and Λ0 = 0n×n                     x>i Mxj regularized by trace norm tr(M) or squared Frobe-
to launch the linearized ADMM.                                            nius norm kMk2F . As a result, our analysis would directly
                                                                          lead to scalable O(d) algorithms for a task of low-rank dis-
Algorithm                                                                 tance or similarity metric learning supervised by instance-
We summarize the proposed Low-Rank Similarity Metric                      level, pairwise, or listwise label information. For example,
Learning (LRSML) approach in Algorithm 1, which con-                      our analysis would give an O(d)-time algorithm for optimiz-
sists of two steps: 1) SVD-based projection, and 2) lower                 ing the low-rank distance metric learning objective (a hinge
dimensional metric learning via the linearized ADMM. In                   loss based on listwise supervision plus a trace norm regular-
the projection step, the SVD of X can be efficiently per-                 izer) in (McFee and Lanckriet 2010) through following our
formed in O(dn2 ) time by running the eigen-decomposition                 proposed two-step scheme, SVD projection + lower dimen-
over the matrix X> X. If the data vectors in X are zero-                  sional metric learning.
centered, such a decomposition is exactly PCA. In the lin-                   We remark here that linearized ADMM techniques have
earized ADMM step, we use the scaled dual variable matrix                 been investigated in the optimization community, and ap-
Λ̃k = Λk /ρ. Importantly, we claim that there is no need                  plied to the problems arising from compressed sensing
to explicitly compute and even store the large d × d metric               (Yang and Zhang 2011), image processing (Zhang et al.

                                                                   2796

Table 1: Basic descriptions of four datasets used in the ex-           et al. 2013) (MLNCM also uses LSA’s output as its heuris-
periments.                                                             tic low-rank basis), a similarity learning method Adaptive
                                                                       Regularization Of MAtrix models (AROMA) (Crammer and
       Dataset      # Classes   # Samples   # Dimensions               Chechik 2012), and the Low-Rank Similarity Metric Learn-
     Reuters-28        28         8,030        18,933                  ing (LRSML) method proposed in this paper. Except ML-
      TDT2-30          30         9,394        36,771                  NCM which requires to use the Nearest Class Mean classi-
    UIUC-Sports         8         1,579        87,040                  fier, all the compared metric/similarity learning methods col-
    UIUC-Scene         15         4,485        21,504                  laborate with the 1NN classifier. Note that we choose to run
                                                                       LSA instead of PCA because of the nonnegative and sparse
2010), and nuclear norm regularized convex optimization                nature of the used feature vectors. LSA-ITML, CM-ITML,
(Yang and Yuan 2013). Nevertheless, the linearized ADMM                MLR and LRSML are convex methods, while MLNCM is a
method that we developed in this paper is self-motivated in            nonconvex method. We choose the most efficient version of
the sense that it was derived for solving a particular opti-           AROMA, which restricts the matrix M used in the similar-
mization problem of low-rank metric learning. To the best              ity function to be diagonal. To efficiently implement MLR,
of our knowledge, linearized ADMM has not been applied                 we follow the two-step scheme, which has been justified in
to metric learning before, and the analytic-form updates pro-          this paper, SVD projection followed by MLR with lower di-
duced by our linearized ADMM method are quite advanta-                 mensionality.
geous. Note that (Lim, McFee, and Lanckriet 2013) applied
the standard ADMM to solve a subproblem in every iteration                Except AROMA which does not impose the PSD con-
of their optimization method, but executing ADMM at only               straint on M, each of these referred methods can yield a
one time costs the cubic time complexity O(T 0 d3 ) (T 0 is            low-rank basis L. For them, we try three measures between
the iteration number of ADMM) which is computationally                 two inputs x and x0 : (i) distance L> x − L> x0 , (ii) inner-
infeasible in high dimensions. By contrast, our linearized                             > > 0                           >
                                                                                                                           x)> (L> x0 )
                                                                       product L> x       L x , and (iii) cosine (L                     .
ADMM tackles the entire metric optimization in O(T n3 )                                                                kL> xkkL> x0 k
time, performing at least one order of magnitude faster than           For LSA, FLDA, LSA-ITML, CM-ITML, MLR and ML-
the standard ADMM by virtue of the elegant W-update.                   NCM, distance and cosine measures are tried; while for
                                                                       LRSML, inner-product and cosine measures are tried since
                                                                       the inner-product measure (ii) is exactly the bilinear similar-
                      Experiments                                      ity function shown in Eq. (1). The baseline “Original” gives
We carry out the experiments on four benchmark datasets                the same results under the three measures as the feature vec-
including two document datasets Reuters-28 and TDT2-30                 tors are `2 normalized already.
(Cai, He, and Han 2011), and two image datasets UIUC-
Sports (Li and Fei-Fei 2007) and UIUC-Scene (Lazebnik,                    Let C be the number of classes in every dataset. On
Schmid, and Ponce 2006). Note that we choose 28 cate-                  Reuters-28 and TDT2-30, we select 5 × C up to 30 × C
gories in the raw Reuters-21578 corpus (Cai, He, and Han               samples for training such that each category covers at least
2011) such that each category has no less than 20 exam-                one sample; we pick up the same number of samples for
ples. In Reuters-28 and TDT2-30, each document is repre-               cross-validation; the rest of samples are for testing. We re-
sented by an `2 normalized TF-IDF feature vector; in UIUC-             peat 20 times of random training/validation/test splits, and
Sports and UIUC-Scene, each image is represented by an `2              then report the average classification error rates and training
normalized sparse-coding feature vector (Wang et al. 2010).            time for all the competing methods in Tables 2 and 3. On
The basic information about these four datasets is shown in            the two image datasets, we follow the commonly used eval-
Table 1.                                                               uation protocols like (Lazebnik, Schmid, and Ponce 2006;
   The baseline method is named as “Original”, which                   Wang et al. 2010). On UIUC-Sports, we select 10 to 70
takes original feature vectors for 1NN classification, and             samples per class for training, a half number of samples for
the linear SVM is also included in comparison. Here we                 validation, and the remaining ones for testing; on UIUC-
evaluate and compare eight metric and similarity learning              Scene, we obey the similar setting and the training sam-
methods including: two classical linear dimensionality re-             ples range from 10 to 100 per class. By running 20 times,
duction methods Latent Semantic Analysis (LSA) (Deer-                  the average of per-class recognition rates as well as the
wester et al. 1990) and Fisher Linear Discriminant Analy-              training time are reported in the tables of the supplemen-
sis (FLDA) (Hastie, Tibshirani, and Friedman 2009), two                tal material. To run our proposed method LRSML, we fix
low-rank variants LSA-ITML and CM-ITML (Davis and                       = 0.1, ρ = 1, and find that τ = 0.01 makes the lin-
Dhillon 2008) of a representative distance metric learn-               earized ADMM converge within T = 1, 000 iterations on all
ing method Information-Theoretic Metric Learning (ITML)                datasets. The results shown in all the referred tables demon-
(Davis et al. 2007) (LSA-ITML uses LSA’s output as its                 strate that LRSML consistently leads to the highest accuracy
heuristic low-rank basis, while CM-ITML uses the class                 in terms of 1NN classification. Across all datasets, the best
means as its heuristic low-rank basis), two recent low-rank            two results are achieved by LRSML with two measures; the
distance metric learning methods Metric Learning to Rank               cosine measure almost results in slightly higher classifica-
(MLR) (McFee and Lanckriet 2010) and Metric Learning                   tion accuracy than the inner-product measure. More experi-
for the Nearest Class Mean classifier (MLNCM) (Mensink                 mental results are presented in the supplemental material.

                                                                2797

Conclusions                                           Davis, J. V.; Kulis, B.; Jain, P.; Sra, S.; and Dhillon, I. S. 2007.
                                                                               Information-theoretic metric learning. In Proc. ICML.
The proposed LRSML formulated a trace norm regularized
convex objective which was proven to yield a globally opti-                    Deerwester, S. C.; Dumais, S. T.; Landauer, T. K.; Furnas, G. W.;
                                                                               and Harshman, R. A. 1990. Indexing by latent semantic analysis.
mal solution with a certain low-rank structure. With the op-                   Journal of the American Society of Information Science 41(6):391–
timality guarantee, the challenging high-dimensional metric                    407.
learning task can be reduced to a lower dimensional metric
                                                                               Fazel, M.; Hindi, H.; and Boyd, S. P. 2001. A rank minimiza-
learning problem after a simple SVD projection. The lin-                       tion heuristic with application to minimum order system approx-
earized ADMM was further developed to efficiently solve                        imation. In Proc. the American Control Conference, volume 6,
the reduced problem. Our approach bears out to maintain                        4734–4739.
linear space and time complexities in the input dimension-                     Globerson, A., and Roweis, S. 2005. Metric learning by collapsing
ality, consequently scalable to high-dimensional data do-                      classes. In NIPS 18.
mains. Through extensive experiments carried out on four
                                                                               Goldberger, J.; Roweis, S.; and Salakhutdinov, R. 2004. Neigh-
benchmark datasets with up to 87,000 dimensions, we cer-                       bourhood components analysis. In NIPS 17.
tify that our learned low-rank similarity metrics are well
                                                                               Hastie, T.; Tibshirani, R.; and Friedman, J. 2009. The Elements
suited to high-dimensional problems and exhibit prominent                      of Statistical Learning: Data Mining, Inference, and Prediction.
performance gains in kNN classification. In future work, we                    Second Edition, Springer.
would like to investigate generalization properties of the pro-
                                                                               Hoi, S. C.; Liu, W.; and Chang, S.-F. 2010. Semi-supervised dis-
posed LRSML like (Cao, Guo, and Ying 2012), and aim                            tance metric learning for collaborative image retrieval and cluster-
to accomplish a tighter generalization bound for trace norm                    ing. ACM Transactions on Multimedia Computing, Communica-
regularized convex metric learning.                                            tions and Applications 6(3).
                                                                               Kar, P., and Jain, P. 2011. Similarity-based learning via data driven
                    Acknowledgments                                            embeddings. In NIPS 24.
When doing an earlier version of this work, Wei Liu was                        Kar, P., and Jain, P. 2012. Supervised learning with similarity
supported by Josef Raviv Memorial Postdoctoral Fellow-                         functions. In NIPS 25.
ship. The authors would like to thank Prof. Gang Hua and                       Kriegel, H.-P.; Schubert, M.; and Zimek, A. 2008. Angle-based
Dr. Fuxin Li for constructive suggestions which are helpful                    outlier detection in high-dimensional data. In Proc. KDD.
to making the final version.                                                   Kulis, B. 2012. Metric learning: a survey. Foundations and
                                                                               Trendsr in Machine Learning 5(4):287–364.
                          References                                           Kunapuli, G., and Shavlik, J. 2012. Mirror descent for metric
                                                                               learning: A unified approach. In Proc. ECML.
Bar-Hillel, A.; Hertz, T.; Shental, N.; and Weinshall, D. 2005.
Learning a mahalanobis metric from equivalence constraints.                    Lazebnik, S.; Schmid, C.; and Ponce, J. 2006. Beyond bags of
JMLR 6:937–965.                                                                features: Spatial pyramid matching for recognizing natural scene
                                                                               categories. In Proc. CVPR.
Bellet, A.; Habrard, A.; and Sebban, M. 2013. A survey on metric
                                                                               Li, L.-J., and Fei-Fei, L. 2007. What, where and who? classifying
learning for feature vectors and structured data. arXiv:1306.6709.
                                                                               event by scene and object recognition. In Proc. ICCV.
Boyd, S. P.; Parikh, N.; Chu, E.; Peleato, B.; and Eckstein, J. 2011.
                                                                               Lim, D. K. H., and Lanckriet, G. 2014. Efficient learning of maha-
Distributed optimization and statistical learning via the alternating
                                                                               lanobis metrics for ranking. In Proc. ICML.
direction method of multipliers. Foundations and Trendsr in Ma-
chine Learning 3(1):1–122.                                                     Lim, D. K. H.; McFee, B.; and Lanckriet, G. 2013. Robust struc-
                                                                               tural metric learning. In Proc. ICML.
Cai, D.; He, X.; and Han, J. 2011. Locally consistent concept fac-
torization for document clustering. IEEE Transactions on Knowl-                Liu, W.; Ma, S.; Tao, D.; Liu, J.; and Liu, P. 2010a. Semi-
edge and Data Engineering 23(6):902–913.                                       supervised sparse metric learning using alternating linearization
                                                                               optimization. In Proc. KDD.
Candès, E. J., and Recht, B. 2009. Exact matrix completion via
convex optimization. Foundations of Computational mathematics                  Liu, W.; Tian, X.; Tao, D.; and Liu, J. 2010b. Constrained metric
9(6):717–772.                                                                  learning via distance gap maximization. In Proc. AAAI.
Cao, Q.; Guo, Z.-C.; and Ying, Y. 2012. Generalization bounds for              McFee, B., and Lanckriet, G. 2010. Metric learning to rank. In
metric and similarity learning. Technical report.                              Proc. ICML.
                                                                               Mensink, T.; Verbeek, J.; Perronnin, F.; and Csurka, G. 2013.
Chatpatanasiri, R.; Korsrilabutr, T.; Tangchanachaianan, P.; and Ki-
                                                                               Distance-based image classification: Generalizing to new classes
jsirikul, B. 2010. A new kernelization framework for mahalanobis
                                                                               at near zero cost. TPAMI 35(11):2624–2637.
distance learning algorithms. Neurocomputing 73(10):1570–1579.
                                                                               Niu, G.; Dai, B.; Yamada, M.; and Sugiyama, M.                  2012.
Chechik, G.; Sharma, V.; Shalit, U.; and Bengio, S. 2010. Large                Information-theoretic semi-supervised metric learning via entropy
scale online learning of image similarity through ranking. JMLR                regularization. In Proc. ICML.
11:1109–1135.
                                                                               Recht, B.; Fazel, M.; and Parrilo, P. A. 2010. Guaranteed
Cheng, L. 2013. Riemannian similarity learning. In Proc. ICML.                 minimum-rank solutions of linear matrix equations via nuclear
Crammer, K., and Chechik, G. 2012. Adaptive regularization for                 norm minimization. SIAM review 52(3):471–501.
weight matrices. In Proc. ICML.                                                Shen, C.; Kim, J.; Wang, L.; and van den Hengel, A. 2012. Positive
Davis, J. V., and Dhillon, I. S. 2008. Structured metric learning for          semidefinite metric learning using boosting-like algorithms. JMLR
high dimensional problems. In Proc. KDD.                                       13:1007–1036.

                                                                        2798

Table 2: Reuters-28 dataset: classification error rates and training time of eight competing metric/similarity learning methods.
                                                5×28 training samples                        30×28 training samples
                       Method         Measure        Error Rate Train Time          Measure       Error Rate Train Time
                                                         (%)          (sec)                            (%)           (sec)
                       Original            –         23.94±1.63         –                –        16.56±0.55           –
                     Linear SVM            –         21.25±1.46       0.03               –         9.16±0.43         0.12
                                       distance     42.21 ±4.48                      distance     24.56±5.30
                        LSA                                           0.02                                           0.55
                                        cosine       21.99±1.89                       cosine      11.85±0.48
                                       distance      39.46±6.10                      distance     15.19±1.38
                        FLDA                                          0.03                                           0.75
                                        cosine       21.34±3.07                       cosine      10.05±0.65
                                       distance      20.17±0.95                      distance     10.06±0.99
                     LSA-ITML                                         47.2                                          6158.0
                                        cosine       21.29±1.02                       cosine       9.06±0.45
                                       distance      33.38±3.57                      distance     18.18±0.74
                      CM-ITML                                          7.3                                           219.3
                                        cosine       23.49±1.42                       cosine      13.28±0.76
                                       distance      27.08±2.75                      distance     13.29±1.14
                        MLR                                           17.8                                          1805.7
                                        cosine       20.22±1.44                       cosine       9.86±0.53
                                       distance      21.07±1.62                      distance     12.62±0.55
                      MLNCM                                            1.1                                           138.5
                                        cosine       22.88±2.52                       cosine      13.51±0.79
                       AROMA          similarity     21.25±1.53       33.1          similarity    13.49±0.63        2328.1
                                    inner-product 15.56±1.55                      inner-product    8.84±0.61
                       LRSML                                           9.1                                           676.0
                                        cosine       14.83±1.20                       cosine       7.29±0.40

Table 3: TDT2-30 dataset: classification error rates and training time of eight competing metric/similarity learning methods.
                                               5×30 training samples                         30×30 training samples
                       Method        Measure        Error Rate     Train Time       Measure       Error Rate Train Time
                                                        (%)           (sec)                            (%)           (sec)
                       Original           –         10.12±1.24          –                –         6.75±0.39           –
                     Linear SVM           –          9.80±2.46        0.06               –         2.59±0.24         0.37
                                      distance     30.01±10.14                       distance      8.52±1.19
                        LSA                                           0.02                                           0.68
                                       cosine        8.19±1.49                        cosine       4.26±0.44
                                      distance      36.57±4.80                       distance      7.50±0.62
                        FLDA                                          0.04                                           1.08
                                       cosine        8.77±1.96                        cosine       3.56±0.34
                                      distance      11.35±2.99                       distance      5.68±0.76
                     LSA-ITML                                         35.7                                          1027.4
                                       cosine        5.85±1.14                        cosine       2.84±0.33
                                      distance      32.56±4.87                       distance      7.67±0.71
                      CM-ITML                                          7.8                                          297.7
                                       cosine        8.75±2.48                        cosine       3.42±0.36
                                      distance      18.50±3.51                       distance      7.50±0.75
                        MLR                                           30.1                                          1036.4
                                       cosine        8.04±1.11                        cosine       3.84±0.29
                                      distance       8.48±1.38                       distance      4.69±0.32
                      MLNCM                                            1.5                                          231.5
                                       cosine        8.58±1.64                        cosine       4.39±0.38
                      AROMA          similarity     11.31±2.55        35.6          similarity     4.27±0.41        2655.6
                                   inner-product     4.74±0.83                    inner-product    2.37±0.20
                       LRSML                                          10.2                                          813.9
                                       cosine        4.95±1.06                        cosine       2.27±0.15

Toh, K. C.; Todd, M. J.; and Tütüncü, R. H. 1999. Sdpt3 - a mat-          Yang, J., and Yuan, X. 2013. Linearized augmented lagrangian
lab software package for semidefinite programming, version 1.3.              and alternating direction methods for nuclear norm minimization.
Optimization Methods and Software 11(1-4):545–581.                           Mathematics of Computation 82(281):301–329.
Torresani, L., and Lee, K.-C. 2006. Large margin component anal-             Yang, J., and Zhang, Y. 2011. Alternating direction algorithms for
ysis. In NIPS 19.                                                            `1 problems in compressive sensing. SIAM Journal on Scientific
                                                                             Computing 33(1):250–278.
Wang, J.; Yang, J.; Yu, K.; Lv, F.; Huang, T.; and Gong, Y. 2010.
Locality-constrained linear coding for image classification. In              Ye, G.-B.; Chen, Y.; and Xie, X. 2009. Efficient variable selection
Proc. CVPR.                                                                  in support vector machines via the alternating direction method of
                                                                             multipliers. In Proc. AISTATS.
Weinberger, K. Q., and Saul, L. K. 2009. Distance metric learning
                                                                             Ying, Y., and Li, P. 2012. Distance metric learning with eigenvalue
for large margin nearest neighbor classification. JMLR 10:207–
                                                                             optimization. JMLR 13:1–26.
244.
                                                                             Ying, Y.; Huang, K.; and Campbell, C. 2009. Sparse metric learn-
Wu, L.; Jin, R.; Hoi, S. C. H.; Zhu, J.; and Yu, N. 2009. Learning           ing via smooth optimization. In NIPS 22.
bregman distance functions and its application for semi-supervised
clustering. In NIPS 22.                                                      Zhang, X.; Burger, M.; Bresson, X.; and Osher, S. 2010. Breg-
                                                                             manized nonlocal regularization for deconvolution and sparse re-
Xing, E. P.; Ng, A. Y.; Jordan, M. I.; and Russell, S. 2002.                 construction. SIAM Journal on Imaging Sciences 3(3):253–276.
Distance metric learning with application to clustering with side-
information. In NIPS 15.

                                                                      2799

You can also read