Black Box Submodular Maximization: Discrete and Continuous Settings - Proceedings of ...

Page created by Tracy Banks
 
CONTINUE READING
Black Box Submodular Maximization: Discrete and Continuous
                                Settings

         Lin Chen1†                 Mingrui Zhang1‡             Hamed Hassani]                Amin Karbasi†
                                †
                                 Department of Electrical Engineering, Yale University,
                             ‡
                               Department of Statistics and Data Science, Yale University,
                   ]
                       Department of Electrical and Systems Engineering, University of Pennsylvania

                          Abstract                               2011, Rios and Sahinidis, 2013, Shahriari et al., 2016].
                                                                 In this setting, we assume that the objective function
        In this paper, we consider the problem of                is unknown and we can only obtain zeroth-order infor-
        black box continuous submodular maximiza-                mation such as (stochastic) function evaluations.
        tion where we only have access to the function           Fueled by a growing number of machine learning ap-
        values and no information about the deriva-              plications, black-box optimization methods are usually
        tives is provided. For a monotone and continu-           considered in scenarios where gradients (i.e., first-order
        ous DR-submodular function, and subject to a             information) are 1) difficult or slow to compute, e.g.,
        bounded convex body constraint, we propose               graphical model inference [Wainwright et al., 2008],
        Black-box Continuous Greedy, a derivative-free           structure predictions [Taskar et al., 2005, Sokolov et al.,
        algorithm that provably achieves the tight               2016], or 2) inaccessible, e.g., hyper-parameter turning
        [(1 − 1/e)OP T − ] approximation guarantee              for natural language processing or image classifications
        with O(d/3 ) function evaluations. We then              Snoek et al. [2012], Thornton et al. [2013], black-box
        extend our result to the stochastic setting              attacks for finding adversarial examples Chen et al.
        where function values are subject to stochas-            [2017c], Ilyas et al. [2018]. Even though heuristics such
        tic zero-mean noise. It is through this stochas-         as random or grid search, with undesirable dependen-
        tic generalization that we revisit the discrete          cies on the dimension, are still used in some applica-
        submodular maximization problem and use                  tions (e.g., parameter tuning for deep networks), there
        the multi-linear extension as a bridge between           has been a growing number of rigorous methods to
        discrete and continuous settings. Finally, we            address the convergence rate of black-box optimiza-
        extensively evaluate the performance of our              tion in convex and non-convex settings [Wang et al.,
        algorithm on continuous and discrete submod-             2017, Balasubramanian and Ghadimi, 2018, Sahu et al.,
        ular objective functions using both synthetic            2018].
        and real data.
                                                                 The focus of this paper is the constrained continuous
                                                                 DR-submodular maximization over a bounded convex
1       Introduction                                             body. We aim to design an algorithm that uses only
                                                                 zeroth-order information while avoiding expensive pro-
Black-box optimization, also known as zeroth-order or            jection operations. Note that one way the optimization
derivative-free optimization2 , has been extensively stud-       methods can deal with constraints is to apply the pro-
ied in the literature [Conn et al., 2009, Bergstra et al.,       jection oracle once the proposed iterates land outside
    1
                                                                 the feasibility region. However, computing the projec-
    These authors contributed equally to this work.
    2                                                            tion in many constrained settings is computationally
    We note that black-box optimization (BBO) and
derivative-free optimization (DFO) are not identical terms.      prohibitive (e.g., projection over bounded trace norm
Audet and Hare [2017] defined DFO as “the mathematical           matrices, flow polytope, matroid polytope, rotation ma-
study of optimization algorithms that do not use deriva-         trices). In such scenarios, projection-free algorithms,
tives” and BBO as “the study of design and analysis of           a.k.a., Frank-Wolfe [Frank and Wolfe, 1956], replace

                                                                 algorithms that assume the objective and/or constraint
Proceedings of the 23rd International Conference on Artificial
                                                                 functions are given by blackboxes”. However, as the differ-
Intelligence and Statistics (AISTATS) 2020, Palermo, Italy.
                                                                 ences are nuanced in most scenarios, this paper uses them
PMLR: Volume 108. Copyright 2020 by the author(s).
                                                                 interchangeably.
Black Box Submodular Maximization: Discrete and Continuous Settings

            Table 1: Number of function queries in different settings, where D1 is the diameter of K.

 Function                               Additional Assumptions                Function Queries
 continuous DR-submodular               monotone, G-Lipschitz, L-smooth       O(max{G, LD1 }3 · d3 ) [Theorem 1]
                                                                                                   3
 stoch. continuous DR-submodular        monotone, G-Lipschitz, L-smooth       O(max{G, LD1 }3 · d5 ) [Theorem 2]
                                                                                   5
 discrete submodular                    monotone                              O( d5 ) [Theorem 3]

the projection with a linear program. Indeed, our pro-      [2008], a canonical example of DR-submodular func-
posed algorithm combines efficiently the zeroth-order       tions, is only defined on a unit cube. Moreover, they
information with solving a series of linear programs to     can only guarantee to reach a first-order stationary
ensure convergence to a near-optimal solution.              point. However, Hassani et al. [2017] showed that for
                                                            a monotone DR-submodular function, the stationary
Continuous DR-submodular functions are an important
                                                            points can only guarantee 1/2 approximation to the
subset of non-convex functions that can be minimized
                                                            optimum. Therefore, if a state-of-the-art zeroth-order
exactly Bach [2016], Staib and Jegelka [2017] and maxi-
                                                            non-convex algorithm is used for maximizing a mono-
mized approximately Bian et al. [2017a,b], Hassani et al.
                                                            tone DR-submodular function, it is likely to terminate
[2017], Mokhtari et al. [2018a], Hassani et al. [2019],
                                                            at a suboptimal stationary point whose approximation
Zhang et al. [2019b] This class of functions generalize
                                                            ratio is only 1/2.
the notion of diminishing returns, usually defined over
discrete set functions, to the continuous domains. They     Our contributions: In this paper, we propose a
have found numerous applications in machine learn-          derivative-free and projection-free algorithm Black-box
ing including MAP inference in determinantal point          Continuous Greedy (BCG), that maximizes a monotone
processes (DPPs) Kulesza et al. [2012], experimental        continuous DR-submodular function over a bounded
design Chen et al. [2018c], resource allocation Eghbali     convex body K ⊆ Rd . We consider three scenarios:
and Fazel [2016], mean-field inference in probabilistic
                                                            (1) In the deterministic setting, where function eval-
models Bian et al. [2018], among many others.
                                                            uations can be obtained exactly, BCG achieves the
Motivation: Computing the gradient of a continu-            tight [(1 − 1/e)OP T − ] approximation guarantee with
ous DR-submodular function has been shown to be             O(d/3 ) function evaluations.
computationally prohibitive (or even intractable) in
                                                            (2) In the stochastic setting, where function evaluations
many applications. For example, the objective func-
                                                            are noisy, BCG achieves the tight [(1 − 1/e)OP T −
tion of influence maximization is defined via specific
                                                            ] approximation guarantee with O(d3 /5 ) function
stochastic processes [Kempe et al., 2003, Rodriguez
                                                            evaluations.
and Schölkopf, 2012] and computing/estimating the
gradient of the mutliliear extension would require a        (3) In the discrete setting, Discrete Black-box Greedy
relatively high computational complexity. In the prob-      (DBG) achieves the tight [(1 − 1/e)OP T − ] approxi-
lem of D-optimal experimental design , the gradient         mation guarantee with O(d5 /5 ) function evaluations.
of the objective function involves inversion of a po-
                                                            All the theoretical results are summarized in Table 1.
tentially large matrix [Chen et al., 2018c]. Moreover,
when one attacks a submodular recommender model,            We would like to note that in discrete setting, due to
only black-box information is available and the service     the conservative upper bounds for the Lipschitz and
provider is unlikely to provide additional first-order      smooth parameters of general multilinear extensions,
information (this is known as the black-box adversarial     and the variance of the gradient estimators subject to
attack model) [Lei et al., 2019].                           noisy function evaluations, the required number of func-
                                                            tion queries in theory is larger than the best known re-
There has been very recent progress on developing
                                                            sult, O(d5/2 /3 ) in Mokhtari et al. [2018a,b]. However,
zeroth-order methods for constrained optimization
                                                            our experiments show that empirically, our proposed
problems in convex and non-convex settings Ghadimi
                                                            algorithm often requires significantly fewer function
and Lan [2013], Sahu et al. [2018]. Such methods typi-
                                                            evaluations and less running time, while achieving a
cally assume the objective function is defined on the
                                                            practically similar utility.
whole Rd so that they can sample points from a proper
distribution defined on Rd . For DR-submodular func-        Novelty of our work: All the previous results in
tions, this assumption might be unrealistic, since many     constrained DR-submodular maximization assume ac-
DR-submodular functions might be only defined on a          cess to (stochastic) gradients. In this work, we address
subset of Rd , e.g., the multi-linear extension Vondrák     a harder problem, i.e., we provide the first rigorous
Lin Chen1† , Mingrui Zhang1‡ , Hamed Hassani] , Amin Karbasi†

analysis when only (stochastic) function values can be        maximization of monotone continuous DR-submodular
obtained. More specifically, with the smoothing trick         functions Zhang et al. [2019a] is a closely related set-
[Flaxman et al., 2005], one can construct an unbiased         ting to ours. However, to the best of our knowledge,
gradient estimator via function queries. However, this        none of the existing work has developed a zeroth-order
estimator has a large O(d2 /δ 2 ) variance which may          algorithm for maximizing a monotone continuous DR-
cause FW-type methods to diverge. To overcome this            submodular function. For a detailed review of DFO
issue, we build on the momentum method proposed by            and BBO, interested readers refer to book [Audet and
Mokhtari et al. [2018a] in which they assumed access          Hare, 2017].
to the first-order information.
Given a point x, the smoothed version of F at x is de-        2   Preliminaries
fined as Ev∼B d [F (x + δv)]. If x is close to the boundary
of the domain D, (x + δv) may fall outside of D, leaving      Submodular Functions We say a set function f :
the smoothed function undefined for many instances            2Ω → R is submodular, if it satisfies the diminishing
of DR-submodular functions (e.g., the multilinear ex-         returns property: for any A ⊆ B ⊆ Ω and x ∈ Ω \ B,
tension is only defined over the unit cube). Thus the         we have
vanilla smoothing trick will not work. To this end,
                                                                    f (A ∪ {x}) − f (A) ≥ f (B ∪ {x}) − f (B).     (1)
we transform the domain D and constraint set K in
a proper way and run our zeroth-order method on               In words, the marginal gain of adding an element x
the transformed constraint set K0 . Importantly, we           to a subset A is no less than that of adding x to its
retrieve the same convergence rate of O(T −1/3 ) as in        superset B.
Mokhtari et al. [2018a] with a minimum number of func-
tion queries in different settings (continuous, stochastic    For the continuous analogue, consider a function F :
continuous, discrete).                                        X → R+ , where X = Πni=1 Xi , and each Xi is a compact
                                                              subset of R+ . We define F to be continuous submodular
We further note that by using more recent variance            if F is continuous and for all x, y ∈ X , we have
reduction techniques [Zhang et al., 2019b], one might
be able to reduce the required number of function                      F (x) + F (y) ≥ F (x ∨ y) + F (x ∧ y),      (2)
evaluations.                                                  where ∨ and ∧ are the component-wise maximizing
                                                              and minimizing operators, respectively.
1.1   Further Related Work
                                                              The continuous function F is called DR-submodular
Submodular functions Nemhauser et al. [1978], that            Bian et al. [2017b] if F is differentiable and ∀ x ≤ y :
capture the intuitive notion of diminishing returns,          ∇F (x) ≥ ∇F (y). An important implication of DR-
have become increasingly important in various machine         submodularity is that the function F is concave in any
learning applications. Examples include graph cuts            non-negative directions, i.e., for x ≤ y, we have
in computer vision Jegelka and Bilmes [2011a,b], data
summarization Lin and Bilmes [2011b,a], Tschiatschek                      F (y) ≤ F (x) + h∇F (x), y − xi.         (3)
et al. [2014], Chen et al. [2018a, 2017b], influence maxi-    The function F is called monotone if for x ≤ y, we
mization Kempe et al. [2003], Rodriguez and Schölkopf         have F (x) ≤ F (y).
[2012], Zhang et al. [2016], feature compression Bateni
et al. [2019], network inference Chen et al. [2017a], ac-     Smoothing Trick For a function F defined on Rd ,
tive and semi-supervised learning Guillory and Bilmes         its δ-smoothed version is given as
[2010], Golovin and Krause [2011], Wei et al. [2015],
crowd teaching Singla et al. [2014], dictionary learn-                      F̃δ (x) , Ev∼B d [F (x + δv)],         (4)
ing Das and Kempe [2011], fMRI parcellation Salehi
                                                              where v is chosen uniformly at random from the d-
et al. [2017], compressed sensing and structured spar-
                                                              dimensional unit ball B d . In words, the function F̃δ at
sity Bach [2010], Bach et al. [2012], fairness in machine
                                                              any point x is obtained by “averaging” F over a ball of
learning Balkanski and Singer [2015], Celis et al. [2016],
                                                              radius δ around x. In the sequel, we omit the subscript
and learning causal structures Steudel et al. [2010],
                                                              δ for the sake of simplicity and use F̃ instead of F̃δ .
Zhou and Spanos [2016], to name a few. Continuous
DR-submodular functions naturally extend the notion           Lemma 1 below shows that under the Lipschitz as-
of diminishing returns to the continuous domains Bian         sumption for F , the smoothed version F̃ is a good ap-
et al. [2017b]. Monotone continuous DR-submodular             proximation of F , and also inherits the key structural
functions can be (approximately) maximized over con-          properties of F (such as monotonicity and submodu-
vex bodies using first-order methods Bian et al. [2017b],     larity). Thus one can (approximately) optimize F via
Hassani et al. [2017], Mokhtari et al. [2018a]. Bandit        optimizing F̃ .
Black Box Submodular Maximization: Discrete and Continuous Settings

Lemma 1 (Proof in Appendix A). If F is monotone              setting and using recently proposed variance reduction
continuous DR-submodular and G-Lipschitz continuous          techniques, we can then optimize F̃ near-optimally. Fi-
on Rd , then so is F̃ and                                    nally, by Lemma 1 we show that the obtained optimizer
                                                             also provides a good solution for F .
                  |F̃ (x) − F (x)|≤ δG.               (5)
                                                             Recall that continuous DR-submodular functions are
                                                             defined on a box X = Πni=1 Xi . To simplify the expo-
An important property of F̃ is that one can obtain an
                                                             sition, we can assume, without loss of generality,
                                                                                                          Qd that
unbiased estimation for its gradient ∇F̃ by a single
                                                             the objective function F is defined on D , i=1 [0, ai ]
query of F . This property plays a key role in our
                                                             Bian et al. [2017a]. Moreover, we note that since
proposed derivative-free algorithms.
                                                             F̃ = Ev∼B d [F (x+δv)], for x close to ∂D (the boundary
Lemma 2 (Lemma 6.5 in [Hazan, 2016]). Given a                of D), the point x + δv may fall outside of D, leaving
function F on Rd , if we choose u uniformly at random        the function F̃ undefined.
from the (d − 1)-dimensional unit sphere S d−1 , then
we have                                                      To circumvent this issue, we shrink the domain D by δ.
                                                           Precisely, the shrunk domain is defined as
                              d
         ∇F̃ (x) = Eu∼S d−1     F (x + δu)u .      (6)                    Dδ0 = {x ∈ D|d(x, ∂D) ≥ δ}.            (10)
                              δ
                                                                                    Qd
                                                             Since we assume D = i=1 [0, ai ], the shrunk domain
3    DR-Submodular Maximization                                             d
                                                             is Dδ0 = i=1 [δ, ai − δ]. Then for all x ∈ Dδ0 , we
                                                                          Q

                                                             have x + δv ∈ D. So F̃ is well-defined on Dδ0 . By
In this paper, we mainly focus on the constrained
optimization problem:                                        Lemma 1, the optimum of F̃ on the shrunk domain
                                                             Dδ0 will be close to that on the original domain D, if
                       max F (x),                     (7)    δ is small enough. Therefore, we can first optimize
                       x∈K
                                                             F̃ on Dδ0 , then approximately optimize F̃ (and thus
where F is a monotone continuous DR-submodular               F ) on D. For simplicity of analysis, we also translate
function on Rd , and the constraint set K ⊆ X ⊆ Rd is        the shrunk domain Dδ0 by −δ, and denote it as Dδ =
                                                             Qd
convex and compact.                                             i=1 [0, ai − 2δ].

For first-order monotone DR-submodular maximiza-             Besides the domain D, we also need to consider the
tion, one can use Continuous Greedy Calinescu et al.         transformation on constraint set K. Intuitively, if there
[2011], Bian et al. [2017b], a variant of Frank-Wolfe        is no translation, we should consider the intersection of
Algorithm [Frank and Wolfe, 1956, Jaggi, 2013, Lacoste-      K and the shrunk domain Dδ0 . But since we translate Dδ0
Julien and Jaggi, 2015], to achieve the [(1−1/e)OP T −       by −δ, the same transformation should be performed
] approximation guarantee. At iteration t, the FW           on K. Thus, we define the transformed constraint set
variant first maximizes the linearization of the objective   as the translated intersection (by −δ) of Dδ0 and K:
function F :
                                                                     K0 , (Dδ0 ∩ K) − δ1 = Dδ ∩ (K − δ1).        (11)
               vt = arg maxhv, ∇F (xt )i.             (8)
                       v∈K
                                                             It is well known that the FW Algorithm is sensitive to
Then the current point xt moves in the direction of vt       the accuracy of gradient, and may have arbitrarily poor
with a step size γt ∈ (0, 1]:                                performance with stochastic gradients Hazan and Luo
                                                             [2016], Mokhtari et al. [2018b]. Thus we incorporate
                   xt+1 = xt + γt vt .                (9)    two methods of variance reduction into our proposed al-
                                                             gorithm Black-box Continuous Greedy which correspond
Hence, by solving linear optimization problems, the          to Step 7 and Step 8 in Algorithm 1, respectively.
iterates are updated without resorting to the projection     First, instead of the one-point gradient estimation in
oracle.                                                      Lemma 2, we adopt the two-point estimator of ∇F̃ (x)
Here we introduce our main algorithm Black-box Con-          [Agarwal et al., 2010, Shamir, 2017]:
tinuous Greedy which assumes access only to function
                                                                          d
values (i.e., zeroth-order information). This algorithm                      (F (x + δu) − F (x − δu))u,         (12)
                                                                          2δ
is partially based on the idea of Continuous Greedy.
The basic idea is to utilize the function evaluations of     where u is chosen uniformly at random from the unit
F at carefully selected points to obtain unbiased esti-      sphere S d−1 .We note that (12) is an unbiased gradi-
mations of the gradient of the smoothed version, ∇F̃ .       ent estimator with less variance w.r.t. the one-point
By extending Continuous Greedy to the derivative-free        estimator. We also average over a mini-batch of Bt
Lin Chen1† , Mingrui Zhang1‡ , Hamed Hassani] , Amin Karbasi†

Algorithm 1 Black-box Continuous Greedy                           ξ is stochastic zero-mean noise. In particular, in Step
 1: Input: constraint set K, iteration number T , ra-             6 of Algorithm 1, we obtain independent stochastic
                                                                                            +              −
    dius δ, step size ρt , batch size Bt                          function evaluations F̂ (yt,i ) and F̂ (yt,i ), instead of the
                                                                                             +              −
 2: Output: xT +1 + δ1                                            exact function values F (yt,i ) and F (yt,i   ). For unbiased
 3: x1 ← 0, ḡ0 ← 0                                               function evaluation oracles with uniformly bounded
 4: for t = 1 to T do                                             variance, we have the following theorem.
 5:   Sample ut,1 , . . . , ut,Bt i.i.d. from S d−1                   Theorem 2 (Proof in Appendix C). Under the
                                +                        −
 6:   For i = 1 to Bt , let yt,i    ← δ1 + xt + δut,i , yt,i ←        condition of Theorem 1, if we further assume that
                                               +       −
      δ1 + xt − δut,i and evaluate F (yt,i ), F (yt,i )               for all x, E[F̂ (x)] = F (x) and E[|F̂ (x)−F (x)|2 ] ≤
                PBt d            +          −
 7:   gt ← B1t i=1      2δ [F (yt,i ) − F (yt,i )]ut,i
                                                                      σ02 , then we have
 8:   ḡt ← (1 − ρt )ḡt−1 + ρt gt
 9:   vt ← arg maxv∈K0 hv, ḡt i                                        (1 − 1/e)F (x∗ ) − E[F (xT +1 + δ1)]
10:   xt+1 ← xt + vTt                                                   3D1 Q1/2    LD12               √
11: end for                                                           ≤          +         + δG(1  + (   d + 1)(1 − 1/e)),
                                                                           T 1/3     2T
12: Output xT +1 + δ1
                                                                      where D1 = diam(K0 ), Q = max{42/3 G2 , 6L2 D12 +
                                                                      (4cdG2 + 2d2 σ02 /δ 2 )/Bt }, c is a constant, and x∗
independently sampled two-point estimators for fur-                   is the global maximizer of F on K.
ther variance reduction. The second variance-reduction                                               3         3 2
technique is the momentum method used in [Mokhtari                Remark √ 2. By setting T = O(1/ ), Bt = d / , and
et al., 2018a] to estimate the gradient by a vector ḡt           δ = / d, the error term (RHS) is at most O(). The
which is updated at each iteration as follows:                    total number of evaluations is at most O(d3 /5 ).

                 ḡt = (1 − ρt )ḡt−1 + ρt gt .           (13)    4     Discrete Submodular Maximization
Here ρt is a given step size, ḡ0 is initialized as an all zero
                                                                  In this section, we describe how Black-box Continuous
vector 0, and gt is an unbiased estimate of the gradient
                                                                  Greedy can be used to solve a discrete submodular max-
at iterate xt . As ḡt is a weighted average of previous
                                                                  imization problem with a general matroid constraint,
gradient approximation ḡt−1 and the newly updated
                                                                  i.e., maxS∈I f (S), where f is a monotone submodular
stochastic gradient gt , it has a lower variance compared
                                                                  set function and I is a matroid.
with gt . Although ḡt is not an unbiased estimation of
the true gradient, the error of it will approach zero as          For any monotone submodular set function f : 2Ω →
time proceeds. The detailed description of Black-box              R≥0 , its multilinear extension F : [0, 1]d → R≥0 , de-
Continuous Greedy is provided in Algorithm 1.                     fined as
  Theorem 1 (Proof in Appendix B). For a mono-
                                                                                    X         Y Y
                                                                            F (x) =     f (S)    xi   (1 − xj ),    (14)
  tone continuous DR-submodular function F , which                                    S⊆Ω       i∈S    j ∈S
                                                                                                         /
  is also G-Lipschitz continuous and L-smooth on
  a convex and compact constraint set K, if we set                is monotone and DR-submodular [Calinescu et al.,
  ρt = 2/(t + 3)2/3 in Algorithm 1, then we have                  2011]. Here, d = |Ω| is the size of the ground set
                                                                  Ω. Equivalently, we have F (x) = ES∼x [f (S)], where
    (1 − 1/e)F (x∗ ) − E[F (xT +1 + δ1)]                          S ∼ x means that the each element i ∈ Ω is included
    3D1 Q1/2    LD12              √                               in S with probability xi independently.
  ≤      1/3
             +         + δG(1 + ( d + 1)(1 − 1/e)).
       T         2T                                               It can be shown that in lieu of solving the discrete
                                                                  optimization problem one can solve the continuous op-
  where Q = max{42/3 G2 , 4cdG2 /Bt + 6L2 D12 }, c is
                                                                  timization problem maxx∈K F (x), where K = conv{1I :
  a constant, D1 = diam(K0 ), and x∗ is the global
                                                                  I ∈ I} is the matroid polytope [Calinescu et al., 2011].
  maximizer of F on K.
                                                                  This equivalence is obtained by showing that (i) the
Remark    1. By setting T = O(1/3 ), Bt = d, and δ =             optimal values of the two problems are the same, and
  √
/ d, the error term (RHS) is guaranteed to be at most            (ii) for any fractional vector x ∈ K we can deploy effi-
O(). Also, the total number of function evaluations is           cient, lossless rounding procedures that produce a set
at most O(d/3 ).                                                 S ∈ I such that E[f (S)] ≥ F (x) (e.g., pipage rounding
                                                                  [Ageev and Sviridenko, 2004, Calinescu et al., 2011]
We can also extend Algorithm 1 to the stochastic case             and contention resolution [Chekuri et al., 2014]). So we
in which we obtain information about F only through               can view F̃ as the underlying function that we intend to
its noisy function evaluations F̂ (x) = F (x) + ξ, where          optimize, and invoke Black-box Continuous Greedy. As
Black Box Submodular Maximization: Discrete and Continuous Settings

Algorithm 2 Discrete Black-box Greedy                                           3)2/3 , St,i = l in Algorithm 2, then we have
 1: Input: matroid constraint I, transformed con-
    straint set K0 = Dδ ∩ (K − δ1) where K = conv{1I :                                (1 − 1/e)f (X ∗ ) − E[f (XT +1 )]
    I ∈ I}, number of iterations T , radius δ, step size
                                                                                                        p
                                                                                      3D1 Q1/2     2M d(d − 1)D12
    ρt , batch size Bt , sample size St,i                                           ≤           +
                                                                                         T 1/3 √          √T
 2: Output: XT +1
                                                                                         + 2M δ d(1 + ( d + 1)(1 − 1/e)).
 3: x1 ← 0, ḡ0 ← 0,
 4: for t = 1 to T do                                                                                                    2d2 M 2 (   1
                                                                                                                                         +8c)
 5:    Sample ut,1 , . . . , ut,Bt i.i.d. from S d−1                            where D1 = diam(K0 ), Q = max{           lδ 2
                                                                                                                        Bt       +
                                              +                                              2 2 5/3      2                   ∗
 6:    For i = 1 to Bt , let yt,i                 ← δ1 + xt +                   96d(d − 1)M D1 , 4 dM }, c is a constant, X is
                 −                                                              the global maximizer of f under matroid constraint
       δut,i , yt,i ← δ1 + xt − δut,i , independently
       sample subsets Yt,i        +
                                      and Yt,i −
                                                  for St,i times                I.
                              +     −
       according to yt,i , yt,i , get sampled subsets                       Remark 3. By setting T = O(d3 /3 ), Bt = 1, l =
          +
       Yt,i,j     −
              , Yt,i,j , ∀j ∈ [St,i ], evaluate the function                d2 /2 , and δ = /d, the error term (RHS) is at most
       values f (Yt,i,j +          −
                           ), f (Yt,i,j ), ∀j ∈ [St,i ], and calcu-         O(). The total number of evaluations is at most
                                         PSt,i         +
                                                  f (Yt,i,j )
                                                                            O(d5 /5 ).
       late the averages f¯t,i
                           +
                               ←            j=1
                                                 St,i         , f¯t,i
                                                                  −
                                                                        ←
        PSt,i      −                                                        We note that in Algorithm 2, f¯t,i  +
                                                                                                                  is the unbiased estima-
          j=1 f (Yt,i,j )
            St,i
                 PBt d ¯+                                                   tion of F (yt,i ), and the same holds for f¯t,i
                                                                                           +                                   −
                                                                                                                                 and F (yt,i−
                                                                                                                                              ).
       gt ← B1t i=1                 ¯−
                         2δ (ft,i − ft,i )ut,i
 7:                                                                         As a result, we can analyze the algorithm under the
 8:    ḡt ← (1 − ρt )ḡt−1 + ρt gt                                         framework of stochastic continuous submodular maxi-
 9:   vt ← arg maxv∈K0 hv, ḡt i                                            mization. By applying Theorem 2, Lemma 3, and the
10:   xt+1 ← xt + vTt                                                       facts E[|f¯t,i
                                                                                       +         + 2
                                                                                           −F (yt,i )| ] ≤ M 2 /St,i , E[|f¯t,i
                                                                                                                            −        − 2
                                                                                                                                −F (yt,i )| ] ≤
11: end for                                                                 M 2 /St,i directly, we can also attain Theorem 3.
12: Output XT +1 = round(xT +1 + δ1)

                                                                            5     Experiments
a result, we want that F is G-Lipschitz and L-smooth
as in Theorem 1. The following lemma shows these                            In this section, we will compare Black-box Continuous
properties are satisfied automatically if f is bounded.                     Greedy (BCG) and Discrete Black-box Greedy (DBG)
                                                                            with the following baselines:
Lemma 3. For a submodular set function f defined on
Ω with supX⊆Ω |f (X)|≤ M , its multilinear extension                        (1) Zeroth-Order Gradient Ascent (ZGA) is the projected
        √
                                                                            gradient ascent algorithm equipped with the same two-
                             p
F is 2M d-Lipschitz and 4M d(d − 1)-smooth.
                                                                            point gradient estimator as BCG uses. Therefore, it is
We note that the bounds for Lipschitz and smooth-                           a zeroth-order projected algorithm.
ness parameters actually depend on the norms that we
                                                                            (2) Stochastic Continuous Greedy (SCG) is the state-of-
consider. However, different norms are equivalent up
                                                                            the-art first-order algorithm for maximizing continuous
to a factor that may depend on the dimension. If we
                                                                            DR-submodular functions Mokhtari et al. [2018a,b].
consider another norm, some dimension factors may be
                                                                            Note that it is a projection-free algorithm.
absorbed into the norm. Therefore, we only study the
Euclidean norm in Lemma 3.                                                  (3) Gradient Ascent (GA) is the first-order projected
                                                                            gradient ascent algorithm Hassani et al. [2017].
We further note that computing the exact value of F is
difficult as it requires evaluating f over all the subsets                  The stopping criterion for the algorithms is whenever a
S ∈ Ω. However, one can construct an unbiased esti-                         given number of iterations is achieved. Moreover, the
mate for the value F (x) by simply sampling a random                        batch sizes St,i in Algorithm 1 and Bt in Algorithm 2
set S ∼ x and returning f (S) as the estimate. We                           are both 1. Therefore, in the experiments, DBG uses 1
present our algorithm in detail in Algorithm 2, where                       query per iteration while SCG uses O(d) queries.
we have D = [0, 1]d , since F is defined on [0, 1]d , and
                                                                            We perform four sets of experiments which are de-
thus Dδ = [0, 1 − 2δ]d . We state the theoretical result
                                                                            scribed in detail in the following. The first two sets
formally in Theorem 3.
                                                                            of experiments are maximization of continuous DR-
  Theorem 3 (Proof in Appendix E). For                                      submodular functions, which Black-box Continuous
  a monotone submodular set function f with                                 Greedy is designed to solve. The last two are sub-
  supX⊆Ω |f (X)|≤ M , if we set ρt = 2/(t +                                 modular set maximization problems. We will apply
                                                                            Discrete Black-box Greedy to solve these problems. The
Lin Chen1† , Mingrui Zhang1‡ , Hamed Hassani] , Amin Karbasi†

function values at different rounds and the execution              problem is defined by f (S) = log det(I + ΣS,S ), where
times are presented in Fig. 1 and Section 5. The first-            S ⊆ [d] and ΣS,S is the principal submatrix indexed
order algorithms (SCG and GA) are marked in orange,                by S. The total number of 22 attributes are parti-
and the zeroth-order algorithms are marked in blue.                tioned into 5 disjoint subsets with sizes 4, 4, 4, 5 and
                                                                   5, respectively. The problem is subject to a partition
Non-convex/non-concave Quadratic Program-
                                                                   matroid requiring that at most one attribute should be
ming (NQP): In this set of experiments, we apply our
                                                                   active within each subset. Since this is a submodular
proposed algorithm and the baselines to the problem of
                                                                   set maximization problem, in order to evaluate the
non-convex/non-concave quadratic programming. The
                                                                   gradient (i.e., obtain an unbiased estimate of gradi-
objective function is of the form F (x) = 12 x> Hx + b> x,
                                                                   ent) required by first-order algorithms SCG and GA,
where x is a 100-dimensional vector, H is a 100-by-100
                                                                   it needs 2d function value queries. To be precise, the
matrix, and every component of H is an i.i.d. ran-
                                                                   i-th component of gradient is ES∼x [f (S ∪ {i}) − f (S)]
dom variable whose distribution is equal to that of the
                                                                   and requires two function value queries. It can be ob-
negated absolute value ofPa standard normal
                                         P60 distribu-
                              30                                   served from Fig. 1c that DBG outperforms the other
tion. The constraints are i=1 xi ≤ 30, i=31 xi ≤ 20,
     P100                                                          zeroth-order algorithm ZGA. Although its performance
and i=61 xi ≤ 20. To guarantee that the gradient                   is slightly worse than the two first-order algorithms
is non-negative, we set bt = −H > 1. One can observe               SCG and GA, it require significantly less number of
from Fig. 1a that the function value that BCG attains              function value queries than the other two first-order
is only slightly lower than that of the first-order algo-          methods (as discussed above).
rithm SCG. The final function value that BCG attains
is similar to that of ZGA.                                         Influence Maximization In the influence maximiza-
                                                                   tion problem, we assume that every node in the network
Topic Summarization: Next, we consider the topic                   is able to influence all of its one-hop neighbors. The ob-
summarization problem [El-Arini et al., 2009, Yue and              jective of influence maximization is to select a subset of
Guestrin, 2011], which is to maximize the probabilistic            nodes in the network, called the seed set (and denoted
coverage of selected articles on news topics. Each news            by S), so that the total number of influenced nodes,
article is characterized by its topic distribution, which          including the seed nodes, is maximized. We choose
is obtained by applying latent Dirichlet allocation to             the social network of Zachary’s karate club Zachary
the corpus of Reuters-21578, Distribution 1.0. The                 [1977] in this study. The subjects in this social net-
number of topics is set to 10. We will choose from 120             work are partitioned into three disjoint groups, whose
news articles. The probabilistic coverage of a subset              sizes are 10, 14, and 10 respectively. The chosen seed
of news articles (denoted by X) is defined by f (X) =              nodes should be subject to a partition matroid; i.e., We
 1
   P10         Q
10    j=1 [1 −   a∈X (1 − pa (j))], where pa (·) is the topic      will select at most two subjects from each of the three
distribution of article a. PThe multilinear        extension       groups. Note that this problem is also a submodular
                          1    10       Q
function of f is F (x) = 10    j=1  [1−  a∈Ω  (1−p   a (j)xa )],   set maximization problem. Similar to the situation in
                  120
where x ∈ [0, 1]      Iyer et al. [2014]. The constraint           the active set selection problem, first-order algorithms
   P40               P80                P120
is i=1 xi ≤ 25,        i=41 xi ≤ 30,      i=81 xi ≤ 35. It         need function value queries to obtain an unbiased es-
can be observed from Fig. 1b that the proposed BCG                 timate of gradient. We can observe from Fig. 1d that
algorithm achieves the same function value as the first-           DBG attains a better influence coverage than the other
ordered algorithm SCG and outperforms the other two.               zeroth-order algorithm ZGA. Again, even though SCG
As shown in Fig. 2a, BCG is the most efficient method.             and GA achieve a slightly better coverage, due to their
The two projection-free algorithms BCG and SCG run                 first-order nature, they require a significantly larger
faster than the projected methods ZGA and GA. We                   number of function value queries.
will elaborate on the running time later in this section.
Active Set Selection We study the active set selec-                Running Time The running times of the our pro-
tion problem that arises in Gaussian process regres-               posed algorithms and the baselines are presented in
sion Mirzasoleiman et al. [2013]. We use the Parkin-               Section 5 for the above-mentioned experimental set-
sons Telemonitoring dataset, which is composed of                  ups. There are two main conclusions. First, the two
biomedical voice measurements from people with early-              projection-based algorithms (ZGA and GA) require
stage Parkinson’s disease [Tsanas et al., 2010]. Let               significantly higher time complexity compared to the
X ∈ Rn×d denote the data matrix. Each row X[i, :]                  projection-free algorithms (BCG, DBG, and SCG),
is a voice recording while each column X[:, j] denotes             as the projection-based algorithms require solving
an attribute. The covariance matrix Σ is defined by                quadratic optimization problems whereas projection-
Σij = exp(−kX[:, i] − X[:, j]k2 )/h2 , where h is set to           free ones require solving linear optimization problems
0.75. The objective function of the active set selection           which can be solved more efficiently. Second, when we
                                                                   compare first-order and zeroth-order algorithms, we
Black Box Submodular Maximization: Discrete and Continuous Settings

     Number of first-order oracle queries                                                                                                                                                                                                     Number of queries to compute gradients
                                                                                                        Number of first-order oracle queries                                Number of queries to compute gradients
 ×104 0        10          20         30
                                                                                                         0        10          20         30                                   0          500        1000
                                                                                                                                                                                                                                                0             1000           2000
                                                                                           1.00                                                                                                                                               6
                       3                                                                                                                                                   1.5
  Function values

                                                                                                                                                                                                                           Function values
                                                                    Function values

                                                                                                                                                         Function values
                                                                                           0.75                                                                                                                                                     4
                       2                                    BCG                                                                     BCG                                    1.0                                                                                                  DBG
                                                                                           0.50                                                                                                             DBG
                                                            ZGA                                                                     ZGA                                                                     ZGA                                                                 ZGA
                       1                                                                                                                                                                                                                            2
                                                            SCG                            0.25                                     SCG                                    0.5                              SCG                                                                 SCG
                                                            GA                                                                      GA                                                                      GA                                                                  GA
                       0                                                                   0.00                                                                            0.0                                                                      0
                              0        20        40         60                                           0        20        40         60                                         0        20        40         60                                    0        20        40         60
                            Number of zeroth-order oracle queries                                      Number of zeroth-order oracle queries                                    Number of zeroth-order oracle queries                               Number of zeroth-order oracle queries

                                               (a) NQP                                 (b) Topic summarization                                                              (c) Active set selection                             (d) Influence maximization
Figure 1: Function value vs. number of oracle queries. Note that every chart has dual horizontal axes. Orange
lines use the orange horizontal axes above while blue lines use the blue ones below.

                                                                                                                                               Rel. running time (DBG=1)

                                                                                                                                                                                                                        Rel. running time (DBG=1)
         Rel. running time (BCG=1)

                                                                       Rel. running time (BCG=1)

                                     6
                                                                                                   4                                                                       10                                                                       8
                                     4                                                             3                                                                                                                                                6

                                                                                                   2                                                                        5                                                                       4
                                     2
                                                                                                   1                                                                                                                                                2

                                     0                                                             0                                                                        0                                                                       0
                                         BCG    SCG   ZGA     GA                                       BCG       SCG       ZGA        GA                                         DBG       SCG       ZGA        GA                                      DBG    SCG      ZGA       GA

                                               (a) NQP                             (b) Topic summarization                                                                 (c) Active set selection                     (d) Influence maximization
Figure 2: Relative running time normalized with respect to BCG (for continuous DR-submodular maximization in
the first two sets of experiments) and DBG (for submodular set maximization in the last two sets of experiments).

can observe that zeroth-order algorithms (BCG, DBG,                                                                                                                        is particularly preferable in a large-scale machine learn-
and ZGA) run faster than their first-order counterparts                                                                                                                    ing task and an application where the total number of
(SCG and GA).                                                                                                                                                              function evaluations or the running time is subject to
                                                                                                                                                                           a budget.
Summary The above experiment results show the
following major advantages of our method over the                                                                                                                          6         Conclusion
baselines including SCG and ZGA.
                                                                                                                                                                           In this paper, we presented Black-box Continuous
 1. BCG/DBG is at least twice faster than SCG and                                                                                                                          Greedy, a derivative-free and projection-free algo-
    ZGA in all tasks in terms of running time (Figs. 2a                                                                                                                    rithm for maximizing a monotone and continuous
    to 2d)                                                                                                                                                                 DR-submodular function subject to a general convex
 2. DBG requires remarkably fewer function evalua-                                                                                                                         body constraint. We showed that Black-box Continuous
    tions in the discrete setting (Figs. 1c and 1d)                                                                                                                        Greedy achieves the tight [(1−1/e)OP T −] approxima-
                                                                                                                                                                           tion guarantee with O(d/3 ) function evaluations. We
 3. In addition to saving function evaluations,                                                                                                                            then extended the algorithm to the stochastic continu-
    BCG/DBG achieves an objective function value                                                                                                                           ous setting and the discrete submodular maximization
    comparable to that of the first-order baselines SCG                                                                                                                    problem. Our experiments on both synthetic and real
    and GA.                                                                                                                                                                data validated the performance of our proposed al-
                                                                                                                                                                           gorithms. In particular, we observed that Black-box
Furthermore, we note that the number of first-order                                                                                                                        Continuous Greedy practically achieves the same utility
queries required by SCG is only half the number re-                                                                                                                        as Continuous Greedy while being way more efficient in
quired by BCG. However, as is shown in Figs. 2a and 2b,                                                                                                                    terms of number of function evaluations.
BCG runs significantly faster than SCG since a zeroth-
order evaluation is faster than a first-order one.
                                                                                                                                                                           Acknowledgements
In the topic summarization task (Fig. 1b), BCG ex-
hibits a similar performance to that of the first-order                                                                                                                    LC is supported by the Google PhD Fellowship. HH
baselines SCG and GA, in terms of the attained                                                                                                                             is supported by AFOSR Award 19RT0726, NSF HDR
objective function value. In the other three tasks,                                                                                                                        TRIPODS award 1934876, NSF award CPS-1837253,
BCG/DBG runs notably faster while achieving an only                                                                                                                        NSF award CIF-1910056, and NSF CAREER award
slightly inferior function value. Therefore, BCG/DBG                                                                                                                       CIF-1943064. AK is partially supported by NSF
Lin Chen1† , Mingrui Zhang1‡ , Hamed Hassani] , Amin Karbasi†

(IIS-1845032), ONR (N00014-19-1-2406), and AFOSR          Andrew An Bian, Baharan Mirzasoleiman, Joachim
(FA9550-18-1-0160).                                        Buhmann, and Andreas Krause. Guaranteed non-
                                                           convex optimization: Submodular maximization over
References                                                 continuous domains. In Artificial Intelligence and
                                                           Statistics, pages 111–120, 2017b.
Alekh Agarwal, Ofer Dekel, and Lin Xiao. Optimal
                                                          Gruia Calinescu, Chandra Chekuri, Martin Pál, and
  algorithms for online convex optimization with multi-
                                                           Jan Vondrák. Maximizing a monotone submodular
  point bandit feedback. In COLT, pages 28–40. Cite-
                                                            function subject to a matroid constraint. SIAM
  seer, 2010.
                                                           Journal on Computing, 40(6):1740–1766, 2011.
Alexander A Ageev and Maxim I Sviridenko. Pipage
                                                          L Elisa Celis, Amit Deshpande, Tarun Kathuria, and
  rounding: A new method of constructing algorithms
                                                            Nisheeth K Vishnoi. How to be fair and diverse?
  with proven performance guarantee. Journal of Com-
                                                            arXiv preprint arXiv:1610.07183, 2016.
  binatorial Optimization, 8(3):307–328, 2004.
                                                          Chandra Chekuri, Jan Vondrák, and Rico Zenklusen.
Charles Audet and Warren Hare. Derivative-free and
                                                           Submodular function maximization via the multi-
 blackbox optimization. Springer, 2017.
                                                           linear relaxation and contention resolution schemes.
Francis Bach. Submodular functions: from discrete          SIAM Journal on Computing, 43(6):1831–1879, 2014.
  to continuous domains. Mathematical Programming,
  pages 1–41, 2016.                                       Lin Chen, Forrest W Crawford, and Amin Karbasi.
                                                            Submodular variational inference for network recon-
Francis Bach, Rodolphe Jenatton, Julien Mairal, Guil-       struction. In UAI, 2017a.
  laume Obozinski, et al. Optimization with sparsity-
  inducing penalties. Foundations and Trends R in         Lin Chen, Andreas Krause, and Amin Karbasi. Interac-
  Machine Learning, 4(1):1–106, 2012.                       tive submodular bandit. In NeurIPS, pages 141–152,
                                                            2017b.
Francis R Bach. Structured sparsity-inducing norms
  through submodular functions. In Advances in Neu-       Lin Chen, Moran Feldman, and Amin Karbasi. Weakly
  ral Information Processing Systems, pages 118–126,        submodular maximization beyond cardinality con-
  2010.                                                     straints: Does randomization help greedy? In ICML,
                                                            pages 804–813, 2018a.
Krishnakumar Balasubramanian and Saeed Ghadimi.
  Zeroth-order (non)-convex stochastic optimization       Lin Chen, Christopher Harshaw, Hamed Hassani, and
 via conditional gradient and gradient updates. In          Amin Karbasi. Projection-free online optimization
 Advances in Neural Information Processing Systems,         with stochastic gradient: From convexity to submod-
  pages 3459–3468, 2018.                                    ularity. In ICML, pages 813–822, 10–15 Jul 2018b.
Eric Balkanski and Yaron Singer. Mechanisms for fair      Lin Chen, Hamed Hassani, and Amin Karbasi. Online
  attribution. In Proceedings of the Sixteenth ACM          continuous submodular maximization. In AISTATS,
  Conference on Economics and Computation, pages            pages 1896–1905, 2018c.
  529–546. ACM, 2015.                                     Pin-Yu Chen, Huan Zhang, Yash Sharma, Jinfeng Yi,
Mohammadhossein Bateni, Lin Chen, Hossein Esfan-            and Cho-Jui Hsieh. Zoo: Zeroth order optimization
 diari, Thomas Fu, Vahab Mirrokni, and Afshin Ros-          based black-box attacks to deep neural networks
 tamizadeh. Categorical feature compression via sub-        without training substitute models. In Proceedings
 modular optimization. In International Conference          of the 10th ACM Workshop on Artificial Intelligence
 on Machine Learning, pages 515–523, 2019.                  and Security, pages 15–26. ACM, 2017c.
James S Bergstra, Rémi Bardenet, Yoshua Bengio,           Andrew R Conn, Katya Scheinberg, and Luis N Vicente.
  and Balázs Kégl. Algorithms for hyper-parameter          Introduction to derivative-free optimization, volume 8.
  optimization. In Advances in neural information          Siam, 2009.
  processing systems, pages 2546–2554, 2011.              Abhimanyu Das and David Kempe. Submodular meets
An Bian, Kfir Levy, Andreas Krause, and Joachim M          spectral: Greedy algorithms for subset selection,
 Buhmann. Continuous dr-submodular maximization:           sparse approximation and dictionary selection. arXiv
 Structure and algorithms. In Advances in Neural In-       preprint arXiv:1102.3975, 2011.
 formation Processing Systems, pages 486–496, 2017a.      Reza Eghbali and Maryam Fazel. Designing smooth-
An Bian, Joachim M Buhmann, and Andreas Krause.             ing functions for improved worst-case competitive
 Optimal dr-submodular maximization and applica-            ratio in online optimization. In Advances in Neural
 tions to provable mean field inference. arXiv preprint    Information Processing Systems, pages 3287–3295,
 arXiv:1805.07482, 2018.                                   2016.
Black Box Submodular Maximization: Discrete and Continuous Settings

Khalid El-Arini, Gaurav Veda, Dafna Shahaf, and Car-      Martin Jaggi. Revisiting frank-wolfe: Projection-free
 los Guestrin. Turning down the noise in the blogo-        sparse convex optimization. In ICML (1), pages
 sphere. In Proceedings of the 15th ACM SIGKDD             427–435, 2013.
 international conference on Knowledge discovery and      Stefanie Jegelka and Jeff Bilmes. Submodularity be-
 data mining, pages 289–298. ACM, 2009.                     yond submodular energies: coupling edges in graph
Abraham D Flaxman, Adam Tauman Kalai, and                   cuts. 2011a.
 H Brendan McMahan. Online convex optimization
                                                          Stefanie Jegelka and Jeff A Bilmes. Approximation
 in the bandit setting: gradient descent without a gra-
                                                            bounds for inference using cooperative cuts. In Pro-
 dient. In Proceedings of the sixteenth annual ACM-
                                                            ceedings of the 28th International Conference on Ma-
 SIAM symposium on Discrete algorithms, pages 385–
                                                            chine Learning (ICML-11), pages 577–584, 2011b.
 394. Society for Industrial and Applied Mathematics,
 2005.                                                    David Kempe, Jon Kleinberg, and Éva Tardos. Maxi-
                                                           mizing the spread of influence through a social net-
Marguerite Frank and Philip Wolfe. An algorithm
                                                           work. In Proceedings of the ninth ACM SIGKDD
 for quadratic programming. Naval research logistics
                                                           international conference on Knowledge discovery and
 quarterly, 3(1-2):95–110, 1956.
                                                           data mining, pages 137–146. ACM, 2003.
Saeed Ghadimi and Guanghui Lan. Stochastic first-
  and zeroth-order methods for nonconvex stochastic       Alex Kulesza, Ben Taskar, et al. Determinantal point
  programming. SIAM Journal on Optimization, 23             processes for machine learning. Foundations and
  (4):2341–2368, 2013.                                     Trends R in Machine Learning, 5(2–3):123–286, 2012.
Daniel Golovin and Andreas Krause. Adaptive submod-       Simon Lacoste-Julien and Martin Jaggi. On the global
  ularity: Theory and applications in active learning       linear convergence of frank-wolfe optimization vari-
  and stochastic optimization. Journal of Artificial        ants. In Advances in Neural Information Processing
 Intelligence Research, 42:427–486, 2011.                   Systems, pages 496–504, 2015.
Andrew Guillory and Jeff Bilmes. Interactive submodu-     Qi Lei, Lingfei Wu, Pin-Yu Chen, Alexandros Dimakis,
 lar set cover. arXiv preprint arXiv:1002.3345, 2010.       Inderjit Dhillon, and Michael Witbrock. Discrete
                                                            adversarial attacks and submodular optimization
Hamed Hassani, Mahdi Soltanolkotabi, and Amin Kar-
                                                            with applications to text classification. Systems and
  basi. Gradient methods for submodular maximiza-
                                                            Machine Learning (SysML), 2019.
  tion. In Advances in Neural Information Processing
 Systems, pages 5841–5851, 2017.                          Hui Lin and Jeff Bilmes. A class of submodular
Hamed Hassani, Amin Karbasi, Aryan Mokhtari, and           functions for document summarization. In Proceed-
 Zebang Shen. Stochastic continuous greedy++:              ings of the 49th Annual Meeting of the Association
 When upper and lower bounds match. In Advances            for Computational Linguistics: Human Language
 in Neural Information Processing Systems, pages           Technologies-Volume 1, pages 510–520. Association
 13066–13076, 2019.                                        for Computational Linguistics, 2011a.
Elad Hazan. Introduction to online convex optimization.   Hui Lin and Jeff Bilmes. Word alignment via submodu-
  Foundations and Trends R in Optimization, 2(3-4):        lar maximization over matroids. In Proceedings of the
  157–325, 2016.                                           49th Annual Meeting of the Association for Compu-
                                                           tational Linguistics: Human Language Technologies:
Elad Hazan and Haipeng Luo. Variance-reduced and
                                                           short papers-Volume 2, pages 170–175. Association
  projection-free stochastic optimization. In Inter-
                                                           for Computational Linguistics, 2011b.
  national Conference on Machine Learning, pages
  1263–1271, 2016.                                        Baharan Mirzasoleiman, Amin Karbasi, Rik Sarkar,
                                                            and Andreas Krause. Distributed submodular maxi-
Andrew Ilyas, Logan Engstrom, Anish Athalye, Jessy
                                                            mization: Identifying representative elements in mas-
 Lin, Anish Athalye, Logan Engstrom, Andrew Ilyas,
                                                            sive data. In Advances in Neural Information Pro-
 and Kevin Kwok. Black-box adversarial attacks
                                                           cessing Systems, pages 2049–2057, 2013.
 with limited queries and information. In Proceedings
 of the 35th International Conference on Machine          Aryan Mokhtari, Hamed Hassani, and Amin Karbasi.
 Learning,{ICML} 2018, 2018.                                Conditional gradient method for stochastic submod-
Rishabh Iyer, Stefanie Jegelka, and Jeff Bilmes. Mono-      ular maximization: Closing the gap. In AISTATS,
  tone closure of relaxed constraints in submodular         pages 1886–1895, 2018a.
  optimization: Connections between minimization          Aryan Mokhtari, Hamed Hassani, and Amin Kar-
  and maximization. In Uncertainty in Artificial In-        basi. Stochastic conditional gradient methods: From
  telligence (UAI), Quebic City, Quebec Canada, July        convex minimization to submodular maximization.
  2014. AUAI.                                              arXiv preprint arXiv:1804.09554, 2018b.
Lin Chen1† , Mingrui Zhang1‡ , Hamed Hassani] , Amin Karbasi†

George L Nemhauser, Laurence A Wolsey, and Mar-           Chris Thornton, Frank Hutter, Holger H Hoos, and
 shall L Fisher. An analysis of approximations for         Kevin Leyton-Brown. Auto-weka: Combined selec-
 maximizing submodular set functions—i. Mathemat-          tion and hyperparameter optimization of classifica-
 ical programming, 14(1):265–294, 1978.                    tion algorithms. In Proceedings of the 19th ACM
Luis Miguel Rios and Nikolaos V Sahinidis. Derivative-     SIGKDD international conference on Knowledge dis-
  free optimization: a review of algorithms and compar-    covery and data mining, pages 847–855. ACM, 2013.
  ison of software implementations. Journal of Global     Athanasios Tsanas, Max A Little, Patrick E McSharry,
  Optimization, 56(3):1247–1293, 2013.                      and Lorraine O Ramig. Enhanced classical dysphonia
Manuel Gomez Rodriguez and Bernhard Schölkopf.              measures and sparse regression for telemonitoring of
 Influence maximization in continuous time diffusion        parkinson’s disease progression. In Acoustics Speech
 networks. arXiv preprint arXiv:1205.1682, 2012.            and Signal Processing (ICASSP), 2010 IEEE Inter-
                                                            national Conference on, pages 594–597. IEEE, 2010.
Anit Kumar Sahu, Manzil Zaheer, and Soummya Kar.
 Towards gradient free and projection free stochastic     Sebastian Tschiatschek, Rishabh K Iyer, Haochen Wei,
 optimization. arXiv preprint arXiv:1810.03233, 2018.       and Jeff A Bilmes. Learning mixtures of submodular
                                                            functions for image collection summarization. In
Mehraveh Salehi, Amin Karbasi, Dustin Scheinost, and        Advances in neural information processing systems,
 R Todd Constable. A submodular approach to cre-            pages 1413–1421, 2014.
 ate individualized parcellations of the human brain.
 In International Conference on Medical Image Com-        Jan Vondrák. Optimal approximation for the submod-
 puting and Computer-Assisted Intervention, pages           ular welfare problem in the value oracle model. In
 478–485. Springer, 2017.                                   Proceedings of the fortieth annual ACM symposium
                                                            on Theory of computing, pages 67–74. ACM, 2008.
Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P
  Adams, and Nando De Freitas. Taking the human           Martin J Wainwright, Michael I Jordan, et al. Graph-
  out of the loop: A review of bayesian optimization.      ical models, exponential families, and variational
 Proceedings of the IEEE, 104(1):148–175, 2016.            inference. Foundations and Trends R in Machine
                                                           Learning, 1(1–2):1–305, 2008.
Ohad Shamir. An optimal algorithm for bandit and
 zero-order convex optimization with two-point feed-      Yining Wang, Simon Du, Sivaraman Balakrishnan, and
 back. Journal of Machine Learning Research, 18(52):        Aarti Singh. Stochastic zeroth-order optimization in
 1–11, 2017.                                                high dimensions. arXiv preprint arXiv:1710.10551,
                                                            2017.
Adish Singla, Ilija Bogunovic, Gábor Bartók, Amin
 Karbasi, and Andreas Krause. Near-optimally teach-       Kai Wei, Rishabh Iyer, and Jeff Bilmes. Submodularity
 ing the crowd to classify. In ICML, pages 154–162,        in data subset selection and active learning. In In-
 2014.                                                     ternational Conference on Machine Learning, pages
                                                           1954–1963, 2015.
Jasper Snoek, Hugo Larochelle, and Ryan P Adams.
  Practical bayesian optimization of machine learn-       Yisong Yue and Carlos Guestrin. Linear submodular
  ing algorithms. In Advances in neural information         bandits and their application to diversified retrieval.
  processing systems, pages 2951–2959, 2012.                In Advances in Neural Information Processing Sys-
                                                            tems, pages 2483–2491, 2011.
Artem Sokolov, Julia Kreutzer, Stefan Riezler, and
  Christopher Lo. Stochastic structured prediction        Wayne W Zachary. An information flow model for
  under bandit feedback. In Advances in Neural Infor-      conflict and fission in small groups. Journal of an-
 mation Processing Systems, pages 1489–1497, 2016.         thropological research, 33(4):452–473, 1977.
Matthew Staib and Stefanie Jegelka. Robust budget al-     Mingrui Zhang, Lin Chen, Hamed Hassani, and Amin
 location via continuous submodular functions. arXiv       Karbasi. Online continuous submodular maximiza-
 preprint arXiv:1702.08791, 2017.                          tion: From full-information to bandit feedback. In
Bastian Steudel, Dominik Janzing, and Bernhard             Advances in Neural Information Processing Systems,
  Schölkopf.   Causal markov condition for sub-            pages 9206–9217, 2019a.
  modular information measures. arXiv preprint            Mingrui Zhang, Zebang Shen, Aryan Mokhtari, Hamed
 arXiv:1002.4020, 2010.                                    Hassani, and Amin Karbasi. One sample stochastic
Ben Taskar, Vassil Chatalbashev, Daphne Koller, and        frank-wolfe. arXiv preprint arXiv:1910.04322, 2019b.
  Carlos Guestrin. Learning structured prediction mod-    Yuanxing Zhang, Yichong Bai, Lin Chen, Kaigui
  els: A large margin approach. In Proceedings of the       Bian, and Xiaoming Li. Influence maximization in
  22nd international conference on Machine learning,        messenger-based social networks. In GLOBECOM,
  pages 896–903. ACM, 2005.                                 pages 1–6. IEEE, 2016.
Black Box Submodular Maximization: Discrete and Continuous Settings

Yuxun Zhou and Costas J Spanos. Causal meets sub-
  modular: Subset selection with directed information.
  In Advances in Neural Information Processing Sys-
 tems, pages 2649–2657, 2016.
You can also read