A Look at Robustness and Stability of 1- versus 0-Regularization: Discussion of Papers by Bertsimas et al. and Hastie et al - ETH Zürich

Page created by Lewis Christensen
 
CONTINUE READING
Statistical Science
2020, Vol. 35, No. 4, 614–622
https://doi.org/10.1214/20-STS809
Main article: https://doi.org/10.1214/19-STS701, https://doi.org/10.1214/19-STS733
© Institute of Mathematical Statistics, 2020

A Look at Robustness and Stability of 1-
versus 0-Regularization: Discussion of
Papers by Bertsimas et al. and Hastie et al.
Yuansi Chen, Armeen Taeb and Peter Bühlmann

                                      Abstract. We congratulate the authors Bertsimas, Pauphilet and van Parys
                                      (hereafter BPvP) and Hastie, Tibshirani and Tibshirani (hereafter HTT) for
                                      providing fresh and insightful views on the problem of variable selection
                                      and prediction in linear models. Their contributions at the fundamental level
                                      provide guidance for more complex models and procedures.
                                      Key words and phrases: Distributional robustness, high-dimensional esti-
                                      mation, latent variables, low-rank estimation, variable selection.

      1. A BIT OF HISTORY ON SUBSET SELECTION                                                 The scientific debate on whether to use 0 or 1 regular-
                                                                                           ization to induce sparsity has always been active. One way
   0 regularization in linear and other models has taken
                                                                                           to tackle the computational shortcoming of the exact 0
a prominent role, perhaps because of Occam’s razor prin-
                                                                                           approach is through greedy algorithms. Tropp (2004) jus-
ciple of simplicity. Information criteria like AIC (Akaike,
                                                                                           tified the good old greedy forward selection from a theo-
1973) and BIC (Schwarz, 1978) have been a breakthrough
                                                                                           retical view point with no noise and Cai and Wang (2011)
for complexity regularization with 0 regularization, see
                                                                                           generalize this result to noisy situations, both assuming
also Kotz and Johnson (1992). Miller’s book on subset
                                                                                           fairly restrictive coherence conditions on the design; the
selection (Miller, 1990) provided a comprehensive view
                                                                                           conditions in the latter work for noisy problems are weak-
of the state-of-the-art 30 years ago. But the landscape has
                                                                                           ened in Zhang (2011). Also matching pursuit (Mallat and
changed since then.
                                                                                           Zhang, 1993) has been a contribution in favor of an al-
   Breiman introduced the nonnegative garrote for better
                                                                                           gorithmic forward approach, perhaps mostly for compu-
subset selection (Breiman, 1995), mentioned instability
                                                                                           tational reasons at that time. The similarity of matching
of forward selection (Breiman, 1996b) and promoted bag-
                                                                                           pursuit to 1 norm penalization was made more clear by
ging (Breiman, 1996a) as a way to address it. A bit ear-
                                                                                           Least Angle regression (Efron et al., 2004) and L2 boost-
lier, Basis pursuit has been invented by Chen and Donoho
                                                                                           ing (Bühlmann, 2006). It is worth mentioning the recent
(1994), just before Tibshirani (1996) proposed the famous
                                                                                           work by Bertsimas, King and Mazumder (2016) illus-
Lasso with 1 norm regularization that has seen mas-
                                                                                           trates that solving moderate-size exact 0 problem is no
sive use in statistics and machine learning; see also Chen,
                                                                                           longer as slow as people used to think. Apart from their
Donoho and Saunders (2001) and Fuchs (2004). The view
                                                                                           smart implementations, the speed-up they are able to at-
and fact that 1 norm regularization can be seen as a pow-
                                                                                           tain benefits from tremendous progress in modern opti-
erful convex relaxation (Donoho, 2006) for the 0 prob-
lem has perhaps overshadowed that there are other statis-                                  mization theory and in physical computing powers.
tical aspects in favor of 1 norm regularization, as pointed                                  On choosing between the 0 and the 1 approach, both
out now again by HTT in their current paper.                                               BPvP and HTT have provided valuable statistical and op-
                                                                                           timization comparisons and discussions. To consolidate
                                                                                           the contributions in both papers in a principled manner,
Yuansi Chen is Postdoctoral Fellow, Seminar for Statistics,                                here we outline what we believe are the three aspects that
ETH Zürich, Rämistrasse 101, CH-8092 Zürich, Switzerland
                                                                                           a data analyst must consider when deciding between 0
(e-mail: chen@stat.math.ethz.ch). Armeen Taeb is
Postdoctoral Fellow, Seminar for Statistics, ETH Zürich,                                   and 1 regularization: namely the application for which
Rämistrasse 101, CH-8092 Zürich, Switzerland (e-mail:                                      they will be employed, the optimization guarantees (com-
armeen.taeb@stat.math.ethz.ch). Peter Bühlmann is Professor,                               putation speed and certificates of optimality) that are de-
Seminar for Statistics, ETH Zürich, Rämistrasse 101, CH-8092                               sired, and the statistical properties that the model might
Zürich, Switzerland (e-mail: buhlmann@stat.math.ethz.ch).                                  possess.

                                                                                     614
DISCUSSION ON BERTSIMAS ET AL. AND HASTIE ET AL.                                    615

       2. THREE ASPECTS OF THE PROBLEM                          other hand, 1 regularization is convex. When 1 reg-
                                                                ularization is combined with a convex loss, the opti-
2.1 Application Aspect
                                                                mization program can be solved efficiently using stan-
   Without a concrete data problem, the application aspect      dard subgradient-based methods. Gradient methods are
is perhaps not clearly distinguishable from a statistical       ubiquitous and ready-to-use in today’s deep learning re-
data analysis viewpoint. Whether one would prefer 0 or         search and popular computing packages such as Tensor-
1 norm regularization is problem dependent. For exam-          flow (Abadi et al., 2015) or PyTorch (Paszke et al., 2019).
ple, for problems where the cardinality of model param-         These publicly available computing packages make 1
eters is naturally constrained, a truly sparse approxima-       regularization easily deployable in applications beyond
tion is reasonable and one may prefer 0 regularization.        linear models.
While 0 regularization is more natural from a modeling            Of course, finding the global optimum may not be nec-
perspective, data analysts have often resorted to 1 regu-      essary for good statistical performance. The search for
larization as a convex surrogate to speed up computations.      greedy algorithms for 0 regularization and the study of
However, the 1 surrogate can become problematic if the         the corresponding statistical performance has been and
solution of the 1 problem were interpreted as that of the      still is an active research field (see Tropp, 2004, Cai and
exact 0 problem. In particular, it is well known that 1       Wang, 2011 and Zhang, 2011 mentioned in Section 1).
tends to overshrink the fitted estimates (Zou, 2006). That      However, in order to provide both computational and
is, the 1 estimates are overly biased toward zero. This        statistical guarantees for these greedy algorithms, often
phenomenon can lead to poorly fitted models in prac-            fairly restrictive coherence conditions on the design ma-
tice. For example, 1 procedures for deconvolving cal-          trix have to be assumed.
cium imaging data to determine the location of neuron              The computational difficulty of exact 0 regularization
                                                                is not completely insurmountable. As pointed by the re-
activity lead to incorrect estimates (Friedrich, Zhou and
                                                                cent work of Bertsimas, King and Mazumder (2016), with
Paninski, 2017), which has motivated exact solvers for
                                                                the advance in physical computing powers and in the
the 0 -penalized problem in this context (Jewell and Wit-
                                                                modern optimization theory on mixed-integer program-
ten, 2018). In directed acyclic graph (DAG) estimation,
                                                                ming (MIO), data analysts now have a rigorous approach
0 regularization is clearly preferred over 1 regulariza-
                                                                to employ 0 regularization efficiently in regression prob-
tion. In particular, 0 regularization preserves the impor-
                                                                lems that would have taken years to compute decades ago.
tant property that Markov-equivalent DAGs have the same
                                                                However, as pointed out by BPvP and HTT, the mixed-
penalized likelihood score, while this is not the case for 1
                                                                integer programming formulation of the 0 problem is the
regularization (van de Geer and Bühlmann, 2013).
                                                                not yet as fast as Lasso, especially for low-SNR regimes
   The two examples above illustrate that many additional
                                                                and large problem sizes. In such scenarios, one often has
desiderata—other than computational complexity—may              to trade off between getting closer to the global minimum
arise when deciding between 0 and 1 regularization. De-       and terminating the program early under a time limit. As
pending on the application, a practitioner may consider:        an example, HTT points out for problem size n = 500,
certificates of optimality (e.g., in mission critical appli-    p = 100, the method originally introduced in Bertsimas,
cations), statistical prediction guarantees, feature selec-     King and Mazumder (2016) still requires an hour to cer-
tion guarantees, stability of the estimated model, distribu-    tify optimality. The addition of 2 regularization by BPvP
tional robustness and generalization of the theoretical un-     does improve computations in the low-SNR regime, in
derstandings in linear models to more complex models.           particular with their new convex relaxation SS for the
Our main objective in this discussion is to raise aware-        0 + 2 framework. As a convex relaxation, SS is a heuris-
ness of the above considerations when choosing between          tic solution and is able to solve sparse linear regression
0 and 1 regularization.                                       of problem size n = 10,000, p = 100,000 in less than 20
2.2 Optimization Aspect                                         seconds, putting it on par with Lasso. Overall, the contri-
                                                                bution of Berstimas and coauthors in the series of related
  Exact 0 regularization has long been abandoned be-           papers Bertsimas, King and Mazumder (2016), Bertsimas
cause of the computational hardness. One could opti-            and King (2017), Bertsimas and Van Parys (2020) enables
mize using branch-and-bound techniques for moderate             the possibility of 0 regularization in high-dimensional
dimensions (Gatu and Kontoghiorghes, 2006, Hofmann,             data analysis, and inspires future research to further in-
Gatu and Kontoghiorghes, 2007, Bertsimas and Shioda,            tegrate 0 -based procedures.
2009) or, as most statisticians did, resort to some forward–
                                                                2.3 Statistical Aspect
backward heuristics. These approaches either provide no
guarantee for finding the optimal solution or are not com-        The statistical aspect is rooted in the goal to do accu-
putationally tractable for large-scale problems. On the         rate or “optimal” information extraction in the context of
616                                         Y. CHEN, A. TAEB AND P. BÜHLMANN

uncertainty and aiming for stability and replicability. This   provided for prediction performance or feature selection
inferential aspect may have very different aims, and de-       performance of a procedure.
pending on them and on the data generating processes,             Regarding the 0 and 1 regularization in high-dimen-
different methods might be preferred for different scenar-     sional sparse linear models, some statistical guarantees
ios. Even more so, the statistical aspect might even em-       are known and some are still unknown (at least to us).
brace the idea that there are several “competing” methods      For prediction, that is, estimating the regression surface
and one would typically gain information by inspecting         Xβ – 0 regularization is very powerful as it leads to min-
consensus or using aggregation.                                imax optimality under no assumption on the design ma-
   Both BPvP and HTT papers demonstrate a careful and          trix X (Barron, Birgé and Massart, 1999). For the Lasso,
responsible comparison of subset selection methods, and        this is not true and the fast convergence rate requires
we complement with a few additional empirical results in       conditions on the design such as the restricted eigen-
Section 3. With respect to statistical performance, HTT        value condition (Bickel, Ritov and Tsybakov, 2009) or
point out some key takeaways from their extensive em-          the compatibility condition (van de Geer and Bühlmann,
pirical study: there is no clear winner, and depending on      2009, Bühlmann and van de Geer, 2011, Bellec, 2018).
the size of the SNR, different methods perform favorably.      On the other hand, for feature selection accuracy, that is,
In particular, they suggest that 0 -based best subset suf-    estimation of the parameter vector β ∗ —one necessarily
fers both computationally and statistically in the low-SNR     needs a condition on the fixed design matrix X, since
regime. To address some of the instability concerns raised     β ∗ is not identifiable in general if p > n. Specifically,
by HTT and inspired by the Elastic Net Zou and Hastie          since the least squares loss function is the quadratic form
(2005), BPvP include an 2 penalty to their MIO formu-         Y − Xb22 /n ≈ (β ∗ − b)T X T X/n(β ∗ − b) + σε2 with β ∗
lation. With this new estimator, they demonstrate good         denoting the true regression parameter and σε2 the error
performance across the SNR spectrum. We applaud both           variance, we might necessarily need a restrictive eigen-
BPvP and HTT put the No Cherry Picking (NCP) guide-            value/compatibility type condition to estimate β ∗ (these
lines (Bühlmann and van de Geer, 2018) in action.              are in fact the weakest assumptions known for accurate
                                                               parameter estimation or feature selection consistency with
   2.3.1 Relaxed and adaptive Lasso: A compromise be-          the Lasso). Whether 0 minimization has an advantage
tween 0 - and 1 regularization. HTT find as an over-         over the Lasso for parameter estimation (e.g., requiring
all recommendation that the relaxed Lasso (Meinshausen,        weaker assumptions or yielding more accurate estimators)
2007), or a version of it, performs “overall best.” In fact,   is unclear to us, and we believe investigating such rela-
when Nicolai Meinshausen came up independently with            tionships is an interesting direction for future research. On
the idea around 2005, we learned that Hui Zou has in-          the fine scale, methods which improve upon the bias of
vented the adaptive Lasso (Zou, 2006). Both proposals          the Lasso as the adaptive Lasso (Zou, 2006), the relaxed
aim to reduce the bias of Lasso and push the solution more     Lasso (Meinshausen, 2007) or thresholding after Lasso,
towards 0 penalization. So we wonder why HTT did not          are indeed a bit better than the plain Lasso, under some
include the adaptive Lasso into their study as well (by us-    assumptions (van de Geer, Bühlmann and Zhou, 2011).
ing the plain Lasso as initial estimator, see also Bühlmann    We wonder whether these Lasso variants also have simi-
and van de Geer, 2011, Chapter 2.8). We would expect a         lar advantages when introduced with 0 regularization.
similar behavior as for the relaxed Lasso.
   The relaxed Lasso and adaptive Lasso above are “push-                 3. A FEW ADDITIONAL THOUGHTS
ing” the convex 1 regularization towards the nonconvex
0 to benefit from the both regularization. In the same           BPvP and HTT provide many empirical illustrations
spirit, the 0 + 2 approach from BPvP combines the            on prediction, parameter estimation, cross-validation or
0 regularization with the convex 2 regularization to in-     degrees of freedom. We complement this with empiri-
crease stability in low SNR regime. If one is willing to       cal studies on the following points: (i) distributional ro-
accept the conceptual connection between the 0 + 2 ap-       bustness and (ii) feature selection stability. The first point
proach from BPvP and the relaxed Lasso, it is no longer        (i) is very briefly mentioned by BPvP in Sections 2.1.1
surprising to see that the 0 + 2 approach from BPvP          and 2.3.1 but not further considered in their empirical
performs better than best subset and Lasso from HTT in         results. Further, we highlight a few problem settings—
                                                               with more complicated decision spaces than linear sub-
many simulation settings.
                                                               set selection—where developing optimization tools for 0
  2.3.2 Statistical theory. The beauty of statistical the-     regularization may be fruitful.
ory is to characterize mathematically under which as-             Implementation details. In all our experiments, Lasso
sumptions a certain method exhibits performance guar-          is solved via the R package glmnet (Friedman, Hastie and
antees such as minimax optimality, rate of statistical con-    Tibshirani, 2010) and best subset is solved via the R pack-
vergence, and consistency. Such guarantees are typically       age bestsubset, with maximum computing time limit of
DISCUSSION ON BERTSIMAS ET AL. AND HASTIE ET AL.                                     617

30 minutes following HTT. The convex relaxation of the          We evaluate the distributional robustness with the follow-
0 + 2 approach SS (introduced by BPvP) is solved via          ing DR metric which takes a regression parameter or its
the Julia package SubsetSelection, with maximum com-            estimate as input and outputs its distributional robustness:
puting iteration limit of 200 iterations following BPvP.                                                       
                                                                (3.2)      DR : β → max Y − (X + )β 2 ,
While CIO is also an important method in BPvP, we did                                  ∈U (η)
not include it here because we had a hard time executing
                                                                with (X, Y ) generated independently identically dis-
the CIO code by BPvP. The difficulty is due to the version
                                                                tributed with respect to the training data that yielded the
incompatibilities of installing both SS and CIO under the       estimate for β. DR computes the “worst-case” predic-
same Julia version in the current SparseRegression pack-        tion performance of β when the covariates X have been
age provided by BPvP. We would like to encourage the            perturbed inside the set U (η). Here, we take the per-
authors to update the SparseRegression package to make          turbation set U (η) = { = (δ1 , . . . , δp ), δj 2 ≤ η ∀j },
their software contributions more accessible.                   parametrized by the perturbation magnitude η > 0. To ob-
  The code to generate all the figures here are written in      tain a better understanding of DR, it helps to consider the
R with the integration of Julia code via R package Julia-       distributional robustness difference (DRD)
Call. Our code is released publicly in the Github reposi-
tory STSDiscussion_SparseRegression.1                           (3.3)                        
                                                                  DRD : β → max Y − (X + )β 2 − Y − Xβ2 .
3.1 Distributional Robustness                                                   ∈U (η)

  It is well known that the Lasso has a robustness prop-        The metrics DR and DRD are related trivially by the rela-
erty with respect to “measurement error” as the robust op-      tion DR(β) = DRD(β) + Y − Xβ2 , for a fixed β. Fur-
timization problem (Xu, Caramanis and Mannor, 2009):            ther, due to the 1 norm and robustness duality obtained
                                                              in Xu, Caramanis and Mannor (2009), DRD(β) = ηβ1 .
(3.1)        argmin max Y − (X + )b2                         As a consequence, the DRD definition is independent of
                b       ∈U (λ)
                                                                how the data (X, Y ) is generated. We also deduce that
with the perturbation set U (λ) = { = (δ1 , . . . , δp ),      DR(β) = ηβ1 + Y − Xβ2 . In other words, the dis-
δj 2 ≤ λ ∀j } is equivalent to the square root Lasso          tributional robustness of an estimator β (as measured by
                                                                DR(β)) trade-offs between the 1 norm of its solution and
               argmin Y − Xb2 + λb1 .                       the prediction performance on unperturbed data.
                    b

We note that the square root Lasso and the Lasso have the          3.1.1 Empirical illustration: DRD vs regularization pa-
same solution path when varying λ from 0 to ∞. Thus,            rameter. We empirically explore distributional robustness
the Lasso is robust under small covariate perturbations;        difference, DRD, as a function of the regularization pa-
“small” since the optimal choice of λ to guarantee the          rameter for both the Lasso and best subset (i.e., the 0
optimal  statistical risk in the Lasso is typically of order    regularization solver originally developed in Bertsimas,
√                                                               King and Mazumder, 2016).
   log(p)/n.
   More generally, Bertsimas and Copenhaver (2018)                 In this simulation, we consider the stylized setting
demonstrate that the robust optimization problem (3.1)          where n = 100, p = 30, s = 5, SNR = 2.0, and the de-
                                                                sign matrix is sampled from a Gaussian distribution with
with a norm h(·) and a perturbation set U (λ) = { :
                                                                identity covariance (i.e., correlation ρ = 0). We further
maxb∈Rp bh(b) ≤ λ} is equivalent to argmin Y − Xb2 +
                2
                                                                fix the perturbation magnitude η = 1.0. Figure 1 shows
λh(b), matching the previous result of Xu, Caramanis and
                                                                DRD of the Lasso and best subset estimates as a function
Mannor (2009) in the 1 norm regularized regression set-
                                                                of the regularization parameter λ for the Lasso and as a
ting. These results suggest a duality between norm-based
                                                                function of the number of features selected k for best sub-
regularization and robustness. However, as 0 is not a
                                                                set. Due to the equality DRD(β) = ηβ1 , the behavior
norm, the connection with robustness is not immediately         of DRD mimics the 1 norm of the estimates in the reg-
transparent. We believe that theoretically characterizing       ularization solution path. As expected, choosing larger λ
the type of perturbations that the 0 regularization is ro-     leads to smaller DRD for the Lasso; we observe a similar
bust to would be an interesting and important future re-        behavior with best subset for smaller k.
search direction.                                                  For both problems, we label two choices for the regular-
   Due to the lack of theoretical comparisons between the       ization parameters: 1. the choice that leads to the best pre-
distributional robustness of the 0 and 1 regularization,      diction performance on a separate validation set of size n;
in the following, we investigate the distributional robust-     2. the choice that leads to the best prediction performance
ness of these two regularization techniques empirically.        on the same validation set among all regularization values
                                                                that yield support size less than or equal to s = 5. Choice
 1 https://github.com/yuachen/STSDiscussion_SparseRegression    1 prioritizes the test prediction performance, while Choice
618                                               Y. CHEN, A. TAEB AND P. BÜHLMANN

F IG . 1. Left—Distributional robustness difference (DRD) of the Lasso as a function of the regularization parameter λ, right—DRD of best subset
as a function of the regularization parameter k.

2 prioritizes selecting the correct number of features. The                For each methods, after appropriately selecting the reg-
left plot of Figure 1 shows that for the Lasso, the two                    ularization parameters and computing the corresponding
choices lead to very different values for λ, and as a re-                  estimates, we evaluate the metric DR in equation (3.2) as
sult, different DRD. Theoretically, the Lasso can be used                  a function of the perturbation magnitude η. Figure 2 com-
to achieve good prediction performance and good feature                    pares the performance of the four methods. Given the re-
selection performance, but the choice of λ for each goal is                lation DR(β) = ηβ + Y − Xβ2 , the Y-intercept of the
different (Wainwright, 2009, Zhao and Yu, 2006). In par-                   linear curves in Figure 2 represent the prediction perfor-
ticular, optimizing with respect to prediction performance                 mance with unperturbed data, and the slopes are precisely
requires a smaller value of λ (Choice 1) and optimizing                    the 1 norm of the β.
for model selection accuracy requires a larger values of λ                    Focusing on the high-SNR regimes in the left column of
(Choice 2). The right plot of Figure 1 shows that the two                  Figure 2, the Lasso, best subset and SS1 have comparable
choices for best subset are similar. Comparing the left and                distributional robustness for small amounts of perturba-
right plots in Figure 1, we observe that Choice 1 yields ap-               tion in both low and high dimensions. As a comparison,
proximately the same DRD for the Lasso and best subset.                    SS2 has larger prediction error with unperturbed data but
Choice 2, however, chooses a large λ for the Lasso, lead-                  has better distributional robustness (as measured by DR)
ing to a slightly better DRD for the Lasso than for best                   if the perturbations are large enough; this behavior can be
subset.                                                                    attributed to the large 2 penalty in SS2 .
  3.1.2 Empirical illustration: Robustness versus signal-                     Focusing on the low-SNR regimes in the right column
to-noise ratio and ambient dimension. In this experiment,                  of Figure 2, in low dimensional setting 2, the Lasso is
we explore the distributional robustness of the Lasso, best                more robust than best subset. The robustness of the Lasso
subset, as well as the new method SS developed by BPvP.                    is largely due to the 1 shrinkage property provided by
  Following the simulation settings in HTT, we consider                    1 regularization. In the low-SNR and high dimensional
the following four data generation settings:                               setting 4, best subset turns out to be more robust than the
                                                                           the Lasso. We observe that in the low-SNR and high di-
  1.   n = 100, p = 30, s = 5, SNR = 2.0, ρ = 0                            mensional setting, best subset selects very small number
  2.   n = 100, p = 30, s = 5, SNR = 0.1, ρ = 0                            of features, enhancing its robustness. Once again, we ob-
  3.   n = 50, p = 1000, s = 5, SNR = 20.0, ρ = 0                          serve that SS2 yields substantially more robust solutions
  4.   n = 50, p = 1000, s = 5, SNR = 1.0, ρ = 0,                          as compared to the other three methods.
and the design matrix is sampled from a Gaussian distri-                      Our experiments, as well as the empirical studies by
bution with identity covariance (i.e., correlation ρ = 0).                 BPvP, suggest that 0 + 2 minimization can substantially
For the Lasso and best subset, we select their regular-                    improve robustness. It is worth noting that this comes
ization parameters based on prediction performance on a                    at the cost of choosing an additional regularization pa-
separate validation set of size n. Since SS has two hy-                    rameter. In practice, searching over the two-dimensional
perparamter choices, we fix the 2 penalty regularization                  grid of regularization parameters could make SS less pre-
(denoted as 1/γ by BPvP) to take one of the two values                     ferred choice for computational reasons. Nonetheless, the
γ = {1000, 0.01} and tune the 0 regularization parameter                  promising results of SS raise interesting questions about
based on validation performance. The two levels of the 2                  its statistical properties, as well its optimization guaran-
penalty lead to two versions of SS: SS1 (γ = 1000) with                    tees. On the statistical side, it is interesting for future re-
small 2 penalty and SS2 with large 2 penalty (γ = 0.01).                 search to understand theoretically when the nonconvex
DISCUSSION ON BERTSIMAS ET AL. AND HASTIE ET AL.                                                   619

F IG . 2. Distributional robustness (DR) of the Lasso, best subset, SS1 and SS2 for four ambient dimensions and SNR settings. For all problems,
the data matrix is uncorrelated, that is, ρ = 0.

0 + 2 regularization has similar prediction, model se-                  the feature selection stability of the 0 approach is better
lection or robustness guarantees as the Lasso. On the op-                 than that of the Lasso. This is perhaps not very surprising
timization side, since SS only solves a relaxation of the                 as it is known that using validation error to select the reg-
0 + 2 regularization optimization problem, it is inter-                 ularization parameter for the Lasso yields overly complex
esting to understand when the SS solution is close to the                 models (Wainwright, 2009) (we consider in supplemen-
global optimum.                                                           tary material Section A.2 of Chen, Taeb and Bühlmann,
                                                                          2020 the setting where λ for the Lasso is chosen so that
3.2 Sampling Stability for Feature Selection
                                                                          the solution has s nonzeros). In the low-SNR regime (al-
   Next, we empirically study the stability of best subset                beit still a bit larger than the extreme SNR = 0.05 consid-
and Lasso as feature selection methods. We measure the                    ered by HTT), we see both methods struggle to tease away
feature selection stability by the probability of true and                true variable from null although we would argue that the
null variables being selected across multiple independent                 Lasso performs favorably here as the true variables appear
identically generated datasets. Ideally, we would like the                with much larger probability.
probability of selecting any true variable to be close to one
                                                                          3.3 More Complicated Decision Spaces
and any null variable close to zero. As with distributional
robustness, the feature selection stability of the Lasso and                Much of the focus of BPvP and HTT has been on subset
best subset are strongly dependent on the choice of regu-                 selection for linear sparse regression settings. Although
larization parameter. We illustrate this phenomena in the                 BPvP consider logistic regression and hinge loss in their
supplementary material Section A.1 (see Chen, Taeb and                    mixed integer framework as well, the theoretical and com-
Bühlmann, 2020).                                                          putational aspects of these approaches appear less under-
   We consider the following stylized setting:                            stood as compared to the linear setting. In practice, 1
                                                                          norm regularization has been widely used beyond the lin-
  1. n = 100, p = 30, s = 5, SNR = 2.0, ρ = 0.35
                                                                          ear setting for generalized linear models, survival models
  2. n = 100, p = 30, s = 5, SNR = 0.1, ρ = 0.35
                                                                          but also for scenarios with nonconvex loss functions such
Figure 3 shows the feature stability of Lasso and best sub-               as mixture regression or mixed effect models (cf. Hastie,
set where the regularization parameters are chosen based                  Tibshirani and Friedman, 2009, Bühlmann and van de
on prediction on a validation set. In the high-SNR regime,                Geer, 2011, Hastie, Tibshirani and Wainwright, 2015).
620                                               Y. CHEN, A. TAEB AND P. BÜHLMANN

F IG . 3. Feature stability of the Lasso and best subset across 100 independent datasets with problem parameters n = 100, p = 30, s = 5, ρ = 0.35
and SNR = {2, 0.1}. The regularization parameters are chosen based on prediction performance on a validation set.

As such, we believe that deepening our understanding                       with a DAG constraint is very difficult, since the DAG
of 0 -optimization for problems involving nonquadratic                    constraint induces a high degree of non-convexity. The
loss functions (beyond generalized linear models) is an                    famous proposal of greedy equivalent search (GES) is
important direction for future research. Furthermore, we                   the most used heuristics, but with proven crude consis-
next outline a few problem instances where 0 -based ap-                   tency guarantees in low dimensions (Chickering, 2003). It
proaches may provide a fresh and interesting perspective.                  would be wonderful to have more rigorous optimization
   3.3.1 Plug-in to squared error loss. There are exten-                   tools, which could address this highly nonconvex prob-
sions of linear models in a plug-in sense, as indicated                    lem.
below, which can deal with important issues around hid-                      3.3.3 Extensions to low-rank matrix estimation. A
den confounding, causality and distributional robustness                   common task in data analysis is to obtain a low-rank
in the presence of large perturbations.                                    model from observed data. These problems often aim at
   The trick with such plug-in methodology is as follows.                  solving the optimization problem:
We linearly transform the data to X̃ = F X and Ỹ = F Y
for a carefully chosen n × n matrix F . Subsequently, we                   (3.4)          argmin f (X)       s.t. rank(X) ≤ k,
fit a linear model of Ỹ versus X̃, typically with a regular-                             X∈Rn×p
ization term such as 1 norm or now, due to recent con-                    for some differentiable function f : Rn×p → R and k
tributions by BPvP, we can also use 0 regularization. We                  min{n, p}. The constraint rank(X) ≤ k can be viewed
list here two choices of F :                                               as singular-values(X)0 ≤ k, making equation
   1. F is a spectral transform which trims the singu-                     (3.4) a matrix analog of the subset selection problem ana-
lar values of X. If there is unobserved hidden linear                      lyzed in BPvP and HTT. Due to computational intractabil-
confounding which is dense affecting many components                       ity of rank minimization, a large body of work has re-
of X, then the regularized regression of Ỹ versus X̃                      sorted to convex relaxation in order to obtain computa-
yields the deconfounded regression coefficient. (Cevid,                    tionally feasible solution by replacing the rank constraint
Bühlmann and Meinshausen, 2018).                                           rank(X) with the nuclear norm penalty λX in the ob-
   2. F is a linear transformation involving projection                    jective (Fazel, 2002), yielding a semidefinite program for
matrices and a robustness tuning parameter. Then Anchor                    certain function classes f . Such relaxations, while hav-
regression is simply regression of Ỹ versus X̃, and we                    ing the advantage of convexity, are not scalable to large
may want to use 0 or 1 regularization (Rothenhäusler                     problem instances. As such, practitioners often resort to
et al., 2018). The obtained regression coefficient has a                   solving other nonconvex formulations of equation (3.4)
causal interpretation and leads to distributional robustness               such as the Burer–Monteiro approach (Burer and Mon-
under a class of large perturbations.                                      teiro, 2003) and projected gradient descent on the low-
   3.3.2 Fitting directed acyclic graphs. Fitting Gaussian                 rank manifold (Jain, Tewari and Kar, 2014). For a range
structural equation models with acyclic directed graph                     of problem settings, these nonconvex techniques achieve
(DAG) structure from observational data is a well-studied                  appealing estimation accuracy (see Chen and Chen, 2018
topic in structure learning. As mentioned in Section 2.1,                  for a summary), and are able to solve larger-dimensional
one should always prefer the 0 regularization principle                   problems than approaches based on convex relaxation.
as it respects the Markov equivalence property. The op-                    While these nonconvex approaches are commonly em-
timization of the 0 -regularized log-likelihood function                  ployed, they do not certify optimality. As such, nonlinear
DISCUSSION ON BERTSIMAS ET AL. AND HASTIE ET AL.                                                    621

semi-definite optimization techniques have been devel-                    A KAIKE , H. (1973). Information theory and an extension of the max-
oped to produce more accurate solutions for some special-                    imum likelihood principle. In Second International Symposium on
izations of equation (3.4); see, for example, (Bertsimas,                    Information Theory (Tsahkadsor, 1971) 267–281. MR0483125
                                                                          BARRON , A., B IRGÉ , L. and M ASSART, P. (1999). Risk bounds for
Copenhaver and Mazumder, 2017) in the context of rank-
                                                                             model selection via penalization. Probab. Theory Related Fields
constrained factor analysis. Beyond the setting analyzed                     113 301–413. MR1679028 https://doi.org/10.1007/s004400050210
in (Bertsimas, Copenhaver and Mazumder, 2017), we                         B ELLEC , P. (2018). The noise barrier and the large signal bias
wonder whether there are conceptual advancements to                          of the Lasso and other convex estimators. Preprint. Available at
the mixed integer framework proposed in BPvP to handle                       arXiv:1804.01230.
low-rank optimization problems (3.4) involving continu-                   B ERTSIMAS , D. and C OPENHAVER , M. S. (2018). Characteriza-
ous nonconvex optimization. Developing this connection                       tion of the equivalence of robustification and regularization in lin-
                                                                             ear and matrix regression. European J. Oper. Res. 270 931–942.
may enable a fresh and powerful approach to provide—in                       MR3814540 https://doi.org/10.1016/j.ejor.2017.03.051
a computationally efficiently manner—certifiably optimal                  B ERTSIMAS , D., C OPENHAVER , M. S. and M AZUMDER , R. (2017).
solutions to equation (3.4).                                                 Certifiably optimal low rank factor analysis. J. Mach. Learn. Res.
                                                                             18 Paper No. 29, 53. MR3634896
                       4. CONCLUSIONS                                     B ERTSIMAS , D. and K ING , A. (2017). Logistic regression: From art
                                                                             to science. Statist. Sci. 32 367–384. MR3696001 https://doi.org/10.
   It is great that practitioners can begin to integrate fast                1214/16-STS602
and exact 0 -based solvers in their data analysis pipelines.             B ERTSIMAS , D., K ING , A. and M AZUMDER , R. (2016). Best subset
The methods developed by BPvP can benefit many data                          selection via a modern optimization lens. Ann. Statist. 44 813–852.
analysts who would prefer to focus on the application                        MR3476618 https://doi.org/10.1214/15-AOS1388
aspect of the problem rather than the optimization as-                    B ERTSIMAS , D. and S HIODA , R. (2009). Algorithm for cardinality-
pect. Aiming for the most parsimonious model fit is a                        constrained quadratic optimizaiton. Comput. Optim. Appl. 43 1–22.
                                                                             MR2501042 https://doi.org/10.1007/s10589-007-9126-9
very plausible principle which is easy to communicate in                  B ERTSIMAS , D. and VAN PARYS , B. (2020). Sparse high-dimensional
many applications. But this alone does not rule out the at-                  regression: Exact scalable algorithms and phase transitions.
tractiveness of 1 norm regularization and its versions: it                  Ann. Statist. 48 300–323. MR4065163 https://doi.org/10.1214/
builds in additional estimation shrinkage, which may be                      18-AOS1804
desirable in high noise settings, and sticking to convex-                 B ICKEL , P. J., R ITOV, Y. and T SYBAKOV, A. B. (2009). Simultane-
ity is yet another plausible principle. In this discussion,                  ous analysis of Lasso and Dantzig selector. Ann. Statist. 37 1705–
                                                                             1732. MR2533469 https://doi.org/10.1214/08-AOS620
we tried to add a few additional thoughts to the excellent
                                                                          B REIMAN , L. (1995). Better subset regression using the non-
and insightful papers by BPvP and HTT, in the hope of                        negative garrote. Technometrics 37 373–384. MR1365720
encouraging new research in this direction.                                  https://doi.org/10.2307/1269730
                                                                          B REIMAN , L. (1996a). Bagging predictors. Mach. Learn. 24 123–140.
                    ACKNOWLEDGMENTS                                       B REIMAN , L. (1996b). Heuristics of instability and stabilization
                                                                             in model selection. Ann. Statist. 24 2350–2383. MR1425957
   Y. Chen, A. Taeb and P. Bühlmann have received fund-                      https://doi.org/10.1214/aos/1032181158
ing from the European Research Council under the Grant                    B ÜHLMANN , P. (2006). Boosting for high-dimensional linear mod-
Agreement No. 786461 (CausalStats—ERC-2017-ADG).                             els. Ann. Statist. 34 559–583. MR2281878 https://doi.org/10.1214/
They all acknowledge scientific interaction and exchange                     009053606000000092
at “ETH Foundations of Data Science.”                                     B ÜHLMANN , P. and VAN DE G EER , S. (2011). Statistics for
                                                                             High-Dimensional Data: Methods, Theory and Applications.
                                                                             Springer Series in Statistics. Springer, Heidelberg. MR2807761
              SUPPLEMENTARY MATERIAL                                         https://doi.org/10.1007/978-3-642-20192-9
                                                                          B ÜHLMANN , P. and VAN DE G EER , S. (2018). Statistics for big
   Supplement to “A Look at Robustness and Stability                         data: A perspective. Statist. Probab. Lett. 136 37–41. MR3806834
of 1 - versus 0 -Regularization: Discussion of Papers                      https://doi.org/10.1016/j.spl.2018.02.016
by Bertsimas et al. and Hastie et al.” (DOI: 10.1214/20-                  B URER , S. and M ONTEIRO , R. D. C. (2003). A nonlinear pro-
STS809SUPP; .pdf). The supplement contains an empiri-                        gramming algorithm for solving semidefinite programs via low-
cal analysis of the dependence of feature selection stabil-                  rank factorization. Math. Program. 95 329–357. MR1976484
ity on the choice of regularization parameter for the Lasso                  https://doi.org/10.1007/s10107-002-0352-8
                                                                          C AI , T. T. and WANG , L. (2011). Orthogonal matching pursuit
and best subset. Further, in the context of feature selection
                                                                             for sparse signal recovery with noise. IEEE Trans. Inf. The-
stability, the effect of choosing λ for the Lasso to obtain a                ory 57 4680–4688. MR2840484 https://doi.org/10.1109/TIT.2011.
fixed cardinality is explored.                                               2146090
                                                                          C EVID , D., B ÜHLMANN , P. and M EINSHAUSEN , N. (2018). Spectral
                                                                             deconfounding and perturbed sparse linear models. Preprint. Avail-
                         REFERENCES                                          able at arXiv:1811.05352.
A BADI , M., AGARWAL , A., BARHAM , P., B REVDO , E., C HEN , Z.,         C HEN , Y. and C HEN , Y. (2018). Harnessing structures in big data via
   C ITRO , C., C ORRADO , G. S., DAVIS , A., D EAN , J. et al. (2015).      guaranteed low-rank matrix estimation: Recent theory and fast al-
   TensorFlow: Large-scale machine learning on heterogeneous sys-            gorithms via convex and nonconvex optimization. IEEE Signal Pro-
   tems. Software available from tensorflow.org.                             cess. Mag. 35 14–31.
622                                                 Y. CHEN, A. TAEB AND P. BÜHLMANN

C HEN , S. and D ONOHO , D. (1994). Basis pursuit. In Proceedings of      M ALLAT, S. and Z HANG , Z. (1993). Matching pursuits with time-
   1994 28th Asilomar Conference on Signals, Systems and Comput-              frequency dictionaries. IEEE Trans. Signal Process. 41 3397–3415.
   ers 1 41–44. IEEE, New York.                                           M EINSHAUSEN , N. (2007). Relaxed Lasso. Comput. Statist. Data
C HEN , S. S., D ONOHO , D. L. and S AUNDERS , M. A. (2001).                  Anal. 52 374–393. MR2409990 https://doi.org/10.1016/j.csda.
   Atomic decomposition by basis pursuit. SIAM Rev. 43 129–159.               2006.12.019
   MR1854649 https://doi.org/10.1137/S003614450037906X                    M ILLER , A. J. (1990). Subset Selection in Regression. Monographs
C HEN , Y., TAEB , A. and B ÜHLMANN , P. (2020). Supplement to “A             on Statistics and Applied Probability 40. CRC Press, London.
   look at robustness and stability of 1 - versus 0 -regularization:        MR1072361 https://doi.org/10.1007/978-1-4899-2939-6
   discussion of papers by Bertsimas et al. and Hastie et al.”
                                                                          PASZKE , A., G ROSS , S., M ASSA , F., L ERER , A., B RADBURY, J.,
   https://doi.org/10.1214/20-STS809SUPP
                                                                              C HANAN , G., K ILLEEN , T., L IN , Z., G IMELSHEIN , N. et al.
C HICKERING , D. M. (2003). Optimal structure identification with
   greedy search. J. Mach. Learn. Res. 3 507–554. MR1991085                   (2019). Pytorch: An imperative style, high-performance deep learn-
   https://doi.org/10.1162/153244303321897717                                 ing library. In Advances in Neural Information Processing Systems
D ONOHO , D. L. (2006). For most large underdetermined systems of             8026–8037.
   linear equations the minimal l1 -norm solution is also the spars-      ROTHENHÄUSLER , D., M EINSHAUSEN , N., B ÜHLMANN , P. and P E -
   est solution. Comm. Pure Appl. Math. 59 797–829. MR2217606                 TERS , J. (2018). Anchor regression: Heterogeneous data meets
   https://doi.org/10.1002/cpa.20132                                          causality. Preprint. Available at arXiv:1801.06229. To appear in
E FRON , B., H ASTIE , T., J OHNSTONE , I. and T IBSHIRANI , R. (2004).       J. Roy. Statist. Soc. Ser. B.
   Least angle regression. Ann. Statist. 32 407–499. MR2060166            S CHWARZ , G. (1978). Estimating the dimension of a model. Ann.
   https://doi.org/10.1214/009053604000000067                                 Statist. 6 461–464. MR0468014
FAZEL , M. (2002). Matrix rank minimization with applications. Ph.D.      T IBSHIRANI , R. (1996). Regression shrinkage and selection via the
   thesis, Stanford, CA.                                                      Lasso. J. Roy. Statist. Soc. Ser. B 58 267–288. MR1379242
F RIEDMAN , J., H ASTIE , T. and T IBSHIRANI , R. (2010). Regular-        T ROPP, J. A. (2004). Greed is good: Algorithmic results for sparse ap-
   ization paths for generalized linear models via coordinate descent.        proximation. IEEE Trans. Inf. Theory 50 2231–2242. MR2097044
   J. Stat. Softw. 33 1–22.                                                   https://doi.org/10.1109/TIT.2004.834793
F RIEDRICH , J., Z HOU , P. and PANINSKI , L. (2017). Fast online de-
                                                                          VAN DE G EER , S. A. and B ÜHLMANN , P. (2009). On the conditions
   convolution of calcium imaging data. PLoS Comput. Biol. 13 1–26.
                                                                              used to prove oracle results for the Lasso. Electron. J. Stat. 3 1360–
F UCHS , J.-J. (2004). On sparse representations in arbitrary redun-
                                                                              1392. MR2576316 https://doi.org/10.1214/09-EJS506
   dant bases. IEEE Trans. Inf. Theory 50 1341–1344. MR2094894
                                                                          VAN DE G EER , S. and B ÜHLMANN , P. (2013). 0 -penalized maxi-
   https://doi.org/10.1109/TIT.2004.828141
G ATU , C. and KONTOGHIORGHES , E. J. (2006). Branch-and-                     mum likelihood for sparse directed acyclic graphs. Ann. Statist. 41
   bound algorithms for computing the best-subset regression                  536–567. MR3099113 https://doi.org/10.1214/13-AOS1085
   models. J. Comput. Graph. Statist. 15 139–156. MR2269366               VAN DE G EER , S., B ÜHLMANN , P. and Z HOU , S. (2011). The adap-
   https://doi.org/10.1198/106186006X100290                                   tive and the thresholded Lasso for potentially misspecified models
H ASTIE , T., T IBSHIRANI , R. and F RIEDMAN , J. (2009). The Ele-            (and a lower bound for the Lasso). Electron. J. Stat. 5 688–749.
   ments of Statistical Learning: Data Mining, Inference, and Pre-            MR2820636 https://doi.org/10.1214/11-EJS624
   diction, 2nd ed. Springer Series in Statistics. Springer, New York.    WAINWRIGHT, M. J. (2009). Sharp thresholds for high-dimensional
   MR2722294 https://doi.org/10.1007/978-0-387-84858-7                        and noisy sparsity recovery using 1 -constrained quadratic pro-
H ASTIE , T., T IBSHIRANI , R. and WAINWRIGHT, M. (2015). Statisti-           gramming (Lasso). IEEE Trans. Inf. Theory 55 2183–2202.
   cal Learning with Sparsity: The Lasso and Generalizations. Mono-           MR2729873 https://doi.org/10.1109/TIT.2009.2016018
   graphs on Statistics and Applied Probability 143. CRC Press, Boca      X U , H., C ARAMANIS , C. and M ANNOR , S. (2009). Robust regression
   Raton, FL. MR3616141                                                       and Lasso. In Advances in Neural Information Processing Systems
H OFMANN , M., G ATU , C. and KONTOGHIORGHES , E. J. (2007). Ef-              1801–1808.
   ficient algorithms for computing the best subset regression models     Z HANG , T. (2011). Sparse recovery with orthogonal matching pursuit
   for large-scale problems. Comput. Statist. Data Anal. 52 16–29.
                                                                              under RIP. IEEE Trans. Inf. Theory 57 6215–6221. MR2857968
   MR2409961 https://doi.org/10.1016/j.csda.2007.03.017
                                                                              https://doi.org/10.1109/TIT.2011.2162263
JAIN , P., T EWARI , A. and K AR , P. (2014). On iterative hard thresh-
                                                                          Z HAO , P. and Y U , B. (2006). On model selection consistency of
   olding methods for high-dimensional M-estimation. In Advances in
   Neural Information Processing Systems 685–693.                             Lasso. J. Mach. Learn. Res. 7 2541–2563. MR2274449
J EWELL , S. and W ITTEN , D. (2018). Exact spike train inference         Z OU , H. (2006). The adaptive Lasso and its oracle properties. J. Amer.
   via 0 optimization. Ann. Appl. Stat. 12 2457–2482. MR3875708              Statist. Assoc. 101 1418–1429. MR2279469 https://doi.org/10.
   https://doi.org/10.1214/18-AOAS1162                                        1198/016214506000000735
KOTZ , S. and J OHNSON , N. L., eds. (1992). Breakthroughs in             Z OU , H. and H ASTIE , T. (2005). Regularization and variable selec-
   Statistics. Vol. I: Foundations and Basic Theory. Springer Se-             tion via the elastic net. J. R. Stat. Soc. Ser. B. Stat. Methodol. 67
   ries in Statistics. Perspectives in Statistics. Springer, New York.        301–320. MR2137327 https://doi.org/10.1111/j.1467-9868.2005.
   MR1182794                                                                  00503.x
You can also read