ON THE POWER OF DEEP BUT NAIVE PARTIAL LABEL LEARNING

Page created by Jim Jennings
 
CONTINUE READING
ON THE POWER OF DEEP BUT NAIVE PARTIAL LABEL LEARNING

                                                                                      Junghoon Seo1†          Joon Suk Huh2†
                                                                      1                                              2
                                                                          SI Analytics Co. Ltd, South Korea              UW–Madison, USA
                                                                               jhseo@si-analytics.ai               jhuh23@wisc.edu

                                                                    ABSTRACT                                  In this paper, we focus on Partial label learning [6] (PLL),
arXiv:2010.11600v2 [cs.LG] 8 Feb 2021

                                        Partial label learning (PLL) is a class of weakly supervised      which is one of the most classic examples of weakly super-
                                        learning where each training instance consists of a data and      vised learning. In the PLL problem, classifiers are trained
                                        a set of candidate labels containing a unique ground truth la-    with a set of candidate labels, among which only one label
                                        bel. To tackle this problem, a majority of current state-of-      is the ground truth. Web mining [7], ecoinformatic [8], and
                                        the-art methods employs either label disambiguation or av-        automatic image annotation [9] are notable examples of real-
                                        eraging strategies. So far, PLL methods without such tech-        world instantizations of the PLL problem.
                                        niques have been considered impractical. In this paper, we            The majority of state-of-the-art parametric methods for
                                        challenge this view by revealing the hidden power of the old-     PLL involves two types of parameters. One is associated
                                        est and naivest PLL method when it is instantiated with deep      with the label confidence, and the other is the model parame-
                                        neural networks. Specifically, we show that, with deep neu-       ters. These methods iteratively and alternatively update these
                                        ral networks, the naive model can achieve competitive perfor-     two types of parameters. This type of methods is denoted as
                                        mances against the other state-of-the-art methods, suggesting     identification-based. On the other hand, average-based meth-
                                        it as a strong baseline for PLL. We also address the question     ods [10, 11] treat all the candidate labels equally, assuming
                                        of how and why such a naive model works well with deep            they contribute equally to the trained classifier. Average-
                                        neural networks. Our empirical results indicate that deep neu-    based methods do not require any label disambiguation pro-
                                        ral networks trained on partially labeled examples generalize     cesses so they are much simpler than identification-based
                                        very well even in the over-parametrized regime and without        methods. However, numerous works [6, 12, 13, 14, 15]
                                        label disambiguations or regularizations. We point out that       pointed out that the label disambiguation processes are es-
                                        existing learning theories on PLL are vacuous in the over-        sential to achieve high-performance in PLL problems, hence,
                                        parametrized regime. Hence they cannot explain why the            attempts to build a high-performance PLL model through the
                                        deep naive method works. We propose an alternative theory         average-based scheme have been avoided.
                                        on how deep learning generalize in PLL problems.                      Contrary to this common belief, we show that one of
                                                                                                          naivest and oldest average-based methods can train accu-
                                            Index Terms— classification, partial label learning,
                                                                                                          rate classifiers in real PLL problems. Specifically, our main
                                        weakly supervised learning, deep neural network, empiri-
                                                                                                          contributions are two-fold:
                                        cal risk minimization
                                                                                                              1. We generalize the classic naive model of [6] to the
                                                              1. INTRODUCTION                                    modern deep learning setting. Specifically, we present
                                                                                                                 a naive surrogate loss for deep PLL. We test our deep
                                        State-of-the-art performance of the standard classification              naive model’s performance and show that it outper-
                                        task is one of the fastest-growing in the field of machine               forms the existing state-of-the-art methods despite its
                                        learning. In the standard classification setting, a learner re-          simplicity.1
                                        quires an unambiguously labeled dataset. However, it is
                                        often hard or even not possible to obtain completely labeled          2. We empirically analyze the unreasonable effectiveness
                                        datasets in the real world. Many pieces of research formu-               of the naive loss with deep neural networks. Our exper-
                                        lated problem settings under which classifiers are trainable             iments shows closing generalization gaps in the over-
                                        with incompletely labeled datasets. These settings are often             parametrized regime where bounds from existing learn-
                                        denoted as weakly supervised. Learning from similar vs. dis-             ing theories are vacuous. We propose an alternative ex-
                                        similar pairs [1], Learning from positive vs. unlabeled data             planation of the working of deep PLL based on obser-
                                        [2, 3], Multiple instance learning [4, 5] are some examples of           vations of Valle-Perez et al. [16].
                                        weakly supervised learning.                                         1 All codes for the experiments in this paper are public on https://

                                           † Both authors contributed equally to this work.               github.com/mikigom/DNPL-PyTorch.
2. DEEP NAIVE MODEL FOR PLL                              2.3. Existing Theories of Generalization in PLL
                                                                    In this sub-section, we review two existing learning theories
2.1. Problem Formulation
                                                                    and their implications which may explain the effectiveness of
We denote x ∈ X as a data and y ∈ Y = {1, . . . , K} as             deep naive models.
a label, and a set S ∈ S = 2Y \ ∅ such that y ∈ S as a
partial label. A partial label data distribution is defined by a    2.3.1. EPRM Learnability
joint data-label distribution p(x, y) and a partial label gener-
                                                                    Under a mild assumption on data distributions, Liu and Diet-
ating process p(S|x, y) where p(S|x, y) = 0 if y ∈      / S. A
                                                                    terich [17] proved that minimizing an empirical partial label
learner’s task is to output a model θ with small Err(θ) =
                                                                    risk gives a correct classifier.
E(x,y)∼p(x,y) I (hθ (x) 6= y) given with a finite number of par-
                                   n                                    Formally, they proved a finite sample complexity bound
tially labeled samples {(xi , Si )}i=1 , where each (xi , Si ) is
                                                                    for the empirical partial risk minimizer (EPRM):
independently sampled from p(x, S).
                                                                                       θ̂n = arg min R̂p,n (θ),                   (5)
                                                                                                θ∈Θ
2.2. Deep Naive Loss for PLL
                                                                    under a mild distributional assumption called small ambiguity
The work of Jin and Gharhramani [6], which is the first pio-        degree condition. The ambiguity degree [11] quantifies the
neering work on PLL, proposed a simple baseline method for          hardness of a PLL problem and is defined as
PLL denoted as the ‘Naive model’. It is defined as follows:                   γ=          sup             Pr         [ȳ ∈ S] .   (6)
                                                                                      (x,y)∈X ×Y,       S∼p(S|x,y)
                         n
                         X      1 X                                                ȳ∈Y:p(x,y)>0, ȳ6=y
          θ̂ = arg max               log p (y|xi ; θ) .      (1)
                 θ∈Θ     i=1
                               |Si |                                When γ is less than 1, we say the small ambiguity degree
                                     y∈Si
                                                                    condition is satisfied. Intuitively, it measures how a specific
We denote the naive loss as the negative of the objective in        non-ground-truth label co-occurs with a specific ground-truth
the above. In [6], the authors proposed the disambiguation          label. When such distractor labels co-occurs with a ground-
strategy as a better alternative to the naive model. Moreover,      truth label in every instance, it is impossible to disambiguate
many works on PLL [12, 13, 14, 15] considered this naive            the label hence PLL is not EPRM learnable. With the mild as-
model to be low-performing and it is still commonly believed        sumption that γ < 1, Liu and Ditterich showed the following
that label disambiguation processes are crucial in achieving        sample complexity bound for PLL,
high-performance.                                                   Theorem 1. (PLL Sample complexity bound [17]). Suppose
    In this work, we propose the following differentiable loss      the ambiguity degree of a PLL problem is small, 0 ≤ γ < 1.
to instantiate the naive loss with deep neural networks:                          2
                                                                    Let η = log 1+γ  and dH be the Natarajan dimension of the
                               n
                                                                    hypothesis space H. Define
              ˆln (θ) = − 1
                               X                    
                                     log hS θ,i Si i ,       (2)           n0 (H, , δ) =
                          n    i=1                                                                                 
                                                                      4                                 1         1
                S θ,i = SOFTMAX (fθ (xi )) ,                 (3)             dH log 4dH + 2 log K + log      + log + 1 ,
                                                                      η                                η        δ
where fθ (xi ) ∈ RK is the output of the neural network. The        then when n > n0 (H, , δ), Err(θ̂n ) <  with probability at
softmax layer is used to make the outputs of the neural net-        least 1 − δ.
work lie in the probability simplex. One can see that the above
loss is almost identical to the naive loss in (1) up to constant        We denote this result as Empirical Partial Risk Minimiza-
factors, hence we denote (2) as the deep naive loss while a         tion (EPRM) learnability.
model trained from it is denoted as a deep naive model.
     The above loss can be identified as a surrogate of the par-    2.3.2. Classifier-consistency
tial label risk defined as follows:                                 A very recent work by Feng et al. [18] proposed new PLL risk
                                                                    estimators by viewing the partial label generation process as
            Rp (θ) =           E       I (hθ (x) ∈
                                                 / S) ,      (4)    a multiple complementary label generation process [19, 20].
                       (x,S)∼p(x,S)
                                                                    One of the proposed estimators is called classifier-consistent
                                                                    (CC) risk Rcc (θ). For any multi-class loss function L : RK ×
where I (·) is the indicator function. We denote R̂p,n (θ) as an
                                                                    Y → R+ , Rcc (θ) it is defined as follows:
empirical estimator of Rp (θ) over n samples. When hθ (x) =
arg maxi fθ,i (x), one can easily see that the deep naive loss                                      L Q> p (y|x; θ) , s ,
                                                                                                                       
                                                                           Rcc (θ) =        E                                  (7)
(2) is a surrogate of the partial-label risk (4).                                     (x,S)∼p(x,S)
Method            Lost               MSRCv2            Soccer Player       Yahoo! News       Avg. Rank    Reference Presented at
   DNPL       81.1 ±3.7% (2)      54.4 ±4.3% (1)       57.3 ±1.4% (2)      69.1 ±0.9% (1)         1.50             This Work
   CLPL       74.2 ±3.8% (7) •    41.3 ±4.1% (12) •    36.8 ±1.0% (12) •   46.2 ±0.9% (12) •     10.75        [11]       JMLR 11
  CORD        80.6 ±2.6% (4)       47.4 ±4.0% (9) •    45.7 ±1.3% (11) •    62.4 ±1.0% (9) •      8.25        [13]       AAAI 17
   ECOC       70.3 ±5.2% (9) •     50.5 ±2.7% (6) •     53.7 ±2.0% (7) •    66.2 ±1.0% (5) •      6.75        [21]       TKDE 17
 GM-PLL       73.7 ±4.3% (8) •    53.0 ±1.9% (3)        54.9 ±0.9% (4) •    62.9 ±0.7% (8) •      5.75        [22]       TKDE 19
    IPAL      67.8 ±5.3% (10) •   52.9 ±3.9% (4)        54.1 ±1.6% (5) •   60.9 ±1.1% (10) •      7.25        [12]       AAAI 15
 PL-BLC       80.6 ±3.2% (4)      53.6 ±3.7% (2)        54.0 ±0.8% (6) •    67.9 ±0.5% (2) •      3.50        [15]       AAAI 20
   PL-LE      62.9 ±5.6% (11) •    49.9 ±3.7% (7) •     53.6 ±2.0% (8) •    65.3 ±0.6% (6) •      8.00        [23]       AAAI 19
  PLKNN       43.2 ±5.1% (12) •   41.7 ±3.4% (11) •    49.5 ±1.8% (10) •   48.3 ±1.1% (11) •     11.00        [10]        IDA 06
 PRODEN       81.6 ±3.5% (1)      43.4 ±3.3% (10) •     55.3 ±5.6% (3) •    67.5 ±0.7% (3) •      4.25        [24]       ICML 20
   SDIM       80.1 ±3.1% (5)      52.0 ±3.7% (5)       57.7 ±1.6% (1)      66.3 ±1.3% (4) •       3.75        [14]       IJCAI 19
   SURE       78.0 ±3.6% (6) •     48.1 ±3.6% (8) •     53.3 ±1.7% (9) •    64.4 ±1.5% (7) •      7.50        [25]       AAAI 19

Table 1. Benchmark results (mean accuracy±std) on the real-world datasets. Numbers in parenthesis represent rankings of com-
paring methods and the sixth column is the average rankings. Best methods are emphasized in boldface. •/◦ indicates whether
our method (DNPL) is better/worse than the comparing methods with respect to unpaired Welch t-test at 5% significance level.

  where Q ∈ RK×K is a label transition matrix in the context       Especially, Valle-Perez et al. [16] empirically observed that
of multiple complementary label learning, s is a uniformly         solutions from stochastic gradient descent (SGD) are biased
randomly chosen label from S. R̂cc,n (θ) is denoted as empir-      toward neural networks with smaller complexity. They ob-
ical risk of Eq. 7.                                                served the following universal scaling behavior in the output
    Feng et al.’s main contribution is to prove an estimation      distribution p(θ) of SGD:
error bound for the CC risk (7). Let θ̂n = arg minθ∈Θ R̂cc,n (θ)
and θ? = arg minθ∈Θ Rcc (θ) denote the empirical and the                                 p(θ) . e−aC(θ)+b ,                    (8)
true minimizer, respectively. Additionally, Hy refers the
model hypothesis space for label y. Then, the estimation           where C(θ) is a computable proxy of (uncomputable) Kol-
error bound for the CC risk is given as                            mogorov complexity and a, b are θ-independent constants.
                                                                   One example of complexity measure C(θ) is Lempel-Ziv
Theorem 2. (Estimation error bound for the CC risk [18]).         complexity [16] which is roughly the length of compressed θ
Assume the loss function L Q> p (y|x; θ) , s is ρ-Lipschitz        with ZIP compressor.
with respect to the first augment in the 2-norm and upper-             In the deep naive PLL, the model parameter is a mini-
bounded by M . Then, for any δ > 0, with probability at least      mizer of the empirical partial label risk R̂p,n (θ) (Eq. 4). The
1 − δ,                                                             minima of R̂p,n (θ) is wide because there are many model pa-
                                                 q 2               rameters perfectly fit to given partially labeled examples. The
                             Pk                     log δ
  Rcc (θ̂n ) − Rcc (θ? ) ≤ 8ρ y=1 Rn (Hy ) + 2M       2n ,         support of SGD’s output distribution will lie in this wide min-
                                                                   ima. According to Eq. 8, this distribution is heavily biased
where Rn (Hy ) refers the expected Rademacher complexity           toward parameters with small complexities. One crucial ob-
of the hypothesis space for the label y, Hy , with sample size     servation is that models fitting inconsistent labels will gener-
n.                                                                 ally have large complexities since they have to memorize each
                                                                   example. According to Eq. 8, such models are exponentially
    If the uniform label transition probability is assumed i.e.,
                                                                  unlikely to be outputted by SGD. Hence the most likely out-
Qij = δij I (j ∈ Sj ) / 2K−1 − 1 , Eq. 7 becomes equiva-
                                                                   put of the deep naive PLL method is a classifier with small
lent to our deep naive loss (Eq. 2) up to some constant fac-
                                                                   error. As a result, the implications of both Theorem 1 and
tors. Hence, Theorem 1 and 2 give generalization bounds on
                                                                   2 appear to be empirically correct in spite of their vacuity of
the partial risk and the CC risk (same as Eq. 2) respectively.
                                                                   model complexity.

2.4. Alternative Explanation of Generalization in DNPL
                                                                                        3. EXPERIMENTS
Since the work of [26], the mystery of deep learning’s gener-
alization ability has been widely investigated in the standard     In this section we give the readers two points. First, deep neu-
supervised learning setting. While it is still not fully under-    ral network classifiers trained with the naive loss can achieve
stood why over-parametrized deep neural networks generalize        competitive performance in real-world benchmarks. Second,
well, several studies are suggesting that deep learning mod-       the generalization gaps of trained classifiers effectively de-
els are inherently biased toward simple functions [16, 27].        crease with respect to the increasing training set size.
Fig. 1. Generalization gaps with respect to training set size for (a) Yahoo! dataset and (b) Soccer dataset are shown. Error
bars represent STDs over 10 repeated experiments. Note that we went through the same experiment process for the other two
smaller datasets (Lost / MSRCv2), but these results were omitted because of the same tendency.

3.1. Benchmarks on Real-world PLL Datasets                        or CORD, DNPL does not need computationally expensive
                                                                  processes like label identification and mean-teaching. This
3.1.1. Datasets and Comparing Methods
                                                                  means that by simply borrowing our surrogate loss to the
We use four real-world datasets including Lost [28], MSRCv2       deep learning classifier, we can build a sufficiently competi-
[8], Soccer Player [9], and Yahoo! News [29]. All real-           tive PLL model.
world datasets can be found in this website2 . We denote              Observing that for Soccer Player and Yahoo! News
the suggested method as Deep Naive Partial label Learning         datasets, DNPL outperforms almost all of the comparing
(DNPL). We compare DNPL with eleven baseline methods.             methods. Regarding the large-scale and high-dimensional
There are eight parametric methods: CLPL, CORD [13],              nature of Soccer Player and Yahoo! News datasets comparing
ECOC, PL-BLC [15], PL-LE [23], PRODEN, SDIM [14],                 to other datasets, this observation suggests that DNPL has its
SURE, and three non-parametric methods: GM-PLL [22],              advantage on large-scale, high-dimensional datasets.
IPAL, PLKNN. Note that both CORD and PL-BLC are deep
learning-based PLL methods which includes label identifica-       3.2. Generalization Gaps of Deep Naive PLL
tion or mean-teaching techniques.
                                                                  In this section, we empirically show that conventional learn-
                                                                  ing theories (Theorem 1, 2) cannot explain the learning be-
3.1.2. Models and Hyperparameters                                 haviors of DNPL. Figure 1 shows how the gap |Err(θ̂n ) −
We employ a neural network of the following architecture:         R̂p,n (θ̂n )| and the CC risk3 Rcc (θ̂n ) decreases as dataset size
din − 512 − 256 − dout , where numbers represent dimensions       n increases. We observe that gap closing behaviors despite
of layers and din (dout ) is input (output) dimension. The neu-   the neural networks are over-parametrized, i.e., # of parame-
ral network have the same size as that of PL-BLC. Batch           ters ∼ 105 >> the training set size ∼ 104 .
normalization [30] is applied after each layer followed by
ELU activation layer [31]. Yogi optimizer [32] is used with                               4. CONCLUSIONS
fixed learning rate 10−3 and default momentum parameters
(0.9, 0.999).                                                     This work showed that a simple naive loss is applicable in
                                                                  training high-performance deep classifiers with partially la-
3.1.3. Benchmark Results                                          beled examples. Moreover, this method does not require any
                                                                  label disambiguation or explicit regularization. Our observa-
Table 1 reports means and standard deviations of observed         tions indicate that the deep naive method’s unreasonable ef-
accuracies. Accuracies of the naive model are measured over       fectiveness cannot be explained by existing learning theories.
5 repeated 10-fold cross-validation and accuracies of others      These raise interesting questions deserving further studies: 1)
are measured over 10-fold cross-validation.                       To what extent does the label disambiguation help learning
    The benchmark results indicate that DNPL achieves state-      with partial labels? 2) How deep learning generalizes in par-
of-the-art performances over all four datasets. Especially,       tial label learning?
DNPL outperforms PL-BLC which uses a neural network
of the same size as ours on those datasets. Unlike PL-BLC           3 We have always observed that with our over-parameterized neural net-
  2 http://palm.seu.edu.cn/zhangml/                               work zero risk can be achieved for Rcc (θ? ). Therefore, we omit this term.
5. REFERENCES                                 [17] Liping Liu and Thomas Dietterich, “Learnability of the
                                                                        superset label learning problem,” in ICML, 2014.
 [1] Yen-Chang Hsu, Zhaoyang Lv, Joel Schlosser, Phillip
     Odom, and Zsolt Kira, “Multi-class classification with-       [18] Lei Feng, Jiaqi Lv, Bo Han, Miao Xu, Gang Niu, Xin
     out multi-class labels,” in ICLR, 2019.                            Geng, Bo An, and Masashi Sugiyama, “Provably con-
                                                                        sistent partial-label learning,” in NeurIPS, 2020.
 [2] Ryuichi Kiryo, Gang Niu, Marthinus C Du Plessis, and
     Masashi Sugiyama, “Positive-unlabeled learning with           [19] Lei Feng and Bo An, “Learning from multiple comple-
     non-negative risk estimator,” in NeurIPS, 2017.                    mentary labels,” in ICML, 2020.

 [3] Hirotaka Kaji, Hayato Yamaguchi, and Masashi                  [20] Yuzhou Cao and Yitian Xu, “Multi-complementary and
     Sugiyama, “Multi task learning with positive and unla-             unlabeled learning for arbitrary losses and models,” in
     beled data and its application to mental state prediction,”        ICML, 2020.
     in ICASSP, 2018.                                              [21] Min-Ling Zhang, Fei Yu, and Cai-Zhi Tang,
                                                                        “Disambiguation-free partial label learning,” IEEE
 [4] Oded Maron and Tomás Lozano-Pérez, “A framework
                                                                        Trans Knowl Data Eng, 2017.
     for multiple-instance learning,” in NeurIPS, 1998.
                                                                   [22] Gengyu Lyu, Songhe Feng, Tao Wang, Congyan Lang,
 [5] Yun Wang, Juncheng Li, and Florian Metze, “A com-
                                                                        and Yidong Li, “Gm-pll: Graph matching based partial
     parison of five multiple instance learning pooling func-
                                                                        label learning,” IEEE Trans Knowl Data Eng, 2019.
     tions for sound event detection with weak labeling,” in
     ICASSP, 2019.                                                 [23] Ning Xu, Jiaqi Lv, and Xin Geng, “Partial label learning
                                                                        via label enhancement,” in AAAI, 2019.
 [6] Rong Jin and Zoubin Ghahramani, “Learning with mul-
     tiple labels,” in NeurIPS, 2003.                              [24] Jiaqi Lv, Miao Xu, Lei Feng, Gang Niu, Xin Geng, and
                                                                        Masashi Sugiyama, “Progressive identification of true
 [7] Jie Luo and Francesco Orabona, “Learning from candi-               labels for partial-label learning,” in ICML, 2020.
     date labeling sets,” in NeurIPS, 2010.
                                                                   [25] Lei Feng and Bo An, “Partial label learning with self-
 [8] Liping Liu and Thomas G Dietterich, “A conditional                 guided retraining,” in AAAI, 2019.
     multinomial mixture model for superset label learning,”
     in NeurIPS, 2012.                                             [26] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin
                                                                        Recht, and Oriol Vinyals, “Understanding deep learning
 [9] Zinan Zeng, Shijie Xiao, Kui Jia, Tsung-Han Chan,                  requires rethinking generalization,” in ICLR, 2017.
     Shenghua Gao, Dong Xu, and Yi Ma, “Learning by as-
     sociating ambiguously labeled images,” in CVPR, 2013.         [27] Giacomo De Palma, Bobak Kiani, and Seth Lloyd,
                                                                        “Random deep neural networks are biased towards sim-
[10] Eyke Hüllermeier and Jürgen Beringer, “Learning from             ple functions,” in NeurIPS, 2019.
     ambiguously labeled examples,” Intell Data Anal, 2006.
                                                                   [28] Gabriel Panis, Andreas Lanitis, Nicholas Tsapatsoulis,
[11] Timothee Cour, Ben Sapp, and Ben Taskar, “Learning                 and Timothy F Cootes, “Overview of research on facial
     from partial labels,” JMLR, 2011.                                  ageing using the FG-NET ageing database,” IET Bio-
                                                                        metrics, 2016.
[12] Min-Ling Zhang and Fei Yu, “Solving the partial la-
     bel learning problem: An instance-based approach,” in         [29] Matthieu Guillaumin, Jakob Verbeek, and Cordelia
     AAAI, 2015.                                                        Schmid, “Multiple instance metric learning from au-
                                                                        tomatically labeled bags of faces,” in ECCV, 2010.
[13] Cai-Zhi Tang and Min-Ling Zhang, “Confidence-rated
     discriminative partial label learning,” in AAAI, 2017.        [30] Sergey Ioffe and Christian Szegedy, “Batch normaliza-
                                                                        tion: Accelerating deep network training by reducing
[14] Lei Feng and Bo An, “Partial label learning by semantic
                                                                        internal covariate shift,” in ICML, 2015.
     difference maximization,” in IJCAI, 2019.
                                                                   [31] Djork-Arné Clevert, Thomas Unterthiner, and Sepp
[15] Yan Yan and Yuhong Guo, “Partial label learning with               Hochreiter, “Fast and accurate deep network learning
     batch label correction,” in AAAI, 2020.                            by exponential linear units (ELUs),” in ICLR, 2016.
[16] Guillermo Valle-Perez, Chico Q Camargo, and Ard A             [32] Manzil Zaheer, Sashank Reddi, Devendra Sachan,
     Louis,      “Deep learning generalizes because the                 Satyen Kale, and Sanjiv Kumar, “Adaptive methods for
     parameter-function map is biased towards simple func-              nonconvex optimization,” in NeurIPS, 2018.
     tions,” in ICLR, 2018.
You can also read