Safe Weakly Supervised Learning - IJCAI

Page created by Gilbert Jimenez
 
CONTINUE READING
Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21)
                                                                   Early Career Track

                                          Safe Weakly Supervised Learning

                                              Yu-Feng Li
    National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China
                                            liyf@nju.edu.cn

                           Abstract                                                   al., 2006], which aims to learn a prediction model by
                                                                                      leveraging a number of unlabeled data.
      Weakly supervised learning (WSL) refers to learn-
      ing from a large amount of weak supervision data.                            • Inexact supervised data, i.e., only coarse-grained labels
      This includes i) incomplete supervision (e.g., semi-                           are given. Reconsider the image categorization task, it is
      supervised learning); ii) inexact supervision (e.g.,                           desirable to have every object in the images annotated,
      multi-instance learning) and iii) inaccurate super-                            however, usually we only have image-level labels rather
      vision (e.g., label noise learning). Unlike su-                                than object-level labels. One representative technique
      pervised learning which typically achieves perfor-                             for this scenario is multi-instance learning [Carbonneau
      mance improvement with more labeled data, WSL                                  et al., 2018], which aims to improve the performance by
      may sometimes even degenerate performance with                                 considering the coarse-grained label information.
      more weak supervision data. It is thus desired to                            • Inaccurate supervised data, i.e., the given labels have
      study safe WSL, which could robustly improve per-                              not always been ground-truth. Such situation occurs in
      formance with weak supervision data. In this ar-                               various tasks when the annotator is careless or weary, or
      ticle, we share our understanding of the problem                               the annotator is not an expert. For this type of label in-
      from in-distribution data to out-of-distribution data,                         formation, label noise learning techniques are one main
      and discuss possible ways to alleviate it, from the                            paradigm to learn a promising prediction from noisy la-
      aspects of worst-case analysis, ensemble-learning,                             bel [Frénay and Verleysen, 2014].
      and bi-level optimization. We also share some open
      problems, to inspire future researches.                                      In traditional machine learning, it is often expected that
                                                                                machine learning techniques, such as supervised learning,
                                                                                with the usage of more data will be able to improve learn-
1     Introduction                                                              ing performance. Such observation, however, no longer
                                                                                holds for weakly supervised learning. There are many stud-
Machine learning has achieved great success in numerous                         ies [Li and Zhou, 2015; Li et al., 2017; Guo and Li, 2018;
tasks, particularly in supervised learning such as classifica-                  Oliver et al., 2018] reporting that the usage of weakly su-
tion and regression. But most successful techniques, such as                    pervised data may sometimes lead to performance degrada-
deep learning [LeCun et al., 2015], require ground-truth la-                    tion, that is, the learning performance is even worse than that
bels to be given for a big training data set. It is noteworthy                  of baseline methods without using weakly supervised data.
that in many tasks, however, it can be difficult to attain strong               More specifically, semi-supervised learning using unlabeled
supervision due to the fact that the hand-labeled data sets are                 data may be worse than vanilla supervised learning with only
time-consuming and expensive to collect. Thus, it is desirable                  limited labeled data [Li and Zhou, 2015; Li et al., 2016].
for machine learning techniques to be able to work well with                    Multi-instance learning may be outperformed by the naive
weakly supervised data [Zhou, 2017].                                            learning methods which simply assign the coarse-grained la-
   Compared to the data in traditional supervised learning,                     bel to a bag of instances [Carbonneau et al., 2018]. Label
weakly supervised data does not have a large amount of pre-                     noise learning may be worse than that of learning from a
cise label information. Specifically, three types of weakly                     small amount of high-quality labeled data [Frénay and Ver-
supervised data commonly exist [Zhou, 2017].                                    leysen, 2014]. Such phenomena undoubtedly go against of
    • Incomplete supervised data, i.e., only a small subset of                  the expectation of WSL and limits its effectiveness in a large
      training data is given with labels whereas the other data                 number of practical tasks.
      remain unlabeled. For example, in image categorization,                      Building a safe WSL, that is to say, WSL using extra
      it is easy to get a huge number of images fro the Internet,               weakly supervised data will not be inferior to a simple su-
      whereas only a small subset of images can be annotated                    pervised learning model, it the Holy Grail of WSL [Chapelle
      due to the annotation cost.Representative techniques for                  et al., 2006; Li and Zhou, 2015; Zhou, 2017]. Since the prob-
      this situation are semi-supervised learning [Chapelle et                  lem was pointed out in [Cozman et al., 2003], there are many

                                                                         4951
Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21)
                                                                   Early Career Track

attempts trying to solve this important and challenging prob-                      By assuming that the ground-truth label assignment f ∗ can
lem. In this article, we will review the recent developments                    be realized as a convex combination of base learners, specif-
                                                                                             Pb
on safe WSL, and share our contributions on two aspects of                      ically, f ∗ = i=1 αi fi where α = [α1 ; α2 ; · · · ; αb ] ≥ 0 be
safe WSL:                                                                                                        Pb
                                                                                the weight of base learners and i=1 αi = 1, then we have
   • For WSL with in-distribution data, we proposed a gen-                      the following objective function:
     eral ensemble scheme that maximize the performance                                                         b                       b
     gain in the worse-case to improve the safeness of WSL.                                                     X                       X
                                                                                                   max `(f0 ,         αi fi ) − `(f ,         α i fi )       (1)
   • For WSL with out-of-distribution (OOD) data, we give a                                         f
                                                                                                                i=1                     i=1
     particular focus on SSL with unseen class unlabeled data                   This is in line with our goal that is to find a prediction f that
     and propose a bi-level optimization based framework to                     maximizes the performance gain against the baseline f0 .
     alleviate the potential performance hurt caused by OOD                        In practice, however, on may be hard to know about the
     unlabeled examples.                                                        precise weight of base learners. We further assume that α
   We will also discuss some open challenges in real-world                      is from a convex set M to make the proposal more practi-
applications that may have been less noticed and desire more                    cal, where M captures the priors knowledge about the im-
attentions.                                                                     portance of base learners. Without any further information to
                                                                                locate the weight of base learners, to guarantee the safeness,
2   Safe WSL with In-Distribution Data                                          we aim to optimize the worse-case performance gain, since,
WSL with in-distribution data, i.e., all supervised data and                    intuitively, the algorithm would be robust as long as the good
weakly supervised data are drawn from a same distribution, is                   performance is guaranteed in the worst case. Then we obtain
the most natural situation. [Cozman et al., 2003] pointed out                   a general formulation for weakly supervised data as,
that WSL could suffer performance degradation problem with                                                        b
                                                                                                                  X                        b
                                                                                                                                           X
in-distribution data. There are multiple reasons, for example,                              max min `(f0 ,               αi fi ) − `(f ,          α i fi )   (2)
                                                                                               f     α∈M
the adopted assumption of WSL algorithm is not suitable for                                                       i=1                      i=1
the data distribution [Chapelle et al., 2006]; there are many                      We have the following theorem to guarantees the safeness
candidate large-margin decision boundaries existing in semi-                    of our proposal for commonly used convex loss functions
supervised support vector machine (SVM) and prior knowl-                        in both classification and regression tasks, e.g., hinge loss,
edge is insufficient to help choose the best one [Li and Zhou,                  cross-entropy loss, mean-square loss, etc.
2015], and so on.
   Some attempts have been devoted to this problem [Li and                      Theorem 1. Suppose the ground-truth f ∗ can be constructed
                                                                                                                  Pb
Zhou, 2015; Loog, 2015; Li et al., 2017; Krijthe and Loog,                      by base learners, i.e., f ∗ ∈ {f | i=1 αi fi , α ∈ M}. Let f̂
2017]. For example, [Li and Zhou, 2015] builds safe semi-                       and α̂ be the optimal solution to Eq.(2). We have `(f̂ , f ∗ ) ≤
supervised SVMs through optimizing the worst-case perfor-                       `(f0 , f ∗ ) and f̂ has already achieved the maximal perfor-
mance gain given a set of candidate low-density separators.                     mance gain against f0 .
[Loog, 2015] proposes to maximize the likelihood gain over
                                                                                   Theorem 1 show that Eq.(2) is a reasonable formulation
a supervised model in the worst-case for generative models.
[Balsubramani and Freund, 2015] proposes to learn a robust                      for our purpose, that is, the derived optimal solution f̂ from
prediction given that the ground-truth label assignment is re-                  Eq.(2) often outperforms f0 and it would not get any worse
stricted to a specific candidate set. [Wei et al., 2018] study                  than f0 .
safe multi-label learning of weakly labeled data. They opti-                       The objective formulation can be globally and efficiently
mize multi-label evaluation metrics (F1 score and Top-k pre-                    addressed via a simple convex quadratic program or linear
cision) given that the ground-truth label assignment is real-                   program. For example, with mean square loss, the objective
ized by a convex combination of base multi-label learners.                      can be equivalently written as
More introductions can be found in our recent summary [Li                                                  min         α> Fα − v> α                          (3)
                                                                                                           α∈M
and Liang, 2019].
   To address this problem, we propose a general ensemble                       where F ∈ Rb×b is a linear kernel matrix of fi ’s, i.e., Fij =
learning scheme, S AFE W (SAFE Weakly supervised learn-                         fi> fj , ∀1 ≤ i, j ≤ b and v = [2f1> f0 ; . . . ; 2fb> f0 ]. Since F is
ing) [Li et al., 2021], which learning prediction by integrat-                  positive semi-definite, Eq.(3) is convex and can be efficiently
ing multiple weakly supervised learners. Specifically, we                       solved. After solving the optimal solution α∗ , the optimal
propose a maximin framework, which maximize the perfor-                               Pb
                                                                                f̄ = i=1 αi∗ fi can be obtained.
mance gain in the worse case. Suppose we have obtained b                           Moreover, the optimization can be written as a geometric
predictions {f1 , · · · , fb } generated by base weakly supervised                                                                   Pb
                                                                                projection problem. Specifically, let Ω = {f | i=1 αi fi , α ∈
learners, let f0 denote the prediction of baseline approaches,
i.e., directly supervised learning with only limited labeled                    M}, f̄ can be rewritten as,
data. Our ultimate goal is here to derive a safe prediction                                              f̄ = arg min kf − f0 k2 ,                           (4)
f = g({f1 , · · · , fb }, f0 ), which often outperforms the base-                                                 f ∈Ω
line f0 , meanwhile it would not be worse than f0 . In other                    which learns a projection of f0 onto the convex set Ω. Fig-
words, we wold like to maximize the performance gain be-                        ure 1 illustrates the intuition of our proposed method via the
tween our prediction and the baseline prediction.                               viewpoint of geometric projection.

                                                                         4952
Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21)
                                                                    Early Career Track

                                                                                 The goal of SSL is to learn a model h(x; θ) : {X ; Θ} → Y
                                                                                 parameterized by θ ∈ Θ from training data to minimize the
                                                                                 generalization risk R(h) = E(X,Y ) [`(h(X; θ), Y )], where
                                                                                 ` : Y × Y → R refers to certain loss function, e.g., mean
                                                                                 squared error or cross entropy loss.
                                                                                    The way SSL utilizes unlabeled data structures is usually
                                                                                 through the introduction of regularization. The objective of
                                                                                 SSL is typically formulated as following:
                                                                                        Xn
                                                                                    min     `(h(xi ; θ), yi ) + Ω(x; θ) s.t. x ∈ Dl ∪ Du . (5)
                                                                                    θ∈Θ
                                                                                          i=1
                                                                                 where Ω(x; θ) refers to the regularization term, e.g., entropy-
                                                                                 minimization regularization [Grandvalet and Bengio, 2005],
                                                                                 consistency regularization [Sohn et al., 2020].
Figure 1: Illustration of the intuition of our proposal via the pro-                 Different from the existing SSL techniques which use all
jection viewpoint. Intuitively, the proposal learns a projection of f0           unlabeled data, DS3L uses it selectively and keeps tracking
onto a convex feasible set Ω.                                                    the effect of the supervised learning model to prevent perfor-
                                                                                 mance hazards. Meanwhile, DS3L uses beneficial unlabeled
                                                                                 data as much as possible to improve performance, preventing
   It is noteworthy that, compared with previous studies
                                                                                 performance gains from being too conservative.
in [Li and Zhou, 2015; Balsubramani and Freund, 2015;
                                                                                     On one hand, DS3L uses the unlabeled selectively. The
Li et al., 2017], the S AFE W framework brings multiple ad-
                                                                                 main methodology is to design a weighting function w :
vantages to safe WSL. i) It can be shown that the proposal
                                                                                 RD → R parameterized by α ∈ Bd that maps an instance
is probably safe as long as the ground-truth label assignment
can be expressed as a convex combination of base learners.                       to a weight. Then, DS3L tries to find the optimal θ̂(α) that
In contrast to [Li and Zhou, 2015] which requires that the                       minimizes the corresponding weighted empirical risk,
ground-truth is one of the base learners, the condition in The-                                  Xn                      n+m
                                                                                                                         X
orem 1 is looser and more practical. ii) Prior knowledge re-                       θ̂(α) = min       `(h(xi ; θ), yi ) +     w(xi ; α)Ω(xi ; θ)
                                                                                               θ∈Θ
lated to the weight of base learners can be easily embedded                                          i=1                         i=n+1
in this framework. iii) The framework is readily applicable                                                                               (6)
for many loss functions in both classification and regression,                   where θ̂(α) is denoted as the model trained with the weight
which is more general in contrast to [Li et al., 2017] that fo-                  function paramaterized by α.
cuses on regression. iv) The proposed formulation can be                            On the other hand, DS3L keeps tracking supervised per-
globally and efficiently addressed and have intuitive geome-                     formance to prevent performance degradation. Specifically,
try interpretation.                                                              DS3L requires that the model returned by the weighted em-
                                                                                 pirical risk process should maximize the generalization per-
3    Safe WSL with OOD Data                                                      formance, i.e.,
Previous WSL studies are based on a basic assumption that                                   α∗ = argmin E(X,Y ) [`(h(X; θ̂(α)), Y )]      (7)
                                                                                                           α∈Bd
labeled data and weakly supervised data come from the same
distribution. Such an assumption is difficult to hold in many                       In real practice, the distribution is unknown, similar to the
practical applications, among which one common case is that                      empirical risk minimization, DS3L tries to find the optimal
OOD unlabeled data that contains classes that are not seen                       parameters α̂ such that the model returned by optimizing the
in the labeled set occurs for SSL. For example, in medi-                         weighted instance loss, should also have a good performance
cal diagnosis, unlabeled medical images often contain dif-                       on the labeled data which acts as an unbiased and reliable
ferent foci from the diseases to be diagnosed. Faced with                        estimation of the underlying distribution, i.e.,
                                                                                                               n
the OOD weakly supervised data, WSL no longer works well                                                      X
and may even be accompanied by severe performance degra-                                       α̂ = argmin        `(h(xi ; θ̂(α)), yi )       (8)
                                                                                                            α∈Bd
dation [Oliver et al., 2018].                                                                                      i=1
   Efforts on safe WSL with OOD data remains to be lim-                             To simplify the notation, we denote θ̂(α) as θ̂. Taking both
ited. We have made particularly efforts to safe SSL problem                      the Eq.(6) and Eq.(8) into consideration, the objective of our
and proposes a simple and effective safe deep SSL frame-                         framework can be formulated as the following bi-level opti-
work DS3L (Deep Safe Semi-Supervised Learning) [Guo et                           mization problem,
al., 2020].                                                                              Xn
   Specifically, in SSL scenarios, we are given a set of train-                     min      `(h(xi ; θ̂), yi )                              (9)
ing data from an unknown distribution, which includes n la-                        α∈Bd
                                                                                          i=1
beled instances Dl = {(x1 , y1 ), · · · , (xn , yn )} and m unla-                       s.t.
beled instances Du = {xn+1 , · · · , xn+m }. x ∈ X ∈ RD ,                                            n
                                                                                                     X                            n+m
                                                                                                                                  X
y ∈ Y = {1, · · · , C} where D is the number of input di-                          θ̂ = argmin             `(h(xi ; θ), yi ) +           w(xi ; α)Ω(xi ; θ)
mension and C is the number of output class in labeled data.                               θ∈Θ       i=1                         i=n+1

                                                                          4953
Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21)
                                                                 Early Career Track

   Eq.(9) can be understood by two stages: first, DS3L seeks                        et al., 2014], however, the issue of safeness remains
the optimal model parameter θ̂ via the weighted empirical risk                      an open problem for weakly-supervised learning in dy-
minimization, then evaluates it on n labeled instances and op-                      namic environments, e.g., an interesting problem is
timizes the weight function parameter α to make the learned θ̂                      when the unlabeled data are useful in online learning.
to achieve better reliable performance. Moreover, the bi-level                   • Automated safe WSL [Feurer et al., 2015]. AutoML,
optimization can be efficiently solved via stochastic gradient                     which seeks to build an appropriate machine learn-
descent methods [Ren et al., 2018].                                                ing model for an unseen dataset in an automatic man-
   In order to show the safeness of DS3L, we analyze the                           ner (without human intervention), has received increas-
empirical risk of DS3L compared with the simple supervised                         ing attention recently. However, existing AutoML sys-
method and obtain the following theorem,                                           tems focus on supervised learning, and existing AutoML
           Let θSL be the supervised model, i.e., θSL =
Theorem 2. P                                                                       techniques could not directly be used for the automated
             n
arg minθ∈Θ i=1 `(h(xi ; θ), yi ). Define the empirical risk                        WSL problem. Efforts on automated WSL, remain lim-
as:                                                                                ited right now. Automated WSL introduces some new
                        n
                     1X                                                            challenges, e.g., various meta-features extracted from
             R̂(θ) =       [`(h(xi ; θ), yi )]                                     limited number of supervised data are no longer avail-
                     n i=1
                                                                                   able and suitable; the use of auxiliary weakly supervised
Then we have the empirical risk of θ̂ returned by DS3L to                          examples may sometimes even be outperformed by di-
be never worse than θSL that is learned from merely labeled                        rect supervised learning. Therefore, safeness if one of
data, i.e., R̂(θ̂) ≤ R̂(θSL ).                                                     the crucial aspects of AutoWSL, since it is not desir-
                                                                                   able to have an automated yet performance degenerated
   Theorem 2 reveals that compared with previous SSL meth-                         WSL system. [Li et al., 2019] first present an auto-
ods, DS3L can achieve safeness in terms of empirical risk,                         mated learning system for SSL. They incorporate meta-
i.e., the performance is not worse than its supervised counter-                    learning with enhanced meta-features to help searching
part, with the learned α.                                                          well-perform instantiations, and a large margin separa-
   We further analyze the generalization risk of DS3L to bet-                      tion method to fine-tune the hyper-parameters as well as
ter understand the effect of the parameter dimension and the                       alleviate performance deterioration. More efforts are ex-
size of labeled data to α and drive the following theorem,                         pected to be devoted to this direction.
Theorem 3. Assume that the loss function is λ-Lipschitz con-
                                                                                 • Safe deep WSL. Although we have introduced S AFE W
tinuous w.r.t. α. Let α ∈ Bd be the parameter of example
                                                                                   and other related methods that aim to solve the safe WSL
weighting function w in a d-dimensional unit ball. Let n be
                                                                                   problem with in-distribution data. However, current safe
the labeled data size. Define the generalization risk as:
                                                                                   WSL studies typically work on shallow models such as
                R(θ) = E(X,Y ) [`(h(X; θ), Y )]                                    support vector machines, logistic regression, linear re-
                                                                                   gression, etc. Applying WSL techniques to deep neural
Let α∗ = arg maxα∈Bd R(θ̂(α)) be the optimal parameter in                          networks has attracted much attention in recent years for
the unit ball, and α̂ = arg maxα∈A R̂(θ̂(α)) be the empir-                         the promising results achieved by deep models. How-
ically optima among a candidate set A. With probability at                         ever, studies of safe WSL with deep neural networks re-
least 1 − δ we have,                                                               main to be limited. It is expected to design an efficient
                                p                                                  scheme for safe deep WSL.
         ∗                 (3λ + 4d ln(n) + 8 ln(2/δ))
  R(θ̂(α )) ≤ R(θ̂(α̂)) +              √                                         • Safe imbalanced WSL. Previous WSL studies typically
                                         n                                         assume a balanced class distribution in both labeled sets
   Theorem 3 establishespthat DS3L approaches the optimal                          and unlabeled sets. However, it is well-known that real-
weight in the order O( d ln(n)/n). Based on theorem 2                              world dataset is often imbalanced or long-tailed. The
and theorem 3, from both the safeness and generalization, it                       performance of previous WSL studies is seriously de-
is reasonable to expect that DS3L can achieve better gen-                          creased when the class distribution is imbalanced since
eralization performance compared with baseline supervised                          their predictions are biased toward majority classes and
learning methods.                                                                  result in low recall on minority classes [Kim et al.,
                                                                                   2020]. There are some efforts begin to address the im-
                                                                                   balanced SSL problem [Kim et al., 2020]. But how to
4    Open Problems                                                                 achieve safe performance for imbalanced WSL is still
Although significant progress has been made in safe WSL                            under study and remains an open problem.
with in-distribution data and out-of-distribution data, there
still remain many open problems in this area.
                                                                              Acknowledgments
    • Safe WSL in dynamic environments. Learning in dy-
      namic environments is far more difficult than in static                 This work was partially supported by NSFC (61772262), and
      ones. The challenges come from distribution drift, new                  the Collaborative Innovation Center of Novel Software Tech-
      class emerging, feature space change, and so on. There                  nology and Industrialization. The author would like to thanks
      are some studies trying to tackle these problems [Da                    Lan-Zhe Guo for improving the paper.

                                                                       4954
Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21)
                                                                 Early Career Track

References                                                                    [Li and Liang, 2019] Yu-Feng Li and De-Ming Liang. Safe
[Balsubramani and Freund, 2015] Akshay          Balsubramani                     semi-supervised learning: a brief introduction. Frontiers
   and Yoav Freund. Optimally combining classifiers using                        Computer Science, 13(4):669–676, 2019.
   unlabeled data. In Proceedings of the 28th Conference on                   [Li and Zhou, 2015] Yu-Feng Li and Zhi-Hua Zhou. To-
   Learning Theory, pages 211–225, 2015.                                         wards making unlabeled data never hurt. IEEE Trans-
[Carbonneau et al., 2018] Marc-André            Carbonneau,                     actions on Pattern Analysis and Machine Intelligence,
   Veronika Cheplygina, Eric Granger, and Ghyslain                               37(1):175–188, 2015.
   Gagnon. Multiple instance learning: A survey of problem                    [Li et al., 2016] Yu-Feng Li, James T Kwok, and Zhi-Hua
   characteristics and applications. Pattern Recognition,                        Zhou. Towards safe semi-supervised learning for mul-
   77:329–353, 2018.                                                             tivariate performance measures. In Proceedings of the
[Chapelle et al., 2006] Olivier      Chapelle,       Bernhard                    AAAI conference on Artificial Intelligence, pages 1816–
   Schölkopf, and Alexander Zien, editors. Semi-Supervised                      1822, Phoenix, AZ, 2016.
   Learning. The MIT Press, 2006.                                             [Li et al., 2017] Yu-Feng Li, Han-Wen Zha, and Zhi-Hua
[Cozman et al., 2003] Fábio Gagliardi Cozman, Ira Cohen,                        Zhou. Learning safe prediction for semi-supervised re-
   and Marcelo Cesar Cirelo. Semi-supervised learning of                         gression. In Proceedings of the AAAI Conference on Arti-
   mixture models. In Proceedings of the 20th International                      ficial Intelligence, pages 2217–2223, San Francisco, CA,
   Conference on Machine Learning, pages 99–106, Wash-                           2017.
   ington, DC, 2003.                                                          [Li et al., 2019] Yu-Feng Li, Hai Wang, Tong Wei, and Wei-
[Da et al., 2014] Qing Da, Yang Yu, and Zhi-Hua Zhou.                            Wei Tu. Towards automated semi-supervised learning. In
   Learning with augmented class by exploiting unlabeled                         Proceedings of The Thirty-Third AAAI Conference on Ar-
   data. In Proceedings of the 28th AAAI Conference on Ar-                       tificial Intelligence, pages 4237–4244, 2019.
   tificial Intelligence, pages 1760–1766, 2014.                              [Li et al., 2021] Yu-Feng Li, Lan-Zhe Guo, and Zhi-Hua
[Feurer et al., 2015] Matthias Feurer, Aaron Klein, Katha-                       Zhou. Towards safe weakly supervised learning. IEEE
   rina Eggensperger, Jost Tobias Springenberg, Manuel                           Transactions on Pattern Analysis and Machine Intelli-
   Blum, and Frank Hutter. Efficient and robust automated                        gence, 43(1):334–346, 2021.
   machine learning. In Advances in Neural Information Pro-                   [Loog, 2015] Marco Loog. Contrastive pessimistic likeli-
   cessing Systems, pages 2962–2970, 2015.                                       hood estimation for semi-supervised classification. IEEE
[Frénay and Verleysen, 2014] Benoˆit Frénay and Michel                         Transactions on Pattern Analysis and Machine Intelli-
   Verleysen. Classification in the presence of label noise:                     gence, 38(3):462–475, 2015.
   a survey. IEEE Transactions on Neural Networks and                         [Oliver et al., 2018] Avital Oliver, Augustus Odena, Colin
   Learning Systems, 25(5):845–869, 2014.                                        Raffel, Ekin Dogus Cubuk, and Ian J. Goodfellow. Re-
[Grandvalet and Bengio, 2005] Yves Grandvalet and Yoshua                         alistic evaluation of deep semi-supervised learning algo-
   Bengio. Semi-supervised learning by entropy minimiza-                         rithms. In Advances in Neural Information Processing
   tion. In Advances in Neural Information Processing Sys-                       Systems, pages 3239–3250, 2018.
   tems, pages 529–536, 2005.                                                 [Ren et al., 2018] Mengye Ren, Wenyuan Zeng, Bin Yang,
[Guo and Li, 2018] Lan-Zhe Guo and Yu-Feng Li. A gen-                            and Raquel Urtasun. Learning to reweight examples for
   eral formulation for safely exploiting weakly supervised                      robust deep learning. In Proceedings of the 35th Inter-
   data. In Proceedings of the AAAI conference on Artificial                     national Conference on Machine Learning, pages 4331–
   Intelligence, pages 3126–3133, New Orleans, LA, 2018.                         4340, 2018.
[Guo et al., 2020] Lan-Zhe Guo, Zhen-Yu Zhang, Yuan                           [Sohn et al., 2020] Kihyuk Sohn, David Berthelot, Nicholas
   Jiang, Yu-Feng Li, and Zhi-Hua Zhou. Safe deep semi-                          Carlini, Zizhao Zhang, Han Zhang, Colin Raffel, Ekin Do-
   supervised learning for unseen-class unlabeled data. In                       gus Cubuk, Alexey Kurakin, and Chun-Liang Li. Fix-
   Proceedings of The 37th International Conference on Ma-                       match: Simplifying semi-supervised learning with consis-
   chine Learning, pages 3897–3906, 2020.                                        tency and confidence. In Advances in Neural Information
[Kim et al., 2020] Jaehyung Kim, Youngbum Hur, Sejun                             Processing Systems, pages 596–608, 2020.
   Park, Eunho Yang, Sung Ju Hwang, and Jinwoo Shin. Dis-                     [Wei et al., 2018] Tong Wei, Lan-Zhe Guo, Yu-Feng Li, and
   tribution aligning refinery of pseudo-label for imbalanced                    Wei Gao. Learning safe multi-label prediction for weakly
   semi-supervised learning. In Advances in Neural Informa-                      labeled data. Machine Learning, 107(4):703–725, 2018.
   tion Processing Systems, pages 14567–14579, 2020.                          [Zhou, 2017] Zhi-Hua Zhou. A brief introduction to weakly
[Krijthe and Loog, 2017] Jesse H Krijthe and Marco Loog.                         supervised learning. National Science Review, 5(1):44–53,
   Projected estimators for robust semi-supervised classifica-                   2017.
   tion. Machine Learning, 106(7):993–1008, 2017.
[LeCun et al., 2015] Yann LeCun, Yoshua Bengio, and Ge-
   offrey Hinton. Deep learning. Nature, 521(7553):436–
   444, 2015.

                                                                       4955
You can also read