ON THE POWER OF DEEP BUT NAIVE PARTIAL LABEL LEARNING
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
ON THE POWER OF DEEP BUT NAIVE PARTIAL LABEL LEARNING
Junghoon Seo1† Joon Suk Huh2†
1 2
SI Analytics Co. Ltd, South Korea UW–Madison, USA
jhseo@si-analytics.ai jhuh23@wisc.edu
ABSTRACT In this paper, we focus on Partial label learning [6] (PLL),
arXiv:2010.11600v2 [cs.LG] 8 Feb 2021
Partial label learning (PLL) is a class of weakly supervised which is one of the most classic examples of weakly super-
learning where each training instance consists of a data and vised learning. In the PLL problem, classifiers are trained
a set of candidate labels containing a unique ground truth la- with a set of candidate labels, among which only one label
bel. To tackle this problem, a majority of current state-of- is the ground truth. Web mining [7], ecoinformatic [8], and
the-art methods employs either label disambiguation or av- automatic image annotation [9] are notable examples of real-
eraging strategies. So far, PLL methods without such tech- world instantizations of the PLL problem.
niques have been considered impractical. In this paper, we The majority of state-of-the-art parametric methods for
challenge this view by revealing the hidden power of the old- PLL involves two types of parameters. One is associated
est and naivest PLL method when it is instantiated with deep with the label confidence, and the other is the model parame-
neural networks. Specifically, we show that, with deep neu- ters. These methods iteratively and alternatively update these
ral networks, the naive model can achieve competitive perfor- two types of parameters. This type of methods is denoted as
mances against the other state-of-the-art methods, suggesting identification-based. On the other hand, average-based meth-
it as a strong baseline for PLL. We also address the question ods [10, 11] treat all the candidate labels equally, assuming
of how and why such a naive model works well with deep they contribute equally to the trained classifier. Average-
neural networks. Our empirical results indicate that deep neu- based methods do not require any label disambiguation pro-
ral networks trained on partially labeled examples generalize cesses so they are much simpler than identification-based
very well even in the over-parametrized regime and without methods. However, numerous works [6, 12, 13, 14, 15]
label disambiguations or regularizations. We point out that pointed out that the label disambiguation processes are es-
existing learning theories on PLL are vacuous in the over- sential to achieve high-performance in PLL problems, hence,
parametrized regime. Hence they cannot explain why the attempts to build a high-performance PLL model through the
deep naive method works. We propose an alternative theory average-based scheme have been avoided.
on how deep learning generalize in PLL problems. Contrary to this common belief, we show that one of
naivest and oldest average-based methods can train accu-
Index Terms— classification, partial label learning,
rate classifiers in real PLL problems. Specifically, our main
weakly supervised learning, deep neural network, empiri-
contributions are two-fold:
cal risk minimization
1. We generalize the classic naive model of [6] to the
1. INTRODUCTION modern deep learning setting. Specifically, we present
a naive surrogate loss for deep PLL. We test our deep
State-of-the-art performance of the standard classification naive model’s performance and show that it outper-
task is one of the fastest-growing in the field of machine forms the existing state-of-the-art methods despite its
learning. In the standard classification setting, a learner re- simplicity.1
quires an unambiguously labeled dataset. However, it is
often hard or even not possible to obtain completely labeled 2. We empirically analyze the unreasonable effectiveness
datasets in the real world. Many pieces of research formu- of the naive loss with deep neural networks. Our exper-
lated problem settings under which classifiers are trainable iments shows closing generalization gaps in the over-
with incompletely labeled datasets. These settings are often parametrized regime where bounds from existing learn-
denoted as weakly supervised. Learning from similar vs. dis- ing theories are vacuous. We propose an alternative ex-
similar pairs [1], Learning from positive vs. unlabeled data planation of the working of deep PLL based on obser-
[2, 3], Multiple instance learning [4, 5] are some examples of vations of Valle-Perez et al. [16].
weakly supervised learning. 1 All codes for the experiments in this paper are public on https://
† Both authors contributed equally to this work. github.com/mikigom/DNPL-PyTorch.2. DEEP NAIVE MODEL FOR PLL 2.3. Existing Theories of Generalization in PLL
In this sub-section, we review two existing learning theories
2.1. Problem Formulation
and their implications which may explain the effectiveness of
We denote x ∈ X as a data and y ∈ Y = {1, . . . , K} as deep naive models.
a label, and a set S ∈ S = 2Y \ ∅ such that y ∈ S as a
partial label. A partial label data distribution is defined by a 2.3.1. EPRM Learnability
joint data-label distribution p(x, y) and a partial label gener-
Under a mild assumption on data distributions, Liu and Diet-
ating process p(S|x, y) where p(S|x, y) = 0 if y ∈ / S. A
terich [17] proved that minimizing an empirical partial label
learner’s task is to output a model θ with small Err(θ) =
risk gives a correct classifier.
E(x,y)∼p(x,y) I (hθ (x) 6= y) given with a finite number of par-
n Formally, they proved a finite sample complexity bound
tially labeled samples {(xi , Si )}i=1 , where each (xi , Si ) is
for the empirical partial risk minimizer (EPRM):
independently sampled from p(x, S).
θ̂n = arg min R̂p,n (θ), (5)
θ∈Θ
2.2. Deep Naive Loss for PLL
under a mild distributional assumption called small ambiguity
The work of Jin and Gharhramani [6], which is the first pio- degree condition. The ambiguity degree [11] quantifies the
neering work on PLL, proposed a simple baseline method for hardness of a PLL problem and is defined as
PLL denoted as the ‘Naive model’. It is defined as follows: γ= sup Pr [ȳ ∈ S] . (6)
(x,y)∈X ×Y, S∼p(S|x,y)
n
X 1 X ȳ∈Y:p(x,y)>0, ȳ6=y
θ̂ = arg max log p (y|xi ; θ) . (1)
θ∈Θ i=1
|Si | When γ is less than 1, we say the small ambiguity degree
y∈Si
condition is satisfied. Intuitively, it measures how a specific
We denote the naive loss as the negative of the objective in non-ground-truth label co-occurs with a specific ground-truth
the above. In [6], the authors proposed the disambiguation label. When such distractor labels co-occurs with a ground-
strategy as a better alternative to the naive model. Moreover, truth label in every instance, it is impossible to disambiguate
many works on PLL [12, 13, 14, 15] considered this naive the label hence PLL is not EPRM learnable. With the mild as-
model to be low-performing and it is still commonly believed sumption that γ < 1, Liu and Ditterich showed the following
that label disambiguation processes are crucial in achieving sample complexity bound for PLL,
high-performance. Theorem 1. (PLL Sample complexity bound [17]). Suppose
In this work, we propose the following differentiable loss the ambiguity degree of a PLL problem is small, 0 ≤ γ < 1.
to instantiate the naive loss with deep neural networks: 2
Let η = log 1+γ and dH be the Natarajan dimension of the
n
hypothesis space H. Define
ˆln (θ) = − 1
X
log hS θ,i Si i , (2) n0 (H, , δ) =
n i=1
4 1 1
S θ,i = SOFTMAX (fθ (xi )) , (3) dH log 4dH + 2 log K + log + log + 1 ,
η η δ
where fθ (xi ) ∈ RK is the output of the neural network. The then when n > n0 (H, , δ), Err(θ̂n ) < with probability at
softmax layer is used to make the outputs of the neural net- least 1 − δ.
work lie in the probability simplex. One can see that the above
loss is almost identical to the naive loss in (1) up to constant We denote this result as Empirical Partial Risk Minimiza-
factors, hence we denote (2) as the deep naive loss while a tion (EPRM) learnability.
model trained from it is denoted as a deep naive model.
The above loss can be identified as a surrogate of the par- 2.3.2. Classifier-consistency
tial label risk defined as follows: A very recent work by Feng et al. [18] proposed new PLL risk
estimators by viewing the partial label generation process as
Rp (θ) = E I (hθ (x) ∈
/ S) , (4) a multiple complementary label generation process [19, 20].
(x,S)∼p(x,S)
One of the proposed estimators is called classifier-consistent
(CC) risk Rcc (θ). For any multi-class loss function L : RK ×
where I (·) is the indicator function. We denote R̂p,n (θ) as an
Y → R+ , Rcc (θ) it is defined as follows:
empirical estimator of Rp (θ) over n samples. When hθ (x) =
arg maxi fθ,i (x), one can easily see that the deep naive loss L Q> p (y|x; θ) , s ,
Rcc (θ) = E (7)
(2) is a surrogate of the partial-label risk (4). (x,S)∼p(x,S)Method Lost MSRCv2 Soccer Player Yahoo! News Avg. Rank Reference Presented at
DNPL 81.1 ±3.7% (2) 54.4 ±4.3% (1) 57.3 ±1.4% (2) 69.1 ±0.9% (1) 1.50 This Work
CLPL 74.2 ±3.8% (7) • 41.3 ±4.1% (12) • 36.8 ±1.0% (12) • 46.2 ±0.9% (12) • 10.75 [11] JMLR 11
CORD 80.6 ±2.6% (4) 47.4 ±4.0% (9) • 45.7 ±1.3% (11) • 62.4 ±1.0% (9) • 8.25 [13] AAAI 17
ECOC 70.3 ±5.2% (9) • 50.5 ±2.7% (6) • 53.7 ±2.0% (7) • 66.2 ±1.0% (5) • 6.75 [21] TKDE 17
GM-PLL 73.7 ±4.3% (8) • 53.0 ±1.9% (3) 54.9 ±0.9% (4) • 62.9 ±0.7% (8) • 5.75 [22] TKDE 19
IPAL 67.8 ±5.3% (10) • 52.9 ±3.9% (4) 54.1 ±1.6% (5) • 60.9 ±1.1% (10) • 7.25 [12] AAAI 15
PL-BLC 80.6 ±3.2% (4) 53.6 ±3.7% (2) 54.0 ±0.8% (6) • 67.9 ±0.5% (2) • 3.50 [15] AAAI 20
PL-LE 62.9 ±5.6% (11) • 49.9 ±3.7% (7) • 53.6 ±2.0% (8) • 65.3 ±0.6% (6) • 8.00 [23] AAAI 19
PLKNN 43.2 ±5.1% (12) • 41.7 ±3.4% (11) • 49.5 ±1.8% (10) • 48.3 ±1.1% (11) • 11.00 [10] IDA 06
PRODEN 81.6 ±3.5% (1) 43.4 ±3.3% (10) • 55.3 ±5.6% (3) • 67.5 ±0.7% (3) • 4.25 [24] ICML 20
SDIM 80.1 ±3.1% (5) 52.0 ±3.7% (5) 57.7 ±1.6% (1) 66.3 ±1.3% (4) • 3.75 [14] IJCAI 19
SURE 78.0 ±3.6% (6) • 48.1 ±3.6% (8) • 53.3 ±1.7% (9) • 64.4 ±1.5% (7) • 7.50 [25] AAAI 19
Table 1. Benchmark results (mean accuracy±std) on the real-world datasets. Numbers in parenthesis represent rankings of com-
paring methods and the sixth column is the average rankings. Best methods are emphasized in boldface. •/◦ indicates whether
our method (DNPL) is better/worse than the comparing methods with respect to unpaired Welch t-test at 5% significance level.
where Q ∈ RK×K is a label transition matrix in the context Especially, Valle-Perez et al. [16] empirically observed that
of multiple complementary label learning, s is a uniformly solutions from stochastic gradient descent (SGD) are biased
randomly chosen label from S. R̂cc,n (θ) is denoted as empir- toward neural networks with smaller complexity. They ob-
ical risk of Eq. 7. served the following universal scaling behavior in the output
Feng et al.’s main contribution is to prove an estimation distribution p(θ) of SGD:
error bound for the CC risk (7). Let θ̂n = arg minθ∈Θ R̂cc,n (θ)
and θ? = arg minθ∈Θ Rcc (θ) denote the empirical and the p(θ) . e−aC(θ)+b , (8)
true minimizer, respectively. Additionally, Hy refers the
model hypothesis space for label y. Then, the estimation where C(θ) is a computable proxy of (uncomputable) Kol-
error bound for the CC risk is given as mogorov complexity and a, b are θ-independent constants.
One example of complexity measure C(θ) is Lempel-Ziv
Theorem 2. (Estimation error bound for the CC risk [18]). complexity [16] which is roughly the length of compressed θ
Assume the loss function L Q> p (y|x; θ) , s is ρ-Lipschitz with ZIP compressor.
with respect to the first augment in the 2-norm and upper- In the deep naive PLL, the model parameter is a mini-
bounded by M . Then, for any δ > 0, with probability at least mizer of the empirical partial label risk R̂p,n (θ) (Eq. 4). The
1 − δ, minima of R̂p,n (θ) is wide because there are many model pa-
q 2 rameters perfectly fit to given partially labeled examples. The
Pk log δ
Rcc (θ̂n ) − Rcc (θ? ) ≤ 8ρ y=1 Rn (Hy ) + 2M 2n , support of SGD’s output distribution will lie in this wide min-
ima. According to Eq. 8, this distribution is heavily biased
where Rn (Hy ) refers the expected Rademacher complexity toward parameters with small complexities. One crucial ob-
of the hypothesis space for the label y, Hy , with sample size servation is that models fitting inconsistent labels will gener-
n. ally have large complexities since they have to memorize each
example. According to Eq. 8, such models are exponentially
If the uniform label transition probability is assumed i.e.,
unlikely to be outputted by SGD. Hence the most likely out-
Qij = δij I (j ∈ Sj ) / 2K−1 − 1 , Eq. 7 becomes equiva-
put of the deep naive PLL method is a classifier with small
lent to our deep naive loss (Eq. 2) up to some constant fac-
error. As a result, the implications of both Theorem 1 and
tors. Hence, Theorem 1 and 2 give generalization bounds on
2 appear to be empirically correct in spite of their vacuity of
the partial risk and the CC risk (same as Eq. 2) respectively.
model complexity.
2.4. Alternative Explanation of Generalization in DNPL
3. EXPERIMENTS
Since the work of [26], the mystery of deep learning’s gener-
alization ability has been widely investigated in the standard In this section we give the readers two points. First, deep neu-
supervised learning setting. While it is still not fully under- ral network classifiers trained with the naive loss can achieve
stood why over-parametrized deep neural networks generalize competitive performance in real-world benchmarks. Second,
well, several studies are suggesting that deep learning mod- the generalization gaps of trained classifiers effectively de-
els are inherently biased toward simple functions [16, 27]. crease with respect to the increasing training set size.Fig. 1. Generalization gaps with respect to training set size for (a) Yahoo! dataset and (b) Soccer dataset are shown. Error
bars represent STDs over 10 repeated experiments. Note that we went through the same experiment process for the other two
smaller datasets (Lost / MSRCv2), but these results were omitted because of the same tendency.
3.1. Benchmarks on Real-world PLL Datasets or CORD, DNPL does not need computationally expensive
processes like label identification and mean-teaching. This
3.1.1. Datasets and Comparing Methods
means that by simply borrowing our surrogate loss to the
We use four real-world datasets including Lost [28], MSRCv2 deep learning classifier, we can build a sufficiently competi-
[8], Soccer Player [9], and Yahoo! News [29]. All real- tive PLL model.
world datasets can be found in this website2 . We denote Observing that for Soccer Player and Yahoo! News
the suggested method as Deep Naive Partial label Learning datasets, DNPL outperforms almost all of the comparing
(DNPL). We compare DNPL with eleven baseline methods. methods. Regarding the large-scale and high-dimensional
There are eight parametric methods: CLPL, CORD [13], nature of Soccer Player and Yahoo! News datasets comparing
ECOC, PL-BLC [15], PL-LE [23], PRODEN, SDIM [14], to other datasets, this observation suggests that DNPL has its
SURE, and three non-parametric methods: GM-PLL [22], advantage on large-scale, high-dimensional datasets.
IPAL, PLKNN. Note that both CORD and PL-BLC are deep
learning-based PLL methods which includes label identifica- 3.2. Generalization Gaps of Deep Naive PLL
tion or mean-teaching techniques.
In this section, we empirically show that conventional learn-
ing theories (Theorem 1, 2) cannot explain the learning be-
3.1.2. Models and Hyperparameters haviors of DNPL. Figure 1 shows how the gap |Err(θ̂n ) −
We employ a neural network of the following architecture: R̂p,n (θ̂n )| and the CC risk3 Rcc (θ̂n ) decreases as dataset size
din − 512 − 256 − dout , where numbers represent dimensions n increases. We observe that gap closing behaviors despite
of layers and din (dout ) is input (output) dimension. The neu- the neural networks are over-parametrized, i.e., # of parame-
ral network have the same size as that of PL-BLC. Batch ters ∼ 105 >> the training set size ∼ 104 .
normalization [30] is applied after each layer followed by
ELU activation layer [31]. Yogi optimizer [32] is used with 4. CONCLUSIONS
fixed learning rate 10−3 and default momentum parameters
(0.9, 0.999). This work showed that a simple naive loss is applicable in
training high-performance deep classifiers with partially la-
3.1.3. Benchmark Results beled examples. Moreover, this method does not require any
label disambiguation or explicit regularization. Our observa-
Table 1 reports means and standard deviations of observed tions indicate that the deep naive method’s unreasonable ef-
accuracies. Accuracies of the naive model are measured over fectiveness cannot be explained by existing learning theories.
5 repeated 10-fold cross-validation and accuracies of others These raise interesting questions deserving further studies: 1)
are measured over 10-fold cross-validation. To what extent does the label disambiguation help learning
The benchmark results indicate that DNPL achieves state- with partial labels? 2) How deep learning generalizes in par-
of-the-art performances over all four datasets. Especially, tial label learning?
DNPL outperforms PL-BLC which uses a neural network
of the same size as ours on those datasets. Unlike PL-BLC 3 We have always observed that with our over-parameterized neural net-
2 http://palm.seu.edu.cn/zhangml/ work zero risk can be achieved for Rcc (θ? ). Therefore, we omit this term.5. REFERENCES [17] Liping Liu and Thomas Dietterich, “Learnability of the
superset label learning problem,” in ICML, 2014.
[1] Yen-Chang Hsu, Zhaoyang Lv, Joel Schlosser, Phillip
Odom, and Zsolt Kira, “Multi-class classification with- [18] Lei Feng, Jiaqi Lv, Bo Han, Miao Xu, Gang Niu, Xin
out multi-class labels,” in ICLR, 2019. Geng, Bo An, and Masashi Sugiyama, “Provably con-
sistent partial-label learning,” in NeurIPS, 2020.
[2] Ryuichi Kiryo, Gang Niu, Marthinus C Du Plessis, and
Masashi Sugiyama, “Positive-unlabeled learning with [19] Lei Feng and Bo An, “Learning from multiple comple-
non-negative risk estimator,” in NeurIPS, 2017. mentary labels,” in ICML, 2020.
[3] Hirotaka Kaji, Hayato Yamaguchi, and Masashi [20] Yuzhou Cao and Yitian Xu, “Multi-complementary and
Sugiyama, “Multi task learning with positive and unla- unlabeled learning for arbitrary losses and models,” in
beled data and its application to mental state prediction,” ICML, 2020.
in ICASSP, 2018. [21] Min-Ling Zhang, Fei Yu, and Cai-Zhi Tang,
“Disambiguation-free partial label learning,” IEEE
[4] Oded Maron and Tomás Lozano-Pérez, “A framework
Trans Knowl Data Eng, 2017.
for multiple-instance learning,” in NeurIPS, 1998.
[22] Gengyu Lyu, Songhe Feng, Tao Wang, Congyan Lang,
[5] Yun Wang, Juncheng Li, and Florian Metze, “A com-
and Yidong Li, “Gm-pll: Graph matching based partial
parison of five multiple instance learning pooling func-
label learning,” IEEE Trans Knowl Data Eng, 2019.
tions for sound event detection with weak labeling,” in
ICASSP, 2019. [23] Ning Xu, Jiaqi Lv, and Xin Geng, “Partial label learning
via label enhancement,” in AAAI, 2019.
[6] Rong Jin and Zoubin Ghahramani, “Learning with mul-
tiple labels,” in NeurIPS, 2003. [24] Jiaqi Lv, Miao Xu, Lei Feng, Gang Niu, Xin Geng, and
Masashi Sugiyama, “Progressive identification of true
[7] Jie Luo and Francesco Orabona, “Learning from candi- labels for partial-label learning,” in ICML, 2020.
date labeling sets,” in NeurIPS, 2010.
[25] Lei Feng and Bo An, “Partial label learning with self-
[8] Liping Liu and Thomas G Dietterich, “A conditional guided retraining,” in AAAI, 2019.
multinomial mixture model for superset label learning,”
in NeurIPS, 2012. [26] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin
Recht, and Oriol Vinyals, “Understanding deep learning
[9] Zinan Zeng, Shijie Xiao, Kui Jia, Tsung-Han Chan, requires rethinking generalization,” in ICLR, 2017.
Shenghua Gao, Dong Xu, and Yi Ma, “Learning by as-
sociating ambiguously labeled images,” in CVPR, 2013. [27] Giacomo De Palma, Bobak Kiani, and Seth Lloyd,
“Random deep neural networks are biased towards sim-
[10] Eyke Hüllermeier and Jürgen Beringer, “Learning from ple functions,” in NeurIPS, 2019.
ambiguously labeled examples,” Intell Data Anal, 2006.
[28] Gabriel Panis, Andreas Lanitis, Nicholas Tsapatsoulis,
[11] Timothee Cour, Ben Sapp, and Ben Taskar, “Learning and Timothy F Cootes, “Overview of research on facial
from partial labels,” JMLR, 2011. ageing using the FG-NET ageing database,” IET Bio-
metrics, 2016.
[12] Min-Ling Zhang and Fei Yu, “Solving the partial la-
bel learning problem: An instance-based approach,” in [29] Matthieu Guillaumin, Jakob Verbeek, and Cordelia
AAAI, 2015. Schmid, “Multiple instance metric learning from au-
tomatically labeled bags of faces,” in ECCV, 2010.
[13] Cai-Zhi Tang and Min-Ling Zhang, “Confidence-rated
discriminative partial label learning,” in AAAI, 2017. [30] Sergey Ioffe and Christian Szegedy, “Batch normaliza-
tion: Accelerating deep network training by reducing
[14] Lei Feng and Bo An, “Partial label learning by semantic
internal covariate shift,” in ICML, 2015.
difference maximization,” in IJCAI, 2019.
[31] Djork-Arné Clevert, Thomas Unterthiner, and Sepp
[15] Yan Yan and Yuhong Guo, “Partial label learning with Hochreiter, “Fast and accurate deep network learning
batch label correction,” in AAAI, 2020. by exponential linear units (ELUs),” in ICLR, 2016.
[16] Guillermo Valle-Perez, Chico Q Camargo, and Ard A [32] Manzil Zaheer, Sashank Reddi, Devendra Sachan,
Louis, “Deep learning generalizes because the Satyen Kale, and Sanjiv Kumar, “Adaptive methods for
parameter-function map is biased towards simple func- nonconvex optimization,” in NeurIPS, 2018.
tions,” in ICLR, 2018.You can also read