Learning from Crowds with Sparse and Imbalanced Annotations

Page created by Virgil Powell
 
CONTINUE READING
Learning from Crowds with Sparse and Imbalanced Annotations
                                                                       Ye Shi1 , Shao-Yuan Li∗1,2 , Sheng-Jun Huang1
                                             1
                                               Ministry of Industry and Information Technology Key Laboratory of Pattern Analysis and Machine
                                               Intelligence College of Computer Science and Technology, Nanjing University of Aeronautics and
                                                                             Astronautics, Nanjing, 211106, China
                                             2
                                               State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, 210023, China
                                                                            {shiye1998, lisy, huangsj}@nuaa.edu.cn

                                                                   Abstract
arXiv:2107.05039v1 [cs.LG] 11 Jul 2021

                                              Traditional supervised learning requires ground
                                              truth labels for the training data, whose collec-
                                              tion can be difficult in many cases. Recently,
                                              crowdsourcing has established itself as an efficient
                                              labeling solution through resorting to non-expert
                                              crowds. To reduce the labeling error effects, one
                                              common practice is to distribute each instance to
                                              multiple workers, whereas each worker only an-
                                              notates a subset of data, resulting in the sparse
                                              annotation phenomenon. In this paper, we note
                                              that when meeting with class-imbalance, i.e., when
                                              the ground truth labels are class-imbalanced, the
                                              sparse annotations are prone to be skewly dis-
                                              tributed, which thus can severely bias the learn-               Figure 1: Class distributions of ground truth labels, observed anno-
                                              ing algorithm. To combat this issue, we pro-                    tations and two intermediate steps (iteration 5 and 10) of confidence
                                              pose one self-training based approach named Self-               based self-training. In each iteration, 10, 000 pseudo-annotations
                                              Crowd by progressively adding confident pseudo-                 are added into the training data.
                                              annotations and rebalancing the annotation dis-
                                              tribution. Specifically, we propose one distribu-
                                              tion aware confidence measure to select confident               training data with crowdsourcing annotations is called learn-
                                              pseudo-annotations, which adopts the resampling                 ing from crowds or crowdsourcing learning, and has attracted
                                              strategy to oversample the minority annotations and             much attention during the last years[Thierry et al., 2010].
                                              undersample the majority annotations. On one real-                 As the crowds can make mistakes, one core task in crowd-
                                              world crowdsourcing image classification task, we               sourcing learning is to deal with the annotation noise, for
                                              show that the proposed method yields more bal-                  which purpose many approaches have been proposed[Philip
                                              anced annotations throughout training than the dis-             and M, 1979; Raykar et al., 2010a; Zhou et al., 2012;
                                              tribution agnostic methods and substantially im-                Filipe and Francisco, 2018]. In this paper, we move one
                                              proves the learning performance at different anno-              step further by noticing the sparsity and class-imbalance phe-
                                              tation sparsity levels.                                         nomenon of crowdsourcing annotations in real-world appli-
                                                                                                              cations.
                                                                                                                 In crowdsourcing, annotation sparsity is common. For ex-
                                         1    Introduction                                                    ample, to reduce the labeling error effects, repetitive label-
                                         The achievements of deep learning rely on large amounts of           ing is employed to introduce labeling redundancy, i.e., each
                                         ground truth labels, but collecting them is difficult in many        instance is distributed to more than one workers. At the
                                         cases. To alleviate this problem, crowdsourcing provides             same time, to collect annotations efficiently, a rather large
                                         a time- and cost-efficient solution through collecting non-          number of workers are employed, whereas each worker only
                                         expertise annotations from crowd workers. Learning from              annotates a subset of data. This results in sparse annota-
                                                                                                              tions. We note that when meeting with class-imbalance, i.e.,
                                            ∗
                                              This research was supported by National Natural Science Foun-
                                                                                                              when the ground truth labels of the concerned task are class-
                                         dation of China (61906089), Jiangsu Province Basic Research          imbalanced, the sparsity can lead to inevitable skewed distri-
                                         Program (BK20190408), and China Postdoc Science Foundation           bution, which may severely bias the learning algorithm.
                                         (2019TQ0152). Shao-Yuan Li is the corresponding author.                 Here we show one real-world example. LabelMe[Rus-
sell et al., 2008] is an image crowdsourcing dataset, consist-
ing of 1000 training data with annotations collected from 59
workers through the Amazon Mechanical Turk (AMT) plat-
form. On average, each image is annotated by 2.547 work-
ers, and each worker is assigned with 43.169 images. This
sparsity on one hand makes estimating each worker’s exper-                                    DNN
tise quite challenging. On the other hand, Figure 1 shows
                                                                                                                                     (…)          (…)
the effects of sparsity encountering class-imbalance. Ex-
cept for the ground truth labels and observed crowd anno-
tations, we also show results of two intermediate steps for
normal confidence based self-training, i.e., the most confi-
dent pseudo-annotations are iteratively added into the train-
                                                                       Instances                               Softmax Output    Crowdlayer   Crowd Output
ing data for updating the model. For the ground truth la-                                  Classifier
                                                                                                                      ;
bels and observed annotations, their standard deviations over
classes are respectively 1.85% and 2.95%, meaning a more            Figure 2: The network architecture of the deep crowdlayer model.
skewed annotation distribution. Moreover, the skewness has
biased the self-training algorithm to prefer majority class an-
notations and ignore the minority classes, which in turn leads      and conducted optimization using expectation-maximization
to more severely skewed annotations and learning bias. We           (EM) algorithm. To avoid the computational overhead of the
will show in the experiment section that this bias significantly    iterative EM procedure, [Filipe and Francisco, 2018] intro-
hurts the learning performance. Nevertheless, this issue has        duced the crowdlayer model and conducted efficient end-to-
been rarely paid attention to and touched in crowdsourcing          end SGD optimization.
learning.                                                              In detail, using f (x; θ) ∈ [0, 1]1×C to denote the softmax
   In this paper, we propose one distribution aware self-           output of the deep neural network classifier f (·) with param-
training based approach to combat this issue. At a high level,      eter θ for some instance x, the crowdlayer model introduced
we iteratively add confident pseudo-annotations and rebal-          R parameters {Wr ∈ RC×C }r=1,··· ,R to capture the annotat-
ance the annotation distribution. Within each iteration, we         ing process of the crowds, i.e., the annotations of x given by
efficiently train a deep learning model using available an-         worker r is derived as:
notations, and then use it as a teacher model to generate
pseudo-annotations. To alleviate the imbalance issue, we                              p̂(y r ) = sof tmax(f (x; θ) · Wr ).                              (1)
propose to select the most confident pseudo-annotations us-
ing resampling strategies, i.e., we undersample the majority           While Wr is real valued without any structural constraints,
classes and oversample minority classes. Then the learn-            it can be used to represent the worker’s annotating expertise,
ing model is retrained on the combination of observed and           i.e., Wr (i, j) can denote the process that instances belong-
pseudo-annotations. We name our approach Self-Crowd, and            ing to class i are annotated with class label j by worker r.
empirically show its positive effect at different sparsity levels   Larger diagonal values mean better worker expertise. Given
on the LabelMe dataset.                                             a specific loss function `, e.g., the cross entropy loss used in
                                                                    this paper, the loss over the crowdsourcing training data D is
                                                                    defined as:
2     The Self-Crowd Framework
                                                                                          N X
                                                                                            R
With X ⊂ Rd denoting feature space and Y = {1, 2, · · · , C}                              X
denoting label space, we use x ∈ X to denote the instance, as                      L :=                 I[y ri 6= 0]`(p̂(y ri ), y ri ).                (2)
well as y, y ∈ Y denote the corresponding ground truth labels                             i=1 r=1

and crowd annotations respectively. Let D = {(xi , y i )}N  i=1     Here I is the indicator function. Then regarding Wr as one
denote the training data with N instances, and y i = {y ri }R
                                                            r=1     crowdlayer after the neural network classifier f (·), [Filipe
denote the crowdsourcing annotations from R workers.                and Francisco, 2018] proposed to simultaneously optimizing
   In the following, we first introduce the deep crowdlayer         the classifier parameter θ and Wr in an end-to-end manner by
model proposed by [Filipe and Francisco, 2018] as our base          minimizing the loss defined in Eq. 2.
learning model, then propose the distribution aware confi-             The network architecture of the crowdlayer is shown
dence measure to deal with annotation sparsity and class-           in Figure 2. Actually, this architecture and the end-to-
imbalance, and finally summarize the algorithm procedure.           end loss optimization over the derived loss in Eq. 2 have
                                                                    been the cornerstone of various deep crowdsourcing learning
2.1    Deep Crowdlayer Base Learning Model                          approaches[Tanno et al., 2019; Chu et al., 2021; Zhijun et al.,
With the ubiquitous success of deep neural networks (DNN),          2020]. They mainly differ in specific structural regularization
deep crowdsourcing learning has been studied by combining           over the expertise parameters Wr with different motivations.
the strength of DNN with crowdsourcing. As one of the pio-          In this paper, we adopt the straightforward crowdlayer as our
neers in this direction, [Shadi et al., 2016] extended the clas-    base learning model for simplicity, and focus on using the
sic DS model[Philip and M, 1979] by using a convolutional           self-training idea to deal with the sparse and class-imbalanced
neural network (CNN) classifier as the latent true label prior,     issue in crowdsourcing annotations.
Algorithm 1 The Self-Crowd Framework                                 class, i.e., the Mc most confident pseudo-annotations for each
 1:   Input:                                                         class c ∈ {1, · · · , C} are selected:
 2:   D = {(xi , y i )}N
                       i=1 : crowdsourcing training data                                                   C
      `: loss function
                                                                                                           X
 3:                                                                                  Mc = tc · M,                tc = 1.           (4)
 4:   Output: classifier f                                                                                 c=1
 5:   Initialization:
 6:   train the crowdlayer model using the loss in Eq. 2 on D           Here M denotes the total number of selected pseudo-
 7:   obtain the pseudo-annotations predictions of each worker       annotations within each iteration, which is a hyperparameter
      on its unannotated instances using Eq. 1                       set by the users. tc denotes the normalized fraction coeffi-
 8:   Repeat:                                                        cient of class c, which is inversely proportional to the number
 9:   for each pseudo-annotation, calculate its confidence           of pseudo-annotations Nc0 of class c among all the generated
      score according to Eq. 3                                       pseudo-annotations:
10:   for each class c, calculate the corresponding selection
                                                                                                       1
      number Mc according to Eq. 4                                                              tc ∝       .                       (5)
11:   select the Mc most confident pseudo-annotations within                                           Nc0
      each class                                                        Algorithm 1 summarizes the main steps of the Self-Crowd
12:   add the selected pseudo-annotations into the training data     approach. We iteratively predict the unobserved annotations
      and retrain the crowdlayer model                               and add the most confident ones into the training data. Those
13:   Until expected performance reached                             pseudo-annotations with lower entropy values and rebalanc-
                                                                     ing the annotation distribution are selected according to Eq. 4-
2.2    Distribution Aware Confidence Measure                         5. Then the learning model is retrained on the combination of
                                                                     observed and pseudo-annotations. This process repeats until
To combat the annotation sparsity and class-imbalance, we            the expected performance is reached.
use the crowdlayer model as the base model, and pro-
pose one distribution aware confidence measure to conduct
self-training. During the training, we progressively predict         3     Experiments
pseudo-annotations for the unannotated instances for each            3.1    Settings
worker, and add some of them into the training data, then
update the learning model. The most confident pseudo-                Dataset We conduct experiments on LabelMe[Russell et
annotations which contribute to rebalancing the annotation           al., 2008], a real-world crowdsourcing image classification
distribution are selected. Next, we will explain the measure         dataset. LabelMe consists of respectively 1000 and 1688
in detail.                                                           training and testing images concerning 8 classes: "high-
Confidence Confidence is a commonly used measure in self-            way", "inside city", "tall building", "street", "forest", "coast",
training, which measures how confident the prediction of the         "mountain" and "open country". The authors distributed the
current model is for some instances. Using p̂(y r ) defined in       training images to 59 crowd workers through the Amazon
Eq. 1 to denote the pseudo-annotations probability of worker         Mechanical Turk (AMT) platform, and got on average 2.547
r on some unannotated instance x, we propose to use entropy          annotations for each image. The accuracy of each worker
to measure its confidence:                                           ranges from 0 to 100%, with mean accuracy and standard de-
                                                                     viation 69.2% ± 18.1%. As shown in Figure 1, the LabelMe
                               C
                       r
                               X                                     dataset is imbalanced with regard to both the ground truth la-
            entropy(y ) = −          p̂(y r ) · log p̂(y r )   (3)   bels and collected crowds annotations.
                               c=1                                   Network and Optimization For a fair comparison, we im-
The pseudo-annotations with lower entropy values are con-            plement the methods following the setting in [Filipe and Fran-
sidered to be more confident and more likely to be cor-              cisco, 2018]. Specifically, we use the pretrained CNN layers
rect. Based on traditional self-training motivation, pseudo-         of the VGG-16 deep neural network [Simonyan and Zisser-
annotations with the least entropy values should be selected         man, 2015] with one fully connected layer with 128 units and
as authentic ones. However, as we discussed in the introduc-         ReLU activations and one output layer on top. Besides, 50%
tion, without taking the class-imbalance issue into account,         random dropout is exploited. The training is conducted by
the learning algorithm would be biased towards selecting ma-         using Adam optimizer[Kingma and Ba, 2015] for 25 epochs
jority class annotations and ignore the minority annotations.        with batch size 512 and a learning rate of 0.001. L2 weight
More seriously, this bias can accumulate throughout the train-       decay regularization is used on all layers with λ = 0.0005.
ing process, which will inevitably damage the performance.           Baselines To assess the performance of the proposed ap-
In the following, we propose our distribution aware confi-           proach, we conduct comparisons for the following implemen-
dence measure.                                                       tations:
Distribution Aware Confidence Resampling is a common                 Self-Crowdr : which randomly selects the pseudo-annotations
strategy for addressing the class-imbalance problem. It intu-        and is used for training.
itively oversamples the majority classes or undersamples the         Self-Crowdc : which selects the most confident pseudo-
minority classes to avoid the dominant effect of majority data.      annotations with the least entropy values according to Eq. 3
In this paper, we adopt the resampling strategy within each          and used for training without considering class-imbalance.
Self-Crowd                               Self-Crowdr                               Self-Crowdc                                      Crowd Annotations                                  Ground Truth
                                                                  5.0
                                                                  4.5
                                                                                                                                                                                                                     0.200
                                                                  4.0
                     85                                           3.5                                                                                                                                                0.175
                                                                                                                      0.4
                                                                  3.0                                                                                                                                                0.150
                     80                                           2.5
 Test Accuracy (%)

                                                                                                                      0.3                                                                                            0.125
                                                                  2.0

                                                                                                            R Value

                                                                                                                                                                                                           R Value
                     75                                           1.5                                                                                                                                                0.100
                                                                                                                      0.2
                                                                  1.0
                                                                                                                                                                                                                     0.075
                                                                                 0.04         0.02        0.00              0.02           0.04
                     70                                                                                               0.1                                                                                            0.050

                     65                                                                                                                                                                                              0.025
                                                                                                                      0.0
                          0   1   2   3    4    5   6    7    8         9   10    11    12    13     14                      0     1   2     3    4   5   6   7   8          9         10 11 12 13 14                         0   1    2   3    4   5    6   7        8   9 10 11 12 13 14
                                                     Iterations                                                                                           Iterations                                                                                     Iterations
                                          (a) Test Accuracy                                                                      (b) R of Pseudo-Annotations                                                                 (c) R of Combined Annotations

                                               Figure 3: Comparison of Self-Crowd, Self-Crowdr and Self-Crowdc from three different perspectives.

Self-Crowd: which selects the pseudo-annotations taking into                                                                                              seen that our proposed method greatly alleviates the class-
account class-imbalance issue according to Eq. 3- 4.                                                                                                      imbalance issue during learning whereas the random and con-
  We examine the classification accuracy on the test images.                                                                                              fidence based selection measure always leads to more imbal-
To avoid the influence of randomness, we repeat the experi-                                                                                               anced annotations. This explains the performance decline of
ments for 20 times and report the average results.                                                                                                        Self-Crowdc and Self-Crowdr .
3.2                       Results                                                                                                                         3.3             Various Sparsity Level Study
Figure 3 (a) shows the test accuracy of compared methods                                                                                                  To examine the effectiveness of our approach with different
on the LabelMe dataset as the self-training process iterates.                                                                                             sparsity levels, we remove fractions of the original observed
Here results for 14 iterations are recorded, and in each itera-                                                                                           annotations for LabelMe and conduct experiment. Specifi-
tion 10, 000 pseudo-annotations are selected without replace-                                                                                             cally, we remove p fractions of the observed annotations with
ment. As we can see, the test accuracy of Self-Crowdc and                                                                                                 p ranges from 0% to 90% in a uniformly random manner.
Self-Crowdr decrease rapidly as the self-training proceeds. In                                                                                            To alleviate the effect of randomness, we repeat each experi-
contrast, Self-Crowd stably improves.                                                                                                                     ments for 5 times and report the average results. For the self-
   To examine what happened during the learning procedure,                                                                                                training process, 5 iterations are conducted with each 10, 000
we define one class-imbalance ratio R as following:                                                                                                       pseudo-annotations selected within each iteration. Figure 4
                                                                                                                                                          shows the results.
                                                             Nmax − Nmin
                                                    R=                                                                                     (6)
                                                               Nanno                                                                                                                   90

Here Nmin , Nmax respectively denote the number of gen-
erated annotations for the most frequent class and the least                                                                                                                           80
frequent class, Nanno denotes the total number of generated
                                                                                                                                                                   Test Accuracy (%)

annotations over all classes. It can be seen that R ranges in
[0, 1], with smaller value meaning more balanced annotations.                                                                                                                          70
   We record the variation of R for the pseudo-annotations
selected by the three methods during self-training in Fig-                                                                                                                             60
ure 3 (b). It can be seen that the R value of Self-Crowdc in-
creases rapidly, indicating that the confidence based measure                                                                                                                                     Self-Crowd
mostly selects the majority class pseudo-annotations, lead-                                                                                                                            50         Self-Crowdr
ing to severely imbalanced annotation distribution, which in                                                                                                                                      Self-Crowdc
turn badly hurt the learning performance as shown in Figure 3                                                                                                                                     Crowdlayer
                                                                                                                                                                                       40
(a). The random selection strategy is much better than confi-                                                                                                                               100   90     80            70         60       50       40           30        20    10
dence based measure but still biased by the imbalance issue.                                                                                                                                            Observed Annotataion Fraction (%)
The proposed distribution aware strategy is more robust and
achieves improved performance.                                                                                                                                               Figure 4: Test accuracy with different sparsity levels.
   Combing the original observed annotations and the se-
lected pseudo-annotations, Figure 3 (c) shows the R value                                                                                                    The yellow line represents the test accuracy when only
variation on the combined annotations. The solid and dashed                                                                                               the observed annotations are used for training without self-
black line respectively represents the R value of original ob-                                                                                            training, i.e., t = 0. It can be seen our approach always
served annotations and the ground truth labels. It can be                                                                                                 achieves the best and most stable performance. However, the
confidence based approach Self-Crowdc decreases rapidly as           database and web-based tool for image annotation. Int. J.
the observation annotations decrease, and Self-Crowdr per-           Comput. Vis., 77:157–173, 2008.
forms stably but worse than the crowdlayer baseline.              [Samira et al., 2018] Pouyanfar Samira, Tao Yudong, Mo-
                                                                     han Anup, Tian Haiman, Kaseb Ahmed S., Gauen Kent,
4   Conclusion                                                       Dailey Ryan, Aghajanzadeh Sarah, Lu Yung-Hsiang, and
In this paper, we propose a self-training based method Self-         Chen Shu-Ching. Dynamic sampling in convolutional neu-
Crowd to deal with the sparsity and class-imbalance issue in         ral networks for imbalanced data classification. In MIPR,
crowdsourcing learning. To combat the selection bias towards         pages 112–117, 2018.
majority class annotations, we propose a distribution aware       [Shadi et al., 2016] Albarqouni Shadi, Baur Christoph,
confidence measure to select the most confident pseudo-              Achilles Felix, Belagiannis Vasileios, Demirci Stefanie,
annotations and rebalance the annotation distribution. Exper-        and Navab Nassir. Aggnet: Deep learning from crowds
iments on a real-world crowdsourcing dataset show the effec-         for mitosis detection in breast cancer histology images.
tiveness of our approach. As a primary attempt to sparse and         IEEE Trans. Medical Imaging, pages 1313–1321, 2016.
imbalance crowdsourcing learning, the proposed method can         [Simonyan and Zisserman, 2015] Karen Simonyan and An-
be extended by combining with sophisticated deep crowd-              drew Zisserman. Very deep convolutional networks for
sourcing learning models and selection measures.                     large-scale image recognition. In ICLR, 2015.
                                                                  [Tanno et al., 2019] Ryutaro Tanno, Ardavan Saeedi, Swami
References                                                           Sankaranarayanan, Daniel C. Alexander, and Nathan Sil-
[Buda et al., 2018] Mateusz Buda, Atsuto Maki, and Ma-               berman. Learning from noisy labels by regularized esti-
   ciej A. Mazurowski. A systematic study of the class im-           mation of annotator confusion. In CVPR, pages 11244–
   balance problem in convolutional neural networks. Neural          11253, 2019.
   Networks, 106:249–259, 2018.                                   [Thierry et al., 2010] Buecheler Thierry, Sieg Jan Henrik,
[Chu et al., 2021] Zhendong Chu, Jing Ma, and Hongning               Füchslin Rudolf Marcel, and Pfeifer Rolf. Crowdsourcing,
   Wang. Learning from crowds by modeling common con-                open innovation and collective intelligence in the scientific
   fusions. In AAAI, pages 5832–5840, 2021.                          method: a research agenda and operational framework. In
[Devansh et al., 2017] Arpit Devansh, Jastrz˛ebski Stanisław,        ALIFE, pages 679–686, 2010.
   Ballas Nicolas, Krueger David, Bengio Emmanuel, Kan-           [Venanzi et al., 2014] Matteo Venanzi,         John Guiver,
   wal Maxinder S, Maharaj Tegan, Fischer Asja, Courville            Gabriella Kazai, Pushmeet Kohli, and Milad Shokouhi.
   Aaron, Bengio Yoshua, et al. A closer look at memoriza-           Community-based bayesian aggregation models for
   tion in deep networks. In ICML, pages 233–242, 2017.              crowdsourcing. In WWW, pages 155–164, 2014.
[Filipe and Francisco, 2018] Rodrigues Filipe and Pereira C       [Zhijun et al., 2020] Chen Zhijun, Wang Huimin, Sun Hai-
   Francisco. Deep learning from crowds. AAAI, page 8,               long, Chen Pengpeng, Han Tao, Liu Xudong, and Yang Jie.
   2018.                                                             Structured probabilistic end-to-end learning from crowds.
                                                                     In IJCAI, pages 1512–1518, 2020.
[Guan et al., 2018] Melody Y. Guan, Varun Gulshan, An-
   drew M. Dai, and Geoffrey E. Hinton. Who said what:            [Zhou et al., 2012] D. Zhou, S. Basu, Y. Mao, and J.C. Platt.
   Modeling individual labelers improves classification. In          Learning from the wisdom of crowds by minimax entropy.
   AAAI, pages 3109–3118, 2018.                                      In P.L. Bartlett, F.C.N. Pereira, C.J.C. Burges, L. Bottou,
                                                                     and K.Q. Weinberger, editors, Advances in Neural Infor-
[Kingma and Ba, 2015] Diederik P. Kingma and Jimmy Ba.               mation Processing Systems 25, pages 2195–2203. 2012.
   Adam: A method for stochastic optimization. In ICLR,
   page (Poster), 2015.
[Philip and M, 1979] Dawid Alexander Philip and Skene Al-
   lan M. Maximum likelihood estimation of observer error-
   rates using the em algorithm. Journal of the Royal Statisti-
   cal Society: Series C (Applied Statistics), 28:20–28, 1979.
[Raykar et al., 2010a] V.C. Raykar, S. Yu, L.H. Zhao, G.H.
   Valadez, C. Florin, L. Bogoni, and L. Moy. Learning from
   crowds. Journal of Machine Learning Research, 11:1297–
   1322, 2010.
[Raykar et al., 2010b] Vikas C. Raykar, Shipeng Yu,
   Linda H. Zhao, Gerardo Hermosillo Valadez, Charles
   Florin, Luca Bogoni, and Linda Moy. Learning from
   crowds. J. Mach. Learn. Res., 11:1297–1322, 2010.
[Russell et al., 2008] Bryan C. Russell, Antonio Torralba,
   Kevin P. Murphy, and William T. Freeman. Labelme: A
You can also read