Multi-class Text Classification using BERT-based Active Learning

Page created by Matthew Burns
 
CONTINUE READING
Multi-class Text Classification using BERT-based Active Learning
Multi-class Text Classification using BERT-based Active Learning

                                                Sumanth Prabhu†                        Moosa Mohamed†                       Hemant Misra
                                             Applied Research, Swiggy                Applied Research, Swiggy          Applied Research, Swiggy

                                                                       Abstract                    nique under Human-in-the-Loop Machine Learn-
                                                                                                   ing (Monarch, 2019). Active Learning overcomes
                                                 Text Classification finds interesting applica-    the labeling bottleneck by choosing a subset of in-
                                                 tions in the pickup and delivery services in-
                                                                                                   stances from a pool of unlabeled instances to be
                                                 dustry where customers require one or more
arXiv:2104.14289v1 [cs.IR] 27 Apr 2021

                                                 items to be picked up from a location and de-     labeled by the human annotators (a.k.a the oracle).
                                                 livered to a certain destination. Classifying     The goal of identifying this subset is to achieve high
                                                 these customer transactions into multiple cat-    accuracy while having labeled as few instances as
                                                 egories helps understand the market needs for     possible.
                                                 different customer segments. Each transaction        BERT-based (Devlin et al., 2019; Sanh et al.,
                                                 is accompanied by a text description provided     2020; Yinhan Liu et al., 2019) Deep Learning mod-
                                                 by the customer to describe the products being
                                                                                                   els have been proven to achieve state-of-the-art
                                                 picked up and delivered which can be used to
                                                 classify the transaction. BERT-based models       results on GLUE (Wang et al., 2018), RACE (Lai
                                                 have proven to perform well in Natural Lan-       et al., 2017) and SQuAD (Rajpurkar et al., 2016,
                                                 guage Understanding. However, the product         2018). Active Learning has been successfully inte-
                                                 descriptions provided by the customers tend to    grated with various Deep Learning approaches (Gal
                                                 be short, incoherent and code-mixed (Hindi-       and Ghahramani, 2016; Gissin and Shalev-Shwartz,
                                                 English) text which demands fine-tuning of        2019). However, the use of Active Learning with
                                                 such models with manually labelled data to        BERT based models for multi class text classifica-
                                                 achieve high accuracy. Collecting this labelled
                                                                                                   tion has not been studied extensively.
                                                 data can prove to be expensive. In this pa-
                                                 per, we explore Active Learning strategies to        In this paper, we consider a text classification
                                                 label transaction descriptions cost effectively   use-case in industry specific to pickup and delivery
                                                 while using BERT to train a transaction classi-   services where customers make use of short text
                                                 fication model. On TREC-6, AG’s News Cor-         to describe the products to be picked up from a
                                                 pus and an internal dataset, we benchmark the     certain location and dropped at a target location.
                                                 performance of BERT across different Active       Table 1 shows a few examples of the descriptions
                                                 Learning strategies in Multi-Class Text Classi-
                                                                                                   used by our customers to describe their transac-
                                                 fication.
                                                                                                   tions. Customers tend to use short incoherent and
                                         1       Introduction                                      code-mixed (using more than one language in the
                                                                                                   same message1 ) textual descriptions of the prod-
                                         Most of the focus of the machine learning commu-          ucts for describing them in the transaction. These
                                         nity is about creating better algorithms for learn-       descriptions, if mapped to a fixed set of possible
                                         ing from data. However, getting useful annotated          categories, help assist critical business decisions
                                         datasets can prove to be difficult. In many scenar-       such as demographic driven prioritization of cate-
                                         ios, we may have access to a large unannotated            gories, launch of new product categories and so on.
                                         corpus while it would be infeasible to annotate all       Furthermore, a transaction may comprise of multi-
                                         instances (Settles, 2009). Thus, weaker forms of          ple products which adds to the complexity of the
                                         supervision (Liang, 2005; Natarajan et al., 2013;         task. In this work, we focus on a multi-class clas-
                                         Ratner et al., 2017; Mekala and Shang, 2020) were         sification of transactions, where a single majority
                                         explored to label data cost effectively. In this pa-      category drives the transaction.
                                         per, we focus on Active Learning, a popular tech-
                                                                                                      1
                                                                                                        In our case, Hindi written in roman case is mixed with
                                             †
                                                 Equal contribution.                               English
Transaction Description               Category          2.1   Traditional Active Learning
 "Get me dahi 1.5kg"                   Grocery
                                                         Popular traditional Active Learning approaches in-
 Translation : Get me 1.5 kilo-
                                                         clude sampling based on model uncertainty using
 grams of curd
                                                         measures such as entropy (Lewis and Gale, 1994;
 "Pick up 1 yellow coloured            Clothes
                                                         Zhu et al., 2008), margin sampling (Scheffer et al.,
 dress"
                                                         2001) and least model confidence (Settles and
 "3 plate chole bhature"               Food
                                                         Craven, 2008; Culotta and McCallum, 2005; Set-
 "2254/- pay krke samaan uthana        Package
                                                         tles, 2009). Researchers also explored sampling
 hai"
                                                         based on diversity of instances (McCallum and
 Translation : Pay 2254/- and
                                                         Nigam, 1998; Settles and Craven, 2008; Xu et al.,
 pick up the package
                                                         2016; Dasgupta and Hsu, 2008; Wei et al., 2015)
Table 1: Instances of actual transaction descriptions    to prevent selection of redundant instances for la-
used by our customers along with their corresponding     belling. (Baram et al., 2004; Hsu and Lin, 2015;
categories.                                              Settles et al., 2008) explored hybrid strategies com-
                                                         bining the advantages of uncertainty and diversity
                                                         based sampling. Our work focusses on extending
   We explored supervised category classification        such strategies to Active Learning in a Deep Learn-
of transactions which required labelled data for         ing setting.
training. Our experiments with different ap-
proaches revealed that BERT-based models per-            2.2   Active Learning for Deep Learning
formed really well for the task. The train data
                                                         Deep Learning in an Active Learning setting can be
used in this paper was labeled manually by subject
                                                         challenging due to their tendency of rarely being
matter experts. However, this was proving to be
                                                         uncertain during inference and requirement of a
a very expensive exercise, and hence, necessitated
                                                         relatively large amount of training data. However,
exploration of cost effective strategies to collecting
                                                         Bayesian Deep Learning based approaches (Gal
manually labelled training data.
                                                         and Ghahramani, 2016; Gal et al., 2017; Kirsch
   The key contributions of our work are as follows
                                                         et al., 2019) were leveraged to demonstrate high
                                                         performance in an active learning setting for im-
    • Active Learning with incoherent and code-          age classification. Similar to Traditional Active
      mixed data: An Active Learning framework           Learning approaches, researchers explored diver-
      to reduce the cost of labelling data for multi     sity based sampling (Sener and Savarese, 2018)
      class text classification by 85% in an industry    and hybrid sampling (Zhdanov, 2019) to address
      use-case.                                          the drawbacks of pure model uncertainty based
                                                         approaches. Moreover, researchers also explored
    • BERT for Active Learning in multi-class            approaches based on Expected Model Change (Set-
      text Classification The first work, to the best    tles et al., 2007; Huang et al., 2016; Zhang et al.,
      of our knowledge, to explore and compare           2017; Yoo and Kweon, 2019) where the selected in-
      multiple advanced strategies in Active Learn-      stances are expected to result in the greatest change
      ing like Discriminative Active Learning using      to the current model parameter estimates when their
      BERT for multi-class text classification on        labels are provided. (Gissin and Shalev-Shwartz,
      publicly available TREC-6 and AG’s News            2019) proposed “Discriminative Active Learning
      Corpus benchmark datasets.                         (DAL)” where active learning in Image Classifica-
                                                         tion was posed as a binary classification task where
2    Related Work                                        instances to label were chosen such that the set
                                                         of labeled instances and the set of unlabeled in-
Active Learning has been widely studied and ap-          stances became indistinguishable. However, there
plied in a variety of tasks including classifica-        is minimal literature in applying these strategies to
tion (Novak et al., 2006; Tong and Koller, 2002),        Natural Language Understanding (NLU). Our work
structured output learning (Roth and Small, 2006;        focusses on extending such Deep Active Learning
Tomanek and Hahn, 2009; Haffari and Sarkar,              strategies for Text Classification using BERT-based
2009), clustering (Bodó et al., 2011) and so on.         models.
2.3    Deep Active Learning for Text                         • Core-Set (Sener and Savarese, 2018) selects
       Classification in NLP                                   instances that best cover the dataset in the
(Siddhant and Lipton, 2018) conducted an em-                   learned representation space using farthest-
pirical study of Active Learning in NLP to ob-                 first traversal algorithm
serve that Bayesian Active Learning outperformed             • Discriminative        Active       Learning
classical uncertainty sampling across all settings.            (DAL) (Gissin and Shalev-Shwartz, 2019)
However, they do not explore the performances                  This strategy aims to select instances
of BERT-based models. (Prabhu et al., 2019;                    that make the labeled set of instances
Zhang, 2019) explored Active Learning strategies               indistinguishable from the unlabeled pool.
extensible to BERT-based models. (Prabhu et al.,
2019) conducted an empirical study to demon-             4       Experiments
strate that active set selection using the posterior
entropy of deep models like FastText.zip (FTZ)           4.1       Data
is robust to sampling biases and to various algo-        Internal Dataset Our internal dataset data com-
rithmic choices such as BERT. (Zhang, 2019) ap-          prises of 55,038 customer transactions each of
plied an ensemble of Active Learning strategies to       which had an associated transaction description
BERT for the task of intent classification. How-         provided by the customer. The instances were man-
ever, neither of the works account for comparison        ually annotated by a team of SMEs and mapped to
against advanced strategies like Discriminative Ac-      one of 10 pre-defined categories. This data set is
tive Learning. (Ein-Dor et al., 2020) explored Ac-       split into 44,030 training samples and 11,008 vali-
tive Learning with multiple strategies using BERT        dation samples. The list of categories considered
based models for binary text classification tasks.       are as follows:
Our work is very closely related to this work with          {‘Food’, ‘Grocery’, ‘Package’, ‘Medicines’,
the difference that we focus on experiments with         ‘Household Items’, ‘Cigarettes’, ‘Clothes’, ‘Elec-
multi-class text classification.                         tronics’, ‘Keys’, ‘Documents/Books’ }

3     Methodology                                        TREC-6 Dataset To verify the reproducibility of
We consider multiple strategies in Active Learning       our observations, we consider the TREC dataset (Li
to understand the impact of the performance using        and Roth, 2002) which comprises of 5,452 training
BERT based models. We attempt to reproduce set-          samples and 500 validation samples with 6 classes.
tings similar to the study conducted by Ein-Dor et
al. (Ein-Dor et al., 2020) for binary text classifica-   AG’s News Corpus To verify the reproducibility
tion.                                                    of our observations, we also consider the AG’s
                                                         corpus of news articles (Zhang et al., 2015) which
    • Random We sample instances at random in            comprises of 120,000 training samples and 7,600
      each iteration while leveraging them to retrain    validation samples with 4 classes.
      the model.
                                                         4.2       Experiment Setting
    • Uncertainty (Entropy) Instances selected to        We leverage DistilBERT (Sanh et al., 2020) for the
      retrain the model have the highest entropy in      purpose of the experiments in this paper. We run
      model predictions.                                 each strategy for two random seed settings on the
                                                         TREC-6 Dataset and AG’s News Corpus and one
    • Expected Gradient Length (EGL) (Huang              random seed setting for our Internal Dataset. The
      et al., 2016) This strategy picks instances that   results present here consist of 558 fine-tuning ex-
      are expected to have the largest gradient norm     periments (78 for our Internal dataset and 240 each
      over all possible labelings of the instances.      for the TREC-6 dataset and AG’s News Corpus).
                                                         The experiments were run on AWS Sagemaker’s p3
    • Deep      Bayesian     Active      Learning
                                                         2x large Spot Instances. The implementation was
      (DBAL) (Gal et al., 2017) Instances
                                                         based on the code2 made available by (Gissin and
      are selected to improve the uncertainty
                                                         Shalev-Shwartz, 2019). Table 2 shows the batch
      measures of BERT with Monte Carlo dropout
                                                             2
      using the max-entropy acquisition function.                https://github.com/dsgissin/DiscriminativeActiveLearning
Figure 1: Comparison of F1 scores of various Active Learning strategies on the Internal Dataset(BatchSize=500,
Iterations=13), TREC-6 Dataset(BatchSize=100, Iterations=20) and AG’s News Corpus(BatchSize=100, Itera-
tions=20)

size and number of iterations settings used for the     BERT in binary classification. Moreover, we ob-
experiments. In each iteration, a batch of samples      serve that the performance improvements over Ran-
are identified and the model is retrained for 1 epoch   dom Sampling strategy is also inconsistent. Con-
with a learning rate of 3 × 10−5 . We retrain the       cretely, on the Internal Dataset and TREC-6, we ob-
models from scratch in each iteration to prevent        serve that DBAL and Core-set are perform poorly
overfitting (Siddhant and Lipton, 2018; Hu et al.,      when compared to Random Sampling. However, on
2019). In other words, a new model is constructed       the AG’s News Corpus, we observe a contradicting
in each iteration without dependency on the model       behaviour where both DBAL and Core-set outper-
learnt in the previous iteration.                       form the Random Sampling strategy. Uncertainty
                                                        sampling strategy performs very similar to the Ran-
    Dataset             Batch         # Iterations      dom Sampling strategy. EGL performs consistently
                        Size                            well across all three datasets. The DAL strategy,
    Internal Dataset    500           13                proven to have worked well in image classification,
    TREC-6 Dataset      100           20                is consistent across the datasets. However, it does
    AG’s News Corpus    100           20                not outperform the EGL strategy. A deeper analysis
                                                        of these observations will be conducted in future
Table 2: Training details for the Internal Dataset,     work.
TREC-6 Dataset and AG’s News Corpus

5     Results                                           6    Conclusion

We report results for the Active Learning Strategies
using F1-score as the classification metric. Fig-       In this paper, we explored multiple Active Learning
ure 1 visualizes the performance of different ap-       strategies using BERT. Our goal was to understand
proaches. We observe that the best F1-score that        if BERT-based models can prove effective in an
Active Learning using BERT achieves on the In-          Active Learning setting for multi-class text classifi-
ternal Dataset is 0.83. A fully supervised setting      cation. We observed that BERT-based models can
with DistilBERT using the entire training data also     be trained cost effectively with lesser data using
achieved an F1-score of 0.83. Thus, we reduce the       Active Learning strategies. We also observed that
labelling cost by 85% while achieving similar F1-       none of the strategies significantly outperformed
scores. However, we observe that no single strat-       the others while EGL performs reasonably well
egy significantly outperforms the other. (Lowell        across datasets. In future work, we plan to inves-
et al., 2019) who studied Active Learning for text      tigate the results of the paper in detail and also
classification and sequence tagging in non-BERT         explore ensembling approaches by combining the
models, and demonstrated the brittleness and incon-     advantages of each strategy to achieve extensive
sistency of Active Learning results. (Ein-Dor et al.,   performance improvements.
2020) also observed inconsistency of results using
References                                                 Wei-Ning Hsu and Hsuan-Tien Lin. 2015. Active learn-
                                                             ing by learning. In Proceedings of the Twenty-
Yoram Baram, Ran El-Yaniv, and Kobi Luz. 2004. On-          Ninth AAAI Conference on Artificial Intelligence,
  line choice of active learning algorithms. J. Mach.       AAAI’15, page 2659–2665. AAAI Press.
  Learn. Res., 5:255–291.
                                                           Peiyun Hu, Zachary C. Lipton, Anima Anandkumar,
Zalán Bodó, Zsolt Minier, and Lehel Csató. 2011. Ac-         and Deva Ramanan. 2019. Active learning with par-
  tive learning with clustering. In Active Learning and      tial feedback.
  Experimental Design workshop In conjunction with
  AISTATS 2010, volume 16 of Proceedings of Ma-            Jiaji Huang, Rewon Child, Vinay Rao, Hairong Liu,
  chine Learning Research, pages 127–139, Sardinia,           Sanjeev Satheesh, and Adam Coates. 2016. Active
  Italy. JMLR Workshop and Conference Proceedings.            learning for speech recognition: the power of gradi-
                                                              ents.
Aron Culotta and Andrew McCallum. 2005. Reduc-
  ing labeling effort for structured prediction tasks.     Andreas Kirsch, Joost R. van Amersfoort, and Y. Gal.
  In Proceedings of the 20th National Conference on          2019. Batchbald: Efficient and diverse batch acqui-
  Artificial Intelligence - Volume 2, AAAI’05, page          sition for deep bayesian active learning. In NeurIPS.
  746–751. AAAI Press.
                                                           Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang,
Sanjoy Dasgupta and Daniel Hsu. 2008. Hierarchi-             and Eduard Hovy. 2017. Race: Large-scale reading
  cal sampling for active learning. ICML’08, page            comprehension dataset from examinations. arXiv
  208–215, New York, NY, USA. Association for                preprint arXiv:1704.04683.
  Computing Machinery.
                                                           David D. Lewis and William A. Gale. 1994. A sequen-
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and                tial algorithm for training text classifiers. CoRR,
   Kristina Toutanova. 2019. BERT: Pre-training of           abs/cmp-lg/9407020.
   deep bidirectional transformers for language under-
   standing. In Proceedings of the 2019 Conference         Xin Li and Dan Roth. 2002. Learning question classi-
   of the North American Chapter of the Association          fiers. In Proceedings of the 19th International Con-
   for Computational Linguistics: Human Language             ference on Computational Linguistics - Volume 1,
  Technologies, Volume 1 (Long and Short Papers),            COLING ’02, page 1–7, USA. Association for Com-
   pages 4171–4186, Minneapolis, Minnesota. Associ-          putational Linguistics.
   ation for Computational Linguistics.
                                                           Percy Liang. 2005. Semi-supervised learning for natu-
Liat Ein-Dor, Alon Halfon, Ariel Gera, Eyal Shnarch,         ral language. In MASTER’S THESIS, MIT.
  Lena Dankin, Leshem Choshen, Marina Danilevsky,
  Ranit Aharonov, Yoav Katz, and Noam Slonim.              David Lowell, Zachary Chase Lipton, and Byron C.
  2020. Active Learning for BERT: An Empirical               Wallace. 2019. Practical obstacles to deploying ac-
  Study. In Proceedings of the 2020 Conference on            tive learning. In EMNLP/IJCNLP.
  Empirical Methods in Natural Language Process-
  ing (EMNLP), pages 7949–7962, Online. Associa-           Andrew McCallum and Kamal Nigam. 1998. Employ-
  tion for Computational Linguistics.                        ing em and pool-based active learning for text clas-
                                                             sification. In Proceedings of the Fifteenth Interna-
Yarin Gal and Zoubin Ghahramani. 2016. Bayesian              tional Conference on Machine Learning, ICML ’98,
  convolutional neural networks with bernoulli ap-           page 350–358, San Francisco, CA, USA. Morgan
  proximate variational inference.                           Kaufmann Publishers Inc.

Yarin Gal, Riashat Islam, and Zoubin Ghahramani.           Dheeraj Mekala and Jingbo Shang. 2020. Contextual-
  2017. Deep Bayesian active learning with image             ized weak supervision for text classification. In As-
  data. In Proceedings of the 34th International             sociation for Computational Linguistics, ACL’20.
  Conference on Machine Learning, volume 70 of
  Proceedings of Machine Learning Research, pages          Robert (Munro) Monarch. 2019. Human-in-the-Loop
  1183–1192. PMLR.                                           Machine Learning. Manning Publications.

Daniel Gissin and Shai Shalev-Shwartz. 2019. Dis-          Nagarajan Natarajan, Inderjit S. Dhillon, Pradeep
  criminative active learning.                               Ravikumar, and Ambuj Tewari. 2013. Learning with
                                                             noisy labels. In Neural Information Processing Sys-
Gholamreza Haffari and Anoop Sarkar. 2009. Active            tems, NIPS’13.
  learning for multilingual statistical machine transla-
  tion. In Proceedings of the Joint Conference of the      Blaž Novak, Dunja Mladenič, and Marko Grobelnik.
  47th Annual Meeting of the ACL and the 4th Interna-        2006. Text classification with active learning. In
  tional Joint Conference on Natural Language Pro-           From Data and Information Analysis to Knowledge
  cessing of the AFNLP, pages 181–189, Suntec, Sin-          Engineering, pages 398–405, Berlin, Heidelberg.
  gapore. Association for Computational Linguistics.         Springer Berlin Heidelberg.
Ameya Prabhu, Charles Dognin, and Maneesh Singh.            In Proceedings of the 2018 Conference on Em-
 2019. Sampling bias in deep active classification:         pirical Methods in Natural Language Processing,
 An empirical study. In Proceedings of the 2019 Con-        pages 2904–2909, Brussels, Belgium. Association
 ference on Empirical Methods in Natural Language           for Computational Linguistics.
 Processing and the 9th International Joint Confer-
 ence on Natural Language Processing (EMNLP-              Katrin Tomanek and Udo Hahn. 2009. Reducing class
 IJCNLP), pages 4058–4068, Hong Kong, China. As-            imbalance during active learning for named entity
 sociation for Computational Linguistics.                   annotation. In Proceedings of the Fifth Interna-
                                                            tional Conference on Knowledge Capture, K-CAP
Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018.         ’09, page 105–112, New York, NY, USA. Associa-
  Know what you don’t know: Unanswerable ques-              tion for Computing Machinery.
  tions for SQuAD. In Association for Computational
  Linguistics.                                            Simon Tong and Daphne Koller. 2002. Support vec-
                                                            tor machine active learning with applications to text
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and       classification. J. Mach. Learn. Res., 2:45–66.
  Percy Liang. 2016. Squad: 100,000+ questions for
  machine comprehension of text. In Empirical Meth-       Alex Wang, Amanpreet Singh, Julian Michael, Felix
  ods in Natural Language Processing.                       Hill, Omer Levy, and Samuel Bowman. 2018. Glue:
                                                            A multi-task benchmark and analysis platform for
Alexander Ratner, Stephen H. Bach, Henry Ehrenberg,
                                                            natural language understanding. In Empirical Meth-
  Jason Fries, Sen Wu, and Christopher Ré. 2017.
                                                            ods in Natural Language Processing.
  Snorkel: Rapid training data creation with weak su-
  pervision. In Proceedings of the VLDB Endowment,
                                                          Kai Wei, Rishabh Iyer, and Jeff Bilmes. 2015. Submod-
  VLDB’17.
                                                            ularity in data subset selection and active learning.
Dan Roth and Kevin Small. 2006. Margin-based active         In Proceedings of the 32nd International Conference
  learning for structured output spaces. In Machine         on International Conference on Machine Learning -
  Learning: ECML 2006, pages 413–424, Berlin, Hei-          Volume 37, ICML’15, page 1954–1963. JMLR.org.
  delberg. Springer Berlin Heidelberg.
                                                          Zuobing Xu, Ram Akella, and Yi Zhang. 2016. Incor-
Victor Sanh, Lysandre Debut, Julien Chaumond, and           porating diversity and density in active learning for
  Thomas Wolf. 2020. Distilbert, a distilled version of     relevance feedback.
  bert: smaller, faster, cheaper and lighter.
                                                          Myle Ott Yinhan Liu, Naman Goyal, Jingfei Du,
Tobias Scheffer, Christian Decomain, and Stefan Wro-       Mandar Joshi, Danqi Chen, Omer Levy, Mike
  bel. 2001. Active hidden markov models for infor-        Lewis, Luke Zettlemoyer, and Veselin Stoyanov.
  mation extraction.                                       2019. Roberta: A robustly optimized bert pre-
                                                           training approach. Computing Research Repository,
Ozan Sener and Silvio Savarese. 2018. Active learn-        arXiv:1907.11692. Version 1.
  ing for convolutional neural networks: A core-set
  approach.                                               Donggeun Yoo and In So Kweon. 2019. Learning
                                                            loss for active learning. In 2019 IEEE/CVF Confer-
Burr Settles. 2009. Active learning literature survey.      ence on Computer Vision and Pattern Recognition
  Computer Sciences Technical Report 1648, Univer-          (CVPR), pages 93–102.
  sity of Wisconsin–Madison.
                                                          L. Zhang. 2019. An ensemble deep active learning
Burr Settles and Mark Craven. 2008. An analysis of ac-
                                                             method for intent classification. Proceedings of the
  tive learning strategies for sequence labeling tasks.
                                                            2019 3rd International Conference on Computer Sci-
  EMNLP ’08, page 1070–1079, USA. Association
                                                             ence and Artificial Intelligence.
  for Computational Linguistics.

Burr Settles, Mark Craven, and Soumya Ray. 2007.          Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015.
  Multiple-instance active learning. In Proceedings         Character-level convolutional networks for text clas-
  of the 20th International Conference on Neural            sification. In Proceedings of the 28th International
  Information Processing Systems, NIPS’07, page             Conference on Neural Information Processing Sys-
  1289–1296, Red Hook, NY, USA. Curran Associates           tems - Volume 1, NIPS’15, page 649–657, Cam-
  Inc.                                                      bridge, MA, USA. MIT Press.

Burr Settles, Mark Craven, and Soumya Ray. 2008.          Ye Zhang, Matthew Lease, and Byron C. Wallace.
  Multiple-instance active learning. In Advances in         2017. Active discriminative text representation
  Neural Information Processing Systems, volume 20.         learning. In Proceedings of the Thirty-First AAAI
  Curran Associates, Inc.                                   Conference on Artificial Intelligence, AAAI’17,
                                                            page 3386–3392. AAAI Press.
Aditya Siddhant and Zachary C. Lipton. 2018. Deep
  Bayesian active learning for natural language pro-      Fedor Zhdanov. 2019. Diverse mini-batch active learn-
  cessing: Results of a large-scale empirical study.        ing.
Jingbo Zhu, Huizhen Wang, Tianshun Yao, and Ben-
   jamin K. Tsou. 2008. Active learning with sam-
   pling by uncertainty and density for word sense dis-
   ambiguation and text classification. In Proceedings
   of the 22nd International Conference on Computa-
   tional Linguistics - Volume 1, COLING’08, page
   1137–1144, USA. Association for Computational
   Linguistics.
You can also read