Explanation-Based Human Debugging of NLP Models: A Survey
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Explanation-Based Human Debugging of NLP Models: A Survey Piyawat Lertvittayakumjorn and Francesca Toni Department of Computing Imperial College London, UK {pl1515, ft}@imperial.ac.uk Abstract els. We refer to this research area as explanation- based human debugging (EBHD), as a general um- Debugging a machine learning model is hard brella term encompassing explanatory debugging since the bug usually involves the train- (Kulesza et al., 2010) and human-in-the-loop de- ing data and the learning process. This bugging (Lertvittayakumjorn et al., 2020). We de- arXiv:2104.15135v2 [cs.CL] 25 Sep 2021 becomes even harder for an opaque deep fine EBHD as the process of fixing or mitigating learning model if we have no clue about how the model actually works. In this sur- bugs in a trained model using human feedback vey, we review papers that exploit expla- given in response to explanations for the model. nations to enable humans to give feedback EBHD is helpful when the training data at hand and debug NLP models. We call this prob- leads to suboptimal models (due, for instance, to lem explanation-based human debugging biases or artifacts in the data), and hence human (EBHD). In particular, we categorize and dis- knowledge is needed to verify and improve the cuss existing work along three dimensions of trained models. In fact, EBHD is related to three EBHD (the bug context, the workflow, and challenging and intertwined issues in NLP: explain- the experimental setting), compile findings on how EBHD components affect the feed- ability (Danilevsky et al., 2020), interactive and back providers, and highlight open problems human-in-the-loop learning (Amershi et al., 2014; that could be future research directions. Wang et al., 2021), and knowledge integration (von Rueden et al., 2021; Kim et al., 2021). Although there are overviews for each of these topics (as 1 Introduction cited above), our paper is the first to draw connec- Explainable AI focuses on generating explanations tions among the three towards the final application for AI models as well as for their predictions. It is of model debugging in NLP. gaining more and more attention these days since Whereas most people agree on the meaning of explanations are necessary in many applications, the term bug in software engineering, various mean- especially in high-stake domains such as health- ings have been ascribed to this term in machine care, law, transportation, and finance (Adadi and learning (ML) research. For example, Selsam et al. Berrada, 2018). Some researchers have explored (2017) considered bugs as implementation errors, various merits of explanations to humans, such as similar to software bugs, while Cadamuro et al. supporting human decision making (Lai and Tan, (2016) defined a bug as a particularly damaging 2019; Lertvittayakumjorn et al., 2021), increas- or inexplicable test error. In this paper, we fol- ing human trust in AI (Jacovi et al., 2020) and low the definition of (model) bugs from Adebayo even teaching humans to perform challenging tasks et al. (2020) as contamination in the learning and/or (Lai et al., 2020). On the other hand, explanations prediction pipeline that makes the model produce can benefit the AI systems as well, e.g., when ex- incorrect predictions or learn error-causing associa- planations are used to promote system acceptance tions. Examples of bugs include spurious correla- (Cramer et al., 2008), to verify the model reasoning tion, labelling errors, and undesirable behavior in (Caruana et al., 2015), and to find potential causes out-of-distribution (OOD) testing. of errors (Han et al., 2020). The term debugging is also interpreted differ- In this paper, we review progress to date specif- ently by different researchers. Some consider de- ically on how explanations have been used in the bugging as a process of identifying or uncovering literature to enable humans to fix bugs in NLP mod- causes of model errors (Parikh and Zitnick, 2011;
Potential bug sources Explanation scopes Explanation methods - Natural artifacts - Local explanation - Self-explaining model - Small training subset - Global explanation - Post-hoc explanation Experimental settings - Wrong label injection - Out-of-distribution tests - Selected participants 1 Provide explanations - Crowdsourced participants 3 - Simulation Update Explanation-Based the model Human Debugging Forms of feedback Framework Model update approaches Inspected Model Human - Label - Feature - Adjust model parameters - Word - Attention - Improve training data 2 Give feedback - Score - Reasoning - Influence the training process - Rule Figure 1: A general framework for explanation-based human debugging (EBHD) of NLP models, consisting of the inspected (potentially buggy) model, the humans providing feedback, and a three-step workflow. Boxes list examples of the options (considered in the selected studies) for the components or steps in the general framework. Trainer Training data SVM Classifier Amazon MTurker (to be improved) 3 Improve 1 LIME explanations (Local, Post-hoc) training data 2 Word-level feedback Remove words: the, will, to, in, is Figure 2: The proposal by Ribeiro et al. (2016) as an instance of the general EBHD framework. Graliński et al., 2019), while others stress that de- only the ones appearing in at least 2 out of the 16 bugging must not only reveal the causes of prob- query results. This resulted in 234 papers which lems but also fix or mitigate them (Kulesza et al., we then manually checked, leading to selecting a 2015; Yousefzadeh and O’Leary, 2019). In this few additional papers, including (Han and Ghosh, paper, we adopt the latter interpretation. 2020; Zylberajch et al., 2021). The overall pro- cess resulted in 15 papers in Table 1 as the selected Scope of the survey. We focus on work using ex- studies primarily discussed in this survey. In con- planations of NLP models to expose whether there trast, some papers from the following categories are bugs and exploit human feedback to fix the appeared in the search results, but were not se- bugs (if any). To collect relevant papers, we started lected since, strictly speaking, they are not in the from some pivotal EBHD work, e.g., (Kulesza main scope of this survey: debugging without ex- et al., 2015; Ribeiro et al., 2016; Teso and Ker- planations (Kang et al., 2018), debugging outside sting, 2019), and added EBHD papers citing or the NLP domain (Ghai et al., 2021; Popordanoska being cited by the pivotal work, e.g., (Stumpf et al., et al., 2020; Bekkemoen and Langseth, 2021), refin- 2009; Kulesza et al., 2010; Lertvittayakumjorn ing the ML pipeline instead of the model (Lourenço et al., 2020; Yao et al., 2021). Next, to ensure that et al., 2020; Schoop et al., 2020), improving the ex- we did not miss any important work, we searched planations instead of the model (Ming et al., 2019), for papers on Semantic Scholar1 using the Carte- and work centered on revealing but not fixing bugs sian product of five keyword sets: {debugging}, (Ribeiro et al., 2020; Krause et al., 2016; Krishnan {text, NLP}, {human, user, interactive, feedback}, and Wu, 2017). {explanation, explanatory}, and {learning}. With 16 queries in total, we collected the top 100 pa- General Framework. EBHD consists of three pers (ranked by relevancy) for each query and kept main steps as shown in Figure 1. First, the expla- nations, which provide interpretable insights into 1 https://www.semanticscholar.org/ the inspected model and possibly reveal bugs, are
given to humans. Then, the humans inspect the the medical knowledge to provide feedback. So, explanations and give feedback in response. Fi- they may ask doctors acting as consultants to the nally, the feedback is used to update and improve development team to be the feedback providers dur- the model. These steps can be carried out once, as ing the model improvement phase. Further, EBHD a one-off improvement, or iteratively, depending can be carried out after deployment, with end users on how the debugging framework is designed. as the feedback providers. For example, a model As a concrete example, Figure 2 illustrates how auto-suggesting the categories of new emails to Ribeiro et al. (2016) improved an SVM text clas- end users can provide explanations supporting the sifier trained on the 20Newsgroups dataset (Lang, suggestions as part of its normal operation. Also, 1995). This dataset has many artifacts which could it can allow the users to provide feedback to both make the model rely on wrong words or tokens the suggestions and the explanations. Then, a rou- when making predictions, reducing its generaliz- tine written by the developers will be triggered to ability2 . To perform EBHD, Ribeiro et al. (2016) process the feedback and update the model auto- recruited humans from a crowdsourcing platform matically to complete the EBHD workflow. In this (i.e., Amazon Mechanical Turk) and asked them case, we need to care about the trust, frustration, to inspect LIME explanations3 (i.e., word rele- and expectation of the end users while and after vance scores) for model predictions of ten exam- they give feedback. In conclusion, EBHD can take ples. Then, the humans gave feedback by identify- place practically both before and after the model ing words in the explanations that should not have is deployed, and many stakeholders can act as the got high relevance scores (i.e., supposed to be the feedback providers, including, but not limited to, artifacts). These words were then removed from the the model developers, the domain experts, and the training data, and the model was retrained. The pro- end users. cess was repeated for three rounds, and the results show that the model generalized better after every Paper Organization. Section 2 explains the round. Using the general framework in Figure 1, choices made by existing work to achieve EBHD we can break the framework of Ribeiro et al. (2016) of NLP models. This illustrates the current state of into components as depicted in Figure 2. Through- the field with the strengths and limitations of exist- out the paper, when reviewing the selected studies, ing work. Naturally, though, a successful EBHD we will use the general framework in Figure 1 for framework cannot neglect the “imperfect” nature analysis, comparison, and discussion. of feedback providers, who may not be an ideal oracle. Hence, Section 3 compiles relevant human Human Roles. To avoid confusion, it is worth factors which could affect the effectiveness of the noting that there are actually two human roles in debugging process as well as the satisfaction of the the EBHD process. One, of course, is that of feed- feedback providers. After that, we identify open back provider(s), looking at the explanations and challenges of EBHD for NLP in Section 4 before providing feedback (noted as ‘Human’ in Figure 1). concluding the paper in Section 5. The other is that of model developer(s), training the model and organizing the EBHD process (not 2 Categorization of Existing Work shown in Figure 1). In practice, a person could be both model developer and feedback provider. Table 1 summarizes the selected studies along three This usually happens during the model validation dimensions, amounting to the debugging context and improvement phase where the developers try (i.e., tasks, models, and bug sources), the workflow to fix the bugs themselves. Sometimes, however, (i.e., the three steps in our general framework), and other stakeholders could also take the feedback the experimental setting (i.e., the mode of human provider role. For instance, if the model is trained engagement). We will discuss these dimensions to classify electronic medical records, the devel- with respect to the broader knowledge of explain- opers (who are mostly ML experts) hardly have able NLP and human-in-the-loop learning, to shed light on the current state of EBHD of NLP models. 2 For more details, please see section 2.1.3. 3 LIME stands for Local Interpretable Model agnostic Ex- 2.1 Context planations (Ribeiro et al., 2016). For each model prediction, it returns relevance scores for words in the input text to show To demonstrate the debugging process, existing how important each word is for the prediction. works need to set up the bug situation they aim to
Context Workflow Paper Setting Bug Exp. Exp. Feed- Up- Task Model sources scope method back date Kulesza et al. (2009) TC NB AR G,L SE LB,WS M,D SP Stumpf et al. (2009) TC NB SS L SE WO T SP Kulesza et al. (2010) TC NB AR G,L SE WO M,D SP Kulesza et al. (2015) TC NB AR G,L SE WO,WS M SP Ribeiro et al. (2016) TC SVM AR L PH WO D CS Koh and Liang (2017) TC LR WL L PH LB D SM VQA TellQA Ribeiro et al. (2018b) AR G PH RU D SP TC fastText Teso and Kersting (2019) TC LR AR L PH WO D SM Cho et al. (2019) TQA NeOp AR L SE AT T NR Khanna et al. (2019) TC LR WL L PH LB D SM Lertvittayakumjorn et al. (2020) TC CNN AR,SS,OD G PH FE T CS Smith-Renner et al. (2020) TC NB AR,SS L SE LB,WO M,D CS Han and Ghosh (2020) TC LR WL L PH LB D SM Yao et al. (2021) TC BERT* AR,OD L PH RE D,T NR Zylberajch et al. (2021) NLI BERT AR L PH ES T SP Table 1: Overview of existing work on EBHD of NLP models. We use abbreviations as follows: Task: TC = Text Classification (single input), VQA = Visual Question Answering, TQA = Table Question Answering, NLI = Natural Language Inference / Model: NB = Naive Bayes, SVM = Support Vector Machines, LR = Logistic Regression, TellQA = Telling QA, NeOp = Neural Operator, CNN = Convolutional Neural Networks, BERT* = BERT and RoBERTa / Bug sources: AR = Natural artifacts, SS = Small training subset, WL = Wrong label injection, OD = Out-of-distribution tests / Exp. scope: G = Global explanations, L = Local explanations / Exp. method: SE = Self-explaining, PH = Post-hoc method / Feedback (form): LB = Label, WO = Word, WS = Word Score, ES = Example Score, FE = Feature, RU = Rule, AT = Attention, RE = Reasoning / Update: M = Adjust the model parameters, D = Improve the training data, T = Influence the training process / Setting: SP = Selected participants, CS = Crowdsourced participants, SM = Simulation, NR = Not reported. fix, including the target NLP task, the inspected ML of top features)4 . Meanwhile, some other NLP model, and the source of the bug to be addressed. tasks require the feedback providers to have lin- guistic knowledge such as part-of-speech tagging, 2.1.1 Tasks parsing, and machine translation. The need for Most papers in Table 1 focus on text classification linguists or experts renders experiments for these with single input (TC) for a variety of specific prob- tasks more difficult and costly. However, we sug- lems such as e-mail categorization (Stumpf et al., gest that there are several tasks where the trained 2009), topic classification (Kulesza et al., 2015; models are prone to be buggy but the tasks are un- Teso and Kersting, 2019), spam classification (Koh derexplored in the EBHD setting, though they are and Liang, 2017), sentiment analysis (Ribeiro et al., not too difficult to experiment on with lay people. 2018b) and auto-coding of transcripts (Kulesza NLI, the focus of (Zylberajch et al., 2021), is one et al., 2010). By contrast, Zylberajch et al. (2021) of them. Indeed, McCoy et al. (2019) and Guru- targeted natural language inference (NLI) which is rangan et al. (2018) showed that NLI models can a type of text-pair classification, predicting whether exploit annotation artifacts and fallible syntactic a given premise entails a given hypothesis. Finally, heuristics to make predictions rather than learn- two papers involve question answering (QA), i.e., ing the logic of the actual task. Other tasks and Ribeiro et al. (2018b) (focusing on visual question their bugs include: QA, where Ribeiro et al. (2019) answering (VQA)) and Cho et al. (2019) (focusing found that the answers from models are sometimes on table question answering (TQA)). inconsistent (i.e., contradicting previous answers); Ghai et al. (2021) suggested that most re- and reading comprehension, where Jia and Liang searchers work on TC because, for this task, it 4 Nevertheless, some specific TC tasks, such as authorship is much easier for lay participants to understand attribution (Juola, 2007) and deceptive review detection (Lai explanations and give feedback (e.g., which key- et al., 2020), are exceptions because lay people are generally words should be added or removed from the list not good at these tasks. Thus, they are not suitable for EBHD.
(2017) showed that models, which answer a ques- topic classification dataset, a better model should tion by reading a given paragraph, can be fooled by focus more on the topic of the content since the an irrelevant sentence being appended to the para- punctuation marks can also appear in other classes, graph. These non-TC NLP tasks would be worth especially when we apply the model to texts in exploring further in the EBHD setting. the wild. Apart from classification performance drops, natural artifacts can also cause model biases, 2.1.2 Models as shown in (De-Arteaga et al., 2019; Park et al., Early work used Naive Bayes models with bag-of- 2018) and debugged in (Lertvittayakumjorn et al., words (NB) as text classifiers (Kulesza et al., 2009, 2020; Yao et al., 2021). 2010; Stumpf et al., 2009), which are relatively In the absence of strong natural artifacts, bugs easy to generate explanations for and to incorpo- can still be simulated using several techniques. rate human feedback into (discussed in section 2.2). First, using only a small subset of labeled data Other traditional models used include logistic re- (SS) for training could cause the model to exploit gression (LR) (Teso and Kersting, 2019; Han and spurious correlation leading to poor performance Ghosh, 2020) and support vector machines (SVM) (Kulesza et al., 2010). Second, injecting wrong la- (Ribeiro et al., 2016), both with bag-of-words fea- bels (WL) into the training data can obviously blunt tures. The next generation of tested models in- the model quality (Koh and Liang, 2017). Third, volves word embeddings. For text classification, using out-of-distribution tests (OD) can reveal that Lertvittayakumjorn et al. (2020) focused on con- the model does not work effectively in the domains volutional neural networks (CNN) (Kim, 2014) that it has not been trained on (Lertvittayakumjorn and touched upon bidirectional LSTM networks et al., 2020; Yao et al., 2021). All of these tech- (Hochreiter and Schmidhuber, 1997), while Ribeiro niques give rise to undesirable model behaviors, et al. (2018b) used fastText, relying also on n-gram requiring debugging. Another technique, not found features (Joulin et al., 2017). For VQA and TQA, in Table 1 but suggested in related work (Idahl the inspected models used attention mechanisms et al., 2021), is contaminating input texts in the for attending to relevant parts of the input image training data with decoys (i.e., injected artifacts) or table. These models are Telling QA (Zhu et al., which could deceive the model into predicting for 2016) and Neural Operator (NeOp) (Cho et al., the wrong reasons. This has been experimented 2018), used by Ribeiro et al. (2018b) and Cho et al. with in the computer vision domain (Rieger et al., (2019), respectively. While the NLP community 2020), and its use in the EBHD setting in NLP nowadays is mainly driven by pre-trained language could be an interesting direction to explore. models (Qiu et al., 2020) with many papers study- ing their behaviors (Rogers et al., 2021; Hoover 2.2 Workflow et al., 2020), only Zylberajch et al. (2021) and Yao This section describes existing work around the et al. (2021) have used pre-trained language models, three steps of the EBHD workflow in Figure 1, i.e., including BERT (Devlin et al., 2019) and RoBERTa how to generate and present the explanations, how (Liu et al., 2019), as test beds for EBHD. to collect human feedback, and how to update the 2.1.3 Bug Sources model using the feedback. Researchers need to make decisions on these key points harmoniously Most of the papers in Table 1 experimented on to create an effective debugging workflow. training datasets with natural artifacts (AR), which cause spurious correlation bugs (i.e., the input texts 2.2.1 Providing Explanations having signals which are correlated to but not the The main role of explanations here is to provide in- reasons for specific outputs) and undermine mod- terpretable insights into the model and uncover its els’ generalizability. Out of the 15 papers we sur- potential misbehavior or irrationality, which some- veyed, 5 used the 20Newsgroups dataset (Lang, times cannot be noticed by looking at the model 1995) as a case study, since it has lots of natural arti- outputs or the evaluation metrics. facts. For example, some punctuation marks appear more often in one class due to the writing styles of Explanation scopes. Basically, there are two the authors contributing to the class, so the model main types of explanations that could be provided uses these punctuation marks as clues to make pre- to feedback providers. Local explanations (L) ex- dictions. However, because 20Newsgroups is a plain the predictions by the model for individual
inputs. In contrast, global explanations (G) explain For the first question, we see many possible an- the model overall, independently of any specific swers in the literature of explainable NLP (e.g., see inputs. It can be seen from Table 1 that most ex- the survey by Danilevsky et al. (2020)). For in- isting work use local explanations. One reason for stance, input-based explanations (so called feature this may be that, for complex models, global ex- importance explanations) identify parts of the input planations can hardly reveal details of the models’ that are important for the prediction. The explana- inner workings in a comprehensible way to users. tion could be a list of importance scores of words in So, some bugs are imperceptible in such high- the input, so called attribution scores or relevance level global explanations and then not corrected scores (Lundberg and Lee, 2017; Arras et al., 2016). by the users. For example, the debugging frame- Example-based explanations select influential, im- work FIND, proposed by Lertvittayakumjorn et al. portant, or similar examples from the training set to (2020), uses only global explanations, and it was explain why the model makes a specific prediction shown to work more effectively on significant bugs (Han et al., 2020; Guo et al., 2020). Rule-based ex- (such as gender bias in abusive language detection) planations provide interpretable decision rules that than on less-obvious bugs (such as dataset shift approximate the prediction process (Ribeiro et al., between product types of sentiment analysis on 2018a). Adversarial-based explanations return the product reviews). Otherwise, Ribeiro et al. (2018b) smallest changes in the inputs that could change presented adversarial replacement rules as global the predictions, revealing the model misbehavior explanations to reveal the model weaknesses only, (Zhang et al., 2020a). In most NLP tasks, input- without explaining how the whole model worked. based explanations are the most popular approach On the other hand, using local explanations has for explaining predictions (Bhatt et al., 2020). This limitations in that it demands a large amount of is also the case for EBHD as most selected studies effort from feedback providers to inspect the ex- use input-based explanations (Kulesza et al., 2009, planation of every single example in the train- 2010; Teso and Kersting, 2019; Cho et al., 2019) ing/validation set. With limited human resources, followed by example-based explanations (Koh and efficient ways to rank or select examples to explain Liang, 2017; Khanna et al., 2019; Han and Ghosh, would be required (Idahl et al., 2021). For instance, 2020). Meanwhile, only Ribeiro et al. (2018b) use Khanna et al. (2019) and Han and Ghosh (2020) adversarial-based explanations, whereas Stumpf targeted explanations of incorrect predictions in et al. (2009) experiment with input-based, rule- the validation set. Ribeiro et al. (2016) picked sets based, and example-based explanations. of non-redundant local explanations to illustrate For the second question, there are two ways to the global picture of the model. Instead, Teso and generate the explanations: self-explaining methods Kersting (2019) leveraged heuristics from active and post-hoc explanation methods. Some models, learning to choose unlabeled examples that maxi- e.g., Naive Bayes, logistic regression, and decision mize some informativeness criteria. trees, are self-explaining (SE) (Danilevsky et al., Recently, some work in explainable AI considers 2020), also referred to as transparent (Adadi and generating explanations for a group of predictions Berrada, 2018) or inherently interpretable (Rudin, (Johnson et al., 2020; Chan et al., 2020) (e.g., for 2019). Local explanations of self-explaining mod- all the false positives of a certain class), thus stay- els can be obtained at the same time as predictions, ing in the middle of the two extreme explanation usually from the process of making those predic- types (i.e., local and global). This kind of expla- tions, while the models themselves can often serve nation is not too fine-grained, yet it can capture directly as global explanations. For example, fea- some suspicious model behaviors if we target the ture importance explanations for a Naive Bayes right group of examples. So, it would be worth model can be directly derived from the likelihood studying in the context of EBHD (to the best of our terms in the Naive Bayes equation, as done by sev- knowledge, no existing study experiments with it). eral papers in Table 1 (Kulesza et al., 2009; Smith- Renner et al., 2020). Also, using attention scores on Generating explanations. To generate explana- input as explanations, as done in (Cho et al., 2019), tions in general, there are two important questions is a self-explaining method because the scores were we need to answer. First, which format should the obtained during the prediction process. explanations have? Second, how do we generate the explanations? In contrast, post-hoc explanation methods (PH)
perform additional steps to extract explanations af- principles are challenging especially when working ter the model is trained (for a global explanation) on non-interpretable complex models. or after the prediction is made (for a local expla- nation). If the method is allowed to access model 2.2.2 Collecting Feedback parameters, it may calculate word relevance scores After seeing explanations, humans generally desire by propagating the output scores back to the input to improve the model by giving feedback (Smith- words (Arras et al., 2016) or analyzing the deriva- Renner et al., 2020). Some existing work asked tive of the output with respect to the input words humans to confirm or correct machine-computed (Smilkov et al., 2017; Sundararajan et al., 2017). explanations. Hence, the form of feedback fairly If the method cannot access the model parameters, depends on the form of the explanations, and in it may perturb the input and see how the output turn this shapes how to update the model too (dis- changes to estimate the importance of the altered cussed in section 2.2.3). For text classification, parts of the input (Ribeiro et al., 2016; Jin et al., most EBHD papers asked humans to decide which 2020). The important words and/or the relevance words (WO) in the explanation (considered impor- scores can be presented to the feedback providers tant by the model) are in fact relevant or irrelevant in the EBHD workflow in many forms such as a (Kulesza et al., 2010; Ribeiro et al., 2016; Teso list of words and their scores (Teso and Kersting, and Kersting, 2019). Some papers even allowed 2019; Ribeiro et al., 2016), word clouds (Lertvit- humans to adjust the word importance scores (WS) tayakumjorn et al., 2020), and a parse tree (Yao (Kulesza et al., 2009, 2015). This is analogous to et al., 2021). Meanwhile, the influence functions specifying relevancy scores for example-based ex- method, used in (Koh and Liang, 2017; Zylberajch planations (ES) in (Zylberajch et al., 2021). Mean- et al., 2021), identifies training examples which in- while, feedback at the level of learned features (FE) fluence the prediction by analyzing how the predic- (i.e., the internal neurons in the model) and learned tion would change if we did not have each training rules (RU) rather than individual words, was asked point. This is another post-hoc explanation method in (Lertvittayakumjorn et al., 2020) and (Ribeiro as it takes place after prediction. It is similar to the et al., 2018b), respectively. Additionally, humans other two example-based explanation methods used may be asked to check the predicted labels (Kulesza in (Khanna et al., 2019; Han and Ghosh, 2020). et al., 2009; Smith-Renner et al., 2020) or even the ground truth labels (collectively noted as LB in Ta- Presenting explanations. It is important to care- ble 1) (Koh and Liang, 2017; Khanna et al., 2019; fully design the presentation of explanations, tak- Han and Ghosh, 2020). Targeting the table question ing into consideration the background knowledge, answering, Cho et al. (2019) asked humans to iden- desires, and limits of the feedback providers. In tify where in the table and the question the model the debugging application by Kulesza et al. (2009), should focus (AT). This is analogous to identifying lay users were asked to provide feedback to email relevant words to attend for text classification. categorizations predicted by the system. The users It is likely that identifying important parts in were allowed to ask several Why questions (in- the input is sufficient to make the model accom- spired by Myers et al. (2006)) through either the plish simple text classification tasks. However, this menu bar, or by right-clicking on the object of inter- might not be enough for complex tasks which re- est (such as a particular word). Examples include quire reasoning. Recently, Yao et al. (2021) asked “Why will this message be filed to folder A?”, “Why humans to provide, as feedback, compositional ex- does word x matter to folder B?”. The system then planations to show how the humans would reason responded by textual explanations (generated us- (RE) about the models’ failure cases. An example ing templates), together with visual explanations of the feedback for a hate speech detection is “Be- such as bar plots for some types of questions. All of cause X is the word dumb, Y is a hateful word, these made the interface become more user-friendly. and X is directly before Y , the attribution scores In 2015, Kulesza et al. proposed, as desirable prin- of both X and Y as well as the interaction score ciples, that the presented explanations should be between X and Y should be increased”. To acquire sound (i.e., truthful in describing the underlying richer information like this as feedback, their frame- model), complete (i.e., not omitting important in- work requires more expertise from the feedback formation about the model), but not overwhelming providers. In the future, it would be interesting to (i.e., remaining comprehensible). However, these explore how we can collect and utilize other forms
of feedback, e.g., natural language feedback (Cam- the training data only, it is applicable to any model buru et al., 2018), new training examples (Fiebrink regardless of the model complexity. et al., 2009), and other forms of decision rules used by humans (Carstens and Toni, 2017). (3) Influence the training process (T). Another approach is to influence the (re-)training process 2.2.3 Updating the Model in a way that the resulting model will behave as Techniques to incorporate human feedback into the the feedback suggests. This approach could be ei- model can be categorized into three approaches. ther model-specific (such as attention supervision) or model-agnostic (such as user co-training). Cho (1) Directly adjust the model parameters (M). et al. (2019) used human feedback to supervise at- When the model is transparent and the explanation tention weights of the model. Similarly, Yao et al. displays the model parameters in an intelligible (2021) added a loss term to regularize explanations way, humans can directly adjust the parameters guided by human feedback. Stumpf et al. (2009) based on their judgements. This idea was adopted proposed (i) constraint optimization, translating hu- by Kulesza et al. (2009, 2015) where humans can man feedback into constraints governing the train- adjust a bar chart showing word importance scores, ing process and (ii) user co-training, using feed- corresponding to the parameters of the underlying back as another classifier working together with Naive Bayes model. In this special case, steps 2 the main ML model in a semi-supervised learning and 3 in Figure 1 are combined into a single step. setting. Lertvittayakumjorn et al. (2020) disabled Besides, human feedback can be used to modify the some learned features deemed irrelevant, based on model parameters indirectly. For example, Smith- the feedback, and re-trained the model, forcing it Renner et al. (2020) increased a word weight in to use only the remaining features. With many the Naive Bayes model by 20% for the class that techniques available, however, there has not been a the word supported, according to human feedback, study testing which technique is more appropriate and reduced the weight by 20% for the opposite for which task, domain, or model architecture. The class (binary classification). This choice gives good comparison issue is one of the open problems for results, however, it is not clear why and whether EBHD research (to be discussed in section 4). 20% is the best choice here. Overall, this approach is fast because it does not 2.2.4 Iteration require model retraining. However, it is important The debugging workflow (explain, feedback, and to ensure that the adjustments made by humans update) can be done iteratively to gradually im- generalize well to all examples. Therefore, the prove the model where the presented explanation system should update the overall results (e.g., per- changes after the model update. This allows hu- formance metrics, predictions, and explanations) in mans to fix vital bugs first and finer bugs in later real time after applying any adjustment, so the hu- iterations, as reflected in (Ribeiro et al., 2016; Koh mans can investigate the effects and further adjust and Liang, 2017) via the performance plots. How- the model parameters (or undo the adjustments) if ever, the interactive process could be susceptible to necessary. This agrees with the correctability prin- local decision pitfalls where local improvements ciples proposed by Kulesza et al. (2015) that the for individual predictions could add up to inferior system should be actionable and reversible, honor overall performance (Wu et al., 2019). So, we need user feedback, and show incremental changes. to ensure that the update in the current iteration (2) Improve the training data (D). We can use is generally favorable and does not overwrite the human feedback to improve the training data and re- good effects of previous updates. train the model to fix bugs. This approach includes correcting mislabeled training examples (Koh and 2.3 Experimental Setting Liang, 2017; Han and Ghosh, 2020), assigning To conduct experiments, some studies in Table 1 noisy labels to unlabeled examples (Yao et al., selected human participants (SP) to be their feed- 2021), removing irrelevant words from input texts back providers. The selected participants could be (Ribeiro et al., 2016), and creating augmented train- people without ML/NLP knowledge (Kulesza et al., ing examples to reduce the effects of the artifacts 2010, 2015) or with ML/NLP knowledge (Ribeiro (Ribeiro et al., 2018b; Teso and Kersting, 2019; et al., 2018b; Zylberajch et al., 2021) depending Zylberajch et al., 2021). As this approach modifies on the study objectives and the complexity of the
feedback process. Early work even conducted ex- Although some of the findings were not derived in periments with the participants in-person (Stumpf NLP settings, we believe that they are generalizable et al., 2009; Kulesza et al., 2009, 2015). Although and worth discussing in the context of EBHD. this limited the number of participants (to less than 100), the researchers could closely observe their be- 3.1 Model Understanding haviors and gain some insights concerning human- So far, we have used explanations as means to help computer interaction. humans understand models and conduct informed By contrast, some used a crowdsourcing plat- debugging. Hence, it is important to verify, at least form, Amazon Mechanical Turk5 in particular, to preliminarily, that the explanations help feedback collect human feedback for debugging the mod- providers form an accurate understanding of how els. Crowdsourcing (CS) enables researchers to the models work. This is an important prerequisite conduct experiments at a large scale; however, the towards successful debugging. quality of human responses could be varying. So, it Existing studies have found that some explana- is important to ensure some quality control such as tion forms are more conducive to developing model specifying required qualifications (Smith-Renner understanding in humans than others. Stumpf et al., 2020), using multiple annotations per ques- et al. (2009) found that rule-based and keyword- tion (Lertvittayakumjorn et al., 2020), having a based explanations were easier to understand than training phase for participants, and setting up some similarity-based explanations (i.e., explaining by obvious questions to check if the participants are similar examples in the training data). Also, they paying attention to the tasks (Egelman et al., 2014). found that some users did not understand why the Finally, simulation (SM), without real humans absence of some words could make the model be- involved but using oracles as human feedback in- come more certain about its predictions. Lim et al. stead, has also been considered (for the purpose of (2009) found that explaining why the system be- testing the EBHD framework only). For example, haved and did not behave in a certain way resulted Teso and Kersting (2019) set 20% of input words as in good user’s understanding of the system, though relevant using feature selection. These were used to the former way of explanation (why) was more respond to post-hoc explanations, i.e., top k words effective than the latter (why not). Cheng et al. selected by LIME. Koh and Liang (2017) simu- (2019) reported that interactive explanations could lated mislabeled examples by flipping the labels of improve users’ comprehension on the model better a random 10% of the training data. So, when the than static explanations, although the interactive explanation showed suspicious training examples, way took more time. In addition, revealing inner the true labels could be used to provide feedback. workings of the model could further help under- Compared to the other settings, simulation is faster standing; however, it introduced additional cogni- and cheaper, yet its results may not reflect the effec- tive workload that might make participants doubt tiveness of the framework when deployed with real whether they really understood the model well. humans. Naturally, human feedback is sometimes 3.2 Willingness inaccurate and noisy, and humans could also be interrupted or frustrated while providing feedback We would like humans to provide feedback for im- (Amershi et al., 2014). These factors, discussed proving models, but do humans naturally want to? in detail in the next section, cannot be thoroughly Prior to the emerging of EBHD, studies found that studied in only simulated experiments. humans are not willing to be constantly asked about labels of examples as if they were just simple or- 3 Research on Human Factors acles (Cakmak et al., 2010; Guillory and Bilmes, 2011). Rather, they want to provide more than just Though the major goal of EBHD is to improve data labels after being given explanations (Amershi models, we cannot disregard the effect on feed- et al., 2014; Smith-Renner et al., 2020). By collect- back providers of the debugging workflow. In this ing free-form feedback from users, Stumpf et al. section, we compile findings concerning how expla- (2009) and Ghai et al. (2021) discovered various nations and feedback could affect the humans, dis- feedback types. The most prominent ones include cussed along five dimensions: model understand- removing-adding features (words), tuning weights, ing, willingness, trust, frustration, and expectation. and leveraging feature combinations. Stumpf et al. 5 https://www.mturk.com/ (2009) further analyzed categories of background
knowledge underlying the feedback and found, in relies on the wrong reasons. These studies go along their experiment, that it was mainly based on com- with a perspective by Zhang et al. (2020b) that ex- monsense knowledge and English language knowl- planations should help calibrate user perceptions edge. Such knowledge may not be efficiently in- to the model quality, signaling whether the users jected into the model if we exploit human feedback should trust or distrust the AI. Although, in some which contains only labels. This agrees with some cases, explanations successfully warned users of participants, in (Smith-Renner et al., 2020), who faulty models (Ribeiro et al., 2016), this is not easy described their feedback as inadequate when they when the model flaws are not obvious (Zhang et al., could only confirm or correct predicted labels. 2020b; Lertvittayakumjorn and Toni, 2019). Although human feedback beyond labels con- Besides explanations, the effect of feedback on tains helpful information, it is naturally neither human trust is quite inconclusive according to some complete nor precise. Ghai et al. (2021) observed (but fewer) studies. On one hand, Smith-Renner that human feedback usually focuses on a few fea- et al. (2020) found that, after lay humans see ex- tures that are most different from human expec- planations of low-quality models and lose their tation, ignoring the others. Also, they found that trust, the ability to provide feedback makes human humans, especially lay people, are not good at cor- trust and acceptance rally, remedying the situation. recting model explanations quantitatively (e.g., ad- In contrast, Honeycutt et al. (2020) reported that justing weights). This is consistent with the find- providing feedback decreases human trust in the ings of Miller (2019) that human explanations are system as well as their perception of system accu- selective (in a biased way) and rarely refer to prob- racy no matter whether the system truly improves abilities but express causal relationships instead. after being updated or not. 3.4 Frustration 3.3 Trust Working with explanations can cause frustration Trust (as well as frustration and expectation, dis- sometimes. Following the discussion on trust, ex- cussed next) is an important issue when the sys- planations of poor models increase user frustration tem end users are feedback providers in the EBHD (as they reveal model flaws), whereas the ability framework. It has been discussed widely that ex- to provide feedback reduces frustration. Hence, in planations engender human trust in AI systems (Pu general situations, the most frustrating condition is and Chen, 2006; Lipton, 2018; Toreini et al., 2020). showing explanations to the users without allowing This trust may be misplaced at times. Showing them to give feedback (Smith-Renner et al., 2020). more detailed explanations can cause users to over Another cause of frustration is the risk of de- rely on the system, leading to misuse where users tailed explanations overloading users (Narayanan agree with incorrect system predictions (Stumpf et al., 2018). This is especially a crucial issue for in- et al., 2016). Moreover, some users may over herently interpretable models where all the internal trust the explanations (without fully understand- workings can be exposed to the users. Though pre- ing them) only because the tools generating them senting all the details is comprehensive and faith- are publicly available, widely used, and showing ful, it could create barriers for lay users (Gershon, appealing visualizations (Kaur et al., 2020). 1998). In fact, even ML experts may feel frustrated However, recent research reported that explana- if they need to understand a decision tree with a tions do not necessarily increase trust and reliance. depth of ten or more. Poursabzi-Sangdeh et al. Cheng et al. (2019) found that, even though ex- (2018) found that showing all the model internals planations help users comprehend systems, they undermined users’ ability to detect flaws in the cannot increase human trust in using the systems model, likely due to information overload. So, they in high-stakes applications involving lots of quali- suggested that model internals should be revealed tative factors, such as graduate school admissions. only when the users request to see them. Smith-Renner et al. (2020) reported that explana- tions of low-quality models decrease trust and sys- 3.5 Expectation tem acceptance as they reveal model weaknesses to Smith-Renner et al. (2020) observed that some par- the users. According to Schramowski et al. (2020), ticipants expected the model to improve after the despite correct predictions, the trust still drops if session where they interacted with the model, re- the users see from the explanations that the model gardless of whether they saw explanations or gave
feedback during the interaction session. EBHD 4 Open Problems should manage these expectations properly. For instance, the system should report changes or im- This section lists potential research directions and provements to users after the model gets updated. open problems for EBHD of NLP models. It would be better if the changes can be seen incre- 4.1 Beyond English Text Classification mentally in real time (Kulesza et al., 2015). All papers in Table 1 conducted experiments only 3.6 Summary on English datasets. We acknowledge that qual- Based on the findings on human factors reviewed itatively analyzing explanations and feedback in in this section, we summarize suggestions for ef- languages at which one is not fluent is not easy, not fective EBHD as follows. to mention recruiting human subjects who know the languages. However, we hope that, with more Feedback providers. Buggy models usually multilingual data publicly available (Wolf et al., lead to implausible explanations, adversely affect- 2020) and growing awareness in the NLP com- ing human trust in the system. Also, it is not munity (Bender, 2019), there will be more EBHD yet clear whether giving feedback increases or de- studies targeting other languages in the near future. creases human trust. So, it is safer to let the devel- Also, most existing EBHD works target text clas- opers or domain experts in the team (rather than end sifiers. It would be interesting to conduct more users) be the feedback providers. For some kinds EBHD work for other NLP tasks such as reading of bugs, however, feedback from end users is es- comprehension, question answering, and natural sential for improving the model. To maintain their language inference (NLI), to see whether existing trust, we may collect their feedback implicitly (e.g., techniques still work effectively. Shifting to other by inferring from their interactions with the system tasks requires an understanding of specific bug after showing them the explanations (Honeycutt characteristics in those tasks. For instance, unlike et al., 2020)) or collect the feedback without telling bugs in text classification which are usually due them that the explanations are of the production to word artifacts, bugs in NLI concern syntactic system (e.g., by asking them to answer a separate heuristics between premises and hypotheses (Mc- survey). All in all, we need different strategies to Coy et al., 2019). Thus, giving human feedback at collect feedback from different stakeholders. word level may not be helpful, and more advanced Explanations. We should avoid using forms of methods may be needed. explanations which are difficult to understand, such as similar training examples and absence of some 4.2 Tackling More Challenging Bugs keywords in inputs, unless the humans are already Lakkaraju et al. (2020) remarked that the evalu- trained to interpret them. Also, too much infor- ation setup of existing EBHD work is often too mation should be avoided as it could overload the easy or unrealistic. For example, bugs are obvi- humans; instead, humans should be allowed to re- ous artifacts which could be removed using sim- quest more information if they are interested, e.g., ple text pre-processing (e.g., removing punctuation by using interactive explanations (Dejl et al., 2021). and redacting named entities). Hence, it is not clear how powerful such EBHD frameworks are when Feedback. Given that human feedback is not al- dealing with real-world bugs. If bugs are not dom- ways complete, correct, or accurate, EBHD should inant and happen less often, global explanations use it with care, e.g., by relying on collective feed- may be too coarse-grained to capture them while back rather than individual feedback and allowing many local explanations may be needed to spot feedback providers to verify and modify their feed- a few appearances of the bugs, leading to ineffi- back before applying it to update the model. ciency. As reported by Smith-Renner et al. (2020), Update. Humans, especially lay people, usually feedback results in minor improvements when the expect the model to improve over time after they model is already reasonably good. give feedback. So, the system should display im- Other open problems, whose solutions may help provements after the model gets updated. Where deal with challenging bugs, include the following. possible, showing the changes incrementally in real First, different people may give different feedback time is preferred, as the feedback providers can for the same explanation. As raised by Ghai et al. check if their feedback works as expected or not. (2021), how can we integrate their feedback to get
robust signals for model update? How should we efficiency of EBHD frameworks (for both machine deal with conflicts among feedback and training and human sides) require further research. examples (Carstens and Toni, 2017)? Second, con- firming or removing what the model has learned is 4.4 Reliable Comparison across Papers easier than injecting, into the model, new knowl- User studies are naturally difficult to replicate as edge (which may not even be apparent in the expla- they are inevitably affected by choices of user nations). How can we use human feedback to inject interfaces, phrasing, population, incentives, etc. new knowledge, especially when the model is not (Lakkaraju et al., 2020). Further, research in ML transparent? Lastly, EBHD techniques have been rarely adopts practices from the human-computer proposed for tabular data and image data (Shao interaction community (Abdul et al., 2018), lim- et al., 2020; Ghai et al., 2021; Popordanoska et al., iting the possibility to compare across studies. 2020). Can we adapt or transfer them across modal- Hence, most existing work only considers model ities to deal with NLP tasks? performance before and after debugging or com- pares the results among several configurations of 4.3 Analyzing and Enhancing Efficiency a single proposed framework. This leads to lit- tle knowledge about which explanation types or Most selected studies focus on improving correct- feedback mechanisms are more effective across ness of the model (e.g., by expecting a higher F1 or several settings. Thus, one promising research di- a lower bias after debugging). However, only some rection would be proposing a standard setup or a of them discuss efficiency of the proposed frame- benchmark for evaluating and comparing EBHD works. In general, we can analyze the efficiency of frameworks reliably across different settings. an EBHD framework by looking at the efficiency of each main step in Figure 1. Step 1 generates the 4.5 Towards Deployment explanations, so its efficiency depends on the expla- nation method used and, in the case of local expla- So far, we have not seen EBHD research widely de- nation methods, the number of local explanations ployed in applications, probably due to its difficulty needed. Step 2 lets humans give feedback, so its to set up the debugging aspects outside a research efficiency concerns the amount of time they spend environment. One way to promote adoption of to understand the explanations and to produce the EBHD is to integrate EBHD frameworks into avail- feedback. Step 3 updates the model using the feed- able visualization systems such as the Language back, so its efficiency relates to the time used for Interpretability Tool (LIT) (Tenney et al., 2020), processing the feedback and retraining the model allowing users to provide feedback to the model (if needed). Existing work mainly reported effi- after seeing explanations and supporting experi- ciency of step 1 or step 2. For instance, approaches mentation. Also, to move towards deployment, it using example-based explanations measured the is important to follow human-AI interaction guide- improved performance with respect to the number lines (Amershi et al., 2019) and evaluate EBHD of explanations computed (step 1) (Koh and Liang, with potential end users, not just via simulation or 2017; Khanna et al., 2019; Han and Ghosh, 2020). crowdsourcing, since human factors play an impor- Kulesza et al. (2015) compared the improved F1 of tant role in real situations (Amershi et al., 2014). EBHD with the F1 of instance labelling given the 5 Conclusion same amount of time for humans to perform the task (step 2). Conversely, Yao et al. (2021) com- We presented a general framework of explanation- pared the time humans need to do EBHD versus based human debugging (EBHD) of NLP models instance labelling in order to achieve the equivalent and analyzed existing work in relation to the com- degree of correctness improvement (step 2). ponents of this framework to illustrate the state-of- None of the selected studies considered the ef- the-art in the field. Furthermore, we summarized ficiency of the three steps altogether. In fact, the findings on human factors with respect to EBHD, efficiency of step 1 and 3 is important especially suggested design practices accordingly, and iden- for black box models where the cost of post-hoc tified open problems for future studies. As EBHD explanation generation and model retraining is not is still an ongoing research topic, we hope that our negligible. It is even more crucial for iterative or survey will be helpful for guiding interested re- responsive EBHD. Thus, analyzing and enhancing searchers and for examining future EBHD papers.
Acknowledgments using explanation feedback to improve classifica- tion abilities. arXiv preprint arXiv:2105.02653. We would like to thank Marco Baroni (the action editor) and anonymous reviewers for very helpful Emily Bender. 2019. The #benderrule: On naming comments. Also, we thank Brian Roark and Cindy the languages we study and why it matters. The Robinson for their technical support concerning Gradient. the submission system. Besides, the first author wishes to thank the support from Anandamahidol Umang Bhatt, Alice Xiang, Shubham Sharma, Foundation, Thailand. Adrian Weller, Ankur Taly, Yunhan Jia, Joydeep Ghosh, Ruchir Puri, José MF Moura, and Peter Eckersley. 2020. Explainable machine learning References in deployment. In Proceedings of the 2020 Con- ference on Fairness, Accountability, and Trans- Ashraf Abdul, Jo Vermeulen, Danding Wang, parency, pages 648–657. Brian Y Lim, and Mohan Kankanhalli. 2018. Trends and trajectories for explainable, account- Gabriel Cadamuro, Ran Gilad-Bachrach, and Xi- able and intelligible systems: An hci research aojin Zhu. 2016. Debugging machine learning agenda. In Proceedings of the 2018 CHI con- models. In ICML Workshop on Reliable Ma- ference on human factors in computing systems, chine Learning in the Wild. pages 1–18. Maya Cakmak, Crystal Chao, and Andrea L Amina Adadi and Mohammed Berrada. 2018. Thomaz. 2010. Designing interactions for robot Peeking inside the black-box: a survey on ex- active learners. IEEE Transactions on Au- plainable artificial intelligence (xai). IEEE ac- tonomous Mental Development, 2(2):108–118. cess, 6:52138–52160. Oana-Maria Camburu, Tim Rocktäschel, Thomas Lukasiewicz, and Phil Blunsom. 2018. e-snli: Julius Adebayo, Michael Muelly, Ilaria Liccardi, natural language inference with natural language and Been Kim. 2020. Debugging tests for model explanations. In Proceedings of the 32nd In- explanations. In Advances in Neural Informa- ternational Conference on Neural Information tion Processing Systems. Processing Systems, pages 9560–9572. Saleema Amershi, Maya Cakmak, William Bradley Lucas Carstens and Francesca Toni. 2017. Using Knox, and Todd Kulesza. 2014. Power to the argumentation to improve classification in nat- people: The role of humans in interactive ma- ural language problems. ACM Transactions on chine learning. Ai Magazine, 35(4):105–120. Internet Technology (TOIT), 17(3):1–23. Saleema Amershi, Dan Weld, Mihaela Vorvoreanu, Rich Caruana, Yin Lou, Johannes Gehrke, Paul Adam Fourney, Besmira Nushi, Penny Collisson, Koch, Marc Sturm, and Noemie Elhadad. 2015. Jina Suh, Shamsi Iqbal, Paul N Bennett, Kori Intelligible models for healthcare: Predicting Inkpen, et al. 2019. Guidelines for human-ai pneumonia risk and hospital 30-day readmission. interaction. In Proceedings of the 2019 chi con- In Proceedings of the 21th ACM SIGKDD inter- ference on human factors in computing systems, national conference on knowledge discovery and pages 1–13. data mining, pages 1721–1730. Leila Arras, Franziska Horn, Grégoire Montavon, Gromit Yeuk-Yin Chan, Jun Yuan, Kyle Overton, Klaus-Robert Müller, and Wojciech Samek. Brian Barr, Kim Rees, Luis Gustavo Nonato, En- 2016. Explaining predictions of non-linear clas- rico Bertini, and Claudio T Silva. 2020. Subplex: sifiers in NLP. In Proceedings of the 1st Work- Towards a better understanding of black box shop on Representation Learning for NLP, pages model explanations at the subpopulation level. 1–7, Berlin, Germany. Association for Computa- arXiv preprint arXiv:2007.10609. tional Linguistics. Hao-Fei Cheng, Ruotong Wang, Zheng Zhang, Yanzhe Bekkemoen and Helge Langseth. 2021. Fiona O’Connell, Terrance Gray, F Maxwell Correcting classification: A bayesian framework Harper, and Haiyi Zhu. 2019. Explaining
You can also read