Improving Factual Completeness and Consistency of Image-to-Text Radiology Report Generation

Page created by Evelyn Leon
 
CONTINUE READING
Improving Factual Completeness and Consistency of
 Image-to-Text Radiology Report Generation

 Yasuhide Miura, Yuhao Zhang, Emily Bao Tsai, Curtis P. Langlotz, Dan Jurafsky
 Stanford University
 {ysmiura, zyh, ebtsai, langlotz, jurafsky}@stanford.edu

 Abstract Medical Images Reference Report
 Large right pleural effusion is unchanged
 in size. There is associated right basilar
 Neural image-to-text radiology report gener-
 atelectasis/scarring, also stable. Healed
 ation systems offer the potential to improve right rib fractures are noted. On the left,
 radiology reporting by reducing the repeti- there is persistent apical pleural thickening
 tive process of report drafting and identifying and apical scarring. Linear opacities
 projecting over the lower lobe are also
 possible medical errors. However, existing compatible with scarring, unchanged.
 report generation systems, despite achieving There is no left pleural effusion. There is
 high performances on natural language genera- no pneumothorax. …
 tion metrics such as CIDEr or BLEU, still suf-
 fer from incomplete and inconsistent genera-
 Image Text
 tions. Here we introduce two new simple re- Encoder Decoder contradiction
 wards to encourage the generation of factually
 complete and consistent radiology reports: one
 that encourages the system to generate radiol- Generated Report
 ogy domain entities consistent with the refer- … The heart size remains unchanged and is within normal limits.
 ence, and one that uses natural language in- Unchanged appearance of thoracic aorta. The pulmonary
 vasculature is not congested. Bilateral pleural effusions are again
 ference to encourage these entities to be de- noted and have increased in size on the right than the left.
 scribed in inferentially consistent ways. We The left-sided pleural effusion has increased in size and is now
 combine these with the novel use of an exist- moderate in size.
 ing semantic equivalence metric (BERTScore).
 We further propose a report generation sys- Figure 1: A (partial) example of a report generated
 tem that optimizes these rewards via reinforce- from our system (with “. . . ” representing abbreviated
 ment learning. On two open radiology report text). The system encodes images and generates text
 datasets, our system substantially improved from that encoded representation. Underlined words
 the F1 score of a clinical information extrac- are disease and anatomy entities. The shaded sentences
 tion performance by +22.1 (∆ + 63.9%). We are an example of a contradictory pair.
 further show via a human evaluation and a
 qualitative analysis that our system leads to
 generations that are more factually complete Automatic radiology report generation systems
 and consistent compared to the baselines. have achieved promising performance as mea-
 sured by widely used NLG metrics such as CIDEr
1 Introduction (Vedantam et al., 2015) and BLEU (Papineni et al.,
 2002) on several datasets (Li et al., 2018; Jing
An important new application of natural language et al., 2019; Chen et al., 2020). However, reports
generation (NLG) is to build assistive systems that that achieve high performance on these NLG met-
take X-ray images of a patient and generate a tex- rics are not always factually complete or consis-
tual report describing clinical observations in the tent. In addition to the use of inadequate metrics,
images (Jing et al., 2018; Li et al., 2018; Liu et al., the factual incompleteness and inconsistency is-
2019; Boag et al., 2020; Chen et al., 2020). Figure sue in generated reports is further exacerbated by
1 shows an example of a radiology report generated the inadequate training of these systems. Specif-
by such a system. This is a clinically important ically, the standard teacher-forcing training algo-
task, offering the potential to reduce radiologists’ rithm (Williams and Zipser, 1989) used by most ex-
repetitive work and generally improve clinical com- isting work can lead to a discrepancy between what
munication (Kahn et al., 2009). the model sees during training and test time (Ran-
 5288
 Proceedings of the 2021 Conference of the North American Chapter of the
 Association for Computational Linguistics: Human Language Technologies, pages 5288–5304
 June 6–11, 2021. ©2021 Association for Computational Linguistics
zato et al., 2016), resulting in degenerate outputs main difference. To construct the NLI model for
with factual hallucinations (Maynez et al., 2020). factENTNLI , we present a weakly supervised ap-
Liu et al. (2019) and Boag et al. (2020) have shown proach that adapts an existing NLI model to the ra-
that reports generated by state-of-the-art systems diology domain. We further present a report genera-
still have poor quality when evaluated by their clin- tion model which directly optimizes a Transformer-
ical metrics as measured with an information ex- based architecture with these rewards using rein-
traction system designed for radiology reports. For forcement learning (RL).
example, the generated report in Figure 1 is incom- We evaluate our proposed report generation
plete since it neglects an observation of atelectasis model on two publicly available radiology report
that can be found in the images. It is also incon- generation datasets. We find that optimizing the
sistent since it mentions left-sided pleural effusion proposed rewards along with BERTScore by RL
which is not present in the images. Indeed, we leads to generated reports that achieve substan-
show that existing systems are inadequate in factual tially improved performance in the important clin-
completeness and consistency, and that an image- ical metrics (Liu et al., 2019; Boag et al., 2020;
to-text radiology report generation system can be Chen et al., 2020), demonstrating the higher clin-
substantially improved by replacing widely used ical value of our approach. We make all our code
NLG metrics with simple alternatives. and the expert-labeled test set for evaluating the ra-
 diology NLI model publicly available to encourage
 We propose two new simple rewards that can en-
 future research1 . To summarize, our contributions
courage the factual completeness and consistency
 in this paper are:
of the generated reports. First, we propose the Ex-
act Entity Match Reward (factENT ) which captures 1. We propose two simple rewards for image-
the completeness of a generated report by measur- to-text radiology report generation, which fo-
ing its coverage of entities in the radiology domain, cus on capturing the factual completeness
compared with a reference report. The goal of the and consistency of generated reports, and a
reward is to better capture disease and anatomical weak supervision-based approach for training
knowledge that are encoded in the entities. Sec- a radiology-domain NLI model to realize the
ond, we propose the Entailing Entity Match Reward second reward.
(factENTNLI ), which extends factENT with a nat- 2. We present a new radiology report genera-
ural language inference (NLI) model that further tion model that directly optimizes these new
considers how inferentially consistent the gener- rewards with RL, showing that previous ap-
ated entities are with their descriptions in the ref- proaches that optimize traditional NLG met-
erence. We add NLI to control the overestimation rics are inadequate, and that the proposed ap-
of disease when optimizing towards factENT . We proach substantially improves performance on
use these two metrics along with an existing seman- clinical metrics (as much as ∆ + 64.2%) on
tic equivalence metric, BERTScore (Zhang et al., two publicly available datasets.
2020a), to potentially capture synonyms (e.g., “left 2 Related Work
and right” effusions are synonymous with “bilat-
eral” effusions) and distant dependencies between 2.1 Image-to-Text Radiology Report
diseases (e.g., a negation like “. . . but underlying Generation
consolidation or other pulmonary lesion not ex- Wang et al. (2018) and Jing et al. (2018) first pro-
cluded”) that are present in radiology reports. posed multi-task learning models that jointly gener-
 Although recent work in summarization, dia- ate a report and classify disease labels from a chest
logue, and data-to-text generation has tried to ad- X-ray image. Their models were extended to use
dress this problem of factual incompleteness and multiple images (Yuan et al., 2019), to adopt a hy-
inconsistency by using natural language inference brid retrieval-generation model (Li et al., 2018), or
(NLI) (Falke et al., 2019; Welleck et al., 2019), to consider structure information (Jing et al., 2019).
question answering (QA) (Wang et al., 2020a), or More recent work has focused on generating re-
content matching constraint (Wang et al., 2020b) ports that are clinically consistent and accurate. Liu
approaches, they either show negative results or et al. (2019) presented a system that generates ac-
are not directly applicable to the generation of ra- curate reports by fine-tuning it with their Clinically
 1
diology reports due to a substantial task and do- https://github.com/ysmiura/ifcc
 5289
 
Coherent Reward. Boag et al. (2020) evaluated K×
 images
several baseline generation systems with clinical Masked
 Self-
metrics and found that standard NLG metrics are Attention
 Memory-
ill-equipped for this task. Very recently, Chen et al. Augmented Add & Norm
 Attention
(2020) proposed an approach to generate radiology
 Add & Norm K, V
reports with a memory-driven Transformer. Our Q

work is most related to Liu et al. (2019); their sys- N×
 Feed Cross-
 Forward Attention N×
tem, however, is dependent on a rule-based infor- layers
 Add & Norm layers
mation extraction system specifically created for Memory-
 Augmented Max-Pool
chest X-ray reports and has limited robustness and Encoder #!

generalizability to different domains within radiol- Meshed-
 Feed Attention
ogy. By contrast, we aim to develop methods that Forward
improve the factual completeness and consistency Add & Norm Add & Norm Meshed
 Decoder
of generated reports by harnessing more robust sta-
tistical models and are easily generalizable. #

2.2 Consistency and Faithfulness in Natural Figure 2: An overview of Meshed-Memory Trans-
 former extended to multiple images.
 Language Generation
A variety of recent work has focused on consis- regions. We find Meshed-Memory Transformer
tency and faithfulness in generation. Our work (Cornia et al., 2020) (M2 Trans) to be more ef-
is inspired by Falke et al. (2019), Welleck et al. fective in our radiology report generation task than
(2019), and Matsumaru et al. (2020) in using NLI the traditional RNN-based models and Transformer
to rerank or filter generations in text summarization, models (an empirical result will be shown in §4),
dialogue, and headline generations systems, respec- and therefore use it as our base architecture.
tively. Other attempts in this direction include eval-
 3 Methods
uating consistency in generations using QA models
(Durmus et al., 2020; Wang et al., 2020a; Maynez 3.1 Image-to-Text Radiology Report
et al., 2020), with distantly supervised classifiers Generation with M2 Trans
(Kryściński et al., 2020), and with task-specific con- Formally, given K individual images x1...K of a
tent matching constraints (Wang et al., 2020b). Liu patient, our task involves generating a sequence of
et al. (2019) and Zhang et al. (2020b) studied im- words to form a textual report ŷ, which describes
proving the factual correctness in generating radiol- the clinical observations in the images. This task
ogy reports with rule-based information extraction resembles image captioning, except with multiple
systems. Our work mainly differs from theirs in the images as input and longer text sequences as out-
direct optimization of factual completeness with an put. We therefore extend a state-of-the-art image
entity-based reward and of factual consistency with captioning model, M2 Trans (Cornia et al., 2020),
a statistical NLI-based reward. with multi-image input as our base architecture.
2.3 Image Captioning with Transformer We first briefly introduce this model and refer inter-
 ested readers to Cornia et al. (2020).
The problem of generating text from image data Figure 2 illustrates an overview of the M2 Trans
has been widely studied in the image captioning model. Given an image xk , image regions are first
setting. While early work focused on combining extracted with a CNN as X = CNN(xk ). X is
convolutional neural network (CNN) and recurrent then encoded with a memory-augmented attention
neural network (RNN) architectures (Vinyals et al., process Mmem (X) as
2015), more recent work has discovered the ef-
 Mmem (X) = Att(Wq X, K, V ) (1)
fectiveness of using the Transformer architecture
 T
  
(Vaswani et al., 2017). Li et al. (2019) and Pan et al. QK
 Att(Q, K, V ) = softmax √ V (2)
(2020) introduced an attention process to exploit se- d
 K = [Wk X; M k ] (3)
mantic and visual information into this architecture.
 V = [Wv X; M v ] (4)
Herdade et al. (2019), Cornia et al. (2020), and
Guo et al. (2020) extended this architecture to learn where Wq , Wk , Wv are weights, M k , M v are
geometrical and other relationships between input memory matrices, d is a scaling factor, and [∗; ∗]
 5290
is the concatenation operation. Att(Q, K, V ) is precision (pr) and recall (rc) of entity match are
an attention process derived from the Transformer calculated as
architecture (Vaswani et al., 2017) and extended to P
include memory matrices that can encode a priori e∈Egen δ(e, Eref )
 prENT = (8)
knowledge between image regions. In the encoder, |Egen |
 P
this attention process is a self-attention process e∈Eref δ(e, Egen )
 rcENT = (9)
since all of the query Q, the key K, and the value |Eref |
 (
V depend on X. Mmem (X) is further processed δ(e, E) =
 1, for e ∈ E
 (10)
with a feed forward layer, a residual connection, 0, otherwise
and a layer normalization to output X̃. This encod-
ing process can be stacked N times and is applied The harmonic mean of precision and recall is taken
to K images, and n-th layer output of K image as factENT to reward a balanced match of entities.
will be X̃ n,K . We used Stanza (Qi et al., 2020) and its clinical
 The meshed decoder first processes an encoded models (Zhang et al., 2020c) as a named entity
text Y with a masked self-attention and further recognizer for radiology reports. For example in
processes it with a feed forward layer, a residual the case of Figure 1, the common entities among
connection, and a layer normalization to output Ÿ . the reference report and the generated report are
Ÿ is then passed to a cross attention C(X̃ n,K , Ÿ ) pleural and effusion, resulting to factENT = 33.3.
and a meshed attention Mmesh (X̃ N ,K , Ÿ ) as
 3.2.2 Entailing Entity Match Reward
 (factENTNLI )
 X
 Mmesh (X̃ N ,K , Ÿ ) = αn C(X̃ n,K , Ÿ ) (5)
 n
 We additionally designed an F-score style reward
C(X̃ n,K , Ÿ ) = max(Att(Wq Ÿ , Wk X̃ n,K , Wv X̃ n,K ))
 K that expands factENT with NLI to capture factual
 (6)
   consistency. NLI is used to control the overestima-
 αn = σ Wn [Y ; C(X̃ n,K , Ÿ )] + bn (7) tion of disease when optimizing towards factENT .
 In factENTNLI , δ in Eq. 10 is expanded to
where is element-wise multiplication, maxK is 
max-pooling over K images, σ is sigmoid function, 1,
 
 
 
 
 for e ∈ E ∧ NLIe (P , h) 6= contradiction
 φ(e,E)= 1, for NLIe (P , h) = entailment
Wn is a weight, and bn is a bias. The weighted 
 
 0, otherwise
 
 
summation in Mmesh (X̃ N ,K , Ÿ ) exploits both
 (11)
low-level and high-level information from the N
 NLIe (P , h) = nli(p̂, h) where p̂ = arg max sim(h, p)
stacked encoder. Differing from the self-attention p∈P

process in the encoder, the cross attention uses a (12)
query that depends on Y and a key and a value
that depend on X. Mmesh (X̃ N ,K , Ÿ ) is further where h is a sentence that includes e, P is all sen-
processed with a feed forward layer, a residual con- tences in a counter part text (if h is a sentence
nection, and a layer normalization to output Ỹ . As in a generated report, P is all sentences in the
like in the encoder, the decoder can be stacked N corresponding reference report), nli(∗, ∗) is an
times to output Ỹ N . Ỹ N is further passed to a feed NLI function that returns an NLI label which is
forward layer to output report ŷ. one of {entailment, neutral, contradiction}, and
 sim(∗, ∗) is a text similarity function. We used
3.2 Optimization with Factual Completeness BERTScore (Zhang et al., 2020a) as sim(∗, ∗) in
 and Consistency the experiments (the detail of BERTScore can be
 found in Appendix A). The harmonic mean of preci-
3.2.1 Exact Entity Match Reward (factENT ) sion and recall is taken as factENTNLI to encourage
We designed an F-score entity match reward to cap- a balanced factual consistency between a generated
ture factual completeness. This reward assumes text and the corresponding reference text. For ex-
that entities encode disease and anatomical knowl- ample in the case of Figure 1, the sentence “The
edge that relates to factual completeness. A named left-sided pleural effusion has increased in size and
entity recognizer is applied to ŷ and the correspond- is now moderate in size.” will be contradictory
ing reference report y. Given entities Egen and to “There is no left pleural effusion.” resulting in
Eref recognized from ygen and yref respectively, pleural and effusion being rejected in ygen .
 5291
3.2.3 Joint Loss for Optimizing Factual Test Accuracy
 Training Data #samples
 RadNLI MedNLI
 Completeness and Consistency
 MedNLI 13k 53.3 80.9
We integrate the proposed factual rewards into self- MedNLI + RadNLI 19k 77.8 79.8
critical sequence training (Rennie et al., 2017). An
RL loss LRL is minimized as the negative expec- Table 1: The accuracies of the NLI model trained with
tation of the reward r. The gradient of the loss is the weakly-supervised approach. RadNLI is the pro-
 posed NLI for radiology reports. The values are the av-
estimated with a single Monte Carlo sample as
 erage of 5 runs and the bold values are the best results
 ∇θ LRL (θ) = −∇θ log Pθ (ŷsp |x1...K ) (r(ŷsp ) − r(ŷgd )) of each test set.
 (13)
 introduce a certain level of similarity between
where ŷsp is a sampled text and ŷgd is a greedy s1 and s2 .
decoded text. Paulus et al. (2018) and Zhang et al.
(2020b) have shown that a generation can be im- Neutral 4 (N4) (1) NE of s1 are equal to NE of
proved by combining multiple losses. We combine s2 and (2) s1 and s2 include observation key-
a factual metric loss with a language model loss words.
and an NLG loss as
 Contradiction 1 (C1) (1) NE of s1 is equal or a
 L = λ1 LNLL + λ2 LRL_NLG + λ3 LRL_FACT (14)
 subset to NE of s2 and (2) s1 is a negation of
where LNLL is a language model loss, LRL_NLG s2 .
is the RL loss using an NLG metric (e.g., CIDEr
 The rules rely on a semantic similarity measure and
or BERTScore), LRL_FACT is the RL loss using a
 the overlap of entities to determine the relationship
factual reward (e.g., factENT or factENTNLI ), and
 between s1 and s2 . In the neutral rules and the
λ∗ are scaling factors to balance the multiple losses.
 contradiction rule, we included similarity measures
3.3 A Weakly-Supervised Approach for to avoid extracting easy to distinguish sentence
 Radiology NLI pairs.
 We evaluated this NLI by preparing training data,
We propose a weakly-supervised approach to con-
 validation data, and test data. For the training data,
struct an NLI model for radiology reports. (There
 the training set of MIMIC-CXR (Johnson et al.,
already exists an NLI system for the medical do-
 2019) is used as the source of sentence pairs. 2k
main, MedNLI (Romanov and Shivade, 2018), but
 pairs are extracted for E1 and C1, 0.5k pairs are
we found that a model trained on MedNLI does not
 extracted for N1, N2, N3, and N4, resulting in a
work well on radiology reports.) Given a large scale
 total of 6k pairs. The training set of MedNLI is
dataset of radiology reports, a sentence pair is sam-
 also used as additional data. For the validation data
pled and filtered with weakly-supervised rules. The
 and the test data, we sampled 480 sentence pairs
rules are prepared to extract a randomly sampled
 from the validation section of MIMIC-CXR and
sentence pair (s1 and s2 ) that are in an entailment,
 had them annotated by two experts: one medical
neutral, or contradiction relation. We designed the
 expert and one NLP expert. Each pair is annotated
following 6 rules for weak-supervision.
 twice swapping its premise and hypothesis result-
Entailment 1 (E1) (1) s1 and s2 are semantically ing in 960 pairs and are split in half resulting in 480
 similar and (2) NE of s2 is a subset or equal pair for a validation set and 480 pairs for a test set.
 to NE of s1 . The test set of MedNLI is also used as alternative
 test data.
Neutral 1 (N1) (1) s1 and s2 are semantically sim- We used BERT (Devlin et al., 2019) as an NLI
 ilar and (2) NE of s1 is a subset of NE of s2 . model since it performed as a strong baseline in
Neutral 2 (N2) (1) NE of s1 are equal to NE of s2 the existing MedNLI system (Ben Abacha et al.,
 and (2) s1 include an antonym of a word in 2019), and used Stanza (Qi et al., 2020) and its clin-
 s2 . ical models (Zhang et al., 2020c) as a named entity
 recognizer. Table 1 shows the result of the model
Neutral 3 (N3) (1) NE types of s1 are equal to NE trained with and without the weakly-supervised
 types of s2 and (2) NE of s1 is different from data. The accuracy of NLI on radiology data in-
 NE of s2 . NE types are used in this rule to creased substantially by +24.5% with the addition
 5292
of the radiology NLI training set. (See Appendix the reference3 . The micro average of accuracy, pre-
A for the detail of the rules, the datasets, and the cision, recall, and F1 scores are calculated over 5
model configuration.) observations (following previous work (Irvin et al.,
 2019)) for: atelectasis, cardiomegaly, consolida-
4 Experiments tion, edema, and pleural effusion4 .
4.1 Data factENT & factENTNLI : We additionally
 include our proposed rewards factENT and
We used the training and validation sets of MIMIC- factENTNLI as metrics to compare their values for
CXR (Johnson et al., 2019) to train and validate different models.
models. MIMIC-CXR is a large publicly available
database of chest radiographs. We extracted the 4.3 Model Variations
findings sections from the reports with a text ex- We used M2 Trans as our report generation model
traction tool for MIMIC-CXR2 , and used them as and used DenseNet-121 (Huang et al., 2017) as
our reference reports as in previous work (Liu et al., our image encoder. We trained M2 Trans with the
2019; Boag et al., 2020). Findings section is a nat- following variety of joint losses.
ural language description of the important aspects
in a radiology image. The reports with empty find- NLL M2 Trans simply optimized with NLL loss
ings sections were discarded, resulting in 152173 as a baseline loss.
and 1196 reports for the training and validation set,
respectively. We used the test set of MIMIC-CXR NLL+CDr CIDEr-D and NLL loss is jointly opti-
and the entire Open-i Chest X-ray dataset (Demner- mized with λ1 = 0.01 and λ2 = 0.99 for the
Fushman et al., 2012) as two individual test sets. scaling factors.
Open-i is another publicly available database of
 NLL+BS The F1 score of BERTScore and NLL
chest radiographs which has been widely used in
 loss is jointly optimized with λ1 = 0.01 and
past studies. We again extracted the findings sec-
 λ2 = 0.99.
tions, resulting in 2347 reports for MIMIC-CXR
and 3335 reports for Open-i. Open-i is used only NLL+BS+fcE factENT is added to NLL+BS with
for testing since the number of reports is too small λ1 = 0.01, λ2 = 0.495, and λ3 = 0.495.
to train and test a neural report generation model.
 NLL+BS+fcEN factENTNLI is added to NLL+BS
4.2 Evaluation Metrics with λ1 = 0.01, λ2 = 0.495, and λ3 =
BLEU4, CIDEr-D & BERTScore: We first use 0.495.
general NLG metrics to evaluate the generation
 We additionally prepared three previous models
quality. These metrics include the 4-gram BLEU
 that have been tested on MIMIC-CXR.
scroe (Papineni et al., 2002, BLEU4), CIDEr score
(Vedantam et al., 2015) with gaming penalties TieNet We reimplemented the model of Wang
(CIDEr-D), and the F1 score of the BERTScore et al. (2018) consisting of a CNN encoder
(Zhang et al., 2020a). and an RNN decoder optimized with a multi-
Clinical Metrics: However, NLG metrics such as task setting of language generation and image
BLEU and CIDEr are known to be inadequate for classification.
evaluating factual completeness and consistency.
We therefore followed previous work (Liu et al., CNN-RNN2 We reimplemented the model of
2019; Boag et al., 2020; Chen et al., 2020) by ad- Liu et al. (2019) consisting of a CNN encoder
ditionally evaluating the clinical accuracy of the and a hierarchical RNN decoder optimized
generated reports using a clinical information ex- with CIDEr and Clinically Coherent Reward
traction system. We use CheXbert (Smit et al., 3
 We used CheXbert instead of CheXpert (Irvin et al., 2019)
2020), an information extraction system for chest since CheXbert was evaluated to be approximately 5.5% more
reports, to extract the presence status of a series accurate than CheXpert. The evaluation using CheXpert can
 be found in Appendix C.
of observations (i.e., whether a disease is present 4
 These 5 observations are evaluated to be most represented
or not), and score a generation by comparing the in real-world radiology reports and therefore using these 5
values of these observations to those obtained from observations (and excluding others) leads to less variance and
 more statistical strength in the results. We include the detailed
 2
 https://github.com/MIT-LCP/mimic-cxr/tree/master/txt results of the clinical metrics in Appendix C for completeness.
 5293
NLG Metrics Clinical Metrics (micro-avg) Factual Rewards
 Dataset Model
 BL4 CDr BS P R F1 acc. fcE fcEN
 Previous models
 TieNet (Wang et al., 2018) 8.1 37.2 49.2 38.6 20.9 27.1 74.0 − −
 CNN-RNN2 (Liu et al., 2019) 7.6 44.7 41.2 66.4 18.7 29.2 79.0 − −
 R2Gen (Chen et al., 2020) 8.6 40.6 50.8 41.2 29.8 34.6 73.9 − −
 Proposed approach without proposed optimization
 MIMIC-
 M2 Trans w/ NLL 10.5 44.5 51.2 48.9 41.1 44.7 76.5 27.3 24.4
 CXR
 M2 Trans w/ NLL+CDr 13.3 67.0 55.9 50.0 51.3 50.6 76.9 35.2 32.9
 Proposed approach
 M2 Trans w/ NLL+BS 12.2 58.4 58.4 46.3 67.5 54.9 74.4 35.9 33.0
 M2 Trans w/ NLL+BS+fcE 11.1 49.2 57.2 46.3 73.2 56.7 74.2 39.5 34.8
 M2 Trans w/ NLL+BS+fcEN 11.4 50.9 56.9 50.3 65.1 56.7 77.1 38.5 37.9
 Previous models
 TieNet (Wang et al., 2018) 9.0 65.7 56.1 46.9 15.9 23.7 96.0 − −
 CNN-RNN2 (Liu et al., 2019) 12.1 87.2 57.1 55.1 7.5 13.2 96.1 − −
 R2Gen (Chen et al., 2020) 6.7 61.4 53.8 27.0 17.3 21.1 94.9 − −
 Proposed approach without proposed optimization
 Open-i M2 Trans w/ NLL 8.2 64.4 53.1 44.7 32.7 37.8 95.8 31.1 34.1
 M2 Trans w/ NLL+CDr 13.4 97.2 59.9 48.2 24.2 32.2 96.0 40.6 42.9
 Proposed approach
 M2 Trans w/ NLL+BS 12.3 87.3 62.4 47.7 46.6 47.2 95.9 41.5 44.1
 M2 Trans w/ NLL+BS+fcE 12.0 99.6 62.6 44.0 53.5 48.3 95.5 44.4 46.8
 M2 Trans w/ NLL+BS+fcEN 13.1 103.4 61.0 48.7 46.9 47.8 96.0 43.6 47.1

Table 2: Results of the baselines and our M2 Trans model trained with different joint losses. For the metrics,
BL4, CDr, and BS represent BLEU4, CIDEr-D, and the F1 score of BERTScore; P, R, F1 and acc. represent the
precision, recall, F1 , and accuracy scores output by the clinical CheXbert labeler, respectively. For the rewards,
fcE and fcEN represent factENT and factENTNLI , respectively.

 which is a reward based on the clinical met- on MIMIC-CXR with M2 Trans when compared
 rics. against M2 Trans w/ BS. For the clinical metrics,
 the best recalls and F1 scores are obtained with
R2Gen The model of Chen et al. (2020) with a M2 Trans using factENT as a reward, achieving a
 CNN encoder and a memory-driven Trans- substantial +22.1 increase (∆+63.9%) in F1 score
 former optimized with NLL loss. We used the against the best baseline R2Gen. We further find
 publicly available official code and its check- that using factENTNLI as a reward leads to higher
 point as its implementation. precision and accuracy compared to factENT with
For reproducibility, we include model configura- decreases in the recalls. The best precisions and
tions and training details in Appendix B. accuracies were obtained in the baseline CNN-
 RNN2 . This is not surprising since this model
5 Results and Discussions directly optimizes the clinical metrics with its Clin-
 ically Coherent Reward. However, this model is
5.1 Evaluation with NLG Metrics and strongly optimized against precision resulting in
 Clinical Metrics the low recalls and F1 scores.
Table 2 shows the results of the baselines5 and
M2 Trans optimized with the five different joint The results of M2 Trans without the proposed
losses. We find that the best result for a metric or rewards and BERTScore reveal the strength of M2
a reward is achieved when that metric or reward is Trans and the inadequacy of NLL loss and CIDEr
used directly in the optimization objective. Notably, for factual completeness and consistency. M2
for the proposed factual rewards, the increases of Trans w/ NLL shows strong improvements in the
 +3.6 factENT and +4.9 factENTNLI are observed clinical metrics against R2Gen. These improve-
 5
 These MIMIC-CXR scores have some gaps from the pre- ments are a little surprising since both models are
viously reported values with some possible reasons. First, Transformer-based models and are optimized with
TieNet and CNN-RNN2 in Liu et al. (2019) are evaluated NLL loss. We assume that these improvements
on a pre-release version of MIMIC-CXR. Second, we used
report-level evaluation for all models, but Chen et al. (2020) are due to architecture differences such as memory
tested R2Gen using image-level evaluation. matrices in the encoder of M2 Trans. The differ-
 5294
M2 Trans w/ BS R2Gen No Metric ρ
 Proposed (simple) (Chen et al., 2020) difference BLEU4 0.092
 36.5% 12.0% 51.5% CIDEr-D 0.034
 BERTScore 0.155
Table 3: The human evaluation result for randomly factENT 0.196
sampled 100 reports from the test set of MIMIC-CXR factENTNLI 0.255
by two board-certified radiologists.
 Table 4: The Spearman correlations ρ of NLG met-
 rics and factual metrics against clinical accuracy. The
ence between NLL and NLL+CDr on M2 Trans strongest correlation among all metrics is shown is
indicates that NLL and CIDEr are unreliable for bold.
factual completeness and consistency.
 used to estimate the performance of the clinical
5.2 Human Evaluation metrics to see whether the proposed rewards can
We performed a human evaluation to further con- be used in an evaluation where a strong clinical
firm whether the generated radiology reports are information extraction system like CheXbert is not
factually complete and consistent. Following prior available. Table 4 shows Spearman correlations
studies of radiology report summarization (Zhang calculated on the generated reports of NLL+BS.
et al., 2020b) and image captioning evaluation factENTNLI shows the strongest correlation with
(Vedantam et al., 2015), we designed a simple the clinical accuracy which aligns with the opti-
human evaluation task. Given a reference report mization where the best accuracy is obtained with
(R) and two candidate model generated reports NLL+ BS+factENTNLI . This correlation value is
(C1, C2), two board-certified radiologists decided slightly lower than a Spearman correlation which
whether C1 or C2 is more factually similar to R. Maynez et al. (2020) observed with NLI for the
To consider cases when C1 and C2 are difficult factual data (0.264). The result suggests the effec-
to differentiate, we also prepared “No difference” tiveness of using the factual rewards to estimate the
as an answer. We sampled 100 reports randomly factual completeness and consistency of radiology
from the test set of MIMIC-CXR for this evalua- reports, although the correlations are still limited,
tion. Since this evaluation is (financially) expensive with some room for improvement.
and there has been no human evaluation between
the baseline models, we selected R2Gen as the 5.4 Qualitative Analysis of Improved Clinical
best previous model and M2 Trans w/ BS as the Completeness and Consistency
most simple proposed model, in order to be able The evaluation with the clinically findings metrics
to weakly infer that all of our proposed models are
 showed improved generation performance by in-
better than all of the baselines. Table 3 shows the
 tegrating BERTScore, factENT , and factENTNLI .
result of the evaluation. The majority of the reports
 As a qualitative analysis, we examined some of
were labeled “No difference” but the proposed ap- the generated reports to see the improvements. Ex-
proach received three times as much preference as ample 1 in Figure 3 shows the improved factual
the baseline. completeness and consistency with BERTScore.
 There are two main reasons why “No difference” The atelectasis is correctly generated and left plu-
was frequent in human evaluation. First, we found ral effusion is correctly suppressed with NLL+BS.
that a substantial portion of the examples were Example 2 in Figure 4 shows the improved fac-
normal studies (no abnormal observations), which tual completeness with factENTNLI . The edema
leads to generated reports of similar quality fromis correctly generated and atelectasis is correctly
both models. Second, in some reports with multiplesuppressed with NLL+BS+fcEN . These examples
abnormal observations, both models made mistakes reveal the strength of integrating the three metrics
on a subset of these observations, making it difficult
 to generate factually complete and consistent re-
to decide which model output was better. ports.
 Despite observing large improvements with our
5.3 Estimating Clinical Accuracy with model in the clinical finding metrics evaluation, the
 Factual Rewards model is still not complete and some typical factual
The integrations of factENT and factENTNLI errors can be found in their generated reports. For
showed improvements in the clinical metrics. We example, Example 3 in Figure 4 includes a compar-
further examined whether these rewards can be ison of an observation against a previous study as
 5295
Images Reference R2Gen M2 Trans w/ NLL+BS
 Large right pleural effusion is unchanged
 PA and lateral chest views were obtained
 in size. There is associated right basilar
 with patient in upright position. Anal-
 atelectasis/scarring, also stable. Healed
 ysis is performed in direct comparison As compared to prior chest radiograph
 right rib fractures are noted. On the
 with the next preceding similar study from DATE, there has been interval im-
 left, there is persistent apical pleural
 of DATE. The heart size remains un- provement of the right pleural effusion.
 Example 1

 thickening and apical scarring. Linear
 changed and is within normal limits. Un- There is a persistent opacity at the right
 opacities projecting over the lower lobe
 changed appearance of thoracic aorta. lung base. There is persistent atelecta-
 are also compatible with scarring, un-
 The pulmonary vasculature is not con- sis at the right lung base. There is no
 changed. There is no left pleural effu-
 gested. Bilateral pleural effusions are left pleural effusion. There is no pneu-
 sion. There is no pneumothorax. Hilar
 again noted and have increased in size mothorax. The cardiomediastinal and hi-
 and cardiomediastinal contours are dif-
 on the right than the left. The left-sided lar contours are unchanged.
 ficult to assess, but appear unchanged.
 pleural effusion has increased in size and
 Vascular stent is seen in the left axil-
 is now moderate in size.
 lary/subclavian region.

Figure 3: An example of radiology reports generated by R2Gen and by the proposed model with the optimization
integrating BERTScore. Repeated sentences are removed from the example to improve readability.
 Images Reference M2 Trans w/ NLL+BS M2 Trans w/ NLL+BS+fcEN
 The cardiomediastinal and hilar contours
 Assessment is limited by patient rotation.
 are stable. The aorta is tortuous. The
 Frontal and lateral radiographs of the The patient is status post median ster-
 patient is status post median sternotomy.
 chest were acquired. There is new mild notomy and CABG. Heart size is moder-
 The heart is mildly enlarged. The aorta
 interstitial pulmonary edema. A small ately enlarged. The aorta is tortuous and
 Example 2

 is tortuous. The lung volumes are lower
 right pleural effusion may be minimally diffusely calcified. There is mild pul-
 compared to the prior chest radiograph.
 increased. There is also likely a trace left monary vascular congestion. Small bilat-
 Mild pulmonary edema is present. Small
 pleural effusion. There is no focal con- eral pleural effusions are present. Patchy
 bilateral pleural effusions are present.
 solidation. The heart size is not signifi- opacities in the lung bases likely reflect
 There is no focal consolidation. No
 cantly changed. There is no pneumotho- atelectasis. No pneumothorax is identi-
 pneumothorax is seen. Median ster-
 rax. Midline sternotomy wires are noted. fied. There are no acute osseous abnor-
 notomy wires and mediastinal clips are
 malities.
 noted.
 A right-sided hemodialysis catheter ter-
 minates at the right atrium. Again seen
 are reticular interstitial opacities dis- Right-sided dual lumen central venous
 The cardiomediastinal and hilar contours
 tributed evenly across both lungs, sta- catheter tip terminates in the lower SVC.
 are normal. The lung volumes are low.
 ble over multiple prior radiographs, pre- Heart size remains mildly enlarged. The
 The lung volumes are present. There is
 Example 3

 viously attributed to chronic hypersensi- mediastinal and hilar contours are un-
 mild pulmonary edema. There is no fo-
 tivity pneumonitis on the chest CT from changed. There is no pulmonary edema.
 cal consolidation. No pleural effusion
 DATE. The cardiac and mediastinal sil- Minimal atelectasis is noted in the lung
 or pneumothorax is seen. A right-sided
 houettes are unchanged. The central pul- bases without focal consolidation. No
 central venous catheter is seen with tip
 monary vessels appear more prominent pleural effusion or pneumothorax is
 in the right atrium.
 since the DATE study. Superimposed seen. There are no acute osseous abnor-
 mild edema cannot be excluded. There is malities.
 no focal consolidation, pleural effusion,
 or pneumothorax.

Figure 4: Examples of radiology reports generated by the proposed model with the optimization integrating
BERTScore and factENTNLI . Repeated sentences are removed from the examples to improve readability.

“. . . appear more prominent since . . . ” in the refer- directly optimizes these rewards with self-critical
ence but our model (or any previous models) can reinforcement learning. On two open datasets, we
not capture this kind of comparison since the model showed that our system generates reports that are
is not designed to take account the past reports of more factually complete and consistent than the
a patient as input. Additionally, in this example, baselines and leads to reports with substantially
edema is mentioned with uncertainty as “cannot be higher scores in clinical metrics. The integration of
excluded” in the reference but the generated report entities and NLI to improve the factual complete-
with factENTNLI simply indicates it as “There is ness and consistency of generation is not restricted
mild pulmonary edema”. to the domain of radiology reports, and we predict
 that a similar approach might similarly improve
6 Conclusion other data-to-text tasks.
We proposed two new simple rewards and com-
bined them with a semantic equivalence metric to
improve image-to-text radiology report generation Acknowledgements
systems. The two new rewards make use of ra-
diology domain entities extracted with a named We would like to thank the anonymous reviewers
entity recognizer and a weakly-supervised NLI to and the members of the Stanford NLP Group for
capture the factual completeness and consistency their very helpful comments that substantially im-
of the generated reports. We further presented a proved this paper.
Transformer-based report generation system that
 5296
References Simao Herdade, Armin Kappeler, Kofi Boakye, and
 Joao Soares. 2019. Image captioning: Transforming
Asma Ben Abacha, Chaitanya Shivade, and Dina objects into words. In Advances in Neural Informa-
 Demner-Fushman. 2019. Overview of the MEDIQA tion Processing, volume 32, pages 11137–11147.
 2019 shared task on textual inference, question en-
 tailment and question answering. In Proceedings of Gao Huang, Zhuang Liu, Laurens van der Maaten, and
 the 18th BioNLP Workshop and Shared Task, pages Kilian Q. Weinberger. 2017. Densely connected con-
 370–379. volutional networks. In Proceedings of the IEEE
William Boag, Tzu-Ming Harry Hsu, Matthew Mcder- Conference on Computer Vision and Pattern Recog-
 mott, Gabriela Berner, Emily Alesentzer, and Pe- nition, pages 2261–2269.
 ter Szolovits. 2020. Baselines for Chest X-Ray Re-
 Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yi-
 port Generation. In Proceedings of the Machine
 fan Yu, Silviana Ciurea-Ilcus, Chris Chute, Hen-
 Learning for Health NeurIPS Workshop, volume
 rik Marklund, Behzad Haghgoo, Robyn L. Ball,
 116, pages 126–140.
 Katie S. Shpanskaya, Jayne Seekins, David A.
Zhihong Chen, Yan Song, Tsung-Hui Chang, and Xi- Mong, Safwan S. Halabi, Jesse K. Sandberg,
 ang Wan. 2020. Generating radiology reports via Ricky Jones, David B. Larson, Curtis P. Langlotz,
 memory-driven transformer. In Proceedings of The Bhavik N. Patel, Matthew P. Lungren, and Andrew Y.
 2020 Conference on Empirical Methods in Natural Ng. 2019. CheXpert: A large chest radiograph
 Language Processing, pages 1439–1449. dataset with uncertainty labels and expert compari-
 son. In The Thirty-Third AAAI Conference on Artifi-
Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, cial Intelligence, volume 33, pages 590–597.
 and Rita Cucchiara. 2020. Meshed-Memory Trans-
 former for Image Captioning. In Proceedings of the Baoyu Jing, Zeya Wang, and Eric Xing. 2019. Show,
 IEEE/CVF Conference on Computer Vision and Pat- describe and conclude: On exploiting the structure
 tern Recognition, pages 10575–10584. information of chest x-ray reports. In Proceedings
 of the 57th Annual Meeting of the Association for
Dina Demner-Fushman, Samee Antani, Simpsonl Computational Linguistics, pages 6570–6580.
 Matthew, and George R. Thoma. 2012. Design and
 development of a multimodal biomedical informa- Baoyu Jing, Pengtao Xie, and Eric Xing. 2018. On
 tion retrieval system. Journal of Computing Science the automatic generation of medical imaging reports.
 and Engineering, 6(2):168–177. In Proceedings of the 56th Annual Meeting of the
 Association for Computational Linguistics (Volume
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and 1: Long Papers), pages 2577–2586.
 Kristina Toutanova. 2019. BERT: Pre-training of
 deep bidirectional transformers for language under- Alistair E. W. Johnson, Tom J. Pollard, Seth J.
 standing. In Proceedings of the 2019 Conference of Berkowitz, Nathaniel R. Greenbaum, Matthew P.
 the North American Chapter of the Association for Lungren, Chih-ying Deng, Roger G. Mark, and
 Computational Linguistics: Human Language Tech- Steven Horng. 2019. MIMIC-CXR, a de-identified
 nologies, Volume 1 (Long and Short Papers), pages publicly available database of chest radiographs with
 4171–4186. free-text reports. Scientific Data, 6(317).
Esin Durmus, He He, and Mona Diab. 2020. FEQA: A
 Alistair E. W. Johnson, Tom J. Pollard, Lu Shen, Li-
 question answering evaluation framework for faith-
 wei H. Lehman, Mengling Feng, Mohammad Ghas-
 fulness assessment in abstractive summarization. In
 semi, Benjamin Moody, Peter Szolovits, Leo An-
 Proceedings of the 58th Annual Meeting of the Asso-
 thony Celi, and Roger G. Mark. 2016. MIMIC-III,
 ciation for Computational Linguistics, pages 5055–
 a freely accessible critical care database. Scientific
 5070.
 Data, 3(16035).
Tobias Falke, Leonardo F. R. Ribeiro, Prasetya Ajie
 Utama, Ido Dagan, and Iryna Gurevych. 2019. Charles E. Kahn, Curtis P. Langlotz, Elizabeth S. Burn-
 Ranking generated summaries by correctness: An in- side, John A. Carrino, David S. Channin, David M.
 teresting but challenging application for natural lan- Hovsepian, and Daniel L. Rubin. 2009. Toward
 guage inference. In Proceedings of the 57th Annual best practices in radiology reporting. Radiology,
 Meeting of the Association for Computational Lin- 252(3):852–856.
 guistics, pages 2214–2220.
 Diederik P. Kingma and Jimmy Ba. 2015. Adam: A
Christiane Fellbaum, editor. 1998. WordNet: A Lexical method for stochastic optimization. In International
 Database for English. MIT Press. Conference for Learning Representations.

Longteng Guo, Jing Liu, Xinxin Zhu, Peng Yao, Wojciech Kryściński, Bryan McCann, Caiming Xiong,
 Shichen Lu, and Hanqing Lu. 2020. Normalized and Richard Socher. 2020. Evaluating the factual
 and geometry-aware self-attention network for im- consistency of abstractive text summarization. In
 age captioning. In IEEE/CVF Conference on Com- Proceedings of the 2020 Conference on Empirical
 puter Vision and Pattern Recognition, pages 10324– Methods in Natural Language Processing, pages
 10333. 9332–9346.
 5297
Christy Yuan Li, Xiaodan Liang, Zhiting Hu, and Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli,
 Eric P Xing. 2018. Hybrid retrieval-generation re- and Wojciech Zaremba. 2016. Sequence level train-
 inforced agent for medical image report generation. ing with recurrent neural networks. In International
 In Advances in Neural Information Processing, vol- Conference on Learning Representations.
 ume 31, pages 1530–1540.
 Steven J. Rennie, Etienne Marcheret, Youssef Mroueh,
Guang Li, Linchao Zhu, Ping Liu, and Yi Yang. 2019. Jerret Ross, and Vaibhava Goel. 2017. Self-critical
 Entangled transformer for image captioning. In Pro- sequence training for image captioning. In Proceed-
 ceedings of the IEEE/CVF International Conference ings of the IEEE Conference on Computer Vision
 on Computer Vision, pages 8927–8936. and Pattern Recognition, pages 1179–1195.
Guanxiong Liu, Tzu-Ming Harry Hsu, Matthew Mc- Alexey Romanov and Chaitanya Shivade. 2018.
 Dermott, Willie Boag, Wei-Hung Weng, Peter Lessons from natural language inference in the clin-
 Szolovits, and Marzyeh Ghassemi. 2019. Clinically ical domain. In Proceedings of the 2018 Conference
 accurate chest x-ray report generation. In Proceed- on Empirical Methods in Natural Language Process-
 ings of the 4th Machine Learning for Healthcare ing, pages 1586–1596.
 Conference, volume 106, pages 249–269.
 Akshay Smit, Saahil Jain, Pranav Rajpurkar, Anuj
Kazuki Matsumaru, Sho Takase, and Naoaki Okazaki. Pareek, Andrew Ng, and Matthew Lungren. 2020.
 2020. Improving truthfulness of headline genera- Combining automatic labelers and expert annota-
 tion. In Proceedings of the 58th Annual Meeting tions for accurate radiology report labeling using
 of the Association for Computational Linguistics, BERT. In Proceedings of the 2020 Conference on
 pages 1335–1346. Empirical Methods in Natural Language Processing,
 pages 1500–1519.
Joshua Maynez, Shashi Narayan, Bernd Bohnet, and
 Ryan McDonald. 2020. On faithfulness and factu- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
 ality in abstractive summarization. In Proceedings Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
 of the 58th Annual Meeting of the Association for Kaiser, and Illia Polosukhin. 2017. Attention is all
 Computational Linguistics, pages 1906–1919. you need. In Advances in neural information pro-
Yingwei Pan, Ting Yao, Yehao Li, and Tao Mei. 2020. cessing systems, volume 30, pages 5998–6008.
 X-linear attention networks for image captioning. In
 Ramakrishna Vedantam, C. Lawrence Zitnick, and
 Proceedings of the IEEE/CVF Conference on Com-
 Devi Parikh. 2015. CIDEr: Consensus-based image
 puter Vision and Pattern Recognition, pages 10968–
 description evaluation. In Proceedings of the IEEE
 10977.
 Conference on Computer Vision and Pattern Recog-
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- nition, pages 4566–4575.
 Jing Zhu. 2002. Bleu: a method for automatic eval-
 uation of machine translation. In Proceedings of the Oriol Vinyals, Alexander Toshev, Samy Bengio, and
 40th Annual Meeting of the Association for Compu- Dumitru Erhan. 2015. Show and tell: A neural im-
 tational Linguistics, pages 311–318. age caption generator. In Proceedings of the IEEE
 Conference on Computer Vision and Pattern Recog-
Romain Paulus, Caiming Xiong, and Richard Socher. nition, pages 3156–3164.
 2018. A deep reinforced model for abstractive sum-
 marization. In International Conference on Learn- Alex Wang, Kyunghyun Cho, and Mike Lewis. 2020a.
 ing Representations. Asking and answering questions to evaluate the fac-
 tual consistency of summaries. In Proceedings of
Yifan Peng, Xiaosong Wang, Le Lu, Mohammadhadi the 58th Annual Meeting of the Association for Com-
 Bagheri, Ronald Summers, and Zhiyong Lu. 2018. putational Linguistics, pages 5008–5020.
 Negbio: a high-performance tool for negation and
 uncertainty detection in radiology reports. In AMIA Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, and
 2018 Informatics Summit. Ronald M. Summers. 2018. TieNet: Text-image em-
 bedding network for common thorax disease classifi-
Jeffrey Pennington, Richard Socher, and Christopher cation and reporting in chest x-rays. In Proceedings
 Manning. 2014. GloVe: Global vectors for word of the IEEE Conference on Computer Vision and Pat-
 representation. In Proceedings of the 2014 Confer- tern Recognition, pages 9049–9058.
 ence on Empirical Methods in Natural Language
 Processing, pages 1532–1543. Zhenyi Wang, Xiaoyang Wang, Bang An, Dong Yu,
 and Changyou Chen. 2020b. Towards faithful neural
Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton, table-to-text generation with content-matching con-
 and Christopher D. Manning. 2020. Stanza: A straints. In Proceedings of the 58th Annual Meet-
 python natural language processing toolkit for many ing of the Association for Computational Linguistics,
 human languages. In Proceedings of the 58th An- pages 1072–1086.
 nual Meeting of the Association for Computational
 Linguistics: System Demonstrations, pages 101– Sean Welleck, Jason Weston, Arthur Szlam, and
 108. Kyunghyun Cho. 2019. Dialogue natural language
 5298
inference. In Proceedings of the 57th Annual Meet-
 ing of the Association for Computational Linguistics,
 pages 3731–3741.
Ronald J Williams and David Zipser. 1989. A learn-
 ing algorithm for continually running fully recurrent
 neural networks. Neural computation, 1(2):270–
 280.
Zhaofeng Wu, Yan Song, Sicong Huang, Yuanhe Tian,
 and Fei Xia. 2019. WTMED at MEDIQA 2019: A
 hybrid approach to biomedical natural language in-
 ference. In Proceedings of the 18th BioNLP Work-
 shop and Shared Task, pages 415–426.
Jianbo Yuan, Haofu Liao, Rui Luo, and Jiebo Luo.
 2019. Automatic radiology report generation based
 on multi-view image fusion and medical concept en-
 richment. In Medical Image Computing and Com-
 puter Assisted Intervention, pages 721–729.
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q.
 Weinberger, and Yoav Artzi. 2020a. BERTScore:
 Evaluating text generation with BERT. In Interna-
 tional Conference on Learning Representations.
Yuhao Zhang, Derek Merck, Emily Tsai, Christo-
 pher D. Manning, and Curtis Langlotz. 2020b. Op-
 timizing the factual correctness of a summary: A
 study of summarizing radiology reports. In Proceed-
 ings of the 58th Annual Meeting of the Association
 for Computational Linguistics, pages 5108–5120.
Yuhao Zhang, Yuhui Zhang, Peng Qi, Christopher D.
 Manning, and Curtis P. Langlotz. 2020c. Biomed-
 ical and clinical English model packages in the
 Stanza Python NLP library. arXiv preprint
 arXiv:2007.14640.

 5299
A Detail of Radiology NLI included in the premise. The following is an exam-
 ple of a sentence pair that matches N1 with entities
A.1 Rules & Examples of Weakly-Supervised
 in bold:
 Radiology NLI
We prepared the 6 rules (E1, N1–N4, and C1) to s1 There is no pulmonary edema or definite con-
train the weakly-supervised radiology NLI. The solidation.
rules are applied against sentence pairs consisting
from premises (s1 ) and hypotheses (s2 ) to extract s2 There is no focal consolidation, pleural effu-
pairs that are in entailment, neutral, or contradic- sion, or pulmonary edema.
tion relation.
 Neutral Rule 2: N2
Entailment Rule: E1 1. The named entities of s1 are equal to the
 1. s1 and s2 are semantically similar. named entities of s2 as NE(s1 ) = NE(s2 ).

 2. The named entities (NE) of s2 is a subset or 2. The anatomy modifiers (NEmod ) of s1 include
 equal to the named entities of s1 as NE(s2 ) ⊆ an antonym (ANT) of the anatomy modifier of
 NE(s1 ). s2 as NEmod (s1 ) ∩ ANT(NEmod (s2 )) 6= ∅.

We used BERTScore (Zhang et al., 2020a) as a sim- Anatomy modifiers are extracted with the clinical
ilarity metric and set the threshold to sim(s1 , s2 ) ≥ model of Stanza and antonyms are decided using
0.76 . The clinical model of Stanza (Zhang et al., WordNet (Fellbaum, 1998). Antonyms in anatomy
2020c) is used to extract anatomy entities and ob- modifiers are considered in this rule to differentiate
servation entities. s1 and s2 are conditioned to be experessions like left vs right and upper vs lower.
both negated or both non-negated. The negation The following is an example of a sentence pair that
is determined with a negation identifier or the ex- matches N2 with antonyms in bold:
istence of uncertain entity, using NegBio (Peng
et al., 2018) as the negation identifier and the clin- s1 Moreover, a small left pleural effusion has
ical model of Stanza is used to extract uncertain newly occurred.
entities. s2 is further restricted to include at least 2
entities as |NE(s2 )| ≥ 2. These similarity metric, s2 Small right pleural effusion has worsened.
named entity recognition model, and entity number
 Neutral Rule 3: N3
restriction are used in the latter neutral and contra-
diction rules. The negation restriction is used in 1. The named entity types (NEtype ) of s1 are
the neutral rules but is not used in the contradiction equal to the named entity types of s2 as
rule. The following is an example of a sentence NEtype (s1 ) = NEtype (s2 ).
pair that matches E1 with entities in bold:
 2. The named entities of s1 is different from the
s1 The heart is mildly enlarged. named entities of s2 as NE(s1 ) ∩ NE(s2 ) = ∅.

s2 The heart appears again mild-to-moderately Specific entity types that we used are anatomy and
 enlarged. observation. This rule ensures that s1 and s2 have
 related but different entities in same types. The
Neutral Rule 1: N1 following is an example of a sentence pair that
 1. s1 and s2 are semantically similar. matches N3 with entities in bold:
 2. The named entities of s1 is a subset of the s1 There is minimal bilateral lower lobe atelecta-
 named entities of s2 as NE(s1 ) ( NE(s2 ). sis.
Since s1 is a premise, this condition denotes that s2 The cardiac silhouette is moderately en-
the counterpart hypothesis has entities that are not larged.
 6
 distilbert-base-uncased with the baseline score is used
as the model of BERTScore for a fast comparison and a Neutral Rule 4: N4
smooth score scale. We swept the threshold value from
{0.6, 0.7, 0.8, 0.9} and set it to 0.7 as a relaxed boundary 1. The named entities of s1 are equal to the
to balance between accuracy and diversity. named entities of s2 as NE(s1 ) = NE(s2 ).
 5300
2. s1 and s2 include observation keywords A.3 Configuration of Radiology NLI Model
 (KEY) that belong to different groups as We used bert-base-uncased as a pre-trained BERT
 KEY(s1 ) 6= KEY(s2 ). model and further fine-tuned it on MIMIC-III
The groups of observation keywords are setup (Johnson et al., 2016) radiology reports with a
following the observation keywords of CheX- masked language modeling loss for 8 epochs. The
pert labeler (Irvin et al., 2019). Specifi- model is further optimized on the training data
cally, G1 = {normal, unremarkable}, G2 = with a classification negative log likelihood loss.
{stable, unchanged}, and G3 = {clear} are We used Adam (Kingma and Ba, 2015) as an opti-
used to determine words included in different mization method with β1 = 0.9, β2 = 0.999, batch
groups as neutral relation. The following is an size of 16, and the gradient clipping norm of 5.0.
example of a sentence pair that matches N4 with The learning rate is set to lr = 1e−5 by running
keywords in bold: a preliminary experiment with lr = {1e−5 , 2e−5 }.
 The model is optimized for the maximum of 20
s1 Normal cardiomediastinal silhouette.
 epochs and a validation accuracy is used to decide
s2 Cardiomediastinal silhouette is unchanged. a model checkpoint that is used to evaluate the test
 set. We trained the model with a single Nvidia Ti-
Contradiction Rule: C1
 tan XP taking approximately 2 hours to complete
 1. The named entities of s1 is a subset or equal to 20 epochs.
 the named entities of s2 as NE(s2 ) ⊆ NE(s1 ).
 B Configurations of Radiology Report
 2. s1 or s2 is a negated sentence.
 Generation Models
Negation is determined with the same approach as
E1. The following is an example of a sentence pair B.1 M2 Trans
that matches C1 with entities in bold: We used DenseNet-121 (Huang et al., 2017) as a
 CNN image feature extractor and pre-trained it on
s1 There are also small bilateral pleural effu-
 CheXpert dataset with the 14-class classification
 sions.
 setting. We used GloVe (Pennington et al., 2014)
s2 No pleural effusions. to pre-train text embeddings and the pre-trainings
A.2 Validation and Test Datasets of were done on a training set with the embedding
 Radiology NLI size of 512. The parameters of the model is set
 up to the dimensionality of 512, the number of
We sampled 480 sentence pairs that satisfy the fol- heads to 8, and the number of memory vector to
lowing conditions from the validation section of 40. We set the number of Transformer layer to
MIMIC-CXR: nlayer = 1 by running a preliminary experiment
 1. Two sentences (s1 and s2 ) have with nlayer = {1, 2, 3}. The model is first trained
 BERTScore(s1 , s2 ) ≥ 0.5. against NLL loss using the learning rate sched-
 uler of Transformer (Devlin et al., 2019) with the
 2. MedNLI labels are equally distributed over
 warm-up steps of 20000 and is further optimized
 three labels: entailment, neutral, and contra-
 with a joint loss with the fixed learning rate of
 diction7 .
 5e−6 . Adam is used as an optimization method
These conditions are introduced to reduce neutral with β1 = 0.9 and β2 = 0.999. The batch size is
pairs since most pairs will be neutral with random set to 48 for NLL loss and 24 for the joint losses.
sampling. The sampled pairs are annotated twice For λ∗ , we first swept the optimal value of λ1
swapping its premise and hypothesis by two ex- from {0.03, 0.02, 0.01, 0.001} using the develop-
perts: one medical expert and one NLP expert. For ment set. We have restricted λ2 and λ3 to have
pairs that the two annotators disagreed, its labels equal values in our experiments and constrined that
are decided by a discussion with one additional all λ∗ values sum up to 1.0. The model is trained
NLP expert. The resulting 960 bidirectional pairs with NLL loss for 32 epochs and further trained
are splitted in half resulting in 480 pairs for a vali- for 32 epochs with a joint loss. Beam search with
dation set and 480 pairs for a test set. the beam size of 4 is used to decode texts when
 7
 We used the baseline BERT model of Wu et al. (2019) to evaluating the model against a validation set or a
assign MedNLI labels to the pairs. test set. We trained the model with a single Nvidia
 5301
You can also read