Natural Language Processing with Deep Learning CS224N/Ling284 - Christopher Manning Lecture 13: Coreference Resolution - Natural Language ...
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Natural Language Processing
with Deep Learning
CS224N/Ling284
Christopher Manning
Lecture 13: Coreference ResolutionAnnouncements • Congratulations on surviving Ass5! • I hope it was a rewarding state-of-the-art learning experience • It was a new assignment: We look forward to feedback on it • Final project proposals: Thank you! Lots of interesting stuff! • We’ll get them back to you tomorrow! • Do get started working on your projects now – talk to your mentor! • We plan to get Ass4 back this week… • Final project milestone is out, due Tuesday March 2 (next Tuesday) • Second invited speaker – Colin Raffel, super exciting! – is this Thursday • Again, you need to write a reaction paragraph for it 2
Lecture Plan: Lecture 16: Coreference Resolution 1. What is Coreference Resolution? (10 mins) 2. Applications of coreference resolution (5 mins) 3. Mention Detection (5 mins) 4. Some Linguistics: Types of Reference (5 mins) Four Kinds of Coreference Resolution Models 5. Rule-based (Hobbs Algorithm) (10 mins) 6. Mention-pair and mention-ranking models (15 mins) 7. Interlude: ConvNets for language (sequences) (10 mins) 8. Current state-of-the-art neural coreference systems (10 mins) 9. Evaluation and current results (10 mins) 3
1. What is Coreference Resolution? • Identify all mentions that refer to the same entity in the word 4
A couple of years later, Vanaja met Akhila at the local park. Akhila’s son Prajwal was just two months younger than her son Akash, and they went to the same school. For the pre-school play, Prajwal was chosen for the lead role of the naughty child Lord Krishna. Akash was to be a tree. She resigned herself to make Akash the best tree that anybody had ever seen. She bought him a brown T-shirt and brown trousers to represent the tree trunk. Then she made a large cardboard cutout of a tree’s foliage, with a circular opening in the middle for Akash’s face. She attached red balls to it to represent fruits. It truly was the nicest tree. From The Star by Shruthi Rao, with some shortening.
Applications
• Full text understanding
• information extraction, question answering, summarization, …
• “He was born in 1961” (Who?)
6Applications
• Full text understanding
• Machine translation
• languages have different features for gender, number,
dropped pronouns, etc.
7Applications
• Full text understanding
• Machine translation
• Dialogue Systems
“Book tickets to see James Bond”
“Spectre is playing near you at 2:00 and 3:00 today. How many
tickets would you like?”
“Two tickets for the showing at three”
8Coreference Resolution in Two Steps
1. Detect the mentions (easy)
“[I] voted for [Nader] because [he] was most aligned with
[[my] values],” [she] said
• mentions can be nested!
2. Cluster the mentions (hard)
“[I] voted for [Nader] because [he] was most aligned with
[[my] values],” [she] said
93. Mention Detection
• Mention: A span of text referring to some entity
• Three kinds of mentions:
1. Pronouns
• I, your, it, she, him, etc.
2. Named entities
• People, places, etc.: Paris, Joe Biden, Nike
3. Noun phrases
• “a dog,” “the big fluffy cat stuck in the tree”
10Mention Detection
• Mention: A span of text referring to some entity
• For detection: traditionally, use a pipeline of other NLP systems
1. Pronouns
• Use a part-of-speech tagger
2. Named entities
• Use a Named Entity Recognition system
3. Noun phrases
• Use a parser (especially a constituency parser – next week!)
11Mention Detection: Not Quite So Simple
• Marking all pronouns, named entities, and NPs as mentions
over-generates mentions
• Are these mentions?
• It is sunny
• Every student
• No student
• The best donut in the world
• 100 miles
12How to deal with these bad mentions?
• Could train a classifier to filter out spurious mentions
• Much more common: keep all mentions as “candidate
mentions”
• After your coreference system is done running discard all
singleton mentions (i.e., ones that have not been marked as
coreference with anything else)
13Avoiding a traditional pipeline system
• We could instead train a classifier specifically for mention
detection instead of using a POS tagger, NER system, and parser.
• Or we can not even try to do mention detection explicitly:
• We can build a model that begins with all spans and jointly
does mention-detection and coreference resolution end-to-
end in one model
• Will cover later in this lecture!
144. On to Coreference! First, some linguistics
• Coreference is when two mentions refer to the same entity in
the world
• Barack Obama traveled to … Obama ...
• A different-but-related linguistic concept is anaphora: when a
term (anaphor) refers to another term (antecedent)
• the interpretation of the anaphor is in some way determined
by the interpretation of the antecedent
• Barack Obama said he would sign the bill.
antecedent anaphor
15Anaphora vs. Coreference
Coreference with named entities
text Barack Obama Obama
world
Anaphora
text Barack Obama he
world
16Not all anaphoric relations are coreferential • Not all noun phrases have reference • Every dancer twisted her knee. • No dancer twisted her knee. • There are three NPs in each of these sentences; because the first one is non-referential, the other two aren’t either.
Anaphora vs. Coreference
• Not all anaphoric relations are coreferential
We went to see a concert last night. The tickets were really expensive.
• This is referred to as bridging anaphora.
coreference anaphora
Barack Obama … pronominal bridging
Obama ... anaphora anaphora
18Anaphora vs. Cataphora • Usually the antecedent comes before the anaphor (e.g., a pronoun), but not always 19
Cataphora
“From the corner of the divan of Persian saddle-
bags on which he was lying, smoking, as was his
custom, innumerable cigarettes, Lord Henry
Wotton could just catch the gleam of the honey-
sweet and honey-coloured blossoms of a
laburnum…”
(Oscar Wilde – The Picture of Dorian Gray)
20Taking stock …
• It’s often said that language is interpreted “in context”
• We’ve seen some examples, like word-sense disambiguation:
• I took money out of the bank vs. The boat disembarked from the bank
• Coreference is another key example of this:
• Obama was the president of the U.S. from 2008 to 2016. He was born in Hawaii.
• As we progress through an article, or dialogue, or webpage, we build up a (potentially
very complex) discourse model, and we interpret new sentences/utterances with
respect to our model of what’s come before.
• Coreference and anaphora are all we see in this class of whole-discourse meaning
21Four Kinds of Coreference Models
• Rule-based (pronominal anaphora resolution)
• Mention Pair
• Mention Ranking
• Clustering [skipping this year; see Clark and Manning (2016)]
225. Traditional pronominal anaphora resolution: Hobbs’ naive algorithm 1. Begin at the NP immediately dominating the pronoun 2. Go up tree to first NP or S. Call this X, and the path p. 3. Traverse all branches below X to the left of p, left-to-right, breadth-first. Propose as antecedent any NP that has a NP or S between it and X 4. If X is the highest S in the sentence, traverse the parse trees of the previous sentences in the order of recency. Traverse each tree left-to-right, breadth first. When an NP is encountered, propose as antecedent. If X not the highest node, go to step 5.
Hobbs’ naive algorithm (1976)
5. From node X, go up the tree to the first NP or S. Call it X, and the path p.
6. If X is an NP and the path p to X came from a non-head phrase of X (a specifier or adjunct,
such as a possessive, PP, apposition, or relative clause), propose X as antecedent
(The original said “did not pass through the N’ that X immediately dominates”, but
the Penn Treebank grammar lacks N’ nodes….)
7. Traverse all branches below X to the left of the path, in a left-to-right, breadth first
manner. Propose any NP encountered as the antecedent
8. If X is an S node, traverse all branches of X to the right of the path but do not go
below any NP or S encountered. Propose any NP as the antecedent.
9. Go to step 4
Until deep learning still often used as a feature in ML systems!Hobbs Algorithm Example
1. Begin at the NP immediately dominating the pronoun 5. From node X, go up the tree to the first NP or S. Call it X, and the path p.
2. Go up tree to first NP or S. Call this X, and the path p. 6. If X is an NP and the path p to X came from a non-head phrase of X (a
3. Traverse all branches below X to the left of p, left-to-right, breadth-first. specifier or adjunct, such as a possessive, PP, apposition, or relative
Propose as antecedent any NP that has a NP or S between it and X clause), propose X as antecedent
7. Traverse all branches below X to the left of the path, in a left-to-right,
4. If X is the highest S in the sentence, traverse the parse trees of the
breadth first manner. Propose any NP encountered as the antecedent
previous sentences in the order of recency. Traverse each tree left-to- 8. If X is an S node, traverse all branches of X to the right of the path but do
right, breadth first. When an NP is encountered, propose as not go below any NP or S encountered. Propose any NP as the
antecedent. If X not the highest node, go to step 5. antecedent.
9. Go to step 4Knowledge-based Pronominal Coreference
• She poured water from the pitcher into the cup until it was full.
• She poured water from the pitcher into the cup until it was empty.
• The city council refused the women a permit because they feared violence.
• The city council refused the women a permit because they advocated violence.
• Winograd (1972)
• These are called Winograd Schema
• Recently proposed as an alternative to the Turing test
• See: Hector J. Levesque “On our best behaviour” IJCAI 2013
http://www.cs.toronto.edu/~hector/Papers/ijcai-13-paper.pdf
• http://commonsensereasoning.org/winograd.html
• If you’ve fully solved coreference, arguably you’ve solved AI !!!Hobbs’ algorithm: commentary
“… the naïve approach is quite good. Computationally
speaking, it will be a long time before a semantically
based algorithm is sophisticated enough to perform as
well, and these results set a very high standard for any
other approach to aim for.
“Yet there is every reason to pursue a semantically
based approach. The naïve algorithm does not work.
Any one can think of examples where it fails. In these
cases, it not only fails; it gives no indication that it has
failed and offers no help in finding the real antecedent.”
— Hobbs (1978), Lingua, p. 3456. Coreference Models: Mention Pair
“I voted for Nader because he was most aligned with my values,” she said.
I Nader he my she
Coreference Cluster 1
Coreference Cluster 2
28Coreference Models: Mention Pair
• Train a binary classifier that assigns every pair of mentions a
probability of being coreferent:
• e.g., for “she” look at all candidate antecedents (previously
occurring mentions) and decide which are coreferent with it
“I voted for Nader because he was most aligned with my values,” she said.
I Nader he my she
coreferent with she?
29Coreference Models: Mention Pair
• Train a binary classifier that assigns every pair of mentions a
probability of being coreferent:
• e.g., for “she” look at all candidate antecedents (previously
occurring mentions) and decide which are coreferent with it
“I voted for Nader because he was most aligned with my values,” she said.
I Nader he my she
Positive examples: want to be near 1
30Coreference Models: Mention Pair
• Train a binary classifier that assigns every pair of mentions a
probability of being coreferent:
• e.g., for “she” look at all candidate antecedents (previously
occurring mentions) and decide which are coreferent with it
“I voted for Nader because he was most aligned with my values,” she said.
I Nader he my she
Negative examples: want to be near 0
31Mention Pair Training
• N mentions in a document
• yij = 1 if mentions mi and mj are coreferent, -1 if otherwise
• Just train with regular cross-entropy loss (looks a bit different
because it is binary classification)
Iterate through Iterate through Coreferent mentions pairs
mentions candidate antecedents should get high probability,
(previously occurring others should get low
mentions) probability
32Mention Pair Test Time
• Coreference resolution is a clustering task, but we are only
scoring pairs of mentions… what to do?
I Nader he my she
33Mention Pair Test Time
• Coreference resolution is a clustering task, but we are only
scoring pairs of mentions… what to do?
• Pick some threshold (e.g., 0.5) and add coreference links
between mention pairs where is above the threshold
“I voted for Nader because he was most aligned with my values,” she said.
I Nader he my she
34Mention Pair Test Time
• Coreference resolution is a clustering task, but we are only
scoring pairs of mentions… what to do?
• Pick some threshold (e.g., 0.5) and add coreference links
between mention pairs where is above the threshold
• Take the transitive closure to get the clustering
“I voted for Nader because he was most aligned with my values,” she said.
I Nader he my she
Even though the model did not predict this coreference link,
I and my are coreferent due to transitivity
35Mention Pair Test Time
• Coreference resolution is a clustering task, but we are only
scoring pairs of mentions… what to do?
• Pick some threshold (e.g., 0.5) and add coreference links
between mention pairs where is above the threshold
• Take the transitive closure to get the clustering
“I voted for Nader because he was most aligned with my values,” she said.
I Nader he my she
Adding this extra link would merge everything
into one big coreference cluster!
36Mention Pair Models: Disadvantage
• Suppose we have a long document with the following mentions
• Ralph Nader … he … his … him …
… voted for Nader because he …
Relatively easy
Ralph
he his him Nader he
Nader
almost impossible
37Mention Pair Models: Disadvantage
• Suppose we have a long document with the following mentions
• Ralph Nader … he … his … him …
… voted for Nader because he …
Relatively easy
Ralph
he his him Nader he
Nader
almost impossible
• Many mentions only have one clear antecedent
• But we are asking the model to predict all of them
• Solution: instead train the model to predict only one antecedent for each mention
• More linguistically plausible
387. Coreference Models: Mention Ranking
• Assign each mention its highest scoring candidate antecedent
according to the model
• Dummy NA mention allows model to decline linking the current
mention to anything (“singleton” or “first” mention)
NA I Nader he my she
best antecedent for she?
39Coreference Models: Mention Ranking
• Assign each mention its highest scoring candidate antecedent
according to the model
• Dummy NA mention allows model to decline linking the current
mention to anything (“singleton” or “first” mention)
NA I Nader he my she
Positive examples: model has to assign a high
probability to either one (but not necessarily both)
40Coreference Models: Mention Ranking
• Assign each mention its highest scoring candidate antecedent
according to the model
• Dummy NA mention allows model to decline linking the current
mention to anything (“singleton” or “first” mention)
NA I Nader he my she
best antecedent for she?
p(NA, she) = 0.1
p(I, she) = 0.5 Apply a softmax over the scores for
p(Nader, she) = 0.1 candidate antecedents so
p(he, she) = 0.1 probabilities sum to 1
41 p(my, she) = 0.2Coreference Models: Mention Ranking
• Assign each mention its highest scoring candidate antecedent
according to the model
• Dummy NA mention allows model to decline linking the current
mention to anything (“singleton” or “first” mention)
NA I Nader he my she
only add highest scoring
coreference link
p(NA, she) = 0.1
p(I, she) = 0.5 Apply a softmax over the scores for
p(Nader, she) = 0.1 candidate antecedents so
p(he, she) = 0.1 probabilities sum to 1
42 p(my, she) = 0.2Coreference Models: Training
• We want the current mention mj to be linked to any one of the
candidate antecedents it’s coreferent with.
• Mathematically, we want to maximize this probability:
i 1
X
(yij = 1)p(mj , mi )
j=1
Iterate through 0 For ones that …we want the model 1to
candidate antecedents N are
i coreferent
1 assign a high probability
(previously occurring Y X
to mj…
J = log
mentions)
@ (yij = 1)p(mj , mi )A
i=2 j=1
• The model could produce 0.9 probability for one of the correct
antecedents and low probability for everything else, and the
sum will still be large
43How do we compute the probabilities?
A. Non-neural statistical classifier
B. Simple neural network
C. More advanced model using LSTMs, attention, transformers
44A. Non-Neural Coref Model: Features
• Person/Number/Gender agreement
• Jack gave Mary a gift. She was excited.
• Semantic compatibility
• … the mining conglomerate … the company …
• Certain syntactic constraints
• John bought him a new car. [him can not be John]
• More recently mentioned entities preferred for referenced
• John went to a movie. Jack went as well. He was not busy.
• Grammatical Role: Prefer entities in the subject position
• John went to a movie with Jack. He was not busy.
• Parallelism:
• John went with Jack to a movie. Joe went with him to a bar.
• …
45B. Neural Coref Model
• Standard feed-forward neural network
• Input layer: word embeddings and a few categorical features
Score s
Hidden Layer h3 W4h3 + b4
Hidden Layer h2 ReLU(W3h2 + b3)
Hidden Layer h1 ReLU(W2h1 + b2)
Input Layer h0 ReLU(W1h0 + b1)
Candidate Candidate Mention Mention Additional
Antecedent Antecedent Embeddings Features Features
Embeddings Features
46Neural Coref Model: Inputs
• Embeddings
• Previous two words, first word, last word, head word, … of each
mention
• The head word is the “most important” word in the mention – you can find it
using a parser. e.g., The fluffy cat stuck in the tree
• Still need some other features to get a strongly performing model:
• Distance
• Document genre
• Speaker information
477. Convolutional Neural Nets (very briefly)
• Main CNN/ConvNet idea:
• What if we compute vectors for every possible word subsequence of a certain length?
• Example: “tentative deal reached to keep government open” computes vectors for:
• tentative deal reached, deal reached to, reached to keep, to keep government, keep
government open
• Regardless of whether phrase is grammatical
• Not very linguistically or cognitively plausible???
• Then group them afterwards (more soon)
48What is a convolution anyway?
• 1d discrete convolution definition:
• Convolution is classically used to extract features from images
• Models position-invariant identification
• Go to cs231n!
• 2d example à
• Yellow color and red numbers
show filter (=kernel) weights
• Green shows input
• Pink shows output
From Stanford UFLDL wiki
49A 1D convolution for text
tentative 0.2 0.1 −0.3 0.4
deal 0.5 0.2 −0.3 −0.1 t,d,r −1.0 0.0 0.50
reached −0.1 −0.3 −0.2 0.4 d,r,t −0.5 0.5 0.38
to 0.3 −0.3 0.1 0.1 r,t,k −3.6 -2.6 0.93
keep 0.2 −0.3 0.4 0.2 t,k,g −0.2 0.8 0.31
government 0.1 0.2 −0.1 −0.1 k,g,o 0.3 1.3 0.21
open −0.4 −0.4 0.2 0.3
Apply a filter (or kernel) of size 3 + bias
3 1 2 −3 ➔ non-linearity
−1 2 1 −3
1 1 −1 1
501D convolution for text with padding
∅ 0.0 0.0 0.0 0.0
tentative 0.2 0.1 −0.3 0.4 ∅,t,d −0.6
deal 0.5 0.2 −0.3 −0.1 t,d,r −1.0
reached −0.1 −0.3 −0.2 0.4 d,r,t −0.5
to 0.3 −0.3 0.1 0.1 r,t,k −3.6
keep 0.2 −0.3 0.4 0.2 t,k,g −0.2
government 0.1 0.2 −0.1 −0.1 k,g,o 0.3
open −0.4 −0.4 0.2 0.3 g,o,∅ −0.5
∅ 0.0 0.0 0.0 0.0
Apply a filter (or kernel) of size 3
3 1 2 −3
−1 2 1 −3
51 1 1 −1 13 channel 1D convolution with padding = 1
∅ 0.0 0.0 0.0 0.0
tentative 0.2 0.1 −0.3 0.4 ∅,t,d −0.6 0.2 1.4
deal 0.5 0.2 −0.3 −0.1 t,d,r −1.0 1.6 −1.0
reached −0.1 −0.3 −0.2 0.4 d,r,t −0.5 −0.1 0.8
to 0.3 −0.3 0.1 0.1 r,t,k −3.6 0.3 0.3
keep 0.2 −0.3 0.4 0.2 t,k,g −0.2 0.1 1.2
government 0.1 0.2 −0.1 −0.1 k,g,o 0.3 0.6 0.9
open −0.4 −0.4 0.2 0.3 g,o,∅ −0.5 −0.9 0.1
∅ 0.0 0.0 0.0 0.0
Apply 3 filters of size 3 Could also use (zero)
padding = 2
3 1 2 −3 1 0 0 1 1 −1 2 −1
−1 2 1 −3 1 0 −1 −1 1 0 −1 3
Also called “wide convolution”
1 1 −1 1 0 1 0 1 0 2 2 1
52conv1d, padded with max pooling over time
∅ 0.0 0.0 0.0 0.0 ∅,t,d −0.6 0.2 1.4
tentative 0.2 0.1 −0.3 0.4 t,d,r −1.0 1.6 −1.0
deal 0.5 0.2 −0.3 −0.1 d,r,t −0.5 −0.1 0.8
reached −0.1 −0.3 −0.2 0.4 r,t,k −3.6 0.3 0.3
to 0.3 −0.3 0.1 0.1 t,k,g −0.2 0.1 1.2
keep 0.2 −0.3 0.4 0.2 k,g,o 0.3 0.6 0.9
government 0.1 0.2 −0.1 −0.1 g,o,∅ −0.5 −0.9 0.1
open −0.4 −0.4 0.2 0.3
∅ 0.0 0.0 0.0 0.0 max p 0.3 1.6 1.4
Apply 3 filters of size 3
3 1 2 −3 1 0 0 1 1 −1 2 −1
−1 2 1 −3 1 0 −1 −1 1 0 −1 3
53 1 1 −1 1 0 1 0 1 0 2 2 1conv1d, padded with ave pooling over time
∅ 0.0 0.0 0.0 0.0 ∅,t,d −0.6 0.2 1.4
tentative 0.2 0.1 −0.3 0.4 t,d,r −1.0 1.6 −1.0
deal 0.5 0.2 −0.3 −0.1 d,r,t −0.5 −0.1 0.8
reached −0.1 −0.3 −0.2 0.4 r,t,k −3.6 0.3 0.3
to 0.3 −0.3 0.1 0.1 t,k,g −0.2 0.1 1.2
keep 0.2 −0.3 0.4 0.2 k,g,o 0.3 0.6 0.9
government 0.1 0.2 −0.1 −0.1 g,o,∅ −0.5 −0.9 0.1
open −0.4 −0.4 0.2 0.3
∅ 0.0 0.0 0.0 0.0 ave p −0.87 0.26 0.53
Apply 3 filters of size 3
3 1 2 −3 1 0 0 1 1 −1 2 −1
−1 2 1 −3 1 0 −1 −1 1 0 −1 3
54 1 1 −1 1 0 1 0 1 0 2 2 1In PyTorch
batch_size = 16
word_embed_size = 4
seq_len = 7
input = torch.randn(batch_size, word_embed_size, seq_len)
conv1 = Conv1d(in_channels=word_embed_size, out_channels=3,
kernel_size=3) # can add: padding=1
hidden1 = conv1(input)
hidden2 = torch.max(hidden1, dim=2) # max pool
55Learning Character-level Representations for Part-of-
Speech Tagging Dos Santos and Zadrozny (2014)
• Convolution over
characters to generate
word embeddings
• Fixed window of word
embeddings used for PoS
tagging
568. End-to-end Neural Coref Model
• Current state-of-the-art models for coreference resolution
• Kenton Lee et al. from UW (EMNLP 2017) et seq.
• Mention ranking model
• Improvements over simple feed-forward NN
• Use an LSTM
• Use attention
• Do mention detection and coreference end-to-end
• No mention detection step!
• Instead consider every span of text (up to a certain length) as a candidate mention
• a span is just a contiguous sequence of words
57End-to-end Model
• First embed the words in the document using a word embedding matrix and a
character-level CNN
Mention score (s )
General Electric Electric said the
m
the Postal Service Service contacted the the company
Span representation (g)
Span head (x̂) + + + + +
Bidirectional LSTM (x∗ )
Word & character
embedding (x)
General Electric said the Postal Service contacted the company
Figure 1: First step of the end-to-end coreference resolution model, which computes embedding repre-
sentations of spans for scoring potential entity mentions. Low-scoring spans are pruned, so that only a
manageable number of spans is considered for coreference decisions. In general, the model considers all
possible spans up to a maximum width, but we depict here only a small subset.
58End-to-end Model
• Then run a bidirectional LSTM over the document
General Electric Electric said the the Postal Service Service contacted the the company
Mention score (sm )
Span representation (g)
Span head (x̂) + + + + +
Bidirectional LSTM (x∗ )
Word & character
embedding (x)
General Electric said the Postal Service contacted the company
Figure 1: First step of the end-to-end coreference resolution model, which computes embedding repre-
sentations of spans for scoring potential entity mentions. Low-scoring spans are pruned, so that only a
manageable number of spans is considered for coreference decisions. In general, the model considers all
possible spans up to a maximum width, but we depict here only a small subset.
59End-to-end Model
• Next, represent each span of text i going from START(i) to END(i) as a vector
General Electric Electric said the the Postal Service Service contacted the the company
Mention score (sm )
Span representation (g)
Span head (x̂) + + + + +
Bidirectional LSTM (x∗ )
Word & character
embedding (x)
General Electric said the Postal Service contacted the company
Figure 1: First step of the end-to-end coreference resolution model, which computes embedding repre-
sentations of spans for scoring potential entity mentions. Low-scoring spans are pruned, so that only a
manageable number of spans is considered for coreference decisions. In general, the model considers all
possible spans up to a maximum width, but we depict here only a small subset.
60End-to-end Model
• Next, represent each span of text i going from START(i) to END(i) as a vector
General Electric Electric said the the Postal Service Service contacted the the company
Mention score (sm )
Span representation (g)
Span head (x̂) + + + + +
Bidirectional LSTM (x∗ )
Word & character
embedding (x)
General Electric said the Postal Service contacted the company
• General,
Figure 1: First General
step of the Electric,
end-to-end General
coreference Electricmodel,
resolution said, …, Electric,
which Electric
computes embedding repre-
sentations of spans
said, for scoring
… will potential
all get its ownentity mentions.
vector Low-scoring spans are pruned, so that only a
representation
manageable number of spans is considered for coreference decisions. In general, the model considers all
possible spans up to a maximum width, but we depict here only a small subset.
61END(i)
" In the training data, only
exp(αk )
End-to-end Model k= START(i)
is observed. Since the an
optimize the marginal log
(i)
• Next, represent each span of text i goingEND "from START(i) to END(i) as a vector antecedents implied by th
• For example,
Mentionfor
score“the
x̂i =
General Electric
(s ) postal service”
Electric said the ai,t · xt
the Postal Service Service contacted the the company
m
t= START(i) N
# "
Span representation (g)
Span head (x̂) + + + + +
log
where x̂i is a weighted sum of word vectors in i=1 ŷ∈Y(i)∩
span i. The weights ai,t are automatically learned
∗
Bidirectional LSTM (x )
and correlate strongly with traditional definitions where GOLD (i) is the set
of head words as we will see in Section 9.2. ter containing span i. If
The above span information is concatenated to to a gold cluster or all gol
Word & character
embedding (x) produce the final representation g of span i:
General Electric said the i
Postal Service contacted pruned,
the GOLD (i) = {&}.
company
∗ ∗
By optimizing this ob
Figure 1:Span
Firstrepresentation: gi = [xcoreference
step of the end-to-end START(i)
, xresolution
END(i)
, x̂imodel,
, φ(i)] which computes embedding
rally learns repre-
to prune span
sentations of spans for scoring potential entity mentions. Low-scoring spans are pruned, so that only a
initial pruning is comple
manageable numberThis of spansgeneralizes thecoreference
is considered for recurrent span In repre-
decisions. general, the model considers all
possible spans upBILSTM hidden states
tosentations
a maximum width, Attention-based
for but we representation
depict here only a small mentions
Additional
subset. features receive positive
span’s start and endrecently proposed for question-
(details next slide) of the words quickly leverage this lear
answering (Lee et al.,in 2016),
the span which only include
ate credit assignment to thot,δ = σ(Wo [xt , ht+δ,δ with ] +the bo ) highest mention compute scorestheir andunaryconsider
mention scores only sm (
suppr
ht,δ = ot,δ
e span information is ◦concatenated
tanh(ct,δ ) to c!to t,δ =a tanh(W
gold cluster c [xt ,up
where
or
ht+δ,δto]all+ ) ∈ {−1,
Kbδcgold
antecedents 1} indicates
antecedents fined in Section
for have
each. 4). To further
the directionality
been
We also of reduce
enforce creasith
End-to-end Model
∗
xt = [hg
final representation t,1i ,of
h span
t,−1 ] i: c = f ◦ !
c
pruned, GOLDnon-crossing
t,δ t,δ t,δ + (1 − each
f
(i) of=thet,δ ) ◦LSTM,
{&}.c t+δ,δ
and x∗t ofisspans
bracketing with
to consider, weoutput
the concatenated
structures
the highest
bidirectional LSTM. We use independent
with
mention
only keep up to
a simple
scores
consid
and cons
ht,δ = ot,δ ◦ tanh(ct,δ ) 2up toWe cepted
∗
By optimizing suppression
this
LSTMs scheme.
objective,
for every the
sentence, Ksince
model accept
antecedents
natu- spans
for each.
cross-sentence in de-
We also
END (
∗ where δ ∗ ∈ {−1, 1} indicates the =
directionality [h , of ]
= [xSTART(i) , xEND•(i) , x̂i ,isφ(i)]
an attention-weighted x t
rally learns average
t,1 h t,−1
of the
creasing
totheprunecontext word
order
was
spans embeddings
notof the mention
helpful
accurately. in in
non-crossing
our the bracketing
scores,
experiments.
While unless, structures
the2 We acceptEND when with
(
∗ is the General suppression scheme. span
each LSTM, and span t x
Mention score (sm )
concatenated
Electric Electricoutput
said the
∈ {−1, pruning
where δinitial 1} indicates
Postal Service
considering Syntactic
thecompletely
is directionality
Service contacted
heads
span i,are
there
ofrandom,
the the company
typically included
creasingexists
order ofagold
only
as fea- ac- Des
previously
the mention scores, unle
of the bidirectional
alizes the recurrent LSTM.
span repre- We use
each independent
LSTM, and x t
tures
∗ is the concatenated in
cepted span j such that previous
output systems START (i)
considering
(Durrett
span<
andSTART
i, there
Klein, (j)a≤
exists maint
previ
LSTMs
Span representation (g)
for every for sentence, since of the mentions receive
bidirectional LSTM. We use independent
cross-sentence positive
2013; Clark updates.
and Manning, The model
2016b,a). can
Instead
j such of re- perim
(i) ≤
<
recently proposed question- END (i) < END (j) ∨cepted START span(j) < that
START START (i) STA
Span head (x̂) LSTMs
+ quickly + leverage
for every sentence, since lying
this on syntacticsignal
learning
+ cross-sentence +parses,
END (i) our <
for model
+ENDlearns
appropri- (j) ∨ STARTa task-(j) < For
context was not helpful
Lee et al., 2016), which only include in our experiments. END
context was not helpful in our experiments. (j) < END (i).
STA
specific notion of headedness END (j) < using END (i). an attention tion o
Syntactic heads are ∗ typically included ate
Syntactic credit
as
heads fea- areassignment
typicallyDespite to
included
mechanism
the
these
as different
aggressive
fea-
(Bahdanau et
factors,
al., pruning
2014)
such strategies, in we
ry representations BidirectionalxLSTM (x∗ ) and Despite theseover words pruning
aggressive instrat
a fo
START (i)
tures in previous systems (Durrett and Klein, tures as
in the
previous mention systems scores
(Durrett s and used
Klein, for pruning.
introduce the soft head word vector
each span:
maintain amhigh recallmaintain of golda high recall ofin
mentions gold
our mentions
The fii
ex-
2013; Clark and Manning, 2016b,a). Instead of re-
2013; Clark and Manning, 2016b,a). Instead
lying on
Fixing
syntactic
of re-scoreperiments
parses, our
of
model
thelearnsdummy
(over a 92%
task-
periments (over
antecedent
when λ = to zerowhen λ = 0.4).
92%
∗ 0.4). the m
ture vector φ(i) encoding the
lying on syntactic parses, our model size of learns
removes a task- a spurious degree
αt = wα · For
of freedom
FFNN theα (x
in the t )
remaining
over-
mentions, the joint
specific notion of headednessFor using thean remaining
attention mentions,
tion of antecedents the forjoint eachdistribu-
document isLc
exp(α t ) 6
specific notion
Word &of headedness using
character
embedding (x)
mechanismanallattention
(Bahdanau
model with et al.,
tion 2014)
respect over to
of antecedents words i,t in
= forineach
amention a detection.
document
forward pass over Itisa single
computedcomputati
END(i)
mechanism (Bahdanau et al.,General 2014)each span: words
over
Electric said in the Postal Service contactedThe "final the prediction
company is the clustering In pro
the
also prevents in
the a forward
span pass
pruning over
from a single computation
introducing
exp(α k ) graph.
ce each span: Attention scores Attention
α t = w α distribution
· The
FFNN α final
(x ∗
) Final
predictionk=isSTART
the most likely configuration.
representation
the(i)clustering produced by
is obs
t
Figure 1: First step of the end-to-end coreference 2
The official resolution
CoNLL-2012 model, evaluation
which computes only embedding
considers pre- repre- optim
exp(α
the t)
most likely 6
configuration.
END (i)Learning
"pruned, so that only a
the fullsentations
model α described
of spans
= w for
· above
scoring
FFNN (x is
∗
potential
) ai,t =without
entity
dictions mentions. Low-scoring
crossing mentions spans
to be are
valid. Enforcing this antece
t α α t END(i)
" x̂ i = a i,t · x t
manageable number
e document length T . To exp(α of spans
maintainis considered for coreference
consistency is not inherently decisions. In general, Inour(i)model. data, onlyall
the
the training
model considers clustering inf
exp(αk ) necessaryt= inSTART
dot
possible spans product
up to a of weight
maximum t ) width, but we depict here 6 Learning is observed. Since the antecedents are l
ai,t = k= START(i) only a small subset.
vector andEND transformed
" (i) just a softmax END (i)over where attention x̂ i is a weighted optimize the marginal
Attention-weighted
sum of wordsumvectorslog-likelihood of a
in
" In the training data, antecedents only clustering information
hidden state exp(αk ) scores x̂i =for the span · xt i. The weights
ai,tspan of word
ai,t embeddings
implied
are automatically learned
by the gold clusterinand correlate strongly with traditional definitions where GOLD (i) is the set of spa
End-to-endofModel
head words as we will see in Section 9.2. ter containing span i. If span
The above span information is concatenated to to a gold cluster or all gold ant
• Why include
produce the all these
final different terms
representation gi ofinspan
the span?
i: pruned, GOLD(i) = {&}.
∗ ∗
By optimizing this objectiv
gi = [xSTART(i) , xEND(i) , x̂i , φ(i)] rally learns to prune spans acc
initial pruning is completely
This generalizes the recurrent span repre-
hidden states for Attention-based mentions receive positive upda
Additional
sentations recently proposed for question-
span’s start and representation features
quickly leverage this learning
answering
end (Lee et al., 2016), which only include
∗ ate credit assignment to the dif
the boundary representations xSTART(i) and
as the mention scores sm used
Represents
∗ the Represents the span
xEND(i) . We introduce the soft head word vectoritself Represents other
context to the Fixing
information not score of the dummy
x̂i and a feature vector φ(i) encoding the size of
left and right of in the removes
text a spurious degree of f
span i.
the span all model with respect to me
5 Inference also prevents the span pruning
2
The official CoNLL-2012 evalua
The size of the full model described above is dictions without crossing mentions to
O(T 4 ) in the document length T . To maintain consistency is not inherently necessar(Durrett
trube, or (4) forcontext
and Klein,
2015), this pairwise coreference
is unambiguous. Therescore: (1) factors
are three whether
Electric said the Postal Service contacted the company
els End-to-end span i is a mention, (2) whether span(1)j is a men-
Clark and Manning, Model
(Durrett and Klein, for this pairwise coreference score: whether
approach
015; Clark and is Manning,
most tion,
spanand i is(3) whether (2)
a mention, j iswhether span j isofa i:men-
an antecedent
-endwe
nking
but coreference
approach
reason • is resolution
most
over model,
tion, and which
#
(3) computes
whether j is an embedding
antecedent of repre-
i:
Lastly, score every pair of spans to decide if they are coreferent
otential
ing, but
ecting entity
we reason
mentions over Low-scoring#
mentions.
mentions
and 0spans are pruned, so that only j = !a
s(i, j) = 0 j = !all
yonsidered
detecting for coreference
mentions and decisions. In general, the model considers
s(i, j) = sm (i) + sm (j) + sa (i, j) j #= !
width, but we depict here only a small subset. sm (i) + sm (j) + sa (i, j) j #= !
Here s (i) is a unary score for span i being a men-
j msm (i)
. (2010) use rules to Are re- spans i and
Here Is i aismention? Is j a mention?
a unary score Do theya look
for span i being men-
an et al. (2010) use rules to re- tion, and sa (i, j) is pairwise score for span j being
coreferent
ted by 12 lexicalized reg- tion, and s a (i, j) is pairwise score for coreferent?
span j being
above
detected by 12 lexicalized reg-
mentions? are computed via standard feed-forward
rees. trees.
parse an anantecedent
antecedentofofspan span i.i.
neural networks:
s(the company,
the Postal Service)• Scoring functions take the span representations as input
sm (i) = wm · FFNN m (gi )
sa (i, j) = wa · FFNN a ([gi , gj , gi ◦ gj , φ(i, j)])
include multiplicative again, we have some
where · denotes the dot product,
interactions ◦
between denotes
extra features
the representations
rvice the company element-wise multiplication, and FFNN denotes aEnd-to-end Model
• Intractable to score every pair of spans
• O(T^2) spans of text in a document (T is the number of words)
• O(T^4) runtime!
• So have to do lots of pruning to make work (only consider a few of
the spans that are likely to be mentions)
• Attention learns which words are important in a mention (a bit like
head words)
(A fire in a Bangladeshi garment factory) has left at least 37 people dead and 100 hospitalized. Most
of the deceased were killed in the crush as workers tried to flee (the blaze) in the four-story building.
1
A fire in (a Bangladeshi garment factory) has left at least 37 people dead and 100 hospitalized. Most
of the deceased were killed in the crush as workers tried to flee the blaze in (the four-story building).
We are looking for (a region of central Italy bordering the Adriatic Sea). (The area) is mostly
mountainous and includes Mt. Corno, the highest peak of the Apennines. (It) also includes a lot of
2
sheep, good clean-living, healthy sheep, and an Italian entrepreneur has an idea about how to make a
little money of them.
(The flight attendants) have until 6:00 today to ratify labor concessions. (The pilots’) union and ground
3BERT-based coref: Now has the best results!
• Pretrained transformers can learn long-distance semantic dependencies in text.
• Idea 1, SpanBERT: Pretrains BERT models to be better at span-based prediction tasks
like coref and QA
• Idea 2, BERT-QA for coref: Treat Coreference like a deep QA task
• “Point to” a mention, and ask “what is its antecedent”
• Answer span is a coreference link
679. Coreference Evaluation
• Many different metrics: MUC, CEAF, LEA, B-CUBED, BLANC
• People often report the average over a few different metrics
• Essentially the metrics think of coreference as a clustering task and
evaluate the quality of the clustering
Gold Cluster 1
Gold Cluster 2
System Cluster 1 System Cluster 2
68Coreference Evaluation
• An example: B-cubed
• For each mention, compute a precision and a recall
P = 4/5
R= 4/6
Gold Cluster 1
Gold Cluster 2
System Cluster 1 System Cluster 2
69Coreference Evaluation
• An example: B-cubed
• For each mention, compute a precision and a recall
P = 4/5
R= 4/6
Gold Cluster 1
Gold Cluster 2
P = 1/5
R= 1/3
System Cluster 1 System Cluster 2
70Coreference Evaluation
• An example: B-cubed
• For each mention, compute a precision and a recall
• Then average the individual Ps and Rs
P = [4(4/5) + 1(1/5) + 2(2/4) + 2(2/4)] / 9 = 0.6
P = 4/5
R= 4/6 P = 2/4
P = 2/4 R= 2/3
R= 2/6
Gold Cluster 1
Gold Cluster 2
P = 1/5
R= 1/3
System Cluster 1 System Cluster 2
71Coreference Evaluation
100% Precision, 33% Recall 50% Precision, 100% Recall,
72System Performance • OntoNotes dataset: ~3000 documents labeled by humans • English and Chinese data • Standard evaluation: an F1 score averaged over 3 coreference metrics 73
System Performance
Model English Chinese
Rule-based system, used
Lee et al. (2010) ~55 ~50
to be state-of-the-art!
Chen & Ng (2012) [CoNLL 2012 Chinese winner] 54.5 57.6 Non-neural machine
Fernandes (2012) [CoNLL 2012 English winner] 60.7 51.6 learning models
Wiseman et al. (2015) 63.3 — Neural mention ranker
Clark & Manning (2016) 65.4 63.7 Neural clustering model
Lee et al. (2017) 67.2 — End-to-end neural ranker
Joshi et al. (2019) 79.6 — End-to-end neural mention
ranker + SpanBERT
Wu et al. (2019) [CorefQA] 79.9 — CorefQA
Xu et al. (2020) 80.2
Wu et al. (2020) [CorefQA + SpanBERT-large] 83.1 CorefQA + SpanBERT rulez
74Where do neural scoring models help?
• Especially with NPs and named entities with no string matching.
Neural vs non-neural scores:
These kinds of coreference are hard and the scores are still low!
Example Wins
Anaphor Antecedent
the country’s leftist rebels the guerillas
the company the New York firm
216 sailors from the ``USS cole’’ the crew
the gun the rifle
75Conclusion • Coreference is a useful, challenging, and linguistically interesting task • Many different kinds of coreference resolution systems • Systems are getting better rapidly, largely due to better neural models • But most models still make many mistakes – OntoNotes coref is easy newswire case • Try out a coreference system yourself! • http://corenlp.run/ (ask for coref in Annotations) • https://huggingface.co/coref/
You can also read