Beyond tagging: segmentation+labeling tasks Intro to NLP - ETHZ - 24/03/2014
Summary ● POS review ○ Bayesian HMM
● Information Extraction: ○ NER and related tasks ● Segmentation & Labeling: ○ Models ○ Features ● Shallow parsing ● Entity disambiguation ● Hand out: mid term exam
POS ambiguity "Squad helps dog bite victim" ● bite -> verb? ● bite -> noun? "Dealers will hear Car Talk at noon" ● car -> noun & talk -> verb? ● car & talk -> proper names?
PoS tagging ● The process of assigning a part-of-speech tag (label) to each word in a text. She
PRP
promised
VBD
to
TO
back
VB
the
DT
bill
NN
.
.
Applications ● ● ● ● ● ● ●
A useful pre-processing step in many tasks: Speech synthesis Syntactic parsing Machine translation Information retrieval Named entity recognition Summarization ○ Title: "But Siri, Apple's personal assistant application on the iPhone 4s, doesn't disappoint"
PTB tagset
Sequence classification ● Dependencies between variables: P(NN|DT) >> P (VB|DT) ... Indipendent per-word tagging suboptimal Sequence models: classify whole sequences y1:T
● ● ● HMMs:
a. Given observation x1:T predict the best tagging y1:T ■ Viterbi algorithm b. Given model theta and y1:T compute P(x1:T|theta) ■ Forward (Backward) algorithm c. Given dataset of sequences estimate theta ■ ML estimation (supervised) ■ EM (unsupervised)
HMMs HMM = (Q,O,A,B) 1. States: Q=q1..qN [the part of speech tags] a. Including special initial/final states q0 and qF b. lambda = (A,B)
2. Observation symbols: O = o1..oV [words] 3. Transitions: a. A = {aij}; aij = P(qt= j|qt-1= i) ~ Multi(ai)
4. Emissions: a. B = {bik}; bik= P(ot = vk|qt = i) ~ Multi(bi)
Complete data likelihood The joint probability of a sequence of words and tags, given a model:
Generative process: 1. generate a tag sequence 2. emit the words for each tag
Parameter estimation (supervised) Maximum likelihood estimates (MLE) on data 1. Transition probabilities:
2. Emission probabilities:
Parameter estimation (unsupervised) 1. Transition probabilities:
2. Emission probabilities:
Parameter estimation 1. Expected number of times in state i for observation O:
Parameter estimation 1. Expected number of times in state i for observation O:
2. Expected number of transitions from i to j for observation O:
Gamma
Gamma
Gamma2(V) g=0.7752
END
V
N
START
Time
board
backs
plan
vote
1
2
3
4
Emission re-estimation
bik = exp. # of times in state i emitting word k expected number of times in state i
Xi
Xi2(V,V) END xi=0.40 V
N
START
Time
board
backs
plan
vote
1
2
3
4
Transition re-estimation
aij = Expected number of transitions from i to j Expected number of transtions out of i
Forward-Backward
Example X = [ "board backs plan vote" "vote backs board plan" "backs board vote plan" "plan vote backs board" ] A=
V
N
END
V
0.4
0.6
0.3
N
0.5
0.5
0.7
START
0.4
0.6
board
backs
plan
vote
V
0.1
0.4
0.3
0.2
N
0.4
0.1
0.2
0.3
B=
Example
Optimizing only some parameters also improves LogL: Generalized EM
Bayesian modeling ■ Model uncertainty about the data ■ Prefer models with certain properties; e.g., sparsity ■ Non-parametric models: number of hidden variables unknown a-priori
Data likelihood - Likelihood function:
- Maximum likelihood problem:
- Bayesian modeling:
Bayesian inference 1. Model distribution over \theta (inference via sampling):
2. Variational Bayes: Find (point estimate) parameters lambda that minimize upper bound on negative log likelihood, including the priors.
Bayesian HMMs HMM = (Q,O,A,B) 1. States: Q=q1..qN [the part of speech tags] a. Including special initial/final states q0 and qF 2. Observation symbols: O = o1..oV [words] 3. Transitions: a. A = {aij}; aij = P(qt= j|qt-1= i) ~ Multi(ai) b. ai | alphaA ~ Dir(alphaA) 4. Emissions: a. B = {bik}; bik= P(ot = vk|qt = i) ~ Multi(bi) b. bi | alphaA ~ Dir(alphaB) alpha: controls the sparsity of A and B
Parameter estimation - VB 1. Transition probabilities:
2. Emission probabilities:
F:
Example with varying alpha
alpha = 1.0, after FB X = [ "board backs plan vote" "vote backs board plan" "backs board vote plan" "plan vote backs board" ] A=
V
N
END
V
0.34
0.66
0.12
N
0.13
0.87
0.88
START
0.51
0.49
B=
board
backs
plan
vote
V
0.25
0.32
0.18
0.24
N
0.25
0.22
0.28
0.25
alpha = 0.1, after FB X = [ "board backs plan vote" "vote backs board plan" "backs board vote plan" "plan vote backs board" ] A=
V
N
END
V
0.0
1.0
0.0
N
0.36
0.64
1.0
START
0.2
0.8
board
backs
plan
vote
V
0.0
1.0
0.0
0.0
N
0.33
0.0
0.33
0.33
B=
tag-frequencies (Johnson, 2007)
Information Extraction Goal: Identify structured data in unstructured text: ○ Identify mentions of instances of predefined classes: names of people, location, organizations, etc. ■ Named-entity recognition (NER) ■ Co-reference resolution ○ Identify and normalize temporal expressions: ■ Normalize temporal expressions ■ Associated events with points in time ■ Timeline generation ○ Relation detection and classification: ○ Event detection
IE: input "Citing high fuel prices, United Airlines said Friday it has increased fares by 6$ per round trip on flights to some cities also served by lower-cost carriers. American Airlines, a unit of AMR Corp., immediately matched the move, spokesman Tim Wagner said. United, a unit of UAL Corp., said the increase took effect Thursday and applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Denver to San Francisco."
IE: (desired) output AIRLINE FARE-RAISE: ● LEAD ARLINE: UA ● AMOUNT: 6$ ● EFFECTIVE DATE: 2006-10-26 ● FOLLOWER: AA 1.Populate a database 2.Compare with historical stock prices
How? - Structured prediction: e.g., sequence models - Supervised and semi-supervised methods - Challenging problems - Limitations and new perspectives
IE: NER "Citing high fuel prices, United Airlines said Friday it has increased fares by 6$ per round trip on flights to some cities also served by lower-cost carriers. American Airlines, a unit of AMR Corp., immediately matched the move, spokesman Tim Wagner said. United, a unit of UAL Corp., said the increase took effect Thursday and applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Denver to San Francisco."
IE: co-reference resolution "Citing high fuel prices, United Airlines1 said Friday it1 has increased fares by 6$ per round trip on flights to some cities also served by lower-cost carriers. American Airlines2, a unit of AMR Corp.3, immediately matched the move, spokesman Tim Wagner4 said. United1, a unit of UAL Corp. , said the increase took effect Thursday and applies to 5 most routes where it1 competes against discount carriers, such as Chicago6 to Dallas7 and Denver8 to San Francisco9."
IE: temporal expressions "Citing high fuel prices, United Airlines1 said Friday it1 has increased fares by 6$ per round trip on flights to some cities also served by lower-cost carriers. American Airlines2, a unit of AMR Corp.3, immediately matched the move, spokesman Tim Wagner4 said. United1, a unit of UAL Corp. , said the increase took effect Thursday and applies to 5 most routes where it1 competes against discount carriers, such as Chicago6 to Dallas7 and Denver8 to San Francisco9."
IE: relation extraction "Citing high fuel prices, United Airlines1 said Friday it1 [has increased fares by] 6$ per round trip on flights to some cities also served by lower-cost carriers. American Airlines2, [a unit of] AMR Corp.3, [immediately matched the move], [spokesman] Tim Wagner4 said. United1, [a unit of] UAL Corp.5, said the increase took effect Thursday and applies to most routes where it1 competes against discount carriers, such as Chicago6 to Dallas7 and Denver8 to San Francisco9."
IE: event extraction "Citing high fuel prices, United Airlines1 said Friday [it1 [has increased fares by] 6$ per round trip on flights to some cities also served by lower-cost carriers.]1 American Airlines2, [a unit of] AMR Corp.3, [immediately matched [the move]1], [spokesman] Tim Wagner4 said. United1, [a unit of] UAL Corp.5, said [the increase]1 took effect Thursday and applies to most routes where it1 competes against discount carriers, such as Chicago6 to Dallas7 and Denver8 to San Francisco9."
Sequence classifiers Choose your favorite sequence classifier: HMM, CRF, MEMM, SVM, Perceptron, ....
We abstract from the specifics going forward
Beyond POS tagging ● Sequence labeling tasks: Assign one (POS) label to each word ○ Cats/NNS love/VBP fish/NN ./. ● Given: labeled data, sequence classifier ● Sequence segmentation + labeling: ○ Information extraction, biomedical IE, shallow (partial) parsing ○ Task: labels can span several words "U.N. Security Council tries to end Syria conflict"
Segmentation + labeling "U.N. Security Council tries to end Syria conflict"
"U.N.ORG SecurityORG CouncilORG tries to end SyriaLOC conflict"
How many organizations?
Segmentation + labeling "U.N. Security Council tries to end Syria conflict"
"[U.N. Security Council]ORG tries to end [Syria]LOC conflict"
One organization...
Segmentation + Labeling 1. Identify the boundaries of a segment: a. "[U.N. Security Council] tries to end Syria conflict" 2. Label the segment: a. "[U.N. Security Council]ORG tries to end Syria conflict"
3. How? a. Reduce the problem to simple labeling b. BIO encoding: split the labels in i. B-X: beginning of label X ii. I-X: inside a label X iii. O: outside a label (NULL label)
BIO encoding 1. Encode the data in BIO format: a. b. c.
[U.N. Security council]ORG tries to end [Syria]LOC conflict. U.N.B-ORG SecurityI-ORG CouncilI-ORG triesO toO endO SyriaB-LOC conflictO No termination ambiguity B-X->I-X same entity, new entity otherwise
2. Train and use your favorite sequence classifier as usual a. No guarantee that output will be consistent b. Set B-X -> I-Y transitions to 0 if necessary 3. Evaluate appropriately: P/R/F1
Features ●
Words: a. Target token identity, stem; w=Corp., sw=corp. b. Same for words within a window around target; w-1=AMR.,w+1=,
●
PoS tags:
a. ●
Shape:
a. ●
●
Regexp-like features; s=Xxp, s+1=p
Prefixes/Suffixes:
a. ●
Target token and tokens within window; pos=NNP, pos+1=,
of target and neighbors; suf=., suf=p., pre=c pre=co, ... Gazetteers: labeled lists of phrases (locations, cities, fir names etc.): a. gX= target is in list X; example, Tim -> gFirstName=true b. Crucial features out-of-domain c. Features on segments? SemiMM Label features: a. label-1=B-ORG,... b. label+1=0, (bidirectional models)
NER Identifying mentions (spans of text) referring to instances of a pre-defined set of classes
Entity ambiguity
NOTE: The problem of the actual "identity" of an entity is not addressed in classic IE, unless approximated by coreference.
NER as sequence classification
Evaluation ● Precision/Recall/F1: ○ micro/macro: mention/type level
● Correct prediction: ○ both entity boundaries are correctly identified: [New York]ORG State vs. [New York State]ORG ○ the label is correct: [New York State]LOC vs [New York State]ORG
● F1 > 90% in supervised case, in-domain! ○ Significant degradation out-domain ○ Domain adaptation, semi-supervised learning etc.
Relation extraction Relation: a set of ordered tuples over a domain ● E.g., President-of = {(Obama, US), (Sarkozy, France), ...} 1. Identify named-entities 2. For each pair (ei,ej) a. Predic-Relation(ei,ej) [Possibly NULL]
Task-specific features: 1. Text between entities: BOW 2. Syntactic structure
Relation Extraction
Appositive constructions: part-of, is-a etc.
Distant supervision for RE Supervised approaches are limited Idea: exploit existing repositories such as Wikipedia (infoboxes), Freebase (relations) to bootstrap models for RE: 1. Start with seed set of pairs 2. Repeat: a. Build extraction patterns from data: b. Extract new pairs
Issues: semantic drift, evaluation
Temporal expressions
● More regular patterns than in other tasks ● Challenges: ○ Coverage ○ False positives ● Normalization: map to interval or point in time
Partial parsing Goal: Identify the basic (non-recursive) N/V/A/P phrases of a sentence (chunking): ● flat/non-overlapping ● segmentation+labeling task ● Base phrases do not contain constituents of the same type ● Include head of phrase ● Exclude post-head modifiers "[The morning flight]NP from [Denver]NP [has arrived]VP"
BIO encoding American Airlines , a unit of AMR Inc.
BIO encoding American Airlines , a unit of AMR Inc.
[American Airlines]NP , [a unit]NP of [AMR Inc.]NP
BIO encoding American Airlines , a unit of AMR Inc.
[American Airlines]NP , [a unit]NP of [AMR Inc.]NP
AmericanB-NP AirlinesI-NP ,O aO unitB-NP ofO AMRB-NP Inc.I-NP
From Trees to Chunks
[American Airlines]NP , [a unit]NP of [AMR Inc.]NP
Partial parsing
PP as sequence classification
Limitations of NER 1. Washington was born at Haywood Farms near Oak Grove, Virginia. 2. Washington attended Yale Law School for a year and then studied at the University of Oxford. 3. Washington served as army chief of staff under William Tubman. 4. Historians laud Washington for his selection and supervision of his generals. 5. Washington was the founder of the town of Centralia, Washington.
Limitations of NER 1. WashingtonPER was born at Haywood Farms near Oak Grove, Virginia. 2. WashingtonPER attended Yale Law School for a year and then studied at the University of Oxford. 3. WashingtonPER served as army chief of staff under William Tubman. 4. Historians laud WashingtonPER for his selection and supervision of his generals. 5. WashingtonPER was the founder of the town of Centralia, WashingtonLOC.
Limitations of NER 1. WashingtonPER was born at Haywood Farms near Oak Grove, Virginia. 2. WashingtonPER attended Yale Law School for a year and then studied at the University of Oxford. 3. WashingtonPER served as army chief of staff under William Tubman. 4. Historians laud WashingtonPER for his selection and supervision of his generals. 5. WashingtonPER was the founder of the town of Centralia, WashingtonLOC. 5 different people!
Limitations of NER 1. WashingtonPER was born at Haywood Farms near Oak Grove, Virginia. [George] 2. WashingtonPER attended Yale Law School for a year and then studied at the University of Oxford. [George] 3. WashingtonPER served as army chief of staff under William Tubman. [George] 4. Historians laud WashingtonPER for his selection and supervision of his generals. [George] 5. WashingtonPER was the founder of the town of Centralia, WashingtonLOC. [George] With the same first name!
1. Ambiguity a word (phrase) can mean different things
President Bush
President Bush
hypothesis: "George Bush father" "George Bush son"
ambiguity resolved by proxy strings
Mercury
...
Mercury
? ...
side effects on text similarity
1. "Astronomers took pictures of the star" 2. "Paparazzis took pictures of the star" ● Unrelated but close in string space
1. Ambiguity a word (phrase can mean different things
2. Synonymy the same thing can be said in different ways
synonyms
synonyms
topic
synonyms ambiguity
topic
side effects on text similarity
1. "U.S. President issues Libya ultimatum" 2. "Barack Obama says Gaddafi may wait out military offensive"
● Closely related but unrelated in string space
text similarity in entity space "Astronomers took pictures of the star"
"Paparazzis took pictures of the star"
"U.S. President issues Libya ultimatum"
"Barack Obama says Gaddafi may wait out military offensive"
Entity disambiguation ● Segmentation and labeling problem ○ Output = Wikipedia, Freebase, YAGO, etc. ○ Updated and maintained by thousands of users ○ Good coverage of popular topics
● How? ○ Classic supervised approach dubious ○ Exploit social aspects (link graph)
● One example: TagMe
(a large) problem space Entities: ● English Wikipedia: 4 million ● YAGO 10 million ● Freebase: 40 million ● Web knowledge graphs: hundreds of millions Facts: ● Billions of tuples: link graphs, infoboxes, Freebase relations etc.
Entity disambiguation in Wikipedia
Basic ingredients: 1. A canonical set of entities (topics, concepts,..) Z, |Z| = K 2. P(zk|w): how likely is entity zk given string w - from anchors, titles (mentions) - from words in articles (content words) 3. Sim(zi,zj), pairwise entity similarity - from the in-link graph, e.g., Google distance
"Scoreboard of the second one-day cricket match England - Pakistan ... Croft bowled Younis .. Total for 8 wickets .. Pakistan were 318 for two at lunch ..."
Segmentation via NER
"Scoreboard of the second one-day cricket match England - Pakistan ... Croft bowled Younis .. Total for 8 wickets .. Pakistan were 318 for two at lunch ..."
P(zk|w)
"Scoreboard of the second one-day cricket match England - Pakistan ... Croft bowled Younis .. Total for 8 wickets .. Pakistan were 318 for two at lunch ..."
+ Sim(zi,zj)
"Scoreboard of the second one-day cricket match England - Pakistan ... Croft bowled Younis .. Total for 8 wickets .. Pakistan were 318 for two at lunch ..."
1. Eliminate low P(zk|w)
"Scoreboard of the second one-day cricket match England - Pakistan ... Croft bowled Younis .. Total for 8 wickets .. Pakistan were 318 for two at lunch ..."
1. Eliminate low P(zk|w) 2. Eliminate weakly connected
"Scoreboard of the second one-day cricket match England - Pakistan ... Croft bowled Younis .. Total for 8 wickets .. Pakistan were 318 for two at lunch ..."
1. Eliminate low P(zk|w) 2. Eliminate weakly connected 3. Keep max P(zk|w)*
"Scoreboard of the second one-day cricket match England - Pakistan ... Croft bowled Younis .. Total for 8 wickets .. Pakistan were 318 for two at lunch ..."
*TagMe: (Ferragina & Scaiella, 2010)
Eval ● Evaluation on manually annotated data: CoNLL-Aida dataset ● Metrics: ○ micro@1: fraction of entity mentions correctly disambiguated ○ macro@1: accuracy averaged by number of documents ● Some results:Baseline CoNLL-Aida Aida-best test-b TagMe* macro@1
72.74
81.66
78.21
micro@1
69.82
82.54
78.64
Applications ● Summarization? ● Choose sentences/snippets containing important entities in the document? ● Need data annotation...
References -Book: J&M: Ch. 13.5 (Partial parsing), Ch. 22 (information extraction) - Entity disambiguation: ● TAGME: on-the-fly annotation of short text ●
fragments (by wikipedia entities). Ferragina & Scaiella, CIKM 2010 Robust Disambiguation of Named Entities in Text. Hoffart et al. EMNLP. 2011
- Tools for the project: Stanford CoreNLP, NLTK, LingPipe, OpenNLP, TagMe etc.
References - Gao & Johnson, "A Comparison of Bayesian Estimators for unsupervised Hidden Markov Model POS taggers". EMNLP 2008. - Bilmes, "A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models" - Rabiner, "A tutorial on hidden Markov models and selected applications in speech recognition" - Neal & Hinton, "A view of the EM algorithm that justifies incremental, sparse and other variants"
Summary ● POS review ○ Bayesian HMM
● Information Extraction: ○ NER and related tasks ● Segmentation & Labeling: ○ Models ○ Features ● Shallow parsing ● Entity disambiguation ● Hand out: mid term exam