tasks segmentation+labeling Beyond tagging - ETH

POS ambiguity "Squad helps dog bite victim" bite -> verb? bite -> noun? "Dealers will hear Car Talk at noon" car -> noun & talk -> verb?...

2 downloads 535 Views 4MB Size
Beyond tagging: segmentation+labeling tasks Intro to NLP - ETHZ - 24/03/2014

Summary ● POS review ○ Bayesian HMM

● Information Extraction: ○ NER and related tasks ● Segmentation & Labeling: ○ Models ○ Features ● Shallow parsing ● Entity disambiguation ● Hand out: mid term exam

POS ambiguity "Squad helps dog bite victim" ● bite -> verb? ● bite -> noun? "Dealers will hear Car Talk at noon" ● car -> noun & talk -> verb? ● car & talk -> proper names?

PoS tagging ● The process of assigning a part-of-speech tag (label) to each word in a text. She

PRP

promised

VBD

to

TO

back

VB

the

DT

bill

NN

.

.

Applications ● ● ● ● ● ● ●

A useful pre-processing step in many tasks: Speech synthesis Syntactic parsing Machine translation Information retrieval Named entity recognition Summarization ○ Title: "But Siri, Apple's personal assistant application on the iPhone 4s, doesn't disappoint"

PTB tagset

Sequence classification ● Dependencies between variables: P(NN|DT) >> P (VB|DT) ... Indipendent per-word tagging suboptimal Sequence models: classify whole sequences y1:T

● ● ● HMMs:

a. Given observation x1:T predict the best tagging y1:T ■ Viterbi algorithm b. Given model theta and y1:T compute P(x1:T|theta) ■ Forward (Backward) algorithm c. Given dataset of sequences estimate theta ■ ML estimation (supervised) ■ EM (unsupervised)

HMMs HMM = (Q,O,A,B) 1. States: Q=q1..qN [the part of speech tags] a. Including special initial/final states q0 and qF b. lambda = (A,B)

2. Observation symbols: O = o1..oV [words] 3. Transitions: a. A = {aij}; aij = P(qt= j|qt-1= i) ~ Multi(ai)

4. Emissions: a. B = {bik}; bik= P(ot = vk|qt = i) ~ Multi(bi)

Complete data likelihood The joint probability of a sequence of words and tags, given a model:

Generative process: 1. generate a tag sequence 2. emit the words for each tag

Parameter estimation (supervised) Maximum likelihood estimates (MLE) on data 1. Transition probabilities:

2. Emission probabilities:

Parameter estimation (unsupervised) 1. Transition probabilities:

2. Emission probabilities:

Parameter estimation 1. Expected number of times in state i for observation O:

Parameter estimation 1. Expected number of times in state i for observation O:

2. Expected number of transitions from i to j for observation O:

Gamma

Gamma

Gamma2(V) g=0.7752

END

V

N

START

Time

board

backs

plan

vote

1

2

3

4

Emission re-estimation

bik = exp. # of times in state i emitting word k expected number of times in state i

Xi

Xi2(V,V) END xi=0.40 V

N

START

Time

board

backs

plan

vote

1

2

3

4

Transition re-estimation

aij = Expected number of transitions from i to j Expected number of transtions out of i

Forward-Backward

Example X = [ "board backs plan vote" "vote backs board plan" "backs board vote plan" "plan vote backs board" ] A=

V

N

END

V

0.4

0.6

0.3

N

0.5

0.5

0.7

START

0.4

0.6

board

backs

plan

vote

V

0.1

0.4

0.3

0.2

N

0.4

0.1

0.2

0.3

B=

Example

Optimizing only some parameters also improves LogL: Generalized EM

Bayesian modeling ■ Model uncertainty about the data ■ Prefer models with certain properties; e.g., sparsity ■ Non-parametric models: number of hidden variables unknown a-priori

Data likelihood - Likelihood function:

- Maximum likelihood problem:

- Bayesian modeling:

Bayesian inference 1. Model distribution over \theta (inference via sampling):

2. Variational Bayes: Find (point estimate) parameters lambda that minimize upper bound on negative log likelihood, including the priors.

Bayesian HMMs HMM = (Q,O,A,B) 1. States: Q=q1..qN [the part of speech tags] a. Including special initial/final states q0 and qF 2. Observation symbols: O = o1..oV [words] 3. Transitions: a. A = {aij}; aij = P(qt= j|qt-1= i) ~ Multi(ai) b. ai | alphaA ~ Dir(alphaA) 4. Emissions: a. B = {bik}; bik= P(ot = vk|qt = i) ~ Multi(bi) b. bi | alphaA ~ Dir(alphaB) alpha: controls the sparsity of A and B

Parameter estimation - VB 1. Transition probabilities:

2. Emission probabilities:

F:

Example with varying alpha

alpha = 1.0, after FB X = [ "board backs plan vote" "vote backs board plan" "backs board vote plan" "plan vote backs board" ] A=

V

N

END

V

0.34

0.66

0.12

N

0.13

0.87

0.88

START

0.51

0.49

B=

board

backs

plan

vote

V

0.25

0.32

0.18

0.24

N

0.25

0.22

0.28

0.25

alpha = 0.1, after FB X = [ "board backs plan vote" "vote backs board plan" "backs board vote plan" "plan vote backs board" ] A=

V

N

END

V

0.0

1.0

0.0

N

0.36

0.64

1.0

START

0.2

0.8

board

backs

plan

vote

V

0.0

1.0

0.0

0.0

N

0.33

0.0

0.33

0.33

B=

tag-frequencies (Johnson, 2007)

Information Extraction Goal: Identify structured data in unstructured text: ○ Identify mentions of instances of predefined classes: names of people, location, organizations, etc. ■ Named-entity recognition (NER) ■ Co-reference resolution ○ Identify and normalize temporal expressions: ■ Normalize temporal expressions ■ Associated events with points in time ■ Timeline generation ○ Relation detection and classification: ○ Event detection

IE: input "Citing high fuel prices, United Airlines said Friday it has increased fares by 6$ per round trip on flights to some cities also served by lower-cost carriers. American Airlines, a unit of AMR Corp., immediately matched the move, spokesman Tim Wagner said. United, a unit of UAL Corp., said the increase took effect Thursday and applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Denver to San Francisco."

IE: (desired) output AIRLINE FARE-RAISE: ● LEAD ARLINE: UA ● AMOUNT: 6$ ● EFFECTIVE DATE: 2006-10-26 ● FOLLOWER: AA 1.Populate a database 2.Compare with historical stock prices

How? - Structured prediction: e.g., sequence models - Supervised and semi-supervised methods - Challenging problems - Limitations and new perspectives

IE: NER "Citing high fuel prices, United Airlines said Friday it has increased fares by 6$ per round trip on flights to some cities also served by lower-cost carriers. American Airlines, a unit of AMR Corp., immediately matched the move, spokesman Tim Wagner said. United, a unit of UAL Corp., said the increase took effect Thursday and applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Denver to San Francisco."

IE: co-reference resolution "Citing high fuel prices, United Airlines1 said Friday it1 has increased fares by 6$ per round trip on flights to some cities also served by lower-cost carriers. American Airlines2, a unit of AMR Corp.3, immediately matched the move, spokesman Tim Wagner4 said. United1, a unit of UAL Corp. , said the increase took effect Thursday and applies to 5 most routes where it1 competes against discount carriers, such as Chicago6 to Dallas7 and Denver8 to San Francisco9."

IE: temporal expressions "Citing high fuel prices, United Airlines1 said Friday it1 has increased fares by 6$ per round trip on flights to some cities also served by lower-cost carriers. American Airlines2, a unit of AMR Corp.3, immediately matched the move, spokesman Tim Wagner4 said. United1, a unit of UAL Corp. , said the increase took effect Thursday and applies to 5 most routes where it1 competes against discount carriers, such as Chicago6 to Dallas7 and Denver8 to San Francisco9."

IE: relation extraction "Citing high fuel prices, United Airlines1 said Friday it1 [has increased fares by] 6$ per round trip on flights to some cities also served by lower-cost carriers. American Airlines2, [a unit of] AMR Corp.3, [immediately matched the move], [spokesman] Tim Wagner4 said. United1, [a unit of] UAL Corp.5, said the increase took effect Thursday and applies to most routes where it1 competes against discount carriers, such as Chicago6 to Dallas7 and Denver8 to San Francisco9."

IE: event extraction "Citing high fuel prices, United Airlines1 said Friday [it1 [has increased fares by] 6$ per round trip on flights to some cities also served by lower-cost carriers.]1 American Airlines2, [a unit of] AMR Corp.3, [immediately matched [the move]1], [spokesman] Tim Wagner4 said. United1, [a unit of] UAL Corp.5, said [the increase]1 took effect Thursday and applies to most routes where it1 competes against discount carriers, such as Chicago6 to Dallas7 and Denver8 to San Francisco9."

Sequence classifiers Choose your favorite sequence classifier: HMM, CRF, MEMM, SVM, Perceptron, ....

We abstract from the specifics going forward

Beyond POS tagging ● Sequence labeling tasks: Assign one (POS) label to each word ○ Cats/NNS love/VBP fish/NN ./. ● Given: labeled data, sequence classifier ● Sequence segmentation + labeling: ○ Information extraction, biomedical IE, shallow (partial) parsing ○ Task: labels can span several words "U.N. Security Council tries to end Syria conflict"

Segmentation + labeling "U.N. Security Council tries to end Syria conflict"

"U.N.ORG SecurityORG CouncilORG tries to end SyriaLOC conflict"

How many organizations?

Segmentation + labeling "U.N. Security Council tries to end Syria conflict"

"[U.N. Security Council]ORG tries to end [Syria]LOC conflict"

One organization...

Segmentation + Labeling 1. Identify the boundaries of a segment: a. "[U.N. Security Council] tries to end Syria conflict" 2. Label the segment: a. "[U.N. Security Council]ORG tries to end Syria conflict"

3. How? a. Reduce the problem to simple labeling b. BIO encoding: split the labels in i. B-X: beginning of label X ii. I-X: inside a label X iii. O: outside a label (NULL label)

BIO encoding 1. Encode the data in BIO format: a. b. c.

[U.N. Security council]ORG tries to end [Syria]LOC conflict. U.N.B-ORG SecurityI-ORG CouncilI-ORG triesO toO endO SyriaB-LOC conflictO No termination ambiguity B-X->I-X same entity, new entity otherwise

2. Train and use your favorite sequence classifier as usual a. No guarantee that output will be consistent b. Set B-X -> I-Y transitions to 0 if necessary 3. Evaluate appropriately: P/R/F1

Features ●

Words: a. Target token identity, stem; w=Corp., sw=corp. b. Same for words within a window around target; w-1=AMR.,w+1=,



PoS tags:

a. ●

Shape:

a. ●



Regexp-like features; s=Xxp, s+1=p

Prefixes/Suffixes:

a. ●

Target token and tokens within window; pos=NNP, pos+1=,

of target and neighbors; suf=., suf=p., pre=c pre=co, ... Gazetteers: labeled lists of phrases (locations, cities, fir names etc.): a. gX= target is in list X; example, Tim -> gFirstName=true b. Crucial features out-of-domain c. Features on segments? SemiMM Label features: a. label-1=B-ORG,... b. label+1=0, (bidirectional models)

NER Identifying mentions (spans of text) referring to instances of a pre-defined set of classes

Entity ambiguity

NOTE: The problem of the actual "identity" of an entity is not addressed in classic IE, unless approximated by coreference.

NER as sequence classification

Evaluation ● Precision/Recall/F1: ○ micro/macro: mention/type level

● Correct prediction: ○ both entity boundaries are correctly identified: [New York]ORG State vs. [New York State]ORG ○ the label is correct: [New York State]LOC vs [New York State]ORG

● F1 > 90% in supervised case, in-domain! ○ Significant degradation out-domain ○ Domain adaptation, semi-supervised learning etc.

Relation extraction Relation: a set of ordered tuples over a domain ● E.g., President-of = {(Obama, US), (Sarkozy, France), ...} 1. Identify named-entities 2. For each pair (ei,ej) a. Predic-Relation(ei,ej) [Possibly NULL]

Task-specific features: 1. Text between entities: BOW 2. Syntactic structure

Relation Extraction

Appositive constructions: part-of, is-a etc.

Distant supervision for RE Supervised approaches are limited Idea: exploit existing repositories such as Wikipedia (infoboxes), Freebase (relations) to bootstrap models for RE: 1. Start with seed set of pairs 2. Repeat: a. Build extraction patterns from data: b. Extract new pairs

Issues: semantic drift, evaluation

Temporal expressions

● More regular patterns than in other tasks ● Challenges: ○ Coverage ○ False positives ● Normalization: map to interval or point in time

Partial parsing Goal: Identify the basic (non-recursive) N/V/A/P phrases of a sentence (chunking): ● flat/non-overlapping ● segmentation+labeling task ● Base phrases do not contain constituents of the same type ● Include head of phrase ● Exclude post-head modifiers "[The morning flight]NP from [Denver]NP [has arrived]VP"

BIO encoding American Airlines , a unit of AMR Inc.

BIO encoding American Airlines , a unit of AMR Inc.

[American Airlines]NP , [a unit]NP of [AMR Inc.]NP

BIO encoding American Airlines , a unit of AMR Inc.

[American Airlines]NP , [a unit]NP of [AMR Inc.]NP

AmericanB-NP AirlinesI-NP ,O aO unitB-NP ofO AMRB-NP Inc.I-NP

From Trees to Chunks

[American Airlines]NP , [a unit]NP of [AMR Inc.]NP

Partial parsing

PP as sequence classification

Limitations of NER 1. Washington was born at Haywood Farms near Oak Grove, Virginia. 2. Washington attended Yale Law School for a year and then studied at the University of Oxford. 3. Washington served as army chief of staff under William Tubman. 4. Historians laud Washington for his selection and supervision of his generals. 5. Washington was the founder of the town of Centralia, Washington.

Limitations of NER 1. WashingtonPER was born at Haywood Farms near Oak Grove, Virginia. 2. WashingtonPER attended Yale Law School for a year and then studied at the University of Oxford. 3. WashingtonPER served as army chief of staff under William Tubman. 4. Historians laud WashingtonPER for his selection and supervision of his generals. 5. WashingtonPER was the founder of the town of Centralia, WashingtonLOC.

Limitations of NER 1. WashingtonPER was born at Haywood Farms near Oak Grove, Virginia. 2. WashingtonPER attended Yale Law School for a year and then studied at the University of Oxford. 3. WashingtonPER served as army chief of staff under William Tubman. 4. Historians laud WashingtonPER for his selection and supervision of his generals. 5. WashingtonPER was the founder of the town of Centralia, WashingtonLOC. 5 different people!

Limitations of NER 1. WashingtonPER was born at Haywood Farms near Oak Grove, Virginia. [George] 2. WashingtonPER attended Yale Law School for a year and then studied at the University of Oxford. [George] 3. WashingtonPER served as army chief of staff under William Tubman. [George] 4. Historians laud WashingtonPER for his selection and supervision of his generals. [George] 5. WashingtonPER was the founder of the town of Centralia, WashingtonLOC. [George] With the same first name!

1. Ambiguity a word (phrase) can mean different things

President Bush

President Bush

hypothesis: "George Bush father" "George Bush son"

ambiguity resolved by proxy strings

Mercury

...

Mercury

? ...

side effects on text similarity

1. "Astronomers took pictures of the star" 2. "Paparazzis took pictures of the star" ● Unrelated but close in string space

1. Ambiguity a word (phrase can mean different things

2. Synonymy the same thing can be said in different ways

synonyms

synonyms

topic

synonyms ambiguity

topic

side effects on text similarity

1. "U.S. President issues Libya ultimatum" 2. "Barack Obama says Gaddafi may wait out military offensive"

● Closely related but unrelated in string space

text similarity in entity space "Astronomers took pictures of the star"

"Paparazzis took pictures of the star"

"U.S. President issues Libya ultimatum"

"Barack Obama says Gaddafi may wait out military offensive"

Entity disambiguation ● Segmentation and labeling problem ○ Output = Wikipedia, Freebase, YAGO, etc. ○ Updated and maintained by thousands of users ○ Good coverage of popular topics

● How? ○ Classic supervised approach dubious ○ Exploit social aspects (link graph)

● One example: TagMe

(a large) problem space Entities: ● English Wikipedia: 4 million ● YAGO 10 million ● Freebase: 40 million ● Web knowledge graphs: hundreds of millions Facts: ● Billions of tuples: link graphs, infoboxes, Freebase relations etc.

Entity disambiguation in Wikipedia

Basic ingredients: 1. A canonical set of entities (topics, concepts,..) Z, |Z| = K 2. P(zk|w): how likely is entity zk given string w - from anchors, titles (mentions) - from words in articles (content words) 3. Sim(zi,zj), pairwise entity similarity - from the in-link graph, e.g., Google distance

"Scoreboard of the second one-day cricket match England - Pakistan ... Croft bowled Younis .. Total for 8 wickets .. Pakistan were 318 for two at lunch ..."

Segmentation via NER

"Scoreboard of the second one-day cricket match England - Pakistan ... Croft bowled Younis .. Total for 8 wickets .. Pakistan were 318 for two at lunch ..."

P(zk|w)

"Scoreboard of the second one-day cricket match England - Pakistan ... Croft bowled Younis .. Total for 8 wickets .. Pakistan were 318 for two at lunch ..."

+ Sim(zi,zj)

"Scoreboard of the second one-day cricket match England - Pakistan ... Croft bowled Younis .. Total for 8 wickets .. Pakistan were 318 for two at lunch ..."

1. Eliminate low P(zk|w)

"Scoreboard of the second one-day cricket match England - Pakistan ... Croft bowled Younis .. Total for 8 wickets .. Pakistan were 318 for two at lunch ..."

1. Eliminate low P(zk|w) 2. Eliminate weakly connected

"Scoreboard of the second one-day cricket match England - Pakistan ... Croft bowled Younis .. Total for 8 wickets .. Pakistan were 318 for two at lunch ..."

1. Eliminate low P(zk|w) 2. Eliminate weakly connected 3. Keep max P(zk|w)*

"Scoreboard of the second one-day cricket match England - Pakistan ... Croft bowled Younis .. Total for 8 wickets .. Pakistan were 318 for two at lunch ..."

*TagMe: (Ferragina & Scaiella, 2010)

Eval ● Evaluation on manually annotated data: CoNLL-Aida dataset ● Metrics: ○ micro@1: fraction of entity mentions correctly disambiguated ○ macro@1: accuracy averaged by number of documents ● Some results:Baseline CoNLL-Aida Aida-best test-b TagMe* macro@1

72.74

81.66

78.21

micro@1

69.82

82.54

78.64

Applications ● Summarization? ● Choose sentences/snippets containing important entities in the document? ● Need data annotation...

References -Book: J&M: Ch. 13.5 (Partial parsing), Ch. 22 (information extraction) - Entity disambiguation: ● TAGME: on-the-fly annotation of short text ●

fragments (by wikipedia entities). Ferragina & Scaiella, CIKM 2010 Robust Disambiguation of Named Entities in Text. Hoffart et al. EMNLP. 2011

- Tools for the project: Stanford CoreNLP, NLTK, LingPipe, OpenNLP, TagMe etc.

References - Gao & Johnson, "A Comparison of Bayesian Estimators for unsupervised Hidden Markov Model POS taggers". EMNLP 2008. - Bilmes, "A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models" - Rabiner, "A tutorial on hidden Markov models and selected applications in speech recognition" - Neal & Hinton, "A view of the EM algorithm that justifies incremental, sparse and other variants"

Summary ● POS review ○ Bayesian HMM

● Information Extraction: ○ NER and related tasks ● Segmentation & Labeling: ○ Models ○ Features ● Shallow parsing ● Entity disambiguation ● Hand out: mid term exam