POS tagging - Systems Group

POS tagging Intro to NLP - ETHZ - 11/03/2013

Summary ● ● ● ●

Parts of speech Tagsets Part of speech tagging HMM Tagging: ○ Most likely tag sequence ○ Probability of an observation ○ Parameter estimation ○ Evaluation

POS ambiguity "Squad helps dog bite victim" ● bite -> verb? ● bite -> noun?

Parts of Speech (PoS) Traditional parts of speech: ● Noun, verb, adjective, preposition, adverb, article, interjection, pronoun, conjunction, etc. ● Called: parts-of-speech, lexical categories, word classes, morphological classes, lexical tags...

Examples N (noun): car, squad, dog, bite, victim V (verb): help, bite ADJ (adjective): purple, tall ADV (adverb): unfortunately, slowly P (preposition): of, by, to PRO (pronoun): I, me, mine DET (determiner): the, a, that, those

Open and closed classes 1. Closed class: small stable set a. b. c. d.

Auxiliaries: may, can, will, been, ... Prepositions: of, in, by, ... Pronouns: I, you, she, mine, his, them, ... Usually function words, short, grammar role

2. Open class: a. new ones are created all the time ("to google/tweet", "e-quaintance", "captcha", "cloud computing", "netbook", "webinar", "widget") b. English has 4: Nouns, Verbs, Adjectives, Adverbs c. Many languages have these 4, not all!

Open class words 1. Nouns: a. Proper nouns: Zurich, IBM, Albert Einstein, The Godfather, ... Capitalized in many languages. b. Common nouns: the rest, also capitalized in German, mass/count nouns (goat/goats, snow/*snows) 2. Verbs: a. Morphological affixes in English: eat/eats/eaten 3. Adverbs: tend to modify things: a. John walked home extremely slowly yesterday b. Directional/locative adverbs (here, home, downhill) c. Degree adverbs (extremely, very, somewhat) d. Manner adverbs (slowly, slinkily, delicately) 4. Adjectives: qualify nouns and noun phrases

Closed class words 1. 2. 3. 4. 5. 6. 7.

prepositions: on, under, over, ... particles: up, down, on, off, ... determiners: a, an, the, ... pronouns: she, who, I, .. conjunctions: and, but, or, ... auxiliary verbs: can, may should, ... numerals: one, two, three, third, .

Prepositions with corpus frequencies

Conjunctions

Pronouns

Auxiliaries

Applications ● A useful pre-processing step in many tasks ● Syntactic parsing: important source of information for syntactic analysis ● Machine translation ● Information retrieval: stemming, filtering ● Named entity recognition ● Summarization?

Applications Speech synthesis, for correct pronunciation of "ambiguous" words: ○ lead /lid/ (guide) vs. /led/ (chemical) ○ inSULT vs. INsult ○ obJECT vs. OBject ○ overFLOW vs. OVERflow ○ conTENT vs. CONtent

Summarization ● Idea: Filter out sentences starting with certain PoS tags ● Use PoS statistics from gold standard titles (might need cross-validation)

Summarization ● Idea: Filter out sentences starting with certain PoS tags: ● Title1: "Apple introduced Siri, an intelligent personal assistant to which you can ask questions" ● Title2: "Especially now that a popular new feature from Apple is bound to make other phones users envious: voice control with Siri" ● Title3: "But Siri, Apple's personal assistant application on the iPhone 4s, doesn't

Summarization ● Idea: Filter out sentences starting with certain PoS tags: ● Title1: "Apple introduced Siri, an intelligent personal assistant to which you can ask questions" ● Title2: "Especially now that a popular new feature from Apple is bound to make other phones users envious: voice control with Siri" ● Title3: "But Siri, Apple's personal assistant application on the iPhone 4s, doesn't

PoS tagging ● The process of assigning a part-of-speech tag (label) to each word in a text. ● Pre-processing: tokenization Word

Tag

Squad

N

helps

V

dog

N

bite

N

victim

N

Choosing a tagset 1. There are many parts of speech, potential distinctions we can draw 2. For POS tagging, we need to choose a standard set of tags to work with 3. Coarse tagsets N, V, Adj, Adv. a. A universal PoS tagset? http://en.wikipedia. org/wiki/Part-of-speech_tagging

4. More commonly used set is finer grained, the “Penn TreeBank tagset”, 45 tags

PTB tagset

Examples* 1. I/PRP need/VBP a/DT flight/NN from/IN Atlanta/NN 2. Does/VBZ this/DT flight/NN serve/VB dinner/NNS 3. I/PRP have/VB a/DT friend/NN living/VBG in/IN Denver/NNP 4. Can/VBP you/PRP list/VB the/DT nonstop/JJ afternoon/NN flights/NNS

Examples* 1. I/PRP need/VBP a/DT flight/NN from/IN Atlanta/NNP 2. Does/VBZ this/DT flight/NN serve/VB dinner/NNS 3. I/PRP have/VB a/DT friend/NN living/VBG in/IN Denver/NNP 4. Can/VBP you/PRP list/VB the/DT nonstop/JJ afternoon/NN flights/NNS

Examples* 1. I/PRP need/VBP a/DT flight/NN from/IN Atlanta/NNP 2. Does/VBZ this/DT flight/NN serve/VB dinner/NN 3. I/PRP have/VB a/DT friend/NN living/VBG in/IN Denver/NNP 4. Can/VBP you/PRP list/VB the/DT nonstop/JJ afternoon/NN flights/NNS

Examples* 1. I/PRP need/VBP a/DT flight/NN from/IN Atlanta/NNP 2. Does/VBZ this/DT flight/NN serve/VB dinner/NN 3. I/PRP have/VBP a/DT friend/NN living/VBG in/IN Denver/NNP 4. Can/VBP you/PRP list/VB the/DT nonstop/JJ afternoon/NN flights/NNS

Examples* 1. I/PRP need/VBP a/DT flight/NN from/IN Atlanta/NNP 2. Does/VBZ this/DT flight/NN serve/VB dinner/NN 3. I/PRP have/VBP a/DT friend/NN living/VBG in/IN Denver/NNP 4. Can/MD you/PRP list/VB the/DT nonstop/JJ afternoon/NN flights/NNS

Complexities ● Book/VB that/DT flight/NN ./. ● There/EX are/VBP 70/CD children/NNS there/RB ./. ● Mrs./NNP Shaefer/NNP never/RB got/VBD around/RP to/TO joining/VBG ./. ● All/DT we/PRP gotta/VBN do/VB is/VBZ go/VB around/IN the/DT corner/NN ./. ● Unresolvable ambiguity: ○ The Duchess was entertaining last night .

Words

PoS WSJ

PoS Universal

The

DT

DET

oboist

NN

NOUN

Heinz

NNP

NOUN

Holliger

NNP

NOUN

has

VBZ

VERB

taken

VBN

VERB

a

DT

DET

hard

JJ

ADJ

line

NN

NOUN

about

IN

ADP

the

DT

DET

problems

NNS

NOUN

.

.

.

POS Tagging ● Words often have more than one POS: back ○ The back door = JJ ○ On my back = NN ○ Win the voters back = RB ○ Promised to back the bill = VB The POS tagging problem is to determine the POS tag for a particular instance of a word.

Word type tag ambiguity

Methods 1. Rule-based a. Start with a dictionary b. Assign all possible tags c. Write rules by hand to remove tags in context

2. Stochastic a. b. c. d.

Supervised/Unsupervised Generative/discriminative independent/structured output HMMs

Rule-based tagging 1. Start with a dictionary: she

PRP

promised

VBN, VBD

to

TO

back

VB, JJ, RB, NN

the

DT

bill

NN, VB

Rule-based tagging 2. Assign all possible tags: NN RB VBN

JJ

PRP

VBD

TO

she

promised to

NN

VB

DT

VB

back

the

bill

Rule-based tagging 3. Introduce rules to reduce ambiguity: NN RB VBN

JJ

PRP

VBD

TO

she

promised to

NN

VB

DT

VB

back

the

bill

Rule: " PRP {VBN, VBD}" -> " PRP VBD"

Statistical models for POS tagging 1. A classic: one of the first successful applications of statistical methods in NLP 2. Extensively studied with all possible approaches (sequence models benchmark) 3. Simple to get started on: data, eval, literature 4. An introduction to more complex segmentation and labelling tasks: NER, shallow parsing, global optimization 5. An introduction to HMMs, used in many variants in POS tagging and related tasks.

Supervision and resources 1. Supervised case: data with words manually annotated with POS tags 2. Partially supervised: annotated data + unannotated data 3. Unsupervised: only raw text available 4. Resources: dictionaries with words possible tags 5. Start with the supervised task

HMMs HMM = (Q,O,A,B) 1. States: Q=q1..qN [the part of speech tags] 2. Observation symbols: O = o1..oV [words] 3. Transitions: a. A = {aij}; aij = P(ts=qj|ts-1=qi) b. ts | ts-1 = qi ~ Multi(ai) c. Special vector of initial/final probabilities

4. Emissions: a. B = {bik}; bik = P(ws = ok|ts=qi) b. ws| ts = qi ~ Multi(bi)

Markov Chain Interpretation Tagging process as a (hidden) Markov process ● Independence assumptions 1. Limited horizon

2. Time-invariant 3. Observation depends only on the state

Complete data likelihood The joint probability of a sequence of words and tags, given a model:

Generative process: 1. generate a tag sequence 2. emit the words for each tag

Inference in HMMs Three fundamental problems: 1. Given an observation (e.g., a sentence) find the most likely sequence of states (e.g., pos tags) 2. Given an observation, compute its probability 3. Given a dataset of observation (sequences) estimate the model's parameters: theta = (A,B)

HMMs and FSA

HMMs and FSA

also Bayes nets, directed graphical models, etc.

Other applications of HMMs • NLP - Named entity recognition, shallow parsing - Word segmentation - Optical Character Recognition • Speech recognition • Computer Vision – image segmentation • Biology - Protein structure prediction Economics, Climatology, Robotics...

POS as sequence classification ● Observation: a sequence of N words w1:N ● Response: a sequence of N tags t1:N ● Task: find the predicted t'1:N such that:

The best possible tagging for the sequence.

Bayes rule reformulation

HMM POS tagging

● How can we find t'1:N? ● Enumeration of all possible sequences?

HMM POS tagging

● How can we find t'1:N? ● Enumeration of all possible sequences? ○ O(|Tagset|N) !

● Dynamic programming: Viterbi algorithm

Viterbi algorithm

Example: model N

V

END

V

0.8

0.2

0.3

N

0.3

0.7

0.7

START

0.6

0.4

board

backs

plan

V

0.3

0.3

0.4

N

0.4

0.2

0.4

A=

B=

Example: observation ● Sentence: "Board backs plan" ● Find the most likely tag sequence ● ds(t) = probability of most likely path ending at state s at time t

Viterbi algorithm: example END V N START

Time

board

backs

plan

1

2

3

Viterbi: forward pass END V N START

Time

board

backs

plan

1

2

3

Viterbi: forward pass END d=.12

V N d=.24

START

Time

board

backs

plan

1

2

3


V N d=.24

START

Time

board

backs

plan

1

2

3


d=.050

V N d=.24

d=.019

START

Time

board

backs

plan

1

2

3


d=.050

V N d=.24

d=.019

START

Time

board

backs

plan

1

2

3


d=.050

d=.005

V N d=.24

d=.016

d=.019

START

Time

board

backs

plan

1

2

3


d=.050

d=.005

V N d=.24

d=.016

d=.019

START

Time

board

backs

plan

1

2

3


d=.050

d=.005

V N d=.24

d=.016

d=.019

START

Time

board

backs

plan

1

2

3

d=.011

Viterbi: backtrack END d=.12

d=.050

d=.005

V N d=.24

d=.016

d=.019

START

Time

board

backs

plan

1

2

3

d=.011


d=.050

d=.005

V N d=.24

d=.016

d=.019

START

Time

board

backs

plan

1

2

3

d=.011


d=.050

d=.005

V N d=.24

d=.016

d=.019

START

Time

board

backs

plan

1

2

3

d=.011

Viterbi: output END d=.011

V N START board/N backs/V Time

1

2

plan/N 3

Observation probability ● Given HMM theta = (A,B) and observation sequence w1:N compute P(w1:N|theta) ● Applications: language modeling ● Complete data likelihood:

● Sum over all possible tag sequences:

Forward algorithm ● Dynamic programming: each state of the trellis stores a value alphai(s) = probability of being in state s having observed w1:i ● Sum over all paths up to i-1 leading to s

● Init:

Forward algorithm

Forward computation END a=.12

V N a=.24

START

Time

board

backs

plan

1

2

3


a=.058

V N a=.24

a=.034

START

Time

board

backs

plan

1

2

3


a=.058

a=.014

V N a=.24

a=.022

a=.034

START

Time

board

backs

plan

1

2

3


a=.058

a=.014 a=0.2

V N a=.24

a=.022

a=.034

START

Time

board

backs

plan

1

2

3

Parameter estimation Maximum likelihood estimates (MLE) on data 1. Transition probabilities:

2. Emission probabilities:

Implementation details 1. 2. 3. 4.

Start/End states Log space/Rescaling Vocabularies: model pruning Higher order models: a. states representation b. Estimation and sparsity: deleted interpolation

Evaluation ● So once you have you POS tagger running how do you evaluate it? 1. Overall error rate with respect to a goldstandard test set. a. ER = # words incorrectly tagged/# words tagged 2. Error rates on particular tags (and pairs) 3. Error rates on particular words (especially unknown words)

Evaluation ● The result is compared with a manually coded “Gold Standard” ● Typically accuracy > 97% on WSJ PTB ● This may be compared with result for a baseline tagger (one that uses no context). ○ Baselines (most frequent tag) can achieve up to 90% accuracy.

● Important: 100% is impossible even for human annotators.

Summary ● ● ● ●

Parts of speech Tagsets Part of speech tagging HMM Tagging: ○ Most likely tag sequence (decoding) ○ Probability of an observation (word sequence) ○ Parameter estimation (supervised) ○ Evaluation

Next class ● Unsupervised POS tagging models (HMMs) ● Parameter estimation: forward-backward algorithm ● Discriminative sequence models: MaxEnt, CRF, Perceptron, SVM, etc. ● Read J&M 5-6 ● Pre-process and POS tag the data: report problems & baselines

POS tagging - Systems Group

Recommend Documents