POS tagging Intro to NLP - ETHZ - 11/03/2013
Summary ● ● ● ●
Parts of speech Tagsets Part of speech tagging HMM Tagging: ○ Most likely tag sequence ○ Probability of an observation ○ Parameter estimation ○ Evaluation
POS ambiguity "Squad helps dog bite victim" ● bite -> verb? ● bite -> noun?
Parts of Speech (PoS) Traditional parts of speech: ● Noun, verb, adjective, preposition, adverb, article, interjection, pronoun, conjunction, etc. ● Called: parts-of-speech, lexical categories, word classes, morphological classes, lexical tags...
Examples N (noun): car, squad, dog, bite, victim V (verb): help, bite ADJ (adjective): purple, tall ADV (adverb): unfortunately, slowly P (preposition): of, by, to PRO (pronoun): I, me, mine DET (determiner): the, a, that, those
Open and closed classes 1. Closed class: small stable set a. b. c. d.
Auxiliaries: may, can, will, been, ... Prepositions: of, in, by, ... Pronouns: I, you, she, mine, his, them, ... Usually function words, short, grammar role
2. Open class: a. new ones are created all the time ("to google/tweet", "e-quaintance", "captcha", "cloud computing", "netbook", "webinar", "widget") b. English has 4: Nouns, Verbs, Adjectives, Adverbs c. Many languages have these 4, not all!
Open class words 1. Nouns: a. Proper nouns: Zurich, IBM, Albert Einstein, The Godfather, ... Capitalized in many languages. b. Common nouns: the rest, also capitalized in German, mass/count nouns (goat/goats, snow/*snows) 2. Verbs: a. Morphological affixes in English: eat/eats/eaten 3. Adverbs: tend to modify things: a. John walked home extremely slowly yesterday b. Directional/locative adverbs (here, home, downhill) c. Degree adverbs (extremely, very, somewhat) d. Manner adverbs (slowly, slinkily, delicately) 4. Adjectives: qualify nouns and noun phrases
Closed class words 1. 2. 3. 4. 5. 6. 7.
prepositions: on, under, over, ... particles: up, down, on, off, ... determiners: a, an, the, ... pronouns: she, who, I, .. conjunctions: and, but, or, ... auxiliary verbs: can, may should, ... numerals: one, two, three, third, .
Prepositions with corpus frequencies
Conjunctions
Pronouns
Auxiliaries
Applications ● A useful pre-processing step in many tasks ● Syntactic parsing: important source of information for syntactic analysis ● Machine translation ● Information retrieval: stemming, filtering ● Named entity recognition ● Summarization?
Applications Speech synthesis, for correct pronunciation of "ambiguous" words: ○ lead /lid/ (guide) vs. /led/ (chemical) ○ inSULT vs. INsult ○ obJECT vs. OBject ○ overFLOW vs. OVERflow ○ conTENT vs. CONtent
Summarization ● Idea: Filter out sentences starting with certain PoS tags ● Use PoS statistics from gold standard titles (might need cross-validation)
Summarization ● Idea: Filter out sentences starting with certain PoS tags: ● Title1: "Apple introduced Siri, an intelligent personal assistant to which you can ask questions" ● Title2: "Especially now that a popular new feature from Apple is bound to make other phones users envious: voice control with Siri" ● Title3: "But Siri, Apple's personal assistant application on the iPhone 4s, doesn't
Summarization ● Idea: Filter out sentences starting with certain PoS tags: ● Title1: "Apple introduced Siri, an intelligent personal assistant to which you can ask questions" ● Title2: "Especially now that a popular new feature from Apple is bound to make other phones users envious: voice control with Siri" ● Title3: "But Siri, Apple's personal assistant application on the iPhone 4s, doesn't
PoS tagging ● The process of assigning a part-of-speech tag (label) to each word in a text. ● Pre-processing: tokenization Word
Tag
Squad
N
helps
V
dog
N
bite
N
victim
N
Choosing a tagset 1. There are many parts of speech, potential distinctions we can draw 2. For POS tagging, we need to choose a standard set of tags to work with 3. Coarse tagsets N, V, Adj, Adv. a. A universal PoS tagset? http://en.wikipedia. org/wiki/Part-of-speech_tagging
4. More commonly used set is finer grained, the “Penn TreeBank tagset”, 45 tags
PTB tagset
Examples* 1. I/PRP need/VBP a/DT flight/NN from/IN Atlanta/NN 2. Does/VBZ this/DT flight/NN serve/VB dinner/NNS 3. I/PRP have/VB a/DT friend/NN living/VBG in/IN Denver/NNP 4. Can/VBP you/PRP list/VB the/DT nonstop/JJ afternoon/NN flights/NNS
Examples* 1. I/PRP need/VBP a/DT flight/NN from/IN Atlanta/NNP 2. Does/VBZ this/DT flight/NN serve/VB dinner/NNS 3. I/PRP have/VB a/DT friend/NN living/VBG in/IN Denver/NNP 4. Can/VBP you/PRP list/VB the/DT nonstop/JJ afternoon/NN flights/NNS
Examples* 1. I/PRP need/VBP a/DT flight/NN from/IN Atlanta/NNP 2. Does/VBZ this/DT flight/NN serve/VB dinner/NN 3. I/PRP have/VB a/DT friend/NN living/VBG in/IN Denver/NNP 4. Can/VBP you/PRP list/VB the/DT nonstop/JJ afternoon/NN flights/NNS
Examples* 1. I/PRP need/VBP a/DT flight/NN from/IN Atlanta/NNP 2. Does/VBZ this/DT flight/NN serve/VB dinner/NN 3. I/PRP have/VBP a/DT friend/NN living/VBG in/IN Denver/NNP 4. Can/VBP you/PRP list/VB the/DT nonstop/JJ afternoon/NN flights/NNS
Examples* 1. I/PRP need/VBP a/DT flight/NN from/IN Atlanta/NNP 2. Does/VBZ this/DT flight/NN serve/VB dinner/NN 3. I/PRP have/VBP a/DT friend/NN living/VBG in/IN Denver/NNP 4. Can/MD you/PRP list/VB the/DT nonstop/JJ afternoon/NN flights/NNS
Complexities ● Book/VB that/DT flight/NN ./. ● There/EX are/VBP 70/CD children/NNS there/RB ./. ● Mrs./NNP Shaefer/NNP never/RB got/VBD around/RP to/TO joining/VBG ./. ● All/DT we/PRP gotta/VBN do/VB is/VBZ go/VB around/IN the/DT corner/NN ./. ● Unresolvable ambiguity: ○ The Duchess was entertaining last night .
Words
PoS WSJ
PoS Universal
The
DT
DET
oboist
NN
NOUN
Heinz
NNP
NOUN
Holliger
NNP
NOUN
has
VBZ
VERB
taken
VBN
VERB
a
DT
DET
hard
JJ
ADJ
line
NN
NOUN
about
IN
ADP
the
DT
DET
problems
NNS
NOUN
.
.
.
POS Tagging ● Words often have more than one POS: back ○ The back door = JJ ○ On my back = NN ○ Win the voters back = RB ○ Promised to back the bill = VB The POS tagging problem is to determine the POS tag for a particular instance of a word.
Word type tag ambiguity
Methods 1. Rule-based a. Start with a dictionary b. Assign all possible tags c. Write rules by hand to remove tags in context
2. Stochastic a. b. c. d.
Supervised/Unsupervised Generative/discriminative independent/structured output HMMs
Rule-based tagging 1. Start with a dictionary: she
PRP
promised
VBN, VBD
to
TO
back
VB, JJ, RB, NN
the
DT
bill
NN, VB
Rule-based tagging 2. Assign all possible tags: NN RB VBN
JJ
PRP
VBD
TO
she
promised to
NN
VB
DT
VB
back
the
bill
Rule-based tagging 3. Introduce rules to reduce ambiguity: NN RB VBN
JJ
PRP
VBD
TO
she
promised to
NN
VB
DT
VB
back
the
bill
Rule: " PRP {VBN, VBD}" -> " PRP VBD"
Statistical models for POS tagging 1. A classic: one of the first successful applications of statistical methods in NLP 2. Extensively studied with all possible approaches (sequence models benchmark) 3. Simple to get started on: data, eval, literature 4. An introduction to more complex segmentation and labelling tasks: NER, shallow parsing, global optimization 5. An introduction to HMMs, used in many variants in POS tagging and related tasks.
Supervision and resources 1. Supervised case: data with words manually annotated with POS tags 2. Partially supervised: annotated data + unannotated data 3. Unsupervised: only raw text available 4. Resources: dictionaries with words possible tags 5. Start with the supervised task
HMMs HMM = (Q,O,A,B) 1. States: Q=q1..qN [the part of speech tags] 2. Observation symbols: O = o1..oV [words] 3. Transitions: a. A = {aij}; aij = P(ts=qj|ts-1=qi) b. ts | ts-1 = qi ~ Multi(ai) c. Special vector of initial/final probabilities
4. Emissions: a. B = {bik}; bik = P(ws = ok|ts=qi) b. ws| ts = qi ~ Multi(bi)
Markov Chain Interpretation Tagging process as a (hidden) Markov process ● Independence assumptions 1. Limited horizon
2. Time-invariant 3. Observation depends only on the state
Complete data likelihood The joint probability of a sequence of words and tags, given a model:
Generative process: 1. generate a tag sequence 2. emit the words for each tag
Inference in HMMs Three fundamental problems: 1. Given an observation (e.g., a sentence) find the most likely sequence of states (e.g., pos tags) 2. Given an observation, compute its probability 3. Given a dataset of observation (sequences) estimate the model's parameters: theta = (A,B)
HMMs and FSA
HMMs and FSA
also Bayes nets, directed graphical models, etc.
Other applications of HMMs • NLP - Named entity recognition, shallow parsing - Word segmentation - Optical Character Recognition • Speech recognition • Computer Vision – image segmentation • Biology - Protein structure prediction Economics, Climatology, Robotics...
POS as sequence classification ● Observation: a sequence of N words w1:N ● Response: a sequence of N tags t1:N ● Task: find the predicted t'1:N such that:
The best possible tagging for the sequence.
Bayes rule reformulation
HMM POS tagging
● How can we find t'1:N? ● Enumeration of all possible sequences?
HMM POS tagging
● How can we find t'1:N? ● Enumeration of all possible sequences? ○ O(|Tagset|N) !
● Dynamic programming: Viterbi algorithm
Viterbi algorithm
Example: model N
V
END
V
0.8
0.2
0.3
N
0.3
0.7
0.7
START
0.6
0.4
board
backs
plan
V
0.3
0.3
0.4
N
0.4
0.2
0.4
A=
B=
Example: observation ● Sentence: "Board backs plan" ● Find the most likely tag sequence ● ds(t) = probability of most likely path ending at state s at time t
Viterbi algorithm: example END V N START
Time
board
backs
plan
1
2
3
Viterbi: forward pass END V N START
Time
board
backs
plan
1
2
3
Viterbi: forward pass END d=.12
V N d=.24
START
Time
board
backs
plan
1
2
3
Viterbi: forward pass END d=.12
V N d=.24
START
Time
board
backs
plan
1
2
3
Viterbi: forward pass END d=.12
d=.050
V N d=.24
d=.019
START
Time
board
backs
plan
1
2
3
Viterbi: forward pass END d=.12
d=.050
V N d=.24
d=.019
START
Time
board
backs
plan
1
2
3
Viterbi: forward pass END d=.12
d=.050
d=.005
V N d=.24
d=.016
d=.019
START
Time
board
backs
plan
1
2
3
Viterbi: forward pass END d=.12
d=.050
d=.005
V N d=.24
d=.016
d=.019
START
Time
board
backs
plan
1
2
3
Viterbi: forward pass END d=.12
d=.050
d=.005
V N d=.24
d=.016
d=.019
START
Time
board
backs
plan
1
2
3
d=.011
Viterbi: backtrack END d=.12
d=.050
d=.005
V N d=.24
d=.016
d=.019
START
Time
board
backs
plan
1
2
3
d=.011
Viterbi: backtrack END d=.12
d=.050
d=.005
V N d=.24
d=.016
d=.019
START
Time
board
backs
plan
1
2
3
d=.011
Viterbi: backtrack END d=.12
d=.050
d=.005
V N d=.24
d=.016
d=.019
START
Time
board
backs
plan
1
2
3
d=.011
Viterbi: output END d=.011
V N START board/N backs/V Time
1
2
plan/N 3
Observation probability ● Given HMM theta = (A,B) and observation sequence w1:N compute P(w1:N|theta) ● Applications: language modeling ● Complete data likelihood:
● Sum over all possible tag sequences:
Forward algorithm ● Dynamic programming: each state of the trellis stores a value alphai(s) = probability of being in state s having observed w1:i ● Sum over all paths up to i-1 leading to s
● Init:
Forward algorithm
Forward computation END a=.12
V N a=.24
START
Time
board
backs
plan
1
2
3
Forward computation END a=.12
a=.058
V N a=.24
a=.034
START
Time
board
backs
plan
1
2
3
Forward computation END a=.12
a=.058
a=.014
V N a=.24
a=.022
a=.034
START
Time
board
backs
plan
1
2
3
Forward computation END a=.12
a=.058
a=.014 a=0.2
V N a=.24
a=.022
a=.034
START
Time
board
backs
plan
1
2
3
Parameter estimation Maximum likelihood estimates (MLE) on data 1. Transition probabilities:
2. Emission probabilities:
Implementation details 1. 2. 3. 4.
Start/End states Log space/Rescaling Vocabularies: model pruning Higher order models: a. states representation b. Estimation and sparsity: deleted interpolation
Evaluation ● So once you have you POS tagger running how do you evaluate it? 1. Overall error rate with respect to a goldstandard test set. a. ER = # words incorrectly tagged/# words tagged 2. Error rates on particular tags (and pairs) 3. Error rates on particular words (especially unknown words)
Evaluation ● The result is compared with a manually coded “Gold Standard” ● Typically accuracy > 97% on WSJ PTB ● This may be compared with result for a baseline tagger (one that uses no context). ○ Baselines (most frequent tag) can achieve up to 90% accuracy.
● Important: 100% is impossible even for human annotators.
Summary ● ● ● ●
Parts of speech Tagsets Part of speech tagging HMM Tagging: ○ Most likely tag sequence (decoding) ○ Probability of an observation (word sequence) ○ Parameter estimation (supervised) ○ Evaluation
Next class ● Unsupervised POS tagging models (HMMs) ● Parameter estimation: forward-backward algorithm ● Discriminative sequence models: MaxEnt, CRF, Perceptron, SVM, etc. ● Read J&M 5-6 ● Pre-process and POS tag the data: report problems & baselines