COSC343: Artificial Intelligence Lecture 23: Statistical Language Modelling
Alistair Knott Dept. of Computer Science, University of Otago
Alistair Knott (Otago)
COSC343 Lecture 23
1 / 25
In today’s lecture
Ambiguity in natural language Simple probability models for natural language
Alistair Knott (Otago)
COSC343 Lecture 23
2 / 25
Ambiguity in natural language
Alistair Knott (Otago)
COSC343 Lecture 23
3 / 25
Ambiguity in natural language
Prostitutes Appeal to Pope
Alistair Knott (Otago)
COSC343 Lecture 23
3 / 25
Ambiguity in natural language
Prostitutes Appeal to Pope New Vaccine May Contain Rabies
Alistair Knott (Otago)
COSC343 Lecture 23
3 / 25
Ambiguity in natural language
Prostitutes Appeal to Pope New Vaccine May Contain Rabies
Squad Helps Dog Bite Victim
Alistair Knott (Otago)
COSC343 Lecture 23
3 / 25
Ambiguity in natural language
Prostitutes Appeal to Pope New Vaccine May Contain Rabies
Squad Helps Dog Bite Victim British Left Waffles On Falklands
Alistair Knott (Otago)
COSC343 Lecture 23
3 / 25
Ambiguity in natural language
Prostitutes Appeal to Pope New Vaccine May Contain Rabies
Squad Helps Dog Bite Victim British Left Waffles On Falklands A list of teachers broken down by age and sex is posted in the lobby
Alistair Knott (Otago)
COSC343 Lecture 23
3 / 25
Ambiguity in natural language
Prostitutes Appeal to Pope New Vaccine May Contain Rabies
Squad Helps Dog Bite Victim British Left Waffles On Falklands A list of teachers broken down by age and sex is posted in the lobby
A man is run over in New York every hour
Alistair Knott (Otago)
COSC343 Lecture 23
3 / 25
Techniques for resolving ambiguity
1. Use semantics. Work out the meaning of each interpretation. Which interpretation makes most sense? 2. Use statistics. Which word sense is most likely, in the current context? Which syntactic structure is most likely, in the current context?
Alistair Knott (Otago)
COSC343 Lecture 23
4 / 25
Probabilistic models of natural language
To build a probabilistic model of a natural language L: We start by obtaining a corpus of text in language L. We may choose to enrich the corpus by annotating it, to identify important features explicitly. E.g. word senses, syntactic structures. We estimate probabilities of events in the corpus by counting how often they occur in the corpus. We use these probabilties to estimate probabilities of events in new (unseen) texts.
Alistair Knott (Otago)
COSC343 Lecture 23
5 / 25
Example: Bayesian text classification You saw this example in one of your tutes. Training corpus: A set of emails, annotated (by a human) as either ‘spam’ or ‘ham’. Probabilities estimated from relative frequencies: for each word wi in the corpus: P(wi |spam); P(wi |¬spam) From these probabilities, using a naive Bayes model, we can take a document containing words w1 . . .wn and calculate P(Spam|w1 ∧w2 ∧. . .wn ) Alistair Knott (Otago)
COSC343 Lecture 23
6 / 25
Dependencies between words
A naive Bayes model makes the assumption that the words in a document are (conditionally) independent of one another (given the document category). It’s called a ‘bag-of-words’ model. But if we want to build a model of sentence structure, rather than just classify documents, the whole point is that the words in a sentence are not fully independent of one another.
Alistair Knott (Otago)
COSC343 Lecture 23
7 / 25
Recap: the two types of structure in human language Recall from Lecture 20: knowing a natural language involves two things: 1. General principles, which are neatly encoded in a grammar: Principles which group words into general categories. Principles which define hierarchical structure in sentences. 2. A lot of specific facts: Knowledge of individual word meanings Knowledge of idioms: word combinations that occur particularly frequently. Probabilistic models are good at capturing knowledge of idioms.
Alistair Knott (Otago)
COSC343 Lecture 23
8 / 25
Idiomatic structure in language Idioms are commonly occurring word sequences or patterns.
Alistair Knott (Otago)
COSC343 Lecture 23
9 / 25
Idiomatic structure in language Idioms are commonly occurring word sequences or patterns. Collocations are combinations of words that are more likely than you would predict (e.g. from a grammar). To pull over, to tuck in, to pick up. . . How are you doing? Never mind. . .
Alistair Knott (Otago)
COSC343 Lecture 23
9 / 25
Idiomatic structure in language Idioms are commonly occurring word sequences or patterns. Collocations are combinations of words that are more likely than you would predict (e.g. from a grammar). To pull over, to tuck in, to pick up. . . How are you doing? Never mind. . . Fixed expressions are groups of words that contribute a meaning collectively, rather than individually. To kick the bucket To take X to task. . .
Alistair Knott (Otago)
COSC343 Lecture 23
9 / 25
Idiomatic structure in language Idioms are commonly occurring word sequences or patterns. Collocations are combinations of words that are more likely than you would predict (e.g. from a grammar). To pull over, to tuck in, to pick up. . . How are you doing? Never mind. . . Fixed expressions are groups of words that contribute a meaning collectively, rather than individually. To kick the bucket To take X to task. . . Lots of expressions are in between: Don’t worry about it Alistair Knott (Otago)
COSC343 Lecture 23
9 / 25
Probabilistic language models
A common way of using probability theory to model language is to adopt a Markovian model: The probability of a word W is dependent on the previous n words. (Also called an n-gram model.) Note this is very different from a grammatical model! It ignores hierarchical structure in sentences altogether.
Alistair Knott (Otago)
COSC343 Lecture 23
10 / 25
Building an n-gram language model
To build an n-gram model, we count frequencies of word sequences in a training corpus. If n = 0, we count the frequency of each individual word, and estimate the prior probability distribution for words. If n = 1, we count the frequency of each possible sequence of two words. Then if we’re given a word, we can estimate the conditional probability distribution for the next word.
Alistair Knott (Otago)
COSC343 Lecture 23
11 / 25
Some n-gram models Here’s a sequence generated by a model trained on unigrams in AIMA: logical are as are confusion a may right tries agent goal the was diesel more object then information-gathering search is Here’s one generated by a model trained on bigrams: planning purely diagnostic expert systems are very similar computational approach would be represented compactly using tic-tac-toe a predicate Here’s one generated by a model trained on trigrams: planning and scheduling are integrated the success of naive bayes model is just a possible prior source by that time
Alistair Knott (Otago)
COSC343 Lecture 23
12 / 25
Problems building a linear language model Clearly, a higher n gives a better model. But there’s a sparse data problem: it’s hard to get good estimates for sequences of length n when n gets large. The problem is particularly severe because words are distributed according to Zipf’s law: word frequency
"the" "of"
log word frequency
"bayes" word rank
log word rank
Basically, there are lots of very rare words. So to get good estimates of all word sequences, you need a huge corpus. Alistair Knott (Otago)
COSC343 Lecture 23
13 / 25
Uses of a linear language model We can use a linear model for word sense disambiguation. Many words are semantically ambiguous. E.g. bank can mean ‘river edge’ or ‘financial institution’.) To disambiguate: Identify the sense of each ambiguous word by hand in the training corpus. Then build an n−ary probability model. Now the context of an ambiguous word carries information about which sense was intended. P(RIVERBANK|swam, to, the) =0.000001 P[FINANCIALBANK|swam, to, the) =0.000000000001
Alistair Knott (Otago)
COSC343 Lecture 23
14 / 25
Uses of a linear language model
We can also use a linear language model in speech interpretation: Say the speaker’s audio signal is consistent with several alternative word sequences: (e.g. recognise speech/wreck a nice beach) The probability model can be used to choose between these alternatives. (Which sequence is most probable?) P(recognise, speech) = 0.0001 P(wreck , a, nice, beach) = 0.00000001
Alistair Knott (Otago)
COSC343 Lecture 23
15 / 25
Uses of a linear language model
We can also use n-grams to improve a bag-of-words model of text classification. Instead of using single words in our Bayesian model of spam/ham, we could use bigrams in the training corpus. (Or n-grams.)
Alistair Knott (Otago)
COSC343 Lecture 23
16 / 25
Uses of a linear language model
Linear language models can also be built with single letters. We compute the probability of the next letter, given the previous n letters. E.g. P(p|h, e, l) = ? P(l|h, e, l) = ? This is how ‘predictive text’ algorithms are learned.
Alistair Knott (Otago)
COSC343 Lecture 23
17 / 25
Learning linear language models with neural networks There’s a very neat variation on a backprop neural network that can learn a kind of probabilistic language model.
Alistair Knott (Otago)
COSC343 Lecture 23
18 / 25
Learning linear language models with neural networks There’s a very neat variation on a backprop neural network that can learn a kind of probabilistic language model. In a regular feedforward network, there’s an input layer, a hidden layer and an output layer.
output hidden layer input
Alistair Knott (Otago)
COSC343 Lecture 23
18 / 25
Learning linear language models with neural networks There’s a very neat variation on a backprop neural network that can learn a kind of probabilistic language model. In a simple recurrent network (invented by Jeff Elman, 1990): The network takes each word in the training corpus one by one, and is trained to predict the next word. (Using regular backprop.)
predicted next word hidden layer current word
Alistair Knott (Otago)
COSC343 Lecture 23
18 / 25
Learning linear language models with neural networks There’s a very neat variation on a backprop neural network that can learn a kind of probabilistic language model. In a simple recurrent network (invented by Jeff Elman, 1990): The network takes each word in the training corpus one by one, and is trained to predict the next word. (Using regular backprop.) To help it, the hidden layer from each time point is copied, and used as another input at the next time point. predicted next word hidden layer current word
Alistair Knott (Otago)
current context
COSC343 Lecture 23
18 / 25
Learning linear language models with neural networks There’s a very neat variation on a backprop neural network that can learn a kind of probabilistic language model. In a simple recurrent network (invented by Jeff Elman, 1990): The network takes each word in the training corpus one by one, and is trained to predict the next word. (Using regular backprop.) At any time point, the network’s inputs represent not just the current word, but the previous state of the network. predicted next word hidden layer current word
Alistair Knott (Otago)
current context
COSC343 Lecture 23
18 / 25
Learning linear language models with neural networks There’s a very neat variation on a backprop neural network that can learn a kind of probabilistic language model. In a simple recurrent network (invented by Jeff Elman, 1990): The network takes each word in the training corpus one by one, and is trained to predict the next word. (Using regular backprop.) After training, the hidden layer learns to represent the aspects of the previous n words that are useful in predicting the next word. predicted next word hidden layer current word
Alistair Knott (Otago)
current context
COSC343 Lecture 23
18 / 25
Learning linear language models with neural networks There’s a very neat variation on a backprop neural network that can learn a kind of probabilistic language model. In a simple recurrent network (invented by Jeff Elman, 1990): The network takes each word in the training corpus one by one, and is trained to predict the next word. (Using regular backprop.) It’s basically learning P(next-word |recent-n-words). But with no fixed n. . . predicted next word hidden layer current word
Alistair Knott (Otago)
current context
COSC343 Lecture 23
18 / 25
NOUN-ANIM NOUN-INANIM NOUN-AGRESS NOUN-FRAG NOUN-FOOD VERB-INTRAN VERB-TRAN VERB-AGPAT VERB-PERCEPT VERB-DESTROY VERB-EAT
cot, mouse book, rack dragon, monster glass, plate cookie, break think, sleep see, chase move, break smell, see breok, smash eat
A simple training grammar
Elman trained his network on sentences produced by a simple grammar: e.g. woman break glass, monster eat man. Templates WORD
1
NOUN-HUM NOUN-HUM NOUN-HUM NOUN-HUM NOUN-HUM NOUN-HUM NOUN-HUM NOUN-ANIM NOUN-ANIM NOUN-ANIM 196NOUN-ANIM NOUN-INANIM NOUN-AGRESS NOUN-AGRESS NOUN-AGRESS NOUN-AGRESS
TABLE 4 far Sentence WORD
Categories
of Lexical
Category
Generator
2
WORD
VERB-EAT VERB-PERCEPT VERB-DESTROY VERB-INTRAN VERB-TRAN VERB-AGPAT VERB-AGPAT VERB-EAT VERB-TRAN VERB-AGPAT ELMAN VERB-AGPAT VERB-AGPAT VERB-DESTROY TABLE 3 VERB-EAT Items Used in Sentence Simulation VERB-EAT VERB-EAT Examples
NOUN-HUM
man,
VERB-DESTROY VERB-EAT
breok, eat
3
NOUN-FOOD NOUN-INANIM NOUN-FRAG NOUN-HUM NOUN-INANIM NOUN-FOOD NOUN-ANIM NOUN-INANIM
NOUN-FRAG NOUN-HUM NOUN-ANIM NOUN-FOOD
woman
mouse but there were NOUN-ANIM no breaks between successivecot, sentences. A fragment of the rack the English gloss for input stream isNOUN-INANIM shown in Column 1 of Table book, 5, with monsterin Column 2. each vector in NOUN-AGRESS parentheses. The desired outputdragon, is given glass, plate NOUN-FRAG For this simulation a network similar to that in the first simulation was cookie, break NOUN-FOOD used, except thatVERB-INTRAN the input layer and output layers contained 31 nodes each, think, sleep and the hidden VERB-TRAN and context layers contained see, 150chase nodes each. The task given to the network was to learn move, to predict VERB-AGPAT break the order of successmell, see sive words. TheVERB-PERCEPT training strategy was as follows. The sequence of 27,354
Alistair Knott (Otago)
TABLE 4 COSC343 Lecture 23 Templates far Sentence Generator
smash
19 / 25
Analysis of the trained network EMAN
After training, if we build a dendrogram of the patterns of activity produced in the hidden layer by each word in the training lexicon, we see that words cluster into syntactic categories!
LO.-ASS
VERBS
DO.OPT
-
DQOBLIG
ANIMALS ANIMATES
The network doesn’t have to be ‘taught’ what syntactic categories are. They emerge very clearly from the data. Alistair Knott (Otago)
HUMAN
NOUNS
‘-Isi I I COSC343 Lecture 23
I
BREAKABLES
I
I
20 / 25
An Elman network for next letter/phoneme prediction An Elman network can also be trained using letters (or phonemes) as input. In this setup, it learns to predict the next phoneme based on the previous n phonemes. predicted next phoneme hidden layer current phoneme
current context
E.g. M,a,n,y,y,e,a,r,s,a,g,o,a,b,o,y,a,n,d,g,i,r,l,l,i,v,e,d,b,y,t,h,e,s,e,a Alistair Knott (Otago)
COSC343 Lecture 23
21 / 25
An Elman network for next letter/phoneme prediction An interesting thing to do is look at the trained network’s confidence in its predictions about the next letter. predicted next phoneme hidden layer current phoneme
Alistair Knott (Otago)
current context
COSC343 Lecture 23
22 / 25
An Elman network for next letter/phoneme prediction An interesting thing to do is look at the trained network’s confidence in its predictions about the next letter. predicted next phoneme hidden layer current phoneme
current context
We can do this by treating the predicted next phoneme layer as a probability distribution (by scaling activations to sum to 1).
Alistair Knott (Otago)
COSC343 Lecture 23
22 / 25
An Elman network for next letter/phoneme prediction An interesting thing to do is look at the trained network’s confidence in its predictions about the next letter. predicted next phoneme hidden layer current phoneme
current context
We can do this by treating the predicted next phoneme layer as a probability distribution (by scaling activations to sum to 1). We can then compute the information gained by learning the actual next letter.
Alistair Knott (Otago)
COSC343 Lecture 23
22 / 25
An Elman network for next letter/phoneme prediction An interesting thing to do is look at the trained network’s confidence in its predictions about the next letter. predicted next phoneme hidden layer current phoneme
current context
We can do this by treating the predicted next phoneme layer as a probability distribution (by scaling activations to sum to 1). We can then compute the information gained by learning the actual next letter. Recall Lecture 6:
I(hP1 , . . . , Pn i) =
n X
−Pi log2 Pi
i =1
High confidence = low information. Alistair Knott (Otago)
COSC343 Lecture 23
22 / 25
Learning what words are If you plot the network’s confidence for each prediction, you see that the places where it’s least confident about the next letter correspond to word boundaries. ELMAN
194
t1 4 h
a
a
I
a
t2
5
Figure
6. Graph
DISCOVERING
Alistair Knott (Otago)
\d
of root
mean
LEXICAL
squared
error
in letter-in-word
precition
task.
CLASSES FROM WORD ORDER
Consider now another problem which arises in the context of word sequences. The order of words in sentences reflects a number of constraints. In languages such as English (so-called “fixed word-order” languages), the order Lecture(the 23“free word-order” lanis tightly constrained. InCOSC343 many other languages
23 / 25
Learning what words are If you plot the network’s confidence for each prediction, you see that the places where it’s least confident about the next letter correspond to word boundaries. ELMAN
194
t1 4 h
a
a
I
a
t2
5
Figure
6. Graph
\d
of root
mean
squared
error
in letter-in-word
precition
task.
LEXICAL CLASSES FROM are! WORD ORDER An Elman network canDISCOVERING learn what words No-one has to teach it!
Alistair Knott (Otago)
Consider now another problem which arises in the context of word sequences. The order of words in sentences reflects a number of constraints. In languages such as English (so-called “fixed word-order” languages), the order Lecture(the 23“free word-order” lanis tightly constrained. InCOSC343 many other languages
23 / 25
The limits of a linear language model We’ve seen that in order to describe the range of sentences in a human language, we need to build a grammar, which assigns each sentence a hierarchical structure. On this model, sentences are not simple sequences—they’re trees. Our model of semantic interpretation uses these trees to build sentence meanings.
Alistair Knott (Otago)
COSC343 Lecture 23
24 / 25
The limits of a linear language model We’ve seen that in order to describe the range of sentences in a human language, we need to build a grammar, which assigns each sentence a hierarchical structure. On this model, sentences are not simple sequences—they’re trees. Our model of semantic interpretation uses these trees to build sentence meanings. What we need to do is to build a probability model which works with a grammar. I’ll talk about that in the next lecture.
Alistair Knott (Otago)
COSC343 Lecture 23
24 / 25
Summary and reading
Reading for this lecture: AIMA Section 22.1 Reading for next lecture: AIMA Section 23.2.1.
Alistair Knott (Otago)
COSC343 Lecture 23
25 / 25