COSC343: Artiﬁcial Intelligence - cs.otago.ac.nz

COSC343: Artificial Intelligence Lecture 23: Statistical Language Modelling

Alistair Knott Dept. of Computer Science, University of Otago

Alistair Knott (Otago)

COSC343 Lecture 23

1 / 25

In today’s lecture

Ambiguity in natural language Simple probability models for natural language


COSC343 Lecture 23

2 / 25

Ambiguity in natural language


COSC343 Lecture 23

3 / 25


Prostitutes Appeal to Pope


COSC343 Lecture 23

3 / 25


Prostitutes Appeal to Pope New Vaccine May Contain Rabies


COSC343 Lecture 23

3 / 25



Squad Helps Dog Bite Victim


COSC343 Lecture 23

3 / 25



Squad Helps Dog Bite Victim British Left Waffles On Falklands


COSC343 Lecture 23

3 / 25



Squad Helps Dog Bite Victim British Left Waffles On Falklands A list of teachers broken down by age and sex is posted in the lobby


COSC343 Lecture 23

3 / 25



Squad Helps Dog Bite Victim British Left Waffles On Falklands A list of teachers broken down by age and sex is posted in the lobby

A man is run over in New York every hour


COSC343 Lecture 23

3 / 25

Techniques for resolving ambiguity

1. Use semantics. Work out the meaning of each interpretation. Which interpretation makes most sense? 2. Use statistics. Which word sense is most likely, in the current context? Which syntactic structure is most likely, in the current context?


COSC343 Lecture 23

4 / 25

Probabilistic models of natural language

To build a probabilistic model of a natural language L: We start by obtaining a corpus of text in language L. We may choose to enrich the corpus by annotating it, to identify important features explicitly. E.g. word senses, syntactic structures. We estimate probabilities of events in the corpus by counting how often they occur in the corpus. We use these probabilties to estimate probabilities of events in new (unseen) texts.


COSC343 Lecture 23

5 / 25

Example: Bayesian text classification You saw this example in one of your tutes. Training corpus: A set of emails, annotated (by a human) as either ‘spam’ or ‘ham’. Probabilities estimated from relative frequencies: for each word wi in the corpus: P(wi |spam); P(wi |¬spam) From these probabilities, using a naive Bayes model, we can take a document containing words w1 . . .wn and calculate P(Spam|w1 ∧w2 ∧. . .wn ) Alistair Knott (Otago)

COSC343 Lecture 23

6 / 25

Dependencies between words

A naive Bayes model makes the assumption that the words in a document are (conditionally) independent of one another (given the document category). It’s called a ‘bag-of-words’ model. But if we want to build a model of sentence structure, rather than just classify documents, the whole point is that the words in a sentence are not fully independent of one another.


COSC343 Lecture 23

7 / 25

Recap: the two types of structure in human language Recall from Lecture 20: knowing a natural language involves two things: 1. General principles, which are neatly encoded in a grammar: Principles which group words into general categories. Principles which define hierarchical structure in sentences. 2. A lot of specific facts: Knowledge of individual word meanings Knowledge of idioms: word combinations that occur particularly frequently. Probabilistic models are good at capturing knowledge of idioms.


COSC343 Lecture 23

8 / 25

Idiomatic structure in language Idioms are commonly occurring word sequences or patterns.


COSC343 Lecture 23

9 / 25

Idiomatic structure in language Idioms are commonly occurring word sequences or patterns. Collocations are combinations of words that are more likely than you would predict (e.g. from a grammar). To pull over, to tuck in, to pick up. . . How are you doing? Never mind. . .


COSC343 Lecture 23

9 / 25

Idiomatic structure in language Idioms are commonly occurring word sequences or patterns. Collocations are combinations of words that are more likely than you would predict (e.g. from a grammar). To pull over, to tuck in, to pick up. . . How are you doing? Never mind. . . Fixed expressions are groups of words that contribute a meaning collectively, rather than individually. To kick the bucket To take X to task. . .


COSC343 Lecture 23

9 / 25

Idiomatic structure in language Idioms are commonly occurring word sequences or patterns. Collocations are combinations of words that are more likely than you would predict (e.g. from a grammar). To pull over, to tuck in, to pick up. . . How are you doing? Never mind. . . Fixed expressions are groups of words that contribute a meaning collectively, rather than individually. To kick the bucket To take X to task. . . Lots of expressions are in between: Don’t worry about it Alistair Knott (Otago)

COSC343 Lecture 23

9 / 25

Probabilistic language models

A common way of using probability theory to model language is to adopt a Markovian model: The probability of a word W is dependent on the previous n words. (Also called an n-gram model.) Note this is very different from a grammatical model! It ignores hierarchical structure in sentences altogether.


COSC343 Lecture 23

10 / 25

Building an n-gram language model

To build an n-gram model, we count frequencies of word sequences in a training corpus. If n = 0, we count the frequency of each individual word, and estimate the prior probability distribution for words. If n = 1, we count the frequency of each possible sequence of two words. Then if we’re given a word, we can estimate the conditional probability distribution for the next word.


COSC343 Lecture 23

11 / 25

Some n-gram models Here’s a sequence generated by a model trained on unigrams in AIMA: logical are as are confusion a may right tries agent goal the was diesel more object then information-gathering search is Here’s one generated by a model trained on bigrams: planning purely diagnostic expert systems are very similar computational approach would be represented compactly using tic-tac-toe a predicate Here’s one generated by a model trained on trigrams: planning and scheduling are integrated the success of naive bayes model is just a possible prior source by that time


COSC343 Lecture 23

12 / 25

Problems building a linear language model Clearly, a higher n gives a better model. But there’s a sparse data problem: it’s hard to get good estimates for sequences of length n when n gets large. The problem is particularly severe because words are distributed according to Zipf’s law: word frequency

"the" "of"

log word frequency

"bayes" word rank

log word rank

Basically, there are lots of very rare words. So to get good estimates of all word sequences, you need a huge corpus. Alistair Knott (Otago)

COSC343 Lecture 23

13 / 25

Uses of a linear language model We can use a linear model for word sense disambiguation. Many words are semantically ambiguous. E.g. bank can mean ‘river edge’ or ‘financial institution’.) To disambiguate: Identify the sense of each ambiguous word by hand in the training corpus. Then build an n−ary probability model. Now the context of an ambiguous word carries information about which sense was intended. P(RIVERBANK|swam, to, the) =0.000001 P[FINANCIALBANK|swam, to, the) =0.000000000001


COSC343 Lecture 23

14 / 25

Uses of a linear language model

We can also use a linear language model in speech interpretation: Say the speaker’s audio signal is consistent with several alternative word sequences: (e.g. recognise speech/wreck a nice beach) The probability model can be used to choose between these alternatives. (Which sequence is most probable?) P(recognise, speech) = 0.0001 P(wreck , a, nice, beach) = 0.00000001


COSC343 Lecture 23

15 / 25


We can also use n-grams to improve a bag-of-words model of text classification. Instead of using single words in our Bayesian model of spam/ham, we could use bigrams in the training corpus. (Or n-grams.)


COSC343 Lecture 23

16 / 25


Linear language models can also be built with single letters. We compute the probability of the next letter, given the previous n letters. E.g. P(p|h, e, l) = ? P(l|h, e, l) = ? This is how ‘predictive text’ algorithms are learned.


COSC343 Lecture 23

17 / 25

Learning linear language models with neural networks There’s a very neat variation on a backprop neural network that can learn a kind of probabilistic language model.


COSC343 Lecture 23

18 / 25

Learning linear language models with neural networks There’s a very neat variation on a backprop neural network that can learn a kind of probabilistic language model. In a regular feedforward network, there’s an input layer, a hidden layer and an output layer.

output hidden layer input


COSC343 Lecture 23

18 / 25

Learning linear language models with neural networks There’s a very neat variation on a backprop neural network that can learn a kind of probabilistic language model. In a simple recurrent network (invented by Jeff Elman, 1990): The network takes each word in the training corpus one by one, and is trained to predict the next word. (Using regular backprop.)

predicted next word hidden layer current word


COSC343 Lecture 23

18 / 25

Learning linear language models with neural networks There’s a very neat variation on a backprop neural network that can learn a kind of probabilistic language model. In a simple recurrent network (invented by Jeff Elman, 1990): The network takes each word in the training corpus one by one, and is trained to predict the next word. (Using regular backprop.) To help it, the hidden layer from each time point is copied, and used as another input at the next time point. predicted next word hidden layer current word


current context

COSC343 Lecture 23

18 / 25

Learning linear language models with neural networks There’s a very neat variation on a backprop neural network that can learn a kind of probabilistic language model. In a simple recurrent network (invented by Jeff Elman, 1990): The network takes each word in the training corpus one by one, and is trained to predict the next word. (Using regular backprop.) At any time point, the network’s inputs represent not just the current word, but the previous state of the network. predicted next word hidden layer current word


current context

COSC343 Lecture 23

18 / 25

Learning linear language models with neural networks There’s a very neat variation on a backprop neural network that can learn a kind of probabilistic language model. In a simple recurrent network (invented by Jeff Elman, 1990): The network takes each word in the training corpus one by one, and is trained to predict the next word. (Using regular backprop.) After training, the hidden layer learns to represent the aspects of the previous n words that are useful in predicting the next word. predicted next word hidden layer current word


current context

COSC343 Lecture 23

18 / 25

Learning linear language models with neural networks There’s a very neat variation on a backprop neural network that can learn a kind of probabilistic language model. In a simple recurrent network (invented by Jeff Elman, 1990): The network takes each word in the training corpus one by one, and is trained to predict the next word. (Using regular backprop.) It’s basically learning P(next-word |recent-n-words). But with no fixed n. . . predicted next word hidden layer current word


current context

COSC343 Lecture 23

18 / 25

NOUN-ANIM NOUN-INANIM NOUN-AGRESS NOUN-FRAG NOUN-FOOD VERB-INTRAN VERB-TRAN VERB-AGPAT VERB-PERCEPT VERB-DESTROY VERB-EAT

cot, mouse book, rack dragon, monster glass, plate cookie, break think, sleep see, chase move, break smell, see breok, smash eat

A simple training grammar

Elman trained his network on sentences produced by a simple grammar: e.g. woman break glass, monster eat man. Templates WORD

1

NOUN-HUM NOUN-HUM NOUN-HUM NOUN-HUM NOUN-HUM NOUN-HUM NOUN-HUM NOUN-ANIM NOUN-ANIM NOUN-ANIM 196NOUN-ANIM NOUN-INANIM NOUN-AGRESS NOUN-AGRESS NOUN-AGRESS NOUN-AGRESS

TABLE 4 far Sentence WORD

Categories

of Lexical

Category

Generator

2

WORD

VERB-EAT VERB-PERCEPT VERB-DESTROY VERB-INTRAN VERB-TRAN VERB-AGPAT VERB-AGPAT VERB-EAT VERB-TRAN VERB-AGPAT ELMAN VERB-AGPAT VERB-AGPAT VERB-DESTROY TABLE 3 VERB-EAT Items Used in Sentence Simulation VERB-EAT VERB-EAT Examples

NOUN-HUM

man,

VERB-DESTROY VERB-EAT

breok, eat

3

NOUN-FOOD NOUN-INANIM NOUN-FRAG NOUN-HUM NOUN-INANIM NOUN-FOOD NOUN-ANIM NOUN-INANIM

NOUN-FRAG NOUN-HUM NOUN-ANIM NOUN-FOOD

woman

mouse but there were NOUN-ANIM no breaks between successivecot, sentences. A fragment of the rack the English gloss for input stream isNOUN-INANIM shown in Column 1 of Table book, 5, with monsterin Column 2. each vector in NOUN-AGRESS parentheses. The desired outputdragon, is given glass, plate NOUN-FRAG For this simulation a network similar to that in the first simulation was cookie, break NOUN-FOOD used, except thatVERB-INTRAN the input layer and output layers contained 31 nodes each, think, sleep and the hidden VERB-TRAN and context layers contained see, 150chase nodes each. The task given to the network was to learn move, to predict VERB-AGPAT break the order of successmell, see sive words. TheVERB-PERCEPT training strategy was as follows. The sequence of 27,354


TABLE 4 COSC343 Lecture 23 Templates far Sentence Generator

smash

19 / 25

Analysis of the trained network EMAN

After training, if we build a dendrogram of the patterns of activity produced in the hidden layer by each word in the training lexicon, we see that words cluster into syntactic categories!

LO.-ASS

VERBS

DO.OPT

-

DQOBLIG

ANIMALS ANIMATES

The network doesn’t have to be ‘taught’ what syntactic categories are. They emerge very clearly from the data. Alistair Knott (Otago)

HUMAN

NOUNS

‘-Isi I I COSC343 Lecture 23

I

BREAKABLES

I

I

20 / 25

An Elman network for next letter/phoneme prediction An Elman network can also be trained using letters (or phonemes) as input. In this setup, it learns to predict the next phoneme based on the previous n phonemes. predicted next phoneme hidden layer current phoneme

current context

E.g. M,a,n,y,y,e,a,r,s,a,g,o,a,b,o,y,a,n,d,g,i,r,l,l,i,v,e,d,b,y,t,h,e,s,e,a Alistair Knott (Otago)

COSC343 Lecture 23

21 / 25

An Elman network for next letter/phoneme prediction An interesting thing to do is look at the trained network’s confidence in its predictions about the next letter. predicted next phoneme hidden layer current phoneme


current context

COSC343 Lecture 23

22 / 25


current context

We can do this by treating the predicted next phoneme layer as a probability distribution (by scaling activations to sum to 1).


COSC343 Lecture 23

22 / 25


current context

We can do this by treating the predicted next phoneme layer as a probability distribution (by scaling activations to sum to 1). We can then compute the information gained by learning the actual next letter.


COSC343 Lecture 23

22 / 25


current context

We can do this by treating the predicted next phoneme layer as a probability distribution (by scaling activations to sum to 1). We can then compute the information gained by learning the actual next letter. Recall Lecture 6:

I(hP1 , . . . , Pn i) =

n X

−Pi log2 Pi

i =1

High confidence = low information. Alistair Knott (Otago)

COSC343 Lecture 23

22 / 25

Learning what words are If you plot the network’s confidence for each prediction, you see that the places where it’s least confident about the next letter correspond to word boundaries. ELMAN

194

t1 4 h

a

a

I

a

t2

5

Figure

6. Graph

DISCOVERING


\d

of root

mean

LEXICAL

squared

error

in letter-in-word

precition

task.

CLASSES FROM WORD ORDER

Consider now another problem which arises in the context of word sequences. The order of words in sentences reflects a number of constraints. In languages such as English (so-called “fixed word-order” languages), the order Lecture(the 23“free word-order” lanis tightly constrained. InCOSC343 many other languages

23 / 25

Learning what words are If you plot the network’s confidence for each prediction, you see that the places where it’s least confident about the next letter correspond to word boundaries. ELMAN

194

t1 4 h

a

a

I

a

t2

5

Figure

6. Graph

\d

of root

mean

squared

error

in letter-in-word

precition

task.

LEXICAL CLASSES FROM are! WORD ORDER An Elman network canDISCOVERING learn what words No-one has to teach it!


Consider now another problem which arises in the context of word sequences. The order of words in sentences reflects a number of constraints. In languages such as English (so-called “fixed word-order” languages), the order Lecture(the 23“free word-order” lanis tightly constrained. InCOSC343 many other languages

23 / 25

The limits of a linear language model We’ve seen that in order to describe the range of sentences in a human language, we need to build a grammar, which assigns each sentence a hierarchical structure. On this model, sentences are not simple sequences—they’re trees. Our model of semantic interpretation uses these trees to build sentence meanings.


COSC343 Lecture 23

24 / 25

The limits of a linear language model We’ve seen that in order to describe the range of sentences in a human language, we need to build a grammar, which assigns each sentence a hierarchical structure. On this model, sentences are not simple sequences—they’re trees. Our model of semantic interpretation uses these trees to build sentence meanings. What we need to do is to build a probability model which works with a grammar. I’ll talk about that in the next lecture.


COSC343 Lecture 23

24 / 25

Summary and reading

Reading for this lecture: AIMA Section 22.1 Reading for next lecture: AIMA Section 23.2.1.


COSC343 Lecture 23

25 / 25

COSC343: Artiﬁcial Intelligence - cs.otago.ac.nz

Recommend Documents