INTRODUCTION FINITE STATE AUTOMATA (FSA) FINITE STATE TRANSDUCERS

Download Kenneth R. Beesley and Lauri Karttunen, Finite. State Morphology, CSLI Publications, 2003. Roche and Schabes 1997. Finite-State Language Pr...

1 downloads 620 Views 337KB Size
Introduction Finite State Automata (FSA) Finite State Transducers (FST)

NLP FS Models

1

Aplications: Increasing use in NLP Morphology Phonology Lexical generation ASR POS tagging simplification of CFG Information Extraction NLP FS Models

2



Why? • Temporal and spatial efficiency • Some FS Machines can be determinized and optimized for leading to more compact representations • Possibility to be used in cascaded form

NLP FS Models

3

Some readings Kenneth R. Beesley and Lauri Karttunen, Finite State Morphology, CSLI Publications, 2003 Roche and Schabes 1997 Finite-State Language Processing. 1997. MIT Press, Cambridge, Massachusetts. References to Finite-State Methods in Natural Language Processing http://www.cis.upenn.edu/~cis639/docs/fsrefs. html

NLP FS Models

4

Some toolbox ATT FSM tools http://www2.research.att.com/~fsmtools/fsm/ Beesley, Kartunnen book http://www.stanford.edu/~laurik/fsmbook/hom e.html Carmel http://www.isi.edu/licensed-sw/carmel/ Dan Colish's PyFSA (Python FSA) https: //github.com/dcolish/PyFSA

NLP FS Models

5

Equivalence Regular Expressions Regular Languages FSA

NLP FS Models

6

Regular Expressions •



Basically they are combinations of simple units (character or strings) with connectives as concatenation, disjunction, option, kleene star, etc. Available in languages as Perl or Python and Unix commands as awk, grep, ...

NLP FS Models

7

Example, acronym detection patterns acrophile acro1 = re.compile('^([A-Z][,\.-/_])+$') acro2 = re.compile('^([A-Z])+$') acro3 = re.compile('^\d*[A-Z](\d[A-Z])*$') acro4 = re.compile('^[A-Z][A-Z][A-Z]+[A-Za-z]+$') acro5 = re.compile('^[A-Z][A-Z]+[A-Za-z]+[A-Z]+$') acro6 = re.compile('^([A-Z][,\.-/_]){2,9}(\'s|s)?$') acro7 = re.compile('^[A-Z]{2,9}(\'s|s)?$') acro8 = re.compile('^[A-Z]*\d[-_]?[A-Z]+$') acro9 = re.compile('^[A-Z]+[A-Za-z]+[A-Z]+$') acro10 = re.compile('^[A-Z]+[/-][A-Z]+$') NLP FS Models

8

Formal Languages

Alphabet (vocabulary) Σ concatenation operation Σ* strings over Σ (free monoid) Language L ⊆ Σ* Languages and grammars Regular Languages (RL) NLP FS Models

9

L, L1 y L2 are languages operations concatenation

L1 • L2 = {u• v | u ∈ L1 ∧ v ∈ L 2 }

union

L1 ∪ L2 = {u | u ∈ L1 ∨ u ∈ L 2 }

intersection

L1 ∩ L2 = {u | u ∈ L1 ∧ u ∈ L 2 }

difference

L1 − L2 = {u | u ∈ L1 ∧ u ∉ L 2 }

complement

L=Σ −L

NLP FS Models

*

10

FSA <Σ, Q, i, F, E> Σ Q i∈Q F⊆Q E ⊆ Q × (Σ ∪ {ε}) × Q E: {d | d: Q × (Σ ∪ {ε}) → 2Q}

NLP FS Models

alphabet finite set of states initial state final states set arc set transitions set

11

Example 1: Recognizes multiple of 2 codified in binary

0

1 1

0

1

0

NLP FS Models

State 0: The string recognized till now ends with 0 State 1: The string recognized till now ends with 1

12

Example 2: Recognizes multiple of 3 codified in binary 0 0

1 0

1

1

2

0 1 State 0: The string recognized till now is multiple of 3 State 1: The string recognized till now is multiple of 3 + 1 State 2: The string recognized till now is multiple of 3 + 2 The transition from a state to the following multiplies by 2 the current string and adds to it the current tag NLP FS Models

13

tabular representation of the FSA

0

1

0

0

1

1

2

0

2

1

2

1

0 0

1 0

2

1 1

0

Recognizes multiple of 3 codified in binary NLP FS Models

14

Properties of RL and FSA Let A a FSA L(A) is the language generated (recognized) by A The class of RL (o FSA) is closed under union intersection concatenation complement Kleene star(A*)

Determinization of FSA Minimization de FSA NLP FS Models

15

The following properties of FSA are decidible w ∈ L(A) ? L(A) = ∅ ? L(A) = Σ* ? L(A1) ⊆ L(A2) ? L(A1) = L(A2) ?

Only the first two are for CFG

NLP FS Models

16

Example of the use of closure properties

Pro

Representation of the Lexicon

that

Det

Pro he

Pro

hopes

N V

that

Conj Det

Conj

this

Det

works

Pro

N V

Let S the FSA: Representation of the sentence with POS tags NLP FS Models

17

Restrictions (negative rules) FSA C1

that

Det

this

Det

FSA C2

that

Det

?

V

We are interested on S - (Σ* • C1 • Σ*) - (Σ* • C2 • Σ*) = S - (Σ* • ( C1 ∪ C2) • Σ*)

NLP FS Models

18

From the union of negative rules we can build a Negative grammar G = Σ* • ( C1 ∪ C2 ∪ … ∪ Cn) • Σ*) Pro Det

that

NLP FS Models

this

this

Det

?

?

Pro

?

Det

Pro

V

N

19

The difference between the two FSA S -G will result on:

he

Pro

hopes

V

that

Conj

this

Pro works

V

Det works N

Most of the ambiguities have been solved

NLP FS Models

20

FST <Σ1, Σ2, Q, i, F, E> Σ1 Σ2

input alphabet output alphabet

Q i∈Q F⊆Q E ⊆ Q × (Σ1* × Σ2 *) × Q

finite states set initial state final states set arcs set

frequently Σ1 = Σ2 = Σ

NLP FS Models

21

Example 3

1/1

0/0 0/0

1/0 0

2

1

1/1

0/1 Td3: division by 3 of a binary string Σ1 = Σ2 = Σ ={0,1}

NLP FS Models

22

Example 3

input 0 11 110 1001 1100 1111 10010

NLP FS Models

output 0 01 010 0011 0100 0101 00110

1/1

0/0 0/0

1/0 0

2

1 1/1

0/1

23

1/1

0/0 0/0

1/0 0

2

1 0/1

1/1 State 0: Recognized: Emited: invariant: emited * 3 = Recognized NLP FS Models

3k k

State 1: Recognized : Emited :

3k+1 k

invariant: emited * 3 + 1 = Recognized

State 2: Recognized : Emited :

3k+2 k

invariant: emited * 3 + 2 = Recognized

24

1/1

0/0 0/0

1/0 0

2

1 0/1

1/1 state 0: Recognized: Emited:

NLP FS Models

3k k

satisfies invariant state 0

consums: emits: recognized: emited:

0 0 3*k*2 = 6k k*2 = 2k

consums: emits: recognized: emited:

1 0 3*k*2 + 1= 6k + 1 k*2 = 2k

satisfies invariant state 1

25

1/1

0/0 0/0

1/0 0

2

1 0/1

1/1 state 1: recognized: emited:

3k+1 k

consums: emits: recognized: Emited: consums: emits: recogniced: emited:

NLP FS Models

satisfaces invariant state 2

0 0 (3k+1)*2 = 6k + 2 k*2 = 2k satisfaces invariant state 0 1 1 (3k+1)*2 + 1= 6k + 3 k*2 + 1 = 2k + 1

26

1/1

0/0 0/0

1/0 0

2

1 0/1

1/1 state 2: recognized: emited:

3k+2 k

consums: emits: recognized: emited:

consums: emits: recognized: emited: NLP FS Models

satisfaces invariant state 1

0 1 (3k+2)*2 = 6k + 4 k*2 + 1 = 2k + 1

satisfaces invariant state 2

1 1 (3k+2)*2 + 1= 6k + 5 k*2 + 1 = 2k + 1

27

FSA associated to a FST

FST <Σ1, Σ2, Q, i, F, E> FSA <Σ, Q, i, F, E’> Σ = Σ 1 × Σ2 (q1, (a,b), q2) ∈ E’ ⇔ (q1, a, b, q2) ∈ E

NLP FS Models

28

FST 9 Projections of a FST

FST T = <Σ1, Σ2, Q, i, F, E> First projection P1(T) <Σ1, Q, i, F, EP1> EP1 = {(q,a,q’) | (q,a,b,q’) ∈ E}

Second projection P2(T) <Σ2, Q, i, F, EP2> EP2 = {(q,b,q’) | (q,a,b,q’) ∈ E} NLP FS Models

29

FST are closed under union invertion example: Td3-1 is equivalent to multiply by 3

composition example : Td9 = Td3 • Td3

FST are not closed under intersection

NLP FS Models

30

Application of a FST

Traverse the FST in all forms compatible with the input (using backtracking if needed) until reaching a final state and generate the corresponding output Consider input as a FSA and compute the intersection of the FSA and the FST

NLP FS Models

31

Determinization of a FST

Not all FST are determinizable, if it is the case they are named subsequential The non deterministic FST is equivalent to the deterministic one a/b

h/h 1 0

0 2 a/c NLP FS Models

h/bh

a/ε

e/e

0

1

2 e/ce

32