===== Nondeterministic Automata =====
==== Motivation ====
In the previous lecture we have investigated the **semantics** of regular expressions and saw that how we can determine the language accepted by, e.g. $math[(A\cup B)(a\cup b)*(0 \cup 1)*]. However, it is not straightforward to **compute** whether a given word $math[w] is a member of $math[L(e)] and this is precisely the task of the **lexical stage**.
In more formal terms, we have a //generator// - a means to construct a language from a regular expression, but we lack a means for //accepting// (words of) languages.
We shall informally illustrate an algorithm for verifying the membership $math[w \in L((A\cup B)(a\cup b)*(0 \cup 1)*)], in Haskell:
check ('A':xs) = check1 xs
check ('B':xs) = check1 xs
check _ = False
check1 ('a':xs) = check1 xs
check1 ('b':xs) = check1 xs
check1 ('0':xs) = check2 xs
check1 ('1':xs) = check2 xs
check1 [] = True
check1 _ = False
check2 ('0':xs) = check2 xs
check2 ('1':xs) = check2 xs
check2 [] = True
check2 _ = False
The algorithm proceeds in **three stages**:
* in the first stage, we check if ''A'' or ''B'' are encountered, otherwise we move on to the second stage;
* in the second stage, we check if ''a'', ''b'', ''0'' or ''1'' are encountered; if ''a'' or ''b'' are found, we continue inspection in the second stage; if ''0'' or ''1'' are found, we continue inspection in the third stage; finally, if the string terminates, we report true;
* in the third stage we search for binary digits in a similar way;
The same strategy can be written in a more elegant way as:
check w = chk w++"!" [0]
where chk (x:xs) set =
| (x 'elem' ['A', 'B']) && (0 'elem' set) = chk xs [1,2,3]
| (x 'elem' ['a', 'b']) && (1 'elem' set) = chk xs [1,2,3]
| (x 'elem' ['0', '1']) && (2 'elem' set) = chk xs [2,3]
| (x == '!') && (3 'elem' set) = True
| otherwise = False
Here, we have introduced the symbol ''!'' to mark the string termination, and thus make the whole code nicer to write. We have also made the //stage idea// explicit. The procedure ''chk'' maintains a set of //stages// or //states//:
* $math[0\in set] indicates that we are in the initial stage, where we are looking for ''A'' or ''B''
* $math[1\in set] indicates that we have read a sequence of alphabetic symbols: ''a''s, ''b''s may follow
* $math[2\in set] indicates that the sequence of alphabetic symbols has ended; only ''0''s or ''1''s may follow;
* $math[3\in set] indicates that the string may also terminate at any time - ''3'' is an //end-stage//.
We start in the initial stage. Whenever a symbol is read, the stage, i.e. the set of possible lookups is updated: for instance, when ''0'' or ''1'' are read, only the second and third situations are possible.
The idea behind our code could be expressed as the following diagram:
{{:lfa:example.png|}}
where
* each node is a **state**, which indicates what is the current stage in the recognition of the input word;
* each arrow is a **transition** which takes the recognition process from one stage to another;
* here, $math[Q_0] is the initial state, $math[Q'] is the state from which any lower-case alphanumeric symbol in the alphabet may follow, and $math[Q''] is the state from which only numerics are accepted.
The string can terminate successfully in both $math[Q] and $math[Q'], which is shown via double circles.
==== Nondeterministic automata ====
The key idea behind the previous algorithm can be generalised to **any** regular expression, and its associated code, written in the same style, yields a similar diagram.
In practice, it is the diagram, i.e. the **nondeterministic finite automaton** (NFA), which helps us generate the code.
$def[NFA]
A **non-deterministic finite automaton** is a tuple $math[M=(K,\Sigma,\Delta,q_0,F)] where:
* $math[K] is a finite set of **states**
* $math[\Sigma] is an alphabet
* $math[\Delta] is a **subset** of $math[K \times \Sigma^* \times K] and is called a **transition relation**
* $math[q_0\in K] is **the initial state**
* $math[F\subseteq K] is **the set of final states**
$end
As an example, consider:
* $math[K=\{q_0,q_1,q_2\}]
* $math[\Sigma=\{0,1\}]
* $math[\Delta=\{(q_0,0,q_0),(q_0,1,q_0),(q_0,0,q_1),(q_1,1,q_2)\}]
* $math[F = \{q_2\}]
Notice that the NFA gets stuck for certain inputs, i.e. it **does not accept**.
**Graphical notation**
$def[Configuration]
A **configuration** of an NFA, is a **member** of $math[K\times \Sigma^*].
$end
Informally, configurations capture a **snapshot** of the execution of an NFA. The snapshot consists of the:
* **current state** of the automaton and
* **the rest of the word** from the input.
For instance, $math[(q_0,0001)] is the **initial configuration** of the automaton from our example, on input $math[0001].
$def[Transition]
We call $math[\vdash_M \subseteq (K\times \Sigma^*) \times (K\times\Sigma^*)] a **one-step** move relation of automaton $math[M]. The relation describes how the automaton **must behave** to reach one configuration from another. Formally:
* $math[(q,w) \vdash_M (q',w')] if and only if there exists $math[u\in\Sigma^*], such that $math[w=uw'] ($math[u] is a prefix of $math[w]) and $math[(q,u,q')\in\Delta]: from state $math[q] on input $math[u] we reach state $math[q'].
We call $math[\vdash_M^*], the **reflexive and transitive closure of** $math[\vdash_M], i.e. the **zero-or-more step(s)** move of automaton $math[M].
$end
For instance, in our previous example, $math[(q_0,0001)\vdash_M(q_0,001)] and also $math[(q_0,0001)\vdash_M(q_1,001)]. At the same time, $math[(q_0,0001)\vdash_M^*(q_2,\epsilon)]. Can you figure out the sequence of steps?
$prop[Acceptance]
A word $math[w] is accepted by an NFA $math[M] iff $math[(q_0,w)\vdash_M^*(q,\epsilon)] and $math[q\in F]. In other words, after the word $math[w] was processed by the automaton, we reach a **final state**.
$end
Notice that the word $math[0001] is indeed accepted by the automaton $math[M] from our example.
$def[Language accepted by an NFA]
Given an NFA $math[M], we define $math[L(M) = \{w\mid w\text{ is accepted by} M\}] as the language **accepted** by $math[M]. We say $math[M] accepts the language $math[L(M)].
$end
=== Execution tree for Nondeterministic Finite Automata ===
Illustration of an AFN for $math[(A\cup B)(a\cup b)*(0 \cup 1)*].
There are two ways of writing this automaton:
* one that follows exactly our previous algorithm sketch.
* one that employs **epsilon transitions**.
**Epsilon transitions** are a means for jumping from a state to another without consuming the input. It is a useful way of defining automata, because it empowers us to **combine** multiple automata procedures.
==== Nondeterminism as imperfect information ====
Notice that **nondeterminism** actually refers to our imperfect information regarding the current state of the automaton. **Nondeterminism** means that, after consuming some part (prefix) of a word, //several concrete states may be possible current states//.
==== From Regular Expressions to NFAs ====
While Regular Expressions are a natural instrument for declaring (or generating) tokens, NFAs are a **natural instrument for accepting** tokens (i.e. their respective language).
The following theorem shows how this can be achieved.
$justtheorem
For every language $math[L(E)] defined by the regular expression $math[E], there exists an NFA $math[M], such that $math[L(M)=L(E)].
$end
This theorem is particularly important, because it also provides an **algorithm** for constructing NFAs from regular expressions.
$proof
Let $math[E] be a regular expression. We construct an NFA, with:
* **exactly one initial state**.
* **exactly one final state**.
* **no transitions from the final state**.
The proof is by **induction** over the expression structure.
**Basis case $math[E=\emptyset]**
We construct the following automaton:
{{:lfa:emptyset.jpg|}}
It is clear that this automaton accepts no word, and obeys the three aforementioned conditions.
**Basis case $math[E=\epsilon]**
We construct the following automaton:
{{:lfa:emptyword.jpg|}}
hich only accepts the empty word.
**Basis case $math[E=c]** where $math[c] is a symbol of the alphabet.
We construct the following automaton:
{{:lfa:char.jpg|}}
Since regular expressions have three //inductive rules// for constructing regular expressions (union, concatenation and Kleene-star), we have to treat three induction steps:
**Induction step $math[E=E_1E_2] (concatenation)**
Suppose $math[E_1] and $math[E_2] are regular expressions for which NFAs can be built (**induction hypothesis**). We build the following NFA which accepts all words generated by the regular expression $math[E_1E_2].
{{:lfa:concat.jpg|}}
**Induction step $math[E=E_1\cup E_2] (union)**
Suppose $math[E_1] and $math[E_2] are regular expressions for which NFAs can be built (**induction hypothesis**). We build the following NFA which accepts all words generated by the regular expression $math[E_1\cup E_2].
{{:lfa:union.jpg|}}
**Induction step $math[E^*] (union)**
Suppose $math[E] is regular expression for which an NFA can be built (**induction hypothesis**). We build the following NFA which accepts all words generated by the regular expression $math[E*].
{{:lfa:kleene.jpg|}}
$end
We illustrate the algorithmic procedure on our regular expression $math[(A\cup B)(a\cup b)*(0 \cup 1)*].
The result is shown below:
{{:lfa:slide4.jpg|}}
From the proof, a naive algorithm can be easily implemented. We illustrate it in Haskell:
data RegExp = EmptyString |
Atom Char |
RegExp :| RegExp |
RegExp :. RegExp |
Kleene RegExp deriving Show
data NFA = NFA {delta :: [(Int,Char,Int)], fin :: [Int]} deriving Show
We begin with a list-based representation of the transition function $math[\delta]. We assume the symbol ''e'' is reserved for the empty string;
-- the strategy is to increment by i, each state
relabel :: Int -> NFA -> NFA
relabel i (NFA delta fin) = NFA (map (\(s,c,s')->(s+i,c,s'+i)) delta) (map (+i) fin)
Since we have chosen to represent states as integers, we use a re-labelling function to ensure uniqueness. Re-labelling relies on state increment. For instance, by calling ''relabel (f1+1) n'', we ensure that the NFA ''n'' will have the initial state equal to ''f1+1''. Note that ''f1'' is a final state in our code, which guarantees uniqueness.
toNFA EmptyString = NFA [(0,'e',1)] [1]
toNFA (Atom c) = NFA [(0,c,1)] [1]
toNFA (e :. e') = let NFA delta1 [f1] = toNFA e
NFA delta2 [f2] = relabel (f1+1) (toNFA e')
in NFA (delta1++delta2++[(f1,'e',f1+1)]) [f2]
toNFA (e :| e') = let NFA delta1 [f1] = relabel 1 (toNFA e)
NFA delta2 [f2] = relabel (f1+1) (toNFA e')
in NFA (delta1 ++ delta2 ++[(0,'e',1),
(0,'e',f1+1),
(f1,'e',f2+1),
(f2,'e',f2+1)]) [f2+1]
toNFA (Kleene e) = let NFA delta [f] = toNFA e in NFA (delta++[(0,'e',f),(f,'e',0)]) [f]
Apart from relabelling, the code follows exactly the steps from the proof.