===== Nondeterministic Automata =====

==== Motivation ====

In the previous lecture we have investigated the **semantics** of regular expressions and saw that how we can determine the language accepted by, e.g. $math[(A\cup B)(a\cup b)*(0 \cup 1)*]. However, it is not straightforward to **compute** whether a given word $math[w] is a member of $math[L(e)] and this is precisely the task of the **lexical stage**.

In more formal terms, we have a //generator// - a means to construct a language from a regular expression, but we lack a means for //accepting// (words of) languages.

We shall informally illustrate an algorithm for verifying the membership $math[w \in L((A\cup B)(a\cup b)*(0 \cup 1)*)], in Haskell:

<code haskell>
check ('A':xs) = check1 xs
check ('B':xs) = check1 xs
check _ = False

check1 ('a':xs) = check1 xs
check1 ('b':xs) = check1 xs
check1 ('0':xs) = check2 xs
check1 ('1':xs) = check2 xs
check1 [] = True
check1 _ = False

check2 ('0':xs) = check2 xs
check2 ('1':xs) = check2 xs
check2 [] = True
check2 _ = False
</code>

The algorithm proceeds in **three stages**:
  * in the first stage, we check if ''A'' or ''B'' are encountered, otherwise we move on to the second stage;
  * in the second stage, we check if ''a'', ''b'', ''0'' or ''1'' are encountered; if ''a'' or ''b'' are found, we continue inspection in the second stage; if ''0'' or ''1'' are found, we continue inspection in the third stage; finally, if the string terminates, we report true;
  * in the third stage we search for binary digits in a similar way;

The same strategy can be written in a more elegant way as:
<code haskell>
check w = chk w++"!" [0]
   where chk (x:xs) set =
	| (x 'elem' ['A', 'B']) && (0 'elem' set) = chk xs [1,2,3]
	| (x 'elem' ['a', 'b']) && (1 'elem' set) = chk xs [1,2,3]
	| (x 'elem' ['0', '1']) && (2 'elem' set) = chk xs [2,3]
	| (x == '!') && (3 'elem' set) = True
	| otherwise = False
</code>

Here, we have introduced the symbol ''!'' to mark the string termination, and thus make the whole code nicer to write. We have also made the //stage idea// explicit. The procedure ''chk'' maintains a set of //stages// or //states//:
  * $math[0\in set] indicates that we are in the initial stage, where we are looking for ''A'' or ''B''
  * $math[1\in set] indicates that we have read a sequence of alphabetic symbols: ''a''s, ''b''s may follow
  * $math[2\in set] indicates that the sequence of alphabetic symbols has ended; only ''0''s or ''1''s may follow;
  * $math[3\in set] indicates that the string may also terminate at any time - ''3'' is an //end-stage//.

We start in the initial stage. Whenever a symbol is read, the stage, i.e. the set of possible lookups is updated: for instance, when ''0'' or ''1'' are read, only the second and third situations are possible.

The idea behind our code could be expressed as the following diagram:
{{:lfa:example.png|}}
where
  * each node is a **state**, which indicates what is the current stage in the recognition of the input word;
  * each arrow is a **transition** which takes the recognition process from one stage to another;
  * here, $math[Q_0] is the initial state, $math[Q'] is the state from which any lower-case alphanumeric symbol in the alphabet may follow, and $math[Q''] is the state from which only numerics are accepted. 

The string can terminate successfully in both $math[Q] and $math[Q'], which is shown via double circles.

==== Nondeterministic automata ====

The key idea behind the previous algorithm can be generalised to **any** regular expression, and its associated code, written in the same style, yields a similar diagram.

In practice, it is the diagram, i.e. the **nondeterministic finite automaton** (NFA), which helps us generate the code. 

$def[NFA]
A **non-deterministic finite automaton** is a tuple $math[M=(K,\Sigma,\Delta,q_0,F)] where:
  * $math[K] is a finite set of **states**
  * $math[\Sigma] is an alphabet
  * $math[\Delta] is a **subset** of $math[K \times \Sigma^* \times K] and is called a **transition relation**
  * $math[q_0\in K] is **the initial state**
  * $math[F\subseteq K] is **the set of final states**
$end

As an example, consider:
  * $math[K=\{q_0,q_1,q_2\}]
  * $math[\Sigma=\{0,1\}]
  * $math[\Delta=\{(q_0,0,q_0),(q_0,1,q_0),(q_0,0,q_1),(q_1,1,q_2)\}]
  * $math[F = \{q_2\}]

Notice that the NFA gets stuck for certain inputs, i.e. it **does not accept**.

**Graphical notation**

$def[Configuration]
A **configuration** of an NFA, is a **member** of $math[K\times \Sigma^*].
$end
Informally, configurations capture a **snapshot** of the execution of an NFA. The snapshot consists of the:
  * **current state** of the automaton and
  * **the rest of the word** from the input.

For instance, $math[(q_0,0001)] is the **initial configuration** of the automaton from our example, on input $math[0001].

$def[Transition]
We call $math[\vdash_M \subseteq (K\times \Sigma^*) \times (K\times\Sigma^*)] a **one-step** move relation of automaton $math[M]. The relation describes how the automaton **must behave** to reach one configuration from another. Formally: 
  * $math[(q,w) \vdash_M (q',w')] if and only if there exists $math[u\in\Sigma^*], such that $math[w=uw'] ($math[u] is a prefix of $math[w]) and $math[(q,u,q')\in\Delta]: from state $math[q] on input $math[u] we reach state $math[q'].

We call $math[\vdash_M^*], the **reflexive and transitive closure of** $math[\vdash_M], i.e. the **zero-or-more step(s)** move of automaton $math[M].
$end

For instance, in our previous example, $math[(q_0,0001)\vdash_M(q_0,001)] and also $math[(q_0,0001)\vdash_M(q_1,001)]. At the same time, $math[(q_0,0001)\vdash_M^*(q_2,\epsilon)]. Can you figure out the sequence of steps?

$prop[Acceptance]
A word $math[w] is accepted by an NFA $math[M] iff $math[(q_0,w)\vdash_M^*(q,\epsilon)] and $math[q\in F]. In other words, after the word $math[w] was processed by the automaton, we reach a **final state**.
$end

Notice that the word $math[0001] is indeed accepted by the automaton $math[M] from our example.

$def[Language accepted by an NFA]
Given an NFA $math[M], we define $math[L(M) = \{w\mid w\text{ is accepted by} M\}] as the language **accepted** by $math[M]. We say $math[M] accepts the language $math[L(M)].
$end

=== Execution tree for Nondeterministic Finite Automata ===

Illustration of an AFN for $math[(A\cup B)(a\cup b)*(0 \cup 1)*].

There are two ways of writing this automaton:
  * one that follows exactly our previous algorithm sketch.
  * one that employs **epsilon transitions**.

**Epsilon transitions** are a means for jumping from a state to another without consuming the input. It is a useful way of defining automata, because it empowers us to **combine** multiple automata procedures.


==== Nondeterminism as imperfect information ====

Notice that **nondeterminism** actually refers to our imperfect information regarding the current state of the automaton. **Nondeterminism** means that, after consuming some part (prefix) of a word, //several concrete states may be possible current states//.

==== From Regular Expressions to NFAs ====

While Regular Expressions are a natural instrument for declaring (or generating) tokens, NFAs are a **natural instrument for accepting** tokens (i.e. their respective language).

The following theorem shows how this can be achieved.

$justtheorem
For every language $math[L(E)] defined by the regular expression $math[E], there exists an NFA $math[M], such that $math[L(M)=L(E)].
$end

This theorem is particularly important, because it also provides an **algorithm** for constructing NFAs from regular expressions.

$proof
Let $math[E] be a regular expression. We construct an NFA, with:
  * **exactly one initial state**.
  * **exactly one final state**.
  * **no transitions from the final state**.

The proof is by **induction** over the expression structure.

**Basis case $math[E=\emptyset]**

We construct the following automaton:
{{:lfa:emptyset.jpg|}}

It is clear that this automaton accepts no word, and obeys the three aforementioned conditions.

**Basis case $math[E=\epsilon]**

We construct the following automaton:
{{:lfa:emptyword.jpg|}}

hich only accepts the empty word.

**Basis case $math[E=c]** where $math[c] is a symbol of the alphabet.

We construct the following automaton:

{{:lfa:char.jpg|}}

Since regular expressions have three //inductive rules// for constructing regular expressions (union, concatenation and Kleene-star), we have to treat three induction steps:

**Induction step $math[E=E_1E_2] (concatenation)**

Suppose $math[E_1] and $math[E_2] are regular expressions for which NFAs can be built (**induction hypothesis**). We build the following NFA which accepts all words generated by the regular expression $math[E_1E_2].

{{:lfa:concat.jpg|}}


**Induction step $math[E=E_1\cup E_2] (union)**

Suppose $math[E_1] and $math[E_2] are regular expressions for which NFAs can be built (**induction hypothesis**). We build the following NFA which accepts all words generated by the regular expression $math[E_1\cup E_2].

{{:lfa:union.jpg|}}

**Induction step $math[E^*] (union)**

Suppose $math[E] is regular expression for which an NFA can be built (**induction hypothesis**). We build the following NFA which accepts all words generated by the regular expression $math[E*].

{{:lfa:kleene.jpg|}}

$end

We illustrate the algorithmic procedure on our regular expression $math[(A\cup B)(a\cup b)*(0 \cup 1)*].
The result is shown below:

{{:lfa:slide4.jpg|}}

From the proof, a naive algorithm can be easily implemented. We illustrate it in Haskell:
<code haskell>
data RegExp = EmptyString | 
              Atom Char | 
              RegExp :| RegExp | 
              RegExp :. RegExp | 
              Kleene RegExp deriving Show

data NFA = NFA {delta :: [(Int,Char,Int)], fin :: [Int]} deriving Show
</code>
We begin with a list-based representation of the transition function $math[\delta]. We assume the symbol ''e'' is reserved for the empty string;

<code haskell>
-- the strategy is to increment by i, each state
relabel :: Int -> NFA -> NFA
relabel i (NFA delta fin) = NFA (map (\(s,c,s')->(s+i,c,s'+i)) delta) (map (+i) fin)
</code>

Since we have chosen to represent states as integers, we use a re-labelling function to ensure uniqueness. Re-labelling relies on state increment. For instance, by calling ''relabel (f1+1) n'', we ensure that the NFA ''n'' will have the initial state equal to ''f1+1''. Note that ''f1'' is a final state in our code, which guarantees uniqueness.


<code haskell>
toNFA EmptyString = NFA [(0,'e',1)] [1]
toNFA (Atom c) = NFA [(0,c,1)] [1]
toNFA (e :. e') = let NFA delta1 [f1] = toNFA e
                      NFA delta2 [f2] = relabel (f1+1) (toNFA e')
                  in NFA (delta1++delta2++[(f1,'e',f1+1)]) [f2]
toNFA (e :| e') = let NFA delta1 [f1] = relabel 1 (toNFA e)
                      NFA delta2 [f2] = relabel (f1+1) (toNFA e')
                  in NFA (delta1 ++ delta2 ++[(0,'e',1),
                                              (0,'e',f1+1),
                                              (f1,'e',f2+1),
                                              (f2,'e',f2+1)]) [f2+1]
toNFA (Kleene e) = let NFA delta [f] = toNFA e in NFA (delta++[(0,'e',f),(f,'e',0)]) [f]
</code>

Apart from relabelling, the code follows exactly the steps from the proof.