====== Deterministic automata ======

===== Motivation =====

In the last lecture, we have shown that, for **each** regular expression $math[e], we can construct an NFA $math[M], such that $math[L(e) = L(M)].

While NFAs are a computational instrument for establishing acceptance, they **are not an ideal one**. Nondeterminism is difficult to translate per. se. in a programming language. It is also quite inefficient. We illustrate this via our previous example, below. Recall our NFA representation in Haskell:

<code haskell>
data NFA = NFA {delta :: [(Int,Char,Int)], fin :: [Int]}
</code>

The following code checks if a word $math[w] and NFA $math[nfa], whether $math[w\in L(nfa)].
<code haskell>
check :: String -> NFA -> Bool
check w nfa = chk (w++"!") nfa [0]
              where chk (x:xs) nfa set  
                          | (x == '!') && ([s | s <- set, s 'elem' fin nfa] /= []) = True
                          | set == [] = False
                          | otherwise = chk xs nfa [s' | s <- set, (s,x,s') <- delta nfa]  
</code>

At each step, the procedure builds the set of all successor states $math[s'] which can be reached from **some** current state $math[s], on current input $math[x].

This procedure can be optimised, if we make the following observations:
  * we can //pre-compute// all possible //state-combinations// (and rule-out those which are inaccessible);
  * in so doing, we will have **only one transition** from one state-combination to the other;
  * the number of state combinations may be exponential in the general case, but we will build them only **once**, not for each word, on each test $math[w\in L(nfa)], as it is done in the above example.

These observations lead us to Deterministic Finite Automata (DFA). More formally, we can easily obtain the definition for DFAs by enforcing the following restriction on the **transition relation** $math[\Delta]:
  * $math[\Delta : K \times \Sigma \rightarrow K],
  * the function $math[\Delta] is **total** (i.e. it is defined for all possible values in its input).

Thus:
  * each transition must occur on a **symbol**, 
  * exactly **one** transition is possible for each state-symbol combination

$def[DFA]
A **Deterministic Finite Automata** is an NFA, where $math[\Delta : K \times \Sigma \rightarrow K] is a **total** function. In what follows, we write $math[\delta] instead of $math[\Delta] to refer to the transition function of a DFA.
$end

$def[Configuration]
The **configuration** of a DFA is an element of $math[K\times\Sigma^*].
$end

$def[One-step move]
We call $math[\vdash_M \subseteq (K\times \Sigma^*) \times (K\times \Sigma^*)] a **one-step** move relation, over configurations. The relation is defined as follows:
  * $math[(q,w) \vdash_M (q',w')] iff $math[w=cw'] for some symbol $math[c] of the alphabet, and $math[\delta(q,c)=q']

We write $math[\vdash_M^*] to refer to the **reflexive and transitive closure** of $math[\vdash_M]. Hence, $math[(q,w)\vdash_M^*(q',w')] means that DFA $math[M] can reach configuration $math[(q',w')] from configuration $math[(q,w)] is zero or more steps.
$end

$def[Acceptance]
We say a word $math[w] is **accepted** by a DFA $math[M] iff $math[(q_0,w)\vdash_M^*(q,\epsilon)] and $math[q\in F] ($math[q] is a final state).
$end

Example(s)


===== NFA to DFA transformation =====

Let $math[M=(K,\Sigma, \Delta, q_0, F)] be an NFA. We assume $math[M] **does not contain transitions on words of length larger than 1**. If $math[(q,w,q')\in\Delta] for some $math[w=c_1\ldots c_n] of size 2 or more, we construct intermediary states $math[q^1, \ldots, q^{n+1}] as well as transitions $math[(q^1,c_1,q^2), \ldots (q^n,c_n,q^{n+1})], where $math[q=q^1] and $math[q'=q^n]

We denote by $math[E_M(q) = \{p\in K\mid (q,\epsilon)\vdash^*_M (p,\epsilon)\}], the **$math[\epsilon]-closure** of state $math[q]. In effect $math[E_M(q)] contains **all states reachable from $math[q] via $math[\epsilon]-transitions**. When the automaton $math[M] is understood from the context, we omit the subscript and simply write $math[E(q)].

We build the DFA $math[M'=(K',\Sigma, \delta, q_0', F')] as follows:
  * $math[K'=2^{K}] - each state of $math[M'] is a **subset** of states from the NFA. It may be the case that some such states are not reachable, hence we shall ignore them from our construction;
  * $math[q_0' = E(q_0)]
  * $math[\delta(Q,c) = \displaystyle \cup_{q\in Q, (q,c,q')\in\Delta} E(q')] - a transition from $math[Q] on symbol $math[c] ends in the reunion of all $math[\epsilon]-closures of states in $math[M] reachable from some member of $math[Q] on symbol $math[c].
  * $math[F'=\{Q\subseteq K'\mid Q \cap F \neq\emptyset\}] - a state is final in $math[M'] iff it contains
 some
 final state in $math[M].

===== Correctness of the transformation =====

$prop[1]
For all $math[q,p\in K], $math[(q,w)\vdash^*_M (p,\epsilon)] iff $math[(E(q),w)\vdash^*_{M'}(P,\epsilon)], for some $math[P] such that $math[p\in P].
$end

The proposition states that, for each path in NFA $math[M] starting on $math[q] which //consumes// word $math[w] (hence ends up in configuration $math[(p,\epsilon)]), there is an //equivalent// path in the DFA $math[M'], which starts in the $math[\epsilon]-closure of $math[q] and ends in some state $math[P] which contains $math[p] --- and vice-versa.

Proposition 1 is essential for proving the following result:

$justtheorem
Let $math[M'] be a DFA constructed from NFA $math[M] according to the above rules. Then $math[L(M)=L(M')].
$end

$proof
The languages of the two machines coincide if, for all words: $math[w\in L(M)] iff $math[w\in L(M')], thus:
  * $math[(q_0,w)\vdash^*_M (p,\epsilon)] with $math[p\in F] iff $math[(E(q_0),w)\vdash^*_{M'}(P,\epsilon)] with $math[p\in P].

The above statement follows immediately from Proposition 1, where:
  * $math[q] is the initial state of $math[M]
  * $math[p] is some final state of $math[M]
as well as from the definition of $math[F'].
$end

We now turn to the proof of Proposition 1:

$proof
The proof is by **induction** over the length of the word $math[w].

** Basis step**: $math[\mid w\mid=0] that is $math[w=\epsilon]
  * //direction $math[\implies]:// 
    - Suppose $math[(q,\epsilon) \vdash^*_M(p,\epsilon)].
    - From the definition of $math[E] and 1., we have that $math[p\in E(q)].
    - Since $math[\vdash^*_{M'}] is **reflexive**, we have $math[(E(q),\epsilon)\vdash^*_{M'}(E(q),\epsilon)].
    - Therefore, we have $math[E(q),\epsilon)\vdash^*_{M'}(P,\epsilon)] with $math[p\in P]: $math[P] is actually $math[E(q)].

  * //direction $math[\impliedby]://
    - Suppose $math[(E(q),\epsilon) \vdash^*_{M'} (P,\epsilon)]
    - Since $math[\delta] does not allow $math[\epsilon]-transitions (and $math[\vdash^*_{M'}] is reflexive), it follows that $math[E(q)=P]. 
    - By the definition of $math[E], we have that $math[(q,\epsilon)\vdash^*_{M}(p,\epsilon)] for any $math[p\in E(q)].

** Induction hypothesis**: suppose that the claim is true for all strings w such that $math[\mid w\mid\leq k] for $math[k\geq0]

** Induction step**: we prove for any string $math[w] of length $math[k+1]; let $math[w'=wa] (hence $math[a] is the last symbol of $math[w']).
  * //direction $math[\implies]:// 
    - Suppose $math[(q,wa)\vdash^*_{M} (p,\epsilon)]. 
    - By the definition of $math[\vdash^*], we have: $math[(q,w)\vdash^*_{M} (r_1,a) \vdash_{M} (r_2,\epsilon)\vdash^*_{M} (p,\epsilon)]. In other words, there is a //path// from $math[q] which takes us to $math[r_1] by consuming $math[w], then to $math[r_2] via a **one-step** transition, then to $math[p] in zero or more $math[\epsilon]-transitions. Notice that $math[p] may be equal to $math[r_2], which is taken into account since $math[\vdash^*_{M}] is reflexive.
    - By the construction of $math[\vdash^*_{M}], we also have $math[(q,w)\vdash^*_{M}(r_1,\epsilon)]
    - From 3. by **induction hypothesis**, we have $math[(E(q),w)\vdash^*_{M'}(R_1,\epsilon)] with $math[r_1 \in R_1]
    - By construction of $math[\vdash^*_{M'}], we have $math[(E(q),wa)\vdash^*_{M'}(R_1,a)]
    - Since $math[(r_1,a) \vdash_{M} (r_2,\epsilon)], by the definition of $math[\delta], we have $math[E(R_2) \subseteq \delta(R_1,a)]. 
    - Since $math[(r_2,\epsilon) \vdash_{M} (p,\epsilon)] it follows that $math[p \in E(r_2)], and therefore $math[p \in \delta(R_1,a)]. 
    - In effect, from 5. and 7. we have shown that $math[(E(q),wa)\vdash^*_{M'}(R_1,a)\vdash_{M'}(R_2,\epsilon)] with $math[p\in R_2], which concludes our proof.

  * //direction $math[\impliedby]://
    - Suppose $math[(E(q),wa)\vdash^*_{M'}(P,\epsilon)]. 
    - Since no $math[\epsilon]-transitions are allowed in a DFA, we must have: $math[(E(q),wa)\vdash^*_{M'}(R,a)\vdash_{M'}(P,\epsilon)].
    - By construction of $math[\vdash^*_{M'}]: $math[(E(q),w)\vdash^*_{M'}(R,\epsilon)].
    - By **induction hypothesis**, $math[(q,w)\vdash^*_{M}(r,\epsilon)], with $math[r\in R].
    - By construction of $math[\vdash^*_{M}]: $math[(q,wa)\vdash^*_{M}(r,a)].
    - Since $math[(R,a)\vdash_{M'}(P,\epsilon)], by the definition of $math[\delta], for some $math[r\in R], $math[(r,a,x)\in\Delta] and $math[E(x)\subseteq P]. 
    - From 6. $math[(q,wa)\vdash^*_{M}(r,a)\vdash_{M}(x,\epsilon)] with $math[x\in E(x)\subseteq P] which completes our proof.
$end

===== Conclusion =====

We have shown so far that the problem $math[w\in L(e)] can be algorithmically solved by checking $math[w \in L(D)], where $math[D] is a DFA obtained via subset construction from $math[M], and $math[M] is obtained directly from $math[e].

The algorithmic procedure for $math[w \in L(D)] is actually quite straightforward, and is shown below:
<code haskell>
data DFA = DFA {delta :: Int -> Char -> Maybe Int, fin :: [Int] }

check w dfa = chk w++"!" dfa 0
	where
		chk (x:xs) dfa state = 
			| (x=='!') && state `elem` (fin dfa) = True
			| (delta a state x) <- Just next = chk xs dfa next
			| othewise = False
</code>

===== Writing a lexical analyser from scratch =====

We now have all the necessary tools to implement a **lexical analyser**, or **scanner**, from scratch. We will proceed to implement such an analyser for the language IMP. 
Our input of the scanner consists of two parts:
  - the //spec//, containing regular expressions for each possible word which appears at input
  - the actual word to be scanned

The input 1. is specific to our language IMP, and is directly implemented in Haskell. It consists of a datatype for regular expressions, as well as a datatype describing each possible token. We also implement, for each different token, a function ''String -> Token'', which actually returns the token, when a substring is found. To make the code nicer, we include such functions in the ''DFA'' datatype.

<code haskell>
data RegExp = EmptyString | Atom Char | RegExp :| RegExp | RegExp :. RegExp | Kleene RegExp 

plus :: RegExp -> RegExp
plus e = e :. (Kleene e)

-- (AUB)(aUb)*(0U1)*
example = ((Atom 'A') :| (Atom 'B')) :.(Kleene ((Atom 'a') :| (Atom 'b'))) :. (Kleene ((Atom '0') :| (Atom '1')))

data Token = If | Leq | Tru | Fals | .... | Var String | AVal Integer | BVal Bool | While | OpenPar | ...

ifToken = 'i' :. 'f'          -- we can build an auxiliary function for that 
f_if :: String -> Token
f_if _ = If

varToken = plus ('a' :| 'b')  -- [a,b]+
f_var :: String -> Token
f_var s = Var s

intToken = plus ('0' :| '1')
f_int :: String -> Token
f_int s = AVal ((read s) :: Integer)

data DFA = DFA {delta :: Int -> Char -> Maybe Int, fin :: [Int], getToken :: String -> Token }

</code>

We also recall some of our previously-defined procedures, already shown in the previous lectures:

<code haskell>
-- converts a regular expression to a NFA
toNFA :: RegExp -> NFA
-- converts an NFA to a DFA
subset :: NFA -> DFA
-- checks if a word is accepted by the DFA
check :: String -> DFA -> Bool

-- takes a regular expression and its token function and builds the DFA
convert :: RegExp -> (String -> Token) -> DFA

</code>

The logic of the scanner is as follows. While processing the input, the scanner will maintain:
  * the rest of the (yet-unprocessed) input (i.e. ''(x:xs)'')
  * the word which has been read so far, but whose token was not yet identified (i.e. ''crt'')
  * a list of //configurations//, i.e. pairs: //(current state, automaton)//, for each regular expression which may be matched in the input
  * a list of tokens which were found so far

The function responsible for scanning is:
<code haskell>
lex :: String -> String -> [Config] -> [Token] -> [Token]
</code>

In the initial phase, all regular expressions are converted to DFAs, and each DFA is converted to its initial configuration.
  * whenever the input is consumed (''x == '!''') and **all** dfas are in the initial state (''(filter (\(s,_) -> s \= 0) cfgs) == []''), return the list of tokens
  * whenever a dfa is in a final state, take the **first** such dfa, (''[a | (s,a) <- cfgs, fin a s] <- (a:_)''), build its respective token from the scanned word (''getToken a crt'') and add it to the list of tokens. The search process continues after:
      * reseting all configurations to the initial ones
      * resetting the scanned (but unmatched) current word
  * whenever we have no dfa in a final state, we simply //move// each dfa to its successor state, and rule out configurations where a sink-state was reached;
  * whenever no current configuration is found, then no regular expression matches the current input and the scanning stops by returning the empty list of tokens;


<code haskell>
type Config = (Integer, DFA)

regularExpressions :: [(RegExp, String -> Token)]
regularExpressions = ...

dfas :: [DFA]
dfas = map (\(e,f)-> convert e f) regularExpressions

lexical :: String -> [Token]
lexical w = lex (w++"!") "" (map (\a->(0,a)) dfas) 
  where lex :: String -> String -> [Config] -> [Token] -> [Token]
      lex (x:xs) crt cfgs tokens = 
        -- the input ended, and 
        | (x == '!') && (filter (\(s,_) -> s \= 0) cfgs) == [] = tokens

        -- we found a dfa which accepted; push the token of the first such dfa
        | [a | (s,a) <- cfgs, fin a s] <- (a:_) = lex (x:xs) "" (map (\a->(0,a)) dfas) (getToken a crt):tokens

        -- if no continuing configuration exists, fail
            | cfgs == [] = [] 

            -- proceed with the next symbol
            | otherwise lex xs (crt++[x])  [(s',a) | (s,a) <- cfgs, (Just s') <- (delta a s x)] tokens

</code>