Table of Contents

Deterministic automata

Motivation

In the last lecture, we have shown that, for each regular expression $ e$ , we can construct an NFA $ M$ , such that $ L(e) = L(M)$ .

While NFAs are a computational instrument for establishing acceptance, they are not an ideal one. Nondeterminism is difficult to translate per. se. in a programming language. It is also quite inefficient. We illustrate this via our previous example, below. Recall our NFA representation in Haskell:

data NFA = NFA {delta :: [(Int,Char,Int)], fin :: [Int]}

The following code checks if a word $ w$ and NFA $ nfa$ , whether $ w\in L(nfa)$ .

check :: String -> NFA -> Bool
check w nfa = chk (w++"!") nfa [0]
              where chk (x:xs) nfa set  
                          | (x == '!') && ([s | s <- set, s 'elem' fin nfa] /= []) = True
                          | set == [] = False
                          | otherwise = chk xs nfa [s' | s <- set, (s,x,s') <- delta nfa]  

At each step, the procedure builds the set of all successor states $ s'$ which can be reached from some current state $ s$ , on current input $ x$ .

This procedure can be optimised, if we make the following observations:

These observations lead us to Deterministic Finite Automata (DFA). More formally, we can easily obtain the definition for DFAs by enforcing the following restriction on the transition relation $ \Delta$ :

Thus:

Definition (DFA):

A Deterministic Finite Automata is an NFA, where $ \Delta : K \times \Sigma \rightarrow K$ is a total function. In what follows, we write $ \delta$ instead of $ \Delta$ to refer to the transition function of a DFA.

Definition (Configuration):

The configuration of a DFA is an element of $ K\times\Sigma^*$ .

Definition (One-step move):

We call $ \vdash_M \subseteq (K\times \Sigma^*) \times (K\times \Sigma^*)$ a one-step move relation, over configurations. The relation is defined as follows:

  • $ (q,w) \vdash_M (q',w')$ iff $ w=cw'$ for some symbol $ c$ of the alphabet, and $ \delta(q,c)=q'$

We write $ \vdash_M^*$ to refer to the reflexive and transitive closure of $ \vdash_M$ . Hence, $ (q,w)\vdash_M^*(q',w')$ means that DFA $ M$ can reach configuration $ (q',w')$ from configuration $ (q,w)$ is zero or more steps.

Definition (Acceptance):

We say a word $ w$ is accepted by a DFA $ M$ iff $ (q_0,w)\vdash_M^*(q,\epsilon)$ and $ q\in F$ ($ q$ is a final state).

Example(s)

NFA to DFA transformation

Let $ M=(K,\Sigma, \Delta, q_0, F)$ be an NFA. We assume $ M$ does not contain transitions on words of length larger than 1. If $ (q,w,q')\in\Delta$ for some $ w=c_1\ldots c_n$ of size 2 or more, we construct intermediary states $ q^1, \ldots, q^{n+1}$ as well as transitions $ (q^1,c_1,q^2), \ldots (q^n,c_n,q^{n+1})$ , where $ q=q^1$ and $ q'=q^n$

We denote by $ E_M(q) = \{p\in K\mid (q,\epsilon)\vdash^*_M (p,\epsilon)\}$ , the $ \epsilon$ -closure of state $ q$ . In effect $ E_M(q)$ contains all states reachable from $ q$ via $ \epsilon$ -transitions. When the automaton $ M$ is understood from the context, we omit the subscript and simply write $ E(q)$ .

We build the DFA $ M'=(K',\Sigma, \delta, q_0', F')$ as follows:

some final state in $ M$ .

Correctness of the transformation

Proposition (1):

For all $ q,p\in K$ , $ (q,w)\vdash^*_M (p,\epsilon)$ iff $ (E(q),w)\vdash^*_{M'}(P,\epsilon)$ , for some $ P$ such that $ p\in P$ .

The proposition states that, for each path in NFA $ M$ starting on $ q$ which consumes word $ w$ (hence ends up in configuration $ (p,\epsilon)$ ), there is an equivalent path in the DFA $ M'$ , which starts in the $ \epsilon$ -closure of $ q$ and ends in some state $ P$ which contains $ p$ — and vice-versa.

Proposition 1 is essential for proving the following result:

Let $ M'$ be a DFA constructed from NFA $ M$ according to the above rules. Then $ L(M)=L(M')$ . </blockquote>

Proof:

The languages of the two machines coincide if, for all words: $ w\in L(M)$ iff $ w\in L(M')$ , thus:

  • $ (q_0,w)\vdash^*_M (p,\epsilon)$ with $ p\in F$ iff $ (E(q_0),w)\vdash^*_{M'}(P,\epsilon)$ with $ p\in P$ .

The above statement follows immediately from Proposition 1, where:

  • $ q$ is the initial state of $ M$
  • $ p$ is some final state of $ M$

as well as from the definition of $ F'$ .

We now turn to the proof of Proposition 1:

Proof:

The proof is by induction over the length of the word $ w$ .

Basis step: $ \mid w\mid=0$ that is $ w=\epsilon$

  • direction $ \implies$ :
    1. Suppose $ (q,\epsilon) \vdash^*_M(p,\epsilon)$ .
    2. From the definition of $ E$ and 1., we have that $ p\in E(q)$ .
    3. Since $ \vdash^*_{M'}$ is reflexive, we have $ (E(q),\epsilon)\vdash^*_{M'}(E(q),\epsilon)$ .
    4. Therefore, we have $ E(q),\epsilon)\vdash^*_{M'}(P,\epsilon)$ with $ p\in P$ : $ P$ is actually $ E(q)$ .
  • direction $ \impliedby$ :
    1. Suppose $ (E(q),\epsilon) \vdash^*_{M'} (P,\epsilon)$
    2. Since $ \delta$ does not allow $ \epsilon$ -transitions (and $ \vdash^*_{M'}$ is reflexive), it follows that $ E(q)=P$ .
    3. By the definition of $ E$ , we have that $ (q,\epsilon)\vdash^*_{M}(p,\epsilon)$ for any $ p\in E(q)$ .

Induction hypothesis: suppose that the claim is true for all strings w such that $ \mid w\mid\leq k$ for $ k\geq0$

Induction step: we prove for any string $ w$ of length $ k+1$ ; let $ w'=wa$ (hence $ a$ is the last symbol of $ w'$ ).

  • direction $ \implies$ :
    1. Suppose $ (q,wa)\vdash^*_{M} (p,\epsilon)$ .
    2. By the definition of $ \vdash^*$ , we have: $ (q,w)\vdash^*_{M} (r_1,a) \vdash_{M} (r_2,\epsilon)\vdash^*_{M} (p,\epsilon)$ . In other words, there is a path from $ q$ which takes us to $ r_1$ by consuming $ w$ , then to $ r_2$ via a one-step transition, then to $ p$ in zero or more $ \epsilon$ -transitions. Notice that $ p$ may be equal to $ r_2$ , which is taken into account since $ \vdash^*_{M}$ is reflexive.
    3. By the construction of $ \vdash^*_{M}$ , we also have $ (q,w)\vdash^*_{M}(r_1,\epsilon)$
    4. From 3. by induction hypothesis, we have $ (E(q),w)\vdash^*_{M'}(R_1,\epsilon)$ with $ r_1 \in R_1$
    5. By construction of $ \vdash^*_{M'}$ , we have $ (E(q),wa)\vdash^*_{M'}(R_1,a)$
    6. Since $ (r_1,a) \vdash_{M} (r_2,\epsilon)$ , by the definition of $ \delta$ , we have $ E(R_2) \subseteq \delta(R_1,a)$ .
    7. Since $ (r_2,\epsilon) \vdash_{M} (p,\epsilon)$ it follows that $ p \in E(r_2)$ , and therefore $ p \in \delta(R_1,a)$ .
    8. In effect, from 5. and 7. we have shown that $ (E(q),wa)\vdash^*_{M'}(R_1,a)\vdash_{M'}(R_2,\epsilon)$ with $ p\in R_2$ , which concludes our proof.
  • direction $ \impliedby$ :
    1. Suppose $ (E(q),wa)\vdash^*_{M'}(P,\epsilon)$ .
    2. Since no $ \epsilon$ -transitions are allowed in a DFA, we must have: $ (E(q),wa)\vdash^*_{M'}(R,a)\vdash_{M'}(P,\epsilon)$ .
    3. By construction of $ \vdash^*_{M'}$ : $ (E(q),w)\vdash^*_{M'}(R,\epsilon)$ .
    4. By induction hypothesis, $ (q,w)\vdash^*_{M}(r,\epsilon)$ , with $ r\in R$ .
    5. By construction of $ \vdash^*_{M}$ : $ (q,wa)\vdash^*_{M}(r,a)$ .
    6. Since $ (R,a)\vdash_{M'}(P,\epsilon)$ , by the definition of $ \delta$ , for some $ r\in R$ , $ (r,a,x)\in\Delta$ and $ E(x)\subseteq P$ .
    7. From 6. $ (q,wa)\vdash^*_{M}(r,a)\vdash_{M}(x,\epsilon)$ with $ x\in E(x)\subseteq P$ which completes our proof.

Conclusion

We have shown so far that the problem $ w\in L(e)$ can be algorithmically solved by checking $ w \in L(D)$ , where $ D$ is a DFA obtained via subset construction from $ M$ , and $ M$ is obtained directly from $ e$ .

The algorithmic procedure for $ w \in L(D)$ is actually quite straightforward, and is shown below:

data DFA = DFA {delta :: Int -> Char -> Maybe Int, fin :: [Int] }
 
check w dfa = chk w++"!" dfa 0
	where
		chk (x:xs) dfa state = 
			| (x=='!') && state 0101lem0032(fin dfa) = True
			| (delta a state x) <- Just next = chk xs dfa next
			| othewise = False

Writing a lexical analyser from scratch

We now have all the necessary tools to implement a lexical analyser, or scanner, from scratch. We will proceed to implement such an analyser for the language IMP. Our input of the scanner consists of two parts:

  1. the spec, containing regular expressions for each possible word which appears at input
  2. the actual word to be scanned

The input 1. is specific to our language IMP, and is directly implemented in Haskell. It consists of a datatype for regular expressions, as well as a datatype describing each possible token. We also implement, for each different token, a function String → Token, which actually returns the token, when a substring is found. To make the code nicer, we include such functions in the DFA datatype.

data RegExp = EmptyString | Atom Char | RegExp :| RegExp | RegExp :. RegExp | Kleene RegExp 
 
plus :: RegExp -> RegExp
plus e = e :. (Kleene e)
 
-- (AUB)(aUb)*(0U1)*
example = ((Atom 'A') :| (Atom 'B')) :.(Kleene ((Atom 'a') :| (Atom 'b'))) :. (Kleene ((Atom '0') :| (Atom '1')))
 
data Token = If | Leq | Tru | Fals | .... | Var String | AVal Integer | BVal Bool | While | OpenPar | ...
 
ifToken = 'i' :. 'f'          -- we can build an auxiliary function for that 
f_if :: String -> Token
f_if _ = If
 
varToken = plus ('a' :| 'b')  -- [a,b]+
f_var :: String -> Token
f_var s = Var s
 
intToken = plus ('0' :| '1')
f_int :: String -> Token
f_int s = AVal ((read s) :: Integer)
 
data DFA = DFA {delta :: Int -> Char -> Maybe Int, fin :: [Int], getToken :: String -> Token }

We also recall some of our previously-defined procedures, already shown in the previous lectures:

-- converts a regular expression to a NFA
toNFA :: RegExp -> NFA
-- converts an NFA to a DFA
subset :: NFA -> DFA
-- checks if a word is accepted by the DFA
check :: String -> DFA -> Bool
 
-- takes a regular expression and its token function and builds the DFA
convert :: RegExp -> (String -> Token) -> DFA

The logic of the scanner is as follows. While processing the input, the scanner will maintain:

The function responsible for scanning is:

lex :: String -> String -> [Config] -> [Token] -> [Token]

In the initial phase, all regular expressions are converted to DFAs, and each DFA is converted to its initial configuration.

type Config = (Integer, DFA)
 
regularExpressions :: [(RegExp, String -> Token)]
regularExpressions = ...
 
dfas :: [DFA]
dfas = map (\(e,f)-> convert e f) regularExpressions
 
lexical :: String -> [Token]
lexical w = lex (w++"!") "" (map (\a->(0,a)) dfas) 
  where lex :: String -> String -> [Config] -> [Token] -> [Token]
      lex (x:xs) crt cfgs tokens = 
        -- the input ended, and 
        | (x == '!') && (filter (\(s,_) -> s \= 0) cfgs) == [] = tokens
 
        -- we found a dfa which accepted; push the token of the first such dfa
        | [a | (s,a) <- cfgs, fin a s] <- (a:_) = lex (x:xs) "" (map (\a->(0,a)) dfas) (getToken a crt):tokens
 
        -- if no continuing configuration exists, fail
            | cfgs == [] = [] 
 
            -- proceed with the next symbol
            | otherwise lex xs (crt++[x])  [(s',a) | (s,a) <- cfgs, (Just s') <- (delta a s x)] tokens