Deterministic automata

Deterministic automata

Motivation

In the last lecture, we have shown that, for each regular expression $ e$ , we can construct an NFA $ M$ , such that $ L(e) = L(M)$ .

While NFAs are a computational instrument for establishing acceptance, they are not an ideal one. Nondeterminism is difficult to translate per. se. in a programming language. It is also quite inefficient. We illustrate this via our previous example, below. Recall our NFA representation in Haskell:

data NFA = NFA {delta :: [(Int,Char,Int)], fin :: [Int]}

The following code checks if a word $ w$ and NFA $ nfa$ , whether $ w\in L(nfa)$ .

check :: String -> NFA -> Bool
check w nfa = chk (w++"!") nfa [0]
              where chk (x:xs) nfa set  
                          | (x == '!') && ([s | s <- set, s 'elem' fin nfa] /= []) = True
                          | set == [] = False
                          | otherwise = chk xs nfa [s' | s <- set, (s,x,s') <- delta nfa]

At each step, the procedure builds the set of all successor states $ s'$ which can be reached from some current state $ s$ , on current input $ x$ .

This procedure can be optimised, if we make the following observations:

we can pre-compute all possible state-combinations (and rule-out those which are inaccessible);
in so doing, we will have only one transition from one state-combination to the other;
the number of state combinations may be exponential in the general case, but we will build them only once, not for each word, on each test $ w\in L(nfa)$ , as it is done in the above example.

These observations lead us to Deterministic Finite Automata (DFA). More formally, we can easily obtain the definition for DFAs by enforcing the following restriction on the transition relation $ \Delta$ :

$ \Delta : K \times \Sigma \rightarrow K$ ,
the function $ \Delta$ is total (i.e. it is defined for all possible values in its input).

Thus:

each transition must occur on a symbol,
exactly one transition is possible for each state-symbol combination

Definition (DFA):

A Deterministic Finite Automata is an NFA, where $ \Delta : K \times \Sigma \rightarrow K$ is a total function. In what follows, we write $ \delta$ instead of $ \Delta$ to refer to the transition function of a DFA.

Definition (Configuration):

The configuration of a DFA is an element of $ K\times\Sigma^*$ .

Definition (One-step move):

We call $ \vdash_M \subseteq (K\times \Sigma^*) \times (K\times \Sigma^*)$ a one-step move relation, over configurations. The relation is defined as follows:

$ (q,w) \vdash_M (q',w')$ iff $ w=cw'$ for some symbol $ c$ of the alphabet, and $ \delta(q,c)=q'$

We write $ \vdash_M^*$ to refer to the reflexive and transitive closure of $ \vdash_M$ . Hence, $ (q,w)\vdash_M^*(q',w')$ means that DFA $ M$ can reach configuration $ (q',w')$ from configuration $ (q,w)$ is zero or more steps.

Definition (Acceptance):

We say a word $ w$ is accepted by a DFA $ M$ iff $ (q_0,w)\vdash_M^*(q,\epsilon)$ and $ q\in F$ ($ q$ is a final state).

Example(s)

NFA to DFA transformation

Let $ M=(K,\Sigma, \Delta, q_0, F)$ be an NFA. We assume $ M$ does not contain transitions on words of length larger than 1. If $ (q,w,q')\in\Delta$ for some $ w=c_1\ldots c_n$ of size 2 or more, we construct intermediary states $ q^1, \ldots, q^{n+1}$ as well as transitions $ (q^1,c_1,q^2), \ldots (q^n,c_n,q^{n+1})$ , where $ q=q^1$ and $ q'=q^n$

We denote by $ E_M(q) = \{p\in K\mid (q,\epsilon)\vdash^*_M (p,\epsilon)\}$ , the $ \epsilon$ -closure of state $ q$ . In effect $ E_M(q)$ contains all states reachable from $ q$ via $ \epsilon$ -transitions. When the automaton $ M$ is understood from the context, we omit the subscript and simply write $ E(q)$ .

We build the DFA $ M'=(K',\Sigma, \delta, q_0', F')$ as follows:

$ K'=2^{K}$ - each state of $ M'$ is a subset of states from the NFA. It may be the case that some such states are not reachable, hence we shall ignore them from our construction;
$ q_0' = E(q_0)$
$ \delta(Q,c) = \displaystyle \cup_{q\in Q, (q,c,q')\in\Delta} E(q')$ - a transition from $ Q$ on symbol $ c$ ends in the reunion of all $ \epsilon$ -closures of states in $ M$ reachable from some member of $ Q$ on symbol $ c$ .
$ F'=\{Q\subseteq K'\mid Q \cap F \neq\emptyset\}$ - a state is final in $ M'$ iff it contains

some final state in $ M$ .

Correctness of the transformation

Proposition (1):

For all $ q,p\in K$ , $ (q,w)\vdash^*_M (p,\epsilon)$ iff $ (E(q),w)\vdash^*_{M'}(P,\epsilon)$ , for some $ P$ such that $ p\in P$ .

The proposition states that, for each path in NFA $ M$ starting on $ q$ which consumes word $ w$ (hence ends up in configuration $ (p,\epsilon)$ ), there is an equivalent path in the DFA $ M'$ , which starts in the $ \epsilon$ -closure of $ q$ and ends in some state $ P$ which contains $ p$ — and vice-versa.

Proposition 1 is essential for proving the following result:

Let $ M'$ be a DFA constructed from NFA $ M$ according to the above rules. Then $ L(M)=L(M')$ . </blockquote>

Proof:

The languages of the two machines coincide if, for all words: $ w\in L(M)$ iff $ w\in L(M')$ , thus:

$ (q_0,w)\vdash^*_M (p,\epsilon)$ with $ p\in F$ iff $ (E(q_0),w)\vdash^*_{M'}(P,\epsilon)$ with $ p\in P$ .

The above statement follows immediately from Proposition 1, where:

$ q$ is the initial state of $ M$

$ p$ is some final state of $ M$

as well as from the definition of $ F'$ .

We now turn to the proof of Proposition 1:

Proof:

The proof is by induction over the length of the word $ w$ .

Basis step: $ \mid w\mid=0$ that is $ w=\epsilon$

direction $ \implies$ :

Suppose $ (q,\epsilon) \vdash^*_M(p,\epsilon)$ .

From the definition of $ E$ and 1., we have that $ p\in E(q)$ .

Since $ \vdash^*_{M'}$ is reflexive, we have $ (E(q),\epsilon)\vdash^*_{M'}(E(q),\epsilon)$ .

Therefore, we have $ E(q),\epsilon)\vdash^*_{M'}(P,\epsilon)$ with $ p\in P$ : $ P$ is actually $ E(q)$ .

direction $ \impliedby$ :

Suppose $ (E(q),\epsilon) \vdash^*_{M'} (P,\epsilon)$

Since $ \delta$ does not allow $ \epsilon$ -transitions (and $ \vdash^*_{M'}$ is reflexive), it follows that $ E(q)=P$ .

By the definition of $ E$ , we have that $ (q,\epsilon)\vdash^*_{M}(p,\epsilon)$ for any $ p\in E(q)$ .

Induction hypothesis: suppose that the claim is true for all strings w such that $ \mid w\mid\leq k$ for $ k\geq0$

Induction step: we prove for any string $ w$ of length $ k+1$ ; let $ w'=wa$ (hence $ a$ is the last symbol of $ w'$ ).

direction $ \implies$ :

Suppose $ (q,wa)\vdash^*_{M} (p,\epsilon)$ .

By the definition of $ \vdash^*$ , we have: $ (q,w)\vdash^*_{M} (r_1,a) \vdash_{M} (r_2,\epsilon)\vdash^*_{M} (p,\epsilon)$ . In other words, there is a path from $ q$ which takes us to $ r_1$ by consuming $ w$ , then to $ r_2$ via a one-step transition, then to $ p$ in zero or more $ \epsilon$ -transitions. Notice that $ p$ may be equal to $ r_2$ , which is taken into account since $ \vdash^*_{M}$ is reflexive.

By the construction of $ \vdash^*_{M}$ , we also have $ (q,w)\vdash^*_{M}(r_1,\epsilon)$

From 3. by induction hypothesis, we have $ (E(q),w)\vdash^*_{M'}(R_1,\epsilon)$ with $ r_1 \in R_1$

By construction of $ \vdash^*_{M'}$ , we have $ (E(q),wa)\vdash^*_{M'}(R_1,a)$

Since $ (r_1,a) \vdash_{M} (r_2,\epsilon)$ , by the definition of $ \delta$ , we have $ E(R_2) \subseteq \delta(R_1,a)$ .

Since $ (r_2,\epsilon) \vdash_{M} (p,\epsilon)$ it follows that $ p \in E(r_2)$ , and therefore $ p \in \delta(R_1,a)$ .

In effect, from 5. and 7. we have shown that $ (E(q),wa)\vdash^*_{M'}(R_1,a)\vdash_{M'}(R_2,\epsilon)$ with $ p\in R_2$ , which concludes our proof.

direction $ \impliedby$ :

Suppose $ (E(q),wa)\vdash^*_{M'}(P,\epsilon)$ .

Since no $ \epsilon$ -transitions are allowed in a DFA, we must have: $ (E(q),wa)\vdash^*_{M'}(R,a)\vdash_{M'}(P,\epsilon)$ .

By construction of $ \vdash^*_{M'}$ : $ (E(q),w)\vdash^*_{M'}(R,\epsilon)$ .

By induction hypothesis, $ (q,w)\vdash^*_{M}(r,\epsilon)$ , with $ r\in R$ .

By construction of $ \vdash^*_{M}$ : $ (q,wa)\vdash^*_{M}(r,a)$ .

Since $ (R,a)\vdash_{M'}(P,\epsilon)$ , by the definition of $ \delta$ , for some $ r\in R$ , $ (r,a,x)\in\Delta$ and $ E(x)\subseteq P$ .

From 6. $ (q,wa)\vdash^*_{M}(r,a)\vdash_{M}(x,\epsilon)$ with $ x\in E(x)\subseteq P$ which completes our proof.

Conclusion

We have shown so far that the problem $ w\in L(e)$ can be algorithmically solved by checking $ w \in L(D)$ , where $ D$ is a DFA obtained via subset construction from $ M$ , and $ M$ is obtained directly from $ e$ .

The algorithmic procedure for $ w \in L(D)$ is actually quite straightforward, and is shown below:

data DFA = DFA {delta :: Int -> Char -> Maybe Int, fin :: [Int] }
 
check w dfa = chk w++"!" dfa 0
	where
		chk (x:xs) dfa state = 
			| (x=='!') && state 0101lem0032(fin dfa) = True
			| (delta a state x) <- Just next = chk xs dfa next
			| othewise = False

Writing a lexical analyser from scratch

We now have all the necessary tools to implement a lexical analyser, or scanner, from scratch. We will proceed to implement such an analyser for the language IMP. Our input of the scanner consists of two parts:

the spec, containing regular expressions for each possible word which appears at input
the actual word to be scanned

The input 1. is specific to our language IMP, and is directly implemented in Haskell. It consists of a datatype for regular expressions, as well as a datatype describing each possible token. We also implement, for each different token, a function String → Token, which actually returns the token, when a substring is found. To make the code nicer, we include such functions in the DFA datatype.

data RegExp = EmptyString | Atom Char | RegExp :| RegExp | RegExp :. RegExp | Kleene RegExp 
 
plus :: RegExp -> RegExp
plus e = e :. (Kleene e)
 
-- (AUB)(aUb)*(0U1)*
example = ((Atom 'A') :| (Atom 'B')) :.(Kleene ((Atom 'a') :| (Atom 'b'))) :. (Kleene ((Atom '0') :| (Atom '1')))
 
data Token = If | Leq | Tru | Fals | .... | Var String | AVal Integer | BVal Bool | While | OpenPar | ...
 
ifToken = 'i' :. 'f'          -- we can build an auxiliary function for that 
f_if :: String -> Token
f_if _ = If
 
varToken = plus ('a' :| 'b')  -- [a,b]+
f_var :: String -> Token
f_var s = Var s
 
intToken = plus ('0' :| '1')
f_int :: String -> Token
f_int s = AVal ((read s) :: Integer)
 
data DFA = DFA {delta :: Int -> Char -> Maybe Int, fin :: [Int], getToken :: String -> Token }

We also recall some of our previously-defined procedures, already shown in the previous lectures:

-- converts a regular expression to a NFA
toNFA :: RegExp -> NFA
-- converts an NFA to a DFA
subset :: NFA -> DFA
-- checks if a word is accepted by the DFA
check :: String -> DFA -> Bool
 
-- takes a regular expression and its token function and builds the DFA
convert :: RegExp -> (String -> Token) -> DFA

The logic of the scanner is as follows. While processing the input, the scanner will maintain:

the rest of the (yet-unprocessed) input (i.e. (x:xs))
the word which has been read so far, but whose token was not yet identified (i.e. crt)
a list of configurations, i.e. pairs: (current state, automaton), for each regular expression which may be matched in the input
a list of tokens which were found so far

The function responsible for scanning is:

lex :: String -> String -> [Config] -> [Token] -> [Token]

In the initial phase, all regular expressions are converted to DFAs, and each DFA is converted to its initial configuration.

whenever the input is consumed (x == '!') and all dfas are in the initial state ((filter (\(s,_) → s \= 0) cfgs) == []), return the list of tokens
whenever a dfa is in a final state, take the first such dfa, ([a | (s,a) ← cfgs, fin a s] ← (a:_)), build its respective token from the scanned word (getToken a crt) and add it to the list of tokens. The search process continues after:
- reseting all configurations to the initial ones
- resetting the scanned (but unmatched) current word
whenever we have no dfa in a final state, we simply move each dfa to its successor state, and rule out configurations where a sink-state was reached;
whenever no current configuration is found, then no regular expression matches the current input and the scanning stops by returning the empty list of tokens;

type Config = (Integer, DFA)
 
regularExpressions :: [(RegExp, String -> Token)]
regularExpressions = ...
 
dfas :: [DFA]
dfas = map (\(e,f)-> convert e f) regularExpressions
 
lexical :: String -> [Token]
lexical w = lex (w++"!") "" (map (\a->(0,a)) dfas) 
  where lex :: String -> String -> [Config] -> [Token] -> [Token]
      lex (x:xs) crt cfgs tokens = 
        -- the input ended, and 
        | (x == '!') && (filter (\(s,_) -> s \= 0) cfgs) == [] = tokens
 
        -- we found a dfa which accepted; push the token of the first such dfa
        | [a | (s,a) <- cfgs, fin a s] <- (a:_) = lex (x:xs) "" (map (\a->(0,a)) dfas) (getToken a crt):tokens
 
        -- if no continuing configuration exists, fail
            | cfgs == [] = [] 
 
            -- proceed with the next symbol
            | otherwise lex xs (crt++[x])  [(s',a) | (s,a) <- cfgs, (Just s') <- (delta a s x)] tokens