In the last lecture, we have shown that, for each regular expression $ e$ , we can construct an NFA $ M$ , such that $ L(e) = L(M)$ .
While NFAs are a computational instrument for establishing acceptance, they are not an ideal one. Nondeterminism is difficult to translate per. se. in a programming language. It is also quite inefficient. We illustrate this via our previous example, below. Recall our NFA representation in Haskell:
data NFA = NFA {delta :: [(Int,Char,Int)], fin :: [Int]}
The following code checks if a word $ w$ and NFA $ nfa$ , whether $ w\in L(nfa)$ .
check :: String -> NFA -> Bool check w nfa = chk (w++"!") nfa [0] where chk (x:xs) nfa set | (x == '!') && ([s | s <- set, s 'elem' fin nfa] /= []) = True | set == [] = False | otherwise = chk xs nfa [s' | s <- set, (s,x,s') <- delta nfa]
At each step, the procedure builds the set of all successor states $ s'$ which can be reached from some current state $ s$ , on current input $ x$ .
This procedure can be optimised, if we make the following observations:
These observations lead us to Deterministic Finite Automata (DFA). More formally, we can easily obtain the definition for DFAs by enforcing the following restriction on the transition relation $ \Delta$ :
Thus:
Definition (DFA):
A Deterministic Finite Automata is an NFA, where $ \Delta : K \times \Sigma \rightarrow K$ is a total function. In what follows, we write $ \delta$ instead of $ \Delta$ to refer to the transition function of a DFA.
Definition (Configuration):
The configuration of a DFA is an element of $ K\times\Sigma^*$ .
Definition (One-step move):
We call $ \vdash_M \subseteq (K\times \Sigma^*) \times (K\times \Sigma^*)$ a one-step move relation, over configurations. The relation is defined as follows:
$ (q,w) \vdash_M (q',w')$ iff $ w=cw'$ for some symbol $ c$ of the alphabet, and $ \delta(q,c)=q'$We write $ \vdash_M^*$ to refer to the reflexive and transitive closure of $ \vdash_M$ . Hence, $ (q,w)\vdash_M^*(q',w')$ means that DFA $ M$ can reach configuration $ (q',w')$ from configuration $ (q,w)$ is zero or more steps.
Definition (Acceptance):
We say a word $ w$ is accepted by a DFA $ M$ iff $ (q_0,w)\vdash_M^*(q,\epsilon)$ and $ q\in F$ ($ q$ is a final state).
Example(s)
Let $ M=(K,\Sigma, \Delta, q_0, F)$ be an NFA. We assume $ M$ does not contain transitions on words of length larger than 1. If $ (q,w,q')\in\Delta$ for some $ w=c_1\ldots c_n$ of size 2 or more, we construct intermediary states $ q^1, \ldots, q^{n+1}$ as well as transitions $ (q^1,c_1,q^2), \ldots (q^n,c_n,q^{n+1})$ , where $ q=q^1$ and $ q'=q^n$
We denote by $ E_M(q) = \{p\in K\mid (q,\epsilon)\vdash^*_M (p,\epsilon)\}$ , the $ \epsilon$ -closure of state $ q$ . In effect $ E_M(q)$ contains all states reachable from $ q$ via $ \epsilon$ -transitions. When the automaton $ M$ is understood from the context, we omit the subscript and simply write $ E(q)$ .
We build the DFA $ M'=(K',\Sigma, \delta, q_0', F')$ as follows:
some final state in $ M$ .
Proposition (1):
For all $ q,p\in K$ , $ (q,w)\vdash^*_M (p,\epsilon)$ iff $ (E(q),w)\vdash^*_{M'}(P,\epsilon)$ , for some $ P$ such that $ p\in P$ .
The proposition states that, for each path in NFA $ M$ starting on $ q$ which consumes word $ w$ (hence ends up in configuration $ (p,\epsilon)$ ), there is an equivalent path in the DFA $ M'$ , which starts in the $ \epsilon$ -closure of $ q$ and ends in some state $ P$ which contains $ p$ — and vice-versa.
Proposition 1 is essential for proving the following result:
Let $ M'$ be a DFA constructed from NFA $ M$ according to the above rules. Then $ L(M)=L(M')$ . </blockquote>
Proof:
The languages of the two machines coincide if, for all words: $ w\in L(M)$ iff $ w\in L(M')$ , thus:
$ (q_0,w)\vdash^*_M (p,\epsilon)$ with $ p\in F$ iff $ (E(q_0),w)\vdash^*_{M'}(P,\epsilon)$ with $ p\in P$ .The above statement follows immediately from Proposition 1, where:
$ q$ is the initial state of $ M$ $ p$ is some final state of $ M$as well as from the definition of $ F'$ .
We now turn to the proof of Proposition 1:
Proof:
The proof is by induction over the length of the word $ w$ .
Basis step: $ \mid w\mid=0$ that is $ w=\epsilon$
direction $ \implies$ :
Suppose $ (q,\epsilon) \vdash^*_M(p,\epsilon)$ . From the definition of $ E$ and 1., we have that $ p\in E(q)$ . Since $ \vdash^*_{M'}$ is reflexive, we have $ (E(q),\epsilon)\vdash^*_{M'}(E(q),\epsilon)$ . Therefore, we have $ E(q),\epsilon)\vdash^*_{M'}(P,\epsilon)$ with $ p\in P$ : $ P$ is actually $ E(q)$ .
direction $ \impliedby$ :
Suppose $ (E(q),\epsilon) \vdash^*_{M'} (P,\epsilon)$ Since $ \delta$ does not allow $ \epsilon$ -transitions (and $ \vdash^*_{M'}$ is reflexive), it follows that $ E(q)=P$ . By the definition of $ E$ , we have that $ (q,\epsilon)\vdash^*_{M}(p,\epsilon)$ for any $ p\in E(q)$ .Induction hypothesis: suppose that the claim is true for all strings w such that $ \mid w\mid\leq k$ for $ k\geq0$
Induction step: we prove for any string $ w$ of length $ k+1$ ; let $ w'=wa$ (hence $ a$ is the last symbol of $ w'$ ).
direction $ \implies$ :
Suppose $ (q,wa)\vdash^*_{M} (p,\epsilon)$ . By the definition of $ \vdash^*$ , we have: $ (q,w)\vdash^*_{M} (r_1,a) \vdash_{M} (r_2,\epsilon)\vdash^*_{M} (p,\epsilon)$ . In other words, there is a path from $ q$ which takes us to $ r_1$ by consuming $ w$ , then to $ r_2$ via a one-step transition, then to $ p$ in zero or more $ \epsilon$ -transitions. Notice that $ p$ may be equal to $ r_2$ , which is taken into account since $ \vdash^*_{M}$ is reflexive. By the construction of $ \vdash^*_{M}$ , we also have $ (q,w)\vdash^*_{M}(r_1,\epsilon)$ From 3. by induction hypothesis, we have $ (E(q),w)\vdash^*_{M'}(R_1,\epsilon)$ with $ r_1 \in R_1$ By construction of $ \vdash^*_{M'}$ , we have $ (E(q),wa)\vdash^*_{M'}(R_1,a)$ Since $ (r_1,a) \vdash_{M} (r_2,\epsilon)$ , by the definition of $ \delta$ , we have $ E(R_2) \subseteq \delta(R_1,a)$ . Since $ (r_2,\epsilon) \vdash_{M} (p,\epsilon)$ it follows that $ p \in E(r_2)$ , and therefore $ p \in \delta(R_1,a)$ . In effect, from 5. and 7. we have shown that $ (E(q),wa)\vdash^*_{M'}(R_1,a)\vdash_{M'}(R_2,\epsilon)$ with $ p\in R_2$ , which concludes our proof.
direction $ \impliedby$ :
Suppose $ (E(q),wa)\vdash^*_{M'}(P,\epsilon)$ . Since no $ \epsilon$ -transitions are allowed in a DFA, we must have: $ (E(q),wa)\vdash^*_{M'}(R,a)\vdash_{M'}(P,\epsilon)$ . By construction of $ \vdash^*_{M'}$ : $ (E(q),w)\vdash^*_{M'}(R,\epsilon)$ . By induction hypothesis, $ (q,w)\vdash^*_{M}(r,\epsilon)$ , with $ r\in R$ . By construction of $ \vdash^*_{M}$ : $ (q,wa)\vdash^*_{M}(r,a)$ . Since $ (R,a)\vdash_{M'}(P,\epsilon)$ , by the definition of $ \delta$ , for some $ r\in R$ , $ (r,a,x)\in\Delta$ and $ E(x)\subseteq P$ . From 6. $ (q,wa)\vdash^*_{M}(r,a)\vdash_{M}(x,\epsilon)$ with $ x\in E(x)\subseteq P$ which completes our proof.
We have shown so far that the problem $ w\in L(e)$ can be algorithmically solved by checking $ w \in L(D)$ , where $ D$ is a DFA obtained via subset construction from $ M$ , and $ M$ is obtained directly from $ e$ .
The algorithmic procedure for $ w \in L(D)$ is actually quite straightforward, and is shown below:
data DFA = DFA {delta :: Int -> Char -> Maybe Int, fin :: [Int] } check w dfa = chk w++"!" dfa 0 where chk (x:xs) dfa state = | (x=='!') && state 0101lem0032(fin dfa) = True | (delta a state x) <- Just next = chk xs dfa next | othewise = False
We now have all the necessary tools to implement a lexical analyser, or scanner, from scratch. We will proceed to implement such an analyser for the language IMP. Our input of the scanner consists of two parts:
The input 1. is specific to our language IMP, and is directly implemented in Haskell. It consists of a datatype for regular expressions, as well as a datatype describing each possible token. We also implement, for each different token, a function String → Token
, which actually returns the token, when a substring is found. To make the code nicer, we include such functions in the DFA
datatype.
data RegExp = EmptyString | Atom Char | RegExp :| RegExp | RegExp :. RegExp | Kleene RegExp plus :: RegExp -> RegExp plus e = e :. (Kleene e) -- (AUB)(aUb)*(0U1)* example = ((Atom 'A') :| (Atom 'B')) :.(Kleene ((Atom 'a') :| (Atom 'b'))) :. (Kleene ((Atom '0') :| (Atom '1'))) data Token = If | Leq | Tru | Fals | .... | Var String | AVal Integer | BVal Bool | While | OpenPar | ... ifToken = 'i' :. 'f' -- we can build an auxiliary function for that f_if :: String -> Token f_if _ = If varToken = plus ('a' :| 'b') -- [a,b]+ f_var :: String -> Token f_var s = Var s intToken = plus ('0' :| '1') f_int :: String -> Token f_int s = AVal ((read s) :: Integer) data DFA = DFA {delta :: Int -> Char -> Maybe Int, fin :: [Int], getToken :: String -> Token }
We also recall some of our previously-defined procedures, already shown in the previous lectures:
-- converts a regular expression to a NFA toNFA :: RegExp -> NFA -- converts an NFA to a DFA subset :: NFA -> DFA -- checks if a word is accepted by the DFA check :: String -> DFA -> Bool -- takes a regular expression and its token function and builds the DFA convert :: RegExp -> (String -> Token) -> DFA
The logic of the scanner is as follows. While processing the input, the scanner will maintain:
(x:xs)
)crt
)The function responsible for scanning is:
lex :: String -> String -> [Config] -> [Token] -> [Token]
In the initial phase, all regular expressions are converted to DFAs, and each DFA is converted to its initial configuration.
x == '!
') and all dfas are in the initial state ((filter (\(s,_) → s \= 0) cfgs) == []
), return the list of tokens[a | (s,a) ← cfgs, fin a s] ← (a:_)
), build its respective token from the scanned word (getToken a crt
) and add it to the list of tokens. The search process continues after:type Config = (Integer, DFA) regularExpressions :: [(RegExp, String -> Token)] regularExpressions = ... dfas :: [DFA] dfas = map (\(e,f)-> convert e f) regularExpressions lexical :: String -> [Token] lexical w = lex (w++"!") "" (map (\a->(0,a)) dfas) where lex :: String -> String -> [Config] -> [Token] -> [Token] lex (x:xs) crt cfgs tokens = -- the input ended, and | (x == '!') && (filter (\(s,_) -> s \= 0) cfgs) == [] = tokens -- we found a dfa which accepted; push the token of the first such dfa | [a | (s,a) <- cfgs, fin a s] <- (a:_) = lex (x:xs) "" (map (\a->(0,a)) dfas) (getToken a crt):tokens -- if no continuing configuration exists, fail | cfgs == [] = [] -- proceed with the next symbol | otherwise lex xs (crt++[x]) [(s',a) | (s,a) <- cfgs, (Just s') <- (delta a s x)] tokens