====== 10. Context-Free Languages & Lexers ======
===== 10.1. Context-Free Grammar to PDA conversion =====
For each context-free grammar G: \\
- describe L(G) \\
- algoritmically construct a PDA that accepts the same language \\
- run the PDA on the given inputs \\
- is the grammar ambiguous? If yes, write a non ambiguous grammar that generates the same language \\
**10.1.1** input: aaaabb \\
$ S \leftarrow aS | aSb | \epsilon $ \\
The start symbol of the PDA is S. \\ The PDA will only have one state q and it will accept via empty stack. \\
For each nonterminal/rule $ A \leftarrow \gamma $ add a transition **q ---$(\epsilon, A/ \gamma)$--➤ q** and for each terminal c add **q ---$(c, c/ \epsilon)$--➤ q**
Thus, our PDA has the following transitions looping on state q:
* $ \epsilon, S/aS $
* $ \epsilon, S/aSb $
* $ \epsilon, S/\epsilon $
* $ a, a/\epsilon $
* $ b, b/\epsilon $
Input: aaabb \\ (aaabb, q, S) => (**a**aabb, q, **a**Sb) => (aabb, q, Sb) => (**a**abb, q, **a**Sbb) => (abb, q, Sbb) => (**a**bb, q, **a**Sbb) => (bb, q, Sbb) => (bb, q, bb) => (b, q, b) => ($\epsilon$, q, $\epsilon$) \\
\\
Is the grammar ambiguuous? yes, because there exist 2 different left-derivations for word aaabb \\ S => aSb => aaSbb => aaaSbb => aaabb \\ S => aS => aaSb => aaaSbb => aaabb \\
\\
The accepted language is $ L(G) = \{a^{m}b^{n} | m \ge n \ge 0\} $ \\
\\
Repaired grammar: \\ $ S \leftarrow aS | A \\ A \leftarrow aAb | \epsilon $
**10.1.2** input: xayxcayatabcazz \\
$ S \leftarrow a | xAz | SbS | cS \\ A \leftarrow SyS | SyStS $
The PDA has the following transitions looping on state q:
* $ \epsilon, S/a $
* $ \epsilon, S/xAz $
* $ \epsilon, S/SbS $
* $ \epsilon, S/cS $
* $ \epsilon, A/SyS $
* $ \epsilon, A/SyStS $
* $ a, a/\epsilon $
* $ b, b/\epsilon $
* $ c, c/\epsilon $
* $ x, x/\epsilon $
* $ y, y/\epsilon $
* $ z, z/\epsilon $
* $ t, t/\epsilon $
Input: xayxcayatabcazz \\ (xayxcayatabcazz, q, S) => (xayxcayatabcazz, q, xAz) => (ayxcayatabcazz, q, Az) => (ayxcayatabcazz, q, SySz) => (ayxcayatabcazz, q, aySz) => (yxcayatabcazz, q, ySz) => (xcayatabcazz, q, Sz) => (xcayatabcazz, q, xAzz) => (cayatabcazz, q, Azz) => (cayatabcazz, q, SyStSzz) => (cayatabcazz, q, cSyStSzz) => (ayatabcazz, q, SyStSzz) => (ayatabcazz, q, ayStSzz) => (yatabcazz, q, yStSzz) => (atabcazz, q, StSzz) => (atabcazz, q, StSzz) => (atabcazz, q, atSzz) => (tabcazz, q, tSzz) => (abcazz, q, Szz) => (abcazz, q, SbSzz) => (abcazz, q, abSzz) => (bcazz, q, bSzz) => (cazz, q, Szz) => (cazz, q, cSzz) => (azz, q, Szz) => (azz, q, azz) => (zz, q, zz) => (z, q, z) => ($\epsilon$, q, $\epsilon$)\\
\\
Is the grammar ambiguuous? yes because of word ababa that has 2 different left-derivations \\ S => SbS => abS => abSbS => ababS => ababa \\ S => SbS => SbSbS => abSbS => ababS => ababa \\
\\
It is hard to directly explain the language in this form. Another form may be easier. Let's relabel the terminals: a => bool; b => and; c => not; x => if; y => then; z => fi; t => else.\\
The grammar becomes: $ S \leftarrow bool | if A fi | S and S | not S \\ A \leftarrow S then S | S then S else S $\\
The language generated can be described as the language of boolean expressions (considering 'bool' is either a variable or a literal) with the operations 'and', 'not', 'if-then' and 'if-then-else'.
\\
Why is it ambigous? The 'and'/b operator does not define its associativity, and the operators 'and'/b and 'not'/c do not have a clear precedence rule.\\
To fix this grammar we will use the following conventions: ababa == (aba)ba and caba == (ca)ba \\
Repaired grammar: \\
$ S \leftarrow TbS | T \\ T \leftarrow cT | xAz | a \\ A \leftarrow SyS | SyStS $ \\
**10.1.3** input: aaabbbbbccc \\
$ S \leftarrow ABC \\ A \leftarrow aA | \epsilon \\ B \leftarrow bbB | b \\ C \leftarrow cC | c $
The PDA has the following transitions looping on state q:
* $ \epsilon, S/ABC $
* $ \epsilon, A/aA $
* $ \epsilon, A/\epsilon $
* $ \epsilon, B/bbB $
* $ \epsilon, B/b $
* $ \epsilon, C/cC $
* $ \epsilon, C/c $
* $ a, a/\epsilon $
* $ b, b/\epsilon $
* $ c, c/\epsilon $
Input: aaabbbbbccc \\ (aaabbbbbccc, q, S) => (aaabbbbbccc, q, ABC) => (aaabbbbbccc, q, aABC) => (aabbbbbccc, q, ABC) =>
=> (aabbbbbccc, q, aABC) => (abbbbbccc, q, ABC) => (aabbbbbccc, q, ABC) => (aabbbbbccc, q, aABC) => => (abbbbbccc, q, ABC) => => (abbbbbccc, q, aABC) => (bbbbbccc, q, ABC)
=> (bbbbbccc, q, BC) => (bbbbbccc, q, bbBC) => (bbbbccc, q, bBC) => (bbbccc, q, BC) => (bbbccc, q, bbBC) => (bbccc, q, bBC) => (bccc, q, BC) => (bccc, q, bC)
=> (ccc, q, C) => (ccc, q, cC) => (cc, q, C) => (cc, q, cC) => (c, q, C) => (c, q, c) => ($\epsilon$, q, $\epsilon$) \\
\\
Is the grammar ambiguuous? no \\
\\
The accepted language is $ L(G) = \{a^{m}b^{2n + 1}c^{p+1} | m,n,p \ge 0\} $ \\
===== 10.2. Lexer Spec =====
Given the following specs, construct the lexer DFA as presented in Lecture 14:
* PAIRS: $ (10 | 01)* $
* ONES: $ 1+ $
* NO_CONSEC_ONE: $ (1 | \epsilon)(01 | 0)* $
Separate the following input strings into lexemes:
* 010101
Although the entire string is matched by PAIRS and NO_CONSEC_ONE, PAIRS is defined first, thus it will be the first picked. \\
PAIRS "010101"
* 1010101011
First we have a maximal match on "101010101" for regex NO_CONSEC_ONE. The remaining string, "1", is matched by both ONES and NO_CONSEC_ONE, but ONES is defined first. \\
NO_CONSEC_ONE "101010101" \\ ONES "1"
* 01110101001
PAIRS "01" \\
ONES "11" \\
NO_CONSEC_ONES "0101001"
* 01010111111001010
PAIRS "010101" \\
ONES "11111" \\
NO_CONSEC_ONES "001010"
* 1101101001111100001010011001
ONES "11"
PAIRS "01101001" \\
ONES "1111" \\
NO_CONSEC_ONES "0000101001" \\
PAIRS "1001"