====== 3. Lexers ====== ===== 3.1. Longest match ===== Lexers use //specs// to split a string into lexemes (of a given type, called **token**). In this lab, the //specs// will be **DFAs**. A **lexer** **runs** DFAs on a string, searching for the **longest prefix** which is **accepted** by at least one of the DFAs. ---- **3.1.1.** Suppose $math[A_1] is a DFA and w=''aabaaabb'' is a word. Find the **longest prefix** of w which is accepted by $math[A_1]. {{ :lfa:lexer-a1.png?300 |}} ---- When such a prefix is found, it is reported as a new **lexeme**. The DFAs are placed in their initial configurations and the search starts over. ---- **3.1.2.** Split the following word $math[w]=''ababbaabbabaab'' using $math[A_2] as the unique token. {{ :lfa:lexer-a2.png?300 |}} **3.1.3.** Given DFAs $math[A_3], $math[A_4] and $math[A_5], use them to split the word $math[w]=''abaaabbabaaaab'' into lexemes. ^^^^ | {{ :lfa:lexer-a3.png?300 |}} | {{ :lfa:lexer-a4.png?200 |}} |{{ :lfa:lexer-a5.png? 200 |}} | ===== 3.2. Priority and longest match ===== When two or more DFAs match the same (longest) prefix, the first one (in the order of their priority) is selected. An interesting question is whether maximal matching may be replaced by priorities. The following exercise illustrates why this is not the case. ---- **3.2.1.** Let us assume that a lexer splits lexemes by the **first matched** principle - we see if the **first DFA** from our list matches a prefix, then move on to the next DFA in our list, and so forth. If no DFA has matched a prefix ''p[1:n]'', we try prefix ''p[1:n+1]''. Let: * $math[A] be a DFA which matches lowercase character sequences (''[a-z]+''), ending with a whitespace (e.g. ''aba '') * while $math[B] matches "''def ''" (the four-letter sequence). Let $math[w]="''def deffunction ''". Suppose: * $math[A] has higher priority than $math[B]. How will the string be split? (Which are the lexemes?) * $math[B] has higher priority than $math[A]. How will the splitting look like? * finally, let us return to the **maximal match** principle. How should the DFAs $math[A] an $math[B] be ordered (w.r.t. priority) so that our word is split in the correct way (assuming a Python syntax)? ===== 3.3. Implementation ===== **3.3.1.** Implement a three-DFA lexer with DFAs $math[A_3], $math[A_4] and $math[A_5]. You can use the code from last lab to directly instantiate the three DFAs. The input should be a word, and the output should be a string of the form '': ... :'', where '''' is the DFA's id (from 3 to 5) and '''' is the matched lexeme.