3. Lexers

Lexers use specs to split a string into lexemes (of a given type, called token). In this lab, the specs will be DFAs.

A lexer runs DFAs on a string, searching for the longest prefix which is accepted by at least one of the DFAs.


3.1.1. Suppose $ A_1$ is a DFA and w=aabaaabb is a word. Find the longest prefix of w which is accepted by $ A_1$ .


When such a prefix is found, it is reported as a new lexeme. The DFAs are placed in their initial configurations and the search starts over.


3.1.2. Split the following word $ w$ =ababbaabbabaab using $ A_2$ as the unique token.

3.1.3. Given DFAs $ A_3$ , $ A_4$ and $ A_5$ , use them to split the word $ w$ =abaaabbabaaaab into lexemes.

When two or more DFAs match the same (longest) prefix, the first one (in the order of their priority) is selected. An interesting question is whether maximal matching may be replaced by priorities. The following exercise illustrates why this is not the case.


3.2.1. Let us assume that a lexer splits lexemes by the first matched principle - we see if the first DFA from our list matches a prefix, then move on to the next DFA in our list, and so forth. If no DFA has matched a prefix p[1:n], we try prefix p[1:n+1].

Let:

  • $ A$ be a DFA which matches lowercase character sequences ([a-z]+), ending with a whitespace (e.g. aba )
  • while $ B$ matches “def ” (the four-letter sequence). Let $ w$ =“def deffunction ”.

Suppose:

  • $ A$ has higher priority than $ B$ . How will the string be split? (Which are the lexemes?)
  • $ B$ has higher priority than $ A$ . How will the splitting look like?
  • finally, let us return to the maximal match principle. How should the DFAs $ A$ an $ B$ be ordered (w.r.t. priority) so that our word is split in the correct way (assuming a Python syntax)?

3.3.1. Implement a three-DFA lexer with DFAs $ A_3$ , $ A_4$ and $ A_5$ . You can use the code from last lab to directly instantiate the three DFAs. The input should be a word, and the output should be a string of the form <token_1>:<lexeme_1> … <token_n>:<lexeme_n>, where <token_i> is the DFA's id (from 3 to 5) and <lexeme_i> is the matched lexeme.