What is missing in the Haskell analyser?

Our Haskell analyser is generic (independent of the language being scanned), however it lacks certain key features which are key for production development:

  • the regular expressions are defined as part of the code, together with the functions String → Token. These could be generated automatically by simply parsing the regular expressions from a spec file;
  • analysers usually perform specific actions when certain tokens are found (e.g. add a variable to a list, etc.); in our Haskell approach, the list of tokens is built first, and then, it is assumed that, in a subsequent phase, the set of actions is performed. This may be inefficient, as well as,
  • counter-intuitive for the programmer. We would like to assign an action to each regular expression, which is executed once the regular expression is matched.

We will now briefly illustrate a tool which incorporates these features. The tool is JFlex and requires programmers to develop their code in Java. (An alternative in Haskell, called ALex, also exists).

JFlex receives as input a spec file *.flex, which, informally, contains a list of regular expressions to be searched, as well as actions (Java code) to be executed once a regular expression is matched. After compiling the flex file, JFlex outputs a Java class, which implements a scanner. The Java class can be included in a larger project with extended functionality.

Installing JFlex

A complete, platform-dependent set of installation instructions can be found here. In a nutshell, JFlex comes as a binary app jflex.

The structure of a flex file

Consider the following simple JFlex file:

import java.util.*;
 
%%
 
%class HelloLexer
%standalone
 
%{
  public Integer words = 0;
%}
 
LineTerminator = \r|\n|\r\n
 
%%   
 
[a-zA-Z]+ { words+=1; }
{LineTerminator} { /* do nothing*/ }

Suppose the above file is called Hello.flex. Running the command jflex Hello.flex will generate a Java class which implements a lexer.

Each JFlex file (such as the above), contains 5 sections:

  • the first section, which ends at the first occurrence of % % contains declarations which will be added at the beginning of the Java class file.
  • the second section, right after % % and until %{ contains a sequence of options for jflex. Here, we use two options:
    • class HelloLexer tells jflex that the output java class that the lexer classname should be HelloLexer
    • standalone tells jflex to print the unmatched input word at to standard output and continue scanning.
    • More details regarding possible options can be found in the JFlex docs.
  • the third section, separated by %{ and %} contains declarations which will be appended in the Lexer class file. Here we declare a public variable words.
  • the fourth section contains regular expression declarations. Here, we have declared LineTerminator to be the regular expression \r | \n | \r\n. Declarations can be use to build more complicated RegExps from simple ones, and can be used as well in the fifth section of the flex file:
  • the fifth section contains rules and actions: a rule specifies a regular expression to be scanned, as well as the appropriate action to be taken, when a word satisfying the regexp is found:
    • the rule [a-zA-Z]+ { words+=1; } states that whenever [a-zA-Z]+ (a regexp defined inline) is matched by a word, words+=1; should be executed;
    • the rule {LineTerminator} { } refers to the regexp defined above (note the brackets); here no action should be executed;
    • JFlex will always scan for the longest input word which satisfies a regexp. When a word satisfies more than one regexp the first one from the flex file will be matched.

Compiling a Hello World project

After performing:

jflex Hello.flex

we obtain HelloLexer.java which contains the HelloLexer public class implementing our lexer. We can easily include this class in our project, e.g.:

import java.io.*;
import java.util.*;
 
public class Hello {
  public static void main (String[] args) throws IOException {
    HelloLexer l = new HelloLexer(new FileReader(args[0]));
 
    l.yylex();
 
    System.out.println(l.words);
 
 
  }
}
  • Note that the lexer constructor method receives a java Reader as input (other options are possible, see the docs), and we take the name of the file to-be-scanned from standard input.
  • Each lexer implements the method yylex which starts the scanning process.

After compiling:

javac HelloLexer.java Hello.java

and running:

java Hello

we obtain:

 
 

 6

at standard output.

Recall that the option standalone tells the lexer to print unmatched words. In our example, those unmatched words are whitespaces.

Application - parsing expressions

Consider the following BNF grammar which describes a different variant of IMP arithmetic expressions:

<val> ::= [0-9]+
<var> ::= [A-Z][a-z]*[0-9]*
<op> ::= "+" | "MOD"
<atom> ::= <val> | <var>
<expr> ::= <atom> | <atom> <op> <expr>

The following are examples of expressions:

x + 1
A + 1 MOD Bx + Cy0

We start this exercise by first identifying the regular expressions we are interested in. The flex file is given below:

import java.util.*;

%%

%class ExprLexer
%standalone

%{
      public Expr crtexpr = null;
      public String crtop = null;
%}

LineTerminator = \r|\n|\r\n
WS             = (" "|\t)+

op             = "+"|"MOD"
alfastream     = [a-zA-Z]+
digitstream    = [0-9]+
var            = [A-Z]{alfastream}?{digitstream}?
val            = digitstream

%%   

{var} { if (crtop == null) crtexpr = new Var(yytext()); else crtexpr = new Binary(crtop,crtexpr,new Var(yytext())); }
{op} {crtop = yytext();}

{WS} {}

{LineTerminator} { /* do nothing*/ }

Note that we have opted to define the regular expression var in terms of other regular expressions. The regexp e? should be read as e - zero or one occurrence of e. Also note that the text MOD may be interpreted as an op as well as a alphastream; that is why it is important to have it defined as an operator first.

Once the regular expressions are defined, we proceed to the program logic:

  • we would like to store the currently-scanned expression (as a Expr public variable)
  • as well as the currently-scanned operator (a String)

Whenever a variable is parsed, we add it as a new expression, if no operator has been previously scanned. Otherwise, we use the existing operator to create a new (sub) expression.

Note that the above program doesn't handle malformed inputs well. Can you identify such cases?

Finally, we add the data-structures required to hold parsed expressions as well as the main test class:

import java.io.*;
import java.util.*;
 
interface Expr {}
 
abstract class Atom implements Expr {
  private String name;
  public Atom (String name) {this.name = name;}
  @Override
  public String toString () {return "{"+this.name+"}";}
}
 
class Binary implements Expr {
  private Expr l,r;
  private String op;
  public Binary (String op, Expr l, Expr r) {this.op = op; this.l = l; this.r = r;}
  @Override
  public String toString () {return l.toString()+"<"+op+">"+r.toString();} 
}
 
 
class Var extends Atom {
  public Var (String s) {super(s);}
}
class Val extends Atom {
    public Val (String s) {super(s);}
}
 
public class Test {
 
 public static void main (String[] args) throws IOException {
    ExprLexer l = new ExprLexer(new FileReader(args[0]));
 
    l.yylex();
 
    System.out.println(l.crtexpr);
 
  }
}