===== An introduction to JFlex =====

==== What is missing in the Haskell analyser? ====

Our Haskell analyser is generic (independent of the language being scanned), however it lacks certain key features which are key for //production// development:
  * the regular expressions are defined as part of the code, together with the functions ''String -> Token''. These could be **generated automatically** by simply //parsing// the regular expressions from a spec file;
  * analysers usually perform specific **actions** when certain tokens are found (e.g. add a variable to a list, etc.); in our Haskell approach, the list of tokens is built first, and then, it is assumed that, in a subsequent phase, the set of actions is performed. This may be inefficient, as well as,
  * counter-intuitive for the programmer. We would like to assign an action to each regular expression, which is executed once the regular expression is matched.

We will now briefly illustrate a tool which incorporates these features. The tool is JFlex and requires programmers to develop their code in Java. (An alternative in Haskell, called ALex, also exists).

JFlex receives as input a spec file ''*.flex'', which, informally, contains a list of regular expressions to be searched, as well as actions (Java code) to be executed once a regular expression is matched. After compiling the flex file, JFlex outputs a **Java class**, which implements a scanner. The Java class can be included in a larger project with extended functionality.

==== Installing JFlex ====

A complete, platform-dependent set of installation instructions can be found [[http://jflex.de/installing.html| here]]. In a nutshell, JFlex comes as a binary app ''jflex''.

==== The structure of a flex file ====

Consider the following simple JFlex file:
<code java>
import java.util.*;

%%

%class HelloLexer
%standalone

%{
  public Integer words = 0;
%}

LineTerminator = \r|\n|\r\n

%%   

[a-zA-Z]+ { words+=1; }
{LineTerminator} { /* do nothing*/ }
</code>

Suppose the above file is called ''Hello.flex''. Running the command ''jflex Hello.flex'' will generate a Java class which implements a lexer.

Each JFlex file (such as the above), contains 5 sections:
  * the first section, which ends at the first occurrence of '' % % '' contains declarations which will be added at the beginning of the Java class file.
  * the second section, right after '' % % '' and until ''%{'' contains a sequence of options for jflex. Here, we use two options:
      * ''class HelloLexer'' tells jflex that the output java class that the lexer classname should be ''HelloLexer''
      * ''standalone'' tells jflex to print the unmatched input word at to standard output and continue scanning.
      * More details regarding possible options can be found in the [[http://jflex.de/manual.pdf|JFlex docs]].
  * the third section, separated by ''%{'' and ''%}'' contains declarations which will be appended in the Lexer class file. Here we declare a public variable ''words''.
  * the fourth section contains regular expression **declarations**. Here, we have declared ''LineTerminator'' to be the regular expression ''\r | \n | \r\n''. Declarations can be use to build more complicated RegExps from simple ones, and can be used as well in the fifth section of the flex file:
  * the fifth section contains rules and actions: a rule specifies a regular expression to be scanned, as well as the appropriate action to be taken, when a word satisfying the regexp is found:
    * the rule ''[a-zA-Z]+ { words+=1; }'' states that whenever ''[a-zA-Z]+'' (a regexp defined inline) is matched by a word, ''words+=1;'' should be executed;
    * the rule ''{LineTerminator} { /* do nothing*/ }'' refers to the regexp defined above (note the brackets); here no action should be executed;
    * JFlex will always scan for the **longest** input word which satisfies a regexp. When a word satisfies more than one regexp the **first** one from the flex file will be matched.

==== Compiling a Hello World project ====

After performing:
<code>
jflex Hello.flex
</code>

we obtain ''HelloLexer.java'' which contains the ''HelloLexer'' public class implementing our lexer. We can easily include this class in our project, e.g.:

<code java>
import java.io.*;
import java.util.*;

public class Hello {
  public static void main (String[] args) throws IOException {
    HelloLexer l = new HelloLexer(new FileReader(args[0]));

    l.yylex();

    System.out.println(l.words);

    
  }
}
</code>
  * Note that the lexer constructor method receives a java Reader as input (other options are possible, see the docs), and we take the name of the file to-be-scanned from standard input.
  * Each lexer implements the method ''yylex'' which starts the scanning process.

After compiling:
<code>
javac HelloLexer.java Hello.java
</code>

and running:

<code>
java Hello
</code>

we obtain:
<code>
 
 
 6
</code>
at standard output.

Recall that the option ''standalone'' tells the lexer to print unmatched words. In our example, those unmatched words are whitespaces.

==== Application - parsing expressions ====

Consider the following BNF grammar which describes a different variant of IMP arithmetic expressions:
<code>
<val> ::= [0-9]+
<var> ::= [A-Z][a-z]*[0-9]*
<op> ::= "+" | "MOD"
<atom> ::= <val> | <var>
<expr> ::= <atom> | <atom> <op> <expr>
</code>

The following are examples of expressions:
<code>
x + 1
A + 1 MOD Bx + Cy0
</code>

We start this exercise by first identifying the regular expressions we are interested in. The flex file is given below:
<code>
import java.util.*;

%%

%class ExprLexer
%standalone

%{
      public Expr crtexpr = null;
      public String crtop = null;
%}

LineTerminator = \r|\n|\r\n
WS             = (" "|\t)+

op             = "+"|"MOD"
alfastream     = [a-zA-Z]+
digitstream    = [0-9]+
var            = [A-Z]{alfastream}?{digitstream}?
val            = digitstream

%%   

{var} { if (crtop == null) crtexpr = new Var(yytext()); else crtexpr = new Binary(crtop,crtexpr,new Var(yytext())); }
{op} {crtop = yytext();}

{WS} {}

{LineTerminator} { /* do nothing*/ }
</code>

Note that we have opted to define the regular expression ''var'' in terms of other regular expressions. The regexp ''e?'' should be read as //e - zero or one occurrence of e//. Also note that the text ''MOD'' may be interpreted as an ''op'' as well as a ''alphastream''; that is why it is important to have it defined as an operator first.

Once the regular expressions are defined, we proceed to the program logic:
  * we would like to store the currently-scanned expression (as a ''Expr'' public variable)
  * as well as the currently-scanned operator (a ''String'')

Whenever a variable is parsed, we add it as a new expression, if no operator has been previously scanned. Otherwise, we use the existing operator to create a new (sub) expression.

Note that the above program doesn't handle malformed inputs well. Can you identify such cases?

Finally, we add the data-structures required to hold parsed expressions as well as the main test class:

<code java>
import java.io.*;
import java.util.*;

interface Expr {}

abstract class Atom implements Expr {
  private String name;
  public Atom (String name) {this.name = name;}
  @Override
  public String toString () {return "{"+this.name+"}";}
}

class Binary implements Expr {
  private Expr l,r;
  private String op;
  public Binary (String op, Expr l, Expr r) {this.op = op; this.l = l; this.r = r;}
  @Override
  public String toString () {return l.toString()+"<"+op+">"+r.toString();} 
}


class Var extends Atom {
  public Var (String s) {super(s);}
}
class Val extends Atom {
    public Val (String s) {super(s);}
}

public class Test {

 public static void main (String[] args) throws IOException {
    ExprLexer l = new ExprLexer(new FileReader(args[0]));

    l.yylex();

    System.out.println(l.crtexpr);
    
  }
}
</code>