===== An introduction to JFlex =====
==== What is missing in the Haskell analyser? ====
Our Haskell analyser is generic (independent of the language being scanned), however it lacks certain key features which are key for //production// development:
* the regular expressions are defined as part of the code, together with the functions ''String -> Token''. These could be **generated automatically** by simply //parsing// the regular expressions from a spec file;
* analysers usually perform specific **actions** when certain tokens are found (e.g. add a variable to a list, etc.); in our Haskell approach, the list of tokens is built first, and then, it is assumed that, in a subsequent phase, the set of actions is performed. This may be inefficient, as well as,
* counter-intuitive for the programmer. We would like to assign an action to each regular expression, which is executed once the regular expression is matched.
We will now briefly illustrate a tool which incorporates these features. The tool is JFlex and requires programmers to develop their code in Java. (An alternative in Haskell, called ALex, also exists).
JFlex receives as input a spec file ''*.flex'', which, informally, contains a list of regular expressions to be searched, as well as actions (Java code) to be executed once a regular expression is matched. After compiling the flex file, JFlex outputs a **Java class**, which implements a scanner. The Java class can be included in a larger project with extended functionality.
==== Installing JFlex ====
A complete, platform-dependent set of installation instructions can be found [[http://jflex.de/installing.html| here]]. In a nutshell, JFlex comes as a binary app ''jflex''.
==== The structure of a flex file ====
Consider the following simple JFlex file:
import java.util.*;
%%
%class HelloLexer
%standalone
%{
public Integer words = 0;
%}
LineTerminator = \r|\n|\r\n
%%
[a-zA-Z]+ { words+=1; }
{LineTerminator} { /* do nothing*/ }
Suppose the above file is called ''Hello.flex''. Running the command ''jflex Hello.flex'' will generate a Java class which implements a lexer.
Each JFlex file (such as the above), contains 5 sections:
* the first section, which ends at the first occurrence of '' % % '' contains declarations which will be added at the beginning of the Java class file.
* the second section, right after '' % % '' and until ''%{'' contains a sequence of options for jflex. Here, we use two options:
* ''class HelloLexer'' tells jflex that the output java class that the lexer classname should be ''HelloLexer''
* ''standalone'' tells jflex to print the unmatched input word at to standard output and continue scanning.
* More details regarding possible options can be found in the [[http://jflex.de/manual.pdf|JFlex docs]].
* the third section, separated by ''%{'' and ''%}'' contains declarations which will be appended in the Lexer class file. Here we declare a public variable ''words''.
* the fourth section contains regular expression **declarations**. Here, we have declared ''LineTerminator'' to be the regular expression ''\r | \n | \r\n''. Declarations can be use to build more complicated RegExps from simple ones, and can be used as well in the fifth section of the flex file:
* the fifth section contains rules and actions: a rule specifies a regular expression to be scanned, as well as the appropriate action to be taken, when a word satisfying the regexp is found:
* the rule ''[a-zA-Z]+ { words+=1; }'' states that whenever ''[a-zA-Z]+'' (a regexp defined inline) is matched by a word, ''words+=1;'' should be executed;
* the rule ''{LineTerminator} { /* do nothing*/ }'' refers to the regexp defined above (note the brackets); here no action should be executed;
* JFlex will always scan for the **longest** input word which satisfies a regexp. When a word satisfies more than one regexp the **first** one from the flex file will be matched.
==== Compiling a Hello World project ====
After performing:
jflex Hello.flex
we obtain ''HelloLexer.java'' which contains the ''HelloLexer'' public class implementing our lexer. We can easily include this class in our project, e.g.:
import java.io.*;
import java.util.*;
public class Hello {
public static void main (String[] args) throws IOException {
HelloLexer l = new HelloLexer(new FileReader(args[0]));
l.yylex();
System.out.println(l.words);
}
}
* Note that the lexer constructor method receives a java Reader as input (other options are possible, see the docs), and we take the name of the file to-be-scanned from standard input.
* Each lexer implements the method ''yylex'' which starts the scanning process.
After compiling:
javac HelloLexer.java Hello.java
and running:
java Hello
we obtain:
6
at standard output.
Recall that the option ''standalone'' tells the lexer to print unmatched words. In our example, those unmatched words are whitespaces.
==== Application - parsing expressions ====
Consider the following BNF grammar which describes a different variant of IMP arithmetic expressions:
::= [0-9]+
::= [A-Z][a-z]*[0-9]*
::= "+" | "MOD"
::= |
::= |
The following are examples of expressions:
x + 1
A + 1 MOD Bx + Cy0
We start this exercise by first identifying the regular expressions we are interested in. The flex file is given below:
import java.util.*;
%%
%class ExprLexer
%standalone
%{
public Expr crtexpr = null;
public String crtop = null;
%}
LineTerminator = \r|\n|\r\n
WS = (" "|\t)+
op = "+"|"MOD"
alfastream = [a-zA-Z]+
digitstream = [0-9]+
var = [A-Z]{alfastream}?{digitstream}?
val = digitstream
%%
{var} { if (crtop == null) crtexpr = new Var(yytext()); else crtexpr = new Binary(crtop,crtexpr,new Var(yytext())); }
{op} {crtop = yytext();}
{WS} {}
{LineTerminator} { /* do nothing*/ }
Note that we have opted to define the regular expression ''var'' in terms of other regular expressions. The regexp ''e?'' should be read as //e - zero or one occurrence of e//. Also note that the text ''MOD'' may be interpreted as an ''op'' as well as a ''alphastream''; that is why it is important to have it defined as an operator first.
Once the regular expressions are defined, we proceed to the program logic:
* we would like to store the currently-scanned expression (as a ''Expr'' public variable)
* as well as the currently-scanned operator (a ''String'')
Whenever a variable is parsed, we add it as a new expression, if no operator has been previously scanned. Otherwise, we use the existing operator to create a new (sub) expression.
Note that the above program doesn't handle malformed inputs well. Can you identify such cases?
Finally, we add the data-structures required to hold parsed expressions as well as the main test class:
import java.io.*;
import java.util.*;
interface Expr {}
abstract class Atom implements Expr {
private String name;
public Atom (String name) {this.name = name;}
@Override
public String toString () {return "{"+this.name+"}";}
}
class Binary implements Expr {
private Expr l,r;
private String op;
public Binary (String op, Expr l, Expr r) {this.op = op; this.l = l; this.r = r;}
@Override
public String toString () {return l.toString()+"<"+op+">"+r.toString();}
}
class Var extends Atom {
public Var (String s) {super(s);}
}
class Val extends Atom {
public Val (String s) {super(s);}
}
public class Test {
public static void main (String[] args) throws IOException {
ExprLexer l = new ExprLexer(new FileReader(args[0]));
l.yylex();
System.out.println(l.crtexpr);
}
}