An introduction to JFlex
What is missing in the Haskell analyser?
Our Haskell analyser is generic (independent of the language being scanned), however it lacks certain key features which are key for production development:
- the regular expressions are defined as part of the code, together with the functions
String → Token
. These could be generated automatically by simply parsing the regular expressions from a spec file; - analysers usually perform specific actions when certain tokens are found (e.g. add a variable to a list, etc.); in our Haskell approach, the list of tokens is built first, and then, it is assumed that, in a subsequent phase, the set of actions is performed. This may be inefficient, as well as,
- counter-intuitive for the programmer. We would like to assign an action to each regular expression, which is executed once the regular expression is matched.
We will now briefly illustrate a tool which incorporates these features. The tool is JFlex and requires programmers to develop their code in Java. (An alternative in Haskell, called ALex, also exists).
JFlex receives as input a spec file *.flex
, which, informally, contains a list of regular expressions to be searched, as well as actions (Java code) to be executed once a regular expression is matched. After compiling the flex file, JFlex outputs a Java class, which implements a scanner. The Java class can be included in a larger project with extended functionality.
Installing JFlex
A complete, platform-dependent set of installation instructions can be found here. In a nutshell, JFlex comes as a binary app jflex
.
The structure of a flex file
Consider the following simple JFlex file:
import java.util.*; %% %class HelloLexer %standalone %{ public Integer words = 0; %} LineTerminator = \r|\n|\r\n %% [a-zA-Z]+ { words+=1; } {LineTerminator} { /* do nothing*/ }
Suppose the above file is called Hello.flex
. Running the command jflex Hello.flex
will generate a Java class which implements a lexer.
Each JFlex file (such as the above), contains 5 sections:
- the first section, which ends at the first occurrence of
% %
contains declarations which will be added at the beginning of the Java class file. - the second section, right after
% %
and until%{
contains a sequence of options for jflex. Here, we use two options:class HelloLexer
tells jflex that the output java class that the lexer classname should beHelloLexer
standalone
tells jflex to print the unmatched input word at to standard output and continue scanning.- More details regarding possible options can be found in the JFlex docs.
- the third section, separated by
%{
and%}
contains declarations which will be appended in the Lexer class file. Here we declare a public variablewords
. - the fourth section contains regular expression declarations. Here, we have declared
LineTerminator
to be the regular expression\r | \n | \r\n
. Declarations can be use to build more complicated RegExps from simple ones, and can be used as well in the fifth section of the flex file: - the fifth section contains rules and actions: a rule specifies a regular expression to be scanned, as well as the appropriate action to be taken, when a word satisfying the regexp is found:
- the rule
[a-zA-Z]+ { words+=1; }
states that whenever[a-zA-Z]+
(a regexp defined inline) is matched by a word,words+=1;
should be executed; - the rule
{LineTerminator} { }
refers to the regexp defined above (note the brackets); here no action should be executed; - JFlex will always scan for the longest input word which satisfies a regexp. When a word satisfies more than one regexp the first one from the flex file will be matched.
Compiling a Hello World project
After performing:
jflex Hello.flex
we obtain HelloLexer.java
which contains the HelloLexer
public class implementing our lexer. We can easily include this class in our project, e.g.:
import java.io.*; import java.util.*; public class Hello { public static void main (String[] args) throws IOException { HelloLexer l = new HelloLexer(new FileReader(args[0])); l.yylex(); System.out.println(l.words); } }
- Note that the lexer constructor method receives a java Reader as input (other options are possible, see the docs), and we take the name of the file to-be-scanned from standard input.
- Each lexer implements the method
yylex
which starts the scanning process.
After compiling:
javac HelloLexer.java Hello.java
and running:
java Hello
we obtain:
6
at standard output.
Recall that the option standalone
tells the lexer to print unmatched words. In our example, those unmatched words are whitespaces.
Application - parsing expressions
Consider the following BNF grammar which describes a different variant of IMP arithmetic expressions:
<val> ::= [0-9]+ <var> ::= [A-Z][a-z]*[0-9]* <op> ::= "+" | "MOD" <atom> ::= <val> | <var> <expr> ::= <atom> | <atom> <op> <expr>
The following are examples of expressions:
x + 1 A + 1 MOD Bx + Cy0
We start this exercise by first identifying the regular expressions we are interested in. The flex file is given below:
import java.util.*; %% %class ExprLexer %standalone %{ public Expr crtexpr = null; public String crtop = null; %} LineTerminator = \r|\n|\r\n WS = (" "|\t)+ op = "+"|"MOD" alfastream = [a-zA-Z]+ digitstream = [0-9]+ var = [A-Z]{alfastream}?{digitstream}? val = digitstream %% {var} { if (crtop == null) crtexpr = new Var(yytext()); else crtexpr = new Binary(crtop,crtexpr,new Var(yytext())); } {op} {crtop = yytext();} {WS} {} {LineTerminator} { /* do nothing*/ }
Note that we have opted to define the regular expression var
in terms of other regular expressions. The regexp e?
should be read as e - zero or one occurrence of e. Also note that the text MOD
may be interpreted as an op
as well as a alphastream
; that is why it is important to have it defined as an operator first.
Once the regular expressions are defined, we proceed to the program logic:
- we would like to store the currently-scanned expression (as a
Expr
public variable) - as well as the currently-scanned operator (a
String
)
Whenever a variable is parsed, we add it as a new expression, if no operator has been previously scanned. Otherwise, we use the existing operator to create a new (sub) expression.
Note that the above program doesn't handle malformed inputs well. Can you identify such cases?
Finally, we add the data-structures required to hold parsed expressions as well as the main test class:
import java.io.*; import java.util.*; interface Expr {} abstract class Atom implements Expr { private String name; public Atom (String name) {this.name = name;} @Override public String toString () {return "{"+this.name+"}";} } class Binary implements Expr { private Expr l,r; private String op; public Binary (String op, Expr l, Expr r) {this.op = op; this.l = l; this.r = r;} @Override public String toString () {return l.toString()+"<"+op+">"+r.toString();} } class Var extends Atom { public Var (String s) {super(s);} } class Val extends Atom { public Val (String s) {super(s);} } public class Test { public static void main (String[] args) throws IOException { ExprLexer l = new ExprLexer(new FileReader(args[0])); l.yylex(); System.out.println(l.crtexpr); } }