Our Haskell analyser is generic (independent of the language being scanned), however it lacks certain key features which are key for production development:
String → Token
. These could be generated automatically by simply parsing the regular expressions from a spec file;We will now briefly illustrate a tool which incorporates these features. The tool is JFlex and requires programmers to develop their code in Java. (An alternative in Haskell, called ALex, also exists).
JFlex receives as input a spec file *.flex
, which, informally, contains a list of regular expressions to be searched, as well as actions (Java code) to be executed once a regular expression is matched. After compiling the flex file, JFlex outputs a Java class, which implements a scanner. The Java class can be included in a larger project with extended functionality.
A complete, platform-dependent set of installation instructions can be found here. In a nutshell, JFlex comes as a binary app jflex
.
Consider the following simple JFlex file:
import java.util.*; %% %class HelloLexer %standalone %{ public Integer words = 0; %} LineTerminator = \r|\n|\r\n %% [a-zA-Z]+ { words+=1; } {LineTerminator} { /* do nothing*/ }
Suppose the above file is called Hello.flex
. Running the command jflex Hello.flex
will generate a Java class which implements a lexer.
Each JFlex file (such as the above), contains 5 sections:
% %
contains declarations which will be added at the beginning of the Java class file. % %
and until %{
contains a sequence of options for jflex. Here, we use two options:class HelloLexer
tells jflex that the output java class that the lexer classname should be HelloLexer
standalone
tells jflex to print the unmatched input word at to standard output and continue scanning.%{
and %}
contains declarations which will be appended in the Lexer class file. Here we declare a public variable words
.LineTerminator
to be the regular expression \r | \n | \r\n
. Declarations can be use to build more complicated RegExps from simple ones, and can be used as well in the fifth section of the flex file:[a-zA-Z]+ { words+=1; }
states that whenever [a-zA-Z]+
(a regexp defined inline) is matched by a word, words+=1;
should be executed;{LineTerminator} { }
refers to the regexp defined above (note the brackets); here no action should be executed;After performing:
jflex Hello.flex
we obtain HelloLexer.java
which contains the HelloLexer
public class implementing our lexer. We can easily include this class in our project, e.g.:
import java.io.*; import java.util.*; public class Hello { public static void main (String[] args) throws IOException { HelloLexer l = new HelloLexer(new FileReader(args[0])); l.yylex(); System.out.println(l.words); } }
yylex
which starts the scanning process.After compiling:
javac HelloLexer.java Hello.java
and running:
java Hello
we obtain:
6
at standard output.
Recall that the option standalone
tells the lexer to print unmatched words. In our example, those unmatched words are whitespaces.
Consider the following BNF grammar which describes a different variant of IMP arithmetic expressions:
<val> ::= [0-9]+ <var> ::= [A-Z][a-z]*[0-9]* <op> ::= "+" | "MOD" <atom> ::= <val> | <var> <expr> ::= <atom> | <atom> <op> <expr>
The following are examples of expressions:
x + 1 A + 1 MOD Bx + Cy0
We start this exercise by first identifying the regular expressions we are interested in. The flex file is given below:
import java.util.*; %% %class ExprLexer %standalone %{ public Expr crtexpr = null; public String crtop = null; %} LineTerminator = \r|\n|\r\n WS = (" "|\t)+ op = "+"|"MOD" alfastream = [a-zA-Z]+ digitstream = [0-9]+ var = [A-Z]{alfastream}?{digitstream}? val = digitstream %% {var} { if (crtop == null) crtexpr = new Var(yytext()); else crtexpr = new Binary(crtop,crtexpr,new Var(yytext())); } {op} {crtop = yytext();} {WS} {} {LineTerminator} { /* do nothing*/ }
Note that we have opted to define the regular expression var
in terms of other regular expressions. The regexp e?
should be read as e - zero or one occurrence of e. Also note that the text MOD
may be interpreted as an op
as well as a alphastream
; that is why it is important to have it defined as an operator first.
Once the regular expressions are defined, we proceed to the program logic:
Expr
public variable)String
)Whenever a variable is parsed, we add it as a new expression, if no operator has been previously scanned. Otherwise, we use the existing operator to create a new (sub) expression.
Note that the above program doesn't handle malformed inputs well. Can you identify such cases?
Finally, we add the data-structures required to hold parsed expressions as well as the main test class:
import java.io.*; import java.util.*; interface Expr {} abstract class Atom implements Expr { private String name; public Atom (String name) {this.name = name;} @Override public String toString () {return "{"+this.name+"}";} } class Binary implements Expr { private Expr l,r; private String op; public Binary (String op, Expr l, Expr r) {this.op = op; this.l = l; this.r = r;} @Override public String toString () {return l.toString()+"<"+op+">"+r.toString();} } class Var extends Atom { public Var (String s) {super(s);} } class Val extends Atom { public Val (String s) {super(s);} } public class Test { public static void main (String[] args) throws IOException { ExprLexer l = new ExprLexer(new FileReader(args[0])); l.yylex(); System.out.println(l.crtexpr); } }