===== An introduction to JFlex ===== ==== What is missing in the Haskell analyser? ==== Our Haskell analyser is generic (independent of the language being scanned), however it lacks certain key features which are key for //production// development: * the regular expressions are defined as part of the code, together with the functions ''String -> Token''. These could be **generated automatically** by simply //parsing// the regular expressions from a spec file; * analysers usually perform specific **actions** when certain tokens are found (e.g. add a variable to a list, etc.); in our Haskell approach, the list of tokens is built first, and then, it is assumed that, in a subsequent phase, the set of actions is performed. This may be inefficient, as well as, * counter-intuitive for the programmer. We would like to assign an action to each regular expression, which is executed once the regular expression is matched. We will now briefly illustrate a tool which incorporates these features. The tool is JFlex and requires programmers to develop their code in Java. (An alternative in Haskell, called ALex, also exists). JFlex receives as input a spec file ''*.flex'', which, informally, contains a list of regular expressions to be searched, as well as actions (Java code) to be executed once a regular expression is matched. After compiling the flex file, JFlex outputs a **Java class**, which implements a scanner. The Java class can be included in a larger project with extended functionality. ==== Installing JFlex ==== A complete, platform-dependent set of installation instructions can be found [[http://jflex.de/installing.html| here]]. In a nutshell, JFlex comes as a binary app ''jflex''. ==== The structure of a flex file ==== Consider the following simple JFlex file: import java.util.*; %% %class HelloLexer %standalone %{ public Integer words = 0; %} LineTerminator = \r|\n|\r\n %% [a-zA-Z]+ { words+=1; } {LineTerminator} { /* do nothing*/ } Suppose the above file is called ''Hello.flex''. Running the command ''jflex Hello.flex'' will generate a Java class which implements a lexer. Each JFlex file (such as the above), contains 5 sections: * the first section, which ends at the first occurrence of '' % % '' contains declarations which will be added at the beginning of the Java class file. * the second section, right after '' % % '' and until ''%{'' contains a sequence of options for jflex. Here, we use two options: * ''class HelloLexer'' tells jflex that the output java class that the lexer classname should be ''HelloLexer'' * ''standalone'' tells jflex to print the unmatched input word at to standard output and continue scanning. * More details regarding possible options can be found in the [[http://jflex.de/manual.pdf|JFlex docs]]. * the third section, separated by ''%{'' and ''%}'' contains declarations which will be appended in the Lexer class file. Here we declare a public variable ''words''. * the fourth section contains regular expression **declarations**. Here, we have declared ''LineTerminator'' to be the regular expression ''\r | \n | \r\n''. Declarations can be use to build more complicated RegExps from simple ones, and can be used as well in the fifth section of the flex file: * the fifth section contains rules and actions: a rule specifies a regular expression to be scanned, as well as the appropriate action to be taken, when a word satisfying the regexp is found: * the rule ''[a-zA-Z]+ { words+=1; }'' states that whenever ''[a-zA-Z]+'' (a regexp defined inline) is matched by a word, ''words+=1;'' should be executed; * the rule ''{LineTerminator} { /* do nothing*/ }'' refers to the regexp defined above (note the brackets); here no action should be executed; * JFlex will always scan for the **longest** input word which satisfies a regexp. When a word satisfies more than one regexp the **first** one from the flex file will be matched. ==== Compiling a Hello World project ==== After performing: jflex Hello.flex we obtain ''HelloLexer.java'' which contains the ''HelloLexer'' public class implementing our lexer. We can easily include this class in our project, e.g.: import java.io.*; import java.util.*; public class Hello { public static void main (String[] args) throws IOException { HelloLexer l = new HelloLexer(new FileReader(args[0])); l.yylex(); System.out.println(l.words); } } * Note that the lexer constructor method receives a java Reader as input (other options are possible, see the docs), and we take the name of the file to-be-scanned from standard input. * Each lexer implements the method ''yylex'' which starts the scanning process. After compiling: javac HelloLexer.java Hello.java and running: java Hello we obtain: 6 at standard output. Recall that the option ''standalone'' tells the lexer to print unmatched words. In our example, those unmatched words are whitespaces. ==== Application - parsing expressions ==== Consider the following BNF grammar which describes a different variant of IMP arithmetic expressions: ::= [0-9]+ ::= [A-Z][a-z]*[0-9]* ::= "+" | "MOD" ::= | ::= | The following are examples of expressions: x + 1 A + 1 MOD Bx + Cy0 We start this exercise by first identifying the regular expressions we are interested in. The flex file is given below: import java.util.*; %% %class ExprLexer %standalone %{ public Expr crtexpr = null; public String crtop = null; %} LineTerminator = \r|\n|\r\n WS = (" "|\t)+ op = "+"|"MOD" alfastream = [a-zA-Z]+ digitstream = [0-9]+ var = [A-Z]{alfastream}?{digitstream}? val = digitstream %% {var} { if (crtop == null) crtexpr = new Var(yytext()); else crtexpr = new Binary(crtop,crtexpr,new Var(yytext())); } {op} {crtop = yytext();} {WS} {} {LineTerminator} { /* do nothing*/ } Note that we have opted to define the regular expression ''var'' in terms of other regular expressions. The regexp ''e?'' should be read as //e - zero or one occurrence of e//. Also note that the text ''MOD'' may be interpreted as an ''op'' as well as a ''alphastream''; that is why it is important to have it defined as an operator first. Once the regular expressions are defined, we proceed to the program logic: * we would like to store the currently-scanned expression (as a ''Expr'' public variable) * as well as the currently-scanned operator (a ''String'') Whenever a variable is parsed, we add it as a new expression, if no operator has been previously scanned. Otherwise, we use the existing operator to create a new (sub) expression. Note that the above program doesn't handle malformed inputs well. Can you identify such cases? Finally, we add the data-structures required to hold parsed expressions as well as the main test class: import java.io.*; import java.util.*; interface Expr {} abstract class Atom implements Expr { private String name; public Atom (String name) {this.name = name;} @Override public String toString () {return "{"+this.name+"}";} } class Binary implements Expr { private Expr l,r; private String op; public Binary (String op, Expr l, Expr r) {this.op = op; this.l = l; this.r = r;} @Override public String toString () {return l.toString()+"<"+op+">"+r.toString();} } class Var extends Atom { public Var (String s) {super(s);} } class Val extends Atom { public Val (String s) {super(s);} } public class Test { public static void main (String[] args) throws IOException { ExprLexer l = new ExprLexer(new FileReader(args[0])); l.yylex(); System.out.println(l.crtexpr); } }