====== Writing parsers with ANTLR ====== ===== What is ANTLR (4.0) ===== ANTLR (**ANother Tool for Language Recognition**) is widely used //parser-generator//. The **input** for ANTLR is a **grammar**. The **output** is a set of files implementing a: * **lexer** * **a parser** together with a //walker// which relies on the **visitor design pattern**. In short, the programmer simply needs to implement a **visitor object** which defines the desired behaviour when each element of the grammar is reached. ===== Installing and using ANTLR 4.0 with Python ===== ANTLR can be used with several languages including Java and Python. In our example we choose Python because the code is easier to deploy and test. As far as ANTLR 4.0 is concerned, switching to Java or another language is just a matter of syntax. To install ANTLR, run: pip install antlr4-python2-runtime and download the antlr4 JAR (which can be easily found online). To compile a grammar ''Sample.g4'', run: java -Xmx500M -cp org.antlr.v4.Tool -Dlanguage=Python2 Sample.g4 The output consists in the files: * ''SampleLexer.py'', ''SampleParser.py'', and ''SampleListener.py'' ===== A grammar for arithmetic expressions ===== A grammar definition (named ''Expr.g4'') for arithmetic expressions is given below; grammar Expr; expr : mult_expr ('+' expr)? ; mult_expr : atom ('*' mult_expr)? ; atom : '(' expr ')' | TOKEN | NUMBER ; NUMBER : [0-9]+ ; TOKEN : [A-Z][a-z]*[0-9]* ; WS : [ \t\r\n]+ -> skip ; * The first line is the grammar definition. * The last three lines define **tokens**, in a manner similar to Flex. Tokens are generally defined using uppercases. Note that some tokens such as ''*'' or ''('' are defined **inline** in the grammar itself. * The last token also defines an **action**, which consists in //ignoring// whitespaces, tabs and newlines. * Our grammar consists of three rules: ''expr'', ''mult_expr'' and ''atom''. Rules have the general form: '' : ;'' * The symbol ''|'' is used in exactly the same way as with grammar definitions. * The ANTLR expression ''?'' denotes //zero or one occurrences of ''''// ===== Running the parser ===== After running ANTLR (see the first section), the files: * ''ExprLexer.py'', ''ExprParser.py'' and ''ExprListener.py'' are generated. We can incorporate them in our code as shown below: from antlr4 import * from ExprLexer import ExprLexer from ExprListener import ExprListener from ExprParser import ExprParser import sys, io stream = FileStream(sys.argv[1]) lexer = ExprLexer(stream) stream = CommonTokenStream(lexer) parser = ExprParser(stream) tree = parser.expr() printer = PrintListener() walker = ParseTreeWalker() walker.walk(printer, tree) * the first line after imports creates a **filestream** by opening a file whose **name** is read from **standard input** * the second line creates a **lexer** from that stream. The lexer **identifies** tokens from the input * the fourth line creates a **parser** for our grammar, employing the above-built lexer * the object ''tree'' (fourth line) consists in the AST of our parsed input. Here, ''expr'' is the **start symbol** for our grammar * the object ''printer'' (fifth line) is a **visitor object**, which we shall implement below * the call ''walker.walk(printer, tree)'' triggers the **vising process**, where the visited object is ''tree'', and the visitor is ''printer'' ===== Implementing visitors ====== In order to implement a visitor, we need to have a look at the file ''ExprListener.py''. It contains **stub definitions**: a pair of //enter// and //exit// functions, one for each rule from our grammar. Each respective method will be called at the beginning/end of visiting the corresponding AST sub-tree. We can implement our visitor by extending ''ExprListener.py'', as shown in the code below: def toString (obj): return str(obj) class PrintListener(ExprListener): def __init__(self): self.str = "" self.plus = 0 self.mult = 0 def enterExpr(self, ctx): if ctx.expr() != None: self.plus += 1 def exitMult_expr(self, ctx): if self.plus > 0: self.str += "+" self.plus -= 1 def enterMult_expr(self, ctx): if ctx.mult_expr() != None: self.mult += 1 def enterAtom(self, ctx): if ctx.TOKEN() != None: self.str += toString(ctx.TOKEN()) if ctx.NUMBER() != None: self.str += toString(ctx.NUMBER()) if ctx.expr() != None: self.str += "(" def exitAtom(self, ctx): if ctx.expr() != None: self.str += ")" if self.mult > 0: self.str += "*" self.mult -= 1 Our class ''PrintListener'' will pretty-print the parsed expression at the output. Let us start with the method ''enterAtom'' : * it receives the object ''ctx'' as parameter. ''ctx'' can be used to explore the **AST sub-tree** corresponding to our rule ''atom''. Recall that this rule is ''atom : '(' expr ')' | TOKEN | NUMBER ;''. We use the function calls ''TOKEN()'' and ''NUMBER()'' in order to examine the exact structure of our parsed atom. * we also declare a **class member** ''str'', which will hold the printed portion of our expression. * the function ''toString'' which is defined top-level, is used as a more legible alternative to python's ''str''. * finally, note that ''enterAtom'' is called once we start visiting an atom. Hence, if this is another expression, we add ''('' to our string. The method ''exitAtom'' will match the respective parenthesis. We now look at ''enterExpr'' : * Note that we need to add the addition (resp. multiplication) signs **after** the first element of the expression (here, ''mult_expr'') has been visited. Thus, if an expression does contain addition, we use the member variable ''plus'' to account for our ''+''. When the first element of a (possible) addition has been visited, i.e. when ''exitMult_expr'' is called, we check is we need to add a plus sign, and if so, we add it. * The same idea is applied for multiplication Finally, we can test our code using e.g.: Var1 + (5 * 6) ===== What grammars can ANTLR support ===== Unlike other parser generators (e.g. Yacc), and even earlier versions of ANTLR, ANTLR is extremely powerful in terms of what kind of grammars can be accepted. In Yacc, parsers containing rules of the form: expr1 : TOKEN '+' '+'; expr2 : TOKEN '+' TOKEN; will fail to generate a grammar. Yacc cannot look ahead **more than two tokens** in order to establish which rule matches a certain input. Moreover, ANTLR does not fail on **ambiguous grammars**. For instance, the ambiguous grammar: grammar Expr2; expr : expr OP expr | OPEN expr CLOSE | TOKEN | NUMBER ; OPEN : '(' ; CLOSE : ')' ; OP : ('+' | '*'); NUMBER : [0-9]+ ; TOKEN : [A-Z][a-z]*[0-9]* ; WS : [ \t\r\n]+ -> skip ; is accepted. The resulting parse tree for expression: '' 1 + 2 * 3'', fed to the above grammar in ANTLR is: * / \ + 3 / \ 1 2 The details regarding how ambiguity is dealt with in ANTLR go beyond this lecture. ===== A visitor for evaluating expressions ===== We return to our previous unambiguous grammar of expressions, and illustrate a visitor which evaluates expressions. def toInteger (ctx): return int(toString(ctx)) def prev (l): if len(l) < 2: return None return l[len(l)-2] class EvalListener(ExprListener): def __init__(self,dict): self.dict = dict self.stack = [] def clean(self,val): if prev(self.stack) == "+": valp = self.stack.pop() self.stack.pop() self.clean(val+valp) elif prev(self.stack) == "*": valp = self.stack.pop() self.stack.pop() self.clean(val*valp) else: self.stack.append(val) def enterExpr(self, ctx): if ctx.expr() != None: self.stack.append("+") def enterMult_expr(self, ctx): if ctx.mult_expr() != None: self.stack.append("*") def enterAtom(self, ctx): if ctx.TOKEN() != None: self.stack.append(self.dict[toString(ctx.TOKEN())]) if ctx.NUMBER() != None: self.stack.append(toInteger(ctx.NUMBER())) if ctx.expr() != None: self.stack.append("(") def exitAtom(self, ctx): if ctx.expr() != None: val = self.stack.pop() self.stack.pop() #remove the opened par self.clean(val) if ctx.TOKEN() != None or ctx.NUMBER() != None: val = self.stack.pop() self.clean(val) The class also relies on two helper methods: * ''toInteger'' which takes a AST object and returns an integer (which is to be used for getting the value from a ''NUMBER'' AST * ''prev'' which looks at the element //before the top of the stack// (represented as a list) The underlying idea behind the visitor is as follows: * whenever an addition or multiplication node is visited, the respective operation is placed on the stack * whenever an opened parenthesis is found (see ''enterAtom''), it is also placed on the stack * whenever a token is found, its value from the dictionary is placed on the stack * whenever a number is found, it is placed on the stack * when the visiting process of a token or number has ended, operations from the stack could be performed. Thus, the **last** value from the stack is popped, and forwarded to the ''clean'' method. * when the visiting process of a closed parenthesis has ended, we expect the stack to contain: ''[ ... "(", ]''. Note that the expression in parenthesis is already evaluated. We pop the '''' and the opened parenthesis, and forward the value to the ''clean'' method, which can perform other subsequent evaluations. The ''clean'' method: * operates on the stack, and receives as parameter a value ''val'' - the last value to be added on the stack. Instead of simply adding ''val'', we check if an evaluation can be performed. This is the case if the stack contains: ''[... , ]''. In this case, we pop the operator and ''''. We do not simply push the result of the operation, but continue the //cleaning// process by a recursive call. This will evaluate other operations waiting in line. * if no evaluation can be performed, ''val'' is simply added on the stack At the end of the visiting process, the stack will contain a unique value: the result of evaluating the expression.