ANTLR (ANother Tool for Language Recognition) is widely used parser-generator. The input for ANTLR is a grammar. The output is a set of files implementing a:
ANTLR can be used with several languages including Java and Python. In our example we choose Python because the code is easier to deploy and test. As far as ANTLR 4.0 is concerned, switching to Java or another language is just a matter of syntax.
To install ANTLR, run:
pip install antlr4-python2-runtime
and download the antlr4 JAR (which can be easily found online).
To compile a grammar Sample.g4
, run:
java -Xmx500M -cp <path to ANTLR complete JAR> org.antlr.v4.Tool -Dlanguage=Python2 Sample.g4
The output consists in the files:
SampleLexer.py
, SampleParser.py
, and SampleListener.py
A grammar definition (named Expr.g4
) for arithmetic expressions is given below;
grammar Expr; expr : mult_expr ('+' expr)? ; mult_expr : atom ('*' mult_expr)? ; atom : '(' expr ')' | TOKEN | NUMBER ; NUMBER : [0-9]+ ; TOKEN : [A-Z][a-z]*[0-9]* ; WS : [ \t\r\n]+ -> skip ;
*
or (
are defined inline in the grammar itself.expr
, mult_expr
and atom
. Rules have the general form: <name> : <body> ;
|
is used in exactly the same way as with grammar definitions.<expr>?
denotes zero or one occurrences of <expr>
After running ANTLR (see the first section), the files:
ExprLexer.py
, ExprParser.py
and ExprListener.py
are generated.
We can incorporate them in our code as shown below:
from antlr4 import * from ExprLexer import ExprLexer from ExprListener import ExprListener from ExprParser import ExprParser import sys, io stream = FileStream(sys.argv[1]) lexer = ExprLexer(stream) stream = CommonTokenStream(lexer) parser = ExprParser(stream) tree = parser.expr() printer = PrintListener() walker = ParseTreeWalker() walker.walk(printer, tree)
tree
(fourth line) consists in the AST of our parsed input. Here, expr
is the start symbol for our grammarprinter
(fifth line) is a visitor object, which we shall implement belowwalker.walk(printer, tree)
triggers the vising process, where the visited object is tree
, and the visitor is printer
In order to implement a visitor, we need to have a look at the file ExprListener.py
. It contains stub definitions: a pair of enter and exit functions, one for each rule from our grammar. Each respective method will be called at the beginning/end of visiting the corresponding AST sub-tree.
We can implement our visitor by extending ExprListener.py
, as shown in the code below:
def toString (obj): return str(obj) class PrintListener(ExprListener): def __init__(self): self.str = "" self.plus = 0 self.mult = 0 def enterExpr(self, ctx): if ctx.expr() != None: self.plus += 1 def exitMult_expr(self, ctx): if self.plus > 0: self.str += "+" self.plus -= 1 def enterMult_expr(self, ctx): if ctx.mult_expr() != None: self.mult += 1 def enterAtom(self, ctx): if ctx.TOKEN() != None: self.str += toString(ctx.TOKEN()) if ctx.NUMBER() != None: self.str += toString(ctx.NUMBER()) if ctx.expr() != None: self.str += "(" def exitAtom(self, ctx): if ctx.expr() != None: self.str += ")" if self.mult > 0: self.str += "*" self.mult -= 1
Our class PrintListener
will pretty-print the parsed expression at the output.
Let us start with the method enterAtom
:
ctx
as parameter. ctx
can be used to explore the AST sub-tree corresponding to our rule atom
. Recall that this rule is atom : '(' expr ')' | TOKEN | NUMBER ;
. We use the function calls TOKEN()
and NUMBER()
in order to examine the exact structure of our parsed atom.str
, which will hold the printed portion of our expression.toString
which is defined top-level, is used as a more legible alternative to python's str
.enterAtom
is called once we start visiting an atom. Hence, if this is another expression, we add (
to our string. The method exitAtom
will match the respective parenthesis.
We now look at enterExpr
:
mult_expr
) has been visited. Thus, if an expression does contain addition, we use the member variable plus
to account for our +
. When the first element of a (possible) addition has been visited, i.e. when exitMult_expr
is called, we check is we need to add a plus sign, and if so, we add it.Finally, we can test our code using e.g.:
Var1 + (5 * 6)
Unlike other parser generators (e.g. Yacc), and even earlier versions of ANTLR, ANTLR is extremely powerful in terms of what kind of grammars can be accepted. In Yacc, parsers containing rules of the form:
expr1 : TOKEN '+' '+'; expr2 : TOKEN '+' TOKEN;
will fail to generate a grammar. Yacc cannot look ahead more than two tokens in order to establish which rule matches a certain input.
Moreover, ANTLR does not fail on ambiguous grammars. For instance, the ambiguous grammar:
grammar Expr2; expr : expr OP expr | OPEN expr CLOSE | TOKEN | NUMBER ; OPEN : '(' ; CLOSE : ')' ; OP : ('+' | '*'); NUMBER : [0-9]+ ; TOKEN : [A-Z][a-z]*[0-9]* ; WS : [ \t\r\n]+ -> skip ;
is accepted. The resulting parse tree for expression: 1 + 2 * 3
, fed to the above grammar in ANTLR is:
* / \ + 3 / \ 1 2
The details regarding how ambiguity is dealt with in ANTLR go beyond this lecture.
We return to our previous unambiguous grammar of expressions, and illustrate a visitor which evaluates expressions.
def toInteger (ctx): return int(toString(ctx)) def prev (l): if len(l) < 2: return None return l[len(l)-2] class EvalListener(ExprListener): def __init__(self,dict): self.dict = dict self.stack = [] def clean(self,val): if prev(self.stack) == "+": valp = self.stack.pop() self.stack.pop() self.clean(val+valp) elif prev(self.stack) == "*": valp = self.stack.pop() self.stack.pop() self.clean(val*valp) else: self.stack.append(val) def enterExpr(self, ctx): if ctx.expr() != None: self.stack.append("+") def enterMult_expr(self, ctx): if ctx.mult_expr() != None: self.stack.append("*") def enterAtom(self, ctx): if ctx.TOKEN() != None: self.stack.append(self.dict[toString(ctx.TOKEN())]) if ctx.NUMBER() != None: self.stack.append(toInteger(ctx.NUMBER())) if ctx.expr() != None: self.stack.append("(") def exitAtom(self, ctx): if ctx.expr() != None: val = self.stack.pop() self.stack.pop() #remove the opened par self.clean(val) if ctx.TOKEN() != None or ctx.NUMBER() != None: val = self.stack.pop() self.clean(val)
The class also relies on two helper methods:
toInteger
which takes a AST object and returns an integer (which is to be used for getting the value from a NUMBER
ASTprev
which looks at the element before the top of the stack (represented as a list)The underlying idea behind the visitor is as follows:
enterAtom
), it is also placed on the stackclean
method.[ … “(”, <value> ]
. Note that the expression in parenthesis is already evaluated. We pop the <value>
and the opened parenthesis, and forward the value to the clean
method, which can perform other subsequent evaluations.
The clean
method:
val
- the last value to be added on the stack. Instead of simply adding val
, we check if an evaluation can be performed. This is the case if the stack contains: [… <valp>, <op>]
. In this case, we pop the operator and <valp>
. We do not simply push the result of the operation, but continue the cleaning process by a recursive call. This will evaluate other operations waiting in line.val
is simply added on the stackAt the end of the visiting process, the stack will contain a unique value: the result of evaluating the expression.