Writing parsers with ANTLR
What is ANTLR (4.0)
ANTLR (ANother Tool for Language Recognition) is widely used parser-generator. The input for ANTLR is a grammar. The output is a set of files implementing a:
- lexer
- a parser together with a walker which relies on the visitor design pattern. In short, the programmer simply needs to implement a visitor object which defines the desired behaviour when each element of the grammar is reached.
Installing and using ANTLR 4.0 with Python
ANTLR can be used with several languages including Java and Python. In our example we choose Python because the code is easier to deploy and test. As far as ANTLR 4.0 is concerned, switching to Java or another language is just a matter of syntax.
To install ANTLR, run:
pip install antlr4-python2-runtime
and download the antlr4 JAR (which can be easily found online).
To compile a grammar Sample.g4
, run:
java -Xmx500M -cp <path to ANTLR complete JAR> org.antlr.v4.Tool -Dlanguage=Python2 Sample.g4
The output consists in the files:
SampleLexer.py
,SampleParser.py
, andSampleListener.py
A grammar for arithmetic expressions
A grammar definition (named Expr.g4
) for arithmetic expressions is given below;
grammar Expr; expr : mult_expr ('+' expr)? ; mult_expr : atom ('*' mult_expr)? ; atom : '(' expr ')' | TOKEN | NUMBER ; NUMBER : [0-9]+ ; TOKEN : [A-Z][a-z]*[0-9]* ; WS : [ \t\r\n]+ -> skip ;
- The first line is the grammar definition.
- The last three lines define tokens, in a manner similar to Flex. Tokens are generally defined using uppercases. Note that some tokens such as
*
or(
are defined inline in the grammar itself. - The last token also defines an action, which consists in ignoring whitespaces, tabs and newlines.
- Our grammar consists of three rules:
expr
,mult_expr
andatom
. Rules have the general form:<name> : <body> ;
- The symbol
|
is used in exactly the same way as with grammar definitions. - The ANTLR expression
<expr>?
denotes zero or one occurrences of<expr>
Running the parser
After running ANTLR (see the first section), the files:
ExprLexer.py
,ExprParser.py
andExprListener.py
are generated.
We can incorporate them in our code as shown below:
from antlr4 import * from ExprLexer import ExprLexer from ExprListener import ExprListener from ExprParser import ExprParser import sys, io stream = FileStream(sys.argv[1]) lexer = ExprLexer(stream) stream = CommonTokenStream(lexer) parser = ExprParser(stream) tree = parser.expr() printer = PrintListener() walker = ParseTreeWalker() walker.walk(printer, tree)
- the first line after imports creates a filestream by opening a file whose name is read from standard input
- the second line creates a lexer from that stream. The lexer identifies tokens from the input
- the fourth line creates a parser for our grammar, employing the above-built lexer
- the object
tree
(fourth line) consists in the AST of our parsed input. Here,expr
is the start symbol for our grammar - the object
printer
(fifth line) is a visitor object, which we shall implement below - the call
walker.walk(printer, tree)
triggers the vising process, where the visited object istree
, and the visitor isprinter
Implementing visitors
In order to implement a visitor, we need to have a look at the file ExprListener.py
. It contains stub definitions: a pair of enter and exit functions, one for each rule from our grammar. Each respective method will be called at the beginning/end of visiting the corresponding AST sub-tree.
We can implement our visitor by extending ExprListener.py
, as shown in the code below:
def toString (obj): return str(obj) class PrintListener(ExprListener): def __init__(self): self.str = "" self.plus = 0 self.mult = 0 def enterExpr(self, ctx): if ctx.expr() != None: self.plus += 1 def exitMult_expr(self, ctx): if self.plus > 0: self.str += "+" self.plus -= 1 def enterMult_expr(self, ctx): if ctx.mult_expr() != None: self.mult += 1 def enterAtom(self, ctx): if ctx.TOKEN() != None: self.str += toString(ctx.TOKEN()) if ctx.NUMBER() != None: self.str += toString(ctx.NUMBER()) if ctx.expr() != None: self.str += "(" def exitAtom(self, ctx): if ctx.expr() != None: self.str += ")" if self.mult > 0: self.str += "*" self.mult -= 1
Our class PrintListener
will pretty-print the parsed expression at the output.
Let us start with the method enterAtom
:
- it receives the object
ctx
as parameter.ctx
can be used to explore the AST sub-tree corresponding to our ruleatom
. Recall that this rule isatom : '(' expr ')' | TOKEN | NUMBER ;
. We use the function callsTOKEN()
andNUMBER()
in order to examine the exact structure of our parsed atom. - we also declare a class member
str
, which will hold the printed portion of our expression. - the function
toString
which is defined top-level, is used as a more legible alternative to python'sstr
. - finally, note that
enterAtom
is called once we start visiting an atom. Hence, if this is another expression, we add(
to our string. The methodexitAtom
will match the respective parenthesis.
We now look at enterExpr
:
- Note that we need to add the addition (resp. multiplication) signs after the first element of the expression (here,
mult_expr
) has been visited. Thus, if an expression does contain addition, we use the member variableplus
to account for our+
. When the first element of a (possible) addition has been visited, i.e. whenexitMult_expr
is called, we check is we need to add a plus sign, and if so, we add it. - The same idea is applied for multiplication
Finally, we can test our code using e.g.:
Var1 + (5 * 6)
What grammars can ANTLR support
Unlike other parser generators (e.g. Yacc), and even earlier versions of ANTLR, ANTLR is extremely powerful in terms of what kind of grammars can be accepted. In Yacc, parsers containing rules of the form:
expr1 : TOKEN '+' '+'; expr2 : TOKEN '+' TOKEN;
will fail to generate a grammar. Yacc cannot look ahead more than two tokens in order to establish which rule matches a certain input.
Moreover, ANTLR does not fail on ambiguous grammars. For instance, the ambiguous grammar:
grammar Expr2; expr : expr OP expr | OPEN expr CLOSE | TOKEN | NUMBER ; OPEN : '(' ; CLOSE : ')' ; OP : ('+' | '*'); NUMBER : [0-9]+ ; TOKEN : [A-Z][a-z]*[0-9]* ; WS : [ \t\r\n]+ -> skip ;
is accepted. The resulting parse tree for expression: 1 + 2 * 3
, fed to the above grammar in ANTLR is:
* / \ + 3 / \ 1 2
The details regarding how ambiguity is dealt with in ANTLR go beyond this lecture.
A visitor for evaluating expressions
We return to our previous unambiguous grammar of expressions, and illustrate a visitor which evaluates expressions.
def toInteger (ctx): return int(toString(ctx)) def prev (l): if len(l) < 2: return None return l[len(l)-2] class EvalListener(ExprListener): def __init__(self,dict): self.dict = dict self.stack = [] def clean(self,val): if prev(self.stack) == "+": valp = self.stack.pop() self.stack.pop() self.clean(val+valp) elif prev(self.stack) == "*": valp = self.stack.pop() self.stack.pop() self.clean(val*valp) else: self.stack.append(val) def enterExpr(self, ctx): if ctx.expr() != None: self.stack.append("+") def enterMult_expr(self, ctx): if ctx.mult_expr() != None: self.stack.append("*") def enterAtom(self, ctx): if ctx.TOKEN() != None: self.stack.append(self.dict[toString(ctx.TOKEN())]) if ctx.NUMBER() != None: self.stack.append(toInteger(ctx.NUMBER())) if ctx.expr() != None: self.stack.append("(") def exitAtom(self, ctx): if ctx.expr() != None: val = self.stack.pop() self.stack.pop() #remove the opened par self.clean(val) if ctx.TOKEN() != None or ctx.NUMBER() != None: val = self.stack.pop() self.clean(val)
The class also relies on two helper methods:
toInteger
which takes a AST object and returns an integer (which is to be used for getting the value from aNUMBER
ASTprev
which looks at the element before the top of the stack (represented as a list)
The underlying idea behind the visitor is as follows:
- whenever an addition or multiplication node is visited, the respective operation is placed on the stack
- whenever an opened parenthesis is found (see
enterAtom
), it is also placed on the stack - whenever a token is found, its value from the dictionary is placed on the stack
- whenever a number is found, it is placed on the stack
- when the visiting process of a token or number has ended, operations from the stack could be performed. Thus, the last value from the stack is popped, and forwarded to the
clean
method. - when the visiting process of a closed parenthesis has ended, we expect the stack to contain:
[ … “(”, <value> ]
. Note that the expression in parenthesis is already evaluated. We pop the<value>
and the opened parenthesis, and forward the value to theclean
method, which can perform other subsequent evaluations.
The clean
method:
- operates on the stack, and receives as parameter a value
val
- the last value to be added on the stack. Instead of simply addingval
, we check if an evaluation can be performed. This is the case if the stack contains:[… <valp>, <op>]
. In this case, we pop the operator and<valp>
. We do not simply push the result of the operation, but continue the cleaning process by a recursive call. This will evaluate other operations waiting in line. - if no evaluation can be performed,
val
is simply added on the stack
At the end of the visiting process, the stack will contain a unique value: the result of evaluating the expression.