Writing parsers with ANTLR

ANTLR (ANother Tool for Language Recognition) is widely used parser-generator. The input for ANTLR is a grammar. The output is a set of files implementing a:

  • lexer
  • a parser together with a walker which relies on the visitor design pattern. In short, the programmer simply needs to implement a visitor object which defines the desired behaviour when each element of the grammar is reached.

ANTLR can be used with several languages including Java and Python. In our example we choose Python because the code is easier to deploy and test. As far as ANTLR 4.0 is concerned, switching to Java or another language is just a matter of syntax.

To install ANTLR, run:

pip install antlr4-python2-runtime

and download the antlr4 JAR (which can be easily found online).

To compile a grammar Sample.g4, run:

java -Xmx500M -cp <path to ANTLR complete JAR> org.antlr.v4.Tool -Dlanguage=Python2 Sample.g4

The output consists in the files:

  • SampleLexer.py, SampleParser.py, and SampleListener.py

A grammar definition (named Expr.g4) for arithmetic expressions is given below;

grammar Expr;

expr : mult_expr ('+' expr)? ;
mult_expr : atom ('*' mult_expr)? ;

atom : '(' expr ')' | TOKEN | NUMBER ;

NUMBER : [0-9]+ ;
TOKEN : [A-Z][a-z]*[0-9]* ;            
WS : [ \t\r\n]+ -> skip ; 
  • The first line is the grammar definition.
  • The last three lines define tokens, in a manner similar to Flex. Tokens are generally defined using uppercases. Note that some tokens such as * or ( are defined inline in the grammar itself.
  • The last token also defines an action, which consists in ignoring whitespaces, tabs and newlines.
  • Our grammar consists of three rules: expr, mult_expr and atom. Rules have the general form: <name> : <body> ;
  • The symbol | is used in exactly the same way as with grammar definitions.
  • The ANTLR expression <expr>? denotes zero or one occurrences of <expr>

After running ANTLR (see the first section), the files:

  • ExprLexer.py, ExprParser.py and ExprListener.py

are generated.

We can incorporate them in our code as shown below:

from antlr4 import *
from ExprLexer import ExprLexer
from ExprListener import ExprListener
from ExprParser import ExprParser
import sys, io
 
stream = FileStream(sys.argv[1])
lexer = ExprLexer(stream)
stream = CommonTokenStream(lexer)
parser = ExprParser(stream)
tree = parser.expr()
printer = PrintListener()
walker = ParseTreeWalker()
walker.walk(printer, tree)
  • the first line after imports creates a filestream by opening a file whose name is read from standard input
  • the second line creates a lexer from that stream. The lexer identifies tokens from the input
  • the fourth line creates a parser for our grammar, employing the above-built lexer
  • the object tree (fourth line) consists in the AST of our parsed input. Here, expr is the start symbol for our grammar
  • the object printer (fifth line) is a visitor object, which we shall implement below
  • the call walker.walk(printer, tree) triggers the vising process, where the visited object is tree, and the visitor is printer

In order to implement a visitor, we need to have a look at the file ExprListener.py. It contains stub definitions: a pair of enter and exit functions, one for each rule from our grammar. Each respective method will be called at the beginning/end of visiting the corresponding AST sub-tree.

We can implement our visitor by extending ExprListener.py, as shown in the code below:

def toString (obj):
    return str(obj)
 
class PrintListener(ExprListener):
 
    def __init__(self):
        self.str = ""
        self.plus = 0
        self.mult = 0
 
    def enterExpr(self, ctx):
        if ctx.expr() != None:
            self.plus += 1
 
    def exitMult_expr(self, ctx):
        if self.plus > 0:
            self.str += "+"
            self.plus -= 1
 
    def enterMult_expr(self, ctx):
        if ctx.mult_expr() != None:
            self.mult += 1
 
    def enterAtom(self, ctx):
        if ctx.TOKEN() != None:
            self.str += toString(ctx.TOKEN())
        if ctx.NUMBER() != None:
            self.str += toString(ctx.NUMBER())
 
        if ctx.expr() != None:
            self.str += "("
 
    def exitAtom(self, ctx):
        if ctx.expr() != None:
            self.str += ")"
        if self.mult > 0:
            self.str += "*"
            self.mult -= 1

Our class PrintListener will pretty-print the parsed expression at the output.

Let us start with the method enterAtom :

  • it receives the object ctx as parameter. ctx can be used to explore the AST sub-tree corresponding to our rule atom. Recall that this rule is atom : '(' expr ')' | TOKEN | NUMBER ;. We use the function calls TOKEN() and NUMBER() in order to examine the exact structure of our parsed atom.
  • we also declare a class member str, which will hold the printed portion of our expression.
  • the function toString which is defined top-level, is used as a more legible alternative to python's str.
  • finally, note that enterAtom is called once we start visiting an atom. Hence, if this is another expression, we add ( to our string. The method exitAtom will match the respective parenthesis.

We now look at enterExpr :

  • Note that we need to add the addition (resp. multiplication) signs after the first element of the expression (here, mult_expr) has been visited. Thus, if an expression does contain addition, we use the member variable plus to account for our +. When the first element of a (possible) addition has been visited, i.e. when exitMult_expr is called, we check is we need to add a plus sign, and if so, we add it.
  • The same idea is applied for multiplication

Finally, we can test our code using e.g.:

Var1 + (5 * 6)

Unlike other parser generators (e.g. Yacc), and even earlier versions of ANTLR, ANTLR is extremely powerful in terms of what kind of grammars can be accepted. In Yacc, parsers containing rules of the form:

  expr1 : TOKEN '+' '+';
  expr2 : TOKEN '+' TOKEN;

will fail to generate a grammar. Yacc cannot look ahead more than two tokens in order to establish which rule matches a certain input.

Moreover, ANTLR does not fail on ambiguous grammars. For instance, the ambiguous grammar:

grammar Expr2;

expr : expr OP expr | OPEN expr CLOSE | TOKEN | NUMBER ;

OPEN : '(' ;
CLOSE : ')' ;
OP : ('+' | '*');
NUMBER : [0-9]+ ;
TOKEN : [A-Z][a-z]*[0-9]* ;            
WS : [ \t\r\n]+ -> skip ; 

is accepted. The resulting parse tree for expression: 1 + 2 * 3, fed to the above grammar in ANTLR is:

      *
    /   \
   +     3
 /   \
1     2 

The details regarding how ambiguity is dealt with in ANTLR go beyond this lecture.

We return to our previous unambiguous grammar of expressions, and illustrate a visitor which evaluates expressions.

def toInteger (ctx):
    return int(toString(ctx))
 
def prev (l):
    if len(l) < 2:
        return None
    return l[len(l)-2]
 
class EvalListener(ExprListener):
    def __init__(self,dict):
        self.dict = dict
        self.stack = []
 
    def clean(self,val):
        if prev(self.stack) == "+":
            valp = self.stack.pop()
            self.stack.pop()
            self.clean(val+valp)
 
        elif prev(self.stack) == "*":
            valp = self.stack.pop()
            self.stack.pop()
            self.clean(val*valp) 
 
        else:
            self.stack.append(val)
 
    def enterExpr(self, ctx):
        if ctx.expr() != None:
            self.stack.append("+")
 
    def enterMult_expr(self, ctx):
        if ctx.mult_expr() != None:
            self.stack.append("*")
 
    def enterAtom(self, ctx):
        if ctx.TOKEN() != None:
            self.stack.append(self.dict[toString(ctx.TOKEN())])
        if ctx.NUMBER() != None:
            self.stack.append(toInteger(ctx.NUMBER()))
 
        if ctx.expr() != None:
            self.stack.append("(")
 
    def exitAtom(self, ctx):
        if ctx.expr() != None:
            val = self.stack.pop()
            self.stack.pop() #remove the opened par
            self.clean(val)
        if ctx.TOKEN() != None or ctx.NUMBER() != None:
            val = self.stack.pop()
            self.clean(val)

The class also relies on two helper methods:

  • toInteger which takes a AST object and returns an integer (which is to be used for getting the value from a NUMBER AST
  • prev which looks at the element before the top of the stack (represented as a list)

The underlying idea behind the visitor is as follows:

  • whenever an addition or multiplication node is visited, the respective operation is placed on the stack
  • whenever an opened parenthesis is found (see enterAtom), it is also placed on the stack
  • whenever a token is found, its value from the dictionary is placed on the stack
  • whenever a number is found, it is placed on the stack
  • when the visiting process of a token or number has ended, operations from the stack could be performed. Thus, the last value from the stack is popped, and forwarded to the clean method.
  • when the visiting process of a closed parenthesis has ended, we expect the stack to contain: [ … “(”, <value> ]. Note that the expression in parenthesis is already evaluated. We pop the <value> and the opened parenthesis, and forward the value to the clean method, which can perform other subsequent evaluations.

The clean method:

  • operates on the stack, and receives as parameter a value val - the last value to be added on the stack. Instead of simply adding val, we check if an evaluation can be performed. This is the case if the stack contains: [… <valp>, <op>]. In this case, we pop the operator and <valp>. We do not simply push the result of the operation, but continue the cleaning process by a recursive call. This will evaluate other operations waiting in line.
  • if no evaluation can be performed, val is simply added on the stack

At the end of the visiting process, the stack will contain a unique value: the result of evaluating the expression.