Table of Contents

Writing parsers with ANTLR

What is ANTLR (4.0)

ANTLR (ANother Tool for Language Recognition) is widely used parser-generator. The input for ANTLR is a grammar. The output is a set of files implementing a:

Installing and using ANTLR 4.0 with Python

ANTLR can be used with several languages including Java and Python. In our example we choose Python because the code is easier to deploy and test. As far as ANTLR 4.0 is concerned, switching to Java or another language is just a matter of syntax.

To install ANTLR, run:

pip install antlr4-python2-runtime

and download the antlr4 JAR (which can be easily found online).

To compile a grammar Sample.g4, run:

java -Xmx500M -cp <path to ANTLR complete JAR> org.antlr.v4.Tool -Dlanguage=Python2 Sample.g4

The output consists in the files:

A grammar for arithmetic expressions

A grammar definition (named Expr.g4) for arithmetic expressions is given below;

grammar Expr;

expr : mult_expr ('+' expr)? ;
mult_expr : atom ('*' mult_expr)? ;

atom : '(' expr ')' | TOKEN | NUMBER ;

NUMBER : [0-9]+ ;
TOKEN : [A-Z][a-z]*[0-9]* ;            
WS : [ \t\r\n]+ -> skip ; 

Running the parser

After running ANTLR (see the first section), the files:

are generated.

We can incorporate them in our code as shown below:

from antlr4 import *
from ExprLexer import ExprLexer
from ExprListener import ExprListener
from ExprParser import ExprParser
import sys, io
 
stream = FileStream(sys.argv[1])
lexer = ExprLexer(stream)
stream = CommonTokenStream(lexer)
parser = ExprParser(stream)
tree = parser.expr()
printer = PrintListener()
walker = ParseTreeWalker()
walker.walk(printer, tree)

Implementing visitors

In order to implement a visitor, we need to have a look at the file ExprListener.py. It contains stub definitions: a pair of enter and exit functions, one for each rule from our grammar. Each respective method will be called at the beginning/end of visiting the corresponding AST sub-tree.

We can implement our visitor by extending ExprListener.py, as shown in the code below:

def toString (obj):
    return str(obj)
 
class PrintListener(ExprListener):
 
    def __init__(self):
        self.str = ""
        self.plus = 0
        self.mult = 0
 
    def enterExpr(self, ctx):
        if ctx.expr() != None:
            self.plus += 1
 
    def exitMult_expr(self, ctx):
        if self.plus > 0:
            self.str += "+"
            self.plus -= 1
 
    def enterMult_expr(self, ctx):
        if ctx.mult_expr() != None:
            self.mult += 1
 
    def enterAtom(self, ctx):
        if ctx.TOKEN() != None:
            self.str += toString(ctx.TOKEN())
        if ctx.NUMBER() != None:
            self.str += toString(ctx.NUMBER())
 
        if ctx.expr() != None:
            self.str += "("
 
    def exitAtom(self, ctx):
        if ctx.expr() != None:
            self.str += ")"
        if self.mult > 0:
            self.str += "*"
            self.mult -= 1

Our class PrintListener will pretty-print the parsed expression at the output.

Let us start with the method enterAtom :

We now look at enterExpr :

Finally, we can test our code using e.g.:

Var1 + (5 * 6)

What grammars can ANTLR support

Unlike other parser generators (e.g. Yacc), and even earlier versions of ANTLR, ANTLR is extremely powerful in terms of what kind of grammars can be accepted. In Yacc, parsers containing rules of the form:

  expr1 : TOKEN '+' '+';
  expr2 : TOKEN '+' TOKEN;

will fail to generate a grammar. Yacc cannot look ahead more than two tokens in order to establish which rule matches a certain input.

Moreover, ANTLR does not fail on ambiguous grammars. For instance, the ambiguous grammar:

grammar Expr2;

expr : expr OP expr | OPEN expr CLOSE | TOKEN | NUMBER ;

OPEN : '(' ;
CLOSE : ')' ;
OP : ('+' | '*');
NUMBER : [0-9]+ ;
TOKEN : [A-Z][a-z]*[0-9]* ;            
WS : [ \t\r\n]+ -> skip ; 

is accepted. The resulting parse tree for expression: 1 + 2 * 3, fed to the above grammar in ANTLR is:

      *
    /   \
   +     3
 /   \
1     2 

The details regarding how ambiguity is dealt with in ANTLR go beyond this lecture.

A visitor for evaluating expressions

We return to our previous unambiguous grammar of expressions, and illustrate a visitor which evaluates expressions.

def toInteger (ctx):
    return int(toString(ctx))
 
def prev (l):
    if len(l) < 2:
        return None
    return l[len(l)-2]
 
class EvalListener(ExprListener):
    def __init__(self,dict):
        self.dict = dict
        self.stack = []
 
    def clean(self,val):
        if prev(self.stack) == "+":
            valp = self.stack.pop()
            self.stack.pop()
            self.clean(val+valp)
 
        elif prev(self.stack) == "*":
            valp = self.stack.pop()
            self.stack.pop()
            self.clean(val*valp) 
 
        else:
            self.stack.append(val)
 
    def enterExpr(self, ctx):
        if ctx.expr() != None:
            self.stack.append("+")
 
    def enterMult_expr(self, ctx):
        if ctx.mult_expr() != None:
            self.stack.append("*")
 
    def enterAtom(self, ctx):
        if ctx.TOKEN() != None:
            self.stack.append(self.dict[toString(ctx.TOKEN())])
        if ctx.NUMBER() != None:
            self.stack.append(toInteger(ctx.NUMBER()))
 
        if ctx.expr() != None:
            self.stack.append("(")
 
    def exitAtom(self, ctx):
        if ctx.expr() != None:
            val = self.stack.pop()
            self.stack.pop() #remove the opened par
            self.clean(val)
        if ctx.TOKEN() != None or ctx.NUMBER() != None:
            val = self.stack.pop()
            self.clean(val)

The class also relies on two helper methods:

The underlying idea behind the visitor is as follows:

The clean method:

At the end of the visiting process, the stack will contain a unique value: the result of evaluating the expression.