This is an old revision of the document!
Introduction to Formal Languages and Automata
Preliminaries
Formal Languages are a branch of theoretical Computer Science which has applications in numerous areas. Some of them are:
- compilers and compiler design
- complexity theory
- program verification
In the first part of the lecture, we shall focus on compiler design as a case study for introducing Formal Languages. Later on, we shall explore the usage of Formal Languages for complexity theory and program verification.
Compilers and interpreters
A compiler is a software program that takes as input a character stream (usually as a text file), containing a program and outputs executable code, in an assembly language appropriate for the computer architecture at hand.
The compilers' input program belongs to a high-level programming language (e.g. C/C++, Java, Scala, Haskell, etc.). The compiler itself may be written in C/C++, Java, Scala or Haskell, usually relying on parser generators - APIs/libraries specifically designed for compiler development:
- for C/C++, such APIs are Flex, Bison.
- for Java - ANTLR.
- for Haskell - Happy.
In contrast, an interpreter takes a program as input, possibly with a program input string, and runs the program on that input. The interpreter output is the program output.
Parsing
Although functionally different, compilers and interpreters share the parsing phase. During parsing, a stream of symbols is converted to a complex set of data structures (the Abstract Syntax Tree is one of them). Its roles are:
- to make sure the program is correct.
- to serve at interpreting or transforming the input to machine code.
Formal Languages are especially helpful for the parsing phase.
Parsing itself has several stages, which we shall illustrate via the following example.
The programming language IMP
The programming lanuage IMP (short for Imperative) is a very simple imperative language, equipped with if
, while
, assignments, arithmetic and boolean expressions. We present the syntax of IMP below:
<Var> ::= String <AVal> ::= Number <BVal> ::= "True" | "False" <AExpr> ::= <Var> | <AVal> | <AExpr> "+" <AExpr> | "(" <AExpr> ")" <BExpr> ::= <BVal> | <BExpr> "&&" <BExpr> | <AExpr> ">" <AExpr> | "(" <BExpr> ")" <Stmt> ::= <Var> "=" <AExpr> | "{" <Stmt> "}" | "if (" <BExpr> ")" <Stmt> <Stmt> | "while (" <BExpr> ")" <Stmt> | <Stmt> ";" <Stmt> <VarList> ::= <Var> | <Var> "," <VarList> <Prog> ::= "int "<VarList> ";" <Stmt>
The notation used above is called “backus-naur” form, and is often used as a semi-formal notation for describing different kind of syntactic notations.
Parsing an IMP program
Consider the following program:
int s,n; n = 1000; s = 0; while (n > 0) { s = s + n; n = n + (-1); }
Ignoring details regarding how the input is read, the compiler/interpreter will start of with the following stream of characters or word :
int s,n;\n n = 1000;\n s = 0;\n while (n > 0) {\t\n s = s + n; \t\n n = n + (-1); \t\n}
Stage 1: Lexical analysis - Lexer
- In the first parsing stage, we would like to identify tokens, or atomic program components. The keyword
while
or the variable namen
are such components.
The result of token identification may be represented as follows, where each token is shown on one line, between quotes:
'int' ' '(whitespace) 's' ',' 'n' '\n' 'n' '=' '1000' '\n' ' ' 's' ' ' '=' ' ' '0' ';' '\n' 'while' '(' ...
- During the same stage, while identifying tokens a parser assigns a type for each token. For instance,
int
is a type name, whilewhile
is a reserved keyword.- Each token can have a unique type (e.g. the language would not allow
if
as a variable name), and identifying the unique role may be challenging. For instanceif
andifvar
are tokens with different roles (keyword vs function name) but which start similarly. Parsers use different strategies to disambiguate such cases. - Some types are only useful for programming discipline (e.g. newlines and tabs) and some for token separation (e.g. whitespaces). We will ignore them in what follows.
Below, we illustrate the same sequence of tokens (omitting whitespaces, tabs and newlines for legibility), together with their role, in paranthesis:
'int'(DECLARATION) ' '(WS) 's'(VAR) ','(COMMA) 'n'(VAR) '\n'(NEWLINE) 'n'(VAR) '='(EQ) '1000'(NUMBER) '\n'(NEWLINE) ' '(WS) 's'(VAR) ' '(WS) '='(EQ) ' '(WS) '0'(NUMBER) ';'(COMMA) '\n'(NEWLINE) 'while'(WHILE) '('(OPEN_PAR) ...
Stage 2: Syntactic analysis (parsing rules)
The output of the first stage is a sequence of tokens with roles assigned to each one. In the syntactic analysis stage, several operations take place:
- the program structure is being validated (and syntactic errors are signalled, if this is the case)
- the abstract syntax tree (a tree-like description) of the program is being built
- the symbol table (the list of all encountered variables, functions, etc. and their scope) is being built
During the Formal Languages lecture, we will only focus on the first operation, which is the foundation-stone for all the rest.
While syntactic analysis can be implemented ad-hoc, a much more efficient (and widely adopted) approach is to specify a grammar, i.e. a set of syntactic rules for the formation of different language constructs.
The grammar is very similar to the BNF notation shown above, however grammars require more care for their definition.
A considerable part of the Formal Languages lecture will be focused on such grammars, their properties, and how to write them in order to avoid ambiguities during parsing.
An example of such an ambiguity is shown below:
if (x>10) if (x>2) x=0 else x=1
This program is faithful to the BNF notation shown above, however, it is not clear whether the else branch construct belongs to the first or the second if
.
Consider the following extension (with addition) of arithmetic expressions:
<AExpr> ::= <AExpr> + <AExpr> | <AExpr> * <AExpr> | <Var>
For the following input determined in the lexical phase:
'x'(Var) '+'(Plus) 'y'(Var) '*'(Mult) '5'(Var)
It is not clear for the parser how should the rules be matched: two possible abstract syntax trees (or derivations) can be built:
+ x * y 5
or
* + 5 x y
Therefore, the grammar is ambiguous, and the parser relying on it is flawed.
Stage 3: Semantic analysis
During or after the syntactic phase, semantic analysis checks if certain syntactic relations between tokens are valid. In more advanced programming languages:
- the declared returned type (e.g.
Integer
for our function), must coincide with the actual returned type. - comparisons (e.g.
x > 10
) must be made between comparable tokens - the function must return a value (e.g. each
if
from our function must be matched by anelse
body each returning anInteger
) - etc.
The semantic analysis may be implemented separately from parsing in some compilers, and goes beyond the scope of this lecture.
Other stages
Once the program is verified, compilers and interpreters behave differently:
- depending on the program typing, type inference or type-checking may be implemented as a stage different from semantic analysis
- compilers (and some interpreters) may perform certain program optimisations (e.g. remove unused program statements)
- compilers will allocate registers for each arithmetic operation in an efficient manner
These stages are outside the scope of our lecture.
The key objective of this lecture
Consider the following programming language, the well-known Lambda Calculus:
<Var> ::= String <LExpr> ::= <Var> | "\" <Var> "." <LExpr> | "(" <LExpr> " " <LExpr> ")"
The lexical and syntactic phase (i.e. parsing) follows the very same general ideas already presented. Instead of implementing another parser from scratch, it only feels natural to build tools which automate, to some extent, the process of parser development. Much of the Formal Languages and Automata lecture relies on this insight.
Lecture outline
Formal Languages provide an indispensable tool for automating most steps in parser development, such that each parser need not be written from scratch, but relying on parser generators.
In the first part of the lecture, we shall study:
- regular-expressions and automata: they are a means for defining (regular expressions) and computing (automata) tokens.
- grammars and push-down automata: they are means for defining (grammars) and computing (push-down automata) syntactic relations.
In the second part of the lecture, we shall examine Computer Science areas different from parsing where the above concepts are being deployed.
- One such concept is Computability and Complexity Theory: we use the concept of language to model problems, and classify problem hardness by examining which computation means (e.g. automata versus push-down automata) can accept a problem. This part of the lecture can be seen as an extension to the Algorithms and Complexity Theory lecture.