This is an old revision of the document!
Introductio to Formal Languages and Automata
Preliminaries
Formal Languages are a branch of theoretical Computer Science which has applications in numerous areas. Some of them are:
- compilers and compiler design
- complexity theory
- program verification
In the first part of the lecture, we shall focus on compiler design as a case study for introducing Formal Languages. Later on, we shall explore the usage of Formal Languages for complexity theory and program verification.
Compilers and interpreters
A compiler is a software program that takes as input a character stream (usually as a text file), containing a program and outputs executable code, in an assembly language appropriate for the computer architecture at hand.
The compilers' input program belongs to a high-level programming language (e.g. C/C++, Java, Scala, Haskell, etc.). The compiler itself may be written in C/C++, Java, Scala or Haskell, usually relying on parser generators - APIs/libraries specifically designed for compiler development:
- for C/C++, such APIs are Flex, Bison.
- for Java - ANTLR.
- for Haskell - Happy.
In contrast, an interpreter takes a program as input, possibly with a program input string, and runs the program on that input. The interpreter output is the program output.
Parsing
Although functionally different, compilers and interpreters share the parsing phase. During parsing, a stream of symbols is converted to a complex set of data structures (the Abstract Syntax Tree is one of them). Its roles are:
- to make sure the program is correct.
- to serve at interpreting or transforming the input to machine code.
Formal Languages are especially helpful for the parsing phase.
Parsing itself has several stages, which we shall illustrate via the following example.
Parsing a simplistic program
Consider the following program:
def func (x : Integer) : Integer = { if (x > 10) 0 else if (x > 0) 1 else 2 }
Ignoring details regarding how the input is read, the compiler will start of with the following stream of characters or word :
def func (x : Integer) : Integer = {\n\tif (x > 10)\n\t0\n\telse if (x > 0) \n\t\t1\n\t\t2\n}\n
Stage 1: Lexical analysis - Lexer
- In the first parsing stage, we would like to identify tokens, or atomic program components. The function name
func
, the keyworddef
or the variable namex
are such components.
The result of token identification may be represented as follows, where each token is shown on one line, between quotes:
'def' ' '(whitespace) 'func' ' '(whitespace) '(' 'x' ' ' ':' ' ' 'Integer' ')' ':' 'Integer' ' ' '=' ' ' '{' '\n' '\t' 'if' ' ' '(' 'x' ' ' '>' ' ' '10' ')' '\n' '\t' '0' '\n' '\t' 'else' ' ' 'if' ' ' '(' 'x' ' ' '>' ' ' '0' ')' ' ' '\n' '\t' '\t' '1' '\n' '\t' '\t' '2' '\n' '}' '\n'
- During the same stage, while identifying tokens a parser assigns a role for each token. For instance,
Integer
is a type name, whiledef
is a reserved keyword.- Each token can have a unique role (e.g. the language would not allow
def
as a variable name), and identifying the unique role may be challenging. For instanceif
andifunction
are tokens with different roles (keyword vs function name) but which start similarly. Parsers use different strategies to disambiguate. - Some roles are only useful for programming discipline (e.g. newlines and tabs) and some for token separation (e.g. whitespaces). We will ignore them in what follows.
Below, we illustrate the same sequence of tokens (omitting whitespaces, tabs and newlines for legibility), together with their role, in paranthesis:
'def'(FUNC_DEF) 'func'(NAME) '('(OPEN_PAR) 'x'(NAME) ':'(TYPE_DEF) 'Integer'(NAME) ')'(CLOSED_PAR) ':'(TYPE_DEF) 'Integer'(NAME) '='(EQUALS) '{'(OPEN_CURL) 'if'(IF) '('(OPEN_PAR) 'x'(NAME) '>'(COMPARISON) '10'(NUMBER) ')'(CLOSED_PAR) '0'(NUMBER) 'else'(ELSE) 'if'(IF) '('(OPEN_PAR) 'x'(NAME) '>'(COMPARISON) '0'(NUMBER) ')'(CLOSED_PAR) '1'(NUMBER) 'else'(ELSE) '2'(NUMBER) '}'(CLOSED_CURL)
Stage 2: Syntactic analysis (parsing rules)
The output of the first stage is a sequence of tokens with roles assigned to each one. In the syntactic analysis stage, several operations take place:
- the program structure is being validated (and syntactic errors are signalled, if this is the case)
- the abstract syntax tree (a tree-like description) of the program is being built
- the symbol table (the list of all encountered variables, functions, etc. and their scope) is being built
During the Formal Languages lecture, we will only focus on the first operation, which is the foundation-stone for all the rest.
While syntactic analysis can be implemented ad-hoc, a much more efficient (and widely adopted) approach is to specify a grammar, i.e. a set of syntactic rules for the formation of different language constructs.
For functions in our language, such rules may look as follows:
<function> ::= <func_def> EQUALS OPEN_CURL <body> CLOSED_CURL <func_def> ::= FUNC_DEF NAME OPEN_PAR <param_list> CLOSED_PAR TYPE_DEF NAME <type> ::= NAME TYPE_DEF NAME <param_list> ::= <type> | <type> COMMA <param_list>
Notice that in our grammar, we rely on token roles to specify what a correct program should look like. For instance, the last rule states that a list of function parameters is either a NAME followed by ':' followed by a NAME, or a comma-separated sequence of such definitions.
A considerable part of the Formal Languages lecture will be focused on such grammars, their properties, and how to write them in order to avoid ambiguities during parsing.
An example of such an ambiguity is shown below:
if (x>10) if (x<2) x+=0 else x+=1
It is not clear whether the else
construct belongs to the first or the second if.
Another example of an ambiguous grammar is:
<expr> ::= <expr> + <expr> | <expr> * <expr> | ATOM
Considering we have the following input from the syntactic phase:
'x'(ATOM) '+'(OPERATION) 'y'(ATOM) '*'(OPERATION) '5'(ATOM)
It is not clear for the parser how should the rules be matched: two possible abstract syntax trees (or derivations) can be built:
+ x * y 5
or
* + 5 x y
Therefore, the grammar is ambiguous, and the parser relying on it is flawed.
Stage 3: Semantic analysis
During or after the syntactic phase, semantic analysis checks if certain syntactic relations between tokens are valid. For instance:
- the declared returned type (e.g.
Integer
for our function), must coincide with the actual returned type. - comparisons (e.g.
x > 10
) must be made between comparable tokens - the function must return a value (e.g. each
if
from our function must be matched by anelse
body each returning anInteger
) - etc.
The semantic analysis may be implemented separately from parsing in some compilers, and goes beyond the scope of this lecture.
Other stages
Once the program is verified, compilers and interpreters behave differently:
- depending on the program typing, type inference or type-checking may be implemented as a stage different from semantic analysis
- compilers (and some interpreters) may perform certain program optimisations (e.g. remove unused program statements)
- compilers will allocate registers for each arithmetic operation in an efficient manner
These stages are outside the scope of our lecture.
Lecture outline
Formal Languages provide an indispensable tool for automating most steps in parser development, such that each parser need not be written from scratch, but relying on parser generators.
In the first part of the lecture, we shall study:
- regular-expressions and automata: they are a means for defining (regular expressions) and computing (automata) tokens.
- grammars and push-down automata: they are means for defining (grammars) and computing (push-down automata) syntactic relations.
In the second part of the lecture, we shall examine Computer Science areas different from parsing where the above concepts are being deployed.
- One such concept is Computability and Complexity Theory: we use the concept of language to model problems, and classify problem hardness by examining which computation means (e.g. automata versus push-down automata) can accept a problem. This part of the lecture can be seen as an extension to the Algorithms and Complexity Theory lecture.