lfa:introduction [books]

This is an old revision of the document!

Preliminaries

Formal Languages are a branch of theoretical Computer Science which has applications in numerous areas. Some of them are:

compilers and compiler design
complexity theory
program verification

In the first part of the lecture, we shall focus on compiler design as a case study for introducing Formal Languages. Later on, we shall explore the usage of Formal Languages for complexity theory and program verification.

Compilers and interpreters

A compiler is a software program that takes as input a character stream (usually as a text file), containing a program and outputs executable code, in an assembly language appropriate for the computer architecture at hand.

The compilers' input program belongs to a high-level programming language (e.g. C/C++, Java, Scala, Haskell, etc.). The compiler itself may be written in C/C++, Java, Scala or Haskell, usually relying on parser generators - APIs/libraries specifically designed for compiler development:

for C/C++, such APIs are Flex, Bison.
for Java - ANTLR.
for Haskell - Happy.

In contrast, an interpreter takes a program as input, possibly with a program input string, and runs the program on that input. The interpreter output is the program output.

Parsing

Although functionally different, compilers and interpreters share the parsing phase. During parsing, a stream of symbols is converted to a complex set of data structures (the Abstract Syntax Tree is one of them). Its roles are:

to make sure the program is correct.
to serve at interpreting or transforming the input to machine code.

Formal Languages are especially helpful for the parsing phase.

Parsing itself has several stages, which we shall illustrate via the following example.

The programming lanuage IMP (short for Imperative) is a very simple imperative language, equipped with if, while, assignments, arithmetic and boolean expressions. We present the syntax of IMP below:

<Var> ::= String
<AVal> ::= Number
<BVal> ::= "True" | "False"
<AExpr> ::= <Var> | <AVal> | <AExpr> "+" <AExpr> | "(" <AExpr> ")"
<BExpr> ::= <BVal> | <BExpr> "&&" <BExpr> | <AExpr> ">" <AExpr> | "(" <BExpr> ")"
<Stmt> ::= <Var> "=" <AExpr> |
           "{" <Stmt> "}" |
           "if (" <BExpr> ")" <Stmt> <Stmt> |
           "while (" <BExpr> ")" <Stmt> |
           <Stmt> ";" <Stmt>
<VarList> ::= <Var> | <Var> "," <VarList>
<Prog> ::= "int "<VarList> ";" <Stmt>

The notation used above is called “backus-naur” form, and is often used as a semi-formal notation for describing different kind of syntactic notations.

Consider the following program:

int s,n;
n = 1000;
s = 0;
while (n > 0) {
   s = s + n;
   n = n + (-1);
}

Ignoring details regarding how the input is read, the compiler/interpreter will start of with the following stream of characters or word :

int s,n;\n n = 1000;\n s = 0;\n while (n > 0) {\t\n s = s + n; \t\n n = n + (-1); \t\n}

Stage 1: Lexical analysis - Lexer

In the first parsing stage, we would like to identify tokens, or atomic program components. The keyword while or the variable name n are such components.

The result of token identification may be represented as follows, where each token is shown on one line, between quotes:

'int'
' '(whitespace)
's'
','
'n'
'\n'
'n'
'='
'1000'
'\n'
' '
's'
' '
'='
' '
'0'
';'
'\n'
'while'
'('
...

During the same stage, while identifying tokens a parser assigns a type for each token. For instance, int is a type name, while while is a reserved keyword.
- Each token can have a unique type (e.g. the language would not allow if as a variable name), and identifying the unique role may be challenging. For instance if and ifvar are tokens with different roles (keyword vs function name) but which start similarly. Parsers use different strategies to disambiguate such cases.
- Some types are only useful for programming discipline (e.g. newlines and tabs) and some for token separation (e.g. whitespaces). We will ignore them in what follows.

Below, we illustrate the same sequence of tokens (omitting whitespaces, tabs and newlines for legibility), together with their role, in paranthesis:

'int'(DECLARATION)
' '(WS)
's'(VAR)
','(COMMA)
'n'(VAR)
'\n'(NEWLINE)
'n'(VAR)
'='(EQ)
'1000'(NUMBER)
'\n'(NEWLINE)
' '(WS)
's'(VAR)
' '(WS)
'='(EQ)
' '(WS)
'0'(NUMBER)
';'(COMMA)
'\n'(NEWLINE)
'while'(WHILE)
'('(OPEN_PAR)
...

Stage 2: Syntactic analysis (parsing rules)

The output of the first stage is a sequence of tokens with roles assigned to each one. In the syntactic analysis stage, several operations take place:

the program structure is being validated (and syntactic errors are signalled, if this is the case)
the abstract syntax tree (a tree-like description) of the program is being built
the symbol table (the list of all encountered variables, functions, etc. and their scope) is being built

During the Formal Languages lecture, we will only focus on the first operation, which is the foundation-stone for all the rest.

While syntactic analysis can be implemented ad-hoc, a much more efficient (and widely adopted) approach is to specify a grammar, i.e. a set of syntactic rules for the formation of different language constructs.

The grammar is very similar to the BNF notation shown above, however grammars require more care for their definition.

A considerable part of the Formal Languages lecture will be focused on such grammars, their properties, and how to write them in order to avoid ambiguities during parsing.

An example of such an ambiguity is shown below:

if (x>10) if (x>2) x=0 else x=1

This program is faithful to the BNF notation shown above, however, it is not clear whether the else branch construct belongs to the first or the second if.

Consider the following extension (with addition) of arithmetic expressions:

<AExpr> ::= <AExpr> + <AExpr> | <AExpr> * <AExpr> | <Var>

For the following input determined in the lexical phase:

'x'(Var)
'+'(Plus)
'y'(Var)
'*'(Mult)
'5'(Var)

It is not clear for the parser how should the rules be matched: two possible abstract syntax trees (or derivations) can be built:

    +
  x   *
     y  5

or

    * 
  +   5
x   y

Therefore, the grammar is ambiguous, and the parser relying on it is flawed.

Stage 3: Semantic analysis

During or after the syntactic phase, semantic analysis checks if certain syntactic relations between tokens are valid. In more advanced programming languages:

the declared returned type (e.g. Integer for our function), must coincide with the actual returned type.
comparisons (e.g. x > 10) must be made between comparable tokens
the function must return a value (e.g. each if from our function must be matched by an else body each returning an Integer)
etc.

The semantic analysis may be implemented separately from parsing in some compilers, and goes beyond the scope of this lecture.

Other stages

Once the program is verified, compilers and interpreters behave differently:

depending on the program typing, type inference or type-checking may be implemented as a stage different from semantic analysis
compilers (and some interpreters) may perform certain program optimisations (e.g. remove unused program statements)
compilers will allocate registers for each arithmetic operation in an efficient manner

These stages are outside the scope of our lecture.

Consider the following programming language, the well-known Lambda Calculus:

<Var> ::= String
<LExpr> ::= <Var> |
            "\" <Var> "." <LExpr> |
            "(" <LExpr> " " <LExpr> ")"

The lexical and syntactic phase (i.e. parsing) follows the very same general ideas already presented. Instead of implementing another parser from scratch, it only feels natural to build tools which automate, to some extent, the process of parser development. Much of the Formal Languages and Automata lecture relies on this insight.

Formal Languages provide an indispensable tool for automating most steps in parser development, such that each parser need not be written from scratch, but relying on parser generators.

In the first part of the lecture, we shall study:

regular-expressions and automata: they are a means for defining (regular expressions) and computing (automata) tokens.
grammars and push-down automata: they are means for defining (grammars) and computing (push-down automata) syntactic relations.

In the second part of the lecture, we shall examine Computer Science areas different from parsing where the above concepts are being deployed.

One such concept is Computability and Complexity Theory: we use the concept of language to model problems, and classify problem hardness by examining which computation means (e.g. automata versus push-down automata) can accept a problem. This part of the lecture can be seen as an extension to the Algorithms and Complexity Theory lecture.

Introduction to Formal Languages and Automata

Preliminaries

Compilers and interpreters

Parsing

The programming language IMP

Parsing an IMP program

Stage 1: Lexical analysis - Lexer

Stage 2: Syntactic analysis (parsing rules)

Stage 3: Semantic analysis

Other stages

The key objective of this lecture

Lecture outline