===== Introduction to Formal Languages and Automata ===== ==== Preliminaries ==== Formal Languages are a branch of theoretical Computer Science which has applications in numerous areas. Some of them are: - compilers and compiler design - complexity theory - program verification In the first part of the lecture, we shall focus on **compiler design** as a case study for introducing Formal Languages. Later on, we shall explore the usage of Formal Languages for complexity theory and program verification. ==== Compilers and interpreters ==== A **compiler** is a software program that takes as input a //character stream// (usually as a text file), containing a **//program//** and outputs **executable code**, in an **assembly language** appropriate for the computer architecture at hand. The compilers' **input program** belongs to a high-level programming language (e.g. C/C++, Java, Scala, Haskell, etc.). The **compiler itself** may be written in C/C++, Java, Scala or Haskell, usually relying on **parser generators** - APIs/libraries specifically designed for compiler development: * for C/C++, such APIs are Flex, Bison. * for Java - ANTLR. * for Haskell - Happy. In contrast, an **interpreter** takes a program as input, possibly with a program input string, and **runs the program on that input**. The interpreter output is the program output. ==== Parsing ==== Although functionally different, **compilers and interpreters share //the parsing phase//**. During **parsing**, a stream of symbols is converted to a complex set of data structures (the **Abstract Syntax Tree** is one of them). Its roles are: * to make sure the program is correct. * to serve at **interpreting** or **transforming** the input to machine code. **Formal Languages** are especially helpful for the **parsing phase**. /* Parsing itself has several stages, which we shall illustrate via the following example. ===== The programming language IMP ===== The programming lanuage IMP (short for //Imperative//) is a very simple imperative language, equipped with ''if'', ''while'', assignments, arithmetic and boolean expressions. We present the syntax of IMP below: ::= String ::= Number ::= "True" | "False" ::= | | "+" | "(" ")" ::= | "&&" | ">" | "(" ")" ::= "=" | "{" "}" | "if (" ")" | "while (" ")" | ";" ::= | "," ::= "int " ";" The notation used above is called "backus-naur" form, and is often used as a semi-formal notation for describing different kind of syntactic notations. ===== Parsing an IMP program ===== Consider the following program: int s,n; n = 1000; s = 0; while (n > 0) { s = s + n; n = n + (-1); } Ignoring details regarding how the input is read, the compiler/interpreter will start of with the following stream of characters or **word** : int s,n;\n n = 1000;\n s = 0;\n while (n > 0) {\t\n s = s + n; \t\n n = n + (-1); \t\n} ==== Stage 1: Lexical analysis - Lexer ==== * In the first parsing stage, we would like to identify **tokens**, or atomic program components. The keyword ''while'' or the variable name ''n'' are such components. The result of token identification may be represented as follows, where each token is shown on one line, between quotes: 'int' ' '(whitespace) 's' ',' 'n' '\n' 'n' '=' '1000' '\n' ' ' 's' ' ' '=' ' ' '0' ';' '\n' 'while' '(' ... * During the same stage, while identifying **tokens** a parser assigns a //type// for each token. For instance, ''int'' is a type name, while ''while'' is a reserved keyword. * Each token can have a unique type (e.g. the language would not allow ''if'' as a variable name), and identifying the unique role may be challenging. For instance ''if'' and ''ifvar'' are tokens with different roles (keyword vs function name) but which start similarly. Parsers use different strategies to disambiguate such cases. * Some //types// are only useful for programming discipline (e.g. newlines and tabs) and some for token separation (e.g. whitespaces). We will ignore them in what follows. Below, we illustrate the same sequence of tokens (omitting whitespaces, tabs and newlines for legibility), together with their role, in paranthesis: 'int'(DECLARATION) ' '(WS) 's'(VAR) ','(COMMA) 'n'(VAR) '\n'(NEWLINE) 'n'(VAR) '='(EQ) '1000'(NUMBER) '\n'(NEWLINE) ' '(WS) 's'(VAR) ' '(WS) '='(EQ) ' '(WS) '0'(NUMBER) ';'(COMMA) '\n'(NEWLINE) 'while'(WHILE) '('(OPEN_PAR) ... ==== Stage 2: Syntactic analysis (parsing rules) ==== The output of the first stage is a sequence of tokens with roles assigned to each one. In the syntactic analysis stage, several operations take place: * the program structure is being validated (and syntactic errors are signalled, if this is the case) * the **abstract syntax tree** (a tree-like description) of the program is being built * the **symbol table** (the list of all encountered variables, functions, etc. and their scope) is being built During the Formal Languages lecture, we will only focus on the **first** operation, which is the foundation-stone for all the rest. While **syntactic analysis** can be implemented **ad-hoc**, a much more efficient (and widely adopted) approach is to specify **a grammar**, i.e. a set of **syntactic rules** for the formation of different language constructs. The grammar is very similar to the BNF notation shown above, however grammars require more care for their definition. A considerable part of the Formal Languages lecture will be focused on such grammars, their properties, and how to write them in order to avoid ambiguities during parsing. An example of such an ambiguity is shown below: if (x>10) if (x>2) x=0 else x=1 This program is faithful to the BNF notation shown above, however, it is not clear whether the else branch construct belongs to the first or the second ''if''. Consider the following extension (with addition) of arithmetic expressions: ::= + | * | For the following input determined in the lexical phase: 'x'(Var) '+'(Plus) 'y'(Var) '*'(Mult) '5'(Var) It is not clear for the parser how should the rules be matched: two possible **abstract syntax trees** (or derivations) can be built: + x * y 5 or * + 5 x y Therefore, the grammar is ambiguous, and the parser relying on it is flawed. ==== Stage 3: Semantic analysis ==== During or after the syntactic phase, **semantic analysis** checks if certain syntactic relations between tokens are valid. In more advanced programming languages: * the **declared** returned type (e.g. ''Integer'' for our function), must coincide with the actual returned type. * comparisons (e.g. ''x > 10'') must be made between **comparable** tokens * the function must return a value (e.g. each ''if'' from our function must be matched by an ''else'' body each returning an ''Integer'') * etc. The semantic analysis may be implemented separately from parsing in some compilers, and goes beyond the scope of this lecture. ==== Other stages ==== Once the program is verified, compilers and interpreters behave differently: * depending on the program typing, **type inference** or **type-checking** may be implemented as a stage different from semantic analysis * compilers (and some interpreters) may perform certain program optimisations (e.g. remove unused program statements) * compilers will allocate registers for each arithmetic operation in an efficient manner These stages are outside the scope of our lecture. ===== The key objective of this lecture ====== Consider the following programming language, the well-known Lambda Calculus: ::= String ::= | "\" "." | "(" " " ")" The lexical and syntactic phase (i.e. //parsing//) follows the very same general ideas already presented. Instead of implementing another parser from scratch, it only feels natural to build tools which automate, to some extent, the process of parser development. Much of the Formal Languages and Automata lecture relies on this insight. */ ===== Lecture outline ===== Formal Languages provide an indispensable tool for **automating** most steps in **parser development**, such that each parser need not be written from scratch, but relying on **parser generators**. In the first part of the lecture, we shall study: * **regular-expressions** and **automata**: they are a means for //defining// (regular expressions) and //computing// (automata) tokens. * **grammars** and **push-down automata**: they are means for //defining// (grammars) and //computing// (push-down automata) syntactic relations. In the second part of the lecture, we shall examine Computer Science areas different from parsing where the above concepts are being deployed. * One such concept is **Computability and Complexity Theory**: we use the concept of **language** to model problems, and classify problem hardness by examining which computation means (e.g. automata versus push-down automata) can accept a problem. This part of the lecture can be seen as an extension to the Algorithms and Complexity Theory lecture.