Differences

This shows you the differences between two versions of the page.

--- lfa:introduction [2017/10/02 14:38]
pdmatei created
+++ lfa:introduction [2020/10/19 15:29] (current)
pdmatei
@@ Line 12: / Line 12: @@
 ==== Compilers and interpreters ====
-A **compiler** is a software program taking as input a //character stream// (usually as a text file), containing a **//program//** and outputs **executable code**, in an **assembly language** appropriate for the computer architecture at hand.
+A **compiler** is a software program that takes as input a //character stream// (usually as a text file), containing a **//program//** and outputs **executable code**, in an **assembly language** appropriate for the computer architecture at hand.
 The compilers' **input program** belongs to a high-level programming language (e.g. C/C++, Java, Scala, Haskell, etc.). The **compiler itself** may be written in C/C++, Java, Scala or Haskell, usually relying on **parser generators** - APIs/libraries specifically designed for compiler development:
@@ Line 29: / Line 29: @@
 **Formal Languages** are especially helpful for the **parsing phase**.
+/*
 Parsing itself has several stages, which we shall illustrate via the following example.
+===== The programming language IMP =====
+The programming lanuage IMP (short for //Imperative//) is a very simple imperative language, equipped with ''if'', ''while'', assignments, arithmetic and boolean expressions. We present the syntax of IMP below:
+<code>
+<Var> ::= String
+<AVal> ::= Number
+<BVal> ::= "True" | "False"
+<AExpr> ::= <Var> | <AVal> | <AExpr> "+" <AExpr> | "(" <AExpr> ")"
+<BExpr> ::= <BVal> | <BExpr> "&&" <BExpr> | <AExpr> ">" <AExpr> | "(" <BExpr> ")"
+<Stmt> ::= <Var> "=" <AExpr> |
+           "{" <Stmt> "}" |
+           "if (" <BExpr> ")" <Stmt> <Stmt> |
+           "while (" <BExpr> ")" <Stmt> |
+           <Stmt> ";" <Stmt>
+<VarList> ::= <Var> | <Var> "," <VarList>
+<Prog> ::= "int "<VarList> ";" <Stmt>
+</code>
+The notation used above is called "backus-naur" form, and is often used as a semi-formal notation for describing different kind of syntactic notations.
-===== Parsing a simplistic program =====
+===== Parsing an IMP program =====
 Consider the following program:
 <code>
-def func (x : Integer) : Integer = {
+int s,n;
-    if (x > 10)
+n = 1000;
+s = 0;
-    else if (x > 0)
+while (n > 0) {
+   s = s + n;
-         else 2
+   n = n + (-1);
 }
 </code>
-Ignoring details regarding how the input is read, the compiler will start of with the following stream of characters or **word** :
+Ignoring details regarding how the input is read, the compiler/interpreter will start of with the following stream of characters or **word** :
 <code>
-def func (x : Integer) : Integer = {\n\tif (x > 10)\n\t0\n\telse if (x > 0) \n\t\t1\n\t\t2\n}\n
+int s,n;\n n = 1000;\n s = 0;\n while (n > 0) {\t\n s = s + n; \t\n n = n + (-1); \t\n}
 </code>
 ==== Stage 1: Lexical analysis - Lexer ====
-  * In the first parsing stage, we would like to identify **tokens**, or atomic program components. The function name ''func'', the keyword ''def'' or the variable name ''x'' are such components.
+  * In the first parsing stage, we would like to identify **tokens**, or atomic program components. The keyword ''while'' or the variable name ''n'' are such components.
 The result of token identification may be represented as follows, where each token is shown on one line, between quotes:
 <code>
-'def'
+'int'
 ' '(whitespace)
-'func'
+'s'
-' '(whitespace)
+','
-'('
+'n'
-'x'
+'\n'
-' '
+'n'
-':'
-' '
-'Integer'
-')'
-':'
-'Integer'
-' '
 '='
-' '
+'1000'
-'{'
 '\n'
-'\t'
-'if'
 ' '
-'('
+'s'
-'x'
 ' '
-'>'
+'='
 ' '
-'10'
-')'
-'\n'
-'\t'
 '0'
+';'
 '\n'
-'\t'
+'while'
-'else'
-' '
-'if'
-' '
 '('
-'x'
+...
-' '
-'>'
-' '
-'0'
-')'
-' '
-'\n'
-'\t'
-'\t'
-'1'
-'\n'
-'\t'
-'\t'
-'2'
-'\n'
-'}'
-'\n'
 </code>
-  * During the same stage, while identifying **tokens** a parser assigns a //role// for each token. For instance, ''Integer'' is a type name, while ''def'' is a reserved keyword.
+  * During the same stage, while identifying **tokens** a parser assigns a //type// for each token. For instance, ''int'' is a type name, while ''while'' is a reserved keyword.
-    * Each token can have a unique role (e.g. the language would not allow ''def'' as a variable name), and identifying the unique role may be challenging. For instance ''if'' and ''ifunction'' are tokens with different roles (keyword vs function name) but which start similarly. Parsers use different strategies to disambiguate.
+    * Each token can have a unique type (e.g. the language would not allow ''if'' as a variable name), and identifying the unique role may be challenging. For instance ''if'' and ''ifvar'' are tokens with different roles (keyword vs function name) but which start similarly. Parsers use different strategies to disambiguate such cases.
-    * Some //roles// are only useful for programming discipline (e.g. newlines and tabs) and some for token separation (e.g. whitespaces). We will ignore them in what follows.
+    * Some //types// are only useful for programming discipline (e.g. newlines and tabs) and some for token separation (e.g. whitespaces). We will ignore them in what follows.
 Below, we illustrate the same sequence of tokens (omitting whitespaces, tabs and newlines for legibility), together with their role, in paranthesis:
 <code>
-'def'(FUNC_DEF)
+'int'(DECLARATION)
-'func'(NAME)
+' '(WS)
-'('(OPEN_PAR)
+'s'(VAR)
-'x'(NAME)
+','(COMMA)
-':'(TYPE_DEF)
+'n'(VAR)
-'Integer'(NAME)
+'\n'(NEWLINE)
-')'(CLOSED_PAR)
+'n'(VAR)
-':'(TYPE_DEF)
+'='(EQ)
-'Integer'(NAME)
+'1000'(NUMBER)
-'='(EQUALS)
+'\n'(NEWLINE)
-'{'(OPEN_CURL)
+' '(WS)
-'if'(IF)
+'s'(VAR)
-'('(OPEN_PAR)
+' '(WS)
-'x'(NAME)
+'='(EQ)
-'>'(COMPARISON)
+' '(WS)
-'10'(NUMBER)
-')'(CLOSED_PAR)
 '0'(NUMBER)
-'else'(ELSE)
+';'(COMMA)
-'if'(IF)
+'\n'(NEWLINE)
+'while'(WHILE)
 '('(OPEN_PAR)
-'x'(NAME)
+...
-'>'(COMPARISON)
-'0'(NUMBER)
-')'(CLOSED_PAR)
-'1'(NUMBER)
-'else'(ELSE)
-'2'(NUMBER)
-'}'(CLOSED_CURL)
 </code>
@@ Line 163: / Line 143: @@
 While **syntactic analysis** can be implemented **ad-hoc**, a much more efficient (and widely adopted) approach is to specify **a grammar**, i.e. a set of **syntactic rules** for the formation of different language constructs.
-For functions in our language, such rules may look as follows:
+The grammar is very similar to the BNF notation shown above, however grammars require more care for their definition.
-<code>
-<function> ::= <func_def> EQUALS OPEN_CURL <body> CLOSED_CURL
-<func_def> ::= FUNC_DEF NAME OPEN_PAR <param_list> CLOSED_PAR TYPE_DEF NAME
-<type> ::= NAME TYPE_DEF NAME
-<param_list> ::= <type> | <type> COMMA <param_list>
-</code>
-Notice that in our grammar, we rely on token roles to specify what a correct program should look like. For instance, the last rule states that a list of function parameters is either a NAME followed by ':' followed by a NAME, or a comma-separated sequence of such definitions.
 A considerable part of the Formal Languages lecture will be focused on such grammars, their properties, and how to write them in order to avoid ambiguities during parsing.
@@ Line 177: / Line 149: @@
 An example of such an ambiguity is shown below:
 <code>
-if (x>10) if (x<2) x+=0 else x+=1
+if (x>10) if (x>2) x=0 else x=1
 </code>
-It is not clear whether the ''else'' construct belongs to the first or the second if.
+This program is faithful to the BNF notation shown above, however, it is not clear whether the else branch construct belongs to the first or the second ''if''.
-Another example of an ambiguous grammar is:
+Consider the following extension (with addition) of arithmetic expressions:
 <code>
-<expr> ::= <expr> + <expr> | <expr> * <expr> | ATOM
+<AExpr> ::= <AExpr> + <AExpr> | <AExpr> * <AExpr> | <Var>
 </code>
-Considering we have the following input from the syntactic phase:
+For the following input determined in the lexical phase:
 <code>
-'x'(ATOM)
+'x'(Var)
-'+'(OPERATION)
+'+'(Plus)
-'y'(ATOM)
+'y'(Var)
-'*'(OPERATION)
+'*'(Mult)
-'5'(ATOM)
+'5'(Var)
 </code>
@@ Line 213: / Line 185: @@
 ==== Stage 3: Semantic analysis ====
-During or after the syntactic phase, **semantic analysis** checks if certain syntactic relations between tokens are valid. For instance:
+During or after the syntactic phase, **semantic analysis** checks if certain syntactic relations between tokens are valid. In more advanced programming languages:
   * the **declared** returned type (e.g. ''Integer'' for our function), must coincide with the actual returned type.
   * comparisons (e.g. ''x > 10'') must be made between **comparable** tokens
@@ Line 229: / Line 201: @@
 These stages are outside the scope of our lecture.
+===== The key objective of this lecture ======
+Consider the following programming language, the well-known Lambda Calculus:
+<code>
+<Var> ::= String
+<LExpr> ::= <Var> |
+            "\" <Var> "." <LExpr> |
+            "(" <LExpr> " " <LExpr> ")"
+</code>
+The lexical and syntactic phase (i.e. //parsing//) follows the very same general ideas already presented. Instead of implementing another parser from scratch, it only feels natural to build tools which automate, to some extent, the process of parser development. Much of the Formal Languages and Automata lecture relies on this insight.
+*/
 ===== Lecture outline =====