Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
lfa:introduction [2017/10/02 14:38]
pdmatei created
lfa:introduction [2020/10/19 15:29] (current)
pdmatei
Line 12: Line 12:
 ==== Compilers and interpreters ==== ==== Compilers and interpreters ====
  
-A **compiler** is a software program ​taking ​as input a //character stream// (usually as a text file), containing a **//​program//​** and outputs **executable code**, in an **assembly language** appropriate for the computer architecture at hand.+A **compiler** is a software program ​that takes as input a //character stream// (usually as a text file), containing a **//​program//​** and outputs **executable code**, in an **assembly language** appropriate for the computer architecture at hand.
  
 The compilers'​ **input program** belongs to a high-level programming language (e.g. C/C++, Java, Scala, Haskell, etc.). The **compiler itself** may be written in C/C++, Java, Scala or Haskell, usually relying on **parser generators** - APIs/​libraries specifically designed for compiler development: ​ The compilers'​ **input program** belongs to a high-level programming language (e.g. C/C++, Java, Scala, Haskell, etc.). The **compiler itself** may be written in C/C++, Java, Scala or Haskell, usually relying on **parser generators** - APIs/​libraries specifically designed for compiler development: ​
Line 29: Line 29:
 **Formal Languages** are especially helpful for the **parsing phase**. ​ **Formal Languages** are especially helpful for the **parsing phase**. ​
  
 +
 +/*
 Parsing itself has several stages, which we shall illustrate via the following example. Parsing itself has several stages, which we shall illustrate via the following example.
  
 +===== The programming language IMP =====
 +
 +The programming lanuage IMP (short for //​Imperative//​) is a very simple imperative language, equipped with ''​if'',​ ''​while'',​ assignments,​ arithmetic and boolean expressions. We present the syntax of IMP below:
 +
 +<​code>​
 +<Var> ::= String
 +<​AVal>​ ::= Number
 +<​BVal>​ ::= "​True"​ | "​False"​
 +<​AExpr>​ ::= <Var> | <​AVal>​ | <​AExpr>​ "​+"​ <​AExpr>​ | "​("​ <​AExpr>​ "​)"​
 +<​BExpr>​ ::= <​BVal>​ | <​BExpr>​ "&&"​ <​BExpr>​ | <​AExpr>​ ">"​ <​AExpr>​ | "​("​ <​BExpr>​ "​)"​
 +<​Stmt>​ ::= <Var> "​="​ <​AExpr>​ |
 +           "​{"​ <​Stmt>​ "​}"​ |
 +           "​if (" <​BExpr>​ "​)"​ <​Stmt>​ <​Stmt>​ |
 +           "​while (" <​BExpr>​ "​)"​ <​Stmt>​ |
 +           <​Stmt>​ ";"​ <​Stmt>​
 +<​VarList>​ ::= <Var> | <Var> ","​ <​VarList>​
 +<​Prog>​ ::= "int "<​VarList>​ ";"​ <​Stmt>​
 +</​code>​
 +
 +The notation used above is called "​backus-naur"​ form, and is often used as a semi-formal notation for describing different kind of syntactic notations.
  
-===== Parsing ​a simplistic ​program =====+===== Parsing ​an IMP program =====
  
 Consider the following program: Consider the following program:
 <​code>​ <​code>​
-def func (x : Integer) : Integer ​{ +int s,n; 
-    if (x > 10)  +1000; 
-    ​+s = 0; 
-    else if (> 0) +while (> 0) { 
-         1 +   s = s + n; 
-         else 2+   n = n + (-1);
 } }
 </​code>​ </​code>​
  
-Ignoring details regarding how the input is read, the compiler will start of with the following stream of characters or **word** :+Ignoring details regarding how the input is read, the compiler/​interpreter ​will start of with the following stream of characters or **word** :
 <​code>​ <​code>​
-def func (x : Integer) : Integer = {\n\tif (x > 10)\n\t0\n\telse if (> 0) \n\t\t1\n\t\t2\n}\n+int s,n;\n n = 1000;\n s = 0;\n while (> 0) {\t\n s = s + n; \t\n n = n + (-1); \t\n}
 </​code>​ </​code>​
  
 ==== Stage 1: Lexical analysis - Lexer ====  ​ ==== Stage 1: Lexical analysis - Lexer ====  ​
  
-  * In the first parsing stage, we would like to identify **tokens**, or atomic program components. The function name ''​func'',​ the keyword ''​def''​ or the variable name ''​x''​ are such components.+  * In the first parsing stage, we would like to identify **tokens**, or atomic program components. The keyword ''​while''​ or the variable name ''​n''​ are such components.
  
 The result of token identification may be represented as follows, where each token is shown on one line, between quotes: The result of token identification may be represented as follows, where each token is shown on one line, between quotes:
 <​code>​ <​code>​
-'def'+'int'
 ' '​(whitespace) ' '​(whitespace)
-'func+'s
-' '(whitespace) +',
-'(+'n
-'x' +'\n
-' ' +'n'
-':'​ +
-' ' +
-'​Integer'​ +
-'​)'​ +
-':'​ +
-'​Integer+
-' '+
 '​='​ '​='​
-' ​+'1000'
-'{'+
 '​\n'​ '​\n'​
-'​\t'​ 
-'​if'​ 
 ' ' ' '
-'(' +'s'
-'x'+
 ' ' ' '
-'>'+'='
 ' ' ' '
-'​10'​ 
-'​)'​ 
-'​\n'​ 
-'​\t'​ 
 '​0'​ '​0'​
 +';'​
 '​\n'​ '​\n'​
-'\t' +'while'
-'​else'​ +
-' ' +
-'​if'​ +
-' ​'+
 '​('​ '​('​
-'​x'​ +...
-' ' +
-'>'​ +
-' ' +
-'​0'​ +
-'​)'​ +
-' ' +
-'​\n'​ +
-'​\t'​ +
-'​\t'​ +
-'​1'​ +
-'​\n'​ +
-'​\t'​ +
-'​\t'​ +
-'​2'​ +
-'​\n'​ +
-'​}'​ +
-'​\n'​+
 </​code>​ </​code>​
  
-  * During the same stage, while identifying **tokens** a parser assigns a //role// for each token. For instance, ''​Integer''​ is a type name, while ''​def''​ is a reserved keyword.  +  * During the same stage, while identifying **tokens** a parser assigns a //type// for each token. For instance, ''​int''​ is a type name, while ''​while''​ is a reserved keyword.  
-    * Each token can have a unique ​role (e.g. the language would not allow ''​def''​ as a variable name), and identifying the unique role may be challenging. For instance ''​if''​ and ''​ifunction''​ are tokens with different roles (keyword vs function name) but which start similarly. Parsers use different strategies to disambiguate. +    * Each token can have a unique ​type (e.g. the language would not allow ''​if''​ as a variable name), and identifying the unique role may be challenging. For instance ''​if''​ and ''​ifvar''​ are tokens with different roles (keyword vs function name) but which start similarly. Parsers use different strategies to disambiguate ​such cases
-    * Some //roles// are only useful for programming discipline (e.g. newlines and tabs) and some for token separation (e.g. whitespaces). We will ignore them in what follows.+    * Some //types// are only useful for programming discipline (e.g. newlines and tabs) and some for token separation (e.g. whitespaces). We will ignore them in what follows.
  
 Below, we illustrate the same sequence of tokens (omitting whitespaces,​ tabs and newlines for legibility),​ together with their role, in paranthesis:​ Below, we illustrate the same sequence of tokens (omitting whitespaces,​ tabs and newlines for legibility),​ together with their role, in paranthesis:​
  
 <​code>​ <​code>​
-'def'(FUNC_DEF+'int'(DECLARATION
-'func'(NAME+' '(WS
-'('(OPEN_PAR+'s'(VAR
-'x'(NAME+','(COMMA
-':'(TYPE_DEF+'n'(VAR
-'Integer'(NAME+'\n'(NEWLINE
-')'(CLOSED_PAR) +'n'(VAR
-':'​(TYPE_DEF) +'​='​(EQ
-'​Integer'​(NAME+'1000'(NUMBER
-'​='​(EQUALS+'\n'(NEWLINE
-'{'(OPEN_CURL+' '(WS
-'if'(IF+'s'(VAR
-'('(OPEN_PAR+' '(WS
-'x'(NAME+'='(EQ
-'>'(COMPARISON+' '(WS)
-'10'(NUMBER+
-')'(CLOSED_PAR)+
 '​0'​(NUMBER) '​0'​(NUMBER)
-'else'(ELSE+';'(COMMA
-'if'(IF)+'\n'(NEWLINE) 
 +'​while'​(WHILE)
 '​('​(OPEN_PAR) '​('​(OPEN_PAR)
-'​x'​(NAME) +... 
-'>'​(COMPARISON) +
-'​0'​(NUMBER) +
-'​)'​(CLOSED_PAR) +
-'​1'​(NUMBER) +
-'​else'​(ELSE) +
-'​2'​(NUMBER) +
-'​}'​(CLOSED_CURL)+
 </​code>​ </​code>​
  
Line 163: Line 143:
 While **syntactic analysis** can be implemented **ad-hoc**, a much more efficient (and widely adopted) approach is to specify **a grammar**, i.e. a set of **syntactic rules** for the formation of different language constructs. While **syntactic analysis** can be implemented **ad-hoc**, a much more efficient (and widely adopted) approach is to specify **a grammar**, i.e. a set of **syntactic rules** for the formation of different language constructs.
  
-For functions in our language, such rules may look as follows: +The grammar ​is very similar ​to the BNF notation shown abovehowever grammars require more care for their definition.
-<​code>​ +
-<​function>​ ::= <​func_def>​ EQUALS OPEN_CURL <​body>​ CLOSED_CURL +
-<​func_def>​ ::= FUNC_DEF NAME OPEN_PAR <​param_list>​ CLOSED_PAR TYPE_DEF NAME +
-<​type>​ ::= NAME TYPE_DEF NAME +
-<​param_list>​ ::= <​type>​ | <​type>​ COMMA <​param_list>​ +
-</​code>​ +
- +
-Notice that in our grammar, we rely on token roles to specify what a correct program should look like. For instance, ​the last rule states that a list of function parameters is either a NAME followed by ':'​ followed by a NAMEor a comma-separated sequence of such definitions.+
  
 A considerable part of the Formal Languages lecture will be focused on such grammars, their properties, and how to write them in order to avoid ambiguities during parsing. A considerable part of the Formal Languages lecture will be focused on such grammars, their properties, and how to write them in order to avoid ambiguities during parsing.
Line 177: Line 149:
 An example of such an ambiguity is shown below: An example of such an ambiguity is shown below:
 <​code>​ <​code>​
-if (x>10) if (x<2) x+=0 else x+=1+if (x>10) if (x>2) x=0 else x=1
 </​code>​ </​code>​
  
-It is not clear whether the ''​else'' ​construct belongs to the first or the second if.+This program is faithful to the BNF notation shown above, however, it is not clear whether the else branch ​construct belongs to the first or the second ​''​if''​.
  
-Another example ​of an ambiguous grammar is:+Consider the following extension (with addition) ​of arithmetic expressions:
 <​code>​ <​code>​
-<expr> ::= <expr> + <expr> | <expr> * <expr> | ATOM+<AExpr> ::= <AExpr> + <AExpr> | <AExpr> * <AExpr> | <Var>
 </​code>​ </​code>​
  
-Considering we have the following input from the syntactic ​phase:+For the following input determined in the lexical ​phase:
 <​code>​ <​code>​
-'​x'​(ATOM+'​x'​(Var
-'​+'​(OPERATION+'​+'​(Plus
-'​y'​(ATOM+'​y'​(Var
-'​*'​(OPERATION+'​*'​(Mult
-'​5'​(ATOM)+'​5'​(Var)
 </​code>​ </​code>​
  
Line 213: Line 185:
 ==== Stage 3: Semantic analysis ==== ==== Stage 3: Semantic analysis ====
  
-During or after the syntactic phase, **semantic analysis** checks if certain syntactic relations between tokens are valid. ​For instance:+During or after the syntactic phase, **semantic analysis** checks if certain syntactic relations between tokens are valid. ​In more advanced programming languages:
   * the **declared** returned type (e.g. ''​Integer''​ for our function), must coincide with the actual returned type.   * the **declared** returned type (e.g. ''​Integer''​ for our function), must coincide with the actual returned type.
   * comparisons (e.g. ''​x > 10''​) must be made between **comparable** tokens   * comparisons (e.g. ''​x > 10''​) must be made between **comparable** tokens
Line 229: Line 201:
  
 These stages are outside the scope of our lecture. These stages are outside the scope of our lecture.
 +
 +===== The key objective of this lecture ======
 +
 +Consider the following programming language, the well-known Lambda Calculus:
 +
 +<​code>​
 +<Var> ::= String
 +<​LExpr>​ ::= <Var> |
 +            "​\"​ <Var> "​."​ <​LExpr>​ |
 +            "​("​ <​LExpr>​ " " <​LExpr>​ "​)"​
 +</​code>​
 +
 +The lexical and syntactic phase (i.e. //​parsing//​) follows the very same general ideas already presented. Instead of implementing another parser from scratch, it only feels natural to build tools which automate, to some extent, the process of parser development. Much of the Formal Languages and Automata lecture relies on this insight.
 +*/
  
 ===== Lecture outline ===== ===== Lecture outline =====