Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
lfa:2023:c01draft [2023/10/06 17:57]
pdmatei
lfa:2023:c01draft [2023/10/07 10:04] (current)
mihai.udubasa changed regex to context free grammars for syntactic analysis
Line 26: Line 26:
  
 The umbrella-term "​Formal Languages and Automata"​ refers to a collection of tools that are inherently abstractions designed to make us write better and faster compilers. At their very beginning, compilers where heavy-weight pieces of software that had tens of thousands of lines of code, and took up to 3 years of writing (as was the case with the compiler for ALGOL - ALGOrithmic Language). A considerable part of that weight was carried by **parsers**,​ tools that were responsible for reading the program at hand. Historically,​ compilation has always been done in stages, and most compilers tend to stick to the following stages: The umbrella-term "​Formal Languages and Automata"​ refers to a collection of tools that are inherently abstractions designed to make us write better and faster compilers. At their very beginning, compilers where heavy-weight pieces of software that had tens of thousands of lines of code, and took up to 3 years of writing (as was the case with the compiler for ALGOL - ALGOrithmic Language). A considerable part of that weight was carried by **parsers**,​ tools that were responsible for reading the program at hand. Historically,​ compilation has always been done in stages, and most compilers tend to stick to the following stages:
 +
  - **lexical stage**: In this stage, the input is split into **lexemes**,​ chunks or words with a particular interpretation. For instance, in ''​int x = 0'',​ ''​int''​ might be a keyword, ''​x''​ might be a token, ''​=''​ might be an operation and so forth. Whitespaces may be skipped or they may be intrinsic parts of the language syntax, as is the case in Haskell (whitespaces are used for indentation,​ which in turn governs program structure), or Python (where tabs are used instead of whitespaces with a similar role).  - **lexical stage**: In this stage, the input is split into **lexemes**,​ chunks or words with a particular interpretation. For instance, in ''​int x = 0'',​ ''​int''​ might be a keyword, ''​x''​ might be a token, ''​=''​ might be an operation and so forth. Whitespaces may be skipped or they may be intrinsic parts of the language syntax, as is the case in Haskell (whitespaces are used for indentation,​ which in turn governs program structure), or Python (where tabs are used instead of whitespaces with a similar role).
 +
  - **syntactic stage**: In this stage, most parsers will build an Abstract Syntax Tree (AST) which describes the relations between tokens. For instance, the program fragment ''​int x = 0''​ may be interpreted as a //​definition//​ which consists in the assignment of variable ''​x''​ to an expression ''​0''​. This stage is also responsible for making sure that the program is syntactically correct.  - **syntactic stage**: In this stage, most parsers will build an Abstract Syntax Tree (AST) which describes the relations between tokens. For instance, the program fragment ''​int x = 0''​ may be interpreted as a //​definition//​ which consists in the assignment of variable ''​x''​ to an expression ''​0''​. This stage is also responsible for making sure that the program is syntactically correct.
- - **semantic checks**: Most of these checks are related to typing, which may be more relaxed, as in dynamic languages such as Racket ​of Python, or rigid, as in most OO-languages or Haskell.+ 
 + - **semantic checks**: Most of these checks are related to typing, which may be more relaxed, as in dynamic languages such as Racket ​or Python, or rigid, as in most OO-languages or Haskell. 
  - **optimisation** and **code-generation**:​ During these stages machine code will be generated as well as reorganised or rewritten in order to increase efficiency.  - **optimisation** and **code-generation**:​ During these stages machine code will be generated as well as reorganised or rewritten in order to increase efficiency.
  
-The first two stages: lexical and syntactic, are usually the responsibility of the **parser**, which is usually decoupled from t+The first two stages: lexical and syntactic, are usually the responsibility of the **parser**, which is usually decoupled from the rest of the compiler. Also, in an interpreter,​ there is no code-generation (and might be less optimisation to be done), rather, the code is executed directly.  
 + 
 +Finally, note that some languages (and many modern ones) do not fit perfectly on the previous description. Java is such an example. On the one hand, they are compiled, because bytecode will be generated during the process. Next, the bytecode will be further translated to machine code by the JVM. But JIT (Just-In-Time) compilation makes the setting more complex and more similar to interpretation. 
 + 
 +Historically,​ writing parsers was challenging and took a lot of time. Nowadays, writing parsers from scratch is rarely done in practice. This process has been replaced by powerful abstractions,​ which allow us to specify what type of lexemes we should search for in the lexical phase, and what kind of program structure we should look for, during the syntactic phase. The former are the well-known **regular expressions**,​ while the latter are, more often than not, **context-free grammars**. 
 + 
 +These abstractions are central to our lecture. 
 + 
 +The modern parser-writing process of today will go as follows: 
 + - the programmer would decide on the syntactic structure of his programming language. He would write regexes, as well as a **grammar** for the language, in a spec with a predefined syntax. You may view this as a sort of meta-programming. 
 + - a tool (one of the most often ones used is ANTLR 4.0) would be used to generate, from the spec, a code-stub for your parser. This stub will contain unimplemented methods that are called when certain constructs have been parsed, and so forth. In some cases, the AST of the input will be built. 
 + - you start work on your interpreter,​ or compiler, by extending the generated code, and achieving the desired functionalty. 
 + 
 +The job of the Formal Languages and Automata lecture is to go into more detail regarding how such generation tools work, and on what principles are they built. These principles revolve around two categories of languages (in the wider sense that just for programming),​ called **regular** and **context-free**. 
 + 
 +==== Beyond Formal Languages ==== 
 + 
 +The reader is most likely a Computer Science undergraduate and will apply much of what he learns in this lecture for writing parsers and compilers. However, ​ Formal Languages yield formal tools that are applicable in a wide range of areas. Most of what you will learn is cornerstone for the more advanced topic of software verification,​ where concepts such as **model checking** use automata (extended with infinite runs) in order to ensure program safety. Formal Languages have numerous applications in natural language processing. Finally, Formal Languages are an important tool for studying computational complexity. Classes of machines with different computation power identify classes of problems with different degrees of difficulty. 
 + 
 +Throughout this lecture, we will focus mostly on those case-studies and scenarios pertaining to parsers, and only occasionally extend the discussion to other fields. 
 +