lfa:2023:c01draft [books]

Abstractions

The beginning

In the early 1940s some of the first programmable machines started to appear in research labs such as Harvard Computation Lab or Bell Labs. These machines were glacially slow by today's standards. However, they were also very difficult to program, and to operate. Programming required intimate knowledge of the machine, and often required resorting to hacks in order to make the computations run faster. Speed was always critical. During this time, when the US was at war, most of these machines were used to compute missile trajectories, damage coverage from bombs, and other war-related computations. Machines such as the Mark 1 were often used 24 hours per day, 7 days of week. The human effort for keeping such machines functional was split between programmers, mostly responsible for developing programs, and operators, which would be ensure the machine was functional throughout the computation. They would repair or replace relays or other components. In some machines, such as the ENIAC, operators would be required to step in during a computation, for it to continue (e.g. flip switches or perform certain electric connections).

Speed

In around 80 years, computers went from 3 (numerical) operations per second (on the Mark 1), to 1 billion instructions per second on A15 bionic chips. This means that one second of computation from your average mobile phone would have taken around 10.5 years to complete on the Mark 1. The success of today's machines can be largely attributed to the major advancements in hardware, with the emergence of the silicon technology. The success of today's software however, is a testament to our ability to build abstractions that make programming easier.

Complexity

One of the first abstractions used in programming started with the observation that many programs were reusing groups of instructions (e.g. performing a more complex task such as numerical differentiation). Thus, programmers started to keep notebooks with such groups, so that they could reuse them in different programs. At this moment, re-usage simply meant copying those instructions from notebooks. Soon, notebooks started being shared between programmers. At that point, subroutines were actually invented. One of the earliest forms of compilers, called linkers would be responsible for automating the copying process. Programs would be loaded and also linked to routines that were required to run it. The subroutine call is thus an abstraction for a piece of code that should be executed at that point. Linkers made the programming process far easier and allowed the programmer to focus on the program logic, removing the task of copying or rewriting from scratch large pieces of code. The next big step was the observation that programmers think and express themselves better in natural language: most of the punch-cards used in early computers bear hand-written comments that would make programs easier to follow. It was Grace Hopper's intuition that programmers should program in natural language instead of machine code, and this lead to the development of COBOL (COmmon Business-Oriented Language), a syntax-friendly language that is still in use today, more than 60 years from its creation, and of FORTRAN (FORmula TRANslator).

Abstractions

These programming languages in the modern sense of the word are abstractions that we use to hide, or abstract from the often complex, messy details of machine hardware. They allow us to write ever more sophisticated programs and applications. These abstractions are powered by compilers or interpreters, which translate our code into machine language, in the former case, or in the actual execution of our code, in the latter. In an attempt to generalise, we shall say that an abstraction is a tool that allows us to take a high-level description of something, and translate it into something operational, a sequence of steps that can be executed to achieve the desired goal. In this sense, the high-level description is given in an appropriate syntax, while the translation process assigns a semantics - an intended, executable, meaning.

The Information Technology of today is highly powered by abstractions. We note only two examples: relational database systems are powered by SQL, which allows us to express and sequence various operations to be performed on data, using almost-human sentences such as SELECT * from user_table where id=0.

Many modern datacenter topologies use software defined networking (SDN). In effect, this means that we can use software abstractions to govern how data is being switched or routed in a topology, without the need to change wiring or to individually update configuration files on distinct machines, in order to achieve some desired behaviour.

The umbrella-term “Formal Languages and Automata” refers to a collection of tools that are inherently abstractions designed to make us write better and faster compilers. At their very beginning, compilers where heavy-weight pieces of software that had tens of thousands of lines of code, and took up to 3 years of writing (as was the case with the compiler for ALGOL - ALGOrithmic Language). A considerable part of that weight was carried by parsers, tools that were responsible for reading the program at hand. Historically, compilation has always been done in stages, and most compilers tend to stick to the following stages:

- lexical stage: In this stage, the input is split into lexemes, chunks or words with a particular interpretation. For instance, in int x = 0, int might be a keyword, x might be a token, = might be an operation and so forth. Whitespaces may be skipped or they may be intrinsic parts of the language syntax, as is the case in Haskell (whitespaces are used for indentation, which in turn governs program structure), or Python (where tabs are used instead of whitespaces with a similar role).

- syntactic stage: In this stage, most parsers will build an Abstract Syntax Tree (AST) which describes the relations between tokens. For instance, the program fragment int x = 0 may be interpreted as a definition which consists in the assignment of variable x to an expression 0. This stage is also responsible for making sure that the program is syntactically correct.

- semantic checks: Most of these checks are related to typing, which may be more relaxed, as in dynamic languages such as Racket or Python, or rigid, as in most OO-languages or Haskell.

- optimisation and code-generation: During these stages machine code will be generated as well as reorganised or rewritten in order to increase efficiency.

The first two stages: lexical and syntactic, are usually the responsibility of the parser, which is usually decoupled from the rest of the compiler. Also, in an interpreter, there is no code-generation (and might be less optimisation to be done), rather, the code is executed directly.

Finally, note that some languages (and many modern ones) do not fit perfectly on the previous description. Java is such an example. On the one hand, they are compiled, because bytecode will be generated during the process. Next, the bytecode will be further translated to machine code by the JVM. But JIT (Just-In-Time) compilation makes the setting more complex and more similar to interpretation.

Historically, writing parsers was challenging and took a lot of time. Nowadays, writing parsers from scratch is rarely done in practice. This process has been replaced by powerful abstractions, which allow us to specify what type of lexemes we should search for in the lexical phase, and what kind of program structure we should look for, during the syntactic phase. The former are the well-known regular expressions, while the latter are, more often than not, context-free grammars.

These abstractions are central to our lecture.

The modern parser-writing process of today will go as follows: - the programmer would decide on the syntactic structure of his programming language. He would write regexes, as well as a grammar for the language, in a spec with a predefined syntax. You may view this as a sort of meta-programming. - a tool (one of the most often ones used is ANTLR 4.0) would be used to generate, from the spec, a code-stub for your parser. This stub will contain unimplemented methods that are called when certain constructs have been parsed, and so forth. In some cases, the AST of the input will be built. - you start work on your interpreter, or compiler, by extending the generated code, and achieving the desired functionalty.

The job of the Formal Languages and Automata lecture is to go into more detail regarding how such generation tools work, and on what principles are they built. These principles revolve around two categories of languages (in the wider sense that just for programming), called regular and context-free.

Beyond Formal Languages

The reader is most likely a Computer Science undergraduate and will apply much of what he learns in this lecture for writing parsers and compilers. However, Formal Languages yield formal tools that are applicable in a wide range of areas. Most of what you will learn is cornerstone for the more advanced topic of software verification, where concepts such as model checking use automata (extended with infinite runs) in order to ensure program safety. Formal Languages have numerous applications in natural language processing. Finally, Formal Languages are an important tool for studying computational complexity. Classes of machines with different computation power identify classes of problems with different degrees of difficulty.

Throughout this lecture, we will focus mostly on those case-studies and scenarios pertaining to parsers, and only occasionally extend the discussion to other fields.

C01: Introduction

Abstractions

What are Formal Languages (and Automata)?

Beyond Formal Languages