Table of Contents

Languages and Regular Expressions

Motivation

Regular expressions are a means for defining tokens (roles) during the lexical phase. Token definition is instrumental in the development of parser generators (e.g. ANTLR).

Let $ \Sigma$ be an alphabet, i.e. a finite set whose elements we call symbols or characters. In parsing, the alphabet we work with is naturally the (possibly extended) set of ASCII symbols. During the lecture, we will often use simpler alphabets such as the binary alphabet or $ \{a,b,c\}$ , etc.

Regular expressions - tentative definition

Let $ \Sigma$ be an alphabet.

It may be convenient to view regular expressions as members of an Abstract Datatype, and each formation rule, as a constructor rule.

For instance:

Precendence

Regular expressions such as $ 01^*\cup 1$ may be ambiguous. If the ADT notation would be employed, e.g. $ union(kleene(concat(0,1)),1)$ , there would be no ambiguity. However, such a notation is cumbersome, and for this reason, we prefer the following order of precedence, for construction rules:

We shall also use parentheses wherever necessary.

Write a regular expression identifying capturing all sequences of alternating 0s and 1s. </blockquote>

One tentative solution would be $ (01)^*$ , however it is incomplete, as sequences such as $ 1010$ cannot be generated. A complete solution could be $ (01)^*\cup(10)^*\cup 0(10)^*\cup 1(01)^*$ , which can also be written as: $ ((1\cup \epsilon)01)^*\cup((0\cup \epsilon)10)^*$ . Also, we may refactor this regular expression to a simpler one: $ (1\cup \epsilon)(01)^*(0\cup \epsilon)$ .

Notice that several regular expressions may capture exactly the same sequences.

Write a regular expression identifying capturing all sequences which do not contain adjacent ones. </blockquote>

One alternative is $ (0^*(100^*)^*)$ , but also $ (10\cup 0)^*(\epsilon\cup 1)$ is a correct answer. Note that going from one regular expression to another is not trivial in this particular case.

In the above examples, we looked at several words and saw if they are accepted by a regular expression.

Languages

Formally, a language is a subset of $ \Sigma*$ , that is, a possibly infinite set of words. Formal languages are a powerful instrument which finds usage beyond compilers:

- So far, we have seen that a formal language models a valid set of tokens (specified e.g. using Regular Expressions). Thus, the membership $ w\in L$ tells us that token $ w$ has indeed the role modelled by $ L$ .

- At the same time, formal languages are models for programming languages, and words - for programs. The membership $ w\in L$ models that program $ w$ is indeed a valid program of the programming language $ L$ . It remains to be seen if we can use Regular Expressions to express the constraints suitable for modern programming languages.

- Formal languages are models for natural language(s), and a great deal of interest into them came from linguistics.

- Finally, formal languages are models for decision problems. Recall that each word $ w$ can be viewed as describing a problem instance (e.g. a graph of the k-Vertex-Cover problem, together with a value k, or a CNF formula of the SAT problem). In this case, the membership $ w\in L$ models the fact that the answer to the problem instance $ w$ is yes. Thus $ L$ is the set of all yes-instances of a problem.

Also, it is interesting to have a picture of the space of languages. We already know that a language is an enumerable set (possibly finite). However, are languages enumerable?

Proposition:

The set $ 2^{\Sigma^*}$ of languages is not enumerable.

Proof:

Suppose the set of languages is enumerable, and consider $ L_1, L_2, \ldots, L_n, \ldots$ and $ w_1, w_2, \ldots, w_n, \ldots$ the enumeration of languages, and words, respectively. We construct the following language:

$ L^* = \{w \mid w=w_i\not\in L_i\}$

Informally, we can obtain $ L^*$ by taking each word $ w_i$ from $ \Sigma^*$ and checking if $ w_i\in L_i$ . If this is so, then we ignore $ w_i$ and move on. Otherwise - we add $ w_i$ to $ L^*$ .

Since $ 2^{\Sigma^*}$ is enumerable, then $ L^*$ must be some language $ L_k$ in the enumeration. So, we select $ w_k$ (the word corresponding to exactly the same language as $ L^*$ ), and we inquire whether:

$ w_k \in L_k$ :

  • if the answer is yes, then by definition of $ L^*$ , we have $ w_k \not \in L^*$ . Contradiction.
  • if the answer is no, by the very same definition, we have $ w_k \in L^*$ . Contradiction.

Hence the set of languages is not enumerable.

This very simple proof has powerful implications. For example, regular expressions are essentially words over some alphabet, hence they are enumerable. Our proof shows that there are infinitely more languages than regular expressions. Hence, we cannot use regular expressions to capture certain languages. Imagine that such languages are harder to define - require a more complex apparatus. A lot of interesting questions spawn from this observation:

Languages and Complexity Theory?

We shall address most of these questions throughout the lecture. But for now, we return to regular expressions.

Computing L(e) - a semantics for regular expressions

We easily notice that:

In order to define the aforementioned map, we introduce a few operations on languages. Let $ A,B\subseteq \Sigma*$ be two languages:

Next, we can define rules for determining the language generated by a regular expression.

Returning to our previous example, the language $ L((A\cup B)(a\cup b)*(0 \cup 1)*)$ is:

math[L(A\cup B) L1)], i.e.

Properties of languages

$ 2^{\Sigma^*}$ is the set of languages over $ \Sigma$ . Let $ E$ be the set of regular expressions. We have defined a semantics for regular expressions, i.e. the map:

This map is powerful because it shows that we can assign a finite representation (the regular expression) to an infinite object (the language).

An interesting question, which we shall examine in detail further on, is whether for each language there exists a regular expression which describes it. We shall investigate this question in a later lecture.

References

- Algorithms & Complexity Theory

1) a\cup b)*) L((0 \cup 1)*