Languages and Regular Expressions

Motivation

Regular expressions are a means for defining tokens (roles) during the lexical phase. Token definition is instrumental in the development of parser generators (e.g. ANTLR).

Let $ \Sigma$ be an alphabet, i.e. a finite set whose elements we call symbols or characters. In parsing, the alphabet we work with is naturally the (possibly extended) set of ASCII symbols. During the lecture, we will often use simpler alphabets such as the binary alphabet or $ \{a,b,c\}$ , etc.

Regular expressions - tentative definition

Let $ \Sigma$ be an alphabet.

  • $ \emptyset$ is a regular expression (short. reg.exp).
  • any $ c\in\Sigma$ is a regular expression.
  • if $ e$ and $ e'$ are regular expressions then:
    • $ (ee')$ or simply $ ee$ (concatenation) is a regular expression
    • $ (e\cup e')$ (reunion) is a regular expression
  • if $ e$ is a regular expression then $ (e)^*$ (Kleene Star) is a regular expression

It may be convenient to view regular expressions as members of an Abstract Datatype, and each formation rule, as a constructor rule.

For instance:

  • $ (A\cup B)(a\cup b)^*(0 \cup 1)^*$ may be used to declare variables, which must be any string starting with A or B, followed a sequence of a's and b's of any length (including 0), and followed by a sequence of 0's and 1s.
    • Thus, Abbb1 is a correct variable definition
    • Ba is also a correct variable definition (a matches $ (a \cup b)^*$ ), while the empty string matches $ (0 \cup 1)^*$ )
    • aaa01 is invalid since it does not start with any of A and B.
    • Aa01a is also invalid since after the digit sequence symbols such as a or b are not allowed.

Precendence

Regular expressions such as $ 01^*\cup 1$ may be ambiguous. If the ADT notation would be employed, e.g. $ union(kleene(concat(0,1)),1)$ , there would be no ambiguity. However, such a notation is cumbersome, and for this reason, we prefer the following order of precedence, for construction rules:

  • Kleene Star
  • concatenation
  • union

We shall also use parentheses wherever necessary.

Write a regular expression identifying capturing all sequences of alternating 0s and 1s. </blockquote>

One tentative solution would be $ (01)^*$ , however it is incomplete, as sequences such as $ 1010$ cannot be generated. A complete solution could be $ (01)^*\cup(10)^*\cup 0(10)^*\cup 1(01)^*$ , which can also be written as: $ ((1\cup \epsilon)01)^*\cup((0\cup \epsilon)10)^*$ . Also, we may refactor this regular expression to a simpler one: $ (1\cup \epsilon)(01)^*(0\cup \epsilon)$ .

Notice that several regular expressions may capture exactly the same sequences.

Write a regular expression identifying capturing all sequences which do not contain adjacent ones. </blockquote>

One alternative is $ (0^*(100^*)^*)$ , but also $ (10\cup 0)^*(\epsilon\cup 1)$ is a correct answer. Note that going from one regular expression to another is not trivial in this particular case.

In the above examples, we looked at several words and saw if they are accepted by a regular expression.

  • A word $ w$ (over $ \Sigma$ ) is a possibly 0-length sequence of symbols ($ w\in\Sigma*$ ). See the Algorithms and Complexity Theory lecture [1].

Formally, a language is a subset of $ \Sigma*$ , that is, a possibly infinite set of words. Formal languages are a powerful instrument which finds usage beyond compilers:

- So far, we have seen that a formal language models a valid set of tokens (specified e.g. using Regular Expressions). Thus, the membership $ w\in L$ tells us that token $ w$ has indeed the role modelled by $ L$ .

- At the same time, formal languages are models for programming languages, and words - for programs. The membership $ w\in L$ models that program $ w$ is indeed a valid program of the programming language $ L$ . It remains to be seen if we can use Regular Expressions to express the constraints suitable for modern programming languages.

- Formal languages are models for natural language(s), and a great deal of interest into them came from linguistics.

- Finally, formal languages are models for decision problems. Recall that each word $ w$ can be viewed as describing a problem instance (e.g. a graph of the k-Vertex-Cover problem, together with a value k, or a CNF formula of the SAT problem). In this case, the membership $ w\in L$ models the fact that the answer to the problem instance $ w$ is yes. Thus $ L$ is the set of all yes-instances of a problem.

Also, it is interesting to have a picture of the space of languages. We already know that a language is an enumerable set (possibly finite). However, are languages enumerable?

Proposition:

The set $ 2^{\Sigma^*}$ of languages is not enumerable.

Proof:

Suppose the set of languages is enumerable, and consider $ L_1, L_2, \ldots, L_n, \ldots$ and $ w_1, w_2, \ldots, w_n, \ldots$ the enumeration of languages, and words, respectively. We construct the following language:

$ L^* = \{w \mid w=w_i\not\in L_i\}$

Informally, we can obtain $ L^*$ by taking each word $ w_i$ from $ \Sigma^*$ and checking if $ w_i\in L_i$ . If this is so, then we ignore $ w_i$ and move on. Otherwise - we add $ w_i$ to $ L^*$ .

Since $ 2^{\Sigma^*}$ is enumerable, then $ L^*$ must be some language $ L_k$ in the enumeration. So, we select $ w_k$ (the word corresponding to exactly the same language as $ L^*$ ), and we inquire whether:

$ w_k \in L_k$ :

  • if the answer is yes, then by definition of $ L^*$ , we have $ w_k \not \in L^*$ . Contradiction.
  • if the answer is no, by the very same definition, we have $ w_k \in L^*$ . Contradiction.

Hence the set of languages is not enumerable.

This very simple proof has powerful implications. For example, regular expressions are essentially words over some alphabet, hence they are enumerable. Our proof shows that there are infinitely more languages than regular expressions. Hence, we cannot use regular expressions to capture certain languages. Imagine that such languages are harder to define - require a more complex apparatus. A lot of interesting questions spawn from this observation:

  • can we afford to define any kind of programming language, without affecting the parsing process?
  • can we build parsers for natural language?
  • what is the relationship between

Languages and Complexity Theory?

We shall address most of these questions throughout the lecture. But for now, we return to regular expressions.

Computing L(e) - a semantics for regular expressions

We easily notice that:

  • a regular expression $ e$ uniquely identifies the language $ L(e)$ containing the set of words which are accepted by $ e$ .
  • therefore there is a map which assigns to each regular expression $ e$ , the language $ L(e)$

In order to define the aforementioned map, we introduce a few operations on languages. Let $ A,B\subseteq \Sigma*$ be two languages:

  • concatenation: the language $ AB$ is defined as $ \{ww' \mid w \in A, w'\in B\}$ - i.e. the set of words consisting of a word from $ A$ followed by a word from $ B$ .
  • reunion: the language $ A\cup B$ is simply the reunion of the languages - that is the set of words from both $ A$ and $ B$ .
  • kleene-star: the language $ A*$ is the set $ \{w \in\Sigma^* \mid w = w_1\ldots w_n, n\geq 0, w_1, \ldots, w_n \in A\}$ of zero or more concatenations of any word from $ A$ .

Next, we can define rules for determining the language generated by a regular expression.

  • the language generated by the regular expression $ \emptyset$ is simply the empty language $ \emptyset$ .
  • the language generated by the regular expression $ c$ , for $ c\in\Sigma$ is $ L(c)=\{c\}$ (a single-word set).
  • $ L(e*)$ is the language $ (L(e))*$ , i.e. the kleene star of the language $ L(e)$ .
  • $ L(ee')$ is $ L(e)L(e')$
  • $ L(e\cup e')$ is $ L(e) \cup L(e')$

Returning to our previous example, the language $ L((A\cup B)(a\cup b)*(0 \cup 1)*)$ is:

  • $

math[L(A\cup B) L1)], i.e.

  • $ L(A\cup B) L((a\cup b))* L((0 \cup 1))*)$ , i.e.
  • $ (L(A)\cup L(B)) (L(a) \cup L(b))* (L(0) \cup L(1))*$ , i.e.
  • $ \{A\}\cup \{B\}) (\{a\} \cup \{b\})* (\{0\} \cup \{1\})*$

Properties of languages

$ 2^{\Sigma^*}$ is the set of languages over $ \Sigma$ . Let $ E$ be the set of regular expressions. We have defined a semantics for regular expressions, i.e. the map:

  • $ E \rightarrow 2^{\Sigma^*}$

This map is powerful because it shows that we can assign a finite representation (the regular expression) to an infinite object (the language).

An interesting question, which we shall examine in detail further on, is whether for each language there exists a regular expression which describes it. We shall investigate this question in a later lecture.


1) a\cup b)*) L((0 \cup 1)*