Regular languages

In the previous lectures, we have introduced regular expressions, NFAs and DFAs as finite representations for languages, and showed the following links between them.

  • $ E \rightarrow NA$ - each regular expression $ e$ can be transformed to a NFA $ M$ such that $ L(e) = L(M)$ .
  • $ NA \rightarrow DA$ - each NFA $ M$ can be transformed to a DFA $ M'$ such that $ L(M) = L(M')$ .

We introduce the following:

  • a language $ L$ is called regular, if it can be generated by a regular expression, i.e. $ L=L(E)$ for some regular expression $ E$ . Denote by $ LR\subset 2^{\Sigma^*}$ the set of regular languages
  • Denote by $ L(NFA)\subset 2^{\Sigma^*}$ , the set of languages which can be accepted by NFAs, and
  • Denote by $ L(DFA)\subset 2^{\Sigma^*}$ , the set of languages which can be accepted by DFAs.

It is both formally and practically important to understand the limits of regular expressions and automata (of different types) in capturing languages.

We have already seen that regular expressions are countable while languages are not. Can automata capture more languages that regular expressions? Our lecture has so far proven the following:

  • $ LR \subseteq L(NFA) \subseteq L(DFA)$

To this we add the following observation:

  • $ L(DFA) \subseteq L(NFA)$ . If a language can be accepted by a DFA, it also can (trivially) be accepted by an NFA, since the latter automata extend the former.

Therefore, we have shown that NFAs and DFAs accept the same languages, i.e. $ L(NFA) = L(DFA)$ . In other words, if a language $ L$ is accepted by some DFA $ M$ ($ L=L(M)$ ), then it can also be accepted by some NFA, and vice-versa.

It remains to establish the relationship between $ LR$ and $ L(DFA)$ (or equivalently $ L(NFA)$ ).

Let $ M$ be a DFA. There exists a regular expression $ E$ , such that $ L(E)=L(M)$ . </blockquote>

To prove the theorem, we rely on:

  • a naming scheme for states. We assume a state $ q_i$ is identified by its index $ i$ . The indexes in our proof start with 1, hence $ 1$ is the initial state. How states are ordered, or their kind (final/nonfinal) is unimportant, however we use the same ordering throughout the proof;
  • a naming scheme for partial regular expressions: We label $ R_{ij}^{(k)}$ the regular expression such that its language is the set of words that label a path from state $ i$ to $ j$ . Moreover, the path cannot visit states of index larger than $ k$ .

We prove the following:

Proposition:

Given DFA $ M$ , for all states $ i,j,k$ of $ M$ , there exists a regular expression $ R_{ij}^{(k)}$ , which satisfies the above conditions.

Proof:

The proof is by induction over $ k$ . Basis case: $ k=0$ . If $ i\neq j$ , then $ R^{(0)}_{ij}$ must contain exactly one transition:

  • $ R^{(0)}_{ij} = \emptyset$ if no transition exists between $ i$ and $ j$
  • $ R^{(0)}_{ij} = c_1 \cup \ldots \cup c_m$ if one or more transitions exist between $ i$ and $ j$ , on symbols $ c_1$ to $ c_n$ .

If $ i = j$ , then:

  • $ R^{(0)}_{ii}$ may contain zero transitions, hence $ R^{(0)}_{ij} = \epsilon$
  • $ R^{(0)}_{ii}$ may contain one transition, and the construction follows the above rules, yielding some regular expression $ E_0$ .

We combine the two situations in a single one: $ R^{(0)}_{ii} = \epsilon \cup E_0$ , where $ E_0$ is constructed as above.

Induction step: By induction hypothesis, we assume there exist regular expressions $ R^{k-1}_{ij}$ that satisfy our designated constraints, in $ M$ .

We build $ R^{k}_{ij}$ , for each possible pair of states $ i,j$ in $ M$ .

  1. a path from $ i$ to $ j$ may pass only states whose index is smaller than $ k$ . In this case: $ R^{(k)}_{ij} = R^{(k-1)}_{ij}$
  2. a path from $ i$ to $ j$ passes $ k$ one or more times. This path can be decomposed in the following bits:
    • a path from $ i$ to $ k$ which only visits states $ <k$ , identified by the regular expression $ R_{ik}^{(k-1)}$
    • zero or more paths from $ k$ to $ k$ which only visit states $ <k$ , each identified by: $ R_{kk}^{(k-1)}$
    • a path from $ k$ to $ j$ which only visits states $ <k$ , identified by: $ R_{kj}^{(k-1)}$ .

The induction hypotheses guarantees that all regular expressions involving the above construction(s) can be properly built. Hence, we assemble $ R_{ij}^{k}$ by combining the two afore-mentioned cases:

$ \displaystyle R_{ij}^{k} = R_{ij}^{(k-1)} \cup R_{ik}^{(k-1)}(R_{kk}^{(k-1)})^*R_{kj}^{(k-1)}$

The proof of our theorem consists in building the regular expression:

$ \displaystyle E = \bigcup_{i\in F}R_{1i}^n$

where $ n$ is the total number of states in $ M$ .

which, according to our proposition, describes all paths that start in the initial state, end in a final state, and may visit all other states.

We have completed an extensive investigation into languages defined via:

  • regular expressions
  • nondeterministic FA
  • deterministic FA

and established that these three instruments for defining languages are equivalent. An important observation is that, languages in general support two kinds of definitions:

  • via generators: for instance, regular expressions are generators for regular languages. They describe how words of the given language can be built;
  • via acceptors; they are, informally, machines. NFAs and DFAs are acceptors for regular languages. They describe how words can be tested for membership in a given language.

Generators and acceptors are always useful when working with any kind of particular language.

We already know that a language is regular iff it can be defined via an regular expression, or automaton of either kind. However, what is an intrinsic feature do regular languages capture?

  • although we shall explore this in more detail later, we can already state that words in a regular language exhibit 'regularities' which can be observed without being required to count, or to have some form of memory available. We shall return to this intuition.

Interesting questions regarding languages arise:

  1. When is a language regular?
  2. When is a language not regular?

We can answer question 1. by constructing a regular expression, NFA or DFA to capture the language. However, in practice, there are a few tools which serve this purpose better:

Although $ LR \subseteq L(DFA)$ has already been proven in the former two lectures, there is another way of establishing this, which has further applications. This second means is related to closure properties of languages.

Generally, a set $ A$ has closure under a transformation ($ T:X\rightarrow X$ ) or operation ($ O:X\times X \rightarrow X$ ) iff, by performing the transformation/operation on member(s) $ a$ ($ b$ ) of $ A$ (i.e. $ T(a)$ or $ O(a,b)$ ), we obtain an element in the same set.

Here, the set at hand is $ L(DFA)$ , and the transformations are:

  • Kleene Star
  • complement

and the operations are:

  • union
  • concatenation
  • intersection

If $ L\in L(DFA)$ , then $ L^*\in L(DFA)$ and also. $ \overline{L}\in L(DFA)$ . By $ \overline{L}$ , we refer to the complement of the language $ L$ , with respect to $ \Sigma^*$ : $ \overline{L}=\Sigma^* \setminus L$

If $ L_1,L_2\in L(DFA)$ then the languages $ L_1 \cup L_2$ , $ L_1L_2$ (language concatenation) and $ L_1 \cap L_2$ are also members of $ L(DFA)$ . </blockquote>