This is an old revision of the document!


4. Regex Representation in Python

In this laboratory we will be using Python3 classes to introduce a possible representation of regular expressions inside code and ways to work with those Regexes.

We will start with a base class Regex that will function as an interface for future descendents:

class Regex:
    '''Base class for Regex ADT'''
 
    def __str__(self) -> str:
        '''Returns the string representation of the regular expression'''
        pass
 
    def gen(self) -> [str]:
        '''Return a representative set of strings that the regular expression can generate'''
        pass
 
    def eval_gen(self):
        '''Prints the set of strings that the regular expression can generate'''
        pass

We remark the following aspects:

There is no __init__(self) blueprint, implying that descendants of Regex could be instantiated with Descendent() if we do not specifically overwrite this default functionality.

As we have mentioned in the [second laboratory](“https://ocw.cs.pub.ro/ppcarte/doku.php?id=lfa:2024:lab02”) the corespondences with Java are:

  • self $ \rightarrow $ this
  • __init__ $ \rightarrow $ Class Constructor
  • __str__(self) $ \rightarrow $ toString()
We have structures of type '''comment'''. This is the preferred way of writing documentation for classes and functions in Python, as the structure produces a help dialogue box when hovering over functionalities with this type of comments.
We have interfaced the functionalities for the methods: __str__(self) -> str, gen(self) -> [str], eval_gen(self).

By using the keyword pass we can create empty function/class definitions, otherwise the Python3 interpreter would throw an error for empty structures.

Although Python is dynamically typed, we still encourage you to write the types for parameters and outputs explicitly, as they contribute to documenting the code.

Further, when writing python code for interviews, the employers usually follow this aspect and grade you in consequence.

Your Task

Your task is to implement the missing pieces of code from the regex.py file following the comments in the TODOS:

# This is an auxiliary method used to generate our strings by length and alphabetically
# E.g: (b | aa)* = [b, aa, bb, aab, bbb, aaaa, ...]
def convert_to_sorted_set(lst: [str]) -> [str]:
    '''Converts a list to a sorted set'''
    crt_lst = list(set(lst))
    return sorted(crt_lst, key = lambda x: (len(x), x))
 
# Global variable used to threshold the number of Star items
INNER_STAR_NO_ITEMS = 3
 
class Regex:
    '''Base class for Regex ADT'''
 
    def __str__(self) -> str:
        '''Returns the string representation of the regular expression'''
        pass
 
    def gen(self) -> [str]:
        '''Return a representative set of strings that the regular expression can generate'''
        pass
 
    def eval_gen(self):
        '''Prints the set of strings that the regular expression can generate'''
        pass
 
# TODO 0: Implementati Clasa Void dupa blueprint-ul de mai sus
class Void(Regex):
    '''Represents the empty regular expression'''
 
    # Va returna Void ca string
    def __str__(self) -> str:
        pass
 
    # Va genera un obiect din Python corespunzatoar clasei void
    def gen(self) -> [str]:
        pass
 
    def eval_gen(self):
        print("Void generates nothing")
 
 
# TODO 1: Implementati Clasa Epsilon-String (Empty) dupa blueprint-ul de mai sus
class Empty(Regex):
    '''Represents the empty string regular expression'''
 
    # Va returna Empty ca string
    def __str__(self) -> str:
        pass
 
    # Va returna o lista corespunzatoare a ce stringuri produce sirul vid
    def gen(self) -> [str]:
        pass
 
    def eval_gen(self):
        print("Empty string generates ''.")
 
# TODO 2: Implementati functionalitatile necesare pentru Clasa Symbol dupa blueprint-ul de mai sus
class Symbol(Regex):
    '''Represents a symbol in the regular expression'''
 
    def __init__(self, char: str):
        self.char = char
 
    # TODO 2: Completati metodele __str__ si gen pentru clasa Symbol    
    def __str__(self) -> str:
        pass
 
    def gen(self) -> [str]:
        pass
 
    def eval_gen(self):
        self.gen()
 
	# Dorim sa pastram in atributul words al clasei curente stringurile generate cu gen(self)
        return "Symbol {self.char} generates {self.words}"
 
# TODO 3: Implementati functionalitatile necesare pentru Clasa Union dupa blueprint-ul de mai sus
class Union(Regex):
    '''Represents the union of two regular expressions'''
    def __init__(self, *arg: [Regex]):
        self.components = arg
 
    # TODO 3: Completati metodele __str__, gen si eval_gen pentru clasa Union    
    def __str__(self) -> str:
        pass
 
    def gen(self) -> [str]:
	pass
 
    # Dupa modelul de la Symbol, vom dori ca urmatoarele eval sa implementeze
    # return "Class {varianta toString() a clasei) generates {self.words}"
    def eval_gen(self) -> str:
	pass
 
# TODO 4: Implementati functionalitatile necesare pentru Clasa Concat dupa blueprint-ul de mai sus
class Concat(Regex):
    '''Represents the concatenation of two regular expressions'''
 
    def __init__(self, *arg : [Regex]):
        self.components = arg
 
    # TODO 4: Completati metodele __str__, gen si eval_gen pentru clasa Concat
    def __str__(self) -> str:
        pass
 
    def gen(self) -> [str]:
	pass
 
    def eval_gen(self) -> str:
	pass
 
# TODO 5: Implementati functionalitatile necesare pentru Clasa Star dupa blueprint-ul de mai sus
class Star(Regex):
    '''Represents the Kleene star (zero or more repetitions) of a regular expression'''
    def __init__(self, regex: Regex):
        self.regex = regex
        self.base_words = [""]
        self.base_gen()
        self.words = []
 
 
    # TODO 5: Completati metodele __str__, gen, eval_gen pentru clasa Star
    def __str__(self):
        pass
 
    def base_gen(self):
        '''To memorize the base words generated by the regex inside the star, we store them in a list'''
        if type(self.regex) == Star:
	    # Implement Star.gen with a limiting threshold
	    pass
 
	# Implement Regex.gen for the rest
	pass            
 
 
    def gen(self, no_items = 10):
	pass
 
    def eval_gen(self, no_items = 10):
	pass
 
 
if __name__ == "__main__":
 
    # Example usage
 
    r1 = Symbol('a')
    r2 = Symbol('b')
    regex_union = Union(r1, r2)
    regex_union.eval_gen()
 
    e2 = Union(Symbol('a'),Symbol('b'), Symbol('c'))
    e4 = Concat(r1, e2)
    e4.eval_gen()
 
    e5 = Concat(e2, e2, r2)
    e5.eval_gen()
 
    star_ex = Star(r1)
    star_ex.eval_gen()
 
    regex_concat = Concat(regex_union, star_ex)
    regex_concat.eval_gen()
 
    last_expr = Concat(Star(Union(Symbol('a'), Symbol('b'))), Symbol('b'), Star(Symbol('c')))
    last_expr.eval_gen()
 

The output should be similar to:

Union (a | b) generates ['a', 'b'].

Concat (a(a | b | c)) generates ['aa', 'ab', 'ac'].

Concat ((a | b | c)(a | b | c)b) generates ['aab', 'abb', 'acb', 'bab', 'bbb', 'bcb', 'cab', 'cbb', 'ccb'].

Star (a*) generates ['', 'a', 'aa', 'aaa', 'aaaa', 'aaaaa', 'aaaaaa', 'aaaaaaa', 'aaaaaaaa', 'aaaaaaaaa', 'aaaaaaaaaa', '...'].

Concat ((a | b)(a*)) generates ['a', 'b', 'aa', 'ba', 'aaa', 'baa', 'aaaa', 'baaa'].

Concat (((a | b)*)b(c*)) generates ['b', 'ab', 'bb', 'bc', 'aab', 'abb', 'abc', 'bab', 'bbb', 'bbc', 'bcc', 'aaab', 'aabb', 'aabc', 'abab', 'abbb', 'abbc', 'abcc', 'baab', 'babb', 'babc', 'bbab', 'bbbb', 'bbbc', 'bbcc', 'bccc', 'aaaab', 'aaabb', '...'].

Implementation Details

We have to take into account for the precedence rules and implement each possibility accordingly:

For example, what is the outer class of a(a|b|c)?

Click to display ⇲

Click to hide ⇱

Concat

How can we write this Regex using our classes?

Click to display ⇲

Click to hide ⇱

Concat(Symbol('a'), Union(Symbol('a'), Symbol('b'), Symbol('c'))

Note that we have used the following code structures that takes a variable number of parameters as inputs for Concat and Union constructors:
    def __init__(self, *arg : [Regex]):
        self.components = arg

This design choice was meant to ease your work when representing more complex regexes. The alternative would be to use binary operations and fuse the ones identical together: (a|b|c) = Concat(Symbol('a'), Concat(Symbol('b'), Symbol('c')) = (a|(b|c))

What should it generate?

Click to display ⇲

Click to hide ⇱

[aa, ab, ac]

Therefore, we should analyse each possibility and think for each class if having other types of regular expressions inside affects the way in which the regex generates further and adapt our code to suit those class-recursions.

We should develop this process from simple examples to more complex ones: Symbol, Concat(Symbol, Symbol), Union(Symbol, Symbol), Concat(Symbol, Union), Concat(Union, Union)…

If there are design choices that do not suit you in this laboratory, please feel free to adapt your implementation accordingly, as far as you keep the same classes and the main functionalities for __str__, gen and eval_gen.