H02. Tree Sets

In this homework, you will implement a binary search tree, that you will use to gather stats about words from a particular text. Generally, in a binary search tree:

  • each non-empty node contains exactly one value and two children
  • all values from the left sub-tree are smaller or equal to that of the current node
  • all values from the right sub-tree are larger or equal to that of the current node

In your project, the value of each node will be represented by Token objects. The class Token is already implemented for you:

case class Token(word: String, freq: Int)

A token stores:

  • the number of occurrences, or frequency freq of a string word, in a text.

Your binary search tree will use frequencies as an ordering criterion. For instance, the text: All for one and one for one, may be represented by the tree:

      for (2)
      /   \
 and (1)  one (3)
  /            
all (1)            

Notice that there are multiple possible BS trees to represent one text, however you do not need to take this into account in this homework. Our tree is called WTree, and is implemented by the following case classes:

case object Empty extends WTree
case class Node(word: Token, left: WTree, right: WTree) extends WTree 

WTree implements the following trait:

trait WTreeInterface {
  def isEmpty: Boolean
  def filter(pred: Token => Boolean): WTree
  def ins(w: Token): WTree
  def contains(s:String): Boolean
  def size: Int
}

The method ins is already implemented, but the rest must be implemented by you. The project has two parts:

  • building a WTree from a text, and
  • using a WTree, to gather info about that particular text.

In the next section you will find implementation details about each of the above.

1. Write a function which splits a text using the single whitespace character as a separator. Multiple whitespaces should be treated as a single separator. If the list contains only whitespaces, split should return the empty list. (Hints: Your implementation must be recursive, but do not try to make it tail-recursive. It will make your code unnecessarily complicated. Several patterns over lists, in the proper order will make the implementation cleaner.)

/*  split(List('h','i',' ','t','h','e','r','e')) = List(List('h','i'), List('t','h','e','r','e'))
*/
def split(text: List[Char]): List[List[Char]] = ???

2. Write a function which computes a list of Token from a list of strings. Recall that Tokens keep track of the string frequency. Use an auxiliary function insWord which inserts a new string in a list of Tokens. If the string is already a token, its frequency is incremented, otherwise it is added as a new token. (Hint: the cleanest way to implement aux is to use one of the two folds).

def computeTokens(words: List[String]): List[Token] = {
    /* insert a new string in a list of tokens */
    def insWord(s: String, acc: List[Token]): List[Token] = ???
    def aux(rest: List[String], acc: List[Token]): List[Token] = ???
    ???
  }

3. Write a function tokensToTree which creates a WTree from a list of tokens. Use the insertion function ins which is already implemented. (Hint: you can implement it as a single fold call, but you have to choose the right one)

def tokensToTree(tokens: List[Token]): WTree = ??

4. Write a function makeTree which takes a string and builds a WTree. makeTree relies on all the previous functions you implemented. You should use _.toList, which converts a String to List[Char]. You can also use andThen, which allows writing a concise and clear implementation. andThen is explained in detail in the next section.

def makeTree(s:String): WTree = ???

5. Implement the member method size, which must return the number of non-empty nodes in the tree.

6. Implement the member method contains, which must check if a string is a member of the tree (no matter its frequency).

7. Implement the filter method in the abstract class WTree. Filter will rely on the tail-recursive filterAux method, which must be implemented in the case classes Empty and Node.

8. In the code template you will find a string: scalaDescription.

Compute the number of occurrences of the keyword “Scala” in scalaDescription. Use word-trees and any of the previous functions you have defined.

def scalaFreq: Int = ??? 

9. Find how many programming languages are referenced in the same text. You may consider that a programming language is any keyword which starts with an uppercase character. To reference character i in a string s, use s(i). You can also use the method _.isUpper.

def progLang: Int = ???

10. Find how many words which are not prepositions or conjunctions appear in the same text. You may consider that a preposition or conjunction is any word whose size is less or equal to 3.

def wordCount : Int = ???

Note: In order to be graded, exercises 5 to 9 must rely on a correct implementation of the previous parts of the homework.

Suppose we have two functions f: B ⇒ C and g: A ⇒ B. The functional composition $ f \circ g$ , can be defined in Scala as: x ⇒ f(g(x)). The intuition behind composition is that we first apply g and then f, on the formal parameter x. In other words, we apply g and then f. The (higher-order) function andThen in Scala works in precisely the same way: g.andThen(f) represents the function x ⇒ f(g(x)). Here is an example:

((x: Int) => x * 2).andThen((x:Int) => x + 1)    // the function g.andThen(f) where g(x) = 2*x and f(x) = x + 1
((x: Int) => x * 2).andThen((x:Int) => x + 1)(2) // calling the previous function with parameter 2

andThen is especially useful when we want to sequence several functions. Thus, instead of:

function1(function2(function3(x)))

which may become less intuitive as the number of applied functions grows, we may use:

function1
   .andThen(function2)
   .andThen(function3)(x)

Project format

  • You should not change any other files of the project, except for the template-file. For this homework, the template-file is Main.scala. Warning: if a submission has changes in other files, it may not be graded.
  • To solve your homework, download the Homework project and rename it using the following convention: HX_<LastName>_<FirstName>, where X is the homework number. (Example: H2_Popovici_Matei). If your project name disregards this convention, it may not be graded.
  • Each project file contains a profileID definition which you must fill out with your token ID received via email for this lecture. Make sure the token id is defined correctly. (Grades will be automatically assigned by token ID).
  • In order to be graded, the homework must compile. If a homework has compilation errors (does not compile), it will not be graded. Please take care to remove code that does not compile by replacing (or keeping) function bodies implemented with ???.

Submission

  • Your submission should be an archived file with your solved project, named via the convention specified previously.
  • All homework must be submitted via moodle. Submissions sent via email will not be graded!.
  • All homework must be submitted before the deadline. Submissions that miss the deadline (even by minutes) will not be graded. (All deadlines will be fixed at 8:00 AM, so that you can take advantage of an all-nighter, should you choose to).

Points

  • Points are assigned for each test (for a total of 100p), but the final grade will be assigned after manual review. Selectively, a homework may be required to be presented during lab for the final grade.

Integrity

  • Each homework uses public test-cases which you can use to guide and test your implementation. Most test-cases are simple, in order to be as easy to use as possible. If an implementation is written with the sole purpose of passing those specific tests, thus disregarding the statement, the entire homework will not be graded!
  • The homework must be solved individually - you are not allowed to share or to take code from other sources including the Internet.

We strongly encourage you to ask questions via the forum (instead of MS Teams) so that other students can benefit from the answers and discussion. You may ask questions about the homework during lab. You will receive feedback about your implementation ideas, but not on the actual written code.