This is an old revision of the document!


Idea: Tiny Analytics Engine in Haskell

The dataset consists of course grades of a real lecture (the names have been changed). The dataset has the following problems:

  • lecture grades are mapped against email addresses, whereas homework grades and exam grades are mapped against names.
  • lecture grades also contains entries with no email address. These correspond to students which have not provided with an email in a valid form.
  • so, in order to have a unique key for each student, the table email to name student map was created using a form. However, it contains some typos.

Task 1 - Reading/writing a DS

DSets are usually presented as CSV files. A CSV file uses a column_delimiter (namely the character ,) to separate between columns of a given entry and a row delimiter (in Unix-style encoding: \\n).

In Haskell, we will represent the DS as a String. We will refine this strategy later. As with CSVs, we will make no distinction between the column head, and the rest of the rows.

We will also use CSV as an output format. For a nicer output display, we will use google sheets.

Task 2 - repairing the DS

* Create a consistent dataset by:

  1. filtering out those entries in lecture grades which contain no email.
  2. populate with 0 all empty entries in lecture grades.
  1. joining the Homework grades and Exam grades tables in a single one, henceforth called Main (Careful: Exam grades contains only entries for students which have passed the exam)
  2. in order to add lecture grades to the previous table, we have to solve the typo problem from the email to name map:
    1. we need to implement a distance function (e.g. Levenstein distance). Initially, we compute it recursively, using a list-style matrix. Later on, we use Dynamic Programming to improve speed of the implementation.
    2. we extract (or select) the list of names from Main
    3. we match them with emails
    4. for names in Email to name map which are not identical to those in Main, we map the distance function over the complete list of names and select the minimal value as a match
    5. we correct typos in Email to name map in this way
    6. then perform the simple join with Main

Task 3 - anonymization

Whenever we work with a DS that contains sensible information, we need to anonymize it. This DS has already been anonymized, however, for the sake of the exercise, we will assume the names are real.

  • anonymize Main by (i) removing the name field (ii) replacing email by a hash

Task 4 - rankings

* Extract admission information (who could take part in exam - note - not all students which could take part in the exam did so) * Todo

Task x - query language

* Define a query language which includes all the above processing * Re-define the previous tasks using your query language

Task y - graph queries

* Extend your language with graph queries … todo