✎ pp2021:project [books]

This page is read only. You can view the source, but not change it. Ask your administrator if you think this is wrong.
===== Programming Paradigms project =====

**Idea:** Tiny Analytics Engine in Haskell

===== Datasets =====

The dataset consists of course grades of a real lecture (the names have been changed). The dataset has the following problems:
  * **lecture grades** are mapped against email addresses, whereas **homework grades** and **exam grades** are mapped against names.
  * **lecture grades** also contains entries with no email address. These correspond to students which have not provided with an email in a valid form. 
  * so, in order to have a unique **key** for each student, the table **email to name student map** was created using a form. However, it contains some **typos**.

  * [[https://docs.google.com/spreadsheets/d/1d9ATUsYrcii80ffEDr8wyliZ_FLVChMF06IiuN6_GvU/edit#gid=1332667400| Homework grades]]
  * [[https://docs.google.com/spreadsheets/d/1YaUU6eGqjbw760G3YoLZVzuTTi13ZHmIy9Odrt6Yr7M/edit#gid=1503323838| Lecture grades]]
  * [[https://docs.google.com/spreadsheets/d/17rjcP8wLW1tZeJhNewKzC2Qs6nXHR775vJ0ttK-nHtw/edit#gid=673917849| Exam grades]]
  * [[https://docs.google.com/spreadsheets/d/1xoRW5Joo0IIyM4aXgXf-pv-TdDl9s1NEZemxuCqXWG4/edit#gid=484562090| Email to name map]]
  * [[? | Group assignment]] (todo?)

==== Task 1 - repairing the DS ====

 * Create a **consistent** dataset by:
     - **filtering** out those entries in **lecture grades** which contain no email.
     - **joining** the **Homework grades** and **Exam grades** tables in a single one, henceforth called **Main** (Careful: **Exam grades** contains only entries for students which have passed the exam)
     - in order to add **lecture grades** to the previous table, we have to solve the //typo problem// from the **email to name map**:
         - we need to implement a **distance function** (e.g. Levenstein distance). Initially, we compute it recursively, using a list-style matrix. Later on, we use Dynamic Programming to improve speed of the implementation.
         - we **extract (or **select**) the list of names from **Main**
         - we match them with emails
         - for names in **Email to name map** which are not identical to those in **Main**, we map the distance function over the complete list of names and select the minimal value as a match
         - we correct typos in **Email to name map** in this way
         - then perform the simple join with **Main**

==== Task 2 - anonymization ====

Whenever we work with a DS that contains sensible information, we need to anonymize it. This DS has already been anonymized, however, for the sake of the exercise, we will assume the names are real.

  * anonymize **Main** by (i) removing the name field (ii) replacing ''email'' by a hash

==== Task 3 - rankings ====

 * Extract admission information (who could take part in exam - note - not all students which could take part in the exam did so)
 * Todo

==== Task x - query language ====

 * Define a query language which includes all the above processing
 * Re-define the previous tasks using your query language

==== Task y - graph queries ====

 * Extend your language with graph queries ... todo