pp2021:project [books]

This is an old revision of the document!

Idea: Tiny Analytics Engine in Haskell

The dataset consists of course grades of a real lecture (the names have been changed). The dataset has the following problems:

lecture grades are mapped against email addresses, whereas homework grades and exam grades are mapped against names.
lecture grades also contains entries with no email address. These correspond to students which have not provided with an email in a valid form.
so, in order to have a unique key for each student, the table email to name student map was created using a form. However, it contains some typos.

Task 1 - Reading/writing a DS

- [Easy] (Relies on a correct splitBy written in lecture) DSets are usually presented as CSV files. A CSV file uses a column_delimiter (namely the character ,) to separate between columns of a given entry and a row delimiter (in Unix-style encoding: \\n).

In Haskell, we will represent the DS as a String. We will refine this strategy later. As with CSVs, we will make no distinction between the column head, and the rest of the rows.

- [Easy] (reverse splitBy) We will also use CSV as an output format. For a nicer output display, we will use google sheets.

Task 2 - repairing the DS

The objective of this set of tasks is to obtain a single, consistent dataset on which further processing can then be performed.

Basic repairing

- [Medium] filter out entries in lecture grades which contain no email (implement a general filtering procedure) - [Hard] merge the datasets Homework grades and Exam grades using Name as key (implement a key-dependent join technique)

Correcting typos in emails

- [Medium] implement a Levenstein (must change the word to avoid internet copy/paste) distance function

a basic recursive implementation does not scale
an basic optimisation (see code) scales poorly
DP works great

- [Easy] implement a projection function which extracts some columns from the table as another table - [Medium] implement a cartesian function (see code constraints) to be used to compute levenstein distances. This will also be used for graphs (maybe?!)

- [Hard,tedious] implement a query-like sequence of operations which uses cartesian to:

filter those names in Emails not found in Main (those which are probably typos)
cross (cartesian) only those with the full list of (projected) names, and apply distance
sort each row of the above matrix by distance and report the correct name
replace the bad name by the correct one in Emails table

Most likely, the list two steps cannot be expressed as a combination of queries, but rather as a user operation.

Task 3 - anonymization

Whenever we work with a DS that contains sensible information, we need to anonymize it. This DS has already been anonymized, however, for the sake of the exercise, we will assume the names are real.

[Easy] anonymize Main by (i) removing the name field (ii) replacing email by a hash

Task 4 - rankings

* Extract admission information (who could take part in exam - note - not all students which could take part in the exam did so) * Todo

Task x - query language

* Define a query language which includes all the above processing * Re-define the previous tasks using your query language

Task y - graph queries

* Extend your language with graph queries … todo

Programming Paradigms project

Datasets