This is an old revision of the document!
Programming Paradigms project
Idea: Tiny Analytics Engine in Haskell
Datasets
The dataset consists of course grades of a real lecture (the names have been changed). The dataset has the following problems:
- lecture grades are mapped against email addresses, whereas homework grades and exam grades are mapped against names.
- lecture grades also contains entries with no email address. These correspond to students which have not provided with an email in a valid form.
- so, in order to have a unique key for each student, the table email to name student map was created using a form. However, it contains some typos.
- Group assignment (todo?)
Task 1 - Reading/writing a DS
- [Easy] (Relies on a correct splitBy
written in lecture) DSets are usually presented as CSV files. A CSV file uses a column_delimiter
(namely the character ,
) to separate between columns of a given entry and a row delimiter
(in Unix-style encoding: \\n
).
In Haskell, we will represent the DS as a String
. We will refine this strategy later. As with CSVs, we will make no distinction between the column head, and the rest of the rows.
- [Easy] (reverse splitBy
) We will also use CSV as an output format. For a nicer output display, we will use google sheets.
Task 2 - repairing the DS
The objective of this set of tasks is to obtain a single, consistent dataset on which further processing can then be performed.
Basic repairing
- [Medium] filter out entries in lecture grades which contain no email (implement a general filtering procedure) - [Hard] merge the datasets Homework grades and Exam grades using Name as key (implement a key-dependent join technique)
Correcting typos in emails
- [Medium] implement a Levenstein (must change the word to avoid internet copy/paste) distance function
- a basic recursive implementation does not scale
- an basic optimisation (see code) scales poorly
- DP works great
- [Easy] implement a projection function which extracts some columns from the table as another table - [Medium] implement a cartesian function (see code constraints) to be used to compute levenstein distances. This will also be used for graphs (maybe?!)
- [Hard,tedious] implement a query-like sequence of operations which uses cartesian to:
- filter those names in Emails not found in Main (those which are probably typos)
- cross (cartesian) only those with the full list of (projected) names, and apply distance
- sort each row of the above matrix by distance and report the correct name
- replace the bad name by the correct one in Emails table
Most likely, the list two steps cannot be expressed as a combination of queries, but rather as a user operation.
Task 3 - anonymization
Whenever we work with a DS that contains sensible information, we need to anonymize it. This DS has already been anonymized, however, for the sake of the exercise, we will assume the names are real.
- [Easy] anonymize Main by (i) removing the name field (ii) replacing
email
by a hash
Task 4 - rankings
* Extract admission information (who could take part in exam - note - not all students which could take part in the exam did so) * Todo
Task x - query language
* Define a query language which includes all the above processing * Re-define the previous tasks using your query language
Task y - graph queries
* Extend your language with graph queries … todo