This is an old revision of the document!


Idea: Tiny Analytics Engine in Haskell

The dataset consists of course grades of a real lecture (the names have been changed). The dataset has the following problems:

  • lecture grades are mapped against email addresses, whereas homework grades and exam grades are mapped against names.
  • lecture grades also contains entries with no email address. These correspond to students which have not provided with an email in a valid form.
  • so, in order to have a unique key for each student, the table email to name student map was created using a form. However, it contains some typos.

The query language to be implemented is shown below:

data Query =
  FromCSV String                                     -- extract a table
   | ToCSV
   | forall a. Filter (FilterCondition a) Query
   | Sort String Query                            -- sort by column-name interpreted as integer. 
   | ValueMap (QVal -> QVal) Query
   | RowMap ([QVal]->QVal) Query
   | TableJoin QKey Query Query                    -- table join according to specs
   | Projection [QVal] Query
   | Cartesian ([QVal] -> [QVal] -> [QVal]) [QVal] Query Query 
   | Graph EdgeOp Query
 
   | VUnion Query Query                           -- adds new rows to the DS (table heads should coincide)
   | HUnion Query Query                           -- 'horizontal union qlues' with zero logic more columns. TableJoin is the smart alternative
   | NameDistanceQuery                            -- dedicated name query

Task Set 1 (Accomodation)

One dataset (exam points) is presented parsed as a matrix in a stub haskell file.

  • Implement procedures which determine which students have enough points to pass the exam. These implementations will serve as a basis for ValueMap, RowMap and HUnion
  • Other similar tasks which may (but not necessarily) include basic filtering or sorting

Task Set 2 (Input/Output)

  • Reading and writing a dataset from CSV. (Relies on a correct splitBy written in lecture) DSets are usually presented as CSV files. A CSV file uses a column_delimiter (namely the character ,) to separate between columns of a given entry and a row delimiter (in Unix-style encoding: \\n).

In Haskell, we will represent the DS as a String. We will refine this strategy later. As with CSVs, we will make no distinction between the column head, and the rest of the rows.

  • Students can also use the csv-uploading script and rely on google sheets for visualisation - experiments.

Task 2 - repairing the DS

The objective of this set of tasks is to obtain a single, consistent dataset on which further processing can then be performed.

Basic repairing

- [Medium] filter out entries in lecture grades which contain no email (implement a general filtering procedure) - [Hard] merge the datasets Homework grades and Exam grades using Name as key (implement a key-dependent join technique)

Correcting typos in emails

- [Medium] implement a Levenstein (must change the word to avoid internet copy/paste) distance function

  • a basic recursive implementation does not scale
  • an basic optimisation (see code) scales poorly
  • DP works great

- [Easy] implement a projection function which extracts some columns from the table as another table - [Medium] implement a cartesian function (see code constraints) to be used to compute levenstein distances. This will also be used for graphs (maybe?!)

- [Hard,tedious] implement a query-like sequence of operations which uses cartesian to:

  • filter those names in Emails not found in Main (those which are probably typos)
  • cross (cartesian) only those with the full list of (projected) names, and apply distance
  • sort each row of the above matrix by distance and report the correct name
  • replace the bad name by the correct one in Emails table

Most likely, the list two steps cannot be expressed as a combination of queries, but rather as a user operation.

Task 3 - anonymization

Whenever we work with a DS that contains sensible information, we need to anonymize it. This DS has already been anonymized, however, for the sake of the exercise, we will assume the names are real.

  1. [Easy] anonymize Main by (i) removing the name field (ii) replacing email by a hash

Task 4 - rankings

* Extract admission information (who could take part in exam - note - not all students which could take part in the exam did so) * Todo

Task x - query language

* Define a query language which includes all the above processing * Re-define the previous tasks using your query language

Task y - graph queries

* Extend your language with graph queries … todo