✎ pp2021:project [books]

This page is read only. You can view the source, but not change it. Ask your administrator if you think this is wrong.
===== Programming Paradigms project =====

**Idea:** Tiny Analytics Engine in Haskell

===== Datasets =====

The dataset consists of course grades of a real lecture (the names have been changed). The dataset has the following problems:
  * **lecture grades** are mapped against email addresses, whereas **homework grades** and **exam grades** are mapped against names.
  * **lecture grades** also contains entries with no email address. These correspond to students which have not provided with an email in a valid form. 
  * so, in order to have a unique **key** for each student, the table **email to name student map** was created using a form. However, it contains some **typos**.

  * [[https://docs.google.com/spreadsheets/d/1d9ATUsYrcii80ffEDr8wyliZ_FLVChMF06IiuN6_GvU/edit#gid=1332667400| Homework grades]]
  * [[https://docs.google.com/spreadsheets/d/1YaUU6eGqjbw760G3YoLZVzuTTi13ZHmIy9Odrt6Yr7M/edit#gid=1503323838| Lecture grades]]
  * [[https://docs.google.com/spreadsheets/d/17rjcP8wLW1tZeJhNewKzC2Qs6nXHR775vJ0ttK-nHtw/edit#gid=673917849| Exam grades]]
  * [[https://docs.google.com/spreadsheets/d/1xoRW5Joo0IIyM4aXgXf-pv-TdDl9s1NEZemxuCqXWG4/edit#gid=484562090| Email to name map]]
  * [[? | Group assignment]] (todo?)

The query language to be implemented is shown below:
<code haskell>
data Query =
  FromCSV String                                     -- extract a table
   | ToCSV
   | forall a. Filter (FilterCondition a) Query
   | Sort String Query                            -- sort by column-name interpreted as integer. 
   | ValueMap (QVal -> QVal) Query
   | RowMap ([QVal]->QVal) Query
   | TableJoin QKey Query Query                    -- table join according to specs
   | Projection [QVal] Query
   | Cartesian ([QVal] -> [QVal] -> [QVal]) [QVal] Query Query 
   | Graph EdgeOp Query

   | VUnion Query Query                           -- adds new rows to the DS (table heads should coincide)
   | HUnion Query Query                           -- 'horizontal union qlues' with zero logic more columns. TableJoin is the smart alternative
   | NameDistanceQuery                            -- dedicated name query
</code>

==== Task Set 1 (Accomodation) ====

One dataset (exam points) is presented parsed as a matrix in a stub haskell file.

  * Implement procedures which determine which students have **enough points** to pass the exam. These implementations will serve as a basis for ''ValueMap'', ''RowMap'' and ''HUnion''
  * Other similar tasks which may (but not necessarily) include basic filtering or sorting

==== Task Set 2 (Input/Output) ====

  * **Reading and writing** a dataset from CSV. (Relies on a correct ''splitBy'' written in lecture) DSets are usually presented as CSV files. A CSV file uses a ''column_delimiter'' (namely the character '','') to separate between columns of a given entry and a ''row delimiter'' (in Unix-style encoding: ''\\n'').

In Haskell, we will represent the DS as a ''[[String]]''. We will refine this strategy later. As with CSVs, we will make no distinction between the column head, and the rest of the rows.

  * Students can also use the csv-uploading script and rely on google sheets for visualisation - experiments.


==== Task 2 - repairing the DS ====

The objective of this set of tasks is to obtain a single, consistent dataset on which further processing can then be performed.

=== Basic repairing ===

- [Medium] filter out entries in **lecture grades** which contain no email (implement a general filtering procedure)
- [Hard] merge the datasets **Homework grades** and **Exam grades** using **Name** as key (implement a key-dependent join technique)

=== Correcting typos in emails ===

- [Medium] implement a **Levenstein** (must change the word to avoid internet copy/paste) distance function
    * a basic recursive implementation does not scale
    * an basic optimisation (see code) scales poorly
    * DP works great

- [Easy] implement a **projection** function which extracts some columns from the table as another table
- [Medium] implement a **cartesian** function (see code constraints) to be used to compute levenstein distances. This will also be used for graphs (maybe?!)

- [Hard,tedious] implement a query-like sequence of operations which uses **cartesian** to: 
    * filter those names in **Emails** not found in **Main** (those which are probably typos)
    * //cross// (cartesian) only those with the full list of (projected) names, and apply distance
    * sort each row of the above matrix by distance and report the correct name
    * replace the bad name by the correct one in **Emails** table

Most likely, the list two steps cannot be expressed as a combination of queries, but rather as a user operation.

==== Task 3 - anonymization ====

Whenever we work with a DS that contains sensible information, we need to anonymize it. This DS has already been anonymized, however, for the sake of the exercise, we will assume the names are real.

  - [Easy] anonymize **Main** by (i) removing the name field (ii) replacing ''email'' by a hash

==== Task 4 - rankings ====

 * Extract admission information (who could take part in exam - note - not all students which could take part in the exam did so)
 * Todo

==== Task x - query language ====

 * Define a query language which includes all the above processing
 * Re-define the previous tasks using your query language

==== Task y - graph queries ====

 * Extend your language with graph queries ... todo