pp2021:project

Programming Paradigms project
Datasets

Programming Paradigms project

Idea: Tiny Analytics Engine in Haskell

Datasets

The dataset consists of course grades of a real lecture (the names have been changed). The dataset has the following problems:

lecture grades are mapped against email addresses, whereas homework grades and exam grades are mapped against names.
lecture grades also contains entries with no email address. These correspond to students which have not provided with an email in a valid form.
so, in order to have a unique key for each student, the table email to name student map was created using a form. However, it contains some typos.

The query language to be implemented is shown below:

data Query =
  FromCSV String                                     -- extract a table
   | ToCSV Query
   | AsList Query                                 -- a column is turned into a list of values
   | forall a. Filter (FilterCondition a) Query
   | Sort String Query                            -- sort by column-name interpreted as integer. 
   | ValueMap (QVal -> QVal) Query
   | RowMap ([QVal]->QVal) Query
   | TableJoin QKey Query Query                    -- table join according to specs
   | Projection [QVal] Query
   | Cartesian ([QVal] -> [QVal] -> [QVal]) [QVal] Query Query 
   | Graph EdgeOp Query
 
   | VUnion Query Query                           -- adds new rows to the DS (table heads should coincide)
   | HUnion Query Query                           -- 'horizontal union qlues' with zero logic more columns. TableJoin is the smart alternative
   | NameDistanceQuery                            -- dedicated name query
 
data QResult = CSV String | Table [[String]] | List [String]

Task Set 1 (Introduction)

One dataset (exam points) is presented parsed as a matrix in a stub Haskell file.

Implement procedures which determine which students have enough points to pass the exam. These implementations will serve as a basis for ValueMap, RowMap and HUnion
Other similar tasks which may (but not necessarily) include basic filtering or sorting

Task Set 2 (Input/Output)

Reading and writing a dataset from CSV. (Relies on a correct splitBy written in lecture) DSets are usually presented as CSV files. A CSV file uses a column_delimiter (namely the character ,) to separate between columns of a given entry and a row delimiter (in Unix-style encoding: \\n).

In Haskell, we will represent the DS as a String. We will refine this strategy later. As with CSVs, we will make no distinction between the column head, and the rest of the rows.

Students can also use the csv-uploading script and rely on google sheets for visualisation - experiments.

Task Set 3 (Matrix)

Based on the cartesian and projection queries, implement (and visualise) the similarity graph of student lecture points. Several possible distances can be implemented.
Students will have access to the python-from-csv-visualisation script, which they can modify on their own, using other Graph-to-viz types of graphics.

Task Set 4 (ATD)

Implement the TDA for query, result, and an evaluation function which couples the previous functionality.
Implement the TDA for filter, (with class Eval) and use it to apply different combinations of filters to the dataset, for visualisation.

Task Set 5 (table join)

Implement several optimisations (graph query expressed as cartesian, combination of sequential filters are turned into a single function, etc).
Implement table join, which allows us to merge the two CSV's using the email as key.

Task Set 6 (typos)

Filter out typoed names (using queries)
Implement (levenstein) distance function and apply it as a cartesian operation:
- he speed will suck
- try the length optimisation, that is compute the difference in length between words. The speed dramatically improves but also sucks on the big ds
- implement the lazy (PD) alternative
Select mininimal distance per row, and thus determine the most likely correct name
Correct the dataset

Task Set 7 (data visualisation)

Plot the correlation (how?!) between final grades (which can now be computed) and other stuff. (Lab results, etc), using queries only

Table of Contents