Programming Paradigms project
Idea: Tiny Analytics Engine in Haskell
Datasets
The dataset consists of course grades of a real lecture (the names have been changed). The dataset has the following problems:
lecture grades are mapped against email addresses, whereas homework grades and exam grades are mapped against names.
lecture grades also contains entries with no email address. These correspond to students which have not provided with an email in a valid form.
so, in order to have a unique key for each student, the table email to name student map was created using a form. However, it contains some typos.
The query language to be implemented is shown below:
data Query =
FromCSV String -- extract a table
| ToCSV Query
| AsList Query -- a column is turned into a list of values
| forall a. Filter (FilterCondition a) Query
| Sort String Query -- sort by column-name interpreted as integer.
| ValueMap (QVal -> QVal) Query
| RowMap ([QVal]->QVal) Query
| TableJoin QKey Query Query -- table join according to specs
| Projection [QVal] Query
| Cartesian ([QVal] -> [QVal] -> [QVal]) [QVal] Query Query
| Graph EdgeOp Query
| VUnion Query Query -- adds new rows to the DS (table heads should coincide)
| HUnion Query Query -- 'horizontal union qlues' with zero logic more columns. TableJoin is the smart alternative
| NameDistanceQuery -- dedicated name query
data QResult = CSV String | Table [[String]] | List [String]
Task Set 1 (Introduction)
One dataset (exam points) is presented parsed as a matrix in a stub Haskell file.
Implement procedures which determine which students have enough points to pass the exam. These implementations will serve as a basis for ValueMap
, RowMap
and HUnion
Other similar tasks which may (but not necessarily) include basic filtering or sorting
In Haskell, we will represent the DS as a String
. We will refine this strategy later. As with CSVs, we will make no distinction between the column head, and the rest of the rows.
Task Set 3 (Matrix)
Based on the cartesian and projection queries, implement (and visualise) the similarity graph of student lecture points. Several possible distances can be implemented.
Students will have access to the python-from-csv-visualisation script, which they can modify on their own, using other Graph-to-viz types of graphics.
Task Set 4 (ATD)
Implement the TDA for query, result, and an evaluation function which couples the previous functionality.
Implement the TDA for filter, (with class Eval
) and use it to apply different combinations of filters to the dataset, for visualisation.
Task Set 5 (table join)
Implement several optimisations (graph query expressed as cartesian, combination of sequential filters are turned into a single function, etc).
Implement table join, which allows us to merge the two CSV's using the email as key.
Task Set 6 (typos)
Filter out typoed names (using queries)
Implement (levenstein) distance function and apply it as a cartesian operation:
he speed will suck
try the length optimisation, that is compute the difference in length between words. The speed dramatically improves but also sucks on the big ds
implement the lazy (PD) alternative
Select mininimal distance per row, and thus determine the most likely correct name
Correct the dataset
Task Set 7 (data visualisation)