Programming Paradigms project
Idea: Tiny Analytics Engine in Haskell
Datasets
The dataset consists of course grades of a real lecture (the names have been changed). The dataset has the following problems:
- lecture grades are mapped against email addresses, whereas homework grades and exam grades are mapped against names.
- lecture grades also contains entries with no email address. These correspond to students which have not provided with an email in a valid form.
- so, in order to have a unique key for each student, the table email to name student map was created using a form. However, it contains some typos.
- Group assignment (todo?)
The query language to be implemented is shown below:
data Query = FromCSV String -- extract a table | ToCSV Query | AsList Query -- a column is turned into a list of values | forall a. Filter (FilterCondition a) Query | Sort String Query -- sort by column-name interpreted as integer. | ValueMap (QVal -> QVal) Query | RowMap ([QVal]->QVal) Query | TableJoin QKey Query Query -- table join according to specs | Projection [QVal] Query | Cartesian ([QVal] -> [QVal] -> [QVal]) [QVal] Query Query | Graph EdgeOp Query | VUnion Query Query -- adds new rows to the DS (table heads should coincide) | HUnion Query Query -- 'horizontal union qlues' with zero logic more columns. TableJoin is the smart alternative | NameDistanceQuery -- dedicated name query data QResult = CSV String | Table [[String]] | List [String]
Task Set 1 (Introduction)
One dataset (exam points) is presented parsed as a matrix in a stub Haskell file.
- Implement procedures which determine which students have enough points to pass the exam. These implementations will serve as a basis for
ValueMap
,RowMap
andHUnion
- Other similar tasks which may (but not necessarily) include basic filtering or sorting
Task Set 2 (Input/Output)
- Reading and writing a dataset from CSV. (Relies on a correct
splitBy
written in lecture) DSets are usually presented as CSV files. A CSV file uses acolumn_delimiter
(namely the character,
) to separate between columns of a given entry and arow delimiter
(in Unix-style encoding:\\n
).
In Haskell, we will represent the DS as a String
. We will refine this strategy later. As with CSVs, we will make no distinction between the column head, and the rest of the rows.
- Students can also use the csv-uploading script and rely on google sheets for visualisation - experiments.
Task Set 3 (Matrix)
- Based on the cartesian and projection queries, implement (and visualise) the similarity graph of student lecture points. Several possible distances can be implemented.
- Students will have access to the python-from-csv-visualisation script, which they can modify on their own, using other Graph-to-viz types of graphics.
Task Set 4 (ATD)
- Implement the TDA for query, result, and an evaluation function which couples the previous functionality.
- Implement the TDA for filter, (with class
Eval
) and use it to apply different combinations of filters to the dataset, for visualisation.
Task Set 5 (table join)
- Implement several optimisations (graph query expressed as cartesian, combination of sequential filters are turned into a single function, etc).
- Implement table join, which allows us to merge the two CSV's using the email as key.
Task Set 6 (typos)
- Filter out typoed names (using queries)
- Implement (levenstein) distance function and apply it as a cartesian operation:
- he speed will suck
- try the length optimisation, that is compute the difference in length between words. The speed dramatically improves but also sucks on the big ds
- implement the lazy (PD) alternative
- Select mininimal distance per row, and thus determine the most likely correct name
- Correct the dataset
Task Set 7 (data visualisation)
- Plot the correlation (how?!) between final grades (which can now be computed) and other stuff. (Lab results, etc), using queries only