===== Programming Paradigms project ===== **Idea:** Tiny Analytics Engine in Haskell ===== Datasets ===== The dataset consists of course grades of a real lecture (the names have been changed). The dataset has the following problems: * **lecture grades** are mapped against email addresses, whereas **homework grades** and **exam grades** are mapped against names. * **lecture grades** also contains entries with no email address. These correspond to students which have not provided with an email in a valid form. * so, in order to have a unique **key** for each student, the table **email to name student map** was created using a form. However, it contains some **typos**. * [[https://docs.google.com/spreadsheets/d/1d9ATUsYrcii80ffEDr8wyliZ_FLVChMF06IiuN6_GvU/edit#gid=1332667400| Homework grades]] * [[https://docs.google.com/spreadsheets/d/1YaUU6eGqjbw760G3YoLZVzuTTi13ZHmIy9Odrt6Yr7M/edit#gid=1503323838| Lecture grades]] * [[https://docs.google.com/spreadsheets/d/17rjcP8wLW1tZeJhNewKzC2Qs6nXHR775vJ0ttK-nHtw/edit#gid=673917849| Exam grades]] * [[https://docs.google.com/spreadsheets/d/1xoRW5Joo0IIyM4aXgXf-pv-TdDl9s1NEZemxuCqXWG4/edit#gid=484562090| Email to name map]] * [[? | Group assignment]] (todo?) The query language to be implemented is shown below: data Query = FromCSV String -- extract a table | ToCSV Query | AsList Query -- a column is turned into a list of values | forall a. Filter (FilterCondition a) Query | Sort String Query -- sort by column-name interpreted as integer. | ValueMap (QVal -> QVal) Query | RowMap ([QVal]->QVal) Query | TableJoin QKey Query Query -- table join according to specs | Projection [QVal] Query | Cartesian ([QVal] -> [QVal] -> [QVal]) [QVal] Query Query | Graph EdgeOp Query | VUnion Query Query -- adds new rows to the DS (table heads should coincide) | HUnion Query Query -- 'horizontal union qlues' with zero logic more columns. TableJoin is the smart alternative | NameDistanceQuery -- dedicated name query data QResult = CSV String | Table [[String]] | List [String] ==== Task Set 1 (Introduction) ==== One dataset (exam points) is presented parsed as a matrix in a stub Haskell file. * Implement procedures which determine which students have **enough points** to pass the exam. These implementations will serve as a basis for ''ValueMap'', ''RowMap'' and ''HUnion'' * Other similar tasks which may (but not necessarily) include basic filtering or sorting ==== Task Set 2 (Input/Output) ==== * **Reading and writing** a dataset from CSV. (Relies on a correct ''splitBy'' written in lecture) DSets are usually presented as CSV files. A CSV file uses a ''column_delimiter'' (namely the character '','') to separate between columns of a given entry and a ''row delimiter'' (in Unix-style encoding: ''\\n''). In Haskell, we will represent the DS as a ''[[String]]''. We will refine this strategy later. As with CSVs, we will make no distinction between the column head, and the rest of the rows. * Students can also use the csv-uploading script and rely on google sheets for visualisation - experiments. ==== Task Set 3 (Matrix) ==== * Based on the **cartesian** and **projection** queries, implement (and visualise) the **similarity graph** of student lecture points. Several possible distances can be implemented. * Students will have access to the python-from-csv-visualisation script, which they can modify on their own, using other **Graph-to-viz** types of graphics. ==== Task Set 4 (ATD) ==== * Implement the TDA for **query**, **result**, and an evaluation function which couples the previous functionality. * Implement the TDA for filter, (with class ''Eval'') and use it to apply different combinations of filters to the dataset, for visualisation. ==== Task Set 5 (table join) ==== * Implement several optimisations (graph query expressed as cartesian, combination of sequential filters are turned into a single function, etc). * Implement **table join**, which allows us to merge the two CSV's using the email as key. ==== Task Set 6 (typos) ==== * Filter out typoed names (using queries) * Implement **(levenstein)** distance function and apply it as a cartesian operation: * he speed will suck * try the length optimisation, that is compute the difference in length between words. The speed dramatically improves but also sucks on the big ds * implement the lazy (PD) alternative * Select mininimal distance per row, and thus determine the most likely correct name * Correct the dataset ==== Task Set 7 (data visualisation) ==== * Plot the correlation (how?!) between final grades (which can now be computed) and other stuff. (Lab results, etc), using queries only