Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
pp2021:project [2021/02/25 11:05]
pdmatei
pp2021:project [2021/03/04 19:24] (current)
pdmatei
Line 4: Line 4:
  
 ===== Datasets ===== ===== Datasets =====
 +
 +The dataset consists of course grades of a real lecture (the names have been changed). The dataset has the following problems:
 +  * **lecture grades** are mapped against email addresses, whereas **homework grades** and **exam grades** are mapped against names.
 +  * **lecture grades** also contains entries with no email address. These correspond to students which have not provided with an email in a valid form. 
 +  * so, in order to have a unique **key** for each student, the table **email to name student map** was created using a form. However, it contains some **typos**.
  
   * [[https://​docs.google.com/​spreadsheets/​d/​1d9ATUsYrcii80ffEDr8wyliZ_FLVChMF06IiuN6_GvU/​edit#​gid=1332667400| Homework grades]]   * [[https://​docs.google.com/​spreadsheets/​d/​1d9ATUsYrcii80ffEDr8wyliZ_FLVChMF06IiuN6_GvU/​edit#​gid=1332667400| Homework grades]]
Line 9: Line 14:
   * [[https://​docs.google.com/​spreadsheets/​d/​17rjcP8wLW1tZeJhNewKzC2Qs6nXHR775vJ0ttK-nHtw/​edit#​gid=673917849| Exam grades]]   * [[https://​docs.google.com/​spreadsheets/​d/​17rjcP8wLW1tZeJhNewKzC2Qs6nXHR775vJ0ttK-nHtw/​edit#​gid=673917849| Exam grades]]
   * [[https://​docs.google.com/​spreadsheets/​d/​1xoRW5Joo0IIyM4aXgXf-pv-TdDl9s1NEZemxuCqXWG4/​edit#​gid=484562090| Email to name map]]   * [[https://​docs.google.com/​spreadsheets/​d/​1xoRW5Joo0IIyM4aXgXf-pv-TdDl9s1NEZemxuCqXWG4/​edit#​gid=484562090| Email to name map]]
 +  * [[? | Group assignment]] (todo?)
 +
 +The query language to be implemented is shown below:
 +<code haskell>
 +data Query =
 +  FromCSV String ​                                    -- extract a table
 +   | ToCSV Query
 +   | AsList Query                                 -- a column is turned into a list of values
 +   | forall a. Filter (FilterCondition a) Query
 +   | Sort String Query                            -- sort by column-name interpreted as integer. ​
 +   | ValueMap (QVal -> QVal) Query
 +   | RowMap ([QVal]->​QVal) Query
 +   | TableJoin QKey Query Query                    -- table join according to specs
 +   | Projection [QVal] Query
 +   | Cartesian ([QVal] -> [QVal] -> [QVal]) [QVal] Query Query 
 +   | Graph EdgeOp Query
 +
 +   | VUnion Query Query                           -- adds new rows to the DS (table heads should coincide)
 +   | HUnion Query Query                           -- '​horizontal union qlues' with zero logic more columns. TableJoin is the smart alternative
 +   | NameDistanceQuery ​                           -- dedicated name query
 +
 +data QResult = CSV String | Table [[String]] | List [String]
 +
 +</​code>​
 +
 +==== Task Set 1 (Introduction) ====
 +
 +One dataset (exam points) is presented parsed as a matrix in a stub Haskell file.
 +
 +  * Implement procedures which determine which students have **enough points** to pass the exam. These implementations will serve as a basis for ''​ValueMap'',​ ''​RowMap''​ and ''​HUnion''​
 +  * Other similar tasks which may (but not necessarily) include basic filtering or sorting
 +
 +==== Task Set 2 (Input/​Output) ====
 +
 +  * **Reading and writing** a dataset from CSV. (Relies on a correct ''​splitBy''​ written in lecture) DSets are usually presented as CSV files. A CSV file uses a ''​column_delimiter''​ (namely the character '',''​) to separate between columns of a given entry and a ''​row delimiter''​ (in Unix-style encoding: ''​\\n''​).
 +
 +In Haskell, we will represent the DS as a ''​[[String]]''​. We will refine this strategy later. As with CSVs, we will make no distinction between the column head, and the rest of the rows.
 +
 +  * Students can also use the csv-uploading script and rely on google sheets for visualisation - experiments.
 +
 +==== Task Set 3 (Matrix) ====
 +
 +  * Based on the **cartesian** and **projection** queries, implement (and visualise) the **similarity graph** of student lecture points. Several possible distances can be implemented.
 +  * Students will have access to the python-from-csv-visualisation script, which they can modify on their own, using other **Graph-to-viz** types of graphics.
 +
 +==== Task Set 4 (ATD) ====
 +
 +  * Implement the TDA for **query**, **result**, and an evaluation function which couples the previous functionality.
 +  * Implement the TDA for filter, (with class ''​Eval''​) and use it to apply different combinations of filters to the dataset, for visualisation.
 +
 +==== Task Set 5 (table join) ====
 +
 +  * Implement several optimisations (graph query expressed as cartesian, combination of sequential filters are turned into a single function, etc).
 +  * Implement **table join**, which allows us to merge the two CSV's using the email as key.
 +
 +==== Task Set 6 (typos) ====
 +
 +  * Filter out typoed names (using queries)
 +  * Implement **(levenstein)** distance function and apply it as a cartesian operation:
 +      * he speed will suck
 +      * try the length optimisation,​ that is compute the difference in length between words. The speed dramatically improves but also sucks on the big ds
 +      * implement the lazy (PD) alternative
 +  * Select mininimal distance per row, and thus determine the most likely correct name
 +  * Correct the dataset
 +
 +==== Task Set 7 (data visualisation) ====
 +
 +  * Plot the correlation (how?!) between final grades (which can now be computed) and other stuff. (Lab results, etc), using queries only
 +