Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
pp2021:project [2021/03/04 12:31]
pdmatei
pp2021:project [2021/03/04 19:24] (current)
pdmatei
Line 20: Line 20:
 data Query = data Query =
   FromCSV String ​                                    -- extract a table   FromCSV String ​                                    -- extract a table
-   | ToCSV+   | ToCSV Query 
 +   | AsList Query                                 -- a column is turned into a list of values
    | forall a. Filter (FilterCondition a) Query    | forall a. Filter (FilterCondition a) Query
    | Sort String Query                            -- sort by column-name interpreted as integer. ​    | Sort String Query                            -- sort by column-name interpreted as integer. ​
Line 33: Line 34:
    | HUnion Query Query                           -- '​horizontal union qlues' with zero logic more columns. TableJoin is the smart alternative    | HUnion Query Query                           -- '​horizontal union qlues' with zero logic more columns. TableJoin is the smart alternative
    | NameDistanceQuery ​                           -- dedicated name query    | NameDistanceQuery ​                           -- dedicated name query
 +
 +data QResult = CSV String | Table [[String]] | List [String]
 +
 </​code>​ </​code>​
  
-==== Task Set 1 (Accomodation) ====+==== Task Set 1 (Introduction) ====
  
-One dataset (exam points) is presented parsed as a matrix in a stub haskell ​file.+One dataset (exam points) is presented parsed as a matrix in a stub Haskell ​file.
  
   * Implement procedures which determine which students have **enough points** to pass the exam. These implementations will serve as a basis for ''​ValueMap'',​ ''​RowMap''​ and ''​HUnion''​   * Implement procedures which determine which students have **enough points** to pass the exam. These implementations will serve as a basis for ''​ValueMap'',​ ''​RowMap''​ and ''​HUnion''​
Line 50: Line 54:
   * Students can also use the csv-uploading script and rely on google sheets for visualisation - experiments.   * Students can also use the csv-uploading script and rely on google sheets for visualisation - experiments.
  
 +==== Task Set 3 (Matrix) ====
  
-==== Task 2 - repairing the DS ==== +  * Based on the **cartesian** and **projection** queries, ​implement (and visualise) the **similarity graph** of student lecture points. Several possible distances can be implemented. 
- +  * Students ​will have access to the python-from-csv-visualisation script, which they can modify on their own, using other **Graph-to-viz** types of graphics.
-The objective of this set of tasks is to obtain a single, consistent dataset ​on which further processing can then be performed. +
- +
-=== Basic repairing === +
- +
-- [Medium] filter out entries in **lecture grades** which contain no email (implement a general filtering procedure) +
-- [Hard] merge the datasets ​**Homework grades** and **Exam grades** using **Name** as key (implement ​a key-dependent join technique) +
- +
-=== Correcting typos in emails === +
- +
-- [Medium] implement a **Levenstein** ​(must change the word to avoid internet copy/pastedistance function +
-    * a basic recursive implementation does not scale +
-    * an basic optimisation (see code) scales poorly +
-    * DP works great +
- +
-- [Easy] implement a **projection** function which extracts some columns from the table as another table +
-- [Medium] implement a **cartesian** function (see code constraints) to be used to compute levenstein distancesThis will also be used for graphs (maybe?!) +
- +
-[Hard,tedious] implement a query-like sequence of operations ​which uses **cartesian** to:  +
-    ​filter those names in **Emails** not found in **Main** (those which are probably typos) +
-    * //cross// (cartesian) only those with the full list of (projected) names, and apply distance +
-    * sort each row of the above matrix by distance and report the correct name +
-    * replace the bad name by the correct one in **Emails** table+
  
-Most likely, the list two steps cannot be expressed as a combination of queries, but rather as a user operation.+==== Task Set 4 (ATD) ====
  
-==== Task 3 - anonymization ====+  * Implement the TDA for **query**, **result**, and an evaluation function which couples the previous functionality. 
 +  * Implement the TDA for filter, (with class ''​Eval''​) and use it to apply different combinations of filters to the dataset, for visualisation.
  
-Whenever we work with a DS that contains sensible information,​ we need to anonymize it. This DS has already been anonymized, however, for the sake of the exercise, we will assume the names are real.+==== Task Set 5 (table join) ====
  
-  ​- [Easy] anonymize ​**Main** by (i) removing ​the name field (ii) replacing ''email''​ by a hash+  * Implement several optimisations (graph query expressed as cartesian, combination of sequential filters are turned into a single function, etc). 
 +  ​Implement ​**table join**, which allows us to merge the two CSV's using the email as key.
  
-==== Task 4 - rankings ​====+==== Task Set 6 (typos) ​====
  
- ​* ​Extract admission information ​(who could take part in exam - note - not all students which could take part in the exam did so+  * Filter out typoed names (using queries) 
- ​* ​Todo+  ​* ​Implement **(levenstein)** distance function and apply it as a cartesian operation:​ 
 +      * he speed will suck 
 +      * try the length optimisation,​ that is compute the difference ​in length between words. The speed dramatically improves but also sucks on the big ds 
 +      * implement the lazy (PDalternative 
 +  Select mininimal distance per row, and thus determine the most likely correct name 
 +  * Correct the dataset
  
-==== Task x - query language ​====+==== Task Set 7 (data visualisation) ​====
  
- Define a query language ​which includes all the above processing +  ​Plot the correlation (how?!) between final grades (which can now be computed) and other stuff. (Lab results, etc), using queries only
- * Re-define the previous tasks using your query language+
  
-==== Task y - graph queries ==== 
  
- * Extend your language with graph queries ... todo