Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
pp2021:project [2021/02/25 11:31]
pdmatei
pp2021:project [2021/03/04 19:24] (current)
pdmatei
Line 16: Line 16:
   * [[? | Group assignment]] (todo?)   * [[? | Group assignment]] (todo?)
  
-==== Task 1 repairing the DS ====+The query language to be implemented is shown below: 
 +<code haskell>​ 
 +data Query = 
 +  FromCSV String ​                                    -- extract a table 
 +   | ToCSV Query 
 +   | AsList Query                                 -- a column is turned into a list of values 
 +   | forall a. Filter (FilterCondition a) Query 
 +   | Sort String Query                            -- sort by column-name interpreted as integer.  
 +   | ValueMap (QVal -> QVal) Query 
 +   | RowMap ([QVal]->​QVal) Query 
 +   | TableJoin QKey Query Query                    -- table join according to specs 
 +   | Projection [QVal] Query 
 +   | Cartesian ([QVal] -> [QVal] -> [QVal]) [QVal] Query Query  
 +   | Graph EdgeOp Query
  
- * Create a **consistent** dataset by: +   | VUnion Query Query                           -- adds new rows to the DS (table heads should coincide
-     **filtering** out those entries in **lecture grades** which contain no email. +   | HUnion Query Query                           -- '​horizontal union qlues' with zero logic more columnsTableJoin is the smart alternative 
-     **joining** ​the **Homework grades** and **Exam grades** tables in a single one, henceforth called **Main** ​(Careful: **Exam grades** contains only entries for students which have passed the exam+   | NameDistanceQuery ​                           ​-- dedicated ​name query
-     ​in order to add **lecture grades** to the previous table, we have to solve the //typo problem// from the **email to name map**: +
-         we need to implement a **distance function** (e.g. Levenstein distance). Initially, we compute it recursively,​ using a list-style matrix. Later on, we use Dynamic Programming to improve speed of the implementation. +
-         ​we **extract (or **select**) the list of names from **Main** +
-         we match them with emails +
-         - for names in **Email to name map** which are not identical to those in **Main**, we map the distance function over the complete list of names and select the minimal value as a match +
-         - we correct typos in **Email to name map** in this way +
-         - then perform the simple join with **Main**+
  
-==== Task 2 - anonymization ====+data QResult ​CSV String | Table [[String]] | List [String]
  
-Whenever we work with a DS that contains sensible information,​ we need to anonymize it. This DS has already been anonymized, however, for the sake of the exercise, we will assume the names are real.+</​code>​
  
-  * anonymize **Main** by (iremoving the name field (ii) replacing ''​email''​ by a hash+==== Task Set 1 (Introduction====
  
-==== Task 3 - rankings ====+One dataset (exam points) is presented parsed as a matrix in a stub Haskell file.
  
- Extract admission information (who could take part in exam - note - not all students ​which could take part in the exam did so) +  ​Implement procedures ​which determine which students have **enough points** to pass the exam. These implementations will serve as a basis for ''​ValueMap'',​ ''​RowMap''​ and ''​HUnion''​ 
- ​* ​Todo+  Other similar tasks which may (but not necessarily) include basic filtering or sorting
  
-==== Task x - query language ​====+==== Task Set 2 (Input/​Output) ​====
  
- Define ​query language which includes all the above processing +  * **Reading and writing** a dataset from CSV. (Relies on a correct ''​splitBy''​ written in lecture) DSets are usually presented as CSV files. A CSV file uses a ''​column_delimiter''​ (namely ​the character '',''​) to separate between columns of a given entry and a ''​row delimiter''​ (in Unix-style encoding: ''​\\n''​). 
- ​* ​Re-define ​the previous tasks using your query language+ 
 +In Haskell, we will represent the DS as a ''​[[String]]''​. We will refine this strategy later. As with CSVs, we will make no distinction between the column head, and the rest of the rows. 
 + 
 +  ​Students can also use the csv-uploading script and rely on google sheets for visualisation - experiments. 
 + 
 +==== Task Set 3 (Matrix) ==== 
 + 
 +  * Based on the **cartesian** and **projection** queries, implement (and visualise) the **similarity graph** of student lecture points. Several possible distances can be implemented. 
 +  * Students will have access to the python-from-csv-visualisation script, which they can modify on their own, using other **Graph-to-viz** types of graphics. 
 + 
 +==== Task Set 4 (ATD) ==== 
 + 
 +  * Implement the TDA for **query**, **result**, and an evaluation function which couples the previous functionality. 
 +  * Implement the TDA for filter, (with class ''​Eval''​) and use it to apply different combinations of filters to the dataset, for visualisation. 
 + 
 +==== Task Set 5 (table join) ==== 
 + 
 +  * Implement several optimisations (graph query expressed as cartesian, combination of sequential filters are turned into a single function, etc). 
 +  * Implement **table join**, which allows us to merge the two CSV's using the email as key. 
 + 
 +==== Task Set 6 (typos) ==== 
 + 
 +  * Filter out typoed names (using queries) 
 +  * Implement **(levenstein)** distance function and apply it as a cartesian operation:​ 
 +      * he speed will suck 
 +      * try the length optimisation,​ that is compute the difference in length between words. The speed dramatically improves but also sucks on the big ds 
 +      * implement the lazy (PD) alternative 
 +  * Select mininimal distance per row, and thus determine the most likely correct name 
 +  * Correct the dataset 
 + 
 +==== Task Set 7 (data visualisation) ==== 
 + 
 +  * Plot the correlation (how?!) between final grades (which can now be computed) and other stuff. (Lab results, etc), using queries only
  
-==== Task y - graph queries ==== 
  
- * Extend your language with graph queries ... todo