Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
pp2021:project [2021/02/25 12:28]
pdmatei
pp2021:project [2021/03/04 19:24] (current)
pdmatei
Line 16: Line 16:
   * [[? | Group assignment]] (todo?)   * [[? | Group assignment]] (todo?)
  
-==== Task 1 Reading/​writing ​DS ====+The query language to be implemented is shown below: 
 +<code haskell>​ 
 +data Query = 
 +  FromCSV String ​                                    -- extract ​table 
 +   | ToCSV Query 
 +   | AsList Query                                 -- a column is turned into a list of values 
 +   | forall a. Filter (FilterCondition a) Query 
 +   | Sort String Query                            -- sort by column-name interpreted as integer.  
 +   | ValueMap (QVal -> QVal) Query 
 +   | RowMap ([QVal]->​QVal) Query 
 +   | TableJoin QKey Query Query                    -- table join according to specs 
 +   | Projection [QVal] Query 
 +   | Cartesian ([QVal] -> [QVal] -> [QVal]) [QVal] Query Query  
 +   | Graph EdgeOp Query
  
-DSets are usually presented as CSV files. A CSV file uses a ''​column_delimiter''​ (namely the character '',''​) to separate between columns of a given entry and a ''​row delimiter''​ (in Unix-style encoding: ''​\\n''​).+   | VUnion Query Query                           -- adds new rows to the DS (table heads should coincide) 
 +   | HUnion Query Query                           -- '​horizontal union qlues' with zero logic more columns. TableJoin is the smart alternative 
 +   | NameDistanceQuery ​                           -- dedicated name query 
 + 
 +data QResult = CSV String | Table [[String]] | List [String] 
 + 
 +</​code>​ 
 + 
 +==== Task Set 1 (Introduction) ==== 
 + 
 +One dataset (exam points) is presented parsed as a matrix in a stub Haskell file. 
 + 
 +  * Implement procedures which determine which students have **enough points** to pass the exam. These implementations will serve as a basis for ''​ValueMap'',​ ''​RowMap''​ and ''​HUnion''​ 
 +  * Other similar tasks which may (but not necessarily) include basic filtering or sorting 
 + 
 +==== Task Set 2 (Input/​Output) ==== 
 + 
 +  * **Reading and writing** a dataset from CSV. (Relies on a correct ''​splitBy''​ written in lecture) ​DSets are usually presented as CSV files. A CSV file uses a ''​column_delimiter''​ (namely the character '',''​) to separate between columns of a given entry and a ''​row delimiter''​ (in Unix-style encoding: ''​\\n''​).
  
 In Haskell, we will represent the DS as a ''​[[String]]''​. We will refine this strategy later. As with CSVs, we will make no distinction between the column head, and the rest of the rows. In Haskell, we will represent the DS as a ''​[[String]]''​. We will refine this strategy later. As with CSVs, we will make no distinction between the column head, and the rest of the rows.
  
-We will also use CSV as an output format. For a nicer output display, we will use **google sheets**.+  * Students can also use the csv-uploading script and rely on google sheets ​for visualisation - experiments.
  
 +==== Task Set 3 (Matrix) ====
  
-==== Task 2 - repairing ​the DS ====+  * Based on the **cartesian** and **projection** queries, implement (and visualise) the **similarity graph** of student lecture points. Several possible distances can be implemented. 
 +  * Students will have access to the python-from-csv-visualisation script, which they can modify on their own, using other **Graph-to-viz** types of graphics.
  
- * Create a **consistent** dataset by: +==== Task Set 4 (ATD====
-     - **filtering** out those entries in **lecture grades** which contain no email. +
-     - **joining** the **Homework grades** and **Exam grades** tables in a single one, henceforth called **Main** ​(Careful: **Exam grades** contains only entries for students which have passed the exam) +
-     - in order to add **lecture grades** to the previous table, we have to solve the //typo problem// from the **email to name map**: +
-         - we need to implement a **distance function** (e.g. Levenstein distance). Initially, we compute it recursively,​ using a list-style matrix. Later on, we use Dynamic Programming to improve speed of the implementation. +
-         - we **extract** (or **select**) the list of names from **Main** +
-         - we match them with emails +
-         - for names in **Email to name map** which are not identical to those in **Main**, we map the distance function over the complete list of names and select the minimal value as a match +
-         - we correct typos in **Email to name map** in this way +
-         - then perform the simple join with **Main**+
  
-==== Task 3 - anonymization ====+  * Implement the TDA for **query**, **result**, and an evaluation function which couples the previous functionality. 
 +  * Implement the TDA for filter, (with class ''​Eval''​) and use it to apply different combinations of filters to the dataset, for visualisation.
  
-Whenever we work with a DS that contains sensible information,​ we need to anonymize it. This DS has already been anonymized, however, for the sake of the exercise, we will assume the names are real.+==== Task Set 5 (table join) ====
  
-  * anonymize ​**Main** by (i) removing ​the name field (ii) replacing ''email''​ by a hash+  * Implement several optimisations (graph query expressed as cartesian, combination of sequential filters are turned into a single function, etc). 
 +  ​Implement ​**table join**, which allows us to merge the two CSV's using the email as key.
  
-==== Task 4 - rankings ​====+==== Task Set 6 (typos) ​====
  
- ​* ​Extract admission information ​(who could take part in exam - note - not all students which could take part in the exam did so+  * Filter out typoed names (using queries) 
- ​* ​Todo+  ​* ​Implement **(levenstein)** distance function and apply it as a cartesian operation:​ 
 +      * he speed will suck 
 +      * try the length optimisation,​ that is compute the difference ​in length between words. The speed dramatically improves but also sucks on the big ds 
 +      * implement the lazy (PDalternative 
 +  Select mininimal distance per row, and thus determine the most likely correct name 
 +  * Correct the dataset
  
-==== Task x - query language ​====+==== Task Set 7 (data visualisation) ​====
  
- Define a query language ​which includes all the above processing +  ​Plot the correlation (how?!) between final grades (which can now be computed) and other stuff. (Lab results, etc), using queries only
- * Re-define the previous tasks using your query language+
  
-==== Task y - graph queries ==== 
  
- * Extend your language with graph queries ... todo