Differences

This shows you the differences between two versions of the page.

--- pp2021:project [2021/02/26 13:12]
pdmatei
+++ pp2021:project [2021/03/04 19:24] (current)
pdmatei
@@ Line 16: / Line 16: @@
   * [[? | Group assignment]] (todo?)
-==== Task 1 - Reading/writing a DS ====
+The query language to be implemented is shown below:
+<code haskell>
+data Query =
+  FromCSV String                                     -- extract a table
+   | ToCSV Query
+   | AsList Query                                 -- a column is turned into a list of values
+   | forall a. Filter (FilterCondition a) Query
+   | Sort String Query                            -- sort by column-name interpreted as integer.
+   | ValueMap (QVal -> QVal) Query
+   | RowMap ([QVal]->QVal) Query
+   | TableJoin QKey Query Query                    -- table join according to specs
+   | Projection [QVal] Query
+   | Cartesian ([QVal] -> [QVal] -> [QVal]) [QVal] Query Query
+   | Graph EdgeOp Query
- - [Easy] (Relies on a correct ''splitBy'' written in lecture) DSets are usually presented as CSV files. A CSV file uses a ''column_delimiter'' (namely the character '','') to separate between columns of a given entry and a ''row delimiter'' (in Unix-style encoding: ''\\n'').
+   | VUnion Query Query                           -- adds new rows to the DS (table heads should coincide)
+   | HUnion Query Query                           -- 'horizontal union qlues' with zero logic more columns. TableJoin is the smart alternative
+   | NameDistanceQuery                            -- dedicated name query
-In Haskell, we will represent the DS as a ''[[String]]''. We will refine this strategy later. As with CSVs, we will make no distinction between the column head, and the rest of the rows.
+data QResult = CSV String | Table [[String]] | List [String]
- - [Easy] (//reverse// ''splitBy'') We will also use CSV as an output format. For a nicer output display, we will use **google sheets**.
+</code>
+==== Task Set 1 (Introduction) ====
-==== Task 2 - repairing the DS ====
+One dataset (exam points) is presented parsed as a matrix in a stub Haskell file.
-The objective of this set of tasks is to obtain a single, consistent dataset on which further processing can then be performed.
+  * Implement procedures which determine which students have **enough points** to pass the exam. These implementations will serve as a basis for ''ValueMap'', ''RowMap'' and ''HUnion''
+  * Other similar tasks which may (but not necessarily) include basic filtering or sorting
-=== Basic repairing ===
+==== Task Set 2 (Input/Output) ====
-- [Medium] filter out entries in **lecture grades** which contain no email (implement a general filtering procedure)
+  * **Reading and writing** a dataset from CSV. (Relies on a correct ''splitBy'' written in lecture) DSets are usually presented as CSV files. A CSV file uses a ''column_delimiter'' (namely the character '','') to separate between columns of a given entry and a ''row delimiter'' (in Unix-style encoding: ''\\n'').
-- [Hard] merge the datasets **Homework grades** and **Exam grades** using **Name** as key (implement a key-dependent join technique)
-=== Correcting typos in emails ===
+In Haskell, we will represent the DS as a ''[[String]]''. We will refine this strategy later. As with CSVs, we will make no distinction between the column head, and the rest of the rows.
-- [Medium] implement a **Levenstein** (must change the word to avoid internet copy/paste) distance function
+  * Students can also use the csv-uploading script and rely on google sheets for visualisation - experiments.
-    * a basic recursive implementation does not scale
-    * an basic optimisation (see code) scales poorly
-    * DP works great
-- [Easy] implement a **projection** function which extracts some columns from the table as another table
+==== Task Set 3 (Matrix) ====
-- [Medium] implement a **cartesian** function (see code constraints) to be used to compute levenstein distances. This will also be used for graphs (maybe?!)
-- [Hard,tedious] implement a query-like sequence of operations which uses **cartesian** to:
+  * Based on the **cartesian** and **projection** queries, implement (and visualise) the **similarity graph** of student lecture points. Several possible distances can be implemented.
-    * filter those names in **Emails** not found in **Main** (those which are probably typos)
+  * Students will have access to the python-from-csv-visualisation script, which they can modify on their own, using other **Graph-to-viz** types of graphics.
-    * //cross// (cartesian) only those with the full list of (projected) names, and apply distance
-    * sort each row of the above matrix by distance and report the correct name
-    * replace the bad name by the correct one in **Emails** table
-Most likely, the list two steps cannot be expressed as a combination of queries, but rather as a user operation.
+==== Task Set 4 (ATD) ====
-==== Task 3 - anonymization ====
+  * Implement the TDA for **query**, **result**, and an evaluation function which couples the previous functionality.
+  * Implement the TDA for filter, (with class ''Eval'') and use it to apply different combinations of filters to the dataset, for visualisation.
-Whenever we work with a DS that contains sensible information, we need to anonymize it. This DS has already been anonymized, however, for the sake of the exercise, we will assume the names are real.
+==== Task Set 5 (table join) ====
-  - [Easy] anonymize **Main** by (i) removing the name field (ii) replacing ''email'' by a hash
+  * Implement several optimisations (graph query expressed as cartesian, combination of sequential filters are turned into a single function, etc).
+  * Implement **table join**, which allows us to merge the two CSV's using the email as key.
-==== Task 4 - rankings ====
+==== Task Set 6 (typos) ====
- * Extract admission information (who could take part in exam - note - not all students which could take part in the exam did so)
+  * Filter out typoed names (using queries)
- * Todo
+  * Implement **(levenstein)** distance function and apply it as a cartesian operation:
+      * he speed will suck
+      * try the length optimisation, that is compute the difference in length between words. The speed dramatically improves but also sucks on the big ds
+      * implement the lazy (PD) alternative
+  * Select mininimal distance per row, and thus determine the most likely correct name
+  * Correct the dataset
-==== Task x - query language ====
+==== Task Set 7 (data visualisation) ====
- * Define a query language which includes all the above processing
+  * Plot the correlation (how?!) between final grades (which can now be computed) and other stuff. (Lab results, etc), using queries only
- * Re-define the previous tasks using your query language
-==== Task y - graph queries ====
- * Extend your language with graph queries ... todo