Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
pp2021:project [2021/03/04 12:31] pdmatei |
pp2021:project [2021/03/04 19:24] (current) pdmatei |
||
---|---|---|---|
Line 20: | Line 20: | ||
data Query = | data Query = | ||
FromCSV String -- extract a table | FromCSV String -- extract a table | ||
- | | ToCSV | + | | ToCSV Query |
+ | | AsList Query -- a column is turned into a list of values | ||
| forall a. Filter (FilterCondition a) Query | | forall a. Filter (FilterCondition a) Query | ||
| Sort String Query -- sort by column-name interpreted as integer. | | Sort String Query -- sort by column-name interpreted as integer. | ||
Line 33: | Line 34: | ||
| HUnion Query Query -- 'horizontal union qlues' with zero logic more columns. TableJoin is the smart alternative | | HUnion Query Query -- 'horizontal union qlues' with zero logic more columns. TableJoin is the smart alternative | ||
| NameDistanceQuery -- dedicated name query | | NameDistanceQuery -- dedicated name query | ||
+ | |||
+ | data QResult = CSV String | Table [[String]] | List [String] | ||
+ | |||
</code> | </code> | ||
- | ==== Task Set 1 (Accomodation) ==== | + | ==== Task Set 1 (Introduction) ==== |
- | One dataset (exam points) is presented parsed as a matrix in a stub haskell file. | + | One dataset (exam points) is presented parsed as a matrix in a stub Haskell file. |
* Implement procedures which determine which students have **enough points** to pass the exam. These implementations will serve as a basis for ''ValueMap'', ''RowMap'' and ''HUnion'' | * Implement procedures which determine which students have **enough points** to pass the exam. These implementations will serve as a basis for ''ValueMap'', ''RowMap'' and ''HUnion'' | ||
Line 50: | Line 54: | ||
* Students can also use the csv-uploading script and rely on google sheets for visualisation - experiments. | * Students can also use the csv-uploading script and rely on google sheets for visualisation - experiments. | ||
+ | ==== Task Set 3 (Matrix) ==== | ||
- | ==== Task 2 - repairing the DS ==== | + | * Based on the **cartesian** and **projection** queries, implement (and visualise) the **similarity graph** of student lecture points. Several possible distances can be implemented. |
- | + | * Students will have access to the python-from-csv-visualisation script, which they can modify on their own, using other **Graph-to-viz** types of graphics. | |
- | The objective of this set of tasks is to obtain a single, consistent dataset on which further processing can then be performed. | + | |
- | + | ||
- | === Basic repairing === | + | |
- | + | ||
- | - [Medium] filter out entries in **lecture grades** which contain no email (implement a general filtering procedure) | + | |
- | - [Hard] merge the datasets **Homework grades** and **Exam grades** using **Name** as key (implement a key-dependent join technique) | + | |
- | + | ||
- | === Correcting typos in emails === | + | |
- | + | ||
- | - [Medium] implement a **Levenstein** (must change the word to avoid internet copy/paste) distance function | + | |
- | * a basic recursive implementation does not scale | + | |
- | * an basic optimisation (see code) scales poorly | + | |
- | * DP works great | + | |
- | + | ||
- | - [Easy] implement a **projection** function which extracts some columns from the table as another table | + | |
- | - [Medium] implement a **cartesian** function (see code constraints) to be used to compute levenstein distances. This will also be used for graphs (maybe?!) | + | |
- | + | ||
- | - [Hard,tedious] implement a query-like sequence of operations which uses **cartesian** to: | + | |
- | * filter those names in **Emails** not found in **Main** (those which are probably typos) | + | |
- | * //cross// (cartesian) only those with the full list of (projected) names, and apply distance | + | |
- | * sort each row of the above matrix by distance and report the correct name | + | |
- | * replace the bad name by the correct one in **Emails** table | + | |
- | Most likely, the list two steps cannot be expressed as a combination of queries, but rather as a user operation. | + | ==== Task Set 4 (ATD) ==== |
- | ==== Task 3 - anonymization ==== | + | * Implement the TDA for **query**, **result**, and an evaluation function which couples the previous functionality. |
+ | * Implement the TDA for filter, (with class ''Eval'') and use it to apply different combinations of filters to the dataset, for visualisation. | ||
- | Whenever we work with a DS that contains sensible information, we need to anonymize it. This DS has already been anonymized, however, for the sake of the exercise, we will assume the names are real. | + | ==== Task Set 5 (table join) ==== |
- | - [Easy] anonymize **Main** by (i) removing the name field (ii) replacing ''email'' by a hash | + | * Implement several optimisations (graph query expressed as cartesian, combination of sequential filters are turned into a single function, etc). |
+ | * Implement **table join**, which allows us to merge the two CSV's using the email as key. | ||
- | ==== Task 4 - rankings ==== | + | ==== Task Set 6 (typos) ==== |
- | * Extract admission information (who could take part in exam - note - not all students which could take part in the exam did so) | + | * Filter out typoed names (using queries) |
- | * Todo | + | * Implement **(levenstein)** distance function and apply it as a cartesian operation: |
+ | * he speed will suck | ||
+ | * try the length optimisation, that is compute the difference in length between words. The speed dramatically improves but also sucks on the big ds | ||
+ | * implement the lazy (PD) alternative | ||
+ | * Select mininimal distance per row, and thus determine the most likely correct name | ||
+ | * Correct the dataset | ||
- | ==== Task x - query language ==== | + | ==== Task Set 7 (data visualisation) ==== |
- | * Define a query language which includes all the above processing | + | * Plot the correlation (how?!) between final grades (which can now be computed) and other stuff. (Lab results, etc), using queries only |
- | * Re-define the previous tasks using your query language | + | |
- | ==== Task y - graph queries ==== | ||
- | * Extend your language with graph queries ... todo |