Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
pp2021:project [2021/02/25 11:33] pdmatei |
pp2021:project [2021/03/04 19:24] (current) pdmatei |
||
---|---|---|---|
Line 16: | Line 16: | ||
* [[? | Group assignment]] (todo?) | * [[? | Group assignment]] (todo?) | ||
- | ==== Task 1 - repairing the DS ==== | + | The query language to be implemented is shown below: |
+ | <code haskell> | ||
+ | data Query = | ||
+ | FromCSV String -- extract a table | ||
+ | | ToCSV Query | ||
+ | | AsList Query -- a column is turned into a list of values | ||
+ | | forall a. Filter (FilterCondition a) Query | ||
+ | | Sort String Query -- sort by column-name interpreted as integer. | ||
+ | | ValueMap (QVal -> QVal) Query | ||
+ | | RowMap ([QVal]->QVal) Query | ||
+ | | TableJoin QKey Query Query -- table join according to specs | ||
+ | | Projection [QVal] Query | ||
+ | | Cartesian ([QVal] -> [QVal] -> [QVal]) [QVal] Query Query | ||
+ | | Graph EdgeOp Query | ||
- | * Create a **consistent** dataset by: | + | | VUnion Query Query -- adds new rows to the DS (table heads should coincide) |
- | - **filtering** out those entries in **lecture grades** which contain no email. | + | | HUnion Query Query -- 'horizontal union qlues' with zero logic more columns. TableJoin is the smart alternative |
- | - **joining** the **Homework grades** and **Exam grades** tables in a single one, henceforth called **Main** (Careful: **Exam grades** contains only entries for students which have passed the exam) | + | | NameDistanceQuery -- dedicated name query |
- | - in order to add **lecture grades** to the previous table, we have to solve the //typo problem// from the **email to name map**: | + | |
- | - we need to implement a **distance function** (e.g. Levenstein distance). Initially, we compute it recursively, using a list-style matrix. Later on, we use Dynamic Programming to improve speed of the implementation. | + | |
- | - we **extract** (or **select**) the list of names from **Main** | + | |
- | - we match them with emails | + | |
- | - for names in **Email to name map** which are not identical to those in **Main**, we map the distance function over the complete list of names and select the minimal value as a match | + | |
- | - we correct typos in **Email to name map** in this way | + | |
- | - then perform the simple join with **Main** | + | |
- | ==== Task 2 - anonymization ==== | + | data QResult = CSV String | Table [[String]] | List [String] |
- | Whenever we work with a DS that contains sensible information, we need to anonymize it. This DS has already been anonymized, however, for the sake of the exercise, we will assume the names are real. | + | </code> |
- | * anonymize **Main** by (i) removing the name field (ii) replacing ''email'' by a hash | + | ==== Task Set 1 (Introduction) ==== |
- | ==== Task 3 - rankings ==== | + | One dataset (exam points) is presented parsed as a matrix in a stub Haskell file. |
- | * Extract admission information (who could take part in exam - note - not all students which could take part in the exam did so) | + | * Implement procedures which determine which students have **enough points** to pass the exam. These implementations will serve as a basis for ''ValueMap'', ''RowMap'' and ''HUnion'' |
- | * Todo | + | * Other similar tasks which may (but not necessarily) include basic filtering or sorting |
- | ==== Task x - query language ==== | + | ==== Task Set 2 (Input/Output) ==== |
- | * Define a query language which includes all the above processing | + | * **Reading and writing** a dataset from CSV. (Relies on a correct ''splitBy'' written in lecture) DSets are usually presented as CSV files. A CSV file uses a ''column_delimiter'' (namely the character '','') to separate between columns of a given entry and a ''row delimiter'' (in Unix-style encoding: ''\\n''). |
- | * Re-define the previous tasks using your query language | + | |
+ | In Haskell, we will represent the DS as a ''[[String]]''. We will refine this strategy later. As with CSVs, we will make no distinction between the column head, and the rest of the rows. | ||
+ | |||
+ | * Students can also use the csv-uploading script and rely on google sheets for visualisation - experiments. | ||
+ | |||
+ | ==== Task Set 3 (Matrix) ==== | ||
+ | |||
+ | * Based on the **cartesian** and **projection** queries, implement (and visualise) the **similarity graph** of student lecture points. Several possible distances can be implemented. | ||
+ | * Students will have access to the python-from-csv-visualisation script, which they can modify on their own, using other **Graph-to-viz** types of graphics. | ||
+ | |||
+ | ==== Task Set 4 (ATD) ==== | ||
+ | |||
+ | * Implement the TDA for **query**, **result**, and an evaluation function which couples the previous functionality. | ||
+ | * Implement the TDA for filter, (with class ''Eval'') and use it to apply different combinations of filters to the dataset, for visualisation. | ||
+ | |||
+ | ==== Task Set 5 (table join) ==== | ||
+ | |||
+ | * Implement several optimisations (graph query expressed as cartesian, combination of sequential filters are turned into a single function, etc). | ||
+ | * Implement **table join**, which allows us to merge the two CSV's using the email as key. | ||
+ | |||
+ | ==== Task Set 6 (typos) ==== | ||
+ | |||
+ | * Filter out typoed names (using queries) | ||
+ | * Implement **(levenstein)** distance function and apply it as a cartesian operation: | ||
+ | * he speed will suck | ||
+ | * try the length optimisation, that is compute the difference in length between words. The speed dramatically improves but also sucks on the big ds | ||
+ | * implement the lazy (PD) alternative | ||
+ | * Select mininimal distance per row, and thus determine the most likely correct name | ||
+ | * Correct the dataset | ||
+ | |||
+ | ==== Task Set 7 (data visualisation) ==== | ||
+ | |||
+ | * Plot the correlation (how?!) between final grades (which can now be computed) and other stuff. (Lab results, etc), using queries only | ||
- | ==== Task y - graph queries ==== | ||
- | * Extend your language with graph queries ... todo |