This shows you the differences between two versions of the page.
ep:labs:10 [2021/12/04 16:13] vlad.stefanescu [⚠️ [15p] Exercise 4] |
ep:labs:10 [2023/10/07 21:56] (current) emilian.radoi [Feedback] |
||
---|---|---|---|
Line 9: | Line 9: | ||
* Be able to compare multiple machine learning models | * Be able to compare multiple machine learning models | ||
- | ===== Exercises ===== | + | ===== Resources ===== |
- | The exercises will be solved in Python, using various popular libraries that are usually integrated in machine learning projects: | + | In this lab, we will study basic performance evaluation techniques used in machine learning, covering elementary concepts such as classification, regression, data fitting, clustering and much more. |
+ | |||
+ | You will work in an environment that is easy to use, and provides a couple of tools like manipulating data and visualizing results. We will use a **Jupyer Notebook** hosted on **Google Colab**, which comes with a variety of useful tools already installed. | ||
+ | |||
+ | The exercises will be solved in Python, using popular libraries that are usually integrated in machine learning projects: | ||
* [[https://scikit-learn.org/stable/documentation.html|Scikit-Learn]]: fast model development, performance metrics, pipelines, dataset splitting | * [[https://scikit-learn.org/stable/documentation.html|Scikit-Learn]]: fast model development, performance metrics, pipelines, dataset splitting | ||
Line 18: | Line 22: | ||
* [[https://matplotlib.org/3.1.1/users/index.html|Matplotlib]]: data plotting | * [[https://matplotlib.org/3.1.1/users/index.html|Matplotlib]]: data plotting | ||
- | All tasks are tutorial based and every exercise will be associated with at least one "**TODO**" within the code. Those tasks can be found in the //exercises// package, but our recommendation is to follow the entire skeleton code for a better understanding of the concepts presented in this laboratory class. Each functionality is properly documented and for some exercises, there are also hints placed in the code. | + | As datasets, we will use some public corpora provided by the Kaggle community: |
- | <note important> | + | * [[https://www.kaggle.com/uciml/pima-indians-diabetes-database/data|Classification Dataset]] |
- | Because the various **tasks** and **exercises** are **spread throughout the laboratory text**, they are marked with a ⚠️ emoji. Make sure you look for this emoji so that you don't miss any of them! | + | * [[https://www.kaggle.com/zaraavagyan/weathercsv|Regression dataset]] |
- | </note> | + | |
+ | You can also check out these cheet sheets for fast reference to the most common libraries: | ||
+ | |||
+ | **Cheat sheets:** | ||
+ | |||
+ | * [[https://perso.limsi.fr/pointal/_media/python:cours:mementopython3-english.pdf)|python]] | ||
+ | * [[https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Numpy_Python_Cheat_Sheet.pdf|numpy]] | ||
+ | * [[https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Python_Matplotlib_Cheat_Sheet.pdf|matplotlib]] | ||
+ | * [[https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Scikit_Learn_Cheat_Sheet_Python.pdf|sklearn]] | ||
+ | * [[https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf|pandas]] | ||
<solution -hidden> | <solution -hidden> | ||
- | Solution: {{:ep:labs:lab_12_ml_revisited_solution.zip}} | + | [[https://colab.research.google.com/drive/1aeV9PGF_uxBA3FoKNMEzsiXMxjVSCcm4?usp=sharing|Solution]] |
</solution> | </solution> | ||
+ | ===== Tasks ===== | ||
+ | ==== Google Colab Notebook ==== | ||
+ | For this lab, we will use Google Colab for exploring performance evaluation in machine learning. Please solve your tasks [[https://github.com/vladastefanescu/machine-learning-introduction/blob/main/Machine_Learning_Introduction.ipynb|here]] by clicking "**Open in Colaboratory**". | ||
+ | You can then export this python notebook as a PDF (**File -> Print**) and upload it to **Moodle**. | ||
+ | ===== Feedback ===== | ||
+ | Please take a minute to fill in the **[[https://forms.gle/NpSRnoEh9NLYowFr5 | feedback form]]** for this lab. | ||
- | ==== ⚠️ [10p] Task 4.A ==== | ||
- | |||
- | For each model, make predictions on both the **training set** and **test set** and compute the corresponding **accuracy values**. | ||
- | |||
- | <note> | ||
- | The model is already trained so you can directly use it to yield predictions on the two sets. And in order to evaluate these predictions, you can use the already familiar **evaluate_classifier** function from //**classification.py**//. | ||
- | </note> | ||
- | |||
- | ⚠️⚠️ **NON-DEMO TASK** | ||
- | |||
- | Look at the hint above and solve the tasks marked with **TODO - TASK A**. | ||
- | |||
- | ==== ⚠️ [5p] Task 4.B ==== | ||
- | |||
- | Comment the results by specifying which is the **best model** in terms of fitting and which are the models that **overfit** or **underfit** the dataset. | ||
- | |||
- | ⚠️⚠️ **NON-DEMO TASK** | ||
- | |||
- | Solve the tasks marked with **TODO - TASK B**. | ||
- | |||
- | ==== ⚠️ [15p] Exercise 5 ==== | ||
- | |||
- | In this exercise, you will learn how to properly evaluate a **clustering model**. We chose a **K-means clustering algorithm** for this example, but feel free to explore other alternatives. You can find out more about K-means clustering algorithms [[https://towardsdatascience.com/understanding-k-means-clustering-in-machine-learning-6a6e67336aa1|here]]. For all the associated tasks, you don't have to use any input file, because the clusters are generated in the skeleton. The model must learn how to group together **points in a 2D space**. | ||
- | |||
- | <note important> | ||
- | The solution for this exercise should be written in the **TODO** sections marked in the //**clustering.py**// file. Please follow the skeleton code and understand what it does. To run the code, uncomment **perform_clustering()** in //**app.py**//. | ||
- | </note> | ||
- | |||
- | |||
- | ==== ⚠️ [5p] Task 5.A ==== | ||
- | |||
- | Compute the **silhouette score** of the model by using a //Scikit-learn// function found in the **metrics** package. | ||
- | |||
- | ⚠️⚠️ **NON-DEMO TASK** | ||
- | |||
- | Solve the tasks marked with **TODO - TASK A**. | ||
- | |||
- | ==== ⚠️ [10p] Task 5.B ==== | ||
- | |||
- | Fetch the **centres of the clusters** (the model should already have them ready for you :-)) and **plot** them together with a **colourful 2D representation** of the data groups. Your plot should look similar to the one below: | ||
- | |||
- | {{ :ep:labs:22._clustering_plot.png?600 |}} | ||
- | |||
- | You can also play around with the **standard deviation** of the generated blobs and observe the different outcomes of the clustering algorithm: | ||
- | <code> | ||
- | CLUSTERS_STD = 2 | ||
- | </code> | ||
- | You should be able to discuss these observations with the assistant. | ||
- | <note> | ||
- | **HINT: **The **plotting code** is very similar to the one found in the skeleton. You can also [[https://lmgtfy.com/?q=plot+k+means+clusters+python|Google]] it out. ;-) | ||
- | </note> | ||
- | ⚠️⚠️ **NON-DEMO TASK** | ||
- | Look at the hint above and solve the tasks marked with **TODO - TASK B**. Make **at least 3** changes to the standard deviation. That means that **3 plots should be generated**. Save each plot **in a separate file**. | ||
- | ==== ⚠️ [10p] Exercise 6 ==== | ||
- | ⚠️⚠️ **NON-DEMO TASK** | ||
- | Please take a minute to fill in the **[[https://forms.gle/KHMVUhNfCPoR71Ew7 | feedback form]]** for this lab. | ||
- | ===== References ===== | ||
- | [[https://www.kaggle.com/uciml/pima-indians-diabetes-database/data|Classification Dataset]] | ||
- | [[https://towardsdatascience.com/a-beginners-guide-to-linear-regression-in-python-with-scikit-learn-83a8f7ae2b4f|Regression dataset]] | ||
- | {{namespace>:ep:labs:10:contents:tasks&nofooter&noeditbutton}} |