This shows you the differences between two versions of the page.
ep:labs:10 [2021/12/04 16:13] vlad.stefanescu [⚠️ [10p] Task 4.A] |
ep:labs:10 [2022/09/24 14:46] (current) emilian.radoi |
||
---|---|---|---|
Line 9: | Line 9: | ||
* Be able to compare multiple machine learning models | * Be able to compare multiple machine learning models | ||
- | ===== Exercises ===== | + | ===== Resources ===== |
- | The exercises will be solved in Python, using various popular libraries that are usually integrated in machine learning projects: | + | In this lab, we will study basic performance evaluation techniques used in machine learning, covering elementary concepts such as classification, regression, data fitting, clustering and much more. |
+ | |||
+ | You will work in an environment that is easy to use, and provides a couple of tools like manipulating data and visualizing results. We will use a **Jupyer Notebook** hosted on **Google Colab**, which comes with a variety of useful tools already installed. | ||
+ | |||
+ | The exercises will be solved in Python, using popular libraries that are usually integrated in machine learning projects: | ||
* [[https://scikit-learn.org/stable/documentation.html|Scikit-Learn]]: fast model development, performance metrics, pipelines, dataset splitting | * [[https://scikit-learn.org/stable/documentation.html|Scikit-Learn]]: fast model development, performance metrics, pipelines, dataset splitting | ||
Line 18: | Line 22: | ||
* [[https://matplotlib.org/3.1.1/users/index.html|Matplotlib]]: data plotting | * [[https://matplotlib.org/3.1.1/users/index.html|Matplotlib]]: data plotting | ||
- | All tasks are tutorial based and every exercise will be associated with at least one "**TODO**" within the code. Those tasks can be found in the //exercises// package, but our recommendation is to follow the entire skeleton code for a better understanding of the concepts presented in this laboratory class. Each functionality is properly documented and for some exercises, there are also hints placed in the code. | + | As datasets, we will use some public corpora provided by the Kaggle community: |
- | <note important> | + | * [[https://www.kaggle.com/uciml/pima-indians-diabetes-database/data|Classification Dataset]] |
- | Because the various **tasks** and **exercises** are **spread throughout the laboratory text**, they are marked with a ⚠️ emoji. Make sure you look for this emoji so that you don't miss any of them! | + | * [[https://www.kaggle.com/zaraavagyan/weathercsv|Regression dataset]] |
- | </note> | + | |
+ | You can also check out these cheet sheets for fast reference to the most common libraries: | ||
+ | |||
+ | **Cheat sheets:** | ||
+ | |||
+ | * [[https://perso.limsi.fr/pointal/_media/python:cours:mementopython3-english.pdf)|python]] | ||
+ | * [[https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Numpy_Python_Cheat_Sheet.pdf|numpy]] | ||
+ | * [[https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Python_Matplotlib_Cheat_Sheet.pdf|matplotlib]] | ||
+ | * [[https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Scikit_Learn_Cheat_Sheet_Python.pdf|sklearn]] | ||
+ | * [[https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf|pandas]] | ||
<solution -hidden> | <solution -hidden> | ||
- | Solution: {{:ep:labs:lab_12_ml_revisited_solution.zip}} | + | [[https://colab.research.google.com/drive/1aeV9PGF_uxBA3FoKNMEzsiXMxjVSCcm4?usp=sharing|Solution]] |
</solution> | </solution> | ||
+ | ===== Tasks ===== | ||
+ | ==== Google Colab Notebook ==== | ||
+ | For this lab, we will use Google Colab for exploring performance evaluation in machine learning. Please solve your tasks [[https://github.com/vladastefanescu/machine-learning-introduction/blob/main/Machine_Learning_Introduction.ipynb|here]] by clicking "**Open in Colaboratory**". | ||
+ | You can then export this python notebook as a PDF (**File -> Print**) and upload it to **Moodle**. | ||
+ | ===== Feedback ===== | ||
+ | Please take a minute to fill in the **[[https://forms.gle/LWBWYsMiJq8FsYdN9 | feedback form]]** for this lab. | ||
- | ==== ⚠️ [5p] Task 4.B ==== | ||
- | |||
- | Comment the results by specifying which is the **best model** in terms of fitting and which are the models that **overfit** or **underfit** the dataset. | ||
- | |||
- | ⚠️⚠️ **NON-DEMO TASK** | ||
- | |||
- | Solve the tasks marked with **TODO - TASK B**. | ||
- | |||
- | ==== ⚠️ [15p] Exercise 5 ==== | ||
- | |||
- | In this exercise, you will learn how to properly evaluate a **clustering model**. We chose a **K-means clustering algorithm** for this example, but feel free to explore other alternatives. You can find out more about K-means clustering algorithms [[https://towardsdatascience.com/understanding-k-means-clustering-in-machine-learning-6a6e67336aa1|here]]. For all the associated tasks, you don't have to use any input file, because the clusters are generated in the skeleton. The model must learn how to group together **points in a 2D space**. | ||
- | |||
- | <note important> | ||
- | The solution for this exercise should be written in the **TODO** sections marked in the //**clustering.py**// file. Please follow the skeleton code and understand what it does. To run the code, uncomment **perform_clustering()** in //**app.py**//. | ||
- | </note> | ||
- | |||
- | |||
- | ==== ⚠️ [5p] Task 5.A ==== | ||
- | |||
- | Compute the **silhouette score** of the model by using a //Scikit-learn// function found in the **metrics** package. | ||
- | |||
- | ⚠️⚠️ **NON-DEMO TASK** | ||
- | |||
- | Solve the tasks marked with **TODO - TASK A**. | ||
- | |||
- | ==== ⚠️ [10p] Task 5.B ==== | ||
- | |||
- | Fetch the **centres of the clusters** (the model should already have them ready for you :-)) and **plot** them together with a **colourful 2D representation** of the data groups. Your plot should look similar to the one below: | ||
- | |||
- | {{ :ep:labs:22._clustering_plot.png?600 |}} | ||
- | |||
- | You can also play around with the **standard deviation** of the generated blobs and observe the different outcomes of the clustering algorithm: | ||
- | |||
- | <code> | ||
- | CLUSTERS_STD = 2 | ||
- | </code> | ||
- | You should be able to discuss these observations with the assistant. | ||
- | <note> | ||
- | **HINT: **The **plotting code** is very similar to the one found in the skeleton. You can also [[https://lmgtfy.com/?q=plot+k+means+clusters+python|Google]] it out. ;-) | ||
- | </note> | ||
- | ⚠️⚠️ **NON-DEMO TASK** | ||
- | Look at the hint above and solve the tasks marked with **TODO - TASK B**. Make **at least 3** changes to the standard deviation. That means that **3 plots should be generated**. Save each plot **in a separate file**. | ||
- | ==== ⚠️ [10p] Exercise 6 ==== | ||
- | ⚠️⚠️ **NON-DEMO TASK** | ||
- | Please take a minute to fill in the **[[https://forms.gle/KHMVUhNfCPoR71Ew7 | feedback form]]** for this lab. | ||
- | ===== References ===== | ||
- | [[https://www.kaggle.com/uciml/pima-indians-diabetes-database/data|Classification Dataset]] | ||
- | [[https://towardsdatascience.com/a-beginners-guide-to-linear-regression-in-python-with-scikit-learn-83a8f7ae2b4f|Regression dataset]] | ||
- | {{namespace>:ep:labs:10:contents:tasks&nofooter&noeditbutton}} |