# Lab 10 - Machine Learning

## 0.1 Objectives

• Understand basic concepts of machine learning
• Remember examples of real-world problems that can be solved with machine learning
• Learn the most common performance evaluation metrics for machine learning models
• Analyse the behaviour of typical machine learning algorithms using the most popular techniques
• Be able to compare multiple machine learning models

## 0.2 Exercises

The exercises will be solved in Python, using various popular libraries that are usually integrated in machine learning projects:

• Scikit-Learn: fast model development, performance metrics, pipelines, dataset splitting
• Pandas: data frames, csv parser, data analysis
• NumPy: scientific computation
• Matplotlib: data plotting

All tasks are tutorial based and every exercise will be associated with at least one “TODO” within the code. Those tasks can be found in the exercises package, but our recommendation is to follow the entire skeleton code for a better understanding of the concepts presented in this laboratory class. Each functionality is properly documented and for some exercises, there are also hints placed in the code.

Because the various tasks and exercises are spread throughout the laboratory text, they are marked with a ⚠️ emoji. Make sure you look for this emoji so that you don't miss any of them!

### ⚠️ Exercise 0

Fill out the feedback form for this course at https://acs.curs.pub.ro

## 1. Introduction

Just like chess players improve their technique by watching hundreds of games from top players, computers can be able to perform certain tasks by “looking” at data. These computers are sometimes called “machines” and this data observation step is also known as “learning”. Simple, isn’t it? So let’s define “Machine Learning” (ML) more formally, as stated by Tom Mitchell. A program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E. Let’s get more -h and provide an example. Say that T is the process of playing chess, E is a set of chess matches on which the algorithm was trained and P is the probability for the program to win the next game of chess.

What it’s truly important however, is that machine learning algorithms, unlike traditional ones, are not required to be pre-programmed explicitly to solve a particular task. Just like in the previous example, they learn from the data fed into them and afterwards, are able to predict results for new scenarios. As you might expect, this isn’t just black magic. The majority of ML algorithms are based on some maths and statistics, but also on a fair amount of engineering. Machine learning algorithms learn from data and they don’t have to be programmed explicitly!

### ⚠️ (0.5p) Exercise 1

In this exercise, we are going to download the skeleton and setup the Python environment for this lab. For that, you should:

2. Make sure Python 3 is installed:
sudo apt-get update
sudo apt-get install python3.6
3. Make sure pip and virtualenv are installed:
sudo apt-get install python3-pip
sudo pip3 install virtualenv
4. Create a Python 3 virtual environment in the skeleton directory:
virtualenv venv
5. Activate your Python 3 virtual environment:
source venv/bin/activate
6. Install all the necessary dependencies specified in the requirements.txt file:
pip3 install -r requirements.txt

If you want to deactivate the virtual environment, simply type in this command:

deactivate

We usually code within a virtual environment because it's very simple to separate the dependencies between projects.

If you face an issue with pip or matplotlib, please see the following troubleshooting:

pip:

• ImportError: cannot import name 'main'
$sudo su$ pip install ...
• AttributeError: 'module' object has no attribute 'SSL_ST_INIT'
$sudo python -m easy_install --upgrade pyOpenSSL matplotlib: • ImportError: No module named '_tkinter', please install the python3-tk package $ sudo apt-get install python3-tk   # ignore errors

## 2. Supervised Learning vs Unsupervised Learning

Generally, in machine learning, there are two types of tasks: supervised and unsupervised.

### A. Supervised Learning

We refer to the former when we have prior knowledge of what the output of the system should look like. More formally, after going through lots of (X, y) pairs, the model should be able to determine that mapping function ĥ which approximates f(X)=y as accurately as possible. In machine learning, the entire set of (X, y) pairs is often called corpus or dataset and an algorithm that learned from this corpus is sometimes called a trained model or simply a model.

For example, let’s say that you want to build a smart car selling platform and you want to provide sellers with the ability to know what should be a “fair price” based on the specs and age of their cars. For that, you would need to provide your model with a set of tuples that might look like this:

(bhp, fuel_type, displacement, weight, torque, suspension, car_age, car_price)

In this case, X is formed from the tuple slices (bhp, fuel_type, displacement, weight, torque, suspension, car_age) and each slice component is called a feature. Likewise, y is represented by the car_price and it's usually called label or ground truth. After learning from the provided corpus, a process that is commonly called model fitting, our model should be able to predict an approximation of the car price ŷ for new ages and specs X.

In the dataset of each supervised machine learning task, there will always be a label y that can be associated with a set of features X!

Because a numerical or continuous output is expected as a result, the aforementioned scenario is an example of a regression task. If y was a label that can take the values “cheap”, “fair” or “expensive”, then that would be called a classification task and y can also be referred to as a class.

Two examples of supervised learning tasks are regression and classification!

### B. Unsupervised Learning

In the case of unsupervised learning, the training data is not labeled anymore, so the machine learning algorithm must take decisions without having a ground truth as reference. That means that your dataset is no longer made out of (X, y) pairs, but only of X entries. Consequently, the job of the model would be to provide meaningful insights about those X entries, by finding various patterns in the data.

The most common tasks for this learning approach are clustering algorithms. In these type of problems, the model must learn how to group various data together, forming components or clusters. Say that having a table with information about all the Facebook users, you would like to recommend personalized ads for every single person. This sounds like a very difficult task, because there is a huge number of users on the platform. But what if we group users together based on their traits (or features as your Data Scientist within you might say)? Then, we can suggest the same ads bundle to a whole group of users, making the advertising process easier and cheaper.

In the dataset of each unsupervised machine learning task, there will exist only X entries used to identify patterns and insights about data.

### ⚠️ (4.0p) Exercise 2

In this exercise, you will learn how to properly evaluate a classifier. We chose a decision tree for this example, but feel free to explore other alternatives. You can find out more about decision trees here. For all the associated tasks, you will use the diabetes.csv file placed in the resources directory. The model must learn to determine whether the patient suffers from diabetes (0 or 1) by looking at a dataset consisted of the following features:

• Number of pregnancies
• Glucose level
• Blood pressure
• Skin thickness
• Insulin level
• Body Mass Index (BMI)
• Diabetes pedigree function (likelihood of diabetes based on family history)
• Age

The solution for this exercise should be written in the TODO sections marked in the classification.py file.

Follow the skeleton code and understand what it does. Afterwards, you must be able to answer the assistant's questions.

## 3. Performance Evaluation in Machine Learning

Generally, with machine learning algorithms, in the early phase of development, the focus is less on the computational resources (RAM, CPU, IO etc.) and more on the ability of the model to generalize well on the data. That means that our algorithm is trained well enough to provide accurate answers for data that has never been seen before (predict a car price for a new car posted on the platform, classify a new vehicle or assign a freshly registered Facebook user to a group). Hence, at first, we try to build a robust and accurate model and then we work towards making it function as computationally inexpensive as possible.

In machine learning, traditional evaluation metrics (RAM, CPU, IO) are not used as frequently!

### A. Training & Test Sets

Evaluating machine learning algorithms is dependent on the type of problem we are trying to solve. But in most cases, we would want to split the data into a training set and a test set. As the name suggests, the former will be used to actually train the model and the later will be used to verify how well the model generalizes on unseen data. On the most common machine learning tasks, the split ratio ranges from 80-20 to 90-10, depending on the size of the corpus. When a huge amount of data is operated (say millions or billions of entries), a smaller proportion of the dataset is used as a test set (even 1%) because the actual number of test entries are considered sufficiently numerous. For instance, 1% of 10M entries is 100k and that’s generally regarded as a lot of data to test on.

In most problems, the data is split into a training set used in model training and a test set used in performance evaluation!

### B. Classification Problems

Let’s begin our performance analysis journey by analyzing a classification problem. Let’s suppose that you have a model that was trained to predict whether a given image corresponds to a cat or not. As you already know, this is a supervised learning task in which the model must learn to predict one of two classes - “cat” or “non-cat”. In this case, the model is also called a binary classifier.

#### Confusion Matrix

You fit the model with your training data, you ask it for predictions on the test set and now you have two results: the ground truth y that was part of your dataset and the predictions ŷ that have just been yielded by your machine learning algorithm. So how do you assess your model in this particular scenario? One way is to build a confusion matrix, like the one below:

Now don’t get confused (excuse the pun ). It’s actually quite simple to interpret it. There are 4 possible predictions yielded by your binary classifier:

• True Positives (y = ŷ and ŷ=”cat”):

The model identified a cat in the image and in that image it is actually a cat.

• False Positives (y != ŷ and ŷ=”cat”):

The model identified a cat in the image, but there was no cat in the image.

• False Negatives (y != ŷ and ŷ=”non-cat”):

The model did not recognize a cat in the image, but there was a cat in the image.

• True Negatives (y = ŷ and ŷ=”non-cat”):

The model did not recognize a cat in the image and in the image there was no cat indeed.

The first step in analyzing the performance of a classifier is to build a confusion matrix!

These 4 scenarios can have different levels of importance, depending on the problem one wants to solve. For a smart antivirus, keeping the number of false negatives as low as possible is crucial, even if that means an increase in the false positives. Naturally, it’s better to have annoying warnings than having your computer infected because the antivirus was unable to identify the threat. But if you don’t necessarily have unusual requirements for you model and you just want to assess its generic performance, there are 3 metrics that can be inferred from this matrix:

#### Accuracy

This is the most intuitive metric and represents the number of correctly predicted labels over the number of total predictions, the actual value of prediction (positive or negative) being irrelevant. Again, judging the model by this single metric can be misleading. Consider the case of credit card transaction frauds. There could be a lot of transactions that are perfectly valid, but only a handful of them are frauds. A model that trains on this data could learn to predict that all of the transactions are safe and the ones that are fraudulent are just outliers. Calculating the accuracy on a dataset of 1 million transactions would yield 99.9%. But our model is useless, because letting 1000 frauds unnoticed cannot be allowed.

#### Precision

The precision is the total number of correctly classified positive examples divided by the total number of predicted positive examples. If the precision is very high, the probability for our model of classifying non-cat images as cat images is quite low.

#### Recall

The recall is the total number of correctly classified positive examples divided by the total number of actual positive examples. If the recall is very high, the probability for our model of misclassifying cat images is quite low.

When you have high recall and low precision, most of the cat images are correctly recognized, but there are a lot of false positives. In contrast, when you have low recall and high precision, we miss a lot of cat images, but those predicted as cat images have a high probability of being indeed cat images and not something else.

#### F1 Score

Ideally, you would want to have both high recall and high precision, but that is not always possible. So you can choose the trade-off you are most comfortable with or you can combine the 2 metrics into 1, by using the F1 score, which is actually the harmonic mean of the 2:

This single metric is more generic, goes from 0 (worst) to 1 (best) and together with accuracy, can give you a solid intuition on the performance of your model.

From the confusion matrix you can extract the accuracy, precision and recall!

Evaluate the classifier by manually computing the accuracy, precision, recall and F1 score. These metrics are derived from the confusion matrix but you don't have to build this matrix yourself. You can use Scikit-learn for that (Confusion Matrix).

Before anything else, we must compute the confusion matrix. Luckily, the metrics package of the Scikit-learn library has just the right function for this task:

cm = confusion_matrix(y_test, y_pred)
if (isBinaryClassification):
cm, cm = cm, cm

Please note that in order to compute a binary confusion matrix that looks just like the ones used in the above diagrams, a small swap has to be performed.

For this exercise we simply identify the TP, TN, FP, FN terms on the confusion matrix and then compute the various metrics using their formulas. For instance, for the precision, we can write the following lines of code:

precision = cm / (cm + cm)
precision_l = [precision]

You should be able to compute the accuracy and recall by yourself.

Note that when we computed the precision, we built precision_l which is a list of a single precision value. For a binary classifier this seems pointless, but when multiple classes are to be predicted, the 2 lists of precisions and recalls will be used to compute the final F1 score as a mean of all the intermediate binary F1 scores (more on this later):

f1_l = [2 * (x * y) / (x + y) if (x + y) > 0 else 0 for (x, y) in zip(precision_l, recall_l)]
f1 = np.mean(f1_l)

Evaluate the classifier using the metrics package from the Scikit-learn library. Again, accuracy, precision, recall and F1 score are required.

HINT: There are 2 Scikit-learn functions that will help you with this computation: accuracy_score and precision_recall_fscore_support. Because we are computing these metrics on a binary classifier, think about what is the suitable value for the average parameter of the precision_recall_fscore_support function.

#### Generic Confusion Matrix

Don’t take the terms “positive” and “negative” written in the confusion matrix above literally! They are actually placeholders for the 2 classes (“cat” and “non-cat”) the model is trying to predict. This means that you can use more generic classes like “cat” and “dog” or “human” and “animal”, depending on the problem you want to solve. As a consequence, each metric (accuracy, precision, recall or F1 score) is computed per class and if you want to obtain a generic score for the model, you must compute their average. For instance, if you have a recall of 0.6 for “cat” and 0.5 for “dog”, the average recall of the model will be 0.55.

Consequently, the confusion matrix can be generalized to more than 2 classes (see the image below), having the same metrics and ways of computing them.

Extend the solution for Task 2.B so that it can accommodate more than 2 classes (if you haven't done that already). Test your performance evaluation functions by using the diabetes_multi.csv file as input.

HINT: Check the utils package for constants. Around 100 classes were assigned to the features. That is why some classes that can be found in the training set, might not appear in the test set. As a consequence, some warnings can occur. You can safely ignore them.

Because we no longer work on a 2 x 2 confusion matrix, we must sum various elements across the entire matrix. Luckily the numpy library comes in handy. We can compute the accuracy like this:

cm_diag_sum = np.sum(np.diag(cm))
cm_sum = np.sum(cm)
accuracy = cm_diag_sum / cm_sum

and the precision like this:

cm_diag = np.diag(cm)
cm_col_sums = np.sum(cm, axis=0)
precision_l = [x / y if y != 0 else 0 for (x, y) in zip(cm_diag, cm_col_sums)]
precision = np.mean(precision_l)

As we discussed previously, we are computing a precision for each class, resulting in a list of precisions of length N (where N x N is the dimension of the confusion matrix). In the end, we are computing the mean of all these values. The same applies to the recall and F1 score.

Comment the difference in score values between the binary classifier and the multiclass classifier. Again, you must be able to answer the assistant's questions.

### C. Regression Problems

As it was previously mentioned, supervised machine learning can also imply solving regression tasks, where a numerical or continuous value is used as a label. Let’s take an example. Suppose that you want to predict the budget for an advertisement campaign, based on the revenue of the company. For this task, your model will learn from a training set of (X, y) pairs and will try to find the best ĥ that approximates the behaviour of the function f(X) = y. If we assume that the relationship between the two is linear, your model must learn how to draw a line that fits the points (X, y) the best. A visual representation can be seen in the figure below:

### ⚠️ (3.5p) Exercise 3

In this exercise, you will learn how to properly evaluate a regression model. We chose a simple linear regressor for this example, but feel free to explore other alternatives. You can find out more about linear regressors here. For all the associated tasks, you will use the weather.csv file placed in the resources directory. The model must learn to determine what is the maximum temperature for a certain day (y) based on the minimum temperature (X).

The solution for this exercise should be written in the TODO sections marked in the regression.py file.

Follow the skeleton code and understand what it does. Afterwards, you must be able to answer the assistant's questions.

#### Root Mean Squared Error

But how do we know if that line is drawn correctly? Well, first we must define an error or loss function that can mathematically indicate us how far from the points the line has been drawn. In linear regression tasks, the most common approach is to use the root mean squared error (RMSE), depicted below. You can sometimes get rid of the root and just compute the MSE to avoid extra computations.

In this formula, yj is the ground truth and ŷj is the result of the partially learned function ĥ(Xj). Basically, the model must apply successive corrections to ĥ, such that the predicted ŷj values lead to a smaller RMSE. This process of minimizing the loss is also called optimization and is one of the foundational principles of machine learning. You can find more about it here.

Similarly to the classification tasks, for regression problems, the value of the RMSE can be used as a performance metric for the model.

#### R² Correlation

Now let’s say that your model was properly trained and you have some predicted labels ŷ and some ground truth values y. In the case of classification problems, with these two pieces of information you could immediately compute the F1 score or the accuracy. But with regression, a 0 to 1 score cannot be simply derived. One metric that can be used however, is the R2 correlation and it is computed using the following formula:

where ŷj is the predicted value, ÿ is the mean of the ground truth labels and yj is the ground truth. This score lies between -∞ and 1 and has the following interpretation:

• Close to 1: high positive linear correlation between X and y
• Close to 0: a linear correlation between X and y cannot be identified
• Close to -∞: high negative linear correlation between X and y

These 3 scenarios are visually represented in the figures below:

For regression tasks you can use RMSE and R-squared score as performance metrics!

Evaluate the regressor by manually computing the Root Mean Squared Error (RMSE), Mean Absolute Error (MAE) and R² score.

For the RMSE we simply apply the formula above:

rmse = list(np.sqrt(np.sum((y_test - y_pred) ** 2) / len(y_test)))

You should be able to compute MAE and R² score on your own!

We know that MAE was not covered in the first section of this laboratory class and that is why you must use your powerful Google skills to solve this one. Evaluate the regressor using the metrics package from the Scikit-learn library. Again, RMSE, MAE and R² score are required.

You might not find a function that computes the actual RMSE, but the MSE. Can you do the R? 🏴‍☠️

### ⚠️ (1.0p) Task 3.D (Bonus)

Train the model on variously sized chunks of the original dataset and notice how the RMSE changes. To better illustrate the behaviour, you should build a plot having the data size on the X axis and the RMSE value on the Y axis. Moreover, you should be able to explain the observed behaviour to the assistant.

Because this is rather a plotting task, we will do the hard work for you and compute the list of RMSE values for the various chunk sizes:

n = min(X.shape, max_chunk_size)
chunk_size = int((n - min_chunk_size) / chunks)

# Create 2 lists used in the plotting logic
size_list = []
rmse_list = []

# Train a model for each chunk
for i in range(0, chunks):

# Compute the size of the current chunk
size = min_chunk_size + (i + 1) * chunk_size
size_list.append(size)

# Select a chunk from the whole dataset
sample_X = X.sample(n=size, random_state=42)
sample_y = y.sample(n=size, random_state=42)

# Split the data into a training set and a test set with a 80-20 ratio
X_train, X_test, y_train, y_test = train_test_split(sample_X, sample_y, test_size=0.2, random_state=42)

# Build a linear regressor
regressor = LinearRegression()

# Fit data from the training set to the regressor
regressor.fit(X_train, y_train)

# Make predictions on the test set
y_pred = regressor.predict(X_test)

# Compute the rmse
rmse, _, _ = evaluate_regressor(y_test, y_pred, 'smart')
rmse_list.append(rmse)

Please read the code and try to understand it by following the comments. Building the plot is on you! Play with the following parameters and observe how the plot changes:

CHUNKS = 100
MIN_CHUNK_SIZE = 1000
MAX_CHUNK_SIZE = 1000000

### D. Underfitting vs Overfitting

Splitting the data into a training set and a test set is not only helping us to determine the accuracy or error of the model’s predictions, but can also give us an insight about its behaviour or generalization capabilities.

Let’s take a trained binary classifier as an example and the accuracy as a performance metric. If it has a small test score, but a high training score, the model relied too much on the training data and is not able to generalize well on entries that has never seen before. In this case, we say that it has overfit the training data. In this particular case, the model is said to have a high variance. In contrast, if the model learned a function that is too generic, the problem of underfitting or high bias occurs. This time, the problem can be inferred from a training score that is too small. These situations were visually represented in the figures below:

The underlying causes of the aforementioned problems depend heavily on the model and how its hyperparameters were fine-tuned. Discussing them is outside the scope of this laboratory class, but one can learn more about those topics from these articles: Underfitting & Overfitting, Hyperparameters.

Besides looking at the usual performance metrics, one must also notice if the model appropriately fits the data!

### ⚠️ (1.5p) Exercise 4

In this exercise, you will learn how to properly evaluate the data fitting behaviour of the binary classifier from exercise 2. For all the associated tasks, you will use the diabetes.csv file again, which is placed in the same resources directory. This time, 3 models will be trained with different parameters and your job is to analyse their behaviour. For that, you will look at the accuracy of each model computed on both the training set and the test set.

The solution for this exercise should be written in the TODO sections marked in the fitting.py file.

For each model, make predictions on both the training set and test set and compute the corresponding accuracy values.

The model is already trained so you can directly use it to yield predictions on the two sets. And in order to evaluate these predictions, you can use the already familiar evaluate_classifier function.

Comment the results by specifying which is the best model in terms of fitting and which are the models that overfit or underfit the dataset.

### E. Clustering Algorithms

In the end, we should talk a little bit about unsupervised learning. As you already know, for such tasks, there is no ground truth - just a set of X values that might have some underlying structure or pattern that can be learned. Evaluating a model without having something as reference might seem a pretty difficult task. And in some scenarios, you might be right!

Say that we wanted to group images containing handwritten digits into clusters and at the end of the learning process, our model grouped the data like this:

At first glance, the clustering outcome looks good, but how can we express this “good-looking” result in a more formal manner? One solution is to measure how compact the clusters are, yet distant from one another, by computing a silhouette score. Because the formulas are too cumbersome to write in here, I will leave you a link, where everything is explained clearly. However, one might understand the concept by looking at this simple example:

### ⚠️ (1.5p) Exercise 5

In this exercise, you will learn how to properly evaluate a clustering model. We chose a K-means clustering algorithm for this example, but feel free to explore other alternatives. You can find out more about K-means clustering algorithms here. For all the associated tasks, you don't have to use any input file, because the clusters are generated in the skeleton. The model must learn how to group together points in a 2D space.

The solution for this exercise should be written in the TODO sections marked in the clustering.py file.

Compute the silhouette score of the model by using a Scikit-learn function found in the metrics package.

Fetch the centres of the clusters (the model should already have them ready for you ) and plot them together with a colourful 2D representation of the data groups. Your plot should look similar to the one below:

You can also play around with the standard deviation of the generated blobs and observe the different outcomes of the clustering algorithm:

CLUSTERS_STD = 2

You should be able to discuss these observations with the assistant.

HINT: The plotting code is very similar to the one found in the skeleton. You can also Google it out. ## Feedback

Please take a minute to fill in the feedback form for this lab. 