Lab 7. Statistical models. Linear regression

Introduction to scikit-learn

Scikit-learn is a machine learning package for Python. This package contain different algorithms for Classification, Regression and Clustering, of which the algorithms used in the following laboratories are highlighted in bold.

Classification algorithms:

• Support Vector Machines
• K-Nearest Neighbors
• Naive Bayes
• Decision trees
• Ensemble methods: Random Forest, Extremely Randomized Trees, AdaBoost, etc.

Regression algorithms:

• Linear Regression
• Logistic Regression

Clustering Algorithms:

• K-means
• K-means++
• DBSCAN
• OPTICS
• Birch

Offers support for constructing the training and the testing set:

• K-Folds
• Stratified K-Folds

Offers support cross validation and parameter selection:

• GridSearchCV

Datasets

To use any of the regression/classification algorithms, we will need a labeled dataset. Usually X is the observation matrix and y is the target vector.

You can find examples of datasets suitable for machine learning and data science on:

To experiment with data science algorithms, you can find public datasets stored as CSV files which can be processed directly in a Python script. You should read the description of the dataset to find out how you can formulate a regression/classification problem (define X - observation column(s) and y - target column).

Scikit-learn comes with some predefined datasets for testing.

The Boston Housing Dataset is a commonly used dataset in Data Science for testing algorithms and contains information about housing in the area of Boston such as:

• CRIM - per capita crime rate by town
• ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
• INDUS - proportion of non-retail business acres per town.
• NOX - nitric oxides concentration (parts per 10 million)
• RM - average number of rooms per dwelling
• AGE - proportion of owner-occupied units built prior to 1940
• DIS - weighted distances to five Boston employment centres
• TAX - full-value property-tax rate per 10000 USD
• MEDV - Median value of owner-occupied homes in thousands of dollars

To use a dataset from sklearn.datasets module we will need to first import the dataset and then to declare the observation matrix and the target vector as follows:

from sklearn.datasets import load_boston
input_data = boston_dataset.data
target_data = boston_dataset.target

Working with data

The Pandas library (review Lab 3) provides some straight-forward methods to work with data sets. For tabular data, the DataFrame structure is used to load the data from different sources and formats. A key feature is the ability to access a group of rows and columns by label(s) or a boolean array. In the following example, we create a DataFrame from dictionary data:

import pandas as pd

d = {'col1': [1, 2], 'col2': [3, 4]}

df = pd.DataFrame(data=d)
print(df)

Datasets can be stored in files, databases, and other external sources. A common representation of tabular data that is useful for later processing is given by CSV (Comma Separated Values) files. Pandas can directly load the dataset from a CSV file. Check out Lab 3 for an example of working with CSV files.

In the following example, we create a DataFrame using the Boston Housing Dataset from scikit-learn:

import pandas as pd

input_data = boston_dataset.data
target_data = boston_dataset.target

boston = pd.DataFrame(boston_dataset.data, columns=boston_dataset.feature_names)

# print the entire table content

# print table column
print(boston["TAX"])

# create a new table column with data
boston['MEDV'] = target_data

Data Models

Data Models are used to describe and predict data. There are different types of models that are best suited for different problems. The model may represent a simple linear equation (e.g. first order linear regression), or it may be a complex neural network, mapped out by sophisticated software and mathematical models (e.g. deep learning). In this lab, we will focus on the Linear Regression models.

Linear regression is widely used in domains such as: finance, economics, marketing, biology, to evaluate trends and make predictions. While it is traditionally a statistical method, it is also one of the fundamental supervised machine-learning algorithms.

Simple Linear Regression

The first order regression model (Simple Linear Regression) can be defined as follows:

$y_i = \beta_0 + \beta_1x_i$

where $\beta_0$ - intercept and $\beta_1$ - slope are the parameters that we need to find after training the model.

Using a first order linear regression model results in a trend line, telling whether a particular data set (e.g. GDP, oil prices or stock prices) have increased or decreased over the period of time:

Polynomial Linear Regression

Polynomial regression is a form of linear regression where higher order powers (2nd, 3rd or higher) of an independent variable are included. These models can still be considered linear models since the regression function is linear in terms of the unknown parameters.

The polynomial regression model can be defined as follows:

$y_i = \beta_0 + \beta_1x_i + \beta_2x_i^2 + ... + \beta_nx_i^n$

where $\beta_i$ are the parameters that we need to find after training the model.

Multiple Linear Regression

If we have multiple attributes, the regression model (Multiple Linear Regression) can be defined in matrix form as follows:

$y = X\beta$

where $\beta$ is the parameter vector that we need to find after training the model.

Therefore, regression models are not limited to single attributes. Take for example the Boston Housing Dataset. There are multiple (input) variables that have an effect on the housing prices (target variable) and we are interested in training a regression model to be able to predict housing prices based on this data.

Training. Validation. Testing.

After a model is selected (in our case a linear regression model), the process of training the model requires a training dataset to learn from. The training data should be labeled (target attribute).

After training the model, it can be used to get predictions on new data for which the target attribute is unknown (unlabeled).

Evaluation methods should be used to check the accuracy of a model on a given dataset.

The standard approach in statistics and machine learning is to split the (existing) dataset into batches:

• The training dataset is used to fit the parameters of the model. In linear regression models, this means finding the coefficients of the linear model.
• The validation dataset is used to evaluate the model while tuning the model hyperparameters. In polynomial regression models, this means adjusting the order of the model until the best fitting model is found.
• The test dataset is used to evaluate the final model on data that has never been used for training or validation.

Model evaluation

A well-fitting regression model results in predicted values close to the observed data values. Three statistical metrics are used in Ordinary Least Squares (OLS) regression to evaluate model fit: R-squared, the overall F-test, and the Root Mean Square Error (RMSE).

RMSE

RMSE is the square root of the variance of the residuals. It indicates the absolute fit of the model to the data (how close the observed data points are to the model's predicted values). Lower values of RMSE indicate better fit. RMSE is a good measure of how accurately the model predicts the response, and it is the most important criterion if the main purpose of the model is prediction.

$RMSE = \sqrt{\frac{1}{N}\sum_{i=1}^N(y_i-\hat{y}_i)^2}$

R-squared

R-squared, also known as the coefficient of determination, is a metric that will give some information about the goodness of fit of a model as a value between 0 and 1. In regression, it shows how well the regression predictions approximate the real data points. An value of 1 indicates that the regression predictions perfectly fit the data. R-Squared basically checks to see if our fitted regression line will predict y better than the mean.

$R^2 = 1-\frac{\sum_{i=1}^N(y_i - \hat{y}_i)^2}{\sum_{i=1}^N(y_i - \bar{y})^2}$

Whereas R-squared is a relative measure of fit, RMSE is an absolute measure of fit.

In the following plot, the prediction results are shown for a Multiple Linear Regression model trained on the Boston Housing Dataset. While the model cannot be directly represented (only 2d/3d is possible to represent), the predicted values and prediction error are plotted for each sample (see Task 3 for more details and source code).

Overfitting. Underfitting.

Overfitting and Underfitting are commonly used in the context of Data Science for describing the quality of data models in real world conditions:

• Overfitting means that model we trained has trained “too well” and is fit too closely to the training dataset. This usually happens when the model is too complex (i.e. too many features/variables compared to the number of observations). This model will be very accurate on the training data but will probably not be very accurate on new data. In linear regression, a higher polynomial order may lead to overfitting.
• Underfitting means that the model does not fit the training data and therefore misses the trends in the data. It could also happen when, for example, we fit a linear model (simple linear regression) to data that is not linear.

Check the plot for polynomial linear regression and try to guess which polynomial order would overfit and which one would underfit. Which criteria can you use to select the best model?

RMSE and R-Squared provide a rough estimation of model over/underfitting on a given dataset when comparing test and validation results. Cross-validation can be further used to evaluate the model on independent datasets.

Scikit-learn

import numpy as np
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt

# define some test data
x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
y = np.array([2, 3, 3, 4, 7, 8, 12, 14, 16, 20])

# split train/test data
n_train = int(len(x)*0.8)
x_train = x[:n_train]
y_train = y[:n_train]
x_predict = x[n_train:]

# create and fit the LR model
model = LinearRegression()
reg = model.fit(x_train.reshape(-1, 1), y_train)

# print model parameters
print(model.coef_)
print(model.intercept_)

# use the model to make predictions on the test dataset
model_output = model.predict(x.reshape(-1, 1))

# plot the results compared to the target values
plt.plot(x, y, model_output)
plt.title("Linear regression")
plt.xlabel("x")
plt.ylabel("y")
plt.legend(["target", "predicted"])
plt.show()

Exercises

Download the project archive and unzip on your PC. Install the requirements using pip (e.g. py -3 -m pip install -r requirements.txt). The script (task1.py) uses linear regression to fit a sample of generated data.

Run the program and solve the following scenarios:

• Experiment with different polynomial orders
• Plot the RMSE and R-Squared values for each case

Based on the experimental results in Task 1, answer the following questions:

• Q1: Which is the optimal polynomial order for this dataset with regards to the RMSE and model complexity? Hint: the RMSE improvement starts to decrease after a certain polynomial order on the RMSE chart.
• Q2: Which is the optimal polynomial order for this dataset with regards to the R-Squared coefficient and model complexity? Hint: the R2 coefficient improvement starts to decrease after a certain polynomial order on the R-Squared chart.
• Q3: Explain the results based on the provided function that is used to generate the dataset.

The script (task3.py) loads the Boston Housing Dataset and trains a linear model over multiple features. The prediction results (median housing prices in thousands of dollars) are shown in the plot and compared to the original dataset. Run the program and solve the following scenarios:

• [TODO 1] Change the size of the training dataset (percent) and evaluate the models that are obtained in each case using RMSE
• [TODO 2] View the RMSE values for each case on a single plot

Based on the experimental results, answer the following questions:

• Q1. Does the recommended split of 80% training data and 20% validation data provide good results in this case?
• Q2. What is the amount (percent) of training data that provides the best results in terms of prediction accuracy on validation data?
• Q3. What happens if the amount training data is small, e.g. 10%, with regards to the prediction accuracy and the over/underfitting of the regression model?