Scikit-learn is a machine learning package for Python. This package contain different algorithms for Classification, Regression and Clustering, of which the algorithms used in the following laboratories are highlighted in bold.
Classification algorithms:
Regression algorithms:
Clustering Algorithms:
Offers support for constructing the training and the testing set:
Offers support cross validation and parameter selection:
To use any of the regression/classification algorithms, we will need a labeled dataset. Usually X is the observation matrix and y is the target vector.
You can find examples of datasets suitable for machine learning and data science on:
To experiment with data science algorithms, you can find public datasets stored as CSV files which can be processed directly in a Python script. You should read the description of the dataset to find out how you can formulate a regression/classification problem (define X - observation column(s) and y - target column).
Scikit-learn comes with some predefined datasets for testing.
The Boston Housing Dataset is a commonly used dataset in Data Science for testing algorithms and contains information about housing in the area of Boston such as:
To use a dataset from sklearn.datasets module we will need to first import the dataset and then to declare the observation matrix and the target vector as follows:
from sklearn.datasets import load_boston boston_dataset = load_boston() input_data = boston_dataset.data target_data = boston_dataset.target
The Pandas library (review Lab 3) provides some straight-forward methods to work with data sets. For tabular data, the DataFrame structure is used to load the data from different sources and formats. A key feature is the ability to access a group of rows and columns by label(s) or a boolean array. In the following example, we create a DataFrame from dictionary data:
import pandas as pd d = {'col1': [1, 2], 'col2': [3, 4]} # loading the data from a dictionary df = pd.DataFrame(data=d) print(df)
In the following example, we create a DataFrame using the Boston Housing Dataset from scikit-learn:
import pandas as pd from sklearn.datasets import load_boston # loading the Boston dataset boston_dataset = load_boston() input_data = boston_dataset.data target_data = boston_dataset.target # loading the data into a Pandas DataFrame boston = pd.DataFrame(boston_dataset.data, columns=boston_dataset.feature_names) # print the entire table content print(boston.head()) # print table column print(boston["TAX"]) # create a new table column with data boston['MEDV'] = target_data
Data Models are used to describe and predict data. There are different types of models that are best suited for different problems. The model may represent a simple linear equation (e.g. first order linear regression), or it may be a complex neural network, mapped out by sophisticated software and mathematical models (e.g. deep learning). In this lab, we will focus on the Linear Regression models.
Linear regression is widely used in domains such as: finance, economics, marketing, biology, to evaluate trends and make predictions. While it is traditionally a statistical method, it is also one of the fundamental supervised machine-learning algorithms.
The first order regression model (Simple Linear Regression) can be defined as follows:
$ y_i = \beta_0 + \beta_1x_i $
where $\beta_0$ - intercept and $\beta_1$ - slope are the parameters that we need to find after training the model.
Using a first order linear regression model results in a trend line, telling whether a particular data set (e.g. GDP, oil prices or stock prices) have increased or decreased over the period of time:
Polynomial regression is a form of linear regression where higher order powers (2nd, 3rd or higher) of an independent variable are included. These models can still be considered linear models since the regression function is linear in terms of the unknown parameters.
The polynomial regression model can be defined as follows:
$ y_i = \beta_0 + \beta_1x_i + \beta_2x_i^2 + ... + \beta_nx_i^n $
where $\beta_i$ are the parameters that we need to find after training the model.
If we have multiple attributes, the regression model (Multiple Linear Regression) can be defined in matrix form as follows:
$ y = X\beta $
where $\beta$ is the parameter vector that we need to find after training the model.
Therefore, regression models are not limited to single attributes. Take for example the Boston Housing Dataset. There are multiple (input) variables that have an effect on the housing prices (target variable) and we are interested in training a regression model to be able to predict housing prices based on this data.
After a model is selected (in our case a linear regression model), the process of training the model requires a training dataset to learn from. The training data should be labeled (target attribute).
After training the model, it can be used to get predictions on new data for which the target attribute is unknown (unlabeled).
Evaluation methods should be used to check the accuracy of a model on a given dataset.
The standard approach in statistics and machine learning is to split the (existing) dataset into batches:
A well-fitting regression model results in predicted values close to the observed data values. Three statistical metrics are used in Ordinary Least Squares (OLS) regression to evaluate model fit: R-squared, the overall F-test, and the Root Mean Square Error (RMSE).
RMSE is the square root of the variance of the residuals. It indicates the absolute fit of the model to the data (how close the observed data points are to the model's predicted values). Lower values of RMSE indicate better fit. RMSE is a good measure of how accurately the model predicts the response, and it is the most important criterion if the main purpose of the model is prediction.
$ RMSE = \sqrt{\frac{1}{N}\sum_{i=1}^N(y_i-\hat{y}_i)^2} $
R-squared, also known as the coefficient of determination, is a metric that will give some information about the goodness of fit of a model as a value between 0 and 1. In regression, it shows how well the regression predictions approximate the real data points. An value of 1 indicates that the regression predictions perfectly fit the data. R-Squared basically checks to see if our fitted regression line will predict y better than the mean.
$ R^2 = 1-\frac{\sum_{i=1}^N(y_i - \hat{y}_i)^2}{\sum_{i=1}^N(y_i - \bar{y})^2} $
In the following plot, the prediction results are shown for a Multiple Linear Regression model trained on the Boston Housing Dataset. While the model cannot be directly represented (only 2d/3d is possible to represent), the predicted values and prediction error are plotted for each sample (see Task 3 for more details and source code).
Overfitting and Underfitting are commonly used in the context of Data Science for describing the quality of data models in real world conditions:
import numpy as np from sklearn.linear_model import LinearRegression import matplotlib.pyplot as plt # define some test data x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) y = np.array([2, 3, 3, 4, 7, 8, 12, 14, 16, 20]) # split train/test data n_train = int(len(x)*0.8) x_train = x[:n_train] y_train = y[:n_train] x_predict = x[n_train:] # create and fit the LR model model = LinearRegression() reg = model.fit(x_train.reshape(-1, 1), y_train) # print model parameters print(model.coef_) print(model.intercept_) # use the model to make predictions on the test dataset model_output = model.predict(x.reshape(-1, 1)) # plot the results compared to the target values plt.plot(x, y, model_output) plt.title("Linear regression") plt.xlabel("x") plt.ylabel("y") plt.legend(["target", "predicted"]) plt.show()
Download the project archive and unzip on your PC. Install the requirements using pip (e.g. py -3 -m pip install -r requirements.txt). The script (task1.py) uses linear regression to fit a sample of generated data.
Run the program and solve the following scenarios:
Based on the experimental results in Task 1, answer the following questions:
The script (task2.py) loads a dataset from a CSV file. Run a similar script as Task 1, and present your results.
The script (task3.py) loads the Boston Housing Dataset and trains a linear model over multiple features. The prediction results (median housing prices in thousands of dollars) are shown in the plot and compared to the original dataset. Run the program and solve the following scenarios:
Based on the experimental results, answer the following questions: