Differences

This shows you the differences between two versions of the page.

Link to this comparison view

ewis:laboratoare:07 [2022/04/19 22:07]
alexandru.predescu [Training. Validation. Testing.]
ewis:laboratoare:07 [2023/04/19 18:08] (current)
alexandru.predescu [Exercises]
Line 182: Line 182:
 == RMSE == == RMSE ==
  
-**RMSE** is the square root of the variance of the residuals. It indicates the absolute fit of the model to the data (how close the observed data points are to the model'​s predicted values). Lower values of RMSE indicate better fit. RMSE is a good measure of how accurately the model predicts the response, and it is the most important criterion if the main purpose of the model is prediction. ​+RMSE is the square root of the variance of the residuals. It indicates the absolute fit of the model to the data (how close the observed data points are to the model'​s predicted values). Lower values of RMSE indicate better fit. RMSE is a good measure of how accurately the model predicts the response, and it is the most important criterion if the main purpose of the model is prediction. ​
  
 $ RMSE = \sqrt{\frac{1}{N}\sum_{i=1}^N(y_i-\hat{y}_i)^2} $ $ RMSE = \sqrt{\frac{1}{N}\sum_{i=1}^N(y_i-\hat{y}_i)^2} $
Line 188: Line 188:
 == R-squared== == R-squared==
  
-**R-squared**, also known as the coefficient of determination,​ is a metric that will give some information about the goodness of fit of a model as a value between 0 and 1. In regression, it shows how well the regression predictions approximate the real data points. An value of 1 indicates that the regression predictions perfectly fit the data. R-Squared basically checks to see if our fitted regression line will predict y better than the mean.+R-squared, also known as the coefficient of determination,​ is a metric that will give some information about the goodness of fit of a model as a value between 0 and 1. In regression, it shows how well the regression predictions approximate the real data points. An value of 1 indicates that the regression predictions perfectly fit the data. R-Squared basically checks to see if our fitted regression line will predict y better than the mean.
  
 $ R^2 = 1-\frac{\sum_{i=1}^N(y_i - \hat{y}_i)^2}{\sum_{i=1}^N(y_i - \bar{y})^2} $ $ R^2 = 1-\frac{\sum_{i=1}^N(y_i - \hat{y}_i)^2}{\sum_{i=1}^N(y_i - \bar{y})^2} $
Line 202: Line 202:
 === Overfitting. Underfitting. === === Overfitting. Underfitting. ===
  
-Overfitting and Underfitting are commonly used in the context of Data Science for describing the quality of data models:+Overfitting and Underfitting are commonly used in the context of Data Science for describing the quality of data models ​in real world conditions:
  
   * **Overfitting** means that model we trained has trained "too well" and is fit too closely to the training dataset. This usually happens when the model is too complex (i.e. too many features/​variables compared to the number of observations). This model will be very accurate on the training data but will probably not be very accurate on new data. In linear regression, a higher polynomial order may lead to overfitting.   * **Overfitting** means that model we trained has trained "too well" and is fit too closely to the training dataset. This usually happens when the model is too complex (i.e. too many features/​variables compared to the number of observations). This model will be very accurate on the training data but will probably not be very accurate on new data. In linear regression, a higher polynomial order may lead to overfitting.
   * **Underfitting** means that the model does not fit the training data and therefore misses the trends in the data. It could also happen when, for example, we fit a linear model (simple linear regression) to data that is not linear.   * **Underfitting** means that the model does not fit the training data and therefore misses the trends in the data. It could also happen when, for example, we fit a linear model (simple linear regression) to data that is not linear.
  
-<​note>​Check the plot for polynomial linear regression and try to guess which polynomial order would overfit and which one would underfit.</​note>​+<​note>​Check the plot for polynomial linear regression and try to guess which polynomial order would overfit and which one would underfit. ​Which criteria can you use to select the best model?</​note>​
  
 <note tip> <note tip>
-RMSE and R-Squared provide a rough estimation of model over/​underfitting on a given dataset. Cross-validation ​is then used to evaluate the model on independent datasets.+RMSE and R-Squared provide a rough estimation of model over/​underfitting on a given dataset ​when comparing test and validation results. Cross-validation ​can be further ​used to evaluate the model on independent datasets.
 </​note>​ </​note>​
 +
 +==== Scikit-learn ====
 +
 +<code python>
 +import numpy as np
 +from sklearn.linear_model import LinearRegression
 +import matplotlib.pyplot as plt
 +
 +# define some test data
 +x = np.array([1,​ 2, 3, 4, 5, 6, 7, 8, 9, 10])
 +y = np.array([2,​ 3, 3, 4, 7, 8, 12, 14, 16, 20])
 +
 +# split train/test data
 +n_train = int(len(x)*0.8)
 +x_train = x[:n_train]
 +y_train = y[:n_train]
 +x_predict = x[n_train:]
 +
 +# create and fit the LR model
 +model = LinearRegression()
 +reg = model.fit(x_train.reshape(-1,​ 1), y_train)
 +
 +# print model parameters
 +print(model.coef_)
 +print(model.intercept_)
 +
 +# use the model to make predictions on the test dataset
 +model_output = model.predict(x.reshape(-1,​ 1))
 +
 +# plot the results compared to the target values
 +plt.plot(x, y, model_output)
 +plt.title("​Linear regression"​)
 +plt.xlabel("​x"​)
 +plt.ylabel("​y"​)
 +plt.legend(["​target",​ "​predicted"​])
 +plt.show()
 +
 +</​code>​
  
 ==== Exercises ==== ==== Exercises ====
Line 218: Line 256:
  
 Download the {{:​ewis:​laboratoare:​lab7:​project_lab7.zip|project archive}} and unzip on your PC. Install the requirements using pip (e.g. //py -3 -m pip install -r requirements.txt//​). Download the {{:​ewis:​laboratoare:​lab7:​project_lab7.zip|project archive}} and unzip on your PC. Install the requirements using pip (e.g. //py -3 -m pip install -r requirements.txt//​).
-The code sample ​(//task12.py//) uses linear regression to fit a sample of generated data.+The script ​(//task1.py//) uses linear regression to fit a sample of generated data. 
 Run the program and solve the following scenarios: Run the program and solve the following scenarios:
   * Experiment with different polynomial orders   * Experiment with different polynomial orders
   * Plot the RMSE and R-Squared values for each case   * Plot the RMSE and R-Squared values for each case
- 
-*This task is required for solving Task 2. 
- 
-=== Task 2 (3p) ===    ​ 
    
 Based on the experimental results in Task 1, answer the following questions: Based on the experimental results in Task 1, answer the following questions:
-  * Q1: Which is the optimal polynomial order for this dataset with regards to the RMSE and model complexity? ​Tip: the RMSE improvement starts to decrease after a certain polynomial order on the RMSE chart. +  * Q1: Which is the optimal polynomial order for this dataset with regards to the RMSE and model complexity? ​Hint: the RMSE improvement starts to decrease after a certain polynomial order on the RMSE chart. 
-  * Q2: Which is the optimal polynomial order for this dataset with regards to the R-Squared coefficient and model complexity? ​Tip: the R2 coefficient improvement starts to decrease after a certain polynomial order on the R-Squared chart.+  * Q2: Which is the optimal polynomial order for this dataset with regards to the R-Squared coefficient and model complexity? ​Hint: the R2 coefficient improvement starts to decrease after a certain polynomial order on the R-Squared chart.
   * Q3: Explain the results based on the provided function that is used to generate the dataset.   * Q3: Explain the results based on the provided function that is used to generate the dataset.
  
-Submit your answers on Moodle as PDF report.+=== Task 2 (3p) === 
  
-=== Task 3 (4p=== +The script ​(//​task2.py//​loads a dataset from a CSV file. Run a similar script as Task 1, and present your results.
  
-The code sample ​(//​task3.py//​) loads the Boston Housing Dataset and trains a linear model over multiple features. The prediction results (median housing prices in thousands of dollars) are shown in the plot and compared to the original dataset.+[[https://​www.kaggle.com/​datasets/​meetnagadia/​bitcoin-stock-data-sept-17-2014-august-24-2021|Bitcoin Price Dataset]] 
 + 
 +=== Task 3 (3p) ===  
 + 
 +The script ​(//​task3.py//​) loads the Boston Housing Dataset and trains a linear model over multiple features. The prediction results (median housing prices in thousands of dollars) are shown in the plot and compared to the original dataset.
 Run the program and solve the following scenarios: Run the program and solve the following scenarios:
   * [TODO 1] Change the size of the training dataset (percent) and evaluate the models that are obtained in each case using RMSE   * [TODO 1] Change the size of the training dataset (percent) and evaluate the models that are obtained in each case using RMSE
Line 245: Line 284:
   * Q2. What is the amount (percent) of training data that provides the best results in terms of prediction accuracy on validation data?   * Q2. What is the amount (percent) of training data that provides the best results in terms of prediction accuracy on validation data?
   * Q3. What happens if the amount training data is small, e.g. 10%, with regards to the prediction accuracy and the over/​underfitting of the regression model? ​   * Q3. What happens if the amount training data is small, e.g. 10%, with regards to the prediction accuracy and the over/​underfitting of the regression model? ​
- 
-Submit your answers on Moodle as PDF report. 
  
 ==== Resources ==== ==== Resources ====
  
-  * {{:​ewis:​laboratoare:​lab7:​project_lab7.zip|Project}}+  * {{:​ewis:​laboratoare:​lab7:​lab7.zip|Project}}
   * {{:​ewis:​laboratoare:​python_workflow.pdf|Python Workflow}}   * {{:​ewis:​laboratoare:​python_workflow.pdf|Python Workflow}}
   * [[https://​www.cs.toronto.edu/​~delve/​data/​boston/​bostonDetail.html|The Boston Housing Dataset]]   * [[https://​www.cs.toronto.edu/​~delve/​data/​boston/​bostonDetail.html|The Boston Housing Dataset]]
ewis/laboratoare/07.1650395268.txt.gz · Last modified: 2022/04/19 22:07 by alexandru.predescu
CC Attribution-Share Alike 3.0 Unported
www.chimeric.de Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0