This shows you the differences between two versions of the page.
ewis:laboratoare:07 [2022/04/19 22:08] alexandru.predescu [Training. Validation. Testing.] |
ewis:laboratoare:07 [2023/04/19 18:08] (current) alexandru.predescu [Exercises] |
||
---|---|---|---|
Line 202: | Line 202: | ||
=== Overfitting. Underfitting. === | === Overfitting. Underfitting. === | ||
- | Overfitting and Underfitting are commonly used in the context of Data Science for describing the quality of data models: | + | Overfitting and Underfitting are commonly used in the context of Data Science for describing the quality of data models in real world conditions: |
* **Overfitting** means that model we trained has trained "too well" and is fit too closely to the training dataset. This usually happens when the model is too complex (i.e. too many features/variables compared to the number of observations). This model will be very accurate on the training data but will probably not be very accurate on new data. In linear regression, a higher polynomial order may lead to overfitting. | * **Overfitting** means that model we trained has trained "too well" and is fit too closely to the training dataset. This usually happens when the model is too complex (i.e. too many features/variables compared to the number of observations). This model will be very accurate on the training data but will probably not be very accurate on new data. In linear regression, a higher polynomial order may lead to overfitting. | ||
* **Underfitting** means that the model does not fit the training data and therefore misses the trends in the data. It could also happen when, for example, we fit a linear model (simple linear regression) to data that is not linear. | * **Underfitting** means that the model does not fit the training data and therefore misses the trends in the data. It could also happen when, for example, we fit a linear model (simple linear regression) to data that is not linear. | ||
- | <note>Check the plot for polynomial linear regression and try to guess which polynomial order would overfit and which one would underfit.</note> | + | <note>Check the plot for polynomial linear regression and try to guess which polynomial order would overfit and which one would underfit. Which criteria can you use to select the best model?</note> |
<note tip> | <note tip> | ||
- | RMSE and R-Squared provide a rough estimation of model over/underfitting on a given dataset. Cross-validation is then used to evaluate the model on independent datasets. | + | RMSE and R-Squared provide a rough estimation of model over/underfitting on a given dataset when comparing test and validation results. Cross-validation can be further used to evaluate the model on independent datasets. |
</note> | </note> | ||
+ | |||
+ | ==== Scikit-learn ==== | ||
+ | |||
+ | <code python> | ||
+ | import numpy as np | ||
+ | from sklearn.linear_model import LinearRegression | ||
+ | import matplotlib.pyplot as plt | ||
+ | |||
+ | # define some test data | ||
+ | x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) | ||
+ | y = np.array([2, 3, 3, 4, 7, 8, 12, 14, 16, 20]) | ||
+ | |||
+ | # split train/test data | ||
+ | n_train = int(len(x)*0.8) | ||
+ | x_train = x[:n_train] | ||
+ | y_train = y[:n_train] | ||
+ | x_predict = x[n_train:] | ||
+ | |||
+ | # create and fit the LR model | ||
+ | model = LinearRegression() | ||
+ | reg = model.fit(x_train.reshape(-1, 1), y_train) | ||
+ | |||
+ | # print model parameters | ||
+ | print(model.coef_) | ||
+ | print(model.intercept_) | ||
+ | |||
+ | # use the model to make predictions on the test dataset | ||
+ | model_output = model.predict(x.reshape(-1, 1)) | ||
+ | |||
+ | # plot the results compared to the target values | ||
+ | plt.plot(x, y, model_output) | ||
+ | plt.title("Linear regression") | ||
+ | plt.xlabel("x") | ||
+ | plt.ylabel("y") | ||
+ | plt.legend(["target", "predicted"]) | ||
+ | plt.show() | ||
+ | |||
+ | </code> | ||
==== Exercises ==== | ==== Exercises ==== | ||
Line 218: | Line 256: | ||
Download the {{:ewis:laboratoare:lab7:project_lab7.zip|project archive}} and unzip on your PC. Install the requirements using pip (e.g. //py -3 -m pip install -r requirements.txt//). | Download the {{:ewis:laboratoare:lab7:project_lab7.zip|project archive}} and unzip on your PC. Install the requirements using pip (e.g. //py -3 -m pip install -r requirements.txt//). | ||
- | The code sample (//task12.py//) uses linear regression to fit a sample of generated data. | + | The script (//task1.py//) uses linear regression to fit a sample of generated data. |
Run the program and solve the following scenarios: | Run the program and solve the following scenarios: | ||
* Experiment with different polynomial orders | * Experiment with different polynomial orders | ||
* Plot the RMSE and R-Squared values for each case | * Plot the RMSE and R-Squared values for each case | ||
- | |||
- | *This task is required for solving Task 2. | ||
- | |||
- | === Task 2 (3p) === | ||
Based on the experimental results in Task 1, answer the following questions: | Based on the experimental results in Task 1, answer the following questions: | ||
- | * Q1: Which is the optimal polynomial order for this dataset with regards to the RMSE and model complexity? Tip: the RMSE improvement starts to decrease after a certain polynomial order on the RMSE chart. | + | * Q1: Which is the optimal polynomial order for this dataset with regards to the RMSE and model complexity? Hint: the RMSE improvement starts to decrease after a certain polynomial order on the RMSE chart. |
- | * Q2: Which is the optimal polynomial order for this dataset with regards to the R-Squared coefficient and model complexity? Tip: the R2 coefficient improvement starts to decrease after a certain polynomial order on the R-Squared chart. | + | * Q2: Which is the optimal polynomial order for this dataset with regards to the R-Squared coefficient and model complexity? Hint: the R2 coefficient improvement starts to decrease after a certain polynomial order on the R-Squared chart. |
* Q3: Explain the results based on the provided function that is used to generate the dataset. | * Q3: Explain the results based on the provided function that is used to generate the dataset. | ||
- | Submit your answers on Moodle as PDF report. | + | === Task 2 (3p) === |
- | === Task 3 (4p) === | + | The script (//task2.py//) loads a dataset from a CSV file. Run a similar script as Task 1, and present your results. |
- | The code sample (//task3.py//) loads the Boston Housing Dataset and trains a linear model over multiple features. The prediction results (median housing prices in thousands of dollars) are shown in the plot and compared to the original dataset. | + | [[https://www.kaggle.com/datasets/meetnagadia/bitcoin-stock-data-sept-17-2014-august-24-2021|Bitcoin Price Dataset]] |
+ | |||
+ | === Task 3 (3p) === | ||
+ | |||
+ | The script (//task3.py//) loads the Boston Housing Dataset and trains a linear model over multiple features. The prediction results (median housing prices in thousands of dollars) are shown in the plot and compared to the original dataset. | ||
Run the program and solve the following scenarios: | Run the program and solve the following scenarios: | ||
* [TODO 1] Change the size of the training dataset (percent) and evaluate the models that are obtained in each case using RMSE | * [TODO 1] Change the size of the training dataset (percent) and evaluate the models that are obtained in each case using RMSE | ||
Line 245: | Line 284: | ||
* Q2. What is the amount (percent) of training data that provides the best results in terms of prediction accuracy on validation data? | * Q2. What is the amount (percent) of training data that provides the best results in terms of prediction accuracy on validation data? | ||
* Q3. What happens if the amount training data is small, e.g. 10%, with regards to the prediction accuracy and the over/underfitting of the regression model? | * Q3. What happens if the amount training data is small, e.g. 10%, with regards to the prediction accuracy and the over/underfitting of the regression model? | ||
- | |||
- | Submit your answers on Moodle as PDF report. | ||
==== Resources ==== | ==== Resources ==== | ||
- | * {{:ewis:laboratoare:lab7:project_lab7.zip|Project}} | + | * {{:ewis:laboratoare:lab7:lab7.zip|Project}} |
* {{:ewis:laboratoare:python_workflow.pdf|Python Workflow}} | * {{:ewis:laboratoare:python_workflow.pdf|Python Workflow}} | ||
* [[https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html|The Boston Housing Dataset]] | * [[https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html|The Boston Housing Dataset]] |