This shows you the differences between two versions of the page.
|
ewis:laboratoare:07 [2022/04/19 22:20] alexandru.predescu [Training. Validation. Testing.] |
ewis:laboratoare:07 [2023/04/19 18:08] (current) alexandru.predescu [Exercises] |
||
|---|---|---|---|
| Line 210: | Line 210: | ||
| <note tip> | <note tip> | ||
| - | RMSE and R-Squared provide a rough estimation of model over/underfitting on a given dataset. Cross-validation is then used to evaluate the model on independent datasets. | + | RMSE and R-Squared provide a rough estimation of model over/underfitting on a given dataset when comparing test and validation results. Cross-validation can be further used to evaluate the model on independent datasets. |
| </note> | </note> | ||
| + | |||
| + | ==== Scikit-learn ==== | ||
| + | |||
| + | <code python> | ||
| + | import numpy as np | ||
| + | from sklearn.linear_model import LinearRegression | ||
| + | import matplotlib.pyplot as plt | ||
| + | |||
| + | # define some test data | ||
| + | x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) | ||
| + | y = np.array([2, 3, 3, 4, 7, 8, 12, 14, 16, 20]) | ||
| + | |||
| + | # split train/test data | ||
| + | n_train = int(len(x)*0.8) | ||
| + | x_train = x[:n_train] | ||
| + | y_train = y[:n_train] | ||
| + | x_predict = x[n_train:] | ||
| + | |||
| + | # create and fit the LR model | ||
| + | model = LinearRegression() | ||
| + | reg = model.fit(x_train.reshape(-1, 1), y_train) | ||
| + | |||
| + | # print model parameters | ||
| + | print(model.coef_) | ||
| + | print(model.intercept_) | ||
| + | |||
| + | # use the model to make predictions on the test dataset | ||
| + | model_output = model.predict(x.reshape(-1, 1)) | ||
| + | |||
| + | # plot the results compared to the target values | ||
| + | plt.plot(x, y, model_output) | ||
| + | plt.title("Linear regression") | ||
| + | plt.xlabel("x") | ||
| + | plt.ylabel("y") | ||
| + | plt.legend(["target", "predicted"]) | ||
| + | plt.show() | ||
| + | |||
| + | </code> | ||
| ==== Exercises ==== | ==== Exercises ==== | ||
| Line 218: | Line 256: | ||
| Download the {{:ewis:laboratoare:lab7:project_lab7.zip|project archive}} and unzip on your PC. Install the requirements using pip (e.g. //py -3 -m pip install -r requirements.txt//). | Download the {{:ewis:laboratoare:lab7:project_lab7.zip|project archive}} and unzip on your PC. Install the requirements using pip (e.g. //py -3 -m pip install -r requirements.txt//). | ||
| - | The code sample (//task12.py//) uses linear regression to fit a sample of generated data. | + | The script (//task1.py//) uses linear regression to fit a sample of generated data. |
| Run the program and solve the following scenarios: | Run the program and solve the following scenarios: | ||
| * Experiment with different polynomial orders | * Experiment with different polynomial orders | ||
| * Plot the RMSE and R-Squared values for each case | * Plot the RMSE and R-Squared values for each case | ||
| - | |||
| - | *This task is required for solving Task 2. | ||
| - | |||
| - | === Task 2 (3p) === | ||
| Based on the experimental results in Task 1, answer the following questions: | Based on the experimental results in Task 1, answer the following questions: | ||
| - | * Q1: Which is the optimal polynomial order for this dataset with regards to the RMSE and model complexity? Tip: the RMSE improvement starts to decrease after a certain polynomial order on the RMSE chart. | + | * Q1: Which is the optimal polynomial order for this dataset with regards to the RMSE and model complexity? Hint: the RMSE improvement starts to decrease after a certain polynomial order on the RMSE chart. |
| - | * Q2: Which is the optimal polynomial order for this dataset with regards to the R-Squared coefficient and model complexity? Tip: the R2 coefficient improvement starts to decrease after a certain polynomial order on the R-Squared chart. | + | * Q2: Which is the optimal polynomial order for this dataset with regards to the R-Squared coefficient and model complexity? Hint: the R2 coefficient improvement starts to decrease after a certain polynomial order on the R-Squared chart. |
| * Q3: Explain the results based on the provided function that is used to generate the dataset. | * Q3: Explain the results based on the provided function that is used to generate the dataset. | ||
| - | Submit your answers on Moodle as PDF report. | + | === Task 2 (3p) === |
| - | === Task 3 (4p) === | + | The script (//task2.py//) loads a dataset from a CSV file. Run a similar script as Task 1, and present your results. |
| - | The code sample (//task3.py//) loads the Boston Housing Dataset and trains a linear model over multiple features. The prediction results (median housing prices in thousands of dollars) are shown in the plot and compared to the original dataset. | + | [[https://www.kaggle.com/datasets/meetnagadia/bitcoin-stock-data-sept-17-2014-august-24-2021|Bitcoin Price Dataset]] |
| + | |||
| + | === Task 3 (3p) === | ||
| + | |||
| + | The script (//task3.py//) loads the Boston Housing Dataset and trains a linear model over multiple features. The prediction results (median housing prices in thousands of dollars) are shown in the plot and compared to the original dataset. | ||
| Run the program and solve the following scenarios: | Run the program and solve the following scenarios: | ||
| * [TODO 1] Change the size of the training dataset (percent) and evaluate the models that are obtained in each case using RMSE | * [TODO 1] Change the size of the training dataset (percent) and evaluate the models that are obtained in each case using RMSE | ||
| Line 245: | Line 284: | ||
| * Q2. What is the amount (percent) of training data that provides the best results in terms of prediction accuracy on validation data? | * Q2. What is the amount (percent) of training data that provides the best results in terms of prediction accuracy on validation data? | ||
| * Q3. What happens if the amount training data is small, e.g. 10%, with regards to the prediction accuracy and the over/underfitting of the regression model? | * Q3. What happens if the amount training data is small, e.g. 10%, with regards to the prediction accuracy and the over/underfitting of the regression model? | ||
| - | |||
| - | Submit your answers on Moodle as PDF report. | ||
| ==== Resources ==== | ==== Resources ==== | ||
| - | * {{:ewis:laboratoare:lab7:project_lab7.zip|Project}} | + | * {{:ewis:laboratoare:lab7:lab7.zip|Project}} |
| * {{:ewis:laboratoare:python_workflow.pdf|Python Workflow}} | * {{:ewis:laboratoare:python_workflow.pdf|Python Workflow}} | ||
| * [[https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html|The Boston Housing Dataset]] | * [[https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html|The Boston Housing Dataset]] | ||