Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
fp2024:hw4 [2024/05/09 18:47]
pdmatei
fp2024:hw4 [2024/05/09 18:51] (current)
pdmatei
Line 31: Line 31:
 $$ y =  \begin{pmatrix} x_1 & x_2 & x_3 \end{pmatrix} \cdot \begin{pmatrix} a \\ b \\ c \end{pmatrix}$$ $$ y =  \begin{pmatrix} x_1 & x_2 & x_3 \end{pmatrix} \cdot \begin{pmatrix} a \\ b \\ c \end{pmatrix}$$
  
-If $math[X] is a matrix with three columns (for the two features of the model), and $math[n] lines, one for each entry in the dataset, then+If $math[X] is a matrix with three columns (for the two features of the model, //plus// $math[x_3 = 1]), and $math[n] lines, one for each entry in the dataset, then
 evaluating our price estimations for each **multi-dimensional point** in the dataset is given by the vector ($math[n] lines, one column): evaluating our price estimations for each **multi-dimensional point** in the dataset is given by the vector ($math[n] lines, one column):
  
Line 68: Line 68:
 def split(percentage:​ Double): (Dataset, Dataset) = ??? def split(percentage:​ Double): (Dataset, Dataset) = ???
 </​code>​ </​code>​
-Generally, when a dataset is being used to implement linear regression, we need to put aside part of the dataset (usually 20%) for evaluation. It is essential that this part is not used in the training process, in order to faithfully ​evaluated ​how the hypothesis behaves on unseen data. At the same time, it is important that this 20% data is **representative** for the entire dataset (hence it cannot be the first or last 20% part of the dataset), but a **representative** sample. For instance if $math[(x_1, y_1), (x_2,y_2), \ldots, (x_{20},​y_{20}] is the set of surface-to-price points, sorted after surfaces, and we decide to keep 20% (or 0.2) for evaluation, then the points that will be put aside might be: $math[(x_1,​y_1),​ (x_5,y_5), (x_{10}, y_{10}) (x_{15}, y_{15})].+Generally, when a dataset is being used to implement linear regression, we need to put aside part of the dataset (usually 20%) for evaluation. It is essential that this part is not used in the training process, in order to faithfully ​evaluate ​how the hypothesis behaves on unseen data. At the same time, it is important that this 20% data is **representative** for the entire dataset (hence it cannot be the first or last 20% part of the dataset), but a **representative** sample. For instance if $math[(x_1, y_1), (x_2,y_2), \ldots, (x_{20},​y_{20})] is the set of surface-to-price points, sorted after surfaces, and we decide to keep 20% (or 0.2) for evaluation, then the points that will be put aside might be: $math[(x_1,​y_1),​ (x_5,y_5), (x_{10}, y_{10}) (x_{15}, y_{15})]. This is very similar to **sampling** the dataset.
  
 In the split function, ''​percentage''​ is expressed as a value between ''​0''​ and ''​1'',​ and represents the amount of evaluation data to be put aside from the entire dataset. In the returned pair, the first component is the training part of the dataset, and the second - the evaluation. In the split function, ''​percentage''​ is expressed as a value between ''​0''​ and ''​1'',​ and represents the amount of evaluation data to be put aside from the entire dataset. In the returned pair, the first component is the training part of the dataset, and the second - the evaluation.