Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
fp2024:hw4 [2024/05/09 14:57]
pdmatei
fp2024:hw4 [2024/05/09 18:51] (current)
pdmatei
Line 9: Line 9:
 ====== Objective: house price prediction ===== ====== Objective: house price prediction =====
  
-Given this dataset, your task is to **predict** prices of **other** houses, outside of the dataset, based on their surface. In other words, for a given house surface $math[x] you need to **predict** its price $math[y(x)]. More generally, we need to find a suitable function $math[y], which is usually called a **hypothesis**. ​+Given this dataset, your task is to **predict** prices of **other** houses, outside of the dataset, based on their surface. In other words, for a given house surface $math[x] you need to **predict** its price $math[y(x)]. More generally, we need to find a suitable ​price prediction ​function $math[y], which is usually called a **hypothesis**. ​
  
-By examining the distribution of points in the figure, one can see that there is a **linear** dependency between house area and house price. This dependency is not perfect but it is strong. Therefore, the hypothesis we will use in this homework is $math[y = a*x + b]. We now need to find the values for $math[a] and $math[b] in such a way that the $math[y] line **fits** best our dataset. One example of a line $math[y] is given by two of its points, shown in green in the dataset. ​We simply examining the figure, we can see that this line is a good fit for the dataset.+By examining the distribution of points in the figure, one can see that there is a **linear** dependency between house area and house price. This dependency is not perfect but it is strong. Therefore, the hypothesis we will use in this homework is $math[y = a*x + b]. We now need to find the values for $math[a] and $math[b] in such a way that the $math[y] line **fits** best our dataset. One example of a line $math[y] is given by two of its points, shown in green in the dataset. ​By simply examining the figure, we can see that this line is a good fit for the dataset.
  
-In order to find $math[a] and $math[b] we need a formal condition to express //**best fit**//. Let be a point $math[ (x_i,y_i) ] from the dataset. Then, the value $math[ \mid y_i - a*x_i + b \mid ] is the error between our price estimation $math[a*x_i + b] and the real price $math[y_i]. We will choose $math[a] and $math[b] in such a way as to **minimise** the **sum** of all such errors, for the **entire** dataset. ​+In order to find $math[a] and $math[b] we need a formal condition to express //**best fit**//. Let $math[ (x_i,y_i) ] be a point from the dataset. Then, the value $math[ \mid y_i - (a*x_i + b\mid ] is the error between our price estimation $math[a*x_i + b] and the real price $math[y_i]. We will choose $math[a] and $math[b] in such a way as to **minimise** the **sum** of all such errors, for the **entire** dataset. ​
  
 In order to solve this homework, it is not necessary to understand how minimisation is performed, but you can read more [[https://​mubaris.com/​posts/​linear-regression/​|details]] about linear regression to get a better perspective of this homework. ​ In order to solve this homework, it is not necessary to understand how minimisation is performed, but you can read more [[https://​mubaris.com/​posts/​linear-regression/​|details]] about linear regression to get a better perspective of this homework. ​
Line 21: Line 21:
 In the previous figure, you might have already seen that a linear hypothesis starts to work poorly for surfaces over 2000 square feet. As it happens, there are also other features of such properties that influence its price. One such feature is the Garage Area, which is also expressed in square feet. Hence, we can improve our hypothesis into: $math[y = a*x_1 + b*x_2 + c] where $math[x_1] represents the surface area of the house and $math[x_2] represents the garage area. Now, our hypothesis has three **parameters** ($math[a,​b,​c]) which must be computed in such as way at to achieve a **best fit**. In the previous figure, you might have already seen that a linear hypothesis starts to work poorly for surfaces over 2000 square feet. As it happens, there are also other features of such properties that influence its price. One such feature is the Garage Area, which is also expressed in square feet. Hence, we can improve our hypothesis into: $math[y = a*x_1 + b*x_2 + c] where $math[x_1] represents the surface area of the house and $math[x_2] represents the garage area. Now, our hypothesis has three **parameters** ($math[a,​b,​c]) which must be computed in such as way at to achieve a **best fit**.
  
-When using more than one feature (more than one $math[x]), it is much more convenient to use a matrix representation. Suppose we also add a third feature $math[x_3] in the dataset, and have that it is allways ​equal to 1:+When using more than one feature (more than one $math[x]), it is much more convenient to use a matrix representation. Suppose we also add a third feature $math[x_3] in the dataset, and have it be always ​equal to 1.
 $$ y = a * x_1 + b * x_2 + c * x_3 $$ $$ y = a * x_1 + b * x_2 + c * x_3 $$
  
Line 31: Line 31:
 $$ y =  \begin{pmatrix} x_1 & x_2 & x_3 \end{pmatrix} \cdot \begin{pmatrix} a \\ b \\ c \end{pmatrix}$$ $$ y =  \begin{pmatrix} x_1 & x_2 & x_3 \end{pmatrix} \cdot \begin{pmatrix} a \\ b \\ c \end{pmatrix}$$
  
-If $math[X] is a matrix with three columns (for the two features of the model), and $math[n] lines, one for each entry in the dataset, then+If $math[X] is a matrix with three columns (for the two features of the model, //plus// $math[x_3 = 1]), and $math[n] lines, one for each entry in the dataset, then
 evaluating our price estimations for each **multi-dimensional point** in the dataset is given by the vector ($math[n] lines, one column): evaluating our price estimations for each **multi-dimensional point** in the dataset is given by the vector ($math[n] lines, one column):
  
Line 68: Line 68:
 def split(percentage:​ Double): (Dataset, Dataset) = ??? def split(percentage:​ Double): (Dataset, Dataset) = ???
 </​code>​ </​code>​
-Generally, when a dataset is being used to implement linear regression, we need to put aside part of the dataset (usually 20%) for evaluation. It is essential that this part is not used in the training process, in order to faithfully ​evaluated ​how the hypothesis behaves on unseen data. At the same time, it is important that this 20% data is **representative** for the entire dataset (hence it cannot be the first or last 20% part of the dataset), but a **representative** sample. For instance if $math[(x_1, y_1), (x_2,y_2), \ldots, (x_{20},​y_{20}] is the set of surface-to-price points, sorted after surfaces, and we decide to keep 20% (or 0.2) for evaluation, then the points that will be put aside might be: $math[(x_1,​y_1),​ (x_5,y_5), (x_{10}, y_{10}) (x_{15}, y_{15})].+Generally, when a dataset is being used to implement linear regression, we need to put aside part of the dataset (usually 20%) for evaluation. It is essential that this part is not used in the training process, in order to faithfully ​evaluate ​how the hypothesis behaves on unseen data. At the same time, it is important that this 20% data is **representative** for the entire dataset (hence it cannot be the first or last 20% part of the dataset), but a **representative** sample. For instance if $math[(x_1, y_1), (x_2,y_2), \ldots, (x_{20},​y_{20})] is the set of surface-to-price points, sorted after surfaces, and we decide to keep 20% (or 0.2) for evaluation, then the points that will be put aside might be: $math[(x_1,​y_1),​ (x_5,y_5), (x_{10}, y_{10}) (x_{15}, y_{15})]. This is very similar to **sampling** the dataset.
  
 In the split function, ''​percentage''​ is expressed as a value between ''​0''​ and ''​1'',​ and represents the amount of evaluation data to be put aside from the entire dataset. In the returned pair, the first component is the training part of the dataset, and the second - the evaluation. In the split function, ''​percentage''​ is expressed as a value between ''​0''​ and ''​1'',​ and represents the amount of evaluation data to be put aside from the entire dataset. In the returned pair, the first component is the training part of the dataset, and the second - the evaluation.
Line 78: Line 78:
 Matrices are represented as boxes storing an inner object ''​m:​ Option[List[List[Double]]''​. The object can be a value ''​Some(mp)''​ where ''​mp''​ is a matrix of doubles, or it can be ''​Nothing''​. This latter value will be produced whenever the current matrix operation fails (for instance, in addition, if the dimensions of the matrices do not match). ​ Matrices are represented as boxes storing an inner object ''​m:​ Option[List[List[Double]]''​. The object can be a value ''​Some(mp)''​ where ''​mp''​ is a matrix of doubles, or it can be ''​Nothing''​. This latter value will be produced whenever the current matrix operation fails (for instance, in addition, if the dimensions of the matrices do not match). ​
  
-**2.1.** Implement method ''​transpose''​ which returns the transposition of a matrix. If the current matrix is a ''​Nothing'',​ the returned value should also be ''​Nothing''​.+**2.1.** Implement the apply methods in the companion object ''​Matrix'',​ which allows creating matrices in various ways: 
 +<code scala> 
 +  def apply(): Matrix = ??? // should return the '​Nothing'​ matrix when no parameter is supplied. 
 +  def apply(raw: List[List[Double]]):​ Matrix = ??? 
 +  def apply(dataset:​ Dataset): Matrix = ??? 
 +  def apply(s: String): Matrix = ??? // it should parse the string as a matrix, with whitespace as the column delimiter and newline '​\n'​ as the line delimiter. 
 +</​code>​ 
 + 
 +**2.2.** Implement method ''​transpose''​ which returns the transposition of a matrix. If the current matrix is a ''​Nothing'',​ the returned value should also be ''​Nothing''​.
 <code scala> <code scala>
 def transpose: Matrix = ??? def transpose: Matrix = ???
 </​code>​ </​code>​
  
-**2.2.** Implement matrix multiplication. If the dimensions of the two matrices do not match for multiplication,​ the resulting matrix should be ''​Nothing''​.+**2.3.** Implement matrix multiplication. If the dimensions of the two matrices do not match for multiplication,​ the resulting matrix should be ''​Nothing''​.
 <code scala> <code scala>
 def *(other: Matrix): Matrix = ??? def *(other: Matrix): Matrix = ???
 </​code>​ </​code>​
  
-**2.3.** Implement matrix subtraction. If the dimensions of the two matrices do not match, the result should be a ''​Nothing''​ matrix.+**2.4.** Implement matrix subtraction. If the dimensions of the two matrices do not match, the result should be a ''​Nothing''​ matrix.
 <code scala> <code scala>
 def -(other: Matrix): Matrix = ??? def -(other: Matrix): Matrix = ???
 </​code>​ </​code>​
  
-**2.4.** Implement the operation ''​normalize''​ which takes a matrix $math[m] and returns the matrix $math[1/n \cdot m] where $math[n] is the number of entries in the dataset. Normalization will be used in regression:+**2.5.** Implement the operation ''​normalize''​ which takes a matrix $math[m] and returns the matrix $math[1/n \cdot m] where $math[n] is the number of entries in the dataset. Normalization will be used in regression:
 <code scala> <code scala>
 def normalize: Matrix = ??? def normalize: Matrix = ???
 </​code>​ </​code>​
  
-**2.5.** Implement the operation ''​map''​ which applies a ''​Double => Double''​ transformation on each element of the matrix:+**2.6.** Implement the operation ''​map''​ which applies a ''​Double => Double''​ transformation on each element of the matrix:
 <code scala> <code scala>
 def map(f: Double => Double): Matrix = ??? def map(f: Double => Double): Matrix = ???
 </​code>​ </​code>​
  
-**2.6.** Implement the operations which adds another column at the end of a matrix, whose values are equal to the constant ''​x'':​+**2.7.** Implement the operations which adds another column at the end of a matrix, whose values are equal to the constant ''​x'':​
 <code scala> <code scala>
 def ++(x: Double): Matrix = ??? def ++(x: Double): Matrix = ???
 </​code>​ </​code>​
  
-**2.7.** Implement the method ''​dimensions''​ that returns a string formatted as a pair ''"​(n,​m)"''​ where ''​n''​ is the number of lines and ''​m''​ - the number of columns in the matrix. This function is useful for troubleshooting when performing matrix operations.+**2.8.** Implement the method ''​dimensions''​ that returns a string formatted as a pair ''"​(n,​m)"''​ where ''​n''​ is the number of lines and ''​m''​ - the number of columns in the matrix. This function is useful for troubleshooting when performing matrix operations.
 <code scala> <code scala>
 def dimensions: String = ??? def dimensions: String = ???
 </​code>​ </​code>​
  
-**2.8.** Implement the method ''​meanedSquaredError'',​ which, given a ''​n x 1''​ matrix: $math[\begin{pmatrix} x_1 & x_2 & \ldots & x_n \end{pmatrix}^T] (n lines, one column), it will compute the meaned squared error: $math[ \sqrt{1/n \cdot (x_1^2 + x_2^2 + \ldots + x_n^n)}].+**2.9.** Implement the method ''​meanedSquaredError'',​ which, given a ''​n x 1''​ matrix: $math[\begin{pmatrix} x_1 & x_2 & \ldots & x_n \end{pmatrix}^T] (n lines, one column), it will compute the meaned squared error: $math[ \sqrt{1/n \cdot (x_1^2 + x_2^2 + \ldots + x_n^n)}]. If the matrix has other dimensions or is ''​Nothing''​ the result should also be ''​Nothing''​. 
 + 
 +<code scala> 
 +def meanSquaredError:​ Option[Double] = ??? 
 +</​code>​  
 + 
 +==== 3. Linear regression ==== 
 + 
 +{{ :​fp2024:​631731_p7z2bkhd0r-9uyn9thdasa.png?​300 |}} 
 +There are several ways to implement linear regression. We will opt for the **gradient descent** approach. In a given number of **iterations** we will take an estimate vector of parameters $math[a,b] (or $math[a,​b,​c]),​ and produce better ones, i.e. ones closer to the **ideal** ones which minimise the squared root error. 
 + 
 +**3.1.** Implement the function ''​rootMeanSquareError''​ in the executable object ''​Regression'',​ which takes a matrix of features $math[X] as well as one of real prices $math[Y], together with a n-line, one column vector of **parameters** (e.g. for n=2, $math[\begin{pmatrix} a & b \end{pmatrix}^T]),​ and computes the root mean square error of the vector of errors (values $math[y_i - (a*x_i + b)] for two parameters). Use the function ''​meanSquaredError''​ from the class ''​Matrix''​. 
 +<code scala> 
 +def rootMeanSquareError(X:​ Matrix, Y: Matrix, parameters: Matrix): Option[Double] = ??? 
 +</​code>​ 
 + 
 +**3.2.** Implement the function ''​gradientDescentStep''​ which, given $math[X] and $math[Y], takes a vector $math[parameters] of parameters and produces an improved version of these parameters. These are the steps you should follow in your implementation:​ 
 +  * Compute current estimation of prices given current parameters ''​parameters''​ and feature matrix $math[X]. Let $math[h] be this matrix (what dimensionality should it have?) 
 +  * Compute the //​**loss**//​ i.e. the error of the current estimation, by difference with real prices from $math[Y]. (Again, take care of matrix dimensionality) 
 +  * Compute the //​**gradient**//​. For a single entry $math[x_i] whose loss is $math[e_i], the gradient is $math[(x_i * e_i)/n] where $math[n] is the size of the dataset. To understand this formula, please read this [[https://​mubaris.com/​posts/​linear-regression/​|post]]. Generalize this formula to matrices and take care with matrix dimensionality. For the division by $math[n] use the method ''​normalize''​. 
 +  * Compute the improved version of the parameters by //​**combining**//​ (adding or subtracting) the gradient. For a single parameter $math[p], the update is $math[p_{new} = p_{old} - (x_i * e_i)\cdot \alpha/n], where $\math[\alpha] is a very small constant which prevents the gradient from shifting too aggressively the current optimum. This value is a constant in the code stub. Generalise this formula to matrices. 
 +Ultimately, this function should return the updated parameters. 
 + 
 +<code scala> 
 +def gradientDescentStep(X:​ Matrix, Y: Matrix, parameters: Matrix): Matrix = ??? 
 +</​code>​ 
 + 
 +**3.3.** Implement the function ''​linearRegression''​ which performs linear regression over dataset ''​dataset''​ (defined as constant in the ''​Regression''​ object), with initial parameters ''​parameters''​ that are also defined as constants, and with the features to be used given as a list. Proceed as follows: 
 +  * Create a dataset from the dataset file and select the desired ''​features''​ columns 
 +  * Split the dataset into training and evaluation (80% training, 20% evaluation). However, in this homework we will not evaluate your prediction. 
 +  * Create a matrix $math[X] of features from the training dataset. 
 +  * Add a feature of 1s at the end of the matrix. 
 +  * Create a matrix $math[Y] of prices (predictions) from the training dataset. 
 +  * Repeat ''​gradientDescentStep''​ for ''​steps''​ times over ''​parameters''​ to get your model parameters. 
 +  * Return a pair consisting in your model (the matrix ''​parameters''​) and the ''​rootMeanSquareError''​ over the trained dataset. 
 + 
 +Beyond our tests, you can run the ''​linearRegression''​ function by yourself to see your prediction. 
 + 
 +**3.4.** To test your model, use [[https://​chart-studio.plotly.com/​create/#/​| this online plotly app]] to test your results. Use the app (or your Scala code) to extract the features from the dataset, and plot two predictions of your own (e.g. for $math[x_1 = 1000] and $math[x_2 = 2000]), to see how your model behaves. Add a screenshot similar to the one below to your project to validate your work: 
 + 
 +{{ :​fp2024:​sampleregression.png?​800 | Sample regression}} 
 + 
 + 
 +===== Submission rules ===== 
 + 
 +  * Please follow the [[fp2024:​submission-guidelines| Submission guidelines]] which are the same for all homework.  
 +  * To solve your homework, download the {{:​fp2024:​linearregression.zip|Project template}}, import it in IntellIJ, and you are all set. Do not rename the project manually, as this may cause problems with IntellIJ.