✎ fp2024:hw4 [books]

This page is read only. You can view the source, but not change it. Ask your administrator if you think this is wrong.
====== Homework 4 - Linear regression =======

===== Introduction =====

In the figure below, we have plotted a dataset containing, on the X axis, the overall surface of houses (in square feet) for sale, as well as, on the Y axis, their sale price. These house prices are part of a real dataset from a certain region of the US.

{{:fp2024:regression_plot.png?600|}}

====== Objective: house price prediction =====

Given this dataset, your task is to **predict** prices of **other** houses, outside of the dataset, based on their surface. In other words, for a given house surface $math[x] you need to **predict** its price $math[y(x)]. More generally, we need to find a suitable function $math[y], which is usually called a **hypothesis**. 

By examining the distribution of points in the figure, one can see that there is a **linear** dependency between house area and house price. This dependency is not perfect but it is strong. Therefore, the hypothesis we will use in this homework is $math[y = a*x + b]. We now need to find the values for $math[a] and $math[b] in such a way that the $math[y] line **fits** best our dataset. One example of a line $math[y] is given by two of its points, shown in green in the dataset. We simply examining the figure, we can see that this line is a good fit for the dataset.

In order to find $math[a] and $math[b] we need a formal condition to express //**best fit**//. Let be a point $math[ (x_i,y_i) ] from the dataset. Then, the value $math[ \mid y_i - a*x_i + b \mid ] is the error between our price estimation $math[a*x_i + b] and the real price $math[y_i]. We will choose $math[a] and $math[b] in such a way as to **minimise** the **sum** of all such errors, for the **entire** dataset. 

In order to solve this homework, it is not necessary to understand how minimisation is performed, but you can read more [[https://mubaris.com/posts/linear-regression/|details]] about linear regression to get a better perspective of this homework. 

====== Multi-dimensional regression ======

In the previous figure, you might have already seen that a linear hypothesis starts to work poorly for surfaces over 2000 square feet. As it happens, there are also other features of such properties that influence its price. One such feature is the Garage Area, which is also expressed in square feet. Hence, we can improve our hypothesis into: $math[y = a*x_1 + b*x_2 + c] where $math[x_1] represents the surface area of the house and $math[x_2] represents the garage area. Now, our hypothesis has three **parameters** ($math[a,b,c]) which must be computed in such as way at to achieve a **best fit**.

When using more than one feature (more than one $math[x]), it is much more convenient to use a matrix representation. Suppose we also add a third feature $math[x_3] in the dataset, and have that it is allways equal to 1:
$$ y = a * x_1 + b * x_2 + c * x_3 $$

We can now express our hypothesis as: 

$$ y =  \begin{pmatrix} a & b & c \end{pmatrix} \cdot \begin{pmatrix} x_1 \\ x_2 \\ x_3 \end{pmatrix}$$

or equivalently as:
$$ y =  \begin{pmatrix} x_1 & x_2 & x_3 \end{pmatrix} \cdot \begin{pmatrix} a \\ b \\ c \end{pmatrix}$$

If $math[X] is a matrix with three columns (for the two features of the model), and $math[n] lines, one for each entry in the dataset, then
evaluating our price estimations for each **multi-dimensional point** in the dataset is given by the vector ($math[n] lines, one column):

$$ Y = X \cdot \begin{pmatrix} a \\ b \\ c \end{pmatrix} $$

===== Implementation =====

==== 1. Reading and extracting data from a dataset ====

The dataset ''houseds.csv'' (from the ''datasets'' folder) contains a big dataset with over 60 features (columns) for properties as well as their sale price. In order to implement regression we may use some, or all of these features. In this homework we will use only two of these features.

The class ''Dataset'' from the ''Dataset.scala'' file already contains a companion object and apply method which extracts a table (as a ''List[List[String]]'' object) from a ''csv'' file.

Your first task is to extract a **subset** of features (columns) from a dataset, by providing a list of columns to be extracted. For this task:

**1.1.** Implement the function ''zipWith'' from ''Helpers.scala''

<code scala>
def zipWith[A, B, C](op: (A, B) => C)(l1: List[A], l2: List[B]): List[C] = ???
</code>

which takes an operation, two lists, and uses to operation to ''zip'' the lists together. For instance, ''zipWith(_ + _)(List(1,2,3), List(4,5,6))'' will produce ''List(5,7,9)''.

**1.2.** Use ''zipWith'' to implement the member method ''selectColumns'' which creates a new dataset by extracting those columns specified as a list of strings in the function parameter.

<code scala>
def selectColumns(cols: List[String]): Dataset = ???
</code>

Be careful to preserve the first line which contains the column names of the dataset.

**1.3.** Use ''selectColumns'' to implement the helper function ''selectColumn''.

**1.4.** Implement the function ''split'', which splits the dataset into training and evaluation. 
<code scala>
def split(percentage: Double): (Dataset, Dataset) = ???
</code>
Generally, when a dataset is being used to implement linear regression, we need to put aside part of the dataset (usually 20%) for evaluation. It is essential that this part is not used in the training process, in order to faithfully evaluated how the hypothesis behaves on unseen data. At the same time, it is important that this 20% data is **representative** for the entire dataset (hence it cannot be the first or last 20% part of the dataset), but a **representative** sample. For instance if $math[(x_1, y_1), (x_2,y_2), \ldots, (x_{20},y_{20}] is the set of surface-to-price points, sorted after surfaces, and we decide to keep 20% (or 0.2) for evaluation, then the points that will be put aside might be: $math[(x_1,y_1), (x_5,y_5), (x_{10}, y_{10}) (x_{15}, y_{15})].

In the split function, ''percentage'' is expressed as a value between ''0'' and ''1'', and represents the amount of evaluation data to be put aside from the entire dataset. In the returned pair, the first component is the training part of the dataset, and the second - the evaluation.

Be careful to preserve the first line which contains the column names of the dataset.

==== 2. Matrix operations ====

Matrices are represented as boxes storing an inner object ''m: Option[List[List[Double]]''. The object can be a value ''Some(mp)'' where ''mp'' is a matrix of doubles, or it can be ''Nothing''. This latter value will be produced whenever the current matrix operation fails (for instance, in addition, if the dimensions of the matrices do not match). 

**2.1.** Implement method ''transpose'' which returns the transposition of a matrix. If the current matrix is a ''Nothing'', the returned value should also be ''Nothing''.
<code scala>
def transpose: Matrix = ???
</code>

**2.2.** Implement matrix multiplication. If the dimensions of the two matrices do not match for multiplication, the resulting matrix should be ''Nothing''.
<code scala>
def *(other: Matrix): Matrix = ???
</code>

**2.3.** Implement matrix subtraction. If the dimensions of the two matrices do not match, the result should be a ''Nothing'' matrix.
<code scala>
def -(other: Matrix): Matrix = ???
</code>

**2.4.** Implement the operation ''normalize'' which takes a matrix $math[m] and returns the matrix $math[1/n \cdot m] where $math[n] is the number of entries in the dataset. Normalization will be used in regression:
<code scala>
def normalize: Matrix = ???
</code>

**2.5.** Implement the operation ''map'' which applies a ''Double => Double'' transformation on each element of the matrix:
<code scala>
def map(f: Double => Double): Matrix = ???
</code>

**2.6.** Implement the operations which adds another column at the end of a matrix, whose values are equal to the constant ''x'':
<code scala>
def ++(x: Double): Matrix = ???
</code>

**2.7.** Implement the method ''dimensions'' that returns a string formatted as a pair ''"(n,m)"'' where ''n'' is the number of lines and ''m'' - the number of columns in the matrix. This function is useful for troubleshooting when performing matrix operations.
<code scala>
def dimensions: String =
</code>