The purpose of an **information system** is to extract useful information from raw data. **Data science** is a field of study that aims to understand and analyze data by means of **statistics,​ big data, machine learning** and to provide support for decision makers and autonomous systems. While this sounds complicated,​ the tools are based on mathematical models and specialized software components that are already available (e.g. Python packages). In the following labs we will learn about.. learning. Machine Learning, to be more specific, and the two main classes: **Supervised Learning** and **Unsupervised Learning**. The general idea is to write software programs that can learn from the available data, identify patterns and make decisions with minimal human interventions,​ based on Machine Learning algorithms. The purpose of an **information system** is to extract useful information from raw data. **Data science** is a field of study that aims to understand and analyze data by means of **statistics,​ big data, machine learning** and to provide support for decision makers and autonomous systems. While this sounds complicated,​ the tools are based on mathematical models and specialized software components that are already available (e.g. Python packages). In the following labs we will learn about.. learning. Machine Learning, to be more specific, and the two main classes: **Supervised Learning** and **Unsupervised Learning**. The general idea is to write software programs that can learn from the available data, identify patterns and make decisions with minimal human interventions,​ based on Machine Learning algorithms.
-<note tip>
-An introduction into Machine Learning territory can be already found in the previous lab, as linear regression is a form of Supervised Learning, extracting patterns from the data based on a linear model.
==== Machine Learning. Supervised Learning ==== ==== Machine Learning. Supervised Learning ====
p = [inv_map[e] for e in p] p = [inv_map[e] for e in p]
print(p) print(p)
+</​code>​
+
+See the next example on how you can plot the decision tree:
+
+<code python>
+# continue from previous example
+
+# plot decision tree
+from dtreeplt import dtreeplt
+dtree = dtreeplt(
+    model=classifier,​
+    feature_names=features,​
+    target_names=labels
+)
+fig = dtree.view(interactive=False)
+fig.savefig("​dtree.png"​)
+</​code>​
+
+== Random forests (overview) ==
+
+  * A "​forest"​ of decision trees
+  * Decision trees are susceptible to overfitting. ​
+  * One solution is to construct several trees and let them “vote” on the final classification.
+  * We do this by randomly re-sampling the input data for each tree (fancy term: bootstrap aggregating).
+
+In Python (scikit-learn),​ we can just use the //​RandomForestClassifier//​ instead of the //​DecisionTreeClassifier//​. There are some parameters that have to be defined such as the number of trees (//​n_estimators//​) and the random state (controls the randomness of the samples when building trees, set to 0 to disable)
+
+<code python>
+from sklearn.ensemble import RandomForestClassifier
+model = RandomForestClassifier(n_estimators=10,​ random_state=0)
</​code>​ </​code>​

<code python> <code python>
import pandas as pd import pandas as pd
-input_file = "​./​past_hires.csv"​+input_file = "./data/​past_hires.csv"​
+
# format the data, map classes to numbers # format the data, map classes to numbers
d = {'​Y':​ 1, '​N':​ 0} d = {'​Y':​ 1, '​N':​ 0}
Line 116: Line 142:
d = {'​BS':​ 0, '​MS':​ 1, '​PhD':​ 2} d = {'​BS':​ 0, '​MS':​ 1, '​PhD':​ 2}
df['​Level of Education'​] = df['​Level of Education'​].map(d) df['​Level of Education'​] = df['​Level of Education'​].map(d)
+
target = df['​Hired'​] target = df['​Hired'​]

+print(target)
</​code>​ </​code>​

from sklearn import tree from sklearn import tree
import numpy as np import numpy as np
+
+# load the data (see previous example)

# print features and data # print features and data
Line 154: Line 183:
-Download the project archive ​and install the required packages via //​requirements.txt//​+Download the {{:​ewis:​laboratoare:​lab8:​lab8.zip|Project Archive}} ​and install the required packages via //​requirements.txt//​

-**Run //​gen_ucode.py//​ to generate ​your unique code (//​UCODE//​) ​that you will use in the exercises when required. Write it down and include it in the pdf report.** +<note important>​**Get your unique code (//​UCODE//​) ​via moodle.** </​note>​

=== Task 1 (2p) === === Task 1 (2p) ===
Line 165: Line 194:
* The predictions are evaluated to find out the accuracy of the model and the decision tree is then shown as (pseudo)code (if else statements) and graph representation as //​dtree1.png//​.   * The predictions are evaluated to find out the accuracy of the model and the decision tree is then shown as (pseudo)code (if else statements) and graph representation as //​dtree1.png//​.

-<note important>​Use //​n_train_percent//​ to change the amount of data used for training the model and evaluate the results. Set //​n_train_percent// as the generated //code// **(//UCODE//)** and report the results: ​+<note important>​Use //​n_train_percent//​ to change the amount of data used for training the model and evaluate the results. Set //​n_train_percent=UCODE// and report the results: ​
* prediction accuracy and generated output //​dtree1.png//​   * prediction accuracy and generated output //​dtree1.png//​
-  * how large is the decision tree regarding the number of leaf nodes+  * how large is the decision tree regarding the number of leaf nodes?
</​note>​ </​note>​

Line 179: Line 208:
* The results are plotted on a chart, showing the effect of the amount (percent) of training data on the prediction accuracy   * The results are plotted on a chart, showing the effect of the amount (percent) of training data on the prediction accuracy

-<note important>​Write down your observations regarding the observed ​results ​in this case and include them into your report. ​How much training data (percent) is required in this case to obtain most accurate predictions?</​note>​+<note important>​Write down your observations regarding the results
+  * How much training data (percent) is required in this case to obtain most accurate predictions?​
+</​note>​

=== Task 3 (3p) === === Task 3 (3p) ===
Line 206: Line 237:
</​code>​ </​code>​

-<note important>​Use //​n_train_percent//​ to change the amount of data used for training the model and evaluate the results. Set //​n_train_percent// as the generated //code// **(//UCODE//)** and report the results: ​+<note important>​Use //​n_train_percent//​ to change the amount of data used for training the model and evaluate the results. Set //​n_train_percent=UCODE// and report the results: ​
* prediction accuracy and generated output //​dtree31.png//​   * prediction accuracy and generated output //​dtree31.png//​
-  * how large is the decision tree regarding the number of leaf nodes+  * how large is the decision tree regarding the number of leaf nodes?
</​note>​ </​note>​

Line 218: Line 249:
* Run //​task31_sol.py//​ for both red (//​winequality-red.csv//​) and white (//​winequality-white.csv//​) wine datasets   * Run //​task31_sol.py//​ for both red (//​winequality-red.csv//​) and white (//​winequality-white.csv//​) wine datasets

-<note important>​Write down your observations regarding the observed ​results ​in this case and include them into your report+<note important>​Write down your observations regarding the results:
-  * How much training data (percent) is required in this case to obtain most accurate predictions+  * How much training data (percent) is required in this case to obtain most accurate predictions?
* What is the average accuracy for each model (decision tree, random forest) **(+1p)**   * What is the average accuracy for each model (decision tree, random forest) **(+1p)**
* Which type of wine (red/white) is easier to predict (more accurate) based on the results **(+1p)**   * Which type of wine (red/white) is easier to predict (more accurate) based on the results **(+1p)**
</​note>​ </​note>​

=== Bonus (4p + 2p) === === Bonus (4p + 2p) ===

Line 248: Line 279:
Use //​n_train_percent//​ to change the amount of data used for training the model and evaluate the results. Set //​n_train_percent//​ as the generated //code// **(//​UCODE//​)** and report the results: ​ Use //​n_train_percent//​ to change the amount of data used for training the model and evaluate the results. Set //​n_train_percent//​ as the generated //code// **(//​UCODE//​)** and report the results: ​
* prediction accuracy and generated output //​dtree32.png//​   * prediction accuracy and generated output //​dtree32.png//​
-  * how large is the decision tree regarding the number of leaf nodes+  * how large is the decision tree regarding the number of leaf nodes?

Create a new script similar to //​task31_sol.py//​ to compare the decision trees with random forest models using variable amounts (percent) of training data: Create a new script similar to //​task31_sol.py//​ to compare the decision trees with random forest models using variable amounts (percent) of training data:
-  * How much training data (percent) is required in this case to obtain most accurate predictions+  * How much training data (percent) is required in this case to obtain most accurate predictions?
* What is the average accuracy for each model (decision tree, random forest)   * What is the average accuracy for each model (decision tree, random forest)
* Explain the low accuracy obtained for this case study. What would be required to improve the results? **(+2p)**   * Explain the low accuracy obtained for this case study. What would be required to improve the results? **(+2p)**
</​note>​ </​note>​
==== Resources ==== ==== Resources ====

+{{:​ewis:​laboratoare:​lab8:​lab8.zip|Project Archive}}

[[https://​data-flair.training/​blogs/​machine-learning-datasets/​]] [[https://​data-flair.training/​blogs/​machine-learning-datasets/​]]
Line 263: Line 296:
[[https://​archive.ics.uci.edu/​ml/​datasets/​wine+quality]] [[https://​archive.ics.uci.edu/​ml/​datasets/​wine+quality]]

