# Differences

This shows you the differences between two versions of the page.

ewis:laboratoare:08 [2022/05/04 18:01]
alexandru.predescu [Exercises]
ewis:laboratoare:08 [2023/04/26 17:55] (current)
alexandru.predescu [Exercises]
Line 2: Line 2:

The purpose of an **information system** is to extract useful information from raw data. **Data science** is a field of study that aims to understand and analyze data by means of **statistics,​ big data, machine learning** and to provide support for decision makers and autonomous systems. While this sounds complicated,​ the tools are based on mathematical models and specialized software components that are already available (e.g. Python packages). In the following labs we will learn about.. learning. Machine Learning, to be more specific, and the two main classes: **Supervised Learning** and **Unsupervised Learning**. The general idea is to write software programs that can learn from the available data, identify patterns and make decisions with minimal human interventions,​ based on Machine Learning algorithms. The purpose of an **information system** is to extract useful information from raw data. **Data science** is a field of study that aims to understand and analyze data by means of **statistics,​ big data, machine learning** and to provide support for decision makers and autonomous systems. While this sounds complicated,​ the tools are based on mathematical models and specialized software components that are already available (e.g. Python packages). In the following labs we will learn about.. learning. Machine Learning, to be more specific, and the two main classes: **Supervised Learning** and **Unsupervised Learning**. The general idea is to write software programs that can learn from the available data, identify patterns and make decisions with minimal human interventions,​ based on Machine Learning algorithms.
-
-<note tip>
-An introduction into Machine Learning territory can be already found in the previous lab, as linear regression is a form of Supervised Learning, extracting patterns from the data based on a linear model.
-</​note>​

==== Machine Learning. Supervised Learning ==== ==== Machine Learning. Supervised Learning ====
Line 73: Line 69:

# plot decision tree # plot decision tree
-from dtreeplt ​import ​dtreeplt +from matplotlib ​import ​pyplot as plt
-dtree = dtreeplt( +
-    ​model=classifier,​ +tree.plot_tree(classifier
-    feature_names=features,​ +plt.show()
-    target_names=labels +
-)    +
-fig = dtree.view(interactive=False+
-fig.savefig("​dtree.png"​)+
</​code>​ </​code>​

Line 116: Line 108:

{{ :​ewis:​laboratoare:​lab8:​dtree_edit.png?​800 |}} {{ :​ewis:​laboratoare:​lab8:​dtree_edit.png?​800 |}}
+
+<note tip>The greedy algorithm ID3 walks down the tree and (at each step) picks the attribute to partition the data set that minimizes the entropy of the data at the next step. The Gini Index or Impurity (between 0 and 1) measures the probability for a random instance being misclassified when chosen randomly.</​note>​

Interpreting the data is straight-forward. At each "​decision"​ (internal node) there are two branches: left (false), right (true) which represent the possible outcomes for the current test attribute (e.g. Interned). The leaf nodes are reached when all the samples are aligned to either outcome and hold the class labels (e.g. Hired/Not Hired) and shown with a **different color** for each class (in this case there are 2 classes: Hired/Not Hired). From the example, the decision for hiring a new candidate can be described as follows: ​ Interpreting the data is straight-forward. At each "​decision"​ (internal node) there are two branches: left (false), right (true) which represent the possible outcomes for the current test attribute (e.g. Interned). The leaf nodes are reached when all the samples are aligned to either outcome and hold the class labels (e.g. Hired/Not Hired) and shown with a **different color** for each class (in this case there are 2 classes: Hired/Not Hired). From the example, the decision for hiring a new candidate can be described as follows: ​
Line 153: Line 147:

== The algorithm. DecisionTreeClassifier == == The algorithm. DecisionTreeClassifier ==
-
-The greedy algorithm ID3 walks down the tree and (at each step) picks the attribute to partition the data set that minimizes the entropy of the data at the next step.

In Python we use the //​DecisionTreeClassifier//​ from the //​scikit-learn//​ package, that creates the tree for us. We train the model using the data set and then we can visualize the decisions. Then we can validate the model by comparing the target values to the predicted values, showing the prediction accuracy. In Python we use the //​DecisionTreeClassifier//​ from the //​scikit-learn//​ package, that creates the tree for us. We train the model using the data set and then we can visualize the decisions. Then we can validate the model by comparing the target values to the predicted values, showing the prediction accuracy.
Line 189: Line 181:

-<note important>​**Get your unique code (//UCODE//) via moodle.** </​note>​ +=== Task 1 (1p) ===
- +

Line 198: Line 188:
* The predictions are evaluated to find out the accuracy of the model and the decision tree is then shown as (pseudo)code (if else statements) and graph representation as //​dtree1.png//​.   * The predictions are evaluated to find out the accuracy of the model and the decision tree is then shown as (pseudo)code (if else statements) and graph representation as //​dtree1.png//​.

-<note important>​Use //​n_train_percent//​ to change ​the amount of data used for training the model and evaluate ​the results. Set //​n_train_percent//​=//​UCODE//​ and report ​the results:  +Change ​the amount of data used for training the model and evaluate the results:
-  * prediction accuracy and generated output ​//​dtree1.png//​ +  * prediction accuracy and generated output
-  * how large is the decision tree regarding the number of leaf nodes +  * how large is the decision tree regarding the number of leaf nodes?
-</​note>​ +

Line 212: Line 200:
* The results are plotted on a chart, showing the effect of the amount (percent) of training data on the prediction accuracy   * The results are plotted on a chart, showing the effect of the amount (percent) of training data on the prediction accuracy

-<note important>​Write down your observations regarding ​the results: +Evaluate ​the results:
-  * How much training data (percent) is required in this case to obtain most accurate predictions?​</​note>​+  * How much training data (percent) is required in this case to obtain ​the most accurate predictions?​

-  * //task31.py// is similar to //​task1.py//,​ using another dataset about wine quality: //winequality-white.csv//, //winequality-red.csv// to train a decision tree that should predict the quality of the wine based on measured properties.+  * //task3.py// is similar to //​task1.py//,​ using another dataset about wine quality: //winequality_white.csv//, //winequality_red.csv// to train a decision tree that should predict the quality of the wine based on measured properties.
* A brief description of the dataset:   * A brief description of the dataset:

Line 240: Line 228:
</​code>​ </​code>​

-<note important>​Use //​n_train_percent//​ to change the amount of data used for training the model and evaluate ​the results. Set //​n_train_percent//​=//​UCODE//​ and report ​the results:  +Use //​n_train_percent//​ to change the amount of data used for training the model and evaluate the results:
-  * prediction accuracy and generated output ​//​dtree31.png//​ +  * prediction accuracy and generated output
-  * how large is the decision tree regarding the number of leaf nodes +  * how large is the decision tree regarding the number of leaf nodes?
-</​note>​+

-=== Task 4 (3p + 2p bonus) ===+=== Task 4 (4p) ===

-  * //task31_sol.py// is similar to //task2.py// and evaluates ​the accuracy on the wine quality dataset using both decision trees and random forest models. The accuracy of the two models is compared on the plot for different amounts of training data, specified by //​n_train_percent//​. +  * //task4.py// is similar to //task4.py// and should evaluate ​the accuracy on the wine quality dataset using both decision trees and random forest models. The accuracy of the two models is compared on the plot for different amounts of training data, specified by //​n_train_percent//​.
-  * Run //task31_sol.py// for both red (//winequality-red.csv//) and white (//winequality-white.csv//) wine datasets+  * Run //task4.py// for both red (//winequality_red.csv//) and white (//winequality_white.csv//) wine datasets

-<note important>​Write down your observations regarding ​the results: +Evaluate ​the results:
-  * How much training data (percent) is required in this case to obtain most accurate predictions +  * How much training data (percent) is required in this case to obtain ​the most accurate predictions?
-  * What is the average accuracy for each model (decision tree, random forest) ​**(+1p)** +  * What is the average accuracy for each model (decision tree, random forest)
-  * Which type of wine (red/white) is easier to predict (more accurate) based on the results **(+1p)** +
-</​note>​+

/* /*
Line 282: Line 267:
Use //​n_train_percent//​ to change the amount of data used for training the model and evaluate the results. Set //​n_train_percent//​ as the generated //code// **(//​UCODE//​)** and report the results: ​ Use //​n_train_percent//​ to change the amount of data used for training the model and evaluate the results. Set //​n_train_percent//​ as the generated //code// **(//​UCODE//​)** and report the results: ​
* prediction accuracy and generated output //​dtree32.png//​   * prediction accuracy and generated output //​dtree32.png//​
-  * how large is the decision tree regarding the number of leaf nodes+  * how large is the decision tree regarding the number of leaf nodes?

Create a new script similar to //​task31_sol.py//​ to compare the decision trees with random forest models using variable amounts (percent) of training data: Create a new script similar to //​task31_sol.py//​ to compare the decision trees with random forest models using variable amounts (percent) of training data:
-  * How much training data (percent) is required in this case to obtain most accurate predictions+  * How much training data (percent) is required in this case to obtain most accurate predictions?
* What is the average accuracy for each model (decision tree, random forest)   * What is the average accuracy for each model (decision tree, random forest)
* Explain the low accuracy obtained for this case study. What would be required to improve the results? **(+2p)**   * Explain the low accuracy obtained for this case study. What would be required to improve the results? **(+2p)**
Line 299: Line 284:
[[https://​archive.ics.uci.edu/​ml/​datasets/​wine+quality]] [[https://​archive.ics.uci.edu/​ml/​datasets/​wine+quality]]

-[[http://​wiki.stat.ucla.edu/​socr/​index.php/​SOCR_Data_MLB_HeightsWeights]]+/*[[http://​wiki.stat.ucla.edu/​socr/​index.php/​SOCR_Data_MLB_HeightsWeights]]*/ 