Differences

This shows you the differences between two versions of the page.

Link to this comparison view

ewis:laboratoare:08 [2020/05/01 11:25]
alexandru.predescu
ewis:laboratoare:08 [2023/04/26 17:55] (current)
alexandru.predescu [Exercises]
Line 1: Line 1:
 ===== Lab 8. Supervised Learning. Decision Trees  ===== ===== Lab 8. Supervised Learning. Decision Trees  =====
-/*+
 The purpose of an **information system** is to extract useful information from raw data. **Data science** is a field of study that aims to understand and analyze data by means of **statistics,​ big data, machine learning** and to provide support for decision makers and autonomous systems. While this sounds complicated,​ the tools are based on mathematical models and specialized software components that are already available (e.g. Python packages). In the following labs we will learn about.. learning. Machine Learning, to be more specific, and the two main classes: **Supervised Learning** and **Unsupervised Learning**. The general idea is to write software programs that can learn from the available data, identify patterns and make decisions with minimal human interventions,​ based on Machine Learning algorithms. The purpose of an **information system** is to extract useful information from raw data. **Data science** is a field of study that aims to understand and analyze data by means of **statistics,​ big data, machine learning** and to provide support for decision makers and autonomous systems. While this sounds complicated,​ the tools are based on mathematical models and specialized software components that are already available (e.g. Python packages). In the following labs we will learn about.. learning. Machine Learning, to be more specific, and the two main classes: **Supervised Learning** and **Unsupervised Learning**. The general idea is to write software programs that can learn from the available data, identify patterns and make decisions with minimal human interventions,​ based on Machine Learning algorithms.
- 
-<note tip> 
-An introduction into Machine Learning territory can be already found in the previous lab, as linear regression is a form of Supervised Learning, extracting patterns from the data based on a linear model. 
-</​note>​ 
  
 ==== Machine Learning. Supervised Learning ==== ==== Machine Learning. Supervised Learning ====
Line 22: Line 18:
 For example, when provided with a dataset about houses, a classification algorithm can try to predict whether the prices for the houses "sell more or less than the recommended retail price"​. Examples of the common classification algorithms include logistic regression, Naïve Bayes, **decision trees**, and K Nearest Neighbors. For example, when provided with a dataset about houses, a classification algorithm can try to predict whether the prices for the houses "sell more or less than the recommended retail price"​. Examples of the common classification algorithms include logistic regression, Naïve Bayes, **decision trees**, and K Nearest Neighbors.
  
-Decision Trees (overview)+== Decision Trees (overview) ​== 
 + 
 +A decision tree is a classification and prediction tool having a tree like structure, where each internal node denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node (terminal node) holds a class label:
  
   * Input: historical data with known outcomes   * Input: historical data with known outcomes
Line 63: Line 61:
 p = [inv_map[e] for e in p] p = [inv_map[e] for e in p]
 print(p) print(p)
 +</​code>​
 +
 +See the next example on how you can plot the decision tree:
 +
 +<code python>
 +# continue from previous example
 +
 +# plot decision tree
 +from matplotlib import pyplot as plt
 +
 +tree.plot_tree(classifier)
 +plt.show()
 +</​code>​
 +
 +== Random forests (overview) ==
 +
 +  * A "​forest"​ of decision trees
 +  * Decision trees are susceptible to overfitting. ​
 +  * One solution is to construct several trees and let them “vote” on the final classification.
 +  * We do this by randomly re-sampling the input data for each tree (fancy term: bootstrap aggregating).
 +
 +In Python (scikit-learn),​ we can just use the //​RandomForestClassifier//​ instead of the //​DecisionTreeClassifier//​. There are some parameters that have to be defined such as the number of trees (//​n_estimators//​) and the random state (controls the randomness of the samples when building trees, set to 0 to disable)
 +
 +<code python>
 +from sklearn.ensemble import RandomForestClassifier
 +model = RandomForestClassifier(n_estimators=10,​ random_state=0)
 </​code>​ </​code>​
  
Line 68: Line 92:
  
 You want to build a system to filter out resumes based on historical hiring data. You want to build a system to filter out resumes based on historical hiring data.
-You have a database of some important attributes of job candidates.+You have a database of some important attributes of job candidates: Years Experience, Employed, Previous employers, Level of Education, Top-tier school, Interned, Hired.
 You can train a decision tree on this data, and arrive at a system for predicting whether a candidate will get hired based on it! You can train a decision tree on this data, and arrive at a system for predicting whether a candidate will get hired based on it!
  
Line 79: Line 103:
   * Making predictions on the test dataset/new data   * Making predictions on the test dataset/new data
  
-The mapping function between input/​output data can be represented as a flowchart:+A decision tree is a classification and prediction tool having a tree like structure, where each internal node denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node (terminal node) holds a class label. 
 + 
 +The mapping function between input/​output data can be represented as a flowchart. In this case study, a decision tree is trained on the HR dataset with the following visual representation: 
 + 
 +{{ :​ewis:​laboratoare:​lab8:​dtree_edit.png?​800 |}} 
 + 
 +<note tip>The greedy algorithm ID3 walks down the tree and (at each step) picks the attribute to partition the data set that minimizes the entropy of the data at the next step. The Gini Index or Impurity (between 0 and 1) measures the probability for a random instance being misclassified when chosen randomly.</​note>​ 
 + 
 +Interpreting the data is straight-forward. At each "​decision"​ (internal node) there are two branches: left (false), right (true) which represent the possible outcomes for the current test attribute (e.g. Interned). The leaf nodes are reached when all the samples are aligned to either outcome and hold the class labels (e.g. Hired/Not Hired) and shown with a **different color** for each class (in this case there are 2 classes: Hired/Not Hired). From the example, the decision for hiring a new candidate can be described as follows:  
 + 
 +  * If already employed 
 +  * If not already employed, but has interned at the company 
 +  * If not already employed, has not interned at the company, but has had previous employers 
 +  * If not already employed, has not interned at the company, has not had previous employers, but has been to a top-tier school
  
-{{ :​ewis:​laboratoare:​lab8:​dtree_flowchart.png?400 |}}+<note important>​The flowchart can be used to represent the decision making as a result of training Decision Trees using the existing dataUnlike the traditional programming using if else statements, the process of creating decision models is based on labelled data (observations) and Supervised Learning algorithms. On a dataset with many attributes and relationships between data (not known in advance), Decision Trees sometimes reveal unexpected patterns.</​note>​
  
 == Data format and preprocessing == == Data format and preprocessing ==
Line 90: Line 127:
  
 <code python> <code python>
- 
 import pandas as pd import pandas as pd
- +  
-input_file = "./PastHires.csv"+input_file = "./data/​past_hires.csv"
 df = pd.read_csv(input_file,​ header=0) df = pd.read_csv(input_file,​ header=0)
-target = df['​Hired'​] + 
 # format the data, map classes to numbers # format the data, map classes to numbers
 d = {'​Y':​ 1, '​N':​ 0} d = {'​Y':​ 1, '​N':​ 0}
Line 105: Line 140:
 d = {'​BS':​ 0, '​MS':​ 1, '​PhD':​ 2} d = {'​BS':​ 0, '​MS':​ 1, '​PhD':​ 2}
 df['​Level of Education'​] = df['​Level of Education'​].map(d) df['​Level of Education'​] = df['​Level of Education'​].map(d)
 + 
 +target = df['​Hired'​]
 +
 +print(target)
 +</​code>​
 +
 +== The algorithm. DecisionTreeClassifier ==
 +
 +In Python we use the //​DecisionTreeClassifier//​ from the //​scikit-learn//​ package, that creates the tree for us. We train the model using the data set and then we can visualize the decisions. Then we can validate the model by comparing the target values to the predicted values, showing the prediction accuracy.
 +
 +Here we continue the example with the HR dataset, and we are training a Decision Tree model to predict future hires:
 +
 +<code python>
 +from sklearn import tree
 +import numpy as np
 +
 +# load the data (see previous example)
  
 # print features and data # print features and data
Line 116: Line 168:
 X = df[features] X = df[features]
 y = target y = target
 +
 +# now actually build the decision tree using the training data set
 +clf = tree.DecisionTreeClassifier()
 +clf.fit(X, y)
 </​code>​ </​code>​
  
-== The algorithm == 
  
-The greedy algorithm ID3 walks down the tree and (at each step) picks the attribute to partition the data set that minimizes the entropy of the data at the next step. +==== Exercises ====
-In Python we use the //​DecisionTreeClassifier//​ from the //​scikit-learn//​ package, that creates the tree for us. We train the model using the data set and then we can visualize the decisions. Then we can validate the model by comparing the target values to the predicted values.+
  
-Here we continue the example with the HR dataset, and we are training a Decision Tree model to predict future hires:+=== Setup ===
  
-<code python>​ +Download ​the {{:​ewis:​laboratoare:​lab8:​lab8.zip|Project Archive}} and install ​the required packages via //​requirements.txt//
-# now actually build the decision tree using the training data set +
-clf = tree.DecisionTreeClassifier() +
-clf.fit(X, y) +
-# now we can print the tree using a custom function +
-print_decision_tree(mytree=clf,​ features=features)+
  
-# test the prediction accuracy +=== Task 1 (1p) ===
-# convert to numpy array for simplicity, get the difference and percent match+
  
-y_a = np.array(y) +Run //task1.py//: 
-y_predict_a = np.array(y_predict) +  
-y_diff_a = y_predict_a - y_a +  * The script loads the HR dataset described in the example, using the //​load_dataset//​ function from //loader.py//, performs some preprocessing using the //​format_data//​ function ​(mapping text data to numeric values ​for classificationand creates a decision tree classifier to predict future outcomes (Hired/Not Hiredusing the //​create_decision_tree//​ function from //​classifiers.py//​. ​ 
-accuracy = len([yd ​for yd in y_diff_a if yd == 0]) / len(y_diff_a* 100 +  * The predictions are evaluated to find out the accuracy ​of the model and the decision tree is then shown as (pseudo)code ​(if else statementsand graph representation as //​dtree1.png//​.
-print("​accuracy: " + str(round(accuracy, 2)) + " %")+
  
-# note the prediction is 100% +Change ​the amount of data used for training the model and evaluate the results: 
-the data set is the same though+  * prediction ​accuracy and generated output 
 +  * how large is the decision tree regarding the number of leaf nodes? 
 + 
 +=== Task 2 (2p) === 
 + 
 +Run //​task2.py//:​ 
 + 
 +  * The script loads the HR dataset described in the example (same as Task 1) 
 +  * The script creates and trains decision trees using variable amounts of training ​data (specified by the range: //​n_train_percent_vect//​). The accuracy for each case is saved into a list. 
 +  * The results are plotted on a chart, showing ​the effect of the amount (percent) of training data on the prediction accuracy 
 + 
 +Evaluate the results: 
 +  * How much training data (percent) is required in this case to obtain the most accurate predictions?​ 
 + 
 +=== Task 3 (3p) === 
 + 
 +Run //​task3.py//:​ 
 + 
 +  * //​task3.py//​ is similar to //​task1.py//,​ using another dataset about wine quality: //​winequality_white.csv//,​ //​winequality_red.csv//​ to train a decision tree that should predict the quality of the wine based on measured properties. 
 +  * A brief description of the dataset: 
 + 
 +<​code>​ 
 +Input variables (based on measurements):​ 
 +1 - fixed acidity 
 +2 - volatile acidity 
 +3 - citric acid 
 +4 - residual sugar 
 +5 - chlorides 
 +6 - free sulfur dioxide 
 +7 - total sulfur dioxide 
 +8 - density 
 +9 - pH 
 +10 - sulphates 
 +11 - alcohol 
 + 
 +Output variable (based on sensory data):  
 +12 - quality (score between 0 and 10)
 </​code>​ </​code>​
 +
 +Use //​n_train_percent//​ to change the amount of data used for training the model and evaluate the results: ​
 +  * prediction accuracy and generated output
 +  * how large is the decision tree regarding the number of leaf nodes?
 + 
 +=== Task 4 (4p) ===
 +
 +Create //​task4.py//:​
 +
 +  * //​task4.py//​ is similar to //​task4.py//​ and should evaluate the accuracy on the wine quality dataset using both decision trees and random forest models. The accuracy of the two models is compared on the plot for different amounts of training data, specified by //​n_train_percent//​.
 +  * Run //​task4.py//​ for both red (//​winequality_red.csv//​) and white (//​winequality_white.csv//​) wine datasets
 +
 +Evaluate the results:
 +  * How much training data (percent) is required in this case to obtain the most accurate predictions?​
 +  * What is the average accuracy for each model (decision tree, random forest)
 +
 +/*
 +=== Bonus (4p + 2p) ===
 +
 +Run //​task32.py//:​
 +
 +  * //​task32.py//​ is similar to //​task1.py//,​ using another dataset about Major League Baseball players: //mlb.csv// to train a decision tree that should predict the recommended position of a new player based on previous data.
 +  * The data is preprocessed to exclude unwanted columns (Name, Team), and map the positions ['​First_Baseman',​ '​Designated_Hitter',​ '​Relief_Pitcher',​ '​Catcher',​ '​Shortstop',​ '​Second_Baseman',​ '​Third_Baseman',​ '​Starting_Pitcher',​ '​Outfielder'​] to numeric classes [0, 1, 2, 3, 4, 5, 6, 7, 8].
 +  * A brief description of the dataset:
 +
 +<​code>​
 +Input variables:
 +1 - Name: MLB Player Name
 +2 - Team: The Baseball team the player was a member of at the time the data was acquired
 +3 - Height(inches):​ Player height in inches
 +4 - Weight(pounds):​ Player weight in pounds
 +5 - Age: Player age at time of record
 +Output variables:
 +6 - Position: Player field position
 +</​code>​
 +
 +<note important>​
 +
 +Use //​n_train_percent//​ to change the amount of data used for training the model and evaluate the results. Set //​n_train_percent//​ as the generated //code// **(//​UCODE//​)** and report the results: ​
 +  * prediction accuracy and generated output //​dtree32.png//​
 +  * how large is the decision tree regarding the number of leaf nodes?
 +
 +Create a new script similar to //​task31_sol.py//​ to compare the decision trees with random forest models using variable amounts (percent) of training data:
 +  * How much training data (percent) is required in this case to obtain most accurate predictions?​
 +  * What is the average accuracy for each model (decision tree, random forest)
 +  * Explain the low accuracy obtained for this case study. What would be required to improve the results? **(+2p)**
 +</​note>​
 */ */
  
 +==== Resources ====
 +
 +{{:​ewis:​laboratoare:​lab8:​lab8.zip|Project Archive}}
  
 +[[https://​data-flair.training/​blogs/​machine-learning-datasets/​]]
  
 +[[https://​archive.ics.uci.edu/​ml/​datasets/​wine+quality]]
  
 +/​*[[http://​wiki.stat.ucla.edu/​socr/​index.php/​SOCR_Data_MLB_HeightsWeights]]*/​
  
ewis/laboratoare/08.1588321521.txt.gz · Last modified: 2020/05/01 11:25 by alexandru.predescu
CC Attribution-Share Alike 3.0 Unported
www.chimeric.de Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0