This shows you the differences between two versions of the page.
| ewis:laboratoare:09 [2021/05/14 10:30] alexandru.predescu [Exercises] | ewis:laboratoare:09 [2023/05/10 18:02] (current) alexandru.predescu [K-Means Clustering] | ||
|---|---|---|---|
| Line 50: | Line 50: | ||
| In general, K-means is a heuristic algorithm that partitions a data set into K clusters by minimizing the sum of squared distance in each cluster (WCSS). K-means is often referred to as Lloyd’s algorithm. | In general, K-means is a heuristic algorithm that partitions a data set into K clusters by minimizing the sum of squared distance in each cluster (WCSS). K-means is often referred to as Lloyd’s algorithm. | ||
| + | |||
| + | The goal is to group together data into similar classes such that: | ||
| + | * intra-class similarity is high | ||
| + | * inter-class similarity is low | ||
| === The algorithm === | === The algorithm === | ||
| Line 55: | Line 59: | ||
| K-Means is a simple unsupervised learning algorithm using a fixed number of clusters (k): | K-Means is a simple unsupervised learning algorithm using a fixed number of clusters (k): | ||
| - | * Random initialization of K centroids (k-means): For each cluster, a centroid (cluster center) is defined (random choice). | + | * Random initialization of K centroids (k-means): For each cluster, a centroid (cluster center) is defined by random choice. | 
| - | * Loop: For each step, data points are assigned to a cluster based on the distance from the centroid. | + | * Loop: For each iteration, data points are assigned to a cluster based on the distance from the centroid. | 
| - | * The centroids are (re)calculated based on the average position of points within each cluster: | + | * The centroids are recalculated based on the average position of points within each cluster | 
| - | * The loop is repeated until the centroids do not change anymore (stop condition) | + | * The loop is repeated until the centroids do not change anymore (stop condition) | 
| There are some things to consider with k-Means Clustering: | There are some things to consider with k-Means Clustering: | ||
| Line 67: | Line 71: | ||
| <note tip>Clustering can also be used to predict new data based on the identified patterns. If you want to predict the cluster for new points, just find the centroid they're closest to</note> | <note tip>Clustering can also be used to predict new data based on the identified patterns. If you want to predict the cluster for new points, just find the centroid they're closest to</note> | ||
| + | |||
| + | The following code generates a random array of points and performs K-Means Clustering. | ||
| + | |||
| + | <code python> | ||
| + | import numpy as np | ||
| + | import matplotlib.pyplot as plt | ||
| + | from sklearn.cluster import KMeans | ||
| + | from sklearn import metrics | ||
| + | |||
| + | X = 10 * np.random.randn(100, 2) + 6 | ||
| + | kmeans_model = KMeans(n_clusters=3) | ||
| + | kmeans_model.fit(X) | ||
| + | |||
| + | plt.scatter(X[:, 0], X[:, 1], c=kmeans_model.labels_, | ||
| + | cmap='rainbow', label="points") | ||
| + | |||
| + | plt.show() | ||
| + | </code> | ||
| + | |||
| + | {{ :ewis:laboratoare:lab9:random_points_clustering.png?400 |}} | ||
| === Choosing the optimal number of clusters === | === Choosing the optimal number of clusters === | ||
| - | Typically, we want to be able to understand the data, so we are looking for the lowest number of clusters. We also want enough detail in the clustering so that we can find the most relevant patterns | + | Typically, we want to be able to understand the data, so we are looking for the lowest number of clusters. We also want enough detail in the clustering so that we can find the most relevant patterns. | 
| + | |||
| + | We now define the following measures to evaluate the clusters: | ||
| + | |||
| + | * **Distortion**: the average of the Euclidean squared distance from the centroid of the respective clusters. | ||
| + | * **Inertia**: the sum of squared distances of samples to their closest cluster center. | ||
| == WCSS (Within Cluster Sum of Squares) == | == WCSS (Within Cluster Sum of Squares) == | ||
| - | WCSS is the sum of squares of the (Euclidean) distance of each data point to the cluster it was assigned to. This measure is used in the K-Means clustering algorithm to evaluate the "good" clustering in terms of the optimal number of clusters. A cluster that has a small WCSS is more compact, and therefore "better" than a cluster that has a large WCSS. | + | WCSS (inertia) is the sum of squares of the (Euclidean) distance of each data point to the cluster it was assigned to. This measure can be used in the K-Means clustering algorithm to evaluate the optimal number of clusters. A cluster that has a small WCSS is more compact, and therefore "better" than a cluster that has a large WCSS. | 
| $ WCSS(k) = \sum_{j=1}^{k} \sum_{i=1}^{n} \left \| x_i - \bar{x_j} \right \|^2 $ | $ WCSS(k) = \sum_{j=1}^{k} \sum_{i=1}^{n} \left \| x_i - \bar{x_j} \right \|^2 $ | ||
| Line 83: | Line 112: | ||
| * $x_i$ = data point $i$ | * $x_i$ = data point $i$ | ||
| * $\bar{x_j}$ = cluster centroid $j$ | * $\bar{x_j}$ = cluster centroid $j$ | ||
| + | |||
| + | The WCSS (inertia) is already provided in the result. | ||
| + | |||
| + | <code python> | ||
| + | inertia = kmeans_model.inertia_ | ||
| + | print(inertia) | ||
| + | </code> | ||
| == The Elbow Method == | == The Elbow Method == | ||
| Line 89: | Line 125: | ||
| {{ :ewis:laboratoare:lab9:elbow_method.png?400 |}} | {{ :ewis:laboratoare:lab9:elbow_method.png?400 |}} | ||
| + | |||
| == The Silhouette Coefficient == | == The Silhouette Coefficient == | ||
| - | Selecting the number of clusters can be done with silhouette analysis for K-Means clustering. The Silhouette Coefficient is defined for each sample and is composed of two scores: | + | The Silhouette Coefficient is a measure of the similarity of a point with the points of the same cluster, and its dissimilarity with the points of other clusters. Selecting the number of clusters can be done with silhouette analysis for K-Means clustering. | 
| - | * The mean distance between a sample and all other points in the same class | + | |
| - | * The mean distance between a sample and all other points in the next nearest cluster | + | |
| This measure has a range of [-1, 1]. | This measure has a range of [-1, 1]. | ||
| Line 101: | Line 136: | ||
| * -1 – the sample is assigned to the wrong cluster | * -1 – the sample is assigned to the wrong cluster | ||
| - | The clustering evaluation using both Elbow Method and Silhouette Coefficient is shown below. In this example, the optimal number of clusters is 4, as shown by both methods (looks like an arm, has the highest silhouette coefficient, k=4). | + | The Silhouette Score is calculated using the scikit-learn provided function //silhouette_score//. | 
| - | + | ||
| - | {{ :ewis:laboratoare:lab9:clustering_evaluation.png?400 |}} | + | |
| - | + | ||
| - | The following code generates a random array of points and performs K-Means Clustering. The Silhouette Score is then calculated using the scikit-learn provided function //silhouette_score//. | + | |
| <code python> | <code python> | ||
| - | import numpy as np | + | s = metrics.silhouette_score(X, kmeans_model.labels_, metric='euclidean') | 
| - | from sklearn.cluster import KMeans | + | |
| - | from sklearn import metrics | + | |
| - | X = 10 * np.random.randn(100, 2) + 6 | + | |
| - | kmeans_model = KMeans(n_clusters=3, random_state=1) | + | |
| - | kmeans_model.fit(X) | + | |
| - | labels = kmeans_model.labels_ | + | |
| - | s = metrics.silhouette_score(X, labels, metric='euclidean') | + | |
| print(s) | print(s) | ||
| </code> | </code> | ||
| + | |||
| + | The clustering evaluation using both Elbow Method and Silhouette Coefficient is shown below. In this example, the optimal number of clusters is 4, as shown by both methods (looks like an arm, has the highest silhouette coefficient, k=4). | ||
| + | |||
| + | {{ :ewis:laboratoare:lab9:clustering_evaluation.png?400 |}} | ||
| <note tip> | <note tip> | ||
| Line 234: | Line 262: | ||
| Download the {{:ewis:laboratoare:lab9:lab9.zip|Project Archive}} and install the required packages via //requirements.txt// | Download the {{:ewis:laboratoare:lab9:lab9.zip|Project Archive}} and install the required packages via //requirements.txt// | ||
| - | <note important>Run //gen_ucode.py// to generate your unique codes (//UCODES//) that you will use in the exercises when required. Write them down and include them in the pdf report.</note> | + | === Task 0 (2p). Random Dataset === | 
| - | + | ||
| - | === Task 0 (2p) === | + | |
| Run //task0.py//: | Run //task0.py//: | ||
| Line 243: | Line 269: | ||
| * We want to find out how these points can be assigned to clusters using the K-Means algorithm.  | * We want to find out how these points can be assigned to clusters using the K-Means algorithm.  | ||
| * The K-Means algorithm is found in //clustering.py// via //clustering_kmeans(X, k)//, which uses the //KMeans// class from //scikit-learn//.  | * The K-Means algorithm is found in //clustering.py// via //clustering_kmeans(X, k)//, which uses the //KMeans// class from //scikit-learn//.  | ||
| - | * The WCSS and Silhouette Score is calculated using the euclidean distance and the //silhouette_score// function from //scikit-learn// | + | * The WCSS and Silhouette Score are calculated using the Euclidean distance and the //silhouette_score// function from //scikit-learn// | 
| * The data points and cluster centroids are shown on a scatter plot (2D). | * The data points and cluster centroids are shown on a scatter plot (2D). | ||
| - | <note important>Change the number of clusters in //task0.py// and run the script for each case. Use the generated //codes// **(//UCODES//)** as number of clusters and report the results as images. | + | **Task: Change the number of clusters in //task0.py// and run the script for each case. Report the results as plots.** | 
| - | </note> | + | |
| Run //task0_test.py//: | Run //task0_test.py//: | ||
| - | * The script generates a random dataset of 100 rows and 2 columns and performs clustering as in //task0.py//, now with variable number of clusters. | + | * The script generates a random dataset and assigns a variable number of clusters: //range(2, 20)//. | 
| - | * The range of clusters is defined as //range(2, 20)//. For each number of clusters, the clustering algorithm is run and the WCSS and Silhouette Scores are saved into a list. | + | * For each selection, the clustering algorithm is evaluated with the WCSS and Silhouette Score. | 
| - | * The optimal number of clusters is evaluated using the Silhouette Score | + | * The optimal number of clusters is determined based on the Silhouette Score. | 
| * The results are shown on a plot (WCSS, Silhouette Score) for each k (number of clusters) | * The results are shown on a plot (WCSS, Silhouette Score) for each k (number of clusters) | ||
| - | <note important>What is the optimal number of clusters for your random generated data? Present the results (plot, optimal number of clusters) in your report. | + | **Task: What is the optimal number of clusters? Present the results as a plot and number of clusters.** | 
| - | </note> | + | |
| - | === Task 1 (4p) === | + | === Task 1 (4p). Countries and Continents === | 
| Run //task1.py//: | Run //task1.py//: | ||
| - | * The script loads the dataset from a CSV file which contains the data about countries and continents as described in the first case study. | + | * The script loads the dataset from a CSV file which contains data about countries and continents.  | 
| * We want to find out how these countries can be assigned to clusters using the K-Means algorithm.  | * We want to find out how these countries can be assigned to clusters using the K-Means algorithm.  | ||
| - | * The data now contains country names (text), which have to be converted to numbers to be able to run the clustering algorithm. We are not interested in the actual country names, and the continents can be assigned to numbers. | + | * The names of the continents are converted to numbers to run the clustering. | 
| - | * The script runs the clustering algorithm using the number of clusters given as the actual number of continents in the dataset. | + | * The script clusters the data using a defined number of clusters. | 
| * The data points and cluster centroids are shown on a scatter plot (2D). | * The data points and cluster centroids are shown on a scatter plot (2D). | ||
| - | <note important>Change the number of clusters in //task1.py// and run the script for each case. Use the generated //codes// **(//UCODES//)** as number of clusters and report the results as images. | + | **Task: Change the number of clusters in //task1.py// and report the results as plots.** | 
| - | </note> | + | |
| Run //task1_test.py//: | Run //task1_test.py//: | ||
| - | * The script loads the dataset from a CSV file which contains the data about countries and continents and performs clustering as in //task1.py//, now with variable number of clusters. | + | * The script loads the dataset about countries and continents and performs clustering as in //task1.py//, with a variable number of clusters: //range(2, 20)//. | 
| - | * The range of clusters is defined as //range(2, 20)//. For each number of clusters, the clustering algorithm is run and the WCSS and Silhouette Scores are saved into a list. | + | * For each selection, the clustering algorithm is evaluated with the WCSS and Silhouette Score. | 
| - | * The optimal number of clusters is evaluated using the Silhouette Score. | + | * The optimal number of clusters is determined based on the Silhouette Score. | 
| * The results are shown on a plot (WCSS, Silhouette Score) for each k (number of clusters) | * The results are shown on a plot (WCSS, Silhouette Score) for each k (number of clusters) | ||
| - | <note important> | + | **Task: What is the optimal number of clusters? Present the results as a plot and number of clusters.** | 
| - | What is the optimal number of clusters in this case? | + | |
| - | * Present the results (plot, optimal number of clusters) in your report. | + | |
| - | * Go back to //task2.py// and plot the results with the optimal number of clusters.  | + | |
| - | * Include your interpretation of the results into your report. | + | |
| - | </note> | + | |
| - | + | === Task 2 (4p). Market Segmentation === | |
| - | === Task 2 (4p) === | + | |
| Run //task2.py//: | Run //task2.py//: | ||
| - | * The script loads the dataset from a CSV file which contains the data about customer behavior as described in the second case study (Market Segmentation). | + | * The script loads the dataset from a CSV file which contains the data about customer behavior in the Market Segmentation case study. | 
| * We want to find out how these behaviors can be assigned to clusters using the K-Means algorithm.  | * We want to find out how these behaviors can be assigned to clusters using the K-Means algorithm.  | ||
| * The data now contains data of different ranges, which have to be scaled to be able to obtain good results with the clustering algorithm.  | * The data now contains data of different ranges, which have to be scaled to be able to obtain good results with the clustering algorithm.  | ||
| * The script runs the clustering algorithm and plots the data points and cluster centroids on a scatter plot (2D). | * The script runs the clustering algorithm and plots the data points and cluster centroids on a scatter plot (2D). | ||
| - | <note important>Change the number of clusters in //task2.py// and run the script for each case. | + | **Task: Change the number of clusters in //task2.py// and run the script for each case. Report the results as plots.** | 
| - | * Use the generated //codes// **(//UCODES//)** as number of clusters and report the results as images. | + | |
| - | * Provide an interpretation of the results for each case | + | |
| - | </note> | + | |
| Write //task2_test.py//: | Write //task2_test.py//: | ||
| - | * The script has to load the dataset from a CSV file which contains the data about customer behavior as described in the second case study (Market Segmentation). | + | * The script has to load the customer behavior dataset. | 
| * The idea is similar to //task1_test.py// | * The idea is similar to //task1_test.py// | ||
| * The range of clusters is defined as //range(2, 10)//. For each number of clusters, the clustering algorithm is run and the WCSS and Silhouette Scores are saved into a list. | * The range of clusters is defined as //range(2, 10)//. For each number of clusters, the clustering algorithm is run and the WCSS and Silhouette Scores are saved into a list. | ||
| Line 307: | Line 321: | ||
| * The results are shown on a plot (WCSS, Silhouette Score) for each k (number of clusters) | * The results are shown on a plot (WCSS, Silhouette Score) for each k (number of clusters) | ||
| - | <note important> | + | **Task: What is the optimal number of clusters? Present the results as a plot and number of clusters.** | 
| - | What is the optimal number of clusters in this case? | + | |
| - | * Present the results (plot, optimal number of clusters) in your report. | + | |
| - | * Go back to //task2.py// and plot the results with the optimal number of clusters.  | + | |
| - | * Include your interpretation of the results into your report. | + | |
| - | </note> | + | |
| ==== Resources ==== | ==== Resources ==== | ||