Differences

This shows you the differences between two versions of the page.

Link to this comparison view

ewis:laboratoare:09 [2023/05/10 17:43]
alexandru.predescu [Exercises]
ewis:laboratoare:09 [2023/05/10 18:02] (current)
alexandru.predescu [K-Means Clustering]
Line 71: Line 71:
  
 <note tip>​Clustering can also be used to predict new data based on the identified patterns. If you want to predict the cluster for new points, just find the centroid they'​re closest to</​note>​ <note tip>​Clustering can also be used to predict new data based on the identified patterns. If you want to predict the cluster for new points, just find the centroid they'​re closest to</​note>​
 +
 +The following code generates a random array of points and performs K-Means Clustering.
 +
 +<code python>
 +import numpy as np
 +import matplotlib.pyplot as plt
 +from sklearn.cluster import KMeans
 +from sklearn import metrics
 +
 +X = 10 * np.random.randn(100,​ 2) + 6
 +kmeans_model = KMeans(n_clusters=3)
 +kmeans_model.fit(X)
 +
 +plt.scatter(X[:,​ 0], X[:, 1], c=kmeans_model.labels_,​
 +            cmap='​rainbow',​ label="​points"​)
 +
 +plt.show()
 +</​code>​
 +
 +{{ :​ewis:​laboratoare:​lab9:​random_points_clustering.png?​400 |}}
  
 === Choosing the optimal number of clusters === === Choosing the optimal number of clusters ===
Line 93: Line 113:
   * $\bar{x_j}$ = cluster centroid $j$   * $\bar{x_j}$ = cluster centroid $j$
  
-== The Elbow Method == +The WCSS (inertia) is already provided in the result.
- +
-Below is a plot of sum of squared distances (WCSS). If the plot looks like an arm, then the elbow on the arm is optimal k. In this example, the optimal number of clusters is 4. +
- +
-{{ :​ewis:​laboratoare:​lab9:​elbow_method.png?​400 |}} +
- +
-The following code generates a random array of points and performs K-Means Clustering. ​The WCSS (inertia) is already provided in the result.+
  
 <code python> <code python>
-import numpy as np 
-from sklearn.cluster import KMeans 
-from sklearn import metrics 
-X = 10 * np.random.randn(100,​ 2) + 6 
-kmeans_model = KMeans(n_clusters=3) 
-kmeans_model.fit(X) 
-labels = kmeans_model.labels_ 
 inertia = kmeans_model.inertia_ inertia = kmeans_model.inertia_
 print(inertia) print(inertia)
 </​code>​ </​code>​
 +
 +== The Elbow Method ==
 +
 +Below is a plot of sum of squared distances (WCSS). If the plot looks like an arm, then the elbow on the arm is optimal k. In this example, the optimal number of clusters is 4.
 +
 +{{ :​ewis:​laboratoare:​lab9:​elbow_method.png?​400 |}}
  
  
Line 123: Line 136:
   * -1 – the sample is assigned to the wrong cluster   * -1 – the sample is assigned to the wrong cluster
  
-The clustering evaluation using both Elbow Method and Silhouette Coefficient is shown below. In this example, the optimal number of clusters is 4, as shown by both methods (looks like an arm, has the highest silhouette coefficient,​ k=4). +The Silhouette Score is calculated using the scikit-learn provided function //​silhouette_score//​.
- +
-{{ :​ewis:​laboratoare:​lab9:​clustering_evaluation.png?​400 |}} +
- +
-The following code generates a random array of points and performs K-Means Clustering. ​The Silhouette Score is then calculated using the scikit-learn provided function //​silhouette_score//​.+
  
 <code python> <code python>
-import numpy as np +s = metrics.silhouette_score(X, ​kmeans_model.labels_, metric='​euclidean'​)
-from sklearn.cluster import KMeans +
-from sklearn import metrics +
-X = 10 * np.random.randn(100,​ 2) + 6 +
-kmeans_model = KMeans(n_clusters=3) +
-kmeans_model.fit(X) +
-labels = kmeans_model.labels_ +
-s = metrics.silhouette_score(X, ​labels, metric='​euclidean'​)+
 print(s) print(s)
 </​code>​ </​code>​
 +
 +The clustering evaluation using both Elbow Method and Silhouette Coefficient is shown below. In this example, the optimal number of clusters is 4, as shown by both methods (looks like an arm, has the highest silhouette coefficient,​ k=4).
 +
 +{{ :​ewis:​laboratoare:​lab9:​clustering_evaluation.png?​400 |}}
  
 <note tip> <note tip>
Line 266: Line 272:
   * The data points and cluster centroids are shown on a scatter plot (2D).   * The data points and cluster centroids are shown on a scatter plot (2D).
  
-Task: Change the number of clusters in //​task0.py//​ and run the script for each case. Report the results as plots.+**Task: Change the number of clusters in //​task0.py//​ and run the script for each case. Report the results as plots.**
  
 Run //​task0_test.py//:​ Run //​task0_test.py//:​
Line 275: Line 281:
   * The results are shown on a plot (WCSS, Silhouette Score) for each k (number of clusters)   * The results are shown on a plot (WCSS, Silhouette Score) for each k (number of clusters)
  
-Task: What is the optimal number of clusters? Present the results as a plot and number of clusters.+**Task: What is the optimal number of clusters? Present the results as a plot and number of clusters.**
  
 === Task 1 (4p). Countries and Continents === === Task 1 (4p). Countries and Continents ===
Line 287: Line 293:
   * The data points and cluster centroids are shown on a scatter plot (2D).   * The data points and cluster centroids are shown on a scatter plot (2D).
  
-Task: Change the number of clusters in //​task1.py//​ and report the results as plots.+**Task: Change the number of clusters in //​task1.py//​ and report the results as plots.**
  
 Run //​task1_test.py//:​ Run //​task1_test.py//:​
Line 295: Line 301:
   * The results are shown on a plot (WCSS, Silhouette Score) for each k (number of clusters)   * The results are shown on a plot (WCSS, Silhouette Score) for each k (number of clusters)
  
-Task: What is the optimal number of clusters? Present the results as a plot and number of clusters.+**Task: What is the optimal number of clusters? Present the results as a plot and number of clusters.**
  
 === Task 2 (4p). Market Segmentation === === Task 2 (4p). Market Segmentation ===
Line 306: Line 312:
   * The script runs the clustering algorithm and plots the data points and cluster centroids on a scatter plot (2D).   * The script runs the clustering algorithm and plots the data points and cluster centroids on a scatter plot (2D).
  
-Task: Change the number of clusters in //​task2.py//​ and run the script for each case. Report the results as plots.+**Task: Change the number of clusters in //​task2.py//​ and run the script for each case. Report the results as plots.**
  
 Write //​task2_test.py//:​ Write //​task2_test.py//:​
Line 315: Line 321:
   * The results are shown on a plot (WCSS, Silhouette Score) for each k (number of clusters)   * The results are shown on a plot (WCSS, Silhouette Score) for each k (number of clusters)
  
-Task: What is the optimal number of clusters? Present the results as a plot and number of clusters.+**Task: What is the optimal number of clusters? Present the results as a plot and number of clusters.**
  
 ==== Resources ==== ==== Resources ====
ewis/laboratoare/09.1683729782.txt.gz · Last modified: 2023/05/10 17:43 by alexandru.predescu
CC Attribution-Share Alike 3.0 Unported
www.chimeric.de Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0