What you will learn:
What is Unsupervised Machine Learning and when it can be useful:
The most common unsupervised learning method is cluster analysis, which is used for exploratory data analysis to find hidden patterns or groupings in data.
Clustering is an important unsupervised learning problem and deals with finding a structure in a collection of unlabeled data. Clustering can uncover interesting groupings of people/things/behaviors such as (example):
For clustering we need to define a proximity measure for two data points: similarity measure S(xa, xb) or dissimilarity (distance) measure D(xa, xb).
There are various similarity measures which can be used. For points, the Euclidean Distance is described by the formula:
$ d(x, y) = \sqrt{\sum_{i=1}^{d} \left ( x_i - y_i \right )^2} $
where:
$ x = [x_1, x_2, ..., x_d] $
$ y = [y_1, y_2, ..., y_d] $
In general, K-means is a heuristic algorithm that partitions a data set into K clusters by minimizing the sum of squared distance in each cluster (WCSS). K-means is often referred to as Lloyd’s algorithm.
The goal is to group together data into similar classes such that:
K-Means is a simple unsupervised learning algorithm using a fixed number of clusters (k):
There are some things to consider with k-Means Clustering:
The following code generates a random array of points and performs K-Means Clustering.
import numpy as np import matplotlib.pyplot as plt from sklearn.cluster import KMeans from sklearn import metrics X = 10 * np.random.randn(100, 2) + 6 kmeans_model = KMeans(n_clusters=3) kmeans_model.fit(X) plt.scatter(X[:, 0], X[:, 1], c=kmeans_model.labels_, cmap='rainbow', label="points") plt.show()
Typically, we want to be able to understand the data, so we are looking for the lowest number of clusters. We also want enough detail in the clustering so that we can find the most relevant patterns.
We now define the following measures to evaluate the clusters:
WCSS (inertia) is the sum of squares of the (Euclidean) distance of each data point to the cluster it was assigned to. This measure can be used in the K-Means clustering algorithm to evaluate the optimal number of clusters. A cluster that has a small WCSS is more compact, and therefore “better” than a cluster that has a large WCSS.
$ WCSS(k) = \sum_{j=1}^{k} \sum_{i=1}^{n} \left \| x_i - \bar{x_j} \right \|^2 $
where:
The WCSS (inertia) is already provided in the result.
inertia = kmeans_model.inertia_ print(inertia)
Below is a plot of sum of squared distances (WCSS). If the plot looks like an arm, then the elbow on the arm is optimal k. In this example, the optimal number of clusters is 4.
The Silhouette Coefficient is a measure of the similarity of a point with the points of the same cluster, and its dissimilarity with the points of other clusters. Selecting the number of clusters can be done with silhouette analysis for K-Means clustering.
This measure has a range of [-1, 1].
The Silhouette Score is calculated using the scikit-learn provided function silhouette_score.
s = metrics.silhouette_score(X, kmeans_model.labels_, metric='euclidean') print(s)
The clustering evaluation using both Elbow Method and Silhouette Coefficient is shown below. In this example, the optimal number of clusters is 4, as shown by both methods (looks like an arm, has the highest silhouette coefficient, k=4).
The following case studies represent possible applications for K-Means clustering. The exercises in this lab are based on these examples.
In this case study, we have a dataset with all the countries in the world, their location (latitude, longitude) and the continent they belong to. This looks like a clustering problem. Let's say we don't know the continents and we want to find them using clustering. The algorithm has to find out which are the continents based on the data about countries and their location.
Maybe we want to define the world other than using continents: let's say 3 clusters. No problem, clustering does this for us. The results are shown in the figure below:
Here is the relevant code for clustering in this scenario:
import csv import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.cluster import KMeans # read the csv dataset data = pd.read_csv('countries_continents.csv', encoding='latin-1') X, head = data.values, data.columns.values print("continents: ") continents = {e: i for i, e in enumerate(np.unique(X[:, 3]))} print(continents) # map data to numbers X[:, 3] = np.vectorize(continents.__getitem__)(X[:, 3]) X = X[:, 1: np.size(X, 1)-1] # run K-Means clustering algorithm kmeans = KMeans(n_clusters=3, init="k-means++") kmeans.fit(X) clusters = kmeans.predict(X) print("cluster labels: ") print(clusters) # show the assigned cluster centers (centroids) print("cluster centers") centroids = kmeans.cluster_centers_ print(centroids) # show the labels assigned for each data points print("cluster labels") print(kmeans.labels_)
In this case study, we have a dataset with information about customers: age, amount spent, satisfaction, brand loyalty. We are interested in revealing some patterns in the customer behavior to be able to define a data-aware business strategy. Let's assume that we have customer satisfaction (CSAT) scores of 1 to 10 (self-reported discrete data, where 1 = very dissatisfied and 10 = very satisfied). And we have similar scores for the customer's level of brand loyalty (more tricky measure based on churn rate, retention rate, customer lifetime value/CLV, in the range of [-2.5, 2.5]).
In this case, we have two measures with different ranges ([1, 10], [-2.5, 2.5]). If we want to obtain good results we want to normalize the data before running the clustering algorithm:
import csv import pandas as pd from sklearn import preprocessing # read the csv dataset data = pd.read_csv('market_segmentation_data.csv', encoding='latin-1') X, head = data.values, data.columns.values # normalize the data X = preprocessing.scale(X)
The chart can be divided into 4 squares based on the measured level of satisfaction and brand loyalty:
Let's first take 2 clusters. These should reveal the extremes in the customer behaviors. Now it's our job to interpret the results. In this case, the results show the two extreme behaviors:
Now, as we've defined the 4 squares, let's take 4 clusters to represent a more useful pattern for business purpose. The results are shown in the chart below:
The 4 clusters reveal a more clear representation of customer behaviors which actually fit the 4 squares defined before:
Great! Now we can define business strategies based on the actual customer behavior patterns, and turn those supporters and roamers into fans.
Download the Project Archive and install the required packages via requirements.txt
Run task0.py:
Task: Change the number of clusters in task0.py and run the script for each case. Report the results as plots.
Run task0_test.py:
Task: What is the optimal number of clusters? Present the results as a plot and number of clusters.
Run task1.py:
Task: Change the number of clusters in task1.py and report the results as plots.
Run task1_test.py:
Task: What is the optimal number of clusters? Present the results as a plot and number of clusters.
Run task2.py:
Task: Change the number of clusters in task2.py and run the script for each case. Report the results as plots.
Write task2_test.py:
Task: What is the optimal number of clusters? Present the results as a plot and number of clusters.