2. The k-means clustering alogorithm
3. Writing the k-means algorithm with NumPy
4. Writing a Scikit-Learn compatible estimator
5. Using the elbow method and Pipelines
Clustering algorithms and unsupervised learning methods have been gaining popularity recently. This is partly because the amount of data being generated has increased exponentially, but also because labels for this data are often still hard to come by. Labeling data can be time consuming and requires human effort which can be expensive. Unsupervised learning methods are machine learning methods that can be used to gleam information from unlabeled data. Clustering algorithms specifically take unlabeled points within a dataset and try to group them into "clusters". Within clusters datapoints are very "similar" (in some sense that will be discussed later) and datapoints between cluster are very "disimilar".
I have mixed feelings on clustering. It's often hard to quantify how well a model is performing when you dont have a measure to define what is correct or not. On the other hand without labeled data, they are often all we've got! Despite being hard to quantify their performance clustering methods can be helpful for Semi-Supervised Learning where there is a small amount of labeled data and a large amount of unlabeled data.
In this post, I will go over how to write a k-means clustering algorithm from scratch using NumPy. The algorithm will be explained in the next section and while seemingly simple, it can be tricky to implement efficiently! As an added bonus, I will go over how to implement the algorithm in a way that is Scikit-Learn compatible so that we can use Scikit-Learn's framework including Pipelines and GridSearchCV (which admittidely isn't particuarly helpful for this model).
Let's start out by talking about the technical details of the k-means clustering algorithm.
The k-means clustering algorithm is a means of partitioning a dataset, $\textbf{X} \in \mathbb{R}^{n \times p}$, of $n$ points and $p$ features into $k$ clusters. It specifically assumes
explicitly that the number of clusters can be defined a-priori.
implicitly that all points in the dataset belong to clusters that can be well-separated
The main idea of the algorithm is to find a centroid for each of the $k$ clusters, $\boldsymbol \mu_{k} \in \mathbb{R}^{p}$, that represents the "center" of the cluster. The algorithm then assign points within a dataset to the cluster that it is closest to. This assignement requires us to define a metric $d(x_{1}, x_{2})$ to tell the algorithm how close two points are within our feature space or how "similar" they are. Most often the distance function $d$ is taken to the euclidian distance.
The k-means clustering algorithm tries to form clusters which contain points which are similar with respect to the distance metric $d$. It turns out this is equivalent to minimizing the variance within each cluster. Let the set of all clusters be $\mathbf{S} = \{S_1, \, \ldots, \, S_k\}$, then the cost function is defined as,
\begin{align} \mathcal{J}(k) \; &= \; \sum_{j=1}^k |S_j| \operatorname{Var} S_j \\ &= \; \sum_{i=1}^{n} \sum_{j=1}^{k} r_{i,j} \, \Vert \textbf{x}_i - \boldsymbol \mu_j \Vert^{2} \end{align}Where for each $\textbf{x}_i \in \mathbb{R}^{p}$ we have the definition of the indicator function,
$$ r_{i,k} \; = \; \left\{ \begin{array}{cc} 1, & \text{if} \; k = \underset{j} {\operatorname{arg\,min}} \Vert \textbf{x}_j - \boldsymbol \mu_j \Vert^{2} \\ 0, & \text{otherwise} \end{array} \right. $$It can be seen that the sum, $|S_k| \; = \; \sum_{i=1}^n r_{i,k}$ is the number of points assigned to each cluster!
Minimizing the cost function $\mathcal{J}(k)$ is an NP-hard problem and a heuristic is usually applied to approximate solutions to the optimal centroids. This heuristic is a greedy iterative method that is called the Lloyd–Forgy algorithm and has the following the steps,
Then while not converged,
Convergence is usually taken to be that the distance between each cluster's centroid before and after each iteration is less than some predefined tolerance. Another option is that one can also set a maximum number iterations that algorithm can take.
This heuristic method is used to train the k-means model and attempts to find groupings of the data that minimizes the cost function which is the weighted sum of the squares errors of the clusters. Given that is the cost function is a sum of square errors, it is sensitive to outliers just like linear regression. In addition the k-means has the following limitations,
Additionally, the algorithm is sensitive to the initial conditions and a number of different random initial conditions are chosen and the one with the best results is chosen as the final model.
The k-means++ is an algorithm for choosing the initial values (or "seeds") for the heuristic solution by specifying a procedure to initialize the cluster centroids before proceeding with the standard iterative k-means algorithm discussed above. The initialization algorithm has the following steps,
With the k-means++ initialization, the algorithm is guaranteed to find a solution that is $\mathcal{O}(\log k)$ and is competitive to the optimal k-means solution.
First lets start out by creatinga a simple dataset in 2 dimensions. We can use 10,000 points belonging to 3 clusters as shown below,
import numpy as np
N = 10000
k = 3
data1 = np.random.randn(N//3,2) + np.array([5,6])
data2 = np.random.randn(N//3,2) + np.array([-5,-6])
data3 = np.random.randn(N//3,2) + np.array([-10,3])
data = np.concatenate((data1, data2, data3))
Now we lets plot the points,
import matplotlib.pyplot as plt
%matplotlib inline
plt.scatter(data[:,0], data[:,1])
To initialize the centroids we need to come up with a function that takes a dataset of N points and chooses k of them to be the centroids. This is a naive initialization process compared to k-means++, but for our purposes it will do. We define this function using the np.random.choice function to get the entries of the k different points as shown below,
np.random.choice(N, 3,replace=False)
We can then write a function to initialize these centroids,
from typing import List
def init_centroids(X: np.array, K: int) -> List[float]:
N = X.shape[0]
return X[np.random.choice(N, K,replace=False)]
centroids = init_centroids(data, 3)
The three initial points for the centroids are,
centroids
We can plot these to see where they are with respsect to the rest of the points,
plt.figure()
plt.scatter(data[:,0], data[:,1])
for c in centroids:
plt.scatter(*c, color="red")
Two of the centroids look like they are in the same cluster (note due to random seeding you may not get the same results). Now let's write a function that assigns each of the points in the dataset to a cluster by finding cluster that each point is closest to.
Let's take an example with the first data point,
point = data[0]
point
We can remind ourselves what the centroids look like,
centroids
Now we want to find the distance from the point to each of the centroids and will use the concept of broadcasting in NumPy. We can see the shape point
and centroids
,
print(f"point.shape = {point.shape}")
print(f"centroids.shape = {centroids.shape}")
Now subtracting the two using broadcasting results in
point - centroids
We broadcasted the point from a (2,) shape to a (3,2) to match the shape of the centroids
array and then performed elementwise subtraction.
Now we can calculate the distance by using the norm from NumPy's numerical linear alebgra module. The use for the norm allows us to use data from arbitrary dimensions!
dists = np.linalg.norm(point - centroids, axis=1)
dists
Now we assign the point to the cluster which is closes usign the argmin function from NumPy.
np.argmin(dists)
This means point is closest to cluster 3 (size Python is 0 indexed)! We can see this below,
plt.figure()
plt.scatter(data[:,0], data[:,1])
for c in centroids:
plt.scatter(*c, color="red")
plt.scatter(*point, color='purple')
Now we want to do this for every point in the dataset, we create a new vector called labels
that is the cluster each point belows to. We use the NumPy empty function to assign an empty array of size N (so that the memory is allocated for up front).
labels = np.empty(data.shape[0])
labels
Now we can write function to assign all the points in the dataset to the cluster it is closest to. We do this for entire dataset using the enumerate function to keep track of the index in the label array while looping over each point in the dataset,
def update_labels(X: np.array, labels: np.array) -> None:
for i, point in enumerate(X):
dists = np.linalg.norm(point - centroids, axis=1) # norm along the rows
labels[i] = np.argmin(dists)
Note this is function edits the labels array by reference instead of returning a new array.
update_labels(data, labels)
labels
The last function we need to write is a function that will update the centroids. The way we update the centroids is by finding the points that belong to a each cluster and the finding the mean (or any other average) of those points features to make the new centroid for that cluster.
We can find all the points that belong to each cluster using the concept of Masking. To see how this works we can find all the labeled points that belong to each cluster. For instance we can find which points belong to cluster 0 with the following.
(labels==0)
Note that this returns a boolean array that says whether each value in the array is a 0 or not. We can then use masking to get all points that are in cluster 0,
data[labels==0]
We can obtain the centroid values by taking the average along each column; this can be calculated with the mean function from NumPy with axis=0
to signify we are summing all the rows in each column,
data[labels==0].mean(axis=0)
Writing this as a function that takes in the dataframe
and labels
we can write a for loop over the clusters and then use the stack function to collect the centroids as an array of of centroids.
def update_centroids(X: np.array, labels: np.array, K: int) -> None:
centroids = np.stack([
X[labels==i].mean(axis=0) for i in range(K)
])
return centroids
And can use this to calculate one iteration of the k means algorithm,
update_centroids(data, labels, 3)
plt.figure()
plt.scatter(data[:,0], data[:,1])
for c in centroids:
plt.scatter(*c, color="red")
Now, lets right the fit
function which will uses all the functions we wrote prior to iteratively fit a k-means model. The last thing we need to mention is convergence which tells us when our iterative method should terminate. The iterative is terminated when,
max_iter
tol
Note that the method is assumed to have converged only when the second condition is achieved, the former just terminates the iteration since the method has not converged in a reasonable timeframe. The fit
is below and returns the centroids of the clusters,
def fit(X: np.array, K: int, max_iters: int = 100, tol:float = 1e-10) -> List[float]:
centroids = init_centroids(X=X, K=K)
labels = np.empty(X.shape[0])
for _ in range(max_iters):
# label points belonging to clusters
prev_centroids = centroids
# update labels
update_labels(X=X, labels=labels)
# update centroids
centroids = update_centroids(X=X, labels=labels, K=K)
if np.linalg.norm(prev_centroids - centroids) < tol:
break
return centroids
We can use this to now fit out model and verify the results after they have converged,
centroids = fit(X=data, K=3)
plt.figure()
plt.scatter(data[:,0], data[:,1])
for c in centroids:
plt.scatter(*c, color="red")
Looks pretty good! Now let's talk about how to make this Scikit-Learn compatible.
Now we want to go about creating a Scikit-Learn compatible k-means clustering model so that we can use the library's built in Pipeline and GridSearchCV. This realies on our ability to create a custom estimator for Scikit-learn.
In order to accomplish this, we need to create a KMeans clustering class for our model that extends the BaseEstimator class and since this since clustering algorithm we also extend the ClusterMixin class. This method uses the concept of inheritence and abstract base classes or interfaces though not quite so formally.
One thing we change in the implementation is that we allow the number of clusters to be a class member K
which is passed to the constructor along with the the maximum number of iteration max_iter
, the tolerance tol
to determine if the method has converged and a random_state
to see the choice of datapoints as initial cluster centroids. The other major changes to the functions written above are mostly either cosmetic or required to make them methods of the class. Specifically we werite the functions to be private methods of the class and therefore require the self
parameter and a "_" prefix for all but the fit
method which will remain public. Lastly, we change the centroids related methods so that the no longer return the centroids, but rather update the objects centroids member.
The class is written below,
import numpy as np
from sklearn.base import BaseEstimator, ClusterMixin
from __future__ import annotations
class KMeans(BaseEstimator, ClusterMixin):
def __init__(self,
K: int=2,
max_iter: int=10,
random_state: int = 42,
tol: float = 1e-6):
self.K = K
self.max_iter = max_iter
self.centroids = None
self.random_state = random_state
self.tol = tol
np.random.seed(random_state)
def _init_centroids(self, X: np.array) -> None:
N = X.shape[0]
self.centroids = X[np.random.choice(N,self.K,replace=False)]
def _update_labels(self, X: np.array, labels: np.array) -> None:
for i, point in enumerate(X):
dists = np.linalg.norm(point - self.centroids, axis=1) # sum along the rows
labels[i] = np.argmin(dists)
def _update_centroids(self, X: np.array, labels: np.array) -> None:
self.centroids = np.stack([
X[labels==i].mean(axis=0) for i in range(self.K)
])
def fit(self, X: np.array, y: np.array=None) -> KMeans:
self._init_centroids(X)
labels = np.empty(X.shape[0])
for _ in range(self.max_iter):
# label points belonging to clusters
prev_centroids = self.centroids
# update labels
self._update_labels(X, labels)
# update centroids
self._update_centroids(X, labels)
if np.linalg.norm(prev_centroids - self.centroids) < self.tol:
break
return self
def predict(self, X: np.array) -> np.array:
labels = np.empty(X.shape[0])
self._update_labels(X, labels)
return labels
def score(self, X: np.array, y: np.array=None) -> np.array:
if y is None:
y = self.predict(X)
variance = np.sum([np.linalg.norm(X[y==i] - self.centroids[i], axis=1).sum()
for i in range(self.K)]) / X.shape[0]
return variance
In addition to the fit
function we also have the public method called predict
that is required for the BaseEstimator and "Mixin" classes. The predict function predicts which cluster datapoints belong to by calling the _update_labels
method under the hood.
In order for our KMeans class to be compatible with the Pipeline and the GridSearchCV we also need to a score
method to measure the performance of the object. For our purposes the score
function is just the simple sum of square errors.
We can now instantiate a k-means object and fit to the dataset.
kmeans = KMeans(3)
kmeans = kmeans.fit(data)
Note that the fit
function returns itself, just as Sciki-learn estimators do!
We can see the centroids,
kmeans.centroids
plt.figure()
plt.scatter(data[:,0], data[:,1])
for c in kmeans.centroids:
plt.scatter(*c, color="red")
And predict any number of values,
kmeans.predict(np.array([10.3242, 5.321]))
Lastly we can find the SSE of the model with the score method,
kmeans.score(data)
We can attempt to compute the appropriate number of clusters a-posterori by using the elbow method over various numbers of $k$.
The idea behind the elbow method is that as the number of clusters increases, the sum of internal variance or SSE decreases rapidly because samples are becoming more and more homogeneous. At a certain point (the elbow point) the clusters contain relatively homogeneous samples and the reduction in SSE is less significant as the number of clusters increases. This inflection point is thought to occur because we are longer seperating legitimate clusters, but rather arbitrarily splitting up clusters and the drop in SSE should be less meaningful.
Let's show how this works on a well-known multi-class dataset where we know the number of cluster beforehand to test to see if the elbow method gives us a reasonable approximation for the number of clusters. We'll use the above dataset which clearly has 3 well-seperated clusters.
Note the clustering is sensitive to scaling and therefore we use the StandardScaler class before applying the clustering model and therefore need to make a pipeline object!
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
pipeline = make_pipeline(StandardScaler(), KMeans(K=k))
model = pipeline.fit(data)
We can loop over a number of clusters and get the scores to use the elbow method to determine the number of clusters to use.
K_max = 7
scores = [make_pipeline(StandardScaler(), KMeans(K=k)).fit(data).score(data)
for k in range(1,K_max)]
We can plot the number of clusters vs SSE to find the optimal number of clusters,
plt.plot(range(1,7), scores)
plt.xlabel("# of clustetrs")
plt.ylabel("SSE", rotation=90)
plt.title("Elbo Curve")
Since the 3 clusters are so well separated the elbow is pretty well defined at 3.
Let's take another example where the clusters are not quite so well separated,
new_data = np.concatenate((
np.random.randn(N//3,2) + np.array([3,1]),
np.random.randn(N//3,2) + np.array([-3,0]),
np.random.randn(N//3,2) + np.array([12,5]),
np.random.randn(N//3,2) + np.array([8,8]))
)
plt.scatter(new_data[:,0], new_data[:,1])
This time we have 4 clusters, but two pairs are some what overlapping. We can retrain the k-means model for various values of $K$ and plot the elbow curve again.
K_max = 10
scores = [make_pipeline(StandardScaler(), KMeans(K=k)).fit(new_data).score(new_data)
for k in range(1,K_max)]
plt.plot(range(1,K_max), scores)
plt.xlabel("# of clustetrs")
plt.ylabel("SSE", rotation=90)
plt.title("Elbo Curve")
The optimal number of clusters could be 2, 3, 4. Obviously the seperating hyperplanes are not as well defined as it was in the prior case!
Lastly let's talk about how to use use the GridSearchCV though again it doesnt make much sense in for unsupersived learning since there is no real metric were trying to optimize. Rather this is just to show the way we created customer estimators and how to use them with the rest of Sciki-learn's tooling.
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
('scaler',StandardScaler()),
('model', KMeans())
])
Now we can define or grid of cluster sizes and then fit the data,
params = {"model__K": [1,2,3,4,5]}
grid = GridSearchCV(estimator=pipeline,
param_grid=params)
results = grid.fit(data)
And view the best fitting model,
results.best_estimator_
Again the results dont make sense, since grid search is looking for the configuration that maximizes SSE it chooses 1 cluster. However, the main idea is to show how to make a Scikit-Learn compatible estimator!
In this blogpost we went over how to create a Scikit-learn compatible k-means clustering algorithm using NumPy and the elbow method to go try to determine how many clusters are in our dataset.
I made extensive use of the following references for writing a k-means algorithm from scratch
And Scikit-learn's own documentation to write the Customer Estimator.
I hope you enjoyed this post!