How to Implement K Means Clustering: A Step-by-Step Guide with Sklearn

Ever wondered how Netflix groups similar movies or how marketers segment their customers? The magic often happens through a powerful technique called k means clustering. It’s one of the most popular unsupervised machine learning algorithms for partitioning data into distinct groups.

If you’re looking for a practical, hands-on guide on how to implement k means clustering using sklearn, you’ve come to the right place. This tutorial will walk you through the entire process, from understanding the basics to writing the code yourself. No advanced math required—just a clear, step-by-step approach.

Let’s dive in and unlock the power of grouping data.

What is K Means Clustering?

In simple terms, k means clustering aims to partition data points into a predefined number of clusters, ‘k’. Each data point belongs to the cluster with the nearest mean. Think of it as a way to find natural groupings in your data.

The algorithm works in four straightforward steps:

  1. Choose the number of clusters, k.
  2. Initialise k points (called centroids) at random.
  3. Assign each data point to the closest centroid, forming clusters.
  4. Recalculate the centroids as the mean of all points in the cluster.
    Steps 3 and 4 repeat until the centroids no longer move significantly.

Prerequisites: Let’s Get Set Up

To follow along, you’ll need Python installed along with three essential libraries: scikit-learn (sklearn), pandas, and matplotlib. You can install them using pip:

pip install scikit-learn pandas matplotlib

Step 1: Import the Necessary Libraries

First, we import all the tools we need. This is standard practice for any data science project.

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs  # To create a sample dataset

Step 2: Prepare a Sample Dataset

For learning, it’s best to use a simple, generated dataset. Sklearn’s make_blobs function is perfect for this. It creates clusters of data points around specified centres.

# Generate a sample dataset
X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)

# Let's visualize the raw data
plt.scatter(X[:, 0], X[:, 1], s=50)
plt.title("Raw Unclustered Data")
plt.show()

This code creates 300 random points forming 4 natural clusters. Visualizing it helps us see what we’re working with before we even start clustering.

Step 3: Choose the Right Number of K

This is the trickiest part. How do you know the right number of clusters? The Elbow Method is a common technique. We run the algorithm for a range of k values and plot the inertia (the sum of squared distances of points to their closest centroid).

We look for the “elbow” in the graph—the point where inertia starts decreasing linearly.

# Find the optimal k using the Elbow Method
inertia = []
k_range = range(1, 10)

for k in k_range:
    kmeans = KMeans(n_clusters=k, random_state=0, n_init=10)
    kmeans.fit(X)
    inertia.append(kmeans.inertia_)

# Plot the Elbow Graph
plt.plot(k_range, inertia)
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.title('The Elbow Method')
plt.show()

The graph will likely show a sharp elbow at k=4, which matches our generated data.

Step 4: Implement K-Means with Sklearn

Now for the main event! Since we know k=4 is optimal, we’ll initialize the K Means model and fit it to our data.

# Initialize the KMeans model with 4 clusters
kmeans = KMeans(n_clusters=4, random_state=0, n_init=10)

# Fit the model to the data and predict the clusters
y_kmeans = kmeans.fit_predict(X)

The fit_predict() method both trains the algorithm on our data X and assigns each data point to a cluster. The results are stored in y_kmeans.

Step 5: Visualise the Results

The best way to understand the results is to see them. Let’s plot our data points, color-coding them by their assigned cluster. We’ll also plot the final centroids.

# Visualize the clusters and centroids
plt.scatter(X[y_kmeans == 0, 0], X[y_kmeans == 0, 1], s=50, c='red', label='Cluster 1')
plt.scatter(X[y_kmeans == 1, 0], X[y_kmeans == 1, 1], s=50, c='blue', label='Cluster 2')
plt.scatter(X[y_kmeans == 2, 0], X[y_kmeans == 2, 1], s=50, c='green', label='Cluster 3')
plt.scatter(X[y_kmeans == 3, 0], X[y_kmeans == 3, 1], s=50, c='cyan', label='Cluster 4')

# Plot the centroids
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=200, c='black', marker='X', label='Centroids')
plt.legend()
plt.title('K-Means Clustering Results')
plt.show()

You should see a clean plot with four distinct, colored clusters and a black ‘X’ marking the center of each.

Taking It Further: Using Your Own Data

The process is identical for your own datasets! Instead of using make_blobs, you would:

  1. Load your data with pd.read_csv(‘your_data.csv’).
  2. Perform any necessary data cleaning and preprocessing (e.g., handling missing values, scaling features).
  3. Follow the same steps from the Elbow Method onward.

Conclusion

And there you have it! You’ve just learned how to implement k means clustering sklearn. This powerful algorithm is a cornerstone of unsupervised learning, perfect for customer segmentation, image compression, and anomaly detection.

The key steps are always the same: prepare your data, find the optimal k, instantiate the model, and fit it. Remember, the quality of your results depends heavily on choosing the right k and having well-preprocessed data.

Now it’s your turn. Fire up a Python notebook, try this code, and experiment with it. Happy clustering!

Here are a few links to some of the important machine learning topics:
The Indisputable Reasons Why Machine Learning is Important
What’s the Real Difference Between AI and ML
Real Life Applications Of Machine Learning Driving The Modern Tech

Index
Scroll to Top