• Python for Machine Learning
  • Machine Learning with R
  • Machine Learning Algorithms
  • Math for Machine Learning
  • Machine Learning Interview Questions
  • ML Projects
  • Deep Learning
  • Computer vision
  • Data Science
  • Artificial Intelligence
  • 100+ Machine Learning Projects with Source Code [2024]

Classification Projects

  • Wine Quality Prediction - Machine Learning
  • ML | Credit Card Fraud Detection
  • Disease Prediction Using Machine Learning
  • Recommendation System in Python
  • Detecting Spam Emails Using Tensorflow in Python
  • SMS Spam Detection using TensorFlow in Python
  • Python | Classify Handwritten Digits with Tensorflow
  • Recognizing HandWritten Digits in Scikit Learn
  • Identifying handwritten digits using Logistic Regression in PyTorch
  • Python | Customer Churn Analysis Prediction
  • Online Payment Fraud Detection using Machine Learning in Python
  • Flipkart Reviews Sentiment Analysis using Python
  • Loan Approval Prediction using Machine Learning
  • Loan Eligibility prediction using Machine Learning Models in Python
  • Stock Price Prediction using Machine Learning in Python
  • Bitcoin Price Prediction using Machine Learning in Python
  • Handwritten Digit Recognition using Neural Network
  • Parkinson Disease Prediction using Machine Learning - Python
  • Spaceship Titanic Project using Machine Learning - Python
  • Rainfall Prediction using Machine Learning - Python
  • Autism Prediction using Machine Learning
  • Predicting Stock Price Direction using Support Vector Machines
  • Fake News Detection Model using TensorFlow in Python
  • CIFAR-10 Image Classification in TensorFlow
  • Black and white image colorization with OpenCV and Deep Learning
  • ML | Kaggle Breast Cancer Wisconsin Diagnosis using Logistic Regression
  • ML | Cancer cell classification using Scikit-learn
  • ML | Kaggle Breast Cancer Wisconsin Diagnosis using KNN and Cross Validation
  • Human Scream Detection and Analysis for Controlling Crime Rate - Project Idea
  • Multiclass image classification using Transfer learning
  • Intrusion Detection System Using Machine Learning Algorithms
  • Heart Disease Prediction using ANN

Regression Projects

  • IPL Score Prediction using Deep Learning
  • Dogecoin Price Prediction with Machine Learning
  • Zillow Home Value (Zestimate) Prediction in ML
  • Calories Burnt Prediction using Machine Learning
  • Vehicle Count Prediction From Sensor Data
  • Analyzing selling price of used cars using Python
  • Box Office Revenue Prediction Using Linear Regression in ML
  • House Price Prediction using Machine Learning in Python
  • ML | Boston Housing Kaggle Challenge with Linear Regression
  • Stock Price Prediction Project using TensorFlow
  • Medical Insurance Price Prediction using Machine Learning - Python
  • Inventory Demand Forecasting using Machine Learning - Python
  • Ola Bike Ride Request Forecast using ML
  • Waiter's Tip Prediction using Machine Learning
  • Predict Fuel Efficiency Using Tensorflow in Python
  • Microsoft Stock Price Prediction with Machine Learning
  • Share Price Forecasting Using Facebook Prophet
  • Python | Implementation of Movie Recommender System
  • How can Tensorflow be used with abalone dataset to build a sequential model?

Computer Vision Projects

  • OCR of Handwritten digits | OpenCV
  • Cartooning an Image using OpenCV - Python
  • Count number of Object using Python-OpenCV
  • Count number of Faces using Python - OpenCV
  • Text Detection and Extraction using OpenCV and OCR
  • FaceMask Detection using TensorFlow in Python
  • Dog Breed Classification using Transfer Learning
  • Flower Recognition Using Convolutional Neural Network
  • Emojify using Face Recognition with Machine Learning
  • Cat & Dog Classification using Convolutional Neural Network in Python
  • Traffic Signs Recognition using CNN and Keras in Python
  • Lung Cancer Detection using Convolutional Neural Network (CNN)
  • Lung Cancer Detection Using Transfer Learning
  • Pneumonia Detection using Deep Learning
  • Detecting Covid-19 with Chest X-ray
  • Skin Cancer Detection using TensorFlow
  • Age Detection using Deep Learning in OpenCV
  • Face and Hand Landmarks Detection using Python - Mediapipe, OpenCV
  • Detecting COVID-19 From Chest X-Ray Images using CNN
  • Image Segmentation Using TensorFlow
  • License Plate Recognition with OpenCV and Tesseract OCR
  • Detect and Recognize Car License Plate from a video in real time
  • Residual Networks (ResNet) - Deep Learning

Natural Language Processing Projects

  • Twitter Sentiment Analysis using Python
  • Facebook Sentiment Analysis using python
  • Next Sentence Prediction using BERT
  • Hate Speech Detection using Deep Learning
  • Image Caption Generator using Deep Learning on Flickr8K dataset
  • Movie recommendation based on emotion in Python
  • Speech Recognition in Python using Google Speech API
  • Voice Assistant using python
  • Human Activity Recognition - Using Deep Learning Model
  • Fine-tuning BERT model for Sentiment Analysis
  • Sentiment Classification Using BERT
  • Sentiment Analysis with an Recurrent Neural Networks (RNN)
  • Autocorrector Feature Using NLP In Python
  • Python | NLP analysis of Restaurant reviews
  • Restaurant Review Analysis Using NLP and SQLite

Clustering Projects

  • Customer Segmentation using Unsupervised Machine Learning in Python
  • Music Recommendation System Using Machine Learning
  • K means Clustering - Introduction
  • Image Segmentation using K Means Clustering

Recommender System Project

  • AI Driven Snake Game using Deep Q Learning

K means Clustering – Introduction

K-Means Clustering is an Unsupervised Machine Learning algorithm, which groups the unlabeled dataset into different clusters. The article aims to explore the fundamentals and working of k mean clustering along with the implementation.

Table of Content

What is K-means Clustering?

What is the objective of k-means clustering, how k-means clustering works, implementation of k-means clustering in python.

Unsupervised Machine Learning is the process of teaching a computer to use unlabeled, unclassified data and enabling the algorithm to operate on that data without supervision. Without any previous data training, the machine’s job in this case is to organize unsorted data according to parallels, patterns, and variations. 

K means clustering, assigns data points to one of the K clusters depending on their distance from the center of the clusters. It starts by randomly assigning the clusters centroid in the space. Then each data point assign to one of the cluster based on its distance from centroid of the cluster. After assigning each point to one of the cluster, new cluster centroids are assigned. This process runs iteratively until it finds good cluster. In the analysis we assume that number of cluster is given in advanced and we have to put points in one of the group.

In some cases, K is not clearly defined, and we have to think about the optimal number of K. K Means clustering performs best data is well separated. When data points overlapped this clustering is not suitable. K Means is faster as compare to other clustering technique. It provides strong coupling between the data points. K Means cluster do not provide clear information regarding the quality of clusters. Different initial assignment of cluster centroid may lead to different clusters. Also, K Means algorithm is sensitive to noise. It maymhave stuck in local minima.

The goal of clustering is to divide the population or set of data points into a number of groups so that the data points within each group are more comparable to one another and different from the data points within the other groups. It is essentially a grouping of things based on how similar and different they are to one another. 

We are given a data set of items, with certain features, and values for these features (like a vector). The task is to categorize those items into groups. To achieve this, we will use the K-means algorithm, an unsupervised learning algorithm. ‘K’ in the name of the algorithm represents the number of groups/clusters we want to classify our items into.

(It will help if you think of items as points in an n-dimensional space). The algorithm will categorize the items into k groups or clusters of similarity. To calculate that similarity, we will use the Euclidean distance as a measurement.

The algorithm works as follows:  

  • First, we randomly initialize k points, called means or cluster centroids.
  • We categorize each item to its closest mean, and we update the mean’s coordinates, which are the averages of the items categorized in that cluster so far.
  • We repeat the process for a given number of iterations and at the end, we have our clusters.

The “points” mentioned above are called means because they are the mean values of the items categorized in them. To initialize these means, we have a lot of options. An intuitive method is to initialize the means at random items in the data set. Another method is to initialize the means at random values between the boundaries of the data set (if for a feature x, the items have values in [0,3], we will initialize the means with values for x at [0,3]).

The above algorithm in pseudocode is as follows:  

Import the necessary Libraries

We are importing Numpy for statistical computations, Matplotlib to plot the graph, and make_blobs from sklearn.datasets.

Create the custom dataset with make_blobs and plot it

Clustering dataset - Geeksforgeeks

Clustering dataset

Initialize the random centroids

The code initializes three clusters for K-means clustering. It sets a random seed and generates random cluster centers within a specified range, and creates an empty list of points for each cluster.

   

Plot the random initialize center with data points

Data points with random center - Geeksforgeeks

Data points with random center

The plot displays a scatter plot of data points (X[:,0], X[:,1]) with grid lines. It also marks the initial cluster centers (red stars) generated for K-means clustering.

Define Euclidean distance

Create the function to Assign and Update the cluster center

The E-step assigns data points to the nearest cluster center, and the M-step updates cluster centers based on the mean of assigned points in K-means clustering.

       

Step 7: Create the function to Predict the cluster for the datapoints

Assign, Update, and predict the cluster center

Plot the data points with their predicted cluster center.

K-means Clustering - Geeksforgeeks

K-means Clustering

The plot shows data points colored by their predicted clusters. The red markers represent the updated cluster centers after the E-M steps in the K-means clustering algorithm.

Import the necessary libraries

Load the Dataset

Elbow Method 

Finding the ideal number of groups to divide the data into is a basic stage in any unsupervised algorithm. One of the most common techniques for figuring out this ideal value of k is the elbow approach.

Plot the Elbow graph to find the optimum number of cluster

programming assignment implementing the k means clustering algorithm

Elbow Method

From the above graph, we can observe that at k=2 and k=3 elbow-like situation. So, we are considering K=3

Build the Kmeans clustering model

Find the cluster center, predict the cluster group:, plot the cluster center with data points.

 

K-means clustering - Geeksforgeeks

K-means clustering

The subplot on the left display petal length vs. petal width with data points colored by clusters, and red markers indicate K-means cluster centers. The subplot on the right show sepal length vs. sepal width similarly.

In conclusion, K-means clustering is a powerful unsupervised machine learning algorithm for grouping unlabeled datasets. Its objective is to divide data into clusters, making similar data points part of the same group. The algorithm initializes cluster centroids and iteratively assigns data points to the nearest centroid, updating centroids based on the mean of points in each cluster.

Frequently Asked Questions (FAQs)

1. what is k-means clustering for data analysis.

K-means is a partitioning method that divides a dataset into ‘k’ distinct, non-overlapping subsets (clusters) based on similarity, aiming to minimize the variance within each cluster.

2.What is an example of k-means in real life?

Customer segmentation in marketing, where k-means groups customers based on purchasing behavior, allowing businesses to tailor marketing strategies for different segments.

3. What type of data is k-means clustering model?

K-means works well with numerical data, where the concept of distance between data points is meaningful. It’s commonly applied to continuous variables.

4.Is K-means used for prediction?

K-means is primarily used for clustering and grouping similar data points. It does not predict labels for new data; it assigns them to existing clusters based on similarity.

5.What is the objective of k-means clustering?

The objective is to partition data into ‘k’ clusters, minimizing the intra-cluster variance. It seeks to form groups where data points within each cluster are more similar to each other than to those in other clusters.

Please Login to comment...

Similar reads.

  • AI-ML-DS With Python
  • Machine Learning

Improve your Coding Skills with Practice

 alt=

What kind of Experience do you want to share?

Python Tutorial

File handling, python modules, python numpy, python pandas, python matplotlib, python scipy, machine learning, python mysql, python mongodb, python reference, module reference, python how to, python examples, machine learning - k-means.

On this page, W3schools.com collaborates with NYC Data Science Academy , to deliver digital training content to our students.

K-means is an unsupervised learning method for clustering data points. The algorithm iteratively divides data points into K clusters by minimizing the variance in each cluster.

Here, we will show you how to estimate the best value for K using the elbow method, then use K-means clustering to group the data points into clusters.

How does it work?

First, each data point is randomly assigned to one of the K clusters. Then, we compute the centroid (functionally the center) of each cluster, and reassign each data point to the cluster with the closest centroid. We repeat this process until the cluster assignments for each data point are no longer changing.

K-means clustering requires us to select K, the number of clusters we want to group the data into. The elbow method lets us graph the inertia (a distance-based metric) and visualize the point at which it starts decreasing linearly. This point is referred to as the "eblow" and is a good estimate for the best value for K based on our data.

Start by visualizing some data points:

programming assignment implementing the k means clustering algorithm

ADVERTISEMENT

Now we utilize the elbow method to visualize the intertia for different values of K:

programming assignment implementing the k means clustering algorithm

The elbow method shows that 2 is a good value for K, so we retrain and visualize the result:

programming assignment implementing the k means clustering algorithm

Example Explained

Import the modules you need.

import matplotlib.pyplot as plt from sklearn.cluster import KMeans

You can learn about the Matplotlib module in our "Matplotlib Tutorial .

scikit-learn is a popular library for machine learning.

Create arrays that resemble two variables in a dataset. Note that while we only use two variables here, this method will work with any number of variables:

x = [4, 5, 10, 4, 3, 11, 14 , 6, 10, 12] y = [21, 19, 24, 17, 16, 25, 24, 22, 21, 21]

Turn the data into a set of points:

data = list(zip(x, y)) print(data)

[(4, 21), (5, 19), (10, 24), (4, 17), (3, 16), (11, 25), (14, 24), (6, 22), (10, 21), (12, 21)]

In order to find the best value for K, we need to run K-means across our data for a range of possible values. We only have 10 data points, so the maximum number of clusters is 10. So for each value K in range(1,11), we train a K-means model and plot the intertia at that number of clusters:

inertias = [] for i in range(1,11):     kmeans = KMeans(n_clusters=i)     kmeans.fit(data)     inertias.append(kmeans.inertia_) plt.plot(range(1,11), inertias, marker='o') plt.title('Elbow method') plt.xlabel('Number of clusters') plt.ylabel('Inertia') plt.show()

We can see that the "elbow" on the graph above (where the interia becomes more linear) is at K=2. We can then fit our K-means algorithm one more time and plot the different clusters assigned to the data:

kmeans = KMeans(n_clusters=2) kmeans.fit(data) plt.scatter(x, y, c=kmeans.labels_) plt.show()

Get Certified

COLOR PICKER

colorpicker

Contact Sales

If you want to use W3Schools services as an educational institution, team or enterprise, send us an e-mail: [email protected]

Report Error

If you want to report an error, or if you want to make a suggestion, send us an e-mail: [email protected]

Top Tutorials

Top references, top examples, get certified.

  • Comprehensive Learning Paths
  • 150+ Hours of Videos
  • Complete Access to Jupyter notebooks, Datasets, References.

Rating

Machine Learning

K-means clustering algorithm from scratch.

  • April 26, 2020
  • Venmani A D

K-Means Clustering is an unsupervised learning algorithm that aims to group the observations in a given dataset into clusters. The number of clusters is provided as an input. It forms the clusters by minimizing the sum of the distance of points from their respective cluster centroids.

  • Basic Overview

Introduction to K-Means Clustering

Steps involved.

  • Maths Behind K-Means Clustering

Implementing K-Means from scratch

  • Elbow Method to find the optimal number of clusters

Grouping mall customers using K-Means

Basic overview of clustering.

Clustering is a type of unsupervised learning which is used to split unlabeled data into different groups.

Now, what does unlabeled data mean?

Unlabeled data means we don’t have a dependent variable (response variable) for the algorithm to compare as the ground truth. Clustering is generally used in Data Analysis to get to know about the different groups that may exist in our dataset.

We try to split the dataset into different groups, such that the data points in the same group have similar characteristics than the data points in different groups.

Now how to find whether the points are similar?

Use a good distance metric to compute the distance between a point and every other point. The points that have less distance are more similar. Euclidean distance is the most common metric.

programming assignment implementing the k means clustering algorithm

Clustering algorithms are generally used in network traffic classification, customer, and market segmentation. It can be used on any tabular dataset, where you want to know which rows are similar to each other and form meaningful groups out of the dataset. First I am going to install the libraries that I will be using.

Let’s look into one of the most common clustering algorithms: K-Means in detail.     Hands-on implementation on real project: Learn how to implement classification algorithms using multiple techniques in my Microsoft Malware Detection Project Course.    

K-Means follows an iterative process in which it tries to minimize the distance of the data points from the centroid points. It’s used to group the data points into k number of clusters based on their similarity.

Euclidean distance is used to calculate the similarity.

Let’s see a simple example of how K-Means clustering can be used to segregate the dataset. In this example, I am going to use the make_blobs the command to generate isotropic gaussian blobs which can be used for clustering.

I passed in the number of samples to be generated to be 100 and the number of centers to be 5.

programming assignment implementing the k means clustering algorithm

If you look at the value of y, you can see that the points are classified based on their clusters, but I won’t be using this compute the clusters, I will be using this only for evaluating purpose. For using K-Means you need to import KMeans from sklearn.cluster library.

For using KMeans, you need to specify the no of clusters as arguments. In this case, as we can look from the graph that there are 5 clusters, I will be passing 5 as arguments. But in general cases, you should use the Elbow Method to find the optimal number of clusters. I will be discussing this method in detail in the upcoming sections.

After passing the arguments, I have fitted the model and predicted the results. Now let’s visualize our predictions in a scatter plot.

programming assignment implementing the k means clustering algorithm

There are 3 important steps in K-Means Clustering.

  • 1. Initialize centroids – This is done by randomly choosing K no of points, the points can be present in the dataset or also random points.
  • 2. Assign Clusters – The clusters are assigned to each point in the dataset by calculating their distance from the centroid and assigning it to the centroid with minimum distance.
  • 3. Re-calculate the centroids – Updating the centroid by calculating the centroid of each cluster we have created.

programming assignment implementing the k means clustering algorithm

Maths Behind K-Means Working

programming assignment implementing the k means clustering algorithm

One of the few things that you need to keep in mind is that as we are using euclidean distance as the main parameter, it will be better to standardize your dataset if the x and y vary way too much like 10 and 100.

Also, it’s recommended to choose a wide range of points as initial clusters to check whether we are getting the same output, there is a possibility that you may get stuck at the local minimum rather than the global minimum.

Let’s use the same make_blobs example we used at the beginning. We will try to do the clustering without using the KMeans library.

I have set the K value to be 5 as before and also initialized the centroids randomly at first using the random.randint() function.

Then I am going to find the distance between the points. Euclidean distance is most commonly used for finding the similarity.

I have also stored all the minimum values in a variable minimum . Then I regrouped the dataset based on the minimum values we got and calculated the centroid value.

Then we need to repeat the above 2 steps over and over again until we reach the convergence.

Let’s plot it.

programming assignment implementing the k means clustering algorithm

Elbow method to find the optimal number of clusters

One of the important steps in K-Means Clustering is to determine the optimal no. of clusters we need to give as an input. This can be done by iterating it through a number of n values and then finding the optimal n value.

For finding this optimal n, the Elbow Method is used.

You have to plot the loss values vs the n value and find the point where the graph is flattening, this point is considered as the optimal n value.

Let’s look at the example we have seen at first, to see the working of the elbow method. I am going to iterate it through a series of n values ranging from 1-20 and then plot their loss values.

programming assignment implementing the k means clustering algorithm

I am going to be using the Mall_Customers Dataset. You can download the dataset from the given link

programming assignment implementing the k means clustering algorithm

Let’s try to find if there are certain clusters between the customers based on their Age and Spending Score.

programming assignment implementing the k means clustering algorithm

More Articles

  • Predictive Modeling

How Naive Bayes Algorithm Works? (with example and full code)

Similar articles, complete introduction to linear regression in r, how to implement common statistical significance tests and find the p value, logistic regression – a complete tutorial with examples in r.

Subscribe to Machine Learning Plus for high value data science content

© Machinelearningplus. All rights reserved.

programming assignment implementing the k means clustering algorithm

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free sample videos:.

programming assignment implementing the k means clustering algorithm

Hands-On.Cloud

K means clustering Python implementation example

Unsupervised Learning uses  Machine Learning algorithms to analyze and cluster unlabeled datasets and discover hidden patterns or data groupings without human intervention. There are two types of Unsupervised Learning algorithms: Clustering and Association Rules . The K means clustering (Python) algorithm belongs to the clustering methods of Data Science.

This article contains the K means clustering Python implementation and visualization examples for data classification.

Table of contents

Clustering is one of the most common exploratory data analysis techniques to understand the data structure. Clustering algorithms allow you to discover homogeneous subgroups within the dataset so that data points in each cluster of initial centroids are comparable and similar based on a similarity measure like euclidean or correlation-based distance.

Introduction-to-unsupervised-learning-clustering

K-mean clustering algorithm overview

The K-means is an Unsupervised Machine Learning algorithm that splits a dataset into K non-overlapping subgroups (clusters). It allows us to split the data into different groups or categories. For example, if K=2 there will be two clusters, if K=3 there will be three clusters, etc. Using the K-means algorithm is a convenient way to discover the clusters of assignments or categories of groups in the unlabeled dataset with all the data points.

The K-means algorithm (also known as a centroid-based algorithm) aims to minimize the sum of distances between the data point and their corresponding clusters while associating each group of objects called a cluster centroid, a (cluster) with its centroid.

The K-means algorithm does two things:

  • Iteratively determines the best value for centroids (center points for every K’s cluster, which can also be called k centroids)
  • Assigns each data point to its closest K-center. The nearest K-center data points form a cluster.

As a result, each cluster will have data points with similar features:

implementation-of-k-mean-cluster-k-mean-algorithm

The K-means algorithm description:

  • Define the number of clusters based on provided K value
  • Select random K points or centroids (can differ from the input dataset)
  • Form the K clusters by assigning each data point to their closest centroid
  • Calculate the variance and define new centroids of each cluster
  • Repeat the process from the third step to reassign each datapoint with centroid initialization to the new closest centroid of each cluster until the algorithm finds the best possible solution

Let’s visualize the steps involved after the clustering step in the training of the K-mean clustering algorithm.

Suppose that we have plotted data, and we want to apply the K-means algorithm to find out the clusters/hidden patterns. Let’s, for example, split the dataset into two different clusters ( K=2 ). The algorithm will choose random points K for each centroid to form the cluster. These points can be either the points from the dataset or any other points. Let’s say the algorithm selects the following two centroids:

implementation-of-k-mean-algorithm-centriods-sections

Next, the algorithm will assign each data point to its closest centroid and compute the distance between the data points and the cluster center of the centroids to define the median between the two clusters:

implementation-of-k-mean-clustering-algorithm-creates-clusters

Next, the K-means clustering algorithm will choose new centroids based on each cluster’s calculated center of gravity, reassign each datapoint to the new centroid, and find a median for new clusters:

implementation-of-k-mean-algorithm-reassigning-clusters

The exact process will repeat until the algorithm finds the optimum clusters:

implementation-of-k-mean-cluster-algrothm-final-clusters

Finally, the algorithm will split all data points from the dataset into the best possible clusters or groups:

implementation-of-k-mean-algorithm-using-python-k-mean-clusters

Elbow method: Getting the value of K

The K-means clustering algorithm’s performance is highly dependent on the number of clusters, and choosing the correct value for K of the K groups is very important but might be challenging. You can use the Elbow method to calculate the ideal number of clusters in the dataset.

The Elbow method uses the Within Cluster Sum of Squares (WCSS) value to define total variations. Here’s a simple formula for two clusters:

implementation-of-k-means-cluster-algorithm-using-pythobn-elbow-formula

The equation ∑ P i in Cluster1  distance(P i  C 1 ) 2 represents the sum of the squares of the distances ( Euclidean distance or the Manhattan distance ) between each data point and its cluster’s centroid.

The following steps calculate the optimal value for K:

  • For various K values, use the K-means clustering to split the dataset
  • Calculate the WCSS value For each K
  • It will then plots a curve graph between calculated WCSS values and the number of clusters K
  • The sharp point of bend or a point of the plot that looks like an arm is the best value for K

implementation-of-k-means-using-python-elbow-graph

One of the limitations of the Elbow method is that it doesn’t always work well, primarily when it is hard to split data points into clusters and because of the dimensionality reduction shown in the graph above. We’ll show this case in the next section.

When the K-means algorithm produces incorrect results

As soon as we describe the concepts of the K-means and Elbow method, let’s try to apply them to the real dataset and use Python to solve the classification problem with the python implementation.

We’ll use the Facebook Live Sellers in Thailand DataSet , which you can download here , and the following Python modules for the implementation and visualization:

You can install them in your AWS Sagemaker Jupyter Notebook by running the following commands in its cell:

Exploring and cleaning the dataset

We will use the pandas module and its pd.read_csv() method to import our input data from the dataset:

Let’s get the shape of the dataset:

implementation-of-k-mean-algorithm-datashape

The output shows 7050 records in the dataset (each record contains 16 attributes).

The description of the dataset contains information that the dataset contains 7051 records with 12 attributes per record. Let’s dive deep into data.

Let’s preview that dataset by using the head() method:

implementation-of-k-mean-algorithm-using-python-preview

Next, we can check for the missing values in the dataset. Pandas provide us with a built-in method isnull(), which will help us to identify the missing values:

implementation-of-k-mean-using-python-missing-values

The output shows that the last four-column contain 7050 null values. These columns are useless, so let’s delete them from the dataset:

implementation-of-k-mean-using-python-without-null-values

We used the inplace to modify the original dataset with the new data instead of making its copy.

Let us now use the info() method to get more information about the types of data each attribute has.

implementation-of-k-mean-algorithm-info-of-data

The output shows three columns with string values ( object in the Dtype column) and nine numerical columns ( int64 in the Dtype column) in the dataset.

Now let us explore each column with string values one by one.

First, let’s take a look at the status_id variable:

implementation-of-k-mean-cluster-status-id

The output from the above code shows that there are 6997 unique labels in the status_id column. The total number of instances in the dataset is 7050. So, it might be a unique identifier for each record in the dataset, which we can’t use for clustering. So we need to delete it from our dataset too:

Next, let’s take a look at the status_published column:

implementation-of-k-mean-algorithm-using-python-status-published

The output shows 6913 unique out of 7050 records in this column. We can’t use this column for clustering too. Let’s drop it:

Let us now explore the last string column ( status_type ):

implementation-of-k-mean-algorithm-uisng-python-status-type

The output shows that there are only four unique labels in the  status_type column, so we can keep it.

Let’s again use the info() method to see the details about the dataset:

implementation-of-k-mean-algorithm-data-info

The output shows that there is now only one column with non-numeric values.

We can use the seaborn module and the pairplot graph to visualize data correlation in the dataset:

implementation-of-k-mean-algorithm-using-python-plotpair-graph

As soon as the K-means algorithm requires numeric data as input, let’s replace the status_type column values with the numeric values:

If we print the dataset, we will see numeric values under the status_type variable:

implementation-of-k-mean-algorithm-dataset-head

Note : values of the status_type column has been changed to discrete numeric values.

Splitting the dataset into two clusters

Now, let’s find relationships between items in the dataset and how many clusters should be in the dataset using the sklearn K-mean class with K=2 .

implementation-of-k-mean-algorithm-center

The K-Means algorithm creates clusters from data with only a single cluster by separating samples in n groups of equal variances, minimizing a criterion known as  inertia .

Inertia measures how well a dataset was clustered by K-Means. It is calculated by measuring the distance between each data point and its centroid, squaring this distance, and summing these squares across one cluster. Clustering: K-Means

The K-means algorithm chooses cluster centroids that minimize the inertia value. So one way to evaluate the K-mean algorithm is to check whether the inertia is small or large. The smaller the inertia value will be, the better the algorithm performs.

So, let’s print the inertia value and check the algorithm’s performance for 2 clusters:

k-means-inertia

We get a very high value for the inertia, which means our model is not performing well.

Let’s check how many data points have been classified correctly.

implementation-of-k-mean-algorithm-using-python-weak-classification2

The output proves that the algorithm did not work well, and we need to change the number of clusters to get better results.

Splitting the dataset into three clusters

For our dataset, the K-means algorithm did not work well for two clusters ( K=2 ). Let’s try to increase the value of K to K=3 .

k-means-inertia2

The inertia value is lower than in the previous attempt, which means the model performed better than last time when the value of the K was 2.

Let’s check the quality of the classification:

cluster-performance

This time our model performed better but not enough. We need to change the value of K to get better results.

Using the Elbow algorithm to get a better K value

Instead of iterating over different values for K manually, we can plot the Elbow algorithm results and get the optimal value for K from the graph.

elbow-method

As you can see, the graph shows that we can use K=3 for the K-Means algorithm, but we failed to get the correct results because we’re dealing with categorical data.

Visualizing dataset clusters

Let’s look at centroids/data distribution for the K=3 .

implementation-of-k-mean-algorithm-visualizaition

The chart shows that we’re dealing with categorical data distribution based on the status_type column values, and the K-means clustering algorithm fails to perform well because the values from the status_type column does not correlate with other columns’ values.

K Means clustering Python algorithm for data classification

Let’s take a different dataset that contains information about the mall’s customers. You can get access to the dataset here .

In this example, we’ll be interested in segmenting customers’ info (separating customers into groups based on their sales, needs, etc.). This segmentation might improve relationships between the company and its customers.

Now, we’ll implement K means clustering in Python. Let’s quickly walk through the dataset:

k-mean-algorithm-using-python-importing-dataset

For simplicity, let’s take two attributes and build a scattered graph to confirm that our data is well dispersed.

k-mean-algorithm-using-python-scattered-graph

Let’s use the K-Means algorithm to split customers into three ( K=3 ) categories of cluster labels.

k-mean-algorithm-using-python-cluster-with-k-3

Looks good, but you will get a different result if you start playing with the random_state variable:

K means clustering Python algorithm - Evaluating results

 As you can see from the graphs, we’re getting different results when we run the K-means algorithm and shuffle the data. So it would be better to split the dataset into a different number of clusters to its nearest centroid.

Evaluating the K-Mean algorithm using inertia

The section above showed that the K-Means algorithm produced different results from the same dataset with shuffled data. So, how could we know which graph is more efficient? We need to use inertia . The lower the value of inertia, the better the algorithm performed.

Let’s look at which of the graphs has the lowest inertia value.

k-means-algorithm-using-python-inertia-values

The output shows that the last graph is more efficient as it has the lowest inertia score among the three.

Finding the optimal number of clusters

Let’s use the Elbow method to find the optimal number of clusters for the given dataset.

k-mean-algorithm-uisng-python-eblow-method

Note : the K-mean algorithm performed well when the number of clusters was five.

Splitting the dataset into five clusters

Let’s plot the classification results of our dataset from a scatter plot into five clusters.

k-means-algorithm-using-python-cluster-with-5

Why is the K-Means algorithm not recommended?

What is k-means clustering used for, is k-means clustering lazy learning.

K-means clustering algorithm is a type of Unsupervised Learning used to split unlabeled data into several categories. The Elbow method allows you to find your dataset’s optimal number of clusters. This article covered the implementation and visualization of the K-means clustering algorithm and Elbow method using Python.

programming assignment implementing the k means clustering algorithm

This is Bashir Alam, majoring in Computer Science and having extensive knowledge of Python, Machine learning, and Data Science. I have been working with different organizations and companies along with my studies. I have solid knowledge and experience of working offline and online, in fact, I am more comfortable in working online.

I love to learn new technologies and skills and I believe I am smart enough to learn new technologies in a short period of time.

programming assignment implementing the k means clustering algorithm

Implementing K-means Clustering from Scratch - in Python

K-means clustering.

K-means algorithm is is one of the simplest and popular unsupervised machine learning algorithms, that solve the well-known clustering problem, with no pre-determined labels defined, meaning that we don’t have any target variable as in the case of supervised learning. It is often referred to as Lloyd’s algorithm.

K-means simply partitions the given dataset into various clusters (groups).

K refers to the total number of clusters to be defined in the entire dataset.There is a centroid chosen for a given cluster type which is used to calculate the distance of a given data point. The distance essentially represents the similarity of features of a data point to a cluster type.

You’ll define a target number K, which refers to the number of centroids you need in the dataset. A centroid is the imaginary or real location representing the center of the cluster. These centroids shoud be placed in a cunning way because of different location causes different result. So, the better choice is to place them as much as possible far away from each other.

In other words, the K-means algorithm identifies K number of centroids, and then allocates every data point to the nearest cluster, while keeping the centroids as small as possible. The ‘means’ in the K-means refers to averaging of the data; that is, finding the centroid.

Every data point is allocated to each of the clusters through reducing the in-cluster sum of squares. Once the algorithm has been run and the groups are defined, any new data can be easily assigned to the correct group.

In K-means, each cluster is described by a single mean, or centroid (hard clustering), so as not to confuse this model with an actual probabilistic model. There is no underlying probability model in K-means . The goal is to group data into K clusters. K-means (and some other clustering methods) have hard boundaries, meaning a data point either belongs to that cluster or it does not. On the other hand, clustering methods such as Gaussian Mixture Models (GMM) have soft boundaries (soft clustering), where data points can belong to multiple cluster at the same time but with different degrees of belief. e.g. a data point can have a $60\%$ of belonging to cluster $1$, $40\%$ of belonging to cluster $2$. Additionally, in probabilistic clustering, clusters can overlap (K-means doesn’t allow this).

programming assignment implementing the k means clustering algorithm

An important observation for K-means is that the cluster models must be circular in 2D (or spherical in 3D or higher, i.i.d. Gaussian). In other words, K-means requires that each blob be a fixed size and completely symmetrical. K-means has no built-in way of accounting for oblong or elliptical clusters. When clusters are non-circular, trying to fit circular clusters would be a poor fit. This results in a mixing of cluster assignments where the resulting circles overlap.

programming assignment implementing the k means clustering algorithm

Unfortunately, K-means will not work for non-spherical clusters like these:

These two disadvantages of K-means—its lack of flexibility in cluster shape and lack of probabilistic cluster assignment—mean that for many datasets (especially low-dimensional datasets) it may not perform as well as you might hope. K-means is also very sensitive to outliers and noise in the dataset.

K-means is a widely used method in cluster analysis. One might easily think that this method does NOT require ANY assumptions, i.e., give me a data set and a pre-specified number of clusters, K, then I just apply this algorithm which minimize the total within-cluster square error (intra-cluster variance). See “Constraints of the algorithm” section for more details!

When to use?

This is a versatile algorithm that can be used for any type of grouping. Some examples of use cases are:

  • Image Segmentation
  • Clustering Gene Segmentation Data
  • News Article Clustering
  • Clustering Languages
  • Species Clustering
  • Anomaly Detection

The Κ-means clustering algorithm uses iterative refinement to produce a final result. The algorithm inputs are the number of clusters Κ and the data set. The data set is a collection of features for each data point.

The algorithms starts with initial estimates for the Κ centroids, which can either be randomly generated or randomly selected from the data set. Random initialization is not an efficient way to start with, as sometimes it leads to increased numbers of required clustering iterations to reach convergence, a greater overall runtime, and a less-efficient algorithm overall. So there are many techniques to solve this problem like K-means++ etc.

We randomly pick K cluster centers(centroids). Let’s assume these are $c_1, c_2, …, c_K$, and we can say that;

where $C$ is the set of all centroids.

Each centroid defines one of the clusters. In this step, each data point is assigned to its nearest centroid, based on the squared Euclidean distance. More formally, if $c_{i}$ is the collection of centroids in set $C$, then each data point $x$ is assigned to a cluster based on

where $dist( \cdot )$ is the standard (L2) Euclidean distance. Let the set of data point assignments for each ith cluster centroid be $S_{i}$. Note that the distance function in the cluster assignment step can be chosen specifically for your problem, and is arbitrary.

In this step, the centroids are recomputed. This is done by taking the mean of all data points assigned to that centroid’s cluster.

where $S_{i}$ is the set of all points assigned to the $i$th cluster.

The algorithm iterates between steps one and two until a stopping criteria is met (i.e., no data points change clusters, the sum of the distances is minimized, or some maximum number of iterations is reached).

The best number of clusters K leading to the greatest separation (distance) is not known as a priori and must be computed from the data. The objective of K-Means clustering is to minimize total intra-cluster variance, or, the squared error function:

programming assignment implementing the k means clustering algorithm

NOTE : Unfortunately, although the algorithm is guaranteed to converge, it may not converge to the right solution (i.e., it may converge to a local optimum, not necessarily the best possible outcome). This highly depends on the centroid initialization. As a result, the computation is often done several times, with different initializations of the centroids. One method to help address this issue is the K-means++ initialization scheme, which has been implemented in scikit-learn (use the init='k-means++' parameter). This initializes the centroids to be (generally) distant from each other, leading to probably better results than random initialization. One idea for initializing K-means is to use a farthest-first traversal on the data set, to pick K points that are far away from each other. However, this is too sensitive to outliers. But, K-means++ procedure picks the K centers one at a time, but instead of always choosing the point farthest from those picked so far, choose each point at random, with probability proportional to its squared distance from the centers chosen already.

NOTE : The computational complexity of the algorithm is generally linear with regards to the number of instances, the number of clusters and the number of dimensions. However, this is only true when the data has a clustering structure. If it does not, then in the worst case scenario the complexity can increase exponentially with the number of instances. In practice, however, this rarely happens, and K-means is generally one of the fastest clustering algorithms.

Choosing the Value of K

Determining the right number of clusters in a data set is important, not only because some clustering algorithms like k-means requires such a parameter, but also because the appropriate number of clusters controls the proper granularity of cluster analysis. determining the number of clusters is far from easy, often because the right number is ambiguous. The interpretations of the number of clusters often depend on the shape and scale of the distribution in a data set, as well as the clustering resolution required by a user. There are many possible ways to estimate the number of clusters. Here, we briefly introduce some simple yet popularly used and effective methods.

We often know the value of K. In that case we use the value of K. In general, there is no method for determining exact value of K.

A simple experienced method is to set the number of clusters to about $\sqrt{n/2}$ for a data set of $n$ points. In expectation, each cluster has $\sqrt{2n}$ points. Another approach is the Elbow Method. We run the algorithm for different values of K (say K = 1 to 10) and plot the K values against WCSSE (Within Cluster Sum of Squared Errors). WCSS is also called “inertia”. Then, select the value of K that causes sudden drop in the sum of squared distances, i.e., for the elbow point as shown in the figure.

programming assignment implementing the k means clustering algorithm

HOWEVER, it is important to note that inertia heavily relies on the assumption that the clusters are convex (of spherical shape).

A number of other techniques exist for validating K, including cross-validation, information criteria, the information theoretic jump method, the silhouette method (we want to have high silhouette coefficient for the number of clusters we want to use), and the G-means algorithm.

programming assignment implementing the k means clustering algorithm

In addition, monitoring the distribution of data points across groups provides insight into how the algorithm is splitting the data for each K. Some researchers also use Hierarchical clustering first to create dendrograms and identify the distinct groups from there.

Constraints of the algorithm

Only numerical data can be used. Generally K-means works best for 2 dimensional numerical data. Visualization is possible in 2D or 3D data. But in reality there are always multiple features to be considered at a time. However, we must be careful about curse of dimensionality. Any more than few tens of dimensions mean that distance interpretation is not obvious and must be guarded against. Appropriate dimensionality reduction techniques and distance measures must be used.

K-Means clustering is prone to initial seeding i.e. random initialization of centroids which is required to kick-off iterative clustering process. Bad initialization may end up getting bad clusters. Leader Algorithm can be used.

The standard K-means algorithm isn’t directly applicable to categorical data, for various reasons. The sample space for categorical data is discrete, and doesn’t have a natural origin. A Euclidean distance function on such a space is not really meaningful. However, the clustering algorithm is free to choose any distance metric / similarity score. Euclidean is the most popular. But any other metric can be used that scales according to the data distribution in each dimension/attribute, for example the Mahalanobis metric.

The use of Euclidean distance as the measure of dissimilarity can also make the determination of the cluster means non-robust to outliers and noise in the data.

Inertia is not a normalized metric: we just know that lower values are better and zero is optimal. But in very high-dimensional spaces, Euclidean distances tend to become inflated (this is an instance of the so-called “curse of dimensionality”). Running a dimensionality reduction algorithm such as Principal component analysis (PCA) prior to k-means clustering can alleviate this problem and speed up the computations.

Categorical data (i.e., category labels such as gender, country, browser type) needs to be encoded (e.g., one-hot encoding for nominal categorical variable or label encoding for ordinal categorical variable) or separated in a way that can still work with the algorithm, which is still not perfectly right. There’s a variation of K-means known as K-modes, introduced in this paper by Zhexue Huang, which is suitable for categorical data.

K-Means does not behave very well when the clusters have varying sizes, different densities, or non-spherical shapes. In that case, one can use Mixture models using EM algorithm or Fuzzy K-means (every object belongs to every cluster with a membership weight that is between 0 (absolutely does not belong) and 1 (absolutely belongs). which both allow soft assignments. As a matter of fact, K-means is special variant of the EM algorithm with the assumption that the clusters are spherical. EM algorithm also starts with random initializations, it is an iterative algorithm, it has strong assumptions that the data points must fulfill, it is sensitive to outliers, it requires prior knowledge of the number of desired clusters. The results produced by EM are also non-reproducible.

The above paragraph shows the drawbacks of this algorithm. K-means assumes the variance of the distribution of each attribute (variable) is spherical; all variables have the same variance; the prior probability for all K clusters is the same, i.e., each cluster has roughly equal number of observations. If any one of these 3 assumptions are violated, then K-means will fail. This Stackoverflow answer explains perfectly!

programming assignment implementing the k means clustering algorithm

It is important to scale the input features before you run K-Means, or else the clusters may be very stretched, and K-Means will perform poorly. Scaling the features does not guarantee that all the clusters will be nice and spherical, but it generally improves things.

K-Means clustering just cannot deal with missing values. Any observation even with one missing dimension must be specially handled. If there are only few observations with missing values then these observations can be excluded from clustering. However, this must have equivalent rule during scoring about how to deal with missing values. Since in practice one cannot just refuse to exclude missing observations from segmentation, often better practice is to impute missing observations. There are various methods available for missing value imputation but care must be taken to ensure that missing imputation doesn’t distort distance calculation implicit in k-Means algorithm. For example, replacing missing age with -1 or missing income with 999999 can be misleading!

Clustering analysis is not negatively affected by heteroscedasticity but the results are negatively impacted by multicollinearity of features/ variables used in clustering as the correlated feature/ variable will carry extra weight on the distance calculation than desired.

K-means has no notion of outliers, so all points are assigned to a cluster even if they do not belong in any. In the domain of anomaly detection, this causes problems as anomalous points will be assigned to the same cluster as “normal” data points. The anomalous points pull the cluster centroid towards them, making it harder to classify them as anomalous points.

K-Means clustering algorithm might converse on local minima which might also correspond to the global minima in some cases but not always. Therefore, it’s advised to run the K-Means algorithm multiple times before drawing inferences about the clusters. However, note that it’s possible to receive same clustering results from K-means by setting the same seed value for each run. But that is done by simply making the algorithm choose the set of same random number for each run.

DATA: Iris Flower Dataset

programming assignment implementing the k means clustering algorithm

K-means in Sci-kit Learn

programming assignment implementing the k means clustering algorithm

K-means from Scratch

programming assignment implementing the k means clustering algorithm

K-median is another clustering algorithm closely related to K-means. The practical difference between the two is as follows:

  • In K-means, centroids are determined by minimizing the sum of the squares of the distance between a centroid candidate and each of its examples.
  • In K-median, centroids are determined by minimizing the sum of the distance between a centroid candidate and each of its examples.

K-medians owes its use to robustness of the median as a statistic. The mean is a measurement that is highly vulnerable to outliers. Even just one drastic outlier can pull the value of the mean away from the majority of the data set, which can be a high concern when operating on very large data sets. The median, on the other hand, is a statistic incredibly resistant to outliers, for in order to deter the median away from the bulk of the information, it requires at least 50% of the data to be contaminated

K-medians uses the median as the statistic to determine the center of each cluster.

Note that the definitions of distance are also different:

K-means relies on the Euclidean distance from the centroid to an example. (In two dimensions, the Euclidean distance means using the Pythagorean theorem to calculate the hypotenuse.) For example, the K-means distance between $(2,2)$ and $(5,-2)$ would be:

K-median relies on the Manhattan distance from the centroid to an example. This distance is the sum of the absolute deltas in each dimension. For example, the K-median distance between $(2,2)$ and $(5,-2)$ would be:

Note that K-medians is also very sensitive to the initialization points of its K centers, each center having the tendency to remain roughly in the same cluster in which it is first placed.

Mini Batch K-Means

The Mini-batch K-Means is a variant of the K-Means algorithm which uses mini-batches to reduce the computation time, while still attempting to optimise the same objective function. Mini-batches are subsets of the input data, randomly sampled in each training iteration. These mini-batches drastically reduce the amount of computation required to converge to a local solution. In contrast to other algorithms that reduce the convergence time of K-means, mini-batch K-means produces results that are generally only slightly worse than the standard algorithm.

The algorithm iterates between two major steps, similar to vanilla K-means. In the first step, samples are drawn randomly from the dataset, to form a mini-batch. These are then assigned to the nearest centroid. In the second step, the centroids are updated. In contrast to k-means, this is done on a per-sample basis. For each sample in the mini-batch, the assigned centroid is updated by taking the streaming average of the sample and all previous samples assigned to that centroid. This has the effect of decreasing the rate of change for a centroid over time. These steps are performed until convergence or a predetermined number of iterations is reached.

Mini-batch K-Means converges faster than K-Means, but the quality of the results is reduced. In practice this difference in quality can be quite small.

For details, look here

  • https://datascience.stackexchange.com/questions/22/k-means-clustering-for-mixed-numeric-and-categorical-data
  • https://blog.bioturing.com/2018/10/17/k-means-clustering-algorithm-and-example/
  • https://www.datascience.com/blog/k-means-clustering
  • http://worldcomp-proceedings.com/proc/p2015/CSC2663.pdf

Reasonable Deviations

a rational approach to complexity

© Robert Andrew Martin 2023. All rights reserved. Disclaimer.

Implementing k-means clustering from scratch in C++

I have a somewhat complicated history when it comes to C++. When I was 15 and teaching myself to code, I couldn’t decide between python and C++ and as a result tried to learn both at the same time. One of my first non-trivial projects was a C++ program to compute orbits – looking back on it now, I can see that what I was actually doing was a (horrifically inefficient) implementation of Euler’s method. I just couldn’t wrap my head around fixed-size arrays (not to mention pointers!). In any case, I soon realised that juggling C++ and python was untenable – not only was I new to the concepts (such as type systems and OOP), I was having to learn two sets of syntax in addition to two flavours of these concepts. I decided to commit to python and haven’t really looked back since.

Now, almost 6 years later (tempus fugit!), having completed the first-year computer science course at Cambridge, I feel like I am in a much better place to have a proper go at C++. My motivation is helped by the fact that all of the second-year computational practicals for physics are done in C++, not to mention that C++ is incredibly useful in quantitative finance (which I am deeply interested in).

To that end, I decided to jump straight in and implement a machine learning algorithm from scratch. I chose k -means because of its personal significance to me: when I was first learning about ML, k -means was one of the first algorithms that I fully grokked and I spent quite a while experimenting with different modifications and implementations in python. Also, given that the main focus of this post is to learn C++, it makes sense to use an algorithm I understand relatively well.

Please let me add the disclaimer that this is certainly not going to be an optimal solution – this post is very much a learning exercise for me and I’d be more than happy to receive constructive criticism. As always, all code for this project can be found on GitHub .

What is k-means clustering?

The k-means algorithm, c++ preambles, representing a datapoint, reading in data from a file, pointers: an old enemy revisited, initialising the clusters, assigning points to a cluster, computing new centroids, writing to a file.

I have decided to give four brief explanations with increasing degrees of rigour. Nothing beyond the first explanation is really essential for the rest of this post, so feel free to stop whenever.

  • k -means clustering allows us to find groups of similar points within a dataset.
  • k -means clustering is the task of finding groups of points in a dataset such that the total variance within groups is minimised.
  • k -means clustering is the task of partitioning feature space into k subsets to minimise the within-cluster sum-of-square deviations (WCSS), which is the sum of quare euclidean distances between each datapoint and the centroid.
  • Formally, k -means clustering is the task of finding a partition $S = \{S_1, S_2, \ldots S_k\}$ where $S$ satisfies:

The k -means clustering problem is actually incredibly difficult to solve. Let’s say we just have $N=120$ and $k=5$, i.e we have 120 datapoints which we want to group into 5 clusters. The number of possible partitions is more than the number of atoms in the universe ($5^{120} \approx 10^{83}$) – for each one, we then need to calculate the WCSS (read: variance) and choose the best partition.

Clearly, any kind of brute force solutions is intractable (to be specific, the problem has exponential complexity). Hence, we need to turn to approximate solutions. The most famous approximate algorithm is Lloyd’s algorithm , which is often confusingly called the “ k -means algorithm”. In this post I will silence my inner pedant and interchangeably use the terms k -means algorithm and k -means clustering, but it should be remembered that they are slightly distinct. With that aside, Lloyd’s algorithm is incredibly simple:

1. Initialise the clusters

The algorithm needs to start somewhere, so we need to come up with a crude way of clustering points. To do this, we randomly select k points which become ‘markers’, then assign each datapoint to its nearest marker point. The result of this is k clusters. While this is a naive initialisation method, it does have some nice properties - more densely populated regions are more likely to contain centroids (which makes logical sense).

2. Compute the centroid of each cluster

Technically Lloyd’s algorithm computes the centroid of each partition of 3D space via integration, but we use the reasonable approximation of computing the centre of mass of the points in a given partition. The rational behind this is that the centroid of a cluster ‘characterises’ the cluster in some sense.

3. Assign each point to the nearest centroid and redefine the cluster

If a point currently in cluster 1 is actually closer to the centroid of cluster 2, surely it makes more sense for it to belong to cluster 2? This is exactly what we do, looping over all points and assigning them to clusters based on which centroid is the closest.

4. Repeat steps 2 and 3

We then repeatedly recompute centroids and reassign points to the nearest centroid. There is actually a very neat proof that this converges: essentially, there is only a finite (though massive) number of possible partitions, and each k -means update at least improves the WCSS. Hence the algorithm must converge.

Implementation

Our goal today is to implement a C++ version of the k -means algorithm that successfully clusters a two-dimensional subset of the famous mall customers dataset (available here ). It should be noted that the k -means algorithm certainly works in more than two dimensions (the Euclidean distance metric easily generalises to higher dimensional space), but for the purposes of visualisation, this post will only implement k -means to cluster 2D data. A plot of the raw data is shown below:

programming assignment implementing the k means clustering algorithm

By eye, it seems that there are five different clusters. The question is whether our k -means algorithm can successfully figure this out. We are actually going to cheat a little bit and tell the algorithm that there will be five clusters (i.e $k=5$). There are methods to avoid this, but they essentially involve testing different values of k and finding the best fit, so they don’t add much value to this post.

Firstly, we need to define our imports and namespace.

In general, using namespace std is not considered best practice (particularly in larger projects) because it can lead to ambiguity (for example, if I define a function or variable called vector ) and unexpected behaviour. However, the alternative is to have things like std::cout or vector::vector everywhere – for an educational post, the loss in clarity is worse than the potential ambiguity.

To represent a datapoint for this program, we will be using a C++ struct . Structs caused me a great deal of confusion when I was learning about C++ because I couldn’t quite figure out how they differ from classes. As it happens, they are really quite similar – possibly the only relevant difference is that members of a struct are public by default. In any case, I would think of a struct as a way of defining a more complicated data type, though it is more than just a container for primitive datatypes because you can also define some functionality.

The first few lines are self-explanatory: we define the coordinates of a point, as well as the cluster it belongs to and the distance to that cluster. Annoyingly, you can’t directly set the default value in the struct (e.g. double x = 0 ) – you need to do this via initialisation lists . Initially the point belongs to no cluster, so we arbitrarily set that to -1. Accordingly, we must set minDist to infinity (or the next best thing, __DBL_MAX__ ).

We also define a distance function, which computes the (square) euclidean distance between this point and another. Our Point struct can be used as follows:

If we wanted to represent a datapoint in p -dimensions, we could replace the x and y members with a vector or array of doubles, with each entry corresponding to a coordinate in a given dimension. The distance function would similarly need to be modified to loop over the vectors/arrays and sum all of the squared differences.

Having decided how we are going to store datapoints within our C++ script, we must then read in the data from a CSV file. This is rather unexciting, but actually took me a long time to figure out. Essentially, we loop over all the lines in the CSV file and break them down based on the commas.

Note that the readcsv function returns a vector of points. I decided to use a vector instead of an array because vectors handle all of the memory management for you (though are slightly less performant) and are functionally quite similar to python lists.

Suppose your friend wants to visit your house. You have two options (the relevance of this thought experiment will be clear shortly.)

  • Give them your postcode and let them find your house.
  • Hire a team of builders to replicate your house brick-for-brick right outside their front door.

The readcsv function returns a vector of points. One might assume that we can then just pass this to whatever k -means function we define and be done with it.

However, we must be aware that depending on the size of our dataset, points might take up quite a large chunk of memory, so we must handle it carefully to be efficient. The problem with the above code is that we are passing the values of the points to the function, i.e we are making a copy of them. This is inefficient from a memory perspective. Luckily, C++ offers a way around this, called pass by reference . Essentially, instead of giving the value of the points vector to the function, we pass the location (read: postcode) of the points vector in memory.

The prototype of our kMeansClustering function is then as follows:

Because we are now passing an address (and thus not technically a vector<Point> ), we must include an asterisk. Read the first argument as “a reference to a vector of Point objects”.

I have also added two other arguments:

  • epochs is the number of iterations over which we will do our main k -means loop
  • k is the number of clusters.

We first need to assign each point to a cluster. The easiest way of doing this is to randomly pick 5 “marker” points and give them labels 1-5 (or actually 0-4 since our arrays index from 0).

The code for this is quite simple. We will use another vector of points to store the centroids (markers), where the index of the centroid is its label. We then select a random point from the points vector we made earlier (from reading in the csv) and set that as a centroid.

One brief C++ note: because points is actually a pointer rather than a vector, in order to access an item at a certain index we can’t do points[i] – we have to first ‘dereference’ it by doing (*points)[i] . This is quite ugly, so fortunately we have the syntactic shortcut of: points->at[i] .

Once the centroids have been initialised, we can begin the k -means algorithm iterations. We now turn to the “meat” of k -means: assigning points to a cluster and computing new centroids.

The logic here is quite simple. We loop through every datapoint and assign it to its nearest centroid. Because there are k centroids, the result is a partition of the datapoints into k clusters.

In terms of the actual code, I had to spend some time thinking about the best way to represent that a point belonged to a certain cluster. In my python implementation (now many years old), I used a dictionary with cluster IDs as keys and a list of points as values. However, for this program I decided to use a quicker solution: I gave each point a cluster attribute which can hold an integer ID. We then set this ID to the index of the cluster that is closest to the point.

After our first iteration, the clusters are really quite crude – we’ve randomly selected 5 points then formed clusters based on the closest random point. There’s no reason why this should produce meaningful clusters and indeed it doesn’t. However, the heart of k -means is the update step, wherein we compute the centroids of the previous cluster and subsequently reassign points.

As previously stated, we are going to majorly simplify the problem by computing the centroid of the points within a cluster rather than the partition of space. Thus all we really have to do is compute the mean coordinates of all the points in a cluster.

To do this, I created two new vectors: on to keep track of the number of points in each cluster and the other to keep track of the sum of coordinates (then the average is just the latter divided by the former).

We then iterate through all the points and increment the correct indices of the above vectors (based on the point’s cluster ID). Importantly, now is a convenient time to reset the minDist attribute of the point, so that the subsequent iteration works as intended.

Now that we have the new centroids, the k -means algorithm repeats. We recompute distances and reassign points to their nearest centroids. Then we can find the new centroids, recompute distances etc..

One final detail: after all of our k -means iterations, we would like to be able to write the output to a file so that we can analyse the clustering. This is quite simple - we will just iterate through the points then print their coordinates and cluster IDs to a csv file.

In order to test that my k -means implementation was working properly, I wrote a simple plotting script. I am somewhat embarrassed (in the context of a C++ post) to say that I wrote this in python.

The result is quite pretty and it shows that – bar a few contentious points around the centre cluster – the clustering has worked as expected.

programming assignment implementing the k means clustering algorithm

In conclusion, we have successfully implemented a simple k -means algorithm in C++. Obviously there is much that could be improved about my program. Firstly, many simplifications were made, for example, we restricted the problem to two dimensions and also pre-set the number of clusters. However, there are more subtle issues that we neglected to discuss, including the random initialisation which may result in suboptimal clusters. In fact, there are algorithms like k-means++ that offer major improvements over k -means by specifying better procedures to find the initial clusters.

It is also worth mentioning the fundamental difficulties with k -means: it acutely suffers from the ‘curse of dimensionality’, as data becomes more sparse in high dimensions, and it is relatively inefficient since there are four loops (over iterations, points, clusters, and dimensions). However, k -means is often a great solution for quickly clustering small data and the algorithm is just about simple enough to explain to business stakeholders.

In any case, the merits/disadvantages of k -means aside, writing this program has given me a lot more confidence in C++ and I am keen to develop a more advanced understanding. I think it’s a good complement to my current interests in scientific/financial computing and it is pleasing to see that I am making more progress than I was a few years back.

Related Posts

Rebuilding pyportfolioopt: an open source adventure 19 mar 2020, evolving cellular automata to solve problems, part 2 01 jun 2018, evolving cellular automata to solve problems, part 1 25 may 2018.

  • Office Hours
  • Python Tutorial
  • Markov Decisions
  • Practice Midterms
  • Midterm Solutions
  • Big Picture
  • Programming:
  • Driverless Car
  • Visual Cortex
  • Problem Sets:
  • Search Pset
  • Variable Pset
  • Learning Pset
  • Variable Models Pset
  • Machine Learning Pset
  • Final Project
  • Self Driving Car Machine Translation Deep Blue Watson

Written by Chris Piech. Based on a handout by Andrew Ng.

The Basic Idea

Say you are given a data set where each observed example has a set of features, but has no labels. Labels are an essential ingredient to a supervised algorithm like Support Vector Machines, which learns a hypothesis function to predict labels given features. So we can't run supervised learning. What can we do?

One of the most straightforward tasks we can perform on a data set without labels is to find groups of data in our dataset which are similar to one another -- what we call clusters.

K-Means is one of the most popular "clustering" algorithms. K-means stores $k$ centroids that it uses to define clusters. A point is considered to be in a particular cluster if it is closer to that cluster's centroid than any other centroid.

K-Means finds the best centroids by alternating between (1) assigning data points to clusters based on the current centroids (2) chosing centroids (points which are the center of a cluster) based on the current assignment of data points to clusters.

programming assignment implementing the k means clustering algorithm

Figure 1: K-means algorithm. Training examples are shown as dots, and cluster centroids are shown as crosses. (a) Original dataset. (b) Random initial cluster centroids. (c-f) Illustration of running two iterations of k-means. In each iteration, we assign each training example to the closest cluster centroid (shown by "painting" the training examples the same color as the cluster centroid to which is assigned); then we move each cluster centroid to the mean of the points assigned to it. Images courtesy of Michael Jordan.

The Algorithm

In the clustering problem, we are given a training set ${x^{(1)}, ... , x^{(m)}}$, and want to group the data into a few cohesive "clusters." Here, we are given feature vectors for each data point $x^{(i)} \in \mathbb{R}^n$ as usual; but no labels $y^{(i)}$ (making this an unsupervised learning problem). Our goal is to predict $k$ centroids and a label $c^{(i)}$ for each datapoint. The k-means clustering algorithm is as follows:

programming assignment implementing the k means clustering algorithm

Implementation

Here is pseudo-python code which runs k-means on a dataset. It is a short algorithm made longer by verbose commenting. # Function: K Means # ------------- # K-Means is an algorithm that takes in a dataset and a constant # k and returns k centroids (which define clusters of data in the # dataset which are similar to one another). def kmeans(dataSet, k): # Initialize centroids randomly numFeatures = dataSet.getNumFeatures() centroids = getRandomCentroids(numFeatures, k) # Initialize book keeping vars. iterations = 0 oldCentroids = None # Run the main k-means algorithm while not shouldStop(oldCentroids, centroids, iterations): # Save old centroids for convergence test. Book keeping. oldCentroids = centroids iterations += 1 # Assign labels to each datapoint based on centroids labels = getLabels(dataSet, centroids) # Assign centroids based on datapoint labels centroids = getCentroids(dataSet, labels, k) # We can get the labels too by calling getLabels(dataSet, centroids) return centroids # Function: Should Stop # ------------- # Returns True or False if k-means is done. K-means terminates either # because it has run a maximum number of iterations OR the centroids # stop changing. def shouldStop(oldCentroids, centroids, iterations): if iterations > MAX_ITERATIONS: return True return oldCentroids == centroids # Function: Get Random Centroids # ------------- # Returns k random centroids, each of dimension n. def getRandomCentroids(n, k): # return some reasonable randomization. --> # Function: Get Labels # ------------- # Returns a label for each piece of data in the dataset. def getLabels(dataSet, centroids): # For each element in the dataset, chose the closest centroid. # Make that centroid the element's label. # Function: Get Centroids # ------------- # Returns k random centroids, each of dimension n. def getCentroids(dataSet, labels, k): # Each centroid is the geometric mean of the points that # have that centroid's label. Important: If a centroid is empty (no points have # that centroid's label) you should randomly re-initialize it.

Important note: You might be tempted to calculate the distance between two points manually, by looping over values. This will work, but it will lead to a slow k-means! And a slow k-means will mean that you have to wait longer to test and debug your solution.

Let's define three vectors:

To calculate the distance between x and y we can use: np.sqrt(sum((x - y) ** 2))

To calculate the distance between all the length 5 vectors in z and x we can use: np.sqrt(((z-x)**2).sum(axis=0))

Expectation Maximization

K-Means is really just the EM (Expectation Maximization) algorithm applied to a particular naive bayes model.

To demonstrate this remarkable claim, consider the classic naive bayes model with a class variable which can take on discrete values (with domain size $k$) and a set of feature variables, each of which can take on a continuous value (see figure 2). The conditional probability distributions for $P(f_i = x | C= c)$ is going to be slightly different than usual. Instead of storing this conditional probability as a table, we are going to store it as a single normal (gaussian) distribution, with it's own mean and a standard deviation of 1. Specifically, this means that: $P(f_i = x | C= c) \sim \mathcal{N}(\mu_{c,i}, 1)$

Learning the values of $\mu_{c, i}$ given a dataset with assigned values to the features but not the class variables is the provably identical to running k-means on that dataset.

programming assignment implementing the k means clustering algorithm

Figure 2: The K-Means algorithm is the EM algorithm applied to this Bayes Net.

If we know that this is the strcuture of our bayes net, but we don't know any of the conditional probability distributions then we have to run Parameter Learning before we can run Inference.

In the dataset we are given, all the feature variables are observed (for each data point) but the class variable is hidden. Since we are running Parameter Learning on a bayes net where some variables are unobserved, we should use EM.

Lets review EM. In EM, you randomly initialize your model parameters, then you alternate between (E) assigning values to hidden variables, based on parameters and (M) computing parameters based on fully observed data.

E-Step : Coming up with values to hidden variables, based on parameters. If you work out the math of chosing the best values for the class variable based on the features of a given piece of data in your data set, it comes out to "for each data-point, chose the centroid that it is closest to, by euclidean distance, and assign that centroid's label." The proof of this is within your grasp! See lecture.

M-Step : Coming up with parameters, based on full assignments. If you work out the math of chosing the best parameter values based on the features of a given piece of data in your data set, it comes out to "take the mean of all the data-points that were labeled as c."

So what? Well this gives you an idea of the qualities of k-means. Like EM, it is provably going to find a local optimum. Like EM, it is not necessarily going to find a global optimum. It turns out those random initial values do matter.

Figure 1 shows k-means with a 2-dimensional feature vector (each point has two dimensions, an x and a y). In your applications, will probably be working with data that has a lot of features. In fact each data-point may be hundreds of dimensions. We can visualize clusters in up to 3 dimensions (see figure 3) but beyond that you have to rely on a more mathematical understanding.

programming assignment implementing the k means clustering algorithm

Figure 3: KMeans in other dimensions. (left) K-means in 2d. (right) K-means in 3d. You have to imagine k-means in 4d.

© Stanford 2013 | Designed by Chris . Inspired by Niels and Percy .

Tutorial Playlist

Machine learning tutorial: a step-by-step guide for beginners, an introduction to machine learning, what is machine learning and how does it work, machine learning steps: a complete guide, top 10 machine learning applications in 2024, different types of machine learning: exploring ai's core, a beginner's guide to supervised & unsupervised learning in ai.

Everything You Need to Know About Feature Selection

Linear Regression in Python

Everything you need to know about classification in machine learning, an introduction to logistic regression in python, understanding the difference between linear vs logistic regression, the best guide on how to implement decision tree in python, random forest algorithm, understanding naive bayes classifier, the best guide to confusion matrix, how to leverage knn algorithm in machine learning.

K-Means Clustering Algorithm: Applications, Types, Demos and Use Cases

PCA in Machine Learning: Your Complete Guide to Principal Component Analysis

What is cost function in machine learning, the ultimate guide to cross-validation in machine learning, an easy guide to stock price prediction using machine learning, what is reinforcement learning: a complete guide, what is q-learning: the best guide to understand q-learning, the best guide to regularization in machine learning, everything you need to know about bias and variance, the complete guide on overfitting and underfitting in machine learning, mathematics for machine learning - important skills you must possess, a one-stop guide to statistics for machine learning, embarking on a machine learning career here’s all you need to know, how to become a machine learning engineer, top 45 machine learning interview questions and answers for 2024, explaining the concepts of quantum computing, supervised machine learning: all you need to know, 10 machine learning platforms to revolutionize your business, what is boosting in machine learning : a comprehensive guide, machine learning vs. neural networks: understanding the differences.

Unlocking the Future: 5 Compelling Reasons to Master Machine Learning in 2024

Feature Engineering

K-means clustering algorithm: applications, types, & how does it work.

Lesson 17 of 38 By Mayank Banoula

K-Means Clustering Algorithm: Applications, Types, Demos and Use Cases

Table of Contents

Every Machine Learning engineer wants to achieve accurate predictions with their algorithms. Such learning algorithms are generally broken down into two types - supervised and unsupervised . K-Means clustering is one of the unsupervised algorithms where the available input data does not have a labeled response.

Types of Clustering

Clustering is a type of unsupervised learning wherein data points are grouped into different sets based on their degree of similarity.

The various types of clustering are:

  • Hierarchical clustering
  • Partitioning clustering 

Hierarchical clustering is further subdivided into:

  • Agglomerative clustering
  • Divisive clustering

Partitioning clustering is further subdivided into:

  • K-Means clustering 
  • Fuzzy C-Means clustering 

Your AI/ML Career is Just Around The Corner!

Your AI/ML Career is Just Around The Corner!

Hierarchical Clustering

Hierarchical clustering uses a tree-like structure, like so:

hierarchical clustering

In agglomerative clustering, there is a bottom-up approach. We begin with each element as a separate cluster and merge them into successively more massive clusters, as shown below:

clustering-slide19

Divisive clustering is a top-down approach. We begin with the whole set and proceed to divide it into successively smaller clusters, as you can see below:

clustering slide20

Partitioning Clustering 

Partitioning clustering is split into two subtypes - K-Means clustering and Fuzzy C-Means.

In k-means clustering, the objects are divided into several clusters mentioned by the number ‘K.’ So if we say K = 2, the objects are divided into two clusters, c1 and c2, as shown:

clustering-slide21

Future-Proof Your AI/ML Career: Top Dos and Don'ts

Future-Proof Your AI/ML Career: Top Dos and Don'ts

Here, the features or characteristics are compared, and all objects having similar characteristics are clustered together. 

Fuzzy c-means is very similar to k-means in the sense that it clusters objects that have similar characteristics together. In k-means clustering, a single object cannot belong to two different clusters. But in c-means, objects can belong to more than one cluster, as shown. 

clustering-slide22

What is Meant by the K-Means Clustering Algorithm?

K-Means clustering is an unsupervised learning algorithm. There is no labeled data for this clustering, unlike in supervised learning. K-Means performs the division of objects into clusters that share similarities and are dissimilar to the objects belonging to another cluster. 

The term ‘K’ is a number. You need to tell the system how many clusters you need to create. For example, K = 2 refers to two clusters. There is a way of finding out what is the best or optimum value of K for a given data. 

For a better understanding of k-means, let's take an example from cricket. Imagine you received data on a lot of cricket players from all over the world, which gives information on the runs scored by the player and the wickets taken by them in the last ten matches. Based on this information, we need to group the data into two clusters, namely batsman and bowlers. 

Let's take a look at the steps to create these clusters.

Assign data points

Here, we have our data set plotted on ‘x’ and ‘y’ coordinates. The information on the y-axis is about the runs scored, and on the x-axis about the wickets taken by the players. 

If we plot the data, this is how it would look:

plot data

Perform Clustering

We need to create the clusters, as shown below:

perform-clustering

Considering the same data set, let us solve the problem using K-Means clustering (taking K = 2).

The first step in k-means clustering is the allocation of two centroids randomly (as K=2). Two points are assigned as centroids. Note that the points can be anywhere, as they are random points. They are called centroids, but initially, they are not the central point of a given data set.

centroids

The next step is to determine the distance between each of the randomly assigned centroids' data points. For every point, the distance is measured from both the centroids, and whichever distance is less, that point is assigned to that centroid. You can see the data points attached to the centroids and represented here in blue and yellow.

centroids2

The next step is to determine the actual centroid for these two clusters. The original randomly allocated centroid is to be repositioned to the actual centroid of the clusters.

clusters3

This process of calculating the distance and repositioning the centroid continues until we obtain our final cluster. Then the centroid repositioning stops.

centroidsr

As seen above, the centroid doesn't need anymore repositioning, and it means the algorithm has converged, and we have the two clusters with a centroid. 

Advantages of k-means

  • Simple and easy to implement: The k-means algorithm is easy to understand and implement, making it a popular choice for clustering tasks.
  • Fast and efficient: K-means is computationally efficient and can handle large datasets with high dimensionality.
  • Scalability: K-means can handle large datasets with a large number of data points and can be easily scaled to handle even larger datasets.
  • Flexibility: K-means can be easily adapted to different applications and can be used with different distance metrics and initialization methods.

Disadvantages of K-Means:

  • Sensitivity to initial centroids: K-means is sensitive to the initial selection of centroids and can converge to a suboptimal solution.
  • Requires specifying the number of clusters: The number of clusters k needs to be specified before running the algorithm, which can be challenging in some applications.
  • Sensitive to outliers: K-means is sensitive to outliers, which can have a significant impact on the resulting clusters.

Applications of K-Means Clustering

K-Means clustering is used in a variety of examples or business cases in real life, like:

  • Academic performance 
  • Diagnostic systems 
  • Search engines 

Wireless sensor networks

Academic Performance

Based on the scores, students are categorized into grades like A, B, or C. 

Diagnostic systems

The medical profession uses k-means in creating smarter medical decision support systems, especially in the treatment of liver ailments.

Search engines

Clustering forms a backbone of search engines. When a search is performed, the search results need to be grouped, and the search engines very often use clustering to do this. 

The clustering algorithm plays the role of finding the cluster heads, which collect all the data in its respective cluster.

Distance Measure 

Distance measure determines the similarity between two elements and influences the shape of clusters.

K-Means clustering supports various kinds of distance measures, such as: 

  • Euclidean distance measure
  • Manhattan distance measure 
  • A squared euclidean distance measure
  • Cosine distance measure 

Euclidean Distance Measure 

The most common case is determining the distance between two points. If we have a point P and point Q, the euclidean distance is an ordinary straight line. It is the distance between the two points in Euclidean space. 

The formula for distance between two points is shown below:

euclidean

Squared Euclidean Distance Measure

This is identical to the Euclidean distance measurement but does not take the square root at the end. The formula is shown below:

squared eu

Manhattan Distance Measure

The Manhattan distance is the simple sum of the horizontal and vertical components or the distance between two points measured along axes at right angles.

Note that we are taking the absolute value so that the negative values don't come into play. 

The formula is shown below:

manhattan distance

Cosine Distance Measure

In this case, we take the angle between the two vectors formed by joining the origin point. The formula is shown below:

cosine-distance

Become a Data Scientist with Hands-on Training!

Become a Data Scientist with Hands-on Training!

K-means on Geyser's Eruptions Segmentation

K-means can be used to segment the Geyser's Eruptions dataset, which records the duration and waiting time between eruptions of the Old Faithful geyser in Yellowstone National Park. The algorithm can be used to cluster the eruptions based on their duration and waiting time and identify different patterns of eruptions.

K-means on Image Compression

K-means can also be used for image compression, where it can be used to reduce the number of colors in an image while maintaining its visual quality. The algorithm can be used to cluster the colors in the image and replace the pixels with the centroid color of each cluster, resulting in a compressed image.

Evaluation Methods

Evaluation methods are used to measure the performance of clustering algorithms. Common evaluation methods include:

Sum of Squared Errors (SSE): This measures the sum of the squared distances between each data point and its assigned centroid.

Silhouette Coefficient: This measures the similarity of a data point to its own cluster compared to other clusters. A high silhouette coefficient indicates that a data point is well-matched to its own cluster and poorly matched to neighboring clusters.

Silhouette Analysis

Silhouette analysis is a graphical technique used to evaluate the quality of the clusters generated by a clustering algorithm. It involves calculating the silhouette coefficient for each data point and plotting them in a histogram. The width of the histogram indicates the quality of the clustering. A wide histogram indicates that the clusters are well-separated and distinct, while a narrow histogram indicates that the clusters are poorly separated and may overlap.

How Does K-Means Clustering Work?

The flowchart below shows how k-means clustering works:

slide32

The goal of the K-Means algorithm is to find clusters in the given input data. There are a couple of ways to accomplish this. We can use the trial and error method by specifying the value of K (e.g., 3,4, 5). As we progress, we keep changing the value until we get the best clusters. 

Another method is to use the Elbow technique to determine the value of K. Once we get the K's value, the system will assign that many centroids randomly and measure the distance of each of the data points from these centroids. Accordingly, it assigns those points to the corresponding centroid from which the distance is minimum. So each data point will be assigned to the centroid, which is closest to it. Thereby we have a K number of initial clusters.

For the newly formed clusters, it calculates the new centroid position. The position of the centroid moves compared to the randomly allocated one.

Once again, the distance of each point is measured from this new centroid point. If required, the data points are relocated to the new centroids, and the mean position or the new centroid is calculated once again. 

If the centroid moves, the iteration continues indicating no convergence. But once the centroid stops moving (which means that the clustering process has converged), it will reflect the result.

Let's use a visualization example to understand this better. 

We have a data set for a grocery shop, and we want to find out how many clusters this has to be spread across. To find the optimum number of clusters, we break it down into the following steps:

The Elbow method is the best way to find the number of clusters. The elbow method constitutes running  K-Means clustering on the dataset.

Next, we use within-sum-of-squares as a measure to find the optimum number of clusters that can be formed for a given data set. Within the sum of squares (WSS) is defined as the sum of the squared distance between each member of the cluster and its centroid.

slide34

The WSS is measured for each value of K. The value of K, which has the least amount of WSS, is taken as the optimum value. 

Now, we draw a curve between WSS and the number of clusters. 

slide35

Here, WSS is on the y-axis and number of clusters on the x-axis.

You can see that there is a very gradual change in the value of WSS as the K value increases from 2. 

So, you can take the elbow point value as the optimal value of K. It should be either two, three, or at most four. But, beyond that, increasing the number of clusters does not dramatically change the value in WSS, it gets stabilized. 

Let's assume that these are our delivery points:

delivery points

We can randomly initialize two points called the cluster centroids.

Here, C1 and C2 are the centroids assigned randomly. 

Now the distance of each location from the centroid is measured, and each data point is assigned to the centroid, which is closest to it.

This is how the initial grouping is done:

initial grouping

Compute the actual centroid of data points for the first group.

Reposition the random centroid to the actual centroid.

random centranoid

Compute the actual centroid of data points for the second group.

actual centroid

Once the cluster becomes static, the k-means algorithm is said to be converged. 

The final cluster with centroids c1 and c2 is as shown below:

final centroid

K-Means Clustering Algorithm 

Let's say we have x1, x2, x3……… x(n) as our inputs, and we want to split this into K clusters. 

The steps to form clusters are:

Step 1: Choose K random points as cluster centers called centroids. 

Step 2: Assign each x(i) to the closest cluster by implementing euclidean distance (i.e., calculating its distance to each centroid)

Step 3: Identify new centroids by taking the average of the assigned points.

Step 4: Keep repeating step 2 and step 3 until convergence is achieved

Let's take a detailed look at it at each of these steps.

We randomly pick K (centroids). We name them c1,c2,..... ck, and we can say that 

centroid

Where C is the set of all centroids.

We assign each data point to its nearest center, which is accomplished by calculating the euclidean distance.

centroid slide44

Where dist() is the Euclidean distance.

Here, we calculate each x value's distance from each c value, i.e. the distance between x1-c1, x1-c2, x1-c3, and so on. Then we find which is the lowest value and assign x1 to that particular centroid.

Similarly, we find the minimum distance for x2, x3, etc. 

We identify the actual centroid by taking the average of all the points assigned to that cluster. 

centroid slide 45

Where Si is the set of all points assigned to the ith cluster.     

It means the original point, which we thought was the centroid, will shift to the new position, which is the actual centroid for each of these groups. 

Keep repeating step 2 and step 3 until convergence is achieved.

How to Choose the Value of "K number of clusters" in K-Means Clustering?

Although there are many choices available for choosing the optimal number of clusters, the Elbow Method is one of the most popular and appropriate methods. The Elbow Method uses the idea of WCSS value, which is short for for Within Cluster Sum of Squares. WCSS defines the total number of variations within a cluster. This is the formula used to calculate the value of WCSS (for three clusters) provided courtesy of Javatpoint:

WCSS= ∑Pi in Cluster1 distance(Pi C1)2 +∑Pi in Cluster2distance(Pi C2)2+∑Pi in CLuster3 distance(Pi C3)2

Python Implementation of the K-Means Clustering Algorithm

Here’s how to use Python to implement the K-Means Clustering Algorithm. These are the steps you need to take:

  • Data pre-processing
  • Finding the optimal number of clusters using the elbow method
  • Training the K-Means algorithm on the training data set
  • Visualizing the clusters

1. Data Pre-Processing. Import the libraries, datasets, and extract the independent variables.

# importing libraries    

import numpy as nm    

import matplotlib.pyplot as mtp    

import pandas as pd    

# Importing the dataset  

dataset = pd.read_csv('Mall_Customers_data.csv')  

x = dataset.iloc[:, [3, 4]].values 

2. Find the optimal number of clusters using the elbow method. Here’s the code you use:

#finding optimal number of clusters using the elbow method  

from sklearn.cluster import KMeans  

wcss_list= []  #Initializing the list for the values of WCSS  

#Using for loop for iterations from 1 to 10.  

for i in range(1, 11):  

    kmeans = KMeans(n_clusters=i, init='k-means++', random_state= 42)  

    kmeans.fit(x)  

    wcss_list.append(kmeans.inertia_)  

mtp.plot(range(1, 11), wcss_list)  

mtp.title('The Elobw Method Graph')  

mtp.xlabel('Number of clusters(k)')  

mtp.ylabel('wcss_list')  

mtp.show() 

3. Train the K-means algorithm on the training dataset. Use the same two lines of code used in the previous section. However, instead of using i, use 5, because there are 5 clusters that need to be formed. Here’s the code:

#training the K-means model on a dataset  

kmeans = KMeans(n_clusters=5, init='k-means++', random_state= 42)  

y_predict= kmeans.fit_predict(x) 

4. Visualize the Clusters. Since this model has five clusters, we need to visualize each one.

#visulaizing the clusters  

mtp.scatter(x[y_predict == 0, 0], x[y_predict == 0, 1], s = 100, c = 'blue', label = 'Cluster 1') #for first cluster  

mtp.scatter(x[y_predict == 1, 0], x[y_predict == 1, 1], s = 100, c = 'green', label = 'Cluster 2') #for second cluster  

mtp.scatter(x[y_predict== 2, 0], x[y_predict == 2, 1], s = 100, c = 'red', label = 'Cluster 3') #for third cluster  

mtp.scatter(x[y_predict == 3, 0], x[y_predict == 3, 1], s = 100, c = 'cyan', label = 'Cluster 4') #for fourth cluster  

mtp.scatter(x[y_predict == 4, 0], x[y_predict == 4, 1], s = 100, c = 'magenta', label = 'Cluster 5') #for fifth cluster  

mtp.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 300, c = 'yellow', label = 'Centroid')   

mtp.title('Clusters of customers')  

mtp.xlabel('Annual Income (k$)')  

mtp.ylabel('Spending Score (1-100)')  

mtp.legend()  

mtp.show()  

Coding provided by Javatpoint .

Demo: K-Means Clustering

Problem Statement - Walmart wants to open a chain of stores across the state of Florida, and it wants to find the optimal store locations to maximize revenue.

The issue here is if they open too many stores close to each other, they will not make a profit. But, if the stores are too far apart, they do not have enough sales coverage. 

Solution - An organization like Walmart is an e-commerce giant. They already have the addresses of their customers in their database. So they can use this information and perform K-Means Clustering to find the optimal location.

Conclusion 

Considered as the Job of the future, Machine Learning engineers are in-demand as well as highly paid. A report by Marketwatch predicts a machine learning growth rate of over 45% for the period between 2017 and 2025. So, why restrict your learning to merely K-means clustering algorithms? Enroll in Simplilearn's Machine Learning Course and expand your knowledge in the broader concepts of Machine Learning. Get certified and become a part of the Artificial Intelligence talent that companies constantly look forward to.

Find our Post Graduate Program in AI and Machine Learning Online Bootcamp in top cities:

NameDatePlace
Cohort starts on 11th Jul 2024,
Weekend batch
Your City
Cohort starts on 15th Jul 2024,
Weekend batch
Your City
Cohort starts on 25th Jul 2024,
Weekend batch
Your City

About the Author

Mayank Banoula

Mayank is a Research Analyst at Simplilearn. He is proficient in Machine learning and Artificial intelligence with python.

Recommended Programs

Post Graduate Program in AI and Machine Learning

*Lifetime access to high-quality, self-paced e-learning content.

Recommended Resources

What is Hierarchical Clustering and How Does It Work

What is Hierarchical Clustering and How Does It Work

Free eBook: 2016 High Paying Certifications

Free eBook: 2016 High Paying Certifications

K-Means Clustering Algorithm: Applications, Types, Demos and Use Cases

Free eBook: Guide To Scrum Methodology

  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.

Navigation Menu

Search code, repositories, users, issues, pull requests..., provide feedback.

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

  • Notifications You must be signed in to change notification settings

Implementation of Partitioning based clustering algorithm K-means and Kernel K-means

Prashant47/Clustering-algorithm-K-means

Folders and files.

NameName
3 Commits

Repository files navigation

Programming assignment - python version.

####Prerequisites:####

  • python: If you are working on your own machine, you will probably need to install Python. The code in this assignment works for python 2.7.
  • linux (recommended) or windows (you may not be able to apply the make commands, but you can use your own IDE, such as Visual Studio and Code Blocks.)

####Goals####

  • Implement the following clustering algorithms: K-means and Kernel K-means.
  • Implement the following supervised clustering evaluation metrics: purity and NMI.

####Step 1. K-means####

  • Complete the following two key functions of K-means in k_means.py
  • Write the purity and NMI metrics in evaluation.py
  • Use the following command line to run the python script
  • If your implementation is correct, you should have information printed on your screen that is very similar to the information given below.

####Step 2. Kernel K-means####

  • Once you have done K-means, you only need to implement a wrapper to transform the data points into the kernel space for kernel K-means. In this homework, we are going to implement the RBF kernel. Please complete the following coordinates transformation function, in file kernel_k_means.py
  • Python 100.0%

Javatpoint Logo

Machine Learning

Artificial Intelligence

Control System

Supervised Learning

Classification, miscellaneous, related tutorials.

Interview Questions

JavaTpoint

K-Means Clustering is an unsupervised learning algorithm that is used to solve the clustering problems in machine learning or data science. In this topic, we will learn what is K-means clustering algorithm, how the algorithm works, along with the Python implementation of k-means clustering.

K-Means Clustering is an , which groups the unlabeled dataset into different clusters. Here K defines the number of pre-defined clusters that need to be created in the process, as if K=2, there will be two clusters, and for K=3, there will be three clusters, and so on.

It allows us to cluster the data into different groups and a convenient way to discover the categories of groups in the unlabeled dataset on its own without the need for any training.

It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of this algorithm is to minimize the sum of distances between the data point and their corresponding clusters.

The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters, and repeats the process until it does not find the best clusters. The value of k should be predetermined in this algorithm.

The k-means algorithm mainly performs two tasks:

Hence each cluster has datapoints with some commonalities, and it is away from other clusters.

The below diagram explains the working of the K-means Clustering Algorithm:

The working of the K-Means algorithm is explained in the below steps:

Select the number K to decide the number of clusters.

Select random K points or centroids. (It can be other from the input dataset).

Assign each data point to their closest centroid, which will form the predefined K clusters.

Calculate the variance and place a new centroid of each cluster.

Repeat the third steps, which means reassign each datapoint to the new closest centroid of each cluster.

If any reassignment occurs, then go to step-4 else go to FINISH.

: The model is ready.

Let's understand the above steps by considering the visual plots:

Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two variables is given below:

From the above image, it is clear that points left side of the line is near to the K1 or blue centroid, and points to the right of the line are close to the yellow centroid. Let's color them as blue and yellow for clear visualization.

. To choose the new centroids, we will compute the center of gravity of these centroids, and will find new centroids as below:

From the above image, we can see, one yellow point is on the left side of the line, and two blue points are right to the line. So, these three points will be assigned to new centroids.

As our model is ready, so we can now remove the assumed centroids, and the two final clusters will be as shown in the below image:

The performance of the K-means clustering algorithm depends upon highly efficient clusters that it forms. But choosing the optimal number of clusters is a big task. There are some different ways to find the optimal number of clusters, but here we are discussing the most appropriate method to find the number of clusters or value of K. The method is given below:

The Elbow method is one of the most popular ways to find the optimal number of clusters. This method uses the concept of WCSS value. stands for , which defines the total variations within a cluster. The formula to calculate the value of WCSS (for 3 clusters) is given below:

distance(P C ) +∑ distance(P C ) +∑ distance(P C )

In the above formula of WCSS,

∑ distance(P C ) : It is the sum of the square of the distances between each data point and its centroid within a cluster1 and the same for the other two terms.

To measure the distance between data points and centroid, we can use any method such as Euclidean distance or Manhattan distance.

To find the optimal value of clusters, the elbow method follows the below steps:

Since the graph shows the sharp bend, which looks like an elbow, hence it is known as the elbow method. The graph for the elbow method looks like the below image:

In the above section, we have discussed the K-means algorithm, now let's see how it can be implemented using .

Before implementation, let's understand what type of problem we will solve here. So, we have a dataset of , which is the data of customers who visit the mall and spend there.

In the given dataset, we have (which is the calculated value of how much a customer has spent in the mall, the more the value, the more he has spent). From this dataset, we need to calculate some patterns, as it is an unsupervised method, so we don't know what to calculate exactly.

The steps to be followed for the implementation are given below:

The first step will be the data pre-processing, as we did in our earlier topics of Regression and Classification. But for the clustering problem, it will be different from other models. Let's discuss it:


As we did in previous topics, firstly, we will import the libraries for our model, which is part of data pre-processing. The code is given below:

In the above code, the we have imported for the performing mathematics calculation, is for plotting the graph, and are for managing the dataset.


Next, we will import the dataset that we need to use. So here, we are using the Mall_Customer_data.csv dataset. It can be imported using the below code:

By executing the above lines of code, we will get our dataset in the Spyder IDE. The dataset looks like the below image:

Here we don't need any dependent variable for data pre-processing step as it is a clustering problem, and we have no idea about what to determine. So we will just add a line of code for the matrix of features.

As we can see, we are extracting only 3 and 4 feature. It is because we need a 2d plot to visualize the model, and some features are not required, such as customer_id.

In the second step, we will try to find the optimal number of clusters for our clustering problem. So, as discussed above, here we are going to use the elbow method for this purpose.

As we know, the elbow method uses the WCSS concept to draw the plot by plotting WCSS values on the Y-axis and the number of clusters on the X-axis. So we are going to calculate the value for WCSS for different k values ranging from 1 to 10. Below is the code for it:

As we can see in the above code, we have used class of sklearn. cluster library to form the clusters.

Next, we have created the variable to initialize an empty list, which is used to contain the value of wcss computed for different values of k ranging from 1 to 10.

After that, we have initialized the for loop for the iteration on a different value of k ranging from 1 to 10; since for loop in Python, exclude the outbound limit, so it is taken as 11 to include 10 value.

The rest part of the code is similar as we did in earlier topics, as we have fitted the model on a matrix of features and then plotted the graph between the number of clusters and WCSS.

After executing the above code, we will get the below output:

As we have got the number of clusters, so we can now train the model on the dataset.

To train the model, we will use the same two lines of code as we have used in the above section, but here instead of using i, we will use 5, as we know there are 5 clusters that need to be formed. The code is given below:

The first line is the same as above for creating the object of KMeans class.

In the second line of code, we have created the dependent variable to train the model.

By executing the above lines of code, we will get the y_predict variable. We can check it under option in the Spyder IDE. We can now compare the values of y_predict with our original dataset. Consider the below image:

3(as index starts from 0, hence 2 will be considered as 3), and 2 belongs to cluster 4, and so on.

The last step is to visualize the clusters. As we have 5 clusters for our model, so we will visualize each cluster one by one.

To visualize the clusters will use scatter plot using mtp.scatter() function of matplotlib.

In above lines of code, we have written code for each clusters, ranging from 1 to 5. The first coordinate of the mtp.scatter, i.e., x[y_predict == 0, 0] containing the x value for the showing the matrix of features values, and the y_predict is ranging from 0 to 1.

shows the customers with average salary and average spending so we can categorize these customers as . .



Youtube

  • Send your Feedback to [email protected]

Help Others, Please Share

facebook

Learn Latest Tutorials

Splunk tutorial

Transact-SQL

Tumblr tutorial

Reinforcement Learning

R Programming tutorial

R Programming

RxJS tutorial

React Native

Python Design Patterns

Python Design Patterns

Python Pillow tutorial

Python Pillow

Python Turtle tutorial

Python Turtle

Keras tutorial

Preparation

Aptitude

Verbal Ability

Interview Questions

Company Questions

Trending Technologies

Artificial Intelligence

Cloud Computing

Hadoop tutorial

Data Science

Angular 7 Tutorial

B.Tech / MCA

DBMS tutorial

Data Structures

DAA tutorial

Operating System

Computer Network tutorial

Computer Network

Compiler Design tutorial

Compiler Design

Computer Organization and Architecture

Computer Organization

Discrete Mathematics Tutorial

Discrete Mathematics

Ethical Hacking

Ethical Hacking

Computer Graphics Tutorial

Computer Graphics

Software Engineering

Software Engineering

html tutorial

Web Technology

Cyber Security tutorial

Cyber Security

Automata Tutorial

C Programming

C++ tutorial

Data Mining

Data Warehouse Tutorial

Data Warehouse

RSS Feed

Information

  • Author Services

Initiatives

You are accessing a machine-readable page. In order to be human-readable, please install an RSS reader.

All articles published by MDPI are made immediately available worldwide under an open access license. No special permission is required to reuse all or part of the article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited. For more information, please refer to https://www.mdpi.com/openaccess .

Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications.

Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the reviewers.

Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world. Editors select a small number of articles recently published in the journal that they believe will be particularly interesting to readers, or important in the respective research area. The aim is to provide a snapshot of some of the most exciting work published in the various research areas of the journal.

Original Submission Date Received: .

  • Active Journals
  • Find a Journal
  • Proceedings Series
  • For Authors
  • For Reviewers
  • For Editors
  • For Librarians
  • For Publishers
  • For Societies
  • For Conference Organizers
  • Open Access Policy
  • Institutional Open Access Program
  • Special Issues Guidelines
  • Editorial Process
  • Research and Publication Ethics
  • Article Processing Charges
  • Testimonials
  • Preprints.org
  • SciProfiles
  • Encyclopedia

applsci-logo

Article Menu

programming assignment implementing the k means clustering algorithm

  • Subscribe SciFeed
  • Recommended Articles
  • Google Scholar
  • on Google Scholar
  • Table of Contents

Find support for a specific problem in the support section of our website.

Please let us know what you think of our products and services.

Visit our dedicated information section to learn more about MDPI.

JSmol Viewer

Genetic algorithm-based optimization of clustering algorithms for the healthy aging dataset.

programming assignment implementing the k means clustering algorithm

1. Introduction

  • RQ1: What is the efficacy of a genetic algorithm in the process of feature selection for enhancing clustering performance?
  • RQ2: Which clustering algorithm is most effective when applied to the selected NPHA dataset?
  • RQ3: Does the iterative process of selection, crossover, and mutation in a genetic algorithm have the potential to enhance clustering performance across numerous generations?
  • By simulating the principles of natural evolution, the genetic algorithm utilized in this study optimizes feature selection for clustering.
  • The most relevant subset of features from the dataset are identified based on the “variance” score for feature selection.
  • Clustering algorithms, including KMeans++, DBSCAN, BIRCH, and agglomerative clustering, were applied to this set of features.
  • Finally, the outcomes of the clustering with the best performance metrics are reported.

2. Materials and Methods

2.1. selecting features using genetic algorithms, 2.2. genetic algorithm.

  • Initialization: The method commences by generating an initial population of potential solutions (individuals) to the problem. Each person is depicted as a sequence of values, which may be binary, integer, or real, depending on the specific issue.
  • Selection: The algorithm chooses people from the population based on their fitness, which is a metric of how effectively an individual solves the task. Individuals possessing greater levels of physical prowess are more inclined to be chosen for the subsequent generation.
  • Crossover: The chosen people are placed together, and a crossover operation is performed to generate new progeny. The crossover procedure involves selecting a random point in the string representation of the people and exchanging the values beyond that point between the parents to generate two new children.
  • Mutation: Following the crossover process, a mutation operation is implemented to induce minor random alterations in the offspring’s strings. This facilitates the introduction of novel genetic material into the population and hinders the algorithm from rapidly converging towards a suboptimal answer.
  • Replacement: The act of replacing individuals in the present population with offspring according to a predetermined replacement plan. This guarantees that the population size remains consistent across successive generations.
  • Termination: The algorithm persists in executing the selection, crossover, mutation, and replacement phases until a specified termination condition is satisfied. This condition may consist of a maximum number of generations, a suitable fitness level, or a predefined time restriction. The fundamental operational premise of a genetic algorithm encompasses the subsequent stages and is shown in Figure 2 .

2.3. Clustering/Cluster Analysis

2.4. clustering algorithms, 2.4.1. kmeans/kmeans++, 2.4.2. density-based spatial clustering of applications with noise (dbscan), 2.4.3. balanced iterative reducing and clustering utilizing hierarchies, 2.4.4. agglomerative, 3. workflow, 4.1. performance evaluation parameters, 4.2. dataset, 4.3. implementation details, 4.4. analysis, 5. discussion, 5.1. ga-kmeans ++ vs. other ga-based clustering algorithms, 5.2. kmeans++ vs. ga-kmeans++, 5.3. insight into the clustering based on features selected by the best-performing algorithms, 5.4. health-related recommendations for all clusters.

  • For patients falling in cluster 1, to maintain dental health, regular dental check-ups are advised. Activities promoting mental health such as social engagements and hobbies should be included in regular routines. Even though sleep issues are minor, the patients can be advised to consider lifestyle adjustments like sleep environment improvements and relaxation techniques before bed.
  • Since there are more female patients in cluster 2, gentle exercise and physical therapy programs can be introduced to move patients from fair to good physical health. Gender-specific programs may be developed, tailored to older females, addressing specific health issues like osteoporosis and cardiovascular health. Also, for this cluster of patients, activities promoting mental health, such as social engagements and hobbies, may be included in the regular routine. Advice regarding improving dental health by regular dental check-ups and lifestyle changes to improve sleep quality may be given.
  • Cluster 3 comprises a balance of males and females. Activities such as tailored fitness classes and nutritional guidance to improve physical health should be promoted. To improve dental health, regular dental check-ups and education on oral hygiene is advised. Social interactions, mental health workshops, and mental stimulation activities can help patients in this cluster maintain good mental health.

6. Conclusions

  • Q1: When genetic algorithms are used to select features, which clustering performance metric is most indicative of successful clustering outcomes, thus indicating the best parameters for healthy aging?
  • Q2: How does the feature selection technique contribute to enhancing performance parameters?

Author Contributions

Institutional review board statement, informed consent statement, data availability statement, acknowledgments, conflicts of interest.

  • Bhattacharjee, V.; Priya, A.; Kumari, N.; Anwar, S. DeepCOVNet Model for COVID-19 Detection Using Chest X-ray Images. Wirel. Pers. Commun. 2023 , 130 , 1399–1416. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Foo, A.; Hsu, W.; Lee, M.L.; Tan, G.S. DP-GAT: A Framework for Image-based Disease Progression Prediction. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 14–18 August 2022. [ Google Scholar ]
  • Nandy, J.; Hsu, W.; Lee, M.L. An Incremental Feature Extraction Framework for Referable Diabetic Retinopathy Detection. In Proceedings of the IEEE 28th International Conference on Tools with Artificial Inteligience (ICTAI), San Jose, CA, USA, 6–8 November 2016. [ Google Scholar ]
  • Mishra, A.; Jha, R.; Bhattacharjee, V. SSCLNet: A Self-Supervised Contrastive Loss-Based Pre-Trained Network for Brain MRI Classification. IEEE Access 2023 , 11 , 6673–6681. [ Google Scholar ] [ CrossRef ]
  • Kumari, N.; Anwar, S.; Bhattacharjee, V.; Sahana, S.K. Visually evoked brain signals guided image regeneration using GAN variants. Multimed. Tools Appl. 2023 , 82 , 32259–32279. [ Google Scholar ] [ CrossRef ]
  • Jha, R.; Bhattacharjee, V.; Mustafi, A. Increasing the Prediction Accuracy for Thyroid Disease: A Step Towards Better Health for Society. Wirel. Pers. Commun. 2021 , 122 , 1921–1938. [ Google Scholar ] [ CrossRef ]
  • Bhattacharjee, V.; Priya, A.; Prasad, U. Evaluating the Performance of Machine Learning Models for Diabetes Prediction with Feature Selection and Missing Values Handling. Int. J. Microsyst. IoT 2023 , 1 . Available online: https://www.ijmit.org/Photo/IJMIT20230028R1.pdf (accessed on 11 January 2024).
  • Singh, S.; Aditi Sneh, A.; Bhattacharjee, V. A Detailed Analysis of Applying the K Nearest Neighbour Algorithm for Detection of Breast Cancer. Int. J. Theor. Appl. Sci. 2021 , 13 , 73–78. [ Google Scholar ]
  • Ahmed, M.; Seraj, R.; Islam, S.M.S. The k-means algorithm: A comprehensive survey and performance evaluation. Electronics 2020 , 9 , 1295. [ Google Scholar ] [ CrossRef ]
  • Jahwar, A.F.; Abdulazeez, A.M. Meta-heuristic algorithms for K-means clustering: A review. PalArch’s J. Archaeol. Egypt/Egyptol. 2020 , 17 , 12002–12020. [ Google Scholar ]
  • Huang, J. Design of Tourism Data Clustering Analysis Model Based on K-Means Clustering Algorithm. In International Conference on Multi-Modal Information Analytics ; Springer: Cham, Switzerland, 2022; pp. 373–380. [ Google Scholar ]
  • Yuan, C.; Yang, H. Research on K-value selection method of K-means clustering algorithm. J 2019 , 2 , 226–235. [ Google Scholar ] [ CrossRef ]
  • Ikotun, A.M.; Ezugwu, A.E.; Abualigah, L.; Abuhaija, B.; Heming, J. K-means clustering algorithms: A comprehensive review, variants analysis, and advances in the era of big data. Inf. Sci. 2023 , 622 , 178–210. [ Google Scholar ] [ CrossRef ]
  • Yang, Z.; Jiang, F.; Yu, X.; Du, J. Initial Seeds Selection for K-means Clustering Based on Outlier Detection. In Proceedings of the 2022 5th International Conference on Software Engineering and Information Management (ICSIM), Yokohama, Japan, 21–23 January 2022; pp. 138–143. [ Google Scholar ]
  • Han, M. Research on optimization of K-means Algorithm Based on Spark. In Proceedings of the 2023 IEEE 6th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC), Chongqing, China, 24–26 February 2023; pp. 1829–1836. [ Google Scholar ]
  • Suryanarayana, G.; Swapna, N.; Bhaskar, T.; Kiran, A. Optimizing K-Means Clustering using the Artificial Firefly Algorithm. Int. J. Intell. Syst. Appl. Eng. 2023 , 11 , 461–468. [ Google Scholar ]
  • Bahmani, B.; Moseley, B.; Vattani, A.; Kumar, R.; Vassilvitskii, S. Scalable k-means++. arXiv 2012 , arXiv:1203.6402. [ Google Scholar ] [ CrossRef ]
  • Dinh, D.; Huynh, V.; Sriboonchitta, S. Clustering mixed numerical and categorical data with missing values. Inf. Sci. 2021 , 571 , 418–442. [ Google Scholar ] [ CrossRef ]
  • Crase, S.; Thennadil, S.N. An analysis framework for clustering algorithm selection with applicationstospectroscopy. PLoS ONE 2022 , 17 , e0266369. [ Google Scholar ] [ CrossRef ]
  • Zheng, W.; Jin, M. Improving the Performance of Feature Selection Methods with Low-Sample-Size Data. Comput. J. 2023 , 66 , 1664–1686. [ Google Scholar ] [ CrossRef ]
  • Pullissery, Y.H.; Starkey, A. Application of Feature Selection Methods for Improving Classifcation Accuracy and Run-Time: A Comparison of Performance on Real-World Datasets. In Proceedings of the 2nd International Conference on Applied Artificial Intelligence and Computing (ICAAIC), Salem, India, 4–6 May 2023; pp. 687–694. [ Google Scholar ] [ CrossRef ]
  • Tabianan, K.; Velu, S.; Ravi, V. K-means clustering approach for intelligent customer segmentation using customer purchase behavior data. Sustainability 2022 , 14 , 7243. [ Google Scholar ] [ CrossRef ]
  • Ghezelbash, R.; Maghsoudi, A.; Shamekhi, M.; Pradhan, B.; Daviran, M. Genetic algorithm to optimize the SVM and K-means algorithms for mapping of mineral prospectivity. Neural Comput. Appl. 2023 , 35 , 719–733. [ Google Scholar ] [ CrossRef ]
  • El-Shorbagy, M.A.; Ayoub, A.Y.; Mousa, A.A.; Eldesoky, I. An enhanced genetic algorithm with new mutation for cluster analysis. Comput. Stat. 2019 , 34 , 1355–1392. [ Google Scholar ] [ CrossRef ]
  • Albadr, M.A.; Tiun, S.; Ayob, M.; AL-Dhief, F. Genetic Algorithm Based on Natural Selection Theory for Optimization Problems. Symmetry 2020 , 12 , 1758. [ Google Scholar ] [ CrossRef ]
  • Zubair, M.; Iqbal, M.A.; Shil, A.; Chowdhury, M.; Moni, M.A.; Sarker, I.H. An improved K-means clustering algorithm towards an efficient data-driven modeling. Ann. Data Sci. 2022 , 9 , 1–20. [ Google Scholar ] [ CrossRef ]
  • Al Shaqsi, J.; Wang, W. Robust Clustering Ensemble Algorithm. SSRN Electron. J. 2022 . Available online: https://www.researchgate.net/publication/365606528_Robust_Clustering_Ensemble_Algorithm (accessed on 11 January 2024). [ CrossRef ]
  • Yu, H.; Wen, G.; Gan, J.; Zheng, W.; Lei, C. Self-paced learning for k-means clustering algorithm. Pattern Recognit. Lett. 2018 , 132 , 69–75. [ Google Scholar ] [ CrossRef ]
  • Sajidha, S.; Desikan, K.; Chodnekar, S.P. Initial seed selection for mixed data using modified k-means clustering algorithm. Arab. J. Sci. Eng. 2020 , 45 , 2685–2703. [ Google Scholar ] [ CrossRef ]
  • Hua, C.; Li, F.; Zhang, C.; Yang, J.; Wu, W. A Genetic XK-Means Algorithm with Empty Cluster Reassignment. Symmetry 2019 , 11 , 744. [ Google Scholar ] [ CrossRef ]
  • Gupta, M.; Rajnish, K.; Bhattacharjee, V. Software fault prediction with imbalanced datasets using SMOTE-Tomek sampling technique and Genetic Algorithm models. Multimed. Tools Appl. 2023 , 83 , 47627–47648. [ Google Scholar ] [ CrossRef ]
  • National Poll on Healthy Aging (NPHA) Dataset. Available online: https://www.kaggle.com/ (accessed on 11 January 2024).

Click here to enlarge figure

FeaturesTypeDescription
AgeCategorical The patient’s age group = {1: 50–64; 2: 65–80}
Physical HealthCategorical A self-assessment of the patient’s physical well-being = {−1: Refused; 1: Excellent; 2: Very Good; 3: Good; 4: Fair; 5: Poor}
Mental HealthCategorical A self-evaluation of the patient’s mental or psychological health = {−1: Refused; 1: Excellent; 2: Very Good; 3: Good; 4: Fair; 5: Poor}
Dental HealthCategorical A self-assessment of the patient’s oral or dental health= {−1: Refused; 1: Excellent; 2: Very Good; 3: Good; 4: Fair; 5: Poor}
EmploymentCategorical The patient’s employment status or work-related information = {−1: Refused; 1: Working full-time; 2: Working part-time; 3: Retired; 4: Not working at this time}
Stress Keeps Patient from SleepingCategorical Whether stress affects the patient’s ability to sleep = {0: No; 1: Yes}
Medication Keeps Patient from SleepingCategorical Whether medication impacts the patient’s sleep = {0: No; 1: Yes}
Pain Keeps Patient from SleepingCategorical Whether physical pain disturbs the patient’s sleep = {0: No; 1: Yes}
Bathroom Needs Keeps Patient from SleepingCategorical Whether the need to use the bathroom affects the patient’s sleep = {0: No; 1: Yes}
Unknown Keeps Patient from SleepingCategoricalUnidentified factors affecting the patient’s sleep = {0: No; 1: Yes}
Trouble sleepingCategoricalGeneral issues or difficulties the patient faces with sleeping = {−1: REFUSED; 1: No; 2: Mild; 3: Yes}
Prescription Sleep MedicationCategoricalInformation about any sleep medication prescribed to the patient = {−1: Refused; 1: Use regularly; 2: Use occasionally; 3: Do not use}
RaceCategoricalThe patient’s racial or ethnic background = {−2: Not asked; −1: REFUSED; 1: White, Non-Hispanic; 2: Black, Non-Hispanic;
3: Other, Non-Hispanic; 4: Hispanic; 5: 2+ Races, Non-Hispanic}
GenderCategoricalThe gender identity of the patient = {
−2: Not asked; −1: REFUSED; 1: Male; 2: Female}
Number of Doctors Visited
(target variable)
Categorical The total count of different doctors the patient has seen = {1: 0–1 doctors; 2: 2–3 doctors; 3: 4 or more doctors}
NPHA Dataset
ModelSilhoutte ScoreDavies–Bouldin ScoreCalinski–Harabasz Score
Birch0.38160.843368.67
DBSCAN0.46531.54414.78
Agglomerative(Agg)0.28671.099590.7
Kmeans ++0.72840.474397.46
GA with Birch(GA-B)0.64970.6024229.007
GA with DBSCAN(GA-DB)0.88441.2082140.69
GA & agglomerative(GA-Agg)0.70440.546283.24
FeaturesCluster 1Cluster 2Cluster 3
MeanRepresentative Datapoint Mean Representative DatapointMean Representative Datapoint
Age112222
Physical health22.3733.0133.37
Mental health21.6122.1722.51
Dental health22.1633.2844.25
Employment32.7632.8232.86
Stress keep patient from sleeping00.2300.3400.24
Medication keeps patients from sleeping00.0400.0600.07
Pain keeps patient from sleeping00.1700.2100.28
Bathroom needs keeps patient from sleeping10.510.5210.5
Unknown keeps patient from sleeping00.4100.3700.42
Trouble sleeping32.4632.4122.3
Prescription sleep medication32.8532.8832.76
Race11.144.111.12
Gender21.5521.6121.52
FeaturesCluster 1Cluster 2Cluster 3
Age50–6465–8065–80
Physical healthBetween very good
and good
Between good
and fair
Good but more
towards fair
Mental healthMostly very
good
Mostly very
good
Good
Dental healthVery goodGoodFair
Trouble sleepingMild to yesMild to yesMild to yes
GenderBalanced male and
female
More femaleBalanced male and
female
The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

Kouser, K.; Priyam, A.; Gupta, M.; Kumar, S.; Bhattacharjee, V. Genetic Algorithm-Based Optimization of Clustering Algorithms for the Healthy Aging Dataset. Appl. Sci. 2024 , 14 , 5530. https://doi.org/10.3390/app14135530

Kouser K, Priyam A, Gupta M, Kumar S, Bhattacharjee V. Genetic Algorithm-Based Optimization of Clustering Algorithms for the Healthy Aging Dataset. Applied Sciences . 2024; 14(13):5530. https://doi.org/10.3390/app14135530

Kouser, Kahkashan, Amrita Priyam, Mansi Gupta, Sanjay Kumar, and Vandana Bhattacharjee. 2024. "Genetic Algorithm-Based Optimization of Clustering Algorithms for the Healthy Aging Dataset" Applied Sciences 14, no. 13: 5530. https://doi.org/10.3390/app14135530

Article Metrics

Further information, mdpi initiatives, follow mdpi.

MDPI

Subscribe to receive issue release notifications and newsletters from MDPI journals

IMAGES

  1. A Friendly Introduction to K-Means clustering algorithm

    programming assignment implementing the k means clustering algorithm

  2. Flowchart of k-means clustering algorithm

    programming assignment implementing the k means clustering algorithm

  3. K-Means Clustering From Scratch in Python [Algorithm Explained]

    programming assignment implementing the k means clustering algorithm

  4. Clustering 6: The k-means algorithm visually

    programming assignment implementing the k means clustering algorithm

  5. K Means Clustering Algorithm In Matlab

    programming assignment implementing the k means clustering algorithm

  6. K Means Clustering Algorithm

    programming assignment implementing the k means clustering algorithm

VIDEO

  1. Implementing k-means clustering algorithm

  2. Implementation of K Means Clustering for Customer Segmentation Assignment

  3. k means clustering euclidean distance python

  4. K-Means Clustering In Clustering Algorithm for mall Customers using Python Code

  5. Lecture 0802 K-means algorithm

  6. Assignment 7

COMMENTS

  1. K-Means Clustering in Python: A Practical Guide

    The k-means clustering method is an unsupervised machine learning technique used to identify clusters of data objects in a dataset. There are many different types of clustering methods, but k-means is one of the oldest and most approachable.These traits make implementing k-means clustering in Python reasonably straightforward, even for novice programmers and data scientists.

  2. Create a K-Means Clustering Algorithm from Scratch in Python

    Implementation. First, the k-means clustering algorithm is initialized with a value for k and a maximum number of iterations for finding the optimal centroid locations. If a maximum number of iterations is not considered when optimizing centroid locations, there is a risk of running an infinite loop. self.n_clusters = n_clusters.

  3. K means Clustering

    K Means is faster as compare to other clustering technique. It provides strong coupling between the data points. K Means cluster do not provide clear information regarding the quality of clusters. Different initial assignment of cluster centroid may lead to different clusters. Also, K Means algorithm is sensitive to noise.

  4. Python Machine Learning

    K-means. K-means is an unsupervised learning method for clustering data points. The algorithm iteratively divides data points into K clusters by minimizing the variance in each cluster. Here, we will show you how to estimate the best value for K using the elbow method, then use K-means clustering to group the data points into clusters.

  5. A Complete K Mean Clustering Algorithm From Scratch in Python: Step by

    The purpose of this algorithm is not to predict any label. Instead to learn about the dataset better and to label them. In k mean clustering we cluster the dataset into different groups. Here is how a k mean clustering algorithm works. The first step is to randomly initialize a few points. These points are called cluster centroids.

  6. K-Means Clustering Algorithm from Scratch

    K-Means Clustering is an unsupervised learning algorithm that aims to group the observations in a given dataset into clusters. The number of clusters is provided as an input. It forms the clusters by minimizing the sum of the distance of points from their respective cluster centroids. Contents Basic Overview Introduction to K-Means Clustering Steps Involved … K-Means Clustering Algorithm ...

  7. K-Means Clustering with Python

    If the issue persists, it's likely a problem on our side. Unexpected token < in JSON at position 4. keyboard_arrow_up. content_copy. SyntaxError: Unexpected token < in JSON at position 4. Refresh. Explore and run machine learning code with Kaggle Notebooks | Using data from Facebook Live sellers in Thailand, UCI ML Repo.

  8. K means clustering Python implementation example

    The K-means is an Unsupervised Machine Learning algorithm that splits a dataset into K non-overlapping subgroups (clusters). It allows us to split the data into different groups or categories. For example, if K=2 there will be two clusters, if K=3 there will be three clusters, etc. Using the K-means algorithm is a convenient way to discover the ...

  9. Introduction to K-Means Clustering Algorithm

    There is an algorithm that tries to minimize the distance of the points in a cluster with their centroid - the k-means clustering technique. K-means is a centroid-based algorithm or a distance-based algorithm, where we calculate the distances to assign a point to a cluster. In K-Means, each cluster is associated with a centroid.

  10. Implementing K-means Clustering from Scratch

    It is often referred to as Lloyd's algorithm. K-means simply partitions the given dataset into various clusters (groups). K refers to the total number of clusters to be defined in the entire dataset.There is a centroid chosen for a given cluster type which is used to calculate the distance of a given data point.

  11. PDF K-Means Clustering and Related Algorithms

    K-Means Clustering and Related Algorithms. of Machine Learning Princeton UniversityIn its broadest definition, machine learning is about au. omatically discovering structure in data. Those data can take ma. y forms, depending on what our goals are. The data might be something like images and labels, in a supervised visual object recognition ...

  12. Implementing k-means clustering from scratch in C++

    The most famous approximate algorithm is Lloyd's algorithm, which is often confusingly called the "k-means algorithm". In this post I will silence my inner pedant and interchangeably use the terms k-means algorithm and k-means clustering, but it should be remembered that they are slightly distinct. With that aside, Lloyd's algorithm is ...

  13. Clustering using k-Means with implementation

    How k-means works. Step 1: Initialize random 'k' points from the data as the cluster centers, let's assume the value of k is 2 and the 1st and the 4th observation is chosen as the centers. Step 2: For all the points, find the distance from the k cluster centers. Euclidean Distance can be used.

  14. Implementing K-means Clustering: A Comprehensive Guide

    Implementing K-means: A Step-by-Step Guide. The implementation is divided into two critical functions: Finding the Closest Centroids: The first step in clustering, where each data point is ...

  15. Implementing a K-Means Clustering Algorithm From Scratch

    Jul 31, 2020. K-means clustering is a widely-used, and relatively simple, unsupervised machine learning model. As the name implies, this algorithm works best when answering questions in regards to ...

  16. What Is K-Means Clustering?

    K-means clustering is a popular technique that takes a pre-defined number of clusters and, using a k-means algorithm iteratively assigns a characteristic to each group until similar groupings are found. It's a method you can use to divide a bunch of data points into distinct groups, ensuring that each point is in the group closest to it.

  17. CS221

    Here, we are given feature vectors for each data point x(i) ∈ Rn x ( i) ∈ R n as usual; but no labels y(i) y ( i) (making this an unsupervised learning problem). Our goal is to predict k k centroids and a label c(i) c ( i) for each datapoint. The k-means clustering algorithm is as follows: The notation ∥x − y∥ ‖ x − y ‖ means ...

  18. Implementing k-means Clustering from Scratch

    K-means clustering is a partitioning method commonly used in unsupervised machine learning. The algorithm aims to divide a dataset into K distinct, non-overlapping subsets (clusters). Each data ...

  19. K-means Clustering Algorithm: Applications, Types, and ...

    Advantages of k-means. Simple and easy to implement: The k-means algorithm is easy to understand and implement, making it a popular choice for clustering tasks. Fast and efficient: K-means is computationally efficient and can handle large datasets with high dimensionality. Scalability: K-means can handle large datasets with a large number of ...

  20. K-means Clustering: Algorithm, Applications, Evaluation Methods, and

    Which translates to recomputing the centroid of each cluster to reflect the new assignments. Few things to note here: Since clustering algorithms including kmeans use distance-based measurements to determine the similarity between data points, it's recommended to standardize the data to have a mean of zero and a standard deviation of one since almost always the features in any dataset would ...

  21. GitHub

    ####Step 2. Kernel K-means#### Once you have done K-means, you only need to implement a wrapper to transform the data points into the kernel space for kernel K-means. In this homework, we are going to implement the RBF kernel. Please complete the following coordinates transformation function, in file kernel_k_means.py

  22. K-Means Clustering Algorithm

    The working of the K-Means algorithm is explained in the below steps: Step-1: Select the number K to decide the number of clusters. Step-2: Select random K points or centroids. (It can be other from the input dataset). Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.

  23. Genetic Algorithm-Based Optimization of Clustering Algorithms ...

    Clustering is a crucial and, at the same time, challenging task in several application domains. It is important to incorporate the optimum feature finding into our clustering algorithms for better exploration of features and to draw meaningful conclusions, but this is difficult when there is no or little information about the importance or relevance of features. To tackle this task in an ...