K Means Clustering and various Use-cases

Jayesh Jain
4 min readJul 19, 2021

What is Clustering?

Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group than those in other groups. In simple words, the aim is to segregate groups with similar traits and assign them into clusters.

Let’s understand this with an example. Suppose, you are the head of a rental store and wish to understand the preferences of your customers to scale up your business. Is it possible for you to look at the details of each customer and devise a unique business strategy for each one of them? Definitely not. But, what you can do is to cluster all of your customers into say 10 groups based on their purchasing habits and use a separate strategy for customers in each of these 10 groups. And this is what we call clustering.

Different types:

Broadly speaking, clustering can be divided into two subgroups :

  • Hard Clustering: In hard clustering, each data point either belongs to a cluster completely or not. For example, in the above example, each customer is put into one group out of the 10 groups.
  • Soft Clustering: In soft clustering, instead of putting each data point into a separate cluster, a probability or likelihood of that data point to be in those clusters is assigned. For example, from the above scenario, each customer is assigned a probability to be in either of 10 clusters of the retail store.

K-Means Clustering

the K-means algorithm in data mining starts with the first group of randomly selected centroids, which are used as the beginning points for every cluster, and then performs iterative (repetitive) calculations to optimize the positions of the centroids

It halts creating and optimizing clusters when either:

  • The centroids have stabilized — there is no change in their values because the clustering has been successful.
  • The defined number of iterations has been achieved.

Use cases of K-Means

1. Identifying Fake News

Fake news is not a new phenomenon, but it is one that is becoming prolific.

What the problem is: Fake news is being created and spread at a rapid rate due to technological innovations such as social media. The issue gained attention recently during the 2016 US presidential campaign. During this campaign, the term Fake News was referenced an unprecedented number of times.

How clustering works: In a paper recently published by two computer science students at the University of California, Riverside, they are using clustering algorithms to identify fake news based on the content.

The way that the algorithm works is by taking in the content of the fake news article, the corpus, examining the words used, and then clustering them. These clusters are what helps the algorithm determine which pieces are genuine and which are fake news. Certain words are found more commonly in sensationalized, click-bait articles. When you see a high percentage of specific terms in an article, it gives a higher probability of the material being fake news.

2. Marketing and Sales

Personalization and targeting in marketing is big business.

This is achieved by looking at specific characteristics of a person and sharing campaigns with them that have been successful with other similar people.

What the problem is: If you are a business trying to get the best return on your marketing investment, it is crucial that you target people in the right way. If you get it wrong, you risk not making any sales, or worse, damaging your Customer trust.

How clustering works: Clustering algorithms are able to group together people with similar traits and likelihood to purchase. Once you have the groups, you can run tests on each group with different marketing copy that will help you better target your messaging to them in the future.

3. Identifying fraudulent or criminal activity

In this scenario, we are going to focus on fraudulent taxi driver behavior. However, the technique has been used in multiple scenarios.

What is the problem: You need to look into fraudulent driving activity. The challenge is how do you identify what is true and which is false?

How clustering works: By analyzing the GPS logs, the algorithm is able to group similar behaviors. Based on the characteristics of the groups you are then able to classify them into those that are real and which are fraudulent.

So there you have it, those were 3 innovative uses of clustering algorithms. As you can see, while the technique remains reasonably constant, you can apply it to many different scenarios.

Looking at the characteristics of different groups of data can help you make better predictions of behavior. In this scenario, the real value of the algorithms is to help you create the best possible groups of data.

Once you have a solid foundation of grouped data to work with, the opportunities become infinite.

Thank you for taking some precious time to read this article!

Feel free to ask any queries. Don’t forget to share this article with your fellow acquaintances. Follow me on LinkedIn using the link provided below:

https://www.linkedin.com/in/jayesh-jain-0522061aa

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

No responses yet

Write a response