Imagine you are opening a small bookstore. You have a lot of different books and three shelves. Your goal is to place similar books on a shelf. You would pick three books, one for each shelf, to establish a theme for each frame. These books will now dictate which of the remaining books will go on each frame.
Each time you take a new book from the stack, compare it with the first three books and put this new book on the shelf that has similar books. You can repeat this process until all the books have been placed.
Once you are done, changing the number of shelves and, collecting different initial books for those shelves, changing the theme for each shelf would increase the efficiency with which you have grouped the books. Therefore, you repeat the process in hopes of a better result.
Well, the K Means algorithm works that way.
“K-means clustering is one of the simplest and most popular unsupervised machine learning algorithms.”
Clustering refers to a cluster, a collection of data points aggregated due to certain similarities.
K-Means’ goal is clear: to cluster similar observations to discover patterns unknown to the naked eye. To achieve this, the algorithm searches for a fixed number (k) of clusters in the dataset.
Once you have defined the number of clusters or K, the model performs its calculations and assigns a cluster to each data point. Your model will calculate the distance between the data point and all centroids, and it will be assigned to the cluster with the closest centroid.
Some examples of use cases are:
- Segment by purchase history.
- Segment by activities in the application, website or platform.
- Define individuals based on their interests.
- Create profiles based on activity monitoring.
- Grouping of inventories by sales activities.
- Grouping of images.
In summary, K Means is a wonderful algorithm with many potential uses, so versatile that it can be used for almost any type of data clustering. But, of course, you must be aware of its assumptions and the way it works if you do not want to be led to wrong results.
Written by César Aveleyra