Clustering is the process of grouping observations of similar kinds into smaller groups within the larger population. It has widespread application in business analytics. One of the questions facing businesses is how to organize the huge amounts of available data into meaningful structures.Or break a large heterogeneous population into smaller homogeneous groups. Cluster analysis is an exploratory data analysis tool which aims at sorting different objects into groups in a way that the degree of association between two objects is maximal if they belong to the same group and minimal otherwise.
A grocer retailer used clustering to segment its 1.3MM loyalty card customers into 5 different groups based on their buying behavior. It then adopted customized marketing strategies for each of these segments in order to target them more effectively.
One of the groups was called ‘Fresh food lovers’. This comprised of customers who purchase a high proportion of organic food, fresh vegetables, salads etc. A marketing campaign that emphasized the freshness of the fruits and vegetables and year-round availability of organic produce in the stores appealed to this customer group.
Another cluster was called ‘Convenience junkies’. This comprised of people who shopped for cooked/semi-cooked, easy-to prepare meals. A marketing campaign focusing on the retailer’s in-house line of frozen meals as well as the speed of the check-out counters at the store worked well with this audience.
In this way the retailer was able to deliver the right message to the right customer and maximize the effectiveness of its marketing.
Features of clustering
Clustering is an undirected data mining technique. This means it can be used to identify hidden patterns and structures in the data without formulating a specific hypothesis. There is no target variable in clustering. In the above case, the grocery retailer was not actively trying to identify fresh food lovers at the start of the analysis. It was just attempting to understand the different buying behaviors of its customer base.
Clustering is performed to identify similarities with respect to specific behaviors or dimensions. In our example, the objective was to identify customer segments with similar buying behavior. Hence, clustering was performed using variables that represent the customer buying patterns.
Cluster analysis can be used to discover structures in data without providing an explanation or interpretation. In other words, cluster analysis simply discovers patterns in data without explaining why they exist. The resulting clusters are meaningless by themselves. They need to be profiled extensively to build their identity i.e. to understand what they represent and how they are different from the parent population.
In the retailer’s case, each cluster was profiled on its buying behavior. Customers in cluster 1 spent a quarter of their total spend on fresh, organic produce. This was significantly higher than other customers who spent less than 5% on this category. This segment of customers was called ‘Fresh food lovers’ as this is what distinguished them from the rest of the customers.
Types of clustering
There are different algorithms available for clustering, and each of them may give a different set of clusters. The choice of a particular method will depend on the objective of clustering, the type of output desired, the hardware and software facilities available and the size of the dataset. In general, clustering techniques may be divided into two categories based on the cluster structure which they produce.
The non-hierarchical methods divide a dataset of N objects into M clusters. K-means, a non-hierarchical technique, is the most commonly used one in business analytics.
The hierarchical methods produce a set of nested clusters in which each pair of objects or clusters is progressively nested in a larger cluster until only one cluster remains.
When to use clustering?
Clustering is primarily used to perform segmentation, be it customer, product or store. We have already talked about customer segmentation using cluster analysis in the example above. Similarly products can be clustered together into hierarchical groups based on their attributes like use, size, brand, flavor etc; stores with similar characteristics – similar sales, size, customer base etc, can be clustered together.
Clustering can also be used for anomaly detection, for example, identifying fraud transactions. Cluster detection methods can be used on a sample containing only good transactions to determine the shape and size of the “normal” cluster. When a transaction comes along that falls outside the cluster for any reason, it is suspect. This approach has been used in medicine to detect the presence of abnormal cells in tissue samples and in telecommunications to detect calling patterns indicative of fraud.
Clustering is often used to break large set of data into smaller groups that are more amenable to other techniques. For example, logistic regression results can be improved by performing it separately on smaller clusters that behave differently and may follow slightly different distributions.
In summary, clustering is a powerful technique to explore patterns structures within data and has wide applications is business analytics. There are various methods for clustering. An analyst should be familiar with multiple clustering algorithms and should be able to apply the most relevant technique as per the business needs.