Clustering with Python — Hierarchical Clustering

from sklearn.cluster import AgglomerativeClustering

Hierarchical Clustering Algorithm

Also called Hierarchical cluster analysis or HCA is an unsupervised clustering algorithm which involves creating clusters that have predominant ordering from top to bottom

Types are ( both are same but reverse in direction)

  1. Agglomerative Hierarchical Clustering ( top —down)
  2. Divisive Hierarchical Clustering ( down — up)

Linkage Methods — ( distance between between 2 clusters)

There are several ways to measure the distance between clusters in order to decide the rules for clustering, and they are often called Linkage Methods. Some of the common linkage methods are:

  • Complete-linkage: the distance between two clusters is defined as the longest distance between two points in each cluster.
  • Single-linkage: the distance between two clusters is defined as the shortest distance between two points in each cluster. This linkage may be used to detect high values in your dataset which may be outliers as they will be merged at the end.
  • Average-linkage: the distance between two clusters is defined as the average distance between each point in one cluster to every point in the other cluster.
  • Centroid-linkage: finds the centroid of cluster 1 and centroid of cluster 2, and then calculates the distance between the two before merging.

What is a Dendrogram?

A Dendrogram is a type of tree diagram showing hierarchical relationships between different sets of data.

As already said a Dendrogram contains the memory of hierarchical clustering algorithm, so just by looking at the Dendrogram you can tell how the cluster is formed.

STOP the process once all clusters are inside the big circle

Form the clusters ( Number of Dissimilar clusters)

You cut the dendrogram tree with a horizontal line at a height where the line can traverse the maximum distance up and down without intersecting the merging point.

For example in the below figure L3 can traverse maximum distance up and down without intersecting the merging points. So we draw a horizontal line and the number of vertical lines it intersects is the optimal number of clusters.

1 — Mark all the vertical line = its the Distance between clusters

2 — extend the horizontal lines

3 — find the lines that dont get cut by the horizontal lines , marking that they have different clusters

4 — find the tallest line that remains uncut

5 — Number of clusters here is then 3 as it cuts 3 VERTICAL LINES

Distance between clusters

Python implementation

import scipy.cluster.hierarchy as sch

from sklearn.cluster import AgglomerativeClustering

dendrogram = sch.dendrogram(sch.linkage(data, method = ‘ward’))

ward is default

Fit the model and predict the results

from sklearn.cluster import AgglomerativeClustering
agglomerative = AgglomerativeClustering(affinity=’euclidean’, linkage=’ward’, n_clusters = 5)
labels = agglomerative.fit_predict(data)

Plot the results

Predict New Data Points —

not available in sklearn