K-Means Clustering

KMeans is an unsupervised machine learning algorithm that is commonly employed for partitioning a dataset
into K distinct, non-overlapping clusters. It aims to group similar data points
while keeping dissimilar data points in different clusters. The algorithm
works by iteratively assigning data points to the nearest cluster centroid and then
updating the centroids based on the mean of all points in each cluster.



Configurable Parameters:
  • Dimensionality reduction: (COMING SOON) Method for summarizing the number of features in spectra, like PCA or Peaks.

  • Number of clusters: Specifies the number of different groups to divide your data into. This is a crucial parameter that determines the granularity of the clustering.


    Visual Example

    K-Means clustering can label similar data together, allowing for the identification
    of distinct patterns or sample types within a dataset.


    Data before and after KMeans

Spectrify allows users to easily identify similar spectra and create a scatter plot of selected peaks. Centroids are also indicated


KMeans applied to spectra

Key Concepts

  • Centroids: The mean point of all data points within a cluster. These serve as the representatives of each cluster.

    Centroids are initially placed randomly and are iteratively updated as the algorithm progresses.


  • Inertia: The sum of squared distances between each data point and its assigned centroid. It measures how internally coherent clusters are.

    Lower inertia indicates more compact and well-separated clusters. This metric is often used to determine the optimal number of clusters.


Data Example

Let's consider a simplified dataset of spectral intensities at different wavelengths for various samples. Imagine plotting them in a 3D scatter plot.

Sample450 nm550 nm650 nm
A0.20.50.8
B0.30.60.7
C0.80.30.2
D0.70.40.3

After applying KMeans clustering with 2 clusters, we might get two main groups:

SampleCluster
A0
B0
C1
D1

And the centroids for these clusters could be:

Cluster450 nm550 nm650 nm
00.250.550.75
10.750.350.25

Mathematical Explanation

  1. Initialize K centroids randomly.

  2. Assign each data point to the nearest centroid: For each data point xix_i, assign it to cluster CjC_j if:

j=argminjxiμj2j = \arg\min_j ||x_i - \mu_j||^2

Where μj\mu_j is the centroid of cluster jj.


  1. Update centroids: For each cluster jj, update its centroid to be the mean of all points assigned to it:

μj=1CjxiCjxi\mu_j = \frac{1}{|C_j|} \sum_{x_i \in C_j} x_i

  1. Repeat steps 2 and 3 until convergence or a maximum number of iterations is reached.

  2. Calculate inertia:

Inertia=i=1nminμjC(xiμj2)\text{Inertia} = \sum_{i=1}^n \min_{\mu_j \in C} (\lVert x_i - \mu_j \rVert^2)

Where nn is the number of data points and CC is the set of all centroids.