Kmeans | Spectrify

K-Means Clustering

KMeans is an unsupervised machine learning algorithm that is commonly employed for partitioning a dataset
into K distinct, non-overlapping clusters. It aims to group similar data points
while keeping dissimilar data points in different clusters. The algorithm
works by iteratively assigning data points to the nearest cluster centroid and then
updating the centroids based on the mean of all points in each cluster.

Configurable Parameters:

Dimensionality reduction: (COMING SOON) Method for summarizing the number of features in spectra, like PCA or Peaks.
Number of clusters: Specifies the number of different groups to divide your data into. This is a crucial parameter that determines the granularity of the clustering.

Visual Example

K-Means clustering can label similar data together, allowing for the identification
of distinct patterns or sample types within a dataset.

Spectrify allows users to easily identify similar spectra and create a scatter plot of selected peaks. Centroids are also indicated

Key Concepts

Centroids: The mean point of all data points within a cluster. These serve as the representatives of each cluster.

Centroids are initially placed randomly and are iteratively updated as the algorithm progresses.
Inertia: The sum of squared distances between each data point and its assigned centroid. It measures how internally coherent clusters are.

Lower inertia indicates more compact and well-separated clusters. This metric is often used to determine the optimal number of clusters.

Data Example

Let's consider a simplified dataset of spectral intensities at different wavelengths for various samples. Imagine plotting them in a 3D scatter plot.

Sample	450 nm	550 nm	650 nm
A	0.2	0.5	0.8
B	0.3	0.6	0.7
C	0.8	0.3	0.2
D	0.7	0.4	0.3

After applying KMeans clustering with 2 clusters, we might get two main groups:

Sample	Cluster
A	0
B	0
C	1
D	1

And the centroids for these clusters could be:

Cluster	450 nm	550 nm	650 nm
0	0.25	0.55	0.75
1	0.75	0.35	0.25

Mathematical Explanation

Initialize K centroids randomly.
Assign each data point to the nearest centroid: For each data point $x_i$ , assign it to cluster $C_j$ if:

$j = \arg\min_j ||x_i - \mu_j||^2$

Where $\mu_j$ is the centroid of cluster $j$ .

Update centroids: For each cluster $j$ , update its centroid to be the mean of all points assigned to it:

$\mu_j = \frac{1}{|C_j|} \sum_{x_i \in C_j} x_i$

Repeat steps 2 and 3 until convergence or a maximum number of iterations is reached.
Calculate inertia:

$\text{Inertia} = \sum_{i=1}^n \min_{\mu_j \in C} (\lVert x_i - \mu_j \rVert^2)$

Where $n$ is the number of data points and $C$ is the set of all centroids.

PLSR Score Metrics

K-Means Clustering

Visual Example

Key Concepts

Data Example

Mathematical Explanation