KMeans Clustering
KMeans is an unsupervised machine learning algorithm that is commonly employed for partitioning a dataset
into K distinct, nonoverlapping clusters. It aims to group similar data points
while keeping dissimilar data points in different clusters. The algorithm
works by iteratively assigning data points to the nearest cluster centroid and then
updating the centroids based on the mean of all points in each cluster.
Configurable Parameters:

Dimensionality reduction: (COMING SOON) Method for summarizing the number of features in spectra, like PCA or Peaks.

Number of clusters: Specifies the number of different groups to divide your data into. This is a crucial parameter that determines the granularity of the clustering.
Visual Example
KMeans clustering can label similar data together, allowing for the identification
of distinct patterns or sample types within a dataset.
Spectrify allows users to easily identify similar spectra and create a scatter plot of selected peaks. Centroids are also indicated
Key Concepts

Centroids: The mean point of all data points within a cluster. These serve as the representatives of each cluster.
Centroids are initially placed randomly and are iteratively updated as the algorithm progresses.

Inertia: The sum of squared distances between each data point and its assigned centroid. It measures how internally coherent clusters are.
Lower inertia indicates more compact and wellseparated clusters. This metric is often used to determine the optimal number of clusters.
Data Example
Let's consider a simplified dataset of spectral intensities at different wavelengths for various samples. Imagine plotting them in a 3D scatter plot.
Sample  450 nm  550 nm  650 nm 

A  0.2  0.5  0.8 
B  0.3  0.6  0.7 
C  0.8  0.3  0.2 
D  0.7  0.4  0.3 
After applying KMeans clustering with 2 clusters, we might get two main groups:
Sample  Cluster 

A  0 
B  0 
C  1 
D  1 
And the centroids for these clusters could be:
Cluster  450 nm  550 nm  650 nm 

0  0.25  0.55  0.75 
1  0.75  0.35  0.25 
Mathematical Explanation

Initialize K centroids randomly.

Assign each data point to the nearest centroid: For each data point $x_i$, assign it to cluster $C_j$ if:
$j = \arg\min_j x_i  \mu_j^2$
Where $\mu_j$ is the centroid of cluster $j$.
 Update centroids: For each cluster $j$, update its centroid to be the mean of all points assigned to it:
$\mu_j = \frac{1}{C_j} \sum_{x_i \in C_j} x_i$

Repeat steps 2 and 3 until convergence or a maximum number of iterations is reached.

Calculate inertia:
$\text{Inertia} = \sum_{i=1}^n \min_{\mu_j \in C} (\lVert x_i  \mu_j \rVert^2)$
Where $n$ is the number of data points and $C$ is the set of all centroids.