K-Means Clustering
KMeans is an unsupervised machine learning algorithm that is commonly employed for partitioning a dataset
into K distinct, non-overlapping clusters. It aims to group similar data points
while keeping dissimilar data points in different clusters. The algorithm
works by iteratively assigning data points to the nearest cluster centroid and then
updating the centroids based on the mean of all points in each cluster.
Configurable Parameters:
-
Dimensionality reduction: (COMING SOON) Method for summarizing the number of features in spectra, like PCA or Peaks.
-
Number of clusters: Specifies the number of different groups to divide your data into. This is a crucial parameter that determines the granularity of the clustering.
Visual Example
K-Means clustering can label similar data together, allowing for the identification
of distinct patterns or sample types within a dataset.
Spectrify allows users to easily identify similar spectra and create a scatter plot of selected peaks. Centroids are also indicated
Key Concepts
-
Centroids: The mean point of all data points within a cluster. These serve as the representatives of each cluster.
Centroids are initially placed randomly and are iteratively updated as the algorithm progresses.
-
Inertia: The sum of squared distances between each data point and its assigned centroid. It measures how internally coherent clusters are.
Lower inertia indicates more compact and well-separated clusters. This metric is often used to determine the optimal number of clusters.
Data Example
Let's consider a simplified dataset of spectral intensities at different wavelengths for various samples. Imagine plotting them in a 3D scatter plot.
Sample | 450 nm | 550 nm | 650 nm |
---|---|---|---|
A | 0.2 | 0.5 | 0.8 |
B | 0.3 | 0.6 | 0.7 |
C | 0.8 | 0.3 | 0.2 |
D | 0.7 | 0.4 | 0.3 |
After applying KMeans clustering with 2 clusters, we might get two main groups:
Sample | Cluster |
---|---|
A | 0 |
B | 0 |
C | 1 |
D | 1 |
And the centroids for these clusters could be:
Cluster | 450 nm | 550 nm | 650 nm |
---|---|---|---|
0 | 0.25 | 0.55 | 0.75 |
1 | 0.75 | 0.35 | 0.25 |
Mathematical Explanation
-
Initialize K centroids randomly.
-
Assign each data point to the nearest centroid: For each data point , assign it to cluster if:
Where is the centroid of cluster .
- Update centroids: For each cluster , update its centroid to be the mean of all points assigned to it:
-
Repeat steps 2 and 3 until convergence or a maximum number of iterations is reached.
-
Calculate inertia:
Where is the number of data points and is the set of all centroids.