Iris and Digits Clustering using K-Means

Title: PySpark - ris and Digits Clustering using K-Means
Description: Clustering data using the K-Means algorithm, focusing on minimizing centroid distances to improve cluster quality.
Source code link: https://github.com/ompranavagrawal/Iris_and_Images_Clustering_using_kMeans.git

Objective

The project aims to accurately cluster data using the K-Means algorithm, focusing on minimizing centroid distances to improve cluster quality.

Preprocessing

Data normalization and image smoothing techniques, including Gaussian smoothing, were employed to enhance clustering performance.

Gaussian smoothing for noise removal

Dimensionality Reduction

Truncated SVD and t-SNE were used for reducing dimensions, alongside MinMaxScaler for feature scaling, to prepare data for clustering.

Clustering

K-Means Algorithm Application

Initial centroids were chosen from data points to avoid bad initializations.
The Euclidean distance was used for assigning data points to the nearest centroid.
The process iterated until k centroids were determined, optimizing cluster quality.

Observations

Choosing centroids from data points yielded better results.
Optimal clustering for the Iris dataset was achieved with 3 clusters, as indicated by silhouette scores and inertia calculations.
For the Digits dataset, 10 clusters provided the best silhouette scores, indicating optimal clustering.

Inertia and Silhouette score for Iris

Inertia and Silhouette score for Image

Conclusion

K-Means clustering effectively labeled the Iris and Digits datasets, demonstrating the algorithm's capability to manage and cluster large datasets efficiently.