Iris and Digits Clustering using K-Means
Clustering data using the K-Means algorithm, focusing on minimizing centroid distances to improve cluster quality.
Title: PySpark - ris and Digits Clustering using K-Means
Description: Clustering data using the K-Means algorithm, focusing on minimizing centroid distances to improve cluster quality.
Source code link: https://github.com/ompranavagrawal/Iris_and_Images_Clustering_using_kMeans.git
Objective
The project aims to accurately cluster data using the K-Means algorithm, focusing on minimizing centroid distances to improve cluster quality.
Preprocessing
Data normalization and image smoothing techniques, including Gaussian smoothing, were employed to enhance clustering performance.

Gaussian smoothing for noise removal
Dimensionality Reduction
Truncated SVD and t-SNE were used for reducing dimensions, alongside MinMaxScaler for feature scaling, to prepare data for clustering.

Clustering
K-Means Algorithm Application
- Initial centroids were chosen from data points to avoid bad initializations.
- The Euclidean distance was used for assigning data points to the nearest centroid.
- The process iterated until k centroids were determined, optimizing cluster quality.
Observations
- Choosing centroids from data points yielded better results.
- Optimal clustering for the Iris dataset was achieved with 3 clusters, as indicated by silhouette scores and inertia calculations.
- For the Digits dataset, 10 clusters provided the best silhouette scores, indicating optimal clustering.

Inertia and Silhouette score for Iris

Inertia and Silhouette score for Image
Conclusion
K-Means clustering effectively labeled the Iris and Digits datasets, demonstrating the algorithm's capability to manage and cluster large datasets efficiently.