IMAGE DATA CLUSTERING BASED ON THE VGG16 MODEL AND THE K-MEANS ALGORITHM

КЛАСТЕРИЗАЦИЯ ДАННЫХ ИЗОБРАЖЕНИЯ НА ОСНОВЕ МОДЕЛИ VGG16 И АЛГОРИТМА K-MEANS

Muminov B.B. Egamberdiyev E.H.

28.01.2025 105

1(130)

10. Информатика, вычислительная техника и управление

Цитировать:

Muminov B.B., Egamberdiyev E.H. IMAGE DATA CLUSTERING BASED ON THE VGG16 MODEL AND THE K-MEANS ALGORITHM // Universum: технические науки : электрон. научн. журн. 2025. 1(130). URL: https://7universum.com/ru/tech/archive/item/19094 (дата обращения: 19.04.2025).

Прочитать статью:

DOI - 10.32743/UniTech.2025.130.1.19094

ABSTRACT

Clustering, which is considered an elementary task in the field of computer vision, is a daunting activity because of the presence of unprocessed image or video data. This study proposes a novel VGG16 image clustering framework that extracts features using the VGG16 model and performs clustering using the k-means algorithm. VGG16 is a model of a convolutional neural network that was previously trained and is now used in extracting high-level features of images with low dimensions while retaining the semantic content of the images. K-means is then applied on these features to obtain clusters of similar images. We have seen in this paper and experimental results performed on benchmark datasets that this method is significantly superior to the task of performing clustering on the raw images or their pixels. Combining deep feature extraction with traditional clustering is definitely the right approach for classifying images. It is evident in this work how deep learning models can work with classical machine learning approaches for complex analytics and pattern recognition tasks.

АННОТАЦИЯ

Кластеризация, которая считается элементарной задачей в области компьютерного зрения, является сложной задачей из-за наличия необработанных данных изображений или видео. В этом исследовании предлагается новая структура кластеризации изображений VGG16, которая извлекает признаки с помощью модели VGG16 и выполняет кластеризацию с помощью алгоритма k-средних. VGG16 — это модель сверточной нейронной сети, которая ранее была обучена и теперь используется для извлечения высокоуровневых признаков изображений с низкими размерами, сохраняя при этом семантическое содержание изображений. Затем к этим признакам применяется алгоритм k-средних для получения кластеров похожих изображений. В этой статье и экспериментальных результатах, полученных на эталонных наборах данных, мы увидели, что этот метод значительно превосходит задачу выполнения кластеризации на необработанных изображениях или их пикселях. Объединение глубокого извлечения признаков с традиционной кластеризацией, безусловно, является правильным подходом для классификации изображений. В этой работе очевидно, как модели глубокого обучения могут работать с классическими подходами машинного обучения для сложных задач аналитики и распознавания образов.

Keywords: Image Clustering, VGG16 Model, k-Means Algorithm, Feature Extraction, Deep Learning.

Ключевые слова: Кластеризация изображений, модель VGG16, алгоритм k-Means, извлечение признаков, глубокое обучение.

Introduction. Clustering plays an important role in computer vision, allowing the analysis of large amounts of image data by grouping similar images based on common features. Applications such as content-based image retrieval, object categorization, and finding groups in data are all clustering-related processes [1, p. 68]. However, clustering high-resolution raw image data is challenging due to the complexity of image features, the presence of noise, and the high computational costs associated with such tasks [2, p. 87].To address this problem, researchers are turning to deep learning models to derive meaningful representations of image data that can improve clustering performance.

Deep convolutional neural networks (CNNs) have become the backbone of computer vision by demonstrating their ability to learn hierarchical feature representations from images[3, p. 90]. Among these models, VGG16, introduced by Simonyan and Zisserman [4, p. 7], was popular for its simplicity and efficiency. VGG16, with its deep, uniform architecture and pre-trained weights on the ImageNet dataset [5, p. 248], captures high-level semantic features, allowing the transformation of raw image data into compact and meaningful feature vectors. These feature vectors can significantly reduce the dimensionality of the data while preserving its underlying structure, making them ideal for clustering tasks[6, p. 423]. The k-means algorithm, introduced by [7, p. 129] and further improved by MacQueen[8, p. 281], is one of the most widely used clustering methods. It reduces intra-cluster variance, grouping data points into distinct clusters based on their similarity. Despite its efficiency, k-means directly struggles with high-dimensional data, such as raw image pixels, due to the curse of dimensionality[2, p, 93]. By combining the deep feature extraction capabilities of VGG16 with the clustering power of k-means, this research paper aims to overcome these challenges and achieve more accurate and efficient clustering. In this paper, we propose a hybrid approach that uses VGG16 for feature extraction and k-means for clustering. A pre-trained VGG16 model is used to extract feature vectors from an image dataset, reduce dimensionality, and highlight semantic similarities. These extracted features are then fed into a k-means algorithm for clustering. The performance of the proposed approach is evaluated on a benchmark dataset, and the results are compared with clustering methods applied directly to raw pixel data.

This paper contributes to the growing intersection of deep learning and traditional machine learning by demonstrating the effectiveness of integrating CNN-based feature extraction with classical clustering methods. In particular:

Demonstrate the usefulness of VGG16 in extracting semantically meaningful features for clustering.
Evaluate the performance of k-means on VGG16-extracted features compared to raw image pixels.
Provide insights into the practical application of the hybrid deep learning and clustering framework for unsupervised learning tasks.

Related Work. Image clustering, which is the study of methods for grouping similar images based on unsupervised learning, has long been a major topic in computer vision and machine learning. This section reviews the existing literature on clustering methods, the role of feature extraction in clustering, and the integration of deep learning methods such as VGG16 with clustering algorithms [9, p. 245].

Clustering algorithms, particularly k-means, have been widely used for organizing image datasets ([7, p. 129], [8, p. 284]). K-means is a straightforward and efficient algorithm for partitioning data into clusters by minimizing the variance within clusters. However, its performance is highly dependent on the input data’s feature representation. Clustering raw image pixels is often ineffective due to high dimensionality and noise, which degrade the algorithm's ability to discern meaningful patterns [2, p. 93].

Other advanced clustering methods, such as spectral clustering and hierarchical clustering, have been proposed to address these limitations. However, these methods often come with increased computational complexity and scalability issues, especially for large image datasets [1, p. 73].

Feature extraction plays a critical role in improving the effectiveness of clustering by transforming raw data into compact and meaningful representations. Traditional approaches, such as Scale-Invariant Feature Transform (SIFT) and Histogram of Oriented Gradients (HOG), have been used to extract image features for clustering tasks[6, p. 423]. While these methods are computationally efficient, they struggle to capture high-level semantic information.

The emergence of deep learning has significantly advanced feature extraction techniques. CNNs, such as AlexNet [3, p. 84], ResNet [10, p. 770], and VGG16 [4, p. 7], have proven highly effective in extracting rich feature representations from images. These deep features outperform traditional methods by encoding complex patterns and hierarchical structures, which are crucial for clustering tasks.

Recent studies have explored the integration of deep learning models with clustering algorithms. Caron et al. [11, p. 139] proposed a deep clustering framework that learns visual features and cluster assignments simultaneously. Similarly, Xie et al. [12, p. 8] introduced an unsupervised deep embedding method that optimizes clustering loss during feature extraction. These studies highlight the potential of deep learning models in improving clustering performance.

The VGG16 model, in particular, has been extensively used for feature extraction in clustering applications. Its pre-trained weights, fine-tuned on the ImageNet dataset[5, p. 248], provide robust feature representations, enabling more accurate clustering. Yang et al. [13, p. 12] demonstrated that combining VGG16 with traditional clustering algorithms, such as k-means, improves clustering accuracy and efficiency compared to using raw image pixels.

The combination of deep feature extraction and traditional clustering methods has gained traction in recent years. Zhou et al. [14, p. 2921] used CNN-based feature extraction followed by k-means to cluster images for object recognition tasks. Ji et al. [15, p. 9872]proposed an invariant information clustering framework that utilizes deep features for unsupervised classification and segmentation.

While these approaches showcase the effectiveness of deep learning-based clustering, challenges remain in selecting appropriate feature extraction layers, optimizing the clustering process, and scaling the methods for large datasets. This study builds upon this growing body of research by leveraging VGG16 for feature extraction and k-means for clustering, aiming to provide a robust and scalable solution for image clustering.

Methodology.

Dataset: The dataset consists of 2311 images representing 13 fruit classes. They are correlated according to the following diagram. Figure 1.

Figure 1. Distribution of Images in the Fruit Dataset [17]

Preprocessing Steps:

Resizing: Each image I is resized to a fixed dimension of 224×224 pixels to match the input requirement of the VGG16 model:

Normalization: The pixel values of the resized images are normalized to a range of [0,1] by dividing by the maximum pixel value (255):

These preprocessing steps ensure consistency in image dimensions and intensity values, facilitating effective feature extraction by the VGG16 model.

Feature Extraction using VGG16

Overview of the VGG16 Architecture The VGG16 model is a deep convolutional neural network consisting of 16 layers: 13 convolutional layers and 3 fully connected layers. It uses small 3×3 convolution filters and applies ReLU activation after each layer, followed by max pooling to reduce the spatial dimensions. The architecture is divided into two parts:

Convolutional Base: Extracts spatial and hierarchical features.

Fully Connected Layers: Processes extracted features for classification or regression.

Using Pre-Trained Weights for Feature Extraction: The VGG16 model pre-trained on the ImageNet dataset is utilized for feature extraction. The pre-trained weights provide a robust initialization that captures general visual patterns. During feature extraction, the fully connected layers are excluded, and only the convolutional base is used to extract feature maps F from the input image I.

Here, ConvBase represents the convolutional layers of VGG16.

Process of Obtaining Feature Vectors.

1. Input Image Processing: Each input image I is preprocessed (resized and normalized as described earlier).

2. Feature Map Extraction: The processed image is passed through the convolutional base to obtain the feature map F:

F has dimensions H×W×C, where H and W are the height and width, and C is the number of channels.

3. Flattening: The feature map F is flattened into a one-dimensional feature vector v:

The dimension of v is H×W×C.

4. lobal Average Pooling: To reduce the dimensionality, global average pooling can be applied:

This produces a vector of size C, representing global feature embeddings. Below is the process diagram showing how features are extracted using VGG16:

Figure 2. Feature extraction using VGG16

Clustering with k-means

Overview of the k-means Algorithm. The k-means algorithm is a widely used clustering method that partitions a dataset into k clusters . Each cluster is represented by its centroid and the algorithm iteratively performs the following steps:

1. Assignment Step: Each data point is assigned to the cluster with the nearest centroid:

2. Update Step: The centroids are updated as the mean of the points in each cluster:

These steps are repeated until the centroids converge or the cluster assignments stabilize.

Determining the Optimal Number of Clusters

The number of clusters (k) is critical for the performance of k-means. To determine the optimal k, the silhouette method is employed, which measures the quality of clustering by evaluating how well each data point fits within its cluster compared to other clusters.

For a data point i, the silhouette score s(i) is calculated as:

Where a(i) is the average distance between i and all other points in its cluster (intra-cluster distance), b(i) is the average distance between i and the points in the nearest neighboring cluster (inter-cluster distance).

The overall silhouette score for a clustering result is the mean silhouette score across all points:

The optimal k is determined by maximizing S across a range of cluster values:

This method ensures that clusters are both compact and well-separated. The silhouette score is plotted against k, and the value of k corresponding to the highest silhouette score is selected.

Integration of Extracted Features with k-means

Feature Input: Feature vectors extracted using the VGG16 model are used as input for the k-means algorithm. The clustering process follows these steps:

Compute the silhouette score for different values of k.

Identify using the silhouette method.

Apply k-means clustering with the optimal k to group the images:

This systematic approach to determining k enhances the accuracy of clustering, ensuring meaningful grouping of images. Visualizations, such as the silhouette plot and cluster assignments, further validate the quality of clustering.

Experiments and Results

The experiments were conducted to evaluate the effectiveness of clustering image data using the VGG16 model for feature extraction combined with the k-means clustering algorithm. The dataset consisted of 2311 images across 13 fruit classes, resized to 224×224 pixels for processing. Features were extracted using the pre-trained VGG16 model, and the resulting high-dimensional feature vectors were reduced to 50 dimensions using Principal Component Analysis (PCA). The clustering was performed using k-means, and the optimal number of clusters (k) was determined using the silhouette method.

Figure 3. Silhouette Method for Optimal k-diagram

The silhouette method was employed to identify the optimal k by plotting the silhouette score across a range of cluster values. The silhouette score peaked at k=13, indicating that dividing the dataset into four clusters provided the best balance between intra-cluster cohesion and inter-cluster separation.

Figure X: Silhouette Scores for Different Cluster Values (k) A peak silhouette score at k=13 indicates the optimal number of clusters.

Figure 4. Cluster Visualization Using t-SNE

The high-dimensional features were further reduced to two dimensions using t-SNE to visualize the clustering results[16, p. 2581]. The clusters were well-separated, validating the effectiveness of the approach. The plot shows four distinct clusters, each representing a group of similar images. To demonstrate the coherence of the clusters, representative images from each cluster were displayed. Each cluster contained visually similar images, confirming the ability of the model to group related images effectively. Images in each cluster are visually consistent, highlighting the meaningful grouping achieved by the proposed method.

Figure 5. Representative Images from Cluster 0

Figure 6. Representative Images from Cluster 1

Figure 7. Representative Images from Cluster 2

Figure 8. Representative Images from Cluster 3

Discussion

Visual analysis of representative images for each cluster confirmed that the clustering was as effective as grouping semantically similar images, such as fruits with similar shapes, textures, or colors. VGG16 extracts hierarchical and higher-level features that capture semantic information from images, such as textures, edges, and object structures, enabling more meaningful clustering than raw pixel values. The use of pre-trained weights in ImageNet ensures robust feature extraction without requiring additional training, making the approach computationally efficient for clustering tasks. Using VGG16’s convolutional framework, high-dimensional raw image data is transformed into compact feature vectors, reducing the computational burden for clustering algorithms such as k-means. The method can be generalized well to a variety of datasets, as VGG16 is designed to extract universal features that are relevant to different types of image data. The PCA dimensionality reduction step and k-means clustering are sensitive to initial parameters (e.g., number of PCA components, initial cluster centers). Dynamic optimization of these parameters can improve performance.

References:

L. Kaufman and P. J. Rousseeuw, “Finding Groups in Data - Wiley Series on Probability and Mathematical Statistics,” p. Chapters 2 & 3, 1990.
C. C. Aggarwal and C. K. . Reddy, “Data clustering : algorithms and applications,” p. 620, 2014, Accessed: Dec. 01, 2024.
Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” Commun ACM, vol. 60, no. 6, pp. 84–90, Jun. 2012, doi: 10.1145/3065386.
K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings, 2015.
J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255, 2009, doi: 10.1109/CVPR.2009.5206848.
Christopher Bishop, “Pattern Recognition and Machine Learning,” J Electron Imaging, vol. 16, no. 4, p. 049901, Jan. 2007, doi: 10.1117/1.2819119.
S. P. Lloyd, “Least Squares Quantization in PCM,” IEEE Trans Inf Theory, vol. 28, no. 2, pp. 129–137, 1982, doi: 10.1109/TIT.1982.1056489.
J. MacQueen, “Some methods for classification and analysis of multivariate observations,” https://doi.org/, vol. 5.1, pp. 281–298, Jan. 1967, Accessed: Dec. 01, 2024.
Egamberdiyev E. DATA INTEGRATION IN A MULTI-TYPE DATA ENVIRONMENT //DTAI–2024. – 2024. – Т. 1. – №. DTAI. – С. 243-246.
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2016-December, pp. 770–778, Dec. 2016, doi: 10.1109/CVPR.2016.90.
M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep Clustering for Unsupervised Learning of Visual Features,” Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 11218 LNCS, pp. 139–156, Jul. 2018, doi: 10.1007/978-3-030-01264-9_9.
J. Xie, R. Girshick, and A. Farhadi, “Unsupervised Deep Embedding for Clustering Analysis,” Jun. 11, 2016, PMLR. Accessed: Dec. 01, 2024.
J. Yang, D. Parikh, and D. Batra, “Joint Unsupervised Learning of Deep Representations and Image Clusters,” Apr. 2016, Accessed: Dec. 01, 2024.
B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning Deep Features for Discriminative Localization,” Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2016-December, pp. 2921–2929, Dec. 2016, doi: 10.1109/CVPR.2016.319.
X. Ji, A. Vedaldi, and J. Henriques, “Invariant information clustering for unsupervised image classification and segmentation,” Proceedings of the IEEE International Conference on Computer Vision, vol. 2019-October, pp. 9864–9873, Oct. 2019, doi: 10.1109/ICCV.2019.00996.
L. J. P. van der Maaten and G. E. Hinton, “Visualizing High-Dimensional Data Using t-SNE,” Journal of Machine Learning Research, vol. 9, no. nov, pp. 2579–2605, 2008, Accessed: Dec. 01, 2024.
electronic resource https://colab.research.google.com/drive/12THPbf6A7HGIuIbL5m62auNmVAQHK8nn#scrollTo=0AMzRqQ8Rar4