Comparison of Traditional Bag of Words Approach to
Deep Learning for Image Retrieval
Pablo Laso
p.lasomielgo@student.utwente.nl
s2808161
University of Twente
Marleen van Gent
j.m.vangent@student.utwente.nl
s3069486
University of Twente
Abstract
An Information Retrieval experiment was designed and con-
ducted in order to compare multiple dierent Information
Retrieval systems for Image Retrieval. These methods in-
clude a traditional Bag of Words approach with ORB, SIFT,
and KAZE feature detection methods and state-of-the-art
Convolutional Neural Networks. The impact of traditional
K-means and fuzzy K-means clustering techniques to learn
the visual vocabulary was studied. Cosine, Euclidean, and
Manhattan distance metrics were compared to nd the most
optimal metric to retrieve images. The Convolutional Neu-
ral Network performed better on the Mapillary Street-Level
Sequences (MSLS) data set than the KAZE-based K-means
(K=55) model, by comparing the Mean Average Precision of
the models.
Keywords: Information Retrieval, Image Retrieval, Bag of
Words, Descriptor Extraction, Visual Vocabulary, SIFT, SURF,
KAZE, Convolutional Neural Networks
ACM Reference Format:
Pablo Laso and Marleen van Gent. 2022. Comparison of Traditional
Bag of Words Approach to Deep Learning for Image Retrieval.
In Proceedings of November 11, 2022 (Foundation of Information
Retrieval ’22). Twente, Enschede, NL, 4pages.
1 Introduction
Image Retrieval is the process aiming to nd images in a data
set similar to the image query. In this paper, both content-
based image retrieval and deep learning methods will be
used to build an Image Retrieval system. The performance of
these models will be evaluated based on the Mean Average
Precision (MAP).
1.1 Keypoint Descriptors
Keypoint descriptors are used to extract local features from
images. Multiple methods which extract these keypoint de-
scriptors exist. These methods dier in precision, eciency,
and compactness. [7]
The Scale Invariant Feature Transform algorithm, also known
Foundation of Information Retrieval ’22,
,
©2022 University of Twente.
as SIFT, is perhaps the most well-known keypoint descriptor
extraction method for content-based image retrieval. SIFT
is invariant to scale, translation, and rotation.[
2
] The SIFT
keypoint descriptor method’s strength is its distinctiveness.
This makes it possible to correctly identify a keypoint from
a large set of images.[2]
Oriented FAST and Rotated BRIEF (ORB) is proposed as an
ecient alternative to SIFT. [
6
] Similar to SIFT, ORB is in-
variant to rotation. In addition, ORB is noise-resistant. ORB
is based on the BRIEF feature descriptor which is similar to
SIFT in performance. [
6
] The ORB algorithm uses the FAST
keypoint detector for nding FAST points in the image. Fur-
thermore, the intensity centroid is used to measure corner
orientation. [
6
] The main strengths of ORB are its good per-
formance and low cost of the algorithm.
Alcantarilla et al. (2012) [
1
] introduced KAZE, an algorithm
for image feature detection and description in nonlinear scale
spaces. Firstly, KAZE will perform a scale space discretiza-
tion in logarithmic steps. Instead of Gaussian approaches
which smooth equally all parts of the image, KAZE uses AOS
techniques and variable conductance diusion. [
1
] After that,
they use the Hessian to compute the image features. Despite
the higher computational costs, the authors claim a higher
performance than other keypoint descriptors such as SIFT.
[1]
We can expect dierent performances from each algorithm.
In a comparison study between all these feature detectors,
Tareen et al. [
8
] found that SIFT is the most scale invariant,
while ORB and KAZE are more invariant to rotation. As
for types of geometric transformations, SIFT is found to be
the most accurate algorithm. ORB, on the other hand, is the
fastest in terms of speed of total image matching. The main
purpose of this research is to compare the performance of
these algorithms to each other and to a Convolutional Neu-
ral Network (CNN) on the Mapillary Street-Level Sequences
(MSLS) data set.
1.2 Bag of Words
The Bag of Words approach was rst used for Text Retrieval
before it became widely applied to Image Retrieval. [
9
] Bag
of Words is a methodology used for content-based image
retrieval which focuses on nding similarities between im-
ages based on the semantics of the images. [
3
] The keypoint
descriptors extracted from images can form a tremendous
Foundation of Information Retrieval ’22,
,
Pablo Laso and Marleen van Gent
amount of data to analyze. The Bag of Words approach can
be used to solve this issue as it identies basic components
of an image and compares images based on the presence of
these components. The Bag of Words methodology involves
the detection of the basic components, i.e. visual words, of
the images, measuring the frequency of these visual words
and store these in a Bag of Words vector, and nd similar
images by comparing the image’s frequency of the visual vo-
cabulary [
9
]. To learn the visual vocabulary of a set of images
clustering methods can be used. Each cluster represents a
visual word of the visual vocabulary. The similarity between
images is represented by the distance between the Bag of
Words vectors of these images. Dierent distance metrics
can be used to compute this distance.
1.3 Convolutional Neural Networks
Convolutional Neural Networks can also be used in Image
Retrieval. These non-linear feature extractors are trained
with the images and produce an output vector, that is, a
holistic descriptor of the image content.
These features are built during training. There is therefore
no need for feature engineering. Their architecture is based
on multiple feature maps or lters that ideally look for the
most relevant features. After so, maxpool is used to downsize
the images, so that we are left with the most important parts
of the image while also reducing computational costs. These
features nally go through fully convolutional layers so that
all the CNN parameters can be optimistically set. After that,
a mathematical formula known as softmax is used to convert
a vector of K real numbers into a probability distribution of
K possible outcomes.
Feature Extraction Networks have been used multiple
times to this aim, in very dierent topics [4] [5].
1.4 Problem Statement
In this paper, we aim to compare multiple approaches for
image retrieval -namely the traditional content-based image
retrieval methods, such as descriptor extraction and clus-
tering techniques used with the Bag of Words approach, to
state-of-the-art CNN-based methods. This comparison is rel-
evant to investigate the potential and success of CNN-based
methods in Image Retrieval.
R1: Do Convolutional Neural Networks perform better than an
optimized content-based image retrieval method?
Furthermore, we aim to investigate the most optimal param-
eters for the content-based image retrieval with descriptor
extraction and the Bag of Words approach.
R2a: Which of the keypoint descriptor methods SIFT, ORB, and
KAZE gives the best results?
R2b: Does K-means clustering or fuzzy K-means clustering
perform better at establishing the visual vocabulary of
images?
R2c: Is Cosine, Euclidean, or Manhattan distance a more
precise distance metrics to determine the similarity between
Bag of Words vectors?
2 Methods
2.1 Experiment Procedure
The Mapillary Street-Level Sequences (MSLS) data set, con-
taining 1000 images, was used to train and compare multiple
Information Retrieval methods. The local keypoint features
were extracted from the images using the ORB, SIFT, and
KAZE descriptor extraction methods. K-means and fuzzy
K-means clustering were used to establish the visual vocab-
ulary of the images in the training set. Multiple values for
the number of clusters were used to nd the ideal number
of centroids for both K-means and fuzzy K-means clustering
algorithms for each descriptor method. Steps of 5 were taken
between 30 and 60 number of centroids. The K-means clus-
tering algorithm was run 10 times with dierent centroid
seeds.
Following, the visual vocabulary was trained with the ORB,
SIFT, and KAZE descriptors. For each image in the Map-
pillary Street-level Sequences (MSLS) data set the Bag of
Words vector was calculated using three dierent distance
methods; Cosine, Euclidean, and Manhattan distance. The
descriptors from ve photos from the test set were extracted
using ORB, SIFT, and KAZE extraction methods. The Bag of
Words vectors were computed using the Cosine, Euclidean,
and Manhattan distance methods. The distances between the
Bag of Words vector of the ve photos from the test set and
the Bag of Words vector from the data set were calculated;
the images with the smallest distances were retrieved. The
images that were retrieved were compared to the relevant
images. The Mean Average Precision was calculated and was
used to evaluate the overall performance of the models.
When implementing a CNN, we used Pytorch to retrieve
a pre-trained model, i.e., the RESNET34. This is a 34-layer
Convolutional Neural Network (CNN) that can be used as a
state-of-the-art model for image classication. It has been
trained upon the ImageNet dataset, with over 100,000 images
of 200 classes.
Its main dierence when compared to other models is
the residuals this CNN takes from previous layers for the
subsequent ones. Each one of the four main layers of the
RESNET34 follows the same pattern, i.e., a 3x3 convolution,
bypassing the input every two layers (residuals). The feature
map dimensions are, respectively and in order, of 64, 128,
256, and 512. Overall, it contains 21.282M parameters.
We use this CNN for feature extraction. We obtain a (1000,
512) array that is used as the map_bow_vectors in the re-
trieve function when we performed clustering. After so, we
compute the MAP to compare it with the results from these
other methods.
Comparison of Traditional Bag of Words Approach to Deep Learning for Image Retrieval
Foundation of Information Retrieval ’22,
,
2.2 Comparison of Methods
The results from the best K-means model were compared
to the results of the Convolutional Neural Network model.
The Mean Average Precision of these models were compared
with a two-tailed sign test.
3 Results
In Table 1, the results of the Content-based Image Retrieval
methods are shown. These results will be discussed in detail
in the sections below. We group them by the Descriptor
algorithm (KAZE, SIFT, ORB), the clustering algorithm (K-
means, Fuzzy (FCM)) and its K values, and the distance metric
(Cosine, Euclidean, or Manhattan).
Table 1. Results of the Content-based Image Retrieval Meth-
ods
Descriptor Clustering K Distance MAP
KAZE K-means 55 Cosine 0.01554
SIFT K-means 60 Euclidean 0.00924
ORB K-means 45 Cosine 0.00635
KAZE Fuzzy 30 Euclidean 0.00073
SIFT Fuzzy 45 Cosine 0.00138
ORB Fuzzy 30 Euclidean 0.00123
3.1 Keypoint Descriptors
From Table 1, we can see that there is not a descriptor that
consistently stands out among the others. KAZE has the best
results with K-means (K=55) and the cosine distance -but
does not maintain its advantage with FCM.
3.2 Clustering Methods
When taking a closer look at the clustering methods, it is
clear that fuzzy K-means -apart from being signicantly
more computationally expensive- does not reach the same
level of performance as K-means.
The dierence between both algorithms is that fuzzy K-
means (FCM) does not directly assign a point to a cluster,
but rather computes a weighted association to the cluster.
We can therefore expect FCM to perform better in elongated
clusters, i.e., when points are widespread along a particular
dimension. If this is not the case, it might actually be the
reason accounting for a lower performance in our dataset.
Overall, we appreciate from Table 1 a higher performance
in terms of Mean Average Precision for the K-means algo-
rithms, in contrast to FCM.
3.3 Distance Methods
The Cosine, Euclidean, and Manhattan distance metrics were
used to determine the distance between the Bag of Words
vectors of the images, i.e. to determine their similarities.
As can be seen in Table 1, the Manhattan distance metric is
not included, as the models in which it was used resulted in
a lower Mean Average Precision. The Cosine and Euclidean
distance metrics both proved to be more ecient when esti-
mating vector similarities.
3.4 Convolutional Neural Network
After performing image retrieval with the feature vectors
obtained from the pre-trained RESNET34, we compute the
MAP. The MAP score obtained for our CNN is 0.096, when
using cosine as the distance metric.
3.5 Comparison of Models
Wilcoxon test was used in order to check if the models were
statistically dierent, that is, if the CNN really posed a greater
advantage over the KAZE-based K-means (K=55) model.
If the p-value is lower than 0.05, we can reject the null
hypothesis in favor of the alternative one with a condence
level of 95%, i.e., the models are statistically dierent. The
more dierent the two distributions, the lower the p-value.
However, if the distributions are very small, the condence
with which we can reject the null hypothesis is lower, which
translates into a higher diculty of obtaining a lower p-
value.
Upon performing the Wilcoxon test, we took the ve Av-
erage Precision scores (for each of the ve images) of each
model. Both distributions were compared by means of the
non-parametric, two-sided Wilcoxon test. The resulting p-
value was 0.44, meaning the null hypothesis cannot be re-
jected and there is no statistical dierence between both
models.
4 Discussion
4.1 Conclusion
We have conducted a comparison between traditional, clustering-
based Bag of Words (BoW) approaches, and state-of-the-art
CNNs for image classication, such as the RESNET34.
Among the traditional methods, K-means proved to be
most ecient with respect to FCM, specially when KAZE
was used as a keypoint detector and descriptor extractor.
When estimating the distance between the BoW vectors,
both the Euclidean and Cosine metrics seemed to outperform
Manhattan.
It is not apparent that one descriptor algorithm might be
better than the other. Although the best results were given
for a KAZE-based algorithm, KAZE also performed worst in
FCM.
When the aforementioned algorithms were compared to a
CNN, results clearly show a higher performance made by the
latter. The statistical test (Wilcoxon) showed no statistical
dierence. However, we believe this is due to the fact that
the sample distributions -consisting of just ve images- were
too small (for computation reasons).
Foundation of Information Retrieval ’22,
,
Pablo Laso and Marleen van Gent
A pre-trained CNN, namely the RESNET34, was used for
feature extraction and accounted for the highest MAP scores
among all algorithms tried in this project.
4.2 Limitations
Due to a limitation in processing power, it was not possible
to test the model on all 500 photos of the test set of the Map-
pillary Street-level Sequences (MSLS) data set. As a solution,
ve photos from the test set were randomly chosen to test
the model.
Although taken into account that the used test data was
randomized, the limited size of the used test set may have
aected the reliability of our comparisons and results. In
future research, the full test set should be used to test the
models on and compare the results.
4.3 Future Research
When using the Bag of Words approach, keypoints are ex-
tracted from gray images, which result in the loss of color
information of the images. Therefore, future research could
focus on nding a method to include the color information
of images in the Bag of Words methodology to improve the
models.
Our results show that the implemented Convolutional
Neural Network (taken from a pre-trained RESNET34) out-
performed the content-based image retrieval models. Future
research should integrate state-of-the-art Deep Learning al-
gorithms into content-based image retrieval methods, such
that the benets of both can be combined into more precise
and accurate image retrieval models.
References
[1]
Pablo Fernández Alcantarilla, Adrien Bartoli, and Andrew J Davi-
son. 2012. KAZE features. In European conference on computer vision.
Springer, 214–227.
[2]
Suraya Abu Bakar, Muhammad Suzuri Hitam, and Wan Nural Jawahir
Hj Wan Yussof. 2013. Content-based image retrieval using SIFT for
binary and greyscale images. In 2013 IEEE International Conference on
Signal and Image Processing Applications. IEEE, 83–88.
[3]
Jialu Liu. 2013. Image retrieval based on bag-of-words model. arXiv
preprint arXiv:1304.5168 (2013).
[4]
T.M. Navamani. 2019. Chapter 7 - Ecient Deep Learning Approaches
for Health Informatics. In Deep Learning and Parallel Computing
Environment for Bioengineering Systems, Arun Kumar Sangaiah (Ed.).
Academic Press, 123–137. hps://doi.org/10.1016/B978-0-12-816718-
2.00014-2
[5]
S. Selva Nidhyananthan, R. Newlin Shebiah, B. Vijaya Kumari, and
K. Gopalakrishnan. 2022. Chapter 15 - Deep learning for accident
avoidance in a hostile driving environment. In Cognitive Systems and
Signal Processing in Image Processing, Yu-Dong Zhang and Arun Kumar
Sangaiah (Eds.). Academic Press, 337–357. hps://doi.org/10.1016/B978-
0-12-824410-4.00002-7
[6]
Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski. 2011.
ORB: An ecient alternative to SIFT or SURF. In 2011 International
conference on computer vision. Ieee, 2564–2571.
[7]
Nicola Strisciuglio. 2022. Lecture 06: Multimedia Retrieval - Content-
based Image Retrieval. University of Twente. hps://canvas.utwente.nl
[8]
Shaharyar Ahmed Khan Tareen and Zahra Saleem. 2018. A comparative
analysis of SIFT, SURF, KAZE, AKAZE, ORB, and BRISK. hps://doi.
org/10.1109/ICOMET.2018.8346440
[9]
Chih-Fong Tsai. 2012. Bag-of-words representation in image annotation:
A review. International Scholarly Research Notices 2012 (2012).
Received 11 November 2022