Comparison of Traditional Bag of Words Approach to

Deep Learning for Image Retrieval

Pablo Laso

p.lasomielgo@student.utwente.nl

s2808161

University of Twente

Marleen van Gent

j.m.vangent@student.utwente.nl

s3069486

University of Twente

Abstract

An Information Retrieval experiment was designed and con-

ducted in order to compare multiple dierent Information

Retrieval systems for Image Retrieval. These methods in-

clude a traditional Bag of Words approach with ORB, SIFT,

and KAZE feature detection methods and state-of-the-art

Convolutional Neural Networks. The impact of traditional

K-means and fuzzy K-means clustering techniques to learn

the visual vocabulary was studied. Cosine, Euclidean, and

Manhattan distance metrics were compared to nd the most

optimal metric to retrieve images. The Convolutional Neu-

ral Network performed better on the Mapillary Street-Level

Sequences (MSLS) data set than the KAZE-based K-means

(K=55) model, by comparing the Mean Average Precision of

the models.

Keywords: Information Retrieval, Image Retrieval, Bag of

Words, Descriptor Extraction, Visual Vocabulary, SIFT, SURF,

KAZE, Convolutional Neural Networks

ACM Reference Format:

Pablo Laso and Marleen van Gent. 2022. Comparison of Traditional

Bag of Words Approach to Deep Learning for Image Retrieval.

In Proceedings of November 11, 2022 (Foundation of Information

Retrieval ’22). Twente, Enschede, NL, 4pages.

1 Introduction

Image Retrieval is the process aiming to nd images in a data

set similar to the image query. In this paper, both content-

based image retrieval and deep learning methods will be

used to build an Image Retrieval system. The performance of

these models will be evaluated based on the Mean Average

Precision (MAP).

1.1 Keypoint Descriptors

Keypoint descriptors are used to extract local features from

images. Multiple methods which extract these keypoint de-

scriptors exist. These methods dier in precision, eciency,

and compactness. [7]

The Scale Invariant Feature Transform algorithm, also known

Foundation of Information Retrieval ’22,

as SIFT, is perhaps the most well-known keypoint descriptor

extraction method for content-based image retrieval. SIFT

is invariant to scale, translation, and rotation.[

] The SIFT

keypoint descriptor method’s strength is its distinctiveness.

This makes it possible to correctly identify a keypoint from

a large set of images.[2]

Oriented FAST and Rotated BRIEF (ORB) is proposed as an

ecient alternative to SIFT. [

] Similar to SIFT, ORB is in-

variant to rotation. In addition, ORB is noise-resistant. ORB

is based on the BRIEF feature descriptor which is similar to

SIFT in performance. [

] The ORB algorithm uses the FAST

keypoint detector for nding FAST points in the image. Fur-

thermore, the intensity centroid is used to measure corner

orientation. [

] The main strengths of ORB are its good per-

formance and low cost of the algorithm.

Alcantarilla et al. (2012) [

] introduced KAZE, an algorithm

for image feature detection and description in nonlinear scale

spaces. Firstly, KAZE will perform a scale space discretiza-

tion in logarithmic steps. Instead of Gaussian approaches

which smooth equally all parts of the image, KAZE uses AOS

techniques and variable conductance diusion. [

] After that,

they use the Hessian to compute the image features. Despite

the higher computational costs, the authors claim a higher

performance than other keypoint descriptors such as SIFT.

[1]

We can expect dierent performances from each algorithm.

In a comparison study between all these feature detectors,

Tareen et al. [

] found that SIFT is the most scale invariant,

while ORB and KAZE are more invariant to rotation. As

for types of geometric transformations, SIFT is found to be

the most accurate algorithm. ORB, on the other hand, is the

fastest in terms of speed of total image matching. The main

purpose of this research is to compare the performance of

these algorithms to each other and to a Convolutional Neu-

ral Network (CNN) on the Mapillary Street-Level Sequences

(MSLS) data set.

1.2 Bag of Words

The Bag of Words approach was rst used for Text Retrieval

before it became widely applied to Image Retrieval. [

] Bag

of Words is a methodology used for content-based image

retrieval which focuses on nding similarities between im-

ages based on the semantics of the images. [

] The keypoint

descriptors extracted from images can form a tremendous

Foundation of Information Retrieval ’22,

Pablo Laso and Marleen van Gent

amount of data to analyze. The Bag of Words approach can

be used to solve this issue as it identies basic components

of an image and compares images based on the presence of

these components. The Bag of Words methodology involves

the detection of the basic components, i.e. visual words, of

the images, measuring the frequency of these visual words

and store these in a Bag of Words vector, and nd similar

images by comparing the image’s frequency of the visual vo-

cabulary [

]. To learn the visual vocabulary of a set of images

clustering methods can be used. Each cluster represents a

visual word of the visual vocabulary. The similarity between

images is represented by the distance between the Bag of

Words vectors of these images. Dierent distance metrics

can be used to compute this distance.

1.3 Convolutional Neural Networks

Convolutional Neural Networks can also be used in Image

Retrieval. These non-linear feature extractors are trained

with the images and produce an output vector, that is, a

holistic descriptor of the image content.

These features are built during training. There is therefore

no need for feature engineering. Their architecture is based

on multiple feature maps or lters that ideally look for the

most relevant features. After so, maxpool is used to downsize

the images, so that we are left with the most important parts

of the image while also reducing computational costs. These

features nally go through fully convolutional layers so that

all the CNN parameters can be optimistically set. After that,

a mathematical formula known as softmax is used to convert

a vector of K real numbers into a probability distribution of

K possible outcomes.

Feature Extraction Networks have been used multiple

times to this aim, in very dierent topics [4] [5].

1.4 Problem Statement

In this paper, we aim to compare multiple approaches for

image retrieval -namely the traditional content-based image

retrieval methods, such as descriptor extraction and clus-

tering techniques used with the Bag of Words approach, to

state-of-the-art CNN-based methods. This comparison is rel-

evant to investigate the potential and success of CNN-based

methods in Image Retrieval.

R1: Do Convolutional Neural Networks perform better than an

optimized content-based image retrieval method?

Furthermore, we aim to investigate the most optimal param-

eters for the content-based image retrieval with descriptor

extraction and the Bag of Words approach.

R2a: Which of the keypoint descriptor methods SIFT, ORB, and

KAZE gives the best results?

R2b: Does K-means clustering or fuzzy K-means clustering

perform better at establishing the visual vocabulary of

images?

R2c: Is Cosine, Euclidean, or Manhattan distance a more

precise distance metrics to determine the similarity between

Bag of Words vectors?

2 Methods

2.1 Experiment Procedure

The Mapillary Street-Level Sequences (MSLS) data set, con-

taining 1000 images, was used to train and compare multiple

Information Retrieval methods. The local keypoint features

were extracted from the images using the ORB, SIFT, and

KAZE descriptor extraction methods. K-means and fuzzy

K-means clustering were used to establish the visual vocab-

ulary of the images in the training set. Multiple values for

the number of clusters were used to nd the ideal number

of centroids for both K-means and fuzzy K-means clustering

algorithms for each descriptor method. Steps of 5 were taken

between 30 and 60 number of centroids. The K-means clus-

tering algorithm was run 10 times with dierent centroid

seeds.

Following, the visual vocabulary was trained with the ORB,

SIFT, and KAZE descriptors. For each image in the Map-

pillary Street-level Sequences (MSLS) data set the Bag of

Words vector was calculated using three dierent distance

methods; Cosine, Euclidean, and Manhattan distance. The

descriptors from ve photos from the test set were extracted

using ORB, SIFT, and KAZE extraction methods. The Bag of

Words vectors were computed using the Cosine, Euclidean,

and Manhattan distance methods. The distances between the

Bag of Words vector of the ve photos from the test set and

the Bag of Words vector from the data set were calculated;

the images with the smallest distances were retrieved. The

images that were retrieved were compared to the relevant

images. The Mean Average Precision was calculated and was

used to evaluate the overall performance of the models.

When implementing a CNN, we used Pytorch to retrieve

a pre-trained model, i.e., the RESNET34. This is a 34-layer

Convolutional Neural Network (CNN) that can be used as a

state-of-the-art model for image classication. It has been

trained upon the ImageNet dataset, with over 100,000 images

of 200 classes.

Its main dierence when compared to other models is

the residuals this CNN takes from previous layers for the

subsequent ones. Each one of the four main layers of the

RESNET34 follows the same pattern, i.e., a 3x3 convolution,

bypassing the input every two layers (residuals). The feature

map dimensions are, respectively and in order, of 64, 128,

256, and 512. Overall, it contains 21.282M parameters.

We use this CNN for feature extraction. We obtain a (1000,

512) array that is used as the map_bow_vectors in the re-

trieve function when we performed clustering. After so, we

compute the MAP to compare it with the results from these

other methods.

Comparison of Traditional Bag of Words Approach to Deep Learning for Image Retrieval

Foundation of Information Retrieval ’22,

2.2 Comparison of Methods

The results from the best K-means model were compared

to the results of the Convolutional Neural Network model.

The Mean Average Precision of these models were compared

with a two-tailed sign test.

3 Results

In Table 1, the results of the Content-based Image Retrieval

methods are shown. These results will be discussed in detail

in the sections below. We group them by the Descriptor

algorithm (KAZE, SIFT, ORB), the clustering algorithm (K-

means, Fuzzy (FCM)) and its K values, and the distance metric

(Cosine, Euclidean, or Manhattan).

Table 1. Results of the Content-based Image Retrieval Meth-

ods

Descriptor Clustering K Distance MAP

KAZE K-means 55 Cosine 0.01554

SIFT K-means 60 Euclidean 0.00924

ORB K-means 45 Cosine 0.00635

KAZE Fuzzy 30 Euclidean 0.00073

SIFT Fuzzy 45 Cosine 0.00138

ORB Fuzzy 30 Euclidean 0.00123

3.1 Keypoint Descriptors

From Table 1, we can see that there is not a descriptor that

consistently stands out among the others. KAZE has the best

results with K-means (K=55) and the cosine distance -but

does not maintain its advantage with FCM.

3.2 Clustering Methods

When taking a closer look at the clustering methods, it is

clear that fuzzy K-means -apart from being signicantly

more computationally expensive- does not reach the same

level of performance as K-means.

The dierence between both algorithms is that fuzzy K-

means (FCM) does not directly assign a point to a cluster,

but rather computes a weighted association to the cluster.

We can therefore expect FCM to perform better in elongated

clusters, i.e., when points are widespread along a particular

dimension. If this is not the case, it might actually be the

reason accounting for a lower performance in our dataset.

Overall, we appreciate from Table 1 a higher performance

in terms of Mean Average Precision for the K-means algo-

rithms, in contrast to FCM.

3.3 Distance Methods

The Cosine, Euclidean, and Manhattan distance metrics were

used to determine the distance between the Bag of Words

vectors of the images, i.e. to determine their similarities.

As can be seen in Table 1, the Manhattan distance metric is

not included, as the models in which it was used resulted in

a lower Mean Average Precision. The Cosine and Euclidean

distance metrics both proved to be more ecient when esti-

mating vector similarities.

3.4 Convolutional Neural Network

After performing image retrieval with the feature vectors

obtained from the pre-trained RESNET34, we compute the

MAP. The MAP score obtained for our CNN is 0.096, when

using cosine as the distance metric.

3.5 Comparison of Models

Wilcoxon test was used in order to check if the models were

statistically dierent, that is, if the CNN really posed a greater

advantage over the KAZE-based K-means (K=55) model.

If the p-value is lower than 0.05, we can reject the null

hypothesis in favor of the alternative one with a condence

level of 95%, i.e., the models are statistically dierent. The

more dierent the two distributions, the lower the p-value.

However, if the distributions are very small, the condence

with which we can reject the null hypothesis is lower, which

translates into a higher diculty of obtaining a lower p-

value.

Upon performing the Wilcoxon test, we took the ve Av-

erage Precision scores (for each of the ve images) of each

model. Both distributions were compared by means of the

non-parametric, two-sided Wilcoxon test. The resulting p-

value was 0.44, meaning the null hypothesis cannot be re-

jected and there is no statistical dierence between both

models.

4 Discussion

4.1 Conclusion

We have conducted a comparison between traditional, clustering-

based Bag of Words (BoW) approaches, and state-of-the-art

CNNs for image classication, such as the RESNET34.

Among the traditional methods, K-means proved to be

most ecient with respect to FCM, specially when KAZE

was used as a keypoint detector and descriptor extractor.

When estimating the distance between the BoW vectors,

both the Euclidean and Cosine metrics seemed to outperform

Manhattan.

It is not apparent that one descriptor algorithm might be

better than the other. Although the best results were given

for a KAZE-based algorithm, KAZE also performed worst in

FCM.

When the aforementioned algorithms were compared to a

CNN, results clearly show a higher performance made by the

latter. The statistical test (Wilcoxon) showed no statistical

dierence. However, we believe this is due to the fact that

the sample distributions -consisting of just ve images- were

too small (for computation reasons).

Foundation of Information Retrieval ’22,

Pablo Laso and Marleen van Gent

A pre-trained CNN, namely the RESNET34, was used for

feature extraction and accounted for the highest MAP scores

among all algorithms tried in this project.

4.2 Limitations

Due to a limitation in processing power, it was not possible

to test the model on all 500 photos of the test set of the Map-

pillary Street-level Sequences (MSLS) data set. As a solution,

ve photos from the test set were randomly chosen to test

the model.

Although taken into account that the used test data was

randomized, the limited size of the used test set may have

aected the reliability of our comparisons and results. In

future research, the full test set should be used to test the

models on and compare the results.

4.3 Future Research

When using the Bag of Words approach, keypoints are ex-

tracted from gray images, which result in the loss of color

information of the images. Therefore, future research could

focus on nding a method to include the color information

of images in the Bag of Words methodology to improve the

models.

Our results show that the implemented Convolutional

Neural Network (taken from a pre-trained RESNET34) out-

performed the content-based image retrieval models. Future

research should integrate state-of-the-art Deep Learning al-

gorithms into content-based image retrieval methods, such

that the benets of both can be combined into more precise

and accurate image retrieval models.

References

[1]

Pablo Fernández Alcantarilla, Adrien Bartoli, and Andrew J Davi-

son. 2012. KAZE features. In European conference on computer vision.

Springer, 214–227.

[2]

Suraya Abu Bakar, Muhammad Suzuri Hitam, and Wan Nural Jawahir

Hj Wan Yussof. 2013. Content-based image retrieval using SIFT for

binary and greyscale images. In 2013 IEEE International Conference on

Signal and Image Processing Applications. IEEE, 83–88.

[3]

Jialu Liu. 2013. Image retrieval based on bag-of-words model. arXiv

preprint arXiv:1304.5168 (2013).

[4]

T.M. Navamani. 2019. Chapter 7 - Ecient Deep Learning Approaches

for Health Informatics. In Deep Learning and Parallel Computing

Environment for Bioengineering Systems, Arun Kumar Sangaiah (Ed.).

Academic Press, 123–137. hps://doi.org/10.1016/B978-0-12-816718-

2.00014-2

[5]

S. Selva Nidhyananthan, R. Newlin Shebiah, B. Vijaya Kumari, and

K. Gopalakrishnan. 2022. Chapter 15 - Deep learning for accident

avoidance in a hostile driving environment. In Cognitive Systems and

Signal Processing in Image Processing, Yu-Dong Zhang and Arun Kumar

Sangaiah (Eds.). Academic Press, 337–357. hps://doi.org/10.1016/B978-

0-12-824410-4.00002-7

[6]

Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski. 2011.

ORB: An ecient alternative to SIFT or SURF. In 2011 International

conference on computer vision. Ieee, 2564–2571.

[7]

Nicola Strisciuglio. 2022. Lecture 06: Multimedia Retrieval - Content-

based Image Retrieval. University of Twente. hps://canvas.utwente.nl

[8]

Shaharyar Ahmed Khan Tareen and Zahra Saleem. 2018. A comparative

analysis of SIFT, SURF, KAZE, AKAZE, ORB, and BRISK. hps://doi.

org/10.1109/ICOMET.2018.8346440

[9]

Chih-Fong Tsai. 2012. Bag-of-words representation in image annotation:

A review. International Scholarly Research Notices 2012 (2012).

Received 11 November 2022