Foundation of Information Retrieval ’22,
,
Pablo Laso and Marleen van Gent
amount of data to analyze. The Bag of Words approach can
be used to solve this issue as it identies basic components
of an image and compares images based on the presence of
these components. The Bag of Words methodology involves
the detection of the basic components, i.e. visual words, of
the images, measuring the frequency of these visual words
and store these in a Bag of Words vector, and nd similar
images by comparing the image’s frequency of the visual vo-
cabulary [
9
]. To learn the visual vocabulary of a set of images
clustering methods can be used. Each cluster represents a
visual word of the visual vocabulary. The similarity between
images is represented by the distance between the Bag of
Words vectors of these images. Dierent distance metrics
can be used to compute this distance.
1.3 Convolutional Neural Networks
Convolutional Neural Networks can also be used in Image
Retrieval. These non-linear feature extractors are trained
with the images and produce an output vector, that is, a
holistic descriptor of the image content.
These features are built during training. There is therefore
no need for feature engineering. Their architecture is based
on multiple feature maps or lters that ideally look for the
most relevant features. After so, maxpool is used to downsize
the images, so that we are left with the most important parts
of the image while also reducing computational costs. These
features nally go through fully convolutional layers so that
all the CNN parameters can be optimistically set. After that,
a mathematical formula known as softmax is used to convert
a vector of K real numbers into a probability distribution of
K possible outcomes.
Feature Extraction Networks have been used multiple
times to this aim, in very dierent topics [4] [5].
1.4 Problem Statement
In this paper, we aim to compare multiple approaches for
image retrieval -namely the traditional content-based image
retrieval methods, such as descriptor extraction and clus-
tering techniques used with the Bag of Words approach, to
state-of-the-art CNN-based methods. This comparison is rel-
evant to investigate the potential and success of CNN-based
methods in Image Retrieval.
R1: Do Convolutional Neural Networks perform better than an
optimized content-based image retrieval method?
Furthermore, we aim to investigate the most optimal param-
eters for the content-based image retrieval with descriptor
extraction and the Bag of Words approach.
R2a: Which of the keypoint descriptor methods SIFT, ORB, and
KAZE gives the best results?
R2b: Does K-means clustering or fuzzy K-means clustering
perform better at establishing the visual vocabulary of
images?
R2c: Is Cosine, Euclidean, or Manhattan distance a more
precise distance metrics to determine the similarity between
Bag of Words vectors?
2 Methods
2.1 Experiment Procedure
The Mapillary Street-Level Sequences (MSLS) data set, con-
taining 1000 images, was used to train and compare multiple
Information Retrieval methods. The local keypoint features
were extracted from the images using the ORB, SIFT, and
KAZE descriptor extraction methods. K-means and fuzzy
K-means clustering were used to establish the visual vocab-
ulary of the images in the training set. Multiple values for
the number of clusters were used to nd the ideal number
of centroids for both K-means and fuzzy K-means clustering
algorithms for each descriptor method. Steps of 5 were taken
between 30 and 60 number of centroids. The K-means clus-
tering algorithm was run 10 times with dierent centroid
seeds.
Following, the visual vocabulary was trained with the ORB,
SIFT, and KAZE descriptors. For each image in the Map-
pillary Street-level Sequences (MSLS) data set the Bag of
Words vector was calculated using three dierent distance
methods; Cosine, Euclidean, and Manhattan distance. The
descriptors from ve photos from the test set were extracted
using ORB, SIFT, and KAZE extraction methods. The Bag of
Words vectors were computed using the Cosine, Euclidean,
and Manhattan distance methods. The distances between the
Bag of Words vector of the ve photos from the test set and
the Bag of Words vector from the data set were calculated;
the images with the smallest distances were retrieved. The
images that were retrieved were compared to the relevant
images. The Mean Average Precision was calculated and was
used to evaluate the overall performance of the models.
When implementing a CNN, we used Pytorch to retrieve
a pre-trained model, i.e., the RESNET34. This is a 34-layer
Convolutional Neural Network (CNN) that can be used as a
state-of-the-art model for image classication. It has been
trained upon the ImageNet dataset, with over 100,000 images
of 200 classes.
Its main dierence when compared to other models is
the residuals this CNN takes from previous layers for the
subsequent ones. Each one of the four main layers of the
RESNET34 follows the same pattern, i.e., a 3x3 convolution,
bypassing the input every two layers (residuals). The feature
map dimensions are, respectively and in order, of 64, 128,
256, and 512. Overall, it contains 21.282M parameters.
We use this CNN for feature extraction. We obtain a (1000,
512) array that is used as the map_bow_vectors in the re-
trieve function when we performed clustering. After so, we
compute the MAP to compare it with the results from these
other methods.