Jun 23, 2012

Image retrieval using Histogram of Visual Words

A simple example of image retrieval based on building a vocabulary then retrieving closest match of the test image from the vocabulary. We are given a set of test image and training image. We can construct the vocabulary by first computing the SIFT features for each of the training image, then constructing the fixed-sized clusters of all the descriptors (coming from each of the image). Note that the descriptors represent the data points in the clustering. Each image has a variable number of descriptors (size of each descriptor is fixed, for SIFT it is 128). Once the vocabulary is computed, we can construct the Histogram of Visual Words for each training and test image by the following approach:
  • All the SIFT features are stored for each of the training images. Note that if there are F number of SIFT features detected in the image then there should be F image descriptor, where each has 128 dimensional vector. vl_feat returns each of the descriptors as column vector. This descriptors will be used as data points in the next step for building the vocabulary. Additionally each of the F feature also has 4-dimensional information describing the location (x,y), scale, and orientation of the frame disc. These are returned in the variable f. So f is a 4xF dimensional matrix.

  • building the vocabulary using the descriptors of the training image tranining data contains all the 128-column vector of points of all the images. We construct clustering of vocab_size centers on these data points. A is a vector (of the size of all the points) and each entry in the vector denotes which of the (1 to vocab_size) cluster the i-th point belongs. For example the first image has 683 SIFT features. Then there are 683 descriptors starting from the first index of training_data to 683 -th index of the training_data. So each of the first 683 entry will contain a number in between 1 to vocab_size that will denote which of the cluster these descriptors belong. The our goal is to construct the HISTOGRAM OF VOCABULARY WORDS (h_vv). For each image we will have h_vw which is a vocab_size long histogram. Value of each index denotes the frequency of that cluster's presence in the current image.

  • compute the histogram of visual words for each test image for the test image we constuct the HISTOGRAM OF VISUAL WORDS in a different manner. We already has the cluster centers by building our vocabulary. Each center is vocab_size dimensional point denotes the center of cluster. For a test image if we run SIFT we will get a set of SIFT features along with its descriptors. For each of the descriptors we will calculate the distance to each of the centers. We wil remember the center which is closest to the current descriptor. In this process we also count the frequency of each center's presence in the test image. Completing the process will give us a histogram of frequency of cluster centers for the test image.
Sample implementation can be found here.

No comments:

Down with the Dictatorship!

    "Let them hate me, so that they fear me" - Caligula 41AD