KMeans Clustering And Visual Bag of Words

Chris Tralie

We will now discuss an interesting combination between unsupervised and supervised learning known as "visual bag of words." When we did bag of words with text, it was fairly intuitive; the words we used in our histograms were the words that occurred in our training documents. However, with images, it's less clear what the "words" should be. What we do is learn these "words" using KMeans. We'll divide up an image into a bunch of tiny 11x11 grayscale patches, each of which can be thought of as a point in 121 dimensional space. We'll then throw all of these patches together into a point cloud and perform KMeans on this point cloud. The cluster centers that we come up with will be the "words" that we'll use when we go to classify things later. Below is some code to sample patches and to examine the clusters

Below is what the 100 cluster centers (visual words) look like. You can see that they're picking up on some crucial features, like oriented edges

Our next step is to train bag of words models using these words for different classes of images. We'll load 50 images in each class. For each, we'll sample all of its patches and find the closest word to each of them. At that point, we can create a histogram for that image. We'll add all of the histograms together for images in the same class, which will be our model for that class

Finally, we'll create histograms for some new images and figure out the maximum likelihood class, following a procedure similar to that of bag of words with text. We'll create a confusion matrix to keep track of which class gets classified as what

When we look at the confusion matrix, we see that the results aren't great (37% accuracy), though we do resonably well with a few classes (car_side, lotus, revolver, stop_sign). However, we should note that this is a tough problem, and that randomly guessing a class label for a 10 class problem would only be 10%. So it's actually kind of amazing that we can do this well with a jumbled up histogram of visual words that are completely taken out of their spatial context.

In the last unit, we will see a much more sophisticated way to analyze images known as a "convolutional neural network," which will perform much better. But there are some things in common; namely, it's helpful to learn little patch features in images, and piecing them together the right way in an image summary can lead towards good classifiers.

In the meantime, check out this 5 minute video on visual bag of words, where the author explains a slight tweak we can do with something called "TF-IDF," or "term frequency inverse document frequency" weighting. This is to make sure that words that just happen to occur often don't give a disproportionately strong influence on the final score for a class. The analogy in text would be the words "the" or "and", which will show up very often, but which don't tell us much about what the documents are. You can see their code here: https://github.com/ovysotska/in_simple_english/blob/master/bag_of_visual_words.ipynb