Week 5: Bag of Words Exercise / Theory of Bayesian Classifiers

Chris Tralie

Bag of Words Exercise

In homework 3, you've been modeling text as a sequence using Markov chains. We're going to do something seemingly much dumber and completely forget the sequence of words and focus just on how often words occur across entire documents. For example, from the first 2016 debate transcript on the homework, you get the following "bag of words" for Clinton (courtesy of wordclouds.com)

and the following for Trump

We can estimate the probability of a single word as simply the number of times that word was said divided by the total number of words, which will be proportional to the size of the words in the above word clouds. For example, Clinton said 6257 words total, and she said "Donald" 20 times, so her probabability of saying Donald would be estimated as

\[ 20/6257 \]

by contrast, Trump only says "Donald" once out of 8367 words, so his probability would be

\[ 1/8367 \]

Your task will be to estimate the probabilities for all words from data and assume complete independence (the "naive bayes" assumption), and multiply them together to get a total probability for both Trump and Clinton of each phrase, for each model. For example, if you got the text "Donald job job Donald Donald", and given that Trump said "job" 6/8367 words, we'd compute the likelihood of the Trump model as

\[ (1/8367) \times (6/8367) \times (6/8367) \times (1/8367) \times (1/8367) = (1/8367)^3 \times (6/8367)^2 \]

You'd then also want to compute such a probability using the Clinton model, and choose the model with the highest probability as your estimate for who said what. You should try to classify all of the Trump and Clinton quotes from the third debate this way and record how many of your guesses are correct. As you go along, we will talk about some pitfalls, and after class, we will dig into some more theory about why this is actually correct and what the "Naive Bayes Model" is precisely

Tweaks To The Technique

Here are a few things to think about as you go:

What's the problem with multiplying a lot of probabilities together? how would you fix this?
What happens if we never saw a particular word in our "training data"? What should we do about that?

Programming tips

Here's some code to load the debate text for Clinton

Here's a slick way to loop through all of the test data (the third debate) for Clinton

To get a list of words in a string, use the split() method of the string class. For example,

would yield

Solution

Below is some code I wrote to implement this on the 2016 debate text. There were two problems that needed to be addressed which people pointed out in class:

To prevent numerical underflow by multiplying tons of small numbers, I summed the logs of probabilities instead of multiplying them directly. The log is a monotonic function, so it preserves maxes, and this won't change which model wins.
If we see a word when we're classifying a new document that wasn't in the training model, it will have a probability of 0 by default. But this will completely kill off the probability of the whole thing when multiplied through. So we need to do something more intelligent. We opt to do something called additive smoothing, in which each unique word has an extra count added to it. This means we also have to add to the denominator the unique number of words so that the probabilities still sum to 1. If we see a word that wasn't in the training data, we simply assign a 1/denominator to it

Below is the code. If you run it, you'll see that it's lightning fast! This is one of the advantages of Naive Bayes, even in situations where it doesn't work as well

Theory of Bayesian Classifiers

Below is some theory on why the above actually works.

The Bayesian Classifier Framework

First, let's define a Bayesian framework for classifying data based on models:

In other words, we want to know how likely a class is given a particular observation, but we're usually only able to model things the other way around, as how likely we are to have made an observation given that we're in a class. In this context, a "class" is a different type of event. In the above example, it would either be that Trump is speaking or that Clinton is speaking.

Let's look at the above formula applied to two different classes C₁ and C₂

\[ p(C_1 | X) = P(X | C_1) P(C_1) / P(X) \]

\[ p(C_2 | X) = P(X | C_2) P(C_2) / P(X) \]

We notice that while the likelihoods and prior probabilities are different, they have a common factor of "evidence." The expression for evidence is often nasty, as it involves sums of probabilities over many disjoint events. Since it's a common factor, though, it won't change the relative order of the probabilities of different classes, so we can omit it and define something instead called a "posterior likelihood," which is proportional to the probability

Def. Posterior Likelihood

\[ \ell(C | X) = P(X | C) P(C) \propto P(C | X) \]

Another neat thing we can do is take the log of this to separate out the likelihood and prior terms into a sum instead of a product. We also saw in the above exercise that taking the log prevents the numerical problem of underflow

Def. Log Posterior Likelihood

\[ \log \ell(C | X) = \log (P(X | C)) + \log(P(C)) \]

Def. Classify

When we say we want to "classify" something in the Bayesian framework, it means we want to find the class that maximizes the (log) posterior likelihood. In computer science terms, it's the "argmax" of the (log) likelihood over all classes. Whichever class maximizes this value is referred to as the maximum a posteriori estimate, and it's the label we will guess that our data is.

Bag of Words / Naive Bayes

We can be a little more specific about if we link it to the bag of words example. Bag of words is an example of something called "Naive Bayes." The assumption behind Naive Bayes is that all of the components of our observation are independent. In the context of bag of words, this means that we draw words completely at random out of our bag of words in our model, and the probability of the next word we draw is the same regardless of all of the other words we've drawn.

Let's suppose we have the following setup with a set of different classifier.

A bag of words model with M words in it, with p_ij being the probability of the j^th word under class i
A new string s that we want to classify, which has N words in it. Let k_j be the number of times the j^th word in the dictionary occurs. Since we have N words total, then
\[ k_1 + k_2 + ... + k_M = N \]

If x_j are the counts we actually witness in a string, we seek the following posterior likelihoods

\[ \ell(C_i | X) = p(x_1 = k_1, x_2 = k_2, ..., x_M = k_M | C_i) P(C_i)\]

As an example, suppose we had a bag of words model with only two words: {Computer, Science}. Let's suppose we make an observation that has 5 words, with 2 counts of "computer" and 3 counts of "science." For example, the following sentence would match the above counts

s = "Science Computer Science Computer Computer"

Let p_c be the probability of drawing a "computer" and p_s be the probability of drawing a "science." Since the events are assumed to be independent in Naive Bayes, we can simply multiply the probabilities together, yielding the following probability

\[ P(s | C) = ps \times pc \times ps \times pc \times pc = p_c^3 ps^2 \]

Notice, though, that there are many strings that would satisfy these probabilities; for instance,

s = "Science Science Computer Computer Computer"

Each one of these strings is a disjoint event, so we can add up their probabilities. In fact, there are 10 possible 5-word sentences that have 3 "computer" strings and 2 "science" strings, so, letting k1 count the "computer" strings and k2 count the "science" string

\[ P(k_1=3, k_2=2 | C) = 10 p_c^3 p_s^2 \]

In general, the counts can be obatined from multinomial coefficients. The formula turns out like this

\[ \ell (x_1 = k_1, x_2 = k_2, ... , x_M = k_M | C_i) = \left( \left( \frac{N!}{k_1! k_2! ... k_M!} \right) p_{i1}^{k_1} p_{i2}^{k2} ... p_{iM}^{kM} \right) p(C_i) \]

Here's an example where taking the log posterior likelihood really helps us, because we have a ton of things multiplied together which we can split into sums

\[ \log \ell (X | C_i) = \log \left( \frac{N!}{k_1! k_2! ... k_M!} \right) + \log \left(p_{i1}^{k_1} p_{i2}^{k2} ... p_{iM}^{kM} \right) + \log( p(C_i)) \]

In fact, we can break this apart even further by remembering the rules lob(a+b) = log(a) + log(b) and log(a^b) = b log(a)

\[ \log \ell (X | C_i) = \log \left( \frac{N!}{k_1! k_2! ... k_M!} \right) + k_1 \log(p_{i_1}) + k_2 \log(p_{i_2}) + ... + k_M \log(p_{iM}) + \log( p(C_i)) \]

If we're simply trying to find the max of the log posterior likelihood, then the ugly first term doesn't actually affect us at all, and we can ignore it! So we can define a new quantity

\[ h_i(X) = k_1 \log(p_{i_1}) + k_2 \log(p_{i_2}) + ... + k_M \log(p_{iM}) + \log( p(C_i)) \]

This is nearly exactly what we actually did in our code in the bag of words exercise. The only extra term is log(p(C_i)). If a class has a small prior, it is inherently less likely, so we can see this as a penalty or promoter for how likely the class is in the absence of any other evidence (for instance, knowing nothing else, your guess that my CS 477 lectures will be awesome is more likely than not 😉). Since we ignored this in the code above, we were assuming that Trump and Hillary were equally likely options.