Homework 3: Markov Chains for Text Processing

Chris Tralie

Overview / Logistics

The purpose of this assignment is to introduce you to natural language processing via Markov chains, which are incredibly powerful tools for representing sequences mathematically. First, you will use this tool to classify text, seeing how it performs at telling whether Clinton or Trump was speaking during the 2016 presidential debates, whether a text is spam or ham, or whether a movie review is positive or negative. You will also use this tool to synthesize new text that mimics the style of some given text. For instance, you'll be able to synthesize new, previously unheard Spongebob quotes. You can even mix styles together (such as my Ursinus sith lord). We will have a class gallery of such "Markov hybrids" that students devise.

Learning Objectives

  • Implement a statistical technique for analysis and synthesis of natural language.
  • Follow the train set / test set paradigm for evaluating the performance of an algorithm, and learn how to "wrangle datasets" (i.e. load files and prepare them for testing).
  • Search for optimal model parameters, and explore how these parameters vary across datasets.
  • Use jupyter notebooks to organize experiments into a computational essay.
  • Reflect on the potential harms of technology that you've created, and what can be done to mitigate ethical threats.

What to submit

Finally, submit a comment on canvas with the answer to the following questions:

  1. Your markov.py file and the jupyter notebooks you created in the experimental section.
  2. Your poetic statement, a title that should go along with it, and what name/pseudonym you would like to use? (results will be displayed on the class web site). Also submit the text files that you trained it on.
  3. Your ethics reflection.
  4. Did you work with a buddy? If so, who?
  5. How long did this take you? I'm simply using this to gauge if things are too easy, too hard, or just right.
  6. Any feedback to help improve the assignment for future students? Alternatively, if there's nothing you would change, let me know what specifically worked for you about it so I know to keep doing things like that.
  7. Any other concerns that you have. For instance, if you have a bug that you were unable to solve but you made progress, write that here. The more you articulate the problem the more partial credit you will receive (fine to leave this blank)

Background: Markov Chains for Text Representation

Human languages, often referred to as ``natural languages,'' are rich and full of meaning. Compared to computer languages, they are much harder to automatically interpret and process. However, as you will see in this assignment, we can make some progress at representing natural languages using a few simple statistical ideas.

In this assignment, our focus will be on creating an algorithm to capture the style of some given text from a set of examples. Then, we will be able to classify documents or "synthesize" (create new text) in this style.

In the discussion below, we refer to documents that we use to build a statistical model as training data.

Zero-Order Statistical Modeling

We'll start by discussing synthesis. The simplest possible thing we could do to model text statistically is try to match the frequency of occurrence of the characters in a document. For example, if we use a collection of spongebob quotes, we find that we have the following character counts across all quotes:

Then, we can then try drawing characters according to these counts. For example, we'd be more likely to draw an 'i' than we would to draw a 'z', and we're more likely to get a space ' ' than any other character. If we draw 100 characters independently with these probabilities, we get text that looks like this:

  • onc 5donps fwyce agoti'tm ne edoi a e Iueogd ei IralcrltItmo.g mimyheat tgesue nwierayekra fP he
  • l rOxttsu Iogrmetucafo ewa khtois!e bttcatnht,r Cyfhr Pngkswhnwl oiet lyoatrl atumr e lenriadb Ie
  • Gi dyuh b .di Po mmceooet'd'nne'n gdo dkimeo aanti is0o i 'uttj'Sstopfsasep!. mosltayaaSso?lraV l

Interestingly, the size of the "words" looks about right since we are more likely to choose a space than any other character, but they are total gibberish. This is because the process has no memory; each character is chosen completely independently from the character preceding it.

K-Prefixes

As it turns out, if we have just a very simple memory of a few characters, we can do a much better job at synthesizing sequences. Surprisingly, we can shoehorn the "memoryless" Markov chain framework from our class exercise into having a small memory if we change our notion of what a "state" is. In particular, a "state" will now be a small sequence of characters of length k instead of just a single character. This is referred to as a k-prefix.

As an example, let's consider that we had the string aaabaaacaabbcabcc. If we took all of the 2-prefixes of this string, we would have the following counts

In the Markov framework, we also have transitions that occur every time a new character is seen. On the above example, they would look like this:

Let's think about what these mean as transitions in the Markov chain. If we're at aa and we get an a, then we stay at aa. If we get a b, then we chop the a off of the left and move to ab. If we're at ca and we see a b, we chop the c off of the left and move to ab. And so on. Drawn as a state diagram, all such transitions look like this:

One issue with this is that it runs into a dead end at cc. This is fine if we're analyzing probabilities, but if we're doing a random walk through our model to synthesize text, we'll run into a dead end there. To make it so we can keep going, we'll loop the text around by padding the end of the string with the first k characters. In this example, this means padding the original string with the first two characters, so we create the prefixes of the string aaabaaacaabbcabccaa. That leads to the following counts for the prefixes and transitions

Notice how the count of ca->a has gone up by 1, and now there is a transition from cc->a. Now there are no dead ends in the state transitions, and we can walk through this model randomly forever to generate random strings.

Below is a more in-depth example with some real text from the spongebob quotes text file provided with the assignment

Spongebob Example

As an example here are all of the counts of the 4-prefix 'you ' in the Spongebob text:

New SentenceCharacter Counts for 'you ' After Sentence
Gary, go away, can't you see I'm trying to forget you?{"s":1}
Why are you mad? Because I can't see my forehead!{"s":1, "m":1}
Can you take the crust off my Krabby Patty?{"s":1, "t":1, "m":1}
Did you see my underwear?{"s":2, "t":1, "m":1}
Well, it makes you look like a girl!{"s":2, "t":1, "l":1, "m":1}
East, oh I thought you said Weest!{"s":3, "t":1, "l":1, "m":1}
That's it, mister, you just lost your brain priveleges!{"s":3, "t":1, "j":1, "l":1, "m":1}
I wumbo, you wumbo, he she we, wumbo, wumboing, wumbology, the study of wumbo? {"s":3, "t":1, "w":1, "j":1, "l":1, "m":1}
It's not you that's got me... it's me that's got me!{"s":3, "t":2, "w":1, "j":1, "l":1, "m":1}
Why don't you ask CowBob RanchPants and his friend Sir Eats-a-lot?{"a":1, "s":3, "t":2, "w":1, "j":1, "l":1, "m":1}
Krusty Krab Pizza, it's the pizza for you and meeeee!{"a":2, "s":3, "t":2, "w":1, "j":1, "l":1, "m":1}
If you believe in yourself, with a tiny pinch of magic all your dreams can come true!{"a":2, "b":1, "s":3, "t":2, "w":1, "j":1, "l":1, "m":1}
Goodbye everyone, I'll remember you all in therapy.{"a":3, "b":1, "s":3, "t":2, "w":1, "j":1, "l":1, "m":1}
Don't you have to be stupid somewhere else?{"a":3, "b":1, "s":3, "t":2, "w":1, "h":1, "j":1, "l":1, "m":1}
Squidward, you can't eat all those patties at one time!{"a":3, "b":1, "s":3, "c":1, "t":2, "w":1, "h":1, "j":1, "l":1, "m":1}
I'll have you know, I stubbed my toe last week, while watering my spice garden, and I only cried for 20 minutes.{"a":3, "b":1, "s":3, "c":1, "t":2, "w":1, "h":1, "j":1, "k":1, "l":1, "m":1}
Squidward, you and your nose will definitely not fit in.{"a":4, "b":1, "s":3, "c":1, "t":2, "w":1, "h":1, "j":1, "k":1, "l":1, "m":1}
Who you callin' pinhead?{"a":4, "b":1, "s":3, "c":2, "t":2, "w":1, "h":1, "j":1, "k":1, "l":1, "m":1}
Gee Patrick, I didn't know you could speak bird.{"a":4, "b":1, "s":3, "c":3, "t":2, "w":1, "h":1, "j":1, "k":1, "l":1, "m":1}
Any particular reason you took your pants off.{"a":4, "b":1, "s":3, "c":3, "t":3, "w":1, "h":1, "j":1, "k":1, "l":1, "m":1}
Let me get this straight, you two ordered a giant screen television just so you could play in the box?{"a":4, "b":1, "s":3, "c":4, "t":4, "w":1, "h":1, "j":1, "k":1, "l":1, "m":1}
I'd hate you even if I didn't hate you.{"a":4, "b":1, "s":3, "c":4, "t":4, "e":1, "w":1, "h":1, "j":1, "k":1, "l":1, "m":1}
You're a man now, Spongebob, and it's time you started acting like one.{"a":4, "b":1, "s":4, "c":4, "t":4, "e":1, "w":1, "h":1, "j":1, "k":1, "l":1, "m":1}
Can you give Spongebob his brain back, I had to borrow it for a week.{"a":4, "b":1, "c":4, "e":1, "g":1, "h":1, "j":1, "k":1, "l":1, "m":1, "s":4, "t":4, "w":1}

Synthesizing Text

Now, let's say we do the following steps, starting with the prefix 'you ':
  1. We randomly choose one of the characters that's to follow, and we choose a 'c'
  2. We then slide over by one character move onto the next prefix, which is 'ou c'. We then see the character counts {"a":2, "o":2} for that prefix.
  3. We make another random choice at this point, and we draw the character 'a'. So then we slide onto the prefix 'u ca', and we see the counts {"l":1, "n":1} for that prefix.
  4. We now make a random choice and draw the character 'n'. We then slide over to the prefix ' can', and we see the counts {" ":3, "'":4}
  5. We now make a random choice and draw a space, so we slide over to the prefix 'can ', and we see the counts {"c":1, "h":1, "I":1}
  6. We now make a random choice of an h, moving us to the prefix 'an h', and so on and so forth
So far in this example, we have synthesized the text "you can h", and we could keep going. Here are a few different outcomes if we keep following this process:
  • you can have facial hair!Now than 24, 25.You don't need a new I stupid
  • you can have to die right.You know the true!If your secrets is hot burns down
  • you can have feet?Since who can't you.I'd hate you and I'm absorbing
  • you can have to be alright.If I wish is nothere ther.No, Patty?

As you can see, this text is starting to make a lot more sense than choosing each character independently, even with a very short memory.

The Probability of A Sequence

In addition to synthesizing text in a model trained on k-prefixes from a particular set of documents, we can assess how likely a different document is to be in the style that we've modeled with our Markov chains. To do this, we will compute the probability of a particular sequence given a model. Markov chains have a simplifying assumption of independence that will help make this easier to compute. In particular, the next character is chosen only based on the current prefix, and none of the previous prefixes influence this decision.

Independent events are nice to work with because the probability of independent events occurring is a simple multiplication. For example, it's reasonable to assume that the type of weather in Beijing on a particular day is independent of the weather in Collegeville. So if the probability it rains in Collegeville is 0.4 and the probability it rains in Beijing is 0.6, then the joint probability of both occurring is 0.4 x 0.6 = 0.24

To compute the probability of a particular sequence of characters, we must first compute the probability that each character was drawn under our model, and then we may compute the joint probability of the entire sequence by multiplying them together. The probability of a particular character c preceded by the prefix p is given as \[ p(c) = \frac{N(p.c) + 1}{N(p)+S} \] where

  • N(p) is the number of times the prefix occurs in the model (which can be 0)
  • N(p.c) is the number of times the character c follows prefix p (which can be 0, and which is automatically 0 if the prefix doesn't exist in the model)
  • S is the size of the alphabet in the model (i.e. the number of unique characters across all prefixes)
So the joint probability of all characters is obtained by multiplying a bunch of these together. Note that this gracefully handles the case where we never saw a particular substring in any of the training data; in this case N(p.c) and N(p) are 0, and the probability is 1/S

There is a slight numerical issue with the above scheme, however. Since there are many characters in most sequences of interest, and since the probabilities of each character are small, we can run into arithmetic underflow where the multiplication of many small numbers ends up just bottoming out at zero numerically. To get around this, we can instead compute the ``log probability''. Since \[ \log(AB) = \log(A) + \log(B) \] We can compute the log of the product probabilities as the sum of the log of each probability. So simply modify the formula as \[ \log \left( p(c) \right) = \log \left( \frac{N(p.c) + 1}{N(p)+S} \right) \] and then sum all of these up for each character to get the final log probability.

Simple Example

Let's consider a fully fledge simple example. Let's suppose that we added the string aaacbaaaacabacbbcabcccbccaaac to the model and we considered all 3-prefixes, so that we have these counts in the model (considering the wrap-around)

Now let's say we wanted to compute the probability of the string aacabaa given the the model. There are three unique characters in the alphabet here: a, b, and c, so S = 3. Then, we compute the probability of each chunk aac, aca, cab, aba (all substrings of length 3 with at least one character following them)

pN(p)cN(p.c) \[ \log \left( \frac{N(p.c) + 1}{N(p)+S} \right) \]
aac 3 a 2 = log(2 + 1) / (3 + 3) = -0.6931
aca 2 b 1 = log(1 + 1) / (2 + 3) = -0.9163
cab 2 a 1 = log(1 + 1) / (2 + 3) = -0.9163
aba 1 a 0 = log(0 + 1) / (1 + 3) = -1.386

Summing all of these up, we get -3.91

Assignment Tasks

Starter Code

Click here to download the starter code for this assignment. The main file you will be editing is markov.py. This contains bare bones code for a Markov class for encapsulating all data in a model, and some (not necessarily all) of the instance methods that you need to write, as explained below.

The folder also contains a number of text documents you will use to test your code.

Data Structures / Adding Text (10 Points)

Choose appropriate data structures to represent your Markov chain (dictionaries are highly recommended), and initialize them as instance variables in the constructor. Then, fill in the method add_string to add prefixes and character counts to your data structures for a string that's passed in. You should loop through all possible prefixes in the string passed as a parameter and update the counts of the prefixes (or add a new prefix if this is the first time you're seeing it). You should also update the counts of the characters that follow each prefix. Two things to watch out for here:

  • To prevent your model from running into dead ends when you go to synthesize new text, you should loop around once you run out of room to make a prefix at the end. For example, if you had the text

    CS 477 rocks

    And you chose to use prefix lengths of 5, you should have all of these prefixes

    	CS 47
    	S 477
    	 477 
    	477 r
    	77 ro
    	7 roc
    	 rock	
    

    But also all of these prefixes once you start looping around

    	rocks
    	ocksC
    	cksCS
    	ksCS 
    	sCS 4
    
    in code, you'd construct a Markov model and add this string as follows
  • Do not add strings with a length less than the chosen prefix length. Simply do nothing if such a string is passed.

As an example to check to make sure this is working properly, if you add all of the lines from the the spongebob file using 5-prefixes you should have 4032 unique prefixes, and the prefix counts for the string " you " should be as follows:

Computing Probabilities (10 Pts)

Given a model and some text, we can compute the likelihood that this text was drawn from the model by following the procedure discussed in the background. Fill in the method get_log_probability to return the log probability of a sequence according to your model.

You should first test the simple example to make sure you're agreeing

For a longer example, if you run the following code

You should get a log probability of -40.5

If you then say

You should get a log probability of -69.6. Note how much lower this probability is, even though the sequences are of roughly the same length. This is our model telling is spongebob is much more likely to have said "I'm ready, I'm ready" than "Artificial intelligence".

Experiment 1: 2016 US Presidental Debate (10 Pts)

You will now perform an experiment in the "train/test" paradigm, which is common in machine learning; that is, you will build models on some labeled data that is designated as "training data," and you will see how well your models predict new labeled examples in the "test data" set. You should perform the experiment and write your reflections all in a single Jupyter notebook file

The first experiment you will do is inspired by very similar assignment I did nearly 14 years ago on figuring out if a quote during the 2004 presidential debate was said by Bush or Kerry. To freshen this up, I have obtained the debate transcripts from the 2016 election from Politico. You will train your models on the first two debates and see if you can predict who said a particular quote from the third. In particular, you will initialize a trump and clinton model separately for some prefix length K

You will then loop through 40 different segments from the third debate and classify each one as having come from Trump or Clinton. Here's an example of how to load all of the segments from Trump as strings

For each segment from both Trump and Clinton, you can compute the log probability under both models and see which one has a higher probability. If the correct one has a higher probability, then you can consider this a correct classification. If not, then it is a mistake. You can report the accuracy as the total number of correct classifications over the total number of items in the test set. For example, if I got 60 correct and there were 80 items, the accuracy would be 60/80 = 0.75.

Below is pseudocode on how you might check the Clinton debate examples to see which ones your models got correct. You'd do a similar thing for the Trump debate examples.

In the end, you can think of generating something like a confusion matrix as follows

Guessed that Trump said it (probability of Trump is higher)Guessed that Clinton said it (probability of Clinton is higher)
Quote from Trump x y
Quote from Clinton z w

The elements along the diagonals represent correct classifications, so the proportion of correctly guessed examples would be

\[ \frac{x+w}{x+y+z+w} \]

Here are some things you should explore and show in your notebook

  1. It's unclear a priori what prefix length K would be best for this task. Create a loop where you report accuracies on the test data over a range of K values from 1 to 20, and plot the accuracy versus K. You can use matplotlib to create such a plot. What trends do you notice in the plot? Given what you know about these models, can you explain these trends? Answer this in a markdown cell in your notebook directly under your code.

  2. For the best performing K in your experiment above, print out the text segments that the models got incorrect. Do these make sense to you? Write about your thoughts in the notebook

  3. For the best performing K, print the correctly classified sequences with the top 10 highest absolute difference in normalized log probability between the models; that is, divide the probability that clinton said each string s by len(s)-K, and do the same for Trump, then compute the absolute value of these differences. This normalization controls for the length of the sequences and gives us an idea of how different the log probabilities of each prefix are on average. The top 10 largest differences tell us which sequences the models are the most confident came from one speaker or the other. Do these make sense to you? Write about your thoughts in the notebook

    Hint: The numpy argsort method may come in handy here

Interpretable models

In the interest of time, I decided to cut a task that was there before about digging into the models to see why they're doing what they're doing. One way to do this is to narrow in on which prefixes are leading to the largest disparity in probability for sequences which are said to be very different between the two models. If you did this, you'd see prefixes involving "Hillary", "Clinton", "disaster", "invest", "i will tell you", "Sanders", "you think", "horrible", "millions of dollars", "unbelievable" were much more probable under Trump's model than Clinton's model, for example.

Because we can actually dig into our model and find out meaningful things about how it's working, it's referred to as interpretable. This is a nice feature to have in a model because we can explain more easily why it is or isn't working in practice. This will not be true of all of the models we examine in this class, but it's often important for us to have this property to trust models.

Experiment 2: Spam vs Ham (5 Pts)

Now, you will try your hand at the spam classification with data from the UCI SMS Spam Collection. Given a test set of spam texts and of ham texts, you will train models to classify a test set of spam and ham texts.

As in the debate task, you should loop through a range of K values to see which one yields the best classification accuracy. What do you see about this plot that's different from the debate plot? Why do you think this is? Write about this in your notebook.

Experiment 3: Movie Reviews (5 Points)

Finally, you will train a model to recognize negative or positive movie reviews using the Cornell movie sentiment analysis dataset. There are 1000 positive movie reviews in the text/movies/pos directory and 1000 negative movie reviews in the text/movies/neg directory. Train on the first 900 of each, and test on the last 100 of each. As in the other experiments, perform these classifications on a range of K values (they may take a bit longer since there is a lot more text). What do you notice about the optimal K value and the performance compared to the other two? Why do you think this is? Write about this in your notebook.

Finally, find your own positive and negative movie reviews online, and classify them in your notebook using models built on all of the movie data with the optimal K you found above. Define the reviews as strings in your notebook (Python multiline strings may come in handy). Does your classifier get it correct? How different are the reported probabilities? Write about this in your notebook.

Synthesizing Text (10 Pts)

Now that we're finished the analysis portion of the model, it's time to synthesize new test. Fill in the method synthesize_text to create new text of a particular length drawn with probabilities from your model.

For example, if I run the code

I might get

s birthday, to apply.The safety featured on @BartramsGarden, the event art prof. Carlita Favero. #MLKDay of single fathers participate in hanging to #Ursinus alum Phil DeSimone ’12 of @Carbon back to

Of course, this is only one of many, many possibilities, but it should look roughly like an amalgamation of things that were in the training data, just spat back with no long term memory.

To accomplish this, you should choose a prefix at random from the training text to start with, and then synthesize the rest of the text one character at a time, updating the current prefix you're on to be the last k characters in the string you're synthesizing as you go along. Be sure that the next character you choose respects the probability of the model. You may want to refer back to our class exercise to see how we were able to draw characters respecting probabilities.

Poetic Statement (5 Pts)

Come up with your own collection of text and train a model. Find at least 100 lines of text in a particular style, and place them all in a text file on a different line (you should choose something other than Ursinus tweets, sith lords, spongebob quotes, or Homer Simpson quotes). Then, play around with the prefix size and see what kinds of outputs you get. Once you are satisfied, generate three or more examples that you believe capture the spirit of your synthesizer, which will be displayed on a class gallery.

One thing you might try to make this really fun is something I'll call Markov Hybrids. The idea is to load in two different styles in the same model and to have the synthesizer mash them up. For example, if you load both the Sith quotes and Ursinus tweets in, you may get some results like this:

  • "the cpd has yet another tip for undergraduates to explore the dark side"
  • "check out this scholarship alert: before your eyes. i can drive you mad with fear, shred your sanity, and leave you a raving lunatic"
  • "thomas j. watson fellowship, a yearlong grant that anyone who knows the words and scientific committee of the force"
  • "kill the spring! 78 days until opening day!"
  • "vanessa wilson-marshall '02 recalls the words and actions of significance is the result of conquest"
If you choose to do this, at least one of the styles should be different from the examples I've given.

Ethics Reflection (5 Pts)

With great power comes great responsibility. Even this relatively simple model can be used for nefarious purposes, and we should take care with how such code is deployed. Reflect on this, and see if you can come up with at least three such examples. What safeguards do we have against the examples you came up with?