Markov Chains (#74)

This week's Ruby Quiz is about text generation. That's right, we're going to teach your computer to weave tall tales.

At its most basic level, a solution might be:

>> (1..30).map { (("a".."z").to_a + [" "] * 10)[rand(36)] }.join
=> "fb mcr hhluesjbhtf swm eehokmi"

However, let's make our goal to get as close to English looking sentences as possible. One way you might do this is using a technique called Markov Chains.

To use Markov Chains, you read some text document(s), making note of which characters commonly follow which characters or which words commonly follow other words (it works for either scale). Then, when generating text, you just select a character or word to output, based on the characters or words that came before it.

The number of previous characters or words considered is called the "order" and you can adjust that to try and find a natural feel. For example, here is some generated text using a second order word chain derived from the Sherlock Holmes novel "The Hound of the Baskervilles" by Arthur Conan Doyle:

The stars shone cold and bright, while a crushing weight of responsibility
from my shoulders. Suddenly my thoughts with sadness. Then on the lady's
face. "What can I assist you?"

If you need text's to prime your program with, I suggest searching Project Gutenberg:

Project Gutenberg


Quiz Summary

I owe myself a signed copy of my book. Err, I mean, this was a popular quiz! That's always great because I have so many wonderful selections of code to talk about in the summary, but it can also be hard because I have so many wonderful selections of code to talk about in the summary. All of the solutions were terrificly educational, even the ones I don't have time to discuss below. Everyone who reads this summary is bound by Ruby Law (can be tried in the Ruby Courts) to go look through the rest of the submissions.

Now that we have the legal formalities out of the way, let's dig into a solution. Jon Egil Strand wrote in with descriptions of several practical applications for Markov Chains and this bit of code:

ruby
class MarkovChain
def initialize(text)
@words = Hash.new
wordlist = text.split
wordlist.each_with_index do |word, index|
add(word, wordlist[index + 1]) if index <= wordlist.size - 2
end
end

def add(word, next_word)
@words[word] = Hash.new(0) if !@words[word]
@words[word][next_word] += 1
end

def get(word)
return "" if !@words[word]
followers = @words[word]
sum = followers.inject(0) {|sum,kv| sum += kv[1]}
random = rand(sum)+1
partial_sum = 0
next_word = followers.find do |word, count|
partial_sum += count
partial_sum >= random
end.first
next_word
end
end

mc = MarkovChain.new(
File.read("Agatha Christie - The Mysterious Affair at Styles.txt")
)

sentence = ""
word = "Murder"
until sentence.count(".") == 4
sentence << word << " "
word = mc.get(word)
end
puts sentence << "\n\n"

Let's examine the MarkovChain object first, as it is where the real work is done. In initialize() you can see that a Hash is created, the text is split() on whitespace, and then each pair of adjacent words are passed to the add() method.

In add(), the first word is used to key the Hash we saw created in initialize(), adding a frequency Hash as the value the first time a new word is seen. The following word is then added to the counts in the frequency Hash. Obviously, we are just working with a first order chain here, since we only ever track the word that immediately preceded the current word.

The real work is done in the get() method, where we pass the current word and are returned a reasonable follow-up word. The word is checked to make sure we have seen it, then the frequency Hash is pulled for that word. Next, the frequencies of all possible follow-up words are summed and a random selection is made in that range. Finally, the frequency Hash is walked one last time, and the random number used to select and return the indicated word.

The rest of the solution code follows the class. First a MarkovChain is constructed from an Agatha Christie novel. Then a variable is prepared to hold the output (called sentence, but it will actually hold a few sentences) and a starting word appropriate to the content is hand-picked. The script then loops, using get() to choose words one by one until it has collected at least four sentences of output, which are dumped to the user before we exit.

The above code does not deal with punctuation at all, which can be both good and bad. The upside is that you will see some natural punctuation, like commas and they will generally even appear in reasonable places since they are attached to the word from the original text. The downside is that you may get something like an opening quote, but never get the closing quote. Sometimes it works out naturally, and sometimes it doesn't.

Now, in this whole discussion, I was loosely throwing around this "frequency Hash" term that I never bothered to define. Allow me to fix that now. The frequency Hash is obviously just a Hash with keys being the words that can follow and values being a count of how many times that word was seen to follow in the original text. Dominik Bathon included excellent examples of how this looks in his submission email. Using Dominik's trivial example text of:

q s a w s. a s w q s. q w s a a w. s q a w s.

Jon's code will produce a Hash tree that looks like:

ruby
{ "w." => {"s" => 1},
"w" => {"s." => 2, "q" => 1, "s" => 1},
"a" => {"w." => 1, "a" => 1, "w" => 2, "s" => 1},
"s." => {"a" => 1, "q" => 1},
"q" => {"a" => 1, "w" => 1, "s." => 1, "s" => 1},
"s" => {"w" => 1, "a" => 2, "q" => 1} }

You can really see how punctuation changes things there, since "w" and "w." are different words.

The right hand side of that (the values of the top level Hash), shows the frequency breakdown. "w" occurs 2 times after an "a", but "s" occurs just once following the same word.

Domink then went on to show how a person might change that representation. Here are two significant changes:

ruby
{ "w" => { "q" => ["s"],
"s" => [".", "a", "."] },
"a" => { "a" => ["w"],
"w" => ["s", ".", "s"],
"s" => ["w"] },
"q" => { "a" => ["w"],
"w" => ["s"],
"s" => ["a", "."] },
"s" => { "w" => ["q"],
"a" => ["w", "a"],
"q" => ["a"] } }

Note that punctuation is handled differently here, because I'm now using Dominik's examples and his code used a different implementation. In Dominik's code, periods, exclamation marks, and question marks are treated as their own words and most other punctuation is ignored.

Now try to ignore the tree structure of the above for now and just focus again on that right-hand side. This is still just a frequency Hash, though it is now disguised as an Array. If you have the Hash {"a" => 1, "." => 2}, you could select the next word just as we saw Jon do earlier. However, the Array [".", "a", "."] represents the same data and we can select a word just by randomly selecting a member. The duplication ensures that the odds are still the same.

The other obvious change is that we are now taking into account an order for the chain. This works out pretty naturally, just by nesting Hashes to the depth of the order. We then walk the tree to find out what comes next. For example, if we have a "w" followed by a "a", we get to our now familiar Array of choices for the next word [".", "a", "."].

But wait, Dominik goes one step further:

ruby
{ ["q", "a"] => ["w"],
["q", "w"] => ["s"],
["s", "q"] => ["a"],
["w", "q"] => ["s"],
["s", "w"] => ["q"],
["a", "s"] => ["w"],
["w", "s"] => [".", "a", "."],
["s", "a"] => ["w", "a"],
["q", "s"] => ["a", "."],
["a", "a"] => ["w"],
["a", "w"] => ["s", ".", "s"] }

This is exactly the same as the tree above, except that the Hash tree of preceding words has been flattened into a single Array of words, used to key the Hash. This is possible because Ruby's Arrays are hashable. Isn't that handy? The gain is speed. Since we only have to make one Hash lookup to get to the frequency Array, it works a little quicker.

Now that we have gone through the trouble of understanding Dominik's data structure, the code should be a breeze. Let's take a look:

ruby
# ...

if $0 == __FILE__
order = 2
n = 10
while ARGV[0] =~ /\A-([on])([1-9]\d*)\z/
if $1 == "o"
order = Integer($2)
else
n = Integer($2)
end
ARGV.shift
end
mc = MarkovChainer.new(order)
mc.add_text(ARGF.read)
n.times {
puts mc.generate_sentence
}
end

This is the last little bit of Dominik's code, but it kicks the process off. You can see that it selects defaults for the order and number of sentences to produce, but then updates them from the provided command-line options. Next, a MarkovChainer is built and fed the combined contents of ARGF. Finally, the requested number of sentences are generated with a simple method call repeated inside of an iterator.

Let's look at the pieces of MarkovChainer used to read text:

ruby
class MarkovChainer
attr_reader :order
def initialize(order)
@order = order
@beginnings = []
@freq = {}
end

def add_text(text)
# make sure each paragraph ends with some sentence terminator
text.gsub!(/\n\s*\n/m, ".")
text << "."
seps = /([.!?])/
sentence = ""
text.split(seps).each { |p|
if seps =~ p
add_sentence(sentence, p)
sentence = ""
else
sentence = p
end
}
end

# ...

private

def add_sentence(str, terminator)
words = str.scan(/[\w']+/)
return unless words.size > order # ignore short sentences
words << terminator
buf = []
words.each { |w|
buf << w
if buf.size == order + 1
(@freq[buf[0..-2]] ||= []) << buf[-1]
buf.shift
end
}
@beginnings << words[0, order]
end

# ...

end

The initialize() method shows a familiar setup, save that new Array we haven't talked about yet. We will tackle that when it comes up.

add_text() isn't much more complicated. It ensures the document it is dealing with contains complete sentences, then breaks it on the end-of-line punctuation characters I mentioned earlier. The trick here is that you need to notice the parentheses in the Regexp handed to split(). Ordinarily, split() will remove the match used to divide the chunks, but when the match includes capturing parentheses, those captures are returned as items of their own. So, the final iterator of the method reads the sentence then the punctuation that followed it and hands those over to add_sentence().

add_sentence() is where the data structure we analyzed earlier gets built. The sentence is separated into words, including the end-of-line punctuation don't forget. Then those words are pushed into a buffer one at a time. When the buffer has enough words stored to satisfy the order, it starts filling the Hash of frequency Arrays for this chain. Finally, we see what that extra Array is used for. It holds beginnings, or groups of words used to start a sentence (when we wouldn't have previous words to check frequencies for). You can see those being stored in the last line of this method.

Here's the other side of that class, text writing:

ruby
class MarkovChainer
# ...

def generate_sentence
res = @beginnings[rand(@beginnings.size)]
loop {
unless nw = next_word_for(res[-order, order])
return res[0..-2].join(" ") + res.last
end
res << nw
}
end

private

# ...

def next_word_for(words)
arr = @freq[words]
arr && arr[rand(arr.size)]
end

end

Not much here, as you can see. With the data structure built, the problem practically solves itself.

The interface method, generate_sentence(), starts by selecting a beginning from the Array of possible beginnings. It then iterates using next_word_for() to grab follow-up words until a nil is returned (after an end-of-line character, since we didn't give follow-ups for them). Whatever we have at that point becomes the returned sentence, after adding a sprinkle of whitespace between the words.

The helper method, next_word_for(), just does the lookup from the frequency Array. Don't let that fancy last line throw you, it's just a shortcut to keep from calling []() when arr is nil.

A huge thank you to all who came out of hiding to get the quiz rolling again. I just knew I would find a problem you couldn't resist eventually.

The response to the Ruby Quiz contest has been strong and tomorrow we will start our run of contributed quizzes with an offering from Pat Eyler...