Proper Case (#89)

by elliot temple

sometimes i type in all or mostly lowercase. a friend of mine says it's hard to read essays with no capital letters. so the problem is to write a method which takes a string (which could include many paragraphs), and capitalizes words that should be capitalized. at minimum it should do the starts of sentences.

solutions could range from simple (a few regexes) to complex (lots of special cases are possible, like abbreviations that use a period). an addition would be using a dictionary to find proper nouns and capitalize those. it could also ask the user about cases the program can't figure out. or log them.

i can provide an example solution (regex based) and a list of reasons it doesn't work very well, if you want.

sample input:

- this email itself works nicely

- this one is hard. sometimes i might want to write about gsub vs. gsub! without the "." or "!" causing any capitalization (or the punctuation in quotes).

one problem is maybe dealing with sentences that contain periods is too hard. i don't know.


Quiz Summary

I'm sad no one but the quiz creator himself gave this problem a shot. This is a very real problem with all manner of source texts and fixing it is tricky. There was even a discussion on the mailing list about how you can't count on there being two spaces at the end of a sentence.

You really need natural language processing to correctly determine which words to capitalize. Unfortunately, natural language processing is complex and often not a perfect solution anyway.

The good news is that we can use some heuristics to get close. A heuristic is a loosely defined rule or, put another way, the computer science equivalent to a guess. These are often developed by just trying to get close to a solution and then tweaking little things here and there to close in on the target. The result won't be perfect, of course, but it may be good enough. It's a very agile process and Rubyists love that.

Let's see what heuristics Elliot came up with now, starting with some code used to correct common Netspeak misspellings:

ruby
Abbreviations = { "ppl" => "people",
"btwn" => "between",
"ur" => "your",
"u" => "you",
"diff" => "different",
"ofc" => "of course",
"liek" => "like",
"rly" => "really",
"i" => "I",
"i'm" => "I'm" }

def fix_abbreviations text
Abbreviations.each_key do |abbrev|
text = text.gsub %r[(^|(\s))#{abbrev}((\s)|[.,?!]|$)]i do |m|
m.gsub(/\w+/, "#{Abbreviations[abbrev]}")
end
end
text
end

# ...

This code is fairly trivial, but still quite effective. Using a predefined Hash, the method just scans the text for the keys, swapping them out for the provided values when found. Note that the expression used to find the key tries to ensure it is not in the middle of some larger word by looking for leading and trailing whitespace or punctuation.

That expression could probably be simplified to %r[\b#{abbrev}\b] which looks for word boundaries (a \W\w or \w\W transition) and means close to the same thing. This would allow Elliot do the search and replace in a single call to gsub(), instead of the current nested call to avoid replacing the surrounding space or punctuation. (You can do it with a single gsub() call even without using \b, just FYI: text.gsub(%r[(^|(\s))#{abbrev}((\s)|[.,?!]|$)]i, "\\1#{Abbreviations[abbrev]}\\2").)

The important aspect of this solution though is that it knows it's not perfect and gives you the Hash as a means to make it better. If it doesn't handle your text correctly, you can always add or delete entries from the Hash to improve the results.

Let's look at some more code, this time for capitalizing proper nouns:

ruby
require "yaml"

# ...

def capitalize_proper_nouns text
if not File.exists?("proper_nouns.yaml")
make_capitalize_proper_nouns_file
end
proper_nouns = YAML.load_file "proper_nouns.yaml"
text = text.gsub /\w+/ do |word|
proper_nouns[word] || word
end
text
end

def make_capitalize_proper_nouns_file
words = File.read("/Users/curi/me/words.txt").split "\n"
lowercase_words = words.select {|w| w =~ /^[a-z]/}.map{|w| w.downcase}
words = words.map{|w| w.downcase} - lowercase_words
proper_nouns = words.inject({}) { |h, w| h[w] = w.capitalize; h }
File.open("proper_nouns.yaml", "w") {|f| YAML.dump(proper_nouns, f)}
end

# ...

This is an interesting two-tiered approach. If the program can locate a proper_nouns.yaml file, a Hash is pulled from it and used to capitalize the listed nouns. If the file cannot be found, a hand-off is made to make_capitalize_proper_nouns_file(). The code in that method appears to read a word list file and build up its own list of proper nouns. This list is then flushed to the YAML file, so it will be found on future loads.

What I liked about this was how I could customize it, yet again. When testing Elliot's code against the quiz text, I just built a quick Hash with the needed keys and values:

$ ruby -r yaml -e 'y Hash[*%w[Elliot Temple].map { |pn| [pn.downcase, pn] }.
flatten]' > proper_nouns.yaml
$ cat proper_nouns.yaml
---
temple: Temple
elliot: Elliot

Getting back to the code, we're again using a trivial regular expression based swap, which you can see in the second half of capitalize_proper_nouns(). It matches all words (well, a run of \w characters) and replaces them with the proper noun capitalization, if there is such a thing, or the word itself, causing no change.

Now we can put all of that together with a few more heuristics to get a complete solution:

ruby
# ...

def capitalize text
return "" if text.nil?
text = fix_abbreviations text
text = text.gsub /([?!.-]\s+)(\w+)/ do |m|
"#$1#{$2.capitalize}"
end
text = text.gsub /(\n)(\w+)/ do |m|
"#$1#{$2.capitalize}"
end
text = text.gsub /\A(\w+)/ do |m|
"#{$1.capitalize}"
end
text = text.gsub %r[\sHttp://] do |m|
"#{$&.downcase}"
end
text = capitalize_proper_nouns text
text
end

puts capitalize(ARGF.read)

This method triggers the fixes for abbreviations and proper nouns that we have already examined. In addition, it uses regular expressions to capitalize word characters following sentence end punctuation as well as words characters at the beginning of a line or the document. It then corrects the protocol identifier for inline links it may have damaged in the process.

So, how does this do on the quiz document? Generally quite good. It makes only two obvious errors:

By Elliot Temple

and:

Sometimes I might want to write about gsub vs. Gsub! Without the...

The first error is that we generally do not capitalize the by in a byline. That could probably be worked around with another regular expression correction.

The second issue is much harder to get right and here is where we start to miss a natural language processing facility. When humans read that line we know that gusb!() and without should not be capitalized because of the context they are used in. The script is not-so-clever though and the period and exclamation point throw it off. You could add rules to work around these cases as well, but you will definitely be fighting an uphill battle at that point.

I still say the end result is quite good though. Count how many characters are wrong in the quiz and subtract from that the three output issues. It's a big improvement.

My thanks to Elliot Temple for the problem and being brave enough to put together a solution.

Tomorrow we'll try our hand at another simple pen and paper game and see who can solve it in record time...