Banned Words (#9)

by Fredrik Jagenheim

At work we discovered that they installed a spam filter that throws away e-mail that it considers spam. It doesn't use any Bayes Filter, it simply checks for certain words that it considers 'banned'. One word we discovered was 'sex', which is a Swedish word for the number six. So the phrase "I'll be home at six o'clock." will be classified as spam, thrown away and never seen.

The Ruby Quiz I propose is to figure out which words are banned. Since the filter is a black box, we can only find out which words they are by sending email through it. The real problem is to find out how to do it with as *few* emails as possible.

Of course I don't want the ruby community to do a Denial-of-Service on my employer's mailserver, so do it as a local filter; perhaps something like this:

# by JEG2

# A simple class for managing a filter that prevents to use
# of a given _banned_words_ list.
class LanguageFilter
# Create a new LanguageFilter object that will
# disallow _banned_words_.
# Accepts a list of words, arrays of words,
# or a combination of the two.
def initialize( *banned_words )
@banned_words = banned_words.flatten.sort
@clean_calls = 0

# A count of the calls to <i>clean?</i>.
attr_reader :clean_calls

# Test if provided _text_ is allowable by this filter.
# Returns *false* if _text_ contains _banned_words_,
# *true* if it does not.
def clean?( text )
@clean_calls += 1
@banned_words.each do |word|
return false if text =~ /\b#{word}\b/

# Verify a _suspect_words_ list against the actual
# _banned_words_ list.
# Returns *false* if the two lists are not identical or
# *true* if the lists do match.
# Accepts a list of words, arrays of words,
# or a combination of the two.
def verify( *suspect_words )
suspect_words.flatten.sort == @banned_words

filter = "six"

filter.clean?("I'll be home at six.") # => false
filter.clean?("Do not taunt the happy fun ball!") # => true

filter.verify("ball") # => false
filter.verify("six") # => true

filter.clean_calls # => 2

So, figure out how to find the hidden words, using as few calls to LanguageFilter#clean? as possible.

Which algorithms do you find are effective when many words are blocked (like 10%) and which are effective when very few are blocked(1 in 20000)?

All solutions should do better than this: ;)

dict = ["foo", "bar", "six", "baz"]
filter = "six"

p dict - (dict.dup.delete_if { |word| not filter.clean?(word) })

Quiz Summary

The general idea behind a lot of the solutions to this week's quiz was pretty basic. Try a big list (probably the whole list, in this problem), if that gets blocked divide it into smaller lists and try again.

When one of these chunks of words gets through, we know that every words in that chunk was clean. The higher up in our search that happens, the more work that saves us. Because of that, this solution is ideal when there aren't a lot of banned words, as would probably be the case in the real world example of this quiz.

Here's my own solution as the most basic example of this process:

def isolate( list, test )
if test.clean? list.join(" ")
elsif list.size == 1
left, right = list[0...(list.size / 2)],
list[(list.size / 2)..-1]
isolate(left, test) + isolate(right, test)

isolate() is a recursive routine that takes an Array of words and a test filter and returns the banned words. If the entire words list passes a clean?() test, we return an empty Array (no banned words in the given list). If we don't get an okay from clean?() and we only have one word in the list, we've found a banned word and we return the one word Array itself to show that. Finally, if we didn't pass clean?() and we have more than one word, we divide the list in half, call isolate() on each half, and combine the results of both of those calls. Eventually, this drills down to find all the banned words.

Of course, that's just the basics. Wayne Vucenic found a very clever optimization. If we check a big list and it gets banned, then we split it up and check the first half, which comes back clean, we know the second have would be banned and can skip the check. That could save a significant amount of messages that we would otherwise need to send.

Brian Schröder made a search for the optimal divisor for the word list, when breaking it into chunks. Brian landed on 3 as the magic number, but there has been some follow up discussion on Ruby Talk about his findings and if they are complete. Regardless, do download Brian's code to see just how much thought and effort he has put into it (and why he wants to sue me).

Jannis Harder didn't cut the word list in half at each step either. I'm not sure exactly how Jannis arrived at the progression used, but it's an interesting scale:

@steps = [110000,700000,300000,200000,110000,

In my own testing, this doesn't seem to best Wayne's optimization though.

Jannis Harder did post a test framework that ended up being used by multiple entries. I thought that was very helpful.

The final entry was by Michael Ulm, who used knowledge of the dictionary size and banned word probability to optimize the number of messages needed to find the banned words. This is a nice example of what can be done with a little bit of knowledge, though that's probably not too likely in real world scenarios.

The main issue brought up in discussion was that my filter class had issues. Several alternatives where proposed, each with their own issues. Matching a word can be surprisingly tricky, when you take into account simple things like plurals, international characters, and apostrophes. In the end, everybody seem to just use what they could live with.

My thanks to the submitters and discussers, informative as always.

We'll try a classic programming challenge tomorrow from Knuth...