DictionaryMatcher (#103)

by Ken Bloom

From time to time someone asks on ruby-talk how they can write a regexp of the form:

ruby
/alligator|crocodile|bear|dinosaur|...|seven-thousandth-word/

It's not hard to write such a regexp, but Ruby has in internal limit on how big the regular expression can be, so users find they can't do this matching function easily.

Implement a class DictionaryMatcher that determines whether any of the strings added to it are substrings of a string S. This should function as almost a drop-in replacement for a Regexp, therefore your implementation should support the following operations:

ruby
# creates a new empty matcher
dm=DictionaryMatcher.new

# adds strings to the matcher
dm << "string"
dm << "Ruby"

# determines whether a given word was one of those added to the matcher
dm.include?("Ruby") # => true
dm.include?("missing") # => false
dm.include?("stringing you along") # => false

# Regexp-like substing search
dm =~ "long string" # => 5
dm =~ "rub you the wrong way" # => nil

# will automatically work as a result of implementing
# DictionaryMatcher#=~ (see String#=~)
"long string" =~ dm # => true

# implement the rest of the interface implemented by Regexps (well, almost)
class DictionaryMatcher
alias_method :===, :=~
alias_method :match, :=~
end

If you can add additional features, like a case insensitivity option when creating a new DictionaryMatcher this is also very useful.


Quiz Summary

As stated in the quiz, this question comes up quite a bit. There are a few ways to address including a fairly simple approach. Let's examine Jamie Macey's rather short solution for an example of that:

ruby
class DictionaryMatcher < Array
alias_method :===, :=~
alias_method :match, :=~

def initialize(default = [], options = nil)
super(default)

unless options.nil? or options.is_a? Fixnum
options = Regexp::IGNORECASE
end
@regexp_options = options
end

def =~(string)
self.map{|e| Regexp.new(e, @regexp_options) =~ string }.compact.min
end
end

Jamie's idea is easy enough to follow: DictionaryMatcher is just an Array of expressions and we can hit the String with them one at a time to find a match. This code even has an advantage over many of the solutions in that the individual expressions themselves can be arbitrarily complex (full regular expressions).

The big downside to this approach though is speed. Unfortunately, it's a pretty big downside in this case because if you had been dealing with a small number of matches to begin with, you probably wouldn't have needed a DictionaryMatcher. The main reason performance is bad is that all expressions must be tested, to find the match that occurs first in the String. All that context shifting in and out of the regular expression engine just takes time.

If we're going to get around that, we need a clever way to store the terms we are looking for and a custom match process that takes advantage of that data structure.

The structure used by the majority of the solutions to store the words is called a trie or prefix tree. Louis J Scoras wrote a solution using Trie objects, and even taught them how to display themselves. This might help you see how this works. Have a look at this example (with corrected indentation):

ruby
>> t = Trie.new
=>
>> t["cat"] = true
=> true
>> t["cab"] = true
=> true
>> t["cate"] = true
=> true
>> t["bob"] = true
=> true
>> t
=> b =>
o =>
b =>
value: true
c =>
a =>
b =>
value: true
t =>
value: true
e =>
value: true

This structure is similar to a Hash, in that it has keys and values. (Values are only used to indicate word boundaries, so we will leave those as true and focus on the keys.) Where a Trie differs from a Hash is how it stores the keys.

The key "cat" isn't stored as one complete String, instead it is stored as a separate "c", "a", and "t". Those individual letters are nested beneath each other in the structure to indicate order. "c" comes before the "a" beneath it, which comes before the "t" beneath it.

Now, to add "cab" or even "cate" into the structure just involves adding in the new letters at the right depth. "bob", on the other hand, begins a new tree, since it starts with a different letter.

This structure can be used to make rather efficient large matches. Where a Regexp has to ask does "cat", "cab", or "cate" match here, the tree version just checks to see if the "c" matches here. When it does, more letters still need to be checked, but when it doesn't we instantly know that all "c" words are no good and we can move on.

To see how that comes together, let's examine some of Ross Bamford's solution:

ruby
class DictionaryMatcher
# ...

def initialize(*words)
@pt = {}
words.each { |word| self << word }
end

def <<(word)
# small memory optimization - if there's a longer word that shares
# this prefix, we can discard it since we'll only ever take the
# shortest match anyway.
word.split('').inject(@pt) do |pt, chr|
pt[chr] ||= {}
end.clear[:__WORD__] = true

self
end

# ...

def match(str, start_ofs = 0)
start_ofs.upto(str.length) do |i|
word = ""
next_pt = @pt
si = i
while next_pt = next_pt[chr = str[i,1]]
word << chr
return MatchData.new(si, i, word, str) if next_pt[:__WORD__]
i+=1
end
end

nil
end

def =~(str)
m = match(str) and m.start_offset
end

# ...
end

I've trimmed a lot of code here, but what I've shown is the heart of the matching algorithm. Ross uses a simple set of nested Hashes to build his prefix tree (@pt) in initialize().

Words are inserted into the tree in the <<() method. It splits the word into letters, and walks the nested Hashes inserting each letter. When it reaches the end of the word, any further nesting is cleared (an optimization explained in the comment) and the special :__WORD__ marker is inserted to indicate a word boundary.

The rest of the magic is in the match() method. It works much as I explained before. It walks the String index by index. If a character is found in the prefix tree, it tries to find a run of matches by walking the tree forward (the while loop). When it makes it all the way to a word boundary marker, it declares victory by returning Ross's custom MatchData object (not shown). If the tree walk fails, the code advances to the next index and when the indices are exhausted, nil is tossed to indicate that no match could be found.

The =~() method also uses match(), but changes the return value to mimic Ruby.

There's certainly more to see in the solutions and I hope this gives you enough of a map to encourage your own spelunking expidition into the code. A big thank you to all of the programmers who helped reduce this FAQ to a link we can pass on to those who ask.

Sharpen your turtles folks, because it's pretty picture drawing time with tomorrow's Ruby Quiz...