Crossword Solver (#132)

by Eugene Kalenkovich

Write a Ruby crossword solver. Randomly fill a crossword template with words from a dictionary. Use any dictionary you want (/usr/share/dict/words, text of pickaxe, etc.).

Template is provided in a file formatted very similarly to one in quiz #10 (http://www.rubyquiz.com/quiz10.html), with "X" substituted with "#". Simple template example:

_ _ _ _ _

_ # _ # _

_ _ _ _ _

_ # _ # _

_ _ _ _ _

Format output any way you want, just make it readable. Each run of the program with big enough dictionary should give a different solution. Example results for a simple template:

F U G U E

U U R

D E I G N

G D S

E J E C T



R E S T S

I K T

N I E C E

D W E

S U S A N

Bonus 1 (simple). Allow partially pre-filled templates:

# # _ # # # # M

_ _ _ _ _ _ # A

# # _ # # _ # T

R U B Y Q U I Z

U # _ # # _ # #

B # _ _ _ _ _ _

Y # # # # _ # #

One of result variants:

U M

P A S C A L A

A O T

R U B Y Q U I Z

U L N

B Y E A G E R

Y E

Bonus 2: Avoid combinatorial explosion. Fill a big template within reasonable time:

_ _ _ _ # _ _ _ _ _ _ _ _ _ # _ _ _ _

_ # _ # _ # _ # _ # _ # _ # _ # _ # _

_ _ _ _ _ _ _ _ _ # _ _ _ _ _ _ _ _ _

_ # _ # _ # _ # _ _ _ # _ # _ # _ # _

# _ _ _ _ # _ # _ # _ # _ # _ _ _ _ #

_ # _ # # # _ _ _ _ _ _ _ # # # _ # _

_ _ _ _ _ _ # # _ # _ # # _ _ _ _ _ _

_ # _ # # _ # _ _ _ _ _ # _ # # _ # _

_ _ _ _ _ _ _ _ # _ # _ _ _ _ _ _ _ _

_ # # _ # _ # _ _ _ _ _ # _ # _ # # _

_ _ _ _ _ _ _ _ # _ # _ _ _ _ _ _ _ _

_ # _ # # _ # _ _ _ _ _ # _ # # _ # _

_ _ _ _ _ _ # # _ # _ # # _ _ _ _ _ _

_ # _ # # # _ _ _ _ _ _ _ # # # _ # _

# _ _ _ _ # _ # _ # _ # _ # _ _ _ _ #

_ # _ # _ # _ # _ _ _ # _ # _ # _ # _

_ _ _ _ _ _ _ _ _ # _ _ _ _ _ _ _ _ _

_ # _ # _ # _ # _ # _ # _ # _ # _ # _

_ _ _ _ # _ _ _ _ _ _ _ _ _ # _ _ _ _



C H A O L E N I E N T L Y C O I L

Y V L M D O E O K A

S H A R E A B L E O E D I P A L L Y

T I A R N U N G E A S

F L I P Y T T E C O H N

I A O L I V I E R O I

R E B U F F F M A S H M A N

R L A B Y T E S D A T

I N E R T I A L Y I S S U A N C E

T U R A S P E N O D N

A L T E R E R S E U P R O O T E D

T O S E A S E S B O I

E X T A N T B N S U N T A N

S T U T R E C H T A G

W E E P N A L R D E L L

P R U T M A O U A L M

R E I N S E R T S S O M E W H E R E

O N H U O E A N R A

S A G S B E R N A D I N E U S E D


Quiz Summary

Many of these solutions got pretty sluggish when run on large crosswords. The reason is simple: the search space for this problem is quite large. You have to try a significant number of words in each position before you will find a good fit for the board.

The good news is that this summer Google has paid to bring a very powerful search tool to Ruby. I've been lucky enough to have a ring-side seat for this process and now I want to show you a little about this new tool.

You've probably seen Andreas Launila solving a fair number of the recent quizzes using his Gecode/R library. Gecode/R is a wrapper over the C Gecode library for constraint programming. Constraint programming is a technique for describing problems in a way that a tool like Gecode can then use to search the solution space and find answers for you. Gecode is a very smart little searcher though and will heavily prune the search space based on your constraints. This leads to some mighty quick results, as you can see:

$ time ruby solve_crossword.rb /path/to/scowl-6/final/english-words.20
< /path/to/test_board.txt
Reading the dictionary...
Please enter the template (end with ^D)
Building the model...
Searching for a solution...
A B I D E

N C A

G H O S T

E N E

L A S E R

real 0m0.430s
user 0m0.360s
sys 0m0.068s

Let's have a look at Andreas's code to see how it sets things up for Gecode. Here's the start of the code:

ruby
require 'enumerator'
require 'rubygems'
require 'gecoder'

# The base we use when converting words to and from numbers.
BASE = ('a'..'z').to_a.size
# The offset of characters compared to digits in word-numbers.
OFFSET = 'a'[0]
# The range of integers that we allow converted words to be in. We are
# only using the unsigned half, we could use both halves, but it would
# complicate things without giving a larger allowed word length.
ALLOWED_INT_RANGE = 0..Gecode::Raw::Limits::Int::INT_MAX
# The maximum length of a word allowed.
MAX_WORD_LENGTH = (Math.log(ALLOWED_INT_RANGE.last) /
Math.log(BASE)).floor

# ...

You can see that Andreas loads Enumerator and Gecode/R here as well as setting up some constants. The constants relate to how this code will model words in the database. The plan here is to represent words as numbers made up of base 26 digits which represent letters of the alphabet. This will allow Andreas to use Gecode's integer variables to model the problem. The downside is that word size will be limited by the maximum size of an integer and thus this solution has a weakness in that it can't be used to solve the larger puzzles.

You can see this conversion from numbers to words in the Dictionary class:

ruby
# ...

# Describes an immutable dictionary which represents all contained words
# as numbers of base BASE where each digit is the corresponding letter
# itself converted to a number of base BASE.
class Dictionary
# Creates a dictionary from the contents of the specified dictionary
# file which is assumed to contain one word per line and be sorted.
def initialize(dictionary_location)
@word_arrays = []
File.open(dictionary_location) do |dict|
previous_word = nil
dict.each_line do |line|
word = line.chomp.downcase
# Only allow words that only contain the characters a-z and are
# short enough.
next if previous_word == word or word.size > MAX_WORD_LENGTH or
word =~ /[^a-z]/
(@word_arrays[word.length] ||= []) << self.class.s_to_i(word)
previous_word = word
end
end
end

# Gets an enumeration containing all numbers representing word of the
# specified length.
def words_of_size(n)
@word_arrays[n] || []
end

# Converts a string to a number of base BASE (inverse of #i_to_s ).
def self.s_to_i(string)
string.downcase.unpack('C*').map{ |x| x - OFFSET }.to_number(BASE)
end

# Converts a number of base BASE back to the corresponding string
# (inverse of #s_to_i ).
def self.i_to_s(int)
res = []
loop do
digit = int % BASE
res << digit
int /= BASE
break if int.zero?
end
res.reverse.map{ |x| x + OFFSET }.pack('C*')
end
end

# ...

We've already talked about the number representation which is the majority of the code here. Do have a look at initialize() and words_of_size() though, to see how words are being stored. An Array is created where indices represent word lengths and the values at those indices are nested Arrays of words with that length. This makes getting a list of words that could work in a given slot of the puzzle easy and fast.

The s_to_i() method above relies on a helper method added to Array, which is simply this:

ruby
class Array
# Computes a number of the specified base using the array's elements
# as digits.
def to_number(base = 10)
inject{ |result, variable| variable + result * base }
end
end

Again, this is just another piece of the conversion I explained earlier.

With a Dictionary created, it's time to model the problem in Gecode constraints:

ruby
# Models the solution to a partially completed crossword.
class Crossword < Gecode::Model
# The template should take the format described in RubyQuiz #132 . The
# words used are selected from the specified dictionary.
def initialize(template, dictionary)
@dictionary = dictionary

# Break down the template and create a corresponding square matrix.
# We let each square be represented by integer variable with domain
# -1...BASE where -1 signifies # and the rest signify letters.
squares = template.split(/\n\s*\n/).map!{ |line| line.split(' ') }
@letters = int_var_matrix(squares.size, squares.first.size,
-1...BASE)

# Do an initial pass, filling in the prefilled squares.
squares.each_with_index do |row, i|
row.each_with_index do |letter, j|
unless letter == '_'
# Prefilled letter.
@letters[i,j].must == self.class.s_to_i(letter)
end
end
end

# Add the constraint that sequences longer than one letter must form
# words. @words will accumulate all word variables created.
@words = []
# Left to right pass.
left_to_right_pass(squares, @letters)
# Top to bottom pass.
left_to_right_pass(squares.transpose, @letters.transpose)

branch_on wrap_enum(@words), :variable => :largest_degree,
:value => :min
end

# Displays the solved crossword in the same format as shown in the
# quiz examples.
def to_s
output = []
@letters.values.each_slice(@letters.column_size) do |row|
output << row.map{ |x| self.class.i_to_s(x) }.join(' ')
end
output.join("\n\n").upcase.gsub('#', ' ')
end

# ...

After storing the dictionary, this code breaks the crossword template down into an integer matrix created using the Gecode/R helper method int_var_matrix(). This will be our puzzle of words Gecode is expected to fill in.

The next two sections of the initialize() method build up the constraints. These are the rules that must be satisfied when we have found a correct solution.

The first of these chunks of code sets rules for any letters that were given to us in the template. This code tells Gecode that the number in that position of the matrix must equal the value of the provided letter. Take a good look at this RSpec like syntax because Andreas has spent a considerable effort on making it easy to express your constraints in a natural syntax and I hope you will agree with me that the end result is quite nice.

The other chunk of constraints are defined using a helper method we will examine in just a moment. The result of this code though is to ensure that the numbers selected represent letters that form actual words.

The final step in describing the problem to Gecode is to choose a branching strategy. This tells Gecode which variables it will need to make guesses about in order to find a solution as well as selecting a heuristic to use when guesses must be made. In this case, words will be selected based on how much of the overall puzzle they affect, hopefully ruling out large sections of the search space quickly.

The problem model we just examined is pretty much always how constraint programming goes. You just need to remember the three steps: create some variables for Gecode to fill in, define the rules for the values you want Gecode to find, and select a strategy for Gecode to use in solving the problem.

The to_s() method above just creates the output used in the quiz examples for display to the user.

Let's have a look at the helper methods used in the model definition now:

ruby
# ...

private

# Parses the template from left to right, line for line, constraining
# sequences of two or more subsequent squares to form a word in the
# dictionary.
def left_to_right_pass(template, variables)
template.each_with_index do |row, i|
letters = []
row.each_with_index do |letter, j|
if letter == '#'
must_form_word(letters) if letters.size > 1
letters = []
else
letters << variables[i,j]
end
end
must_form_word(letters) if letters.size > 1
end
end

# Converts a word from integer form to string form, including the #.
def self.i_to_s(int)
if int == -1
return '#'
else
Dictionary.i_to_s(int)
end
end

# Converts a word from string form to integer form, including the #.
def self.s_to_i(string)
if string == '#'
return -1
else
Dictionary.s_to_i(string)
end
end

# Constrains the specified variables to form a word contained in the
# dictionary.
def must_form_word(letter_vars)
raise 'The word is too long.' if letter_vars.size > MAX_WORD_LENGTH
# Create a variable for the word with the dictionary's words as
# domain and add the constraint.
word = int_var @dictionary.words_of_size(letter_vars.size)
letter_vars.to_number(BASE).must == word
@words << word
end
end

# ...

The i_to_s() and s_to_i() methods here are mostly just wrappers over the Dictionary counterparts we examined earlier. The real interest is the other two methods that together define the word constraints.

First, left_to_right_pass() is used to walk the puzzle looking for runs of two or more letters that will need to become words in the Dictionary. Each time it finds one, a hand-off is made to must_form_word(), which builds the actual constraint.

With the problem modeled, it takes just a touch more code to turn this into a full solution:

ruby
# ...

puts 'Reading the dictionary...'
dictionary = Dictionary.new(ARGV.shift || '/usr/share/dict/words')
puts 'Please enter the template (end with ^D)'
template = ''
loop do
line = $stdin.gets
break if line.nil?
template << line
end
puts 'Building the model...'
model = Crossword.new(template, dictionary)
puts 'Searching for a solution...'
puts((model.solve! || 'Failed').to_s)

You can see that this application code is just reading in the dictionary and template, then constructing a model and pulling a solution with the solve!() method. That's all it really takes to get answers after describing your problem to Gecode.

If you want to continue exploring Andreas's Gecode/R wrapper, drop by the Web site which has links to many useful resources:

Gecode/R

My thanks to all who put a good deal of effort into a hard search problem.

Tomorrow we will take another peek at our dictionary, this time to see what numbers are hiding in there...