Regexp.build() (#4)

There's been some discussion on Ruby Talk lately about Range.member? which tests if a given element (often a number) is a member of the set the Range object iterates over. Obviously, this kind of test is useful in many aspects of programming, but let's approach this problem from a different angle.

This week's quiz is to build a library that adds a class method called build() to Regexp. build() should accept a variable number of arguments which can include integers and ranges of integers. Have build() return a Regexp object that will match only integers in the set of passed arguments.

Here are some examples of possible usage:

ruby
lucky = Regexp.build( 3, 7 )
"7" =~ lucky # => true
"13" =~ lucky # => false
"3" =~ lucky # => true

month = Regexp.build( 1..12 )
"0" =~ month # => false
"1" =~ month # => true
"12" =~ month # => true
day = Regexp.build( 1..31 )
"6" =~ day # => true
"16" =~ day # => true
"Tues" =~ day # => false
year = Regexp.build( 98, 99, 2000..2005 )
"04" =~ year # => false
"2004" =~ year # => true
"99" =~ year # => true

num = Regexp.build( 0..1_000_000 )
"-1" =~ num # => false

Some issues you may want to consider while building you're library:

* How should leading zeros be handled?

Match the hour from a clock formatted in military time (0 to 23). Hours 0
through 9 may or may not have a single leading zero.

* Should anything be captured by the returned Regexp?

* How should anchoring work?

ruby
"2004" =~ Regexp.build( 4 ) # => ???

Quiz Summary

The first thing to consider in this quiz is what does a Regexp to match a number look like? Here's the most basic answer to match 1..12:

1|2|3|4|5|6|7|8|9|10|11|12

Note: You might want to reverse the order of that, unless you can count on your anchoring to match the right thing.

Obviously, the above works and is dirt simple to implement. Here's a submitted solution by Tanaka Akira that does pretty much that:

ruby
def Regexp.build(*args)
args = args.map {|arg| Array(arg) }.flatten.uniq.sort
neg, pos = args.partition {|arg| arg < 0 }
/ \A (?: -0*#{Regexp.union(*neg.map {|arg| (-arg).to_s })} |
0*#{Regexp.union(*pos.map {|arg| arg.to_s })} ) \z /x

end

The first line of that method is pretty clever, calling Array() on all the passed arguments. That turns Range objects into the Array equivalent and wraps simple Integers in an Array of their own. Following that up with flatten() yields a single Array of all the elements we're trying to match.

The second line just separates the arguments into positive and negative groups.

Finally the third line builds a Regexp object from the created groups using the nifty Regexp.union() that I wasn't even aware of when I made this quiz.

(This solution handles negative numbers and allows for arbitrary leading zeros.)

Is this quiz really this easy to solve? Obviously it can be, for simple data sets. However, Tanaka's solution has limits. On my box, it only takes:

ruby
Regexp.build(1..10_000)

to get a "...regular expression too big..." error. Clearly, if your data set is big you'll need to dig a little deeper.

That get's us back to our original question, but now with a qualification: What is a short way to match a number with a Regexp? The most obvious optimization to apply to our patterns is the use of character classes. Going back to our 1..12 example that might give us something like:

\d|1[0-2]

That's getting a lot more reasonable. Going to a serious example, even 1..1_000_000 is only:

[1-9]|[1-9]\d|[1-9]\d\d|[1-9]\d\d\d|[1-9]\d\d\d\d|[1-9]\d\d\d\d\d|1000000

Technically, we could keep going and get to something like:

[1-9]\d{0,5}|1000000

But that's a little trickier to build algorithmically and none of the submitted solutions went quite that far.

The character class approach, on the other hand, was very popular.

The main trick to building character classes is to break down the passed Range objects. You could also lump in the individual Integer arguments, but these are pretty insignificant. Several solutions solved this by adding a method to the Range class to convert them into Regexps.

Adding a Regexp.build() over that is trivial. Here's a nice example from Mark Hubbart's second submission:

ruby
def Regexp.build(*args)
ranges, numbers = args.partition{|item| Range === item}
re = ranges.map{|r| r.to_re } + numbers.map{|n| /0*#{n}/ }
/^#{Regexp.union(*re)}$/
end


class Range
def to_re
# normalize the range format; we want end inclusive,
# integer ranges this part passes the load off to a
# newly built range if needed.
if exclude_end?
return( (first.to_i..last.to_i - 1).to_re )
elsif ! (first + last).kind_of?(Integer)
return( (first.to_i .. last.to_i).to_re )
end

# Deal with ranges that are wholly or partially negative.
# If range is only partially negative, split in half and
# get two regexen. join them together for the finish.
# If the range is wholly negative, make it positive, then
# add a negative sign to the regexp
if first < 0 and last < 0
# return a negatized version of the regexp
return /-#{(-last..-first).to_re}/
elsif first < 0
neg = (first..-1).to_re
pos = (0..last).to_re
return /(?:#{neg}|#{pos})/
end

### First, create an array of new ranges that are more
### suited to regex conversion.

# a and z will be the remainders of the endpoints
# of the range as we slice it
a, z = first, last

# build the first part of the list of new ranges.
list1 = []
num = first
until num > z
a = num # recycle the value
# get the first power of ten that leaves a remainder
v = 10
v *= 10 while num % v == 0 and num != 0
# compute the next value up
num += v - num % v
# store the value unless it's too high
list1 << (a..num-1) unless num > z
end

# build the second part of the list; counting down.
list2 = []
num = last + 1
until num < a
z = num - 1 # recycle the value
# slice to the nearest power of ten
v = 10
v *= 10 while num % v == 0 and num != 0
# compute the next value down
num -= num % v
# store the value if it fits
list2 << (num..z) unless num < a
end
# get the chewey center part, if needed
center = a < z ? [a..z] : []
# our new list
list = list1 + center + list2.reverse

### Next, convert each range to a regexp.
list.map! do |rng|
a, z = rng.first.to_s, rng.last.to_s
a.split(//).zip(z.split(//)).map do |(f,l)|
case
when f == l then f
when f.to_i + 1 == l.to_i then "[%s%s]" % [f,l]
when f+l == "09" then "\\d"
else
"[%s-%s]" % [f,l]
end
end.join # returns the regexp for *that* range
end

### Last, return the final regexp
/0*#{ list.join("|") }/
end
end

The first third of the to_re() method just deals with normalizing Ranges and is very well commented.

The middle third divides the Range into Regexp friendly chunks which are groups that share the same number of digits. For example, here is what to_re() builds into the local "list" variable for 1..1_000:

1..9
10..99
100..999
1000..1000

The final third of to_re() builds character classes from these grouped Ranges. The code inside "list.map! do ... end" is pretty clever and I recommend working through it until you can follow how it works.

A big part of using these solutions is a question of how long you'll have to wait for a Regexp object to be built and how quickly the result can find a match. Here are some benchmarks, first for build times:

user system total real
James Edward Gray II 63.490000 0.020000 63.510000 ( 63.512904)
James Edward Gray II (2) 20.270000 0.000000 20.270000 ( 20.307565)
Jamis Buck 0.490000 0.000000 0.490000 ( 0.488664)
Mark Hubbart 0.290000 0.010000 0.300000 ( 0.297225)
Mark Hubbart (2) 0.030000 0.000000 0.030000 ( 0.028206)
Tanaka Akira 0.440000 0.000000 0.440000 ( 0.442430)
Thomas Leitner 0.020000 0.000000 0.020000 ( 0.013544)
Warren Brown 0.020000 0.000000 0.020000 ( 0.015618)

Or focusing in on the faster solutions over a bigger test:

user system total real
Mark Hubbart (2) 3.390000 0.000000 3.390000 ( 3.415515)
Thomas Leitner 2.120000 0.000000 2.120000 ( 2.138968)
Warren Brown 2.290000 0.000000 2.290000 ( 2.280555)

As you can see, Thomas Leitner and Warren Brown's solutions are also worth a look, if you haven't checked them out already. Warren's even has a clever feature to tell you which parameter of build() caused a match.

And here are some matching benchmarks (build times excluded from results):

user system total real
James Edward Gray II 0.070000 0.000000 0.070000 ( 0.079497)
James Edward Gray II (2) 0.100000 0.000000 0.100000 ( 0.115083)
Jamis Buck 5.480000 0.000000 5.480000 ( 5.502583)
Mark Hubbart 4.590000 0.000000 4.590000 ( 4.617739)
Mark Hubbart (2) 0.100000 0.010000 0.110000 ( 0.124735)
Tanaka Akira 4.570000 0.010000 4.580000 ( 4.741717)
Thomas Leitner 0.100000 0.000000 0.100000 ( 0.121431)
Warren Brown 0.130000 0.000000 0.130000 ( 0.123779)

As usual, my thanks go out to all who participated as well as to those who just silently followed the discussion.

I've been unexpectedly called out of town this weekend, so we'll take a break from Ruby Quiz. Sink all the effort you intended to give next week's quiz into filling my inbox with quiz suggestions. ;)

I'll send out a new quiz on the 29th that's all fun and games...