NDiff (#46)

by Bil Kleb

This week's quiz is to write a version of the diff command that compares files numerically. In Unix-speak,

$ ndiff --help
Usage: ndiff [options] file1 file2

Numerically compare files line by line, numerical field by numerical field.

-d INT --digits INT Maximum number of significant digits that
should match. (default: 0)
-h --help Output this help.
-q --quiet No output, just exit code.
-s --statistics Provide comparison statistics only. (extra credit)
-t DBL --tolerance DBL Tolerate <= DBL distance between numbers.
(default: 0.0)

For example, given fileA,

1.000001
2.00
-3
Cy=0.11278889E-01 Cx=-1.343e+02

And fileB,

1.000000
1.99
-3.4
Cy=0.11278890E-01 Cx=-1.343e+02

the following scenarios could play out:

$ ndiff fileA fileB
1,4c1,4
< 1.000001
< 2.00
< -3
< Cy=0.11278889E-01 Cx=-1.343e+02
---
> 1.000000
> 1.99
> -3.4
> Cy=0.11278890E-01 Cx=-1.343e+02

$ ndiff -t 0.000001 fileA fileB
2,3c2,3
< 2.00
< -3
---
> 1.99
> -3.4

$ ndiff --tolerance 0.01 fileA fileB
3c3
< -3
---
> -3.4

$ ndiff --digits 1 fileA fileB (zero exit code)

$ ndiff -q fileA fileB (non-zero exit code)

and, for extra credit,

$ ndiff --statistics fileA fileB
Numbers compared: 5
Distance range: 0.0..0.4
Average distance: 0.99987e-01 [guess]
Mean distance: 1.0e-06 [guess]

FWIW, the results of this quiz will be used by NASA's FUN3D team for regression testing aerothermodynamic simulation software.


Quiz Summary

by James Edward Gray II

Both solutions had some really interesting material in them. I would certainly talk about both, if time and space allowed.

Paul Vaillant had some interesting parsing code (see DataFile.parse_line()) and comparison code (see DataFile::CompareResults.compare()). Paul's entire solution was also very clean and readable, though it did come out a bit longer than Daniel Sheppard's code. Finally, Paul's code is over three times faster, which could be a real asset for significant data sets.

Daniel's code is the one I want to look at below, because of the way it handles an interesting Ruby problem. What makes this challenge fun, I think, is that it's one of the edge cases where Ruby's internal iterator can feel awkward. You're always walking through two separate collections in this problem. Ruby has some nice solutions for this, buried in the dark corners of its standard library, but many of us just aren't aware of them. One of these tools, SyncEnumerator from the generator library, is heavily used in the following code. Let's see how something like that can make a real difference in the end result:

ruby
require 'generator'
require 'pathname'
require 'optparse'
require 'bigdecimal'

class NumberLine
include Enumerable
def initialize(line)
@line = line
end
def each
@line.scan( / ((-)?([0-9]+|\.)
\.?([0-9]+)?[Ee]?
([-+]?[0-9]+)?) /x
) do |match|
yield BigDecimal.new(match[0])
end
end
def to_s
@line
end
end

# ...

This is a simple encapsulation of a single line of one or more numbers. The only method of interest here is each(), which scan()s the line looking for all the numbers it can find. That lengthy Regexp just matches an optional minus sign, followed by some digits (or a period), optionally followed by a period and some more digits, optionally following by an E or e and another number. This isn't a fool-proof match since it will catch a String like "-." or even "..", but it's probably good enough for most cases. Each number matched is fed to BigDecimal and then yielded to a given block. The use of the library here slows things down a bit, but it avoids floating point inaccuracies too, which is more important to this challenge.

ruby
# ...

class NumericalDifferencer
attr_accessor :digits, :tolerance
def initialize
@digits = 0
@tolerance = 0.0
end
def each_diff(file1, file2)
line = 0
diff = nil
SyncEnumerator.new(file1, file2).each do |line1, line2|
line1, line2 = NumberLine.new(line1), NumberLine.new(line2)
line += 1
num_enum = SyncEnumerator.new(line1, line2)
if num_enum.all? do |a,b|
unless @digits == 0
a,b = a.round(@digits-1), b.round(@digits-1)
end
(a - b).abs <= @tolerance
end
if diff
yield diff
diff = nil
end
else
diff = NumericalDiff.new(line) unless diff
diff.add_lines(line1, line2)
end
end
yield diff if diff
end
end

# ...

Here we get into the actual file comparison code and some SyncEnumerator fun. The first SyncEnumerator created here walks through the two files given in step. Each iteration of the block receives the next line from both files. Those lines are turned into NumberLine objects, which we have already seen. Then a SyncEnumerator is made from those objects, which allows the code to traverse each number found in both lines, again in step. That object is used in the if condition to ensure that all?() numbers in the line are within the needed tolerance (rounding if requested). When any numbers of the lines don't match close enough a NumericalDiff object is created and the current lines, plus any following lines that also differ, are added to it. When the NumericalDiff is complete, because the code has returned to matching lines or completed running through the files, the diff object is yielded to the provided block.

Let's take a look at that NumercialDiff class:

ruby
# ...

class NumericalDiff
attr_reader :start_line, :left, :right
def initialize(start_line)
@start_line = start_line
@left = []
@right = []
end
def add_lines(line1, line2)
@left << line1
@right << line2
end
def to_s
lines = "#{@start_line},#{@start_line + @left.length - 1}"
str = "#{lines}c#{lines}\n"
str << @left.collect { |x| "< #{x}" }.join
str << "---\n"
str << @right.collect { |x| "> #{x}" }.join
end
end

# ...

No black magic in here. The work method is to_s() and it should be easy enough to see that it's just creating the quiz format there.

That's really the majority of the solution. The rest is mainly option parsing:

ruby
# ...

differ = NumericalDifferencer.new
quiet = false
statistics = false

opts = OptionParser.new do |opts|
opts.banner = "Usage: #{$0} [options] file1 file2"

opts.separator ""
opts.separator "Numerically compare files line by line, " +
"numerical field by numerical field."
opts.separator ""
opts.on("-d", "--digits INT", Integer,
"Maximum number of significant digits that should match.",
"(default: 0)") do |digits|
differ.digits = digits
end
opts.on("-t", "--tolerance DBL", String,
"Tolerate <= DBL distance between numbers.",
"(default: 0.0)") do |tolerance|
differ.tolerance = BigDecimal.new(tolerance)
end
opts.on("-h", "--help", "Output this help.") do |help|
puts opts
exit 0
end
opts.on("-q", "--quiet", "No output, just exit code.") do |value|
quiet = value
end
opts.on("-s", "--statistics",
"Provide comparison statistics only.") do |value|
statistics = value
end
end

# ...

The top few lines there create a variables to hold the diff tools and setting information. Then the last chunk of code is OptionParser 101.

Finally, here's the code that kicks it all off:

ruby
# ...

begin
opts.parse!(ARGV)
if quiet && statistics
raise "--quiet and --statistics are mutually exclusive"
end
raise "Must pass two filenames" unless ARGV.length == 2
files = ARGV.collect { |x| Pathname.new(x) }
files.each do |f|
unless f.exist? && f.file?
raise "'#{f}' does not exist"
end
end
File.open(files[0]) do |file1|
File.open(files[1]) do |file2|
if(statistics)
distances = []
SyncEnumerator.new(file1, file2).each do |line1, line2|
line1, line2 = NumberLine.new(line1),
NumberLine.new(line2)
SyncEnumerator.new(line1, line2).each do |num1, num2|
distances << (num1 - num2).abs
end
end
class << distances
def median
sorted = self.sort
mid = sorted.size / 2
if sorted.size % 2 == 0
(sorted[mid] + sorted[mid - 1]) / 2
else
sorted[mid]
end
end
def mean
self.inject(0) { |sum, x| sum + x / self.size}
end
end
puts(<<-EOF)
Numbers compared: #{distances.size}
Distance range: #{distances.min}..#{distances.max}
Median distance: #{distances.median}
Mean distance: #{distances.mean}
EOF

elsif(quiet)
differ.each_diff(file1, file2) do |diff|
exit 1
end
exit 0
else
different = false
differ.each_diff(file1, file2) do |diff|
different = true
puts diff
end
exit(different ? 1 : 0)
end
end
end
rescue => e
warn e
warn opts
exit 1
end

I know that looks like a lot of code, but the majority of it is the extra credit part of the challenge. If you can spot that big if statement in the middle, you'll see that the majority of the code is in the first branch. That part calculates statistics, by implementing its own inline custom traversal of the files (again with SyncEnumerator). The really interesting feature in this code is that the results are just compiled into an Array and then that Array is modified with extra methods to make outputting the results easy. If you're not familiar with Ruby's class << some_object ... end idiom, it simply adds the methods to just that one object. A handy trick put to good use here, I think.

The other two very short branches (elsif and else), actually solve the quiz. They just react according to the each_diff() method we examined earlier.

It's probably worth mentioning again here, the generator library is pure Ruby and makes use of continuations, which can be quite slow. BigDecimal is never going to be as fast as using Floats either, for obvious reasons. The result is that this code is quite a bit slower than Paul's solution. (Over three times slower, in my tests.) That doesn't affect small data sets, like the ones in the quiz, very much, but you may start waiting a bit on longer inputs.

My thanks to both quiz solvers, for showing off some super cool Ruby techniques.

Tomorrows quiz is a simple challenge for all you web application junkies, that may just prove interesting to the community at large. Join in the fun and show us just how cool your favorite web application framework can be...