by Gavin Kistner
You have been hired to work for a small city government. The city recently bought a vehicle counter, one of those kinds that uses pneumatic rubber hoses stretched across the road. The company that sells the machine also sells software to interpret the raw data. However, the city has asked you to see if you can interpret it instead, saving them some money.
The data from the machine looks like this:
A268981
A269123
A604957
B604960
A605128
B605132
A1089807
B1089810
A1089948
B1089951
The numbers are the number of milliseconds since midnight when the mark occurred. Thus, the first line above represents a pair of tires driving by at 12:04:28am. The second line represents another pair of tires going by 142ms later (almost certainly the 2nd axle of the car).
The vehicle counter has two tubes - one stretches across both lanes of traffic, and one goes just across traffic in one direction. Each hose independently records when tires drive over it. As such, cars going in one direction (say, northbound) only record on one sensor (preceded with an 'A'), while cars going in the direction (say, southbound) are recorded on both sensors. Lines 3-6 above represent a second car going in the other direction. The first set of tires hit the 'A' sensor at 12:10:04am, and then hit the 'B' sensor 3ms later. The second set of tires then hit the 'A' sensor 171ms later, and then the 'B' sensor 4ms later.
The machine was left to run for 5 days in a row (starting on a Monday). This is obvious because the times in the data make several sudden drops:
A86328771
B86328774
A86328899
B86328902
A582668
B582671
A582787
B582789
The city has asked you to see how many analysis features you can provide from what the manufacturer's software offers:
* Total vehicle counts in each direction: morning versus evening,
per hour, per half hour, per 20 minutes, and per 15 minutes.
* The above counts can be displayed for each day of the session,
or you can see averages across all the days.
* Peak volume times.
* The (rough) speed distribution of traffic.
* Rough distance between cars during various periods.
Luckily for you, you know that:
* The speed limit on the road where this is recorded is 40mph
(that doesn't mean that everyone drives this speed, or that
no one exceeds it, but it's a good starting point).
* The average wheelbase of cars in the city is around 100 inches
between axles.
* Only 2-axle vehicles were allowed on this road during the
recording sessions.
The full data can be found at:
Quiz Summary
This is a cool little data mining problem. Unfortunately, it is a fair bit of work to get a good solution. Justin Ethier sent in the only full solution and even he chose to cut a few corners. His code does produce almost 200 lines of interesting feedback when run on the sample data set though, so it's fun to play around with. The downside is that it's close to 300 lines of code. Because of that, I'm going to try giving a pretty high-level overview of how it works.
Let's work backwards this time and just follow the code:
if ARGV.size < 1
puts "Usage: vehicle_counter.rb datafile [-avg]"
else
vc = VehicleCounter.new
vc.process(ARGV[0], ARGV[1] != nil)
end
This is the code that actually gets executed when you run the program. I always call that the application code.
Here we see a simple branch that either presents a usage message or builds a VehicleCounter and calls process() on it. This could actually be just a module, since it doesn't maintain any state. It may help you to think of the following code that way:
class VehicleCounter
# ...
def process(file, report_averages)
raw_data = parse(file)
for sensor in 0..1
sensor_report = DirectionDataReport.new(raw_data[sensor])
sensor_report.report(sensor, report_averages)
end
end
# ...
end
# ...
This defines the overall process of this program which is just two steps: read in the data and build a report. The parse() method manages the reading step while reporting is encapsulated in DirectionDataReport objects. We will begin with the reading:
Dir_A = 0
Dir_B = 1
def parse(file)
times = []
dirs = [[], []]
f = File.open(file)
f.each_line do |line|
sensor, time = parse_record(line)
times << time
if (times.size % 2) == 0
if (times.size == 2 and sensor == Dir_A) or
(times.size == 4 and sensor == Dir_B)
# Remove "B" records from second direction
times = [times[0], times[2]] if sensor == Dir_B
dirs[sensor] << times
times = []
elsif (times.size == 4 and sensor != Dir_B)
puts "Parse error"
times = []
end
elsif (times.size % 2) == 1 and sensor == Dir_B
puts "Parse error - Unexpected B record"
end
end
f.close
dirs
end
def parse_record(data)
unpacked = data.unpack("a1a*")
return unpacked[0] == "A" ? Dir_A : Dir_B, unpacked[1].to_i
end
# ...
This parser is what I meant by cutting corners earlier. The data provided is quite simple and you don't have to worry about cars going both directions at the same time when working with it. Given that, a pair of A times is a northbound car and a set of A, B, A, B times is a southbound car. This parse just hunts for those sets.
Both sets of times are maintained in the Array of Arrays called dirs. The constants are used to select the right group based on the indicator found in unpack()ing the last record of the group. These two Arrays are the ultimate results of this method.
Note that the file reading loop could be reduced to a File.foreach() call.
The rest of the code is reporting. We saw a DirectionDataReport initialized in the process() method and then witnessed a call to report() on that object. These two steps are the primary interface:
MSecPerHour = MSecPerMin * 60
InchesPerMile = 63360
class DirectionDataReport
Verbose = false
Sensors = %w(Northbound Southbound)
Days = %w(Mon Tue Wed Thu Fri Sat Sun)
def initialize(raw_data)
@raw_data = raw_data
end
def report(sensor, report_averages)
puts "Direction: #{Sensors[sensor]}"
puts "Total Cars: #{self.total_count}"
report_time_periods(sensor, report_averages, MSecPerHour * 12)
report_time_periods(sensor, report_averages, MSecPerHour)
report_time_periods(sensor, report_averages, MSecPerHour / 2)
report_time_periods(sensor, report_averages, MSecPerHour / 4)
puts ""
end
def total_count()
@raw_data.size
end
# ...
You can see here that the constructor just squirrels away the data from the parse. The report() method, on the other, hand drives the output process. After reporting the direction and a total car count, it displays time oriented breakdowns for various scales: 12 hours, one hour, 30 minutes, and 15 minutes.
Here is the report_time_periods() method and three of the methods it relies on:
def report_time_periods(sensor, report_averages, time_period_length)
days = create_time_periods(time_period_length)
num_time_periods = (MSecPerHour * 24) / time_period_length
counts = count_per_time_period(days)
avg_speeds = speed_per_time_period(days)
avg_dists = dist_per_time_period(days)
puts("\nTime Interval: #{time_period_length/MSecPerMin} Minutes")
if (num_time_periods > 2)
peaks = find_peak_times(days)
puts("Peak Times")
for i in 0...peaks.size
printf("#{Days[i]}:")
peaks[i].size.times {|p|
printf(format_time_interval_index(peaks[i][p][1],
time_period_length))
}
puts ""
end
end
puts("Statistics")
printf("Data ")
printf("\tDay") if not report_averages
num_time_periods.times{|i|
printf(format_time_interval_index(i, time_period_length))
}
report_column_data(days, num_time_periods, report_averages, counts,
report_averages ? "Avg Count" : "Count ", "% 5d")
report_column_data(days, num_time_periods, report_averages, avg_speeds,
"Avg Speed", "%02.02f")
report_column_data(days, num_time_periods, report_averages, avg_dists,
"Avg Dist ", "%02.02f")
puts ""
end
def create_time_periods(time_period_length = MSecPerHour)
days = []
time_periods = nil
prev_start = @raw_data[0][0] + 1
for data in @raw_data
if prev_start > data[0]
days << time_periods if time_periods != nil
cur_time_period = 0
time_periods = [[]]
puts "New day: data=#{data[0]}" if Verbose
elsif data[0] > ((cur_time_period + 1) * time_period_length)
while data[0] > ((cur_time_period + 1) * time_period_length)
cur_time_period += 1
time_periods[cur_time_period] = []
end
puts "New time period: data=#{data[0]}" if Verbose
end
time_periods[cur_time_period] << data
prev_start = data[0]
end
days << time_periods
days
end
def format_time_interval_index(index, time_period_length)
sprintf("\t%02d:%02d",
index * time_period_length / MSecPerHour,
(index * time_period_length / MSecPerMin) % 60)
end
def report_column_data(days, num_time_periods, report_averages, data,
data_label, format_string)
if report_averages
printf("\n#{data_label}")
for time in 0...num_time_periods
avg = 0
days.size.times {|day| avg += data[day][time] }
printf("\t#{format_string}", avg / days.size)
end
else
for day in 0...days.size
printf("\n#{data_label}\t#{Days[day]}")
for time in 0...num_time_periods
printf("\t#{format_string}", data[day][time])
end
end
end
end
# ...
There is a lot of code here obviously, but none of it is very complex. First, report_time_periods() breaks the data down into time intervals using create_time_periods(). That method is just a partition tool for the data. Once it's divided out, a few utility method we will examine next are used to separate the interesting information out.
After that, the entire rest of the of the method is printing code for what it found. (The peak data section actually calculates, again using another helper, and prints for appropriate data ranges.) Most of the print work is delegated to the bottom two methods and much or their magic is for printing columnar data.
Here are the helpers I glossed over and the helpers they rely on:
def find_peak_times(days, num_peaks=4)
days.map do |day|
find_daily_peak_times(day, num_peaks)
end
end
def find_daily_peak_times(daily_time_periods, num_peaks)
peaks = []
daily_time_periods.size.times {|i|
peaks << [daily_time_periods[i].size, i]
}
peaks.sort.reverse.slice(0, num_peaks).sort {|a,b| a[1]<=>b[1]}
end
def count_per_time_period(days)
days.map do |day|
day.map {|time_period| time_period.size}
end
end
def speed_per_time_period(days)
days.map do |day|
day.map {|time_period| calc_average_speed(time_period) }
end
end
def dist_per_time_period(days)
days.map do |day|
day.map {|time_period| calc_average_distance(time_period) }
end
end
def calc_average_speed(time_period)
return 0 if time_period.size == 0
sum = 0
for time in time_period
sum += calc_speed(time[0], time[1])
end
sum / (time_period.size)
end
def calc_speed(start_time, end_time)
return (100.0 / (end_time - start_time)) *
MSecPerHour / (InchesPerMile * 1.0)
end
def calc_average_distance(time_period)
return 0 if time_period.size <= 1 # Need at least 2 cars
sum = 0
for i in 0...(time_period.size - 1)
sum += calc_distance(time_period[0], time_period[1])
end
return sum / (time_period.size - 1)
end
def calc_distance(leader_time, follower_time)
follower_speed = calc_speed(follower_time[0], follower_time[1])
dist = follower_speed * ((follower_time[0] - leader_time[0]) /
(MSecPerHour * 1.0))
return dist
end
end
# ...
There are three kinds of helpers here. The peak methods just sort the data and take so many entries off the top. The *_per_time_period() method are wrappers that apply the calc_*() methods over periods of data. Which means the calc_*() methods are where the action is. They figure out car speeds, distances between cars, and the averages for both using simple math. This is the primary data that we see in the report output.
Again it's a lot of code to cover, but Justin has produced a great end result. I'm not going to include the output here because it too is large, but go run the program and take a look at what it builds.
My thanks to Justin for the heroic effort and Gavin for the interesting problem.
The Ruby Quiz began with an encoding problem that came out of a book and tomorrow we have another one for you...