Years ago, on a job developing custom reporting software, this was one of the side tasks. Parsing a report may sound boring, but I urge you to at least download this report and peek inside. It's a tragic example of database output gone wrong.
(I've doctored the report heavily to protect my client, but the spirit of the data remains the same. I went out of my way to keep the report's little quirks while fudging all the data.)
My job was literally described as, "I need to take this straight into Excel, no clean up or Wizards required." Some of you may want to stop reading there and make your own interpretation of that. Feel free.
I actually felt uncomfortable with that description and asked to look over an employee's shoulder as they used the report. I learned a lot from that. Here's some hints for those that like a little more direction:
CSV files go "straight into Excel," for all practical purposes.
Page headers are not needed and actually get in the way.
The "Period" of the report, is useful information though.
As are column headers. This is a Qty/Pounds report. They also run
Qty/Dollars reports.
Dashed lines are not helpful in Excel.
The report footer is not needed.
Excel is a much better tool when actually working with numbers.
Everything should fit comfortably into cells. A single piece of
data should not be broken up between two or more cells.
However, the employees are very use to this report format and
something familiar to them would be preferred.
Be warned, this is one of their small reports. Run in February,
it covers only two months of sales. December reports are many
times larger.
I did a few other things to this report, but that should be plenty for the quiz.
This report is used daily by over 20 employees. Think about how much clean up time that adds up to. They've done it for many years!
Quiz Summary
Probably the first question to ask when looking at this quiz is, what makes this report so unfriendly, not just to Excel, but to work with in general?
1. There's a lot of junk we don't need. Headers, footers, dashed lines. All that needs to go, so we can focus on the data.
2. The data itself isn't pure. Have a look at these two fragments:
50,589 358 %%%% 56,652 430 %%%%
924,553 704,028 31 1,612K 1,226K
Those aren't numbers in there. Yuck.
3. This report is ALMOST a fixed width file, which would help a lot, but it fails in a few areas:
TEE_X_101 26-Mar-2002 15:26 1: GE
---------------
Current LastYr
Part Code Description Period Period
--------------- ------------------------ ------- -------
Salesperson 00 NOBODY
Customer 1036 COMPANY 501
SA Sort Code 1.43 WATER DOLLS
78-143FS 17/8# SS MODEL 10 0
------- -------
SA Sort Code subtotals 10 0
SA Sort Code 3.3 REMOTE CONTROL CARS
74270 Model 35357-DBL 0 0
921137-73 LARGE 19 X 18 X 14 30 0
------- -------
SA Sort Code subtotals 30 0
------- -------
Customer subtotals 40 0
Customer 14457 COMPANY 518
SA Sort Code 11.5 KITCHEN SETS
943437 19/8# SS MODEL 0 0
------- -------
SA Sort Code subtotals 0 0
------- -------
Customer subtotals 0 0
Customer 1824 COMPANY 529
SA Sort Code 19.4 SLIDES
8394 2.5 OZ 0 20
8341 .21 SIZE PLASTIC, NO BATT 0 10
18363 .29 SIZE PLASTIC, NO BATT 0 24
------- -------
SA Sort Code subtotals 0 54
------- -------
Customer subtotals 0 54
Use the dashed lines as a guide and you can see how the sub-headers (Salesperson, Customer, and SA Sort Code) don't fit into columns. Subtotal lines have the same problem. Also notice that the Description content is allowed to overflow it's column by one character. That's a mess.
4. This one is less obvious, but equally real. The data isn't very useful in this format. Christian Neukirchen did a nice job of recognizing this and taking steps to correct it.
Now that we've identified the issues, let's work through them. When I began playing with a foreign report like this, I do just that: Play around a bit. My first script may not end up being the one I keep, but I need to get familiar with the data. I'll show you what works for me, step by step.
Whenever I think, "I need to translate this into..." I mentally switch that to:
print line
end
It may not be much, but it's a start. Now, let's tackle problem number 1 and jettison the junk. The majority of it is those silly page headers. Those should be easy enough to drop. We stop printing when we enter a header and start up again when we leave.
I think in patterns, so I would then start looking for a way to spot that I'm entering a header. I'm leery to anchor against arbitrary text, if I don't have too, so I would like to find something better. If you're familiar with these page print style reports, you might have a good guess of what comes before that header. Even if you're not, it pays to look at a foreign document with the "invisible characters" turned on, if your editor can do that. Here's what a header really looks like:
SA Sort Code subtotals 20 0
\f <-- Look what's hiding here! A form-feed character.
TEE_X_101 26-Mar-2002 15:26 1: GE
---------------
Current LastYr
Part Code Description Period Period
--------------- ------------------------ ------- -------
SA Sort Code 5.207 BUILDING BLOCKS
That makes spotting the headers easy. The very first one doesn't have the form-feed, but that's not hard to work around.
Once we're in a header, the last line is just a solid line of dashes and spaces. Just be careful to distinguish it from the other dashed line in the header and you can build a solution:
while line = ARGF.gets
if header # we're inside the header
header = false if line =~ /^-[- ]+-$/ # watch for the end
else # we're not in the header
if line =~ /\f/ # watch for header beginning
header = true
next
end
print line
end
end
If you run that on the report, you'll see that we did indeed remove all the page headers. I showed (and even submitted) that version because it's easy to break down and understand, but if all that code bothers you, we can shorten it up:
# the next line skips all headers
next if ($. == 1 or line =~ /\f/) .. line =~ /^-[- ]+-$/
print line
end
That does exactly the same thing, making use of Ruby's versatile Ranges. The range acts as a toggle here. When the first condition becomes true, the Range will evaluate to true (skipping lines with next()) until the second condition becomes true. Then the process begins again. My first condition searches for line one of the report or a form-feed character. The second condition looks for the solid line of dashes. Again, the code works the same, so you can use whatever you are comfortable with.
We're not done dropping junk yet! Let's toss out dashed lines and the report footer. Both are trivial:
next if ($. == 1 or line =~ /\f/) .. line =~ /^-[- ]+-$/
next if line =~ /--$/ # skip dashed lines
print line
break if line =~ /^Report Totals/ # stop with "Report Totals"
end
What's the report look like now?
Salesperson 00 NOBODY
Customer 1036 COMPANY 501
SA Sort Code 1.43 WATER DOLLS
78-143FS 17/8# SS MODEL 10 0
SA Sort Code subtotals 10 0
SA Sort Code 3.3 REMOTE CONTROL CARS
74270 Model 35357-DBL 0 0
921137-73 LARGE 19 X 18 X 14 30 0
SA Sort Code subtotals 30 0
Customer subtotals 40 0
Customer 14457 COMPANY 518
SA Sort Code 11.5 KITCHEN SETS
943437 19/8# SS MODEL 0 0
SA Sort Code subtotals 0 0
Customer subtotals 0 0
We're making progress. It's almost readable.
There are four types of lines left in this report that we need to deal with. Yes, I'm sure it's four and not three. Sub-headers, product sales, subtotals and blank lines. At this point, I would try to build selectors to handle each of them:
next if ($. == 1 or line =~ /\f/) .. line =~ /^-[- ]+-$/
next if line =~ /--$/
# handle blank lines...
if line !~ /\S/
print " BLANK: ", line
# handle subtotals
elsif line =~ /^.+?totals(?:\s+(?:-?[\d,]+K?|%+)){12}\s*$/i
print " TOTAL: ", line
# hande product sales
elsif line =~ /^\S/
print "PRODUCT: ", line
# handle subheaders
else
print " HEADER: ", line
end
break if line =~ /^Report Totals/
end
Uh oh, scary Regexp in there. Let's break it down:
^ # the beginning of the line
.+? # skip some of the beginning looking for...
totals # the word totals (to match subtotals or Report Totals with /i)
(?: # grouping
\s+ # some whitespace
(?: # start group
-? # an optional minus
[\d,]+ # a comma separated digit sequence
K? # an optional trailing K
| # or ...
%+ # a run of % signs
) # end group
){12} # end group -- find exactly 12 of those
\s* # optional trailing whitespace
$ # the end of the line
In short, it's just looking for the word "totals" followed by 12 things that could be numbers in this report. You really don't even need an expression that complex, in this case, but since the headers can hold arbitrary text (company names) I want to make sure I'm matching what I'm looking for. Ruby will even let you drop that commented Regexp in your code with /x if you like.
Notice that I'm being careful to use generic patterns. For example, I never match Salesperson, Company or SA Sort Code. What if the company happens to have another report that includes Brokers? (Trick question. I know it to be true in this case!) These patterns will treat that like any other sub-header and just work as expected.
So does that successfully locate everything?
HEADER: Salesperson 00 NOBODY
HEADER: Customer 1036 COMPANY 501
HEADER: SA Sort Code 1.43 WATER DOLLS
PRODUCT: 78-143FS 17/8# SS MODEL 10 0
TOTAL: SA Sort Code subtotals 10 0
BLANK:
HEADER: SA Sort Code 3.3 REMOTE CONTROL CARS
PRODUCT: 74270 Model 35357-DBL 0 0
PRODUCT: 921137-73 LARGE 19 X 18 X 14 30 0
TOTAL: SA Sort Code subtotals 30 0
TOTAL: Customer subtotals 40 0
BLANK:
HEADER: Customer 14457 COMPANY 518
HEADER: SA Sort Code 11.5 KITCHEN SETS
PRODUCT: 943437 19/8# SS MODEL 0 0
TOTAL: SA Sort Code subtotals 0 0
TOTAL: Customer subtotals 0 0
Sure does. Now it's easy to just transform that data into CSV, which will fix our not-quite-fixed-width-columns problem:
while line = ARGF.gets
next if ($. == 1 or line =~ /\f/) .. line =~ /^-[- ]+-$/
next if line =~ /--$/
if line !~ /\S/
puts CSV.generate_line([""])
elsif line =~ /^(.+?totals)((?:\s+(?:-?[\d,]+K?|%+)){12})\s*$/i
puts CSV.generate_line(["", $1.lstrip, *$2.split(" ")])
elsif line =~ /^(\S+)\s+(.+?)((?:\s+(?:-?[\d,]+K?|%+)){12})\s*$/
puts CSV.generate_line([$1, $2, *$3.split(" ")])
else
puts CSV.generate_line(["", line.strip])
end
break if line =~ /^Report Totals/
end
I snuck another nasty Regexp in there didn't I? Actually, if you look closely, you'll see that it's nearly identical to the other one, save that it also catches product descriptions. It's just there to help me locate all the parts of the line. After capturing the parts, I use split() to break them up, strip() to clean them, and let Ruby's CSV handle the rest.
If you open the CSV output from this in Excel, you'll see that I've just shifted sub-headers and subtotals into the second column. That pretty much maintains the familiar report format, while eliminating the column breakup issues.
That's really a lot of the clean up work done already. The nagging problem is that Excel doesn't consider 1,612K or %%%% numbers. We can fix the first one easy enough, but what the heck is causing the %%%%s?
More detective work is needed.
Those %%%% entries only happen in the small "Pct Var" column, which is four characters wide. "Var" huh? Variance maybe? If you look around a bit, the formula is easy to calculate from a row like this:
Current LastYr Pct
Part Code Description Period Period Var
--------------- ------------------------ ------- ------- ----
613433 .22 SIZE PLASTIC, NO BATT 65 1 6400
That looks like (65 - 1) * 100. That doesn't make much sense though, because we're probably talking about a percentage. More likely is (65 - 1) / 1 * 100. You can test that on other rows to prove that it is indeed correct.
Okay, so why are we getting %%%%s?
Current LastYr Pct
YTD YTD Var
------- ------- ----
125 1 %%%%
(125 - 1) / 1 * 100 = 12400. That's five characters and we're only allowed four. It's a column overflow problem. Our CSV approach has no such limitation, so we can recalculate those numbers and add them to the output.
Here's a routine to clean up the numbers:
# turn them back into numbers...
numbers.map! do |n|
n.gsub!(",", "") # drop commas
if n.sub!(/K$/, "") # when there's a K...
n.to_i * 1000 # multiple by 1,000
elsif n !~ /%/
n.to_i # numify
else
n
end
end
# recalculate %%%% columns...
numbers.each_with_index do |n, i|
if n.to_s =~ /%/
numbers[i] = ( (numbers[i - 2] - numbers[i - 1]) /
numbers[i - 1].to_f * 100 ).to_i
end
end
numbers # return results
end
To use that, we just call it twice:
next if ($. == 1 or line =~ /\f/) .. line =~ /^-[- ]+-$/
next if line =~ /--$/
if line !~ /\S/
puts CSV.generate_line([""])
elsif line =~ /^(.+?totals)((?:\s+(?:-?[\d,]+K?|%+)){12})\s*$/i
# first call...
puts CSV.generate_line(["", $1.lstrip, *clean($2.split(" "))])
elsif line =~ /^(\S+)\s+(.+?)((?:\s+(?:-?[\d,]+K?|%+)){12})\s*$/
# second call...
puts CSV.generate_line([$1, $2, *clean($3.split(" "))])
else
puts CSV.generate_line(["", line.strip])
end
break if line =~ /^Report Totals/
end
The rest of my solution (not shown) was boring header parsing, to print out the column labels and the report "Period". You can follow the link in the sidebar if you want to examine that.
At this point, we've solved problems 1, 2 and 3 from the beginning of this summary. I want to talk a little about problem 4, then I promise to end this ridiculously long message.
Christian Neukirchen's short solution (not shown) does away with sub-header and subtotal lines. That filter tacks on the Salesperson, Customer and SA Sort Code to each product entry. This makes all the needed information available on every line. This is a big win for using Excel's filters, making the report a lot easier to examine and change.
When I did this project for work, I did the same thing. However, I put the metadata at the back of the line and left in the sub-headers, subtotals and blanks. I also added a column to identify each type of line: Header, total, data or blank. This gave me all the same benefits as Christian's code has, while keeping the familiar format. I could filter down to the "data" lines to have nearly identical results. This feature turned out to be a big hit, drastically increasing company productivity.
Sorry for sending in a quiz that "looked too much like real work". I promise, we're back to fun and games tomorrow...