Quoted Printable (#23)

The quoted printable encoding is used primarily in email, though it has recently seen some use in XML areas as well. The encoding is simple to translate to and from.

This week's quiz is to build a filter that handles quoted printable translation.

Your script should be a standard Unix filter, reading from files listed on the command-line or STDIN and writing to STDOUT. In normal operation, the script should encode all text read in the quoted printable format. However, your script should also support a -d command-line option and when present, text should be decoded from quoted printable instead. Finally, your script should understand a -x command-line option and when given, it should encode <, > and & for use with XML.

Here are the rules we will use, from the quoted printable format:

1. Bytes with ASCII values from 33 (exclamation point) through 60 (less
than) and values from 62 (greater than) through 126 (tilde) should be
passed through the encoding process unchanged. Note that the -x switch
modifies this rule slightly, as stated above.

2. Other bytes are to be encoded as an equals sign (=) followed by two
hexadecimal digits. For example, when -x is active less than (<) will
become =3C. Use only capital letters for hex digits.

3. The exceptions are spaces and tabs. They should remain unencoded as
long as any non-whitespace character follows them on the line. Spaces
and tabs at the end of a line, must be encoded per rule 2 above.

4. Native line endings should be translated to carriage return-line feed
pairs.

5. Quoted printable lines are limited to 76 characters of length (not
counting the line ending pair). Longer lines must be divided up. Any
line endings added by the encoding process should be proceeded by an
equals sign, so the unecoder will know to remove them. The equals sign
must be the last character on the line, followed immediately by the line
end pair. Such an equals sign does count as a non-whitespace character
for rule 3, allowing preceding spaces and tabs to remain unencoded.
The equals sign must fit inside the 76 character limit.

To unecode, just reverse the process.


Quiz Summary

It was pointed out, first in private email and later on Ruby Talk, that your quiz editor isn't quite up on all of Ruby's features. Support for the Quoted Printable encoding is already in the language. You can access this with the "M" format specification of Array.pack() and String.unpack(). Dave burt posted a modification to his solution using these features. Here's that class:

ruby
class String
def to_quoted_printable(*args)
[self].pack("M").gsub(/\n/, "\r\n")
end
def from_quoted_printable
self.gsub(/\r\n/, "\n").unpack("M").first
end
end

Ruby's Quoted Printable encoder uses standard Unix line endings, which is why you see the gsub() translations to the specified carriage-return line-feed pairs above. That doesn't handle the XML aspect of the quiz, but you can add that with a few more calls to gsub() at both ends.

Ignoring my knowledge gap, we still have some interesting solutions to discuss.

Let's start with a solution. Here's Glenn Parker's code:

ruby
#!/usr/bin/env ruby -w

require 'getoptlong'

MaxLength = 76

def main
opts = GetoptLong.new(
[ "-d", GetoptLong::NO_ARGUMENT ],
[ "-x", GetoptLong::NO_ARGUMENT ]
)
$opt_decode = false
$opt_xml = false
opts.each do |opt, arg|
case opt
when "-d": $opt_decode = true
when "-x": $opt_xml = true
end
end

if $opt_decode
decode_input
else
encode_input
end
end

def encode_input
STDOUT.binmode # We need to control the line-endings.
while (line = gets) do
# Note: String#chomp! swallows more than just $/.
line.sub!(/#{$/}$/o, "")
# Encode the entire line.
line.gsub!(/[^\t -<>-~]+/) { |str| encode_str(str) }
line.gsub!(/[&<>]+/) { |str| encode_str(str) } if $opt_xml
line.sub!(/\s*$/) { |str| encode_str(str) }
# Split the line up as needed.
while line.length > MaxLength
### original code ###
# split = line.index("=", MaxLength - 4) - 1
# split = (MaxLength - 2) if split.nil? or (split > MaxLength - 2)
### BUGFIX: index() can return nil, so don't subtract -JEG2 ###
split = line.index("=", MaxLength - 4)
split = (MaxLength - 2) if split.nil? or ( split - 1 >
MaxLength - 2 )
### END BUGFIX ###
print line[0..split], "=\r\n"
line = line[(split + 1)..-1]
end
print line, "\r\n"
end
end

def encode_str(str)
encoded = ""
str.each_byte { |c| encoded << "=%02X" % c }
encoded
end

def decode_input
while (line = gets) do
line.chomp!
line.gsub!(/=([\dA-F]{2})/) { $1.hex.chr }
if line[-1] == ?=
print line[0..-2]
else
print line, $/
end
end
end

main

Let me talk a little about that shebang line. It doesn't work on my system:

$ chmod +x quoted_printable.rb
$ ./quoted_printable.rb
env: ruby -w: No such file or directory

That's one of the minuses of using the "env ruby" trick. If you don't want to hardcode the path and still want to enable warnings inside the script, the following works:

ruby
#!/usr/bin/env ruby
$VERBOSE = true # enable warnings

That doesn't have anything to do with the quiz, of course, and you could still run Glenn's code with "ruby quoted_printable.rb", but having been bitten by that same problem myself, I wanted to mention it.

Getting back to the code, Glenn pulls in getoptlong, defines a constant to hold the line length, and then defines a method called main(). main() just parses command line options (setting the globals $opt_decode and $opt_xml as needed), then hands off work to either decode_input() or encode_input().

For encoding, encode_input() handles most of the work. It starts by shutting off line ending translation with a call to binmode(). I believe that's only needed when your code is running on Windows, but it's still a great habit to form anytime you're going to muck with raw line endings.

From there, encode_input() loops over STDIN with a line-by-line read. Note that it performs its own chomp() with a call to sub!(). The author explains why in his submission email:

I found it a bit more frustrating that String#chomp! is a greedier than
you might expect, discarding all sorts of potential line endings,
instead of limiting itself to $/.

The next three substitutions encode the needed characters on the line. They're just a combination of simple Regexps and calls to encode_str(). If you glance down at encode_str(), you can see that it's a very simple byte to hex translator.

The final while loop in encode_input() breaks up long lines. It looks more complex above, because I added a bug fix too it. When running tests on the code, Glenn's script crashed on me. The issue was that String.index() can return nil and you can't subtract 1 from nil. I just moved the "- 1" down a line to work around this.

The reason index() is called looking for an "=" is to prevent breaking up an already encoded character. If there aren't any encoded characters, the line is split at MaxLength.

This method of breaking up the lines can break lines mid-word. You might want to consider trying to break them at word boundaries though. A big advantage of Quoted Printable is that it's really a Base64-like encoding, that keeps plain text pretty readable. That's why I suggested its use to embed data in XML. To that end, breaking lines on word boundaries just enhances that characteristic.

Getting back to the code one last time, decode_input() is even easier to follow. It too is a line-by-line read, with a gsub!() used to unencode and a basic if statement used to unwrap lines (by dropping the = and not printing a line ending).

The other solutions are all quite interesting and I do encourage everyone to check them out. Most submissions modified String to add the conversions. Matthew Moss also added foreach() style readers to IO. Dave Burt included a nice set of test cases, used by himself and at least one other person. Good stuff all around.

My thanks to all who endure my mental lapses, and to those who gently correct me. I need all the help I can get.

Great news: We have a record four quizzes queued up right now, all of them including some contribution from others! I'm so pleased. We'll start our run tomorrow with a quiz for people who know when to Hold'em and when to fold 'em...