Huffman Encoder (#123)

by Harlan

Huffman Coding is a common form of data compression where none of the original data gets lost. It begins by analyzing a string of data to determine which pieces occur with the highest frequencies. These frequencies and pieces are used to construct a binary tree. It is the “path” from root node to the leaf with this data that forms its encoding. The following example should explain things:

Data: ABRRKBAARAA (11 bytes)

Frequency counts:
A 5
R 3
B 2
K 1

In Huffman Tree form, with frequency weights in parentheses:
ARBK (11)
/ \
0 1
/ \
A (5) RBK (6)
/ \
0 1
/ \
R (3) BK (3)
/ \
0 1
/ \
B (2) K (1)

The encoding for each character is simply the path to that character:
A 0
R 10
B 110
K 111

Here is the original data encoded:
01101010 11111000 1000 (fits in 3 bytes)

We have compressed the original information by 80%!

A key point to note is that every character encoding has a unique prefix, corresponding to the unique path to that character within the tree. If this were not so, then decoding would be impossible due to ambiguity.

The quiz this time is to write a program to implement a compression program using Huffman encoding.

Extra Credit:
Perform the actual encoding using your tree. You may encounter one issue
during the decompression/decoding phase. Your encoded string may not be a
multiple of 8. This means that when you compress your encoding into a
binary number, padding 0’s get added. Then, upon decompression, you may
see extra characters. To counter this, one solution is to add your own
padding of 1 extra character every time. And then simply strip it off
once you have decoded.

You may also wish to provide a way to serialize the Huffman Tree so it
can be shared among copies of your program.

./huffman_encode.rb I want this message to get encoded!

Encoded:
11111111 11111110 11111111 11101111 10111111
01100110 11111111 11110111 11111111 11011100
11111111 11010111 01110111 11011110 10011011
11111100 11110101 10010111 11101111 11111011
11111101 11111101 01111111 01111111 11111110
Encoded Bytes:
25

Original:
I want this message to get encoded!
Original Bytes:
35

Compressed: 28%


Quiz Summary

With the extra credits, this problem is a little involved and some people did write a lot of code for it. Building the tree was our main interest in this problem though.

The quiz didn't detail that process too much, but several submitters found write-ups like the one at Wikipedia. The trick is generally to use two queues. The first starts with all of the letters queued lowest frequency to the highest and the second starts empty. While there is more than one node in the combined queues, you dequeue the two with the lowest weights, build a new node with them as children and a combined weight, then enqueue that in the second queue. When you get down to just one node, you are finished. That single node is the root of the tree.

A variation on this strategy is to use a single priority queue. When working this way you can always just pull the two lowest entries, since the queue will keep them coming in the proper order.

Drew Olson has some pretty easy to follow code using the priority queue approach, so let's look into that now. First, Drew had to build a priority queue since one doesn't ship with Ruby:

ruby
# priority queue for nodes
class NodeQueue
def initialize
@queue = []
end

def enqueue node
@queue << node
@queue = @queue.sort_by{|x|[-x.weight,x.val.size]}
end

def dequeue
@queue.pop
end

def size
@queue.size
end
end

This is a trivial implementation that just resorts the queue after each new entry. Note that the sort is on the opposite of the weights to put the lowest entries at the front.

This is not ideal, of course, but likely to be reasonably quick if you are just encoding simple text. That's because the sort is largely in C. For a better priority queue, have a peek at Daniel Martin's code.

Drew also used a trivial class to represent nodes in the tree:

ruby
# class to hold nodes in the huffman tree
class Node
attr_accessor :val,:weight,:left,:right

def initialize(val="",weight=0)
@val,@weight = val,weight
end

def children?
return @left || @right
end
end

As you can see, Nodes are pretty much just a Struct for tracking value, weight, and children. The additional method just checks to see if this node is a branch, meaning that it has at least one child node.

With those tools to build on, Drew is now ready to create a HuffmanTree:

ruby
# HuffmanTree represents the tree with which we perform
# the encoding
class HuffmanTree

# initialize the tree based on data
def initialize data
@freqs = build_frequencies(data)
@root = build_tree
end

#encode the given data
def encode data
data.downcase.split(//).inject("") do |code,char|
code + encode_char(char)
end
end

def decode data
node = @root

if !@root.children?
return @root.val
end

data.split(//).inject("") do |phrase,digit|
if digit == "0"
node = node.left
else
node = node.right
end
if !node.children?
phrase += node.val
node = @root
end
phrase
end
end

# ...

These three methods define the external interface for this class. First, you create HuffmanTree objects by passing in the data a tree should be constructed from. Frequencies are counted for the characters in the data and a tree is built from those counts.

The encode() method takes the data you wish to apply the encoding to and returns a String of ones and zeros representing the data. This implementation just iterates over the characters, using a helper method to translate them. Note that Drew's implementation normalizes case which results in smaller encodings, but this step needs to be removed if you want lossless compression.

The decode method is the most complex in the set, but still not too hard to grasp. It starts at the root node and iterates over each one and zero, selecting the correct child node. Each time it reaches a leaf node (no children), that character value is added to the translation and the search resets to the root node.

Now we just need to see the helper methods used in those methods. This first one is the reverse of the decoder we just examined:

ruby
# ...

private

# this method encodes a given character based on our
# tree representation
def encode_char char
node = @root
coding = ""

# encode to 0 if only one character
if !@root.children?
return "0"
end

# we do a binary search, building the representation
# of the character based on which branch we follow
while node.val != char
if node.right.val.include? char
node = node.right
coding += "1"
else
node = node.left
coding += "0"
end
end
coding
end

# ...

Again, the search begins with the root node and advances down the tree branches. This time the search is for nodes containing the character and we can stop as soon as we reach a leaf. The encoding is the path of one and zero branches that lead to the character.

These last two methods handle tree construction:

ruby
# ...

# get word frequencies in a given phrase
def build_frequencies phrase
phrase.downcase.split(//).inject(Hash.new(0)) do |hash,item|
hash[item] += 1
hash
end
end

# build huffmantree using the priority queue method
def build_tree
queue = NodeQueue.new

# build a node for each character and place in pqueue
@freqs.keys.each do |char|
queue.enqueue(Node.new(char,@freqs[char]))
end

while !queue.size.zero?

# if only one node exists, it is the root. return it
return queue.dequeue if queue.size == 1

# dequeue two lightest nodes, create parent,
# add children and enqueue newly created node
node = Node.new
node.right = queue.dequeue
node.left = queue.dequeue
node.val = node.left.val+node.right.val
node.weight = node.left.weight+node.right.weight
queue.enqueue node
end
end
end

The first method, build_frequencies(), is just a character counter. The counts are returned in a Hash keyed by the character for a given count.

The main work is done in build_tree(). It begins by creating a priority queue and queuing each of the characters from the frequency count. After that, the while loop is a direct translation of the process I described at the beginning of this summary.

The final bit of code puts the tree to work creating Drew's solution:

ruby
# get command lines args, build tree and encode data
if __FILE__ == $0
require 'enumerator'

data = ARGV.join(" ")
tree = HuffmanTree.new data

# get encoded data and split into bits
code = tree.encode(data)
encoded_bits = code.scan(/\d{1,8}/)

# output
puts
puts "Original"
puts data
puts "#{data.size} bytes"
puts
puts "Encoded"
encoded_bits.each_slice(5) do |slice|
puts slice.join(" ")
end
puts "#{encoded_bits.size} bytes"
puts
puts "%d percent compression" %
(100.0 - (encoded_bits.size.to_f/data.size.to_f)*100.0)
puts
puts "Decoded"
puts tree.decode(code)
end

The first few chunks of this code just run the interface methods we have been examining. The last big chunk is simply the output of results using some straightforward printing logic.

My thanks to all who took on this challenge. Several of you wrote library quality solutions. It was impressive to see.

Tomorrow we will try some magical math, as quick as we can...