by John Miller
This week's task is to implement the Rope data structure as a Ruby class. This topic comes out of the ICFP programming competition (http://www.icfpcontest.com/) which had competitors manipulating a 7.5 million character string this year.
What is a Rope:
You may not realize it, but for many changes to the content of a String, Ruby creates a new copy of the original with the modifications applied. For small strings that are created once and read often this is actually a very efficient way to do thing, but what happens when the string starts to get long, and is undergoing a lot of changes? First, the program will spend more and more of its processing cycles just copying bits around. Second, the garbage collector will be called more and more often to pick up the little stringy scraps you've left all over the memory.
Ropes (the name is a pun for a heavy duty string) are tree structures where a node represents the concatenation of its left branch with its right, and leaves are flat strings. (This is a little sloppy. A rope may contain shared subtrees, and is thus really a directed acyclic graph, where the out-edges of each vertex are ordered. We will continue to be sloppy.) E.g. To prepend Text A to Text B, one creates a Node (call it N1) with A as its left branch and B as its right. To further append Text C create a new Node N2 with its left branch pointing to N1 and its right to C. Easy, right? To find out more see Boehm, Atkinson and Plass "Ropes: an Alternative to Strings" at:
Ropes: an Alternative to Strings
The task comes in three parts, each increasing in difficulty:
Part one:
Create a Rope class that can do the following:
a. 'append' or 'prepend' a String or another Rope
(alias the << operator to the append function)
b. Return the fully concatenated text with 'to_s'
c. define 'slice' to call to_s.slice
d. provide a 'length' method
Part two:
Add the following:
a. Add the ability to 'index' a single character given a 0-based offset
from the beginning of the string.
b. Add the ability to 'shift' a single character from the front of a Rope.
(Remove and return the character)
c. Add your own 'slice' method that returns a String. Implement as many of
the String method's forms as possible. To run the example code this
function will only need to understand the slice(offset,length) form.
Major Bonus for Regex and Substring forms.
d. "Balance" the tree with a 'normalize' method.
(see Boehm, Atkinson and Plass 1319 Rebalancing)
Part three: (bonus)
Add the following:
a. Change the 'slice' method to return a Rope. Ideally this method should
do as little string copying as possible. (Doing this will well
dramatically increase the speed of the example code)
b. Allow 'shift' to optionally accept an integer number of characters to
remove and return.
c. modify the '<<' operator so that can efficiently append a few
characters at a time. (see Boehm, Atkinson and Plass 1318 para. 4)
d. *Major Bonus* Add the ability to append and prepend IO classes in a
lazy fashion. (see Boehm, Atkinson and Plass 1318 para. 2)
The following code may help you explore how efficient your code is and show where Ropes are useful. `ruby -r /path/to/your/rope/class this_script.rb Rope` will run the test with your code. Run the script without arguments to see how well String does. Also play around with the SIZE and CHUNKS constants to get a feel for how they affect performance.
#This code make a String/Rope of CHUNCKS chunks of text
#each chunck is SIZE bytes long. Each chunck starts with
#an 8 byte number. Initially the chuncks are shuffled the
#qsort method sorts them into ascending order.
#
#pass the name of the class to use as a parameter
#ruby -r rope.rb this_file Rope
puts 'preparing data...'
TextClass = Object.const_get(ARGV.shift || :String)
def qsort(text)
return TextClass.new if text.length == 0
pivot = text.slice(0,8).to_s.to_i
less = TextClass.new
more = TextClass.new
offset = 8+SIZE
while (offset < text.length)
i = text.slice(offset,8).to_s.to_i
(i < pivot ? less : more) << text.slice(offset,8+SIZE)
offset = offset + 8+SIZE
end
print "*"
return qsort(less) << text.slice(0,8+SIZE) << qsort(more)
end
SIZE = 512 * 1024
CHUNCKS = 128
CHARS = %w[R O P E]
data = TextClass.new
bulk_string =
TextClass.new(Array.new(SIZE) { CHARS[rand(4)] }.join)
puts 'Building Text...'
build = Benchmark.measure do
(0..CHUNCKS).sort_by { rand }.each do |n|
data<< sprintf("%08i",n) << bulk_string
end
data.normalize if data.respond_to? :normalize
end
GC.start
sort = Benchmark.measure do
puts "Sorting Text..."
qsort(data)
puts"\nEND"
end
puts "Build: #{build}Sort: #{sort}"
Quiz Summary
The discussion for this quiz was very interesting. It's probably worth your time to go back and read those messages, if you haven't already. Some points I got out of the discussion were:
* It's tricky to get ropes right. Clever implementations may loose key
rope qualities, like the O(1) concatenation time.
* The functional approach, building immutable ropes, is probably
superior considering how ropes are intended to be used.
* Autorebalancing didn't hurt much, at least in the quiz test case.
* Copy-On-Write is helpful, when it works.
Many of these points came out due to Eric Mahurin's work in benchmarking the submitted solutions. Eric also submitted several variations of rope classes, a few of which I want to take a look at below. Let's begin with the class Eric uses to build the rope trees:
class Rope
include Enumerable
# form a binary tree from two ropes (possibly sub-trees)
def initialize(left,right)
@left = left
@right = right
@llength = @left.length
@length = @llength+@right.length
@depth = [left.depth, right.depth].max+1
end
# number of elements in this rope
def length
@length
end
# depth of the tree (to help keep the tree balanced)
def depth
@depth
end
# left rope (not needed when depth==0)
def left
@left
end
# right rope (not needed when depth==0)
def right
@right
end
# ...
Here we see the setup and relevant attributes of Rope objects. First we have the fact that they are binary trees, with left and right subtrees. Next we see that Eric is going to track two lengths for Rope objects, both the total length and the length of just the left subtree. The reasoning for that will become apparent when we examine indexing. Finally, Eric tracks a depth, for use in rebalancing.
There are really two major operations that are key to a Rope implementation: concatenation and indexing. Here's the concatenation side of the puzzle:
# appended rope (non-modifying)
def +(other)
# balance as an AVL tree
balance = other.depth-@depth
if balance>+1
left = other.left
right = other.right
if left.depth>right.depth
# rotate other to right before rotating self+other to left
(self + left.left) + (left.right + right)
else
# rotate self+other to left
(self + left) + right
end
elsif balance<-1
if @right.depth>@left.depth
# rotate self to left before rotating self+other to right
(@left + @right.left) + (@right.right + other)
else
# rotate self+other to right
@left + (@right + other)
end
else
self.class.new(self, other)
end
end
alias_method(:<<, :+)
# ...
This method is only this long because it automatically rebalances the tree as needed. In fact, if you glance down to the final else clause, you will see the trivial implementation, which is just to construct a new Rope from the current Rope and the concatenated element.
The rebalancing done here is, as the comment suggests, a textbook AVL implementation. With an AVL tree, you subtract the left depth from the right depth to get a tree's balance factor. Anything in the range of -1 to 1 is a balanced tree. If the factor is outside of that range, one or two rotations are required to rebalance the tree.
I'm not going to go into the specific rotations. If you would like to read up on them, I recommend:
The AVL Tree Rotations Tutorial
Let's move on to the indexing methods used to pull information back out of a Rope:
# slice of the rope
def slice(start, len)
return self if start.zero? and len==@length
rstart = start-@llength
return @right.slice(rstart, len) if rstart>=0
llen = @llength-start
rlen = len-llen
if rlen>0
@left.slice(start, llen) + @right.slice(0, rlen)
else
@left.slice(start, len)
end
end
# element at a certain position in the rope
def at(index)
rindex = index-@llength
if rindex<0
@left.at(index)
else
@right.at(rindex)
end
end
# ...
Have a look at the at() method first, because it's easier to digest and it shows the concept of indexing this tree structure. Essentially, to index in the tree you check a position against the left length. If it's less than, index the left side. If it's greater than, index the right. This search tactic is the hallmark attribute of binary trees.
The slice() method works the same way. It's just more complicated because it has to work with two indices instead of one. Finding the start index is the same strategy we saw in at(). If that index is in the right subtree, the end will be as well and the code makes the recursive hand-off without further checks. When it's in the left, the end must be located. If that end point turns out to also be in the left, the hand-off is made to the left side. When it is in the right, a partial slice() is made from both halves and combined. This covers all the cases.
Eric added a couple more methods to the Rope class that cover iteration and converting to a String:
# iterate through the elements in the rope
def each(&block)
@left.each(&block)
@right.each(&block)
end
# flatten the rope into a string (optionally starting with a prefix)
def to_s(s="")
@right.to_s(@left.to_s(s))
end
end
# ...
That covers the tree implementation. What we haven't seen yet though, are the leaf nodes. We need two types for the implementation I want to examine, EmptyRope and StringRope. Here's the first of those:
EmptyRope = Object.new
class << EmptyRope
include Enumerable
def length
0
end
def depth
0
end
def +(other)
other
end
alias_method(:<<, :+)
def slice(start, len)
self
end
def each
end
def to_s
""
end
end
# ...
This implementation is kind of a poor man's singleton instance (the design pattern, not the Ruby concept, though we see both here). There shouldn't be any surprises in this code. The attributes are zeroed, concatenation results in whatever the concatenated element is, and slice()ing just returns self which doubles as an empty String.
That leaves just one last leaf, StringRope:
class StringRope
include Enumerable
def self.new(*args)
if args.empty?
EmptyRope
else
super
end
end
def initialize(data)
@data = data
end
def length
@data.length
end
def depth
0
end
# ...
This class just wraps a String to support the Rope API we've been examining. About the only interesting trick here is the is that the default new() for this class is overridden to allow returning an EmptyRope as needed. Anytime the argument is provided though, this method does hand-off to the default new() implementation.
Here's the concatenation method:
def +(other)
balance = other.depth
if balance>1
left = other.left
right = other.right
if left.depth>right.depth
# rotate other to right before rotating self+other to left
(self + left.left) + (left.right + right)
else
# rotate self+other to left
(self + left) + right
end
else
Rope.new(self, other)
end
end
alias_method(:<<, :+)
# ...
We see some more balance work here, but the algorithm is simplified since only the right side can be a subtree. Beyond that, we've seen this code before.
Here are the final few methods:
def slice(start, len)
return self if start.zero? and len==@data.length
# depend on ruby's COW mechanism to just reference the slice data
self.class.new(@data.slice(start, len))
end
def at(index)
@data[index]
end
def each(&block)
@data.each_char(&block)
end
def to_s(s="")
s.concat(@data.to_s)
end
end
end
Note here that slice() was overridden to return a StringRope instead of a String. As the comment says, Ruby internally uses some Copy On Write semantics to reference sliced Strings. This should keep it from wildly duplicating the data, but it was found that a couple of solutions had problems with this for unknown reasons.
That covers a basic Rope implementation. We won't bother to go into the destructive methods as you are probably better off working without them.
My thanks to all who explored this newly popular data structure with us. It looks like there will be a talk on ropes at this year's Rubyconf, so hopefully we gave the speaker some extra material to work with.
This week's Ruby Quiz will start a day early, to adjust for my Lone Star Rubyconf schedule this weekend. The no-spoiler period is still 48 hours and I will still summarize it at the same time, you just get an extra day to work on it. Stay tuned for that problem in just a few minutes...