Editing Text (#145)

by Eric Mahurin

Have you ever wondered how a text buffer might be represented in a text editor or word processor? A simple string to represent text buffer isn't efficient enough because inserting (i.e. typing) and deleting (backspace) in the middle would result in moving all of the text to the end for each operation. A data structure that can efficiently insert and delete in the middle is needed.

The task is to implement data structures for efficiently editing text. One common data structure is a gap buffer:

Gap buffer

Other options to consider include: ropes (see quiz #137), linked lists, simple strings, or a multi-level combination of data structures (i.e. for lines vs. characters in a line). There are many data structures that may work efficiently with simple editing operations, but not all of those data structures will work well for more complex functionality.

All of the basic operations occur around a text cursor. The minimal operations on/at the cursor should be:

* Insert a character before or after the cursor.
* Delete a character before or after the cursor and return the
deleted character.
* Move the cursor left or right (crossing line boundaries if
necessary) and return the character or nil at the beginning
or end of the buffer.
* Move the cursor up or down a line and return nil/false only if a
line boundary could not be crossed. The cursor may be placed in
the most natural column for the data structure.

Additional useful operations that you might find in a typical text editor can also be added:

* Get current line and column numbers
* Copy some amount of text before or after the cursor and return
this buffer.
* Display some context around the cursor.
* Cut some amount of text before or after the cursor and return
this buffer.
* Paste a copy/cut buffer before or after the cursor.
* Insert a file.
* Write to a file.
* Goto a specific line or column.
* Goto the begin/end of the line/buffer.
* Copy or cut to a specific line/column.
* Filter some text through a ruby block.
* Search (and possibly replace) using a regular expression.
* Undo/redo.

Major bonus points for the following where gap buffers probably won't work:

* Only store changes to a file.
* Handle multiple efficient cursors in a text buffer.

Although the focus is on data structures and making the ruby DSL equivalent to unix "ed" or DOS "edlin", a GUI could be added to make a full-screen/window text editor.

Here is a benchmark for testing that needs the minimal implementation (#insert_before, #insert_after, #delete_before, #delete_after, #left, #right, #up, #down):

ruby
# edit_test.rb
# usage: ruby -r klass.rb edit_test.rb <iter> \
# [<constructor> [<lines> <columns>] ...]

require 'benchmark'
require 'test/unit/assertions'
include Test::Unit::Assertions

# char = byte pre 1.9, each_char already defined in 1.9
unless "".respond_to?(:each_char)
class String;alias_method(:each_char, :each_byte);end
end

iterations = ARGV.shift.to_i

while cursor = ARGV.shift
nlines = (ARGV.shift || 10000).to_i
ncolumns = (ARGV.shift || 100).to_i
n = nlines*ncolumns
chars = (?a..?z).to_a
line = (0...ncolumns).inject("") { |line, i|
line << chars[i%chars.length]
}
line[-1] = ?\n

iterations.times { eval(cursor).instance_eval {
Benchmark.benchmark(
"#{cursor}: #{nlines}x#{ncolumns}\n",16,nil,"total"
) { |b|

total = b.report("insert_before") {
nlines.times { line.each_char { |ch| insert_before(ch) } }
}
i = 0
total += b.report("left") { i += 1 while left }
assert_equal(n, i)
i = 0
total += b.report("right") { i += 1 while right }
assert_equal(n, i)
i = 0
total += b.report("up") { i += 1 while up }
assert_equal(nlines, i)
i = 0
total += b.report("down") { i += 1 while down }
assert_equal(nlines, i)
total += b.report("insert_after") {
nlines.times { line.each_char { |ch| insert_after(ch) } }
}
i = 0
total += b.report("delete_before") {
i += 1 while delete_before
}
assert_equal(n, i)
i = 0
total += b.report("delete_after") {
i += 1 while delete_after
}
assert_equal(n, i)

[total]

}
} }
end

Quiz Summary

Implementing an efficient data structure for this problem can be tricky. A couple of the solutions turned out to have a higher complexity than it first appeared. That doesn't mean they aren't valuable to examine though.

Let's have a look at Holger's code below. It's an implementation of the quiz-recommended gap buffer data structure. It's a very clean implementation of the idea that helped me understand what a gap buffer really is. Here's the constructor:

ruby
class GapBuffer
def initialize(data="", i=0)
@data = data
@gap_start = i
@gap_len = 0
@GAP = " "*64
end

# ...

The actual content of our text editing session will be kept in @data. However, that data may contain some extra material called a "gap," which will begin at the index @gap_start and be @gap_len characters long. @GAP is just a constant used to insert new gaps when that last one is exhausted.

How this interacts in usage is well shown by the insert methods:

ruby
# ...

def insert_before(ch)
if @gap_len.zero?
@data[@gap_start, 0] = @GAP
@gap_len = @GAP.length
end
@data[@gap_start] = ch
@gap_start += 1
@gap_len -= 1
end

def insert_after(ch)
if @gap_len.zero?
@data[@gap_start, 0] = @GAP
@gap_len = @GAP.length
end
@data[@gap_start+@gap_len-1] = ch
@gap_len -= 1
end

# ...

The idea for both of these methods is simple. To add content, we place it in either end of the gap, shrinking the space available there. To insert a character before the caret we must add the character to the beginning of the gap, bump the gap start index beyond the new character, and decrease our count of available space in the gap. Adding after the caret just means placing the character at the end and decreasing the length counter.

The if statement at the beginning of both of these methods watches for the gap to be exhausted. When we run out of space, a new gap is inserted and our length counter is reset.

Deleting characters is just the opposite operation:

ruby
# ...

def delete_before
return if @gap_start.zero?
@gap_start -= 1
@gap_len += 1
@data[@gap_start]
end

def delete_after
return if @gap_start+@gap_len >= @data.length
@gap_len += 1
@data[@gap_start+@gap_len-1]
end

# ...

To remove something all we really have to do is to manipulate our index and/or length to put the discarded character back into the gap. Once there, later operations can overwrite it normally.

Note that the last lines of these methods return the "deleted" character and the first lines skip the operations when there are no characters beyond the gap.

Now the harder operations of a gap buffer are moving the caret. We'll start with left() and right(), which are the easier moves:

ruby
# ...

def left
return if @gap_start.zero?
@data[@gap_start+@gap_len-1] = @data[@gap_start-1]
@gap_start -= 1
@data[@gap_start]
end

def right
return if @gap_start+@gap_len>=@data.length
@data[@gap_start] = @data[@gap_start + @gap_len]
@gap_start += 1
@data[@gap_start - 1]
end

# ...

Moving the caret to either side means transferring a character from one end of the buffer to the other. Then we can just expand the gap to include the moved character's previous location, so future editing operations will swallow in up. For example, given the following data:

The gap buffer[ ], or our caret, is shown here with brackets.

A move to the left() is gives us:

The gap buffe[r ]r, or our caret, is shown here with brackets.

The moved character is on the right of the buffer while the original is inside. Characters inside the buffer don't really exist in the content so we don't have any duplication here.

Now for the hard moves:

ruby
# ...

def up
col = column
cursor = @gap_start-col
return if cursor.zero?
cursor_line = @data.rindex(?\n, cursor-2)
cursor_line = 0 if cursor_line.nil?
cursor_line += col+1
if cursor_line > cursor-1
cursor_line = cursor-1
end
left while @gap_start > cursor_line
true
end

def down
col = column
cursor = @data.index(?\n, @gap_start + @gap_len)
return if cursor.nil?
cursor_line = cursor+1+col
cursor = @data.index(?\n, cursor+1)
cursor = @data.length if cursor.nil?
cursor_line = cursor if cursor_line > cursor
right while @gap_start + @gap_len < cursor_line
true
end

# ...

While this looks like a lot of code, the procedure is simple.

The column() helper method, which we will see shortly, figures out where we are in the current line. For the down() method, that's really all we need to mark our current location. When moving up() though, we want to reach the line before ours, or somewhere between one and two newlines back. To make that easier, up() first backs up by the column count to start the search from the beginning of the line.

Then we have a terrific use index()/rindex() with the optional minimum index parameter to find the next newline past the end of the gap buffer. When that search fails, the caret is moved to the beginning or end of the data which must be the last line we needed to cross. Having found the desired line, the current column is added on, and trimmed to match the data if it exceeds the line.

Together, these operations give us a where we are and where we need to get to. The actual move is performed with repeated calls to left()/right(), one for each character we need to travel.

Here are a couple of location methods, one of which we just saw in use:

ruby
# ...

def position
@gap_start
end

def column
lbreak = @data.rindex(?\n, @gap_start-1)
lbreak.nil? ? @gap_start : (@gap_start-(lbreak+1))
end

# ...

As you can see, position() is just an alias for @gap_start. The column() method though is just more clever use of the rindex() method. It really works just like we saw in up(), save that it doesn't need to skip over a newline.

The last operation supported by this solution is to stringify the data:

ruby
# ...

def to_s
return @data[0, @gap_start] +
@data[@gap_start+@gap_len, @data.length-@gap_start-@gap_len]
end
end

The actual data is just the concatenation of what's on both sides of the gap. The code above assembles that with simple indexing.

Holger had hoped the solution was O(n), but unfortunately it turns out to be O(n**2). Eric explains why this is and provides tips for how to get it to O(n) in this email:

Eric's Analysis of Holger's Algorithm

While the code does need the mentioned changes to reduce its complexity, I still felt it was a very clean implementation of the gap buffer and quite easy to follow. This should make fixing it up a snap.

My thanks to those who did manage to find the time for the quiz. It was a tricky problem to do well and all of the solutions showed off neat approaches.

Tomorrow we will do some vehicle counting, programmer style...