Bytecode Compiler (#100)

by Ross Bamford

Note: This quiz isn't really as much work as it might seem!

This quiz involves writing (in Ruby, of course) a compiler for basic arithmetic expressions. The output from this compiler should be an array of unsigned byte-sized ints, which can be fed into the included interpreter (http://www.rubyquiz.com/interp.rb) in order to execute the compiled expression.

The bytecode format is very simple, while the interpreter is also very simple, implemented as a stack machine. They have the following general characteristics:

* Bytecode is stored as an array of unsigned byte-sized Fixnums.
* All stack-bound numbers are signed integers
* The following operations are supported:
* Addition
* Subtraction
* Multiplication
* Division
* Raise to power
* Integer modulo
* Where an operator would return a floating point value,
the value is truncated to an integer.
* Short CONST and long LCONST instructions allow constants
to be pushed to the stack. These instructions expect their
operands to hold a signed short or long, respectively,
in network byte order.

Your compiler interface should be via a singleton method on a module 'Compiler', taking a string, such that:

ruby
Compiler.compile('3+2')

Returns an array of instructions (and operands) that, when fed to the interpreter, will execute the expression 3+2 and return the result (hopefully 5). For example, a correct (but non-optimal) result from the above call would be:

ruby
[2,0,0,0,3, # LCONST 3
2,0,0,0,2, # LCONST 2
10] # ADD

Your compiler should support all basic arithmetic operations and explicit precedence (parenthesis). As standard, syntax/precedence/ associativity/etc. should follow Ruby itself. Obviously, specific implementation is entirely up to you, though bear in mind that your compiler must be capable of running inline in the same Ruby process as the interpreter, without affecting any code outside itself.

The quiz also includes a number of tests (http://www.rubyquiz.com/test_bytecode.rb) that will test your compiler's functionality, with expressions becoming more complex as the tests go on. To pass all the tests, a compiler will have to not only generate correct bytecode, but it will also need to generate the shortest code it can for a given expression.

Here is the bytecode spec:

# 0x01: CONST (cbyte1, cbyte2) ... => ..., const
Push a 15-bit signed integer to the stack.
The two bytes immediately following the instruction represent the
constant.

# 0x02: LCONST (cbyte1, cbyte2, cbyte3, cbyte4) ... => ..., const
Push a 31-bit signed integer to the stack.
The four bytes immediately following the instruction represent the
constant.

# 0x0a: ADD () ..., value1, value2 => ..., result
Pop the top two values from the stack, add them, and push the result
back onto the stack.

# 0x0b: SUB () ..., value1, value2 => ..., result
Pop the top two values from the stack, subtract value2 from value1,
and push the result back onto the stack.

# 0x0c: MUL () ..., value1, value2 => ..., result
Pop the top two values from the stack, multiply value1 by value2,
and push the result back onto the stack.

# 0x0d: POW () ..., value1, value2 => ..., result
Pop the top two values from the stack, raise value1 to the power of
value2, and push the result back onto the stack.

# 0x0e: DIV () ..., value1, value2 => ..., result
Pop the top two values from the stack, divide value1 by value2,
and push the result back onto the stack.

# 0x0f: MOD () ..., value1, value2 => ..., result
Pop the top two values from the stack, modulo value1 by value2,
and push the result back onto the stack.

# 0xa0: SWAP () ..., value1, value2 => ..., value2, value1
Swap the top two stack values.


Quiz Summary

I'm very glad this quiz came along, because I honestly had no idea what "bytecode" meant before I read this quiz. I think it was good for dumbing the idea down for the masses (or just me, if everyone else already knew this stuff). Not all of us port entire virtual machines, as the quiz creator does.

There were a couple of approaches to this week's quiz. One strategy was to convert the expression into valid Ruby that would emit the bytecode and let Ruby do the parsing. This works for this parsing problem because Ruby's expression rules are comparable to those specified by the quiz. I think we can break down one of those solutions to better understand what was involved with this problem.

Here's a chunk of code from Cameron Pope's solution:

ruby
module CreateExpressions
def +(other) Expr.new(:add, self, other) end
def -(other) Expr.new(:sub, self, other) end
def *(other) Expr.new(:mul, self, other) end
def /(other) Expr.new(:div, self, other) end
def %(other) Expr.new(:mod, self, other) end
def **(other) Expr.new(:pow, self, other) end
end

You can see a set of operator definitions here, matching those needed by the quiz. Instead of these methods doing what they normally do (math), the versions we see here just create and return an Expr object. Let's have a look at that definition now:

ruby
class Expr
include CreateExpressions
OPCODES = {:add => 0x0a, :sub => 0x0b, :mul => 0x0c, :pow => 0x0d,
:div => 0x0e, :mod => 0x0f}

def initialize(op, a, b)
@op = op
@first = a
@second = b
end

def to_s
"(#{@op.to_s} #{@first.to_s} #{@second.to_s})"
end

def emit
@first.emit << @second.emit << OPCODES[@op]
end
end

Ignoring the to_s() method, which is just for debugging, this object is trivial. It stores an operator and operands, and can emit() bytecode. To do that, it forwards the emit() call to both operands, which may be nested Expr objects (say from parenthetical expressions) or something else we will discuss in just a moment. emit() also looks up the operator's bytecode and appends that to the results of the operand emit()s.

Note that this class include()s all of the operator definitions we saw earlier. This means you can add, multiply, etc. Expr objects, which yields a new Expr with the original Epxrs nested in the operands.

That covers expressions, but what about plain old numbers? For that, there is another class:

ruby
class Const
include CreateExpressions
OPCODES = {2 => 0x01, 4 => 0x02}

def initialize(i)
@value = i
end

def to_s
@value
end

def emit
case @value
when (-32768..32767): bytes = [@value].pack("n").unpack("C*")
else bytes = [@value].pack("N").unpack("C*")
end
bytes.insert 0, OPCODES[bytes.size]
end
end

We have pretty much the same pattern here, save that emit() converts the number to the bytecode for the properly sized constant plus the packed bytes. Again we see the arithmetic operators include()ed, so these too can be combined with other Const and Expr objects.

Now we need a means to turn the normal numbers of the expression into Const objects:

ruby
class Fixnum
def to_const
Const.new(self)
end
end

Yep, that will do just that.

All that's left is to put it to use:

ruby
class Compiler
def self.compile(expr)
self.mangle(expr).emit.flatten
end

def self.explain(expr)
self.mangle(expr).to_s
end

private
def self.mangle(expr)
eval(expr.gsub(/\d+/) {|s| "#{s}.to_const()"})
end
end

The mangle() method is the heart of this solution. A simple Regexp is used to tack a to_const() call on to each number of the expression and a hand-off is made to eval() to build up the indicated combination of Const and Expr objects. Once mangle()d, the result is emit()ted and flatten()ed into the final bytecode Array in compile().

That's easy enough to understand and it works just fine in this instance. However, what if the rules differed from Ruby's and you couldn't lean on Ruby's parser? In that case, you would have to roll your own parser, which some decided to do.

Luckily there is already a popular algorithm for unrolling infix arithmetic expressions into the RPN order required by bytecode spec:

Shunting Yard Algorithm

Several people employed this strategy, including Daniel Martin:

ruby
# This is a solution to Ruby Quiz #100
#
# It's basically just a shunting algorithm, but with a twist
# since it needs to distinguish between a "-" that's part of
# a number and a "-" that's an operator. To do that, I use
# a state machine while parsing to remember if I need next
# an operator or an integer.

require 'strscan'
class Compiler
# A small class made so that I can use case ... when
# with a StringScanner
class Token < Regexp
def initialize(re)
super(re)
end
# Using is_a? instead of respond_to? isn't very duck-typey,
# but unfortunately String#scan and StringScanner#scan mean
# completely different things.
def ===(s)
if (s.is_a?(StringScanner))
s.scan(self)
else
super(s)
end
end
end

# ...

This first class of Daniel's solution is really just a Regexp with a custom case equals method. This sets up an elegant syntax for the solution proper we will encounter in a bit.

ruby
# ...

# The tokens I need
WSPACE = Token.new(/\s+/)
LPAREN = Token.new(/\(/)
RPAREN = Token.new(/\)/)
OP = Token.new(/\*\*|[+*%\/-]/)
NEG = Token.new(/-/)
INT = Token.new(/\d+/)

OpValMap = {'+' => 0x0a, '-' => 0x0b, '*' => 0x0c,
'**' => 0x0d, '/' => 0x0e, '%' => 0x0f}

# ...

Here you see the class used to build the Tokens for lexxing the expression content. These are very basic regular expressions.

Here are the interface methods required by the quiz solution:

ruby
# ...

def initialize(instring)
@scanner = StringScanner.new(instring)
@opstack = Array.new
@outarr = Array.new
end

def compile()
state = :state_int
while state != :state_end
case @scanner
when WSPACE
next
else
state = send(state)
raise "Syntax error at index #{@scanner.pos}" if ! state
end
end
while ! @opstack.empty?
op = @opstack.pop
raise "Mismatched parens" if LPAREN === op
@outarr << OpValMap[op]
end
@outarr
end

# Class method as required by the test harness
def self.compile(instring)
new(instring).compile
end

# ...

Notice that initialize() sets up the stack and output Arrays needed by the Shunting Yard algorithm. The compile() method wraps a state machine we will examine the workings of shortly. You can see that is discards whitespace as needed, forwards to state methods, pop()s the final operators as the algorithm requires, and returns the resulting output Array. The class compile() is just a wrapper over the other two methods.

The heart of what is going on here is hidden in the state methods. From compile() we saw that the initial state for the machine is :state_int, so let's begin with that method:

ruby
# ...

private

# ...

# state where we're expecting an integer or left paren
def state_int
case @scanner
when LPAREN
@opstack << @scanner.matched
:state_int
when INT
integer(@scanner.matched.to_i)
:state_op
when NEG
:state_neg
end
end

# ...

Here we begin to see the magic of the Token class. Matching a Token advances the StringScanner index and matched() can then be used to grab the element.

The LPAREN when clause is right out of the algorithm description, and the INT clause is pretty obviously if you know that integer() handles the constant conversion. The NEG clause is needed to distinguish an unary minus from the binary operator. Note that each case returns the next expected state for the machine.

Here's the integer() helper we have seen before:

ruby
# ...

# Handle an integer
def integer(i)
if (i <= 32767 and i >= -32768)
@outarr << 0x01
@outarr.push(*([i].pack("n").unpack("C*")))
else
@outarr << 0x02
@outarr.push(*([i].pack("N").unpack("C*")))
end
end

# ...

The only difference here is that constants are pushed onto the output Array as required by the algorithm.

Let's return to the state methods:

ruby
# ...

# Expecting an operator or right paren
def state_op
case @scanner
when RPAREN
while not LPAREN === @opstack[-1]
raise "Mismatched parens" if @opstack.empty?
@outarr << OpValMap[@opstack.pop]
end
@opstack.pop
:state_op
when OP
op = @scanner.matched
while is_lower(@opstack[-1], op)
@outarr << OpValMap[@opstack.pop]
end
@opstack << op
:state_int
else
# I would handle this with an EOS token, but unfortunately
# StringScanner is broken w.r.t. @scanner.scan(/$/)
:state_end if @scanner.eos?
end
end

# ...

Again, the RPAREN clause is right out of the algorithm description. The OP clause is as well and you can see that it handles the precedence check via the is_lower() helper method. The else clause gives us our exit state when the expression has been exhausted.

Here's the is_lower() helper:

ruby
# ...

# Define the precedence order
# One thing to note is that for an operator a,
# is_lower(a,a) being true will make that operator
# left-associative, while is_lower(a,a) being false
# makes that operator right-associative. Note that
# we want ** to be right associative, but all other
# operators to be left associative.
def is_lower(op_on_stack, op_in_hand)
case op_on_stack
when nil, LPAREN; false
when /\*\*|[*\/%]/; op_in_hand =~ /^.$/
when /[+-]/; op_in_hand =~ /[+-]/
end
end

# ...

The comment surely explains this better than I can, but the point of this method is to resolve whether or not the operator we just matched is lower in precedence than the operator on the stack. For example, in the last line, when we have a plus or minus on the stack only another plus or minus would trigger the true result.

Here's the final state:

ruby
# ...

# The state where we've seen a minus and are expecting
# the rest of the integer
def state_neg
case @scanner
when INT
integer(-(@scanner.matched.to_i))
:state_op
end
end

# ...
end

This just reads the constant following a negation operator and ensures that it is negated.

Those are two different approaches that pass the quiz tests. It's probably slightly easier to lean on Ruby in cases like this where you know you can get away with it. If you can't though, the custom parser isn't too much more work as Daniel shows.

My thanks to all who taught me many things I didn't know about bytecode compilation through their solutions and to Ross Bamford for the educational quiz.

Tomorrow we will play with VCRs and I'm told that dates me...