evan.musing << current

life and tech stuff by Evan Phoenix

Archive for May 2008

CS Nerds Anonymous

with 4 comments

I’m also holding a last minute addition to the RailsConf schedule, CS Nerds Anonymous.

I’ll be chairing, getting the conversation started, basically being the ring leader. My hope is that people show up with fun topics to discuss, and we’ll take it from there. There is a good chance that we’ll have some lightning talks too. Again, doesn’t have to be Rails or even Ruby specific. Just fun topics that you think your fellow CS nerds would enjoy!

So wake up early on Sunday and come hang out! Bring your opinions on everything from optional type annotations in Ruby to Erlang monads!

Written by evanphx

May 26, 2008 at 5:19 pm

Posted in speaking

Code Drive at RailsConf

leave a comment »

A heads up that I’ll be participating in the RailsConf Community CodeDrive, hacking on Rubinius

Come hang out with me and other project hackers and share the love!

Bring your questions, quandaries, criticisms, and code!

See you Thursday morning at 10am!

Written by evanphx

May 26, 2008 at 5:06 pm

Posted in rubinius, speaking

Simple VM JIT with LLVM

with 12 comments

I’ve been investigating using LLVM for Rubinius, so I’ve been doing some very small scale experiments. I typically do this on most projects, to get a mental handle on the problem.

In doing this, I’ve written a very tiny VM to play with how LLVM handles it. Here is the breakdown of the entire VM:

  • Only operates on ints.
  • Uses numbered registers for operations.
  • 3 Instructions:
    1. set(reg, val) Set register number reg to integer value val
    2. add(result, reg, val) Add register reg to val and put the result in result
    3. show(reg) Print out the contents of register reg

Pretty trivial. Not turing complete because there is no branching. But it’s simple enough to explore how to handle bytecode in LLVM.

Sample Program

So, I’ve written a sample program, encoded directly as bytecode:

  [ 0, 0, 3,
    1, 0, 0, 4,
    2, 0 ]

This is:

  • Set register 0 to 3
  • Add register 0 to 4 and put the result in register 0
  • Show register 0

Again, totally trivial. Should just output 7.

C Switch

So, the VM Design 101 way is to build a C switch statement to interpret each bytecode and perform it’s operations. It would look like this, given that int ops[] contains the above integer sequence for the program:

void add(int* ops, int* registers) {
  registers[ops[1]] = registers[ops[2]] + ops[3];
}

void set(int* ops, int* registers) {
  registers[ops[1]] = ops[2];
}

void show(int* ops, int* registers) {
  printf("=> %d\n", registers[ops[1]]);
}

void run(int* ops, int* registers) {
  switch(*ops) {
  case 0:
	set(ops, registers);
	ops += 3;
    break;
  case 1:
	add(ops, registers);
	ops += 4;
    break;
  case 2:
	show(ops, registers);
    return;
  }
}

Very easy. We increments ops to move forward, using ops as a pointer to the current instruction. This is how every VM starts. But this code is very slow when compared to machine code because it obscures the execution flow from the CPU. And besides, this doesn’t use LLVM, the whole point of this post.

A static result…

As we look at the code that was run, we see that the program is set, add, then show. Lets pretend for a sec that given the above functions, we want to perform the same operation, we’d write:

void my_program() {
  int registers[2] = {0, 0};
  int program[10] = [ 0, 0, 3,
                      1, 0, 0, 4,
                      2, 0 ]

  int* ops = (int*)program;

  set(ops, registers);
  ops += 3;
  add(ops, registers);
  ops += 4;
  show(ops, registers);
  ops += 2;
}

So, my_program would perform the same operations as your bytecode above, and considerable faster because we avoid all the overhead the switch statement.

Combining the 2

If you look at the bytecode, than the at the hand written C, we can see that there there is a programmatic way to go from our array of numbers to the C code.

A common approach that people have taken in the past to actually write a C code emitter, that would parse the integers once, spit out a .c file, which you’d compile, then run. This works ok for some situations, but it doesn’t allow for any kind of dynamic abilities to run code. And besides, it doesn’t use LLVM.

The idea is to write a function thats takes as input the array of numbers and dynamically builds the equivalent of the above C function. And thats where LLVM comes in.

Part 1: importing the functions

A key part to this scheme is to have the add/set/show functions we defined above available to LLVM. Doing so lets us use our normal tools to write the each instruction as a small operation that can easily be tested. So, given that we have put those 3 functions into ops.c, we run:
llvm-gcc -emit-llvm -O3 -c ops.c, which generates ops.o as a LLVM bitcode file. Using llvm-dis < ops.o we see it contains:

@.str = internal constant [7 x i8] c"=> %dA0"		;  [#uses=1]

define void @add(i32* %ops, i32* %registers) nounwind  {
entry:
	%tmp1 = getelementptr i32* %ops, i32 1		;  [#uses=1]
	%tmp2 = load i32* %tmp1, align 4		;  [#uses=1]
	%tmp4 = getelementptr i32* %ops, i32 2		;  [#uses=1]
	%tmp5 = load i32* %tmp4, align 4		;  [#uses=1]
	%tmp7 = getelementptr i32* %registers, i32 %tmp5		;  [#uses=1]
	%tmp8 = load i32* %tmp7, align 4		;  [#uses=1]
	%tmp10 = getelementptr i32* %ops, i32 3		;  [#uses=1]
	%tmp11 = load i32* %tmp10, align 4		;  [#uses=1]
	%tmp12 = add i32 %tmp11, %tmp8		;  [#uses=1]
	%tmp14 = getelementptr i32* %registers, i32 %tmp2		;  [#uses=1]
	store i32 %tmp12, i32* %tmp14, align 4
	ret void
}

define void @set(i32* %ops, i32* %registers) nounwind  {
entry:
	%tmp1 = getelementptr i32* %ops, i32 1		;  [#uses=1]
	%tmp2 = load i32* %tmp1, align 4		;  [#uses=1]
	%tmp4 = getelementptr i32* %ops, i32 2		;  [#uses=1]
	%tmp5 = load i32* %tmp4, align 4		;  [#uses=1]
	%tmp7 = getelementptr i32* %registers, i32 %tmp2		;  [#uses=1]
	store i32 %tmp5, i32* %tmp7, align 4
	ret void
}

declare i32 @printf(i8*, ...) nounwind 

define void @show(i32* %ops, i32* %registers) nounwind  {
entry:
	%tmp1 = getelementptr i32* %ops, i32 1		;  [#uses=1]
	%tmp2 = load i32* %tmp1, align 4		;  [#uses=1]
	%tmp4 = getelementptr i32* %registers, i32 %tmp2		;  [#uses=1]
	%tmp5 = load i32* %tmp4, align 4		;  [#uses=1]
	%tmp7 = tail call i32 (i8*, ...)* @printf( i8* getelementptr ([7 x i8]* @.str, i32 0, i32 0), i32 %tmp5 ) nounwind 		;  [#uses=0]
	ret void
}

Now we have our functions ready for easy importing.

Phase 2: LLVM C++ API

When I did this experiment, I decided that the function that, given an array of ints, would drive the LLVM, wasn’t necessary. After all, all I was concerned was if the output of that process would actually work. So instead, I hand wrote a function that uses the LLVM C++ api the same way it would be if this were dynamic:

Function* create(Module** out) {
  std::string error;
  Module* jit;

  // Load in the bitcode file containing the functions for each
  // bytecode operation.
  if(MemoryBuffer* buffer = MemoryBuffer::getFile("ops.o", &error)) {
    jit = ParseBitcodeFile(buffer, &error);
    delete buffer;
  }

  // Pull out references to them.
  Function* set =  jit->getFunction(std::string("set"));
  Function* add =  jit->getFunction(std::string("add"));
  Function* show = jit->getFunction(std::string("show"));

  // Now, begin building our new function, which calls the
  // above functions.
  Function* body = cast<Function>(jit->getOrInsertFunction("body",
        Type::VoidTy,
        PointerType::getUnqual(Type::Int32Ty),
        PointerType::getUnqual(Type::Int32Ty), (Type*)0));

  // Our function will be passed the ops pointer and the
  // registers pointer, just like before.
  Function::arg_iterator args = body->arg_begin();
  Value* ops = args++;
  ops->setName("ops");
  Value* registers = args++;
  registers->setName("registers");

  BasicBlock *bb = BasicBlock::Create("entry", body);

  // Set up our arguments to be passed to set.
  std::vector<Value*> params;
  params.push_back(ops);
  params.push_back(registers);

  // Call out to set, passing ops and registers down
  CallInst* call = CallInst::Create(set, params.begin(), params.end(), "", bb);
  ConstantInt* const_3 = ConstantInt::get(APInt(32,  "3", 10));
  ConstantInt* const_4 = ConstantInt::get(APInt(32,  "4", 10));

  // add 3 to the ops pointer.
  GetElementPtrInst* ptr1 = GetElementPtrInst::Create(ops, const_3, "tmp3", bb);

  // Setup and call add, notice we pass down the updated ops pointer
  // rather than the original, so that we've moved down.
  std::vector<Value*> params2;
  params2.push_back(ptr1);
  params2.push_back(registers);
  CallInst* call2 = CallInst::Create(add, params2.begin(), params2.end(), "", bb);

  // Push the ops pointer down another 4.
  GetElementPtrInst* ptr2 = GetElementPtrInst::Create(ops, const_4, "tmp3", bb);

  // Setup and call show.
  std::vector<Value*> params3;
  params3.push_back(ptr2);
  params3.push_back(registers);
  CallInst* call3 = CallInst::Create(show, params3.begin(), params3.end(), "", bb);

  // And we're done!
  ReturnInst::Create(bb);

  *out = jit;
  return body;
}

Now, lets write a function that calls create() and executes the result:

int main() {
  // The registers.
  int registers[2] = {0, 0};

  // Our program.
  int program[20] = {0, 0, 3,
                     1, 0, 0, 4,
                     2, 0};

  int* ops = (int*)program;

  // Create our function and give us the Module and Function back.
  Module* jit;
  Function* func = create(&jit);

  // Add in optimizations. These were taken from a list that 'opt', LLVMs optimization tool, uses.
  PassManager p;

  /* Comment out optimize
  p.add(new TargetData(jit));
  p.add(createVerifierPass());
  p.add(createLowerSetJmpPass());
  p.add(createRaiseAllocationsPass());
  p.add(createCFGSimplificationPass());
  p.add(createPromoteMemoryToRegisterPass());
  p.add(createGlobalOptimizerPass());
  p.add(createGlobalDCEPass());
  p.add(createFunctionInliningPass());
  */

  // Run these optimizations on our Module
  p.run(*jit);

  // Setup for JIT
  ExistingModuleProvider* mp = new ExistingModuleProvider(jit);
  ExecutionEngine* engine = ExecutionEngine::create(mp);

  // Show us what we've created!
  std::cout << "Created\n" << *jit;

  // Have our function JIT'd into machine code and return. We cast it to a particular C function pointer signature so we can call in nicely.
  void (*fp)(int*, int*) = (void (*)(int*, int*))engine->getPointerToFunction(func);

  // Call what we've created!
  fp(ops, registers);
}

We drive our create() function and then execute the result. As you can see, we’ve commented out all optimizations as a first try. The output from running this is:

<snip same LLVM as before>

define void @body(i32* %ops, i32* %registers) {
entry:
	call void @set( i32* %ops, i32* %registers )
	%tmp3 = getelementptr i32* %ops, i32 3		;  [#uses=1]
	call void @add( i32* %tmp3, i32* %registers )
	%tmp31 = getelementptr i32* %ops, i32 4		;  [#uses=1]
	call void @show( i32* %tmp31, i32* %registers )
	ret void
}
=> 7

Hey! Look at that! It runs! And we can see what it ran. We see it perform the calls to our functions that implement each opcode. Going back, it would be easily to write a function that dynamically drivers the LLVM C++ API to generate this code, it’s simply one call per bytecode, mapped directly.
Even if that were all that LLVM let us do, it would be worth it, but…

Wait, there’s more!

Before, we ran without optimizations to keep the output simple, lets see what happens we turn them on:

define void @body(i32* %ops, i32* %registers) {
entry:
	%tmp1.i = getelementptr i32* %ops, i32 1		;  [#uses=1]
	%tmp2.i = load i32* %tmp1.i, align 4		;  [#uses=1]
	%tmp4.i = getelementptr i32* %ops, i32 2		;  [#uses=1]
	%tmp5.i = load i32* %tmp4.i, align 4		;  [#uses=1]
	%tmp7.i = getelementptr i32* %registers, i32 %tmp2.i		;  [#uses=1]
	store i32 %tmp5.i, i32* %tmp7.i, align 4
	%tmp3 = getelementptr i32* %ops, i32 3		;  [#uses=3]
	%tmp1.i7 = getelementptr i32* %tmp3, i32 1		;  [#uses=1]
	%tmp2.i8 = load i32* %tmp1.i7, align 4		;  [#uses=1]
	%tmp4.i9 = getelementptr i32* %tmp3, i32 2		;  [#uses=1]
	%tmp5.i10 = load i32* %tmp4.i9, align 4		;  [#uses=1]
	%tmp7.i11 = getelementptr i32* %registers, i32 %tmp5.i10		;  [#uses=1]
	%tmp8.i = load i32* %tmp7.i11, align 4		;  [#uses=1]
	%tmp10.i = getelementptr i32* %tmp3, i32 3		;  [#uses=1]
	%tmp11.i = load i32* %tmp10.i, align 4		;  [#uses=1]
	%tmp12.i = add i32 %tmp11.i, %tmp8.i		;  [#uses=1]
	%tmp14.i = getelementptr i32* %registers, i32 %tmp2.i8		;  [#uses=1]
	store i32 %tmp12.i, i32* %tmp14.i, align 4
	%tmp31 = getelementptr i32* %ops, i32 4		;  [#uses=1]
	%tmp1.i2 = getelementptr i32* %tmp31, i32 1		;  [#uses=1]
	%tmp2.i3 = load i32* %tmp1.i2, align 4		;  [#uses=1]
	%tmp4.i4 = getelementptr i32* %registers, i32 %tmp2.i3		;  [#uses=1]
	%tmp5.i5 = load i32* %tmp4.i4, align 4		;  [#uses=1]
	%tmp7.i6 = call i32 (i8*, ...)* @printf( i8* getelementptr ([7 x i8]* @.str, i32 0, i32 0), i32 %tmp5.i5 ) nounwind 		;  [#uses=0]
	ret void
}
=> 7

WOW! Now we’re cooking with hot flaming nuclear power! LLVM has the extra mile and rather than just calling our functions that implement each instruction, it’s inlined all that code directly into our dynamically generated function! This means A LOT of additional overhead is discarded because, since our functions themselves did simple operations like store into memory or add 2 numbers, that code is now run a lot more quickly.

It’s commonly known that inlining can dramatically improve performance because it eliminates the over head of function calls (spilling and reload registers, stack frames, etc). And LLVM has just given us a powerful form of that for free.
This doesn’t allow for inlining across things like the kind of method call that Ruby does, but it puts us on the track to being able to feed LLVM enough information to do just that.

LLVM is an amazing piece of software. I can’t wait to start using it more.

Written by evanphx

May 23, 2008 at 6:03 pm

Posted in CS, LLVM

Rails on Rubinius

with 70 comments

We hit a major milestone tonight. As most people know, we’ve been working to run Rails on Rubinius by RailsConf to have something to show off, even if it’s pretty slow.

Well, I’m super proud to say that tonight, rails served up both static and dynamic pages under Rubinius. Previous to tonight, we’d been blocked just trying to get Rails to even load. I decided to just try loading it up and bang on it enough to get it up and going.

In a scary way, it didn’t take very much code. Which meant we were very close already.

It’s pretty late, so I’m going to keep this short. Big thanks to everyone who’s contributed to Rubinius and had faith in us. Enormous thanks to Engine Yard, without whom I don’t know if we’d been able to reach this amazing height.

More updates to come…

Written by evanphx

May 17, 2008 at 1:22 am

Posted in rubinius

Welcome to the Club

leave a comment »

I’ve like to formally welcome the maglev development team over at Gemstone to the Ruby environment club.

For those of you that haven’t yet heard of maglev, it’s a brand new Ruby VM being developed by the folks over at Gemstone. Gemstone is the makers of probably the most advanced object-oriented database used today, and have traditionally been a Smalltalk shop till recently.
With the tide rising on Ruby, I’m happy to see another player enter the field. This only means that Ruby is continuing to mature and see that the community is healthy.

I was personally excited to read an interview with Bob Walker and Avi Bryant concerning maglev, because Rubinius is mentioned more than a few times. They’re looking at Rubinius for a couple of reasons. For one, the RubySpec suite we’ve developing and are about to spin off. The more people that we see using the suite and depending on it, the more mature it will become. Not having a spec for Ruby is commonly touted as a reason that it’s a toy, immature language, and anything we can do to dispel that thinking is good for the community.

The other reason that I’m excited about maglev is that they’re taking a very similar approach to the problem of building a Ruby environment. Like Rubinius, the VM is minimal and most of the kernel is implemented in Ruby.
My hope is that the kernel of Rubinius can be refactored and developed to be generic enough for other environments to use. While I know little about maglev’s current environment, they’re a natural build off the work in the Rubinius kernel. I’d hate to see people develop the code functionality of a ruby environment yet again (I count 5 code bases to this effect currently: MRI, JRuby, Ruby.NET, IronRuby, and Rubinius).

Being able to use a generic Ruby kernel is not unique to a smalltalk style VM. With some luck, it could be used by the folks in other environments as well. In my eyes, this is a big win for everyone. For one, this would mean a common code base that consists of the primary Ruby functionality, and thus would mean a vastly reduced worry of fragmentation. Plus it would alleviate the need for this code to be written again, letting future environment developers focus on taking Ruby to the next level in terms of platform integration, performance, etc.

Written by evanphx

May 3, 2008 at 11:00 pm

Posted in rubinius, ruby