CPP work branch change
Hi everyone. I’m super happy to announce that we’ve gotten the C++ branch stable enough that we’re making in the default branch. This means that those of you with existing clones are going to likely do a little work to get them sane though.
Here is what was done:
- The old master branch was rename shotgun.
- The cpp branch was copy to the name master.
- The cpp branch was then deleted.
Anyone that has up to now been working on the cpp branch has a couple of options.
- Delete your clone and re-clone. This is the easiest. The default checkout will be code in the cpp branch and you’re off and going.
- Fix up your current repo. I did this by doing the following commands:
git checkout mastergit reset --hard origin/mastergit branch -D cpp
This will get your local master branch repointed and properly checked out. In addition, the old cpp local branch can be deleted.
Hopefully no one experiences much pain due to this change. It’s been a long time coming and I’m really excited.
If you do run into problems, post a comment or stop on by IRC and we’ll work it out for ya.
Thanks!
Rubinius Status
Hey folks, sorry for the quietness here. Thought I’d fill everyone in on the current status of Rubinius.
We (the Rubinius team) have been hard at work on a couple of fronts:
- a new C++ VM: the team has been hard at work getting a new VM up and running back to the level we had the old system at. Our output has slowed to a little bit as the rest of the team has gotten up to speed on this new code base. Now, I’m sure you’re wondering why we’ve begun working on a new VM. Well, there are a few reasons:
- Better organized. We’ve learned a lot in the building of the last VM about how to structure things. For instance, using C++ lets us model Ruby classes as C++ classes, providing the VM with the same familiar structure and execution as their Ruby counterparts. This lowers the barrier for understanding and using the code base.
- Better tested. The old VM, I’m ashamed to say, had no unit tests. From day one of the new VM, we’ve been writing unit and integration tests. This has helped us a lot to keep the code base under control, as everyone who writes unit tests knows.
- More potential. One of the big changes is keep a lot of parts of the system open. For example, we’re actively experimenting with using LLVM to speed up method execution a lot. The old code base, with no tests, was quite tangled and sadly didn’t provide any easy way forward for a lot of experiments we wanted to do
- We’ve been working a lot of Ryan Davis’s ruby_parser project lately. We’re actively looking to use that code base as Rubinius’ internal parser. This is towards our goal of more Ruby code, but as anyone who’s written a parser will tell ya, it can be a real pain.
Ryan has made great progress getting it working and integrating it with Rubinius’ Compiler. - Conferences: Wilson Bilkovich was just in Berlin, talking technical about Rubinius at RailsConf Europe. I’m here in Austin, at Lone Star Ruby Conf, finishing up my keynote that I’ll be giving later today.
The whole team will be at RubyConf in November as well, and a few are likely going to OOPSLA as well.
I know that a lot of people are eagerly anticipating Rubinius, and I want to thank you all for your patience with me and the rest of the team.
This is a dream project, and turns out to be pretty darn hard and a lot of work. I’ve made the mistake in the past about talking about when I think that we’ll release 1.0. Something I don’t think I properly understood a year ago was the level that people are looking for a 1.0 to operate at. If we released a 1.0 that was 10x slower than MRI, we’d probably be in pretty tough shape.
At the some time, I know people want to play with Rubinius. We fell off doing monthly releases a while back, and we’re going to start getting back on that soon. Hopefully that will give people insight into the project’s progress more.
Again, thanks to Engine Yard and everyone else for wonderful support.
The Rubinius Summer
Hi everyone. Been too long since the last update, so wanted to get everyone up to speed.
Rubinius
Things have been a little quiet on the Rubinius front, as I’m sure a lot of you have noticed. We’re still hard at work, currently getting the new C++ VM into shape.
This new C++ VM fixes a lot of fundamental problems the shotgun VM had (type safety, expression ordering, etc), which is a major reason we’re migrating our work to it.
Things have been a little quieter, commit wise, as the rest of the team gets up to speed on the new VM that I’ve been working on. Shotgun has been put into maintenance mode, with updates to the current main coming mainly in the form of bug fixes to the kernel.
I know that the advances we’re making in the new VM everyone will love, from more performance to less crashes to better code organization.
Comic-Con
I’ve just returned from Comic-Con, having spend 4 days in the sun down in San Diego with the rest of nerdom.
It was quite a fun con though quite tiring. We managed to get into some good panels, but didn’t make it into the Hero’s and Lost panels. Seems you had to arrive at 6am to even attempt to get a seat. The line to get in was literally a mile long (no really, I’m not kidding.)
Conferences
Since we’ve been hard at work trying to get Rubinius to 1.0, I haven’t done too many conferences this summer. The next one I’ll be at is Lone Star Ruby Conf down in Austin. Should be fun, I’ve never been to Austin before and people seem to like the city.
Rubinius version 0.9.0 released!
I’m super proud to say that we’ve release version 0.9.0. It’s a snapshot of the work we’ve already been doing, but we’re trying to formalize our releases a bit more.
We’re going to be doing another release, 0.10, next month, as well. We’re working to do more releases, more often.
CS Nerds Anonymous
I’m also holding a last minute addition to the RailsConf schedule, CS Nerds Anonymous.
I’ll be chairing, getting the conversation started, basically being the ring leader. My hope is that people show up with fun topics to discuss, and we’ll take it from there. There is a good chance that we’ll have some lightning talks too. Again, doesn’t have to be Rails or even Ruby specific. Just fun topics that you think your fellow CS nerds would enjoy!
So wake up early on Sunday and come hang out! Bring your opinions on everything from optional type annotations in Ruby to Erlang monads!
Code Drive at RailsConf
A heads up that I’ll be participating in the RailsConf Community CodeDrive, hacking on Rubinius
Come hang out with me and other project hackers and share the love!
Bring your questions, quandaries, criticisms, and code!
See you Thursday morning at 10am!
Simple VM JIT with LLVM
I’ve been investigating using LLVM for Rubinius, so I’ve been doing some very small scale experiments. I typically do this on most projects, to get a mental handle on the problem.
In doing this, I’ve written a very tiny VM to play with how LLVM handles it. Here is the breakdown of the entire VM:
- Only operates on ints.
- Uses numbered registers for operations.
- 3 Instructions:
- set(reg, val) Set register number reg to integer value val
- add(result, reg, val) Add register reg to val and put the result in result
- show(reg) Print out the contents of register reg
Pretty trivial. Not turing complete because there is no branching. But it’s simple enough to explore how to handle bytecode in LLVM.
Sample Program
So, I’ve written a sample program, encoded directly as bytecode:
[ 0, 0, 3,
1, 0, 0, 4,
2, 0 ]
This is:
- Set register 0 to 3
- Add register 0 to 4 and put the result in register 0
- Show register 0
Again, totally trivial. Should just output 7.
C Switch
So, the VM Design 101 way is to build a C switch statement to interpret each bytecode and perform it’s operations. It would look like this, given that int ops[] contains the above integer sequence for the program:
void add(int* ops, int* registers) {
registers[ops[1]] = registers[ops[2]] + ops[3];
}
void set(int* ops, int* registers) {
registers[ops[1]] = ops[2];
}
void show(int* ops, int* registers) {
printf("=> %d\n", registers[ops[1]]);
}
void run(int* ops, int* registers) {
switch(*ops) {
case 0:
set(ops, registers);
ops += 3;
break;
case 1:
add(ops, registers);
ops += 4;
break;
case 2:
show(ops, registers);
return;
}
}
Very easy. We increments ops to move forward, using ops as a pointer to the current instruction. This is how every VM starts. But this code is very slow when compared to machine code because it obscures the execution flow from the CPU. And besides, this doesn’t use LLVM, the whole point of this post.
A static result…
As we look at the code that was run, we see that the program is set, add, then show. Lets pretend for a sec that given the above functions, we want to perform the same operation, we’d write:
void my_program() {
int registers[2] = {0, 0};
int program[10] = [ 0, 0, 3,
1, 0, 0, 4,
2, 0 ]
int* ops = (int*)program;
set(ops, registers);
ops += 3;
add(ops, registers);
ops += 4;
show(ops, registers);
ops += 2;
}
So, my_program would perform the same operations as your bytecode above, and considerable faster because we avoid all the overhead the switch statement.
Combining the 2
If you look at the bytecode, than the at the hand written C, we can see that there there is a programmatic way to go from our array of numbers to the C code.
A common approach that people have taken in the past to actually write a C code emitter, that would parse the integers once, spit out a .c file, which you’d compile, then run. This works ok for some situations, but it doesn’t allow for any kind of dynamic abilities to run code. And besides, it doesn’t use LLVM.
The idea is to write a function thats takes as input the array of numbers and dynamically builds the equivalent of the above C function. And thats where LLVM comes in.
Part 1: importing the functions
A key part to this scheme is to have the add/set/show functions we defined above available to LLVM. Doing so lets us use our normal tools to write the each instruction as a small operation that can easily be tested. So, given that we have put those 3 functions into ops.c, we run:
llvm-gcc -emit-llvm -O3 -c ops.c, which generates ops.o as a LLVM bitcode file. Using llvm-dis < ops.o we see it contains:
@.str = internal constant [7 x i8] c"=> %dA0" ; [#uses=1]
define void @add(i32* %ops, i32* %registers) nounwind {
entry:
%tmp1 = getelementptr i32* %ops, i32 1 ; [#uses=1]
%tmp2 = load i32* %tmp1, align 4 ; [#uses=1]
%tmp4 = getelementptr i32* %ops, i32 2 ; [#uses=1]
%tmp5 = load i32* %tmp4, align 4 ; [#uses=1]
%tmp7 = getelementptr i32* %registers, i32 %tmp5 ; [#uses=1]
%tmp8 = load i32* %tmp7, align 4 ; [#uses=1]
%tmp10 = getelementptr i32* %ops, i32 3 ; [#uses=1]
%tmp11 = load i32* %tmp10, align 4 ; [#uses=1]
%tmp12 = add i32 %tmp11, %tmp8 ; [#uses=1]
%tmp14 = getelementptr i32* %registers, i32 %tmp2 ; [#uses=1]
store i32 %tmp12, i32* %tmp14, align 4
ret void
}
define void @set(i32* %ops, i32* %registers) nounwind {
entry:
%tmp1 = getelementptr i32* %ops, i32 1 ; [#uses=1]
%tmp2 = load i32* %tmp1, align 4 ; [#uses=1]
%tmp4 = getelementptr i32* %ops, i32 2 ; [#uses=1]
%tmp5 = load i32* %tmp4, align 4 ; [#uses=1]
%tmp7 = getelementptr i32* %registers, i32 %tmp2 ; [#uses=1]
store i32 %tmp5, i32* %tmp7, align 4
ret void
}
declare i32 @printf(i8*, ...) nounwind
define void @show(i32* %ops, i32* %registers) nounwind {
entry:
%tmp1 = getelementptr i32* %ops, i32 1 ; [#uses=1]
%tmp2 = load i32* %tmp1, align 4 ; [#uses=1]
%tmp4 = getelementptr i32* %registers, i32 %tmp2 ; [#uses=1]
%tmp5 = load i32* %tmp4, align 4 ; [#uses=1]
%tmp7 = tail call i32 (i8*, ...)* @printf( i8* getelementptr ([7 x i8]* @.str, i32 0, i32 0), i32 %tmp5 ) nounwind ; [#uses=0]
ret void
}
Now we have our functions ready for easy importing.
Phase 2: LLVM C++ API
When I did this experiment, I decided that the function that, given an array of ints, would drive the LLVM, wasn’t necessary. After all, all I was concerned was if the output of that process would actually work. So instead, I hand wrote a function that uses the LLVM C++ api the same way it would be if this were dynamic:
Function* create(Module** out) {
std::string error;
Module* jit;
// Load in the bitcode file containing the functions for each
// bytecode operation.
if(MemoryBuffer* buffer = MemoryBuffer::getFile("ops.o", &error)) {
jit = ParseBitcodeFile(buffer, &error);
delete buffer;
}
// Pull out references to them.
Function* set = jit->getFunction(std::string("set"));
Function* add = jit->getFunction(std::string("add"));
Function* show = jit->getFunction(std::string("show"));
// Now, begin building our new function, which calls the
// above functions.
Function* body = cast<Function>(jit->getOrInsertFunction("body",
Type::VoidTy,
PointerType::getUnqual(Type::Int32Ty),
PointerType::getUnqual(Type::Int32Ty), (Type*)0));
// Our function will be passed the ops pointer and the
// registers pointer, just like before.
Function::arg_iterator args = body->arg_begin();
Value* ops = args++;
ops->setName("ops");
Value* registers = args++;
registers->setName("registers");
BasicBlock *bb = BasicBlock::Create("entry", body);
// Set up our arguments to be passed to set.
std::vector<Value*> params;
params.push_back(ops);
params.push_back(registers);
// Call out to set, passing ops and registers down
CallInst* call = CallInst::Create(set, params.begin(), params.end(), "", bb);
ConstantInt* const_3 = ConstantInt::get(APInt(32, "3", 10));
ConstantInt* const_4 = ConstantInt::get(APInt(32, "4", 10));
// add 3 to the ops pointer.
GetElementPtrInst* ptr1 = GetElementPtrInst::Create(ops, const_3, "tmp3", bb);
// Setup and call add, notice we pass down the updated ops pointer
// rather than the original, so that we've moved down.
std::vector<Value*> params2;
params2.push_back(ptr1);
params2.push_back(registers);
CallInst* call2 = CallInst::Create(add, params2.begin(), params2.end(), "", bb);
// Push the ops pointer down another 4.
GetElementPtrInst* ptr2 = GetElementPtrInst::Create(ops, const_4, "tmp3", bb);
// Setup and call show.
std::vector<Value*> params3;
params3.push_back(ptr2);
params3.push_back(registers);
CallInst* call3 = CallInst::Create(show, params3.begin(), params3.end(), "", bb);
// And we're done!
ReturnInst::Create(bb);
*out = jit;
return body;
}
Now, lets write a function that calls create() and executes the result:
int main() {
// The registers.
int registers[2] = {0, 0};
// Our program.
int program[20] = {0, 0, 3,
1, 0, 0, 4,
2, 0};
int* ops = (int*)program;
// Create our function and give us the Module and Function back.
Module* jit;
Function* func = create(&jit);
// Add in optimizations. These were taken from a list that 'opt', LLVMs optimization tool, uses.
PassManager p;
/* Comment out optimize
p.add(new TargetData(jit));
p.add(createVerifierPass());
p.add(createLowerSetJmpPass());
p.add(createRaiseAllocationsPass());
p.add(createCFGSimplificationPass());
p.add(createPromoteMemoryToRegisterPass());
p.add(createGlobalOptimizerPass());
p.add(createGlobalDCEPass());
p.add(createFunctionInliningPass());
*/
// Run these optimizations on our Module
p.run(*jit);
// Setup for JIT
ExistingModuleProvider* mp = new ExistingModuleProvider(jit);
ExecutionEngine* engine = ExecutionEngine::create(mp);
// Show us what we've created!
std::cout << "Created\n" << *jit;
// Have our function JIT'd into machine code and return. We cast it to a particular C function pointer signature so we can call in nicely.
void (*fp)(int*, int*) = (void (*)(int*, int*))engine->getPointerToFunction(func);
// Call what we've created!
fp(ops, registers);
}
We drive our create() function and then execute the result. As you can see, we’ve commented out all optimizations as a first try. The output from running this is:
<snip same LLVM as before>
define void @body(i32* %ops, i32* %registers) {
entry:
call void @set( i32* %ops, i32* %registers )
%tmp3 = getelementptr i32* %ops, i32 3 ; [#uses=1]
call void @add( i32* %tmp3, i32* %registers )
%tmp31 = getelementptr i32* %ops, i32 4 ; [#uses=1]
call void @show( i32* %tmp31, i32* %registers )
ret void
}
=> 7
Hey! Look at that! It runs! And we can see what it ran. We see it perform the calls to our functions that implement each opcode. Going back, it would be easily to write a function that dynamically drivers the LLVM C++ API to generate this code, it’s simply one call per bytecode, mapped directly.
Even if that were all that LLVM let us do, it would be worth it, but…
Wait, there’s more!
Before, we ran without optimizations to keep the output simple, lets see what happens we turn them on:
define void @body(i32* %ops, i32* %registers) {
entry:
%tmp1.i = getelementptr i32* %ops, i32 1 ; [#uses=1]
%tmp2.i = load i32* %tmp1.i, align 4 ; [#uses=1]
%tmp4.i = getelementptr i32* %ops, i32 2 ; [#uses=1]
%tmp5.i = load i32* %tmp4.i, align 4 ; [#uses=1]
%tmp7.i = getelementptr i32* %registers, i32 %tmp2.i ; [#uses=1]
store i32 %tmp5.i, i32* %tmp7.i, align 4
%tmp3 = getelementptr i32* %ops, i32 3 ; [#uses=3]
%tmp1.i7 = getelementptr i32* %tmp3, i32 1 ; [#uses=1]
%tmp2.i8 = load i32* %tmp1.i7, align 4 ; [#uses=1]
%tmp4.i9 = getelementptr i32* %tmp3, i32 2 ; [#uses=1]
%tmp5.i10 = load i32* %tmp4.i9, align 4 ; [#uses=1]
%tmp7.i11 = getelementptr i32* %registers, i32 %tmp5.i10 ; [#uses=1]
%tmp8.i = load i32* %tmp7.i11, align 4 ; [#uses=1]
%tmp10.i = getelementptr i32* %tmp3, i32 3 ; [#uses=1]
%tmp11.i = load i32* %tmp10.i, align 4 ; [#uses=1]
%tmp12.i = add i32 %tmp11.i, %tmp8.i ; [#uses=1]
%tmp14.i = getelementptr i32* %registers, i32 %tmp2.i8 ; [#uses=1]
store i32 %tmp12.i, i32* %tmp14.i, align 4
%tmp31 = getelementptr i32* %ops, i32 4 ; [#uses=1]
%tmp1.i2 = getelementptr i32* %tmp31, i32 1 ; [#uses=1]
%tmp2.i3 = load i32* %tmp1.i2, align 4 ; [#uses=1]
%tmp4.i4 = getelementptr i32* %registers, i32 %tmp2.i3 ; [#uses=1]
%tmp5.i5 = load i32* %tmp4.i4, align 4 ; [#uses=1]
%tmp7.i6 = call i32 (i8*, ...)* @printf( i8* getelementptr ([7 x i8]* @.str, i32 0, i32 0), i32 %tmp5.i5 ) nounwind ; [#uses=0]
ret void
}
=> 7
WOW! Now we’re cooking with hot flaming nuclear power! LLVM has the extra mile and rather than just calling our functions that implement each instruction, it’s inlined all that code directly into our dynamically generated function! This means A LOT of additional overhead is discarded because, since our functions themselves did simple operations like store into memory or add 2 numbers, that code is now run a lot more quickly.
It’s commonly known that inlining can dramatically improve performance because it eliminates the over head of function calls (spilling and reload registers, stack frames, etc). And LLVM has just given us a powerful form of that for free.
This doesn’t allow for inlining across things like the kind of method call that Ruby does, but it puts us on the track to being able to feed LLVM enough information to do just that.
LLVM is an amazing piece of software. I can’t wait to start using it more.