Simple VM JIT with LLVM
I’ve been investigating using LLVM for Rubinius, so I’ve been doing some very small scale experiments. I typically do this on most projects, to get a mental handle on the problem.
In doing this, I’ve written a very tiny VM to play with how LLVM handles it. Here is the breakdown of the entire VM:
- Only operates on ints.
- Uses numbered registers for operations.
- 3 Instructions:
- set(reg, val) Set register number reg to integer value val
- add(result, reg, val) Add register reg to val and put the result in result
- show(reg) Print out the contents of register reg
Pretty trivial. Not turing complete because there is no branching. But it’s simple enough to explore how to handle bytecode in LLVM.
Sample Program
So, I’ve written a sample program, encoded directly as bytecode:
[ 0, 0, 3,
1, 0, 0, 4,
2, 0 ]
This is:
- Set register 0 to 3
- Add register 0 to 4 and put the result in register 0
- Show register 0
Again, totally trivial. Should just output 7.
C Switch
So, the VM Design 101 way is to build a C switch statement to interpret each bytecode and perform it’s operations. It would look like this, given that int ops[] contains the above integer sequence for the program:
void add(int* ops, int* registers) {
registers[ops[1]] = registers[ops[2]] + ops[3];
}
void set(int* ops, int* registers) {
registers[ops[1]] = ops[2];
}
void show(int* ops, int* registers) {
printf("=> %d\n", registers[ops[1]]);
}
void run(int* ops, int* registers) {
switch(*ops) {
case 0:
set(ops, registers);
ops += 3;
break;
case 1:
add(ops, registers);
ops += 4;
break;
case 2:
show(ops, registers);
return;
}
}
Very easy. We increments ops to move forward, using ops as a pointer to the current instruction. This is how every VM starts. But this code is very slow when compared to machine code because it obscures the execution flow from the CPU. And besides, this doesn’t use LLVM, the whole point of this post.
A static result…
As we look at the code that was run, we see that the program is set, add, then show. Lets pretend for a sec that given the above functions, we want to perform the same operation, we’d write:
void my_program() {
int registers[2] = {0, 0};
int program[10] = [ 0, 0, 3,
1, 0, 0, 4,
2, 0 ]
int* ops = (int*)program;
set(ops, registers);
ops += 3;
add(ops, registers);
ops += 4;
show(ops, registers);
ops += 2;
}
So, my_program would perform the same operations as your bytecode above, and considerable faster because we avoid all the overhead the switch statement.
Combining the 2
If you look at the bytecode, than the at the hand written C, we can see that there there is a programmatic way to go from our array of numbers to the C code.
A common approach that people have taken in the past to actually write a C code emitter, that would parse the integers once, spit out a .c file, which you’d compile, then run. This works ok for some situations, but it doesn’t allow for any kind of dynamic abilities to run code. And besides, it doesn’t use LLVM.
The idea is to write a function thats takes as input the array of numbers and dynamically builds the equivalent of the above C function. And thats where LLVM comes in.
Part 1: importing the functions
A key part to this scheme is to have the add/set/show functions we defined above available to LLVM. Doing so lets us use our normal tools to write the each instruction as a small operation that can easily be tested. So, given that we have put those 3 functions into ops.c, we run:
llvm-gcc -emit-llvm -O3 -c ops.c, which generates ops.o as a LLVM bitcode file. Using llvm-dis < ops.o we see it contains:
@.str = internal constant [7 x i8] c"=> %dA0" ; [#uses=1]
define void @add(i32* %ops, i32* %registers) nounwind {
entry:
%tmp1 = getelementptr i32* %ops, i32 1 ; [#uses=1]
%tmp2 = load i32* %tmp1, align 4 ; [#uses=1]
%tmp4 = getelementptr i32* %ops, i32 2 ; [#uses=1]
%tmp5 = load i32* %tmp4, align 4 ; [#uses=1]
%tmp7 = getelementptr i32* %registers, i32 %tmp5 ; [#uses=1]
%tmp8 = load i32* %tmp7, align 4 ; [#uses=1]
%tmp10 = getelementptr i32* %ops, i32 3 ; [#uses=1]
%tmp11 = load i32* %tmp10, align 4 ; [#uses=1]
%tmp12 = add i32 %tmp11, %tmp8 ; [#uses=1]
%tmp14 = getelementptr i32* %registers, i32 %tmp2 ; [#uses=1]
store i32 %tmp12, i32* %tmp14, align 4
ret void
}
define void @set(i32* %ops, i32* %registers) nounwind {
entry:
%tmp1 = getelementptr i32* %ops, i32 1 ; [#uses=1]
%tmp2 = load i32* %tmp1, align 4 ; [#uses=1]
%tmp4 = getelementptr i32* %ops, i32 2 ; [#uses=1]
%tmp5 = load i32* %tmp4, align 4 ; [#uses=1]
%tmp7 = getelementptr i32* %registers, i32 %tmp2 ; [#uses=1]
store i32 %tmp5, i32* %tmp7, align 4
ret void
}
declare i32 @printf(i8*, ...) nounwind
define void @show(i32* %ops, i32* %registers) nounwind {
entry:
%tmp1 = getelementptr i32* %ops, i32 1 ; [#uses=1]
%tmp2 = load i32* %tmp1, align 4 ; [#uses=1]
%tmp4 = getelementptr i32* %registers, i32 %tmp2 ; [#uses=1]
%tmp5 = load i32* %tmp4, align 4 ; [#uses=1]
%tmp7 = tail call i32 (i8*, ...)* @printf( i8* getelementptr ([7 x i8]* @.str, i32 0, i32 0), i32 %tmp5 ) nounwind ; [#uses=0]
ret void
}
Now we have our functions ready for easy importing.
Phase 2: LLVM C++ API
When I did this experiment, I decided that the function that, given an array of ints, would drive the LLVM, wasn’t necessary. After all, all I was concerned was if the output of that process would actually work. So instead, I hand wrote a function that uses the LLVM C++ api the same way it would be if this were dynamic:
Function* create(Module** out) {
std::string error;
Module* jit;
// Load in the bitcode file containing the functions for each
// bytecode operation.
if(MemoryBuffer* buffer = MemoryBuffer::getFile("ops.o", &error)) {
jit = ParseBitcodeFile(buffer, &error);
delete buffer;
}
// Pull out references to them.
Function* set = jit->getFunction(std::string("set"));
Function* add = jit->getFunction(std::string("add"));
Function* show = jit->getFunction(std::string("show"));
// Now, begin building our new function, which calls the
// above functions.
Function* body = cast<Function>(jit->getOrInsertFunction("body",
Type::VoidTy,
PointerType::getUnqual(Type::Int32Ty),
PointerType::getUnqual(Type::Int32Ty), (Type*)0));
// Our function will be passed the ops pointer and the
// registers pointer, just like before.
Function::arg_iterator args = body->arg_begin();
Value* ops = args++;
ops->setName("ops");
Value* registers = args++;
registers->setName("registers");
BasicBlock *bb = BasicBlock::Create("entry", body);
// Set up our arguments to be passed to set.
std::vector<Value*> params;
params.push_back(ops);
params.push_back(registers);
// Call out to set, passing ops and registers down
CallInst* call = CallInst::Create(set, params.begin(), params.end(), "", bb);
ConstantInt* const_3 = ConstantInt::get(APInt(32, "3", 10));
ConstantInt* const_4 = ConstantInt::get(APInt(32, "4", 10));
// add 3 to the ops pointer.
GetElementPtrInst* ptr1 = GetElementPtrInst::Create(ops, const_3, "tmp3", bb);
// Setup and call add, notice we pass down the updated ops pointer
// rather than the original, so that we've moved down.
std::vector<Value*> params2;
params2.push_back(ptr1);
params2.push_back(registers);
CallInst* call2 = CallInst::Create(add, params2.begin(), params2.end(), "", bb);
// Push the ops pointer down another 4.
GetElementPtrInst* ptr2 = GetElementPtrInst::Create(ops, const_4, "tmp3", bb);
// Setup and call show.
std::vector<Value*> params3;
params3.push_back(ptr2);
params3.push_back(registers);
CallInst* call3 = CallInst::Create(show, params3.begin(), params3.end(), "", bb);
// And we're done!
ReturnInst::Create(bb);
*out = jit;
return body;
}
Now, lets write a function that calls create() and executes the result:
int main() {
// The registers.
int registers[2] = {0, 0};
// Our program.
int program[20] = {0, 0, 3,
1, 0, 0, 4,
2, 0};
int* ops = (int*)program;
// Create our function and give us the Module and Function back.
Module* jit;
Function* func = create(&jit);
// Add in optimizations. These were taken from a list that 'opt', LLVMs optimization tool, uses.
PassManager p;
/* Comment out optimize
p.add(new TargetData(jit));
p.add(createVerifierPass());
p.add(createLowerSetJmpPass());
p.add(createRaiseAllocationsPass());
p.add(createCFGSimplificationPass());
p.add(createPromoteMemoryToRegisterPass());
p.add(createGlobalOptimizerPass());
p.add(createGlobalDCEPass());
p.add(createFunctionInliningPass());
*/
// Run these optimizations on our Module
p.run(*jit);
// Setup for JIT
ExistingModuleProvider* mp = new ExistingModuleProvider(jit);
ExecutionEngine* engine = ExecutionEngine::create(mp);
// Show us what we've created!
std::cout << "Created\n" << *jit;
// Have our function JIT'd into machine code and return. We cast it to a particular C function pointer signature so we can call in nicely.
void (*fp)(int*, int*) = (void (*)(int*, int*))engine->getPointerToFunction(func);
// Call what we've created!
fp(ops, registers);
}
We drive our create() function and then execute the result. As you can see, we’ve commented out all optimizations as a first try. The output from running this is:
<snip same LLVM as before>
define void @body(i32* %ops, i32* %registers) {
entry:
call void @set( i32* %ops, i32* %registers )
%tmp3 = getelementptr i32* %ops, i32 3 ; [#uses=1]
call void @add( i32* %tmp3, i32* %registers )
%tmp31 = getelementptr i32* %ops, i32 4 ; [#uses=1]
call void @show( i32* %tmp31, i32* %registers )
ret void
}
=> 7
Hey! Look at that! It runs! And we can see what it ran. We see it perform the calls to our functions that implement each opcode. Going back, it would be easily to write a function that dynamically drivers the LLVM C++ API to generate this code, it’s simply one call per bytecode, mapped directly.
Even if that were all that LLVM let us do, it would be worth it, but…
Wait, there’s more!
Before, we ran without optimizations to keep the output simple, lets see what happens we turn them on:
define void @body(i32* %ops, i32* %registers) {
entry:
%tmp1.i = getelementptr i32* %ops, i32 1 ; [#uses=1]
%tmp2.i = load i32* %tmp1.i, align 4 ; [#uses=1]
%tmp4.i = getelementptr i32* %ops, i32 2 ; [#uses=1]
%tmp5.i = load i32* %tmp4.i, align 4 ; [#uses=1]
%tmp7.i = getelementptr i32* %registers, i32 %tmp2.i ; [#uses=1]
store i32 %tmp5.i, i32* %tmp7.i, align 4
%tmp3 = getelementptr i32* %ops, i32 3 ; [#uses=3]
%tmp1.i7 = getelementptr i32* %tmp3, i32 1 ; [#uses=1]
%tmp2.i8 = load i32* %tmp1.i7, align 4 ; [#uses=1]
%tmp4.i9 = getelementptr i32* %tmp3, i32 2 ; [#uses=1]
%tmp5.i10 = load i32* %tmp4.i9, align 4 ; [#uses=1]
%tmp7.i11 = getelementptr i32* %registers, i32 %tmp5.i10 ; [#uses=1]
%tmp8.i = load i32* %tmp7.i11, align 4 ; [#uses=1]
%tmp10.i = getelementptr i32* %tmp3, i32 3 ; [#uses=1]
%tmp11.i = load i32* %tmp10.i, align 4 ; [#uses=1]
%tmp12.i = add i32 %tmp11.i, %tmp8.i ; [#uses=1]
%tmp14.i = getelementptr i32* %registers, i32 %tmp2.i8 ; [#uses=1]
store i32 %tmp12.i, i32* %tmp14.i, align 4
%tmp31 = getelementptr i32* %ops, i32 4 ; [#uses=1]
%tmp1.i2 = getelementptr i32* %tmp31, i32 1 ; [#uses=1]
%tmp2.i3 = load i32* %tmp1.i2, align 4 ; [#uses=1]
%tmp4.i4 = getelementptr i32* %registers, i32 %tmp2.i3 ; [#uses=1]
%tmp5.i5 = load i32* %tmp4.i4, align 4 ; [#uses=1]
%tmp7.i6 = call i32 (i8*, ...)* @printf( i8* getelementptr ([7 x i8]* @.str, i32 0, i32 0), i32 %tmp5.i5 ) nounwind ; [#uses=0]
ret void
}
=> 7
WOW! Now we’re cooking with hot flaming nuclear power! LLVM has the extra mile and rather than just calling our functions that implement each instruction, it’s inlined all that code directly into our dynamically generated function! This means A LOT of additional overhead is discarded because, since our functions themselves did simple operations like store into memory or add 2 numbers, that code is now run a lot more quickly.
It’s commonly known that inlining can dramatically improve performance because it eliminates the over head of function calls (spilling and reload registers, stack frames, etc). And LLVM has just given us a powerful form of that for free.
This doesn’t allow for inlining across things like the kind of method call that Ruby does, but it puts us on the track to being able to feed LLVM enough information to do just that.
LLVM is an amazing piece of software. I can’t wait to start using it more.
So…. does this mean yet another change for the Rubinius VM? Is the recent C++ VM soon to be replaced?
rubyfan
23 May 08 at 6:23 pm
This isn’t instead of the C++ VM at all. Using LLVM to execute methods is actually only one small part of the over all VM, and we’re potentially using it with the C++ VM. I’ve had this kind of thing in the back of my mind for a while, so the C++ VM is actually setup so that this can be easily plugged in.
evanphx
23 May 08 at 6:26 pm
I’ve done some work on using LLVM with the VisualWorks Smalltalk VM, and I’ve been looking at Rubinius with an eye to doing exactly what you are suggesting.
“This doesn’t allow for inlining across things like the kind of method call that Ruby does” - actually, you can if you implement partial specialisation, which I think could generate enormous wins, especially if you specialise code that dispatches to immediates. It’s like a PIC, but with the code inlined at the call site rather than just caching the method lookup.
Antony Blakey
23 May 08 at 7:47 pm
Antony: Interesting! Is your work with VW available anywhere? Rubinius actually implements a MethodContext caching scheme I found described in a cincom paper on VW’s context cache.
evanphx
24 May 08 at 12:12 am
There are two parts to what I was doing.
1) Allowing methods to be written in C, inline. This used the clang project associated with LLVM, plus the LLVM jitter infrastructure. This would also be useful in Ruby. I published a WIP version of that work with VW but it’s not useful in this forum.
2) Using LLVM as you have described, to replace the existing VW jitter and ‘outsource’ most of the platform details to LLVM. Unfortunately you need to be a Cincom commercial customer to get the VM source (I am), and then there’s the problem of how to distribute it. This problem contributed to me putting two of my projects on hold (the other was embedding WebKit with full GC integration). This work wasn’t far enough along to produce results, although as you have seen, results can actually come very easily with LLVM.
You should post the x86 machine code generated from your exampl, because in my experience that’s where the OMG kicks in because those getelementptr lines disappear.
Over the last month I have been looking to restart this work in a Ruby context, because as you’ve discovered LLVM is absolutely the dogs bollocks, and it’s beautifully written and architected. It really could be like standing of the shoulders of giants as far as what it could do for Ruby performance, particularly with, as I mentioned, partial specialisation keyed off something like multimethod dispatch signatures.
One of the nice things that can be done is for nearly all of the work to be done in Ruby, rather than using C. Theoretically you can do the entire thing in LLVM + Ruby, although bootstrapping can be a curly problem.
I’d love to work on this, I’ve just got to get up to speed on the Rubinius architecture.
Antony Blakey
24 May 08 at 1:26 am
http://etoileos.com/news/archive/2008/05/12/1719/
Someone else is doing work with LLVM which will enable the use of Smalltalk as a first class citizen next to Objective C, a Smalltalk compiler that will be able to inherit Objective C classes. Maybe that work could be reused, if it ever gets done, for Ruby, as the Ruby object model is quite like the one in Smalltalk.
anon
24 May 08 at 2:58 am
[...] Simple VM JIT with LLVM I’ve been investigating using LLVM for Rubinius, so I’ve been doing some very small scale experiments. I [...] [...]
Top Posts « WordPress.com
25 May 08 at 5:03 pm
[...] Buradan esinlenerek yazdigim basit bir sanal makine. Sanal makinenin kendisi C++ ile yazildi. Assembler’i ise python ile. [...]
» Basit bir sanal makine yonetim danışmanlığı ekonomi sermet sandıkcı
28 May 08 at 1:23 pm
Evan, definitely check out what David Chisnall is doing with the Étoilé project. A unified Objective-C, Smalltalk, and Ruby runtime built on top of LLVM is pretty close to Nirvana (the state, not the band). Throw Javascript into the mix and we’re pretty close to heaven.
My work on Ruby and LLVM got sidetracked when I read Ian Piumarta’s papers about Cola (http://piumarta.com/software/cola/). Especially check out the minimal object model (http://piumarta.com/software/cola/). Now I’m spending my time on just enough icing on top of LLVM to build multiple dynamic object models.
How are you planning on implementing closures? Garbage collected stack frames and CPS? Also, the GC interface in LLVM is pretty sweet, you can tag roots with meta info when you create them and have this passed to your GC at collection time.
Tomy
8 Jun 08 at 7:56 pm
I put together Ruby LLVM bindings that make experimenting with this sort of thing faster/more fun. Certainly beats waiting for C++ build and link. I’ve blogged some examples of using the extension at http://llvmruby.org and the source lives at http://github.com/tombagby/llvmruby
Tom Bagby
7 Sep 08 at 10:01 pm
Greetings Evan,
I am very, very surprised that you have left a message for me.
Although I was talking about your examples and LLVM and Rubinius,
most of contents were written in Chinese. I didn’t expect there
would be any reader who’s first language wasn’t Chinese.
So, I am very surprised and glad to see your message.
If you were interested in some content of that blog post,
I could translate some of them to English for you,
if you don’t mind my poor, strange English sentence.
*
Let’s get back to your examples. At first, I couldn’t find a way to
compile the examples. After searching header files and llvm-config,
the examples ran fine. I was glad to reach here.
But I thought it’s a bit strange that your “create” function which
creates JIT module, was skipping “function label”, lacking a switch
case to find out which function it should call this time.
By talking about “function label”, I mean the 0, 1, 2 in our source
program. 0 stands for set, 1 stands for add, and 2 stands for show.
Then I started to add the switch case in your create function,
thinking it should let me rewrite our source program to such as:
int program[] = {
0, 0, 5, // res[0] = 5
2, 0, // puts res[0]
1, 0, 0, 4, // res[0] = res[0] + 4
2, 0, // puts res[0]
1, 1, 0, 7, // res[1] = res[0] + 7
2, 1 // puts res[1]
};
Problems occurred here.
Segmentation fault, or strange result, e.g. -2153546.
blah blah blah… time passed.
I started to think perhaps there would be a bug in your examples.
The body bytecode (or bitcode?) emitted by llvm, i.e.,
define void @body(i32* %ops, i32* %registers) {
entry:
call void @set( i32* %ops, i32* %registers )
%tmp3 = getelementptr i32* %ops, i32 3 ; [#uses=1]
call void @add( i32* %tmp3, i32* %registers )
%tmp31 = getelementptr i32* %ops, i32 4 ; [#uses=1]
call void @show( i32* %tmp31, i32* %registers )
ret void
}
I thought it’s very strange that the second getelementptr told
%ops to step 4, just like “ops + 4″ in C. Then it passed %tmp31,
i.e. “ops + 4″ to show. Shouldn’t it be “ops + 3 + 4″ ?
Then I took out the line:
GetElementPtrInst* ptr2 = GetElementPtrInst::Create(ops, const_4, “tmp3″, bb);
and changed it to:
GetElementPtrInst* ptr2 = GetElementPtrInst::Create(ptr1, const_4, “tmp3″, bb);
save, compile, and it runs fine. Of course it’s not the exactly
modification I made, because I built a loop and a switch case to
create many CallInst(s). The exactly lines were:
const_n = ConstantInt::get(APInt(32, lexical_cast<std::string>(step), 10));
ptr = GetElementPtrInst::Create(params.back(), const_n, “tmp”, bb);
You can find the source code at the bottom of previous post.
[LLVM & Rubinius]
Many thanks for your listening and kindness!
cheers,
Lin Jen-Shin (a.k.a. godfat 真常)
p.s. below is another copy of above message.
http://blogger.godfat.org/2008/10/llvm-rubinius-2.html
Lin Jen-Shin (godfat 真常)
31 Oct 08 at 8:30 am