Simple VM JIT with LLVM

I’ve been investigating using LLVM for Rubinius, so I’ve been doing some very small scale experiments. I typically do this on most projects, to get a mental handle on the problem.

In doing this, I’ve written a very tiny VM to play with how LLVM handles it. Here is the breakdown of the entire VM:

  • Only operates on ints.
  • Uses numbered registers for operations.
  • 3 Instructions:
    1. set(reg, val) Set register number reg to integer value val
    2. add(result, reg, val) Add register reg to val and put the result in result
    3. show(reg) Print out the contents of register reg

Pretty trivial. Not turing complete because there is no branching. But it’s simple enough to explore how to handle bytecode in LLVM.

Sample Program

So, I’ve written a sample program, encoded directly as bytecode:

  [ 0, 0, 3,
    1, 0, 0, 4,
    2, 0 ]

This is:

  • Set register 0 to 3
  • Add register 0 to 4 and put the result in register 0
  • Show register 0

Again, totally trivial. Should just output 7.

C Switch

So, the VM Design 101 way is to build a C switch statement to interpret each bytecode and perform it’s operations. It would look like this, given that int ops[] contains the above integer sequence for the program:

void add(int* ops, int* registers) {
  registers[ops[1]] = registers[ops[2]] + ops[3];
}

void set(int* ops, int* registers) {
  registers[ops[1]] = ops[2];
}

void show(int* ops, int* registers) {
  printf("=> %d\n", registers[ops[1]]);
}

void run(int* ops, int* registers) {
  switch(*ops) {
  case 0:
	set(ops, registers);
	ops += 3;
    break;
  case 1:
	add(ops, registers);
	ops += 4;
    break;
  case 2:
	show(ops, registers);
    return;
  }
}

Very easy. We increments ops to move forward, using ops as a pointer to the current instruction. This is how every VM starts. But this code is very slow when compared to machine code because it obscures the execution flow from the CPU. And besides, this doesn’t use LLVM, the whole point of this post.

A static result…

As we look at the code that was run, we see that the program is set, add, then show. Lets pretend for a sec that given the above functions, we want to perform the same operation, we’d write:

void my_program() {
  int registers[2] = {0, 0};
  int program[10] = [ 0, 0, 3,
                      1, 0, 0, 4,
                      2, 0 ]

  int* ops = (int*)program;

  set(ops, registers);
  ops += 3;
  add(ops, registers);
  ops += 4;
  show(ops, registers);
  ops += 2;
}

So, my_program would perform the same operations as your bytecode above, and considerable faster because we avoid all the overhead the switch statement.

Combining the 2

If you look at the bytecode, than the at the hand written C, we can see that there there is a programmatic way to go from our array of numbers to the C code.

A common approach that people have taken in the past to actually write a C code emitter, that would parse the integers once, spit out a .c file, which you’d compile, then run. This works ok for some situations, but it doesn’t allow for any kind of dynamic abilities to run code. And besides, it doesn’t use LLVM.

The idea is to write a function thats takes as input the array of numbers and dynamically builds the equivalent of the above C function. And thats where LLVM comes in.

Part 1: importing the functions

A key part to this scheme is to have the add/set/show functions we defined above available to LLVM. Doing so lets us use our normal tools to write the each instruction as a small operation that can easily be tested. So, given that we have put those 3 functions into ops.c, we run:
llvm-gcc -emit-llvm -O3 -c ops.c, which generates ops.o as a LLVM bitcode file. Using llvm-dis < ops.o we see it contains:

@.str = internal constant [7 x i8] c"=> %dA0"		;  [#uses=1]

define void @add(i32* %ops, i32* %registers) nounwind  {
entry:
	%tmp1 = getelementptr i32* %ops, i32 1		;  [#uses=1]
	%tmp2 = load i32* %tmp1, align 4		;  [#uses=1]
	%tmp4 = getelementptr i32* %ops, i32 2		;  [#uses=1]
	%tmp5 = load i32* %tmp4, align 4		;  [#uses=1]
	%tmp7 = getelementptr i32* %registers, i32 %tmp5		;  [#uses=1]
	%tmp8 = load i32* %tmp7, align 4		;  [#uses=1]
	%tmp10 = getelementptr i32* %ops, i32 3		;  [#uses=1]
	%tmp11 = load i32* %tmp10, align 4		;  [#uses=1]
	%tmp12 = add i32 %tmp11, %tmp8		;  [#uses=1]
	%tmp14 = getelementptr i32* %registers, i32 %tmp2		;  [#uses=1]
	store i32 %tmp12, i32* %tmp14, align 4
	ret void
}

define void @set(i32* %ops, i32* %registers) nounwind  {
entry:
	%tmp1 = getelementptr i32* %ops, i32 1		;  [#uses=1]
	%tmp2 = load i32* %tmp1, align 4		;  [#uses=1]
	%tmp4 = getelementptr i32* %ops, i32 2		;  [#uses=1]
	%tmp5 = load i32* %tmp4, align 4		;  [#uses=1]
	%tmp7 = getelementptr i32* %registers, i32 %tmp2		;  [#uses=1]
	store i32 %tmp5, i32* %tmp7, align 4
	ret void
}

declare i32 @printf(i8*, ...) nounwind 

define void @show(i32* %ops, i32* %registers) nounwind  {
entry:
	%tmp1 = getelementptr i32* %ops, i32 1		;  [#uses=1]
	%tmp2 = load i32* %tmp1, align 4		;  [#uses=1]
	%tmp4 = getelementptr i32* %registers, i32 %tmp2		;  [#uses=1]
	%tmp5 = load i32* %tmp4, align 4		;  [#uses=1]
	%tmp7 = tail call i32 (i8*, ...)* @printf( i8* getelementptr ([7 x i8]* @.str, i32 0, i32 0), i32 %tmp5 ) nounwind 		;  [#uses=0]
	ret void
}

Now we have our functions ready for easy importing.

Phase 2: LLVM C++ API

When I did this experiment, I decided that the function that, given an array of ints, would drive the LLVM, wasn’t necessary. After all, all I was concerned was if the output of that process would actually work. So instead, I hand wrote a function that uses the LLVM C++ api the same way it would be if this were dynamic:

Function* create(Module** out) {
  std::string error;
  Module* jit;

  // Load in the bitcode file containing the functions for each
  // bytecode operation.
  if(MemoryBuffer* buffer = MemoryBuffer::getFile("ops.o", &error)) {
    jit = ParseBitcodeFile(buffer, &error);
    delete buffer;
  }

  // Pull out references to them.
  Function* set =  jit->getFunction(std::string("set"));
  Function* add =  jit->getFunction(std::string("add"));
  Function* show = jit->getFunction(std::string("show"));

  // Now, begin building our new function, which calls the
  // above functions.
  Function* body = cast<Function>(jit->getOrInsertFunction("body",
        Type::VoidTy,
        PointerType::getUnqual(Type::Int32Ty),
        PointerType::getUnqual(Type::Int32Ty), (Type*)0));

  // Our function will be passed the ops pointer and the
  // registers pointer, just like before.
  Function::arg_iterator args = body->arg_begin();
  Value* ops = args++;
  ops->setName("ops");
  Value* registers = args++;
  registers->setName("registers");

  BasicBlock *bb = BasicBlock::Create("entry", body);

  // Set up our arguments to be passed to set.
  std::vector<Value*> params;
  params.push_back(ops);
  params.push_back(registers);
  
  // Call out to set, passing ops and registers down
  CallInst* call = CallInst::Create(set, params.begin(), params.end(), "", bb);
  ConstantInt* const_3 = ConstantInt::get(APInt(32,  "3", 10));
  ConstantInt* const_4 = ConstantInt::get(APInt(32,  "4", 10));

  // add 3 to the ops pointer.
  GetElementPtrInst* ptr1 = GetElementPtrInst::Create(ops, const_3, "tmp3", bb);

  // Setup and call add, notice we pass down the updated ops pointer
  // rather than the original, so that we've moved down.
  std::vector<Value*> params2;
  params2.push_back(ptr1);
  params2.push_back(registers);
  CallInst* call2 = CallInst::Create(add, params2.begin(), params2.end(), "", bb);

  // Push the ops pointer down another 4.
  GetElementPtrInst* ptr2 = GetElementPtrInst::Create(ops, const_4, "tmp3", bb);
 
  // Setup and call show.
  std::vector<Value*> params3;
  params3.push_back(ptr2);
  params3.push_back(registers);
  CallInst* call3 = CallInst::Create(show, params3.begin(), params3.end(), "", bb);

  // And we're done!
  ReturnInst::Create(bb);

  *out = jit;
  return body;
}

Now, lets write a function that calls create() and executes the result:

int main() {
  // The registers.
  int registers[2] = {0, 0};

  // Our program.
  int program[20] = {0, 0, 3,
                     1, 0, 0, 4,
                     2, 0};

  int* ops = (int*)program;

  // Create our function and give us the Module and Function back.
  Module* jit;
  Function* func = create(&jit);

  // Add in optimizations. These were taken from a list that 'opt', LLVMs optimization tool, uses.
  PassManager p;

  /* Comment out optimize
  p.add(new TargetData(jit));
  p.add(createVerifierPass());
  p.add(createLowerSetJmpPass());
  p.add(createRaiseAllocationsPass());
  p.add(createCFGSimplificationPass());
  p.add(createPromoteMemoryToRegisterPass());
  p.add(createGlobalOptimizerPass());
  p.add(createGlobalDCEPass());
  p.add(createFunctionInliningPass()); 
  */

  // Run these optimizations on our Module
  p.run(*jit);

  // Setup for JIT
  ExistingModuleProvider* mp = new ExistingModuleProvider(jit);
  ExecutionEngine* engine = ExecutionEngine::create(mp);

  // Show us what we've created!
  std::cout << "Created\n" << *jit;

  // Have our function JIT'd into machine code and return. We cast it to a particular C function pointer signature so we can call in nicely.
  void (*fp)(int*, int*) = (void (*)(int*, int*))engine->getPointerToFunction(func);

  // Call what we've created!
  fp(ops, registers);
}

We drive our create() function and then execute the result. As you can see, we’ve commented out all optimizations as a first try. The output from running this is:

<snip same LLVM as before>

define void @body(i32* %ops, i32* %registers) {
entry:
	call void @set( i32* %ops, i32* %registers )
	%tmp3 = getelementptr i32* %ops, i32 3		;  [#uses=1]
	call void @add( i32* %tmp3, i32* %registers )
	%tmp31 = getelementptr i32* %ops, i32 4		;  [#uses=1]
	call void @show( i32* %tmp31, i32* %registers )
	ret void
}
=> 7

Hey! Look at that! It runs! And we can see what it ran. We see it perform the calls to our functions that implement each opcode. Going back, it would be easily to write a function that dynamically drivers the LLVM C++ API to generate this code, it’s simply one call per bytecode, mapped directly.
Even if that were all that LLVM let us do, it would be worth it, but…

Wait, there’s more!

Before, we ran without optimizations to keep the output simple, lets see what happens we turn them on:

define void @body(i32* %ops, i32* %registers) {
entry:
	%tmp1.i = getelementptr i32* %ops, i32 1		;  [#uses=1]
	%tmp2.i = load i32* %tmp1.i, align 4		;  [#uses=1]
	%tmp4.i = getelementptr i32* %ops, i32 2		;  [#uses=1]
	%tmp5.i = load i32* %tmp4.i, align 4		;  [#uses=1]
	%tmp7.i = getelementptr i32* %registers, i32 %tmp2.i		;  [#uses=1]
	store i32 %tmp5.i, i32* %tmp7.i, align 4
	%tmp3 = getelementptr i32* %ops, i32 3		;  [#uses=3]
	%tmp1.i7 = getelementptr i32* %tmp3, i32 1		;  [#uses=1]
	%tmp2.i8 = load i32* %tmp1.i7, align 4		;  [#uses=1]
	%tmp4.i9 = getelementptr i32* %tmp3, i32 2		;  [#uses=1]
	%tmp5.i10 = load i32* %tmp4.i9, align 4		;  [#uses=1]
	%tmp7.i11 = getelementptr i32* %registers, i32 %tmp5.i10		;  [#uses=1]
	%tmp8.i = load i32* %tmp7.i11, align 4		;  [#uses=1]
	%tmp10.i = getelementptr i32* %tmp3, i32 3		;  [#uses=1]
	%tmp11.i = load i32* %tmp10.i, align 4		;  [#uses=1]
	%tmp12.i = add i32 %tmp11.i, %tmp8.i		;  [#uses=1]
	%tmp14.i = getelementptr i32* %registers, i32 %tmp2.i8		;  [#uses=1]
	store i32 %tmp12.i, i32* %tmp14.i, align 4
	%tmp31 = getelementptr i32* %ops, i32 4		;  [#uses=1]
	%tmp1.i2 = getelementptr i32* %tmp31, i32 1		;  [#uses=1]
	%tmp2.i3 = load i32* %tmp1.i2, align 4		;  [#uses=1]
	%tmp4.i4 = getelementptr i32* %registers, i32 %tmp2.i3		;  [#uses=1]
	%tmp5.i5 = load i32* %tmp4.i4, align 4		;  [#uses=1]
	%tmp7.i6 = call i32 (i8*, ...)* @printf( i8* getelementptr ([7 x i8]* @.str, i32 0, i32 0), i32 %tmp5.i5 ) nounwind 		;  [#uses=0]
	ret void
}
=> 7

WOW! Now we’re cooking with hot flaming nuclear power! LLVM has the extra mile and rather than just calling our functions that implement each instruction, it’s inlined all that code directly into our dynamically generated function! This means A LOT of additional overhead is discarded because, since our functions themselves did simple operations like store into memory or add 2 numbers, that code is now run a lot more quickly.

It’s commonly known that inlining can dramatically improve performance because it eliminates the over head of function calls (spilling and reload registers, stack frames, etc). And LLVM has just given us a powerful form of that for free.
This doesn’t allow for inlining across things like the kind of method call that Ruby does, but it puts us on the track to being able to feed LLVM enough information to do just that.

LLVM is an amazing piece of software. I can’t wait to start using it more.

About these ads

14 thoughts on “Simple VM JIT with LLVM

  1. This isn’t instead of the C++ VM at all. Using LLVM to execute methods is actually only one small part of the over all VM, and we’re potentially using it with the C++ VM. I’ve had this kind of thing in the back of my mind for a while, so the C++ VM is actually setup so that this can be easily plugged in.

  2. I’ve done some work on using LLVM with the VisualWorks Smalltalk VM, and I’ve been looking at Rubinius with an eye to doing exactly what you are suggesting.

    “This doesn’t allow for inlining across things like the kind of method call that Ruby does” – actually, you can if you implement partial specialisation, which I think could generate enormous wins, especially if you specialise code that dispatches to immediates. It’s like a PIC, but with the code inlined at the call site rather than just caching the method lookup.

  3. Antony: Interesting! Is your work with VW available anywhere? Rubinius actually implements a MethodContext caching scheme I found described in a cincom paper on VW’s context cache.

  4. There are two parts to what I was doing.

    1) Allowing methods to be written in C, inline. This used the clang project associated with LLVM, plus the LLVM jitter infrastructure. This would also be useful in Ruby. I published a WIP version of that work with VW but it’s not useful in this forum.

    2) Using LLVM as you have described, to replace the existing VW jitter and ‘outsource’ most of the platform details to LLVM. Unfortunately you need to be a Cincom commercial customer to get the VM source (I am), and then there’s the problem of how to distribute it. This problem contributed to me putting two of my projects on hold (the other was embedding WebKit with full GC integration). This work wasn’t far enough along to produce results, although as you have seen, results can actually come very easily with LLVM.

    You should post the x86 machine code generated from your exampl, because in my experience that’s where the OMG kicks in because those getelementptr lines disappear.

    Over the last month I have been looking to restart this work in a Ruby context, because as you’ve discovered LLVM is absolutely the dogs bollocks, and it’s beautifully written and architected. It really could be like standing of the shoulders of giants as far as what it could do for Ruby performance, particularly with, as I mentioned, partial specialisation keyed off something like multimethod dispatch signatures.

    One of the nice things that can be done is for nearly all of the work to be done in Ruby, rather than using C. Theoretically you can do the entire thing in LLVM + Ruby, although bootstrapping can be a curly problem.

    I’d love to work on this, I’ve just got to get up to speed on the Rubinius architecture.

  5. http://etoileos.com/news/archive/2008/05/12/1719/

    Someone else is doing work with LLVM which will enable the use of Smalltalk as a first class citizen next to Objective C, a Smalltalk compiler that will be able to inherit Objective C classes. Maybe that work could be reused, if it ever gets done, for Ruby, as the Ruby object model is quite like the one in Smalltalk.

  6. Evan, definitely check out what David Chisnall is doing with the Étoilé project. A unified Objective-C, Smalltalk, and Ruby runtime built on top of LLVM is pretty close to Nirvana (the state, not the band). Throw Javascript into the mix and we’re pretty close to heaven.

    My work on Ruby and LLVM got sidetracked when I read Ian Piumarta’s papers about Cola (http://piumarta.com/software/cola/). Especially check out the minimal object model (http://piumarta.com/software/cola/). Now I’m spending my time on just enough icing on top of LLVM to build multiple dynamic object models.

    How are you planning on implementing closures? Garbage collected stack frames and CPS? Also, the GC interface in LLVM is pretty sweet, you can tag roots with meta info when you create them and have this passed to your GC at collection time.

  7. Greetings Evan,

    I am very, very surprised that you have left a message for me.
    Although I was talking about your examples and LLVM and Rubinius,
    most of contents were written in Chinese. I didn’t expect there
    would be any reader who’s first language wasn’t Chinese.

    So, I am very surprised and glad to see your message.
    If you were interested in some content of that blog post,
    I could translate some of them to English for you,
    if you don’t mind my poor, strange English sentence.

    *

    Let’s get back to your examples. At first, I couldn’t find a way to
    compile the examples. After searching header files and llvm-config,
    the examples ran fine. I was glad to reach here.

    But I thought it’s a bit strange that your “create” function which
    creates JIT module, was skipping “function label”, lacking a switch
    case to find out which function it should call this time.

    By talking about “function label”, I mean the 0, 1, 2 in our source
    program. 0 stands for set, 1 stands for add, and 2 stands for show.

    Then I started to add the switch case in your create function,
    thinking it should let me rewrite our source program to such as:

    int program[] = {
    0, 0, 5, // res[0] = 5
    2, 0, // puts res[0]
    1, 0, 0, 4, // res[0] = res[0] + 4
    2, 0, // puts res[0]
    1, 1, 0, 7, // res[1] = res[0] + 7
    2, 1 // puts res[1]
    };

    Problems occurred here.
    Segmentation fault, or strange result, e.g. -2153546.

    blah blah blah… time passed.

    I started to think perhaps there would be a bug in your examples.
    The body bytecode (or bitcode?) emitted by llvm, i.e.,

    define void @body(i32* %ops, i32* %registers) {
    entry:
    call void @set( i32* %ops, i32* %registers )
    %tmp3 = getelementptr i32* %ops, i32 3 ; [#uses=1]
    call void @add( i32* %tmp3, i32* %registers )
    %tmp31 = getelementptr i32* %ops, i32 4 ; [#uses=1]
    call void @show( i32* %tmp31, i32* %registers )
    ret void
    }

    I thought it’s very strange that the second getelementptr told
    %ops to step 4, just like “ops + 4″ in C. Then it passed %tmp31,
    i.e. “ops + 4″ to show. Shouldn’t it be “ops + 3 + 4″ ?

    Then I took out the line:

    GetElementPtrInst* ptr2 = GetElementPtrInst::Create(ops, const_4, “tmp3″, bb);

    and changed it to:

    GetElementPtrInst* ptr2 = GetElementPtrInst::Create(ptr1, const_4, “tmp3″, bb);

    save, compile, and it runs fine. Of course it’s not the exactly
    modification I made, because I built a loop and a switch case to
    create many CallInst(s). The exactly lines were:

    const_n = ConstantInt::get(APInt(32, lexical_cast<std::string>(step), 10));
    ptr = GetElementPtrInst::Create(params.back(), const_n, “tmp”, bb);

    You can find the source code at the bottom of previous post.
    [LLVM & Rubinius]

    Many thanks for your listening and kindness!

    cheers,
    Lin Jen-Shin (a.k.a. godfat 真常)

    p.s. below is another copy of above message.

    http://blogger.godfat.org/2008/10/llvm-rubinius-2.html

    • I haven’t looked into Parrot in a lot of depth, but from what I unedsrtand it’s a higher-level virtual machine than LLVM. LLVM really models a RISC processor with an infinite number of write-once registers (i.e. a single static assignment machine); it doesn’t have primitives for anything like method dispatch or imply support for garbage collection or anything like that.The point of LLVM is that it provides a great, flexible and platform-independent target for other things like Parrot. For example, since LLVM is structured as a set of libraries, Parrot could use it to perform just-in-time compilation from the higher-level abstraction it offers to an instruction stream that can be used on a particular piece of hardware. (Thus saving Parrot the work of implementing its own JIT.)

  8. Hi

    Did you consider use libJIT? I am interested about your opinion, if you compare both LLVM and libJIT for your work. Because as our experiments show for ECMA-335 Microsoft Common Intermediate Language Infrastructure, Common Intermediate Runtime, and Common Intermediate Language, libJIT is both much faster, platform independent, and remarkably easier to use than LLVM (or GNU Lightning). It was one of the reason we actually started developement of libJIT. The other libraries were not suitable for our needs, when we started libJIT, to implement an advanced Just-In-Time compiler, which would run much faster than Mono Just-In-Time compiler. And as of today which we actually archieved as seen in following results in end of text.

    Also LLVM is not suitable for Just-In-Time compilation at all in real projects, because of huge compilation time. Moreover, it does not support a large range of features needed to implement ECMA-335. These are just a couple of them:

    * the whole spectrum of ECMA 335 for CLR types and operations

    * async exception handling

    * precise stack marking

    * multiple custom call conventions

    * fast Just-In-Time compilation

    libJIT from end of 2008 runs 26% faster in PNETMARK than the commercial Mono 2.0 from August 2008, and faster 12.5% than Mono 2.2 from January 2009, and Mono 2.4 from March 2009.

    You can try this:

    http://code.google.com/p/libjit-linear-scan-register-allocator/

    This is my research experimental branch of libJIT.

    There is a tutorial with libJIT here:

    http://www.gnu.org/software/dotgnu/libjit-doc/libjit.html

  9. Hi, I wanted to write to something about how to keep track of a variable, which means that I want to know the variable’s memory access, including read and write.Is there any API in LLVM I can use?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s