Hey guys,

As some of you may have guessed, some virtual machine people are asking me on a regular basis questions such as:

  • Why don’t you use LLVM instead of reimplementing it with Sista ?
  • Why don’t you use LLVM for low level optimizations ?
  • Why don’t you rewrite Cog using LLVM ?

I think today is the day I need to explain how LLVM and Sista could or could not interact, and why, to avoid having these questions all the time in the future.

Question 1: Why don’t you use LLVM instead of reimplementing it ?

In short, I am *not* reimplementing LLVM, Sista and LLVM could work together but you could not use LLVM instead of Sista.

If you look at a virtual machine for a dynamic object-oriented language that use LLVM (and there are not that much in production for industrial projects), you will notice that there are not running on top of LLVM, but that they use LLVM for low level optimizations. Let’s take the example of the fresh new Javascript webkit VM implemented by guys loving LLVM (I believe this VM was released in the past few weeks). Their architecture have 4 tiers:

  • Tiers 1: the interpreter LLInt
  • Tiers 2: the Baseline JIT
  • Tiers 3: the DFG (Data Flow graph) JIT
  • Tiers 4: The FTL (fourth tiers LLVM) JIT

The general idea is that the tier N + 1 will need more compile time to generate the native code to execute than the tier N but the generated code will be much faster. Therefore, the most frequently used a portion of code will be, the biggest tier the VM will use to generate its native code. Typically, a method is interpreted the 6 first executions, then the Baseline JIT generates the basic native code that is used for the next 60 executions, then the DFG JIT generates evolved native code that is used for the next 600 executions, and lastly the FTL JIT generates very efficient native code that is used from them on.

As you can see, they are using the Baseline JIT and the DFG JIT before using LLVM. There is a good reason for it: LLVM cannot optimize anything before. LLVM is designed to optimize low-level code such as C. On the other hand, Smalltalk or javascript code looks like a lot of message sends. As each message send may indirectly trigger a GC, a stack frame creation, a process switch, a user interrupt, open a debugger, etc LLVM can optimize only the code that is in-between 2 message sends. And Let me tell you that there are not a lot of code between two message sends.

The problem is even worse in Smalltalk, where every operation is a message send, including addition between SmallIntegers. Between message sends, smalltalk code have basically 2 things: inline cache checks and jumps. It happens that the current JIT specifically optimize these two cases, generating very efficient instructions with very good register allocation. Therefore the current JIT generates code as efficient as LLVM, with however a much lower compilation time (the current JIT compiler is very basic but very fast to compile).

To be able to use LLVM, one needs first to remove the message sends by inlining them / removing them up to the point that you end up with lots of primitive operations such as “+” on int32, assignment to temporaries or jumps. Webkit’s DFG JIT and Sista are doing exactly that: inlining message sends, removing bound checks for array access and unboxing Numbers to int32 or double. After these steps, LLVM has some code to optimize and may worth be used.

So Sista is *not* in competition against LLVM to generate the fastest native code, but it is one of the steps required in an optimizer that may (or may not) use LLVM.

Question 2: Why don’t you use LLVM for low level optimizations ?

Now that I explain how LLVM could interact with Cog and Sista, I guess that you, readers, may understand why the question very relevant. In theory, what we could do, is to use LLVM to generate efficient native code from the optimized compiled method that Sista has produced. Let’s compare the pros and cons.

If I use LLVM, the unique pro is: the LLVM native code generated will be faster than the one I will produce manually, as there are dozens of developers improving and maintaining LLVM. This is because LLVM maintained several platform-dependent optimizations that I will not implement, as well as exotic optimizations such as automatic parallelization.

Let’s look at the performance reports: in Javascript webkit engine, adding their tier 4 FTL JIT based on LLVM improved the performance by 35% compared to the VM that had only 3 tiers.

If you discuss with the webkit guys, you may notice that around 9 months ago, they released many LLVM optimization passes that they now use for their JIT. This means that integrating LLVM is far from “out of the box”. You need to integrate new optimization passes in LLVM, which means that in our case we spread the VM implementation from slang and C to slang, C and C++ in LLVM.

Some people told me: “using LLVM, the native code generation is assembly independent so you have a JIT for all the existing processors supported by LLVM”. That is true, but in an architecture such as the one of Javascript webkit, you still need to port the native code generation of the baseline JIT and of the DFG JIT to your new assembly code. Therefore, only the tier 4 is platform-independent “for free”, not the VM. I will discuss later about building the full VM on top of LLVM later in the article (Question 3).

Now if you look at the cons of using LLVM:

  • The memory foot print of Cog (around 300kb + the native code cache zone that is typically from 1 Mb to 2 Mb) will increase by 3.5Mb, which is the memory footprint of LLVM.
  • Cog will rely on LLVM, so if LLVM is not maintained any more, we are screwed (Notice that Smalltalk has been running since 1980 and that most common libraries from 1980 are not maintained any more, so the philosophy of writing simple but working library in Smalltalk has paid off).
  • Cog relies a very different stack management than common project to be able to support efficiently the stack to context mapping (including on-the-fly stack edition from the language). You would need to write many new passes on the LLVM IR to get stack frame to context mapping information. The cost of these implementation is massive.
  • Cog relies on a Garbage Collector. Integrating our Garbage Collector with LLVM may not be simple. Note also that Smalltalk supports exchanging two objects in memory (#become:). I am not sure this kind of feature would work out of the box with LLVM generated code.
  • The Smalltalk runtime relies heavily on non local return (In Smalltalk, non local return are a common case). It happens that the only way to represent non local returns in LLVM is to use exceptions, and the “zero cost” exception model of LLVM is very slow in the case where the exception is raised, which is common for Smalltalk. Therefore, one also needs to write much code and optimization passes in LLVM to handle efficiently non local return.

Am I going to try using LLVM for Cog + Sista ?

When the Sista optimizer will be in production, the next step will be to improve the quality of the native code generated by improving the byte code to native code translation. One solution is to use LLVM for this purpose.

Let’s look at the Cog JIT tiers:

  • tier 1: Stack interpreter
  • tier 2: Cogit (baseline JIT)
  • tier 3: Sista with on-the-fly stack replacement or Sista postponing some optimizations to the background process

In our model, it may be that in the tier 3, if you detect that a method has reached its maximum optimization potential, generating the native code of the optimized method with LLVM will be worthwhile. However, the gain of 40% performance compared to the huge amount of work that using LLVM implies (add and maintain LLVM passes for non local returns, for the integration with the garbage collector and for the stack frame to context mapping) simply does not work for us. Therefore, if someone is interested in plug in a LLVM backend for Sista’s optimized methods, I will be really happy to help him, to see the experimental result and if it is good, to help integrating it in the production VM. But I will not do it myself, as we have only a few people working on the Pharo VM nowadays, and that our limited resources cannot support the LLVM back-end for only 40% performance boost. But you guys can try to convince me in the blog post comments (for example, if you tell me that you will add and maintain LLVM passes for non local returns, for the integration with the Cog garbage collector and for the Cog stack frame to context mapping, and that you will do that within a year, it will be very convincing).

Now, if I would build a low level compiler, such as a C compiler, will definitely use LLVM.

Question 4: Why don’t you rewrite Cog using LLVM ?

This question is easy. LLVM is a slow compiler. Compiling from byte code to native code with LLVM VM will take a hell of a time. For optimized methods, it may worth it, as there are few optimized methods and they are performance critical. However, for common methods, using LLVM will slow down a lot the code generation and will not improve performance because as we discussed before, LLVM cannot improve the performance of methods relying only on message sends. The performance of Cog on top of LLVM will therefore be very poor. As a proof, you can see that in the Javascript webkit VM, which was implemented by guys which really *really* like LLVM, they didn’t use it for the 3 first tiers.

I hope you guys enjoyed the post 🙂

Advertisements