For the unaware developers, the sista runtime is a high-performance layer on top of the current Cog / Spur VM for Pharo / Squeak and other Smalltalk languages. It aims to reach greater performance by speculatively optimizing and deoptimizing code at runtime so frequently used code is faster to run.
The benchmarks are kind of getting stable on the sista runtime. Binary trees, Fibonnacci, Richards, integer benchmark, Nbody and Threadring are running with reliable performance, while chameneosRedux and DeltaBlue are still crashing very often. For development, I am now starting to use the sista runtime, though it’s still crashing every half an hour or so. I think it’s time to share the code with new people. If you’re interested to contribute in the sista runtime, please contact me. You need as a requirement to be able to compile the VM with different settings and to be willing to contribute. I had multiple reviewers already, but new persons -especially, contributors, not reviewers- would be nice.
The sista runtime relies both on full block closures and the new bytecode set.
I’ve been using the new bytecode set (SistaV1) for a couple weeks now and it looks quite stable, both from the debugging and the execution points of view. There are a couple remaining bugs that Marcus Denker is fixing, such as a UI bug (when there are errorOnDraw in Morphic, sometimes the image freezes as it cannot correctly map the temporary name to the value…).
I’ve introduced even more recently FullBlockClosures, which are however quite buggy. I compile the VM with interpreter mode for the FullBlockClosure execution to avoid most issues, leading to incorrect benchmark comparison (usually optimized code has no closures, they’re inlined, so it’s faster than interpreted closures). Currently I compile only part of the image (typically only the benchmarks) with full block closures to avoid crashes. Code with old closures can’t be inlined.
On the sista runtime itself, the debugger conceptually works (each context with optimized code can be requested to be deoptimized at any interrupt point). However the debugger integration is not there yet at all. Many optimizations and deoptimizations are happening during the first couple minutes of development.
The current running benchmarks are still between 1.2 and 1.8x. The thing is that very few optimizations are in, and the focus is on integration and stabilization, not performance. There’s no doubt that the sista runtime is very far from its full performance potential. Right now we’re looking with Eliot into easy-to-implement optimizations such as store checks elimination and object allocation optimizations, that should bring some benchmarks over the 2x.
As requested by Eliot, I try to give here an insight of the optimizing compiler. It’s difficult to give a full overview of the optimizer as there is no direct optimization flow. For example, the method to optimize is decompiled to the SSA-like IR (a.k.a. Scorch IR), but then each time a block or method is inlined the decompiler is called again. Another example is the elimination of dead branches, that usually happen early on to fold inlined constants, but it can happen again at later stage if some operation between SmallIntegers has been resolved due to ranges / inlined constants.
The Optimizer is called on a context. The first step is to search the best method to optimize on stack (usually the optimizer picks a method quite close to the context with the tripping counter, but it can be a little bit further to inline properly some closures into their outer contexts). The method chosen to be optimized is decompiled to the Scorch-IR and then optimized.
The decompiler transforms the bytecode in the Scorch IR. The Scorch IR is a control flow graph with a SSA-like property (I understood too late the Sea-of-node style IR, so the Scorch IR is not this way). I say SSA-like as it’s a high-level IR, so it’s different from LLVM-style IR (many instructions are not really typed, etc). I’ll describe it later in the post, but important points are:
- It’s not stack-based as it was really not convenient for the SSA property and for optimizations.
- At the exception of the basic blocks control flow, all the instructions are scheduled linearly inside a basic block, there is no tree, nested instructions or anything like that.
The decompiler is very similar to Cogit’s decompiler or the in-image bytecode to AST decompilers, it uses a simulated stack to construct expressions.
Inspired from an early version of V8’s Crankshaft optimizer, the conversion to SSA is done quite naively: phis nodes are added aggressively for each variable that could need it, and a second passes removes unnecessary phis. In practice it’s very efficient as Smalltalk methods and blocks have small control flows and small number of local variables, while the phis needs to be simplified one method/block at a time.
The control flow graph is slightly canonicalized during the decompilation process:
- Critical edges are split (Similarly to LLVM, see here).
- Loops are canonicalized so that a basic block having a predecessor through a back-edge can have only 2 predecessors, one through a back edge and one through a forward edge.
- Returns are canonicalized so each method or block has a single return to caller (everything jumps to the same return), this is very convenient for inlining.
The last step of decompilation:
- Computes the dominator tree (in a naive way, which works well as the control flow are simple and the dominator tree can be composed while inlining, but it would be interesting to see if the Tarjan-Lengauer algorithm performs better).
- Reorders the basicBlocks in post order. In addition to the post order property, the reordering keeps loop body contiguous, which changes sometimes the Smalltalk default control flow.
- Removes redondant or unused phis.
These last steps require each to iterate over the control flow graph, but only over the basic blocks, not over each instruction.
There is no real order of optimization. Inlining is done mostly at the beginning, and passes that won’t trigger any more optimizations are done at the end.
Call graph inliner
The first phase is the CallGraphInliner. It basically iterates over every sends, starting by inner loop sends up to outer most sends. It inlines aggressively every send with reliable information (from inline caches or by inferring the types). It inlines:
- message sends calling non primitive & quick primitive Smalltalk methods
- closure activations if the closure is used only once (most closures).
- “perform: aSelector” to a send if the selector is constant to cover the case “aCollection do: #foo” which is common in some frameworks.
- primitives that relies only on type information (primitives corresponding to #bitAnd:, #class, #<, #size etc.).
After multiple attempts, I disabled inlining of blocks with non local returns, except if the block is created in the outer most method. It’s simpler for me and already most blocks can be inlined (3% of blocks have non local returns, and part of them can be inlined). One can fix that later.
Dead branch elimination
Dead branch elimination removes branches on boolean, which are frequents mostly because of the Smalltalk pattern “isSomething”, which once inlined often leads to “true ifTrue: ” or things like that. Dead branch elimination is sometimes called again from later pass when it seems it will remove additional branches.
ABCD inlines primitive operations and fold branches based on types and ranges. It’s based on this paper. This pass:
- inlines primitives corresponding to #at:, #+, #-
- removes dead branches when primitives matching #<, #> etc. can lead to a single branch due to the range information.
The pass temporarily transforms the IR in e-SSA (Addition of Pi nodes) as described in the ABCD paper and revert it back to SSA state at the end.
There is so many more primitives we could uncheck there. I’ve done the most important, I’ll do the others (SmallInteger multiplication, etc) later.
LICM and GVN
The last 2 main passes simplifies data computation. The first one, LICM (Loop Invariant Code Motion) attempts to move code up. It mainly moves traps out of loops. The second pass, GVN (Global value numbering) attempts to move code down and resolve statically some computation. It eliminates common sub expressions, solves primitives operations between constants, and all sorts of things like that.
The back-end is responsible for generating back bytecodes.
It works as follow:
- ExpandAndReduce: that pass expands macro instructions, i.e., instructions that are present to simplify the IR but that can’t be generated to bytecode without being converted to multiple instructions. For example, trapIfNotInstanceOf are transformed to BranchIfNotInstanceOf to another basic block with immediate trap as the back-end can’t generate trapIfNotInstanceOf instructions. The pass also reduces other instructions. For example, it looks for returns and try to move them up in branch to avoid jumping to a return.
- Spill analysis: that pass analyses the instructions to find out which instruction needs to be a temp, which one can be a spilled value on stack and which one is effect only for the stack-based bytecode. In general it tends to generate more temps than spills as it easier to handle, but sometimes spills are critical for performance as Cog’s JIT expects specific bytecode patterns to generate efficient instructions (typically JumpBelow instructions).
- Liveness analysis: this pass computes both the liveness and interferences between the different variables that will be assigned to temps.
- Temp index allocation: based on liveness analysis results, a phi coalescing and a graph coloring algorithm, it assigns temporary variable indexes to all the instructions that require a temporary slot.
- Bytecode generation: this pass walks the graph and generates bytecodes. It also map the optimized code bytecode pc to the deoptimization metadata.
The Scorch IR
Let’s try to describe a bit the IR. It’s not perfect, but it works just fine. It’s a control flow graph of basic blocks, each instruction has the SSA property.
Let’s try to give an example with this method. I know it’s a simple example with no inlining but I can’t show everything at once:
exampleBasicRange1 | array | array := #(1 2 3 4 5). 1 to: array size do: [ :i | self nonInlinedEval: (array at: i) ]
25 <20> pushConstant: #(1 2 3 4 5) 26 <D0> popIntoTemp: 0 27 <40> pushTemp: 0 28 <72> send: size 29 <D1> popIntoTemp: 1 30 <51> pushConstant: 1 31 <D2> popIntoTemp: 2 32 <42> pushTemp: 2 33 <41> pushTemp: 1 34 <64> send: <= 35 <EF 0E> jumpFalse: 51 37 <4C> self 38 <40> pushTemp: 0 39 <42> pushTemp: 2 40 <70> send: at: 41 <91> send: nonInlinedEval: 42 <D8> pop 43 <42> pushTemp: 2 44 <51> pushConstant: 1 45 <60> send: + 46 <D2> popIntoTemp: 2 47 <E1 FF ED ED> jumpTo: 32 51 <58> returnSelf
After decompilation, if I look at the send (self nonInlinedEval: (array at: i)), it looks like that:
[3.2] (S) self nonInlinedEval: [3.1].
- [3.2]: it means this instructions is the instruction 2 of the basic block 3.
- (S): it means this instruction has deoptimization information attached.
- self: As the IR is very high level, specific instructions (self, argument read) are considered as immediate instruction in a similar way to constants. It means the receiver of the send is self.
- nonInlinedEval: it’s the send’s selector.
- [3.1]: it means that the first argument of the send is the result of the instruction [3.1].
Control flow graph
Let’s now look at the basic block 3.
[3.1] (S) #(1 2 3 4 5) at: [2.1]. [3.2] (S) self nonInlinedEval: [3.1]. [3.3] (S) [2.1] + 1. [3.4] (S) backTo: 2.
The basic block is composed of 4 instructions. All the instructions but the last one are “body” instructions, by opposition to the last instruction which is a control flow instruction. Each basicBlock has always a control flow instruction at the end. In the case of basic block 3, the last instruction is a back jump. As you can see, every instruction requires deoptimization information.
Finally let’s look at the whole method:
[1.1] (S) #(1 2 3 4 5) size . [1.2] (S) loopHead. [1.3] goTo: 2. [2.1] phi: 1'1 [3.3]'3 . [2.2] (S) [2.1] <= [1.1]. [2.3] (S) [2.2] ifTrue: 3 ifFalse: 4. [3.1] (S) #(1 2 3 4 5) at: [2.1]. [3.2] (S) self nonInlinedEval: [3.1]. [3.3] (S) [2.1] + 1. [3.4] (S) backTo: 2. [4.1] ^ self.
At the exception of message sends and control flow instructions, we have:
- a phi instruction (Look at what is SSA on google if you don’t understand what is a phi).
- a loop head, which is there to record deoptimization information so the optimizer can hoist instructions such as traps out of loops.
This is the control flow graph. Every instruction is scheduled.
Optimized control flow graph
Let’s look at it again after running the optimization passes. The send nonInlinedEval: won’t be inlined because I forced it not to be inlined for this example. In practice this example would be created from the inlining of #do: on an array.
[1.1] goTo: 2. [2.1] phi: [3.3]'3 1'1 . [2.2] [2.1] USmiLessOrEqual 5. [2.3] [2.2] ifTrue: 3 ifFalse: 4. [3.1] #(1 2 3 4 5) UPointerAt [2.1]. [3.2] (S) self nonInlinedEval: [3.1]. [3.3] [2.1] USmiAdd 1. [3.4] (S) backTo: 2. [4.1] ^ self.
After the optimizations, multiple operations have been unchecked and the size operation have been resolved at compilation time. The branch is always used on a boolean, so it does not require deoptimization information. In fact, only 2 instructions now require deoptimization information for bytecode generation, the send and the back-jump. I guess because the loop has a fixed small number of iteration I could remove the interrupt point on the loop, but I am not sure of the side-effects with the non inlined send, so I didn’t do it for now.
The instructions are shared between the control flow graph and the def-uses graph.
Let’s look at the def-use graph of the instruction [2.2]. It looks like this:
[2.2] [2.1] USmiLessOrEqual 5. [2.3] [2.2] ifTrue: 3 ifFalse: 4.
The instruction [2.2] is used only once, in instruction [2.3]. Instructions can be used in deoptimization information, but that’s not the case of instruction [2.2].
Let’s look at the deoptimization information of the backjump. In this case, we didn’t inline aggressively and no object allocation was removed. Hence the deoptimization information is pretty simple.
The deoptimization information is a list of objects to reconstruct. The first object to reconstruct is the bottom context of the stack. In our case, a single object needs to be reconstructed, the context with non optimized method. Hence the inner collection looks like this:
A sunk object is an object which allocation has been postponed from runtime to deoptimization time. In our case, it’s a context, so it’s a Pointer sunk object (by opposition to byte sunk object and co). A “PSunkObj” describes the state of the object to reconstruct at a given point in the program. In this case, the object to reconstruct is a context.
This case is simple, but after inlining the optimizer needs currently to be able to recreate multiple contexts, temp vectors and closures; and maybe more in the future.
If we inspect the object, we can see that:
- The class of the object to reconstruct is Context.
- It has 7 fixed fields, which are reconstructed, in this case, using constants and the receiver of the optimized method.
- It has 3 variable fields, which are reconstructed for the 2 first using constants and for the later using the value of the phi instruction.
I hope you got a feeling of how the IR looks like. I know it can be improved, and hopefully it will happen.
In term of integration, the focus needs to be on the new bytecode set polishing (it works almost entirely), on FullBlockClosure integration (compiler, VM, debugger) and lastly one needs to integrate better sista in the IDE to be sure that all the new methods installed are caught and that the debugger works just fine.
In term of optimizations, the next thing to do is to stabilize store check and immutability check removal, optimizations on object allocations, branches on values guaranteed to be boolean. The other thing to do is to work on splitting, as it seems to be critical in current methods in the benchmarks and in the IDE.
I hope you enjoyed the post.