It is now time to make an open alpha release of the Sista VM. As all alpha releases, it is reserved for VM developers (the release is not relevant for non VM developers, and clearly no one should deploy any application on it yet). Last year we had a closed apha release with a couple people involved such as Tim Felgentreff who added support for the Sista VM builds in the Squeak speed center after tuning the optimisation settings.
The main goal of the Sista VM is to add adaptive optimisations such as speculative inlining in Cog’s JIT compiler using type information present in the inline caches. Such optimisations both improve Cog’s performance and allow developers to write easy-to-read code over fast-to-execute code without performance overhead (typically, #do: is the same performance as #to:do:).
As shown in the following figure generated from the Squeak speed center data, benchmarks are typically in-between 1.5x and 5x faster on the Sista VM than the current production VM. On the figure, the time to run the bench is represented (hence, smaller columns implies less time spent in the bench and faster VM). Four columns are shown for each benchmark:
- Cog: the current production VM.
- Cog+Counters: the current production VM with the overhead of profiling counters used to detect hot spot and provide basic block usage profiling.
- Sista (Cold): the Sista runtime from an image with no optimised code, approximate start-up performance.
- Sista (Warm): the Sista runtime started on an image with optimised code present, approximate peak performance.
The image is extracted from my Ph.D, where one can find all the relevant data to reproduce the benchmarks.
In practice, on real-application benchmarks (such as the TCAP benchmark not shown in the figure), the Sista runtime is around 1.5x times faster. Specific smaller benchmarks sometimes show more significant speed-ups (JSON parsing, bench (c) in the figure, showing 5x), or no speed-up at all (Mandelbrot, bench (i) in the figure, the time is spent in double floating-pointer arithmetic and I did not implement double optimisations in Sista).
For this first release, the main focus has been on closure inlining and to get some decent benchmark results to get people looking for an efficient Smalltalk interested.
Several optimisations (String comparisons, inlined object allocations, unchecked array accesses) show a 1.5x times speed-up on benchmarks where those operations are intensive, but based on profiling on larger application (for example the Pharo IDE) the speed-up comes mainly from closure inlining.
Some benchmarks in the benchmark suite focus on other things such as 32bits large integers or double floating pointers arithmetic. These benchmarks typically used inlined loops (#to:do:, etc.) and hence don’t really benefit much from the runtime compiler.
Just a little discussion on the names not to confuse everyone…
Sista is the name of the overall infrastructure/runtime.
Scorch is the bytecode to bytecode optimising JIT, written in Smalltalk. It relies on Cogit as a back-end to generate machine code. It can ask Cogit for specific things such as the inline cache data of a specific method.
Cogit is the bytecode the machine code JIT compiler. It is used alone as a baseline JIT and can be combined with Scorch to be used as an optimising JIT.
The following figure summarises the Sista architecture and the interactions between the frameworks:
Overview of the runtime compiler Scorch
Scorch is called by Cogit on a context with a tripping counter (i.e., a portion of code executed many times). The optimiser then:
- Select a context to optimise (always a method context)
- Decompiles the context method to a SSA IR.
- Performs a set of optimisations
- Generates back an optimised compiled method
- Installs the optimised method and register its dependencies
In step 1, Scorch looks for the context defining closures on stack. The typical case is that the Array>>#do: method has a tripping counter. Optimising Array>>#do: won’t make any sense if the optimiser cannot optimise the closure evaluated at each iteration of the loop. The optimiser typically selects the sender context of Array>>#do: for optimisation, so later in the optimisation process the closure creation and evaluation will be removed.
In step 2, Scorch generates a control flow graph of basic blocks, each basic block having a linear sequence of instructions. This step includes heavy canonicalisation and annotation of the representation (basicBlocks are sorted in reverse postOrder, annotated with the dominator tree, loops are canonicalised, sends are annotated with runtime information from Cogit, the minimum set of phis is computed, etc.).
In step 3, Scorch performs a set of optimisations on the control flow graph. Multiple inlining phases happen, where the goal is to inline code in nested loops, to inline short methods and to inline code that would lead to constant folding or closure inlining. Part of the inlining phase consists in removing temp vectors (once the closures are inlined). Aside from inlining, one optimisation phase focuses on loops, hoisting code out and in rare cases unrolling them. The other phases consist of dead branches removal, better SmallInteger comparisons / branches pipeline, redundant type-check removal, common sub-expression elimination, unused side-effect free instruction elimination, head read/write redundancy elimination and other minor things like that.
In step 4, Scorch makes small changes to get the representation in a proper state for code generation (some instructions are expanded, the single return point is split in multiple ones, etc.). It then analyses the representation to figure out which value will become a temporary variable and which value will become spilled on stack. Future temporaries are then assigned a temp index. The temp index is assigned first by coalescing phis (to decrease temp writes) and second through graph coloring (to use the least number of temps). Once done, the representation is traversed generating bytecodes for each basic block. The size of each jump is then computed, and the final optimised method is generated.
In step 5, Scorch installs the optimised method, potentially in a subclass of the original method (customisation). The optimised method has a special literal which includes all the deoptimisation metadata to reconstruct the runtime stack with non optimised code at each interrupt point. In addition, Scorch adds to the dependency manager a list of selectors which requires the optimised method to be discarded if a new method with this selector is installed (look-up results could change, confusing speculative inlining, etc.).
Next optimisations to implement
Apart from rethinking the optimisation planning and improving all the existing optimisations, new optimisations may be added. On the top of my head, the major next things I can think of are probably:
- Full object sinking: Right now read/writes to objects are analysed and simplified, but there is nothing entirely removing object allocations if they don’t escape the method. That was not done because almost every object escape, but if one would direct the inliner to remove escapes or do partial escape analysis, I guess we could see significant speed-up.
- Something for large integers arithmetic. The minimum would be to be able to permute the arithmetic operations using associative/commutative properties, solve the ones between constants and hoist others out of loops.
- Float boxing / unboxing management. That would remove a lot of overhead in float intensive benchmarks.
There are also multiple minor things to do here and there. Improving loop optimisations would likely yield significant speed-up too.
How to get/build a Sista image and VM
1) Get the Pharo 6 release image and VM, for example by doing in the command line:
wget --quiet -O - get.pharo.org/60+vm | bash
2) Execute the following code (DoIt) to prepare the image:
"Add special selector for trap instruction"
Smalltalk specialObjectsArray at: 60 put: #trapTripped.
"Disable hot spot detection (to load the Scorch code)"
Smalltalk specialObjectsArray at: 59 put: nil.
"Recompile the fetch mourner primitive which has strange side-effect with alternate bytecode set and closures"
WeakArray class compile: 'primitiveFetchMourner ^ nil' classified: #patch.
"Enable FullBlockClosure and alternate bytecode set"
CompilationContext bytecodeBackend: OpalEncoderForSistaV1.
CompilationContext usesFullBlockClosure: true.
3) Load Scorch by using this DoIt:
4) Go to https://github.com/OpenSmalltalk/opensmalltalk-vm
and compile a squeak.sista.spur VM.
Alternatively, pre-compiled VMs are available on Cog’s Bintray (Some OS/processors may not be available though).
5) Restart your image with the Sista VM. You can now execute:
25 tinyBenchmarks logCr.
"Enable Scorch optimizations"
Smalltalk specialObjectsArray at: 59 put: #conditionalBranchCounterTrippedOn:.
25 tinyBenchmarks logCr.
"Disable Scorch optimizations"
Smalltalk specialObjectsArray at: 59 put: nil.
It should show on Transcript something like that (copied from my machine):
'2486945962 bytecodes/sec; 150270417 sends/sec'
Counter tripped in Integer>>#benchmark
Installed SmallInteger>>#tinyBenchmarks in SmallInteger
Counter tripped in Integer>>#benchmark
Installed SmallInteger>>#benchmark in SmallInteger
Counter tripped in Integer>>#benchFib
Installed SmallInteger>>#benchFib in SmallInteger
'3849624060 bytecodes/sec; 271220541 sends/sec'
That code was run on the Sista runtime.
6) Optionally, add in Monticello the repo http://www.hpi.uni-potsdam.de/hirschfeld/squeaksource/BenchmarkRunner and load the 2 packages to have a set of benchmarks to toy with.
Note when toying around
If you want to experiment with the Sista runtime, you need to note:
- A certain number of iterations is needed to reach peak performance.
- DoIts are not optimised (hence if you don’t put your code in a method, it won’t get optimised).
- It’s still not very mature, hence it is possible to build benchmarks where the performance is not that good.
- Expect crashes.
Another interesting thing is to do:
optimisedMethod metadata printDebugInfo
which shows [most of] inlined code in the given optimised method, and allows one to try to understand the optimiser optimisation decisions. In the case of
tinyBenchmark, the method benchmark would show something like (based on my machine):
52) atAllPut: Inlined (SequenceableCollection>>#atAllPut:) 
41) from:to:put: Inlined (SequenceableCollection>>#from:to:put:) 
56) min: Inlined (Magnitude>>#min:) 
The number at the beginning (52) is the bytecode offset of the send inlined, followed by the selector of the send (atAllPut:), followed by the method inlined (SequenceableCollection>>#atAllPut:). In some cases there may be several methods inlined. The last number () is the order in which the methods are inlined.
The indentation means the inlining depth (#from:to:put: is inlined in #atAllPut: itself inlined in #benchmark for example).
In the case of
benchmark, other methods were inlined, but they were proven to be non-failing primitives, so they are not shown here.
In the case of non-local return inlining, more complex logic is involved and the debug info may be incomplete.
System integration: Some TODOs
Many things are partially done in the IDE. Customised methods are currently shown in the class browser. It is possible at each interrupt point to deoptimise an optimised context to multiple deoptimised contexts, but the debugger ode needs to be updated to do so. Hooks for method installation need to be added to correctly ask Scorch to discard optimised methods that are dependent on the selector.
Another thing is the thisContext keyword, which now shows sometimes optimised context. Again, on interrupt point, it is possible to request deoptimisation, but no IDE tool is doing so right now.
Lastly, the deoptimiser is written in Pharo, but is completely independent from the rest of the code and needs love. Some parts have still dependencies, leading to crashes.
I hope you enjoyed the post. Please report on the vm-dev mailing list any experiment with the Sista VM.