Hi everyone,
It’s been a long time I have not written on the Sista project we are doing with Eliot. For the new readers, Sista is the infrastructure we are adding to the Cog VM to support runtime optimisations through dynamic recompilation / deoptimization. There is a good reason for that, when I started to write about Sista I was lacking knowledge about compiler IR and I didn’t want to write anymore things I don’t fully understand not to make mistakes. My knowledge about compiler design increased every week for the past years, and I feel confident again to write blog posts about it.
Just as a context, we started this work 2 years and a half ago. At that time, I was working on Sista 1 day a week. I am now full-time on the project… But ! I teach on average one day per week, I write research papers for my phd, I am writing tutorials on Smalltalk virtual machine development (See here) and I am also contributing on side-projects like read-only objects (incoming IWST paper). So… In practice I’m working around three days per week on the project on good weeks, 4 hours per week when I am writing a paper for a research conference. Eliot dedicated a month every year to work with me (3 times April) and gives me a day of work from time to time to help me.
Work done
The first version of the runtime optimizer I worked on was based on previous people work. This version had a stack based-IR and copied many objects. It was very difficult to evolve without memory blow-ups and the stack representation was difficult to tackle for compiler optimizations. I made two iterations on this work, the first one without dynamic deoptimization metadata generated by the optimizing compiler, the second one with it, and none of them was stable enough to run efficiently benchmark.
I then made a third version using a Control Flow Graph Single Static Assignment IR, designed based on the LLVM IR and Javascript V8’s Crankshaft optimizing compiler. With that new version, I was able last fall to run some benchmarks. On the Shoutout benchmarks (nbody, threadring, binarytrees) and the integer benchmarks (fibonacci, integer benchmarks), the performance boost was between 1.6x and 1.8x. At that point, I had worked mainly on stability and spent close to no time tweaking the performance. I’ve only implemented a couple of optimizations (inlining, bounds and overflow check elimination in loops, constant propagation/folding). As expected, the optimized method were able to be saved as part of the snapshot and starting a hot image allows the VM to reach peak performance quicker.
Then I worked on the optimizer stabilization as bigger benchmarks were crashing. Especially, I tried to validate that each uncommon deoptimization path is correct to be able to deploy the optimizer without risking crashes. Unfortunately, many deoptimization points were not restoring correctly the stack…
The closure problem
The main problem I had was with closures. In the Cog VM clients, the closure bytecodes is inlined in the enclosing method or closure. In addition, at closure creation time, the closure fetches the receiver and the method from which the bytecodes has to be executed from the outer context. These specificities makes it really hard to inline methods holding closure, as there is no instruction to create a closure with bytecodes / receiver that have to be fetched elsewhere than from the outer context, while the outer context does not hold the correct information in case of inlining. I ended up with very large and complex deoptimization metadata, very hard to stabilize. Deoptimization sometimes required to deoptimize multiple optimize frames, which was very tricky to handle.
Last spring I visited Eliot and we decided (I convinced him after some debate) to move to a simpler closure model. In the new model, called FullBlockClosure, a closure’s bytecode is in a separated compiled method object (in fact, we created CompiledCode and two subclasses CompiledMethod and CompiledBlock). The full closure creation bytecode specifies the literal index of the compiled block, the number of copied values and in addition if the receiver has to be fetched from the outer context or on stack as well as if the outer context reference is needed. This new design implied lots of simplification in Cog’s JIT (now methods and bloks are handled in a similar way) as well as in the runtime optimizer.
The FullBlockClosure design was added as part of the SistaV1 bytecode set, a set designed to replace the existing SqueakV3PlusClosure bytecode set. As described in our IWST’14 paper (A bytecode set for adaptive optimizations Clement Bera, Eliot Miranda), the new bytecode set does not only introduce new instructions for adaptive optimisations but also solves different problems of the existing bytecode set, such as encoding limitations for the number of literals, instance variables or jump size.
Road map
As for now, I am trying to add abstraction over the bytecode sets in the Pharo image. I’ve done most of the work with the help of Eliot who did it in Squeak, but there are still some narrow cases to fix.
Once this will be finished, I will create a Pharo image using only FullBlockClosures (no inlined BlockClosure) with the SistaV1 bytecode set and put it in production. This requires some changes to the bytecode compiler.
Lastly, I am now changing the runtime optimizer to deal with FullBlockClosure, so it’ll be the 4th version of the optimizer. The FullBlockClosures are easier to handle than the previous closures, but it still requires deep changes in the optimizer.
While implementing the 3rd iteration of the optimizer I learnt how Cog’s JIT work and made lots of different small evolutions on it, such as the introduction of read-only objects. I noticed some minor problems in the optimizer that needed to be fixed, such as the management of forwarding objects for remote object instance variable accessing (See A partial Read Barrier for Efficient Support of Live Object-Oriented Programming Eliot Miranda Clement Bera) or the management of context instance variables. I also need to fix the deoptimization process, which should be much easier to implement, as there is no more all the closure non-sense.
Hopefully it’ll take me a couple of months to get that running, and I will be able to show some new bench results. I am still focusing on stability right now, and the massive performance boost will come later, once the first version will be in production with only ~1.7x. It’s much easier to work on performance while the whole thing is working, and we’re missing some inlined operations so we can’t really optimize object allocation, which explains the trouble we have to reach 2x-3x.
I hope you enjoyed the post.
Stéphane Ducasse said:
Keep pushing clement!
Henry said:
Cool, I guess that means #once will start working as intended as well 😉
Do you have any estimates of impact on image size due to the overhead associated with full block objects vs inlined?
Also, with current closure bytecodes;
A class >> #t
^[ 1 ]
A t == A t. false
I guess this will change when switching to a block object in literal slot; then they are created at compile time, right?
Do you have any numbers on improvements in GC overhead with the new vs old approach in block-heavy benchmarks?.
Clement Bera said:
Hi Henry,
On my 49.5Mb image (Pharo 6 + the code I work on), switching to FullBlockClosure grows the image size by ~200kb. There might be some minor difference on the final implementation. So 0.5% memory increase or something like that.
We designed the full block closure (bytecode and classes) to be able to have 3 kind of blocks:
– clean blocks (I don’t remember exactly, between 15 and 30% of blocks): a clean block can’t access any external variable nor have a non local return. The closure is created at compile-time and no allocation is needed at runtime.
– copying blocks (97% of blocks): a copying block can access external variable but can’t have a non local return. The closure is created at runtime, if needed a temp vector is also created, but there is no need to reify the outer context.
– full blocks (100% of blocks): Both the closure and the outer context are created at runtime, and if needed a temp vector is also created.
In Pharo right now there are 2 or 3 allocations per block creation (all blocks are considered full blocks), and if we would use the 3 types of blocks we would have between 0 and 3 allocations. Precisely, 97% of blocks would have 1 less allocation, and a small portion of blocks would have 0 allocation. The problem is that the block behavior is then different with clean blocks, as you’ve just shown, and it limits debugging while Pharo is powerful because of its debugging capabilities. Hence, I guess I could add the block optimisations as a bytecode compiler setting but I won’t enable them by default. Obviously I don’t have much spare time so it might not happen. The sista inliner removes many blocks anyway, which hopefully will solve most of the problem once it hits production…
I don’t have numbers on the GC improvement expected with the different types of block. I’d guess the performance could be around 1.2x faster in block intensive bench ? Maybe more. No more than 1.5x times for sure. I need to implement it to measure. If I do so I’ll make another post.
gasche said:
My understanding of the “bytecode-to-bytecode optimizer” idea was that optimization decisions from profiling data resulted in a transformation from the un-optimized bytecode to an optimized bytecode sequence with deoptimization guards.
In this post (thanks! I was wondering about Sista), you mention that you are now using a SSA-like IR. Does it mean that you generate the native code directly from the SSA IR, and thus that the design is not bytecode-to-bytecode anymore? If I understand correctly, you start from compiled bytecodes, but the JIT transforms them into SSA, optimizes at this level, and then generative native code. Is this correct? How are you representing de-optimization edges in the SSA form?
Clement Bera said:
Hi, sorry I’ve just seen your comment, I should have approved it much earlier.
It’s close to impossible to optimize directly on the bytecodes, so I convert it to an SSA-like IR, do the optimization passes, then convert it back to bytecodes. It’s indeed a bytecode to bytecode optimizer.
The baseline JIT translates the optimized bytecodes to machine code as it would translate normal methods. There are little optimizations you can do in the back-end for object-oriented languages (based on other optimizing runtime compiler experience), so reusing the baseline JIT as a back-end seems to be a good idea.
The deoptimization edges are represented as a frame state. Some values in the frame state are nodes somewhere else in the IR. When a node is replaced by another node, it automatically replaces all the references, so the frame state is edited.
Michel said:
Hi Clément,
do you know about the open project Eclipse OMR? In short, it is an effort by IBM to de-Javafying their J9 JVM, keeping everything else (GCs, threading, tooling, diagnostics, monitoring, hw exploitation (SIMD, GPU, etc.)), including the JIT compiler (for several platforms). Shortly, they will even open source J9 itself as a proof-of-concept example and more.
OK, it’s probably too late for your work on the Pharo VM. However, if I were to start a language VM I would probably delve into Eclipse OMR. It also enables a polyglot environment.
Greetings
Clement Bera said:
Hi Michel,
Thanks for your remark.
Just to be clear, I did not build a new VM, I built an optimising JIT on top of an existing VM. Multiple people have tried to convince me to use other frameworks to be more efficient, especially Truffle and the RPython toolchain. Eclipse OMR is also interesting but very recent so I would guess less mature.
My main concern is that the Pharo VM is used in production in many places. For our customers, if switching to another VM implies having to drop support for a feature (and Smalltalk has lots of strange features), they won’t use the new VM. I built my optimising JIT on top of the existing production VM to be able to move this work to production at some point, for example next year, when I will maintain and evolve the Pharo VM as a job after my PhD.
I hope the Eclipse OMR project will succeed and that the VM community will get new and interesting ideas out of it.
Best,
Michel said:
Yes, I understand your position.
Truffle/Graal isn’t very suited because the object model still is that of Java. They have proposed one but it seems like an overkill for a class based dynamic language as Smalltalk.
It is correct that Eclipse OMR seems rather new, but in fact IBMers use it themselves to maintain production J9 and other languages on all their platforms. This is a formidable proof of the overall design, I think. Further, it seems flexible enough for Smalltalk as well, as OMR does not impose any object model (beside a specific byte in the header for GC purposes).
Anyway, have fun with your PhD and your work on the Pharo VM.
Pingback: The Sista chronicles VII: inlining strategies | Clément Béra
Pingback: The Sista chronicles VI: inlining non local returns | Clément Béra
Pingback: The Sista chronicles V: First optimized method running | Clément Béra
Pingback: The Sista chronicles IV: inlining message sends | Clément Béra
Pingback: The Sista chronicles III: An Intermediate representation for optimizations | Clément Béra
Pingback: The Sista chronicles II: the hotspot detector | Clément Béra
Pingback: The Sista chronicles I: An introduction to adaptive recompilation | Clément Béra