Hi everyone,

It’s been a long time I have not written on the Sista project we are doing with Eliot. For the new readers, Sista is the infrastructure we are adding to the Cog VM to support runtime optimisations through dynamic recompilation / deoptimization. There is a good reason for that, when I started to write about Sista I was lacking knowledge about compiler IR and I didn’t want to write anymore things I don’t fully understand not to make mistakes. My knowledge about compiler design increased every week for the past years, and I feel confident again to write blog posts about it.

Just as a context, we started this work 2 years and a half ago. At that time, I was working on Sista 1 day a week. I am now full-time on the project… But ! I teach on average one day per week, I write research papers for my phd, I am writing tutorials on Smalltalk virtual machine development (See here) and I am also contributing on side-projects like read-only objects (incoming IWST paper). So… In practice I’m working around three days per week on the project on good weeks, 4 hours per week when I am writing a paper for a research conference. Eliot dedicated a month every year to work with me (3 times April) and gives me a day of work from time to time to help me.

Work done

The first version of the runtime optimizer I worked on was based on previous people work. This version had a stack based-IR and copied many objects. It was very difficult to evolve without memory blow-ups and the stack representation was difficult to tackle for compiler optimizations. I made two iterations on this work, the first one without dynamic deoptimization metadata generated by the optimizing compiler, the second one with it, and none of them was stable enough to run efficiently benchmark.

I then made a third version using a Control Flow Graph Single Static Assignment IR, designed based on the LLVM IR and Javascript V8’s Crankshaft optimizing compiler. With that new version, I was able last fall to run some benchmarks. On the Shoutout benchmarks (nbody, threadring, binarytrees) and the integer benchmarks (fibonacci, integer benchmarks), the performance boost was between 1.6x and 1.8x. At that point, I had worked mainly on stability and spent close to no time tweaking the performance. I’ve only implemented a couple of optimizations (inlining, bounds and overflow check elimination in loops, constant propagation/folding). As expected, the optimized method were able to be saved as part of the snapshot and starting a hot image allows the VM to reach peak performance quicker.

Then I worked on the optimizer stabilization as bigger benchmarks were crashing. Especially, I tried to validate that each uncommon deoptimization path is correct to be able to deploy the optimizer without risking crashes. Unfortunately, many deoptimization points were not restoring correctly the stack…

The closure problem

The main problem I had was with closures. In the Cog VM clients, the closure bytecodes is inlined in the enclosing method or closure. In addition, at closure creation time, the closure fetches the receiver and the method from which the bytecodes has to be executed from the outer context. These specificities makes it really hard to inline methods holding closure, as there is no instruction to create a closure with bytecodes / receiver that have to be fetched elsewhere than from the outer context, while the outer context does not hold the correct information in case of inlining. I ended up with very large and complex deoptimization metadata, very hard to stabilize. Deoptimization sometimes required to deoptimize multiple optimize frames, which was very tricky to handle.

Last spring I visited Eliot and we decided (I convinced him after some debate) to move to a simpler closure model. In the new model, called FullBlockClosure, a closure’s bytecode is in a separated compiled method object (in fact, we created CompiledCode and two subclasses CompiledMethod and CompiledBlock). The full closure creation bytecode specifies the literal index of the compiled block, the number of copied values and in addition if the receiver has to be fetched from the outer context or on stack as well as if the outer context reference is needed. This new design implied lots of simplification in Cog’s JIT (now methods and bloks are handled in a similar way) as well as in the runtime optimizer.

The FullBlockClosure design was added as part of the SistaV1 bytecode set, a set designed to replace the existing SqueakV3PlusClosure bytecode set. As described in our IWST’14 paper (A bytecode set for adaptive optimizations Clement Bera, Eliot Miranda), the new bytecode set does not only introduce new instructions for adaptive optimisations but also solves different problems of the existing bytecode set, such as encoding limitations for the number of literals, instance variables or jump size.

Road map

As for now, I am trying to add abstraction over the bytecode sets in the Pharo image. I’ve done most of the work with the help of Eliot who did it in Squeak, but there are still some narrow cases to fix.

Once this will be finished, I will create a Pharo image using only FullBlockClosures (no inlined BlockClosure) with the SistaV1 bytecode set and put it in production. This requires some changes to the bytecode compiler.

Lastly, I am now changing the runtime optimizer to deal with FullBlockClosure, so it’ll be the 4th version of the optimizer. The FullBlockClosures are easier to handle than the previous closures, but it still requires deep changes in the optimizer.

While implementing the 3rd iteration of the optimizer I learnt how Cog’s JIT work and made lots of different small evolutions on it, such as the introduction of read-only objects. I noticed some minor problems in the optimizer that needed to be fixed, such as the management of forwarding objects for remote object instance variable accessing (See A partial Read Barrier for Efficient Support of Live Object-Oriented Programming Eliot Miranda Clement Bera) or the management of context instance variables. I also need to fix the deoptimization process, which should be much easier to implement, as there is no more all the closure non-sense.

Hopefully it’ll take me a couple of months to get that running, and I will be able to show some new bench results. I am still focusing on stability right now, and the massive performance boost will come later, once the first version will be in production with only ~1.7x. It’s much easier to work on performance while the whole thing is working, and we’re missing some inlined operations so we can’t really optimize object allocation, which explains the trouble we have to reach 2x-3x.

I hope you enjoyed the post.

 

Advertisements