This article is out-of-date! Please look at this article, or articles written after June 2016.
For the last 3 weeks, I have been in San Francisco to work on Sista with Eliot Miranda, especially on the communication between the in-image optimizer and the Cog runtime / Cog JIT. Last tuesday, for the first time, the VM detected a hot spot via a tripping counter, then called the in-image optimizer, which optimized the method, installed it, and at the next call the optimized method was used (Do not misunderstand me, on-the-fly stack replacement does not work yet). This is a big step. However, the VM right now does not implement all the additional sista bytecodes and inlined primitives so the optimized methods performance is quite low :-(.
With the current settings we (we = Eliot Miranda and I) have, while running the Pharo kernel tests, 5350 counters tripped, which in fact corresponds to 187 different tripping methods (so yeah, the assumption that the code usually run almost always the same methods is verified). Basically the optimizer was able to optimize 80 methods out of the 187. The optimizer failures were mostly due to:
- (2750/5350 cases) send data information forbidding inlining (Right now the optimization is cancelled if some critical sends are not inlined):
- 950 megamorphic inline cache on the inlining path
- 600 polymorphic inline cache on the inlining path
- 1150 missing send data on the inlining path. This could mean that the counter trip on an uncommon case, i.e., the tripping method was called 65000 times from a certain method, but the last time, when the counter tripped, it was called from another method rarely used. It can also be due to some unexpected machine code garbage collection
- (1250/5350 cases) the VM not being stable enough right now in the mapping machine code pc -> bytecode pc (only the Sista VM is not stable)
Interestingly, some non local returns are very hard to optimize, but in the current model, I had only 17 optimizations out of 5350 that cancelled due to unoptimizable non local returns. Therefore I assume this is not a common case.
Another interesting point, out of over a thousand successful method optimizations, only a 5 of them exceeded 50 ms, which is the maximum pause a user can tolerate. As the Spur garbage collector of Pharo 4 will double Pharo’s performance, this case is therefore also uncommon (at least on my Macbook 2.5Ghz Intel Core i5 with lots of applications running aside).
There is a long way to go before an alpha version though. I noticed that the optimizer decompilation and the inlining optimizations passes are *very* stable (thanks Ronie Salgado for showing me a good design for the inliner), but the range optimizations are not finished, and the byte code generation is not very good and even sometimes leads to incorrect byte codes. My next move will be on the code generation part to stabilize the overall optimizer.
The V8’s design influence on sista is nice, as I was for a long time strongly influenced by LLVM’s design (information about LLVM here), whereas the implementation of static C compiler usually prefers slow algorithm finding the perfect solution for an issue instead of a fast algorithm finding a suboptimal but almost perfect solution. However, you can still feel the C compilers influence by looking at the control flow graph manipulations which are much more aggressive than V8’s ones. In some way we could say that Sista is an hybrid optimizer based on GCC’s, LLVM’s and V8’s design. This is partly because some of Sista’s features (postponing the optimization to the a background process when it takes more than 50 ms, incremental optimizations) allows to spend more time on the optimizer.
The byte code generation have a main issues right now: the number of temporaries in optimized method is huge. There are several reasons for that:
- The deSSA (algorithm that removes phi nodes) performs poorly. This is because I took it from an old prototype, and it is supposed to merge multiple temporaries only if they were the same temporaries in their unoptimized method, forcing the optimizer to create extra temporaries at each inlining pass. The idea was to keep values for the deoptimization process. However, having now implemented the deoptimization and having it working, I can tell you that this is complete non sense. The Sista intermediate representation is completely independent of any temporary representation, in fact a temporary in the Sista IR just means a slot on the stack that have the SSA property. And I didn’t invent anything there, Crankshaft’s intermediate representation hydrogen is also temporary variable independent.
- After inlining multiple methods, the number of temps of each method, and in addition their “hidden” temporaries due to loops or stack top values, are added.
- Currently the deoptimization requires that each value needed in the deoptimization scopes to be put in a temporary variables in the byte code, whereas for most of them I could just use the top of the stack. This is not a real problem, as the solution mostly consists in spending time on improving it.
To improve this issue, I am completely rewriting the conversion from the SSA control flow graph to Opal’s IR, the bytecode generator input. The main difference is that the deSSA algorithm will now merge as many temps as possible thanks to Briggs-Chaitin graph coloring algorithm and use a V8-inspired algorithm for liveness analysis. This should greatly decrease the number of temporaries (especially after inlining).
[BEGIN graph coloring digression …]
The general idea of Briggs-Chaitin graph coloring algorithm is that you want to color each node of a graph such as there can’t be 2 nodes with a relationship having the same color. In the real world, each node is a sista IR temporary (let’s say a stack location, it’s not really a temporary) before the algorithm, and after the algorithm each node having the same color can share the same temporary variable index in the byte code. This algorithm is usually used for register allocation, but sista doesn’t allocate the registers, this is left to the low level JIT. However, my issue of reducing the number of temporary variables is clearly similar to register allocation.
As I am able to visualize everything I do in a few minutes thanks to Roassal (It took me only 4 min 42 sec to program this visualization, Roassal is *so* helpful, thank you Roassal guys), here are some visualizations of the Graph Coloring algorithm:
Now I just need to make it work nicely with my deSSA and byte code generation passes :-). I will be in vacation for the next two weeks so it may take a while.
[END graph coloring digression …]
By the way, I know I still need to write down what method I optimize when a counter trips based on the current execution stack and why (and I said I would do so, I’m sorry I haven’t done it yet). For the ones that have read Urs Hölzle phd, you know that the method to optimize is not necessarily the method that has the tripping counter. Therefore you need to add some logic to find the best method to optimize. That is done and working, but I need to spend time to write about it…
Anyway I still recommend to read Urs phd, my explanation will obviously be less good than Urs’ ones, as he has a better English style and as I discussed with Eliot Miranda recently, the way he explains things make the optimizer’s concepts looking like they are very simple whereas they are very complex, which shows how good he is.
Hope you enjoyed the post !
PS: If you are interested in a specific aspect of Sista that I do not mention in my blog posts, please say it in the comment, I will answer you or perhaps even write a dedicated post.