This article is out-of-date! Please look at this article, or articles written after June 2016.
In this blog post I will explain which methods and block closures I inline when a hot spot is detected in the virtual machine and why. In other words, this article is the tale of an inliner couple: the man, the stack-based inliner, inlines messages very aggressively and is quite unreliable, whereas his sweet wife, the hot path inliner, swiftly polish the output of her husband’s brutal work.
Introduction to the problem
For the optimizer, there’s 2 main hooks from the VM:
- hook 1: #conditionalCounterTrippedOn: When the JIT has detected a hotspot, it sends the message #conditionalCounterTrippedOn: on the active context (which should trigger then the in-image optimizer)
- hook 2: the sista primitive This primitive allows one to access the send and branches datas for a method from the image (native code inline caches and hotspot detector counter values).
The optimizer knows the active context where the hot spot was detected (it’s the receiver of #conditionalCounterTrippedOn:). In addition, it can know for each method the send and branch data with the sista primitive. One could think that the optimizer just needed to optimize the active context’s method, because this is the place where the hot spot was detected. But as stated in Urs Hölzle’s phd, this is not the case, because a hot spot could have been detected on a getter or some similar method, which you cannot optimize. Therefore, the optimizer should walk up the stack from the active stack frame and look for the best spot to optimize, inlining all messages in-between. Fortunately, Smalltalk provides an easy way to do it, because the stack is reified as a linked list of contexts, and the optimizer has access to the active context.
In our case, the hot spots can be detected only on methods with conditional jumps. Therefore, in addition to the stack-based inlining pass, one needs to inline further methods in order to optimize methods that cannot be detected as hot spot. This is why there are 2 inliners, one inlines based on the current stack, the other one inlines additional messages based on frequency usage of each basic block.
The manly inliner: stack based
The manly inliner is very classic in the non meta-tracing efficient JIT compiler literature. His main job is to walk up the stack from the active context and look for the best activation record to optimize. In the Self VM, there used to be some heuristic on how many stack frames to walk up to find the best fit. However, in the strongtalk reports (those are very hard to find, because they’re probably highly confidential *cough*, and you can only get bribes of information by discussing directly or indirectly with one of the strongtalk implementors), it seems that simpler heuristic works as well if not better.
I consider that the optimizer is incremental, meaning that it can reoptimize an optimized method several times up to the point that it cannot be optimized any further. I mention that explicitly because it seems that V8 optimizing compiler, Crankshaft, can optimize only completely unoptimized methods, and cannot reoptimize a method it has already optimized (I guess this is due to V8’s need to reach very fast its maximum speed, but I may be wrong, please correct me if I am). Therefore, it is enough to inline one method at a time based on the current stack, because if the method needs to be inlined additional times to reach maximum performance, the hot spot will be detected again and the method will be optimized again up to the point that the method is not detected any more as a hot spot or have reached its maximum optimization potential.
SequenceableCollection>>indexOf: anElement startingAt: start ifAbsent: exceptionBlock
start to: self size do:
(self at: index) = anElement ifTrue: [^ index]].
^ exceptionBlock value
SequenceableCollection>>indexOf: anElement ifAbsent: exceptionBlock
^ self indexOf: anElement startingAt: 1 ifAbsent: exceptionBlock
^ self indexOf: anElement ifAbsent: 
Let’s say you have a stack with these 3 methods and that a counter trips in indexOf:startingAt:ifAbsent:. A good heuristic would determine that you need to inline the stack at least up to indexOf: to be able to inline the block and increase performance (note that by explaining this example I realized that the block is useless, we can have only “0”, I complained and reported a bug so this method may be implemented differently later on if you look at it). However, my inliner will just inline indexOf:startingAt:ifAbsent: in indexOf:ifAbsent:. But as the method will still be able to be further optimized, the virtual machine will detect it again as a hot spot, and this time it will inline one additional method based on the stack, so up to indexOf:.
The approach I take have several advantages. The end result is that both indexOf:ifAbsent: and indexOf: will be optimized. Therefore, if indexOf:ifAbsent: have several callers, they will all be faster. In addition, the manly inliner is unreliable: it may be that for the current activation indexOf: ended up calling indexOf:startingAt:ifAbsent:, but the most common case may be another method calling indexOf:startingAt:ifAbsent: (in fact the current stack may be an uncommon case). With a heuristic, you may end up spending optimizing an uncommon case, which is less unlikely to happen here and if it happens the optimizer will loose less time. Moreover, if a trap trips, triggering deoptimization, it may be that indexOf: will be deoptimized, but the deoptimized method will call indexOf:startingAt: which will still be optimized and may not need to be deoptimized. Therefore you end up with some partial deoptimization.
There are a few drawbacks. The most noticeable issue is that the optimizer takes a bit more time to reach its maximum speed. But we do not really care because it usually takes less that 30 seconds to reach maximum speed and Pharo application usually runs for days (if not months or years). In addition, Sista hash a ship it hot feature, allowing the optimizations to be saved and not needing to do them again at the next start-up. The other problem is that I may end up with a code blow-up, having too many optimized methods. However, this can be handled by discarding the least recently used optimized methods on a regular basis (maybe at each GC ?).
In the current implementation, for a method activation, the manly inliner walks up the stack by 1 stack frame and inline the message send aggressively. As the manly inliner is merciless, he cancels the optimization if he doesn’t succeed to inline the message. I designed it this way because I may end up optimizing methods which cannot be optimized if I don’t cancel the optimization. But this could be changed later, or to be a setting. His unreliability comes both from this cancellation politic and from the unreliability of the stack (the current activation may be an uncommon case).
The manly inliner takes specifically care about block activations. He considers that a block activation can only be inlined if he can inline it up to its home context. Therefore, for a block activation, he walks up the stack up to the blockClosure homeContext, and inline aggressively all the messages in-between, failing the optimization if one of them cannot be inlined. His twisted mind has however limitations. After walking up 10 stack frames or reaching the top of the stack, if he hasn’t found the blockClosure’s home context, he gets bored and cancels the optimization whatsoever.
I hide some details but basically in addition you need to check while inlining a blockClosure activation up to its homeContext that there are no block activations in between, and if it’s the case, the manly inliner inlines them too or cancels the optimization.
The girly inliner: hot path based
After the aggressive work of her husband, the girly inliner polishes the resulting method. Instead of relying on the unreliable stack, she relies on branch data. The JIT compiler provides information about each conditional jumps: it makes available the number of times each branch has been taken.
These counter data may not be accurate in a specific context. Let’s say you have inline SequenceableCollection>>indexOf: in your method, you may know that indexOf: always find the index of the element. On the other hand, SequenceableCollection>>indexOf: may also be called in other methods, and it happens that the branch data of SequenceableCollection>>indexOf: tells you that sometimes the element is not found and that the exception block is executed, even if this is not true in your context. But the girly inliner is very sweet, therefore it may be that these error will forbid her to perform some inlining, but the method will always result in correct executable code. However, the method may trip again as the optimizer is incremental, and the next time the branch data will be correct as the branch information will come from the inlined branch data, which corresponds to the method code only in this specific context.
Based on the branch data, the girly inliner computes a heat for each basic block. The highest the heat of the basic block the most frequently used the basic block is. Then, she inlines all the message sends of the method node through a single pass other all the basic block statements with an aggressivity relative to the basic block heat. If the basic block is never used (heat = 0), she does not inline anything because the basic block will be removed by the dead code elimination pass. If the basic block is cold, she inlines only blockClosure direct evaluations. If it’s tepid, she inlines methods with a small number of byte codes. If it’s hot, she inlines as many messages as possible.
On the contrary to her husband, the girly inliner never cancels the optimization, if she finds a message that she cannot inline (no send data, polymorphic send data targeting different methods or megamorphic send data), she just does not inline it.
The girly inliner is really important, because as the hot spot detection is based on conditional jumps counter, the hot spots on methods with no conditional jumps are not detected. Therefore, without her, her husband would end up with lots of performance critical unoptimized methods.
ifTrue: [ self bar ]
ifFalse: [ self baz ]
Let’s say a hot spot is detected on the conditional jump in #foo.
Here is the call graph:
As you can see, after the hot spot detection, the manly inliner inline #foo into #example. In fact, as it is a method activation, he always inline it one time up in the stack. Then, the girly inliner inlines all the other messages at the first level of depth (she does not inline the #logCr in #bar and #baz because they are in the second depth level). As the call of #baz in #foo is in a cold basic block (in fact, this branch is never taken, so it is never called, the heat of the basic block is 0), she does not inline that message.
Side note: data flow and control flow analysis
As I was looking deep into V8 optimizing compiler, Crankshaft, and especially its hydrogen optimizations which correspond to what I am doing, I noticed that firstly Crankshaft does the control flow oriented optimizations (inlining, dead branch elimination) and then the data flow oriented optimization (casting from Number to int32/double, range analysis and lastly common subexpression elimination and loop invariant code motion (by the way the last 2 passes are done in a single pass named GlobalValueNumbering, but the name seems just to be here to fool young padawans, as it seems it does not perform any value numbering at all)).
So I wanted to do the same. But it does not work. Why ?
The issue is that in Smalltalk BlockClosures are very common and performance critical. It happens that to inline a BlockClosure, I need to be sure that the receiver of BlockClosure>>#value (and similar block activation messages) is a specific block closure from which I have the code. Therefore I need data flow oriented analysis to find out that the receiver of a message is a specific block closure and not another object. So I have at the beginning both control flow and data flow oriented optimizations, and then only data flow optimizations. It is convenient to perform as many data flow optimizations as possible at the same time as you can assume that the control flow graph will not be edited, therefore you can do things such as defining a collection of all the basic blocks sorted in the order you like (Pre order traversal for me) and iterate over this collection to iterate over all the basic blocks (which is very fast compared to iterating over a graph where you know only the successor and previous nodes for a given node).
By the way readers, if you think I am not right in some aspects in one of the Sista articles, I’d be very happy if you can prove me and explain me why I am wrong (but I’m not interested in people telling me “Your project sucks” without concrete explanations on why the project is not going in the right direction).
Note: I will make a short break in writing blog posts in the next few months as I have many things to do. On the other hand, I am writing a paper about the new Pharo byte code set, also known as the Sista V1 byte code set, for the ESUG (European Smalltalk User Group) workshop, and I will put the paper on this blog as soon as the paper will be ready to have only one place with all the sista documentation.
As always, I hope you enjoyed this post 🙂