A couple weeks ago the new closure implementation I co-designed and co-implemented with Eliot Miranda has reached the Pharo 6 alpha image. We’re now looking into failing tests to be able to use them by default in Pharo 6. That should happen before the Pharo 6 freeze, mid-november.
The new closure design, called FullBlockClosure, allows among other thing the implementation of the Copying and Clean block optimizations. The VM support is already there and only in-image development is required (especially Opal compiler changes).
The main performance problem with blocks is the allocation of multiple objects. The creation of a block in Pharo requires the allocation of:
- the FullBlockClosure object.
- the outerContext object if it’s not already allocated (for example by another block creation in the same context).
- a temp vector to represent efficiently some remote variables, if required by the semantic analysis performed at bytecode compilation time.
In total, a block creation requires between 1 and 3 object allocations, with on average 2 allocations.
The only allocation optimized to the maximum potential is the tamp vector allocation: it’s allocated only when needed and has no side-effect if it’s not present. However, the two other allocations (the outer context and the closure) are always performed while they may not be required. The two optimizations I am going to describe avoid allocating these objects. On the contrary to the temp vector, these optimizations may have a cost as some debugging features may not be available anymore.
With FullBlockClosures, the closure’s outerContext is needed *only* to perform non local returns. Hence, if the bytecode compiler detects at compilation time that the closure does not have a non local return, it can generate a different bytecode instruction to create the closure without allocating the outer context.
It’s difficult to give precise estimate as it depends on the programmer’s style, but usually less than 5% of closures present in the code have a non local return. This optimization therefore allows most closures (> 95%) to be more efficient by allocating only the closure and if necessary the temp vector at closure creation time, without allocating the outer context.
Some closures not only do not have non local returns, but in addition do not use any remote variables. In this case, the closure is in fact only a simple function. If the bytecode compiler detect such a closure, it can create the FullBlockClosure instance at compilation time, avoid all allocations at runtime.
In practice, though depending on the programmer’s style and application, usually 30% of closures do not access any remote temporary variable nor perform any non local return. This optimization provides a huge speed-up for these closures.
This optimization can also be used to write code in places where object allocation is not allowed (for real-time librairies, etc…)
Issue 1: debugging
Copying blocks and clean blocks do not have any reference to their outer context as the reference is not used for normal execution.
However, when the programmer edits code from a closure activation in the debugger, the debugger either shows “method not found on stack” in rare cases if the outer context is dead, or restarts the home context code. For copying and clean blocks, the outer context reference is not present, hence the debugger would always display “method not found on stack”.
The debugger can be improved to have some work-arounds, but it leads to non-obvious bugs while debugging.
So, the question is, do we want closure performance over this kind of debugging ? Is this debugging feature deeply used or not ?
Other Smalltalk runtimes have the optimization and it does not seem anybody is complaining.
Issue 2: IDE tools
In the case of clean blocks, a compiled code literal frame can now hold FullBlockClosure instances which was not possible in the past. This leads to some complications, as for example if one wants to scan both for the method and its inner blocks bytecodes, it needs to looks for FullBlockClosure literals, reach the compiled block from there, etc.
Issue 3: Block identity
Another problem with clean block is identity. Normally a method answering a block answers a different instance at each execution, which is not the case for clean blocks.
^ [ ]
Example new cleanBlock == Example new cleanBlock
The DoIt answers true with the clean block optimization and false without it.
I will add these optimizations as an option (not activated by default), then I will evaluate the performance and reconsider.
In the context of my work with the runtime optimization, it seems copying blocks are interesting as they drastically reduce the number of deoptimization metadata in multiple cases, while not making anything more complex.
On the other hand, clean blocks have more drawbacks that copying blocks and make things more complex in the optimizer (like handling FullBlockClosure literal inlining).