Hey guys,

    Today I’ll explain how to use NativeBoost as FFI (foreign function interface) to optimize you Pharo application to C speed.

EDIT: I fixed the difference between FFI plugins speed that was not correct.

Introduction

    A few weeks ago, Esteban Lorenzano pointed me out a video game, Eve online, that is partially implemented in Python. It is nice to see a video game in Python because this language is on a higher level than C or C++, allowing faster and easier development process. However, Eve online is a 3D real-time video game, therefore it requires a lot of performance whereas Python is much slower than C++ or C. A few days later, I discussed with Igor Stasenko, the guy who implemented most of NativeBoost. I asked him:
    – If you would implement an application that requires a lot of performance, as for example a 3D real time video game, would you consider using Pharo or would you think Pharo is not able to reach this performance level and implement it in C++ or C ?
Igor answered:
    – Actually I would do it in Pharo. I think the way how video game companies should implement video games is the following : they should implement it in a high level language as Pharo, and in the latest development cycles, as the game will be too slow, they should profile it, detect slowest methods, reimplement them in C, and then link it to their Pharo application through FFI calls.

    Profiling is easy, Pharo provides Andreas’ profiler (World Menu >> Tools >> Time profiler). But I didn’t know how to link a C library to Pharo with a FFI. To satisfy my curiosity, I had to check how to do it.

Why did I use NativeBoost as a FFI ?

    First of all, Pharo has 3 virtual machine plugins to implement FFI calls: Alien plugin, FFI plugin and NativeBoost plugin. As I wanted to use the FFI calls to improve speed, I checked the benchmarks of the plugins. Basically, Alien is slower than FFI, itself slower than NativeBoost. Depending on FFI calls, Alien and FFI plugins can be from 10% to 1000% slower than NativeBoost. So the best solution is to use NativeBoost as a FFI.

Implementing the FFI calls

    To check the performance, I needed to implement a performance bench. I chose to implement a fibonacci bench for multi-cores computers.

In Pharo :

Integer>>fib4
    ^ self < 4
        ifTrue: [1]
        ifFalse: [(self-1) fib4 + (self-2) fib4 + (self-3) fib4
            + (self-4) fib4]

Now I needed to implement the fib4 method in C (full C file code here) :

#include <stdio.h>
int fib4(int k) {
    if (k < 4) {
        return 1;
    } else {
        return (fib4(k-1) + fib4(k-2) + fib4(k-3) + fib4(k-4));
    }
}

    Then, as it is C, you need to compile with some C compiler. I saved the C file as ‘fib4.c’ so the compilation command lines with gcc are:
-> Generating the fib4.o
    gcc -c -m32 fib4.c
-> Generating the dynamic library fib4.dylib. Use .dll or .so instead of .dylib in case of Windows / Linux
    gcc -shared -m32 -o fib4.dylib fib4.o

Back to Pharo, I added the method:

Integer>>fib4NB
    <primitive: #primitiveNativeCall module: #NativeBoostPlugin error: errorCode>
    ^ self
        nbCall: #( int fib4 (int self) )
        module: '/Users/bera/Desktop/fib4.dylib'

    As my fib4.dylib is on Desktop, the absolute path is ‘/Users/bera/Desktop/fib4.dylib’. I needed to take care, when I changed the C code to experiment and recompiled the dylib, NativeBoost had still in memory references to the old dylib. So each time I recompiled the dynamic library I needed to restart the Pharo image.

I checked in workspace, both methods were working fine.

Benchmarks

    So my machine is Macbook pro, on Mac OS X 10.8, with a 2.5GHz Intel Core i5 and 8Gb of RAM. The Pharo image is run on top of the Pharo VM, which is a fork of the CogVM. However my benchs may not be very accurate (I didn’t close all other applications to run the benchs, which can lead to side effects).

    Firstly, I needed to run several times both fib4 and fib4NB methods to warm up the JIT and to link the external library with NativeBoost (link done only at first call). Then I used SMark framework to bench. Here are the result:

Report for: BenchFFISuite
    Benchmark Fib4NB
    Fib4NB total: iterations=20 runtime: 428.20ms +/-0.62
    Benchmark Fib4
    Fib4 total: iterations=20 runtime: 825.6ms +/-3.7

    Of course, fibonacci is not the best example to show the difference of performance. GCC is also not the fastest C compiler. But we can see that in this case, the C implementation is twice faster than the Pharo one.

What is important to note is :
    – In 2 minutes, I can link an external C library to my Pharo application with NativeBoost and use it.
    – Within a few minutes, I can massively speed up my Pharo app with NativeBoost FFI and some C code.

That’s all. Hope you enjoyed it 🙂

Advertisements