Marshaling object code for a Numba function - llvm

I have a problem that could be solved by Numba: creating Numpy ufuncs for a query server to (a) coalesce simple operations into a single pass over data, reducing my #1 hotspot (memory bandwidth), and (b) to wrap up third party C functions as ufuncs on the fly, providing more functionality to users of the query system.
I have an accumulator node that splits up the query and collects results and compute nodes that actually run Numpy (distinct computers in a network). If the Numba compilation happens on the compute nodes, it will be duplicated effort since they're working on different data partitions for the same query--- same query means same Numba compilation. Moreover, even the simplest Numba compilation takes 96 milliseconds--- as long as running a query calculation over millions of points, which is time better served on the compute nodes.
So I want to do a Numba compilation once on the accumulate node, then send it to the compute nodes so they can run it. I can guarantee that both have the same hardware, so that object code is compatible.
I've been searching the Numba API for this functionality and haven't found it (apart from a numba.serialize module with no documentation; I'm not sure what its purpose is). The solution might not be a "feature" of the Numba package, but a technique that takes advantage of someone's insider knowledge of Numba and/or LLVM. Does anyone know how to get at the object code, marshal it, and reconstitute it? I can have Numba installed on both machines if that helps, I just can't do anything too expensive on the destination machines.

Okay, it's possible, and the solution makes heavy use of the llvmlite library under Numba.
Getting the serialized function
First we define some function with Numba.
import numba
#numba.jit("f8(f8)", nopython=True)
def example(x):
return x + 1.1
We can get access to the object code with
cres = example.overloads.values()[0] # 0: first and only type signature
elfbytes = cres.library._compiled_object
If you print out elfbytes, you'll see that it's an ELF-encoded byte array (bytes object, not a str if you're in Python 3). This is what would go into a file if you were to compile a shared library or executable, so it's portable to any machine with the same architecture, same libraries, etc.
There are several functions inside this bundle, which you can see by dumping the LLVM IR:
The one we want is named __main__.example$1.float64 and we can see its type signature in the LLVM IR:
define i32 #"__main__.example$1.float64"(double* noalias nocapture %retptr, { i8*, i32 }** noalias nocapture readnone %excinfo, i8* noalias nocapture readnone %env, double %arg.x) #0 {
%.14 = fadd double %arg.x, 1.100000e+00
store double %.14, double* %retptr, align 8
ret i32 0
Take note for future reference: the first argument is a pointer to a double that gets overwritten with the result, the second and third arguments are pointers that never get used, and the last argument is the input double.
(Also note that we can get the function names programmatically with [ for x in cres.library._final_module.functions]. The entry point that Numba actually uses is cres.fndesc.mangled_name.)
We transmit this ELF and function signature from the machine that does all the compiling to the machine that does all the computing.
Reading it back
Now on the compute machine, we're going to use llvmlite with no Numba at all (following this page). Initialize it:
import llvmlite.binding as llvm
llvm.initialize_native_asmprinter() # yes, even this one
Create an LLVM execution engine:
target = llvm.Target.from_default_triple()
target_machine = target.create_target_machine()
backing_mod = llvm.parse_assembly("")
engine = llvm.create_mcjit_compiler(backing_mod, target_machine)
And now hijack its caching mechanism to have it load our ELF, named elfbytes:
def object_compiled_hook(ll_module, buf):
def object_getbuffer_hook(ll_module):
return elfbytes
engine.set_object_cache(object_compiled_hook, object_getbuffer_hook)
Finalize the engine as though we had just compiled an IR, but in fact we skipped that step. The engine will load our ELF, thinking it's coming from its disk-based cache.
We should now find our function in this engine's space. If the following returns 0L, something's wrong. It should be a function pointer.
func_ptr = engine.get_function_address("__main__.example$1.float64")
Now we need to interpret func_ptr as a ctypes function we can call. We have to set up the signature manually.
import ctypes
pdouble = ctypes.c_double * 1
out = pdouble()
pointerType = ctypes.POINTER(None)
dummy1 = pointerType()
dummy2 = pointerType()
# restype first then argtypes...
cfunc = ctypes.CFUNCTYPE(ctypes.c_int32, pdouble, pointerType, pointerType, ctypes.c_double)(func_ptr)
And now we can call it:
cfunc(out, dummy1, dummy2, ctypes.c_double(3.14))
# 4.24, which is 3.14 + 1.1. Yay!
More complications
If the JITed function has array inputs (after all, you want to do the tight loop over many values in the compiled code, not in Python), Numba generates code that recognizes Numpy arrays. The calling convention for this is quite complex, including pointers-to-pointers to exception objects and all the metadata that accompanies a Numpy array as separate parameters. It does not generate an entry point that you can use with Numpy's ctypes interface.
However, it does provide a very high-level entry point, which takes a Python *args, **kwds as arguments and parses them internally. Here's how you use that.
First, find the function whose name starts with "cpython.":
name = [ for x in cres.library._final_module.functions if"cpython.")][0]
There should be exactly one of them. Then, after serialization and deserialization, get its function pointer using the method described above:
func_ptr = engine.get_function_address(name)
and cast it with three PyObject* arguments and one PyObject* return value. (LLVM thinks these are i8*.)
class PyTypeObject(ctypes.Structure):
_fields_ = ("ob_refcnt", ctypes.c_int), ("ob_type", ctypes.c_void_p), ("ob_size", ctypes.c_int), ("tp_name", ctypes.c_char_p)
class PyObject(ctypes.Structure):
_fields_ = ("ob_refcnt", ctypes.c_int), ("ob_type", ctypes.POINTER(PyTypeObject))
PyObjectPtr = ctypes.POINTER(PyObject)
cpythonfcn = ctypes.CFUNCTYPE(PyObjectPtr, PyObjectPtr, PyObjectPtr, PyObjectPtr)(fcnptr)
The first of these three arguments is a closure (global variables that the function accesses), and I'm going to assume we didn't need that. Use explicit arguments instead of closures. We can use the fact that CPython's id() implementation returns the pointer value to make PyObject pointers.
def wrapped(*args, **kwds):
closure = ()
return cpythonfcn(ctypes.cast(id(closure), PyObjectPtr), ctypes.cast(id(args), PyObjectPtr), ctypes.cast(id(kwds), PyObjectPtr))
Now the function can be called as
wrapped(whatever_numpy_arguments, ...)
just like the original Numba dispatcher function.
Bottom line
After all that, was it worth it? Doing the end-to-end compilation with Numba--- the easy way--- takes 50 ms for this simple function. Asking for -O3 instead of the default -O2, I can make this 40% slower.
Splicing in a pre-compiled ELF file, however, takes 0.5 ms: a factor of 100 faster. Moreover, compilation times will increase with more complex functions but the splicing-in procedure should always take 0.5 ms for any function.
For my application, this is absolutely worth it. It means that I can perform computations on 10 MB at a time and be spending most of my time computing (doing real work), rather than compiling (preparing to work). Scale this up by a factor of 100 and I'd have to perform computations on 1 GB at a time. Since a machine is limited to order-of 100 GB and it has to be shared among order-of 100 processes, I'd be in greater danger of hitting resource limitations, load balancing issues, etc., because the problem would be too granular.
But for other applications, 50 ms is nothing. It all depends on your application.


How to execute separate compiled binary file from inside program on MCU?

I have an MCU (say an STM32) running, and I would like to 'pass' it a separately compiled binary file over UART/USB and use it like calling a function, where I can pass it data and collect its output? After its complete, a second, different binary would be sent to be executed, and so on.
How can I do this? Does this require an OS be running? I'd like to avoid that overhead.
It is somewhat specific to the mcu what the exact call function is but you are just making a function call. You can try the function pointer thing but that has been known to fail with thumb (on gcc)(stm32 uses the thumb instruction set from arm).
First off you need to decide in your overall system design if you want to use a specific address for this code. for example 0x20001000. or do you want to have several of these resident at the same time and want to load them at any one of multiple possible addresses? This will determine how you link this code. Is this code standalone? with its own variables or does it want to know how to call functions in other code? All of this determines how you build this code. The easiest, at least to first try this out, is a fixed address. Build like you build your normal application but based in a ram address like 0x20001000. Then you load the program sent to you at that address.
In any case the normal way to "call" a function in thumb (say an stm32). Is the bl or blx instruction. But normally in this situation you would use bx but to make it a call need a return address. The way arm/thumb works is that for bx and other related instructions the lsbit determines the mode you switch/stay in when branching. Lsbit set is thumb lsbit clear is arm. This is all documented in the arm documentation which completely covers your question BTW, not sure why you are asking...
Gcc and I assume llvm struggles to get this right and then some users know enough to be dangerous and do the worst thing of ADDing one (rather than ORRing one) or even attempting to put the one there. Sometimes putting the one there helps the compiler (this is if you try to do the function pointer approach and hope the compiler does all the work for you *myfun = 0x10000 kind of thing). But it has been shown on this site that you can make subtle changes to the code or depending on the exact situation the compiler will get it right or wrong and without looking at the code you have to help with the orr one thing. As with most things when you need an exact instruction, just do this in asm (not inline please, use real) yourself, make your life 10000 times easier...and your code significantly more reliable.
So here is my trivial solution, extremely reliable, port the asm to your assembly language.
.globl HOP
bx r0
I C it looks like this
void HOP ( unsigned int );
Now if you loaded to address 0x20001000 then after loading there
Or you can
.globl HOP
orr r0,#1
bx r0
The compiler generates a bl to hop which means the return path is covered.
If you want to send say a parameter...
.globl HOP
orr r1,#1
bx r1
void HOP ( unsigned int, unsigned int );
Easy and extremely reliable, compiler cannot mess this up.
If you need to have functions and global variables between the main app and the downloaded app, then there are a few solutions and they involve resolving addresses, if the loaded app and the main app are not linked at the same time (doing a copy and jump and single link is generally painful and should be avoided, but...) then like any shared library you need to have a mechanism for resolving addresses. If this downloaded code has several functions and global variables and/or your main app has several functions and global variables that the downloaded library needs, then you have to solve this. Essentially one side has to have a table of addresses in a way that both sides agree on the format, could be as a simple array of addresses and both sides know which address is which simply from position. Or you create a list of addresses with labels and then you have to search through the list matching up names to addresses for all the things you need to resolve. You could for example use the above to have a setup function that you pass an array/structure to (structures across compile domains is of course a very bad thing). That function then sets up all the local function pointers and variable pointers to the main app so that subsequent functions in this downloaded library can call the functions in the main app. And/or vice versa this first function can pass back an array structure of all the things in the library.
Alternatively a known offset in the downloaded library there could be an array/structure for example the first words/bytes of that downloaded library. Providing one or the other or both, that the main app can find all the function addresses and variables and/or the caller can be given the main applications function addresses and variables so that when one calls the other it all works... This of course means function pointers and variable pointers in both directions for all of this to work. Think about how .so or .dlls work in linux or windows, you have to replicate that yourself.
Or you go the path of linking at the same time, then the downloaded code has to have been built along with the code being run, which is probably not desirable, but some folks do this, or they do this to load code from flash to ram for various reasons. but that is a way to resolve all the addresses at build time. then part of the binary in the build you extract separately from the final binary and then pass it around later.
If you do not want a fixed address, then you need to build the downloaded binary as position independent, and you should link that with .text and .bss and .data at the same address.
hello : ORIGIN = 0x20001000, LENGTH = 0x1000
.text : { *(.text*) } > hello
.rodata : { *(.rodata*) } > hello
.bss : { *(.bss*) } > hello
.data : { *(.data*) } > hello
you should obviously do this anyway, but with position independent then you have it all packed in along with the GOT (might need a .got entry but I think it knows to use .data). Note, if you put .data after .bss with gnu at least and insure, even if it is a bogus variable you do not use, make sure you have one .data then .bss is zero padded and allocated for you, no need to set it up in a bootstrap.
If you build for position independence then you can load it almost anywhere, clearly on arm/thumb at least on a word boundary.
In general for other instruction sets the function pointer thing works just fine. In ALL cases you simply look at the documentation for the processor and see the instruction(s) used for calling and returning or branching and simply use that instruction, be it by having the compiler do it or forcing the right instruction so that you do not have it fail down the road in a re-compile (and have a very painful debug). arm and mips have 16 bit modes that require specific instructions or solutions for switching modes. x86 has different modes 32 bit and 64 bit and ways to switch modes, but normally you do not need to mess with this for something like this. msp430, pic, avr, these should be just a function pointer thing in C should work fine. In general do the function pointer thing then see what the compiler generates and compare that to the processor documentation. (compare it to a non-function pointer call).
If you do not know these basic C concepts of function pointer, linking a bare metal app on an mcu/processor, bootstrap, .text, .data, etc. You need to go learn all that.
The times you decide to switch to an operating system are....if you need a filesystem, networking, or a few things like this where you just do not want to do that yourself. Now sure there is lwip for networking and some embedded filesystem libraries. And multithreading then an os as well, but if all you want to do is generate a branch/jump/call instruction you do not need an operating system for that. Just generate the call/branch/whatever.
Loading and execution a fully linked binary and loading and calling a single function (and returning to the caller) are not really the same thing. The latter is somewhat complicated and involves "dynamic linking", where the code effectively and secures in the same execution environment as the caller.
Loading a complete stand-alone executable in the other hand is more straightforward and is the function of a bootloader. A bootloader loads and jumps to the loaded executable which then establishes it's own execution environment. Returning to the bootloader requires a processor reset.
In this case it would make sense to have the bootloader load and execute code in RAM if you are going to be frequently loading different code. However be aware that on Harvard Architecture devices like STM32, RAM execution may slow down execution because data and instruction fetch share the same bus.
The actual implementation of a bootloader will depend on the target architecture, but for Cortex-M devices is fairly straightforward and dealt with elsewhere.
STM32 actually includes an on-chip bootloader (you need to configure the boot source pins to invoke it), which I believe can load and execute code in RAM. It is normally used to load a secondary bootloader to load and program flash, but it can be used for loading any code.
You do need to build and link your code to run from RAM at the address tle loader locates it, or if supported build position-indeoendent code that can run from anywhere.

operating with complex struct/class in Julia from C++

In the recent project, I am trying to write a simple wrapper of C++ library (OpenCV) to be utilized in Julia with the use of CxxWrap.
The case of C++ code (where arguments and return types are my own, rather simple, structs) is working.
The problem I have is with more complex data structures (defined in, let's say OpenCV); in our case (I want it to be simple to understand) I want to get information about the frame, so I execute:
using PyCall
const cv2 = pyimport("cv2")
module CppHello
using CxxWrap
function __init__()
cap = CppHello.openVideo() // (*)
to above I have two questions:
do I have to explicitly define returned type by openVideo() -- suppose for this moment that I want to use only my library in C++ to start any of the OpenCV functions;
if "No" to the above can I do something like that:
cap = cv2.VideoCapture(0) # from library
cap.isOpened() && frm = cap.frame()
The point is that I am interested only in a few operations at frame along with passing returned value to other procedures (show frame utilizing C++ on the screen or save in file).
Motivation is a problem of low performance of imshow() executed on Julia level with used PyCall (in contrary with goodFeaturesToTrack or calcOpticalFlowPyrLK) and drastic low FPS compared with C++.
Maybe there is another solution I unnoticed.
As I have a problem with (*) I thought that maybe I can simply write a struct (of known elements) of pointers to hold the data returned by C++ functions?
As it is my first edition of the question, I will be grateful for any info about correctness and completeness.

Passing value as a function argument vs calculating it twice?

I recall from Agner Fog's excellent guide that 64-bit Linux can pass 6 integer function parameters via registers:
(page 8)
I have the following function:
void x(signed int a, uint b, char c, unit d, uint e, signed short f);
and I need to pass an additional unsigned short parameter, which would make 7 in total. However, I can actually derive the value of the 7th from one of the existing 6.
So my question is which of the following is a better practice for performance:
Passing the already-calculated value as a 7th argument on 64-bit Linux
Not passing the already-calculated value, but calculating it again for a second time using one of the existing 6 arguments.
The operation in question is a simple bit-shift:
unsigned short g = c & 1;
Not fully understanding x86 assembler I am not too sure how precious registers are and whether it is better to recalculate a value as a local variable, than pass it through function calls as an argument?
My belief is that it would be better to calculate the value twice because it is such a simple 1 CPU cycle task.
EDIT I know I can just profile this- but I'd like to also understand what is happening under the hood with both approaches. Having a 7th argument does this mean cache/memory is involved, rather than registers?
The machine conventions to pass arguments is called the application binary interface (or ABI), and for Linux x86-64 is described in x86-64 ABI spec. See also x86 calling conventions wikipage.
In your case, it is probably not worthwhile to pass c & 1 as an additional parameter (since that 7th parameter is passed on stack).
Don't forget that current processor cores (on desktop or laptop computers) are often doing out-of-order execution and are superscalar, so the c & 1 operation could be done in parallel with other operations and might cost "nothing".
But leave such micro-optimizations to the compiler. If you care a lot about performance, use a recent GCC 4.8 compiler with gcc-4.8 -O3 -flto both for compiling and for linking (i.e. enable link-time optimization).
BTW, cache performance is much more relevant than such micro-optimizations. A single cache miss may take the same time (e.g. 250 nanoseconds) as hundreds of CPU machine instructions. Current CPUs are rumored to mostly wait for the caches. You might want to add a few explicit (and judicious) calls to __builtin_prefetch (see this question and this answer). But adding too much these prefetches would slow down your code.
At last, readability and maintainability of your code should matter much more than raw performance!
Basile's answer is good, I'll just point out another thing to keep in mind:
a) The stack is very likely to be in L1 cache, so passing arguments on the stack should not take more than ~3 cycles extra.
b) The ABI (x86-64 System V, in this case) requires clobbered registers to be restored. Some are saved by the caller, others by the callee. Obviously, the registers used to pass arguments must be saved by the caller if the original contents were needed again. But when your function uses more registers than the caller saved, any additional temporary results the function needs to calculate must go into a callee-saved register. So the function ends up spilling a register on the stack, reusing the register for your temporary variable, and then pops the original value back.
The only way you can avoid accessing memory is by using a smaller, simpler function that needs fewer temporary variables.

C++: Why does this speed my code up?

I have the following function
double single_channel_add(int patch_top_left_row, int patch_top_left_col,
int image_hash_key,
Mat* preloaded_images,
int* random_values){
int first_pixel_row = patch_top_left_row + random_values[0];
int first_pixel_col = patch_top_left_col + random_values[1];
int second_pixel_row = patch_top_left_row + random_values[2];
int second_pixel_col = patch_top_left_col + random_values[3];
int channel = random_values[4];
Vec3b* first_pixel_bgr = preloaded_images[image_hash_key].ptr<Vec3b>(first_pixel_row, first_pixel_col);
Vec3b* second_pixel_bgr = preloaded_images[image_hash_key].ptr<Vec3b>(second_pixel_row, second_pixel_col);
return (*first_pixel_bgr)[channel] + (*second_pixel_bgr)[channel];
Which is called about one and a half million times with different values for patch_top_left_row and patch_top_left_col. This takes about 2 seconds to run, now when I change the calculation of first_pixel_row etc to not use the arguments but hard coded numbers instead (shown below), the thing runs sub second and I don't know why. Is the compiler doing something smart here ( I am using gcc cross compiler)?
double single_channel_add(int patch_top_left_row, int patch_top_left_col,
int image_hash_key,
Mat* preloaded_images,
int* random_values){
int first_pixel_row = 5 + random_values[0];
int first_pixel_col = 6 + random_values[1];
int second_pixel_row = 8 + random_values[2];
int second_pixel_col = 10 + random_values[3];
int channel = random_values[4];
Vec3b* first_pixel_bgr = preloaded_images[image_hash_key].ptr<Vec3b>(first_pixel_row, first_pixel_col);
Vec3b* second_pixel_bgr = preloaded_images[image_hash_key].ptr<Vec3b>(second_pixel_row, second_pixel_col);
return (*first_pixel_bgr)[channel] + (*second_pixel_bgr)[channel];
I have pasted the assembly from the two versions of the function
using arguments:
using constants:
After compiling with -O3 I get the following clock ticks and speeds:
using arguments: 1990000 ticks and 1.99seconds
using constants: 330000 ticks and 0.33seconds
using argumenst with -03 compilation:
using constant with -03 compilation:
On the x86 platform there are instructions that very quickly add small integers to a register. These instructions are the lea (aka 'load effective address') instructions and they are meant for computing address offsets for structures and the like. The small integer being added is actually part of the instruction. Smart compilers know that these instructions are very quick and use them for addition even when addresses are not involved.
I bet if you changed the constants to some random value that was at least 24 bits long that you would see much of the speedup disappear.
Secondly those constants are known values. The compiler can do a lot to arrange for those values to end up in a register in the most efficient way possible. With an argument, unless the argument is passed in a register (and I think your function has too many arguments for that calling convention to be used) the compiler has no choice but to fetch the number from memory using a stack offset load instruction. That isn't a particularly slow instruction or anything, but with constants the compiler is free to do something much faster than may involve simply fetching the number from the instruction itself. The lea instructions are simply the most extreme example of this.
Edit: Now that you've pasted the assembly things are much clearer
In the non-constant code, here is how the add is done:
addl -68(%rbp), %eax
This fetches a value from the stack an offset -68(%rpb) and adds it to the %eax% register.
In the constant code, here is how the add is done:
addl $5, %eax
and if you look at the actual numbers, you see this:
0138 83C005
It's pretty clear that the constant being added is encoded directly into the instruction as a small value. This is going to be much faster to fetch than fetching a value from a stack offset for a number of reasons. First it's smaller. Secondly, it's part of an instruction stream with no branches. So it will be pre-fetched and pipelined with no possibility for cache stalls of any kind.
So while my surmise about the lea instruction wasn't correct, I was still on the right track. The constant version uses a small instruction specifically oriented towards adding a small integer to a register. The non-constant version has to fetch an integer that may be of indeterminate size (so it has to fetch ALL the bits, not just the low ones) from a stack offset (which adds in an additional add to compute the actual address from the offset and stack base address).
Edit 2: Now that you've posted the -O3 results
Well, it's much more confusing now. It's apparently inlined the function in question and it jumps around a whole ton between the code for the inlined function and the code for the calling function. I'm going to need to see the original code for the whole file to make a proper analysis.
But what I strongly suspect is happening now is that the unpredictability of the values retrieved from get_random_number_in_range is severely limiting the optimization options available to the compiler. In fact, it looks like in the constant version it doesn't even bother to call get_random_number_in_range because the value is tossed out and never used.
I'm assuming that the values of patch_top_left_row and patch_top_left_col are generated in a loop somewhere. I would push this loop into this function. If the compiler knows the values are generated as part of a loop, there are a very large number of optimization options open to it. In the extreme case it could use some of the SIMD instructions that are part of the various SSE or 3dnow! instruction suites to make things a whole ton faster than even the version you have that uses constants.
The other option would be to make this function inline, which would hint to the compiler that it should try inserting it into the loop in which it's called. If the compiler takes the hint (this function is a bit largish, so the compiler might not) it will have much the same effect as if you'd stuffed the loop into the function.
Well, binary arithmetic operations of immediate constant vs. memory format are expected to produce faster code than the ones of memory vs. memory format, but the timing effect you observe appears to be too extreme, especially considering that there are other operations inside that function.
Could it be that the compiler decided to inline your function? Inlining would allow the compiler to easily eliminate everything related to the unused patch_top_left_row and patch_top_left_col parameters in the second version, including any steps that prepare/calculate these parameters in the calling code.
Technically, this can be done even if the function is not inlined, but it is generally more complicated.

LLVM: Figure out if a variable is a function of other variables

Is there a way in llvm that using static analysis, i can find out if one variable is a particular function of other variables?
Eg: As in a cuda program, i want to find out given a variable tid, does it store a global thread ID or not?
int tid = blockIdx.x * blockDim.x + threadId.x;
Edit: I am trying to figure if i can write a pass which analyzes the program and see if any divergence or array access is based on this global id alone and not on, other values like blockID or local threadId. I am trying to identify cases where changing the cuda program's gridDim, blockDim doesn't change the program output, like for instance a vector add, i can have gridDim as 128, blockDim as 4 or gridDim as 8, blockDim as 64. The output is not affected. Iam doing this in llvm because i am trying to use a compilation framework called ocelot which converts cuda to x86.
The closest I can find is the memdeps pass, but this is primarily about other operations on memory, which does not necessarily correspond to operations on "variables" in the usual sense - they may be in registers. It appears to be a reasonable standard dependence analysis problem, however, so perhaps you could modify this pass to your needs. The alias analysis passes might be helpful too, though not in the presence of operations that stop variables from aliasing one another (e.g., copies, arithmetic).
Incidentally, your question is rather under-specified. This is usually the kind of analysis (alias analysis, for example) that would make a lot more sense in the source language (cuda, for example), not in the target language (LLVM, for example).