Passing value as a function argument vs calculating it twice? - c++

I recall from Agner Fog's excellent guide that 64-bit Linux can pass 6 integer function parameters via registers:
http://www.agner.org/optimize/optimizing_cpp.pdf
(page 8)
I have the following function:
void x(signed int a, uint b, char c, unit d, uint e, signed short f);
and I need to pass an additional unsigned short parameter, which would make 7 in total. However, I can actually derive the value of the 7th from one of the existing 6.
So my question is which of the following is a better practice for performance:
Passing the already-calculated value as a 7th argument on 64-bit Linux
Not passing the already-calculated value, but calculating it again for a second time using one of the existing 6 arguments.
The operation in question is a simple bit-shift:
unsigned short g = c & 1;
Not fully understanding x86 assembler I am not too sure how precious registers are and whether it is better to recalculate a value as a local variable, than pass it through function calls as an argument?
My belief is that it would be better to calculate the value twice because it is such a simple 1 CPU cycle task.
EDIT I know I can just profile this- but I'd like to also understand what is happening under the hood with both approaches. Having a 7th argument does this mean cache/memory is involved, rather than registers?

The machine conventions to pass arguments is called the application binary interface (or ABI), and for Linux x86-64 is described in x86-64 ABI spec. See also x86 calling conventions wikipage.
In your case, it is probably not worthwhile to pass c & 1 as an additional parameter (since that 7th parameter is passed on stack).
Don't forget that current processor cores (on desktop or laptop computers) are often doing out-of-order execution and are superscalar, so the c & 1 operation could be done in parallel with other operations and might cost "nothing".
But leave such micro-optimizations to the compiler. If you care a lot about performance, use a recent GCC 4.8 compiler with gcc-4.8 -O3 -flto both for compiling and for linking (i.e. enable link-time optimization).
BTW, cache performance is much more relevant than such micro-optimizations. A single cache miss may take the same time (e.g. 250 nanoseconds) as hundreds of CPU machine instructions. Current CPUs are rumored to mostly wait for the caches. You might want to add a few explicit (and judicious) calls to __builtin_prefetch (see this question and this answer). But adding too much these prefetches would slow down your code.
At last, readability and maintainability of your code should matter much more than raw performance!

Basile's answer is good, I'll just point out another thing to keep in mind:
a) The stack is very likely to be in L1 cache, so passing arguments on the stack should not take more than ~3 cycles extra.
b) The ABI (x86-64 System V, in this case) requires clobbered registers to be restored. Some are saved by the caller, others by the callee. Obviously, the registers used to pass arguments must be saved by the caller if the original contents were needed again. But when your function uses more registers than the caller saved, any additional temporary results the function needs to calculate must go into a callee-saved register. So the function ends up spilling a register on the stack, reusing the register for your temporary variable, and then pops the original value back.
The only way you can avoid accessing memory is by using a smaller, simpler function that needs fewer temporary variables.

Related

When should I use CUDA's built-in warpSize, as opposed to my own proper constant?

nvcc device code has access to a built-in value, warpSize, which is set to the warp size of the device executing the kernel (i.e. 32 for the foreseeable future). Usually you can't tell it apart from a constant - but if you try to declare an array of length warpSize you get a complaint about it being non-const... (with CUDA 7.5)
So, at least for that purpose you are motivated to have something like (edit):
enum : unsigned int { warp_size = 32 };
somewhere in your headers. But now - which should I prefer, and when? : warpSize, or warp_size?
Edit: warpSize is apparently a compile-time constant in PTX. Still, the question stands.
Let's get a couple of points straight. The warp size isn't a compile time constant and shouldn't be treated as one. It is an architecture specific runtime immediate constant (and its value just happens to be 32 for all architectures to date). Once upon a time, the old Open64 compiler did emit a constant into PTX, however that changed at least 6 years ago if my memory doesn't fail me.
The value is available:
In CUDA C via warpSize, where is is not a compile time constant (the PTX WARP_SZ variable is emitted by the compiler in such cases).
In PTX assembler via WARP_SZ, where it is a runtime immediate constant
From the runtime API as a device property
Don't declare you own constant for the warp size, that is just asking for trouble. The normal use case for an in-kernel array dimensioned to be some multiple of the warp size would be to use dynamically allocated shared memory. You can read the warp size from the host API at runtime to get it. If you have a statically declared in-kernel you need to dimension from the warp size, use templates and select the correct instance at runtime. The latter might seem like unnecessary theatre, but it is the right thing to do for a use case that almost never arises in practice. The choice is yours.
Contrary to talonmies's answer I find warp_size constant perfectly acceptable. The only reason to use warpSize is to make the code forward-compatibly with a possible future hardware that may have warps of different size. However, when such hardware arrives, the kernel code will most likely require other alterations as well in order to remain efficient. CUDA is not a hardware-agnostic language - on the contrary, it is still quite a low-level programming language. Production code uses various intrinsic functions that come and go over time (e.g. __umul24).
The day we get a different warp size (e.g. 64) many things will change:
The warpSize will have to be adjusted obviously
Many warp-level intrinsic will need their signature adjusted, or a new version produced, e.g. int __ballot, and while int does not need to be 32-bit, it is most commonly so!
Iterative operations, such as warp-level reductions, will need their number of iterations adjusted. I have never seen anyone writing:
for (int i = 0; i < log2(warpSize); ++i) ...
that would be overly complex in something that is usually a time-critical piece of code.
warpIdx and laneIdx computation out of threadIdx would need to be adjusted. Currently, the most typical code I see for it is:
warpIdx = threadIdx.x/32;
laneIdx = threadIdx.x%32;
which reduces to simple right-shift and mask operations. However, if you replace 32 with warpSize this suddenly becomes a quite expensive operation!
At the same time, using warpSize in the code prevents optimization, since formally it is not a compile-time known constant.
Also, if the amount of shared memory depends on the warpSize this forces you to use the dynamically allocated shmem (as per talonmies's answer). However, the syntax for that is inconvenient to use, especially when you have several arrays -- this forces you to do pointer arithmetic yourself and manually compute the sum of all memory usage.
Using templates for that warp_size is a partial solution, but adds a layer of syntactic complexity needed at every function call:
deviceFunction<warp_size>(params)
This obfuscates the code. The more boilerplate, the harder the code is to read and maintain.
My suggestion would be to have a single header that control all the model-specific constants, e.g.
#if __CUDA_ARCH__ <= 600
//all devices of compute capability <= 6.0
static const int warp_size = 32;
#endif
Now the rest of your CUDA code can use it without any syntactic overhead. The day you decide to add support for newer architecture, you just need to alter this one piece of code.

Why does ARM documentation recommend having only 4 function arguments? Is there a specific performance cost of having more?

Currently working with ARM embedded systems and would like to know how many arguments at max a function/method should have so that code should remain readable and efficient at the same time.
Currently I am using 6 arguments to a function. However ARM Documentation says only 4 arguments for a better code. What if I have more than 4 arguments, will it affect the performance of the system?
The ARM EABI which defines amongst other things the calling convention for C/C++ code requires that the first four arguments are passed in the general purpose registers R0 to R3.
Additional arguments are passed on the call stack, so involve a RAM access to load and retrieve them. Apart from RAM access being often slower than register access, the transfer to and from RAM requires more instructions in any case.
This of course applies to arguments that are 32 bits in length. Double precision floating point types, and aggregate types (structures) passed by copy cannot be passed in a single register.
In the end it is probably academic; if a function needs the arguments, it needs them! Alternative methods of passing large amounts of data have similar overheads that make them little or no better. For example you could pass a single pointer or reference to a structure or object containing the data, but that data is still in RAM and the access overhead remains.
Processors dont have unlimited registers in general, so any calling convention defined for any compiler has to find some balance between number of registers used for parameters and the rest go in the stack (which is relatively unlimited). some simply say all parameters go on the stack.
The shorter answer is because there has to be a limit and the ABI/EABI chose four, it is a nice balance for the number of registers the processor has vs the number of parameters you find in programs...
The COST, is that when you use more than four registers (which you can easily do with less than four parameters) is that the rest of the parameters go on the stack, that has a cost, if you didnt need a stack frame otherwise that costs even more.
tl;dr - You shouldn't care for ARM CPUs. This is a micro optimization. Let the compiler place registers by making functions static.
All CPUs are limited in registers. ARM is mainly a RISC architecture and one fact that guided its development was implementing compiled code. You may pass as many arguments as you like. The only trade off is efficiency.
It is always good to group related parameters together in a structure. Especially, if the parameters pass-through a function to a sub-function. On the ARM, the first four parameters are slightly more optimal if placed first. Consider the following example,
int bar(int a, int b) { return a+b; }
int baz(int a, int b, int c) { return a+bar(b,c); }
int foo(int a, int b, int c, int d, int e, int f) { return a+b+baz(d,e,f); }
The calculations are non-sensical. It is only meant to provide a reference for different types of functions. bar is a leaf function, baz is an intermediate and foo is a top level or higher level function. It makes a lot of sense to package the 'e' and 'f' parameters to foo in a structure and to pass a pointer to the structure around. In fact, this is perhaps a hallmark of Object-oriented design. The structure will pass through baz to be used by bar. When the ARM call list exceeds four, passing a structure is much the same as putting things on the stack. The same addressing modes are available via the pointer. Also, by grouping the variables together you get cache and memory pipeline optimizations.
If a function will use an argument, then passing it in the first four parameters is beneficial. If a function does not use an argument, it can actually be beneficial to pass them later on. So, if you have the latitude to choice, it is better to put frequently used parameters first.
Another issue is that it beneficial to keep arguments the same between calls. Here is another example of the above where the foo() caller reverses the order.
int bar(int a, int b) { return a+b; }
int baz(int a, int b, int c) { return c+bar(a,b); }
int foo(int a, int b, int c, int d, int e, int f) { return e+f+baz(a,b,c); }
Here is a third example with the data packaging,
struct bar_data {int a; int b;};
int bar(struct bar_data *p) { return p->a+p->b; }
int baz(struct bar_data *p, int a) { return a+bar(p); }
int foo(struct bar_data *p, int a, int b, int c) { return b+c+baz(p,a); }
First example is 19 instructions.
Second example is 14 instructions.
Third example is 16 instructions.
I choose the -O1 option as compilers will not be able to inline if the functions are in separate source files. They will probably all be the same at a higher level as the compiler will re-order the arguments as it best sees fit.
So, you can see that this really only matters for extern functions; follow the links and add the -O3 option.
That said, clear understandable code is always beneficial. No one should prematurely optimize. Frankly, I wouldn't worry about it. The ARM is generally designed to be quite efficient at handling compiled code. Most compiler are quite good at re-ordering and analyzing code. You are better off writing Object Oriented static functions and doing higher level design for a great many reasons. It will also naturally result in better code.
See the Arm Procedure calling standard for the register use; all extern functions should follow these rules on the ARM. Depending on the machine, some floating point arguments may also be passed in registers. So four is not a hard and fast rule; even if performance is your ultimate goal.
Always the design and the algorithm is more important than the actual number of parameters. If the design is good and the number of parameters is justified, if they all are used and are not redundant, everything should be fine. It's important to have them well documented.
This depend on function-calling convention. Generally it is optimal to put function arguments in CPU registers, as accessing CPU registers is usually much faster than accessing memory. However number of CPU registers is limited so it is not possible to put all arguments into registers if the number of arguments exceeds the number of registers (less special purpose registers). Additionally operating systems limit further the number of arguments put into registers (see eg. this link). I am not familiar with ARM CPUs though, but the general rules (OS/calling convention, number of registers etc) should apply to ARM too.
It really depends upon your cpu and instruction set that you are using. For example, AMD64 (x86-64) instruction set has 16 registers and 6 of them can be used for setting arguments to a function call. Exceeding arguments go on stack which is obviously a slow read as compared to registers. In comparison, 32 bits architectures have less registers (8 registers) hence less number of arguments on registers.

Design elements for inline asm in concurrent usage

I can't find a neat explanation about how I'm supposed to write a piece of inline asm, and what are the problem that can possibly arise from a concurrent use of a foo function that contains asm code in it.
The problem that I see is that in asm the registers are uniquely named, and so 1 name is strictly tied to a really precise portion of your cpu, and that's a big problem if you are writing 1 piece of code that is supposed to run concurrently because you can't simply extra registers with the same name.
The other problem is that asm doesn't really uses a calling convention, you simply call registers and/or values, and sometimes calling a register implies a silent action on another register that doesn't even shows up explicitly in your code; so I can't even expect that my C/C++ function foo will be packed and sealed inside its own stack if it contains asm code .
Now with what gcc calls extended asm I can basically declare where the input and the output goes, so each function can use its own parameters "as registers" , and the pattern is the following
asm ( assembler template
: output
: input
: registers
);
Assuming that my main target for now are mathematical operations, and my function is only supposed to give a certain functionality and perform some computation ( no internal lock ), is extended asm good for concurrency ? How I should design a piece of asm that is supposed to be used by a concurrent application ?
For now I'm using gcc, but I would like a generic answer about the general asm design that I'm supposed to give to this kind of code snippets.
You seem to be misunderstanding what threading actually is. Let's consider a single-processor system first. The threads don't actually run concurrently, since there is only one unit that can successfully decode and execute them. Your operating system is only creating the illusion of running multiple threads (and processes, too) by employing scheduling inside of it : every thread, or process, is allocated a certain amount of time it gets to execute on the processor.
This is why, when threads are executed, they don't overwrite each other's registers. When a currently executed thread or process is switched, the operating system asks the processor to perform something that's called a context switch. In a nutshell, the processor saves its state when it was executing the previous task/thread/process into some memory area, which is controlled by the OS. The new task/thread/process has its context restored from the previously stored state and continues its execution. When this task/thread/process' time slice on the CPU is up, the scheduler decides which task/thread/process to resume next. The time slice is usually very small, which is why you're given the illusion of multiple streams of code running at the same time. Keep in mind that this is a very, very simplified description : refer to CPU manuals or books on operating systems for more detail.
The situation is analogous on multi-processor systems : only with the exception that, then, there is more than one unit that can execute the instructions. This is also true for multi-core processors : every one of the cores has its own set of registers. The basic stuff stays the same - the scheduler in your OS decides whether the code being executed is actually executed at the same time by multiple cores in one processor.
Thus, your concerns in this case are not valid. However, they were raised for very valid reasons. Remember that the only things that threads share is the main memory : each thread has its own registers, and its own stack.
Let me come back to the actual question about gcc's extended inline assembly. The compiler itself cannot work out which registers are modified by the assembly you wrote. That's why you need to specify it. However, it is very rare that an instruction modifies a register without you being able to control it, and it happens only with a small number of instructions - assuming that we're talking about x86. Moreover, gcc can work out the destination/source operands by itself when you want to refer to a C/C++ variable from inside the assembly. In fact, this is the preferred method, since it leaves the compiler much more room for optimization.
Consider this piece of code :
unsigned int get_cr0(void)
{
unsigned int rc;
__asm__ (
"movl %%cr0, %0\n"
: "=r"(rc)
:
:
);
return rc;
}
This function's purpose is to return the contents of the control register cr0. This is a privileged instruction, so the program will not work when you run it in user mode, but this is not important right now. See how I put %0 in the instruction, and then specified "=r"(rc) in the output list. This means that %0 will be automagically aliased by the compiler to your rc variable. You can do this for every variable you specify on the input/output list. They are numbered starting from zero, as you can see.
I can't really remember the instructions which used registers that were not encoded as operands, so I can't give you an example right now. In this case, you would need to put them on the clobber list (the last one). I'm pretty sure you can refer to this for more information.
I also can't answer anything regarding "general asm design", since this is a non-standard extension and thus varies between compilers. The 64-bit Visual Studio compilers don't support it at all, for example.

C++: Why does this speed my code up?

I have the following function
double single_channel_add(int patch_top_left_row, int patch_top_left_col,
int image_hash_key,
Mat* preloaded_images,
int* random_values){
int first_pixel_row = patch_top_left_row + random_values[0];
int first_pixel_col = patch_top_left_col + random_values[1];
int second_pixel_row = patch_top_left_row + random_values[2];
int second_pixel_col = patch_top_left_col + random_values[3];
int channel = random_values[4];
Vec3b* first_pixel_bgr = preloaded_images[image_hash_key].ptr<Vec3b>(first_pixel_row, first_pixel_col);
Vec3b* second_pixel_bgr = preloaded_images[image_hash_key].ptr<Vec3b>(second_pixel_row, second_pixel_col);
return (*first_pixel_bgr)[channel] + (*second_pixel_bgr)[channel];
}
Which is called about one and a half million times with different values for patch_top_left_row and patch_top_left_col. This takes about 2 seconds to run, now when I change the calculation of first_pixel_row etc to not use the arguments but hard coded numbers instead (shown below), the thing runs sub second and I don't know why. Is the compiler doing something smart here ( I am using gcc cross compiler)?
double single_channel_add(int patch_top_left_row, int patch_top_left_col,
int image_hash_key,
Mat* preloaded_images,
int* random_values){
int first_pixel_row = 5 + random_values[0];
int first_pixel_col = 6 + random_values[1];
int second_pixel_row = 8 + random_values[2];
int second_pixel_col = 10 + random_values[3];
int channel = random_values[4];
Vec3b* first_pixel_bgr = preloaded_images[image_hash_key].ptr<Vec3b>(first_pixel_row, first_pixel_col);
Vec3b* second_pixel_bgr = preloaded_images[image_hash_key].ptr<Vec3b>(second_pixel_row, second_pixel_col);
return (*first_pixel_bgr)[channel] + (*second_pixel_bgr)[channel];
}
EDIT:
I have pasted the assembly from the two versions of the function
using arguments: http://pastebin.com/tpCi8c0F
using constants: http://pastebin.com/bV0d7QH7
EDIT:
After compiling with -O3 I get the following clock ticks and speeds:
using arguments: 1990000 ticks and 1.99seconds
using constants: 330000 ticks and 0.33seconds
EDIT:
using argumenst with -03 compilation: http://pastebin.com/fW2HCnHc
using constant with -03 compilation: http://pastebin.com/FHs68Agi
On the x86 platform there are instructions that very quickly add small integers to a register. These instructions are the lea (aka 'load effective address') instructions and they are meant for computing address offsets for structures and the like. The small integer being added is actually part of the instruction. Smart compilers know that these instructions are very quick and use them for addition even when addresses are not involved.
I bet if you changed the constants to some random value that was at least 24 bits long that you would see much of the speedup disappear.
Secondly those constants are known values. The compiler can do a lot to arrange for those values to end up in a register in the most efficient way possible. With an argument, unless the argument is passed in a register (and I think your function has too many arguments for that calling convention to be used) the compiler has no choice but to fetch the number from memory using a stack offset load instruction. That isn't a particularly slow instruction or anything, but with constants the compiler is free to do something much faster than may involve simply fetching the number from the instruction itself. The lea instructions are simply the most extreme example of this.
Edit: Now that you've pasted the assembly things are much clearer
In the non-constant code, here is how the add is done:
addl -68(%rbp), %eax
This fetches a value from the stack an offset -68(%rpb) and adds it to the %eax% register.
In the constant code, here is how the add is done:
addl $5, %eax
and if you look at the actual numbers, you see this:
0138 83C005
It's pretty clear that the constant being added is encoded directly into the instruction as a small value. This is going to be much faster to fetch than fetching a value from a stack offset for a number of reasons. First it's smaller. Secondly, it's part of an instruction stream with no branches. So it will be pre-fetched and pipelined with no possibility for cache stalls of any kind.
So while my surmise about the lea instruction wasn't correct, I was still on the right track. The constant version uses a small instruction specifically oriented towards adding a small integer to a register. The non-constant version has to fetch an integer that may be of indeterminate size (so it has to fetch ALL the bits, not just the low ones) from a stack offset (which adds in an additional add to compute the actual address from the offset and stack base address).
Edit 2: Now that you've posted the -O3 results
Well, it's much more confusing now. It's apparently inlined the function in question and it jumps around a whole ton between the code for the inlined function and the code for the calling function. I'm going to need to see the original code for the whole file to make a proper analysis.
But what I strongly suspect is happening now is that the unpredictability of the values retrieved from get_random_number_in_range is severely limiting the optimization options available to the compiler. In fact, it looks like in the constant version it doesn't even bother to call get_random_number_in_range because the value is tossed out and never used.
I'm assuming that the values of patch_top_left_row and patch_top_left_col are generated in a loop somewhere. I would push this loop into this function. If the compiler knows the values are generated as part of a loop, there are a very large number of optimization options open to it. In the extreme case it could use some of the SIMD instructions that are part of the various SSE or 3dnow! instruction suites to make things a whole ton faster than even the version you have that uses constants.
The other option would be to make this function inline, which would hint to the compiler that it should try inserting it into the loop in which it's called. If the compiler takes the hint (this function is a bit largish, so the compiler might not) it will have much the same effect as if you'd stuffed the loop into the function.
Well, binary arithmetic operations of immediate constant vs. memory format are expected to produce faster code than the ones of memory vs. memory format, but the timing effect you observe appears to be too extreme, especially considering that there are other operations inside that function.
Could it be that the compiler decided to inline your function? Inlining would allow the compiler to easily eliminate everything related to the unused patch_top_left_row and patch_top_left_col parameters in the second version, including any steps that prepare/calculate these parameters in the calling code.
Technically, this can be done even if the function is not inlined, but it is generally more complicated.

Profiling a simple, one cycle length operation

We have an assignment where we need to profile a 'simple instruction' (addition or bit-wise and for example). This means performing the same operation a large number of times (100K+) and measuring the average time in microseconds. The result should be presented in cycle-lengths: (totalTime/iterations)*cphMHz.
So, results may vary but all in all we were told that we should get a result close to 1 cycle-length. Actual result doesn't matter as long as programming is correct.
My question is: what is a good operation to profile?
There are two points I need to concider:
I use loop unrolling to be a bit more accurate, so in each iteration I perform 10 simple instruction. This means I have to choose an operation to wouldn't be performed only once due to compiler optimization (we can't use -o0 flag as school staff does not).
Bad example: var = i; - the compiler would only perform the last command.
What is a real 'simple instruction'? How do I know the number of operations that are actually performed? I tried reading the assembly output, but I couldn't understand it.
Hope I was clear enough, any idea would be great.
Thanks anyway
P.S don't know if it matters but I write in CPP
1) This sounds (to me) like an impossible task, if optimizations are (or might be) enabled. You can never be sure on what the compiler will do during optimizations. I'd definitely do something like reusing the previous result. If allowed to/possible, I'd try to include a raw assembler snippet to be profiled (so you can be sure there's no additional overhead; although it still could be optimized).
2) As for instructions: One assembler command is one instruction. E.g. a += i will - depending on available instruction set and stuff - most likely result in 4 instructions: read a, read i, add, write a. Reading assembly is pretty much straightforward. Depending on the instruction set/processor, there might be different "directions" for reading (i.e. "from -> to"). x86 assemblers (and those for most other common processors) will prefer instruction target, source, while DSPs prefer to use instruction source, target. Just important to know: moving data has to happen through registers. So even a single assignment like a = b will result in two instructions (b to register and register to a).
In general, if this answer goes into the wrong direction, try to elaborate a bit more on your specific task and its requirements (e.g. which compiler is to be used) and drop me a short comment.