I have seen in many SO answers that kind of code:
template <typename T>
inline T imax (T a, T b)
{
return (a > b) * a + (a <= b) * b;
}
Where authors say that this branchless.
But is this really branchless on current architectures? (x86, ARM...)
And is there a real standard guarantee that this is branchless?
x86 has the SETcc family of instructions which set a byte register to 1 or 0 depending on the value of a flag. This is commonly used by compilers to implement this kind of code without branches.
If you use the “naïve” approach
int imax(int a, int b) {
return a > b ? a : b;
}
The compiler would generate even more efficient branch-less code using the CMOVcc (conditional move) family of instructions.
ARM has the ability to conditionally execute every instruction which allowed the compiler to compile both your and the naïve implementation efficiently, the naïve implementation being faster.
I stumbled upon this SO question because I was asking me the same… turns out it’s not always. For instance, the following code…
const struct op {
const char *foo;
int bar;
int flags;
} ops[] = {
{ "foo", 5, 16 },
{ "bar", 9, 16 },
{ "baz", 13, 0 },
{ 0, 0, 0 }
};
extern int foo(const struct op *, int);
int
bar(void *a, void *b, int c, const struct op *d)
{
c |= (a == b) && (d->flags & 16);
return foo(d, c) + 1;
}
… compiles to branching code with both gcc 3.4.6 (i386) and 8.3.0 (amd64, i386) in all optimisation levels. The one from 3.4.6 is more manually legibe, I’ll demonstrate with gcc -O2 -S -masm=intel x.c; less x.s:
[…]
.text
.p2align 2,,3
.globl bar
.type bar , #function
bar:
push %ebp
mov %ebp, %esp
push %ebx
push %eax
mov %eax, DWORD PTR [%ebp+12]
xor %ecx, %ecx
cmp DWORD PTR [%ebp+8], %eax
mov %edx, DWORD PTR [%ebp+16]
mov %ebx, DWORD PTR [%ebp+20]
je .L4
.L2:
sub %esp, 8
or %edx, %ecx
push %edx
push %ebx
call foo
inc %eax
mov %ebx, DWORD PTR [%ebp-4]
leave
ret
.p2align 2,,3
.L4:
test BYTE PTR [%ebx+8], 16
je .L2
mov %cl, 1
jmp .L2
.size bar , . - bar
Turns out the pointer comparison operation invokes a comparison and even a subroutine to insert 1.
Not even using !!(a == b) makes a difference here.
tl;dr
Check the actual compiler output (assembly with -S or disassembly with objdump -d -Mintel x.o; drop the -Mintel if not on x86, it merely makes the assembly more legible) of the actual compilation; compilers are unpredictable beasts.
Related
Why does the following code returns zero, when compiled with clang?
#include <stdint.h>
uint64_t square() {
return ((uint32_t)((double)4294967296.0)) - 10;
}
The assembly it produces is:
push rbp
mov rbp, rsp
xor eax, eax
pop rbp
ret
I would have expected that double to become zero (integer) and that minus would wrap it around. In other words, why doesn't in matter what number there is to subtract, as it always produces zero? Note that gcc does produce different numbers, as expected:
push rbp
mov rbp, rsp
mov eax, 4294967285
pop rbp
ret
I assume casting 4294967296.0 to uint32_t is undefined behaviour but even then, I would expect to produce different results for different subtrahends.
The behaviour of casting an out-of-range double to an unsigned type is indeed undefined.
Wrap-around does not apply, even for an unsigned type.
This is an oft-forgotten rule.
Once program control reaches an undefined construct, the entire program is undefined, even somewhat paradoxically, statements that have already been ran.
Reference: https://timsong-cpp.github.io/cppwp/conv.fpint#1
g++ 5.4.0 gives 4294967285 (-11 unsigned), so it is up to the compiler what it wants to do with undefined behavior.
__Z6squarev:
LFB1023:
.cfi_startproc
pushl %ebp
.cfi_def_cfa_offset 8
.cfi_offset 5, -8
movl %esp, %ebp
.cfi_def_cfa_register 5
movl $-11, %eax
movl $0, %edx
popl %ebp
.cfi_restore 5
.cfi_def_cfa 4, 4
ret
.cfi_endproc
I know that C++ compilers optimize empty (static) functions.
Based on that knowledge I wrote a piece of code that should get optimized away whenever I some identifier is defined (using the -D option of the compiler).
Consider the following dummy example:
#include <iostream>
#ifdef NO_INC
struct T {
static inline void inc(int& v, int i) {}
};
#else
struct T {
static inline void inc(int& v, int i) {
v += i;
}
};
#endif
int main(int argc, char* argv[]) {
int a = 42;
for (int i = 0; i < argc; ++i)
T::inc(a, i);
std::cout << a;
}
The desired behavior would be the following:
Whenever the NO_INC identifier is defined (using -DNO_INC when compiling), all calls to T::inc(...) should be optimized away (due to the empty function body). Otherwise, the call to T::inc(...) should trigger an increment by some given value i.
I got two questions regarding this:
Is my assumption correct that calls to T::inc(...) do not affect the performance negatively when I specify the -DNO_INC option because the call to the empty function is optimized?
I wonder if the variables (a and i) are still loaded into the cache when T::inc(a, i) is called (assuming they are not there yet) although the function body is empty.
Thanks for any advice!
Compiler Explorer is an very useful tool to look at the assembly of your generated program, because there is no other way to figure out if the compiler optimized something or not for sure. Demo.
With actually incrementing, your main looks like:
main: # #main
push rax
test edi, edi
jle .LBB0_1
lea eax, [rdi - 1]
lea ecx, [rdi - 2]
imul rcx, rax
shr rcx
lea esi, [rcx + rdi]
add esi, 41
jmp .LBB0_3
.LBB0_1:
mov esi, 42
.LBB0_3:
mov edi, offset std::cout
call std::basic_ostream<char, std::char_traits<char> >::operator<<(int)
xor eax, eax
pop rcx
ret
As you can see, the compiler completely inlined the call to T::inc and does the incrementing directly.
For an empty T::inc you get:
main: # #main
push rax
mov edi, offset std::cout
mov esi, 42
call std::basic_ostream<char, std::char_traits<char> >::operator<<(int)
xor eax, eax
pop rcx
ret
The compiler optimized away the entire loop!
Is my assumption correct that calls to t.inc(...) do not affect the performance negatively when I specify the -DNO_INC option because the call to the empty function is optimized?
Yes.
If my assumption holds, does it also hold for more complex function bodies (in the #else branch)?
No, for some definition of "complex". Compilers use heuristics to determine whether it's worth it to inline a function or not, and bases its decision on that and on nothing else.
I wonder if the variables (a and i) are still loaded into the cache when t.inc(a, i) is called (assuming they are not there yet) although the function body is empty.
No, as demonstrated above, the loop doesn't even exist.
Is my assumption correct that calls to t.inc(...) do not affect the performance negatively when I specify the -DNO_INC option because the call to the empty function is optimized? If my assumption holds, does it also hold for more complex function bodies (in the #else branch)?
You are right. I have modified your example (i.e. removed cout which clutters the assembly) in compiler explorer to make it more obvious what happens.
The compiler optimizes everything away and outouts
main: # #main
movl $42, %eax
retq
Only 42 is leaded in eax and returned.
For the more complex case, however, more instructions are needed to compute the return value. See here
main: # #main
testl %edi, %edi
jle .LBB0_1
leal -1(%rdi), %eax
leal -2(%rdi), %ecx
imulq %rax, %rcx
shrq %rcx
leal (%rcx,%rdi), %eax
addl $41, %eax
retq
.LBB0_1:
movl $42, %eax
retq
I wonder if the variables (a and i) are still loaded into the cache when t.inc(a, i) is called (assuming they are not there yet) although the function body is empty.
They are only loaded, when the compiler cannot reason that they are unused. See the second example of compiler explorer.
By the way: You do not need to make an instance of T (i.e. T t;) in order to call a static function within a class. This is defeating the purpose. Call it like T::inc(...) rahter than t.inc(...).
Because the inline keword is used, you can safely assume 1. Using these functions shouldn't negatively affect performance.
Running your code through
g++ -c -Os -g
objdump -S
confirms this; An extract:
int main(int argc, char* argv[]) {
T t;
int a = 42;
1020: b8 2a 00 00 00 mov $0x2a,%eax
for (int i = 0; i < argc; ++i)
1025: 31 d2 xor %edx,%edx
1027: 39 fa cmp %edi,%edx
1029: 7d 06 jge 1031 <main+0x11>
v += i;
102b: 01 d0 add %edx,%eax
for (int i = 0; i < argc; ++i)
102d: ff c2 inc %edx
102f: eb f6 jmp 1027 <main+0x7>
t.inc(a, i);
return a;
}
1031: c3 retq
(I replaced the cout with return for better readability)
Consider this code:
struct A{
volatile int x;
A() : x(12){
}
};
A foo(){
A ret;
//Do stuff
return ret;
}
int main()
{
A a;
a.x = 13;
a = foo();
}
Using g++ -std=c++14 -pedantic -O3 I get this assembly:
foo():
movl $12, %eax
ret
main:
xorl %eax, %eax
ret
According to my estimation the variable x should be written to at least three times (possibly four), yet it not even written once (the function foo isn't even called!)
Even worse when you add the inline keyword to foo this is the result:
main:
xorl %eax, %eax
ret
I thought that volatile means that every single read or write must happen even if the compiler can not see the point of the read/write.
What is going on here?
Update:
Putting the declaration of A a; outside main like this:
A a;
int main()
{
a.x = 13;
a = foo();
}
Generates this code:
foo():
movl $12, %eax
ret
main:
movl $13, a(%rip)
xorl %eax, %eax
movl $12, a(%rip)
ret
movl $12, a(%rip)
ret
a:
.zero 4
Which is closer to what you would expect....I am even more confused then ever
Visual C++ 2015 does not optimize away the assignments:
A a;
mov dword ptr [rsp+8],0Ch <-- write 1
a.x = 13;
mov dword ptr [a],0Dh <-- write2
a = foo();
mov dword ptr [a],0Ch <-- write3
mov eax,dword ptr [rsp+8]
mov dword ptr [rsp+8],eax
mov eax,dword ptr [rsp+8]
mov dword ptr [rsp+8],eax
}
xor eax,eax
ret
The same happens both with /O2 (Maximize speed) and /Ox (Full optimization).
The volatile writes are kept also by gcc 3.4.4 using both -O2 and -O3
_main:
pushl %ebp
movl $16, %eax
movl %esp, %ebp
subl $8, %esp
andl $-16, %esp
call __alloca
call ___main
movl $12, -4(%ebp) <-- write1
xorl %eax, %eax
movl $13, -4(%ebp) <-- write2
movl $12, -8(%ebp) <-- write3
leave
ret
Using both of these compilers, if I remove the volatile keyword, main() becomes essentially empty.
I'd say you have a case where the compiler over-agressively (and incorrectly IMHO) decides that since 'a' is not used, operations on it arent' necessary and overlooks the volatile member. Making 'a' itself volatile could get you what you want, but as I don't have a compiler that reproduces this, I can't say for sure.
Last (while this is admittedly Microsoft specific), https://msdn.microsoft.com/en-us/library/12a04hfd.aspx says:
If a struct member is marked as volatile, then volatile is propagated to the whole structure.
Which also points towards the behavior you are seeing being a compiler problem.
Last, if you make 'a' a global variable, it is somewhat understandable that the compiler is less eager to deem it unused and drop it. Global variables are extern by default, so it is not possible to say that a global 'a' is unused just by looking at the main function. Some other compilation unit (.cpp file) might be using it.
GCC's page on Volatile access gives some insight into how it works:
The standard encourages compilers to refrain from optimizations concerning accesses to volatile objects, but leaves it implementation defined as to what constitutes a volatile access. The minimum requirement is that at a sequence point all previous accesses to volatile objects have stabilized and no subsequent accesses have occurred. Thus an implementation is free to reorder and combine volatile accesses that occur between sequence points, but cannot do so for accesses across a sequence point. The use of volatile does not allow you to violate the restriction on updating objects multiple times between two sequence points.
In C standardese:
§5.1.2.3
2 Accessing a volatile object, modifying an object, modifying a file,
or calling a function that does any of those operations are all side
effects, 11) which are changes in the state of the
execution environment. Evaluation of an expression may produce side
effects. At certain specified points in the execution sequence called
sequence points, all side effects of previous evaluations shall be complete and no side effects of subsequent evaluations shall have
taken place. (A summary of the sequence points is given in annex C.)
3 In the abstract machine, all expressions are evaluated as specified
by the semantics. An actual implementation need not evaluate part of
an expression if it can deduce that its value is not used and that no
needed side effects are produced (including any caused by calling a
function or accessing a volatile object).
[...]
5 The least requirements on a conforming implementation are:
At sequence points, volatile objects are stable in the sense that previous accesses are complete and subsequent accesses have not yet
occurred. [...]
I chose the C standard because the language is simpler but the rules are essentially the same in C++. See the "as-if" rule.
Now on my machine, -O1 doesn't optimize away the call to foo(), so let's use -fdump-tree-optimized to see the difference:
-O1
*[definition to foo() omitted]*
;; Function int main() (main, funcdef_no=4, decl_uid=2131, cgraph_uid=4, symbol_order=4) (executed once)
int main() ()
{
struct A a;
<bb 2>:
a.x ={v} 12;
a.x ={v} 13;
a = foo ();
a ={v} {CLOBBER};
return 0;
}
And -O3:
*[definition to foo() omitted]*
;; Function int main() (main, funcdef_no=4, decl_uid=2131, cgraph_uid=4, symbol_order=4) (executed once)
int main() ()
{
struct A ret;
struct A a;
<bb 2>:
a.x ={v} 12;
a.x ={v} 13;
ret.x ={v} 12;
ret ={v} {CLOBBER};
a ={v} {CLOBBER};
return 0;
}
gdb reveals in both cases that a is ultimately optimized out, but we're worried about foo(). The dumps show us that GCC reordered the accesses so that foo() is not even necessary and subsequently all of the code in main() is optimized out. Is this really true? Let's see the assembly output for -O1:
foo():
mov eax, 12
ret
main:
call foo()
mov eax, 0
ret
This essentially confirms what I said above. Everything is optimized out: the only difference is whether or not the call to foo() is as well.
This question already has answers here:
What is the lifetime of a static variable in a C++ function?
(5 answers)
Closed 8 years ago.
C++, unlike some other languages, allows static data to be of any arbitrary type, not just plain-old-data. Plain-old-data is trivial to initialize (the compiler just writes the value at the appropriate address in the data segment), but the other, more complex types, are not.
How is initialization of non-POD types typically implemented in C++? In particular, what exactly happens when the function foo is executed for the first time? What mechanisms are used to keep track of whether str has already been initialized or not?
#include <string>
void foo() {
static std::string str("Hello, Stack Overflow!");
}
C++11 requires the initialization of function local static variables to be thread-safe. So at least in compilers that are compliant, there'll typically be some sort of synchronization primitive in use that'll need to be checked each time the function is entered.
For example, here's the assembly listing for the code from this program:
#include <string>
void foo() {
static std::string str("Hello, Stack Overflow!");
}
int main() {}
.LC0:
.string "Hello, Stack Overflow!"
foo():
cmpb $0, guard variable for foo()::str(%rip)
je .L14
ret
.L14:
pushq %rbx
movl guard variable for foo()::str, %edi
subq $16, %rsp
call __cxa_guard_acquire
testl %eax, %eax
jne .L15
.L1:
addq $16, %rsp
popq %rbx
ret
.L15:
leaq 15(%rsp), %rdx
movl $.LC0, %esi
movl foo()::str, %edi
call std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(char const*, std::allocator<char> const&)
movl guard variable for foo()::str, %edi
call __cxa_guard_release
movl $__dso_handle, %edx
movl foo()::str, %esi
movl std::basic_string<char, std::char_traits<char>, std::allocator<char> >::~basic_string(), %edi
call __cxa_atexit
jmp .L1
movq %rax, %rbx
movl guard variable for foo()::str, %edi
call __cxa_guard_abort
movq %rbx, %rdi
call _Unwind_Resume
main:
xorl %eax, %eax
ret
The __cxa_guard_acquire, __cxa_guard_release etc. are guarding initialization of the static variable.
The implementation that I've seen uses a hidden boolean variable to check if the variable is initialized. Modern compiler will do this thread-safely, but IIRC, some older compilerd did not do that, and if it was called from several threads at the same time you could get the constructor called twice.
Something along the lines of:
static bool __str_initialized = false;
static char __mem_for_str[...]; //std::string str("Hello, Stack Overflow!");
void foo() {
if (!__str_initialized)
{
lock();
__str_initialized = true;
new (__mem_for_str) std::string("Hello, Stack Overflow!");
unlock();
}
}
Then, in the finalization code of the program:
if (__str_initialized)
((std::string&)__mem_for_str).~std::string();
It's implementation specific.
Typically, there'll be a flag (statically initialised to zero) to indicate whether it's initialised, and (in C++11, or earlier thread-safe implementations) some kind of mutex, also statically initialisable, to protect against multiple threads trying to in initialise it.
The generated code would typically behave something along the lines of
static __atomic_flag_type __initialised = false;
static __mutex_type __mutex = __MUTEX_INITIALISER;
if (!__initialised) {
__lock_type __lock(__mutex);
if (!__initialised) {
__initialise(str);
__initialised = true;
}
}
You can check what your compiler does by generating an assembler listing.
MSVC2008 in debug mode generates this code (excluding exception handling prolog/epilog etc):
mov eax, DWORD PTR ?$S1#?1??foo##YA_NXZ#4IA
and eax, 1
jne SHORT $LN1#foo
mov eax, DWORD PTR ?$S1#?1??foo##YA_NXZ#4IA
or eax, 1
mov DWORD PTR ?$S1#?1??foo##YA_NXZ#4IA, eax
mov DWORD PTR __$EHRec$[ebp+8], 0
mov esi, esp
push OFFSET ??_C#_0BH#ENJCLPMJ#Hello?0?5Stack?5Overflow?$CB?$AA#
mov ecx, OFFSET ?str#?1??foo##YA_NXZ#4V?$basic_string#DU?$char_traits#D#std##V?$allocator#D#2##std##A
call DWORD PTR __imp_??0?$basic_string#DU?$char_traits#D#std##V?$allocator#D#2##std##QAE#PBD#Z
cmp esi, esp
call __RTC_CheckEsp
push OFFSET ??__Fstr#?1??foo##YA_NXZ#YAXXZ ; `foo'::`2'::`dynamic atexit destructor for 'str''
call _atexit
add esp, 4
mov DWORD PTR __$EHRec$[ebp+8], -1
$LN1#foo:
i.e there is a static variable referenced by ?$S1#?1??foo##YA_NXZ#4IA this is checked to see if it & 1 is zero. if not it branches to the label $LN1#foo:. Otherwise it or's in 1 to the flag, constructs the string at a known location and then adds a call for its destructor at program exit using 'atexit'. Then continues the function as normal.
I presume that the following will give me 10 volatile ints
volatile int foo[10];
However, I don't think the following will do the same thing.
volatile int* foo;
foo = malloc(sizeof(int)*10);
Please correct me if I am wrong about this and how I can have a volatile array of items using malloc.
Thanks.
int volatile * foo;
read from right to left "foo is a pointer to a volatile int"
so whatever int you access through foo, the int will be volatile.
P.S.
int * volatile foo; // "foo is a volatile pointer to an int"
!=
volatile int * foo; // foo is a pointer to an int, volatile
Meaning foo is volatile. The second case is really just a leftover of the general right-to-left rule.
The lesson to be learned is get in the habit of using
char const * foo;
instead of the more common
const char * foo;
If you want more complicated things like "pointer to function returning pointer to int" to make any sense.
P.S., and this is a biggy (and the main reason I'm adding an answer):
I note that you included "multithreading" as a tag. Do you realize that volatile does little/nothing of good with respect to multithreading?
volatile int* foo;
is the way to go. The volatile type qualifier works just like the const type qualifier. If you wanted a pointer to a constant array of integer you would write:
const int* foo;
whereas
int* const foo;
is a constant pointer to an integer that can itself be changed. volatile works the same way.
Yes, that will work. There is nothing different about the actual memory that is volatile. It is just a way to tell the compiler how to interact with that memory.
I think the second declares the pointer to be volatile, not what it points to. To get that, I think it should be
int * volatile foo;
This syntax is acceptable to gcc, but I'm having trouble convincing myself that it does anything different.
I found a difference with gcc -O3 (full optimization). For this (silly) test code:
volatile int v [10];
int * volatile p;
int main (void)
{
v [3] = p [2];
p [3] = v [2];
return 0;
}
With volatile, and omitting (x86) instructions which don't change:
movl p, %eax
movl 8(%eax), %eax
movl %eax, v+12
movl p, %edx
movl v+8, %eax
movl %eax, 12(%edx)
Without volatile, it skips reloading p:
movl p, %eax
movl 8(%eax), %edx ; different since p being preserved
movl %edx, v+12
; 'p' not reloaded here
movl v+8, %edx
movl %edx, 12(%eax) ; p reused
After many more science experiments trying to find a difference, I conclude there is no difference. volatile turns off all optimizations related to the variable which would reuse a subsequently set value. At least with x86 gcc (GCC) 4.1.2 20070925 (Red Hat 4.1.2-33). :-)
Thanks very much to wallyk, I was able to devise some code use his method to generate some assembly to prove to myself the difference between the different pointer methods.
using the code: and compiling with -03
int main (void)
{
while(p[2]);
return 0;
}
when p is simply declared as pointer, we get stuck in a loop that is impossible to get out of. Note that if this were a multithreaded program and a different thread wrote p[2] = 0, then the program would break out of the while loop and terminate normally.
int * p;
============
LCFI1:
movq _p(%rip), %rax
movl 8(%rax), %eax
testl %eax, %eax
jne L6
xorl %eax, %eax
leave
ret
L6:
jmp L6
notice that the only instruction for L6 is to goto L6.
==
when p is volatile pointer
int * volatile p;
==============
L3:
movq _p(%rip), %rax
movl 8(%rax), %eax
testl %eax, %eax
jne L3
xorl %eax, %eax
leave
ret
here, the pointer p gets reloaded each loop iteration and as a consequence the array item also gets reloaded. However, this would not be correct if we wanted an array of volatile integers as this would be possible:
int* volatile p;
..
..
int* j;
j = &p[2];
while(j);
and would result in the loop that would be impossible to terminate in a multithreaded program.
==
finally, this is the correct solution as tony nicely explained.
int volatile * p;
LCFI1:
movq _p(%rip), %rdx
addq $8, %rdx
.align 4,0x90
L3:
movl (%rdx), %eax
testl %eax, %eax
jne L3
leave
ret
In this case the the address of p[2] is kept in register value and not loaded from memory, but the value of p[2] is reloaded from memory on every loop cycle.
also note that
int volatile * p;
..
..
int* j;
j = &p[2];
while(j);
will generate a compile error.