while playing around with godbolt.org I noticed that gcc (6.2, 7.0 snapshot), clang (3.9) and icc (17) when compiling something close to
int a(int* a, int* b) {
if (b - a < 2) return *a = ~*a;
// register intensive code here e.g. sorting network
}
compiles (-O2/-O3) this into somthing like this:
push r15
mov rax, rcx
push r14
sub rax, rdx
push r13
push r12
push rbp
push rbx
sub rsp, 184
mov QWORD PTR [rsp], rdx
cmp rax, 7
jg .L95
not DWORD PTR [rdx]
.L162:
add rsp, 184
pop rbx
pop rbp
pop r12
pop r13
pop r14
pop r15
ret
which obviously has a huge overhead in case of b - a < 2. In case of -Os gcc compiles to:
mov rax, rcx
sub rax, rdx
cmp rax, 7
jg .L74
not DWORD PTR [rdx]
ret
.L74:
Which leads me to beleave that there is no code keeping the compiler from emitting this shorter code.
Is there a reason why compilers do this ? Is there a way to get them compiling to the shorter version without compiling for size?
Here's an example on Godbolt that reproduces this. It seems to have something to do with the complex part being recursive
This is a known compiler limitation, see my comments on the question. IDK why it exists; maybe it's hard for compilers to decide what they can do without spilling when they haven't finished saving regs yet.
Pulling the early-out check into a wrapper is often useful when it's small enough to inline.
Looks like modern gcc can actually sidestep this compiler limitation sometimes.
Using your example on the Godbolt compiler explorer, adding a second caller is enough to get even gcc6.1 -O2 to split the function for you, so it can inline the early-out into the second caller and into the externally visible square() (which ends with jmp square(int*, int*) [clone .part.3] if the early-out return path isn't taken).
code on Godbolt, note I added -std=gnu++14, which is required for clang to compiler your code.
void square_inlinewrapper(int* a, int* b) {
//if (b - a < 16) return; // gcc inlines this part for us, and calls a private clone of the function!
return square(a, b);
}
# gcc6.1 -O2 (default / generic -march= and -mtune=)
mov rax, rsi
sub rax, rdi
cmp rax, 63
jg .L9
rep ret
.L9:
jmp square(int*, int*) [clone .part.3]
square() itself compiles to the same thing, calling the private clone which has the bulk of the code. The recursive calls from inside the clone call the wrapper function, so they don't do the extra push/pop work when it's not needed.
Even gcc7 doesn't do this when there's no other caller, even at -O3. It does still transform one of the recursive calls into a loop, but the other one just calls the big function again.
Clang 3.9 and icc17 don't clone the function, either, so you should write the inlineable wrapper manually (and change the main body of the function to use it for recursive calls, if the check is needed there).
You might want to name the wrapper square, and rename just the main body to a private name (like static void square_impl).
Related
I'm trying to experiment with code by myself here on different compilers.
I've been trying to lookup the advantages of disabling exceptions on certain functions (via the binary footprint) and to compare that to functions that don't disable exceptions, and I've actually stumbled onto a weird case where it's better to have exceptions than not.
I've been using Matt Godbolt's Compiler Explorer to do these checks, and it was checked on x86-64 clang 12.0.1 without any flags (on GCC this weird behavior doesn't exist).
Looking at this simple code:
auto* allocated_int()
{
return new int{};
}
int main()
{
delete allocated_int();
return 0;
}
Very straight-forward, pretty much deletes an allocated pointer returned from the function allocated_int().
As expected, the binary footprint is minimal, as well:
allocated_int(): # #allocated_int()
push rbp
mov rbp, rsp
mov edi, 4
call operator new(unsigned long)
mov rcx, rax
mov rax, rcx
mov dword ptr [rcx], 0
pop rbp
ret
Also, very straight-forward.
But the moment I apply the noexcept keyword to the allocated_int() function, the binary bloats. I'll apply the resulting assembly here:
allocated_int(): # #allocated_int()
push rbp
mov rbp, rsp
sub rsp, 16
mov edi, 4
call operator new(unsigned long)
mov rcx, rax
mov qword ptr [rbp - 8], rcx # 8-byte Spill
jmp .LBB0_1
.LBB0_1:
mov rcx, qword ptr [rbp - 8] # 8-byte Reload
mov rax, rcx
mov dword ptr [rcx], 0
add rsp, 16
pop rbp
ret
mov rdi, rax
call __clang_call_terminate
__clang_call_terminate: # #__clang_call_terminate
push rax
call __cxa_begin_catch
call std::terminate()
Why is clang doing this extra code for us? I didn't request any other action but calling new(), and I was expecting the binary to reflect that.
Thank you for those who can explain!
Why is clang doing this extra code for us?
Because the behaviour of the function is different.
I didn't request any other action but calling new()
By declaring the function noexcept, you've requested std::terminate to be called in case an exception propagates out of the function.
allocated_int in the first program never calls std::terminate, while
allocated_int in the second program may call std::terminate. Note that the amount of added code is much less if you remember to enable the optimiser. Comparing non-optimised assembly is mostly futile.
You can use non-throwing allocation to prevent that:
return new(std::nothrow) int{};
It's indeed an astute observation that doing potentially throwing things inside non-throwing function can introduce some extra work that wouldn't need to be done if the same things were done in a potentially throwing function.
I've been trying to lookup the advantages of disabling exceptions on certain functions
The advantage of using non-throwing is potentially realised where such function is called; not within the function itself.
Without nothrow, your function just acts as a front end to the allocation function you call. It doesn't have any real behavior of its own. In fact, in a real executable, if you do link-time optimization there's a pretty good chance that it'll completely disappear.
When you add noexcept, your code is silently transformed into something roughly like this:
auto* allocated_int()
{
try {
return new int{};
}
catch(...) {
terminate();
}
}
The extra code you see generated is what's needed to catch the exception and call terminate when/if needed.
We consider the following program, that is just timing a loop:
#include <cstdlib>
std::size_t count(std::size_t n)
{
#ifdef VOLATILEVAR
volatile std::size_t i = 0;
#else
std::size_t i = 0;
#endif
while (i < n) {
#ifdef VOLATILEASM
asm volatile("": : :"memory");
#endif
++i;
}
return i;
}
int main(int argc, char* argv[])
{
return count(argc > 1 ? std::atoll(argv[1]) : 1);
}
For readability, the version with both volatile variable and volatile asm reads as follow:
#include <cstdlib>
std::size_t count(std::size_t n)
{
volatile std::size_t i = 0;
while (i < n) {
asm volatile("": : :"memory");
++i;
}
return i;
}
int main(int argc, char* argv[])
{
return count(argc > 1 ? std::atoll(argv[1]) : 1);
}
Compilation under g++ 8 with g++ -Wall -Wextra -g -std=c++11 -O3 loop.cpp -o loop gives roughly the following timings:
default: 0m0.001s
-DVOLATILEASM: 0m1.171s
-DVOLATILEVAR: 0m5.954s
-DVOLATILEVAR -DVOLATILEASM: 0m5.965s
The question I have is: why is that? The default version is normal since the loop is optimized away by the compiler. But I have harder time understanding why -DVOLATILEVAR is way longer than -DVOLATILEASM since both should force the loop to run.
Compiler explorer gives the following count function for -DVOLATILEASM:
count(unsigned long):
mov rax, rdi
test rdi, rdi
je .L2
xor edx, edx
.L3:
add rdx, 1
cmp rax, rdx
jne .L3
.L2:
ret
and for -DVOLATILEVAR (and the combined -DVOLATILEASM -DVOLATILEVAR):
count(unsigned long):
mov QWORD PTR [rsp-8], 0
mov rax, QWORD PTR [rsp-8]
cmp rdi, rax
jbe .L2
.L3:
mov rax, QWORD PTR [rsp-8]
add rax, 1
mov QWORD PTR [rsp-8], rax
mov rax, QWORD PTR [rsp-8]
cmp rax, rdi
jb .L3
.L2:
mov rax, QWORD PTR [rsp-8]
ret
Why is the exact reason of that? Why does the volatile qualification of the variable prevents the compiler from doing the same loop as the one with asm volatile?
When you make i volatile you tell the compiler that something that it doesn't know about can change its value. That means it is forced to load it's value every time you use it and it has to store it every time you write to it. When i is not volatile the compiler can optimize that synchronization away.
-DVOLATILEVAR forces the compiler to keep the loop counter in memory, so the loop bottlenecks on the latency of store/reload (store forwarding), ~5 cycles + the latency of an add 1 cycle.
Every assignment to and read from volatile int i is considered an observable side-effect of the program that the optimizer has to make happen in memory, not just a register. This is what volatile means.
There's also a reload for the compare, but that's only a throughput issue, not latency. The ~6 cycle loop carried data dependency means your CPU doesn't bottleneck on any throughput limits.
This is similar to what you'd get from -O0 compiler output, so have a look at my answer on Adding a redundant assignment speeds up code when compiled without optimization for more about loops like that, and x86 store-forwarding.
With only VOLATILEASM, the empty asm template (""), has to run the right number of times. Being empty, it doesn't add any instructions to the loop, so you're left with a 2-uop add / cmp+jne loop that can run at 1 iteration per clock on modern x86 CPUs.
Critically, the loop counter can stay in a register, despite the compiler memory barrier. A "memory" clobber is treated like a call to a non-inline function: it might read or modify any object that it might possibly have a reference to, but that does not include local variables that have never had their address escape the function. (i.e. we never called sscanf("0", "%d", &i) or posix_memalign(&i, 64, 1234). But if we did, then the "memory" barrier would have to spill / reload it, because an external function could have saved a pointer to the object.
i.e. a "memory" clobber is only a full compiler barrier for objects that could possibly be visible outside the current function. This is really only an issue when messing around and looking at compiler output to see what barriers do what, because a barrier can only matter for multi-threading correctness for variables that other threads could possible have a pointer to.
And BTW, your asm statement is already implicitly volatile because it has no output operands. (See Extended-Asm#Volatile in the gcc manual).
You can add a dummy output to make a non-volatile asm statement the compiler can optimize away, but unfortunately gcc still keep the empty loop after eliminating a non-volatile asm statement from it. If i's address has escaped the function, removing the asm statement entirely turns the loop into a single compare jump over a store, right before the function returns. I think it would be legal to simply return without ever storing to that local, because there's no a correct program can know that it managed to read i from another thread before i went out of scope.
But anyway, here's the source I used. As I said, note that there's always an asm statement here, and I'm controlling whether it's volatile or not.
#include <stdlib.h>
#include <stdio.h>
#ifndef VOLATILEVAR // compile with -DVOLATILEVAR=volatile to apply that
#define VOLATILEVAR
#endif
#ifndef VOLATILEASM // Different from your def; yours drops the whole asm statement
#define VOLATILEASM
#endif
// note I ported this to also be valid C, but I didn't try -xc to compile as C.
size_t count(size_t n)
{
int dummy; // asm with no outputs is implicitly volatile
VOLATILEVAR size_t i = 0;
sscanf("0", "%zd", &i);
while (i < n) {
asm VOLATILEASM ("nop # operand = %0": "=r"(dummy) : :"memory");
++i;
}
return i;
}
compiles (with gcc4.9 and newer -O3, neither VOLATILE enabled) to this weird asm.
(Godbolt compiler explorer with gcc and clang):
# gcc8.1 -O3 with sscanf(.., &i) but non-volatile asm
# the asm nop doesn't appear anywhere, but gcc is making clunky code.
.L8:
mov rdx, rax # i, <retval>
.L3: # first iter entry point
lea rax, [rdx+1] # <retval>,
cmp rax, rbx # <retval>, n
jb .L8 #,
Nice job, gcc.... gcc4.8 -O3 avoids pulling an extra mov inside the loop:
# gcc4.8 -O3 with sscanf(.., &i) but non-volatile asm
.L3:
add rdx, 1 # i,
cmp rbx, rdx # n, i
ja .L3 #,
mov rax, rdx # i.0, i # outside the loop
Anyway, without the dummy output operand, or with volatile, gcc8.1 gives us:
# gcc8.1 with sscanf(&i) and asm volatile("nop" ::: "memory")
.L3:
nop # operand = eax # dummy
mov rax, QWORD PTR [rsp+8] # tmp96, i
add rax, 1 # <retval>,
mov QWORD PTR [rsp+8], rax # i, <retval>
cmp rax, rbx # <retval>, n
jb .L3 #,
So we see the same store/reload of the loop counter, only difference from volatile i being the cmp doesn't need to reload it.
I used nop instead of just a comment because Godbolt hides comment-only lines by default, and I wanted to see it. For gcc, it's purely a text substitution: we're looking at the compiler's asm output with operands substituted into the template before it's sent to the assembler. For clang, there might be some effect because the asm has to be valid (i.e. actually assemble correctly).
If we comment out the scanf and remove the dummy output operand, we get a register-only loop with the nop in it. But keep the dummy output operand and the nop doesn't appear anywhere.
Take this simple function that increments an integer under a lock implemented by std::mutex:
#include <mutex>
std::mutex m;
void inc(int& i) {
std::unique_lock<std::mutex> lock(m);
i++;
}
I would expect this (after inlining) to compile in a straightforward way to a call of m.lock() an increment of i and then m.unlock().
Checking the generated assembly for recent versions of gcc and clang, however, we see an extra complication. Taking the gcc version first:
inc(int&):
mov eax, OFFSET FLAT:__gthrw___pthread_key_create(unsigned int*, void (*)(void*))
test rax, rax
je .L2
push rbx
mov rbx, rdi
mov edi, OFFSET FLAT:m
call __gthrw_pthread_mutex_lock(pthread_mutex_t*)
test eax, eax
jne .L10
add DWORD PTR [rbx], 1
mov edi, OFFSET FLAT:m
pop rbx
jmp __gthrw_pthread_mutex_unlock(pthread_mutex_t*)
.L2:
add DWORD PTR [rdi], 1
ret
.L10:
mov edi, eax
call std::__throw_system_error(int)
It's the first couple of lines that are interesting. The assembled code examines the address of __gthrw___pthread_key_create (which is the implementation for pthread_key_create - a function to create a thread-local storage key), and if it is zero, it branches to .L2 which implements the increment in a single instruction without any locking at all.
If it is non-zero it proceeds as expected: locking the mutex, doing the increment, and unlocking.
clang does even more: it checks the address of the function twice, once before the lock and once before the unlock:
inc(int&): # #inc(int&)
push rbx
mov rbx, rdi
mov eax, __pthread_key_create
test rax, rax
je .LBB0_4
mov edi, m
call pthread_mutex_lock
test eax, eax
jne .LBB0_6
inc dword ptr [rbx]
mov eax, __pthread_key_create
test rax, rax
je .LBB0_5
mov edi, m
pop rbx
jmp pthread_mutex_unlock # TAILCALL
.LBB0_4:
inc dword ptr [rbx]
.LBB0_5:
pop rbx
ret
.LBB0_6:
mov edi, eax
call std::__throw_system_error(int)
What's the purpose of this check?
Perhaps it is to support the case where the object file is ultimately complied into a binary without pthreads support and then to fall back to a version without locking in that case? I couldn't find any documentation on this behavior.
Your guess looks to be correct. From the libgcc/gthr-posix.h file in gcc's source repository (https://github.com/gcc-mirror/gcc.git):
/* For a program to be multi-threaded the only thing that it certainly must
be using is pthread_create. However, there may be other libraries that
intercept pthread_create with their own definitions to wrap pthreads
functionality for some purpose. In those cases, pthread_create being
defined might not necessarily mean that libpthread is actually linked
in.
For the GNU C library, we can use a known internal name. This is always
available in the ABI, but no other library would define it. That is
ideal, since any public pthread function might be intercepted just as
pthread_create might be. __pthread_key_create is an "internal"
implementation symbol, but it is part of the public exported ABI. Also,
it's among the symbols that the static libpthread.a always links in
whenever pthread_create is used, so there is no danger of a false
negative result in any statically-linked, multi-threaded program.
For others, we choose pthread_cancel as a function that seems unlikely
to be redefined by an interceptor library. The bionic (Android) C
library does not provide pthread_cancel, so we do use pthread_create
there (and interceptor libraries lose). */
#ifdef __GLIBC__
__gthrw2(__gthrw_(__pthread_key_create),
__pthread_key_create,
pthread_key_create)
# define GTHR_ACTIVE_PROXY __gthrw_(__pthread_key_create)
#elif defined (__BIONIC__)
# define GTHR_ACTIVE_PROXY __gthrw_(pthread_create)
#else
# define GTHR_ACTIVE_PROXY __gthrw_(pthread_cancel)
#endif
static inline int
__gthread_active_p (void)
{
static void *const __gthread_active_ptr
= __extension__ (void *) >HR_ACTIVE_PROXY;
return __gthread_active_ptr != 0;
}
Then throughout the remainder of the file many of the pthread APIs are wrapped inside checks to the __gthread_active_p() function. If __gthread_active_p() returns 0 nothing is done and success is returned.
Is reference is compiled as usual pointer or it has other stuff behind?
And how does it differ in clang?
You can think of a reference as an immutable pointer that is automatically de-referenced on usage. This isn't what the C++ standard says, so you cannot rely on that being an actual implementation.
Practically speaking though, it likely to be what you see in many cases.
Take the following example in the case of parameter passing:
#include <stdio.h>
void function (int *const n){
printf("%d",*n);
}
void function (int & n){
printf("%d",n);
}
int main(){
int n = 123;
function(&n);
function(n);
}
Both gcc and clang produce identical code for the functions without any optimizations enabled:
function(int*):
push rbp
mov rbp, rsp
sub rsp, 16
mov QWORD PTR [rbp-8], rdi
mov rax, QWORD PTR [rbp-8]
mov eax, DWORD PTR [rax]
mov esi, eax
mov edi, OFFSET FLAT:.LC0
mov eax, 0
call printf
nop
leave
ret
function(int&):
push rbp
mov rbp, rsp
sub rsp, 16
mov QWORD PTR [rbp-8], rdi
mov rax, QWORD PTR [rbp-8]
mov eax, DWORD PTR [rax]
mov esi, eax
mov edi, OFFSET FLAT:.LC0
mov eax, 0
call printf
nop
leave
ret
How does reference translates to asm in gcc?
In general: It depends.
To find that out in a specific case, you can test by reading the generated assembly.
Is reference is compiled as usual pointer or it has other stuff behind?
The implementation of code using a reference is practically identical to one using a pointer to achieve identical indirection. How each are implemented, is not guaranteed by the standard, but there is no need for implementing them differently.
References only differ from pointers in the way that they are allowed to use by the rules of C++. Of course, because the rules are different, pointers can be used in a way that references can not. And in such case you cannot compare whether pointers generate the same assembly.
The limitations of a reference might make some optimizations easier, so there might be a difference, but such optimization would also have been possible with pointers, so there is no guarantee of different assembly output when using references instead of pointers.
And how does it differ in clang?
In general: It depends.
Both compilers are bound by the same rules of the standard. They might generate identical assembly or different. How the generated assembly of particular version of one compiler differs (if at all) from the assembly generated by a particular version of another compiler, with particular compilation options for each, on particular processor architecture on particular operating system, in a particular use case of a reference, can be found by inspecting and comparing the generated assembly in each particular case.
I have a question about performance. I think this can also applies to other languages (not only C++).
Imagine that I have this function:
int addNumber(int a, int b){
int result = a + b;
return result;
}
Is there any performance improvement if I write the code above like this?
int addNumber(int a, int b){
return a + b;
}
I have this question because the second function doesn´t declare a 3rd variable. But would the compiler detect this in the first code?
To answer this question you can look at the generated assembler code. With -O2, x86-64 gcc 6.2 generates exactly the same code for both methods:
addNumber(int, int):
lea eax, [rdi+rsi]
ret
addNumber2(int, int):
lea eax, [rdi+rsi]
ret
Only without optimization turned on, there is a difference:
addNumber(int, int):
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-20], edi
mov DWORD PTR [rbp-24], esi
mov edx, DWORD PTR [rbp-20]
mov eax, DWORD PTR [rbp-24]
add eax, edx
mov DWORD PTR [rbp-4], eax
mov eax, DWORD PTR [rbp-4]
pop rbp
ret
addNumber2(int, int):
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], edi
mov DWORD PTR [rbp-8], esi
mov edx, DWORD PTR [rbp-4]
mov eax, DWORD PTR [rbp-8]
add eax, edx
pop rbp
ret
However, performance comparison without optimization is meaningless
In principle there is no difference between the two approaches. The majority of compilers have handled this type of optimisation for some decades.
Additionally, if the function can be inlined (e.g. its definition is visible to the compiler when compiling code that uses such a function) the majority of compilers will eliminate the function altogether, and simply emit code to add the two variables passed and store the result as required by the caller.
Obviously, the comments above assume compiling with a relevant optimisation setting (e.g. not doing a debug build without optimisation).
Personally, I would not write such a function anyway. It is easier, in the caller, to write c = a + b instead of c = addNumber(a, b), so having a function like that offers no benefit to either programmer (effort to understand) or program (performance, etc). You might as well write comments that give no useful information.
c = a + b; // add a and b and store into c
Any self-respecting code reviewer would complain bitterly about uninformative functions or uninformative comments.
I'd only use such a function if its name conveyed some special meaning (i.e. more than just adding two values) for the application
c = FunkyOperation(a,b);
int FunkyOperation(int a, int b)
{
/* Many useful ways of implementing this operation.
One of those ways happens to be addition, but we need to
go through 25 pages of obscure mathematical proof to
realise that
*/
return a + b;
}