Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I'm writing (or at least trying to write) some high-performance C++ code. I've come across a a part where I need to do a large amount of integer comparisons, namely, to check if the result is equal to zero.
Which is more efficient? That is, which requires fewer processor instructions?
if (i == 0) {
// do stuff
}
or
if (!i) {
// do stuff
}
I'm running it on an x86-64 architecture, if that makes any difference.
Let's look at the assembly (with no optimizations) of this code with gcc :
void foo(int& i)
{
if(!i)
i++;
}
void bar(int& i)
{
if(i == 0)
i++;
}
int main()
{
int i = 0;
foo(i);
bar(i);
}
foo(int&): # #foo(int&)
movq %rdi, -8(%rsp)
movq -8(%rsp), %rdi
cmpl $0, (%rdi)
jne .LBB0_2
movq -8(%rsp), %rax
movl (%rax), %ecx
addl $1, %ecx
movl %ecx, (%rax)
.LBB0_2:
ret
bar(int&): # #bar(int&)
movq %rdi, -8(%rsp)
movq -8(%rsp), %rdi
cmpl $0, (%rdi)
jne .LBB1_2
movq -8(%rsp), %rax
movl (%rax), %ecx
addl $1, %ecx
movl %ecx, (%rax)
.LBB1_2:
ret
main: # #main
pushq %rbp
movq %rsp, %rbp
subq $16, %rsp
leaq -8(%rbp), %rdi
movl $0, -4(%rbp)
movl $0, -8(%rbp)
callq foo(int&)
leaq -8(%rbp), %rdi
callq bar(int&)
movl -4(%rbp), %eax
addq $16, %rsp
popq %rbp
ret
Bottom line:
The generated assembly is exactly identical (even without optimizations enabled), so it doesn't matter : choose the clearer, most readable syntax, which is probably if( i == 0) in your case.
In C++, you almost never need to care about such micro optimizations, compilers/optimizers are very good at this game : trust them. If you don't and if you have a performance bottleneck, profile / look at the assembly for your particular platform.
Note:
You can use godbolt.org to generate such assembly, it is a very handy tool.
You can also use the -S option on gcc to produce the assembly (other compilers have similar options)
Unless you have an insane compiler, they should compile identically. Having said that, for the sanity of future people looking at your code, only use i == 0 if i is a numeric type and !i if i is a bool type.
No compiler of the better known ones will compile those to anything that differs significantly enough that it will matter when you do what everyone must do before applying manual optimizations: measure.
Related
I'm trying to reproduce the example code in order to understand the as-if rule of C++ better. According to cppreference.
int& preinc(int& n) { return ++n; }
int add(int n, int m) { return n+m; }
// volatile input to prevent constant folding
volatile int input = 7;
// volatile output to make the result a visible side-effect
volatile int result;
int main()
{
int n = input;
// using built-in operators would invoke undefined behavior
// int m = ++n + ++n;
// but using functions makes sure the code executes as-if
// the functions were not overlapped
int m = add(preinc(n), preinc(n));
result = m;
}
I use g++ -s main.cpp to get the assembler output from the source, the main() function of output file main.s is showed as below:
main:
.LFB2:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
pushq %rbx
subq $24, %rsp
.cfi_offset 3, -24
movq %fs:40, %rax
movq %rax, -24(%rbp)
xorl %eax, %eax
movl input(%rip), %eax
movl %eax, -32(%rbp)
leaq -32(%rbp), %rax
movq %rax, %rdi
call _Z6preincRi
movl (%rax), %ebx
leaq -32(%rbp), %rax
movq %rax, %rdi
call _Z6preincRi
movl (%rax), %eax
movl %ebx, %esi
movl %eax, %edi
call _Z3addii
movl %eax, -28(%rbp)
movl -28(%rbp), %eax
movl %eax, result(%rip)
movl $0, %eax
movq -24(%rbp), %rdx
xorq %fs:40, %rdx
je .L7
call __stack_chk_fail
According to the output file I think the g++ compiler only compile the source code sentence by sentence without optimization, even if I add the -O3 compile option.
Since the output suppose to be like this:
# full code of the main() function as produced by the GCC compiler
# x86 (Intel) platform:
movl input(%rip), %eax # eax = input
leal 3(%rax,%rax), %eax # eax = 3 + eax + eax
movl %eax, result(%rip) # result = eax
xorl %eax, %eax # eax = 0 (the return value of main())
ret
I want to know how to get the assembler output code showed as below.
Something went wrong when I test the example code. Here is the answer I made concluding some of my thoughts and the comments from others above.
Compilers will optimize the code unless the "-O3" or "-O2" compilation option was added. Just like #Balázs Kovacsics and #molbdnilo said in comments. Using the command g++ -S main.cpp will get the assembler output sentence by sentence like what is showed in the question.
Once the "-O3" or "-O2" compilation option was added, it means programmer allows the compiler do any code transformations that do not change the observable behavior of the program. So the main() function of output file main.s is showed as below, using g++ -S -O3 main.cpp
main:
.LFB2:
.cfi_startproc
movl input(%rip), %eax
leal 3(%rax,%rax), %eax
movl %eax, result(%rip)
xorl %eax, %eax
ret
.cfi_endproc
Should be careful that the compiler option should written in upper case.
Here is a compiler explorer website #JulianH gave which is really convenient to see assembler output among different platforms and different compilers.
I think get the assembler output helps me understand as-if rule better. I hope what I wrote would help someone who is also confusing about the abstract description of cppreference.
I was fooling around and found that the following
#include <stdio.h>
void f(int& x){
x+=1;
}
int main(){
int a = 12;
f(a);
printf("%d\n",a);
}
when translated by g++ (Ubuntu 4.8.4-2ubuntu1~14.04.3) 4.8.4 with g++ main.cpp -S produces this assembly (showing only the relevant parts)
_Z1fRi:
pushq %rbp
movq %rsp, %rbp
movq %rdi, -8(%rbp)
movq -8(%rbp), %rax
movl (%rax), %eax
leal 1(%rax), %edx
movq -8(%rbp), %rax
movl %edx, (%rax)
popq %rbp
ret
main:
pushq %rbp
movq %rsp, %rbp
subq $16, %rsp
movl $12, -4(%rbp)
leaq -4(%rbp), %rax
movq %rax, %rdi
call _Z1fRi
movl -4(%rbp), %eax
movl %eax, %esi
movl $.LC0, %edi
movl $0, %eax
call printf
movl $0, %eax
leave
ret
Question: Why would the compiler choose to use leal instead of incq? Or am I missing something?
You compiled without optimization. GCC does not make any effort to select particularly well-fitting instructions when building in "debug" mode; it just focuses on generating the code as quickly as possible (and with an eye to making debugging easier—e.g., the ability to set breakpoints on source code lines).
When I enable optimizations by passing the -O2 switch, I get:
_Z1fRi:
addl $1, (%rdi)
ret
With generic tuning, the addl is preferred because some Intel processors (specifically Pentium 4, but also possibly Knight's Landing) have a false flags dependency.
With -march=k8, incl is used instead.
There is sometimes a use-case for leal in optimized code, though, and that is when you want to increment a register's value and store the result in a different register. Using leal in this way would allow you to preserve the register's original value, without needing an additional movl instruction. Another advantage of leal over incl/addl is that leal doesn't affect the flags, which can be useful in instruction scheduling.
While reading sources of GNU C++ standard library, I found some code for copying (or moving, if possible) a range of iterators (file stl_algobase.h), which uses template specialization for some optimizations. A comment corresponding to it says:
All of these auxiliary structs serve two purposes. (1) Replace calls to copy with memmove whenever possible. (Memmove, not memcpy, because the input and output ranges are permitted to overlap.) (2) If we're using random access iterators, then write the loop as a for loop with an explicit count.
The specialization using the second optimization looks like this:
template<>
struct __copy_move<false, false, random_access_iterator_tag>
{
template<typename _II, typename _OI>
static _OI
__copy_m(_II __first, _II __last, _OI __result)
{
typedef typename iterator_traits<_II>::difference_type _Distance;
for(_Distance __n = __last - __first; __n > 0; --__n)
{
*__result = *__first;
++__first;
++__result;
}
return __result;
}
};
So, I have two questions concerning this
How can memmove increase the spead of copying? Is it implemented somehow more effective than a simple loop?
How can using explicit counter in the for loop affect the performance?
Some clarification: I would like to see some optimization examples actually used by compilers, not elaboration on the possibility of those.
Edit: the first question is quite nicely answered here.
Answering the second question, the explicit count does indeed lead to more opportunities for loop unrolling, though even with pointers iterating through a fixed size array, gcc does not perform aggressive unrolling unless asked to do so with -funroll-loops. The other gain comes from a potentially simpler end-of-loop comparison test for non-trivial iterators.
On a Core i7-4770, I benchmarked the time spent performing a copy of a maximally-aligned 2048-long integer array with a while loop and explicit count copy implementation. (Times in microseconds, includes call overhead; minimum of 200 samples of a timing loop with warm-up.)
while count
gcc -O3 0.179 0.178
gcc -O3 -march=native 0.097 0.095
gcc -O3 -march=native -funroll-loops 0.066 0.066
In each case, the generated code is very similar; the while version does a bit more work at the end in each case, handling checks that there aren't any entries left to copy that didn't fill out a whole 128-bit (SSE) or 256-bit (AVX) register, but these are pretty much taken care of by the branch predictor. The gcc -O3 assembly for each is as follows (leaving out assembler directives). while version:
array_copy_while(int (&) [2048], int (&) [2048]):
leaq 8192(%rdi), %rax
leaq 4(%rdi), %rdx
movq %rax, %rcx
subq %rdx, %rcx
movq %rcx, %rdx
shrq $2, %rdx
leaq 1(%rdx), %r8
cmpq $8, %r8
jbe .L11
leaq 16(%rsi), %rdx
cmpq %rdx, %rdi
leaq 16(%rdi), %rdx
setae %cl
cmpq %rdx, %rsi
setae %dl
orb %dl, %cl
je .L11
movq %r8, %r9
xorl %edx, %edx
xorl %ecx, %ecx
shrq $2, %r9
leaq 0(,%r9,4), %r10
.L9:
movdqa (%rdi,%rdx), %xmm0
addq $1, %rcx
movdqa %xmm0, (%rsi,%rdx)
addq $16, %rdx
cmpq %rcx, %r9
ja .L9
leaq 0(,%r10,4), %rdx
addq %rdx, %rdi
addq %rdx, %rsi
cmpq %r10, %r8
je .L1
movl (%rdi), %edx
movl %edx, (%rsi)
leaq 4(%rdi), %rdx
cmpq %rdx, %rax
je .L1
movl 4(%rdi), %edx
movl %edx, 4(%rsi)
leaq 8(%rdi), %rdx
cmpq %rdx, %rax
je .L20
movl 8(%rdi), %eax
movl %eax, 8(%rsi)
ret
.L11:
movl (%rdi), %edx
addq $4, %rdi
addq $4, %rsi
movl %edx, -4(%rsi)
cmpq %rdi, %rax
jne .L11
.L1:
rep ret
.L20:
rep ret
count version:
array_copy_count(int (&) [2048], int (&) [2048]):
leaq 16(%rsi), %rax
movl $2048, %ecx
cmpq %rax, %rdi
leaq 16(%rdi), %rax
setae %dl
cmpq %rax, %rsi
setae %al
orb %al, %dl
je .L23
movw $512, %cx
xorl %eax, %eax
xorl %edx, %edx
.L29:
movdqa (%rdi,%rax), %xmm0
addq $1, %rdx
movdqa %xmm0, (%rsi,%rax)
addq $16, %rax
cmpq %rdx, %rcx
ja .L29
rep ret
.L23:
xorl %eax, %eax
.L31:
movl (%rdi,%rax,4), %edx
movl %edx, (%rsi,%rax,4)
addq $1, %rax
cmpq %rax, %rcx
jne .L31
rep ret
When the iterators are more complicated however, the difference becomes more pronounced. Consider a hypothetical container that stores values in a series of fixed-sized allocated buffers. An iterator comprises a pointer to the chain of blocks, a block index and a block offset. Comparison of two iterators requires potentially two comparisons. Incrementing the iterator requires checking if we pop over a block boundary.
I made such a container, and performed the same benchmark for copying a 2000-long container of int, with a block size of 512 ints.
while count
gcc -O3 1.560 2.818
gcc -O3 -march=native 1.660 2.854
gcc -O3 -march=native -funroll-loops 1.432 2.858
That looks weird! Oh wait, it's because gcc 4.8 has a misoptimisation, where it uses conditional moves instead of nice, branch-predictor friendly comparisons. (gcc bug 56309).
Let's try icc on a different machine (Xeon E5-2670).
while count
icc -O3 3.952 3.704
icc -O3 -xHost 3.898 3.624
This is closer to what we'd expect, a small but significant improvement from the simpler loop condition. On a different architecture, the gain is more pronounced. clang targeting a PowerA2 at 1.6GHz:
while count
bgclang -O3 36.528 31.623
I'll omit the assembly, as it's quite long!
Should one use dynamic memory allocation when one knows that a variable will not be needed before it goes out of scope?
For example in the following function:
void func(){
int i =56;
//do something with i, i is not needed past this point
for(int t; t<1000000; t++){
//code
}
}
say one only needed i for a small section of the function, is it worthwhile deleting i as it is not needed in the very long for loop?
As Borgleader said:
A) This is micro (and most probably premature) optimization, meaning
don't worry about it. B) In this particular case, dynamically
allocation i might even hurt performance. tl;dr; profile first,
optimize later
As an example, I compiled the following two programs into assembly (using g++ -S flag with no optimisation enabled).
Creating i on the stack:
int main(void)
{
int i = 56;
i += 5;
for(int t = 0; t<1000; t++) {}
return 0;
}
Dynamically:
int main(void)
{
int* i = new int(56);
*i += 5;
delete i;
for(int t = 0; t<1000; t++) {}
return 0;
}
The first program compiled to:
movl $56, -8(%rbp) # Store 56 on stack (int i = 56)
addl $5, -8(%rbp) # Add 5 to i (i += 5)
movl $0, -4(%rbp) # Initialize loop index (int t = 0)
jmp .L2 # Begin loop (goto .L2.)
.L3:
addl $1, -4(%rbp) # Increment index (t++)
.L2:
cmpl $999, -4(%rbp) # Check loop condition (t<1000)
setle %al
testb %al, %al
jne .L3 # If (t<1000) goto .L3.
movl $0, %eax # return 0
And the second:
subq $16, %rsp # Allocate memory (new)
movl $4, %edi
call _Znwm
movl $56, (%rax) # Store 56 in *i
movq %rax, -16(%rbp)
movq -16(%rbp), %rax # Add 5
movl (%rax), %eax
leal 5(%rax), %edx
movq -16(%rbp), %rax
movl %edx, (%rax)
movq -16(%rbp), %rax # Free memory (delete)
movq %rax, %rdi
call _ZdlPv
movl $0, -4(%rbp) # Initialize loop index (int t = 0)
jmp .L2 # Begin loop (goto .L2.)
.L3:
addl $1, -4(%rbp) # Increment index (t++)
.L2:
cmpl $999, -4(%rbp) # Check loop condition (t<1000)
setle %al
testb %al, %al
jne .L3 # If (t<1000) goto .L3.
movl $0, %eax # return 0
In the above assembly output, you can see strait away that there is a significant difference between the number of commands being executed. If I compile the same programs with optimisation turned on. The first program produced the result:
xorl %eax, %eax # Equivalent to return 0;
The second produced:
movl $4, %edi
call _Znwm
movl $61, (%rax) # A smart compiler knows 56+5 = 61
movq %rax, %rdi
call _ZdlPv
xorl %eax, %eax
addq $8, %rsp
With optimisation on, the compiler becomes a pretty powerful tool for improving your code, in certain cases it can even detect that a program only returns 0 and get rid of all the unnecessary code. When you use dynamic memory in the code above, the program still has to request and then free the dynamic memory, it can't optimise it out.
Inside a large loop, I currently have a statement similar to
if (ptr == NULL || ptr->calculate() > 5)
{do something}
where ptr is an object pointer set before the loop and never changed.
I would like to avoid comparing ptr to NULL in every iteration of the loop. (The current final program does that, right?) A simple solution would be to write the loop code once for (ptr == NULL) and once for (ptr != NULL). But this would increase the amount of code making it more difficult to maintain, plus it looks silly if the same large loop appears twice with only one or two lines changed.
What can I do? Use dynamically-valued constants maybe and hope the compiler is smart? How?
Many thanks!
EDIT by Luther Blissett. The OP wants to know if there is a better way to remove the pointer check here:
loop {
A;
if (ptr==0 || ptr->calculate()>5) B;
C;
}
than duplicating the loop as shown here:
if (ptr==0)
loop {
A;
B;
C;
}
else loop {
A;
if (ptr->calculate()>5) B;
C;
}
I just wanted to inform you, that apparently GCC can do this requested hoisting in the optimizer. Here's a model loop (in C):
struct C
{
int (*calculate)();
};
void sideeffect1();
void sideeffect2();
void sideeffect3();
void foo(struct C *ptr)
{
int i;
for (i=0;i<1000;i++)
{
sideeffect1();
if (ptr == 0 || ptr->calculate()>5) sideeffect2();
sideeffect3();
}
}
Compiling this with gcc 4.5 and -O3 gives:
.globl foo
.type foo, #function
foo:
.LFB0:
pushq %rbp
.LCFI0:
movq %rdi, %rbp
pushq %rbx
.LCFI1:
subq $8, %rsp
.LCFI2:
testq %rdi, %rdi # ptr==0? -> .L2, see below
je .L2
movl $1000, %ebx
.p2align 4,,10
.p2align 3
.L4:
xorl %eax, %eax
call sideeffect1 # sideeffect1
xorl %eax, %eax
call *0(%rbp) # call p->calculate, no check for ptr==0
cmpl $5, %eax
jle .L3
xorl %eax, %eax
call sideeffect2 # ok, call sideeffect2
.L3:
xorl %eax, %eax
call sideeffect3
subl $1, %ebx
jne .L4
addq $8, %rsp
.LCFI3:
xorl %eax, %eax
popq %rbx
.LCFI4:
popq %rbp
.LCFI5:
ret
.L2: # here's the loop with ptr==0
.LCFI6:
movl $1000, %ebx
.p2align 4,,10
.p2align 3
.L6:
xorl %eax, %eax
call sideeffect1 # does not try to call ptr->calculate() anymore
xorl %eax, %eax
call sideeffect2
xorl %eax, %eax
call sideeffect3
subl $1, %ebx
jne .L6
addq $8, %rsp
.LCFI7:
xorl %eax, %eax
popq %rbx
.LCFI8:
popq %rbp
.LCFI9:
ret
And so does clang 2.7 (-O3):
foo:
.Leh_func_begin1:
pushq %rbp
.Llabel1:
movq %rsp, %rbp
.Llabel2:
pushq %r14
pushq %rbx
.Llabel3:
testq %rdi, %rdi # ptr==NULL -> .LBB1_5
je .LBB1_5
movq %rdi, %rbx
movl $1000, %r14d
.align 16, 0x90
.LBB1_2:
xorb %al, %al # here's the loop with the ptr->calculate check()
callq sideeffect1
xorb %al, %al
callq *(%rbx)
cmpl $6, %eax
jl .LBB1_4
xorb %al, %al
callq sideeffect2
.LBB1_4:
xorb %al, %al
callq sideeffect3
decl %r14d
jne .LBB1_2
jmp .LBB1_7
.LBB1_5:
movl $1000, %r14d
.align 16, 0x90
.LBB1_6:
xorb %al, %al # and here's the loop for the ptr==NULL case
callq sideeffect1
xorb %al, %al
callq sideeffect2
xorb %al, %al
callq sideeffect3
decl %r14d
jne .LBB1_6
.LBB1_7:
popq %rbx
popq %r14
popq %rbp
ret
In C++, although completely overkill you can put the loop in a function and use a template. This will generate twice the body of the function, but eliminate the extra check which will be optimized out. While I certainly don't recommend it, here is the code:
template<bool ptr_is_null>
void loop() {
for(int i = x; i != y; ++i) {
/**/
if(ptr_is_null || ptr->calculate() > 5) {
/**/
}
/**/
}
}
You call it with:
if (ptr==NULL) loop<true>(); else loop<false>();
You are better off without this "optimization", the compiler will probably do the RightThing(TM) for you.
Why do you want to avoid comparing to NULL?
Creating a variant for each of the NULL and non-NULL cases just gives you almost twice as much code to write, test and more importantly maintain.
A 'large loop' smells like an opportunity to refactor the loop into separate functions, in order to make the code easier to maintain. Then you can easily have two variants of the loop, one for ptr == null and one for ptr != null, calling different functions, with just a rough similarity in the overall structure of the loop.
Since
ptr is an object pointer set before the loop and never changed
can't you just check if it is null before the loop and not check again... since you don't change it.
If it is not valid for your pointer to be NULL, you could use a reference instead.
If it is valid for your pointer to be NULL, but if so then you skip all processing, then you could either wrap your code with one check at the beginning, or return early from your function:
if (ptr != NULL)
{
// your function
}
or
if (ptr == NULL) { return; }
If it is valid for your pointer to be NULL, but only some processing is skipped, then keep it like it is.
if (ptr == NULL || ptr->calculate() > 5)
{do something}
I would simply think in terms of what is done if the condition is true.
If "do something" is really the exact same stuff for (ptr == NULL) or (ptr->calculate() > 5), then I hardly see a reason to split up anything.
If "do something" contains particular cases for either condition, then I would consider to refactor into separate loops to get rid of extra special case checking. Depends on the special cases involved.
Eliminating code duplication is good up to a point. You should not care too much about optimizing until your program does what it should do and until performance becomes a problem.
[...] Premature optimization is the root of all evil
http://en.wikipedia.org/wiki/Program_optimization