Which of the below code will be more optimized for efficiency First function or Second function in C/C++ gcc compiler ?
// First Function
if ( A && B && C ) {
UpdateData();
} else if ( A && B ){
ResetData();
}
//Second Function
if ( A && B) {
if (C) {
UpdateData();
} else {
ResetData();
}
}
Do we get any performance improvement in Second Function ?
If First Function is used, can the compiler optimize it to Second Method on its own ?
A large portion of this question will depend on what A, B and C really are (and the compiler will optimise it, as shown below). Simple types, definitely not worth worrying about. If they are some kind of "big number math" objects, or some complicated data type that needs 1000 instructions for each "is this true or not", then there will be a big difference if the compiler decides to make different code.
As always when it comes to performance: Measure in your own code, use profiling to detect where the code spends MOST of the time, and then measure with changes to that code. Repeat until it runs fast enough [whatever that is] and/or your manager tells you to stop fiddling with the code. Typically, however, unless it's REALLY a high traffic area of the code, it will make little difference to re-arrange the conditions in an if-statement, it is the overall algorithm that makes most impact in the general case.
If we assume A, B and C are simple types, such as int, we can write some code to investigate:
extern int A, B, C;
extern void UpdateData();
extern void ResetData();
void func1()
{
if ( A && B && C ) {
UpdateData();
} else if ( A && B ){
ResetData();
}
}
void func2()
{
if ( A && B) {
if (C) {
UpdateData();
} else {
ResetData();
}
}
}
gcc 4.8.2 given this, with -O1 produces this code:
_Z5func1v:
cmpl $0, A(%rip)
je .L6
cmpl $0, B(%rip)
je .L6
subq $8, %rsp
cmpl $0, C(%rip)
je .L3
call _Z10UpdateDatav
jmp .L1
.L3:
call _Z9ResetDatav
.L1:
addq $8, %rsp
.L6:
rep ret
_Z5func2v:
.LFB1:
cmpl $0, A(%rip)
je .L12
cmpl $0, B(%rip)
je .L12
subq $8, %rsp
cmpl $0, C(%rip)
je .L9
call _Z10UpdateDatav
jmp .L7
.L9:
call _Z9ResetDatav
.L7:
addq $8, %rsp
.L12:
rep ret
In other words: No difference at all
Using clang++ 3.7 (as of about 3 weeks ago) with -O1 gives this:
_Z5func1v: # #_Z5func1v
cmpl $0, A(%rip)
setne %cl
cmpl $0, B(%rip)
setne %al
andb %cl, %al
movzbl %al, %ecx
cmpl $1, %ecx
jne .LBB0_2
movl C(%rip), %ecx
testl %ecx, %ecx
je .LBB0_2
jmp _Z10UpdateDatav # TAILCALL
.LBB0_2: # %if.else
testb %al, %al
je .LBB0_3
jmp _Z9ResetDatav # TAILCALL
.LBB0_3: # %if.end8
retq
_Z5func2v: # #_Z5func2v
cmpl $0, A(%rip)
je .LBB1_4
movl B(%rip), %eax
testl %eax, %eax
je .LBB1_4
cmpl $0, C(%rip)
je .LBB1_3
jmp _Z10UpdateDatav # TAILCALL
.LBB1_4: # %if.end4
retq
.LBB1_3: # %if.else
jmp _Z9ResetDatav # TAILCALL
.Ltmp1:
The chaining of and in the func1 of clang MAY be of benefit, but it's probably such a small difference that you should concentrate on what makes more sense from a logical perspective of the code.
In summary: Not worth it
Higher optimisation in g++ makes it do the same tailcall optimisation that clang does, otherwise no difference.
However, if we make A, B and C into external functions, which the compiler can't "understand", then we get a difference:
_Z5func1v: # #_Z5func1v
pushq %rax
.Ltmp0:
.cfi_def_cfa_offset 16
callq _Z1Av
testl %eax, %eax
je .LBB0_3
callq _Z1Bv
testl %eax, %eax
je .LBB0_3
callq _Z1Cv
testl %eax, %eax
je .LBB0_3
popq %rax
jmp _Z10UpdateDatav # TAILCALL
.LBB0_3: # %if.else
callq _Z1Av
testl %eax, %eax
je .LBB0_5
callq _Z1Bv
testl %eax, %eax
je .LBB0_5
popq %rax
jmp _Z9ResetDatav # TAILCALL
.LBB0_5: # %if.end12
popq %rax
retq
_Z5func2v: # #_Z5func2v
pushq %rax
.Ltmp2:
.cfi_def_cfa_offset 16
callq _Z1Av
testl %eax, %eax
je .LBB1_4
callq _Z1Bv
testl %eax, %eax
je .LBB1_4
callq _Z1Cv
testl %eax, %eax
je .LBB1_3
popq %rax
jmp _Z10UpdateDatav # TAILCALL
.LBB1_4: # %if.end6
popq %rax
retq
.LBB1_3: # %if.else
popq %rax
jmp _Z9ResetDatav # TAILCALL
Here we DO see the difference between func1 and func2, where func1 will call A and B twice - since the compiler can't assume that calling those functions ONCE will do the same thing as calling twice. [Consider that the functions A and B may be reading data from a file, calling rand, or whatever, the result of NOT calling that function may be that the program behaves differently.
(In this case I only posted clang code, but g++ produces code that has the same outcome, but slightly different ordering of the different lumps of code)
Related
I've been told many times that recursion is slow due to function calls, but in this code, it seems much faster than the iterative solution. At best, I typically expect a compiler to optimize recursion into iteration (which looking at the assembly, did seem to happen).
#include <iostream>
bool isDivisable(int x, int y)
{
for (int i = y; i != 1; --i)
if (x % i != 0)
return false;
return true;
}
bool isDivisableRec(int x, int y)
{
if (y == 1)
return true;
return x % y == 0 && isDivisableRec(x, y-1);
}
int findSmallest()
{
int x = 20;
for (; !isDivisable(x,20); ++x);
return x;
}
int main()
{
std::cout << findSmallest() << std::endl;
}
Assembly here: https://gist.github.com/PatrickAupperle/2b56e16e9e5a6a9b251e
I'd love to know what is going on here. I'm sure it is some tricky compiler optimization that I can be amazed to learn about.
Edit: I just realized I forgot to mention that if I use the recursive version, it runs in about .25 seconds, the iterative, about .6.
Edit 2: I am compiling with -O3 using
$ g++ --version
g++ (Ubuntu 4.8.4-2ubuntu1~14.04) 4.8.4
Though, I'm not really sure what that matters.
Edit 3:
Better benchmarking:
Source: http://gist.github.com/PatrickAupperle/ee8241ac51417437d012
Output: http://gist.github.com/PatrickAupperle/5870136a5552b83fd0f1
Running with 100 iterations shows very similar results
Edit 4:
At Roman's suggestion, I added -fno-inline-functions -fno-inline-small-functions to the compilation flags. The effect is extremely bizarre to me. The code runs about 15x faster, but the ratio between the recursive version and the iterative version remains similar.
https://gist.github.com/PatrickAupperle/3a87eb53a9f11c1f0bec
Using this code I also see large timing difference (in favor of the recursive version) with GCC 4.9.3 in Cygwin. I get
13.411 seconds for iterative
4.29101 seconds for recursive
Looking at the assembly code it generated with -O3, I see two things
The compiler replaced tail recursion in isDivisableRec with a cycle and then unrolled the cycle: each iteration of the cycle in the machine code covers two levels of the original recursion.
_Z14isDivisableRecii:
.LFB1467:
.seh_endprologue
movl %edx, %r8d
.L15:
cmpl $1, %r8d
je .L18
movl %ecx, %eax ; First unrolled divisibility check
cltd
idivl %r8d
testl %edx, %edx
je .L20
.L19:
xorl %eax, %eax
ret
.p2align 4,,10
.L20:
leal -1(%r8), %r9d
cmpl $1, %r9d
jne .L21
.p2align 4,,10
.L18:
movl $1, %eax
ret
.p2align 4,,10
.L21:
movl %ecx, %eax ; Second unrolled divisibility check
cltd
idivl %r9d
testl %edx, %edx
jne .L19
subl $2, %r8d
jmp .L15
.seh_endproc
The compiler inlined several iterations of isDivisableRec by lifting them into findSmallestRec. Since the value of y parameter of isDivisableRec is hardcoded as 20 the compiler managed to replace the iterations for 20, 19...15 with some "magical" code inlined directly into findSmallestRec. The actual call to isDivisableRec happens only for y parameter value of 14 (if it happens at all).
Here's the inlined code in findSmallestRec
movl $20, %ebx
movl $1717986919, %esi ; Magic constants
movl $1808407283, %edi ; for divisibility tests
movl $954437177, %ebp ;
movl $2021161081, %r12d ;
movl $-2004318071, %r13d ;
jmp .L28
.p2align 4,,10
.L29: ; The main cycle
addl $1, %ebx
.L28:
movl %ebx, %eax ; Divisibility by 20 test
movl %ebx, %ecx
imull %esi
sarl $31, %ecx
sarl $3, %edx
subl %ecx, %edx
leal (%rdx,%rdx,4), %eax
sall $2, %eax
cmpl %eax, %ebx
jne .L29
movl %ebx, %eax ; Divisibility by 19 test
imull %edi
sarl $3, %edx
subl %ecx, %edx
leal (%rdx,%rdx,8), %eax
leal (%rdx,%rax,2), %eax
cmpl %eax, %ebx
jne .L29
movl %ebx, %eax ; Divisibility by 18 test
imull %ebp
sarl $2, %edx
subl %ecx, %edx
leal (%rdx,%rdx,8), %eax
addl %eax, %eax
cmpl %eax, %ebx
jne .L29
movl %ebx, %eax ; Divisibility by 17 test
imull %r12d
sarl $3, %edx
subl %ecx, %edx
movl %edx, %eax
sall $4, %eax
addl %eax, %edx
cmpl %edx, %ebx
jne .L29
testb $15, %bl ; Divisibility by 16 test
jne .L29
movl %ebx, %eax ; Divisibility by 15 test
imull %r13d
leal (%rdx,%rbx), %eax
sarl $3, %eax
subl %ecx, %eax
movl %eax, %edx
sall $4, %edx
subl %eax, %edx
cmpl %edx, %ebx
jne .L29
movl $14, %edx
movl %ebx, %ecx
call _Z14isDivisableRecii ; call isDivisableRecii(x, 14)
...
The above blocks of machine instructions before each jne .L29 jump are divisibility tests for 20, 19...15 lifted directly into findSmallestRec. Apparently, they are more efficient than the tests used inside isDivisableRec for a run-time value of y. As you can see, the divisibility by 16 test is implemented simply as testb $15, %bl. Because of this, non-divisibility of x by high values of y is caught early by the above highly optimized code.
None of this happens for isDivisable and findSmallest - they are basically translated literally. Even the cycle is not unrolled.
I believe it is the second optimization that makes for the most of the difference. The compiler used highly optimized methods of checking divisibility for higher y values, which happen to be known at compile time.
If you replace the second argument of isDivisableRec with an "unpredictable" run-time value of 20 (instead of hard-coded compile-time constant 20), it should disable this optimization and bring the timings in line. I just tried this and ended up with
12.9 seconds for iterative
13.26 seconds for recursive
I have a simple piece of code, that addresses this (poorly stated, out of place) question :
template<typename It>
bool isAlpha(It first, It last)
{
return (first != last && *first != '\0') ?
(isalpha(static_cast<int>(*first)) && isAlpha(++first, last)) : true;
}
I'm trying to figure out how can I go about implementing it in a tail recursive fashion, and although there are great sources like this answer, I can't wrap my mind around it.
Can anyone help ?
EDIT
I'm placing the disassembly code below; The compiler is gcc 4.9.0 compiling with -std=c++11 -O2 -Wall -pedantic the assembly output is
bool isAlpha<char const*>(char const*, char const*):
cmpq %rdi, %rsi
je .L5
movzbl (%rdi), %edx
movl $1, %eax
testb %dl, %dl
je .L12
pushq %rbp
pushq %rbx
leaq 1(%rdi), %rbx
movq %rsi, %rbp
subq $8, %rsp
.L3:
movsbl %dl, %edi
call isalpha
testl %eax, %eax
jne .L14
xorl %eax, %eax
.L2:
addq $8, %rsp
popq %rbx
popq %rbp
.L12:
rep ret
.L14:
cmpq %rbp, %rbx
je .L7
addq $1, %rbx
movzbl -1(%rbx), %edx
testb %dl, %dl
jne .L3
.L7:
movl $1, %eax
jmp .L2
.L5:
movl $1, %eax
ret
To clarify cdhowie's point, the function can be rewritten as follows (unless I made a mistake):
bool isAlpha(It first, It last)
{
if (first == last)
return true;
if (*first == '\0')
return true;
if (!isalpha(static_cast<int>(*first))
return false;
return isAlpha(++first, last);
}
Which would indeed allow for trivial tail call elimination.
This is normally a job for the compiler, though.
So the question is which of these implementation has better performance and readability.
Imagine you have to write a code that each step is dependent of the success of the previous one, something like:
bool function()
{
bool isOk = false;
if( A.Func1() )
{
B.Func1();
if( C.Func2() )
{
if( D.Func3() )
{
...
isOk = true;
}
}
}
return isOk;
}
Let's say there are up to 6 nested IFs, since I don't want the padding to grow too much to the right, and I don't want to nest the function calls because there are several parameters involved, the first approach would be using the inverse logic:
bool function()
{
if( ! A.Func1() ) return false:
B.Func1();
if( ! C.Func2() ) return false;
if( ! D.Func3() ) return false;
...
return true;
}
But what about avoiding so many returns, like this:
bool function()
{
bool isOk = false;
do
{
if( ! A.Func1() ) break:
B.Func1();
if( ! C.Func2() ) break;
if( ! D.Func3() ) break;
...
isOk = true;
break;
}while(false);
return isOk;
}
Compilers will break down your code to simple instructions, using branch instructions to form loops, if/else etc, and it's unlikely your code will be any different at all once the compiler has gone over it.
Write the code that you think makes most sense for the solution you require.
If I were to "vote" for one of the three variants, I'd say my code is mostly variant 2. However, I don't follow it religiously. If it makes more sense (from a "how you think about it" perspective) to write in variant 1, then I will do that.
I don't think I've ever written, or even seen code written like variant 3 - I'm sure it happens, but if your goal is to have a single return, then I'd say variant 1 is the clearer and more obvious choice. Variant 3 is really just a "goto by another name" (see my most rewarded answer [and that's after I had something like 80 down-votes for suggesting goto as a solution]). I personally don't see variant 3 as any better than the other two, and unless the function is short enough to see do and the while on the same page, you also don't actually know that it won't loop without scrolling around - which is really not a good thing.
If you then, after profiling the code, discover a particular function is taking more time than you think is "right", study the assembly code.
Just to illustrate this, I will take your code and compile all three examples with g++ and clang++, and show the resulting code. It will probably take a few minutes because I have to actually make it compileable first.
Your source, after some massaging to make it compile as a singe source file:
class X
{
public:
bool Func1();
bool Func2();
bool Func3();
};
X A, B, C, D;
bool function()
{
bool isOk = false;
if( A.Func1() )
{
B.Func1();
if( C.Func2() )
{
if( D.Func3() )
{
isOk = true;
}
}
}
return isOk;
}
bool function2()
{
if( ! A.Func1() ) return false;
B.Func1();
if( ! C.Func2() ) return false;
if( ! D.Func3() ) return false;
return true;
}
bool function3()
{
bool isOk = false;
do
{
if( ! A.Func1() ) break;
B.Func1();
if( ! C.Func2() ) break;
if( ! D.Func3() ) break;
isOk = true;
}while(false);
return isOk;
}
Code generated by clang 3.5 (compiled from sources a few days ago):
_Z8functionv: # #_Z8functionv
pushq %rax
movl $A, %edi
callq _ZN1X5Func1Ev
testb %al, %al
je .LBB0_2
movl $B, %edi
callq _ZN1X5Func1Ev
movl $C, %edi
callq _ZN1X5Func2Ev
testb %al, %al
je .LBB0_2
movl $D, %edi
popq %rax
jmp _ZN1X5Func3Ev # TAILCALL
xorl %eax, %eax
popq %rdx
retq
_Z9function2v: # #_Z9function2v
pushq %rax
movl $A, %edi
callq _ZN1X5Func1Ev
testb %al, %al
je .LBB1_1
movl $B, %edi
callq _ZN1X5Func1Ev
movl $C, %edi
callq _ZN1X5Func2Ev
testb %al, %al
je .LBB1_3
movl $D, %edi
callq _ZN1X5Func3Ev
# kill: AL<def> AL<kill> EAX<def>
jmp .LBB1_5
.LBB1_1:
xorl %eax, %eax
jmp .LBB1_5
.LBB1_3:
xorl %eax, %eax
.LBB1_5:
# kill: AL<def> AL<kill> EAX<kill>
popq %rdx
retq
_Z9function3v: # #_Z9function3v
pushq %rax
.Ltmp4:
.cfi_def_cfa_offset 16
movl $A, %edi
callq _ZN1X5Func1Ev
testb %al, %al
je .LBB2_2
movl $B, %edi
callq _ZN1X5Func1Ev
movl $C, %edi
callq _ZN1X5Func2Ev
testb %al, %al
je .LBB2_2
movl $D, %edi
popq %rax
jmp _ZN1X5Func3Ev # TAILCALL
.LBB2_2:
xorl %eax, %eax
popq %rdx
retq
In the clang++ code, the second function is very marginally worse due to an extra jump that one would have hoped the compiler could sort out being the same as one of the others. But I doubt any realistic code where func1 and func2 and func3 actually does anything meaningful will show any measurable difference.
And g++ 4.8.2:
_Z8functionv:
subq $8, %rsp
movl $A, %edi
call _ZN1X5Func1Ev
testb %al, %al
jne .L10
.L3:
xorl %eax, %eax
addq $8, %rsp
ret
.L10:
movl $B, %edi
call _ZN1X5Func1Ev
movl $C, %edi
call _ZN1X5Func2Ev
testb %al, %al
je .L3
movl $D, %edi
addq $8, %rsp
jmp _ZN1X5Func3Ev
_Z9function2v:
subq $8, %rsp
movl $A, %edi
call _ZN1X5Func1Ev
testb %al, %al
jne .L19
.L13:
xorl %eax, %eax
addq $8, %rsp
ret
.L19:
movl $B, %edi
call _ZN1X5Func1Ev
movl $C, %edi
call _ZN1X5Func2Ev
testb %al, %al
je .L13
movl $D, %edi
addq $8, %rsp
jmp _ZN1X5Func3Ev
_Z9function3v:
.LFB2:
subq $8, %rsp
movl $A, %edi
call _ZN1X5Func1Ev
testb %al, %al
jne .L28
.L22:
xorl %eax, %eax
addq $8, %rsp
ret
.L28:
movl $B, %edi
call _ZN1X5Func1Ev
movl $C, %edi
call _ZN1X5Func2Ev
testb %al, %al
je .L22
movl $D, %edi
addq $8, %rsp
jmp _ZN1X5Func3Ev
I challenge you to spot the difference aside from the label names between the different functions.
I think performance (and most likely even binary code) will be the same with any modern compiler.
Readability is somewhat a matter of conventions and habits.
I personally would prefer the first form, and probably you would need a new function to group some of the conditions (I think some of them can be grouped together in some meaningful way). The third form looks most cryptic to me.
As C++ has RAII and automatic cleanups, I tend to prefer the bail-out-with-return-as-soon-as-possible solution (your second one), because the code gets much cleaner IMHO. Obviously, it's a matter of opinion, taste and YMMV...
I have thought one compare must be faster than two. But after my test, I found in debug mode short compare is a bit faster, and in release mode char compare is faster. And I want to know the true reason.
Following is the test code and test result. I wrote two simple functions, func1() using two char compares, and func2() using one short compare. The main function returns temporary return value to avoid compile optimization ignoring my test code. My compiler is GCC 4.7.2, CPU Intel® Xeon® CPU E5-2430 0 # 2.20GHz (VM).
inline int func1(unsigned char word[2])
{
if (word[0] == 0xff && word[1] == 0xff)
return 1;
return 0;
}
inline int func2(unsigned char word[2])
{
if (*(unsigned short*)word == 0xffff)
return 1;
return 0;
}
int main()
{
int n_ret = 0;
for (int j = 0; j < 10000; ++j)
for (int i = 0; i < 70000; ++i)
n_ret += func2((unsigned char*)&i);
return n_ret;
}
Debug mode:
func1 func2
real 0m3.621s 0m3.586s
user 0m3.614s 0m3.579s
sys 0m0.001s 0m0.000s
Release mode:
func1 func2
real 0m0.833s 0m0.880s
user 0m0.831s 0m0.878s
sys 0m0.000s 0m0.002s
func1 edition's assembly code:
.cfi_startproc
movl $10000, %esi
xorl %eax, %eax
.p2align 4,,10
.p2align 3
.L6:
movl $1, %edx
xorl %ecx, %ecx
.p2align 4,,10
.p2align 3
.L8:
movl %edx, -24(%rsp)
addl $1, %edx
addl %ecx, %eax
cmpl $70001, %edx
je .L3
xorl %ecx, %ecx
cmpb $-1, -24(%rsp)
jne .L8
xorl %ecx, %ecx
cmpb $-1, -23(%rsp)
sete %cl
jmp .L8
.p2align 4,,10
.p2align 3
.L3:
subl $1, %esi
jne .L6
rep
ret
.cfi_endproc
func2 edition's assembly code:
.cfi_startproc
movl $10000, %esi
xorl %eax, %eax
.p2align 4,,10
.p2align 3
.L4:
movl $1, %edx
xorl %ecx, %ecx
jmp .L3
.p2align 4,,10
.p2align 3
.L7:
movzwl -24(%rsp), %ecx
.L3:
cmpw $-1, %cx
movl %edx, -24(%rsp)
sete %cl
addl $1, %edx
movzbl %cl, %ecx
addl %ecx, %eax
cmpl $70001, %edx
jne .L7
subl $1, %esi
jne .L4
rep
ret
.cfi_endproc
In GCC 4.6.3 the code is different for the first and second pieces of code, and the runtime for the func1 option is noticeably slower if you run it for long enough. Unfortunately, with your very short runtime, the two appear similar in time.
Increasing the outer loop by a factor of 10 means it takes about 6 seconds for func2, and 10 seconds for func1. This s using gcc -std=c99 -O3 to compile the code.
The main difference, I expect, is from the extra branch introduced with the && statement. And the extra xorl %ecx, %ecx doesn't help much (I get the same, although my code looks subtly different when it comes to label names).
Edit: I did try to come up with a branchless solution using and instead of a branch, but the compile refuses to inline the function, so it takes 30 seconds instead of 10.
Benchmarks run on:
AMD Phenom(tm) II X4 965
Runs at 3.4 GHz.
Inside a large loop, I currently have a statement similar to
if (ptr == NULL || ptr->calculate() > 5)
{do something}
where ptr is an object pointer set before the loop and never changed.
I would like to avoid comparing ptr to NULL in every iteration of the loop. (The current final program does that, right?) A simple solution would be to write the loop code once for (ptr == NULL) and once for (ptr != NULL). But this would increase the amount of code making it more difficult to maintain, plus it looks silly if the same large loop appears twice with only one or two lines changed.
What can I do? Use dynamically-valued constants maybe and hope the compiler is smart? How?
Many thanks!
EDIT by Luther Blissett. The OP wants to know if there is a better way to remove the pointer check here:
loop {
A;
if (ptr==0 || ptr->calculate()>5) B;
C;
}
than duplicating the loop as shown here:
if (ptr==0)
loop {
A;
B;
C;
}
else loop {
A;
if (ptr->calculate()>5) B;
C;
}
I just wanted to inform you, that apparently GCC can do this requested hoisting in the optimizer. Here's a model loop (in C):
struct C
{
int (*calculate)();
};
void sideeffect1();
void sideeffect2();
void sideeffect3();
void foo(struct C *ptr)
{
int i;
for (i=0;i<1000;i++)
{
sideeffect1();
if (ptr == 0 || ptr->calculate()>5) sideeffect2();
sideeffect3();
}
}
Compiling this with gcc 4.5 and -O3 gives:
.globl foo
.type foo, #function
foo:
.LFB0:
pushq %rbp
.LCFI0:
movq %rdi, %rbp
pushq %rbx
.LCFI1:
subq $8, %rsp
.LCFI2:
testq %rdi, %rdi # ptr==0? -> .L2, see below
je .L2
movl $1000, %ebx
.p2align 4,,10
.p2align 3
.L4:
xorl %eax, %eax
call sideeffect1 # sideeffect1
xorl %eax, %eax
call *0(%rbp) # call p->calculate, no check for ptr==0
cmpl $5, %eax
jle .L3
xorl %eax, %eax
call sideeffect2 # ok, call sideeffect2
.L3:
xorl %eax, %eax
call sideeffect3
subl $1, %ebx
jne .L4
addq $8, %rsp
.LCFI3:
xorl %eax, %eax
popq %rbx
.LCFI4:
popq %rbp
.LCFI5:
ret
.L2: # here's the loop with ptr==0
.LCFI6:
movl $1000, %ebx
.p2align 4,,10
.p2align 3
.L6:
xorl %eax, %eax
call sideeffect1 # does not try to call ptr->calculate() anymore
xorl %eax, %eax
call sideeffect2
xorl %eax, %eax
call sideeffect3
subl $1, %ebx
jne .L6
addq $8, %rsp
.LCFI7:
xorl %eax, %eax
popq %rbx
.LCFI8:
popq %rbp
.LCFI9:
ret
And so does clang 2.7 (-O3):
foo:
.Leh_func_begin1:
pushq %rbp
.Llabel1:
movq %rsp, %rbp
.Llabel2:
pushq %r14
pushq %rbx
.Llabel3:
testq %rdi, %rdi # ptr==NULL -> .LBB1_5
je .LBB1_5
movq %rdi, %rbx
movl $1000, %r14d
.align 16, 0x90
.LBB1_2:
xorb %al, %al # here's the loop with the ptr->calculate check()
callq sideeffect1
xorb %al, %al
callq *(%rbx)
cmpl $6, %eax
jl .LBB1_4
xorb %al, %al
callq sideeffect2
.LBB1_4:
xorb %al, %al
callq sideeffect3
decl %r14d
jne .LBB1_2
jmp .LBB1_7
.LBB1_5:
movl $1000, %r14d
.align 16, 0x90
.LBB1_6:
xorb %al, %al # and here's the loop for the ptr==NULL case
callq sideeffect1
xorb %al, %al
callq sideeffect2
xorb %al, %al
callq sideeffect3
decl %r14d
jne .LBB1_6
.LBB1_7:
popq %rbx
popq %r14
popq %rbp
ret
In C++, although completely overkill you can put the loop in a function and use a template. This will generate twice the body of the function, but eliminate the extra check which will be optimized out. While I certainly don't recommend it, here is the code:
template<bool ptr_is_null>
void loop() {
for(int i = x; i != y; ++i) {
/**/
if(ptr_is_null || ptr->calculate() > 5) {
/**/
}
/**/
}
}
You call it with:
if (ptr==NULL) loop<true>(); else loop<false>();
You are better off without this "optimization", the compiler will probably do the RightThing(TM) for you.
Why do you want to avoid comparing to NULL?
Creating a variant for each of the NULL and non-NULL cases just gives you almost twice as much code to write, test and more importantly maintain.
A 'large loop' smells like an opportunity to refactor the loop into separate functions, in order to make the code easier to maintain. Then you can easily have two variants of the loop, one for ptr == null and one for ptr != null, calling different functions, with just a rough similarity in the overall structure of the loop.
Since
ptr is an object pointer set before the loop and never changed
can't you just check if it is null before the loop and not check again... since you don't change it.
If it is not valid for your pointer to be NULL, you could use a reference instead.
If it is valid for your pointer to be NULL, but if so then you skip all processing, then you could either wrap your code with one check at the beginning, or return early from your function:
if (ptr != NULL)
{
// your function
}
or
if (ptr == NULL) { return; }
If it is valid for your pointer to be NULL, but only some processing is skipped, then keep it like it is.
if (ptr == NULL || ptr->calculate() > 5)
{do something}
I would simply think in terms of what is done if the condition is true.
If "do something" is really the exact same stuff for (ptr == NULL) or (ptr->calculate() > 5), then I hardly see a reason to split up anything.
If "do something" contains particular cases for either condition, then I would consider to refactor into separate loops to get rid of extra special case checking. Depends on the special cases involved.
Eliminating code duplication is good up to a point. You should not care too much about optimizing until your program does what it should do and until performance becomes a problem.
[...] Premature optimization is the root of all evil
http://en.wikipedia.org/wiki/Program_optimization