Does this optimization even matter? - c++

This page recommends "loop unrolling" as an optimization:
Loop overhead can be reduced by reducing the number of iterations and
replicating the body of the loop.
Example:
In the code fragment below, the body of the loop can be replicated
once and the number of iterations can be reduced from 100 to 50.
for (i = 0; i < 100; i++)
g ();
Below is the code fragment after loop unrolling.
for (i = 0; i < 100; i += 2)
{
g ();
g ();
}
With GCC 5.2, loop unrolling isn't enabled unless you use -funroll-loops (it's not enabled in either -O2 or -O3). I've inspected the assembly to see if there's a significant difference.
g++ -std=c++14 -O3 -funroll-loops -c -Wall -pedantic -pthread main.cpp && objdump -d main.o
Version 1:
0: ba 64 00 00 00 mov $0x64,%edx
5: 0f 1f 00 nopl (%rax)
8: 8b 05 00 00 00 00 mov 0x0(%rip),%eax # e <main+0xe>
e: 83 c0 01 add $0x1,%eax
# ... etc ...
a1: 83 c1 01 add $0x1,%ecx
a4: 83 ea 0a sub $0xa,%edx
a7: 89 0d 00 00 00 00 mov %ecx,0x0(%rip) # ad <main+0xad>
ad: 0f 85 55 ff ff ff jne 8 <main+0x8>
b3: 31 c0 xor %eax,%eax
b5: c3 retq
Version 2:
0: ba 32 00 00 00 mov $0x32,%edx
5: 0f 1f 00 nopl (%rax)
8: 8b 05 00 00 00 00 mov 0x0(%rip),%eax # e <main+0xe>
e: 83 c0 01 add $0x1,%eax
11: 89 05 00 00 00 00 mov %eax,0x0(%rip) # 17 <main+0x17>
17: 8b 0d 00 00 00 00 mov 0x0(%rip),%ecx # 1d <main+0x1d>
1d: 83 c1 01 add $0x1,%ecx
# ... etc ...
143: 83 c7 01 add $0x1,%edi
146: 83 ea 0a sub $0xa,%edx
149: 89 3d 00 00 00 00 mov %edi,0x0(%rip) # 14f <main+0x14f>
14f: 0f 85 b3 fe ff ff jne 8 <main+0x8>
155: 31 c0 xor %eax,%eax
157: c3 retq
Version 2 produces more iterations. What am I missing?

Yes, there are cases where loop unrolling will make the code more efficient.
The theory is reduce the less overhead (branching to top of loop and incrementing loop counter).
Most processors hate branch instructions. They love data processing instructions. For every iteration, there is a minimum of one branch instruction. By "duplicating" a set of code, the number of branches is reduced and the data processing instructions is increased between branches.
Many modern compilers have optimization settings to perform loop unrolling.

It doesn’t produce more iterations; you’ll notice that the loop that calls g() twice runs half as many times. (What if you have to call g() an odd number of times? Look up Duff’s Device.)
In your listings, you’ll notice that the assembly-language instruction jne 8 <main+0x8> appears once in both. This tells the processor to go back to the start of the loop. In the original loop, this instruction will run 99 times. In the rolled loop, it will only run 49 times. Imagine if the body of the loop is something very short, just one or two instructions. These jumps might be a third or even half of the instructions in the most performance-critical part of your program! (And there is even a useful loop with zero instructions: BogoMIPS. But the article about optimizing that was a joke.)
So, unrolling the loop trades speed for code size, right? Not so fast. Maybe you’ve made your unrolled loop so big that the code at the top of the loop is no longer in the cache, and the CPU needs to fetch it. In the real world, the only way to know if it helps is to profile your program.

Related

How to properly check if a value is infinite or nan ic c++(msvc2010)

I'm almost 100% positive that this has been asked before, but my search on this did'nt lead to a satisfying anwser.
So lets begin. All of my problems came from this little issue: -1.#IND000.
So basically my value was either nan or infinite, so the calcs blew up causing errors.
Since I'm working with floats, I've been using float.IsNan() and float.IsInfinity() in C#
But when I started coding in C++ I havent quite found equivalent functions in C++.
So I wrote a template for checking if the float is nan, like this:
template <typename T> bool isnan (T value)
{ return value != value; }
But how should I write a function to define if the float is infinite? And after all Is my nan check properly done? Also I'm doing the ckecks in a timed loop, so the template should act fast.
Thanks for your time!
You are looking for std::isnan() and std::isinf(). You should not attempt to write these functions yourself given that they exist as part of the standard library.
Now, I have a nagging doubt that these functions are not present in the standard library that ships with VS2010. In which case you can work around the omission by using functions provided by the CRT. Specifically the following functions declared in float.h: _isnan(), _finite(x) and _fpclass().
Note that:
x is NaN if and only if x != x.
x is NaN or an infinity if and only if x - x != 0.
x is a zero or an infinity if and only if x + x == x.
x is a zero if and only if x == 0.
If FLT_EVAL_METHOD is 0 or 1, then x is an infinity if and only if x + DBL_MAX == x.
x is positive infinity if and only if x + infinity == x.
I do not think there is anything wrong with using comparisons like the above instead of standard library functions, even if those standard library functions exist. In fact, after a discussion with David Heffernan in the comments, I would recommend using the arithmetic comparisons above over the isinf/isfinite/isnan macros/functions.
I see that you are using a Microsoft compiler here. I do not have one installed. What follows is all done with reference to the gcc on my Arch box, namely gcc version 4.9.0 20140521 (prerelease) (GCC), so this is at most a portability note for you. Try something similar with your compiler and see which variants, if any, tell the compiler what's going on and which just make it give up.
Consider the following code:
int foo(double x) {
return x != x;
}
void tva(double x) {
if (!foo(x)) {
x += x;
if (!foo(x)) {
printf(":(");
}
}
}
Here foo is an implementation of isnan. x += x will not result in a NaN unless x was NaN before. Here is the code generated for tva:
0000000000000020 <_Z3tvad>:
20: 66 0f 2e c0 ucomisd %xmm0,%xmm0
24: 7a 1a jp 40 <_Z3tvad+0x20>
26: f2 0f 58 c0 addsd %xmm0,%xmm0
2a: 66 0f 2e c0 ucomisd %xmm0,%xmm0
2e: 7a 10 jp 40 <_Z3tvad+0x20>
30: bf 00 00 00 00 mov $0x0,%edi
35: 31 c0 xor %eax,%eax
37: e9 00 00 00 00 jmpq 3c <_Z3tvad+0x1c>
3c: 0f 1f 40 00 nopl 0x0(%rax)
40: f3 c3 repz retq
Note that the branch containing the printf was not generated. What happens if we replace foo with isnan?
00000000004005c0 <_Z3tvad>:
4005c0: 66 0f 28 c8 movapd %xmm0,%xmm1
4005c4: 48 83 ec 18 sub $0x18,%rsp
4005c8: f2 0f 11 4c 24 08 movsd %xmm1,0x8(%rsp)
4005ce: e8 4d fe ff ff callq 400420 <__isnan#plt>
4005d3: 85 c0 test %eax,%eax
4005d5: 75 17 jne 4005ee <_Z3tvad+0x2e>
4005d7: f2 0f 10 4c 24 08 movsd 0x8(%rsp),%xmm1
4005dd: 66 0f 28 c1 movapd %xmm1,%xmm0
4005e1: f2 0f 58 c1 addsd %xmm1,%xmm0
4005e5: e8 36 fe ff ff callq 400420 <__isnan#plt>
4005ea: 85 c0 test %eax,%eax
4005ec: 74 0a je 4005f8 <_Z3tvad+0x38>
4005ee: 48 83 c4 18 add $0x18,%rsp
4005f2: c3 retq
4005f3: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
4005f8: bf 94 06 40 00 mov $0x400694,%edi
4005fd: 48 83 c4 18 add $0x18,%rsp
400601: e9 2a fe ff ff jmpq 400430 <printf#plt>
400606: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
It appears that gcc has no idea what isnan does! It generates the dead branch with the printf and it generates two separate calls to isnan.
My point here is that using the isnan macro/function confounds gcc's value analysis. It has no idea that isnan(x) if and only if x is NaN. Having compiler optimisations work is often much more important than generating the fastest possible code for a given primitive.

Why does vectorization fail?

I want to optimize my code for vectorization using
-msse2 -ftree-vectorizer-verbose=2.
I have the following simple code:
int main(){
int a[2048], b[2048], c[2048];
int i;
for (i=0; i<2048; i++){
b[i]=0;
c[i]=0;
}
for (i=0; i<2048; i++){
a[i] = b[i] + c[i];
}
return 0;
}
Why do I get the note
test.cpp:10: note: not vectorized: not enough data-refs in basic block.
Thanks!
For what it's worth, after adding an asm volatile("": "+m"(a), "+m"(b), "+m"(c)::"memory"); near the end of main, my copy of gcc emits this:
400610: 48 81 ec 08 60 00 00 sub $0x6008,%rsp
400617: ba 00 20 00 00 mov $0x2000,%edx
40061c: 31 f6 xor %esi,%esi
40061e: 48 8d bc 24 00 20 00 lea 0x2000(%rsp),%rdi
400625: 00
400626: e8 b5 ff ff ff callq 4005e0 <memset#plt>
40062b: ba 00 20 00 00 mov $0x2000,%edx
400630: 31 f6 xor %esi,%esi
400632: 48 8d bc 24 00 40 00 lea 0x4000(%rsp),%rdi
400639: 00
40063a: e8 a1 ff ff ff callq 4005e0 <memset#plt>
40063f: 31 c0 xor %eax,%eax
400641: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
400648: c5 f9 6f 84 04 00 20 vmovdqa 0x2000(%rsp,%rax,1),%xmm0
40064f: 00 00
400651: c5 f9 fe 84 04 00 40 vpaddd 0x4000(%rsp,%rax,1),%xmm0,%xmm0
400658: 00 00
40065a: c5 f8 29 04 04 vmovaps %xmm0,(%rsp,%rax,1)
40065f: 48 83 c0 10 add $0x10,%rax
400663: 48 3d 00 20 00 00 cmp $0x2000,%rax
400669: 75 dd jne 400648 <main+0x38>
So it recognised that the first loop was just doing memset to a couple arrays and the second loop was doing a vector addition, which it appropriately vectorised.
I'm using gcc version 4.9.0 20140521 (prerelease) (GCC).
An older machine with gcc version 4.7.2 (Debian 4.7.2-5) also vectorises the loop, but in a different way. Your -ftree-vectorizer-verbose=2 setting makes it emit the following output:
Analyzing loop at foo155.cc:10
Vectorizing loop at foo155.cc:10
10: LOOP VECTORIZED.
foo155.cc:1: note: vectorized 1 loops in function.
You probably goofed your compiler flags (I used g++ -O3 -ftree-vectorize -ftree-vectorizer-verbose=2 -march=native foo155.cc -o foo155 to build) or have a really old compiler.
remove the first loop and do this
int a[2048], b[2048], c[2048] = {0};
also try this tag
-ftree-vectorize
instead of
-msse2 -ftree-vectorizer-verbose=2

Difference between “static const”, “#define”, and “enum” in performance and memory usage aspects

There might be any because of inlining of #define statements.
I understand that answer may be compiler dependent, lets asume GCC then.
There already are similar questions about C and about C++, but they are more about usage aspects.
The compiler would treat them the same given basic optimization.
It's fairly easy to check - consider the following c code :
#define a 1
static const int b = 2;
typedef enum {FOUR = 4} enum_t;
int main() {
enum_t c = FOUR;
printf("%d\n",a);
printf("%d\n",b);
printf("%d\n",c);
return 0;
}
compiled with gcc -O3:
0000000000400410 <main>:
400410: 48 83 ec 08 sub $0x8,%rsp
400414: be 01 00 00 00 mov $0x1,%esi
400419: bf 2c 06 40 00 mov $0x40062c,%edi
40041e: 31 c0 xor %eax,%eax
400420: e8 cb ff ff ff callq 4003f0 <printf#plt>
400425: be 02 00 00 00 mov $0x2,%esi
40042a: bf 2c 06 40 00 mov $0x40062c,%edi
40042f: 31 c0 xor %eax,%eax
400431: e8 ba ff ff ff callq 4003f0 <printf#plt>
400436: be 04 00 00 00 mov $0x4,%esi
40043b: bf 2c 06 40 00 mov $0x40062c,%edi
400440: 31 c0 xor %eax,%eax
400442: e8 a9 ff ff ff callq 4003f0 <printf#plt>
Absolutely identical assembly code, hence - the exact same performance and memory usage.
Edit: As Damon stated in the comments, there may be some corner cases such as complicated non literals, but that goes a bit beyond the question.
When used as a constant expression there will be no difference in performance. If used as an lvalue, the static const will need to be defined (memory) and accessed (cpu).

odd compiled code

I've compiled some Qt code with google's nacl compiler, but the ncval validator does not grok it. One example among many:
src/corelib/animation/qabstractanimation.cpp:165
Here's the relevant code:
#define Q_GLOBAL_STATIC(TYPE, NAME) \
static TYPE *NAME() \
{ \
static TYPE thisVariable; \
static QGlobalStatic<TYPE > thisGlobalStatic(&thisVariable); \
return thisGlobalStatic.pointer; \
}
#ifndef QT_NO_THREAD
Q_GLOBAL_STATIC(QThreadStorage<QUnifiedTimer *>, unifiedTimer)
#endif
which compiles to:
00000480 <_ZL12unifiedTimerv>:
480: 55 push %ebp
481: 89 e5 mov %esp,%ebp
483: 57 push %edi
484: 56 push %esi
485: 53 push %ebx
486: 83 ec 2c sub $0x2c,%esp
489: c7 04 24 28 00 2e 10 movl $0x102e0028,(%esp)
490: 8d 74 26 00 lea 0x0(%esi,%eiz,1),%esi
494: 8d bc 27 00 00 00 00 lea 0x0(%edi,%eiz,1),%edi
49b: e8 fc ff ff ff call 49c <_ZL12unifiedTimerv+0x1c>
4a0: 84 c0 test %al,%al
4a2: 74 1c je 4c0 <_ZL12unifiedTimerv+0x40>
4a4: 0f b6 05 2c 00 2e 10 movzbl 0x102e002c,%eax
4ab: 83 f0 01 xor $0x1,%eax
4ae: 84 c0 test %al,%al
4b0: 74 0e je 4c0 <_ZL12unifiedTimerv+0x40>
4b2: b8 01 00 00 00 mov $0x1,%eax
4b7: eb 27 jmp 4e0 <_ZL12unifiedTimerv+0x60>
4b9: 8d b4 26 00 00 00 00 lea 0x0(%esi,%eiz,1),%esi
4c0: b8 00 00 00 00 mov $0x0,%eax
4c5: eb 19 jmp 4e0 <_ZL12unifiedTimerv+0x60>
4c7: 90 nop
4c8: 90 nop
4c9: 90 nop
4ca: 90 nop
4cb: 90 nop
Check the call instruction at 49b: it is what the validator cannot grok. What on earth could induce the compiler to issue an instruction that calls into the middle of itself? Is there a way around this? I've compiled with -g -O0 -fno-inline. Compiler bug?
Presumably it's really a call to an external symbol, which will get filled in at link time. Actually what will get called is externalSymbol-4, which is a bit strange -- perhaps this is what is throwing the ncval validator off the scent.
Is this a dynamic library or a static object that is not linked to an executable yet?
In a dynamic library this likely came out because the code was built as position-dependent and linked into a dynamic library. Try "objdump -d -r -R" on it, if you see TEXTREL, that is the case. TEXTREL is not supported in NaCl dynamic linking stories. (solved by having -fPIC flag during compilation of the code)
With a static object try to validate after it was linked into a static executable.

Performance difference of "if if" vs "if else if"

I was just thinking is there any performance difference between the 2 statements in C/C++:
Case 1:
if (p==0)
do_this();
else if (p==1)
do_that();
else if (p==2)
do_these():
Case 2:
if(p==0)
do_this();
if(p==1)
do_that();
if(p==2)
do_these();
Assuming simple types (in this case, I used int) and no funny business (didn't redefine operator= for int), at least with GCC 4.6 on AMD64, there is no difference. The generated code is identical:
0000000000000000 <case_1>: 0000000000000040 <case_2>:
0: 85 ff test %edi,%edi 40: 85 ff test %edi,%edi
2: 74 14 je 18 <case_1+0x18> 42: 74 14 je 58 <case_2+0x18>
4: 83 ff 01 cmp $0x1,%edi 44: 83 ff 01 cmp $0x1,%edi
7: 74 27 je 30 <case_1+0x30> 47: 74 27 je 70 <case_2+0x30>
9: 83 ff 02 cmp $0x2,%edi 49: 83 ff 02 cmp $0x2,%edi
c: 74 12 je 20 <case_1+0x20> 4c: 74 12 je 60 <case_2+0x20>
e: 66 90 xchg %ax,%ax 4e: 66 90 xchg %ax,%ax
10: f3 c3 repz retq 50: f3 c3 repz retq
12: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1) 52: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
18: 31 c0 xor %eax,%eax 58: 31 c0 xor %eax,%eax
1a: e9 00 00 00 00 jmpq 1f <case_1+0x1f> 5a: e9 00 00 00 00 jmpq 5f <case_2+0x1f>
1f: 90 nop 5f: 90 nop
20: 31 c0 xor %eax,%eax 60: 31 c0 xor %eax,%eax
22: e9 00 00 00 00 jmpq 27 <case_1+0x27> 62: e9 00 00 00 00 jmpq 67 <case_2+0x27>
27: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1) 67: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1)
2e: 00 00 6e: 00 00
30: 31 c0 xor %eax,%eax 70: 31 c0 xor %eax,%eax
32: e9 00 00 00 00 jmpq 37 <case_1+0x37> 72: e9 00 00 00 00 jmpq 77 <case_2+0x37>
37: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1)
3e: 00 00
The extra instruction at the end of case_1 is just for padding (to get the next function aligned).
This isn't really surprising, figuring out that p isn't changed in that function is fairly basic optimization. If p could be changed (e.g., passed-by-reference or pointer to the various do_… functions, or was a reference or pointer itself, so there could be an alias) then the behavior is different, and of course the generated code would be too.
In the former case conditions after the one matched are not evaluated.
if else is faster; if a match was found before the last if then at least the last if statement is skipped, if match was found in first it will skip all other statements.
if if is slower; even if a match was found using the first if statement, it will keep on trying to match in other statement.
Yes the performance difference is:
The second statement evaluate every IF
As it has already been demonstrated... it varies.
If we are talking about primitive (built-ins) types like int, then the compiler may be smart enough so that it does not matter (or not). In any case though, the performance impact will be minor because the cost of calling a function is much higher than that of a if, so the difference will probably get lost in the noise if you ever attempt to measure it.
The semantics, however, are quite different.
When I read the first case:
if (...) {
// Branch 1
} else if (...) {
// Branch 2
}
Then I know that no matter what the two branches might do, only one can ever be executed.
However, when I read the second case:
if (...) {
}
if (...) {
}
Then I have to wonder whether there is a possibility that both branches be taken or not, which mean that I have to scrutinize the code in the first to determine whether it is likely to influence the second test or not. And when I finally conclude it's not, I curse the bloody developer who was too lazy to write that damn else that would have saved me the last 10 minutes of scrutiny.
So, help yourself and your future maintainers, and concentrate on getting the semantics right and clear.
And on this subject, one could argue that perhaps this dispatch logic could be better express with other constructs, such as a switch or perhaps a map<int, void()> ? (beware of the latter and avoid over-engineering ;) )
You probably won’t notice any difference in performance for such a limited number of expressions. But theoretically, the if..if..if requires to check every single expression. If single expressions are mutally exclusive, you can save that evaluation by using if..else if.. instead. That way only when the previous cases fail, the other expression is checked.
Note that when just checking an int for equality, you could also just use the switch statement. That way you can still maintain some level of readability for a long sequence of checks.
switch ( p )
{
case 0:
do_this();
break;
case 1:
do_that();
break;
case 2:
do_these():
break;
}
The major difference is that the if/else construct will stop evaluating the ifs() once one of them returns a true. That means it MAY only execute 1 or 2 of the ifs before bailing. The other version will check all 3 ifs, regardless of the outcome of the others.
So.... if/else has a operational cost of "Up to 3 checks". The if/if/if version has an operational cost of "always does 3 checks". Assuming all 3 of the values being checked are equally likely, the if/else version will have an average of 1.5 ifs performed, while the if/if one will always do 3 ifs. In the long term, you're saving yourself 1.5 ifs worth of CPU time with the "else" construct.
If..If case could be improved by using a DONE flag in all subsequent IF checks (since true logic is evaluated left-right), to avoid double work/match and optimization.
bool bDone = false;
If( <condition> ) {
<work>;
bDone = true;
}
if (!bDone && <condition> ) {
<work>;
bDone = true;
}
Or could use some logic like this:
While(true) {
if( <condition> ) {
<work>;
break;
}
if( <condition> ) {
<work>;
break;
}
....
break;
}
though it's somewhat confusing to read (ask "why do this way")