Why is this no-op loop not optimized away?

Why is this no-op loop not optimized away? - c++

The following code does some copying from one array of zeroes interpreted as floats to another one, and prints timing of this operation. As I've seen many cases where no-op loops are just optimized away by compilers, including gcc, I was waiting that at some point of changing my copy-arrays program it will stop doing the copying.
#include <iostream>
#include <cstring>
#include <sys/time.h>
static inline long double currentTime()
{
timespec ts;
clock_gettime(CLOCK_MONOTONIC,&ts);
return ts.tv_sec+(long double)(ts.tv_nsec)*1e-9;
}
int main()
{
size_t W=20000,H=10000;
float* data1=new float[W*H];
float* data2=new float[W*H];
memset(data1,0,W*H*sizeof(float));
memset(data2,0,W*H*sizeof(float));
long double time1=currentTime();
for(int q=0;q<16;++q) // take more time
for(int k=0;k<W*H;++k)
data2[k]=data1[k];
long double time2=currentTime();
std::cout << (time2-time1)*1e+3 << " ms\n";
delete[] data1;
delete[] data2;
}
I compiled this with g++ 4.8.1 command g++ main.cpp -o test -std=c++0x -O3 -lrt. This program prints 6952.17 ms for me. (I had to set ulimit -s 2000000 for it to not crash.)
I also tried changing creation of arrays with new to automatic VLAs, removing memsets, but this doesn't change g++ behavior (apart from changing timings by several times).
It seems the compiler could prove that this code won't do anything sensible, so why didn't it optimize the loop away?

Anyway it isn't impossible (clang++ version 3.3):
clang++ main.cpp -o test -std=c++0x -O3 -lrt
The program prints 0.000367 ms for me... and looking at the assembly language:
...
callq clock_gettime
movq 56(%rsp), %r14
movq 64(%rsp), %rbx
leaq 56(%rsp), %rsi
movl $1, %edi
callq clock_gettime
...
while for g++:
...
call clock_gettime
fildq 32(%rsp)
movl $16, %eax
fildq 40(%rsp)
fmull .LC0(%rip)
faddp %st, %st(1)
.p2align 4,,10
.p2align 3
.L2:
movl $1, %ecx
xorl %edx, %edx
jmp .L5
.p2align 4,,10
.p2align 3
.L3:
movq %rcx, %rdx
movq %rsi, %rcx
.L5:
leaq 1(%rcx), %rsi
movss 0(%rbp,%rdx,4), %xmm0
movss %xmm0, (%rbx,%rdx,4)
cmpq $200000001, %rsi
jne .L3
subl $1, %eax
jne .L2
fstpt 16(%rsp)
leaq 32(%rsp), %rsi
movl $1, %edi
call clock_gettime
...
EDIT (g++ v4.8.2 / clang++ v3.3)
SOURCE CODE - ORIGINAL VERSION (1)
...
size_t W=20000,H=10000;
float* data1=new float[W*H];
float* data2=new float[W*H];
...
SOURCE CODE - MODIFIED VERSION (2)
...
const size_t W=20000;
const size_t H=10000;
float data1[W*H];
float data2[W*H];
...
Now the case that isn't optimized is (1) + g++

The code in this question has changed quite a bit, invalidating correct answers. This answer applies to the 5th version: as the code currently attempts to read uninitialized memory, an optimizer may reasonably assume that unexpected things are happening.
Many optimization steps have a similar pattern: there's a pattern of instructions that's matched to the current state of compilation. If the pattern matches at some point, the matched pattern is (parametrically) replaced by a more efficient version. A very simple example of such a pattern is the definition of a variable that's not subsequently used; the replacement in this case is simply a deletion.
These patterns are designed for correct code. On incorrect code, the patterns may simply fail to match, or they may match in entirely unintended ways. The first case leads to no optimization, the second case may lead to totally unpredictable results (certainly if the modified code if further optimized)

Why do you expect the compiler to optimise this? It’s generally really hard to prove that writes to arbitrary memory addresses are a “no-op”. In your case it would be possible, but it would require the compiler to trace the heap memory addresses through new (which is once again hard since these addresses are generated at runtime) and there really is no incentive for doing this.
After all, you tell the compiler explicitly that you want to allocate memory and write to it. How is the poor compiler to know that you’ve been lying to it?
In particular, the problem is that the heap memory could be aliased to lots of other stuff. It happens to be private to your process but like I said above, proving this is a lot of work for the compiler, unlike for function local memory.

The only way in which the compiler could know that this is a no-op is if it knew what memset does. In order for that to happen, the function must either be defined in a header (and it typically isn't), or it must be treated as a special intrinsic by the compiler. But barring those tricks, the compiler just sees a call to an unknown function which could have side effects and do different things for each of the two calls.

Related

What are these "abundant" assembly codes used for and why are they compiled out?

I'm not familiar with the implementation of C++ compilers and I'm writing a C++ snippet like this (for learning):
#include <vector>
void vector8_inc(std::vector<unsigned char>& v) {
for (std::size_t i = 0; i < v.size(); i++) {
v[i]++;
}
}
void vector32_inc(std::vector<unsigned int>& v) {
for (std::size_t i = 0; i < v.size(); i++) {
v[i]++;
}
}
int main() {
std::vector<unsigned char> my{ 10,11,2,3 };
vector8_inc(my);
std::vector<unsigned int> my2{ 10,11,2,3 };
vector32_inc(my2);
}
By generating its assembly codes and disassemble its binary, I found the segments for
vector8_inc and vector32_inc:
g++ -S test.cpp -o test.a -O2
_Z11vector8_incRSt6vectorIhSaIhEE:
.LFB853:
.cfi_startproc
endbr64
movq (%rdi), %rdx
cmpq 8(%rdi), %rdx
je .L1
xorl %eax, %eax
.p2align 4,,10
.p2align 3
.L3:
addb $1, (%rdx,%rax)
movq (%rdi), %rdx
addq $1, %rax
movq 8(%rdi), %rcx
subq %rdx, %rcx // line C: calculate size of the vector
cmpq %rcx, %rax
jb .L3
.L1:
ret
.cfi_endproc
......
_Z12vector32_incRSt6vectorIjSaIjEE:
.LFB854:
.cfi_startproc
endbr64
movq (%rdi), %rax // line A
movq 8(%rdi), %rdx
subq %rax, %rdx // line C': calculate size of the vector
movq %rdx, %rcx
shrq $2, %rcx // Line B: ← What is this used for?
je .L6
addq %rax, %rdx
.p2align 4,,10
.p2align 3
.L8:
addl $1, (%rax)
addq $4, %rax
cmpq %rax, %rdx
jne .L8
.L6:
ret
.cfi_endproc
......
but in fact, these two functions are "inlined" within main(), and the instructions shown above will never be executed but compiled and loaded into memory (I verified this by adding a breakpoint to line A and checking the memory):
main() contains duplicate logic of the above segment
So here are my questions:
What are these two segments used for and why are they compiled out? _Z12vector32_incRSt6vectorIjSaIjEE, _Z11vector8_incRSt6vectorIhSaIhEE
Does the instruction (Line B) in _Z12vector32_incRSt6vectorIjSaIjEE (i.e. shrq $2, %rcx ) means 'to shift size integer of the vector right by 2 bits'? If it does, what's this used for?
Why is size() computed every turn of the loop in vector8_inc, but in vector32_inc it's computed before executing the loop? (see Line C and Line C', under -O2 optimization level of g++)

First, function names are a lot more readable if you demangle them, with a tool such as c++filt:
_Z12vector32_incRSt6vectorIjSaIjEE
-> vector32_inc(std::vector<unsigned int, std::allocator<unsigned int> >&)
_Z11vector8_incRSt6vectorIhSaIhEE
-> vector8_inc(std::vector<unsigned char, std::allocator<unsigned char> >&)
Q1
You defined those functions as global (external linkage). The compiler compiles one module at a time. It doesn't know whether this file is your entire program, or if you will link it with other object files later. In the latter case, some function from another module might call vector8_inc or vector32_inc, and so the compiler needs to create an out-of-line version to be available for such a call.
A traditional linker links code at the level of modules, and isn't able to selectively exclude functions from individual object files. It has to include your test.o module in the link, so that means the binary contains all the code from that module, including the out-of-line vector_inc functions. You know they can never be called, but the compiler didn't know that and the linker doesn't check.
Here are three ways you can avoid having them included in your binary:
Declare those functions as static in your source code. This ensures they cannot be called from other modules, so if the compiler can inline all calls from within this module, it need not emit the out-of-line versions.
Compile with -fwhole-program which tells GCC to assume that functions cannot be called from anywhere it can't see.
Link using -flto to enable link time optimization. This provides more data to the linker, so that it can be smart and do things like omit individual functions that are never called (and much more). When I used -flto, again I no longer saw the out-of-line code in the binary.
Q2
Shift right by 2 is division by 4. It looks like the std::vector class, in GCC's implementation, contains a start pointer (loaded into %rax) and a finish pointer (loaded into %rdx). Subtracting these pointers yields the number of bytes in the vector. For a vector of unsigned int, each element is 4 bytes in size, so dividing by 4 yields the number of elements.
This value is only used as a test of whether the vector is empty, by comparing it to zero. In this case the loop is skipped and the function does nothing. (shr will set the zero flag if the result is zero, and je jumps if the zero flag is set.) So the only way the shift matters, instead of just comparing the pointer difference to 0, is if the pointer difference were 1, 2 or 3. That should not happen, since pointers to unsigned int should be 4-byte aligned (unless the library is doing clever things like storing metadata in the low bits). Thus I think the shift is actually unnecessary, and this may be a minor missed optimization.
Q3
This is most likely to handle aliasing.
There is a general rule in C++ that, roughly speaking, writing through a pointer of one type may not modify an object of another type. So in vector32_inc, the compiler knows that writing to elements of the vector, through a pointer to unsigned int, cannot modify the vector's metadata (which apparently is objects of some other type, most likely pointers). Therefore it is safe to hoist the size() computation out of the loop, since it cannot change over the course of the loop's execution. Likewise, the pointer used to iterate over the object can be kept in a register throughout.
However, character types have a special exemption from this rule, partly so that things like memcpy can work. So when you do vec[i]++, writing through a pointer to unsigned char, the compiler has to account for the possibility that the pointer may actually point to some of the vector metadata. (This should not actually happen if vector is being used properly, but the compiler doesn't know that.) If it did, then vec[i]++ might modify that metadata in memory. The next loop iteration would then be supposed to use the newly modified metadata, so it has to be reloaded from memory just in case. Note that the vector's start pointer is also reloaded, and the new value used, instead of keeping it in a register throughout.

Inlining of vararg functions

While playing about with optimisation settings, I noticed an interesting phenomenon: functions taking a variable number of arguments (...) never seemed to get inlined. (Obviously this behavior is compiler-specific, but I've tested on a couple of different systems.)
For example, compiling the following small programm:
#include <stdarg.h>
#include <stdio.h>
static inline void test(const char *format, ...)
{
va_list ap;
va_start(ap, format);
vprintf(format, ap);
va_end(ap);
}
int main()
{
test("Hello %s\n", "world");
return 0;
}
will seemingly always result in a (possibly mangled) test symbol appearing in the resulting executable (tested with Clang and GCC in both C and C++ modes on MacOS and Linux). If one modifies the signature of test() to take a plain string which is passed to printf(), the function is inlined from -O1 upwards by both compilers as you'd expect.
I suspect this is to do with the voodoo magic used to implement varargs, but how exactly this is usually done is a mystery to me. Can anybody enlighten me as to how compilers typically implement vararg functions, and why this seemingly prevents inlining?

At least on x86-64, the passing of var_args is quite complex (due to passing arguments in registers). Other architectures may not be quite so complex, but it is rarely trivial. In particular, having a stack-frame or frame pointer to refer to when getting each argument may be required. These sort of rules may well stop the compiler from inlining the function.
The code for x86-64 includes pushing all the integer arguments, and 8 sse registers onto the stack.
This is the function from the original code compiled with Clang:
test: # #test
subq $200, %rsp
testb %al, %al
je .LBB1_2
# BB#1: # %entry
movaps %xmm0, 48(%rsp)
movaps %xmm1, 64(%rsp)
movaps %xmm2, 80(%rsp)
movaps %xmm3, 96(%rsp)
movaps %xmm4, 112(%rsp)
movaps %xmm5, 128(%rsp)
movaps %xmm6, 144(%rsp)
movaps %xmm7, 160(%rsp)
.LBB1_2: # %entry
movq %r9, 40(%rsp)
movq %r8, 32(%rsp)
movq %rcx, 24(%rsp)
movq %rdx, 16(%rsp)
movq %rsi, 8(%rsp)
leaq (%rsp), %rax
movq %rax, 192(%rsp)
leaq 208(%rsp), %rax
movq %rax, 184(%rsp)
movl $48, 180(%rsp)
movl $8, 176(%rsp)
movq stdout(%rip), %rdi
leaq 176(%rsp), %rdx
movl $.L.str, %esi
callq vfprintf
addq $200, %rsp
retq
and from gcc:
test.constprop.0:
.cfi_startproc
subq $216, %rsp
.cfi_def_cfa_offset 224
testb %al, %al
movq %rsi, 40(%rsp)
movq %rdx, 48(%rsp)
movq %rcx, 56(%rsp)
movq %r8, 64(%rsp)
movq %r9, 72(%rsp)
je .L2
movaps %xmm0, 80(%rsp)
movaps %xmm1, 96(%rsp)
movaps %xmm2, 112(%rsp)
movaps %xmm3, 128(%rsp)
movaps %xmm4, 144(%rsp)
movaps %xmm5, 160(%rsp)
movaps %xmm6, 176(%rsp)
movaps %xmm7, 192(%rsp)
.L2:
leaq 224(%rsp), %rax
leaq 8(%rsp), %rdx
movl $.LC0, %esi
movq stdout(%rip), %rdi
movq %rax, 16(%rsp)
leaq 32(%rsp), %rax
movl $8, 8(%rsp)
movl $48, 12(%rsp)
movq %rax, 24(%rsp)
call vfprintf
addq $216, %rsp
.cfi_def_cfa_offset 8
ret
.cfi_endproc
In clang for x86, it is much simpler:
test: # #test
subl $28, %esp
leal 36(%esp), %eax
movl %eax, 24(%esp)
movl stdout, %ecx
movl %eax, 8(%esp)
movl %ecx, (%esp)
movl $.L.str, 4(%esp)
calll vfprintf
addl $28, %esp
retl
There's nothing really stopping any of the above code from being inlined as such, so it would appear that it is simply a policy decision on the compiler writer. Of course, for a call to something like printf, it's pretty meaningless to optimise away a call/return pair for the cost of the code expansion - after all, printf is NOT a small short function.
(A decent part of my work for most of the past year has been to implement printf in an OpenCL environment, so I know far more than most people will ever even look up about format specifiers and various other tricky parts of printf)
Edit: The OpenCL compiler we use WILL inline calls to var_args functions, so it is possible to implement such a thing. It won't do it for calls to printf, because it bloats the code very much, but by default, our compiler inlines EVERYTHING, all the time, no matter what it is... And it does work, but we found that having 2-3 copies of printf in the code makes it REALLY huge (with all sorts of other drawbacks, including final code generation taking a lot longer due to some bad choices of algorithms in the compiler backend), so we had to add code to STOP the compiler doing that...

The variable arguments implementation generally have the following algorithm: Take the first address from the stack which is after the format string, and while parsing the input format string use the value at the given position as the required datatype. Now increment the stack parsing pointer with the size of the required datatype, advance in the format string and use the value at the new position as the required datatype ... and so on.
Some values automatically get converted (ie: promoted) to "larger" types (and this is more or less implementation dependant) such as char or short gets promoted to int and float to double.
Certainly, you do not need a format string, but in this case you need to know the type of the arguments passed in (such as: all ints, or all doubles, or the first 3 ints, then 3 more doubles ..).
So this is the short theory.
Now, to the practice, as the comment from n.m. above shows, gcc does not inline functions which have variable argument handling. Possibly there are pretty complex operations going on while handling the variable arguments which would increase the size of the code to an un-optimal size so it is simply not worth inlining these functions.
EDIT:
After doing a quick test with VS2012 I don't seem to be able to convince the compiler to inline the function with the variable arguments.
Regardless of the combination of flags in the "Optimization" tab of the project there is always a call totest and there is always a test method. And indeed:
http://msdn.microsoft.com/en-us/library/z8y1yy88.aspx
says that
Even with __forceinline, the compiler cannot inline code in all circumstances. The compiler cannot inline a function if:
...
The function has a variable argument list.

The point of inlining is that it reduces function call overhead.
But for varargs, there is very little to be gained in general.
Consider this code in the body of that function:
if (blah)
{
printf("%d", va_arg(vl, int));
}
else
{
printf("%s", va_arg(vl, char *));
}
How is the compiler supposed to inline it? Doing that requires the compiler to push everything on the stack in the correct order anyway, even though there isn't any function being called. The only thing that's optimized away is a call/ret instruction pair (and maybe pushing/popping ebp and whatnot). The memory manipulations cannot be optimized away, and the parameters cannot be passed in registers. So it's unlikely that you'll gain anything notable by inlining varargs.

I do not expect that it would ever be possible to inline a varargs function, except in the most trivial case.
A varargs function that had no arguments, or that did not access any of its arguments, or that accessed only the fixed arguments preceding the variable ones could be inlined by rewriting it as an equivalent function that did not use varargs. This is the trivial case.
A varargs function that accesses its variadic arguments does so by executing code generated by the va_start and va_arg macros, which rely on the arguments being laid out in memory in some way. A compiler that performed inlining simply to remove the overhead of a function call would still need to create the data structure to support those macros. A compiler that attempted to remove all the machinery of function call would have to analyse and optimise away those macros as well. And it would still fail if the variadic function made a call to another function passing va_list as an argument.
I do not see a feasible path for this second case.

Why does tree vectorization make this sorting algorithm 2x slower?

The sorting algorithm of this question becomes twice faster(!) if -fprofile-arcs is enabled in gcc (4.7.2). The heavily simplified C code of that question (it turned out that I can initialize the array with all zeros, the weird performance behavior remains but it makes the reasoning much much simpler):
#include <time.h>
#include <stdio.h>
#define ELEMENTS 100000
int main() {
int a[ELEMENTS] = { 0 };
clock_t start = clock();
for (int i = 0; i < ELEMENTS; ++i) {
int lowerElementIndex = i;
for (int j = i+1; j < ELEMENTS; ++j) {
if (a[j] < a[lowerElementIndex]) {
lowerElementIndex = j;
}
}
int tmp = a[i];
a[i] = a[lowerElementIndex];
a[lowerElementIndex] = tmp;
}
clock_t end = clock();
float timeExec = (float)(end - start) / CLOCKS_PER_SEC;
printf("Time: %2.3f\n", timeExec);
printf("ignore this line %d\n", a[ELEMENTS-1]);
}
After playing with the optimization flags for a long while, it turned out that -ftree-vectorize also yields this weird behavior so we can take -fprofile-arcs out of the question. After profiling with perf I have found that the only relevant difference is:
Fast case gcc -std=c99 -O2 simp.c (runs in 3.1s)
cmpl %esi, %ecx
jge .L3
movl %ecx, %esi
movslq %edx, %rdi
.L3:
Slow case gcc -std=c99 -O2 -ftree-vectorize simp.c (runs in 6.1s)
cmpl %ecx, %esi
cmovl %edx, %edi
cmovl %esi, %ecx
As for the first snippet: Given that the array only contains zeros, we always jump to .L3. It can greatly benefit from branch prediction.
I guess the cmovl instructions cannot benefit from branch prediction.
Questions:
Are all my above guesses correct? Does this make the algorithm slow?
If yes, how can I prevent gcc from emitting this instruction (other than the trivial -fno-tree-vectorization workaround of course) but still doing as much optimizations as possible?
What is this -ftree-vectorization? The documentation is quite
vague, I would need a little more explanation to understand what's happening.
Update: Since it came up in comments: The weird performance behavior w.r.t. the -ftree-vectorize flag remains with random data. As Yakk points out, for selection sort, it is actually hard to create a dataset that would result in a lot of branch mispredictions.
Since it also came up: I have a Core i5 CPU.
Based on Yakk's comment, I created a test. The code below (online without boost) is of course no longer a sorting algorithm; I only took out the inner loop. Its only goal is to examine the effect of branch prediction: We skip the if branch in the for loop with probability p.
#include <algorithm>
#include <cstdio>
#include <random>
#include <boost/chrono.hpp>
using namespace std;
using namespace boost::chrono;
constexpr int ELEMENTS=1e+8;
constexpr double p = 0.50;
int main() {
printf("p = %.2f\n", p);
int* a = new int[ELEMENTS];
mt19937 mt(1759);
bernoulli_distribution rnd(p);
for (int i = 0 ; i < ELEMENTS; ++i){
a[i] = rnd(mt)? i : -i;
}
auto start = high_resolution_clock::now();
int lowerElementIndex = 0;
for (int i=0; i<ELEMENTS; ++i) {
if (a[i] < a[lowerElementIndex]) {
lowerElementIndex = i;
}
}
auto finish = high_resolution_clock::now();
printf("%ld ms\n", duration_cast<milliseconds>(finish-start).count());
printf("Ignore this line %d\n", a[lowerElementIndex]);
delete[] a;
}
The loops of interest:
This will be referred to as cmov
g++ -std=c++11 -O2 -lboost_chrono -lboost_system -lrt branch3.cpp
xorl %eax, %eax
.L30:
movl (%rbx,%rbp,4), %edx
cmpl %edx, (%rbx,%rax,4)
movslq %eax, %rdx
cmovl %rdx, %rbp
addq $1, %rax
cmpq $100000000, %rax
jne .L30
This will be referred to as no cmov, the -fno-if-conversion flag was pointed out by Turix in his answer.
g++ -std=c++11 -O2 -fno-if-conversion -lboost_chrono -lboost_system -lrt branch3.cpp
xorl %eax, %eax
.L29:
movl (%rbx,%rbp,4), %edx
cmpl %edx, (%rbx,%rax,4)
jge .L28
movslq %eax, %rbp
.L28:
addq $1, %rax
cmpq $100000000, %rax
jne .L29
The difference side by side
cmpl %edx, (%rbx,%rax,4) | cmpl %edx, (%rbx,%rax,4)
movslq %eax, %rdx | jge .L28
cmovl %rdx, %rbp | movslq %eax, %rbp
| .L28:
The execution time as a function of the Bernoulli parameter p
The code with the cmov instruction is absolutely insensitive to p. The code without the cmov instruction is the winner if p<0.26 or 0.81<p and is at most 4.38x faster (p=1). Of course, the worse situation for the branch predictor is at around p=0.5 where the code is 1.58x slower than the code with the cmov instruction.

Note: Answered before graph update was added to the question; some assembly code references here may be obsolete.
(Adapted and extended from our above chat, which was stimulating enough to cause me to do a bit more research.)
First (as per our above chat), it appears that the answer to your first question is "yes". In the vector "optimized" code, the optimization (negatively) affecting performance is branch predication, whereas in the original code the performance is (positively) affected by branch prediction. (Note the extra 'a' in the former.)
Re your 3rd question: Even though in your case, there is actually no vectorization being done, from step 11 ("Conditional Execution") here it appears that one of the steps associated with vectorization optimizations is to "flatten" conditionals within targeted loops, like this bit in your loop:
if (a[j] < a[lowerElementIndex]
lowerElementIndex = j;
Apparently, this happens even if there is no vectorization.
This explains why the compiler is using the conditional move instructions (cmovl). The goal there is to avoid a branch entirely (as opposed to trying to predict it correctly). Instead, the two cmovl instructions will be sent down the pipeline before the result of the previous cmpl is known and the comparison result will then be "forwarded" to enable/prevent the moves prior to their writeback (i.e., prior to them actually taking effect).
Note that if the loop had been vectorized, this might have been worth it to get to the point where multiple iterations through the loop could effectively be accomplished in parallel.
However, in your case, the attempt at optimization actually backfires because in the flattened loop, the two conditional moves are sent through the pipeline every single time through the loop. This in itself might not be so bad either, except that there is a RAW data hazard that causes the second move (cmovl %esi, %ecx) to have to wait until the array/memory access (movl (%rsp,%rsi,4), %esi) is completed, even if the result is going to be ultimately ignored. Hence the huge time spent on that particular cmovl. (I would expect this is an issue with your processor not having complex enough logic built into its predication/forwarding implementation to deal with the hazard.)
On the other hand, in the non-optimized case, as you rightly figured out, branch prediction can help to avoid having to wait on the result of the corresponding array/memory access there (the movl (%rsp,%rcx,4), %ecx instruction). In that case, when the processor correctly predicts a taken branch (which for an all-0 array will be every single time, but [even] in a random array should [still] be roughlymore than [edited per #Yakk's comment] half the time), it does not have to wait for the memory access to finish to go ahead and queue up the next few instructions in the loop. So in correct predictions, you get a boost, whereas in incorrect predictions, the result is no worse than in the "optimized" case and, furthermore, better because of the ability to sometimes avoid having the 2 "wasted" cmovl instructions in the pipeline.
[The following was removed due to my mistaken assumption about your processor per your comment.]
Back to your questions, I would suggest looking at that link above for more on the flags relevant to vectorization, but in the end, I'm pretty sure that it's fine to ignore that optimization given that your Celeron isn't capable of using it (in this context) anyway.
[Added after above was removed]
Re your second question ("...how can I prevent gcc from emitting this instruction..."), you could try the -fno-if-conversion and -fno-if-conversion2 flags (not sure if these always work -- they no longer work on my mac), although I do not think your problem is with the cmovl instruction in general (i.e., I wouldn't always use those flags), just with its use in this particular context (where branch prediction is going to be very helpful given #Yakk's point about your sort algorithm).

What does the compiler do in assembly when optimizing code? ie -O2 flag

So when you add an optimization flag when compiling your C++, it runs faster, but how does this work? Could someone explain what really goes on in the assembly?

It means you're making the compiler do extra work / analysis at compile time, so you can reap the rewards of a few extra precious cpu cycles at runtime. Might be best to explain with an example.
Consider a loop like this:
const int n = 5;
for (int i = 0; i < n; ++i)
cout << "bleh" << endl;
If you compile this without optimizations, the compiler will not do any extra work for you -- assembly generated for this code snippet will likely be a literal translation into compare and jump instructions. (which isn't the fastest, just the most straightforward)
However, if you compile WITH optimizations, the compiler can easily inline this loop since it knows the upper bound can't ever change because n is const. (i.e. it can copy the repeated code 5 times directly instead of comparing / checking for the terminating loop condition).
Here's another example with an optimized function call. Below is my whole program:
#include <stdio.h>
static int foo(int a, int b) {
return a * b;
}
int main(int argc, char** argv) {
fprintf(stderr, "%d\n", foo(10, 15));
return 0;
}
If i compile this code without optimizations using gcc foo.c on my x86 machine, my assembly looks like this:
movq %rsi, %rax
movl %edi, -4(%rbp)
movq %rax, -16(%rbp)
movl $10, %eax ; these are my parameters to
movl $15, %ecx ; the foo function
movl %eax, %edi
movl %ecx, %esi
callq _foo
; .. about 20 other instructions ..
callq _fprintf
Here, it's not optimizing anything. It's loading the registers with my constant values and calling my foo function. But look if i recompile with the -O2 flag:
movq (%rax), %rdi
leaq L_.str(%rip), %rsi
movl $150, %edx
xorb %al, %al
callq _fprintf
The compiler is so smart that it doesn't even call foo anymore. It just inlines it's return value.

Most of the optimization happens in the compiler's intermediate representation before the assembly is generated. You should definitely check out Agner Fog's Software optimization resources. Chapter 8 of the 1st manual describes optimizations performed by the compiler with examples.

Which is faster : if (bool) or if(int)?

Which value is better to use? Boolean true or Integer 1?
The above topic made me do some experiments with bool and int in if condition. So just out of curiosity I wrote this program:
int f(int i)
{
if ( i ) return 99; //if(int)
else return -99;
}
int g(bool b)
{
if ( b ) return 99; //if(bool)
else return -99;
}
int main(){}
g++ intbool.cpp -S generates asm code for each functions as follows:
asm code for f(int)
__Z1fi:
LFB0:
pushl %ebp
LCFI0:
movl %esp, %ebp
LCFI1:
cmpl $0, 8(%ebp)
je L2
movl $99, %eax
jmp L3
L2:
movl $-99, %eax
L3:
leave
LCFI2:
ret
asm code for g(bool)
__Z1gb:
LFB1:
pushl %ebp
LCFI3:
movl %esp, %ebp
LCFI4:
subl $4, %esp
LCFI5:
movl 8(%ebp), %eax
movb %al, -4(%ebp)
cmpb $0, -4(%ebp)
je L5
movl $99, %eax
jmp L6
L5:
movl $-99, %eax
L6:
leave
LCFI6:
ret
Surprisingly, g(bool) generates more asm instructions! Does it mean that if(bool) is little slower than if(int)? I used to think bool is especially designed to be used in conditional statement such as if, so I was expecting g(bool) to generate less asm instructions, thereby making g(bool) more efficient and fast.
EDIT:
I'm not using any optimization flag as of now. But even absence of it, why does it generate more asm for g(bool) is a question for which I'm looking for a reasonable answer. I should also tell you that -O2 optimization flag generates exactly same asm. But that isn't the question. The question is what I've asked.

Makes sense to me. Your compiler apparently defines a bool as an 8-bit value, and your system ABI requires it to "promote" small (< 32-bit) integer arguments to 32-bit when pushing them onto the call stack. So to compare a bool, the compiler generates code to isolate the least significant byte of the 32-bit argument that g receives, and compares it with cmpb. In the first example, the int argument uses the full 32 bits that were pushed onto the stack, so it simply compares against the whole thing with cmpl.

Compiling with -03 gives the following for me:
f:
pushl %ebp
movl %esp, %ebp
cmpl $1, 8(%ebp)
popl %ebp
sbbl %eax, %eax
andb $58, %al
addl $99, %eax
ret
g:
pushl %ebp
movl %esp, %ebp
cmpb $1, 8(%ebp)
popl %ebp
sbbl %eax, %eax
andb $58, %al
addl $99, %eax
ret
.. so it compiles to essentially the same code, except for cmpl vs cmpb.
This means that the difference, if there is any, doesn't matter. Judging by unoptimized code is not fair.
Edit to clarify my point. Unoptimized code is for simple debugging, not for speed. Comparing the speed of unoptimized code is senseless.

When I compile this with a sane set of options (specifically -O3), here's what I get:
For f():
.type _Z1fi, #function
_Z1fi:
.LFB0:
.cfi_startproc
.cfi_personality 0x3,__gxx_personality_v0
cmpl $1, %edi
sbbl %eax, %eax
andb $58, %al
addl $99, %eax
ret
.cfi_endproc
For g():
.type _Z1gb, #function
_Z1gb:
.LFB1:
.cfi_startproc
.cfi_personality 0x3,__gxx_personality_v0
cmpb $1, %dil
sbbl %eax, %eax
andb $58, %al
addl $99, %eax
ret
.cfi_endproc
They still use different instructions for the comparison (cmpb for boolean vs. cmpl for int), but otherwise the bodies are identical. A quick look at the Intel manuals tells me: ... not much of anything. There's no such thing as cmpb or cmpl in the Intel manuals. They're all cmp and I can't find the timing tables at the moment. I'm guessing, however, that there's no clock difference between comparing a byte immediate vs. comparing a long immediate, so for all practical purposes the code is identical.
edited to add the following based on your addition
The reason the code is different in the unoptimized case is that it is unoptimized. (Yes, it's circular, I know.) When the compiler walks the AST and generates code directly, it doesn't "know" anything except what's at the immediate point of the AST it's in. At that point it lacks all contextual information needed to know that at this specific point it can treat the declared type bool as an int. A boolean is obviously by default treated as a byte and when manipulating bytes in the Intel world you have to do things like sign-extend to bring it to certain widths to put it on the stack, etc. (You can't push a byte.)
When the optimizer views the AST and does its magic, however, it looks at surrounding context and "knows" when it can replace code with something more efficient without changing semantics. So it "knows" it can use an integer in the parameter and thereby lose the unnecessary conversions and widening.

With GCC 4.5 on Linux and Windows at least, sizeof(bool) == 1. On x86 and x86_64, you can't pass in less than an general purpose register's worth to a function (whether via the stack or a register depending on the calling convention etc...).
So the code for bool, when un-optimized, actually goes to some length to extract that bool value from the argument stack (using another stack slot to save that byte). It's more complicated than just pulling a native register-sized variable.

Yeah, the discussion's fun. But just test it:
Test code:
#include <stdio.h>
#include <string.h>
int testi(int);
int testb(bool);
int main (int argc, char* argv[]){
bool valb;
int vali;
int loops;
if( argc < 2 ){
return 2;
}
valb = (0 != (strcmp(argv[1], "0")));
vali = strcmp(argv[1], "0");
printf("Arg1: %s\n", argv[1]);
printf("BArg1: %i\n", valb ? 1 : 0);
printf("IArg1: %i\n", vali);
for(loops=30000000; loops>0; loops--){
//printf("%i: %i\n", loops, testb(valb=!valb));
printf("%i: %i\n", loops, testi(vali=!vali));
}
return valb;
}
int testi(int val){
if( val ){
return 1;
}
return 0;
}
int testb(bool val){
if( val ){
return 1;
}
return 0;
}
Compiled on a 64-bit Ubuntu 10.10 laptop with:
g++ -O3 -o /tmp/test_i /tmp/test_i.cpp
Integer-based comparison:
sauer#trogdor:/tmp$ time /tmp/test_i 1 > /dev/null
real 0m8.203s
user 0m8.170s
sys 0m0.010s
sauer#trogdor:/tmp$ time /tmp/test_i 1 > /dev/null
real 0m8.056s
user 0m8.020s
sys 0m0.000s
sauer#trogdor:/tmp$ time /tmp/test_i 1 > /dev/null
real 0m8.116s
user 0m8.100s
sys 0m0.000s
Boolean test / print uncommented (and integer commented):
sauer#trogdor:/tmp$ time /tmp/test_i 1 > /dev/null
real 0m8.254s
user 0m8.240s
sys 0m0.000s
sauer#trogdor:/tmp$ time /tmp/test_i 1 > /dev/null
real 0m8.028s
user 0m8.000s
sys 0m0.010s
sauer#trogdor:/tmp$ time /tmp/test_i 1 > /dev/null
real 0m7.981s
user 0m7.900s
sys 0m0.050s
They're the same with 1 assignment and 2 comparisons each loop over 30 million loops. Find something else to optimize. For example, don't use strcmp unnecessarily. ;)

At the machine level there is no such thing as bool
Very few instruction set architectures define any sort of boolean operand type, although there are often instructions that trigger an action on non-zero values. To the CPU, usually, everything is one of the scalar types or a string of them.
A given compiler and a given ABI will need to choose specific sizes for int and bool and when, like in your case, these are different sizes they may generate slightly different code, and at some levels of optimization one may be slightly faster.
Why is bool one byte on many systems?
It's safer to choose a char type for bool because someone might make a really large array of them.
Update: by "safer", I mean: for the compiler and library implementors. I'm not saying people need to reimplement the system type.

It will mostly depend on the compiler and the optimization. There's an interesting discussion (language agnostic) here:
Does "if ([bool] == true)" require one more step than "if ([bool])"?
Also, take a look at this post: http://www.linuxquestions.org/questions/programming-9/c-compiler-handling-of-boolean-variables-290996/

Approaching your question in two different ways:
If you are specifically talking about C++ or any programming language that will produce assembly code for that matter, we are bound to what code the compiler will generate in ASM. We are also bound to the representation of true and false in c++. An integer will have to be stored in 32 bits, and I could simply use a byte to store the boolean expression. Asm snippets for conditional statements:
For the integer:
mov eax,dword ptr[esp] ;Store integer
cmp eax,0 ;Compare to 0
je false ;If int is 0, its false
;Do what has to be done when true
false:
;Do what has to be done when false
For the bool:
mov al,1 ;Anything that is not 0 is true
test al,1 ;See if first bit is fliped
jz false ;Not fliped, so it's false
;Do what has to be done when true
false:
;Do what has to be done when false
So, that's why the speed comparison is so compile dependent. In the case above, the bool would be slightly fast since cmp would imply a subtraction for setting the flags. It also contradicts with what your compiler generated.
Another approach, a much simpler one, is to look at the logic of the expression on it's own and try not to worry about how the compiler will translate your code, and I think this is a much healthier way of thinking. I still believe, ultimately, that the code being generated by the compiler is actually trying to give a truthful resolution. What I mean is that, maybe if you increase the test cases in the if statement and stick with boolean in one side and integer in another, the compiler will make it so the code generated will execute faster with boolean expressions in the machine level.
I'm considering this is a conceptual question, so I'll give a conceptual answer. This discussion reminds me of discussions I commonly have about whether or not code efficiency translates to less lines of code in assembly. It seems that this concept is generally accepted as being true. Considering that keeping track of how fast the ALU will handle each statement is not viable, the second option would be to focus on jumps and compares in assembly. When that is the case, the distinction between boolean statements or integers in the code you presented becomes rather representative. The result of an expression in C++ will return a value that will then be given a representation. In assembly, on the other hand, the jumps and comparisons will be based in numeric values regardless of what type of expression was being evaluated back at you C++ if statement. It is important on these questions to remember that purely logicical statements such as these end up with a huge computational overhead, even though a single bit would be capable of the same thing.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js