Is extra time required for float vs int comparisons? - c++

If you have a floating point number double num_float = 5.0; and the following two conditionals.
if(num_float > 3)
{
//...
}
if(num_float > 3.0)
{
//...
}
Q: Would it be slower to perform the former comparison because of the conversion of 3 to a floating point, or would there really be no difference at all?
Obviously I'm assuming the time delay would be negligible at best, but compounded in a while(1) loop I suppose over the long run a decent chunk of time could be lost (if it really is slower).

Because of the "as-if" rule, the compiler is allowed to do the conversion of the literal to a floating point value at compile time. A good compiler will do so if that results in better code.
In order to answer your question definitively for your compiler and your target platform(s), you'd need to check what the compiler emits, and how it performs. However, I'd be surprised if any mainstream compiler did not turn either of the two if statements into the most efficient code possible.

If the value is a constant, then there shouldn't be any difference, since the compiler will convert the constant to float as part of the compilation [unless the compiler decides to use a "compare float with integer" instruction].
If the value is an integer VARIABLE, then there will be an extra instruction to convert the integer value to a floating point [again, unless the compiler can use a "compare float with integer" instruction].
How much, if any, time that adds to the whole process depends HIGHLY on what processor, how the floating point instructions work, etc, etc.
As with anything where performance really matters, measure the alternatives. Preferably on more than one type of hardware (e.g. both AMD and Intel processors if it's a PC), and then decide which is the better choice. Otherwise, you may find yourself tuning the code to work well on YOUR hardware, but worse on some other hardware. Which isn't a good optimisation - unless the ONLY machine you ever run on is your own.

Note: This will need to be repeated with your target hardware. The code below just demonstrates nicely what has been said.
with constants:
bool with_int(const double num_float) {
return num_float > 3;
}
bool with_float(const double num_float) {
return num_float > 3.0;
}
g++ 4.7.2 (-O3 -march=native):
with_int(double):
ucomisd .LC0(%rip), %xmm0
seta %al
ret
with_float(double):
ucomisd .LC0(%rip), %xmm0
seta %al
ret
.LC0:
.long 0
.long 1074266112
clang 3.0 (-O3 -march=native):
.LCPI0_0:
.quad 4613937818241073152 # double 3.000000e+00
with_int(double): # #with_int(double)
ucomisd .LCPI0_0(%rip), %xmm0
seta %al
ret
.LCPI1_0:
.quad 4613937818241073152 # double 3.000000e+00
with_float(double): # #with_float(double)
ucomisd .LCPI1_0(%rip), %xmm0
seta %al
ret
Conclusion: No difference if comparing against constants.
with variables:
bool with_int(const double a, const int b) {
return a > b;
}
bool with_float(const double a, const float b) {
return a > b;
}
g++ 4.7.2 (-O3 -march=native):
with_int(double, int):
cvtsi2sd %edi, %xmm1
ucomisd %xmm1, %xmm0
seta %al
ret
with_float(double, float):
unpcklps %xmm1, %xmm1
cvtps2pd %xmm1, %xmm1
ucomisd %xmm1, %xmm0
seta %al
ret
clang 3.0 (-O3 -march=native):
with_int(double, int): # #with_int(double, int)
cvtsi2sd %edi, %xmm1
ucomisd %xmm1, %xmm0
seta %al
ret
with_float(double, float): # #with_float(double, float)
cvtss2sd %xmm1, %xmm1
ucomisd %xmm1, %xmm0
seta %al
ret
Conclusion: the emitted instructions differ when comparing against variables, as Mats Peterson's answer already explained.

Q: Would it be slower to perform the former comparison because of the conversion of 3 to a floating point, or would there really be no difference at all?
A) Just specify it as an integer. Some chips have a special instruction for comparing to integer at runtime, but that is not important since the compiler will choose what is best. In some cases it might convert it to 3.0 at compile time depending on target architecture. In other cases it will leave it as int. But because you want 3 specifically then specify '3'
Obviously I'm assuming the time delay would be negligible at best, but compounded in a while(1) loop I suppose over the long run a decent chunk of time could be lost (if it really is slower).
A) The compiler will not do anything odd with such code. It will choose the best thing so there should be no time delay. With number constants the compiler is free to do what is best regardless of what it seems as long as it produces same result. However you would not want to compound this type of comparison in a while loop at all. Rather use an integer loop counter. A floating point loop counter is going to be much slower. If you have to use floats as a loop counter prefer single point 32 bit data type and compare as little as possible.
For example, you could break the problem down into multiple loops.
int x = 0;
float y = 0;
float finc = 0.1;
int total = 1000;
int num_times = total / finc;
num_times -= 2;// safety
// Run the loop in a safe zone using integer compares
while (x < num_times) {
// Do stuff
y += finc;
x++;
}
// Now complete the loop using float compares
while (y < total) {
y+= finc;
}
And that would cause a severe improvement in the comparisons speed.

Related

Nasm, C++, passing class object

I have some class in my cpp file.
class F{
private:
int id;
float o;
float p;
float s;
static int next;
public:
F(double o, double s = 0.23, double p = 0.0):
id(next++), o(o),
p(p), s(s){}
};
int F::next = 0;
extern "C" float pod(F f);
int main(){
F bur(1000, 0.23, 100);
pod(bur);
return 0;
}
and I'm trying to pass class object burto function pod which is defined in my asm file. However I have big problem getting values from this class object.
In asm program I have 0.23 in XMM1, 100 in XMM2 but I can't find where 1000 is stored.
I don't know why you are seeing 100 in xmm2, I suspect that that is entirely conincidence. The easiest way to see how your struct is being passed is to compile the C++ code.
With cruft removed, my compiler does this:
main:
.LFB3:
.cfi_startproc
subq $8, %rsp
.cfi_def_cfa_offset 16
movl _ZN1F4nextE(%rip), %edi # load F::next into edi/rdi
movq .LC3(%rip), %xmm0 # load { 0.23, 100 } into xmm0
leal 1(%rdi), %eax # store rdi + 1 into eax
movl %eax, _ZN1F4nextE(%rip) # store eax back into F::next
movabsq $4934256341737799680, %rax # load { 1000.0, 0 } into rax
orq %rax, %rdi # or rax into pre-increment F::next in rdi
call pod
xorl %eax, %eax
addq $8, %rsp
.cfi_def_cfa_offset 8
ret
.LC3:
.quad 4497835022170456064
The constant 4497835022170456064 is 3E6B851F42C80000 in hex, and if you look at the most significant four bytes (3E6B851F), this is 0.23 when interpreted as a single precision float, and the least significant four bytes (42C80000) are 100.0.
Similarly the most significant four bytes of the constant 4934256341737799680 (hex 447A000000000000) are 1000.0.
So, bur.id and bur.o are passed in rdi and bur.p and bur.s are passed in xmm0.
The reason for this is documented in the x86-64 abi reference. In extreme summary, because the first fwo fields are small enough, of mixed type and one of them is integer, they are passed in a general purpose register (rdi being the first general purpose parameter register), because the next two fields are both float, they are passed in an SSE register.
You want to have a look at calling-convention compilation from Agner's here. Depending on compiler, operating system and whether you are in 32 o 64 bits, different things may happen. (see Table 5 chapter 7).
For 64bits linux for instance, since you object contains different values (see table 6), the R case seems to apply:
Entire object is transferred in integer registers and/or XMM registers if the size is no bigger than 128 bits, otherwise on the stack. Each 64-bit part of the object is transferred in an XMM register if it contains only float or double, or in an integer register if it contains integer types or mixed integer and float. Two consecutive floats can be packed into the lower half of one XMM register.
In your case, the class fits in 128 bits. The experiment from #CharlesBailey illustrates this behavior. According to the convention
... or in an integer register if it contains integer types or mixed integer and float. Two consecutive floats can be packed into the lower half of one XMM register. Examples: int and float: RDI.
first int register rdi should hold id and o where xmm0 should hold p and s.
Seeing 100 within xmm2 might be a side effect of initialization as this is passed as a double to the struct constructor.

Performance difference between member function and global function in release version

I have implemented two functions to perform the cross product of two Vectors (not std::vector), one is a member function and another is a global one, here is the key codes(additional parts are ommitted)
//for member function
template <typename Scalar>
SquareMatrix<Scalar,3> Vector<Scalar,3>::outerProduct(const Vector<Scalar,3> &vec3) const
{
SquareMatrix<Scalar,3> result;
for(unsigned int i = 0; i < 3; ++i)
for(unsigned int j = 0; j < 3; ++j)
result(i,j) = (*this)[i]*vec3[j];
return result;
}
//for global function: Dim = 3
template<typename Scalar, int Dim>
void outerProduct(const Vector<Scalar, Dim> & v1 , const Vector<Scalar, Dim> & v2, SquareMatrix<Scalar, Dim> & m)
{
for (unsigned int i=0; i<Dim; i++)
for (unsigned int j=0; j<Dim; j++)
{
m(i,j) = v1[i]*v2[j];
}
}
They are almost the same except that one is a member function having a return value and another is a global function where the values calculated are straightforwardly assigned to a square matrix, thus requiring no return value.
Actually, I was meant to replace the member one by the global one to improve the performance, since the first one involes copy operations. The strange thing, however, is that the time cost by the global function is almost two times longer than the member one. Furthermore, I find that the execution of
m(i,j) = v1[i]*v2[j]; // in global function
requires much more time than that of
result(i,j) = (*this)[i]*vec3[j]; // in member function
So the question is, how does this performance difference between member and global function arise?
Anyone can tell the reasons?
Hope I have presented my question clearly, and sorry to my poor english!
//----------------------------------------------------------------------------------------
More information added:
The following is the codes I use to test the performance:
//the codes below is in a loop
Vector<double, 3> vec1;
Vector<double, 3> vec2;
Timer timer;
timer.startTimer();
for (unsigned int i=0; i<100000; i++)
{
SquareMatrix<double,3> m = vec1.outerProduct(vec2);
}
timer.stopTimer();
std::cout<<"time cost for member function: "<< timer.getElapsedTime()<<std::endl;
timer.startTimer();
SquareMatrix<double,3> m;
for (unsigned int i=0; i<100000; i++)
{
outerProduct(vec1, vec2, m);
}
timer.stopTimer();
std::cout<<"time cost for global function: "<< timer.getElapsedTime()<<std::endl;
std::system("pause");
and the result captured:
You can see that the member funtion is almost twice faster than the global one.
Additionally, my project is built upon a 64bit windows system, and the codes are in fact used to generate the static lib files based on the Scons construction tools, along with vs2010 project files produced.
I have to remind that the strange performance difference only occurs in a release version, while in a debug build type, the global function is almost five times faster than the member one.(about 0.10s vs 0.02s)
One possible explanation:
With inlining, in the first case, compiler may knows that result(i, j) (from local variable) doesn't alias this[i] or vec3[j], and so neither of the Scalar array of this nor vec3 are modified.
In the second case, from the function point of view, the variables may alias, so each write into m might modify Scalars of v1 or v2, so neither of v1[i] nor v2[j] can be cached.
You may try the restrict keyword extension to check if my hypothesis is correct.
EDIT: loop elision in the original assembly has been corrected
[paraphrased] Why is the performance different between the member function and the static function?
I'll start with the simplest things mentioned in your question, and progress to the more nuanced points of performance testing / analysis.
It is a bad idea to measure performance of debug builds. Compilers take liberties in many places, such as zeroing arrays that are uninitialized, generating extra code that isn't strictly necessary, and (obviously) not performing any optimization past the trivial ones such as constant propagation. This leads to the next point...
Always look at the assembly. C and C++ are high level languages when it comes to the subtleties of performance. Many people even consider x86 assembly a high level language since each instruction is decomposed into possibly several micro-ops during decoding. You cannot tell what the computer is doing just by looking at C++ code. For example, depending on how you implemented SquareMatrix, the compiler may or may not be able to perform copy elision during optimization.
Entering the somewhat more nuanced topics when testing for performance...
Make sure the compiler is actually generating loops. Using your example test code, g++ 4.7.2 doesn't actually generate loops with my implementation of SquareMatrix and Vector. I implemented them to initialize all components to 0.0, so the compiler can statically determine that the values never change, and so only generates a single set of mov instructions instead of a loop. In my example code, I use COMPILER_NOP which (with gcc) is __asm__ __volatile__("":::) inside the loop to prevent this (as compilers cannot predict side-effects from manual assembly, and so cannot elide the loop). Edit: I DO use COMPILER_NOP but since the output values from the functions are never used, the compiler is still able to remove the bulk of the work from the loop, and reduce the loop to this:
.L7
subl $1, %eax
jne .L7
I have corrected this by performing additional operations inside the loop. The loop now assigns a value from the output to the inputs, preventing this optimization and forcing the loop to cover what was originally intended.
To (finally) get around to answering your question: When I implemented the rest of what is needed to get your code to run, and verified by checking the assembly that loops are actually generated, the two functions execute in the same amount of time. They even have nearly identical implementations in assembly.
Here's the assembly for the member function:
movsd 32(%rsp), %xmm7
movl $100000, %eax
movsd 24(%rsp), %xmm5
movsd 8(%rsp), %xmm6
movapd %xmm7, %xmm12
movsd (%rsp), %xmm4
movapd %xmm7, %xmm11
movapd %xmm5, %xmm10
movapd %xmm5, %xmm9
mulsd %xmm6, %xmm12
mulsd %xmm4, %xmm11
mulsd %xmm6, %xmm10
mulsd %xmm4, %xmm9
movsd 40(%rsp), %xmm1
movsd 16(%rsp), %xmm0
jmp .L7
.p2align 4,,10
.p2align 3
.L12:
movapd %xmm3, %xmm1
movapd %xmm2, %xmm0
.L7:
movapd %xmm0, %xmm8
movapd %xmm1, %xmm3
movapd %xmm1, %xmm2
mulsd %xmm1, %xmm8
movapd %xmm0, %xmm1
mulsd %xmm6, %xmm3
mulsd %xmm4, %xmm2
mulsd %xmm7, %xmm1
mulsd %xmm5, %xmm0
subl $1, %eax
jne .L12
and the assembly for the static function:
movsd 32(%rsp), %xmm7
movl $100000, %eax
movsd 24(%rsp), %xmm5
movsd 8(%rsp), %xmm6
movapd %xmm7, %xmm12
movsd (%rsp), %xmm4
movapd %xmm7, %xmm11
movapd %xmm5, %xmm10
movapd %xmm5, %xmm9
mulsd %xmm6, %xmm12
mulsd %xmm4, %xmm11
mulsd %xmm6, %xmm10
mulsd %xmm4, %xmm9
movsd 40(%rsp), %xmm1
movsd 16(%rsp), %xmm0
jmp .L9
.p2align 4,,10
.p2align 3
.L13:
movapd %xmm3, %xmm1
movapd %xmm2, %xmm0
.L9:
movapd %xmm0, %xmm8
movapd %xmm1, %xmm3
movapd %xmm1, %xmm2
mulsd %xmm1, %xmm8
movapd %xmm0, %xmm1
mulsd %xmm6, %xmm3
mulsd %xmm4, %xmm2
mulsd %xmm7, %xmm1
mulsd %xmm5, %xmm0
subl $1, %eax
jne .L13
In conclusion: You probably need to tighten your code up a bit before you can tell whether the implementations differ on your system. Make sure your loops are actually being generated (look at the assembly) and see whether the compiler was able to elide the return value from the member function.
If those things are true and you still see differences, can you post the implementations here for SquareMatrix and Vector so we can give you some more info?
Full code, a makefile, and the generated assembly for my working example is available as a GitHub gist.
Explicit instantiations of template function produce performance difference?
Some experiments I have done to look for the performance difference:
1.
Firstly I suspected that the performance difference may be caused by the implementation itself. In fact, we have two sets of implementation, one is implemented by ourself(this one is quite similar to codes by #black), and another is implemented to serve as a wrapper of Eigen::Matrix, which is controlled by a macro on-off, But switch between these two implementation does not make any change, the global one is still slower than the member one.
2.
Since these codes(classVector<Scalar, Dim> & SquareMatrix<Scalar, Dim>) are implemented in a large project, then I guess that the performance difference may probably be influenced by other codes(though I think it impossible, but still worth a try). So I extract all necessary codes(implementation by ourself used), and put them in my manually-generated VS2010 project. Surprisingly but also normally, I find that the global one is slightly faster than the member one, which is the same result as #black #Myles Hathcock, even though I leave the implementation of codes unchanged.
3.
Because in our project, outerProduct are put into a release lib files, while in my manually-generate project, it straightforwardly produce the .obj files, and be link to .exe files. To exclude this issue, I use the codes extracted and produce the lib file through VS2010, and apply this lib file to another VS project to test the performance difference, but still the global one is slight faster than the member one. So, both codes have the same implementation and both of them are put into lib files, though one is produced by Scons and the other is generated by VS project, but they have different performance. Is Scons causing this problem?
4.
For the codes shown in my question, global function outerProduct is declared and defined in .h file, then #include by a .cpp file. So when compiling this .cpp file, outerProduct will be instantiated. But if I change this to another manner:(I have to remind that these codes are now compiled by Scons to product lib file, not manually-generated VS2010 project)
First, I declare the global function outerProduct in .h file:
\\outProduct.h
template<typename Scalar, int Dim>
void outerProduct(const Vector<Scalar, Dim> & v1 , const Vector<Scalar, Dim> & v2, SquareMatrix<Scalar, Dim> & m);
then in .cpp file,
\\outerProduct.cpp
template<typename Scalar, int Dim>
void outerProduct(const Vector<Scalar, Dim> & v1 , const Vector<Scalar, Dim> & v2, SquareMatrix<Scalar, Dim> & m)
{
for (unsigned int i=0; i<Dim; i++)
for (unsigned int j=0; j<Dim; j++)
{
m(i,j) = v1[i]*v2[j];
}
}
Since it a template function, It requires some explicit instantiations:
\\outerProduct.cpp
template void outerProduct<double, 3>(const Vector<double, 3> &, const Vector<double, 3> &, SquareMatrix<double, 3> &);
template void outerProduct<float, 3>(const Vector<float, 3> &, const Vector<float, 3> &, SquareMatrix<float, 3> &);
Finally, in .cpp file calling this function:
\\use_outerProduct.cpp
#include "outerProduct.h" //note: outerProduct.cpp is not needful.
...
outerProduct(v1, v2, m)
...
The strange thing, now, is that the global one finally be slightly faster than the member one, shown in the following picture:
But this only happens in a Scons environment. In mannually-generated VS2010 project, global one will always be slightly faster than the member one. So this performance difference only results from a Scons environment? and if template function being explicitly instantiated, it will become normal?
Things are still strange ! It seems that Scons would have done something I didn't expected.
//------------------------------------------------------------------------
Additionally, test codes are now changed to the followings to avoid the loop elision:
Vector<double, 3> vec1(0.0);
Vector<double, 3> vec2(1.0);
Timer timer;
while(true)
{
timer.startTimer();
for (unsigned int i=0; i<100000; i++)
{
vec1 = Vector<double, 3>(i);
SquareMatrix<double,3> m = vec1.outerProduct(vec2);
}
timer.stopTimer();
cout<<"time cost for member function: "<< timer.getElapsedTime()<<endl;
timer.startTimer();
SquareMatrix<double,3> m;
for (unsigned int i=0; i<100000; i++)
{
vec1 = Vector<double, 3>(i);
outerProduct(vec1, vec2, m);
}
timer.stopTimer();
cout<<"time cost for global function: "<< timer.getElapsedTime()<<endl;
system("pause");
}
#black #Myles Hathcock, Great thanks to warm-hearted people!
#Myles Hathcock, your explanation is really a subtle and abstruse one, but I think I would benefit a lot from it.
Finally, the entire implementation is on
https://github.com/FeiZhu/Physika
Which is a physical engine we are developing, and from which you can find more info including the whole source codes. Vector and SquareMatrix are defined in Physika_Src/Physika_Core folder! But global function outerProduct is not uploaded, you can add it appropriately to somewhere.

Why does tree vectorization make this sorting algorithm 2x slower?

The sorting algorithm of this question becomes twice faster(!) if -fprofile-arcs is enabled in gcc (4.7.2). The heavily simplified C code of that question (it turned out that I can initialize the array with all zeros, the weird performance behavior remains but it makes the reasoning much much simpler):
#include <time.h>
#include <stdio.h>
#define ELEMENTS 100000
int main() {
int a[ELEMENTS] = { 0 };
clock_t start = clock();
for (int i = 0; i < ELEMENTS; ++i) {
int lowerElementIndex = i;
for (int j = i+1; j < ELEMENTS; ++j) {
if (a[j] < a[lowerElementIndex]) {
lowerElementIndex = j;
}
}
int tmp = a[i];
a[i] = a[lowerElementIndex];
a[lowerElementIndex] = tmp;
}
clock_t end = clock();
float timeExec = (float)(end - start) / CLOCKS_PER_SEC;
printf("Time: %2.3f\n", timeExec);
printf("ignore this line %d\n", a[ELEMENTS-1]);
}
After playing with the optimization flags for a long while, it turned out that -ftree-vectorize also yields this weird behavior so we can take -fprofile-arcs out of the question. After profiling with perf I have found that the only relevant difference is:
Fast case gcc -std=c99 -O2 simp.c (runs in 3.1s)
cmpl %esi, %ecx
jge .L3
movl %ecx, %esi
movslq %edx, %rdi
.L3:
Slow case gcc -std=c99 -O2 -ftree-vectorize simp.c (runs in 6.1s)
cmpl %ecx, %esi
cmovl %edx, %edi
cmovl %esi, %ecx
As for the first snippet: Given that the array only contains zeros, we always jump to .L3. It can greatly benefit from branch prediction.
I guess the cmovl instructions cannot benefit from branch prediction.
Questions:
Are all my above guesses correct? Does this make the algorithm slow?
If yes, how can I prevent gcc from emitting this instruction (other than the trivial -fno-tree-vectorization workaround of course) but still doing as much optimizations as possible?
What is this -ftree-vectorization? The documentation is quite
vague, I would need a little more explanation to understand what's happening.
Update: Since it came up in comments: The weird performance behavior w.r.t. the -ftree-vectorize flag remains with random data. As Yakk points out, for selection sort, it is actually hard to create a dataset that would result in a lot of branch mispredictions.
Since it also came up: I have a Core i5 CPU.
Based on Yakk's comment, I created a test. The code below (online without boost) is of course no longer a sorting algorithm; I only took out the inner loop. Its only goal is to examine the effect of branch prediction: We skip the if branch in the for loop with probability p.
#include <algorithm>
#include <cstdio>
#include <random>
#include <boost/chrono.hpp>
using namespace std;
using namespace boost::chrono;
constexpr int ELEMENTS=1e+8;
constexpr double p = 0.50;
int main() {
printf("p = %.2f\n", p);
int* a = new int[ELEMENTS];
mt19937 mt(1759);
bernoulli_distribution rnd(p);
for (int i = 0 ; i < ELEMENTS; ++i){
a[i] = rnd(mt)? i : -i;
}
auto start = high_resolution_clock::now();
int lowerElementIndex = 0;
for (int i=0; i<ELEMENTS; ++i) {
if (a[i] < a[lowerElementIndex]) {
lowerElementIndex = i;
}
}
auto finish = high_resolution_clock::now();
printf("%ld ms\n", duration_cast<milliseconds>(finish-start).count());
printf("Ignore this line %d\n", a[lowerElementIndex]);
delete[] a;
}
The loops of interest:
This will be referred to as cmov
g++ -std=c++11 -O2 -lboost_chrono -lboost_system -lrt branch3.cpp
xorl %eax, %eax
.L30:
movl (%rbx,%rbp,4), %edx
cmpl %edx, (%rbx,%rax,4)
movslq %eax, %rdx
cmovl %rdx, %rbp
addq $1, %rax
cmpq $100000000, %rax
jne .L30
This will be referred to as no cmov, the -fno-if-conversion flag was pointed out by Turix in his answer.
g++ -std=c++11 -O2 -fno-if-conversion -lboost_chrono -lboost_system -lrt branch3.cpp
xorl %eax, %eax
.L29:
movl (%rbx,%rbp,4), %edx
cmpl %edx, (%rbx,%rax,4)
jge .L28
movslq %eax, %rbp
.L28:
addq $1, %rax
cmpq $100000000, %rax
jne .L29
The difference side by side
cmpl %edx, (%rbx,%rax,4) | cmpl %edx, (%rbx,%rax,4)
movslq %eax, %rdx | jge .L28
cmovl %rdx, %rbp | movslq %eax, %rbp
| .L28:
The execution time as a function of the Bernoulli parameter p
The code with the cmov instruction is absolutely insensitive to p. The code without the cmov instruction is the winner if p<0.26 or 0.81<p and is at most 4.38x faster (p=1). Of course, the worse situation for the branch predictor is at around p=0.5 where the code is 1.58x slower than the code with the cmov instruction.
Note: Answered before graph update was added to the question; some assembly code references here may be obsolete.
(Adapted and extended from our above chat, which was stimulating enough to cause me to do a bit more research.)
First (as per our above chat), it appears that the answer to your first question is "yes". In the vector "optimized" code, the optimization (negatively) affecting performance is branch predication, whereas in the original code the performance is (positively) affected by branch prediction. (Note the extra 'a' in the former.)
Re your 3rd question: Even though in your case, there is actually no vectorization being done, from step 11 ("Conditional Execution") here it appears that one of the steps associated with vectorization optimizations is to "flatten" conditionals within targeted loops, like this bit in your loop:
if (a[j] < a[lowerElementIndex]
lowerElementIndex = j;
Apparently, this happens even if there is no vectorization.
This explains why the compiler is using the conditional move instructions (cmovl). The goal there is to avoid a branch entirely (as opposed to trying to predict it correctly). Instead, the two cmovl instructions will be sent down the pipeline before the result of the previous cmpl is known and the comparison result will then be "forwarded" to enable/prevent the moves prior to their writeback (i.e., prior to them actually taking effect).
Note that if the loop had been vectorized, this might have been worth it to get to the point where multiple iterations through the loop could effectively be accomplished in parallel.
However, in your case, the attempt at optimization actually backfires because in the flattened loop, the two conditional moves are sent through the pipeline every single time through the loop. This in itself might not be so bad either, except that there is a RAW data hazard that causes the second move (cmovl %esi, %ecx) to have to wait until the array/memory access (movl (%rsp,%rsi,4), %esi) is completed, even if the result is going to be ultimately ignored. Hence the huge time spent on that particular cmovl. (I would expect this is an issue with your processor not having complex enough logic built into its predication/forwarding implementation to deal with the hazard.)
On the other hand, in the non-optimized case, as you rightly figured out, branch prediction can help to avoid having to wait on the result of the corresponding array/memory access there (the movl (%rsp,%rcx,4), %ecx instruction). In that case, when the processor correctly predicts a taken branch (which for an all-0 array will be every single time, but [even] in a random array should [still] be roughlymore than [edited per #Yakk's comment] half the time), it does not have to wait for the memory access to finish to go ahead and queue up the next few instructions in the loop. So in correct predictions, you get a boost, whereas in incorrect predictions, the result is no worse than in the "optimized" case and, furthermore, better because of the ability to sometimes avoid having the 2 "wasted" cmovl instructions in the pipeline.
[The following was removed due to my mistaken assumption about your processor per your comment.]
Back to your questions, I would suggest looking at that link above for more on the flags relevant to vectorization, but in the end, I'm pretty sure that it's fine to ignore that optimization given that your Celeron isn't capable of using it (in this context) anyway.
[Added after above was removed]
Re your second question ("...how can I prevent gcc from emitting this instruction..."), you could try the -fno-if-conversion and -fno-if-conversion2 flags (not sure if these always work -- they no longer work on my mac), although I do not think your problem is with the cmovl instruction in general (i.e., I wouldn't always use those flags), just with its use in this particular context (where branch prediction is going to be very helpful given #Yakk's point about your sort algorithm).

Which is faster : if (bool) or if(int)?

Which value is better to use? Boolean true or Integer 1?
The above topic made me do some experiments with bool and int in if condition. So just out of curiosity I wrote this program:
int f(int i)
{
if ( i ) return 99; //if(int)
else return -99;
}
int g(bool b)
{
if ( b ) return 99; //if(bool)
else return -99;
}
int main(){}
g++ intbool.cpp -S generates asm code for each functions as follows:
asm code for f(int)
__Z1fi:
LFB0:
pushl %ebp
LCFI0:
movl %esp, %ebp
LCFI1:
cmpl $0, 8(%ebp)
je L2
movl $99, %eax
jmp L3
L2:
movl $-99, %eax
L3:
leave
LCFI2:
ret
asm code for g(bool)
__Z1gb:
LFB1:
pushl %ebp
LCFI3:
movl %esp, %ebp
LCFI4:
subl $4, %esp
LCFI5:
movl 8(%ebp), %eax
movb %al, -4(%ebp)
cmpb $0, -4(%ebp)
je L5
movl $99, %eax
jmp L6
L5:
movl $-99, %eax
L6:
leave
LCFI6:
ret
Surprisingly, g(bool) generates more asm instructions! Does it mean that if(bool) is little slower than if(int)? I used to think bool is especially designed to be used in conditional statement such as if, so I was expecting g(bool) to generate less asm instructions, thereby making g(bool) more efficient and fast.
EDIT:
I'm not using any optimization flag as of now. But even absence of it, why does it generate more asm for g(bool) is a question for which I'm looking for a reasonable answer. I should also tell you that -O2 optimization flag generates exactly same asm. But that isn't the question. The question is what I've asked.
Makes sense to me. Your compiler apparently defines a bool as an 8-bit value, and your system ABI requires it to "promote" small (< 32-bit) integer arguments to 32-bit when pushing them onto the call stack. So to compare a bool, the compiler generates code to isolate the least significant byte of the 32-bit argument that g receives, and compares it with cmpb. In the first example, the int argument uses the full 32 bits that were pushed onto the stack, so it simply compares against the whole thing with cmpl.
Compiling with -03 gives the following for me:
f:
pushl %ebp
movl %esp, %ebp
cmpl $1, 8(%ebp)
popl %ebp
sbbl %eax, %eax
andb $58, %al
addl $99, %eax
ret
g:
pushl %ebp
movl %esp, %ebp
cmpb $1, 8(%ebp)
popl %ebp
sbbl %eax, %eax
andb $58, %al
addl $99, %eax
ret
.. so it compiles to essentially the same code, except for cmpl vs cmpb.
This means that the difference, if there is any, doesn't matter. Judging by unoptimized code is not fair.
Edit to clarify my point. Unoptimized code is for simple debugging, not for speed. Comparing the speed of unoptimized code is senseless.
When I compile this with a sane set of options (specifically -O3), here's what I get:
For f():
.type _Z1fi, #function
_Z1fi:
.LFB0:
.cfi_startproc
.cfi_personality 0x3,__gxx_personality_v0
cmpl $1, %edi
sbbl %eax, %eax
andb $58, %al
addl $99, %eax
ret
.cfi_endproc
For g():
.type _Z1gb, #function
_Z1gb:
.LFB1:
.cfi_startproc
.cfi_personality 0x3,__gxx_personality_v0
cmpb $1, %dil
sbbl %eax, %eax
andb $58, %al
addl $99, %eax
ret
.cfi_endproc
They still use different instructions for the comparison (cmpb for boolean vs. cmpl for int), but otherwise the bodies are identical. A quick look at the Intel manuals tells me: ... not much of anything. There's no such thing as cmpb or cmpl in the Intel manuals. They're all cmp and I can't find the timing tables at the moment. I'm guessing, however, that there's no clock difference between comparing a byte immediate vs. comparing a long immediate, so for all practical purposes the code is identical.
edited to add the following based on your addition
The reason the code is different in the unoptimized case is that it is unoptimized. (Yes, it's circular, I know.) When the compiler walks the AST and generates code directly, it doesn't "know" anything except what's at the immediate point of the AST it's in. At that point it lacks all contextual information needed to know that at this specific point it can treat the declared type bool as an int. A boolean is obviously by default treated as a byte and when manipulating bytes in the Intel world you have to do things like sign-extend to bring it to certain widths to put it on the stack, etc. (You can't push a byte.)
When the optimizer views the AST and does its magic, however, it looks at surrounding context and "knows" when it can replace code with something more efficient without changing semantics. So it "knows" it can use an integer in the parameter and thereby lose the unnecessary conversions and widening.
With GCC 4.5 on Linux and Windows at least, sizeof(bool) == 1. On x86 and x86_64, you can't pass in less than an general purpose register's worth to a function (whether via the stack or a register depending on the calling convention etc...).
So the code for bool, when un-optimized, actually goes to some length to extract that bool value from the argument stack (using another stack slot to save that byte). It's more complicated than just pulling a native register-sized variable.
Yeah, the discussion's fun. But just test it:
Test code:
#include <stdio.h>
#include <string.h>
int testi(int);
int testb(bool);
int main (int argc, char* argv[]){
bool valb;
int vali;
int loops;
if( argc < 2 ){
return 2;
}
valb = (0 != (strcmp(argv[1], "0")));
vali = strcmp(argv[1], "0");
printf("Arg1: %s\n", argv[1]);
printf("BArg1: %i\n", valb ? 1 : 0);
printf("IArg1: %i\n", vali);
for(loops=30000000; loops>0; loops--){
//printf("%i: %i\n", loops, testb(valb=!valb));
printf("%i: %i\n", loops, testi(vali=!vali));
}
return valb;
}
int testi(int val){
if( val ){
return 1;
}
return 0;
}
int testb(bool val){
if( val ){
return 1;
}
return 0;
}
Compiled on a 64-bit Ubuntu 10.10 laptop with:
g++ -O3 -o /tmp/test_i /tmp/test_i.cpp
Integer-based comparison:
sauer#trogdor:/tmp$ time /tmp/test_i 1 > /dev/null
real 0m8.203s
user 0m8.170s
sys 0m0.010s
sauer#trogdor:/tmp$ time /tmp/test_i 1 > /dev/null
real 0m8.056s
user 0m8.020s
sys 0m0.000s
sauer#trogdor:/tmp$ time /tmp/test_i 1 > /dev/null
real 0m8.116s
user 0m8.100s
sys 0m0.000s
Boolean test / print uncommented (and integer commented):
sauer#trogdor:/tmp$ time /tmp/test_i 1 > /dev/null
real 0m8.254s
user 0m8.240s
sys 0m0.000s
sauer#trogdor:/tmp$ time /tmp/test_i 1 > /dev/null
real 0m8.028s
user 0m8.000s
sys 0m0.010s
sauer#trogdor:/tmp$ time /tmp/test_i 1 > /dev/null
real 0m7.981s
user 0m7.900s
sys 0m0.050s
They're the same with 1 assignment and 2 comparisons each loop over 30 million loops. Find something else to optimize. For example, don't use strcmp unnecessarily. ;)
At the machine level there is no such thing as bool
Very few instruction set architectures define any sort of boolean operand type, although there are often instructions that trigger an action on non-zero values. To the CPU, usually, everything is one of the scalar types or a string of them.
A given compiler and a given ABI will need to choose specific sizes for int and bool and when, like in your case, these are different sizes they may generate slightly different code, and at some levels of optimization one may be slightly faster.
Why is bool one byte on many systems?
It's safer to choose a char type for bool because someone might make a really large array of them.
Update: by "safer", I mean: for the compiler and library implementors. I'm not saying people need to reimplement the system type.
It will mostly depend on the compiler and the optimization. There's an interesting discussion (language agnostic) here:
Does "if ([bool] == true)" require one more step than "if ([bool])"?
Also, take a look at this post: http://www.linuxquestions.org/questions/programming-9/c-compiler-handling-of-boolean-variables-290996/
Approaching your question in two different ways:
If you are specifically talking about C++ or any programming language that will produce assembly code for that matter, we are bound to what code the compiler will generate in ASM. We are also bound to the representation of true and false in c++. An integer will have to be stored in 32 bits, and I could simply use a byte to store the boolean expression. Asm snippets for conditional statements:
For the integer:
mov eax,dword ptr[esp] ;Store integer
cmp eax,0 ;Compare to 0
je false ;If int is 0, its false
;Do what has to be done when true
false:
;Do what has to be done when false
For the bool:
mov al,1 ;Anything that is not 0 is true
test al,1 ;See if first bit is fliped
jz false ;Not fliped, so it's false
;Do what has to be done when true
false:
;Do what has to be done when false
So, that's why the speed comparison is so compile dependent. In the case above, the bool would be slightly fast since cmp would imply a subtraction for setting the flags. It also contradicts with what your compiler generated.
Another approach, a much simpler one, is to look at the logic of the expression on it's own and try not to worry about how the compiler will translate your code, and I think this is a much healthier way of thinking. I still believe, ultimately, that the code being generated by the compiler is actually trying to give a truthful resolution. What I mean is that, maybe if you increase the test cases in the if statement and stick with boolean in one side and integer in another, the compiler will make it so the code generated will execute faster with boolean expressions in the machine level.
I'm considering this is a conceptual question, so I'll give a conceptual answer. This discussion reminds me of discussions I commonly have about whether or not code efficiency translates to less lines of code in assembly. It seems that this concept is generally accepted as being true. Considering that keeping track of how fast the ALU will handle each statement is not viable, the second option would be to focus on jumps and compares in assembly. When that is the case, the distinction between boolean statements or integers in the code you presented becomes rather representative. The result of an expression in C++ will return a value that will then be given a representation. In assembly, on the other hand, the jumps and comparisons will be based in numeric values regardless of what type of expression was being evaluated back at you C++ if statement. It is important on these questions to remember that purely logicical statements such as these end up with a huge computational overhead, even though a single bit would be capable of the same thing.

C++: Structs slower to access than basic variables?

I found some code that had "optimization" like this:
void somefunc(SomeStruct param){
float x = param.x; // param.x and x are both floats. supposedly this makes it faster access
float y = param.y;
float z = param.z;
}
And the comments said that it will make the variable access faster, but i've always thought structs element access is as fast as if it wasnt struct after all.
Could someone clear my head off this?
The usual rules for optimization (Michael A. Jackson) apply:
1. Don't do it.
2. (For experts only:) Don't do it yet.
That being said, let's assume it's the innermost loop that takes 80% of the time of a performance-critical application. Even then, I doubt you will ever see any difference. Let's use this piece of code for instance:
struct Xyz {
float x, y, z;
};
float f(Xyz param){
return param.x + param.y + param.z;
}
float g(Xyz param){
float x = param.x;
float y = param.y;
float z = param.z;
return x + y + z;
}
Running it through LLVM shows: Only with no optimizations, the two act as expected (g copies the struct members into locals, then proceeds sums those; f sums the values fetched from param directly). With standard optimization levels, both result in identical code (extracting the values once, then summing them).
For short code, this "optimization" is actually harmful, as it copies the floats needlessly. For longer code using the members in several places, it might help a teensy bit if you actively tell your compiler to be stupid. A quick test with 65 (instead of 2) additions of the members/locals confirms this: With no optimizations, f repeatedly loads the struct members while g reuses the already extracted locals. The optimized versions are again identical and both extract the members only once. (Surprisingly, there's no strength reduction turning the additions into multiplications even with LTO enabled, but that just indicates the LLVM version used isn't optimizing too agressively anyway - so it should work just as well in other compilers.)
So, the bottom line is: Unless you know your code will have to be compiled by a compiler that's so outragously stupid and/or ancient that it won't optimize anything, you now have proof that the compiler will make both ways equivalent and can thus do away with this crime against readability and brewity commited in the name of performance. (Repeat the experiment for your particular compiler if necessary.)
Rule of thumb: it's not slow, unless profiler says it is. Let the compiler worry about micro-optimisations (they're pretty smart about them; after all, they've been doing it for years) and focus on the bigger picture.
I'm no compiler guru, so take this with a grain of salt. I'm guessing that the original author of the code is assuming that by copying the values from the struct into local variables, the compiler has "placed" those variables into floating point registers which are available on some platforms (e.g., x86). If there aren't enough registers to go around, they'd be put in the stack.
That being said, unless this code was in the middle of an intensive computation/loop, I'd strive for clarity rather than speed. It's pretty rare that anyone is going to notice a few instructions difference in timing.
You'd have to look at the compiled code on a particular implementation to be sure, but there's no reason in principle why your preferred code (using the struct members) should necessarily be any slower than the code you've shown (copying into variables and then using the variables).
someFunc takes a struct by value, so it has its own local copy of that struct. The compiler is perfectly at liberty to apply exactly the same optimizations to the struct members, as it would apply to the float variables. They're both automatic variables, and in both cases the "as-if" rule allows them to be stored in register(s) rather than in memory provided that the function produces the correct observable behavior.
This is unless of course you take a pointer to the struct and use it, in which case the values need to be written in memory somewhere, in the correct order, pointed to by the pointer. This starts to limit optimization, and other limits are introduced by the fact that if you pass around a pointer to an automatic variable, the compiler can no longer assume that the variable name is the only reference to that memory and hence the only way its contents can be modified. Having multiple references to the same object is called "aliasing", and does sometimes block optimizations that could be made if the object was somehow known not to be aliased.
Then again, if this is an issue, and the rest of the code in the function somehow does use a pointer to the struct, then of course you could be on dodgy ground copying the values into variables from the POV of correctness. So the claimed optimization is not quite so straightforward as it looks in that case.
Now, there may be particular compilers (or particular optimization levels) which fail to apply to structs all the optimizations that they're permitted to apply, but do apply equivalent optimizations to float variables. If so then the comment would be right, and that's why you have to check to be sure. For example, maybe compare the emitted code for this:
float somefunc(SomeStruct param){
float x = param.x; // param.x and x are both floats. supposedly this makes it faster access
float y = param.y;
float z = param.z;
for (int i = 0; i < 10; ++i) {
x += (y +i) * z;
}
return x;
}
with this:
float somefunc(SomeStruct param){
for (int i = 0; i < 10; ++i) {
param.x += (param.y +i) * param.z;
}
return param.x;
}
There may also be optimization levels where the extra variables make the code worse. I'm not sure I put much trust in code comments that say "supposedly this makes it faster access", sounds like the author doesn't really have a clear idea why it matters. "Apparently it makes it faster access - I don't know why but the tests to confirm this and to demonstrate that it makes a noticeable difference in the context of our program, are in source control in the following location" is a lot more like it ;-)
In an unoptimised code:
function parameters (which are not passed by reference) are on the stack
local variables are also on the stack
Unoptimised access to local variables and function parameters in an assembly language look more-or-less like this:
mov %eax, %ebp+ compile-time-constant
where %ebp is a frame pointer (sort of 'this' pointer for a function).
It makes no difference if you access a parameter or a local variable.
The fact that you are accessing an element from a struct makes absolutely no difference from the assembly/machine point of view. Structs are constructs made in C to make programmer's life easier.
So, ulitmately, my answer is: No, there is absolutely no benefit in doing that.
There are good and valid reasons to do that kind of optimization when pointers are used, because consuming all inputs first frees the compiler from possible aliasing issues which prevent it from producing optimal code (there's restrict nowadays too, though).
For non-pointer types, there is in theory an overhead because every member is accessed via the struct's this pointer. This may in theory be noticeable within an inner loop and will in theory be a diminuitive overhead otherwise.In practice, however, a modern compiler will almost always (unless there is a complex inheritance hierarchy) produce the exact same binary code.
I had asked myself the exact same question as you did about two years ago and did a very extensive test case using gcc 4.4. My findings were that unless you really try to throw sticks between the compiler's legs on purpose, there is absolutely no difference in the generated code.
The real answer is given by Piotr. This one is just for fun.
I have tested it. This code:
float somefunc(SomeStruct param, float &sum){
float x = param.x;
float y = param.y;
float z = param.z;
float xyz = x * y * z;
sum = x + y + z;
return xyz;
}
And this code:
float somefunc(SomeStruct param, float &sum){
float xyz = param.x * param.y * param.z;
sum = param.x + param.y + param.z;
return xyz;
}
Generate identical assembly code when compiled with g++ -O2. They do generate different code with optimization turned off, though. Here is the difference:
< movl -32(%rbp), %eax
< movl %eax, -4(%rbp)
< movl -28(%rbp), %eax
< movl %eax, -8(%rbp)
< movl -24(%rbp), %eax
< movl %eax, -12(%rbp)
< movss -4(%rbp), %xmm0
< mulss -8(%rbp), %xmm0
< mulss -12(%rbp), %xmm0
< movss %xmm0, -16(%rbp)
< movss -4(%rbp), %xmm0
< addss -8(%rbp), %xmm0
< addss -12(%rbp), %xmm0
---
> movss -32(%rbp), %xmm1
> movss -28(%rbp), %xmm0
> mulss %xmm1, %xmm0
> movss -24(%rbp), %xmm1
> mulss %xmm1, %xmm0
> movss %xmm0, -4(%rbp)
> movss -32(%rbp), %xmm1
> movss -28(%rbp), %xmm0
> addss %xmm1, %xmm0
> movss -24(%rbp), %xmm1
> addss %xmm1, %xmm0
The lines marked < correspond to the version with "optimization" variables. It seems to me that the "optimized" version is even slower than the one with no extra variables. This is to be expected, though, as x, y and z are allocated on the stack, exactly like the param. What's the point of allocating more stack variables to duplicate existing ones?
If the one who did that "optimization" knew the language better, he would probably have declared those variables as register, but even that leaves the "optimized" version slightly slower and longer, at least on G++/x86-64.
Compiler may make faster code to copy float-to-float.
But when x will used it will be converted to internal FPU representation.
When you specify a "simple" variable (not a struct/class) to be operated upon, the system only has to go to that place and fetch the data it wants.
But when you refer to a variable inside a struct or class, like A.B, the system needs to calculate where B is inside that area called A (because there may be other variables declared before it), and that calculation takes a bit more than the the more plain access described above.