I have implemented two functions to perform the cross product of two Vectors (not std::vector), one is a member function and another is a global one, here is the key codes(additional parts are ommitted)
//for member function
template <typename Scalar>
SquareMatrix<Scalar,3> Vector<Scalar,3>::outerProduct(const Vector<Scalar,3> &vec3) const
{
SquareMatrix<Scalar,3> result;
for(unsigned int i = 0; i < 3; ++i)
for(unsigned int j = 0; j < 3; ++j)
result(i,j) = (*this)[i]*vec3[j];
return result;
}
//for global function: Dim = 3
template<typename Scalar, int Dim>
void outerProduct(const Vector<Scalar, Dim> & v1 , const Vector<Scalar, Dim> & v2, SquareMatrix<Scalar, Dim> & m)
{
for (unsigned int i=0; i<Dim; i++)
for (unsigned int j=0; j<Dim; j++)
{
m(i,j) = v1[i]*v2[j];
}
}
They are almost the same except that one is a member function having a return value and another is a global function where the values calculated are straightforwardly assigned to a square matrix, thus requiring no return value.
Actually, I was meant to replace the member one by the global one to improve the performance, since the first one involes copy operations. The strange thing, however, is that the time cost by the global function is almost two times longer than the member one. Furthermore, I find that the execution of
m(i,j) = v1[i]*v2[j]; // in global function
requires much more time than that of
result(i,j) = (*this)[i]*vec3[j]; // in member function
So the question is, how does this performance difference between member and global function arise?
Anyone can tell the reasons?
Hope I have presented my question clearly, and sorry to my poor english!
//----------------------------------------------------------------------------------------
More information added:
The following is the codes I use to test the performance:
//the codes below is in a loop
Vector<double, 3> vec1;
Vector<double, 3> vec2;
Timer timer;
timer.startTimer();
for (unsigned int i=0; i<100000; i++)
{
SquareMatrix<double,3> m = vec1.outerProduct(vec2);
}
timer.stopTimer();
std::cout<<"time cost for member function: "<< timer.getElapsedTime()<<std::endl;
timer.startTimer();
SquareMatrix<double,3> m;
for (unsigned int i=0; i<100000; i++)
{
outerProduct(vec1, vec2, m);
}
timer.stopTimer();
std::cout<<"time cost for global function: "<< timer.getElapsedTime()<<std::endl;
std::system("pause");
and the result captured:
You can see that the member funtion is almost twice faster than the global one.
Additionally, my project is built upon a 64bit windows system, and the codes are in fact used to generate the static lib files based on the Scons construction tools, along with vs2010 project files produced.
I have to remind that the strange performance difference only occurs in a release version, while in a debug build type, the global function is almost five times faster than the member one.(about 0.10s vs 0.02s)
One possible explanation:
With inlining, in the first case, compiler may knows that result(i, j) (from local variable) doesn't alias this[i] or vec3[j], and so neither of the Scalar array of this nor vec3 are modified.
In the second case, from the function point of view, the variables may alias, so each write into m might modify Scalars of v1 or v2, so neither of v1[i] nor v2[j] can be cached.
You may try the restrict keyword extension to check if my hypothesis is correct.
EDIT: loop elision in the original assembly has been corrected
[paraphrased] Why is the performance different between the member function and the static function?
I'll start with the simplest things mentioned in your question, and progress to the more nuanced points of performance testing / analysis.
It is a bad idea to measure performance of debug builds. Compilers take liberties in many places, such as zeroing arrays that are uninitialized, generating extra code that isn't strictly necessary, and (obviously) not performing any optimization past the trivial ones such as constant propagation. This leads to the next point...
Always look at the assembly. C and C++ are high level languages when it comes to the subtleties of performance. Many people even consider x86 assembly a high level language since each instruction is decomposed into possibly several micro-ops during decoding. You cannot tell what the computer is doing just by looking at C++ code. For example, depending on how you implemented SquareMatrix, the compiler may or may not be able to perform copy elision during optimization.
Entering the somewhat more nuanced topics when testing for performance...
Make sure the compiler is actually generating loops. Using your example test code, g++ 4.7.2 doesn't actually generate loops with my implementation of SquareMatrix and Vector. I implemented them to initialize all components to 0.0, so the compiler can statically determine that the values never change, and so only generates a single set of mov instructions instead of a loop. In my example code, I use COMPILER_NOP which (with gcc) is __asm__ __volatile__("":::) inside the loop to prevent this (as compilers cannot predict side-effects from manual assembly, and so cannot elide the loop). Edit: I DO use COMPILER_NOP but since the output values from the functions are never used, the compiler is still able to remove the bulk of the work from the loop, and reduce the loop to this:
.L7
subl $1, %eax
jne .L7
I have corrected this by performing additional operations inside the loop. The loop now assigns a value from the output to the inputs, preventing this optimization and forcing the loop to cover what was originally intended.
To (finally) get around to answering your question: When I implemented the rest of what is needed to get your code to run, and verified by checking the assembly that loops are actually generated, the two functions execute in the same amount of time. They even have nearly identical implementations in assembly.
Here's the assembly for the member function:
movsd 32(%rsp), %xmm7
movl $100000, %eax
movsd 24(%rsp), %xmm5
movsd 8(%rsp), %xmm6
movapd %xmm7, %xmm12
movsd (%rsp), %xmm4
movapd %xmm7, %xmm11
movapd %xmm5, %xmm10
movapd %xmm5, %xmm9
mulsd %xmm6, %xmm12
mulsd %xmm4, %xmm11
mulsd %xmm6, %xmm10
mulsd %xmm4, %xmm9
movsd 40(%rsp), %xmm1
movsd 16(%rsp), %xmm0
jmp .L7
.p2align 4,,10
.p2align 3
.L12:
movapd %xmm3, %xmm1
movapd %xmm2, %xmm0
.L7:
movapd %xmm0, %xmm8
movapd %xmm1, %xmm3
movapd %xmm1, %xmm2
mulsd %xmm1, %xmm8
movapd %xmm0, %xmm1
mulsd %xmm6, %xmm3
mulsd %xmm4, %xmm2
mulsd %xmm7, %xmm1
mulsd %xmm5, %xmm0
subl $1, %eax
jne .L12
and the assembly for the static function:
movsd 32(%rsp), %xmm7
movl $100000, %eax
movsd 24(%rsp), %xmm5
movsd 8(%rsp), %xmm6
movapd %xmm7, %xmm12
movsd (%rsp), %xmm4
movapd %xmm7, %xmm11
movapd %xmm5, %xmm10
movapd %xmm5, %xmm9
mulsd %xmm6, %xmm12
mulsd %xmm4, %xmm11
mulsd %xmm6, %xmm10
mulsd %xmm4, %xmm9
movsd 40(%rsp), %xmm1
movsd 16(%rsp), %xmm0
jmp .L9
.p2align 4,,10
.p2align 3
.L13:
movapd %xmm3, %xmm1
movapd %xmm2, %xmm0
.L9:
movapd %xmm0, %xmm8
movapd %xmm1, %xmm3
movapd %xmm1, %xmm2
mulsd %xmm1, %xmm8
movapd %xmm0, %xmm1
mulsd %xmm6, %xmm3
mulsd %xmm4, %xmm2
mulsd %xmm7, %xmm1
mulsd %xmm5, %xmm0
subl $1, %eax
jne .L13
In conclusion: You probably need to tighten your code up a bit before you can tell whether the implementations differ on your system. Make sure your loops are actually being generated (look at the assembly) and see whether the compiler was able to elide the return value from the member function.
If those things are true and you still see differences, can you post the implementations here for SquareMatrix and Vector so we can give you some more info?
Full code, a makefile, and the generated assembly for my working example is available as a GitHub gist.
Explicit instantiations of template function produce performance difference?
Some experiments I have done to look for the performance difference:
1.
Firstly I suspected that the performance difference may be caused by the implementation itself. In fact, we have two sets of implementation, one is implemented by ourself(this one is quite similar to codes by #black), and another is implemented to serve as a wrapper of Eigen::Matrix, which is controlled by a macro on-off, But switch between these two implementation does not make any change, the global one is still slower than the member one.
2.
Since these codes(classVector<Scalar, Dim> & SquareMatrix<Scalar, Dim>) are implemented in a large project, then I guess that the performance difference may probably be influenced by other codes(though I think it impossible, but still worth a try). So I extract all necessary codes(implementation by ourself used), and put them in my manually-generated VS2010 project. Surprisingly but also normally, I find that the global one is slightly faster than the member one, which is the same result as #black #Myles Hathcock, even though I leave the implementation of codes unchanged.
3.
Because in our project, outerProduct are put into a release lib files, while in my manually-generate project, it straightforwardly produce the .obj files, and be link to .exe files. To exclude this issue, I use the codes extracted and produce the lib file through VS2010, and apply this lib file to another VS project to test the performance difference, but still the global one is slight faster than the member one. So, both codes have the same implementation and both of them are put into lib files, though one is produced by Scons and the other is generated by VS project, but they have different performance. Is Scons causing this problem?
4.
For the codes shown in my question, global function outerProduct is declared and defined in .h file, then #include by a .cpp file. So when compiling this .cpp file, outerProduct will be instantiated. But if I change this to another manner:(I have to remind that these codes are now compiled by Scons to product lib file, not manually-generated VS2010 project)
First, I declare the global function outerProduct in .h file:
\\outProduct.h
template<typename Scalar, int Dim>
void outerProduct(const Vector<Scalar, Dim> & v1 , const Vector<Scalar, Dim> & v2, SquareMatrix<Scalar, Dim> & m);
then in .cpp file,
\\outerProduct.cpp
template<typename Scalar, int Dim>
void outerProduct(const Vector<Scalar, Dim> & v1 , const Vector<Scalar, Dim> & v2, SquareMatrix<Scalar, Dim> & m)
{
for (unsigned int i=0; i<Dim; i++)
for (unsigned int j=0; j<Dim; j++)
{
m(i,j) = v1[i]*v2[j];
}
}
Since it a template function, It requires some explicit instantiations:
\\outerProduct.cpp
template void outerProduct<double, 3>(const Vector<double, 3> &, const Vector<double, 3> &, SquareMatrix<double, 3> &);
template void outerProduct<float, 3>(const Vector<float, 3> &, const Vector<float, 3> &, SquareMatrix<float, 3> &);
Finally, in .cpp file calling this function:
\\use_outerProduct.cpp
#include "outerProduct.h" //note: outerProduct.cpp is not needful.
...
outerProduct(v1, v2, m)
...
The strange thing, now, is that the global one finally be slightly faster than the member one, shown in the following picture:
But this only happens in a Scons environment. In mannually-generated VS2010 project, global one will always be slightly faster than the member one. So this performance difference only results from a Scons environment? and if template function being explicitly instantiated, it will become normal?
Things are still strange ! It seems that Scons would have done something I didn't expected.
//------------------------------------------------------------------------
Additionally, test codes are now changed to the followings to avoid the loop elision:
Vector<double, 3> vec1(0.0);
Vector<double, 3> vec2(1.0);
Timer timer;
while(true)
{
timer.startTimer();
for (unsigned int i=0; i<100000; i++)
{
vec1 = Vector<double, 3>(i);
SquareMatrix<double,3> m = vec1.outerProduct(vec2);
}
timer.stopTimer();
cout<<"time cost for member function: "<< timer.getElapsedTime()<<endl;
timer.startTimer();
SquareMatrix<double,3> m;
for (unsigned int i=0; i<100000; i++)
{
vec1 = Vector<double, 3>(i);
outerProduct(vec1, vec2, m);
}
timer.stopTimer();
cout<<"time cost for global function: "<< timer.getElapsedTime()<<endl;
system("pause");
}
#black #Myles Hathcock, Great thanks to warm-hearted people!
#Myles Hathcock, your explanation is really a subtle and abstruse one, but I think I would benefit a lot from it.
Finally, the entire implementation is on
https://github.com/FeiZhu/Physika
Which is a physical engine we are developing, and from which you can find more info including the whole source codes. Vector and SquareMatrix are defined in Physika_Src/Physika_Core folder! But global function outerProduct is not uploaded, you can add it appropriately to somewhere.
Related
I want to write fast simd code to compute the multiplicative reduction of a complex array. In standard C this is:
#include <complex.h>
complex float f(complex float x[], int n ) {
complex float p = 1.0;
for (int i = 0; i < n; i++)
p *= x[i];
return p;
}
n will be at most 50.
Gcc can't auto-vectorize complex multiplication but, as I am happy to assume the gcc compiler and if I knew I wanted to target sse3 I could follow How to enable sse3 autovectorization in gcc and write:
typedef float v4sf __attribute__ ((vector_size (16)));
typedef union {
v4sf v;
float e[4];
} float4
typedef struct {
float4 x;
float4 y;
} complex4;
static complex4 complex4_mul(complex4 a, complex4 b) {
return (complex4){a.x.v*b.x.v -a.y.v*b.y.v, a.y.v*b.x.v + a.x.v*b.y.v};
}
complex4 f4(complex4 x[], int n) {
v4sf one = {1,1,1,1};
complex4 p = {one,one};
for (int i = 0; i < n; i++) p = complex4_mul(p, x[i]);
return p;
}
This indeed produces fast vectorized assembly code using gcc. Although you still need to pad your input to a multiple of 4. The assembly you get is:
.L3:
vmovaps xmm0, XMMWORD PTR 16[rsi]
add rsi, 32
vmulps xmm1, xmm0, xmm2
vmulps xmm0, xmm0, xmm3
vfmsubps xmm1, xmm3, XMMWORD PTR -32[rsi], xmm1
vmovaps xmm3, xmm1
vfmaddps xmm2, xmm2, XMMWORD PTR -32[rsi], xmm0
cmp rdx, rsi
jne .L3
However, it is designed for the exact simd instruction set and is not optimal for avx2 or avx512 for example for which you need to change the code.
How can you write C or C++ code for which gcc will produce optimal
code when compiled for any of sse, avx2 or avx512? That is, do you always have to write separate functions by hand for each different width of SIMD register?
Are there any open source libraries that make this easier?
Here would be an example using the Eigen library:
#include <Eigen/Core>
std::complex<float> f(const std::complex<float> *x, int n)
{
return Eigen::VectorXcf::Map(x, n).prod();
}
If you compile this with clang or g++ and sse or avx enabled (and -O2), you should get fairly decent machine code. It also works for some other architectures like Altivec or NEON. If you know that the first entry of x is aligned, you can use MapAligned instead of Map.
You get even better code, if you happen to know the size of your vector at compile time using this:
template<int n>
std::complex<float> f(const std::complex<float> *x)
{
return Eigen::Matrix<std::complex<float>, n, 1> >::MapAligned(x).prod();
}
Note: The functions above directly correspond to the function f of the OP.
However, as #PeterCordes pointed out, it is generally bad to store complex numbers interleaved, since this will require lots of shuffling for multiplication. Instead, one should store real and imaginary parts in a way that they can be directly loaded one packet at once.
Edit/Addendum: To implement a structure-of-arrays like complex multiplication, you can actually write something like:
typedef Eigen::Array<float, 8, 1> v8sf; // Eigen::Array allows element-wise standard operations
typedef std::complex<v8sf> complex8;
complex8 prod(const complex8& a, const complex8& b)
{
return a*b;
}
Or more generic (using C++11):
template<int size, typename Scalar = float> using complexX = std::complex<Eigen::Array<Scalar, size, 1> >;
template<int size>
complexX<size> prod(const complexX<size>& a, const complexX<size>& b)
{
return a*b;
}
When compiled with -mavx -O2, this compiles to something like this (using g++-5.4):
vmovaps 32(%rsi), %ymm1
movq %rdi, %rax
vmovaps (%rsi), %ymm0
vmovaps 32(%rdi), %ymm3
vmovaps (%rdi), %ymm4
vmulps %ymm0, %ymm3, %ymm2
vmulps %ymm4, %ymm1, %ymm5
vmulps %ymm4, %ymm0, %ymm0
vmulps %ymm3, %ymm1, %ymm1
vaddps %ymm5, %ymm2, %ymm2
vsubps %ymm1, %ymm0, %ymm0
vmovaps %ymm2, 32(%rdi)
vmovaps %ymm0, (%rdi)
vzeroupper
ret
For reasons not obvious to me, this is actually hidden in a method which is called by the actual method, which just moves around some memory -- I don't know why Eigen/gcc does not assume that the arguments are already properly aligned. If I compile the same with clang 3.8.0 (and the same arguments), it is compiled to just:
vmovaps (%rsi), %ymm0
vmovaps %ymm0, (%rdi)
vmovaps 32(%rsi), %ymm0
vmovaps %ymm0, 32(%rdi)
vmovaps (%rdi), %ymm1
vmovaps (%rdx), %ymm2
vmovaps 32(%rdx), %ymm3
vmulps %ymm2, %ymm1, %ymm4
vmulps %ymm3, %ymm0, %ymm5
vsubps %ymm5, %ymm4, %ymm4
vmulps %ymm3, %ymm1, %ymm1
vmulps %ymm0, %ymm2, %ymm0
vaddps %ymm1, %ymm0, %ymm0
vmovaps %ymm0, 32(%rdi)
vmovaps %ymm4, (%rdi)
movq %rdi, %rax
vzeroupper
retq
Again, the memory-movement at the beginning is weird, but at least that is vectorized. For both gcc and clang this get optimized away when called in a loop, however:
complex8 f8(complex8 x[], int n) {
if(n==0)
return complex8(v8sf::Ones(),v8sf::Zero()); // I guess you want p = 1 + 0*i at the beginning?
complex8 p = x[0];
for (int i = 1; i < n; i++) p = prod(p, x[i]);
return p;
}
The difference here is that clang will unroll that outer loop to 2 multiplications per loop. On the other hand, gcc will use fused-multiply-add instructions when compiled with -mfma.
The f8 function can of course also be generalized to arbitrary dimensions:
template<int size>
complexX<size> fX(complexX<size> x[], int n) {
using S= typename complexX<size>::value_type;
if(n==0)
return complexX<size>(S::Ones(),S::Zero());
complexX<size> p = x[0];
for (int i = 1; i < n; i++) p *=x[i];
return p;
}
And for reducing the complexX<N> to a single std::complex the following function can be used:
// only works for powers of two
template<int size> EIGEN_ALWAYS_INLINE
std::complex<float> redux(const complexX<size>& var) {
complexX<size/2> a(var.real().template head<size/2>(), var.imag().template head<size/2>());
complexX<size/2> b(var.real().template tail<size/2>(), var.imag().template tail<size/2>());
return redux(a*b);
}
template<> EIGEN_ALWAYS_INLINE
std::complex<float> redux(const complexX<1>& var) {
return std::complex<float>(var.real()[0], var.imag()[0]);
}
However, depending on whether I use clang or g++, I get quite different assembler output. Overall, g++ has a tendency to fail to inline loading the input arguments, and clang fails to use FMA operations (YMMV ...)
Essentially, you need to inspect the generated assembler code anyway. And more importantly, you should benchmark the code (not sure, how much impact this routine has in your overall problem).
Also, I wanted to note that Eigen actually is a linear algebra library. Exploiting it for pure portable SIMD code generation is not really what is designed for.
If Portability is your main concern, there are many libraries here which provide SIMD instructions in their own syntax. Most of them do the explicit vectorization more simple and portable than intrinsics. This Library (UME::SIMD) is recently published and has a great performance
In this paper(UME::SIMD) an interface based on Vc has been established which
is named UME::SIMD. It allows the programmer to access the SIMD
capabilities without the need for extensive knowledge of SIMD ISAs.
UME::SIMD provides a simple, flexible and portable abstraction for
explicit vectorization without performance losses compared to
intrinsics
I don't think you have a fully general solution for this. You can increase your "vector_size" to 32:
typedef float v4sf __attribute__ ((vector_size (32)));
Also increase all arrays to have 8 elements:
typedef float v8sf __attribute__ ((vector_size (32)));
typedef union {
v8sf v;
float e[8];
} float8;
typedef struct {
float8 x;
float8 y;
} complex8;
static complex8 complex8_mul(complex8 a, complex8 b) {
return (complex8){a.x.v*b.x.v -a.y.v*b.y.v, a.y.v*b.x.v + a.x.v*b.y.v};
}
This will make the compiler able to generate AVX512 code (don't forget to add -mavx512f), but will make your code slightly worse in SSE by making memory transfers sub-optimal. However, it will certainly not disable SSE vectorization.
You could keep both versions (with 4 and with 8 array elements), switching between them by some flag, but it might be too tedious for little benefit.
The following code does some copying from one array of zeroes interpreted as floats to another one, and prints timing of this operation. As I've seen many cases where no-op loops are just optimized away by compilers, including gcc, I was waiting that at some point of changing my copy-arrays program it will stop doing the copying.
#include <iostream>
#include <cstring>
#include <sys/time.h>
static inline long double currentTime()
{
timespec ts;
clock_gettime(CLOCK_MONOTONIC,&ts);
return ts.tv_sec+(long double)(ts.tv_nsec)*1e-9;
}
int main()
{
size_t W=20000,H=10000;
float* data1=new float[W*H];
float* data2=new float[W*H];
memset(data1,0,W*H*sizeof(float));
memset(data2,0,W*H*sizeof(float));
long double time1=currentTime();
for(int q=0;q<16;++q) // take more time
for(int k=0;k<W*H;++k)
data2[k]=data1[k];
long double time2=currentTime();
std::cout << (time2-time1)*1e+3 << " ms\n";
delete[] data1;
delete[] data2;
}
I compiled this with g++ 4.8.1 command g++ main.cpp -o test -std=c++0x -O3 -lrt. This program prints 6952.17 ms for me. (I had to set ulimit -s 2000000 for it to not crash.)
I also tried changing creation of arrays with new to automatic VLAs, removing memsets, but this doesn't change g++ behavior (apart from changing timings by several times).
It seems the compiler could prove that this code won't do anything sensible, so why didn't it optimize the loop away?
Anyway it isn't impossible (clang++ version 3.3):
clang++ main.cpp -o test -std=c++0x -O3 -lrt
The program prints 0.000367 ms for me... and looking at the assembly language:
...
callq clock_gettime
movq 56(%rsp), %r14
movq 64(%rsp), %rbx
leaq 56(%rsp), %rsi
movl $1, %edi
callq clock_gettime
...
while for g++:
...
call clock_gettime
fildq 32(%rsp)
movl $16, %eax
fildq 40(%rsp)
fmull .LC0(%rip)
faddp %st, %st(1)
.p2align 4,,10
.p2align 3
.L2:
movl $1, %ecx
xorl %edx, %edx
jmp .L5
.p2align 4,,10
.p2align 3
.L3:
movq %rcx, %rdx
movq %rsi, %rcx
.L5:
leaq 1(%rcx), %rsi
movss 0(%rbp,%rdx,4), %xmm0
movss %xmm0, (%rbx,%rdx,4)
cmpq $200000001, %rsi
jne .L3
subl $1, %eax
jne .L2
fstpt 16(%rsp)
leaq 32(%rsp), %rsi
movl $1, %edi
call clock_gettime
...
EDIT (g++ v4.8.2 / clang++ v3.3)
SOURCE CODE - ORIGINAL VERSION (1)
...
size_t W=20000,H=10000;
float* data1=new float[W*H];
float* data2=new float[W*H];
...
SOURCE CODE - MODIFIED VERSION (2)
...
const size_t W=20000;
const size_t H=10000;
float data1[W*H];
float data2[W*H];
...
Now the case that isn't optimized is (1) + g++
The code in this question has changed quite a bit, invalidating correct answers. This answer applies to the 5th version: as the code currently attempts to read uninitialized memory, an optimizer may reasonably assume that unexpected things are happening.
Many optimization steps have a similar pattern: there's a pattern of instructions that's matched to the current state of compilation. If the pattern matches at some point, the matched pattern is (parametrically) replaced by a more efficient version. A very simple example of such a pattern is the definition of a variable that's not subsequently used; the replacement in this case is simply a deletion.
These patterns are designed for correct code. On incorrect code, the patterns may simply fail to match, or they may match in entirely unintended ways. The first case leads to no optimization, the second case may lead to totally unpredictable results (certainly if the modified code if further optimized)
Why do you expect the compiler to optimise this? It’s generally really hard to prove that writes to arbitrary memory addresses are a “no-op”. In your case it would be possible, but it would require the compiler to trace the heap memory addresses through new (which is once again hard since these addresses are generated at runtime) and there really is no incentive for doing this.
After all, you tell the compiler explicitly that you want to allocate memory and write to it. How is the poor compiler to know that you’ve been lying to it?
In particular, the problem is that the heap memory could be aliased to lots of other stuff. It happens to be private to your process but like I said above, proving this is a lot of work for the compiler, unlike for function local memory.
The only way in which the compiler could know that this is a no-op is if it knew what memset does. In order for that to happen, the function must either be defined in a header (and it typically isn't), or it must be treated as a special intrinsic by the compiler. But barring those tricks, the compiler just sees a call to an unknown function which could have side effects and do different things for each of the two calls.
If you have a floating point number double num_float = 5.0; and the following two conditionals.
if(num_float > 3)
{
//...
}
if(num_float > 3.0)
{
//...
}
Q: Would it be slower to perform the former comparison because of the conversion of 3 to a floating point, or would there really be no difference at all?
Obviously I'm assuming the time delay would be negligible at best, but compounded in a while(1) loop I suppose over the long run a decent chunk of time could be lost (if it really is slower).
Because of the "as-if" rule, the compiler is allowed to do the conversion of the literal to a floating point value at compile time. A good compiler will do so if that results in better code.
In order to answer your question definitively for your compiler and your target platform(s), you'd need to check what the compiler emits, and how it performs. However, I'd be surprised if any mainstream compiler did not turn either of the two if statements into the most efficient code possible.
If the value is a constant, then there shouldn't be any difference, since the compiler will convert the constant to float as part of the compilation [unless the compiler decides to use a "compare float with integer" instruction].
If the value is an integer VARIABLE, then there will be an extra instruction to convert the integer value to a floating point [again, unless the compiler can use a "compare float with integer" instruction].
How much, if any, time that adds to the whole process depends HIGHLY on what processor, how the floating point instructions work, etc, etc.
As with anything where performance really matters, measure the alternatives. Preferably on more than one type of hardware (e.g. both AMD and Intel processors if it's a PC), and then decide which is the better choice. Otherwise, you may find yourself tuning the code to work well on YOUR hardware, but worse on some other hardware. Which isn't a good optimisation - unless the ONLY machine you ever run on is your own.
Note: This will need to be repeated with your target hardware. The code below just demonstrates nicely what has been said.
with constants:
bool with_int(const double num_float) {
return num_float > 3;
}
bool with_float(const double num_float) {
return num_float > 3.0;
}
g++ 4.7.2 (-O3 -march=native):
with_int(double):
ucomisd .LC0(%rip), %xmm0
seta %al
ret
with_float(double):
ucomisd .LC0(%rip), %xmm0
seta %al
ret
.LC0:
.long 0
.long 1074266112
clang 3.0 (-O3 -march=native):
.LCPI0_0:
.quad 4613937818241073152 # double 3.000000e+00
with_int(double): # #with_int(double)
ucomisd .LCPI0_0(%rip), %xmm0
seta %al
ret
.LCPI1_0:
.quad 4613937818241073152 # double 3.000000e+00
with_float(double): # #with_float(double)
ucomisd .LCPI1_0(%rip), %xmm0
seta %al
ret
Conclusion: No difference if comparing against constants.
with variables:
bool with_int(const double a, const int b) {
return a > b;
}
bool with_float(const double a, const float b) {
return a > b;
}
g++ 4.7.2 (-O3 -march=native):
with_int(double, int):
cvtsi2sd %edi, %xmm1
ucomisd %xmm1, %xmm0
seta %al
ret
with_float(double, float):
unpcklps %xmm1, %xmm1
cvtps2pd %xmm1, %xmm1
ucomisd %xmm1, %xmm0
seta %al
ret
clang 3.0 (-O3 -march=native):
with_int(double, int): # #with_int(double, int)
cvtsi2sd %edi, %xmm1
ucomisd %xmm1, %xmm0
seta %al
ret
with_float(double, float): # #with_float(double, float)
cvtss2sd %xmm1, %xmm1
ucomisd %xmm1, %xmm0
seta %al
ret
Conclusion: the emitted instructions differ when comparing against variables, as Mats Peterson's answer already explained.
Q: Would it be slower to perform the former comparison because of the conversion of 3 to a floating point, or would there really be no difference at all?
A) Just specify it as an integer. Some chips have a special instruction for comparing to integer at runtime, but that is not important since the compiler will choose what is best. In some cases it might convert it to 3.0 at compile time depending on target architecture. In other cases it will leave it as int. But because you want 3 specifically then specify '3'
Obviously I'm assuming the time delay would be negligible at best, but compounded in a while(1) loop I suppose over the long run a decent chunk of time could be lost (if it really is slower).
A) The compiler will not do anything odd with such code. It will choose the best thing so there should be no time delay. With number constants the compiler is free to do what is best regardless of what it seems as long as it produces same result. However you would not want to compound this type of comparison in a while loop at all. Rather use an integer loop counter. A floating point loop counter is going to be much slower. If you have to use floats as a loop counter prefer single point 32 bit data type and compare as little as possible.
For example, you could break the problem down into multiple loops.
int x = 0;
float y = 0;
float finc = 0.1;
int total = 1000;
int num_times = total / finc;
num_times -= 2;// safety
// Run the loop in a safe zone using integer compares
while (x < num_times) {
// Do stuff
y += finc;
x++;
}
// Now complete the loop using float compares
while (y < total) {
y+= finc;
}
And that would cause a severe improvement in the comparisons speed.
I am creating a multi-dimensional vector (mathematical vector) where I allow basic mathematical operations +,-,/,*,=. The template takes in two parameters, one is the type (int, float etc.) while the other is the size of the vector. Currently I am applying the operations via a for loop. Now considering the size is known at compile time, will the compiler unroll the loop? If not, is there a way to unroll it with no (or minimal) performance penalty?
template <typename T, u32 size>
class Vector
{
public:
// Various functions for mathematical operations.
// The functions take in a Vector<T, size>.
// Example:
void add(const Vector<T, size>& vec)
{
for (u32 i = 0; i < size; ++i)
{
values[i] += vec[i];
}
}
private:
T values[size];
};
Before somebody comments Profile then optimize please note that this is the basis for my 3D graphics engine and it must be fast. Second, I want to know for the sake of educating myself.
You can do the following trick with disassembly to see how the particular code is compiled.
Vector<int, 16> a, b;
Vector<int, 65536> c, d;
asm("xxx"); // marker
a.Add(b);
asm("yyy"); // marker
c.Add(d);
asm("zzz"); // marker
Now compile
gcc -O3 1.cc -S -o 1.s
And see the disasm
xxx
# 0 "" 2
#NO_APP
movdqa 524248(%rsp), %xmm0
leaq 524248(%rsp), %rsi
paddd 524184(%rsp), %xmm0
movdqa %xmm0, 524248(%rsp)
movdqa 524264(%rsp), %xmm0
paddd 524200(%rsp), %xmm0
movdqa %xmm0, 524264(%rsp)
movdqa 524280(%rsp), %xmm0
paddd 524216(%rsp), %xmm0
movdqa %xmm0, 524280(%rsp)
movdqa 524296(%rsp), %xmm0
paddd 524232(%rsp), %xmm0
movdqa %xmm0, 524296(%rsp)
#APP
# 36 "1.cc" 1
yyy
# 0 "" 2
#NO_APP
leaq 262040(%rsp), %rdx
leaq -104(%rsp), %rcx
xorl %eax, %eax
.p2align 4,,10
.p2align 3
.L2:
movdqa (%rcx,%rax), %xmm0
paddd (%rdx,%rax), %xmm0
movdqa %xmm0, (%rdx,%rax)
addq $16, %rax
cmpq $262144, %rax
jne .L2
#APP
# 38 "1.cc" 1
zzz
As you see, the first loop was small enough to get unrolled. The second is the loop.
First: Modern CPUs are pretty smart about predicting branches, so unrolling the loop might not help (and could even hurt).
Second: Yes, modern compilers know how to unroll a loop like this, if it is a good idea for your target CPU.
Third: Modern compilers can even auto-vectorize the loop, which is even better than unrolling.
Bottom line: Do not think you are smarter than your compiler unless you know a lot about CPU architecture. Write your code in a simple, straightforward way, and do not worry about micro-optimizations until your profiler tells you to.
The loop can be unrolled using recursive template instantiation. This may or may not be faster on your C++ implementation.
I adjusted your example slightly, so that it would compile.
typedef unsigned u32; // or something similar
template <typename T, u32 size>
class Vector
{
// need to use an inner class, because member templates of an
// unspecialized template cannot be explicitly specialized.
template<typename Vec, u32 index>
struct Inner
{
static void add(const Vec& a, const Vec& b)
{
a.values[index] = b.values[index];
// triggers recursive instantiation of Inner
Inner<Vec, index-1>::add(a,b);
}
};
// this specialization terminates the recursion
template<typename Vec>
struct Inner<Vec, 0>
{
static void add(const Vec& a, const Vec& b)
{
a.values[0] = b.values[0];
}
};
public:
// PS! this function should probably take a
// _const_ Vector, because the argument is not modified
// Various functions for mathematical operations.
// The functions take in a Vector<T, size>.
// Example:
void add(Vector<T, size>& vec)
{
Inner<Vector, size-1>::add(*this, vec);
}
T values[size];
};
The only way to figure this out is to try it on your own compiler with your own optimization parameters. Make one test file with your "does it unroll" code, test.cpp:
#include "myclass.hpp"
void doSomething(Vector<double, 3>& a, Vector<double, 3>& b) {
a.add( b );
}
then a reference code snippet reference.cpp:
#include "myclass.hpp"
void doSomething(Vector<double, 3>& a, Vector<double, 3>& b) {
a[0] += b[0];
a[1] += b[1];
a[2] += b[2];
}
and now use GCC to compile them and spit out only the assembly:
for x in *.cpp; do g++ -c "$x" -Wall -Wextra -O2 -S -o "out/$x.s"; done
In my experience, GCC will unroll loops of 3 or less by default when using loops whose duration are known at compile time; using the -funroll-loops will cause it to unroll even more.
First of all, it is not at all certain that unrolling the loop would be beneficial.
The only possible answer to your question is "it depends" (on the compiler flags, on the value of size, etc).
If you really want to know, ask your compiler: compile into assembly code with typical values of size and with the optimization flags you'd use for real, and examine the result.
Many compilers will unroll this loop, no idea if "the compiler" you are referring to will. There isn't just one compiler in the world.
If you want to guarantee that it's unrolled, then TMP (with inlining) can do that. (This is actually one of the more trivial applications of TMP, often used as an example of metaprogramming).
This question already has answers here:
Do temp variables slow down my program?
(5 answers)
Closed 5 years ago.
Let's say we have two functions:
int f();
int g();
I want to get the sum of f() and g().
First way:
int fRes = f();
int gRes = g();
int sum = fRes + gRes;
Second way:
int sum = f() + g();
Will be there any difference in performance in this two cases?
Same question for complex types instead of ints
EDIT
Do I understand right i should not worry about performance in such case (in each situation including frequently performed tasks) and use temporary variables to increase readability and to simplify the code ?
You can answer questions like this for yourself by compiling to assembly language (with optimization on, of course) and inspecting the output. If I flesh your example out to a complete, compilable program...
extern int f();
extern int g();
int direct()
{
return f() + g();
}
int indirect()
{
int F = f();
int G = g();
return F + G;
}
and compile it (g++ -S -O2 -fomit-frame-pointer -fno-exceptions test.cc; the last two switches eliminate a bunch of distractions from the output), I get this (further distractions deleted):
__Z8indirectv:
pushq %rbx
call __Z1fv
movl %eax, %ebx
call __Z1gv
addl %ebx, %eax
popq %rbx
ret
__Z6directv:
pushq %rbx
call __Z1fv
movl %eax, %ebx
call __Z1gv
addl %ebx, %eax
popq %rbx
ret
As you can see, the code generated for both functions is identical, so the answer to your question is no, there will be no performance difference. Now let's look at complex numbers -- same code, but s/int/std::complex<double>/g throughout and #include <complex> at the top; same compilation switches --
__Z8indirectv:
subq $72, %rsp
call __Z1fv
movsd %xmm0, (%rsp)
movsd %xmm1, 8(%rsp)
movq (%rsp), %rax
movq %rax, 48(%rsp)
movq 8(%rsp), %rax
movq %rax, 56(%rsp)
call __Z1gv
movsd %xmm0, (%rsp)
movsd %xmm1, 8(%rsp)
movq (%rsp), %rax
movq %rax, 32(%rsp)
movq 8(%rsp), %rax
movq %rax, 40(%rsp)
movsd 48(%rsp), %xmm0
addsd 32(%rsp), %xmm0
movsd 56(%rsp), %xmm1
addsd 40(%rsp), %xmm1
addq $72, %rsp
ret
__Z6directv:
subq $72, %rsp
call __Z1gv
movsd %xmm0, (%rsp)
movsd %xmm1, 8(%rsp)
movq (%rsp), %rax
movq %rax, 32(%rsp)
movq 8(%rsp), %rax
movq %rax, 40(%rsp)
call __Z1fv
movsd %xmm0, (%rsp)
movsd %xmm1, 8(%rsp)
movq (%rsp), %rax
movq %rax, 48(%rsp)
movq 8(%rsp), %rax
movq %rax, 56(%rsp)
movsd 48(%rsp), %xmm0
addsd 32(%rsp), %xmm0
movsd 56(%rsp), %xmm1
addsd 40(%rsp), %xmm1
addq $72, %rsp
ret
That's a lot more instructions and the compiler isn't doing a perfect optimization job, it looks like, but nonetheless the code generated for both functions is identical.
I think in the second way it is assigned to a temporary variable when the function returns a value anyway. However, it becomes somewhat significant when you need to use the values from f() and g() more than once case in which storing them to a variable instead of recalculating them each time can help.
If you have optimization turned off, there likely will be. If you have it turned on, they will likely result in identical code. This is especially true of you label the fRes and gRes as const.
Because it's legal for the compiler to elide the call to the copy constructor if fRes and gRes are complex types they will not differ in performance for complex types either.
Someone mentioned using fRes and gRes more than once. And of course, this is obviously potentially less optimal as you would have to call f() or g() more than once.
As you wrote it, there's only a subtle difference (which another answer addresses, that there's a sequence point in the one vs the other).
They would be different if you had done this instead:
int fRes;
int gRes;
fRes = f();
fRes = g();
int sum = fRes + gRes;
(Imagining that int as actually some other type with a non-trivial default constructor.)
In the case here, you invoke default constructors and then assignment operators, which is potentially more work.
It depends entirely on what optimizations the compiler performs. The two could compile to slightly different or exactly the same bytecode. Even if slightly different, you couldn't measure a statistically significant difference in time and space costs for those particular samples.
On my platform with full optimization turned on, a function returning the sum from both different cases compiled to exactly the same machine code.
The only minor difference between the two examples is that the first guarantees the order in which f() and g() are called, so in theory the second allows the compiler slightly more flexibility. Whether this ever makes a difference would depend on what f() and g() actually do and, perhaps, whether they can be inlined.
There is a slight difference between the two examples. In expression f() + g() there is no sequence point, whereas when the calls are made in different statements there are sequence points at the end of each statement.
The absence of a sequence point means the order these two functions are called is unspecified, they can be called in any order, which might help the compiler optimize it.