I am creating a multi-dimensional vector (mathematical vector) where I allow basic mathematical operations +,-,/,*,=. The template takes in two parameters, one is the type (int, float etc.) while the other is the size of the vector. Currently I am applying the operations via a for loop. Now considering the size is known at compile time, will the compiler unroll the loop? If not, is there a way to unroll it with no (or minimal) performance penalty?
template <typename T, u32 size>
class Vector
{
public:
// Various functions for mathematical operations.
// The functions take in a Vector<T, size>.
// Example:
void add(const Vector<T, size>& vec)
{
for (u32 i = 0; i < size; ++i)
{
values[i] += vec[i];
}
}
private:
T values[size];
};
Before somebody comments Profile then optimize please note that this is the basis for my 3D graphics engine and it must be fast. Second, I want to know for the sake of educating myself.
You can do the following trick with disassembly to see how the particular code is compiled.
Vector<int, 16> a, b;
Vector<int, 65536> c, d;
asm("xxx"); // marker
a.Add(b);
asm("yyy"); // marker
c.Add(d);
asm("zzz"); // marker
Now compile
gcc -O3 1.cc -S -o 1.s
And see the disasm
xxx
# 0 "" 2
#NO_APP
movdqa 524248(%rsp), %xmm0
leaq 524248(%rsp), %rsi
paddd 524184(%rsp), %xmm0
movdqa %xmm0, 524248(%rsp)
movdqa 524264(%rsp), %xmm0
paddd 524200(%rsp), %xmm0
movdqa %xmm0, 524264(%rsp)
movdqa 524280(%rsp), %xmm0
paddd 524216(%rsp), %xmm0
movdqa %xmm0, 524280(%rsp)
movdqa 524296(%rsp), %xmm0
paddd 524232(%rsp), %xmm0
movdqa %xmm0, 524296(%rsp)
#APP
# 36 "1.cc" 1
yyy
# 0 "" 2
#NO_APP
leaq 262040(%rsp), %rdx
leaq -104(%rsp), %rcx
xorl %eax, %eax
.p2align 4,,10
.p2align 3
.L2:
movdqa (%rcx,%rax), %xmm0
paddd (%rdx,%rax), %xmm0
movdqa %xmm0, (%rdx,%rax)
addq $16, %rax
cmpq $262144, %rax
jne .L2
#APP
# 38 "1.cc" 1
zzz
As you see, the first loop was small enough to get unrolled. The second is the loop.
First: Modern CPUs are pretty smart about predicting branches, so unrolling the loop might not help (and could even hurt).
Second: Yes, modern compilers know how to unroll a loop like this, if it is a good idea for your target CPU.
Third: Modern compilers can even auto-vectorize the loop, which is even better than unrolling.
Bottom line: Do not think you are smarter than your compiler unless you know a lot about CPU architecture. Write your code in a simple, straightforward way, and do not worry about micro-optimizations until your profiler tells you to.
The loop can be unrolled using recursive template instantiation. This may or may not be faster on your C++ implementation.
I adjusted your example slightly, so that it would compile.
typedef unsigned u32; // or something similar
template <typename T, u32 size>
class Vector
{
// need to use an inner class, because member templates of an
// unspecialized template cannot be explicitly specialized.
template<typename Vec, u32 index>
struct Inner
{
static void add(const Vec& a, const Vec& b)
{
a.values[index] = b.values[index];
// triggers recursive instantiation of Inner
Inner<Vec, index-1>::add(a,b);
}
};
// this specialization terminates the recursion
template<typename Vec>
struct Inner<Vec, 0>
{
static void add(const Vec& a, const Vec& b)
{
a.values[0] = b.values[0];
}
};
public:
// PS! this function should probably take a
// _const_ Vector, because the argument is not modified
// Various functions for mathematical operations.
// The functions take in a Vector<T, size>.
// Example:
void add(Vector<T, size>& vec)
{
Inner<Vector, size-1>::add(*this, vec);
}
T values[size];
};
The only way to figure this out is to try it on your own compiler with your own optimization parameters. Make one test file with your "does it unroll" code, test.cpp:
#include "myclass.hpp"
void doSomething(Vector<double, 3>& a, Vector<double, 3>& b) {
a.add( b );
}
then a reference code snippet reference.cpp:
#include "myclass.hpp"
void doSomething(Vector<double, 3>& a, Vector<double, 3>& b) {
a[0] += b[0];
a[1] += b[1];
a[2] += b[2];
}
and now use GCC to compile them and spit out only the assembly:
for x in *.cpp; do g++ -c "$x" -Wall -Wextra -O2 -S -o "out/$x.s"; done
In my experience, GCC will unroll loops of 3 or less by default when using loops whose duration are known at compile time; using the -funroll-loops will cause it to unroll even more.
First of all, it is not at all certain that unrolling the loop would be beneficial.
The only possible answer to your question is "it depends" (on the compiler flags, on the value of size, etc).
If you really want to know, ask your compiler: compile into assembly code with typical values of size and with the optimization flags you'd use for real, and examine the result.
Many compilers will unroll this loop, no idea if "the compiler" you are referring to will. There isn't just one compiler in the world.
If you want to guarantee that it's unrolled, then TMP (with inlining) can do that. (This is actually one of the more trivial applications of TMP, often used as an example of metaprogramming).
Related
I want to write fast simd code to compute the multiplicative reduction of a complex array. In standard C this is:
#include <complex.h>
complex float f(complex float x[], int n ) {
complex float p = 1.0;
for (int i = 0; i < n; i++)
p *= x[i];
return p;
}
n will be at most 50.
Gcc can't auto-vectorize complex multiplication but, as I am happy to assume the gcc compiler and if I knew I wanted to target sse3 I could follow How to enable sse3 autovectorization in gcc and write:
typedef float v4sf __attribute__ ((vector_size (16)));
typedef union {
v4sf v;
float e[4];
} float4
typedef struct {
float4 x;
float4 y;
} complex4;
static complex4 complex4_mul(complex4 a, complex4 b) {
return (complex4){a.x.v*b.x.v -a.y.v*b.y.v, a.y.v*b.x.v + a.x.v*b.y.v};
}
complex4 f4(complex4 x[], int n) {
v4sf one = {1,1,1,1};
complex4 p = {one,one};
for (int i = 0; i < n; i++) p = complex4_mul(p, x[i]);
return p;
}
This indeed produces fast vectorized assembly code using gcc. Although you still need to pad your input to a multiple of 4. The assembly you get is:
.L3:
vmovaps xmm0, XMMWORD PTR 16[rsi]
add rsi, 32
vmulps xmm1, xmm0, xmm2
vmulps xmm0, xmm0, xmm3
vfmsubps xmm1, xmm3, XMMWORD PTR -32[rsi], xmm1
vmovaps xmm3, xmm1
vfmaddps xmm2, xmm2, XMMWORD PTR -32[rsi], xmm0
cmp rdx, rsi
jne .L3
However, it is designed for the exact simd instruction set and is not optimal for avx2 or avx512 for example for which you need to change the code.
How can you write C or C++ code for which gcc will produce optimal
code when compiled for any of sse, avx2 or avx512? That is, do you always have to write separate functions by hand for each different width of SIMD register?
Are there any open source libraries that make this easier?
Here would be an example using the Eigen library:
#include <Eigen/Core>
std::complex<float> f(const std::complex<float> *x, int n)
{
return Eigen::VectorXcf::Map(x, n).prod();
}
If you compile this with clang or g++ and sse or avx enabled (and -O2), you should get fairly decent machine code. It also works for some other architectures like Altivec or NEON. If you know that the first entry of x is aligned, you can use MapAligned instead of Map.
You get even better code, if you happen to know the size of your vector at compile time using this:
template<int n>
std::complex<float> f(const std::complex<float> *x)
{
return Eigen::Matrix<std::complex<float>, n, 1> >::MapAligned(x).prod();
}
Note: The functions above directly correspond to the function f of the OP.
However, as #PeterCordes pointed out, it is generally bad to store complex numbers interleaved, since this will require lots of shuffling for multiplication. Instead, one should store real and imaginary parts in a way that they can be directly loaded one packet at once.
Edit/Addendum: To implement a structure-of-arrays like complex multiplication, you can actually write something like:
typedef Eigen::Array<float, 8, 1> v8sf; // Eigen::Array allows element-wise standard operations
typedef std::complex<v8sf> complex8;
complex8 prod(const complex8& a, const complex8& b)
{
return a*b;
}
Or more generic (using C++11):
template<int size, typename Scalar = float> using complexX = std::complex<Eigen::Array<Scalar, size, 1> >;
template<int size>
complexX<size> prod(const complexX<size>& a, const complexX<size>& b)
{
return a*b;
}
When compiled with -mavx -O2, this compiles to something like this (using g++-5.4):
vmovaps 32(%rsi), %ymm1
movq %rdi, %rax
vmovaps (%rsi), %ymm0
vmovaps 32(%rdi), %ymm3
vmovaps (%rdi), %ymm4
vmulps %ymm0, %ymm3, %ymm2
vmulps %ymm4, %ymm1, %ymm5
vmulps %ymm4, %ymm0, %ymm0
vmulps %ymm3, %ymm1, %ymm1
vaddps %ymm5, %ymm2, %ymm2
vsubps %ymm1, %ymm0, %ymm0
vmovaps %ymm2, 32(%rdi)
vmovaps %ymm0, (%rdi)
vzeroupper
ret
For reasons not obvious to me, this is actually hidden in a method which is called by the actual method, which just moves around some memory -- I don't know why Eigen/gcc does not assume that the arguments are already properly aligned. If I compile the same with clang 3.8.0 (and the same arguments), it is compiled to just:
vmovaps (%rsi), %ymm0
vmovaps %ymm0, (%rdi)
vmovaps 32(%rsi), %ymm0
vmovaps %ymm0, 32(%rdi)
vmovaps (%rdi), %ymm1
vmovaps (%rdx), %ymm2
vmovaps 32(%rdx), %ymm3
vmulps %ymm2, %ymm1, %ymm4
vmulps %ymm3, %ymm0, %ymm5
vsubps %ymm5, %ymm4, %ymm4
vmulps %ymm3, %ymm1, %ymm1
vmulps %ymm0, %ymm2, %ymm0
vaddps %ymm1, %ymm0, %ymm0
vmovaps %ymm0, 32(%rdi)
vmovaps %ymm4, (%rdi)
movq %rdi, %rax
vzeroupper
retq
Again, the memory-movement at the beginning is weird, but at least that is vectorized. For both gcc and clang this get optimized away when called in a loop, however:
complex8 f8(complex8 x[], int n) {
if(n==0)
return complex8(v8sf::Ones(),v8sf::Zero()); // I guess you want p = 1 + 0*i at the beginning?
complex8 p = x[0];
for (int i = 1; i < n; i++) p = prod(p, x[i]);
return p;
}
The difference here is that clang will unroll that outer loop to 2 multiplications per loop. On the other hand, gcc will use fused-multiply-add instructions when compiled with -mfma.
The f8 function can of course also be generalized to arbitrary dimensions:
template<int size>
complexX<size> fX(complexX<size> x[], int n) {
using S= typename complexX<size>::value_type;
if(n==0)
return complexX<size>(S::Ones(),S::Zero());
complexX<size> p = x[0];
for (int i = 1; i < n; i++) p *=x[i];
return p;
}
And for reducing the complexX<N> to a single std::complex the following function can be used:
// only works for powers of two
template<int size> EIGEN_ALWAYS_INLINE
std::complex<float> redux(const complexX<size>& var) {
complexX<size/2> a(var.real().template head<size/2>(), var.imag().template head<size/2>());
complexX<size/2> b(var.real().template tail<size/2>(), var.imag().template tail<size/2>());
return redux(a*b);
}
template<> EIGEN_ALWAYS_INLINE
std::complex<float> redux(const complexX<1>& var) {
return std::complex<float>(var.real()[0], var.imag()[0]);
}
However, depending on whether I use clang or g++, I get quite different assembler output. Overall, g++ has a tendency to fail to inline loading the input arguments, and clang fails to use FMA operations (YMMV ...)
Essentially, you need to inspect the generated assembler code anyway. And more importantly, you should benchmark the code (not sure, how much impact this routine has in your overall problem).
Also, I wanted to note that Eigen actually is a linear algebra library. Exploiting it for pure portable SIMD code generation is not really what is designed for.
If Portability is your main concern, there are many libraries here which provide SIMD instructions in their own syntax. Most of them do the explicit vectorization more simple and portable than intrinsics. This Library (UME::SIMD) is recently published and has a great performance
In this paper(UME::SIMD) an interface based on Vc has been established which
is named UME::SIMD. It allows the programmer to access the SIMD
capabilities without the need for extensive knowledge of SIMD ISAs.
UME::SIMD provides a simple, flexible and portable abstraction for
explicit vectorization without performance losses compared to
intrinsics
I don't think you have a fully general solution for this. You can increase your "vector_size" to 32:
typedef float v4sf __attribute__ ((vector_size (32)));
Also increase all arrays to have 8 elements:
typedef float v8sf __attribute__ ((vector_size (32)));
typedef union {
v8sf v;
float e[8];
} float8;
typedef struct {
float8 x;
float8 y;
} complex8;
static complex8 complex8_mul(complex8 a, complex8 b) {
return (complex8){a.x.v*b.x.v -a.y.v*b.y.v, a.y.v*b.x.v + a.x.v*b.y.v};
}
This will make the compiler able to generate AVX512 code (don't forget to add -mavx512f), but will make your code slightly worse in SSE by making memory transfers sub-optimal. However, it will certainly not disable SSE vectorization.
You could keep both versions (with 4 and with 8 array elements), switching between them by some flag, but it might be too tedious for little benefit.
I have implemented two functions to perform the cross product of two Vectors (not std::vector), one is a member function and another is a global one, here is the key codes(additional parts are ommitted)
//for member function
template <typename Scalar>
SquareMatrix<Scalar,3> Vector<Scalar,3>::outerProduct(const Vector<Scalar,3> &vec3) const
{
SquareMatrix<Scalar,3> result;
for(unsigned int i = 0; i < 3; ++i)
for(unsigned int j = 0; j < 3; ++j)
result(i,j) = (*this)[i]*vec3[j];
return result;
}
//for global function: Dim = 3
template<typename Scalar, int Dim>
void outerProduct(const Vector<Scalar, Dim> & v1 , const Vector<Scalar, Dim> & v2, SquareMatrix<Scalar, Dim> & m)
{
for (unsigned int i=0; i<Dim; i++)
for (unsigned int j=0; j<Dim; j++)
{
m(i,j) = v1[i]*v2[j];
}
}
They are almost the same except that one is a member function having a return value and another is a global function where the values calculated are straightforwardly assigned to a square matrix, thus requiring no return value.
Actually, I was meant to replace the member one by the global one to improve the performance, since the first one involes copy operations. The strange thing, however, is that the time cost by the global function is almost two times longer than the member one. Furthermore, I find that the execution of
m(i,j) = v1[i]*v2[j]; // in global function
requires much more time than that of
result(i,j) = (*this)[i]*vec3[j]; // in member function
So the question is, how does this performance difference between member and global function arise?
Anyone can tell the reasons?
Hope I have presented my question clearly, and sorry to my poor english!
//----------------------------------------------------------------------------------------
More information added:
The following is the codes I use to test the performance:
//the codes below is in a loop
Vector<double, 3> vec1;
Vector<double, 3> vec2;
Timer timer;
timer.startTimer();
for (unsigned int i=0; i<100000; i++)
{
SquareMatrix<double,3> m = vec1.outerProduct(vec2);
}
timer.stopTimer();
std::cout<<"time cost for member function: "<< timer.getElapsedTime()<<std::endl;
timer.startTimer();
SquareMatrix<double,3> m;
for (unsigned int i=0; i<100000; i++)
{
outerProduct(vec1, vec2, m);
}
timer.stopTimer();
std::cout<<"time cost for global function: "<< timer.getElapsedTime()<<std::endl;
std::system("pause");
and the result captured:
You can see that the member funtion is almost twice faster than the global one.
Additionally, my project is built upon a 64bit windows system, and the codes are in fact used to generate the static lib files based on the Scons construction tools, along with vs2010 project files produced.
I have to remind that the strange performance difference only occurs in a release version, while in a debug build type, the global function is almost five times faster than the member one.(about 0.10s vs 0.02s)
One possible explanation:
With inlining, in the first case, compiler may knows that result(i, j) (from local variable) doesn't alias this[i] or vec3[j], and so neither of the Scalar array of this nor vec3 are modified.
In the second case, from the function point of view, the variables may alias, so each write into m might modify Scalars of v1 or v2, so neither of v1[i] nor v2[j] can be cached.
You may try the restrict keyword extension to check if my hypothesis is correct.
EDIT: loop elision in the original assembly has been corrected
[paraphrased] Why is the performance different between the member function and the static function?
I'll start with the simplest things mentioned in your question, and progress to the more nuanced points of performance testing / analysis.
It is a bad idea to measure performance of debug builds. Compilers take liberties in many places, such as zeroing arrays that are uninitialized, generating extra code that isn't strictly necessary, and (obviously) not performing any optimization past the trivial ones such as constant propagation. This leads to the next point...
Always look at the assembly. C and C++ are high level languages when it comes to the subtleties of performance. Many people even consider x86 assembly a high level language since each instruction is decomposed into possibly several micro-ops during decoding. You cannot tell what the computer is doing just by looking at C++ code. For example, depending on how you implemented SquareMatrix, the compiler may or may not be able to perform copy elision during optimization.
Entering the somewhat more nuanced topics when testing for performance...
Make sure the compiler is actually generating loops. Using your example test code, g++ 4.7.2 doesn't actually generate loops with my implementation of SquareMatrix and Vector. I implemented them to initialize all components to 0.0, so the compiler can statically determine that the values never change, and so only generates a single set of mov instructions instead of a loop. In my example code, I use COMPILER_NOP which (with gcc) is __asm__ __volatile__("":::) inside the loop to prevent this (as compilers cannot predict side-effects from manual assembly, and so cannot elide the loop). Edit: I DO use COMPILER_NOP but since the output values from the functions are never used, the compiler is still able to remove the bulk of the work from the loop, and reduce the loop to this:
.L7
subl $1, %eax
jne .L7
I have corrected this by performing additional operations inside the loop. The loop now assigns a value from the output to the inputs, preventing this optimization and forcing the loop to cover what was originally intended.
To (finally) get around to answering your question: When I implemented the rest of what is needed to get your code to run, and verified by checking the assembly that loops are actually generated, the two functions execute in the same amount of time. They even have nearly identical implementations in assembly.
Here's the assembly for the member function:
movsd 32(%rsp), %xmm7
movl $100000, %eax
movsd 24(%rsp), %xmm5
movsd 8(%rsp), %xmm6
movapd %xmm7, %xmm12
movsd (%rsp), %xmm4
movapd %xmm7, %xmm11
movapd %xmm5, %xmm10
movapd %xmm5, %xmm9
mulsd %xmm6, %xmm12
mulsd %xmm4, %xmm11
mulsd %xmm6, %xmm10
mulsd %xmm4, %xmm9
movsd 40(%rsp), %xmm1
movsd 16(%rsp), %xmm0
jmp .L7
.p2align 4,,10
.p2align 3
.L12:
movapd %xmm3, %xmm1
movapd %xmm2, %xmm0
.L7:
movapd %xmm0, %xmm8
movapd %xmm1, %xmm3
movapd %xmm1, %xmm2
mulsd %xmm1, %xmm8
movapd %xmm0, %xmm1
mulsd %xmm6, %xmm3
mulsd %xmm4, %xmm2
mulsd %xmm7, %xmm1
mulsd %xmm5, %xmm0
subl $1, %eax
jne .L12
and the assembly for the static function:
movsd 32(%rsp), %xmm7
movl $100000, %eax
movsd 24(%rsp), %xmm5
movsd 8(%rsp), %xmm6
movapd %xmm7, %xmm12
movsd (%rsp), %xmm4
movapd %xmm7, %xmm11
movapd %xmm5, %xmm10
movapd %xmm5, %xmm9
mulsd %xmm6, %xmm12
mulsd %xmm4, %xmm11
mulsd %xmm6, %xmm10
mulsd %xmm4, %xmm9
movsd 40(%rsp), %xmm1
movsd 16(%rsp), %xmm0
jmp .L9
.p2align 4,,10
.p2align 3
.L13:
movapd %xmm3, %xmm1
movapd %xmm2, %xmm0
.L9:
movapd %xmm0, %xmm8
movapd %xmm1, %xmm3
movapd %xmm1, %xmm2
mulsd %xmm1, %xmm8
movapd %xmm0, %xmm1
mulsd %xmm6, %xmm3
mulsd %xmm4, %xmm2
mulsd %xmm7, %xmm1
mulsd %xmm5, %xmm0
subl $1, %eax
jne .L13
In conclusion: You probably need to tighten your code up a bit before you can tell whether the implementations differ on your system. Make sure your loops are actually being generated (look at the assembly) and see whether the compiler was able to elide the return value from the member function.
If those things are true and you still see differences, can you post the implementations here for SquareMatrix and Vector so we can give you some more info?
Full code, a makefile, and the generated assembly for my working example is available as a GitHub gist.
Explicit instantiations of template function produce performance difference?
Some experiments I have done to look for the performance difference:
1.
Firstly I suspected that the performance difference may be caused by the implementation itself. In fact, we have two sets of implementation, one is implemented by ourself(this one is quite similar to codes by #black), and another is implemented to serve as a wrapper of Eigen::Matrix, which is controlled by a macro on-off, But switch between these two implementation does not make any change, the global one is still slower than the member one.
2.
Since these codes(classVector<Scalar, Dim> & SquareMatrix<Scalar, Dim>) are implemented in a large project, then I guess that the performance difference may probably be influenced by other codes(though I think it impossible, but still worth a try). So I extract all necessary codes(implementation by ourself used), and put them in my manually-generated VS2010 project. Surprisingly but also normally, I find that the global one is slightly faster than the member one, which is the same result as #black #Myles Hathcock, even though I leave the implementation of codes unchanged.
3.
Because in our project, outerProduct are put into a release lib files, while in my manually-generate project, it straightforwardly produce the .obj files, and be link to .exe files. To exclude this issue, I use the codes extracted and produce the lib file through VS2010, and apply this lib file to another VS project to test the performance difference, but still the global one is slight faster than the member one. So, both codes have the same implementation and both of them are put into lib files, though one is produced by Scons and the other is generated by VS project, but they have different performance. Is Scons causing this problem?
4.
For the codes shown in my question, global function outerProduct is declared and defined in .h file, then #include by a .cpp file. So when compiling this .cpp file, outerProduct will be instantiated. But if I change this to another manner:(I have to remind that these codes are now compiled by Scons to product lib file, not manually-generated VS2010 project)
First, I declare the global function outerProduct in .h file:
\\outProduct.h
template<typename Scalar, int Dim>
void outerProduct(const Vector<Scalar, Dim> & v1 , const Vector<Scalar, Dim> & v2, SquareMatrix<Scalar, Dim> & m);
then in .cpp file,
\\outerProduct.cpp
template<typename Scalar, int Dim>
void outerProduct(const Vector<Scalar, Dim> & v1 , const Vector<Scalar, Dim> & v2, SquareMatrix<Scalar, Dim> & m)
{
for (unsigned int i=0; i<Dim; i++)
for (unsigned int j=0; j<Dim; j++)
{
m(i,j) = v1[i]*v2[j];
}
}
Since it a template function, It requires some explicit instantiations:
\\outerProduct.cpp
template void outerProduct<double, 3>(const Vector<double, 3> &, const Vector<double, 3> &, SquareMatrix<double, 3> &);
template void outerProduct<float, 3>(const Vector<float, 3> &, const Vector<float, 3> &, SquareMatrix<float, 3> &);
Finally, in .cpp file calling this function:
\\use_outerProduct.cpp
#include "outerProduct.h" //note: outerProduct.cpp is not needful.
...
outerProduct(v1, v2, m)
...
The strange thing, now, is that the global one finally be slightly faster than the member one, shown in the following picture:
But this only happens in a Scons environment. In mannually-generated VS2010 project, global one will always be slightly faster than the member one. So this performance difference only results from a Scons environment? and if template function being explicitly instantiated, it will become normal?
Things are still strange ! It seems that Scons would have done something I didn't expected.
//------------------------------------------------------------------------
Additionally, test codes are now changed to the followings to avoid the loop elision:
Vector<double, 3> vec1(0.0);
Vector<double, 3> vec2(1.0);
Timer timer;
while(true)
{
timer.startTimer();
for (unsigned int i=0; i<100000; i++)
{
vec1 = Vector<double, 3>(i);
SquareMatrix<double,3> m = vec1.outerProduct(vec2);
}
timer.stopTimer();
cout<<"time cost for member function: "<< timer.getElapsedTime()<<endl;
timer.startTimer();
SquareMatrix<double,3> m;
for (unsigned int i=0; i<100000; i++)
{
vec1 = Vector<double, 3>(i);
outerProduct(vec1, vec2, m);
}
timer.stopTimer();
cout<<"time cost for global function: "<< timer.getElapsedTime()<<endl;
system("pause");
}
#black #Myles Hathcock, Great thanks to warm-hearted people!
#Myles Hathcock, your explanation is really a subtle and abstruse one, but I think I would benefit a lot from it.
Finally, the entire implementation is on
https://github.com/FeiZhu/Physika
Which is a physical engine we are developing, and from which you can find more info including the whole source codes. Vector and SquareMatrix are defined in Physika_Src/Physika_Core folder! But global function outerProduct is not uploaded, you can add it appropriately to somewhere.
The following code does some copying from one array of zeroes interpreted as floats to another one, and prints timing of this operation. As I've seen many cases where no-op loops are just optimized away by compilers, including gcc, I was waiting that at some point of changing my copy-arrays program it will stop doing the copying.
#include <iostream>
#include <cstring>
#include <sys/time.h>
static inline long double currentTime()
{
timespec ts;
clock_gettime(CLOCK_MONOTONIC,&ts);
return ts.tv_sec+(long double)(ts.tv_nsec)*1e-9;
}
int main()
{
size_t W=20000,H=10000;
float* data1=new float[W*H];
float* data2=new float[W*H];
memset(data1,0,W*H*sizeof(float));
memset(data2,0,W*H*sizeof(float));
long double time1=currentTime();
for(int q=0;q<16;++q) // take more time
for(int k=0;k<W*H;++k)
data2[k]=data1[k];
long double time2=currentTime();
std::cout << (time2-time1)*1e+3 << " ms\n";
delete[] data1;
delete[] data2;
}
I compiled this with g++ 4.8.1 command g++ main.cpp -o test -std=c++0x -O3 -lrt. This program prints 6952.17 ms for me. (I had to set ulimit -s 2000000 for it to not crash.)
I also tried changing creation of arrays with new to automatic VLAs, removing memsets, but this doesn't change g++ behavior (apart from changing timings by several times).
It seems the compiler could prove that this code won't do anything sensible, so why didn't it optimize the loop away?
Anyway it isn't impossible (clang++ version 3.3):
clang++ main.cpp -o test -std=c++0x -O3 -lrt
The program prints 0.000367 ms for me... and looking at the assembly language:
...
callq clock_gettime
movq 56(%rsp), %r14
movq 64(%rsp), %rbx
leaq 56(%rsp), %rsi
movl $1, %edi
callq clock_gettime
...
while for g++:
...
call clock_gettime
fildq 32(%rsp)
movl $16, %eax
fildq 40(%rsp)
fmull .LC0(%rip)
faddp %st, %st(1)
.p2align 4,,10
.p2align 3
.L2:
movl $1, %ecx
xorl %edx, %edx
jmp .L5
.p2align 4,,10
.p2align 3
.L3:
movq %rcx, %rdx
movq %rsi, %rcx
.L5:
leaq 1(%rcx), %rsi
movss 0(%rbp,%rdx,4), %xmm0
movss %xmm0, (%rbx,%rdx,4)
cmpq $200000001, %rsi
jne .L3
subl $1, %eax
jne .L2
fstpt 16(%rsp)
leaq 32(%rsp), %rsi
movl $1, %edi
call clock_gettime
...
EDIT (g++ v4.8.2 / clang++ v3.3)
SOURCE CODE - ORIGINAL VERSION (1)
...
size_t W=20000,H=10000;
float* data1=new float[W*H];
float* data2=new float[W*H];
...
SOURCE CODE - MODIFIED VERSION (2)
...
const size_t W=20000;
const size_t H=10000;
float data1[W*H];
float data2[W*H];
...
Now the case that isn't optimized is (1) + g++
The code in this question has changed quite a bit, invalidating correct answers. This answer applies to the 5th version: as the code currently attempts to read uninitialized memory, an optimizer may reasonably assume that unexpected things are happening.
Many optimization steps have a similar pattern: there's a pattern of instructions that's matched to the current state of compilation. If the pattern matches at some point, the matched pattern is (parametrically) replaced by a more efficient version. A very simple example of such a pattern is the definition of a variable that's not subsequently used; the replacement in this case is simply a deletion.
These patterns are designed for correct code. On incorrect code, the patterns may simply fail to match, or they may match in entirely unintended ways. The first case leads to no optimization, the second case may lead to totally unpredictable results (certainly if the modified code if further optimized)
Why do you expect the compiler to optimise this? It’s generally really hard to prove that writes to arbitrary memory addresses are a “no-op”. In your case it would be possible, but it would require the compiler to trace the heap memory addresses through new (which is once again hard since these addresses are generated at runtime) and there really is no incentive for doing this.
After all, you tell the compiler explicitly that you want to allocate memory and write to it. How is the poor compiler to know that you’ve been lying to it?
In particular, the problem is that the heap memory could be aliased to lots of other stuff. It happens to be private to your process but like I said above, proving this is a lot of work for the compiler, unlike for function local memory.
The only way in which the compiler could know that this is a no-op is if it knew what memset does. In order for that to happen, the function must either be defined in a header (and it typically isn't), or it must be treated as a special intrinsic by the compiler. But barring those tricks, the compiler just sees a call to an unknown function which could have side effects and do different things for each of the two calls.
The sorting algorithm of this question becomes twice faster(!) if -fprofile-arcs is enabled in gcc (4.7.2). The heavily simplified C code of that question (it turned out that I can initialize the array with all zeros, the weird performance behavior remains but it makes the reasoning much much simpler):
#include <time.h>
#include <stdio.h>
#define ELEMENTS 100000
int main() {
int a[ELEMENTS] = { 0 };
clock_t start = clock();
for (int i = 0; i < ELEMENTS; ++i) {
int lowerElementIndex = i;
for (int j = i+1; j < ELEMENTS; ++j) {
if (a[j] < a[lowerElementIndex]) {
lowerElementIndex = j;
}
}
int tmp = a[i];
a[i] = a[lowerElementIndex];
a[lowerElementIndex] = tmp;
}
clock_t end = clock();
float timeExec = (float)(end - start) / CLOCKS_PER_SEC;
printf("Time: %2.3f\n", timeExec);
printf("ignore this line %d\n", a[ELEMENTS-1]);
}
After playing with the optimization flags for a long while, it turned out that -ftree-vectorize also yields this weird behavior so we can take -fprofile-arcs out of the question. After profiling with perf I have found that the only relevant difference is:
Fast case gcc -std=c99 -O2 simp.c (runs in 3.1s)
cmpl %esi, %ecx
jge .L3
movl %ecx, %esi
movslq %edx, %rdi
.L3:
Slow case gcc -std=c99 -O2 -ftree-vectorize simp.c (runs in 6.1s)
cmpl %ecx, %esi
cmovl %edx, %edi
cmovl %esi, %ecx
As for the first snippet: Given that the array only contains zeros, we always jump to .L3. It can greatly benefit from branch prediction.
I guess the cmovl instructions cannot benefit from branch prediction.
Questions:
Are all my above guesses correct? Does this make the algorithm slow?
If yes, how can I prevent gcc from emitting this instruction (other than the trivial -fno-tree-vectorization workaround of course) but still doing as much optimizations as possible?
What is this -ftree-vectorization? The documentation is quite
vague, I would need a little more explanation to understand what's happening.
Update: Since it came up in comments: The weird performance behavior w.r.t. the -ftree-vectorize flag remains with random data. As Yakk points out, for selection sort, it is actually hard to create a dataset that would result in a lot of branch mispredictions.
Since it also came up: I have a Core i5 CPU.
Based on Yakk's comment, I created a test. The code below (online without boost) is of course no longer a sorting algorithm; I only took out the inner loop. Its only goal is to examine the effect of branch prediction: We skip the if branch in the for loop with probability p.
#include <algorithm>
#include <cstdio>
#include <random>
#include <boost/chrono.hpp>
using namespace std;
using namespace boost::chrono;
constexpr int ELEMENTS=1e+8;
constexpr double p = 0.50;
int main() {
printf("p = %.2f\n", p);
int* a = new int[ELEMENTS];
mt19937 mt(1759);
bernoulli_distribution rnd(p);
for (int i = 0 ; i < ELEMENTS; ++i){
a[i] = rnd(mt)? i : -i;
}
auto start = high_resolution_clock::now();
int lowerElementIndex = 0;
for (int i=0; i<ELEMENTS; ++i) {
if (a[i] < a[lowerElementIndex]) {
lowerElementIndex = i;
}
}
auto finish = high_resolution_clock::now();
printf("%ld ms\n", duration_cast<milliseconds>(finish-start).count());
printf("Ignore this line %d\n", a[lowerElementIndex]);
delete[] a;
}
The loops of interest:
This will be referred to as cmov
g++ -std=c++11 -O2 -lboost_chrono -lboost_system -lrt branch3.cpp
xorl %eax, %eax
.L30:
movl (%rbx,%rbp,4), %edx
cmpl %edx, (%rbx,%rax,4)
movslq %eax, %rdx
cmovl %rdx, %rbp
addq $1, %rax
cmpq $100000000, %rax
jne .L30
This will be referred to as no cmov, the -fno-if-conversion flag was pointed out by Turix in his answer.
g++ -std=c++11 -O2 -fno-if-conversion -lboost_chrono -lboost_system -lrt branch3.cpp
xorl %eax, %eax
.L29:
movl (%rbx,%rbp,4), %edx
cmpl %edx, (%rbx,%rax,4)
jge .L28
movslq %eax, %rbp
.L28:
addq $1, %rax
cmpq $100000000, %rax
jne .L29
The difference side by side
cmpl %edx, (%rbx,%rax,4) | cmpl %edx, (%rbx,%rax,4)
movslq %eax, %rdx | jge .L28
cmovl %rdx, %rbp | movslq %eax, %rbp
| .L28:
The execution time as a function of the Bernoulli parameter p
The code with the cmov instruction is absolutely insensitive to p. The code without the cmov instruction is the winner if p<0.26 or 0.81<p and is at most 4.38x faster (p=1). Of course, the worse situation for the branch predictor is at around p=0.5 where the code is 1.58x slower than the code with the cmov instruction.
Note: Answered before graph update was added to the question; some assembly code references here may be obsolete.
(Adapted and extended from our above chat, which was stimulating enough to cause me to do a bit more research.)
First (as per our above chat), it appears that the answer to your first question is "yes". In the vector "optimized" code, the optimization (negatively) affecting performance is branch predication, whereas in the original code the performance is (positively) affected by branch prediction. (Note the extra 'a' in the former.)
Re your 3rd question: Even though in your case, there is actually no vectorization being done, from step 11 ("Conditional Execution") here it appears that one of the steps associated with vectorization optimizations is to "flatten" conditionals within targeted loops, like this bit in your loop:
if (a[j] < a[lowerElementIndex]
lowerElementIndex = j;
Apparently, this happens even if there is no vectorization.
This explains why the compiler is using the conditional move instructions (cmovl). The goal there is to avoid a branch entirely (as opposed to trying to predict it correctly). Instead, the two cmovl instructions will be sent down the pipeline before the result of the previous cmpl is known and the comparison result will then be "forwarded" to enable/prevent the moves prior to their writeback (i.e., prior to them actually taking effect).
Note that if the loop had been vectorized, this might have been worth it to get to the point where multiple iterations through the loop could effectively be accomplished in parallel.
However, in your case, the attempt at optimization actually backfires because in the flattened loop, the two conditional moves are sent through the pipeline every single time through the loop. This in itself might not be so bad either, except that there is a RAW data hazard that causes the second move (cmovl %esi, %ecx) to have to wait until the array/memory access (movl (%rsp,%rsi,4), %esi) is completed, even if the result is going to be ultimately ignored. Hence the huge time spent on that particular cmovl. (I would expect this is an issue with your processor not having complex enough logic built into its predication/forwarding implementation to deal with the hazard.)
On the other hand, in the non-optimized case, as you rightly figured out, branch prediction can help to avoid having to wait on the result of the corresponding array/memory access there (the movl (%rsp,%rcx,4), %ecx instruction). In that case, when the processor correctly predicts a taken branch (which for an all-0 array will be every single time, but [even] in a random array should [still] be roughlymore than [edited per #Yakk's comment] half the time), it does not have to wait for the memory access to finish to go ahead and queue up the next few instructions in the loop. So in correct predictions, you get a boost, whereas in incorrect predictions, the result is no worse than in the "optimized" case and, furthermore, better because of the ability to sometimes avoid having the 2 "wasted" cmovl instructions in the pipeline.
[The following was removed due to my mistaken assumption about your processor per your comment.]
Back to your questions, I would suggest looking at that link above for more on the flags relevant to vectorization, but in the end, I'm pretty sure that it's fine to ignore that optimization given that your Celeron isn't capable of using it (in this context) anyway.
[Added after above was removed]
Re your second question ("...how can I prevent gcc from emitting this instruction..."), you could try the -fno-if-conversion and -fno-if-conversion2 flags (not sure if these always work -- they no longer work on my mac), although I do not think your problem is with the cmovl instruction in general (i.e., I wouldn't always use those flags), just with its use in this particular context (where branch prediction is going to be very helpful given #Yakk's point about your sort algorithm).
If you have a floating point number double num_float = 5.0; and the following two conditionals.
if(num_float > 3)
{
//...
}
if(num_float > 3.0)
{
//...
}
Q: Would it be slower to perform the former comparison because of the conversion of 3 to a floating point, or would there really be no difference at all?
Obviously I'm assuming the time delay would be negligible at best, but compounded in a while(1) loop I suppose over the long run a decent chunk of time could be lost (if it really is slower).
Because of the "as-if" rule, the compiler is allowed to do the conversion of the literal to a floating point value at compile time. A good compiler will do so if that results in better code.
In order to answer your question definitively for your compiler and your target platform(s), you'd need to check what the compiler emits, and how it performs. However, I'd be surprised if any mainstream compiler did not turn either of the two if statements into the most efficient code possible.
If the value is a constant, then there shouldn't be any difference, since the compiler will convert the constant to float as part of the compilation [unless the compiler decides to use a "compare float with integer" instruction].
If the value is an integer VARIABLE, then there will be an extra instruction to convert the integer value to a floating point [again, unless the compiler can use a "compare float with integer" instruction].
How much, if any, time that adds to the whole process depends HIGHLY on what processor, how the floating point instructions work, etc, etc.
As with anything where performance really matters, measure the alternatives. Preferably on more than one type of hardware (e.g. both AMD and Intel processors if it's a PC), and then decide which is the better choice. Otherwise, you may find yourself tuning the code to work well on YOUR hardware, but worse on some other hardware. Which isn't a good optimisation - unless the ONLY machine you ever run on is your own.
Note: This will need to be repeated with your target hardware. The code below just demonstrates nicely what has been said.
with constants:
bool with_int(const double num_float) {
return num_float > 3;
}
bool with_float(const double num_float) {
return num_float > 3.0;
}
g++ 4.7.2 (-O3 -march=native):
with_int(double):
ucomisd .LC0(%rip), %xmm0
seta %al
ret
with_float(double):
ucomisd .LC0(%rip), %xmm0
seta %al
ret
.LC0:
.long 0
.long 1074266112
clang 3.0 (-O3 -march=native):
.LCPI0_0:
.quad 4613937818241073152 # double 3.000000e+00
with_int(double): # #with_int(double)
ucomisd .LCPI0_0(%rip), %xmm0
seta %al
ret
.LCPI1_0:
.quad 4613937818241073152 # double 3.000000e+00
with_float(double): # #with_float(double)
ucomisd .LCPI1_0(%rip), %xmm0
seta %al
ret
Conclusion: No difference if comparing against constants.
with variables:
bool with_int(const double a, const int b) {
return a > b;
}
bool with_float(const double a, const float b) {
return a > b;
}
g++ 4.7.2 (-O3 -march=native):
with_int(double, int):
cvtsi2sd %edi, %xmm1
ucomisd %xmm1, %xmm0
seta %al
ret
with_float(double, float):
unpcklps %xmm1, %xmm1
cvtps2pd %xmm1, %xmm1
ucomisd %xmm1, %xmm0
seta %al
ret
clang 3.0 (-O3 -march=native):
with_int(double, int): # #with_int(double, int)
cvtsi2sd %edi, %xmm1
ucomisd %xmm1, %xmm0
seta %al
ret
with_float(double, float): # #with_float(double, float)
cvtss2sd %xmm1, %xmm1
ucomisd %xmm1, %xmm0
seta %al
ret
Conclusion: the emitted instructions differ when comparing against variables, as Mats Peterson's answer already explained.
Q: Would it be slower to perform the former comparison because of the conversion of 3 to a floating point, or would there really be no difference at all?
A) Just specify it as an integer. Some chips have a special instruction for comparing to integer at runtime, but that is not important since the compiler will choose what is best. In some cases it might convert it to 3.0 at compile time depending on target architecture. In other cases it will leave it as int. But because you want 3 specifically then specify '3'
Obviously I'm assuming the time delay would be negligible at best, but compounded in a while(1) loop I suppose over the long run a decent chunk of time could be lost (if it really is slower).
A) The compiler will not do anything odd with such code. It will choose the best thing so there should be no time delay. With number constants the compiler is free to do what is best regardless of what it seems as long as it produces same result. However you would not want to compound this type of comparison in a while loop at all. Rather use an integer loop counter. A floating point loop counter is going to be much slower. If you have to use floats as a loop counter prefer single point 32 bit data type and compare as little as possible.
For example, you could break the problem down into multiple loops.
int x = 0;
float y = 0;
float finc = 0.1;
int total = 1000;
int num_times = total / finc;
num_times -= 2;// safety
// Run the loop in a safe zone using integer compares
while (x < num_times) {
// Do stuff
y += finc;
x++;
}
// Now complete the loop using float compares
while (y < total) {
y+= finc;
}
And that would cause a severe improvement in the comparisons speed.