Is there a tool that can do alias analysis on a program and tell you where gcc / g++ are having to generate sub-optimal instruction sequences due to potential pointer aliasing?
I don't know of anything that gives "100 %" coverage, but for vectorizing code (which aliasing often prevents) use the -ftree-vectorizer-verbose=n option, where n is an integer between 1 and 6. This prints out some info why a loop couldn't be vectorized.
For instance, with g++ 4.1, the code
//#define RSTR __restrict__
#define RSTR
void addvec(float* RSTR a, float* b, int n)
{
for (int i = 0; i < n; i++)
a[i] = a[i] + b[i];
}
results in
$ g++ -ftree-vectorizer-verbose=1 -ftree-vectorize -O3 -c aliastest.cpp
aliastest.cpp:6: note: vectorized 0 loops in function.
Now, switch to the other definition for RSTR and you get
$ g++ -ftree-vectorizer-verbose=1 -ftree-vectorize -O3 -c aliastest.cpp
aliastest.cpp:6: note: LOOP VECTORIZED.
aliastest.cpp:6: note: vectorized 1 loops in function.
Interestingly, if one switches to g++ 4.4, it can vectorize the first non-restrict case by versioning and a runtime check:
$ g++44 -ftree-vectorizer-verbose=1 -O3 -c aliastest.cpp
aliastest.cpp:6: note: created 1 versioning for alias checks.
aliastest.cpp:6: note: LOOP VECTORIZED.
aliastest.cpp:4: note: vectorized 1 loops in function.
And this is done for both of the RSTR definitons.
In the past I've tracked down cases aliasing slowdowns with some help from a profiler. Some of the game console profilers will highlight parts of the code that are causing lots of load-hit-store penalties - these can often occur because the compiler assumes some pointers are aliased and has to generate the extra load instructions. Once you know the part of the code they're occuring, you can backtrack from the assembly to the source to see what might be considered aliased, and add "restict" as needed (or other tricks to avoid the extra loads).
I'm not sure if there are any freely available profilers that will let you get into this level of detail, however.
The side benefit of this approach is that you only spend your time examining cases that actually slow your code down.
Related
Example: https://www.godbolt.org/z/ahfcaj7W8
From https://gcc.gnu.org/onlinedocs/gcc-12.2.0/gcc/Optimize-Options.html
It says
-ftree-loop-vectorize
Perform loop vectorization on trees. This flag is enabled by default at -O2 and by -ftree-vectorize, -fprofile-use, and -fauto-profile."
However it seems I have to pass a flag explicitly to turn on loop unrolling & SIMD. Did I misunderstand something here? It is enabled at -O3 though.
It is enabled at -O2 in GCC12, but only with a much lower cost threshold than at -O3, e.g. often only vectorizing when the loop trip count is a compile-time constant and known to be a multiple of the vector width (e.g. 8 for 32-bit elements with AVX2 vectors). See https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=2b8453c401b699ed93c085d0413ab4b5030bcdb8
https://godbolt.org/z/3xjdrx6as shows some loops at -O2 vs. -O3, with a sum of an array of integers only vectorizing with a constant count, not a runtime variable. Even for (int i=0 ; i < (len&-16) ; i++) sum += arr[i] to make the length a multiple of 16 doesn't make gcc -O2 auto-vectorize.
Before GCC12, -ftree-vectorize wasn't enabled at all by -O2.
I'm working on a parallel STL implementation of the Barnes-Hut-Algorithm.
For performance issues I wanted to try the parallel mode of some algorithms from the libstdc++
https://gcc.gnu.org/onlinedocs/libstdc++/manual/parallel_mode.html
This extension will also come with the new C++17 standard.
To calculate the effective acceleration for each body, I use the for_each algorithm from the namespace __gnu_parallel. To use the sequential algorithm, you can replace it by std.
To compile the program I use g++ with version 5.4.0 and call it by g++-5 -fopenmp -O0 -g -Wall -fmessage-length=0 -std=c++1z -c -o BarnesHutCPU.o BarnesHutCPU.cpp
For the parallel algorithms OpenMP is used. This is the reason for -fopenmp.
However, the time for the sequential and the parallel use of for_each is nearly the same. And when you call omp_get_num_threads() in the for_each loop, you get the reason that only one thread is used for the complete loop.
So my question is: Why is the algorithm not executed in parallel and what do I have to to to get a parallel execution?
I also tried it with OMP_NUM_THREADS=4 ./BarnesHutCPU.
I don't want to use a normal for loop, because I have to use the STL algorithms. (One reason is, I want to use Thrust later)
This is the important code part with N=750:
void calcAcc()
{
double theta = 0.5;
__gnu_parallel::for_each(counting_iterator<int>(0), counting_iterator<int>(N), [&](const int &i){
...
}
}
counting_iterator<T> is from boost::counting_iterator<T>
Greetings
Tommekk
Okay, so the reason was the -O0 flag. With -O3 it uses my 4 CPUs, which I can also see in the system monitor. At first I didn't see any effects, because my N was to small.
Thanks for your help!
From what I've read about Eigen (here), it seems that operator=() acts as a "barrier" of sorts for lazy evaluation -- e.g. it causes Eigen to stop returning expression templates and actually perform the (optimized) computation, storing the result into the left-hand side of the =.
This would seem to mean that one's "coding style" has an impact on performance -- i.e. using named variables to store the result of intermediate computations might have a negative effect on performance by causing some portions of the computation to be evaluated "too early".
To try to verify my intuition, I wrote up an example and was surprised at the results (full code here):
using ArrayXf = Eigen::Array <float, Eigen::Dynamic, Eigen::Dynamic>;
using ArrayXcf = Eigen::Array <std::complex<float>, Eigen::Dynamic, Eigen::Dynamic>;
float test1( const MatrixXcf & mat )
{
ArrayXcf arr = mat.array();
ArrayXcf conj = arr.conjugate();
ArrayXcf magc = arr * conj;
ArrayXf mag = magc.real();
return mag.sum();
}
float test2( const MatrixXcf & mat )
{
return ( mat.array() * mat.array().conjugate() ).real().sum();
}
float test3( const MatrixXcf & mat )
{
ArrayXcf magc = ( mat.array() * mat.array().conjugate() );
ArrayXf mag = magc.real();
return mag.sum();
}
The above gives 3 different ways of computing the coefficient-wise sum of magnitudes in a complex-valued matrix.
test1 sort of takes each portion of the computation "one step at a time."
test2 does the whole computation in one expression.
test3 takes a "blended" approach -- with some amount of intermediate variables.
I sort of expected that since test2 packs the entire computation into one expression, Eigen would be able to take advantage of that and globally optimize the entire computation, providing the best performance.
However, the results were surprising (numbers shown are in total microseconds across 1000 executions of each test):
test1_us: 154994
test2_us: 365231
test3_us: 36613
(This was compiled with g++ -O3 -- see the gist for full details.)
The version I expected to be fastest (test2) was actually slowest. Also, the version that I expected to be slowest (test1) was actually in the middle.
So, my questions are:
Why does test3 perform so much better than the alternatives?
Is there a technique one can use (short of diving into the assembly code) to get some visibility into how Eigen is actually implementing your computations?
Is there a set of guidelines to follow to strike a good tradeoff between performance and readability (use of intermediate variables) in your Eigen code?
In more complex computations, doing everything in one expression could hinder readability, so I'm interested in finding the right way to write code that is both readable and performant.
It looks like a problem of GCC. Intel compiler gives the expected result.
$ g++ -I ~/program/include/eigen3 -std=c++11 -O3 a.cpp -o a && ./a
test1_us: 200087
test2_us: 320033
test3_us: 44539
$ icpc -I ~/program/include/eigen3 -std=c++11 -O3 a.cpp -o a && ./a
test1_us: 214537
test2_us: 23022
test3_us: 42099
Compared to the icpc version, gcc seems to have problem optimizing your test2.
For more precise result, you may want to turn off the debug assertions by -DNDEBUG as shown here.
EDIT
For question 1
#ggael gives an excellent answer that gcc fails vectorizing the sum loop. My experiment also find that test2 is as fast as the hand-written naive for-loop, both with gcc and icc, suggesting that vectorization is the reason, and no temporary memory allocation is detected in test2 by the method mentioned below, suggesting that Eigen evaluate the expression correctly.
For question 2
Avoiding the intermediate memory is the main purpose that Eigen use expression templates. So Eigen provides a macro EIGEN_RUNTIME_NO_MALLOC and a simple function to enable you check whether an intermediate memory is allocated during calculating the expression. You can find a sample code here. Please note this may only work in debug mode.
EIGEN_RUNTIME_NO_MALLOC - if defined, a new switch is introduced which
can be turned on and off by calling set_is_malloc_allowed(bool). If
malloc is not allowed and Eigen tries to allocate memory dynamically
anyway, an assertion failure results. Not defined by default.
For question 3
There is a way to use intermediate variables, and to get the performance improvement introduced by lazy evaluation/expression templates at the same time.
The way is to use intermediate variables with correct data type. Instead of using Eigen::Matrix/Array, which instructs the expression to be evaluated, you should use the expression type Eigen::MatrixBase/ArrayBase/DenseBase so that the expression is only buffered but not evaluated. This means you should store the expression as intermediate, rather than the result of the expression, with the condition that this intermediate will only be used once in the following code.
As determing the template parameters in the expression type Eigen::MatrixBase/... could be painful, you could use auto instead. You could find some hints on when you should/should not use auto/expression types in this page. Another page also tells you how to pass the expressions as function parameters without evaluating them.
According to the instructive experiment about .abs2() in #ggael 's answer, I think another guideline is to avoid reinventing the wheel.
What happens is that because of the .real() step, Eigen won't explicitly vectorize test2. It will thus call the standard complex::operator* operator, which, unfortunately, is never inlined by gcc. The other versions, on the other hand, uses Eigen's own vectorized product implementation of complexes.
In contrast, ICC does inline complex::operator*, thus making the test2 the fastest for ICC. You can also rewrite test2 as:
return mat.array().abs2().sum();
to get even better performance on all compilers:
gcc:
test1_us: 66016
test2_us: 26654
test3_us: 34814
icpc:
test1_us: 87225
test2_us: 8274
test3_us: 44598
clang:
test1_us: 87543
test2_us: 26891
test3_us: 44617
The extremely good score of ICC in this case is due to its clever auto-vectorization engine.
Another way to workaround the inlining failure of gcc without modifying test2 is to define your own operator* for complex<float>. For instance, add the following at the top of your file:
namespace std {
complex<float> operator*(const complex<float> &a, const complex<float> &b) {
return complex<float>(real(a)*real(b) - imag(a)*imag(b), imag(a)*real(b) + real(a)*imag(b));
}
}
and then I get:
gcc:
test1_us: 69352
test2_us: 28171
test3_us: 36501
icpc:
test1_us: 93810
test2_us: 11350
test3_us: 51007
clang:
test1_us: 83138
test2_us: 26206
test3_us: 45224
Of course, this trick is not always recommended as, in contrast to the glib version, it might lead to overflow or numerical cancellation issues, but this what icpc and the other vectorized versions compute anyway.
One thing I have done before is to make use of the auto keyword a lot. Keeping in mind that most Eigen expressions return special expression datatypes (e.g. CwiseBinaryOp), an assignment back to a Matrix may force the expression to be evaluated (which is what you are seeing). Using auto allows the compiler to deduce the return type as whatever expression type it is, which will avoid evaluation as long as possible:
float test1( const MatrixXcf & mat )
{
auto arr = mat.array();
auto conj = arr.conjugate();
auto magc = arr * conj;
auto mag = magc.real();
return mag.sum();
}
This should essentially be closer to your second test case. In some cases I have had good performance improvements while keeping readability (you do not want to have to spell out the expression template types). Of course, your mileage may vary, so benchmark carefully :)
I just want you to note that you did profiling in a non-optimal way, so actually the issue could just be your profiling method.
Since there are many things like cache locality to keep into account you should do the profiling that way:
int warmUpCycles = 100;
int profileCycles = 1000;
// TEST 1
for(int i=0; i<warmUpCycles ; i++)
doTest1();
auto tick = std::chrono::steady_clock::now();
for(int i=0; i<profileCycles ; i++)
doTest1();
auto tock = std::chrono::steady_clock::now();
test1_us = (std::chrono::duration_cast<std::chrono::microseconds>(tock-tick)).count();
// TEST 2
// TEST 3
Once you did the test in the proper way, then you can come to conclusions..
I highly suspect that since you are profiling one operation at a time, you ends up by using the cached version on the third test since operations are likely to be re-ordered by the compiler.
Also you should try different compilers to see if the problem is the unrolling of templates (there is a depth limit to optimizing templates: it is likely you can hit it with a single big expression).
Also if Eigen support move semantics, there's no reason why one version should be faster since it is not always guaranteed that expressions can be optimized.
Please try and let me know, that's interesting. Also be sure to have enabled optimizations with flags like -O3, profiling without optimization is meaningless.
As to prevent compiler optimize everything away, use initial input from a file or cin and then re-feed the input inside the functions.
I would like to see the disassembled code in the same order that the compiler generates after instruction rescheduling. b.t.w I am using GDB and when I give a command saying disas /m FunctionName it gives me disassembled code in the order of source code. I am trying to look at the effectiveness of instruction rescheduling by my compiler (GCC 4.1) and would like to see how instructions are rescheduled.
Thanks!
//////////////////EDITS////////////////////////////////////////
After looking at disassembled code for a line of code:
double w_n = (A_n[2] * x[0] + A_n[5] * y + A_n[8] * z + A_n[11]) ;
I could see its 83 bytes of instructions. But after unrolling it by 2 iterations :
double w_n[2] = { (A_n[2] * x[0] + A_n[5] * y + A_n[8] * z + A_n[11]), (A_n_2[2] * x[0] + A_n_2[5] * y + A_n_2[8] * z + A_n_2[11]) };
The block of code is 226 bytes. And there is enormous increase in instruction count. Could anyone tell me why this is happening? I can also see from VTune that instructions retired after unrolling has increased.
Possible reason I could think: Compiler is getting enough opportunity with increased block size to generate simple instructions so maximize the throughput of Instruction prefetch and decoder unit.
Any help is greatly appreciated. Thanks!!
If rescheduling has been done by the compiler, you really should see that when disassembling in gdb.
Otherwise you can perhaps use objdump directly on the command-line, that's my preferred way of seeing code in an ELF:
$ objdump --disassemble a.out | less
It doesn't reference the source at all, so it should really show what's in the binary itself.
In the step in which you compile the code into an object file, you could also simply tell the GCC driver (gcc) that you want to get assembly code:
gcc -S -c file.c
gcc -O2 -S -c file.c
gcc -S -masm=intel -c file.c
(the latter generates Intel instead of AT&T syntax assembly)
You can even then throw that assembly code at the assembler (e.g. gasm) later on to get an object file which can be linked.
As to why the code is bigger, there is a number of reasons. The heuristics we humans used to hand-optimize assembly haven't been true anymore for quite some time. One big goal is pipelining, another vectorization. All in all it's about parallelizing as much as possible and having to invalidate as little as possible from the (already read) cache at any given time in order to speed up execution.
Even though it seems counter-intuitive, this can lead to bigger, yet faster, code.
I am trying to compile my code with auto-vectorization flags but I encounter a failure in a very simple reduction loop:
double node3::GetSum(void){
double sum=0.;
for(int i=0;i<8;i++) sum+=c_value[i];
return sum;
}
where the c_value[i] array is defined as
class node3{
private:
double c_value[9];
The auto-vectorization compilation returns:
Analyzing loop at node3.cpp:10
node3.cpp:10: note: step unknown.
node3.cpp:10: note: reduction: unsafe fp math optimization: sum_6 = _5 + sum_11;
node3.cpp:10: note: Unknown def-use cycle pattern.
node3.cpp:10: note: Unsupported pattern.
node3.cpp:10: note: not vectorized: unsupported use in stmt.
node3.cpp:10: note: unexpected pattern.
node3.cpp:8: note: vectorized 0 loops in function.
node3.cpp:10: note: Failed to SLP the basic block.
node3.cpp:10: note: not vectorized: failed to find SLP opportunities in basic block.
I really do not understand why it can't determine the basic block for SLP for example.
Moreover I guess I did not understand what really is the "unsupported use in stmt": the loop here simply sums a sequential access array.
Could such problems be caused by the fact that c_value[] is defined in the private of the class?
Thanks in advance.
Note: compiled as g++ -c -O3 -ftree-vectorizer-verbose=2 -march=native node3.cpp and also tried with more specific -march=corei7 but same results. GCC Version: 4.8.1
I managed to vectorize the loop at the end with the following trick:
double node3::GetSum(void){
double sum=0.,tmp[8];
tmp[0]=c_value[0]; tmp[1]=c_value[1]; tmp[2]=c_value[2]; tmp[3]=c_value[3];
tmp[4]=c_value[4]; tmp[5]=c_value[5]; tmp[6]=c_value[6];tmp[7]=c_value[7];
for(int i=0;i<8;i++) sum+=tmp[i];
return sum;
}
where I created the dummy array tmp[]. This trick, together with another compilation flag i.e., -funsafe-math-optimizations (#Mysticial: this is actually the only thing I need, -ffast-math with other things I apparently don't need), makes the auto-vectorization successful.
Now, I don't really know if this solution really speeds-up the execution. It does vectorize, but I added an assign operation so I'm not sure if this should run faster. My feeling is that on the long run (calling the function many times) it does speed-up a little, but I can't prove that.
Anyway this is a possible solution to the vectorization problem, so I posted as an answer.
It's annoying that the freedom to vectorize reductions is coupled with other (literally) unsafe optimizations. In my examples, a bug is surfacing (with gcc but not g++) with the combination of -mavx and -funsafe-math-optimizations, where a pointer which should never be touched gets clobbered.
Auto-vectorization doesn't consistently speed up such short loops, particularly because the sum reduction epilogue with the hadd instruction is slow on the more common CPUs.