Is `-ftree-loop-vectorize` not enabled by `-O2` in GCC v12? - c++

Example: https://www.godbolt.org/z/ahfcaj7W8
From https://gcc.gnu.org/onlinedocs/gcc-12.2.0/gcc/Optimize-Options.html
It says
-ftree-loop-vectorize
     Perform loop vectorization on trees. This flag is enabled by default at -O2 and by -ftree-vectorize, -fprofile-use, and -fauto-profile."
However it seems I have to pass a flag explicitly to turn on loop unrolling & SIMD. Did I misunderstand something here? It is enabled at -O3 though.

It is enabled at -O2 in GCC12, but only with a much lower cost threshold than at -O3, e.g. often only vectorizing when the loop trip count is a compile-time constant and known to be a multiple of the vector width (e.g. 8 for 32-bit elements with AVX2 vectors). See https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=2b8453c401b699ed93c085d0413ab4b5030bcdb8
https://godbolt.org/z/3xjdrx6as shows some loops at -O2 vs. -O3, with a sum of an array of integers only vectorizing with a constant count, not a runtime variable. Even for (int i=0 ; i < (len&-16) ; i++) sum += arr[i] to make the length a multiple of 16 doesn't make gcc -O2 auto-vectorize.
Before GCC12, -ftree-vectorize wasn't enabled at all by -O2.

Related

Why doesn't vectorization speed up these loops?

I'm getting up to speed with vectorization, since my current PC supports it. I have an Intel i7-7600u. It has 2 cores running at 2.8/2.9 GHz and supports SSE4.1, SSE4.2 and AVX2. I'm not sure of the vector register size. I believe it is 256 bits, so will work with 4 64 bit double precision values at a time. I believe this should give a peak rate of:
(2.8GHz)(2 core)(4 vector)(2 add/mult) = 45 GFlops.
I am using GNU Gfortran and g++.
I have a set of fortran loops I built up back in my days of working on various supercomputers.
One loop I tested is:
do j=1,m
s(:) = s(:) + a(:,j)*b(:,j)
enddo
The vector length is 10000, m = 200 and the nest was executed 500 times to give 2e9 operations. I ran it with the j loop unrolled 0, 1, 2, 3 and 5 times. Unrolling should reduce the number of times s is loaded and stored. It is also optimal because all the memory accesses are stride one and it has a paired add and multiply. I ran it using both array syntax as shown above and by using an inner do loop, but that seems to make little difference. With do loops and no unrolling it looks like:
do j=1,m
do i=1,n
s(i)=s(i)+a(i,j)*b(i,j)
end do
end do
The build looks like:
gfortran -O3 -w -fimplicit-none -ftree-vectorize -fopt-info-vec loops.f90
The compiler says the loops are all vectorized. The best results I have gotten is about 2.8 GFlops, which is one per cycle. If I run it with:
gfortran -O2 -w -fimplicit-none -fno-tree-vectorize -fopt-info-vec loops.f90
No vectorization is reported. It executes a little slower without unrolling, but the same with unrolling. Can someone tell me what is going on here? Do I have the characterization of my processor wrong? Why doesn't vectorization speed it up? I was expecting to get at least some improvement. I apologize if this plows old ground, but I could not find a clean example similar to this.

vectorized sum in Fortran

I am compiling my Fortran code using gfortran and -mavx and have verified that some instructions are vectorized via objdump, but I'm not getting the speed improvements that I was expecting, so I want to make sure the following argument is being vectorized (this single instruction is ~50% of the runtime).
I know that some instructions can be vectorized, while others cannot, so I want to make sure this can be:
sum(A(i1:i2,ir))
Again, this single line takes about 50% of the runtime since I am doing this over a very large matrix. I can give more information on why I am doing this, but suffice it to say that it is necessary, though I can restructure the memory if necessary (for example, I could do the sum as sum(A(ir,i1:i2)) if that could be vectorized instead.
Is this line being vectorized? How can I tell? How do I force vectorization if it is not being vectorized?
EDIT: Thanks to the comments, I now realize that I can check on the vectorization of this summation via -ftree-vectorizer-verbose and see that this is not vectorizing. I have restructured the code as follows:
tsum = 0.0d0
tn = i2 - i1 + 1
tvec(1:tn) = A(i1:i2, ir)
do ii = 1,tn
tsum = tsum + tvec(ii)
enddo
and this ONLY vectorizes when I turn on -funsafe-math-optimizations, but I do see another 70% speed increase due to vectorization. The question still holds: Why does sum(A(i1:i2,ir)) not vectorize and how can I get a simple sum to vectorize?
It turns out that I am not able to make use of the vectorization unless I include -ffast-math or -funsafe-math-optimizations.
The two code snippets I played with are:
tsum = 0.0d0
tvec(1:n) = A(i1:i2, ir)
do ii = 1,n
tsum = tsum + tvec(ii)
enddo
and
tsum = sum(A(i1:i2,ir))
and here are the times I get when running the first code snippet with different compilation options:
10.62 sec ... None
10.35 sec ... -mtune=native -mavx
7.44 sec ... -mtune-native -mavx -ffast-math
7.49 sec ... -mtune-native -mavx -funsafe-math-optimizations
Finally, with these same optimizations, I am able to vectorize tsum = sum(A(i1:i2,ir)) to get
7.96 sec ... None
8.41 sec ... -mtune=native -mavx
5.06 sec ... -mtune=native -mavx -ffast-math
4.97 sec ... -mtune=native -mavx -funsafe-math-optimizations
When we compare sum and -mtune=native -mavx with -mtune=native -mavx -funsafe-math-optimizations, it shows a ~70% speedup. (Note that these were only run once each - before we publish we will do true benchmarking on multiple runs).
I do take a small hit though. My values change slightly when I use the -f options. Without them, the errors for my variables (v1, v2) are :
v1 ... 5.60663e-15 9.71445e-17 1.05471e-15
v2 ... 5.11674e-14 1.79301e-14 2.58127e-15
but with the optimizations, the errors are :
v1 ... 7.11931e-15 5.39846e-15 3.33067e-16
v2 ... 1.97273e-13 6.98608e-14 2.17742e-14
which indicates that there truly is something different going on.
Your explicit loop version still does the FP adds in a different order than a vectorized version would. A vector version uses 4 accumulators, each one getting every 4th array element.
You could write your source code to match what a vector version would do:
tsum0 = 0.0d0
tsum1 = 0.0d0
tsum2 = 0.0d0
tsum3 = 0.0d0
tn = i2 - i1 + 1
tvec(1:tn) = A(i1:i2, ir)
do ii = 1,tn,4 ! count by 4
tsum0 = tsum0 + tvec(ii)
tsum1 = tsum1 + tvec(ii+1)
tsum2 = tsum2 + tvec(ii+2)
tsum3 = tsum3 + tvec(ii+3)
enddo
tsum = (tsum0 + tsum1) + (tsum2 + tsum3)
This might vectorize without -ffast-math.
FP add has multi-cycle latency, but one or two per clock throughput, so you need the asm to use multiple vector accumulators to saturate the FP add unit(s). Skylake can do two FP adds per clock, with latency=4. Previous Intel CPUs do one per clock, with latency=3. So on Skylake, you need 8 vector accumulators to saturate the FP units. And of course they have to be 256b vectors, because AVX instructions are as fast but do twice as much work as SSE vector instructions.
Writing the source with 8 * 8 accumulator variables would be ridiculous, so I guess you need -ffast-math, or an OpenMP pragma that tells the compiler different orders of operations are ok.
Explicitly unrolling your source means you have to handle loop counts that aren't a multiple of the vector width * unroll. If you put restrictions on things, it can help the compiler avoid generating multiple versions of the loop or extra loop setup/cleanup code.

gcc auto-vectorization fails in a reduction loop

I am trying to compile my code with auto-vectorization flags but I encounter a failure in a very simple reduction loop:
double node3::GetSum(void){
double sum=0.;
for(int i=0;i<8;i++) sum+=c_value[i];
return sum;
}
where the c_value[i] array is defined as
class node3{
private:
double c_value[9];
The auto-vectorization compilation returns:
Analyzing loop at node3.cpp:10
node3.cpp:10: note: step unknown.
node3.cpp:10: note: reduction: unsafe fp math optimization: sum_6 = _5 + sum_11;
node3.cpp:10: note: Unknown def-use cycle pattern.
node3.cpp:10: note: Unsupported pattern.
node3.cpp:10: note: not vectorized: unsupported use in stmt.
node3.cpp:10: note: unexpected pattern.
node3.cpp:8: note: vectorized 0 loops in function.
node3.cpp:10: note: Failed to SLP the basic block.
node3.cpp:10: note: not vectorized: failed to find SLP opportunities in basic block.
I really do not understand why it can't determine the basic block for SLP for example.
Moreover I guess I did not understand what really is the "unsupported use in stmt": the loop here simply sums a sequential access array.
Could such problems be caused by the fact that c_value[] is defined in the private of the class?
Thanks in advance.
Note: compiled as g++ -c -O3 -ftree-vectorizer-verbose=2 -march=native node3.cpp and also tried with more specific -march=corei7 but same results. GCC Version: 4.8.1
I managed to vectorize the loop at the end with the following trick:
double node3::GetSum(void){
double sum=0.,tmp[8];
tmp[0]=c_value[0]; tmp[1]=c_value[1]; tmp[2]=c_value[2]; tmp[3]=c_value[3];
tmp[4]=c_value[4]; tmp[5]=c_value[5]; tmp[6]=c_value[6];tmp[7]=c_value[7];
for(int i=0;i<8;i++) sum+=tmp[i];
return sum;
}
where I created the dummy array tmp[]. This trick, together with another compilation flag i.e., -funsafe-math-optimizations (#Mysticial: this is actually the only thing I need, -ffast-math with other things I apparently don't need), makes the auto-vectorization successful.
Now, I don't really know if this solution really speeds-up the execution. It does vectorize, but I added an assign operation so I'm not sure if this should run faster. My feeling is that on the long run (calling the function many times) it does speed-up a little, but I can't prove that.
Anyway this is a possible solution to the vectorization problem, so I posted as an answer.
It's annoying that the freedom to vectorize reductions is coupled with other (literally) unsafe optimizations. In my examples, a bug is surfacing (with gcc but not g++) with the combination of -mavx and -funsafe-math-optimizations, where a pointer which should never be touched gets clobbered.
Auto-vectorization doesn't consistently speed up such short loops, particularly because the sum reduction epilogue with the hadd instruction is slow on the more common CPUs.

SSE2 double multiplication slower than with standard multiplication

I'm wondering why the following code with SSE2 instructions performs the multiplication slower than the standard C++ implementation.
Here is the code:
m_win = (double*)_aligned_malloc(size*sizeof(double), 16);
__m128d* pData = (__m128d*)input().data;
__m128d* pWin = (__m128d*)m_win;
__m128d* pOut = (__m128d*)m_output.data;
__m128d tmp;
int i=0;
for(; i<m_size/2;i++)
pOut[i] = _mm_mul_pd(pData[i], pWin[i]);
The memory for m_output.data and input().data has been allocated with _aligned_malloc.
The time to execute this code however for a 2^25 array is identical to the time for this code (350ms):
for(int i=0;i<m_size;i++)
m_output.data[i] = input().data[i] * m_win[i];
How is that possible? It should theoretically take only 50% of the time, right? Or is the overhead for the memory transfer from SIMD registers to the m_output.data array so expensive?
If I replace the line from the first snippet
pOut[i] = _mm_mul_pd(pData[i], pWin[i]);
by
tmp = _mm_mul_pd(pData[i], pWin[i]);
where __m128d tmp; then the codes executes blazingly fast, less then the resolution of my timer function.
Is that because everything is just stored in the registers and not the memory?
And even more surprising, if I compile in debug mode, the SSE code takes only 93ms while the standard multiplication takes 309ms.
DEBUG: 93ms (SSE2) / 309ms (standard multiplication)
RELEASE: 350ms (SSE2) / 350 (standard multiplication)
What's going on here???
I'm using MSVC2008 with QtCreator 2.2.1 in release mode.
Here are my compilter switches for RELEASE:
cl -c -nologo -Zm200 -Zc:wchar_t- -O2 -MD -GR -EHsc -W3 -w34100 -w34189
and these are for DEBUG:
cl -c -nologo -Zm200 -Zc:wchar_t- -Zi -MDd -GR -EHsc -W3 -w34100 -w34189
EDIT
Regarding the RELEASE vs DEBUG issue:
I just wanted to note that I profiled the code and the SSE code is infact slower in release mode!
That just confirms somehow the hypothesis that VS2008 somehow cant handle intrinsics with the optimizer properly.
Intel VTune gives me 289ms for the SSE loop in DEBUG and 504ms in RELEASE mode.
Wow... just wow...
First of all, VS 2008 is a bad choice for intrisincs as it tends to add many more register moves than necessary and in general does not optimize very well (for instance, it has issues with loop induction variable analysis when SSE instructions are present.)
So, my wild guess is that the compiler generates mulss instructions which the CPU can trivially reorder and execute in parallel (no dependencies between the iterations) while the intrisincs result in lots of register moves/complex SSE code -- it might even blow the trace cache on modern CPUs. VS2008 is notorious for doing all it's calculations in registers and I guess there will be some hazards that the CPU cannot skip (like xor reg, move mem->reg, xor, mov mem->reg, mul, mov mem->reg which is a dependency chain while the scalar code might be move mem->reg, mul with mem operand, mov.) You should definitely look at the generated assembly or try VS 2010 which has much better support for intrinsincs.
Finally, and most important: Your code is not compute bound at all, no amount of SSE will make it significantly faster. On each iteration, you are reading four double values and writing two, which means FLOPs is not your problem. In that case, you're at the mercy of the cache/memory subsystem, and that probably explains the variance you see. The debug multiplication shouldn't be faster than release; and if you see it being faster than you should do more runs and check what else is going on (be careful if your CPU supports a turbo mode, that adds another 20% variation.) A context switch which empties the cache might be enough in this case.
So, overall, the test you made is pretty much meaningless and just shows that for memory bound cases there is no difference to use SSE or not. You should use SSE if there is actually code which is compute-dense and parallel, and even then I would spend a lot of time with a profiler to nail down the exact location where to optimize. A simple dot product is not suitable to see any performance improvements with SSE.
Several points:
as has already been pointed out, MSVC generates pretty bad code for SSE
your code is almost certainly memory bandwidth limited, since you are performing only one operation in between loads and stores
most modern x86 CPUs have two floating point ALUs, so there may be little to be gained from using SSE for double precision floating point math, even if you're not bandwidth-limited

Tool for detecting pointer aliasing problems in C / C++

Is there a tool that can do alias analysis on a program and tell you where gcc / g++ are having to generate sub-optimal instruction sequences due to potential pointer aliasing?
I don't know of anything that gives "100 %" coverage, but for vectorizing code (which aliasing often prevents) use the -ftree-vectorizer-verbose=n option, where n is an integer between 1 and 6. This prints out some info why a loop couldn't be vectorized.
For instance, with g++ 4.1, the code
//#define RSTR __restrict__
#define RSTR
void addvec(float* RSTR a, float* b, int n)
{
for (int i = 0; i < n; i++)
a[i] = a[i] + b[i];
}
results in
$ g++ -ftree-vectorizer-verbose=1 -ftree-vectorize -O3 -c aliastest.cpp
aliastest.cpp:6: note: vectorized 0 loops in function.
Now, switch to the other definition for RSTR and you get
$ g++ -ftree-vectorizer-verbose=1 -ftree-vectorize -O3 -c aliastest.cpp
aliastest.cpp:6: note: LOOP VECTORIZED.
aliastest.cpp:6: note: vectorized 1 loops in function.
Interestingly, if one switches to g++ 4.4, it can vectorize the first non-restrict case by versioning and a runtime check:
$ g++44 -ftree-vectorizer-verbose=1 -O3 -c aliastest.cpp
aliastest.cpp:6: note: created 1 versioning for alias checks.
aliastest.cpp:6: note: LOOP VECTORIZED.
aliastest.cpp:4: note: vectorized 1 loops in function.
And this is done for both of the RSTR definitons.
In the past I've tracked down cases aliasing slowdowns with some help from a profiler. Some of the game console profilers will highlight parts of the code that are causing lots of load-hit-store penalties - these can often occur because the compiler assumes some pointers are aliased and has to generate the extra load instructions. Once you know the part of the code they're occuring, you can backtrack from the assembly to the source to see what might be considered aliased, and add "restict" as needed (or other tricks to avoid the extra loads).
I'm not sure if there are any freely available profilers that will let you get into this level of detail, however.
The side benefit of this approach is that you only spend your time examining cases that actually slow your code down.