Why doesn't vectorization speed up these loops? - fortran

I'm getting up to speed with vectorization, since my current PC supports it. I have an Intel i7-7600u. It has 2 cores running at 2.8/2.9 GHz and supports SSE4.1, SSE4.2 and AVX2. I'm not sure of the vector register size. I believe it is 256 bits, so will work with 4 64 bit double precision values at a time. I believe this should give a peak rate of:
(2.8GHz)(2 core)(4 vector)(2 add/mult) = 45 GFlops.
I am using GNU Gfortran and g++.
I have a set of fortran loops I built up back in my days of working on various supercomputers.
One loop I tested is:
do j=1,m
s(:) = s(:) + a(:,j)*b(:,j)
enddo
The vector length is 10000, m = 200 and the nest was executed 500 times to give 2e9 operations. I ran it with the j loop unrolled 0, 1, 2, 3 and 5 times. Unrolling should reduce the number of times s is loaded and stored. It is also optimal because all the memory accesses are stride one and it has a paired add and multiply. I ran it using both array syntax as shown above and by using an inner do loop, but that seems to make little difference. With do loops and no unrolling it looks like:
do j=1,m
do i=1,n
s(i)=s(i)+a(i,j)*b(i,j)
end do
end do
The build looks like:
gfortran -O3 -w -fimplicit-none -ftree-vectorize -fopt-info-vec loops.f90
The compiler says the loops are all vectorized. The best results I have gotten is about 2.8 GFlops, which is one per cycle. If I run it with:
gfortran -O2 -w -fimplicit-none -fno-tree-vectorize -fopt-info-vec loops.f90
No vectorization is reported. It executes a little slower without unrolling, but the same with unrolling. Can someone tell me what is going on here? Do I have the characterization of my processor wrong? Why doesn't vectorization speed it up? I was expecting to get at least some improvement. I apologize if this plows old ground, but I could not find a clean example similar to this.

Related

Slow vpermpd instruction being generated; why?

I have a filter m_f which acts on an input vector v via
Real d2v = m_f[0]*v[i];
for (size_t j = 1; j < m_f.size(); ++j)
{
d2v += m_f[j] * (v[i + j] + v[i - j]);
}
perf tells us where this loop is hot:
The vaddpd and vfma231pd make sense; surely we can't perform this operation without them. But slow vpermpd is baffling to me. What is it accomplishing?
vpermpd should only be slowing you down here if your bottleneck is front-end throughput (feeding uops into the out-of-order core).
vpermpd isn't particularly "slow" unless you're on an AMD CPU. (Lane-crossing YMM shuffles are slow-ish on AMD's CPUs, because they have to decode into more than the normal 2 128-bit uops that 256-bit instructions are split into. vpermpd is 3 uops on Ryzen, or 4 with a memory source.)
On Intel, vpermpd with a memory source is always 2 uops for the front-end (even a non-indexed addressing mode can't micro-fuse). Bu
If your loop only runs for a tiny number of iterations, then OoO exec may be able to hide the FMA latency and maybe actually bottleneck on the front end for this loop + surrounding code. That's possible, given how many counts the (inefficient) horizontal-sum code outside the loop is getting.
In that case, maybe unrolling by 2 would help, but maybe the extra overhead to check if you can run even one iteration of the main loop could get costly for very small counts.
Otherwise (for large counts) your bottleneck is probably on the 4 to 5 cycle loop-carried dependency of doing an FMA with d2v as an input/output operand. Unrolling with multiple accumulators, and pointer-increments instead of indexing, would be a huge performance win. Like 2x or 3x.
Try clang, it will usually do that for you, and its skylake/haswell tunings unroll pretty aggressively. (e.g. clang -O3 -march=native -ffast-math)
GCC with -funroll-loops doesn't actually use multiple accumulators, IIRC. I haven't looked in a while, I might be wrong, but I think it will just repeat the loop body using the same accumulator register, not helping at all to run more dep chains in parallel. Clang will actually use 2 or 4 different vector registers to hold partial sums for d2v, and add them at the end outside the loop. (But for large sizes, 8 or more would be better. Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables?)
Unrolling would also make it worthwhile to use pointer increments, saving 1 uop in each of the vaddpd and vfmadd instructions on Intel SnB-family.
Why is m_f.size(); being kept in memory (cmp rax, [rsp+0x50]) instead of a register? Are you compiling with strict-aliasing disabled? The loop doesn't write memory, so that's just strange. Unless the compiler thinks the loop will run very few iterations, so not worth code outside the loop to load a max?
Copying and negating j every iteration looks like a missed-optimization. Obviously more efficient to start with 2 registers outside the loop, and add rax,0x20 / sub rbx, 0x20 every loop iteration instead of MOV+NEG.
If you have a [mcve] of this, it looks like several missed optimizations that could be reported as compiler bugs. This asm looks like gcc output to me.
It's disappointing that gcc uses such a terrible horizontal-sum idiom. VHADDPD is 3 uops, 2 of which need the shuffle port. Maybe try a newer version of GCC, like 8.2. Although I'm not sure if avoiding VHADDPS/PD was part of closing GCC bug 80846 as fixed. That link is to my comment on the bug analyzing GCC's hsum code using packed-single, using vhaddps twice.
It looks like your hsum following the loop is actually "hot", so you're suffering from gcc's compact but inefficient hsum.
That is the v[i - j] term. Since the memory access moves backwards thru memory as j increases, the shuffle is necessary to reverse the order of the 4 values that are read from memory.

When should I use DO CONCURRENT and when OpenMP?

I am aware of this and this, but I ask again as the first link is pretty old now, and the second link did not seem to reach a conclusive answer. Has any consensus developed?
My problem is simple:
I have a DO loop that has elements that may be run concurrently. Which method do I use ?
Below is code to generate particles on a simple cubic lattice.
npart is the number of particles
npart_edge & npart_face are that along an edge and a face, respectively
space is the lattice spacing
Rx, Ry, Rz are position arrays
x, y, z are temporary variables to decide positon on lattice
Note the difference that x,y and z have to be arrays in the CONCURRENT case, but not so in the OpenMP case because they can be defined as being PRIVATE.
So do I use DO CONCURRENT (which, as I understand from the links above, uses SIMD) :
DO CONCURRENT (i = 1, npart)
x(i) = MODULO(i-1, npart_edge)
Rx(i) = space*x(i)
y(i) = MODULO( ( (i-1) / npart_edge ), npart_edge)
Ry(i) = space*y(i)
z(i) = (i-1) / npart_face
Rz(i) = space*z(i)
END DO
Or do I use OpenMP?
!$OMP PARALLEL DEFAULT(SHARED) PRIVATE(x,y,z)
!$OMP DO
DO i = 1, npart
x = MODULO(i-1, npart_edge)
Rx(i) = space*x
y = MODULO( ( (i-1) / npart_edge ), npart_edge)
Ry(i) = space*y
z = (i-1) / npart_face
Rz(i) = space*z
END DO
!$OMP END DO
!$OMP END PARALLEL
My tests:
Placing 64 particles in a box of side 10:
$ ifort -qopenmp -real-size 64 omp.f90
$ ./a.out
CPU time = 6.870000000000001E-003
Real time = 3.600000000000000E-003
$ ifort -real-size 64 concurrent.f90
$ ./a.out
CPU time = 6.699999999999979E-005
Real time = 0.000000000000000E+000
Placing 100000 particles in a box of side 100:
$ ifort -qopenmp -real-size 64 omp.f90
$ ./a.out
CPU time = 8.213300000000000E-002
Real time = 1.280000000000000E-002
$ ifort -real-size 64 concurrent.f90
$ ./a.out
CPU time = 2.385000000000000E-003
Real time = 2.400000000000000E-003
Using the DO CONCURRENT construct seems to be giving me at least an order of magnitude better performance. This was done on an i7-4790K. Also, the advantage of concurrency seems to decrease with increasing size.
DO CONCURRENT does not do any parallelization per se. The compiler may decide to parallelize it using threads or use SIMD instructions or even offload to a GPU. For threads you often have to instruct it to do so. For GPU offloading you need a particular compiler with particular options. Or (often!), the compiler just treats DO CONCURENT as a regular DO and uses SIMD if it would use them for the regular DO.
OpenMP is also not just threads, the compiler can use SIMD instructions if it wants. There is also omp simd directive, but that is only a suggestion to the compiler to use SIMD, it can be ignored.
You should try, measure and see. There is no single definitive answer. Not even for a given compiler, the less for all compilers.
If you would not use OpenMP anyway, I would give DO CONCURRENT a try to see if the automatic parallelizer does a better job with this construct. Chances are good that it will help. If your code is already in OpenMP, I do not see any point introducing DO CONCURRENT.
My practice is to use OpenMP and try to make sure the compiler vectorizes (SIMD) what it can. Especially because I use OpenMP all over my program anyway. DO CONCURRENT still has to prove it is actually useful. I am not convinced, yet, but some GPU examples look promising - however, real codes are often much more complex.
Your specific examples and the performance measurement:
Too little code is given and there are subtle points in every benchmarking. I wrote some simple code around your loops and did my own tests. I was careful NOT to include the thread creation into the timed block. You should not include $omp parallel into your timing. I also took the minimum real time over multiple computations because sometimes the first take is longer (certainly with DO CONCURRENT). CPU has various throttle modes and may need some time to spin-up. I also added SCHEDULE(STATIC).
npart=10000000
ifort -O3 concurrent.f90: 6.117300000000000E-002
ifort -O3 concurrent.f90 -parallel: 5.044600000000000E-002
ifort -O3 concurrent_omp.f90: 2.419600000000000E-002
npart=10000, default 8 threads (hyper-threading)
ifort -O3 concurrent.f90: 5.430000000000000E-004
ifort -O3 concurrent.f90 -parallel: 8.899999999999999E-005
ifort -O3 concurrent_omp.f90: 1.890000000000000E-004
npart=10000, OMP_NUM_THREADS=4 (ignore hyper-threading)
ifort -O3 concurrent.f90: 5.410000000000000E-004
ifort -O3 concurrent.f90 -parallel: 9.200000000000000E-005
ifort -O3 concurrent_omp.f90: 1.070000000000000E-004
Here, DO CONCURRENT seems to be somewhat faster for the small case, but not too much if we make sure to use the right number of cores. It is clearly slower for the big case. The -parallel option is clearly necessary for the automatic parallelization.

vectorized sum in Fortran

I am compiling my Fortran code using gfortran and -mavx and have verified that some instructions are vectorized via objdump, but I'm not getting the speed improvements that I was expecting, so I want to make sure the following argument is being vectorized (this single instruction is ~50% of the runtime).
I know that some instructions can be vectorized, while others cannot, so I want to make sure this can be:
sum(A(i1:i2,ir))
Again, this single line takes about 50% of the runtime since I am doing this over a very large matrix. I can give more information on why I am doing this, but suffice it to say that it is necessary, though I can restructure the memory if necessary (for example, I could do the sum as sum(A(ir,i1:i2)) if that could be vectorized instead.
Is this line being vectorized? How can I tell? How do I force vectorization if it is not being vectorized?
EDIT: Thanks to the comments, I now realize that I can check on the vectorization of this summation via -ftree-vectorizer-verbose and see that this is not vectorizing. I have restructured the code as follows:
tsum = 0.0d0
tn = i2 - i1 + 1
tvec(1:tn) = A(i1:i2, ir)
do ii = 1,tn
tsum = tsum + tvec(ii)
enddo
and this ONLY vectorizes when I turn on -funsafe-math-optimizations, but I do see another 70% speed increase due to vectorization. The question still holds: Why does sum(A(i1:i2,ir)) not vectorize and how can I get a simple sum to vectorize?
It turns out that I am not able to make use of the vectorization unless I include -ffast-math or -funsafe-math-optimizations.
The two code snippets I played with are:
tsum = 0.0d0
tvec(1:n) = A(i1:i2, ir)
do ii = 1,n
tsum = tsum + tvec(ii)
enddo
and
tsum = sum(A(i1:i2,ir))
and here are the times I get when running the first code snippet with different compilation options:
10.62 sec ... None
10.35 sec ... -mtune=native -mavx
7.44 sec ... -mtune-native -mavx -ffast-math
7.49 sec ... -mtune-native -mavx -funsafe-math-optimizations
Finally, with these same optimizations, I am able to vectorize tsum = sum(A(i1:i2,ir)) to get
7.96 sec ... None
8.41 sec ... -mtune=native -mavx
5.06 sec ... -mtune=native -mavx -ffast-math
4.97 sec ... -mtune=native -mavx -funsafe-math-optimizations
When we compare sum and -mtune=native -mavx with -mtune=native -mavx -funsafe-math-optimizations, it shows a ~70% speedup. (Note that these were only run once each - before we publish we will do true benchmarking on multiple runs).
I do take a small hit though. My values change slightly when I use the -f options. Without them, the errors for my variables (v1, v2) are :
v1 ... 5.60663e-15 9.71445e-17 1.05471e-15
v2 ... 5.11674e-14 1.79301e-14 2.58127e-15
but with the optimizations, the errors are :
v1 ... 7.11931e-15 5.39846e-15 3.33067e-16
v2 ... 1.97273e-13 6.98608e-14 2.17742e-14
which indicates that there truly is something different going on.
Your explicit loop version still does the FP adds in a different order than a vectorized version would. A vector version uses 4 accumulators, each one getting every 4th array element.
You could write your source code to match what a vector version would do:
tsum0 = 0.0d0
tsum1 = 0.0d0
tsum2 = 0.0d0
tsum3 = 0.0d0
tn = i2 - i1 + 1
tvec(1:tn) = A(i1:i2, ir)
do ii = 1,tn,4 ! count by 4
tsum0 = tsum0 + tvec(ii)
tsum1 = tsum1 + tvec(ii+1)
tsum2 = tsum2 + tvec(ii+2)
tsum3 = tsum3 + tvec(ii+3)
enddo
tsum = (tsum0 + tsum1) + (tsum2 + tsum3)
This might vectorize without -ffast-math.
FP add has multi-cycle latency, but one or two per clock throughput, so you need the asm to use multiple vector accumulators to saturate the FP add unit(s). Skylake can do two FP adds per clock, with latency=4. Previous Intel CPUs do one per clock, with latency=3. So on Skylake, you need 8 vector accumulators to saturate the FP units. And of course they have to be 256b vectors, because AVX instructions are as fast but do twice as much work as SSE vector instructions.
Writing the source with 8 * 8 accumulator variables would be ridiculous, so I guess you need -ffast-math, or an OpenMP pragma that tells the compiler different orders of operations are ok.
Explicitly unrolling your source means you have to handle loop counts that aren't a multiple of the vector width * unroll. If you put restrictions on things, it can help the compiler avoid generating multiple versions of the loop or extra loop setup/cleanup code.

Why is this for loop not faster using OpenMP?

I have extracted this simple member function from a larger 2D program, all it does is a for loop accessing from three different arrays and doing a math operation (1D convolution). I have been testing with using OpenMP to make this particular function faster:
void Image::convolve_lines()
{
const int *ptr0 = tmp_bufs[0];
const int *ptr1 = tmp_bufs[1];
const int *ptr2 = tmp_bufs[2];
const int width = Width;
#pragma omp parallel for
for ( int x = 0; x < width; ++x )
{
const int sum = 0
+ 1 * ptr0[x]
+ 2 * ptr1[x]
+ 1 * ptr2[x];
output[x] = sum;
}
}
If I use gcc 4.7 on debian/wheezy amd64 the overall programm performs a lot slower on an 8 CPUs machine. If I use gcc 4.9 on a debian/jessie amd64 (only 4 CPUs on this machine) the overall program perform with very little difference.
Using time to compare:
single core run:
$ ./test black.pgm out.pgm 94.28s user 6.20s system 84% cpu 1:58.56 total
multi core run:
$ ./test black.pgm out.pgm 400.49s user 6.73s system 344% cpu 1:58.31 total
Where:
$ head -3 black.pgm
P5
65536 65536
255
So Width is set to 65536 during execution.
If that matter, I am using cmake for compilation:
add_executable(test test.cxx)
set_target_properties(test PROPERTIES COMPILE_FLAGS "-fopenmp" LINK_FLAGS "-fopenmp")
And CMAKE_BUILD_TYPE is set to:
CMAKE_BUILD_TYPE:STRING=Release
which implies -O3 -DNDEBUG
My question, why is this for loop not faster using multi-core ? There is no overlap on the array, openmp should split the memory equally. I do not see where bottleneck is coming from ?
EDIT: as it was commented, I changed my input file into:
$ head -3 black2.pgm
P5
33554432 128
255
So Width is now set to 33554432 during execution (should be considered by enough). Now the timing reveals:
single core run:
$ ./test ./black2.pgm out.pgm 100.55s user 5.77s system 83% cpu 2:06.86 total
multi core run (for some reason cpu% was always below 100%, which would indicate no threads at all):
$ ./test ./black2.pgm out.pgm 117.94s user 7.94s system 98% cpu 2:07.63 total
I have some general comments:
1. Before optimizing your code, make sure the data is 16 byte aligned. This is extremely important for whatever optimization one wants to apply. And if the data is separated into 3 pieces, it is better to add some dummy elements to make the starting addresses of the 3 pieces are all 16-byte aligned. By doing so, the CPU can load your data into cache lines easily.
2. Make sure the simple function is vectorized before implementing openMP. Most of cases, using AVX/SSE instruction sets should give you a decent 2 to 8X single thread improvement. And it is very simple for your case: create a constant mm256 register and set it with value 2, and load 8 integers to three mm256 registers. With Haswell processor, one addition and one multiplication can be done together. So theoretically, the loop should speed up by a factor 12 if AVX pipeline can be filled!
3. Sometimes parallelization can degrade performance: Modern CPU needs several hundreds to thousands clock cycles to warm up, entering high performance states and scaling up frequency. If the task is not large enough, it is very likely that the task is done before the CPU warms up and one cannot gain speed boost by going parallel. And don't forget that openMP has overhead as well: thread creating, synchronization and deletion. Another case is poor memory management. Data accesses are so scattered, all CPU cores are idle and waiting for data from RAM.
My Suggestion:
You might want to try intel MKL, don't reinvent the wheel. The library is optimized to extreme and there is no clock cycle wasted. One can link with the serial library or the parallel version, a speed boost is guaranteed if it is possible by going parallel.

Puzzling performance difference between ifort and gfortran

Recently, I read a post on Stack Overflow about finding integers that are perfect squares. As I wanted to play with this, I wrote the following small program:
PROGRAM PERFECT_SQUARE
IMPLICIT NONE
INTEGER*8 :: N, M, NTOT
LOGICAL :: IS_SQUARE
N=Z'D0B03602181'
WRITE(*,*) IS_SQUARE(N)
NTOT=0
DO N=1,1000000000
IF (IS_SQUARE(N)) THEN
NTOT=NTOT+1
END IF
END DO
WRITE(*,*) NTOT ! should find 31622 squares
END PROGRAM
LOGICAL FUNCTION IS_SQUARE(N)
IMPLICIT NONE
INTEGER*8 :: N, M
! check if negative
IF (N.LT.0) THEN
IS_SQUARE=.FALSE.
RETURN
END IF
! check if ending 4 bits belong to (0,1,4,9)
M=IAND(N,15)
IF (.NOT.(M.EQ.0 .OR. M.EQ.1 .OR. M.EQ.4 .OR. M.EQ.9)) THEN
IS_SQUARE=.FALSE.
RETURN
END IF
! try to find the nearest integer to sqrt(n)
M=DINT(SQRT(DBLE(N)))
IF (M**2.NE.N) THEN
IS_SQUARE=.FALSE.
RETURN
END IF
IS_SQUARE=.TRUE.
RETURN
END FUNCTION
When compiling with gfortran -O2, running time is 4.437 seconds, with -O3 it is 2.657 seconds. Then I thought that compiling with ifort -O2 could be faster since it might have a faster SQRT function, but it turned out running time was now 9.026 seconds, and with ifort -O3 the same. I tried to analyze it using Valgrind, and the Intel compiled program indeed uses many more instructions.
My question is why? Is there a way to find out where exactly the difference comes from?
EDITS:
gfortran version 4.6.2 and ifort version 12.0.2
times are obtained from running time ./a.out and is the real/user time (sys was always almost 0)
this is on Linux x86_64, both gfortran and ifort are 64-bit builds
ifort inlines everything, gfortran only at -O3, but the latter assembly code is simpler than that of ifort, which uses xmm registers a lot
fixed line of code, added NTOT=0 before loop, should fix issue with other gfortran versions
When the complex IF statement is removed, gfortran takes about 4 times as much time (10-11 seconds). This is to be expected since the statement approximately throws out about 75% of the numbers, avoiding to do the SQRT on them. On the other hand, ifort only uses slightly more time. My guess is that something goes wrong when ifort tries to optimize the IF statement.
EDIT2:
I tried with ifort version 12.1.2.273 it's much faster, so looks like they fixed that.
What compiler versions are you using?
Interestingly, it looks like a case where there is a performance regression from 11.1 to 12.0 -- e.g. for me, 11.1 (ifort -fast square.f90) takes 3.96s, and 12.0 (same options) took 13.3s.
gfortran (4.6.1) (-O3) is still faster (3.35s).
I have seen this kind of a regression before, although not quite as dramatic.
BTW, replacing the if statement with
is_square = any(m == [0, 1, 4, 9])
if(.not. is_square) return
makes it run twice as fast with ifort 12.0, but slower in gfortran and ifort 11.1.
It looks like part of the problem is that 12.0 is overly aggressive in trying to vectorize things: adding
!DEC$ NOVECTOR
right before the DO loop (without changing anything else in the code) cuts the run time down to 4.0 sec.
Also, as a side benefit: if you have a multi-core CPU, try adding -parallel to the ifort command line :)