vectorized sum in Fortran - fortran

I am compiling my Fortran code using gfortran and -mavx and have verified that some instructions are vectorized via objdump, but I'm not getting the speed improvements that I was expecting, so I want to make sure the following argument is being vectorized (this single instruction is ~50% of the runtime).
I know that some instructions can be vectorized, while others cannot, so I want to make sure this can be:
sum(A(i1:i2,ir))
Again, this single line takes about 50% of the runtime since I am doing this over a very large matrix. I can give more information on why I am doing this, but suffice it to say that it is necessary, though I can restructure the memory if necessary (for example, I could do the sum as sum(A(ir,i1:i2)) if that could be vectorized instead.
Is this line being vectorized? How can I tell? How do I force vectorization if it is not being vectorized?
EDIT: Thanks to the comments, I now realize that I can check on the vectorization of this summation via -ftree-vectorizer-verbose and see that this is not vectorizing. I have restructured the code as follows:
tsum = 0.0d0
tn = i2 - i1 + 1
tvec(1:tn) = A(i1:i2, ir)
do ii = 1,tn
tsum = tsum + tvec(ii)
enddo
and this ONLY vectorizes when I turn on -funsafe-math-optimizations, but I do see another 70% speed increase due to vectorization. The question still holds: Why does sum(A(i1:i2,ir)) not vectorize and how can I get a simple sum to vectorize?

It turns out that I am not able to make use of the vectorization unless I include -ffast-math or -funsafe-math-optimizations.
The two code snippets I played with are:
tsum = 0.0d0
tvec(1:n) = A(i1:i2, ir)
do ii = 1,n
tsum = tsum + tvec(ii)
enddo
and
tsum = sum(A(i1:i2,ir))
and here are the times I get when running the first code snippet with different compilation options:
10.62 sec ... None
10.35 sec ... -mtune=native -mavx
7.44 sec ... -mtune-native -mavx -ffast-math
7.49 sec ... -mtune-native -mavx -funsafe-math-optimizations
Finally, with these same optimizations, I am able to vectorize tsum = sum(A(i1:i2,ir)) to get
7.96 sec ... None
8.41 sec ... -mtune=native -mavx
5.06 sec ... -mtune=native -mavx -ffast-math
4.97 sec ... -mtune=native -mavx -funsafe-math-optimizations
When we compare sum and -mtune=native -mavx with -mtune=native -mavx -funsafe-math-optimizations, it shows a ~70% speedup. (Note that these were only run once each - before we publish we will do true benchmarking on multiple runs).
I do take a small hit though. My values change slightly when I use the -f options. Without them, the errors for my variables (v1, v2) are :
v1 ... 5.60663e-15 9.71445e-17 1.05471e-15
v2 ... 5.11674e-14 1.79301e-14 2.58127e-15
but with the optimizations, the errors are :
v1 ... 7.11931e-15 5.39846e-15 3.33067e-16
v2 ... 1.97273e-13 6.98608e-14 2.17742e-14
which indicates that there truly is something different going on.

Your explicit loop version still does the FP adds in a different order than a vectorized version would. A vector version uses 4 accumulators, each one getting every 4th array element.
You could write your source code to match what a vector version would do:
tsum0 = 0.0d0
tsum1 = 0.0d0
tsum2 = 0.0d0
tsum3 = 0.0d0
tn = i2 - i1 + 1
tvec(1:tn) = A(i1:i2, ir)
do ii = 1,tn,4 ! count by 4
tsum0 = tsum0 + tvec(ii)
tsum1 = tsum1 + tvec(ii+1)
tsum2 = tsum2 + tvec(ii+2)
tsum3 = tsum3 + tvec(ii+3)
enddo
tsum = (tsum0 + tsum1) + (tsum2 + tsum3)
This might vectorize without -ffast-math.
FP add has multi-cycle latency, but one or two per clock throughput, so you need the asm to use multiple vector accumulators to saturate the FP add unit(s). Skylake can do two FP adds per clock, with latency=4. Previous Intel CPUs do one per clock, with latency=3. So on Skylake, you need 8 vector accumulators to saturate the FP units. And of course they have to be 256b vectors, because AVX instructions are as fast but do twice as much work as SSE vector instructions.
Writing the source with 8 * 8 accumulator variables would be ridiculous, so I guess you need -ffast-math, or an OpenMP pragma that tells the compiler different orders of operations are ok.
Explicitly unrolling your source means you have to handle loop counts that aren't a multiple of the vector width * unroll. If you put restrictions on things, it can help the compiler avoid generating multiple versions of the loop or extra loop setup/cleanup code.

Related

Is `-ftree-loop-vectorize` not enabled by `-O2` in GCC v12?

Example: https://www.godbolt.org/z/ahfcaj7W8
From https://gcc.gnu.org/onlinedocs/gcc-12.2.0/gcc/Optimize-Options.html
It says
-ftree-loop-vectorize
     Perform loop vectorization on trees. This flag is enabled by default at -O2 and by -ftree-vectorize, -fprofile-use, and -fauto-profile."
However it seems I have to pass a flag explicitly to turn on loop unrolling & SIMD. Did I misunderstand something here? It is enabled at -O3 though.
It is enabled at -O2 in GCC12, but only with a much lower cost threshold than at -O3, e.g. often only vectorizing when the loop trip count is a compile-time constant and known to be a multiple of the vector width (e.g. 8 for 32-bit elements with AVX2 vectors). See https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=2b8453c401b699ed93c085d0413ab4b5030bcdb8
https://godbolt.org/z/3xjdrx6as shows some loops at -O2 vs. -O3, with a sum of an array of integers only vectorizing with a constant count, not a runtime variable. Even for (int i=0 ; i < (len&-16) ; i++) sum += arr[i] to make the length a multiple of 16 doesn't make gcc -O2 auto-vectorize.
Before GCC12, -ftree-vectorize wasn't enabled at all by -O2.

How can I multithread this code snippet in C++ with Eigen

I'm trying to implement a faster version of the following code fragment:
Eigen::VectorXd dTX = (( (XPSF.array() - x0).square() + (ZPSF.array() - z0).square() ).sqrt() + txShift)*fs/c + t0*fs;
Eigen::VectorXd Zsq = ZPSF.array().square();
Eigen::MatrixXd idxt(XPSF.size(),nc);
for (int i = 0; i < nc; i++) {
idxt.col(i) = ((XPSF.array() - xe(i)).square() + Zsq.array()).sqrt()*fs/c + dTX.array();
idxt.col(i) = (abs(XPSF.array()-xe(i)) <= ZPSF.array()*0.5/fnumber).select(idxt.col(i),-1);
}
The sample array sizes I'm working with right now are:
XPSF: Column Vector of 591*192 coefficients (113,472 total values in the column vector)
ZPSF: Same size as XPSF
xe: RowVector of 192 coefficients
idxt: Matrix of 113,472x192 size
Current runs with gcc and -msse2 and -o3 optimization yield an average time of ~0.08 seconds for the first line of the loop and ~0.03 seconds for the second line of the loop. I know that runtimes are platform dependent, but I believe that this still can be much faster. A commercial software performs the operations I'm trying to do here in ~two orders of magnitude less time. Also, I suspect my code is a bit amateurish right now!
I've tried reading over Eigen documentation to understand how vectorization works, where it is implemented and how much of this code might be "implicitly" parallelized by Eigen, but I've struggled to keep track of the details. I'm also a bit new to C++ in general, but I've seen the documentation and other resources regarding std::thread and have tried to combine it with this code, but without much success.
Any advice would be appreciated.
Update:
Update 2
I would upvote Soleil's answer because it contains helpful information if I had the reputation score for it. However, I should clarify that I would like to first figure out what optimizations I can do without a GPU. I'm convinced (albeit without OpenMP) Eigen's inherent multithreading and vectorization won't speed it up any further (unless there are unnecessary temporaries being generated). How could I use something like std::thread to explicitly parellelize this? I'm struggling to combine both std::thread and Eigen to this end.
OpenMP
If your CPU has enough many cores and threads, usually a simple and quick first step is to invoke OpenMP by adding the pragma:
#pragma omp parallel for
for (int i = 0; i < nc; i++)
and compile with /openmp (cl) or -fopenmp (gcc) or just -ftree-parallelize-loops with gcc in order to auto unroll the loops.
This will do a map reduce and the map will occur over the number of parallel threads your CPU can handle (8 threads with the 7700HQ).
In general you also can set a clause num_threads(n) where n is the desired number of threads:
#pragma omp parallel num_threads(8)
Where I used 8 since the 7700HQ can handle 8 concurrent threads.
TBB
You also can unroll your loop with TBB:
#pragma unroll
for (int i = 0; i < nc; i++)
threading integrated with eigen
With Eigen you can add
OMP_NUM_THREADS=n ./my_program
omp_set_num_threads(n);
Eigen::setNbThreads(n);
remarks with multithreading with eigen
However, in the FAQ:
currently Eigen parallelizes only general matrix-matrix products (bench), so it doesn't by itself take much advantage of parallel hardware."
In general, the improvement with OpenMP is not always here, so benchmark the release build. Another way is to make sure that you're using vectorized instructions.
Again, from the FAQ/vectorization:
How can I enable vectorization?
You just need to tell your compiler to enable the corresponding
instruction set, and Eigen will then detect it. If it is enabled by
default, then you don't need to do anything. On GCC and clang you can
simply pass -march=native to let the compiler enables all instruction
set that are supported by your CPU.
On the x86 architecture, SSE is not enabled by default by most
compilers. You need to enable SSE2 (or newer) manually. For example,
with GCC, you would pass the -msse2 command-line option.
On the x86-64 architecture, SSE2 is generally enabled by default, but
you can enable AVX and FMA for better performance
On PowerPC, you have to use the following flags: -maltivec
-mabi=altivec, for AltiVec, or -mvsx for VSX-capable systems.
On 32-bit ARM NEON, the following: -mfpu=neon -mfloat-abi=softfp|hard,
depending if you are on a softfp/hardfp system. Most current
distributions are using a hard floating-point ABI, so go for the
latter, or just leave the default and just pass -mfpu=neon.
On 64-bit ARM, SIMD is enabled by default, you don't have to do
anything extra.
On S390X SIMD (ZVector), you have to use a recent gcc (version >5.2.1)
compiler, and add the following flags: -march=z13 -mzvector.
multithreading with cuda
Given the size of your arrays, you want to try to offload to a GPU to reach the microsecond; in that case you would have (typically) as many threads as the number of elements in your array.
For a simple start, if you have an nvidia card, you want to look at cublas, which also allows you to use the tensor registers (fused multiply add, etc) of the last generations, unlike regular kernel.
Since eigen is a header only library, it makes sense that you could use it in a cuda kernel.
You also may implements everything "by hand" (ie., without eigen) with regular kernels. This is a nonsense in terms of engineering, but common practice in an education/university project, in order to understand everything.
multithreading with OneAPI and Intel GPU
Since you have a skylake architecture, you also can unroll your loop on your CPU's GPU with OneAPI:
// Unroll loop as specified by the unroll factor.
#pragma unroll unroll_factor
for (int i = 0; i < nc; i++)
(from the sample).

Do we need vectorization in C++ or are for loops already fast enough?

In Matlab we use vectorization to speed up code. For example, here are two ways of performing the same calculation:
% Loop
tic
i = 0;
for t = 0:.01:1e5
i = i + 1;
y(i) = sin(t);
end
toc
% Vectorization
tic
t = 0:.01:1e5;
y = sin(t);
toc
The results are:
Elapsed time is 1.278207 seconds. % For loop
Elapsed time is 0.099234 seconds. % Vectorization
So the vectorized code is almost 13 times faster. Actually, if we run it again we get:
Elapsed time is 0.200800 seconds. % For loop
Elapsed time is 0.103183 seconds. % Vectorization
The vectorized code is now only 2 times as fast instead of 13 times as fast. So it appears we get a huge speedup on the first run of the code, but on future runs the speedup is not as great since Matlab appears to know that the for loop hasn't changed and is optimizing for it. In any case the vectorized code is still twice as fast as the for loop code.
Now I have started using C++ and I am wondering about vectorization in this language. Do we need to vectorize for loops in C++ or are they already fast enough? Maybe the compiler automatically vectorizes them? Actually, I don't know if Matlab type vectorization is even a concept in C++, maybe its just needed for Matlab because this is an interpreted language? How would you write the above function in C++ to make it as efficient as possible?
Do we need vectorization in C++
Vectorisation is not necessarily needed always, but it can make some programs faster.
C++ compilers support auto-vectorisation, although if you need to have vectorisation, then you might not be able to rely on such optimisation because not every loop can be vectorised automatically.
are [loops] already fast enough?
Depends on the loop, the target CPU, the compiler and its options, and crucially: How fast does it need to be.
Some things that you could do to potentially achieve vectorisation in standard C++:
Enable compiler optimisations that perform auto vectorisation. (See the manual of your compiler)
Specify a target CPU that has vector operations in their instruction set. (See the manual of your compiler)
Use standard algorithms with std::parallel_unsequenced_policy or std::unsequenced_policy.
Ensure that the data being operated on is sufficiently aligned for SIMD instructions. You can use alignas. See the manual of the target CPU for what alignment you need.
Ensure that the optimiser knows as much as possible by using link time optimisation.
Partially unroll your loops. Limitation of this is that you hard code the amount of parallelisation:
for (int i = 0; i < count; i += 4) {
operation(i + 0);
operation(i + 1);
operation(i + 2);
operation(i + 3);
}
Outside of standard, portable C++, there are implementation specific ways:
Some compilers provide language extension to write explicitly vectorised programs. This is portable across different CPUs but not portable to compilers that don't implement the extension.
using v4si = int __attribute__ ((vector_size (16)));
v4si a, b, c;
a = b + 1; /* a = b + {1,1,1,1}; */
a = 2 * b; /* a = {2,2,2,2} * b; */
Some compilers provide "builtin" functions to invoke specific CPU instructions which can be used to invoke SIMD vector instructions. Using these is not portable across incompatible CPUs.
Some compilers support OpenMP API which has #pragma omp simd.

Why doesn't vectorization speed up these loops?

I'm getting up to speed with vectorization, since my current PC supports it. I have an Intel i7-7600u. It has 2 cores running at 2.8/2.9 GHz and supports SSE4.1, SSE4.2 and AVX2. I'm not sure of the vector register size. I believe it is 256 bits, so will work with 4 64 bit double precision values at a time. I believe this should give a peak rate of:
(2.8GHz)(2 core)(4 vector)(2 add/mult) = 45 GFlops.
I am using GNU Gfortran and g++.
I have a set of fortran loops I built up back in my days of working on various supercomputers.
One loop I tested is:
do j=1,m
s(:) = s(:) + a(:,j)*b(:,j)
enddo
The vector length is 10000, m = 200 and the nest was executed 500 times to give 2e9 operations. I ran it with the j loop unrolled 0, 1, 2, 3 and 5 times. Unrolling should reduce the number of times s is loaded and stored. It is also optimal because all the memory accesses are stride one and it has a paired add and multiply. I ran it using both array syntax as shown above and by using an inner do loop, but that seems to make little difference. With do loops and no unrolling it looks like:
do j=1,m
do i=1,n
s(i)=s(i)+a(i,j)*b(i,j)
end do
end do
The build looks like:
gfortran -O3 -w -fimplicit-none -ftree-vectorize -fopt-info-vec loops.f90
The compiler says the loops are all vectorized. The best results I have gotten is about 2.8 GFlops, which is one per cycle. If I run it with:
gfortran -O2 -w -fimplicit-none -fno-tree-vectorize -fopt-info-vec loops.f90
No vectorization is reported. It executes a little slower without unrolling, but the same with unrolling. Can someone tell me what is going on here? Do I have the characterization of my processor wrong? Why doesn't vectorization speed it up? I was expecting to get at least some improvement. I apologize if this plows old ground, but I could not find a clean example similar to this.

When should I use DO CONCURRENT and when OpenMP?

I am aware of this and this, but I ask again as the first link is pretty old now, and the second link did not seem to reach a conclusive answer. Has any consensus developed?
My problem is simple:
I have a DO loop that has elements that may be run concurrently. Which method do I use ?
Below is code to generate particles on a simple cubic lattice.
npart is the number of particles
npart_edge & npart_face are that along an edge and a face, respectively
space is the lattice spacing
Rx, Ry, Rz are position arrays
x, y, z are temporary variables to decide positon on lattice
Note the difference that x,y and z have to be arrays in the CONCURRENT case, but not so in the OpenMP case because they can be defined as being PRIVATE.
So do I use DO CONCURRENT (which, as I understand from the links above, uses SIMD) :
DO CONCURRENT (i = 1, npart)
x(i) = MODULO(i-1, npart_edge)
Rx(i) = space*x(i)
y(i) = MODULO( ( (i-1) / npart_edge ), npart_edge)
Ry(i) = space*y(i)
z(i) = (i-1) / npart_face
Rz(i) = space*z(i)
END DO
Or do I use OpenMP?
!$OMP PARALLEL DEFAULT(SHARED) PRIVATE(x,y,z)
!$OMP DO
DO i = 1, npart
x = MODULO(i-1, npart_edge)
Rx(i) = space*x
y = MODULO( ( (i-1) / npart_edge ), npart_edge)
Ry(i) = space*y
z = (i-1) / npart_face
Rz(i) = space*z
END DO
!$OMP END DO
!$OMP END PARALLEL
My tests:
Placing 64 particles in a box of side 10:
$ ifort -qopenmp -real-size 64 omp.f90
$ ./a.out
CPU time = 6.870000000000001E-003
Real time = 3.600000000000000E-003
$ ifort -real-size 64 concurrent.f90
$ ./a.out
CPU time = 6.699999999999979E-005
Real time = 0.000000000000000E+000
Placing 100000 particles in a box of side 100:
$ ifort -qopenmp -real-size 64 omp.f90
$ ./a.out
CPU time = 8.213300000000000E-002
Real time = 1.280000000000000E-002
$ ifort -real-size 64 concurrent.f90
$ ./a.out
CPU time = 2.385000000000000E-003
Real time = 2.400000000000000E-003
Using the DO CONCURRENT construct seems to be giving me at least an order of magnitude better performance. This was done on an i7-4790K. Also, the advantage of concurrency seems to decrease with increasing size.
DO CONCURRENT does not do any parallelization per se. The compiler may decide to parallelize it using threads or use SIMD instructions or even offload to a GPU. For threads you often have to instruct it to do so. For GPU offloading you need a particular compiler with particular options. Or (often!), the compiler just treats DO CONCURENT as a regular DO and uses SIMD if it would use them for the regular DO.
OpenMP is also not just threads, the compiler can use SIMD instructions if it wants. There is also omp simd directive, but that is only a suggestion to the compiler to use SIMD, it can be ignored.
You should try, measure and see. There is no single definitive answer. Not even for a given compiler, the less for all compilers.
If you would not use OpenMP anyway, I would give DO CONCURRENT a try to see if the automatic parallelizer does a better job with this construct. Chances are good that it will help. If your code is already in OpenMP, I do not see any point introducing DO CONCURRENT.
My practice is to use OpenMP and try to make sure the compiler vectorizes (SIMD) what it can. Especially because I use OpenMP all over my program anyway. DO CONCURRENT still has to prove it is actually useful. I am not convinced, yet, but some GPU examples look promising - however, real codes are often much more complex.
Your specific examples and the performance measurement:
Too little code is given and there are subtle points in every benchmarking. I wrote some simple code around your loops and did my own tests. I was careful NOT to include the thread creation into the timed block. You should not include $omp parallel into your timing. I also took the minimum real time over multiple computations because sometimes the first take is longer (certainly with DO CONCURRENT). CPU has various throttle modes and may need some time to spin-up. I also added SCHEDULE(STATIC).
npart=10000000
ifort -O3 concurrent.f90: 6.117300000000000E-002
ifort -O3 concurrent.f90 -parallel: 5.044600000000000E-002
ifort -O3 concurrent_omp.f90: 2.419600000000000E-002
npart=10000, default 8 threads (hyper-threading)
ifort -O3 concurrent.f90: 5.430000000000000E-004
ifort -O3 concurrent.f90 -parallel: 8.899999999999999E-005
ifort -O3 concurrent_omp.f90: 1.890000000000000E-004
npart=10000, OMP_NUM_THREADS=4 (ignore hyper-threading)
ifort -O3 concurrent.f90: 5.410000000000000E-004
ifort -O3 concurrent.f90 -parallel: 9.200000000000000E-005
ifort -O3 concurrent_omp.f90: 1.070000000000000E-004
Here, DO CONCURRENT seems to be somewhat faster for the small case, but not too much if we make sure to use the right number of cores. It is clearly slower for the big case. The -parallel option is clearly necessary for the automatic parallelization.