Do we need vectorization in C++ or are for loops already fast enough? - c++

In Matlab we use vectorization to speed up code. For example, here are two ways of performing the same calculation:
% Loop
tic
i = 0;
for t = 0:.01:1e5
i = i + 1;
y(i) = sin(t);
end
toc
% Vectorization
tic
t = 0:.01:1e5;
y = sin(t);
toc
The results are:
Elapsed time is 1.278207 seconds. % For loop
Elapsed time is 0.099234 seconds. % Vectorization
So the vectorized code is almost 13 times faster. Actually, if we run it again we get:
Elapsed time is 0.200800 seconds. % For loop
Elapsed time is 0.103183 seconds. % Vectorization
The vectorized code is now only 2 times as fast instead of 13 times as fast. So it appears we get a huge speedup on the first run of the code, but on future runs the speedup is not as great since Matlab appears to know that the for loop hasn't changed and is optimizing for it. In any case the vectorized code is still twice as fast as the for loop code.
Now I have started using C++ and I am wondering about vectorization in this language. Do we need to vectorize for loops in C++ or are they already fast enough? Maybe the compiler automatically vectorizes them? Actually, I don't know if Matlab type vectorization is even a concept in C++, maybe its just needed for Matlab because this is an interpreted language? How would you write the above function in C++ to make it as efficient as possible?

Do we need vectorization in C++
Vectorisation is not necessarily needed always, but it can make some programs faster.
C++ compilers support auto-vectorisation, although if you need to have vectorisation, then you might not be able to rely on such optimisation because not every loop can be vectorised automatically.
are [loops] already fast enough?
Depends on the loop, the target CPU, the compiler and its options, and crucially: How fast does it need to be.
Some things that you could do to potentially achieve vectorisation in standard C++:
Enable compiler optimisations that perform auto vectorisation. (See the manual of your compiler)
Specify a target CPU that has vector operations in their instruction set. (See the manual of your compiler)
Use standard algorithms with std::parallel_unsequenced_policy or std::unsequenced_policy.
Ensure that the data being operated on is sufficiently aligned for SIMD instructions. You can use alignas. See the manual of the target CPU for what alignment you need.
Ensure that the optimiser knows as much as possible by using link time optimisation.
Partially unroll your loops. Limitation of this is that you hard code the amount of parallelisation:
for (int i = 0; i < count; i += 4) {
operation(i + 0);
operation(i + 1);
operation(i + 2);
operation(i + 3);
}
Outside of standard, portable C++, there are implementation specific ways:
Some compilers provide language extension to write explicitly vectorised programs. This is portable across different CPUs but not portable to compilers that don't implement the extension.
using v4si = int __attribute__ ((vector_size (16)));
v4si a, b, c;
a = b + 1; /* a = b + {1,1,1,1}; */
a = 2 * b; /* a = {2,2,2,2} * b; */
Some compilers provide "builtin" functions to invoke specific CPU instructions which can be used to invoke SIMD vector instructions. Using these is not portable across incompatible CPUs.
Some compilers support OpenMP API which has #pragma omp simd.

Related

How can I multithread this code snippet in C++ with Eigen

I'm trying to implement a faster version of the following code fragment:
Eigen::VectorXd dTX = (( (XPSF.array() - x0).square() + (ZPSF.array() - z0).square() ).sqrt() + txShift)*fs/c + t0*fs;
Eigen::VectorXd Zsq = ZPSF.array().square();
Eigen::MatrixXd idxt(XPSF.size(),nc);
for (int i = 0; i < nc; i++) {
idxt.col(i) = ((XPSF.array() - xe(i)).square() + Zsq.array()).sqrt()*fs/c + dTX.array();
idxt.col(i) = (abs(XPSF.array()-xe(i)) <= ZPSF.array()*0.5/fnumber).select(idxt.col(i),-1);
}
The sample array sizes I'm working with right now are:
XPSF: Column Vector of 591*192 coefficients (113,472 total values in the column vector)
ZPSF: Same size as XPSF
xe: RowVector of 192 coefficients
idxt: Matrix of 113,472x192 size
Current runs with gcc and -msse2 and -o3 optimization yield an average time of ~0.08 seconds for the first line of the loop and ~0.03 seconds for the second line of the loop. I know that runtimes are platform dependent, but I believe that this still can be much faster. A commercial software performs the operations I'm trying to do here in ~two orders of magnitude less time. Also, I suspect my code is a bit amateurish right now!
I've tried reading over Eigen documentation to understand how vectorization works, where it is implemented and how much of this code might be "implicitly" parallelized by Eigen, but I've struggled to keep track of the details. I'm also a bit new to C++ in general, but I've seen the documentation and other resources regarding std::thread and have tried to combine it with this code, but without much success.
Any advice would be appreciated.
Update:
Update 2
I would upvote Soleil's answer because it contains helpful information if I had the reputation score for it. However, I should clarify that I would like to first figure out what optimizations I can do without a GPU. I'm convinced (albeit without OpenMP) Eigen's inherent multithreading and vectorization won't speed it up any further (unless there are unnecessary temporaries being generated). How could I use something like std::thread to explicitly parellelize this? I'm struggling to combine both std::thread and Eigen to this end.
OpenMP
If your CPU has enough many cores and threads, usually a simple and quick first step is to invoke OpenMP by adding the pragma:
#pragma omp parallel for
for (int i = 0; i < nc; i++)
and compile with /openmp (cl) or -fopenmp (gcc) or just -ftree-parallelize-loops with gcc in order to auto unroll the loops.
This will do a map reduce and the map will occur over the number of parallel threads your CPU can handle (8 threads with the 7700HQ).
In general you also can set a clause num_threads(n) where n is the desired number of threads:
#pragma omp parallel num_threads(8)
Where I used 8 since the 7700HQ can handle 8 concurrent threads.
TBB
You also can unroll your loop with TBB:
#pragma unroll
for (int i = 0; i < nc; i++)
threading integrated with eigen
With Eigen you can add
OMP_NUM_THREADS=n ./my_program
omp_set_num_threads(n);
Eigen::setNbThreads(n);
remarks with multithreading with eigen
However, in the FAQ:
currently Eigen parallelizes only general matrix-matrix products (bench), so it doesn't by itself take much advantage of parallel hardware."
In general, the improvement with OpenMP is not always here, so benchmark the release build. Another way is to make sure that you're using vectorized instructions.
Again, from the FAQ/vectorization:
How can I enable vectorization?
You just need to tell your compiler to enable the corresponding
instruction set, and Eigen will then detect it. If it is enabled by
default, then you don't need to do anything. On GCC and clang you can
simply pass -march=native to let the compiler enables all instruction
set that are supported by your CPU.
On the x86 architecture, SSE is not enabled by default by most
compilers. You need to enable SSE2 (or newer) manually. For example,
with GCC, you would pass the -msse2 command-line option.
On the x86-64 architecture, SSE2 is generally enabled by default, but
you can enable AVX and FMA for better performance
On PowerPC, you have to use the following flags: -maltivec
-mabi=altivec, for AltiVec, or -mvsx for VSX-capable systems.
On 32-bit ARM NEON, the following: -mfpu=neon -mfloat-abi=softfp|hard,
depending if you are on a softfp/hardfp system. Most current
distributions are using a hard floating-point ABI, so go for the
latter, or just leave the default and just pass -mfpu=neon.
On 64-bit ARM, SIMD is enabled by default, you don't have to do
anything extra.
On S390X SIMD (ZVector), you have to use a recent gcc (version >5.2.1)
compiler, and add the following flags: -march=z13 -mzvector.
multithreading with cuda
Given the size of your arrays, you want to try to offload to a GPU to reach the microsecond; in that case you would have (typically) as many threads as the number of elements in your array.
For a simple start, if you have an nvidia card, you want to look at cublas, which also allows you to use the tensor registers (fused multiply add, etc) of the last generations, unlike regular kernel.
Since eigen is a header only library, it makes sense that you could use it in a cuda kernel.
You also may implements everything "by hand" (ie., without eigen) with regular kernels. This is a nonsense in terms of engineering, but common practice in an education/university project, in order to understand everything.
multithreading with OneAPI and Intel GPU
Since you have a skylake architecture, you also can unroll your loop on your CPU's GPU with OneAPI:
// Unroll loop as specified by the unroll factor.
#pragma unroll unroll_factor
for (int i = 0; i < nc; i++)
(from the sample).

Why is vectorization not beneficial in this for loop?

I am trying to vectorize this for loop. After using the Rpass flag, I am getting the following remark for it:
int someOuterVariable = 0;
for (unsigned int i = 7; i != -1; i--)
{
array[someOuterVariable + i] -= 0.3 * anotherArray[i];
}
Remark:
The cost-model indicates that vectorization is not beneficial
the cost-model indicates that interleaving is not beneficial
I want to understand what this means. Does "interleaving is not benificial" mean the array indexing is not proper?
It's hard to answer without more details about your types. But in general, starting a loop incurs some costs and vectorising also implies some costs (such as moving data to/from SIMD registers, ensuring proper alignment of data)
I'm guessing here that the compiler tells you that the vectorisation cost here is bigger than simply running the 8 iterations without it, so it's not doing it.
Try to increase the number of iterations, or help the compiler for computing alignement for example.
Typically, unless the type of array's item are exactly of the proper alignment for SIMD vector, accessing an array from a "unknown" offset (what you've called someOuterVariable) prevents the compiler to write an efficient vectorisation code.
EDIT: About the "interleaving" question, it's hard to guess without knowning your tool. But in general, interleaving usually means mixing 2 streams of computations so that the compute units of the CPU are all busy. For example, if you have 2 ALU in your CPU, and the program is doing:
c = a + b;
d = e * f;
The compiler can interleave the computation so that both the addition and multiplication happens at the same time (provided you have 2 ALU available). Typically, this means that the multiplication which is a bit longer to compute (for example 6 cycles) will be started before the addition (for example 3 cycles). You'll then get the result of both operation after only 6 cycles instead of 9 if the compiler serialized the computations. This is only possible if there is no dependencies between the computation (if d required c, it can not work). A compiler is very cautious about this, and, in your example, will not apply this optimization if it can't prove that array and anotherArray don't alias.

How to optimize a simple loop?

The loop is simple
void loop(int n, double* a, double const* b)
{
#pragma ivdep
for (int i = 0; i < n; ++i, ++a, ++b)
*a *= *b;
}
I am using intel c++ compiler and using #pragma ivdep for optimization currently. Any way to make it perform better like using multicore and vectorization together, or other techniques?
This loop is absolutely vectorizable by compiler. But make sure that loop was actually vectorized (using Compiler' -qopt-report5, assembly output, Intel (vectorization) Advisor, whatever other techniques). One more overkill way to do that is creating performance baseline using -no-vec option (which will disable ivdep-driven and auto-vectorization) and then compare execution time against it. This is not good way for checking vectorization presence, but it's useful for general performance analysis for next bullets.
If loop hasn't been actually vectorized, make sure you push compiler to auto-vectorize it. In order to push compiler see next bullet. Note that next bullet could be useful even if loop was succesfully auto-vectorized.
To push compiler to vectorize it use: (a) restrict keyword to "disambiguate" a and b pointers (someone has already suggested it to you). (b) #pragma omp simd (which has extra bonus of being more portable and much more flexible than ivdep, but also has a drawback of being unsupported in old compilers before intel compiler version 14 and for other loops is more "dangerous"). To re-emphasize: given bullet may seem to do the same thing as ivdep, but depending on various circumstances it could be better and more powerful option.
Given loop has fine-grain iterations (too small amount of computations per single iteration) and overall is not purely compute-bound (so effort/cycles spent by CPU to load/store data from/to cache/memory is comparable if not bigger to effort/cycles spent to perform multiplication). Unrolling is often good way to slightly mitigate such disadvantages. But I would recommend to explicitly ask compiler to unroll it, by using #pragma unroll. In fact, for certain compiler versions the unrolling will happen automatically. Again, you can check whenever compiler did it by using -qopt-report5, loop assembly or Intel (Vectorization) Advisor:
In given loop you deal with "streaming" access pattern. I.e. you are contiguously loading/store data from/to memory (and cache sub-system will not help a lot for big "n" values). So, depending on target hardware, usage of multi-threading (atop of SIMD), etc, your loop will likely become memory bandwidth bound in the end. Once you become memory bandwidth bound, you could use techniques like loop blocking, non-temporal stores, aggressive prefetching. All of these techniques worth separate article, although for prefetching/NT-stores you have some pragmas in Intel Compiler to play with.
If n is huge, and you already got prepared to memory bandwidth troubles, you could use things like #pragma omp parallel for simd, which will simulteneously thread-parallelize and vectorize the loop. However quality of this feature has been made decent only in very fresh compiler versions AFAIK, so maybe you'd prefer to split n semi-manually. I.e. n=n1xn2xn3, where n1 - is number of iterations to distribute among threads, n2 - for cache blocking, n3 - for vectorization. Rewrite given loop to make it loopnest of 3 nested loops, where outer loop has n1 iterations (and #pragma omp parallel for is applied), next level loop has n2 iterations, n3 - is innermost (where #pragma omp simd is applied).
Some up to date links with syntax examples and more info:
unroll: https://software.intel.com/en-us/articles/avoid-manual-loop-unrolling
OpenMP SIMD pragma (not so fresh and detailed, but still relevant): https://software.intel.com/en-us/articles/enabling-simd-in-program-using-openmp40
restrict vs. ivdep
NT-stores and prefetching : https://software.intel.com/sites/default/files/managed/22/a3/mtaap2013-prefetch-streaming-stores.pdf
Note1: I apologize that I don't provide various code snippets here. There are at least 2 justifiable reasons for not providing them here: 1. My 5 bullets are pretty much applicable to very many kernels, not just to yours. 2. On the other hand specific combination of pragmas/manual rewriting techniques and corresponding performance results will vary depending on target platform, ISA and Compiler version.
Note2: Last comment regarding your GPU question. Think of your loop vs. simple industry benchmarks like LINPACK or STREAM. In fact your loop could become somewhat very similar to some of them in the end. Now think of x86 CPUs and especially Intel Xeon Phi platform characteristics for LINPACK/STREAM. They are very good indeed and will become even better with High Bandwidth Memory platforms (like Xeon Phi 2nd gen). So theoretically there is no any single reason to think that your given loop is not well mapped to at least some variants of x86 hardware (note that I didn't say similar thing for arbitrary kernel in universe).
Assuming the data pointed to by a can't overlap the data pointed to by b the most important information to give the compiler to let it optimize the code is that fact.
In older ICC version "restrict" was the only clean way to provide that key information to the compiler. In newer versions there are a few cleaner ways to give a much stronger guarantee than ivdep gives (in fact ivdep is a weaker promise to the optimizer than it appears and generally doesn't have the intended effect).
But if n is large, the whole thing will be dominated by the cache misses, so no local optimization can help.
Loop unrolling manually is a simple way to optimize your code, and following is my code. Original loop costs 618.48 ms, while loop2 costs 381.10 ms in my PC, the compiler is GCC with option '-O2'. I don't have Intel ICC to verify the code, but I think the optimization principles are the same.
Similarly, I did some experiments that compare the execution time of two programs to XOR two blocks of memories, and one program is vectorized with the help of SIMD instructions, while the other is manually loop-unrolled. If you are interested, see here.
P.S. Of course loop2 only works when n is even.
#include <stdlib.h>
#include <stdio.h>
#include <sys/time.h>
#define LEN 512*1024
#define times 1000
void loop(int n, double* a, double const* b){
int i;
for(i = 0; i < n; ++i, ++a, ++b)
*a *= *b;
}
void loop2(int n, double* a, double const* b){
int i;
for(i = 0; i < n; i=i+2, a=a+2, b=b+2)
*a *= *b;
*(a+1) *= *(b+1);
}
int main(void){
double *la, *lb;
struct timeval begin, end;
int i;
la = (double *)malloc(LEN*sizeof(double));
lb = (double *)malloc(LEN*sizeof(double));
gettimeofday(&begin, NULL);
for(i = 0; i < times; ++i){
loop(LEN, la, lb);
}
gettimeofday(&end, NULL);
printf("Time cost : %.2f ms\n",(end.tv_sec-begin.tv_sec)*1000.0\
+(end.tv_usec-begin.tv_usec)/1000.0);
gettimeofday(&begin, NULL);
for(i = 0; i < times; ++i){
loop2(LEN, la, lb);
}
gettimeofday(&end, NULL);
printf("Time cost : %.2f ms\n",(end.tv_sec-begin.tv_sec)*1000.0\
+(end.tv_usec-begin.tv_usec)/1000.0);
free(la);
free(lb);
return 0;
}
I assume, that n is large. You can distribute the workload on k CPUs by starting k threads and provide each with n/k elements. Use big chunks of consecutive data for each thread, don't do finegrained interleaving. Try to align the chunks with cache lines.
If you plan to scale to more than one NUMA node, consider to explicitly copy the chunks of workload to the node, the thread runs on, and copy back the results. In this case, it might not really help, because the workload for each step is very simple. You'll have to run tests for that.

How is cilk reduce done (thread vs smid)

I have something like that :
for (b=from; b<to; b++)
{
for (a=from2; a<to2; a++)
{
dest->ac[b] += srcvec->ac[a] * srcmatrix->weight[a+(b+from)*matrix_width];
}
}
that i'd like to parallelize using cilk. I have written the following code :
for ( b=from; b<to; b++)
{
dest->ac[b] =+ __sec_reduce_add(srcvec->ac[from2:to2-from2] * (srcmatrix->weight+(b*matrix_width))[from2:to2-from2]);
}
but the thing is, I could use a cilk_for on the primary loop, but if the reduce operation is already spawning thread, won't the cilk_for augment the thread overhead, and slow the whole thing down ?
And should I add restrict to dest and src args to further help the compiler ? or is it implicit in this case ?
(ps: I can't try the code right now because of
internal compiler error: in find_rank, at
c-family/array-notation-common.c:244
on
neu1b->ac[0:layer1_size]=neu1->ac[0:layer1_size];
that i'am trying to solve also.)
restrict is not implicitely the case. Furthermore Cilk is implemented using the work-stealing concept. Cilk does not necessarily spawn extra threads for extra work. It works with pushing tasks on a work stack. More info about the internal working can be found on the Cilk FAQ. The Intel compiler might handle things differently than GCC with Cilk. Intel vTune and the intel vectorization report can help you to measure performance differences and indicate whether it's compiled to SIMD or not. With the Intel compiler you can also indicate SIMD operations as follows:
#pragma simd above your loop
or
array notations:
a[:] = b[:] + c[:] to program vectorized array operations.

Make g++ produce a program that can use multiple cores?

I have a c++ program with multiple For loops; each one runs about 5 million iterations. Is there any command I can use with g++ to make the resulting .exe will use multiple cores; i.e. make the first For loop run on the first core and the second For loop run on the second core at the same time? I've tried -O3 and -O3 -ftree-vectorize, but in both cases, my cpu usage still only hovers at around 25%.
EDIT:
Here is my code, in case in helps. I'm basically just making a program to test the speed capabilities of my computer.
#include <iostream>
using namespace std;
#include <math.h>
int main()
{
float *bob = new float[50102133];
float *jim = new float[50102133];
float *joe = new float[50102133];
int i,j,k,l;
//cout << "Starting test...";
for (i=0;i<50102133;i++)
bob[i] = sin(i);
for (j=0;j<50102133;j++)
bob[j] = sin(j*j);
for (k=0;k<50102133;k++)
bob[k] = sin(sqrt(k));
for (l=0;l<50102133;l++)
bob[l] = cos(l*l);
cout << "finished test.";
cout << "the 100120 element is," << bob[1001200];
return 0;
}
The most obviously choice would be to use OpenMP. Assuming your loop is one that's really easy to execute multiple iterations in parallel, you might be able to just add:
#pragma openmp parallel for
...immediately before the loop, and get it to execute in parallel. You'll also have to add -fopenmp when you compile.
Depending on the content of the loop, that may give anywhere from a nearly-linear speedup to slowing the code down somewhat. In the latter cases (slowdown or minimal speedup) there may be other things you can do with OpenMP to help speed it up, but without knowing at least a little about the code itself, it's hard to guess what to do or what improvement you may be able to expect at maximum.
The other advice you're getting ("Use threads") may be suitable. OpenMP is basically an automated way of putting threads to use for specific types of parallel code. For a situation such as you describe (executing multiple iterations of a loop in parallel) OpenMP is generally preferred--it's quite a bit simpler to implement, and may well give better performance unless you know multithreading quite well and/or expend a great deal of effort on parallelizing the code.
Edit:
The code you gave in the question probably won't benefit from multiple threads. The problem is that it does very little computation on each data item before writing the result out to memory. Even a single core can probably do the computation fast enough that the overall speed will be limited by the bandwidth to memory.
To stand a decent chance of getting some real benefit from multiple threads, you probably want to write some code that does more computation and less just reading and writing memory. For example, if we collapse your computations together, and do all of them on a single item, then sum the results:
double total = 0;
for (int i = 0; i < size; i++)
total += sin(i) + sin(i*i) + sin(sqrt(i)) + cos(i*i);
By adding a pragma:
#pragma omp parallel for reduction(+:total)
...just before the for loop, we stand a good chance of seeing a substantial improvement in execution speed. Without OpenMP, I get a time like this:
Real 16.0399
User 15.9589
Sys 0.0156001
...but with the #pragma and OpenMP enabled when I compile, I get a time like this:
Real 8.96051
User 17.5033
Sys 0.0468003
So, on my (dual core) processor, time has dropped from 16 to 9 seconds--not quite twice as fast, but pretty close. Of course, a lot of the improvement you get will depend on exactly how many cores you have available. For example, on my other computer (with an Intel i7 CPU), I get a rather larger improvement because it has more cores.
Without OpenMP:
Real 15.339
User 15.3281
Sys 0.015625
...and with OpenMP:
Real 3.09105
User 23.7813
Sys 0.171875
For completeness, here's the final code I used:
#include <math.h>
#include <iostream>
static const int size = 1024 * 1024 * 128;
int main(){
double total = 0;
#pragma omp parallel for reduction(+:total)
for (int i = 0; i < size; i++)
total += sin(i) + sin(i*i) + sin(sqrt(i)) + cos(i*i);
std::cout << total << "\n";
}
The compiler has no way to tell if your code inside the loop can be safely executed on multiple cores. If you want to use all your cores, use threads.
Use Threads or Processes, you may want to look to OpenMp
C++11 got support for threading but c++ compilers won't/can't do any threading on their own.
As others have pointed out, you can manually use threads to achieve this. You might look at libraries such as libdispatch (aka. GCD) or Intel's TBB to help you do this with the least pain.
The -ftree-vectorize option you mention is for targeting SIMD vector processor units on CPUs such as ARM's NEON or Intel's SSE. The code produced is not thread-parallel, but rather operation parallel using a single thread.
The code example posted above is highly amenable to parallelism on SIMD systems as the body of each loop very obviously has no dependancies on the previous iteration, and the operations in the loop are linear.
On some ARM Cortex A series systems at least, you may need to accept slightly reduced accuracy to get the full benefits.