How to optimize a simple loop? - c++

The loop is simple
void loop(int n, double* a, double const* b)
{
#pragma ivdep
for (int i = 0; i < n; ++i, ++a, ++b)
*a *= *b;
}
I am using intel c++ compiler and using #pragma ivdep for optimization currently. Any way to make it perform better like using multicore and vectorization together, or other techniques?

This loop is absolutely vectorizable by compiler. But make sure that loop was actually vectorized (using Compiler' -qopt-report5, assembly output, Intel (vectorization) Advisor, whatever other techniques). One more overkill way to do that is creating performance baseline using -no-vec option (which will disable ivdep-driven and auto-vectorization) and then compare execution time against it. This is not good way for checking vectorization presence, but it's useful for general performance analysis for next bullets.
If loop hasn't been actually vectorized, make sure you push compiler to auto-vectorize it. In order to push compiler see next bullet. Note that next bullet could be useful even if loop was succesfully auto-vectorized.
To push compiler to vectorize it use: (a) restrict keyword to "disambiguate" a and b pointers (someone has already suggested it to you). (b) #pragma omp simd (which has extra bonus of being more portable and much more flexible than ivdep, but also has a drawback of being unsupported in old compilers before intel compiler version 14 and for other loops is more "dangerous"). To re-emphasize: given bullet may seem to do the same thing as ivdep, but depending on various circumstances it could be better and more powerful option.
Given loop has fine-grain iterations (too small amount of computations per single iteration) and overall is not purely compute-bound (so effort/cycles spent by CPU to load/store data from/to cache/memory is comparable if not bigger to effort/cycles spent to perform multiplication). Unrolling is often good way to slightly mitigate such disadvantages. But I would recommend to explicitly ask compiler to unroll it, by using #pragma unroll. In fact, for certain compiler versions the unrolling will happen automatically. Again, you can check whenever compiler did it by using -qopt-report5, loop assembly or Intel (Vectorization) Advisor:
In given loop you deal with "streaming" access pattern. I.e. you are contiguously loading/store data from/to memory (and cache sub-system will not help a lot for big "n" values). So, depending on target hardware, usage of multi-threading (atop of SIMD), etc, your loop will likely become memory bandwidth bound in the end. Once you become memory bandwidth bound, you could use techniques like loop blocking, non-temporal stores, aggressive prefetching. All of these techniques worth separate article, although for prefetching/NT-stores you have some pragmas in Intel Compiler to play with.
If n is huge, and you already got prepared to memory bandwidth troubles, you could use things like #pragma omp parallel for simd, which will simulteneously thread-parallelize and vectorize the loop. However quality of this feature has been made decent only in very fresh compiler versions AFAIK, so maybe you'd prefer to split n semi-manually. I.e. n=n1xn2xn3, where n1 - is number of iterations to distribute among threads, n2 - for cache blocking, n3 - for vectorization. Rewrite given loop to make it loopnest of 3 nested loops, where outer loop has n1 iterations (and #pragma omp parallel for is applied), next level loop has n2 iterations, n3 - is innermost (where #pragma omp simd is applied).
Some up to date links with syntax examples and more info:
unroll: https://software.intel.com/en-us/articles/avoid-manual-loop-unrolling
OpenMP SIMD pragma (not so fresh and detailed, but still relevant): https://software.intel.com/en-us/articles/enabling-simd-in-program-using-openmp40
restrict vs. ivdep
NT-stores and prefetching : https://software.intel.com/sites/default/files/managed/22/a3/mtaap2013-prefetch-streaming-stores.pdf
Note1: I apologize that I don't provide various code snippets here. There are at least 2 justifiable reasons for not providing them here: 1. My 5 bullets are pretty much applicable to very many kernels, not just to yours. 2. On the other hand specific combination of pragmas/manual rewriting techniques and corresponding performance results will vary depending on target platform, ISA and Compiler version.
Note2: Last comment regarding your GPU question. Think of your loop vs. simple industry benchmarks like LINPACK or STREAM. In fact your loop could become somewhat very similar to some of them in the end. Now think of x86 CPUs and especially Intel Xeon Phi platform characteristics for LINPACK/STREAM. They are very good indeed and will become even better with High Bandwidth Memory platforms (like Xeon Phi 2nd gen). So theoretically there is no any single reason to think that your given loop is not well mapped to at least some variants of x86 hardware (note that I didn't say similar thing for arbitrary kernel in universe).

Assuming the data pointed to by a can't overlap the data pointed to by b the most important information to give the compiler to let it optimize the code is that fact.
In older ICC version "restrict" was the only clean way to provide that key information to the compiler. In newer versions there are a few cleaner ways to give a much stronger guarantee than ivdep gives (in fact ivdep is a weaker promise to the optimizer than it appears and generally doesn't have the intended effect).
But if n is large, the whole thing will be dominated by the cache misses, so no local optimization can help.

Loop unrolling manually is a simple way to optimize your code, and following is my code. Original loop costs 618.48 ms, while loop2 costs 381.10 ms in my PC, the compiler is GCC with option '-O2'. I don't have Intel ICC to verify the code, but I think the optimization principles are the same.
Similarly, I did some experiments that compare the execution time of two programs to XOR two blocks of memories, and one program is vectorized with the help of SIMD instructions, while the other is manually loop-unrolled. If you are interested, see here.
P.S. Of course loop2 only works when n is even.
#include <stdlib.h>
#include <stdio.h>
#include <sys/time.h>
#define LEN 512*1024
#define times 1000
void loop(int n, double* a, double const* b){
int i;
for(i = 0; i < n; ++i, ++a, ++b)
*a *= *b;
}
void loop2(int n, double* a, double const* b){
int i;
for(i = 0; i < n; i=i+2, a=a+2, b=b+2)
*a *= *b;
*(a+1) *= *(b+1);
}
int main(void){
double *la, *lb;
struct timeval begin, end;
int i;
la = (double *)malloc(LEN*sizeof(double));
lb = (double *)malloc(LEN*sizeof(double));
gettimeofday(&begin, NULL);
for(i = 0; i < times; ++i){
loop(LEN, la, lb);
}
gettimeofday(&end, NULL);
printf("Time cost : %.2f ms\n",(end.tv_sec-begin.tv_sec)*1000.0\
+(end.tv_usec-begin.tv_usec)/1000.0);
gettimeofday(&begin, NULL);
for(i = 0; i < times; ++i){
loop2(LEN, la, lb);
}
gettimeofday(&end, NULL);
printf("Time cost : %.2f ms\n",(end.tv_sec-begin.tv_sec)*1000.0\
+(end.tv_usec-begin.tv_usec)/1000.0);
free(la);
free(lb);
return 0;
}

I assume, that n is large. You can distribute the workload on k CPUs by starting k threads and provide each with n/k elements. Use big chunks of consecutive data for each thread, don't do finegrained interleaving. Try to align the chunks with cache lines.
If you plan to scale to more than one NUMA node, consider to explicitly copy the chunks of workload to the node, the thread runs on, and copy back the results. In this case, it might not really help, because the workload for each step is very simple. You'll have to run tests for that.

Related

How to make this OpenMP run faster?

I am new to OpenMP and I have this code of a Sparse Matrix-Vector Multiplication and it runs in between 40 - 50 sec. and has total 4237 MFlops/s. Is there any way to get it faster?
Ss I have edited the post the complete code und Aas an input I have 2 matrices one with 50000 Element and the secound with 400000.
The main problem is when ever I try something different, I get the time to go even worse.
#pragma omp parallel for schedule (static,50)
for (int i=0; i< (tInput->stNumRows); ++i) {
y[i] = 0.0;
for (int j=Arow[i]; j<Arow[i+1]; ++j)
y[i] += Aval[j]*x[Acol[j]];
}
The way thing you can do to improve the performance of the code is to use vectorization (thanks to SIMD instructions). Here is the resulting code:
for (int i=0; i< (tInput->stNumRows); ++i) {
double s = 0.0;
#pragma omp simd reduction(+:s)
for (int j=Arow[i]; j<Arow[i+1]; ++j)
s += Aval[j] * x[Acol[j]];
y[i] = s;
}
Note that y[i] is not read/written continuously in the loop enabling further compiler optimizations. Please take care to compile the code in -O3 (or /O2 for MSVC) for the code to be effectively vectorized. However, this is probably not enough for this code to be vectorized.
Indeed, one issue with this code is the memory indirection x[Acol[j]] which is very hard to vectorize efficiently. Recent x86-64 processors (the ones with AVX2) and very recent ARM processors (the ones with SVE) have SIMD instructions to do that (although they are not great still due to the memory access pattern). Without these instructions, no compiler will likely vectorize the code. Thus, you should tell to your compiler it can use theses instructions (assuming the target processor is actually recent). For GCC/Clang, one way is to use the non-portable -march=native. Another way is to use -mavx2 combined with -mfma on x86-64 processors (although this does not seems to be as good as -march=native in this case for very complex reasons).
Another way to improve the code is to mitigate possible load balancing issues and unwanted overheads. Indeed, load balancing issues can appear in your code if the expression Arow[i+1]-Arow[i]+1 is very different for many i values. In that case, you can use a guided schedule or a dynamic one. However, keep in mind that using a non-static schedule may introduces a significant overhead (especially if the loop is very small or the gap between values is huge). Finally, you can move the omp parallel directive outside the timing loop body since this can introduce a significant overhead (due to the thread creation regarding the target OpenMP runtime).
Note that the above solutions assume the input matrices are big enough so parallelism is useful. Moreover, if x is huge, the code will likely be bounded by the memory hierarchy and there is not much you can do. Sparse matrix computations are often slow because of such issues.
Here is the final code:
#pragma omp parallel
{
// Timing loop
// [...]
#pragma omp for schedule(guided)
for (int i=0; i< (tInput->stNumRows); ++i) {
double s = 0.0;
#pragma omp simd reduction(+:s)
for (int j=Arow[i]; j<Arow[i+1]; ++j)
s += Aval[j] * x[Acol[j]];
y[i] = s;
}
// [...]
}
EDIT: with your input data, the best solution on my machine (with Clang/IOMP) is not to use multiple threads at all since 400000 elements can be computed in roughly 0.3 ms and the overhead of sharing the work between threads is bigger.

Do we need vectorization in C++ or are for loops already fast enough?

In Matlab we use vectorization to speed up code. For example, here are two ways of performing the same calculation:
% Loop
tic
i = 0;
for t = 0:.01:1e5
i = i + 1;
y(i) = sin(t);
end
toc
% Vectorization
tic
t = 0:.01:1e5;
y = sin(t);
toc
The results are:
Elapsed time is 1.278207 seconds. % For loop
Elapsed time is 0.099234 seconds. % Vectorization
So the vectorized code is almost 13 times faster. Actually, if we run it again we get:
Elapsed time is 0.200800 seconds. % For loop
Elapsed time is 0.103183 seconds. % Vectorization
The vectorized code is now only 2 times as fast instead of 13 times as fast. So it appears we get a huge speedup on the first run of the code, but on future runs the speedup is not as great since Matlab appears to know that the for loop hasn't changed and is optimizing for it. In any case the vectorized code is still twice as fast as the for loop code.
Now I have started using C++ and I am wondering about vectorization in this language. Do we need to vectorize for loops in C++ or are they already fast enough? Maybe the compiler automatically vectorizes them? Actually, I don't know if Matlab type vectorization is even a concept in C++, maybe its just needed for Matlab because this is an interpreted language? How would you write the above function in C++ to make it as efficient as possible?
Do we need vectorization in C++
Vectorisation is not necessarily needed always, but it can make some programs faster.
C++ compilers support auto-vectorisation, although if you need to have vectorisation, then you might not be able to rely on such optimisation because not every loop can be vectorised automatically.
are [loops] already fast enough?
Depends on the loop, the target CPU, the compiler and its options, and crucially: How fast does it need to be.
Some things that you could do to potentially achieve vectorisation in standard C++:
Enable compiler optimisations that perform auto vectorisation. (See the manual of your compiler)
Specify a target CPU that has vector operations in their instruction set. (See the manual of your compiler)
Use standard algorithms with std::parallel_unsequenced_policy or std::unsequenced_policy.
Ensure that the data being operated on is sufficiently aligned for SIMD instructions. You can use alignas. See the manual of the target CPU for what alignment you need.
Ensure that the optimiser knows as much as possible by using link time optimisation.
Partially unroll your loops. Limitation of this is that you hard code the amount of parallelisation:
for (int i = 0; i < count; i += 4) {
operation(i + 0);
operation(i + 1);
operation(i + 2);
operation(i + 3);
}
Outside of standard, portable C++, there are implementation specific ways:
Some compilers provide language extension to write explicitly vectorised programs. This is portable across different CPUs but not portable to compilers that don't implement the extension.
using v4si = int __attribute__ ((vector_size (16)));
v4si a, b, c;
a = b + 1; /* a = b + {1,1,1,1}; */
a = 2 * b; /* a = {2,2,2,2} * b; */
Some compilers provide "builtin" functions to invoke specific CPU instructions which can be used to invoke SIMD vector instructions. Using these is not portable across incompatible CPUs.
Some compilers support OpenMP API which has #pragma omp simd.

Adding arrays of L1 cache size. Large arrays are absolute faster than short arrays

Good morning,
I wrote the following program to add two arrays:
#include<iostream>
#define line 32
inline void add(float a[], float b[]){
for (int i=0; i<line; i++){a[i]+=b[i];}
}
int main(){
float a[line]; for (int i=0; i<line; i++){a[i]=0.;}
float b[line]; for (int i=0; i<line; i++){b[i]=0.;}
for (int i=0; i<1024*1024*512; i++){add(a,b);} //Add arrays several times
for (int i=0; i<line; i++){std::cout << a[i] << std::endl;} //Print arrays, else -05 optimize it away.
}
I compiled it with (g++ version 4.8.4 / my hardware is older)
g++ add.c++ -O5 -o Test
and run it with
time ./Test
if line=32 then it needs 1.3 seconds
if line=16 then it needs 2.3 seconds
I tried it several times and the run-time is always the same (so it's stable.)
I understand, that a large array can be relatively faster (vector processors etc.), but I don't understand, why it is absolute faster. I wrote this program to figure out how to achieve Peak-Performance. My question: What is going on there in the CPU and how can I improve it?
Your compiler will be able to unroll loops with hard coded limits like 16/32/64 very easily. It's likely that unrolling your "32 times" loop and using AVX results in exactly 4 AVX additions. This is going to be faster than a "16 times" loop which results in 2 AVX additions as there will likely be a pipeline stall when the branch happens to operate on each of your "lines" (unless you are running on something like a xeon which can speculatively execute several paths).
Microbenchmarks can be misleading unless done carefully so you should always look at the generated assembly. As your benchmark is just hitting the same memory over and over you should think about whether this is actually representative of what will happen in production.
From your compile statement it's clear that, you are using O5 optimization to compile the code.
It is known that except that O2 optimization all others variant is not stable. I would suggest you not to use them for testing purpose and most importantly it's not well defined.
And also there could be many conditions here. Like O5 optimization may not actually work until your program consumes a large amount of memory or it's internal optimization algorithm detects heavy pressure on computation. Like I say O5 optimization is not well defined and stable.
While I was coding I faces this type of problem and stop doing any O.x optimization.
I wish this helps.

Make g++ produce a program that can use multiple cores?

I have a c++ program with multiple For loops; each one runs about 5 million iterations. Is there any command I can use with g++ to make the resulting .exe will use multiple cores; i.e. make the first For loop run on the first core and the second For loop run on the second core at the same time? I've tried -O3 and -O3 -ftree-vectorize, but in both cases, my cpu usage still only hovers at around 25%.
EDIT:
Here is my code, in case in helps. I'm basically just making a program to test the speed capabilities of my computer.
#include <iostream>
using namespace std;
#include <math.h>
int main()
{
float *bob = new float[50102133];
float *jim = new float[50102133];
float *joe = new float[50102133];
int i,j,k,l;
//cout << "Starting test...";
for (i=0;i<50102133;i++)
bob[i] = sin(i);
for (j=0;j<50102133;j++)
bob[j] = sin(j*j);
for (k=0;k<50102133;k++)
bob[k] = sin(sqrt(k));
for (l=0;l<50102133;l++)
bob[l] = cos(l*l);
cout << "finished test.";
cout << "the 100120 element is," << bob[1001200];
return 0;
}
The most obviously choice would be to use OpenMP. Assuming your loop is one that's really easy to execute multiple iterations in parallel, you might be able to just add:
#pragma openmp parallel for
...immediately before the loop, and get it to execute in parallel. You'll also have to add -fopenmp when you compile.
Depending on the content of the loop, that may give anywhere from a nearly-linear speedup to slowing the code down somewhat. In the latter cases (slowdown or minimal speedup) there may be other things you can do with OpenMP to help speed it up, but without knowing at least a little about the code itself, it's hard to guess what to do or what improvement you may be able to expect at maximum.
The other advice you're getting ("Use threads") may be suitable. OpenMP is basically an automated way of putting threads to use for specific types of parallel code. For a situation such as you describe (executing multiple iterations of a loop in parallel) OpenMP is generally preferred--it's quite a bit simpler to implement, and may well give better performance unless you know multithreading quite well and/or expend a great deal of effort on parallelizing the code.
Edit:
The code you gave in the question probably won't benefit from multiple threads. The problem is that it does very little computation on each data item before writing the result out to memory. Even a single core can probably do the computation fast enough that the overall speed will be limited by the bandwidth to memory.
To stand a decent chance of getting some real benefit from multiple threads, you probably want to write some code that does more computation and less just reading and writing memory. For example, if we collapse your computations together, and do all of them on a single item, then sum the results:
double total = 0;
for (int i = 0; i < size; i++)
total += sin(i) + sin(i*i) + sin(sqrt(i)) + cos(i*i);
By adding a pragma:
#pragma omp parallel for reduction(+:total)
...just before the for loop, we stand a good chance of seeing a substantial improvement in execution speed. Without OpenMP, I get a time like this:
Real 16.0399
User 15.9589
Sys 0.0156001
...but with the #pragma and OpenMP enabled when I compile, I get a time like this:
Real 8.96051
User 17.5033
Sys 0.0468003
So, on my (dual core) processor, time has dropped from 16 to 9 seconds--not quite twice as fast, but pretty close. Of course, a lot of the improvement you get will depend on exactly how many cores you have available. For example, on my other computer (with an Intel i7 CPU), I get a rather larger improvement because it has more cores.
Without OpenMP:
Real 15.339
User 15.3281
Sys 0.015625
...and with OpenMP:
Real 3.09105
User 23.7813
Sys 0.171875
For completeness, here's the final code I used:
#include <math.h>
#include <iostream>
static const int size = 1024 * 1024 * 128;
int main(){
double total = 0;
#pragma omp parallel for reduction(+:total)
for (int i = 0; i < size; i++)
total += sin(i) + sin(i*i) + sin(sqrt(i)) + cos(i*i);
std::cout << total << "\n";
}
The compiler has no way to tell if your code inside the loop can be safely executed on multiple cores. If you want to use all your cores, use threads.
Use Threads or Processes, you may want to look to OpenMp
C++11 got support for threading but c++ compilers won't/can't do any threading on their own.
As others have pointed out, you can manually use threads to achieve this. You might look at libraries such as libdispatch (aka. GCD) or Intel's TBB to help you do this with the least pain.
The -ftree-vectorize option you mention is for targeting SIMD vector processor units on CPUs such as ARM's NEON or Intel's SSE. The code produced is not thread-parallel, but rather operation parallel using a single thread.
The code example posted above is highly amenable to parallelism on SIMD systems as the body of each loop very obviously has no dependancies on the previous iteration, and the operations in the loop are linear.
On some ARM Cortex A series systems at least, you may need to accept slightly reduced accuracy to get the full benefits.

How do I force the compiler not to skip my function calls?

Let's say I want to benchmark two competing implementations of some function double a(double b, double c). I already have a large array <double, 1000000> vals from which I can take input values, so my benchmarking would look roughly like this:
//start timer here
double r;
for (int i = 0; i < 1000000; i+=2) {
r = a(vals[i], vals[i+1]);
}
//stop timer here
Now, a clever compiler could realize that I can only ever use the result of the last iteration and simply kill the rest, leaving me with double r = a(vals[999998], vals[999999]). This of course defeats the purpose of benchmarking.
Is there a good way (bonus points if it works on multiple compilers) to prevent this kind of optimization while keeping all other optimizations in place?
(I have seen other threads about inserting empty asm blocks but I'm worried that might prevent inlining or reordering. I'm also not particularly fond of the idea of adding the results sum += r; during each iteration because that's extra work that should not be included in the resulting timings. For the purposes of this question, it would be great if we could focus on other alternative solutions, although for anyone interested in this there is a lively discussion in the comments where the consensus is that += is the most appropriate method in many cases. )
Put a in a separate compilation unit and do not use LTO (link-time optimizations). That way:
The loop is always identical (no difference due to optimizations based on a)
The overhead of the function call is always the same
To measure the pure overhead and to have a baseline to compare implementations, just benchmark an empty version of a
Note that the compiler can not assume that the call to a has no side-effect, so it can not optimize the loop away and replace it with just the last call.
A totally different approach could use RDTSC, which is a hardware register in the CPU core that measures the clock cycles. It's sometimes useful for micro-benchmarks, but it's not exactly trivial to understand the results correctly. For example, check out this and goggle/search SO for more information on RDTSCs.