basic openmp program runs slower [duplicate]

basic openmp program runs slower [duplicate] - c++

This question already has answers here:
No performance gain after using openMP on a program optimize for sequential running
(3 answers)
Closed 7 years ago.
i am trying to make my program run faster so, i will use parallel computing. Before that, i tried on simple for loop but it runs slower.
before open mp :
int a[100000] = { 0 };
clock_t begin = clock();
for (int i = 0; i < 100000; i++)
{
a[i] = i;
}
clock_t end = clock();
double elapsed_secs = double(end - begin) / CLOCKS_PER_SEC;
printf("%lf", elapsed_secs);
After open mp :
int a[100000] = { 0 };
clock_t begin = clock();
#pragma omp parallel for
for (int i = 0; i < 100000; i++)
{
a[i] = i;
}
clock_t end = clock();
double elapsed_secs = double(end - begin) / CLOCKS_PER_SEC;
printf("%lf", elapsed_secs);

You say that your code runs slower, but you actually don't know that. The reason is that you use clock() for measuring the time, and this function counts the CPU time of the current threads and possibly the one of all threads it spawns. For evaluating speed-ups, what you need to measure is the elapsed wall clock time. And for this purpose, OpenMP offers you omp_get_wtime(). Try using it on your code and then you'll really know whether or not your code gets any sort of benefit from OpenMP.
Now, let's be clear, your code does nothing more than writing in memory. So there is a strong likelihood that you'll saturate your memory bandwidth pretty quickly. Therefore, unless you have multiple memory controllers, it is unlikely you gain much from adding threads in this case. Please have a look at this answer to convince yourself.
And finally, make sure you do something with your data before to exit the code, otherwise, the compiler is likely to just optimise it out, leading to a code doing pretty-much nothing (but doing it very fast).

To be succesfull with your first OpenMP parallel (multi-threaded) code examples you need to improve your test cases from following two perspectives:
Make your examples testable. To do that:
make sure that your code is complex enough to not give compilers any chance to "optimize" the whole loop out (i.e. to prevent compilers from kinda replacing the whole loop with single expression)
you may end up with need to introduce function wrapping your loop and pass argument to this function in runtime (via argc/argv) to make compiler confused, while keeping the code very simple
make sure you use proper compilation flags (-O2 -fopenmp for GCC, some other flags for other compilers)
make sure your loop takes enough time and that you use proper way to measure time spent in the loop (other respondents, including Gilles, have alrady pointed it out very well)
Make sure that your loop is doing enough (ideally computational) work (i.e. additions, multiplications, etc) in every loop iteration, so that various overheads associated with doing some under-the-hood work inside of OpenMP runtime library (required to "schedule"/plan/distribute iterations between threads) are not "bigger" than amount of useful work done in bunch of your loop iterations.
Second and Third wikipedia OpenMP' parallel for examples are already good enough to mostly satisfy given criteria (while your example is not satisfying it). You are at the point where just following wikipedia examples will help you to gain some basic understanding.
After you learn given basics, your next steps would be (a) understanding "Data Races" / "Race Conditions" / "Loop Carried Dependencies" and (b) understanding the "difference" between #pragma omp parallel and #pragma omp for (again, you will need to find simple examples from books or basic OpenMP courses).
(to be honest, all other topics, like OpenMP imbalance, dynamic vs. static, Memory Bandwidth, will make sense only after you spend at least couple days of reading/practicing with simpler notions)

Related

How to optimize a simple loop?

The loop is simple
void loop(int n, double* a, double const* b)
{
#pragma ivdep
for (int i = 0; i < n; ++i, ++a, ++b)
*a *= *b;
}
I am using intel c++ compiler and using #pragma ivdep for optimization currently. Any way to make it perform better like using multicore and vectorization together, or other techniques?

This loop is absolutely vectorizable by compiler. But make sure that loop was actually vectorized (using Compiler' -qopt-report5, assembly output, Intel (vectorization) Advisor, whatever other techniques). One more overkill way to do that is creating performance baseline using -no-vec option (which will disable ivdep-driven and auto-vectorization) and then compare execution time against it. This is not good way for checking vectorization presence, but it's useful for general performance analysis for next bullets.
If loop hasn't been actually vectorized, make sure you push compiler to auto-vectorize it. In order to push compiler see next bullet. Note that next bullet could be useful even if loop was succesfully auto-vectorized.
To push compiler to vectorize it use: (a) restrict keyword to "disambiguate" a and b pointers (someone has already suggested it to you). (b) #pragma omp simd (which has extra bonus of being more portable and much more flexible than ivdep, but also has a drawback of being unsupported in old compilers before intel compiler version 14 and for other loops is more "dangerous"). To re-emphasize: given bullet may seem to do the same thing as ivdep, but depending on various circumstances it could be better and more powerful option.
Given loop has fine-grain iterations (too small amount of computations per single iteration) and overall is not purely compute-bound (so effort/cycles spent by CPU to load/store data from/to cache/memory is comparable if not bigger to effort/cycles spent to perform multiplication). Unrolling is often good way to slightly mitigate such disadvantages. But I would recommend to explicitly ask compiler to unroll it, by using #pragma unroll. In fact, for certain compiler versions the unrolling will happen automatically. Again, you can check whenever compiler did it by using -qopt-report5, loop assembly or Intel (Vectorization) Advisor:
In given loop you deal with "streaming" access pattern. I.e. you are contiguously loading/store data from/to memory (and cache sub-system will not help a lot for big "n" values). So, depending on target hardware, usage of multi-threading (atop of SIMD), etc, your loop will likely become memory bandwidth bound in the end. Once you become memory bandwidth bound, you could use techniques like loop blocking, non-temporal stores, aggressive prefetching. All of these techniques worth separate article, although for prefetching/NT-stores you have some pragmas in Intel Compiler to play with.
If n is huge, and you already got prepared to memory bandwidth troubles, you could use things like #pragma omp parallel for simd, which will simulteneously thread-parallelize and vectorize the loop. However quality of this feature has been made decent only in very fresh compiler versions AFAIK, so maybe you'd prefer to split n semi-manually. I.e. n=n1xn2xn3, where n1 - is number of iterations to distribute among threads, n2 - for cache blocking, n3 - for vectorization. Rewrite given loop to make it loopnest of 3 nested loops, where outer loop has n1 iterations (and #pragma omp parallel for is applied), next level loop has n2 iterations, n3 - is innermost (where #pragma omp simd is applied).
Some up to date links with syntax examples and more info:
unroll: https://software.intel.com/en-us/articles/avoid-manual-loop-unrolling
OpenMP SIMD pragma (not so fresh and detailed, but still relevant): https://software.intel.com/en-us/articles/enabling-simd-in-program-using-openmp40
restrict vs. ivdep
NT-stores and prefetching : https://software.intel.com/sites/default/files/managed/22/a3/mtaap2013-prefetch-streaming-stores.pdf
Note1: I apologize that I don't provide various code snippets here. There are at least 2 justifiable reasons for not providing them here: 1. My 5 bullets are pretty much applicable to very many kernels, not just to yours. 2. On the other hand specific combination of pragmas/manual rewriting techniques and corresponding performance results will vary depending on target platform, ISA and Compiler version.
Note2: Last comment regarding your GPU question. Think of your loop vs. simple industry benchmarks like LINPACK or STREAM. In fact your loop could become somewhat very similar to some of them in the end. Now think of x86 CPUs and especially Intel Xeon Phi platform characteristics for LINPACK/STREAM. They are very good indeed and will become even better with High Bandwidth Memory platforms (like Xeon Phi 2nd gen). So theoretically there is no any single reason to think that your given loop is not well mapped to at least some variants of x86 hardware (note that I didn't say similar thing for arbitrary kernel in universe).

Assuming the data pointed to by a can't overlap the data pointed to by b the most important information to give the compiler to let it optimize the code is that fact.
In older ICC version "restrict" was the only clean way to provide that key information to the compiler. In newer versions there are a few cleaner ways to give a much stronger guarantee than ivdep gives (in fact ivdep is a weaker promise to the optimizer than it appears and generally doesn't have the intended effect).
But if n is large, the whole thing will be dominated by the cache misses, so no local optimization can help.

Loop unrolling manually is a simple way to optimize your code, and following is my code. Original loop costs 618.48 ms, while loop2 costs 381.10 ms in my PC, the compiler is GCC with option '-O2'. I don't have Intel ICC to verify the code, but I think the optimization principles are the same.
Similarly, I did some experiments that compare the execution time of two programs to XOR two blocks of memories, and one program is vectorized with the help of SIMD instructions, while the other is manually loop-unrolled. If you are interested, see here.
P.S. Of course loop2 only works when n is even.
#include <stdlib.h>
#include <stdio.h>
#include <sys/time.h>
#define LEN 512*1024
#define times 1000
void loop(int n, double* a, double const* b){
int i;
for(i = 0; i < n; ++i, ++a, ++b)
*a *= *b;
}
void loop2(int n, double* a, double const* b){
int i;
for(i = 0; i < n; i=i+2, a=a+2, b=b+2)
*a *= *b;
*(a+1) *= *(b+1);
}
int main(void){
double *la, *lb;
struct timeval begin, end;
int i;
la = (double *)malloc(LEN*sizeof(double));
lb = (double *)malloc(LEN*sizeof(double));
gettimeofday(&begin, NULL);
for(i = 0; i < times; ++i){
loop(LEN, la, lb);
}
gettimeofday(&end, NULL);
printf("Time cost : %.2f ms\n",(end.tv_sec-begin.tv_sec)*1000.0\
+(end.tv_usec-begin.tv_usec)/1000.0);
gettimeofday(&begin, NULL);
for(i = 0; i < times; ++i){
loop2(LEN, la, lb);
}
gettimeofday(&end, NULL);
printf("Time cost : %.2f ms\n",(end.tv_sec-begin.tv_sec)*1000.0\
+(end.tv_usec-begin.tv_usec)/1000.0);
free(la);
free(lb);
return 0;
}

I assume, that n is large. You can distribute the workload on k CPUs by starting k threads and provide each with n/k elements. Use big chunks of consecutive data for each thread, don't do finegrained interleaving. Try to align the chunks with cache lines.
If you plan to scale to more than one NUMA node, consider to explicitly copy the chunks of workload to the node, the thread runs on, and copy back the results. In this case, it might not really help, because the workload for each step is very simple. You'll have to run tests for that.

How to generate computation intensive code in C++ that will not be removed by compiler? [duplicate]

This question already has an answer here:
How to prevent optimization of busy-wait
(1 answer)
Closed 7 years ago.
I am doing some experiments on CPU's performance. I wonder if anyone know a formal way or a tool to generate simple code that can run for a period of time (several seconds) and consumes significant computation resource of a CPU.
I know there are a lot of CPU benchmarks but the code of them is pretty complicated. What I want is a program more straight forward.
As the compiler is very smart, writing some redundant code as following will not work.
for (int i = 0; i < 100; i++) {
int a = i * 200 + 100;
}

Put the benchmark code in a function in a separate translation unit from the code that calls it. This prevents the code from being inlined, which can lead to aggressive optimizations.
Use parameters for the fixed values (e.g., the number of iterations to run) and return the resulting value. This prevents the optimizer from doing too much constant folding and it keeps it from eliminating calculations for a variable that it determines you never use.
Building on the example from the question:
int TheTest(int iterations) {
int a;
for (int i = 0; i < iterations; i++) {
a = i * 200 + 100;
}
return a;
}
Even in this example, there's still a chance that the compiler might realize that only the last iteration matters and completely omit the loop and just return 200*(iterations - 1) + 100, but I wouldn't expect that to happen in many real-life cases. Examine the generated code to be certain.
Other ideas, like using volatile on certain variables can inhibit some reasonable optimizations, which might make your benchmark perform worse that actual code.
There are also frameworks, like this one, for writing benchmarks like these.

It's not necessarily your optimiser that removes the code. CPU's these days are very powerful, and you need to increase the challenge level. However, note that your original code is not a good general benchmark: you only use a very subset of a CPU's instruction set. A good benchmark will try to challenge the CPU on different kinds of operations, to predict the performance in real world scenarios. Very good benchmarks will even put load on various components of your computer, to test their interplay.
Therefore, just stick to a well known published benchmark for your problem. There is a very good reason why they are more involved. However, if you really just want to benchmark your setup and code, then this time, just go for higher counter values:
double j=10000;
for (double i = 0; i < j*j*j*j*j; i++)
{
}
This should work better for now. Note that there a just more iterations. Change j according to your needs.

sleep without system or IO calls

I need a sleep that does not issue any system or IO calls for a scenario with Hardware Transactional Memory (these calls would lead to an abort). Sleeping for 1 microsecond as in usleep(1) would be just fine.
This question suggests to implement nested loops to keep the program busy and delay it for some time. However, I want to be able to compile with optimization which would delete these loops.
An idea could be to calculate some sophisticated math equation. Are there approaches to this? The actual time waited does not have to be precise - it should be vaguely the same for multiple runs however.

Try a nop loop with a volatile asm directive:
for (int i = 0; i < 1000; i++) {
asm volatile ("nop");
}
The volatile should prevent the optimizer from getting rid of it. If that doesn't do it, then try __volatile__.

The tricky part here is the timing. Querying any sort of timer may well count as an I/O function, depending on the OS.
But if you just want a delay loop, when timing isn't that important, you should look to platform-specific code. For example, there is an Intel-specific intrinsic called _mm_pause that translates to a CPU pause instruction, which basically halts the pipeline until the next memory bus sync comes through. It was designed to be put into a spinlock loop (no point in spinning and requerying an atomic variable until there is a possibility of new information), but it might (might - read the documentation) inhibit the compiler from removing your delay loop as empty.

You can use this code:
#include <time.h>
void delay(int n)
{
n *= CLOCKS_PER_SEC / 1000;
clock_t t1 = clock();
while (clock() <= t1 + n && clock() >= t1);
}
Sometimes (not very often) this function will cause less delay than specified due to clock counter overflow.
Update
Another option is to use a loops like this with volatile counters.

Make g++ produce a program that can use multiple cores?

I have a c++ program with multiple For loops; each one runs about 5 million iterations. Is there any command I can use with g++ to make the resulting .exe will use multiple cores; i.e. make the first For loop run on the first core and the second For loop run on the second core at the same time? I've tried -O3 and -O3 -ftree-vectorize, but in both cases, my cpu usage still only hovers at around 25%.
EDIT:
Here is my code, in case in helps. I'm basically just making a program to test the speed capabilities of my computer.
#include <iostream>
using namespace std;
#include <math.h>
int main()
{
float *bob = new float[50102133];
float *jim = new float[50102133];
float *joe = new float[50102133];
int i,j,k,l;
//cout << "Starting test...";
for (i=0;i<50102133;i++)
bob[i] = sin(i);
for (j=0;j<50102133;j++)
bob[j] = sin(j*j);
for (k=0;k<50102133;k++)
bob[k] = sin(sqrt(k));
for (l=0;l<50102133;l++)
bob[l] = cos(l*l);
cout << "finished test.";
cout << "the 100120 element is," << bob[1001200];
return 0;
}

The most obviously choice would be to use OpenMP. Assuming your loop is one that's really easy to execute multiple iterations in parallel, you might be able to just add:
#pragma openmp parallel for
...immediately before the loop, and get it to execute in parallel. You'll also have to add -fopenmp when you compile.
Depending on the content of the loop, that may give anywhere from a nearly-linear speedup to slowing the code down somewhat. In the latter cases (slowdown or minimal speedup) there may be other things you can do with OpenMP to help speed it up, but without knowing at least a little about the code itself, it's hard to guess what to do or what improvement you may be able to expect at maximum.
The other advice you're getting ("Use threads") may be suitable. OpenMP is basically an automated way of putting threads to use for specific types of parallel code. For a situation such as you describe (executing multiple iterations of a loop in parallel) OpenMP is generally preferred--it's quite a bit simpler to implement, and may well give better performance unless you know multithreading quite well and/or expend a great deal of effort on parallelizing the code.
Edit:
The code you gave in the question probably won't benefit from multiple threads. The problem is that it does very little computation on each data item before writing the result out to memory. Even a single core can probably do the computation fast enough that the overall speed will be limited by the bandwidth to memory.
To stand a decent chance of getting some real benefit from multiple threads, you probably want to write some code that does more computation and less just reading and writing memory. For example, if we collapse your computations together, and do all of them on a single item, then sum the results:
double total = 0;
for (int i = 0; i < size; i++)
total += sin(i) + sin(i*i) + sin(sqrt(i)) + cos(i*i);
By adding a pragma:
#pragma omp parallel for reduction(+:total)
...just before the for loop, we stand a good chance of seeing a substantial improvement in execution speed. Without OpenMP, I get a time like this:
Real 16.0399
User 15.9589
Sys 0.0156001
...but with the #pragma and OpenMP enabled when I compile, I get a time like this:
Real 8.96051
User 17.5033
Sys 0.0468003
So, on my (dual core) processor, time has dropped from 16 to 9 seconds--not quite twice as fast, but pretty close. Of course, a lot of the improvement you get will depend on exactly how many cores you have available. For example, on my other computer (with an Intel i7 CPU), I get a rather larger improvement because it has more cores.
Without OpenMP:
Real 15.339
User 15.3281
Sys 0.015625
...and with OpenMP:
Real 3.09105
User 23.7813
Sys 0.171875
For completeness, here's the final code I used:
#include <math.h>
#include <iostream>
static const int size = 1024 * 1024 * 128;
int main(){
double total = 0;
#pragma omp parallel for reduction(+:total)
for (int i = 0; i < size; i++)
total += sin(i) + sin(i*i) + sin(sqrt(i)) + cos(i*i);
std::cout << total << "\n";
}

The compiler has no way to tell if your code inside the loop can be safely executed on multiple cores. If you want to use all your cores, use threads.

Use Threads or Processes, you may want to look to OpenMp

C++11 got support for threading but c++ compilers won't/can't do any threading on their own.

As others have pointed out, you can manually use threads to achieve this. You might look at libraries such as libdispatch (aka. GCD) or Intel's TBB to help you do this with the least pain.
The -ftree-vectorize option you mention is for targeting SIMD vector processor units on CPUs such as ARM's NEON or Intel's SSE. The code produced is not thread-parallel, but rather operation parallel using a single thread.
The code example posted above is highly amenable to parallelism on SIMD systems as the body of each loop very obviously has no dependancies on the previous iteration, and the operations in the loop are linear.
On some ARM Cortex A series systems at least, you may need to accept slightly reduced accuracy to get the full benefits.

Algorithm: taking out every 4th item of an array

I have two huge arrays (int source[1000], dest[1000] in the code below, but having millions of elements in reality). The source array contains a series of ints of which I want to copy 3 out of every 4.
For example, if the source array is:
int source[1000] = {1,2,3,4,5,6,7,8....};
int dest[1000];
Here is my code:
for (int count_small = 0, count_large = 0; count_large < 1000; count_small += 3, count_large +=4)
{
dest[count_small] = source[count_large];
dest[count_small+1] = source[count_large+1];
dest[count_small+2] = source[count_large+2];
}
In the end, dest console output would be:
1 2 3 5 6 7 9 10 11...
But this algorithm is so slow! Is there an algorithm or an open source function that I can use / include?
Thank you :)
Edit: The actual length of my array would be about 1 million (640*480*3)
Edit 2: Processing this for loop takes about 0.98 seconds to 2.28 seconds, while the other code only take 0.08 seconds to 0.14 seconds, so the device uses at least 90 % cpu time only for the loop

Well, the asymptotic complexity there is as good as it's going to get. You might be able to achieve slightly better performance by loading in the values as four 4-way SIMD integers, shuffling them into three 4-way SIMD integers, and writing them back out, but even that's not likely to be hugely faster.
With that said, though, the time to process 1000 elements (Edit: or one million elements) is going to be utterly trivial. If you think this is the bottleneck in your program, you are incorrect.

Before you do much more, try profiling your application and determine if this is the best place to spend your time. Then, if this is a hot spot, determine how fast is it, and how fast you need it to be/might achieve? Then test the alternatives; the overhead of threading or OpenMP might even slow it down (especially, as you now have noted, if you are on a single core processor - in which case it won't help at all). For single threading, I would look to memcpy as per Sean's answer.
#Sneftel has also reference other options below involving SIMD integers.
One option would be to try parallel processing the loop, and see if that helps. You could try using the OpenMP standard (see Wikipedia link here), but you would have to try it for your specific situation and see if it helped. I used this recently on an AI implementation and it helped us a lot.
#pragma omp parallel for
for (...)
{
... do work
}
Other than that, you are limited to the compiler's own optimisations.
You could also look at the recent threading support in C11, though you might be better off using pre-implemented framework tools like parallel_for (available in the new Windows Concurrency Runtime through the PPL in Visual Studio, if that's what you're using) than rolling your own.
parallel_for(0, max_iterations,
[...] (int i)
{
... do stuff
}
);
Inside the for loop, you still have other options. You could try a for loop that iterates and skips every for, instead of doing 3 copies per iteration (just skip when (i+1) % 4 == 0), or doing block memcopy operations for groups of 3 integers as per Seans answer. You might achieve slightly different compiler optimisations for some of these, but it is unlikely (memcpy is probably as fast as you'll get).
for (int i = 0, int j = 0; i < 1000; i++)
{
if ((i+1) % 4 != 0)
{
dest[j] = source[i];
j++;
}
}
You should then develop a test rig so you can quickly performance test and decide on the best one for you. Above all, decide how much time is worth spending on this before optimising elsewhere.

You could try memcpy instead of the individual assignments:
memcpy(&dest[count_small], &source[count_large], sizeof(int) * 3);

Is your array size only a 1000? If so, how is it slow? It should be done in no time!
As long as you are creating a new array and for a single threaded application, this is the only away AFAIK.
However, if the datasets are huge, you could try a multi threaded application.
Also you could explore having a bigger data type holding the value, such that the array size decreases... That is if this is viable to your real life application.

If you have Nvidia card you can consider using CUDA. If thats not the case you can try other parallel programming methods/environments as well.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js