Why a basic loop isn't accelerated by OpenMP? - c++

I have a basic loop:
int i, n=50000000;
for (i=0 ; i<n ; i++)
{
register float val = rand()/(float)RAND_MAX;
}
That I want to accelerate with OpenMP. I previously set:
omp_set_dynamic(0);
omp_set_num_threads(nths);
with nths=4
And the final loop is:
int i, n=50000000;
#pragma omp parallel for firstprivate(n) default(shared)
for (i=0 ; i<n ; i++)
{
register float val = rand()/(float)RAND_MAX;
}
The non parallelized loop takes 1.12s to execute and the parallel one takes 21.04s (it can varies a lot depending on my linux priority process). I am on a x86 platform with Ubuntu and 4 CPUs with 1 thread each. I compile with g++ (I need it) that I flagged with -fopenmp and I use the library -lgomp
Why OpenMP doesn't accelerate this basic loop ?
EDIT:
Regarding the answers I changed the inside of the loop to be:
for (i=0 ; i<n ; i++)
{
a[i]=i;
b[i]=i;
c[i]=a[i]+b[i];
}
with n=500000 and the pragma:
#pragma omp parallel for firstprivate(n) default(shared) schedule(dynamic) num_threads(4)
I also changed the code to use only gcc and I have the same problem:
With 1 Thread
Test ms = 0.003000
Test Omp ms = 19.695000
With 4 Threads
Test ms = 0.003000
Test Omp ms = 240.990000
EDIT2:
I changed the way I was measuring time when using OpenMP. Instead of the clock() function I used the omp_get_wtime() one, the results are way better.

I ran your code quickly through my system.
First of all, in the array addition case, your 50M is barely enough to show the win, but it does - if OpenMP is set up correctly.
In your case, the schedule(dynamic) is killing you - it tells the compiler to spread the work to the team at runtime. That would make sense if you cannot predetermine your workload - but in this case it's perfectly predictable as the effort per iteration is exactly the same.
I get the following results after editing your example (see below) and running on a hyperthreaded CPU with the cores all fixed on the lowest frequency. I compiled using gcc 4.9.3:
time ./testseq && time ./testpar
real 0m0.576s
user 0m0.504s
sys 0m0.072s
real 0m0.285s
user 0m0.968s
sys 0m0.123s
As you can see, the real value, which is the "wallclock time", roughly halves. The user time increases, because of thread startup and shutdown.
The parallelized results change considerably if i add the schedule(dynamic) clause:
real 0m4.181s
user 0m14.886s
sys 0m1.283s
All of the extra work load is spent on threads that are done doing a small amount of work and looking for the next batch. That requires taking a lock - and that kills your second example. Please only use schedule(dynamic) when you have load balancing issues - where the amount of work per iterator varies wildly.
To give full disclosure, I ran with the following full source code:
CXXFLAGS=-std=c++11 -I. -Wall -Wextra -g -pthread
all: testseq testpar
testpar: test.cpp
${CXX} -o $# $^ -fopenmp ${CXXFLAGS}
testseq: test.cpp
${CXX} -o $# $^ ${CXXFLAGS}
clean:
rm -f *.o *~ test
and test.cpp:
#include <omp.h>
constexpr int n=50*1000*1000;
float a[n];
float b[n];
float c[n];
int main(void) {
#pragma omp parallel for schedule(dynamic)
for (int i=0 ; i<n ; i++) {
a[i]=i;
b[i]=i;
c[i]=a[i]+b[i];
}
}
Note that I also took away the other clauses to your parallel for - you need none of them.

rand() is not at all a thread-safe function. There is a PRNG inside that has a state and therefore cannot be invoked by multiple threads without synchronization. Use different PRNG (C++11 has a bunch of them, Boost as well), use one generator per thread and don't forget to seed them with different seed values (beware of time(NULL)).
UPDATE
n being 500k may be too small to get any speedup due to the overhead of thread creation. Could you test with, e.g., n set to 500M? Moreover, times without OpenMP are suspiciously low (maybe not for such a low n). What are you doing with the a, b, and c arrays after the loop? If nothing, a compiler could optimize the whole loop away in the sequential code. Do something with these arrays, e.g., print out their sum (out of the measured section).

Related

C++ with OpenMP try to avoid the false sharing for tight looped array

I try to introduce OpenMP to my c++ code to improve the performance using a simple case as shown:
#include <omp.h>
#include <chrono>
#include <iostream>
#include <cmath>
using std::cout;
using std::endl;
#define NUM 100000
int main()
{
double data[NUM] __attribute__ ((aligned (128)));;
#ifdef _OPENMP
auto t1 = omp_get_wtime();
#else
auto t1 = std::chrono::steady_clock::now();
#endif
for(long int k=0; k<100000; ++k)
{
#pragma omp parallel for schedule(static, 16) num_threads(4)
for(long int i=0; i<NUM; ++i)
{
data[i] = cos(sin(i*i+ k*k));
}
}
#ifdef _OPENMP
auto t2 = omp_get_wtime();
auto duration = t2 - t1;
cout<<"OpenMP Elapsed time (second): "<<duration<<endl;
#else
auto t2 = std::chrono::steady_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1).count();
cout<<"No OpenMP Elapsed time (second): "<<duration/1e6<<endl;
#endif
double tempsum = 0.;
for(long int i=0; i<NUM; ++i)
{
int nextind = (i == 0 ? 0 : i-1);
tempsum += i + sin(data[i]) + cos(data[nextind]);
}
cout<<"Raw data sum: "<<tempsum<<endl;
return 0;
}
Access to a tightly looped int array (size = 10000) and change its elements in either parallel or non-parallel way.
Build as
g++ -o test test.cpp
or
g++ -o test test.cpp -fopenmp
The program reported results as:
No OpenMP Elapsed time (second): 427.44
Raw data sum: 5.00009e+09
OpenMP Elapsed time (second): 113.017
Raw data sum: 5.00009e+09
Intel 10th CPU, Ubuntu 18.04, GCC 7.5, OpenMP 4.5.
I suspect that the false sharing in the cache line leads to the bad performance of the OpenMP version code.
I update the new test results after increasing the loop size, the OpenMP runs faster as expected.
Thank you!
Since you're writing C++, use the C++ random number generator, which is threadsafe, unlike the C legacy one you're using.
Also, you're not using your data array, so the compiler is actually at liberty to remove your loop completely.
You should touch all your data once before you do the timed loop. That way you ensure that pages are instantiated and data is in or out of cache depending.
Your loop is pretty short.
rand() is not thread-safe (see here). Use an array of C++ random-number generators instead, one for each thread. See std::uniform_int_distribution for details.
You can drop #ifdef _OPENMP variations in your code. In a Bash terminal, you can call your application as OMP_NUM_THREADS=1 test. See here for details.
So you can remove num_threads(4) as well because you can explicitly specify the amount of parallelism.
Use Google Benchmark or command-line parameters so you can parameterize the number of threads and array size.
From here, I expect you will see:
The performance when you call OMP_NUM_THREADS=1 test is close to your non-OpenMP version.
The array of C++ RNG generators is faster than calling rand() from multiple threads.
The multi-threaded version is still slower than the single-threaded version when using a 10,000 element array.

warning #2901: [omp] OpenMP is not active; all OpenMP directives will be ignored

I'm currently trying to use OpenMP for parallel computing.
I've written the following basic code.
However it returns the following warning:
warning #2901: [omp] OpenMP is not active; all OpenMP directives will be ignored.
Changing the number of threads does not change the required running time since omp.h is ignored for some reason which is unclear to me.
Can someone help me out?
#include <stdio.h>
#include <omp.h>
#include <math.h>
int main(void)
{
double ts;
double something;
clock_t begin = clock();
#pragma omp parallel num_threads(4)
#pragma omp parallel for
for (int i = 0; i<pow(10,7);i++)
{
something=sqrt(123456);
}
clock_t end = clock();
ts = (double)(end - begin) / CLOCKS_PER_SEC;
printf("Time elpased is %f seconds", ts);
}
In order to get OpenMP support you need to explicitly tell your compiler.
g++, gcc and clang need the option -fopenmp
mvsc needs the option /openmp (more info here if you use visual studio)
Aside from the obvious having to compile with -fopenmp flag your code has some problem worth pointing out, namely:
To measure time use omp_get_wtime() instead of clock() (it will give you the number of clock ticks accumulated across all threads).
The other problem is:
#pragma omp parallel num_threads(4)
#pragma omp parallel for
for (int i = 0; i<pow(10,7);i++)
{
something=sqrt(123456);
}
the iterations of the loop are not being assigned to threads as you wanted. Because you have added again the clause parallel to #pragma omp for, and assuming that you have nested parallelism disabled, which by default it is, each of the threads created in the outer parallel region will execute "sequentially" the code within that region. Consequently, for a n = 6 (i.e., pow(10,7) = 6) and number of threads = 4, you would have the following block of code:
for (int i=0; i<n; i++) {
something=sqrt(123456);
}
being executed 6 x 4 = 24 times (i.e., the total number of loop iterations multiple by the total number of threads). For a more in depth explanation check this SO Thread about a similar issue. Nevertheless, the image below provides a visualization of the essential:
To fix this adapt your code to the following:
#pragma omp parallel for num_threads(4)
for (int i = 0; i<pow(10,7);i++)
{
something=sqrt(123456);
}

openMP excessive synchronization

I am trying to add an openMP parallelization into quite a big Project and I found out the openMP does too much synchronization outside the parallel blocks.
This synchronization is done for all of the variables, even those not used in the parallel block and it is done continuously, not only before entering the block.
I made an example proving this:
#include <cmath>
int main()
{
double dummy1 = 1.234;
int const size = 1000000;
int const size1 = 2500;
int const size2 = 500;
for(unsigned int i=0; i<size; ++i){
//for (unsigned int j=0; j<size1; j++){
// dummy1 = pow(dummy1/2 + 1, 1.5);
//}
#pragma omp parallel for
for (unsigned int j=0; j<size2; j++){
double dummy2 = 2.345;
dummy2 = pow(dummy2/2 + 1, 1.5);
}
}
}
If I run this code (with the for cycle commented), the runtimes are 6.75s with parallelization and 30.6s without. Great.
But if I uncomment the for cycle and run it again, the excessive synchronization kicks in and I get results 67.9s with parallelization and 73s without. If I increase size1 I even get slower results with parallelization than without it.
Is there a way to disable this synchronization and force it only before the second for cycle? Or any other way how to improve the speed?
Note that the outer neither the first for cycle are in the real example parallelizable. The outer one is in fact a ODE solver and the first inner one updating of loads of inner values.
I am using gcc (SUSE Linux) 4.8.5
Thanks for Your answers.
In the end the solution for my problem was specifying number of threads = number of processor cores. It seems the hyperthreading was causing the problems. So using (my processor has 4 real cores)
#pragma omp parallel for num_threads(4)
I get times 8.7s without the first for loop and 51.9s with it. There is still about 1.2s overhead, but that is acceptable. Using default (8 threads)
#pragma omp parallel for
the times are 6.65s and 68s. Here the overhead is about 19s.
So the hyperthreading helps if no other code is present, but when it is it might not always be a good idea to use it.

How should I interpreter these VTune results?

I'm trying to parallelyzing this code using OpenMP. OpenCV (built using IPP for best efficiency) is used as external library.
I'm having problems unbalanced CPU usage in parallel fors, but it seems that there is no load imbalance. As you will see, this could be because of KMP_BLOCKTIME=0, but this could be necessary because of external libraries (IPP, TBB, OpenMP, OpenCV). In the rest of the questions you will find more details and data that you can download.
These are the Google Drive links to my VTune results:
c755823 basic KMP_BLOCKTIME=0 30 runs : basic hotspot with environment variable KMP_BLOCKTIME set to 0 on 30 runs of the same input
c755823 basic 30 runs : same as above, but with default KMP_BLOCKTIME=200
c755823 advanced KMP_BLOCKTIME=0 30 runs : same as first, but advanced hotspot
For those who are interested, I can send you the original code somehow.
On my Intel i7-4700MQ the actual wall-clock time of the application on average on 10 runs is around 0.73 seconds. I compile the code with icpc 2017 update 3 with the following compiler flags:
INTEL_OPT=-O3 -ipo -simd -xCORE-AVX2 -parallel -qopenmp -fargument-noalias -ansi-alias -no-prec-div -fp-model fast=2 -fma -align -finline-functions
INTEL_PROFILE=-g -qopt-report=5 -Bdynamic -shared-intel -debug inline-debug-info -qopenmp-link dynamic -parallel-source-info=2 -ldl
In addition I set KMP_BLOCKTIME=0 because the default value (200) was generating an huge overhead.
We can divide the code in 3 parallel regions (wrapped in only one #pragma parallel for efficiency) and a previous serial one, which is around 25% of the algorithm (and it can't be parallelized).
I'll try to describe them (or you can skip to the code structure directly):
We create a parallel region in order to avoid the overhead to create a new parallel region. The final result is to populate the rows of a matrix obejct, cv::Mat descriptor. We have 3 shared std::vector objects: (a) blurs which is a chain of blurs (not parallelizable) using GuassianBlur by OpenCV (which uses the IPP implementation of guassian blurs) (b) hessResps (size known, say 32) (c) findAffineShapeArgs (unkown size, but in order of thousands of elements, say 2.3k) (d) cv::Mat descriptors (unkown size, final result). In the serial part, we populate `blurs, which is a read only vector.
In the first parallel region,hessResps is populated using blurs without any synchronization mechanism.
In the second parallel region findLevelKeypoints is populated using hessResps as read only. Since findAffineShapeArgs size is unkown, we need a local vector localfindAffineShapeArgs which will be appended to findAffineShapeArgs in the next step
Since findAffineShapeArgs is shared and its size is unkown, we need a critical section where each localfindAffineShapeArgs is appended to it.
In the third parallel region, each findAffineShapeArgs is used to generate the rows of the final cv::Mat descriptor. Again, since descriptors is shared, we need a local version cv::Mat localDescriptors.
A final critical section push_back each localDescriptors to descriptors. Notice that this is extremely fast since cv::Mat is "kinda" of a smart pointer, so we push_back pointers.
This is the code structure:
cv::Mat descriptors;
std::vector<Mat> blurs(blursSize);
std::vector<Mat> hessResps(32);
std::vector<FindAffineShapeArgs> findAffineShapeArgs;//we don't know its tsize in advance
#pragma omp parallel
{
//compute all the hessianResponses
#pragma omp for collapse(2) schedule(dynamic)
for(int i=0; i<levels; i++)
for (int j = 1; j <= scaleCycles; j++)
{
hessResps[/**/] = hessianResponse(/*...*/);
}
std::vector<FindAffineShapeArgs> localfindAffineShapeArgs;
#pragma omp for collapse(2) schedule(dynamic) nowait
for(int i=0; i<levels; i++)
for (int j = 2; j < scaleCycles; j++){
findLevelKeypoints(localfindAffineShapeArgs, hessResps[/*...*], /*...*/); //populate localfindAffineShapeArgs with push_back
}
#pragma omp critical{
findAffineShapeArgs.insert(findAffineShapeArgs.end(), localfindAffineShapeArgs.begin(), localfindAffineShapeArgs.end());
}
#pragma omp barrier
#pragma omp for schedule(dynamic) nowait
for(int i=0; i<findAffineShapeArgs.size(); i++){
{
findAffineShape(findAffineShapeArgs[i]);
}
#pragma omp critical{
for(size_t i=0; i<localRes.size(); i++)
descriptors.push_back(localRes[i].descriptor);
}
}
At the end of the question, you can find FindAffineShapeArgs.
I'm using Intel Amplifier to see hotspots and evaluate my application.
The OpenMP Potential Gain analsysis says that the Potential Gain if there would be perfect load balancing would be 5.8%, so we can say that the workload is balanced between different CPUs.
This i the CPU usage histogram for the OpenMP region (remember that this is the result of 10 consecutive runs):
So as you can see, the Average CPU Usage is 7 cores, which is good.
This OpenMP Region Duration Histogram shows that in these 10 runs the parallel region is executed always with the same time (with a spread around 4 milliseconds):
This is the Caller/Calee tab:
For you knowledge:
interpolate is called in the last parallel region
l9_ownFilter* functions are all called in the last parallel region
samplePatch is called in the last parallel region.
hessianResponse is called in the second parallel region
Now, my first question is: how should I interpret the data above? As you can see, in many of the functions half of the time the "Effective Time by Utilization` is "ok", which would probably become "Poor" with more cores (for example on a KNL machine, where I'll test the application next).
Finally, this is the Wait and Lock analysis result:
Now, this is the first weird thing: line 276 Join Barrier (which corresponds to the most expensive wait object) is#pragma omp parallel`, so the beginning of the parallel region. So it seems that someone spawned threads before. Am I wrong? In addition, the wait time is longer than the program itself (0.827s vs 1.253s of the Join Barrier that I'm talking about)! But maybe that refers to the waiting of all threads (and not wall-clock time, which is clearly impossible since it's longer than the program itself).
Then, the Explicit Barrier at line 312 is #pragma omp barrier of the code above, and its duration is 0.183s.
Looking at the Caller/Callee tab:
As you can see, most of wait time is poor, so it refers to one thread. But I'm sure that I'm understanding this. My second question is: can we interpret this as "all the threads are waiting just for one thread who is staying behind?".
FindAffineShapeArgs definition:
struct FindAffineShapeArgs
{
FindAffineShapeArgs(float x, float y, float s, float pixelDistance, float type, float response, const Wrapper &wrapper) :
x(x), y(y), s(s), pixelDistance(pixelDistance), type(type), response(response), wrapper(std::cref(wrapper)) {}
float x, y, s;
float pixelDistance, type, response;
std::reference_wrapper<Wrapper const> wrapper;
};
The Top 5 Parallel Regions by Potential Gain in the summary view shows only one region (the only one)
Look at the "/OpenMP Region/OpenMP Barrier-to-Barrier" grouping, this is the order of the most expensives loops:
The 3th loop:
pragma omp for schedule(dynamic) nowait
for(int i=0; i
is the most expensive one (as I already knew) and here's a screenshot of the expended view:
As you can see, many functions are from OpenCV, which exploits IPP and is (should be) already optimized. Expanding the two other functions (interpolate and samplePatch) shows a [No call stack information]. Same for all the other functions (in other regions too).
The 2nd most expensive region is the second parallel for:
#pragma omp for collapse(2) schedule(dynamic) nowait
for(int i=0; i<levels; i++)
for (int j = 2; j < scaleCycles; j++){
findLevelKeypoints(localfindAffineShapeArgs, hessResps[/*...*], /*...*/); //populate localfindAffineShapeArgs with push_back
}
Here's the expanded view:
And finally the 3th most expensive is the first loop:
#pragma omp for collapse(2) schedule(dynamic)
for(int i=0; i<levels; i++)
for (int j = 1; j <= scaleCycles; j++)
{
hessResps[/**/] = hessianResponse(/*...*/);
}
Here's the expended view:
If you want to know more, please use my attached VTune files or just ask!
Try reading the info in this link, especially the part about "Nested OpenMP" since Intel IPP already uses OpenMP in their implementation. From my experience with Intel IPP and OpenMP, if you are doing some other type of multi-threading and when each of the created threads gets to the OpenMP calls performance was really bad. Also, you can try having #pragma omp parallel for for each of the parallel regions instead of #pragma omp for and getting rid of the outer #pragma omp parallel

Strange slowdown when using openmp

I am trying to increase performance of a rather complex iteration algorithm by parallelizing matrix multiplication, which is being called on each iteration.
The algorithm takes 500 iterations and approximately 10 seconds. But after parallelizing matrix multiplication it slows down to 13 seconds.
However, when I tested matrix multiplication of the same dimension alone, there was an increase in speed. (I am talking about 100x100 matrices.)
Finally, I switched off any parallelizing inside the algorithm and added on each iteration the following piece of code, which does absolutely nothing and presumably shouldn't take long:
int j;
#pragma omp parallel for private(j)
for (int i = 0; i < 10; i++)
j = i;
And again, there is a 30% slowdown comparing to the same algorithm without this piece of code.
Thus, calling any parallelization using openmp 500 times inside the main algorithm somehow slows things down. This behavior looks very strange to me, anybody has any clues what the problem is?
The main algorithm is being called by a desktop application, compiled by VS2010, Win32 Release.
I work on Intel Core i3 (parallelization creates 4 threads), 64 bit Windows 7.
Here is a structure of a program:
int internal_method(..)
{
...//no openmp here
// the following code does nothing, has nothing to do with the rest of the program and shouldn't take long,
// but somehow adding of this code caused a 3 sec slowdown of the Huge_algorithm()
double sum;
#pragma omp parallel for private(sum)
for (int i = 0; i < 10; i++)
sum = i*i*i / (1.0 + i*i*i*i);
...//no openmp here
}
int Huge_algorithm(..)
{
...//no openmp here
for (int i = 0; i < 500; i++)
{
.....// no openmp
internal_method(..);
......//no openmp
}
...//no openmp here
}
So, the final point is:
calling the parallel piece of code 500 times alone (when the rest of the algorithm is omitted) takes less than 0.01 sec, but when you call it 500 times inside a huge algorithm it causes 3 sec delay of the entire algorithm.
And what I don't understand is how the small parallel part affects the rest of the algorithm?
For 10 iterations and a simple assignment, I guess there is too much OpenMP overhead compared to the computation itself. What looks lightweight here is actually managing and synchronizing multiple threads which may not even come from a thread pool. There might be some locking involved, and I don't know how good MSVC is at estimating whether to parallelize at all.
Try with bigger loop bodies or a bigger amount of iterations (say 1024*1024 iterations, just for starters).
Example OpenMP Magick:
#pragma omp parallel for private(j)
for (int i = 0; i < 10; i++)
j = i;
This might be approximately expanded by a compiler to:
const unsigned __cpu_count = __get_cpu_count();
const unsigned __j = alloca (sizeof (unsigned) * __cpu_count);
__thread *__threads = alloca (sizeof (__thread) * __cpu_count);
for (unsigned u=0; u!=__cpu_count; ++u) {
__init_thread (__threads+u);
__run_thread ([u]{for (int i=u; i<10; i+=__cpu_count)
__j[u] = __i;}); // assume lambdas
}
for (unsigned u=0; u!=__cpu_count; ++u)
__join (__threads+u);
with __init_thread(), __run_thread() and __join() being non-trivial function that invoke certain system calls.
In case thread-pools are used, you would replace the first alloca() by something like __pick_from_pool() or so.
(note this, names and emitted code, was all imaginary, actual implementation will look different)
Regarding your updated question:
You seem to be parallelizing at the wrong granularity. Put as much workload as possible in a thread, so instead of
for (...) {
#omp parallel ...
for (...) {}
}
try
#omp parallel ...
for (...) {
for (...) {}
}
Rule of thumb: Keep workloads big enough per thread so as to reduce relative overhead.
Maybe just j=i is not high-yield for core-cpu bandwith. maybe you should try something more yielding calculation. (for exapmle taking i*i*i*i*i*i and dividing it by i+i+i)
are you running this on multi-core cpu or gpu?