openmp with nested loops and function call - c++

I have a few nested loops and I put the first one in parallel mode. apar and mpar are structs whose values are modified in the loop and then function breakLogic is called which generates a struct which i store in a pre created vector of those structs.
one, two ... have been declared earlier in the function.
I have tried to include ordered and critical to ensure accuracy but i am still getting incorrect results.
#pragma omp parallel for ordered private(appFlip, atur, apar, mpar, i, j, k, l, m, n) shared(rawFlip)
for(i=0; i<oneL; i++)
{
initialize mpar
#pragma omp critical
apar.one = one[i];
for(j=0; j<twoL; j++)
{
apar.two = two[j];
for(k=0; k<threeL; k++)
{
apar.three = floor(three[k]*apar.two);
appFlip = applyParamSin(rawFlip, apar);
for(l=0; l< fourL; l++)
{
mpar.four = four[l];
for(m=0; m<fiveL; m++)
{
mpar.five = five[m];
for(n=0; n<sixL; n++)
{
mpar.six = add[n];
atur = breakLogic(appFlip, mpar, dt);
#pragma omp ordered
{
sinResVec[itr] = atur;
itr++;
}
}
}
}
r0(appFlip);
}
}
}
Or is this code not conducive for parallelism? Are there any tools for g++ which can profile code for parallel processing and indicate potential issues?
This modified code works but gives no performance improvement.

You original code can be paralleled by a few modifications.
set apar and mpar as firstprivate. apar and mpar should be thread local variables and be initialized when entering the parallel for region;
remove all critical and ordered clauses, including the one in the parallel for directive. they are not working as your expected;
calculate iter with i,j,k,l,m,n to remove the dependency.
.
iter=(((i*twoL+j)*threeL+k)*fourL+m)*fiveL+n;
sinResVec[itr] = atur;
update
See here for more details of OpenMP, especially the differences between private and firstprivate.
http://msdn.microsoft.com/en-us/library/tt15eb9t.aspx

Related

Using consistent RNG with OpenMP

So I have a function, let's call it dostuff() in which it's beneficial for my application to sometimes parallelize within, or to do it multiple times and parallelize the whole thing. The function does not change though between both use cases.
Note: object is large enough that it cannot viably be stored in a list, and so it must be discarded with each iteration.
So, let's say our code looks like this:
bool parallelize_within = argv[1];
if (parallelize_within) {
// here we assume parallelization is handled within the function dostuff()
for (int i = 0; i < 100; ++i) {
object randomized = function_that_accesses_rand();
dostuff(i, randomized, parallelize_within);
}
} else {
#pragma omp parallel for
for (int i = 0; i < 100; ++i) {
object randomized = function_that_accesses_rand();
dostuff(i, randomized, parallelize_within);
}
}
Obviously, we run into the issue that dostuff() will have threads access the random object at different times in different iterations of the same program. This is not the case when parallelize_within == true, but when we run dostuff() in parallel individually per thread, is there a way to guarantee that the random object is accessed in order based on the iteration? I know that I could do:
#pragma omp parallel for schedule(dynamic)
which will guarantee that eventually, as iterations are assigned to threads at runtime dynamically, the objects will access rand in order with the iteration number, but for the first set of iterations it will be totally random. Any suggestions on how to avoid this?
First of all you have to make sure that both function_that_accesses_rand and do_stuff are threadsafe.
You do not have to duplicate your code if you use the if clause:
#pragma omp parallel for if(!parallelize_within)
To make sure that in function dostuff(i, randomized,...); i reflects the order of creation of randomized object you have to do something like this:
int j = 0;
#pragma omp parallel for if(!parallelize_within)
for (int i = 0; i < 100; ++i) {
int k;
object randomized;
#pragma omp critical
{
k = j++;
randomized = function_that_accesses_rand();
}
dostuff(k, randomized, parallelize_within);
}
You may eliminate the use of the critical section if your function_that_accesses_rand makes it possible, but I cannot be more specific without knowing your function. One solution is that this function returns the value representing the order. Do not forget that this function has to be threadsafe!
#pragma omp parallel for if(!parallelize_within)
for (int i = 0; i < 100; ++i) {
int k;
object randomized = function_that_accesses_rand(k);
dostuff(k, randomized, parallelize_within);
}
... function_that_accesses_rand(int& k){
...
#pragma omp atomic capture
k = some_internal_counter++;
...
}
You could pre generate the random object and store it in a list. Then have a variable in the omp loop, that is incremented per thread.
// generate random objects
i=0
#pragma omp parallel for
for( ... ){
do_stuff(...,rand_obj[i],...)

OpenMP: Is array reduction always needed for updating an array in parallel?

I am quite new to OpenMP. I have the following simple loop that I want to run in parallel with OpenMP:
double rij[3];
double r;
#ifdef _OPENMP
#pragma omp parallel for private(rij,r)
#endif
for (int i=0; i<n; ++i)
{
for (int j=0; j<n; ++j)
{
if (i != j)
{
distance(X,rij,r,i,j);
V[i] += ke * Q[j] / r;
for (int k=0; k<3; ++k)
{
F[3*i+k] += ke * Q[j] * rij[k] / pow(r,3);
}
}
}
}
From what I understood, variables are shared by default which is why I only declared private(rij,r). But according to these questions (first second third), I should do array reduction in this case.
It's clear to me that if many threads need to sum to the same variable, this has to be done with #pragma omp parallel for reduction(+:A[:n]) for summing to array A of size n. This is what I do in another part of my code, and it works as expected.
However, in this case workers never have to sum to the same variable: every worker performs the sum on its index i. Is is correct to do as I do in this case i.e. not doing any array reduction and not using any critical section ?
If my implementation is correct, I believe it would avoid the overhead of the critical section while being simpler code. Feel free to give your advice on how this could be better optimized.
Thank you
You don't need a reduction. It is a feature to avoid copying the same code all over again because they are re-occurring problems (Try to think off, how you would implement a sum-reduction without OpenMP).
What you do right now is working on parallel data (V[i]) which should not overlap at any iteration (as you state in the question), because you divide by i itself. Furthermore write to F[...] shouldn't overlap either, because it only depends on iand k

How to stop parallel region of OpenMP 2.0

I look for a better way to cancel my threads.
In my approach, I use a shared variable and if this variable is set, I just throw a continue. This finishes my threads fast, but threads keep theoretically spawning and ending, which seems not elegant.
So, is there a better way to solve the issue (break is not supported by my OpenMP)?
I have to work with Visual, so my OpenMP Lib is outdated and there is no way around that. Consequently, I think #omp cancel will not work
int progress_state = RunExport;
#pragma omp parallel
{
#pragma omp for
for (int k = 0; k < foo.z; k++)
for (int j = 0; j < foo.y; j++)
for (int i = 0; i < foo.x; i++) {
if (progress_state == StopExport) {
continue;
}
// do some fancy shit
// yeah here is a condition for speed due to the critical
#pragma omp critical
if (condition) {
progress_state = StopExport;
}
}
}
You should do it the simple way of "just continue in all remaining iterations if cancellation is requested". That can just be the first check in the outermost loop (and given that you have several nested loops, that will probably not have any measurable overhead).
std::atomic<int> progress_state = RunExport;
// You could just write #pragma omp parallel for instead of these two nested blocks.
#pragma omp parallel
{
#pragma omp for
for (int k = 0; k < foo.z; k++)
{
if (progress_state == StopExport)
continue;
for (int j = 0; j < foo.y; j++)
{
// You can add break statements in these inner loops.
// OMP only parallelizes the outermost loop (at least given the way you wrote this)
// so it won't care here.
for (int i = 0; i < foo.x; i++)
{
// ...
if (condition) {
progress_state = StopExport;
}
}
}
}
}
Generally speaking, OMP will not suddenly spawn new threads or end existing ones, especially not within one parallel region. This means there is little overhead associated with running a few more tiny iterations. This is even more true given that the default scheduling in your case is most likely static, meaning that each thread knows its start and end index right away. Other scheduling modes would have to call into the OMP runtime every iteration (or every few iterations) to request more work, but that won't happen here. The compiler will basically see this code for the threaded work:
// Not real omp functions.
int myStart = __omp_static_for_my_start();
int myEnd = __omp_static_for_my_end();
for (int k = myStart; k < myEnd; ++k)
{
if (progress_state == StopExport)
continue;
// etc.
}
You might try a non-atomic thread-local "should I cancel?" flag that starts as false and can only be changed to true (which the compiler may understand and fold into the loop condition). But I doubt you will see significant overhead either way, at least on x86 where int is atomic anyway.
which seems not elegant
OMP 2.0 does not exactly shine with respect to elegance. I mean, iterating over a std::vector requires at least one static_cast to silence signed -> unsigned conversion warnings. So unless you have specific evidence of this pattern causing a performance problem, there is little reason not to use it.

omp parallel for : How to make threads write to private arrays and merge all the arrays once all the threads finished processing

I have a requirement to calculate z values, push them into arrays B and s2.
I tried to parallelize the processing using omp parallel for.
One problem I see is, If I don't put B[i][j] += z and s2[i] += z statements in critical section, I see lot of NaN values being generated.
Just wondering if there is a way to write the z values to separate arrays (one array per thread) and merge them at the end.
Any help is greatly appreciated.
#pragma omp parallel
{
double z;
#pragma omp parallel for
for(int t=1; t<n; t++) {
double phi_i[N];
double obs_j_seq_t[N];
for(int i=0; i<N; i++) {
for(int j=0; j<N; j++) {
z=phi_i[i]*trans[i*N + j]*obs_j_seq_t[j]*beta[t*N+j]/c[t];
#pragma omp critical
{
B[i][j] += z;
s2[i] += z;
}
}
}
}
}
Your code exposes a few issues, each being a potential killer for its performance and / or validity:
You start by using a #pragma omp parallel and then you add a #pragma omp parallel for. That means that you are trying to generate nested parallelism (a parallel region within another parallel region). This is first, a bad idea and second, disabled by default. Therefore, your second parallel directive is ignored and the work on your loop never gets distributed and is executed in full by all the threads you spawned with your initial parallel directive. Therefore, you have race conditions on the writing of the results in B and s2 by all the threads at once. You solve the issue by adding a critical section, but fundamentally, the code is wrong.
Even if you hadn't had this initial parallel directive or with nested parallelism enabled, your code would have been wrong for the following reasons:
Your z variable is shared across the threads of the second parallel region and since it is modified by all of them, its value is undefined as soon as more than one thread is spawned in the region.
Even more fundamentally, you try to parallelize the loop over t, but the solutions are indexed over i. That means that all threads will compete for updating the same indexes, leading once more to race conditions and invalid results. You could again use a critical directive to address that, but that would only make the code super slow. You'd better be parallelizing the loop over i (while possibly swapping the loops over t and i to put the latter the outermost one).
Your code could become something like this (not tested):
#pragma omp parallel for
for(int i=0; i<N; i++) {
for(int t=1; t<n; t++) {
double phi_i[N]; // I guess these need some initialization
double obs_j_seq_t[N]; // Idem
for(int j=0; j<N; j++) {
double z=phi_i[i]*trans[i*N + j]*obs_j_seq_t[j]*beta[t*N+j]/c[t];
B[i][j] += z;
s2[i] += z;
}
}
}

Why my C code is slower using OpenMP

I m trying to do multi-thread programming on CPU using OpenMP. I have lots of for loops which are good candidate to be parallel. I attached here a part of my code. when I use first #pragma omp parallel for reduction, my code is faster, but when I try to use the same command to parallelize other loops it gets slower. does anyone have any idea why it is like this?
.
.
.
omp_set_dynamic(0);
omp_set_num_threads(4);
float *h1=new float[nvi];
float *h2=new float[npi];
while(tol>0.001)
{
std::fill_n(h2, npi, 0);
int k,i;
float h222=0;
#pragma omp parallel for private(i,k) reduction (+: h222)
for (i=0;i<npi;++i)
{
int p1=ppi[i];
int m = frombus[p1];
for (k=0;k<N;++k)
{
h222 += v[m-1]*v[k]*(G[m-1][k]*cos(del[m-1]-del[k])
+ B[m-1][k]*sin(del[m-1]-del[k]));
}
h2[i]=h222;
}
//*********** h3*****************
std::fill_n(h3, nqi, 0);
float h333=0;
#pragma omp parallel for private(i,k) reduction (+: h333)
for (int i=0;i<nqi;++i)
{
int q1=qi[i];
int m = frombus[q1];
for (int k=0;k<N;++k)
{
h333 += v[m-1]*v[k]*(G[m-1][k]*sin(del[m-1]-del[k])
- B[m-1][k]*cos(del[m-1]-del[k]));
}
h3[i]=h333;
}
.
.
.
}
I don't think your OpenMP code gives the same result as without OpenMP. Let's just concentrate on the h2[i] part of the code (since the h3[i] has the same logic). There is a dependency of h2[i] on the index i (i.e. h2[1] = h2[1] + h2[0]). The OpenMP reduction you're doing won't give the correct result. If you want to do the reduction with OpenMP you need do it on the inner loop like this:
float h222 = 0;
for (int i=0; i<npi; ++i) {
int p1=ppi[i];
int m = frombus[p1];
#pragma omp parallel for reduction(+:h222)
for (int k=0;k<N; ++k) {
h222 += v[m-1]*v[k]*(G[m-1][k]*cos(del[m-1]-del[k])
+ B[m-1][k]*sin(del[m-1]-del[k]));
}
h2[i] = h222;
}
However, I don't know if that will be very efficient. An alternative method is fill h2[i] in parallel on the outer loop without a reduction and then take care of the dependency in serial. Even though the serial loop is not parallelized it still should have a small effect on the computation time since it does not have the inner loop over k. This should give the same result with and without OpenMP and still be fast.
#pragma omp parallel for
for (int i=0; i<npi; ++i) {
int p1=ppi[i];
int m = frombus[p1];
float h222 = 0;
for (int k=0;k<N; ++k) {
h222 += v[m-1]*v[k]*(G[m-1][k]*cos(del[m-1]-del[k])
+ B[m-1][k]*sin(del[m-1]-del[k]));
}
h2[i] = h222;
}
//take care of the dependency serially
for(int i=1; i<npi; i++) {
h2[i] += h2[i-1];
}
Keep in mind that creating and destroying threads is a time consuming process; clock the execution time for the process and see for yourself. You only use parallel reduction twice which may be faster than a serial reduction, however the initial cost of creating the threads may still be higher. Try parallelizing the outer most loop (if possible) to see if you can obtain a speedup.