I've added OpenMP to an existing code base in order to parallelize a for loop. Several variables are created inside the scope of the parallel for region, including a pointer:
#pragma omp parallel for
for (int i = 0; i < n; i++){
[....]
Model *lm;
lm->myfunc();
lm->anotherfunc();
[....]
}
In the resulting output files I noticed inconsistencies, presumably caused by a race condition. I ultimately resolved the race condition by using an omp critical. My question remains, though: is lm private to each thread, or is it shared?
Yes, all variables declared inside the OpenMP region are private. This includes pointers.
Each thread will have its own copy of the pointer.
It lets you do stuff like this:
int threads = 8;
int size_per_thread = 10000000;
int *ptr = new int[size_per_thread * threads];
#pragma omp parallel num_threads(threads)
{
int id = omp_get_thread_num();
int *my_ptr = ptr + size_per_thread * id;
// Do work on "my_ptr".
}
Related
So I have a function, let's call it dostuff() in which it's beneficial for my application to sometimes parallelize within, or to do it multiple times and parallelize the whole thing. The function does not change though between both use cases.
Note: object is large enough that it cannot viably be stored in a list, and so it must be discarded with each iteration.
So, let's say our code looks like this:
bool parallelize_within = argv[1];
if (parallelize_within) {
// here we assume parallelization is handled within the function dostuff()
for (int i = 0; i < 100; ++i) {
object randomized = function_that_accesses_rand();
dostuff(i, randomized, parallelize_within);
}
} else {
#pragma omp parallel for
for (int i = 0; i < 100; ++i) {
object randomized = function_that_accesses_rand();
dostuff(i, randomized, parallelize_within);
}
}
Obviously, we run into the issue that dostuff() will have threads access the random object at different times in different iterations of the same program. This is not the case when parallelize_within == true, but when we run dostuff() in parallel individually per thread, is there a way to guarantee that the random object is accessed in order based on the iteration? I know that I could do:
#pragma omp parallel for schedule(dynamic)
which will guarantee that eventually, as iterations are assigned to threads at runtime dynamically, the objects will access rand in order with the iteration number, but for the first set of iterations it will be totally random. Any suggestions on how to avoid this?
First of all you have to make sure that both function_that_accesses_rand and do_stuff are threadsafe.
You do not have to duplicate your code if you use the if clause:
#pragma omp parallel for if(!parallelize_within)
To make sure that in function dostuff(i, randomized,...); i reflects the order of creation of randomized object you have to do something like this:
int j = 0;
#pragma omp parallel for if(!parallelize_within)
for (int i = 0; i < 100; ++i) {
int k;
object randomized;
#pragma omp critical
{
k = j++;
randomized = function_that_accesses_rand();
}
dostuff(k, randomized, parallelize_within);
}
You may eliminate the use of the critical section if your function_that_accesses_rand makes it possible, but I cannot be more specific without knowing your function. One solution is that this function returns the value representing the order. Do not forget that this function has to be threadsafe!
#pragma omp parallel for if(!parallelize_within)
for (int i = 0; i < 100; ++i) {
int k;
object randomized = function_that_accesses_rand(k);
dostuff(k, randomized, parallelize_within);
}
... function_that_accesses_rand(int& k){
...
#pragma omp atomic capture
k = some_internal_counter++;
...
}
You could pre generate the random object and store it in a list. Then have a variable in the omp loop, that is incremented per thread.
// generate random objects
i=0
#pragma omp parallel for
for( ... ){
do_stuff(...,rand_obj[i],...)
My code is threaded using OpenMP. I'm now using the Intel Inspector to check for data races between the threads and I got some inside a critical section.
The relevant code is:
#pragma omp parallel num_threads(current_num_threads) default(none) shared(max_particle_radius_,min_particle_radius_,...)
{
double thread_max_radius = max_particle_radius_;
double thread_min_radius = min_particle_radius_;
#pragma omp for schedule(static)
for (ID i = 0; i<n_particles_; ++i) {
// ...
thread_max_radius = max(thread_max_radius, radius);
thread_min_radius = min(thread_min_radius, radius);
// ...
}
#pragma omp critical (reduce_minmax_radius)
{
max_particle_radius_ = max(max_particle_radius_, thread_max_radius);
min_particle_radius_ = min(min_particle_radius_, thread_min_radius);
}
}
The two max_particle_radius_ and min_particle_radius_ are actually members of the same class this code is executed in, so basically they are automatically shared in the parallel region.
The intel inspector then gives me following error, which I can't really understand:
How can there be a data race, if only one thread at the time can enter the critical section? Am I missing something about how OpenMP works?
I have a few nested loops and I put the first one in parallel mode. apar and mpar are structs whose values are modified in the loop and then function breakLogic is called which generates a struct which i store in a pre created vector of those structs.
one, two ... have been declared earlier in the function.
I have tried to include ordered and critical to ensure accuracy but i am still getting incorrect results.
#pragma omp parallel for ordered private(appFlip, atur, apar, mpar, i, j, k, l, m, n) shared(rawFlip)
for(i=0; i<oneL; i++)
{
initialize mpar
#pragma omp critical
apar.one = one[i];
for(j=0; j<twoL; j++)
{
apar.two = two[j];
for(k=0; k<threeL; k++)
{
apar.three = floor(three[k]*apar.two);
appFlip = applyParamSin(rawFlip, apar);
for(l=0; l< fourL; l++)
{
mpar.four = four[l];
for(m=0; m<fiveL; m++)
{
mpar.five = five[m];
for(n=0; n<sixL; n++)
{
mpar.six = add[n];
atur = breakLogic(appFlip, mpar, dt);
#pragma omp ordered
{
sinResVec[itr] = atur;
itr++;
}
}
}
}
r0(appFlip);
}
}
}
Or is this code not conducive for parallelism? Are there any tools for g++ which can profile code for parallel processing and indicate potential issues?
This modified code works but gives no performance improvement.
You original code can be paralleled by a few modifications.
set apar and mpar as firstprivate. apar and mpar should be thread local variables and be initialized when entering the parallel for region;
remove all critical and ordered clauses, including the one in the parallel for directive. they are not working as your expected;
calculate iter with i,j,k,l,m,n to remove the dependency.
.
iter=(((i*twoL+j)*threeL+k)*fourL+m)*fiveL+n;
sinResVec[itr] = atur;
update
See here for more details of OpenMP, especially the differences between private and firstprivate.
http://msdn.microsoft.com/en-us/library/tt15eb9t.aspx
I'm trying to make a for loop multi-threaded in C++ so that the calculation gets divided to the multiple threads. Yet it contains data that needs to be joined together in the order as they are.
So the idea is to first join the small bits on many cores (25.000+ loops) and then join the combined data once more at the end.
std::vector<int> ids; // mappings
std::map<int, myData> combineData; // data per id
myData outputData; // combined data based on the mappings
myData threadData; // data per thread
#pragma parallel for default(none) private(data, threadData) shared(combineData)
for (int i=0; i<30000; i++)
{
threadData += combineData[ids[i]];
}
// Then here I would like to get all the seperate thread data and combine them in a similar manner
// I.e.: for each threadData: outputData += threadData
What would be the efficient and good way to approach this?
How can I schedule the openmp loop so that the scheduling is split evenly into chunks
For example for 2 threads:
[0, 1, 2, 3, 4, .., 14999] & [15000, 15001, 15002, 15003, 15004, .., 29999]
If there's a better way to join the data (which involves joining a lot of std::vectors together and some matrix math), yet preserve the order of additions pointers to that would help as well.
Added information
The addition is associative, though not commutative.
myData is not an intrinsic type. It's a class containing data as multiple std::vectors (and some data related to the Autodesk Maya API.)
Each cycle is doing a similar matrix multiplication to many points and adds these points to a vector (in theory the calculation time should stay roughly similar per cycle)
Basically it's adding mesh data (consisting of vectors of data) to eachother (combining meshes) though the order of the whole thing accounts for the index value of the vertices. The vertex index should be consistent and rebuildable.
This depends on a few properties of the the addition operator of myData. If the operator is both associative (A + B) + C = A + (B + C) as well as commutative A + B = B + A then you can use a critical section or if the data is plain old data (e.g. a float, int,...) a reduction.
However, if it's not commutative as you say (order of operation matters) but still associative you can fill an array with a number of elements equal to the number of threads of the combined data in parallel and then merge them in order in serial (see the code below. Using schedule(static) will split the chunks more or less evenly and with increasing thread number as you want.
If the operator is neither associative nor commutative then I don't think you can parallelize it (efficiently - e.g. try parallelizing a Fibonacci series efficiently).
std::vector<int> ids; // mappings
std::map<int, myData> combineData; // data per id
myData outputData; // combined data based on the mappings
myData *threadData;
int nthreads;
#pragma omp parallel
{
#pragma omp single
{
nthreads = omp_get_num_threads();
threadData = new myData[nthreads];
}
myData tmp;
#pragma omp for schedule(static)
for (int i=0; i<30000; i++) {
tmp += combineData[ids[i]];
}
threadData[omp_get_thread_num()] = tmp;
}
for(int i=0; i<nthreads; i++) {
outputData += threadData[i];
}
delete[] threadData;
Edit: I'm not 100% sure at this point if the chunks will assigned in order of increasing thread number with #pragma omp for schedule(static) (though I would be surprised if they are not). There is an ongoing discussion on this issue. Meanwhile, if you want to be 100% sure then instead of
#pragma omp for schedule(static)
for (int i=0; i<30000; i++) {
tmp += combineData[ids[i]];
}
you can do
const int nthreads = omp_get_num_threads();
const int ithread = omp_get_thread_num();
const int start = ithread*30000/nthreads;
const int finish = (ithread+1)*30000/nthreads;
for(int i = start; i<finish; i++) {
tmp += combineData[ids[i]];
}
Edit:
I found a more elegant way to fill in parallel but merge in order
#pragma omp parallel
{
myData tmp;
#pragma omp for schedule(static) nowait
for (int i=0; i<30000; i++) {
tmp += combineData[ids[i]];
}
#pragma omp for schedule(static) ordered
for(int i=0; i<omp_get_num_threads(); i++) {
#pragma omp ordered
outputData += tmp;
}
}
This avoids allocating data for each thread (threadData) and merging outside the parallel region.
If you really want to preserve the same order as in the serial case, then there is no other way than doing it serially. In that case you can maybe try to parallelize the operations done in operator+=.
If the operations can be done randomly, but the reduction of the blocks has a specific order , then it may be worth having a look at TBB parallel_reduce. It will require you to write more code, but if I remember well you can define complex custom reduction operations.
If the order of the operations doesn't matter, then your snippet is almost complete. What it lacks is possibly a critical construct to aggregate private data:
std::vector<int> ids; // mappings
std::map<int, myData> combineData; // data per id
myData outputData; // combined data based on the mappings
#pragma omp parallel
{
myData threadData; // data per thread
#pragma omp for nowait
for (int ii =0; ii < total_iterations; ii++)
{
threadData += combineData[ids[ii]];
}
#pragma omp critical
{
outputData += threadData;
}
#pragma omp barrier
// From here on you are ensured that every thread sees
// the correct value of outputData
}
The schedule of the for loop in this case is not important for the semantic. If the overload of operator+= is a relatively stable operation (in terms of the time needed to compute it), then you can use schedule(static) which divides the iterations evenly among threads. Otherwise you can resort to other scheduling to balance the computational burden (e.g. schedule(guided)).
Finally if myData is a typedef of an intrinsic type, then you can avoid the critical section and use a reduction clause:
#pragma omp for reduction(+:outputData)
for (int ii =0; ii < total_iterations; ii++)
{
outputData += combineData[ids[ii]];
}
In this case you don't need to declare anything explicitly as private.
I'm sitting with some stuff here trying to make orphaning work, and reduce the overhead by reducing the calls of #pragma omp parallel.
What I'm trying is something like:
#pragma omp parallel default(none) shared(mat,mat2,f,max_iter,tol,N,conv) private(diff,k)
{
#pragma omp master // I'm not against using #pragma omp single or whatever will work
{
while(diff>tol) {
do_work(mat,mat2,f,N);
swap(mat,mat2);
if( !(k%100) ) // Only test stop criteria every 100 iteration
diff = conv[k] = do_more_work(mat,mat2);
k++;
} // end while
} // end master
} // end parallel
The do_work depends on the previous iteration so the while-loop is has to be run sequential.
But I would like to be able to run the ´do_work´ parallel, so it would look something like:
void do_work(double *mat, double *mat2, double *f, int N)
{
int i,j;
double scale = 1/4.0;
#pragma omp for schedule(runtime) // Just so I can test different settings without having to recompile
for(i=0;i<N;i++)
for(j=0;j<N;j++)
mat[i*N+j] = scale*(mat2[(i+1)*N+j]+mat2[(i-1)*N+j] + ... + f[i*N+j]);
}
I hope this can be accomplished some way, I'm just not sure how. So any help I can get is greatly appreciated (also if you're telling me this isn't possible). Btw I'm working with open mp 3.0, the gcc compiler and the sun studio compiler.
The outer parallel region in your original code contains only a serial piece (#pragma omp master), which makes no sense and effectively results in purely serial execution (no parallelism). As do_work() depends on the previous iteration, but you want to run it in parallel, you must use synchronisation. The openmp tool for that is an (explicit or implicit) synchronisation barrier.
For example (code similar to yours):
#pragma omp parallel
for(int j=0; diff>tol; ++j) // must be the same condition for each thread!
#pragma omp for // note: implicit synchronisation after for loop
for(int i=0; i<N; ++i)
work(j,i);
Note that the implicit synchronisation ensures that no thread enters the next j if any thread is still working on the current j.
The alternative
for(int j=0; diff>tol; ++j)
#pragma omp parallel for
for(int i=0; i<N; ++i)
work(j,i);
should be less efficient, as it creates a new team of threads at each iteration, instead of merely synchronising.