C++ OpenMP Parallel For Loop - Alternatives to std::vector [closed] - c++

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
Based on this thread, OpenMP and STL vector, which data structures are good alternatives for a shared std::vector in a parallel for loop? The main aspect is speed, and the vector might require resizing during the loop.

I think you can use std::vector with OpenMP most of the time and still have good performance. The following code for example fills std::vectors in parallel and then combines them in the end. As long as your main loop/fill function is the bottleneck this should work well in general and be thread safe.
std::vector<int> vec;
#pragma omp parallel
{
std::vector<int> vec_private;
#pragma omp for nowait //fill vec_private in parallel
for(int i=0; i<100; i++) {
vec_private.push_back(i);
}
#pragma omp critical
vec.insert(vec.end(), vec_private.begin(), vec_private.end());
}
Edit:
OpenMP 4.0 allows user-defined reductions using #pragma omp declare reduction. The code above can be simplified with to
#pragma omp declare reduction (merge : std::vector<int> : omp_out.insert(omp_out.end(), omp_in.begin(), omp_in.end()))
std::vector<int> vec;
#pragma omp parallel for reduction(merge: vec)
for(int i=0; i<100; i++) vec.push_back(i);
Edit:
What I have shown so far does not fill the vector in order. If the order matters then this can be done like this
std::vector<int> vec;
#pragma omp parallel
{
std::vector<int> vec_private;
#pragma omp for nowait schedule(static)
for(int i=0; i<N; i++) {
vec_private.push_back(i);
}
#pragma omp for schedule(static) ordered
for(int i=0; i<omp_get_num_threads(); i++) {
#pragma omp ordered
vec.insert(vec.end(), vec_private.begin(), vec_private.end());
}
}
This avoids saving a std::vector for each thread and then merging them in serial outside of the parallel region. I learned about this "trick" here. I'm not sure how to do this (or if it's even possible) for user-defined reductions.. It's not possible to do this with user-defined reductions.
I just realized that the critical section is not necessary which I figured out from this question parallel-cumulative-prefix-sums-in-openmp-communicating-values-between-thread. This method also gets the order correct as well
std::vector<int> vec;
size_t *prefix;
#pragma omp parallel
{
int ithread = omp_get_thread_num();
int nthreads = omp_get_num_threads();
#pragma omp single
{
prefix = new size_t[nthreads+1];
prefix[0] = 0;
}
std::vector<int> vec_private;
#pragma omp for schedule(static) nowait
for(int i=0; i<100; i++) {
vec_private.push_back(i);
}
prefix[ithread+1] = vec_private.size();
#pragma omp barrier
#pragma omp single
{
for(int i=1; i<(nthreads+1); i++) prefix[i] += prefix[i-1];
vec.resize(vec.size() + prefix[nthreads]);
}
std::copy(vec_private.begin(), vec_private.end(), vec.begin() + prefix[ithread]);
}
delete[] prefix;

The question you link was talking about the fact that "that STL vector container is not thread-safe in the situation where multiple threads write to a single container" . This is only true, as stated correctly there, if you call methods that can cause reallocation of the underlying array that std::vector holds. push_back(), pop_back() and insert() are examples of these dangerous methods.
If you need thread safe reallocation, then the library intel thread building block offers you concurrent vector containers . You should not use tbb::concurrent_vector in single thread programs because the time it takes to access random elements is higher than the time std::vector takes to do the same (which is O(1)). However, concurrent vector calls push_back(), pop_back(), insert() in a thread safe way, even when reallocation happens.
EDIT 1: The slides 46 and 47 of the following Intel presentation give an illustrative example of concurrent reallocation using tbb::concurrent_vector
EDIT 2: By the way, if you start using Intel Tread Building Block (it is open source, it works with most compilers and it is much better integrated with C++/C++11 features than openmp), then you don't need to use openmp to create a parallel_for, Here is a nice example of parallel_for using tbb.

Related

parallel programming in OpenMP

I have the following piece of code.
for (i = 0; i < n; ++i) {
++cnt[offset[i]];
}
where offset is an array of size n containing values in the range [0, m) and cnt is an array of size m initialized to 0. I use OpenMP to parallelize it as follows.
#pragma omp parallel for shared(cnt, offset) private(i)
for (i = 0; i < n; ++i) {
++cnt[offset[i]];
}
According to the discussion in this post, if offset[i1] == offset[i2] for i1 != i2, the above piece of code may result in incorrect cnt. What can I do to avoid this?
This code:
#pragma omp parallel for shared(cnt, offset) private(i)
for (i = 0; i < n; ++i) {
++cnt[offset[i]];
}
contains a race-condition during the updates of the array cnt, to solve it you need to guarantee mutual exclusion of those updates. That can be achieved with (for instance) #pragma omp atomic update but as already pointed out in the comments:
However, this resolves just correctness and may be terribly
inefficient due to heavy cache contention and synchronization needs
(including false sharing). The only solution then is to have each
thread its private copy of cnt and reduce these copies at the end.
The alternative solution is to have a private array per thread, and at end of the parallel region you perform the manual reduction of all those arrays into one. An example of such approach can be found here.
Fortunately, with OpenMP 4.5 you can reduce arrays using a dedicate pragma, namely:
#pragma omp parallel for reduction(+:cnt)
You can have look at this example on how to apply that feature.
Worth mentioning that regarding the reduction of arrays versus the atomic approach as kindly point out by #Jérôme Richard:
Note that this is fast only if the array is not huge (the atomic based
solution could be faster in this specific case regarding the platform
and if the values are not conflicting). So that is m << n. –
As always profiling is the key!; Hence, you should test your code with aforementioned approaches to find out which one is the most efficient.

How to do a parallel_reduce in TBB to append std::vector?

I've defined this reduction in OpenMP:
std::vector<FindAffineShapeArgs> v;
#pragma omp declare reduction(mergeFindAffineShapeArgs : std::vector<FindAffineShapeArgs> : omp_out.insert(omp_out.end(), omp_in.begin(), omp_in.end()))
#pragma omp parallel for collapse(2) schedule(dynamic,1) reduction(mergeFindAffineShapeArgs : findAffineShapeArgs)
for(int i=0; i<n; i++){
v.push_back(//something);
}
In few words, the reduction operation appends each local version of v to the global one.
I've never used TBB before, but I've read this, this and this tutorial, but I don't even understand if this is even possible in TBB.
Can someone help me with this please?

Converting for loop to OpenMP

So I have a loop where I iterate over elements of a vector, call a function on each element, and if it meets a certain criteria, I push it onto a list.
my_list li;
for (auto itr = Obj.begin(); itr != Obj.end(); ++itr) {
if ((*itr).function_call())
li.push_back((*itr);
}
I've been thinking of ways to optimize my program, and I came across OpenMP, but a lot of the sample code is hard to follow.
Could someone walk me through how to convert the above loop to utilize multiple cores in parallel?
Thanks.
There are a few points you need to take care to parallelize that code snippet
If you're using OpenMP 3.0 (or above) you can parallelize your for-loop #pragma omp for, if you're using an older version of OpenMP, you need to be using a for loop accessing vector with indexes.
You need to guard li.push_back((*itr); statement with a lock or set it as critical section
If function_call is not a really slow function or your vector does not contain so many items, it may not be necessary to parallelize as thread creation will introduce overhead.
So a pseudo-code implementation would be
my_list li;
#pragma omp for
for (auto itr = Obj.begin(); itr != Obj.end(); ++itr) {
if ((*itr).function_call())
{
#pragma omp critical CRIT_1
{
li.push_back((*itr);
}
}
}
The time has come to discuss efficient ways to use container classes such as std::list or std::vector with OpenMP (since the OP wants to optimize his code using lists with OpenMP). Let me list four ways in increasing level of efficiency.
Fill the container in a parallel section in a critical block
Make private versions of the container for each thread, fill them in parallel, and then merge them in a critical section
Don't use STL containers. STL was not designed with efficiency in mind. Instead either write your own or use something like Agner Fog's containters which are designed for efficiency. For example instead of using a heap for memory allocation they use a memory pool.
In some special cases it's possible to merge the private versions of the containers in parallel as well.
Example code for the first case is given in the accepted answer. This defeats most of the purpose of using threaded code since each iteration fills the container in critical section.
Example code for the second case can be found at C++ OpenMP Parallel For Loop - Alternatives to std::vector. Rather than re-post the code here let me give an example for the third case using Agner Fog's container classes
DynamicArray<int> vec;
#pragma omp parallel
{
DynamicArray<int> vec_private;
#pragma omp for nowait //fill vec_private in parallel
for(int i=0; i<100; i++) {
vec_private.Push(i);
}
//merging here is probably not optimal
//Dynamic array needs an append function
//vec should reserve a size equal to the sum of size each vec_private
//then use memcpy to append vec_private into vec in a critcal section
#pragma omp critical
{
for(int i=0; i<vec_private.GetNum(); i++) {
vec.Push(vec_private[i]);
}
}
}
Finally, in special cases for example with histograms (probably the most common data structure in experimental particle physics), it's possible to merge the private arrays in parallel as well. For the histograms this is equivalent to an array reduction. This is a bit tricky. An example showing how to do this can be found at Fill histograms (array reduction) in parallel with OpenMP without using a critical section

Parallelization of nested loops with OpenMP

I was trying to parallelize the following loop in my code with OpenMP
double pottemp,pot2body;
pot2body=0.0;
pottemp=0.0;
#pragma omp parallel for reduction(+:pot2body) private(pottemp) schedule(dynamic)
for(int i=0;i<nc2;i++)
{
pottemp=ener2body[i]->calculatePot(ener2body[i]->m_mols);
pot2body+=pottemp;
}
For function 'calculatePot', a very important loop inside this function has also been parallelized by OpenMP
CEnergymulti::calculatePot(vector<CMolecule*> m_mols)
{
...
#pragma omp parallel for reduction(+:dev) schedule(dynamic)
for (int i = 0; i < i_max; i++)
{
...
}
}
So it seems that my parallelization involves nested loops. When I removed the parallelization of the outmost loop,
it seems that the program runs much faster than the one with outmost loop parallelized. The test was performed on 8 cores.
I think this low efficiency of parallelization might be related to nested loops. Someone suggests me using 'collapse' while parallelizing the outmost loop. However, since there are still something between the outmost loop and the inner loop, it was said 'collapse' cannot be used under this circumstance. Are there any other ways I could try to make this parllelization more efficient while still using OpenMP?
Thanks a lot.
If i_max is independent of the i in the outerloop you can try fusing the loops (essentially collapse). It's something I do often which often gives me a small boost. I also prefer fusing the loops "by hand" rather than with OpenMP because Visual Studio only supports OpenMP 2.0 which does not have collapse and I want my code to work on Windows and Linux.
#pragma omp parallel for reduction(+:pot2body) schedule(dynamic)
for(int n=0; n<(nc2*i_max); n++) {
int i = n/i_max; //i from outer loop
int j = n%i_max; //i from inner loop
double pottmp_j = ...
pot2body += pottmp_j;
}
If i_max depends on j then this won't work. In that case follow Grizzly's advice. But one more thing to you can try. OpenMP has an overhead. If i_max is too small then using OpenMP could actually be slower. If you add an if clause at the end of the pragma then OpenMP will only run if the statement is true. Like this:
const int threshold = ... // smallest value for which OpenMP gives a speedup.
#pragma omp parallel for reduction(+:dev) schedule(dynamic) if(i_max > threshold)

Using openMP with SETS

I wanted to parallelize this code with the help of OpenMP with something like
#pragma omp parallel for to divide the work amongst the different threads.
What would be an efficient way? Here level is shared between the varioud threads.
Here make is a set.
for(iter=make.at(level).begin();iter!=make.at(level).end();iter++)
{
Function(*iter);
}
If the type returned by make.at(level) has random-access iterators with constant access time and if your compiler supports recent enough OpenMP version (read: it is not MSVC++) then you can directly use the parallel for worksharing directive:
obj = make.at(level);
#pragma omp parallel for
for (iter = obj.begin(); iter != obj.end(); iter++)
{
Function(*iter);
}
If the type does not provide radom-access iterators but still your compiler supports OpenMP 3.0 or newer, then you can use OpenMP tasks:
#pragma omp parallel
{
#pragma omp single
{
obj = make.at(level);
for (iter = obj.begin(); iter != obj.end(); iter++)
{
#pragma omp task
Function(*iter);
}
}
}
Here a single thread executes the for loop and creates a number of OpenMP tasks. Each task will make a single call to Function() using the corresponding value of *iter. Then each idle thread will start picking from the list of unfinished tasks. At the end of the parallel region there is an implicit barrier so the master thread will dutifully wait for all tasks to finish before continuing execution past the parallel region.
If you are unfortunate enough to use MS Visual C++, then you don't have much of a choice than to create an array of object pointers and iterate over it using a simple integer loop:
obj = make.at(level);
obj_type* elements = new obj_type*[obj.size()];
for (i = 0, iter = obj.begin(); i < obj.size(); i++)
{
elements[i] = &(*iter++);
}
#pragma omp parallel for
for (i = 0; i < obj.size(); i++)
{
Function(*elements[i]);
}
delete [] elements;
It's not the most elegant solution but it should work.
Edit: If I understand correctly from the title of your question, you are working with sets. That rules out the first algorithm since sets do not support random-access iterators. Use either the second or the third algorithm depending on your compiler's supports for OpenMP tasks.
It seems that the variable in a parallel for must be signed int. But I'm not sure. Here is a topic about this.Why must loop variables be signed in a parallel for?
To use this iterator pattern with OpenMP probably requires some rethinking of how to perform the loop - you can't use #pragma omp for since your loop isn't a simple integer loop. I wonder if the following would work:
iter = make.at(level).begin();
end = make.at(level).end();
#pragma omp parallel private(iter) shared(make,level,end)
{
#pragma omp single
func(iter); /* only one thread does the first item */
while (1)
{
#pragma omp critical
iter = make.at(level).next(); /* each thread gets a different item */
if (iter == end)
break;
func(iter);
}
} /* end parallel block */
Note that I had to change your iter++ into a next() call in a critical section to make it work. The reason for this is that the shared make.at(level) object needs to remember which items have already been processed.