Openmp and reduction on std::vector? - c++

I want to make this code parallel:
std::vector<float> res(n,0);
std::vector<float> vals(m);
std::vector<float> indexes(m);
// fill indexes with values in range [0,n)
// fill vals and indexes
for(size_t i=0; i<m; i++){
res[indexes[i]] += //something using vas[i];
}
In this article it's suggested to use:
#pragma omp parallel for reduction(+:myArray[:6])
In this question the same approach is proposed in the comments section.
I have two questions:
I don't know m at compile time, and from these two examples it seems that's required. Is it so? Or if I can use it for this case, what do I have to replace ? with in the following command #pragma omp parallel for reduction(+:res[:?]) ? m or n?
Is it relevant that the indexes of the for are relative to indexes and vals and not to res, especially considering that reduction is done on the latter one?
However, If so, how can I solve this problem?

It is fairly straight forward to do a user declared reduction for C++ vectors of a specific type:
#include <algorithm>
#include <vector>
#pragma omp declare reduction(vec_float_plus : std::vector<float> : \
std::transform(omp_out.begin(), omp_out.end(), omp_in.begin(), omp_out.begin(), std::plus<float>())) \
initializer(omp_priv = decltype(omp_orig)(omp_orig.size()))
std::vector<float> res(n,0);
#pragma omp parallel for reduction(vec_float_plus : res)
for(size_t i=0; i<m; i++){
res[...] += ...;
}
1a) Not knowing m at compile time is not a requirement.
1b) You cannot use the array section reduction on std::vectors, because they are not arrays (and std::vector::data is not an identifier). If it were possible, you'd have to use n, as this is the number of elements in the array section.
2) As long as you are only reading indexes and vals, there is no issue.
Edit: The original initializer caluse was simpler: initializer(omp_priv = omp_orig). However, if the original copy is then not full of zeroes, the result will be wrong. Therefore, I suggest the more complicated initializer which always creates zero-element vectors.

Related

OpenMP reduction on container elements

I have a nested loop, with few outer, and many inner iterations. In the inner loop, I need to calculate a sum, so I want to use an OpenMP reduction. The outer loop is on a container, so the reduction is supposed to happen on an element of that container.
Here's a minimal contrived example:
#include <omp.h>
#include <vector>
#include <iostream>
int main(){
constexpr int n { 128 };
std::vector<int> vec (4, 0);
for (unsigned int i {0}; i<vec.size(); ++i){
/* this does not work */
//#pragma omp parallel for reduction (+:vec[i])
//for (int j=0; j<n; ++j)
// vec[i] +=j;
/* this works */
int* val { &vec[0] };
#pragma omp parallel for reduction (+:val[i])
for (int j=0; j<n; ++j)
val[i] +=j;
/* this is allowed, but looks very wrong. Produces wrong results
* for std::vector, but on an Eigen type, it worked. */
#pragma omp parallel for reduction (+:val[i])
for (int j=0; j<n; ++j)
vec[i] +=j;
}
for (unsigned int i=0; i<vec.size(); ++i) std::cout << vec[i] << " ";
std::cout << "\n";
return 0;
}
The problem is, that if I write the reduction clause as (+:vec[i]), I get the error ‘vec’ does not have pointer or array type, which is descriptive enough to find a workaround. However, that means I have to introduce a new variable and somewhat change the code logic, and I find it less obvious to see what the code is supposed to do.
My main question is, whether there is a better/cleaner/more standard way to write a reduction for container elements.
I'd also like to know why and how the third way shown in the code above somewhat works. I'm actually working with the Eigen library, on whose containers that variant seems to work just fine (haven't extensively tested it though), but on std::vector, it produces results somewhere between zero and the actual result (8128). I thought it should work, because vec[i] and val[i] should both evaluate to dereferencing the same address. But alas, apparently not.
I'm using OpenMP 4.5 and gcc 9.3.0.
I'll answer your question in three parts:
1. What is the best way to perform to OpenMP reductions in your example above with a std::vec ?
i) Use your approach, i.e. create a pointer int* val { &vec[0] };
ii) Declare a new shared variable like #1201ProgramAlarm answered.
iii) declare a user defined reduction (which is not really applicable in your simple case, but see 3. below for a more efficient pattern).
2. Why doesn't the third loop work and why does it work with Eigen ?
Like the previous answer states you are telling OpenMP to perform a reduction sum on a memory address X, but you are performing additions on memory address Y, which means that the reduction declaration is ignored and your addition is subjected to the usual thread race conditions.
You don't really provide much detail into your Eigen venture, but here are some possible explanations:
i) You're not really using multiple threads (check n = Eigen::nbThreads( ))
ii) You didn't disable Eigen's own parallelism which can disrupt your own usage of OpenMP, e.g. EIGEN_DONT_PARALLELIZE compiler directive.
iii) The race condition is there, but you're not seeing it because Eigen operations take longer, you're using a low number of threads and only writing a low number of values => lower occurrence of threads interfering with each other to produce the wrong result.
3. How should I parallelize this scenario using OpenMP (technically not a question you asked explicitly) ?
Instead of parallelizing only the inner loop, you should parallelize both at the same time. The less serial code you have, the better. In this scenario each thread has its own private copy of the vec vector, which gets reduced after all the elements have been summed by their respective thread. This solution is optimal for your presented example, but might run into RAM problems if you're using a very large vector and very many threads (or have very limited RAM).
#pragma omp parallel for collapse(2) reduction(vsum : vec)
for (unsigned int i {0}; i<vec.size(); ++i){
for (int j = 0; j < n; ++j) {
vec[i] += j;
}
}
where vsum is a user defined reduction, i.e.
#pragma omp declare reduction(vsum : std::vector<int> : std::transform(omp_out.begin(), omp_out.end(), omp_in.begin(), omp_out.begin(), std::plus<int>())) initializer(omp_priv = decltype(omp_orig)(omp_orig.size()))
Declare the reduction before the function where you use it, and you'll be good to go
For the second example, rather than storing a pointer then always accessing the same element, just use a local variable:
int val = vec[i];
#pragma omp parallel for reduction (+:val)
for (int j=0; j<n; ++j)
val +=j;
vec[i] = val;
With the 3rd loop, I suspect that the problem is because the reduction clause names a variable, but you never update that variable by that name in the loop so there is nothing that the compiler sees to reduce. Using Eigen may make the code a bit more complicate to analyze, resulting in the loop working.

OpenMP: ordered clause without index

Let's suppose that I want want to populate in parallel a std::vector object in an ordered way, like this:
std::vector<T> v;
#pragma omp parallel for ordered
for (int i=0;i<n;i++){
T result = //some expensive fun here...
#pragma omp ordered
v.push_back(result);
}
As you can see, the instruction v.push_back(result) doesn't depend on i.
My question is: v will still be populated in an ordered way according to i?

Situations faced in OpenMP on for() loops

I'm using OpenMP for this and I'm not confident of my answer as well. Really need your help in this. I've been wondering which method (serial or parallel) is faster in run speed in this. My #pragma commands (set into comments) are shown below.
Triangle Triangle::t_ID_lookup(Triangle a[], int ID, int n)
{
Triangle res; int i;
//#pragma omp for schedule(static) ordered
for(i=0; i<n; i++)
{
if(ID==a[i].t_ID)
{
//#pragma omp ordered
return (res=a[i]); // <-changed into "res = a[i]" instead of "return(...)"
}
}
return res;
}
It depends on n. If n is small, then the overhead required for the OMP threads makes the OMP version slower. This can be overcome by adding an if clause: #pragma omp parallel if (n > YourThreshhold)
If all a[i].t_ID are not unique, then you may receive different results from the same data when using OMP.
If you have more in your function than just a single comparison, consider adding a shared flag variable that would indicate that it was found so that a comparison if(found) continue; can be added at the beginning of the loop.
I have no experience with ordered, so if that was the crux of your question, ignore all the above and consider this answer.
Profile. In the end, there is no better answer.
If you still want a theoretical answer, then a random lookup would be O(n) with a mean of n/2 while the OMP version would be a constant n/k where k is the number of threads/cores not including overhead.
For an alternative way of writing your loop, see Z Boson's answer to a different question.

Join array results in OpenMP

I am writing c++ codes using OpenMP. I have a global huge array (100,000+ elements) that will be modified by adding values in a for loop. Is there a way that I can efficiently have each thread created by OpenMP for parallel maintain its local copy of array and then join after the loop? Since the number of threads is a variable, I could not create the local copies of array beforehand. If using a global copy and address the race condition by a synchronization lock, the performance is terrible.
Thanks!
Edited:
Sorry for not being clear. Here's some pseudo-code hopefully could clarify the scenario:
int* huge_array=new int[N];
memset(huge_array, 0, N*sizeof(int));
#pragma omp parallel for
for (i=0; i<n; i++)
{
get a value v independently
get a position p independently
// I have to set a lock here
omp_set_lock(&lock);
huge_array[p] += v;
omp_unset_lock(&lock);
}
Is there a way to improve the performance of the code above?
Okay, I finally understood what you want to do. Yes, you do it the same way as with ptreads.
std::vector<int> A(N,0);
std::vector<int*> local(omp_max_num_threads());
#pragma omp parallel
{
int np = omp_get_num_threads();
std::vector<int> localA(N);
local[omp_get_thread_num()] = localA.data();
// add values to local array
#pragma omp for
for(int i=0; i<num_values; ++i)
localA[position()] += value(); // (1)
// implicit barrier ensures all local copies are ready for aggregation
// aggregate local copies into global array
#pragma omp for
for(int k=0; k<N; ++k)
for(int p=0; p<np; ++p)
A[k] += local[p][k]; // (2)
// implicit barrier ensures no local copy is deleted before aggregation is done
}
but it is important to do the aggregate also in parallel.
In Walter's answer, I believe instead of
std::vector<int*> local(omp_max_num_threads());
It should be
std::vector<int*> local(omp_get_max_threads());
omp_max_num_threads() is not a routine in OpenMP.
What about using the directive
'#'pragma omp parallel for private(VARIABLE)
for your program (only with a cross, not with these '')?
EDIT:
For your code I would use my directive, you won't loose so much time when locking and unlocking your variable...
EDIT 2:
Sorry, you can not use my code for your problem, only, if you create a temporary array first where you store your data temporarily...
As far as I can tell you are essentially filling a histogram where position is the bin of the histogram to fill and value is the weight/value that you will add to that bin. Filling a histogram in parallel is equivalent to doing an array reduction. The C++ implementation of OpenMP does not have direct support for this, however, as far as I understand some version of the Fortran implementation do. To do an array reduction in C++ with OpenMP I have two suggestions.
1.) If the number of bins of the histogram (array) is much less than the number of values that will fill the histogram (which is often the preferred case since one wants reasonable statistics in each bin), then you can fill private version of the histogram in parallel and merge them in a critical section in serial. Since the number of bins is much less than the number of values this should be efficient.
2.) However, If the number of bins is large (as your example seems to imply) then it's possible to merge the private histograms in parallel as well but this is a bit more tricky. Additionally, one needs to be careful with cache alignment and false sharing.
I showed how to do both these methods and discuss some of the cache issues in the following question:
Fill histograms (array reduction) in parallel with openmp without using a critical section.

Is it possible to do a reduction on an array with openmp?

Does OpenMP natively support reduction of a variable that represents an array?
This would work something like the following...
float* a = (float*) calloc(4*sizeof(float));
omp_set_num_threads(13);
#pragma omp parallel reduction(+:a)
for(i=0;i<4;i++){
a[i] += 1; // Thread-local copy of a incremented by something interesting
}
// a now contains [13 13 13 13]
Ideally, there would be something similar for an omp parallel for, and if you have a large enough number of threads for it to make sense, the accumulation would happen via binary tree.
Array reduction is now possible with OpenMP 4.5 for C and C++. Here's an example:
#include <iostream>
int main()
{
int myArray[6] = {};
#pragma omp parallel for reduction(+:myArray[:6])
for (int i=0; i<50; ++i)
{
double a = 2.0; // Or something non-trivial justifying the parallelism...
for (int n = 0; n<6; ++n)
{
myArray[n] += a;
}
}
// Print the array elements to see them summed
for (int n = 0; n<6; ++n)
{
std::cout << myArray[n] << " " << std::endl;
}
}
Outputs:
100
100
100
100
100
100
I compiled this with GCC 6.2. You can see which common compiler versions support the OpenMP 4.5 features here: https://www.openmp.org/resources/openmp-compilers-tools/
Note from the comments above that while this is convenient syntax, it may invoke a lot of overheads from creating copies of each array section for each thread.
Only in Fortran in OpenMP 3.0, and probably only with certain compilers.
See the last example (Example 3) on:
http://wikis.sun.com/display/openmp/Fortran+Allocatable+Arrays
Now the latest openMP 4.5 spec has supports of reduction of C/C++ arrays.
http://openmp.org/wp/2015/11/openmp-45-specs-released/
And latest GCC 6.1 also has supported this feature.
http://openmp.org/wp/2016/05/gcc-61-released-supports-openmp-45/
But I didn't give it a try yet. Wish others can test this feature.
OpenMP cannot perform reductions on array or structure type variables (see restrictions).
You also might want to read up on private and shared clauses. private declares a variable to be private to each thread, where as shared declares a variable to be shared among all threads. I also found the answer to this question very useful with regards to OpenMP and arrays.
OpenMP can perform this operation as of OpenMP 4.5 and GCC 6.3 (and possibly lower) supports it. An example program looks as follows:
#include <vector>
#include <iostream>
int main(){
std::vector<int> vec;
#pragma omp declare reduction (merge : std::vector<int> : omp_out.insert(omp_out.end(), omp_in.begin(), omp_in.end()))
#pragma omp parallel for default(none) schedule(static) reduction(merge: vec)
for(int i=0;i<100;i++)
vec.push_back(i);
for(const auto x: vec)
std::cout<<x<<"\n";
return 0;
}
Note that omp_out and omp_in are special variables and that the type of the declare reduction must match the vector you are planning to reduce on.