I have the following piece of code.
for (i = 0; i < n; ++i) {
++cnt[offset[i]];
}
where offset is an array of size n containing values in the range [0, m) and cnt is an array of size m initialized to 0. I use OpenMP to parallelize it as follows.
#pragma omp parallel for shared(cnt, offset) private(i)
for (i = 0; i < n; ++i) {
++cnt[offset[i]];
}
According to the discussion in this post, if offset[i1] == offset[i2] for i1 != i2, the above piece of code may result in incorrect cnt. What can I do to avoid this?
This code:
#pragma omp parallel for shared(cnt, offset) private(i)
for (i = 0; i < n; ++i) {
++cnt[offset[i]];
}
contains a race-condition during the updates of the array cnt, to solve it you need to guarantee mutual exclusion of those updates. That can be achieved with (for instance) #pragma omp atomic update but as already pointed out in the comments:
However, this resolves just correctness and may be terribly
inefficient due to heavy cache contention and synchronization needs
(including false sharing). The only solution then is to have each
thread its private copy of cnt and reduce these copies at the end.
The alternative solution is to have a private array per thread, and at end of the parallel region you perform the manual reduction of all those arrays into one. An example of such approach can be found here.
Fortunately, with OpenMP 4.5 you can reduce arrays using a dedicate pragma, namely:
#pragma omp parallel for reduction(+:cnt)
You can have look at this example on how to apply that feature.
Worth mentioning that regarding the reduction of arrays versus the atomic approach as kindly point out by #Jérôme Richard:
Note that this is fast only if the array is not huge (the atomic based
solution could be faster in this specific case regarding the platform
and if the values are not conflicting). So that is m << n. –
As always profiling is the key!; Hence, you should test your code with aforementioned approaches to find out which one is the most efficient.
Related
I have a nested loop, with few outer, and many inner iterations. In the inner loop, I need to calculate a sum, so I want to use an OpenMP reduction. The outer loop is on a container, so the reduction is supposed to happen on an element of that container.
Here's a minimal contrived example:
#include <omp.h>
#include <vector>
#include <iostream>
int main(){
constexpr int n { 128 };
std::vector<int> vec (4, 0);
for (unsigned int i {0}; i<vec.size(); ++i){
/* this does not work */
//#pragma omp parallel for reduction (+:vec[i])
//for (int j=0; j<n; ++j)
// vec[i] +=j;
/* this works */
int* val { &vec[0] };
#pragma omp parallel for reduction (+:val[i])
for (int j=0; j<n; ++j)
val[i] +=j;
/* this is allowed, but looks very wrong. Produces wrong results
* for std::vector, but on an Eigen type, it worked. */
#pragma omp parallel for reduction (+:val[i])
for (int j=0; j<n; ++j)
vec[i] +=j;
}
for (unsigned int i=0; i<vec.size(); ++i) std::cout << vec[i] << " ";
std::cout << "\n";
return 0;
}
The problem is, that if I write the reduction clause as (+:vec[i]), I get the error ‘vec’ does not have pointer or array type, which is descriptive enough to find a workaround. However, that means I have to introduce a new variable and somewhat change the code logic, and I find it less obvious to see what the code is supposed to do.
My main question is, whether there is a better/cleaner/more standard way to write a reduction for container elements.
I'd also like to know why and how the third way shown in the code above somewhat works. I'm actually working with the Eigen library, on whose containers that variant seems to work just fine (haven't extensively tested it though), but on std::vector, it produces results somewhere between zero and the actual result (8128). I thought it should work, because vec[i] and val[i] should both evaluate to dereferencing the same address. But alas, apparently not.
I'm using OpenMP 4.5 and gcc 9.3.0.
I'll answer your question in three parts:
1. What is the best way to perform to OpenMP reductions in your example above with a std::vec ?
i) Use your approach, i.e. create a pointer int* val { &vec[0] };
ii) Declare a new shared variable like #1201ProgramAlarm answered.
iii) declare a user defined reduction (which is not really applicable in your simple case, but see 3. below for a more efficient pattern).
2. Why doesn't the third loop work and why does it work with Eigen ?
Like the previous answer states you are telling OpenMP to perform a reduction sum on a memory address X, but you are performing additions on memory address Y, which means that the reduction declaration is ignored and your addition is subjected to the usual thread race conditions.
You don't really provide much detail into your Eigen venture, but here are some possible explanations:
i) You're not really using multiple threads (check n = Eigen::nbThreads( ))
ii) You didn't disable Eigen's own parallelism which can disrupt your own usage of OpenMP, e.g. EIGEN_DONT_PARALLELIZE compiler directive.
iii) The race condition is there, but you're not seeing it because Eigen operations take longer, you're using a low number of threads and only writing a low number of values => lower occurrence of threads interfering with each other to produce the wrong result.
3. How should I parallelize this scenario using OpenMP (technically not a question you asked explicitly) ?
Instead of parallelizing only the inner loop, you should parallelize both at the same time. The less serial code you have, the better. In this scenario each thread has its own private copy of the vec vector, which gets reduced after all the elements have been summed by their respective thread. This solution is optimal for your presented example, but might run into RAM problems if you're using a very large vector and very many threads (or have very limited RAM).
#pragma omp parallel for collapse(2) reduction(vsum : vec)
for (unsigned int i {0}; i<vec.size(); ++i){
for (int j = 0; j < n; ++j) {
vec[i] += j;
}
}
where vsum is a user defined reduction, i.e.
#pragma omp declare reduction(vsum : std::vector<int> : std::transform(omp_out.begin(), omp_out.end(), omp_in.begin(), omp_out.begin(), std::plus<int>())) initializer(omp_priv = decltype(omp_orig)(omp_orig.size()))
Declare the reduction before the function where you use it, and you'll be good to go
For the second example, rather than storing a pointer then always accessing the same element, just use a local variable:
int val = vec[i];
#pragma omp parallel for reduction (+:val)
for (int j=0; j<n; ++j)
val +=j;
vec[i] = val;
With the 3rd loop, I suspect that the problem is because the reduction clause names a variable, but you never update that variable by that name in the loop so there is nothing that the compiler sees to reduce. Using Eigen may make the code a bit more complicate to analyze, resulting in the loop working.
I'm using OpenMP for this and I'm not confident of my answer as well. Really need your help in this. I've been wondering which method (serial or parallel) is faster in run speed in this. My #pragma commands (set into comments) are shown below.
Triangle Triangle::t_ID_lookup(Triangle a[], int ID, int n)
{
Triangle res; int i;
//#pragma omp for schedule(static) ordered
for(i=0; i<n; i++)
{
if(ID==a[i].t_ID)
{
//#pragma omp ordered
return (res=a[i]); // <-changed into "res = a[i]" instead of "return(...)"
}
}
return res;
}
It depends on n. If n is small, then the overhead required for the OMP threads makes the OMP version slower. This can be overcome by adding an if clause: #pragma omp parallel if (n > YourThreshhold)
If all a[i].t_ID are not unique, then you may receive different results from the same data when using OMP.
If you have more in your function than just a single comparison, consider adding a shared flag variable that would indicate that it was found so that a comparison if(found) continue; can be added at the beginning of the loop.
I have no experience with ordered, so if that was the crux of your question, ignore all the above and consider this answer.
Profile. In the end, there is no better answer.
If you still want a theoretical answer, then a random lookup would be O(n) with a mean of n/2 while the OMP version would be a constant n/k where k is the number of threads/cores not including overhead.
For an alternative way of writing your loop, see Z Boson's answer to a different question.
Performance wise, which of the following is more efficient?
Assigning in the master thread and copying the value to all threads:
int i = 0;
#pragma omp parallel for firstprivate(i)
for( ; i < n; i++){
...
}
Declaring and assigning the variable in each thread
#pragma omp parallel for
for(int i = 0; i < n; i++){
...
}
Declaring the variable in the master thread but assigning it in each thread.
int i;
#pragma omp parallel for private(i)
for(i = 0; i < n; i++){
...
}
It may seem a silly question and/or the performance impact may be negligible. But I'm parallelizing a loop that does a small amount of computation and is called a large number of times, so any optimization I can squeeze out of this loop is helpful.
I'm looking for a more low level explanation and how OpenMP handles this.
For example, if parallelizing for a large number of threads I assume the second implementation would be more efficient, since initializing a variable using xor is far more efficient than copying the variable to all the threads
There is not much of a difference in terms of performance among the 3 versions you presented, since each one of them is using #pragma omp parallel for. Hence, OpenMP will automatically assign each for iteration to different threads. Thus, variable i will became private to each thread, and each thread will have a different range of for iterations to work with. The variable 'i' was automatically set to private in order to avoid race conditions when updating this variable. Since, the variable 'i' will be private on the parallel for anyway, there is no need to put private(i) on the #pragma omp parallel for.
Nevertheless, your first version will produce an error since OpenMP is expecting that the loop right underneath of #pragma omp parallel for have the following format:
for(init-expr; test-expr;incr-expr)
inorder to precompute the range of work.
The for directive places restrictions on the structure of all
associated for-loops. Specifically, all associated for-loops must
have the following canonical form:
for (init-expr; test-expr;incr-expr) structured-block (OpenMP Application Program Interface pag. 39/40.)
Edit: I tested your two last versions, and inspected the generated assembly. Both version produce the same assembly, as you can see -> version 2 and version 3.
I need to parallelize this loop, I though that to use was a good idea, but I never studied them before.
#pragma omp parallel for
for(std::set<size_t>::const_iterator it=mesh->NEList[vid].begin();
it!=mesh->NEList[vid].end(); ++it){
worst_q = std::min(worst_q, mesh->element_quality(*it));
}
In this case the loop is not parallelized because it uses iterator and the compiler cannot
understand how to slit it.
Can You help me?
OpenMP requires that the controlling predicate in parallel for loops has one of the following relational operators: <, <=, > or >=. Only random access iterators provide these operators and hence OpenMP parallel loops work only with containers that provide random access iterators. std::set provides only bidirectional iterators. You may overcome that limitation using explicit tasks. Reduction can be performed by first partially reducing over private to each thread variables followed by a global reduction over the partial values.
double *t_worst_q;
// Cache size on x86/x64 in number of t_worst_q[] elements
const int cb = 64 / sizeof(*t_worst_q);
#pragma omp parallel
{
#pragma omp single
{
t_worst_q = new double[omp_get_num_threads() * cb];
for (int i = 0; i < omp_get_num_threads(); i++)
t_worst_q[i * cb] = worst_q;
}
// Perform partial min reduction using tasks
#pragma omp single
{
for(std::set<size_t>::const_iterator it=mesh->NEList[vid].begin();
it!=mesh->NEList[vid].end(); ++it) {
size_t elem = *it;
#pragma omp task
{
int tid = omp_get_thread_num();
t_worst_q[tid * cb] = std::min(t_worst_q[tid * cb],
mesh->element_quality(elem));
}
}
}
// Perform global reduction
#pragma omp critical
{
int tid = omp_get_thread_num();
worst_q = std::min(worst_q, t_worst_q[tid * cb]);
}
}
delete [] t_worst_q;
(I assume that mesh->element_quality() returns double)
Some key points:
The loop is executed serially by one thread only, but each iteration creates a new task. These are most likely queued for execution by the idle threads.
Idle threads waiting at the implicit barrier of the single construct begin consuming tasks as soon as they are created.
The value pointed by it is dereferenced before the task body. If dereferenced inside the task body, it would be firstprivate and a copy of the iterator would be created for each task (i.e. on each iteration). This is not what you want.
Each thread performs partial reduction in its private part of the t_worst_q[].
In order to prevent performance degradation due to false sharing, the elements of t_worst_q[] that each thread accesses are spaced out so to end up in separate cache lines. On x86/x64 the cache line is 64 bytes, therefore the thread number is multiplied by cb = 64 / sizeof(double).
The global min reduction is performed inside a critical construct to protect worst_q from being accessed by several threads at once. This is for illustrative purposes only since the reduction could also be performed by a loop in the main thread after the parallel region.
Note that explicit tasks require compiler which supports OpenMP 3.0 or 3.1. This rules out all versions of Microsoft C/C++ Compiler (it only supports OpenMP 2.0).
Random-Access Container
The simplest solution is to just throw everything into a random-access container (like std::vector) and use the index-based loops that are favoured by OpenMP:
// Copy elements
std::vector<size_t> neListVector(mesh->NEList[vid].begin(), mesh->NEList[vid].end());
// Process in a standard OpenMP index-based for loop
#pragma omp parallel for reduction(min : worst_q)
for (int i = 0; i < neListVector.size(); i++) {
worst_q = std::min(worst_q, complexCalc(neListVector[i]));
}
Apart from being incredibly simple, in your situation (tiny elements of type size_t that can easily be copied) this is also the solution with the best performance and scalability.
Avoiding copies
However, in a different situation than yours you may have elements that aren't copied as easily (larger elements) or cannot be copied at all. In this case you can just throw the corresponding pointers in a random-access container:
// Collect pointers
std::vector<const nonCopiableObjectType *> neListVector;
for (const auto &entry : mesh->NEList[vid]) {
neListVector.push_back(&entry);
}
// Process in a standard OpenMP index-based for loop
#pragma omp parallel for reduction(min : worst_q)
for (int i = 0; i < neListVector.size(); i++) {
worst_q = std::min(worst_q, mesh->element_quality(*neListVector[i]));
}
This is slightly more complex than the first solution, still has the same good performance on small elements and increased performance on larger elements.
Tasks and Dynamic Scheduling
Since someone else brought up OpenMP Tasks in his answer, I want to comment on that to. Tasks are a very powerful construct, but they have a huge overhead (that even increases with the number of threads) and in this case just make things more complex.
For the min reduction the use of Tasks is never justified because the creation of a Task in the main thread costs much more than just doing the std::min itself!
For the more complex operation mesh->element_quality you might think that the dynamic nature of Tasks can help you with load-balancing problems, in case that the execution time of mesh->element_quality varies greatly between iterations and you don't have enough iterations to even it out. But even in that case, there is a simpler solution: Simply use dynamic scheduling by adding the schedule(dynamic) directive to your parallel for line in one of my previous solutions. It achieves the same behaviour which far less overhead.
With OpenMP 3.1, it is possible to have a reduction clause with min:
double m;
#pragma omp parallel for reduction(min:m)
for (int i=0;i< n; i++){
if (a[i]*2 < m) {
m = a[i] * 2;
}
return m;
Suppose I also need the index of the minimal element; is there a way to use the reduction clause for this? I believe the alternative is writing the reduction manually using nowait and critical.
Suppose I also need the index of the minimal element; is there a way to use the reduction clause for this?
Unfortunately, no. the list of possible reductions in OpenMP is very … small. In particular, min and max are the only “higher-level” functions and they aren’t customisable. At all.
I have to admit that I don’t like OpenMP’s approach to reductions, precisely because it’s not extensible in the slightest, it is designed only to work on special cases. Granted, those are interesting special cases but it’s still fundamentally a bad approach.
For such operations, you need to implement the reduction yourself by accumulating thread-local results into thread-local variables and combining them at the end.
The easiest way of doing this (and indeed quite close to how OpenMP implements reductions) is to have an array with elements for each thread, and using omp_get_thread_num() to access an element. Note however that this will lead to a performance degradation due to false sharing if the elements in the array share a cache line. To mitigate this, pad the array:
struct min_element_t {
double min_val;
size_t min_index;
};
size_t const CACHE_LINE_SIZE = 1024; // for example.
std::vector<min_element_t> mins(threadnum * CACHE_LINE_SIZE);
#pragma omp parallel for
for (int i = 0; i < n; ++i) {
size_t const index = omp_get_thread_num() * CACHE_LINE_SIZE;
// operate on mins[index] …
}