How to collect data for each thread OpenMP - c++

I'm new to OpenMP and try to sort out the issue of collecting data from threads. I study the example of applying OpenMP on Monte-Carlo method (square of a circle inscribed into a square).
I understood how the following code works:
unsigned pointsInside = 0;
#pragma omp parallel for num_threads(threadNum) shared(threadNum) reduction(+: pointsInside)
for (unsigned i = 0; i < threadNum; i++) { ... }
Am I right that originally pointsInsideis a variable but OpenMP represents it as an array and than the mantra reduction(+: pointsInside) sums over the elements of the "array"?
But the main question is how to collect information directly into an array or vector? I tried to declare array or vector and provide pointer or address into OpenMP via shared and collect information for each thread at corresponding index. But it works slower than the way with the variable and reduction. Such the approach with vector or array is needed for me for my current project. Thanks a lot!
UPD:
When I said above that "it works slower" I meant comparison of two realizations of the Monte-Carlo method: 1) via shared and a vector/array, and 2) via a scalar variable and reduction. The first case is faster. My guess and question about it below.
I would like to rephrase my question more clear. I create a vector/array and provide it into OpenMP via shared. I want to collect data for each thread at corresponding index in vector/array. Under this approach I don't need any synchronization of access to the vector/array. Is it true that OpenMP enable synchronization by default when I use shared. If it is so, then how to disable it. Or may be another approaches exist. If it is not so, then how to share vector/array into the parallel part correctly and without synchronization of access.
I'd like to apply this technique for my project where I want to sort through different permutations in parallel part, collect each permutation and scalar result for it outside of the parallel part. Then sort the results and choose the best one.

A partial answer:
Am I right that originally pointsInside is a variable but OpenMP represents it as an array and than the mantra reduction(+: pointsInside) sums over the elements of the "array"?
I think it is better to think of pointsInside as a scalar. When the parallel region starts the run-time takes care of creating individual scalars, perhaps you might think of them as myPointsInside, one such scalar for each thread. When the parallel region finishes the run-time reduces the values of all the thread scalars onto the original scalar pointsInside. This is just about what OpenMP actually does behind the scenes.
As to the rest of your question:
Yes, you can perform reductions onto arrays - but this was only added to OpenMP, for C and C++ programs, in OpenMP 4.5 (I think). What goes on is much the same as for the scalar case. This Q&A provides some assistance - Is it possible to do a reduction on an array with openmp?
As to the speed, it's difficult to answer that without a much clearer understanding of what comparisons you are making. But it's very easy to write parallel reductions on arrays which incur a significant penalty in performance from the phenomenon of false sharing, about which you may wish to inform yourself.

Related

how can I get good speedup for a parallel write to memory?

I'm new to OpenMP and trying to get some very basic loops in my code parallelized with OpenMP, with good speedup on multiple cores. Here's a function in my program:
bool Individual::_SetFitnessScaling_1(double source_value, EidosObject **p_values, size_t p_values_size)
{
if ((source_value < 0.0) || (std::isnan(source_value)))
return true;
#pragma omp parallel for schedule(static) default(none) shared(p_values_size) firstprivate(p_values, source_value) if(p_values_size >= EIDOS_OMPMIN_SET_FITNESS_S1)
for (size_t value_index = 0; value_index < p_values_size; ++value_index)
((Individual *)(p_values[value_index]))->fitness_scaling_ = source_value;
return false;
}
So the goal is to set the fitnessScaling ivar of every object pointed to by pointers in the buffer that p_values points to, to the same double value source_value. Those various objects might be more or less anywhere in memory, so each write probably hits a different cache line; that's an aspect of the code that would be difficult to change, but I'm hoping that by spreading it across multiple cores I can at least divide that pain by a good speedup factor. The cast to (Individual *) is safe, by the way; checks were already done external to this function that guarantee its safety.
You can see my first attempt at parallelizing this, using the default static schedule (so each thread gets its own contiguous block in p_values), making the loop limit shared, and making p_values and source_value be firstprivate so each thread gets its own private copy of those variables, initialized to the original value. The threshold for parallelization, EIDOS_OMPMIN_SET_FITNESS_S1, is set to 900. I test this with a script that passes in a million values, with between 1 and 8 cores (and a max thread count to match), so the loop should certainly run in parallel. I have followed these same practices in some other places in the code and have seen a good speedup. [EDIT: I should say that the speedup I observe for this, for 2/4/6/8 cores/threads, is always about 1.1x-1.2x the single-threaded performance, so there's a very small win but it is realized already with 2 cores and does not get any better with 8 cores.] The notable difference with this code is that this loop spends its time writing to memory; the other loops I have successfully parallelized spend their time doing things like reading values from a buffer and summing across them, so they might be limited by memory read speeds, but not by memory write speeds.
It occurred to me that with all of this writing through a pointer, my loop might be thrashing due to things like aliasing (making the compiler force a flush of the cache after each write), or some such. I attempted to solve that kind of issue as follows, using const and __restrict:
bool Individual::_SetFitnessScaling_1(double source_value, EidosObject **p_values, size_t p_values_size)
{
if ((source_value < 0.0) || (std::isnan(source_value)))
return true;
#pragma omp parallel default(none) shared(p_values_size) firstprivate(p_values, source_value) if(p_values_size >= EIDOS_OMPMIN_SET_FITNESS_S1)
{
EidosObject * const * __restrict local_values = p_values;
#pragma omp for schedule(static)
for (size_t value_index = 0; value_index < p_values_size; ++value_index)
((Individual *)(local_values[value_index]))->fitness_scaling_ = source_value;
}
return false;
}
This made no difference to the performance, however. I still suspect that some kind of memory contention, cache thrash, or aliasing issue is preventing the code from parallelizing effectively, but I don't know how to solve it. Or maybe I'm barking up the wrong tree?
These tests are done with Xcode 13 (i.e., using Apple clang 13.0.0) on macOS, on an M1 Mac mini (2020).
[EDIT: In reply to comments below, a few points. (1) There is nothing fancy going on inside the class here, no operator= or similar; the assignment of source_value into fitness_scaling_ is, in effect, simply the assignment of a double into a field in a struct. (2) The use of firstprivate(p_values, source_value) is to ensure that repeated reading from those values across threads doesn't introduce some kind of between-thread contention that slows things down. It is recommended in Mattson, He, & Koniges' book "The OpenMP Common Core"; see section 6.3.2, figure 6.10 with the corrected Mandelbrot code using firstprivate, and the quote on p. 111: "An easy solution is to change the storage attribute for eps to firstprivate. This gives each thread its own copy of the variable but with a specified value. Notice that eps is read-only. It is not updated inside the parallel region. Therefore, another solution is to let it be shared (shared(eps)) or not specify eps in a data environment clause and let its default, shared behavior be used. While this would result in correct code, it would potentially increase overhead. If eps is shared, every thread will be reading the same address in memory... Some compilers will optimize for such read-only variables by putting them into registers, but we should not rely on that behavior." I have observed this change speeding up parallelized loops in other contexts, so I have adopted it as my standard practice in such cases; if I have misunderstood, please do let me know. (3) No, keeping the fitness_scaling_ values in their own buffer is not a workable solution for several reasons. Most importantly, this method may be called with any arbitrary buffer of pointers to Individual; it is not necessarily setting the fitness_scaling_ of all Individual objects, just an effectively random subset of them, so this operation will never be reducible to a simple memset(). Also, I am going to need to similarly optimize the setting of many other properties on Individual and on other classes in my code, so a general solution is needed; I can't very well put all of the ivars of all of my classes into separately allocated buffers external to the objects themselves. And third, Individual objects are being dynamically allocated and deallocated independently of each other, so an external buffer of fitness_scaling_ values for the objects would have big implementation problems.]

Efficient parallel union of sets with OpenMP

I need to calculate a global std::set (or equivalently a global std::unordered_set) in an OpenMP parallelised programm. At the moment every thread has a local std::set which then later the union is calculated from using
#pragma omp critical //critical as std container inserting is not thread safe
global_set.insert(local_set.begin(), local_set.end());
However this creates an effectively serial section of code, where each thread inserts its local set into the global set one after the other.
How can I improve on this by parallelising the union of the sets? The union is preceded by a large block of work, is there a convenient way to give all threads different amounts of work to let the others work while one is inserting the elements in the set? Or can the union itself be efficiently parallelised (for example by unioning sets in a 'binary tree' fashion)?
You should read up on OpenMP reductions, and, in particular user-defined reductions. That lets you pass the problem to the OpenMP implementation, which will very likely perform the reduction up a tree.
Of course, whether that is beneficial is not clear; it may be that it simply introduces a lot of copying and memory allocation which is still slower than the style of code you show.

Stream compaction (or Array Packing) with prefix scan using Openmp

I am using openmp to parallelize my code. I have an original array:
A=[3,5,2,5,7,9,-4,6,7,-3,1,7,6,8,-1,2]
and a marks array:
M=[1,0,1,0,0,0,1,0,0,1,1,0,0,0,1,1]
using array M i can compact my original array in this packed array:
A=[3,2,-4,-3,1,-1,2]
I'd like to solve this problem using a multi-threads approach. Library 'Thrust' for C++ solves this problem but i am not able to find a similar tools for Fortran.
Is there a library, like 'thrust' for C++, that i can use to execute a stream compaction?
Alternatively, is there an algorithm that i can write myself using fortran and openmp, to solve this?
Is there a library, like 'thrust' for C++, that i can use to execute a stream compaction?
It shouldn't be that difficult to call a thrust routine from Fortran (if you're willing to write a little bit of C++ code). Furthermore, thrust can target an OMP backend instead of a GPU backend.
Alternatively, is there an algorithm that i can write myself using fortran and openmp, to solve this?
The basic parallel stream compaction algorithm is as follows. We will assume that there is one thread assigned per element in your data array, initially.
Perform a parallel prefix sum (inclusive scan) on the M array:
M=[1,0,1,0,0,0,1,0,0,1,1,0,0,0,1,1]
sM=[1,1,2,2,2,2,3,3,3,4,5,5,5,5,6,7]
Each thread will then inspect its element in the M array, and if that element is non-zero, it will copy its corresponding element in the A array to the output array (let's call it O):
M=[1,0,1,0,0,0, 1,0,0, 1,1,0,0,0, 1,1]
sM=[1,1,2,2,2,2, 3,3,3, 4,5,5,5,5, 6,7]
A=[3,5,2,5,7,9,-4,6,7,-3,1,7,6,8,-1,2]
O=[3, 2, -4, -3,1, -1,2]
If you were doing this in OMP, you will need an OMP barrier between steps 1 and 2. The work in step 2 is relatively simple and completely independent, so you could use a OMP parallel do loop, and break the work up in any fashion you wish. Step 1 will be complicated, and I suggest following the outline provided in the chapter you and I linked. The OMP code there will require various barriers along the way, but is parallelizable.
As mentioned already in the comments, if this is the only piece of work you want to parallelize, I wouldn't recommend a GPU, because the cost of transferring the data to/from the GPU would probably outweigh any parallel execution time benefits you might accrue. But as I mentioned already, thrust can target an OMP realization rather than a GPU realization. It might be worth a try.
Regarding thrust from fortran, most of what you need is here. That is admittedly CUDA fortran, but the only differences should be not using the device attribute, and using thrust::host_vector instead of thrust::device_vector (at least, to get started).

Parallel tasks get better performances with boost::thread than with ppl or OpenMP

I have a C++ program which could be parallelized. I'm using Visual Studio 2010, 32bit compilation.
In short the structure of the program is the following
#define num_iterations 64 //some number
struct result
{
//some stuff
}
result best_result=initial_bad_result;
for(i=0; i<many_times; i++)
{
result *results[num_iterations];
for(j=0; j<num_iterations; j++)
{
some_computations(results+j);
}
// update best_result;
}
Since each some_computations() is independent(some global variables read, but no global variables modified) I parallelized the inner for-loop.
My first attempt was with boost::thread,
thread_group group;
for(j=0; j<num_iterations; j++)
{
group.create_thread(boost::bind(&some_computation, this, result+j));
}
group.join_all();
The results were good, but I decided to try more.
I tried the OpenMP library
#pragma omp parallel for
for(j=0; j<num_iterations; j++)
{
some_computations(results+j);
}
The results were worse than the boost::thread's ones.
Then I tried the ppl library and used parallel_for():
Concurrency::parallel_for(0,num_iterations, [=](int j) {
some_computations(results+j);
})
The results were the worst.
I found this behaviour quite surprising. Since OpenMP and ppl are designed for the parallelization, I would have expected better results, than boost::thread. Am I wrong?
Why is boost::thread giving me better results?
OpenMP or PPL do no such thing as being pessimistic. They just do as they are told, however there's some things you should take into consideration when you do try to paralellize loops.
Without seeing how you implemented these things, it's hard to say what the real cause may be.
Also if the operations in each iteration have some dependency on any other iterations in the same loop, then this will create contention, which will slow things down. You haven't shown what your some_operation function actually does, so it's hard to tell if there is data dependencies.
A loop that can be truly parallelized has to be able to have each iteration run totally independent of all other iterations, with no shared memory being accessed in any of the iterations. So preferably, you'd write stuff to local variables and then copy at the end.
Not all loops can be parallelized, it is very dependent on the type of work being done.
For example, something that is good for parallelizing is work being done on each pixel of a screen buffer. Each pixel is totally independent from all other pixels, and therefore, a thread can take one iteration of a loop and do the work without needing to be held up waiting for shared memory or data dependencies within the loop between iterations.
Also, if you have a contiguous array, this array may be partly in a cache line, and if you are editing element 5 in thread A and then changing element 6 in thread B, you may get cache contention, which will also slow down things, as these would be residing in the same cache line. A phenomenon known as false sharing.
There is many aspects to think about when doing loop parallelization.
In short words, openMP is mainly based on shared memory, with additional cost of tasking management and memory management. ppl is designed to handle generic patterns of common data structures and algorithms, it brings additional complexity cost. Both of them have additional CPU cost, but your simple falling down boost threads do not (boost threads are just simple API wrapping). That's why both of them are slower than your boost version. And, since the exampled computation is independent for each other, without synchronization, openMP should be close to the boost version.
It occurs in simple scenarios, but, for complicated scenarios, with complicated data layout and algorithms, it should be context dependent.

Proper use of "atomic directive" to lock STL container

I have a large number of sets of integers, which I have, in turn, put into a vector of pointers. I need to be able to update these sets of integers in parallel without causing a race condition. More specifically. I am using OpenMP's "parallel for" construct.
For dealing with shared resources, OpenMP offers a handy "atomic directive," which allows one to avoid a race condition on a specific piece of memory without using locks. It would be convenient if I could use the "atomic directive" to prevent simultaneous updating to my integer sets, however, I'm not sure whether this is possible.
Basically, I want to know whether the following code could lead to a race condition
vector< set<int>* > membershipDirectory(numSets, new set<int>);
#pragma omp for schedule(guided,expandChunksize)
for(int i=0; i<100; i++)
{
set<int>* sp = membershipDirectory[rand()];
#pragma omp atomic
sp->insert(45);
}
Note that I use a random integer for the index, because in my application, any thread might access any index (there is a random element in my larger application, but I need not go into details).
I have seen a similar example of this for incrementing an integer, but I'm not sure whether it works when working with a pointer to a container as in my case.
After searching around, I found the OpenMP C and C++ API manual on openmp.org, and in section 2.6.4, the limitations of the atomic construct are described.
Basically, the atomic directive can only be used with the following operators:
Unary:
++, -- (prefix and postfix)
Binary:
+,-,*,/,^,&,|,<<,>>
So I will just use locks!
(In some situations critical sections might be preferable, but in my case locks will provide fine grained access to the shared resource, yielding better performance than a critical section.)
you should not use atomic where expression is a function call, it only applies to simple expressions (with possibly built-ins: power, square root).
Instead use critical section (either named or default)
Your code is not clear. Assuming that membershipDirectory[5] is actually membershipDirectory[i], atomic directive is not needed. For two processors, for example, OpenMP produces two threads, one handles i = 0-49, another 50-99 intervals. In this case, there is no need to protect membershipDirectory[i]. atomic directive is required to protect some common resource which does not depend on the loop index, for example, total sum.