Stream compaction (or Array Packing) with prefix scan using Openmp

Stream compaction (or Array Packing) with prefix scan using Openmp - fortran

I am using openmp to parallelize my code. I have an original array:
A=[3,5,2,5,7,9,-4,6,7,-3,1,7,6,8,-1,2]
and a marks array:
M=[1,0,1,0,0,0,1,0,0,1,1,0,0,0,1,1]
using array M i can compact my original array in this packed array:
A=[3,2,-4,-3,1,-1,2]
I'd like to solve this problem using a multi-threads approach. Library 'Thrust' for C++ solves this problem but i am not able to find a similar tools for Fortran.
Is there a library, like 'thrust' for C++, that i can use to execute a stream compaction?
Alternatively, is there an algorithm that i can write myself using fortran and openmp, to solve this?

Is there a library, like 'thrust' for C++, that i can use to execute a stream compaction?
It shouldn't be that difficult to call a thrust routine from Fortran (if you're willing to write a little bit of C++ code). Furthermore, thrust can target an OMP backend instead of a GPU backend.
Alternatively, is there an algorithm that i can write myself using fortran and openmp, to solve this?
The basic parallel stream compaction algorithm is as follows. We will assume that there is one thread assigned per element in your data array, initially.
Perform a parallel prefix sum (inclusive scan) on the M array:
M=[1,0,1,0,0,0,1,0,0,1,1,0,0,0,1,1]
sM=[1,1,2,2,2,2,3,3,3,4,5,5,5,5,6,7]
Each thread will then inspect its element in the M array, and if that element is non-zero, it will copy its corresponding element in the A array to the output array (let's call it O):
M=[1,0,1,0,0,0, 1,0,0, 1,1,0,0,0, 1,1]
sM=[1,1,2,2,2,2, 3,3,3, 4,5,5,5,5, 6,7]
A=[3,5,2,5,7,9,-4,6,7,-3,1,7,6,8,-1,2]
O=[3, 2, -4, -3,1, -1,2]
If you were doing this in OMP, you will need an OMP barrier between steps 1 and 2. The work in step 2 is relatively simple and completely independent, so you could use a OMP parallel do loop, and break the work up in any fashion you wish. Step 1 will be complicated, and I suggest following the outline provided in the chapter you and I linked. The OMP code there will require various barriers along the way, but is parallelizable.
As mentioned already in the comments, if this is the only piece of work you want to parallelize, I wouldn't recommend a GPU, because the cost of transferring the data to/from the GPU would probably outweigh any parallel execution time benefits you might accrue. But as I mentioned already, thrust can target an OMP realization rather than a GPU realization. It might be worth a try.
Regarding thrust from fortran, most of what you need is here. That is admittedly CUDA fortran, but the only differences should be not using the device attribute, and using thrust::host_vector instead of thrust::device_vector (at least, to get started).

Related

Efficient parallel union of sets with OpenMP

I need to calculate a global std::set (or equivalently a global std::unordered_set) in an OpenMP parallelised programm. At the moment every thread has a local std::set which then later the union is calculated from using
#pragma omp critical //critical as std container inserting is not thread safe
global_set.insert(local_set.begin(), local_set.end());
However this creates an effectively serial section of code, where each thread inserts its local set into the global set one after the other.
How can I improve on this by parallelising the union of the sets? The union is preceded by a large block of work, is there a convenient way to give all threads different amounts of work to let the others work while one is inserting the elements in the set? Or can the union itself be efficiently parallelised (for example by unioning sets in a 'binary tree' fashion)?

You should read up on OpenMP reductions, and, in particular user-defined reductions. That lets you pass the problem to the OpenMP implementation, which will very likely perform the reduction up a tree.
Of course, whether that is beneficial is not clear; it may be that it simply introduces a lot of copying and memory allocation which is still slower than the style of code you show.

How to collect data for each thread OpenMP

I'm new to OpenMP and try to sort out the issue of collecting data from threads. I study the example of applying OpenMP on Monte-Carlo method (square of a circle inscribed into a square).
I understood how the following code works:
unsigned pointsInside = 0;
#pragma omp parallel for num_threads(threadNum) shared(threadNum) reduction(+: pointsInside)
for (unsigned i = 0; i < threadNum; i++) { ... }
Am I right that originally pointsInsideis a variable but OpenMP represents it as an array and than the mantra reduction(+: pointsInside) sums over the elements of the "array"?
But the main question is how to collect information directly into an array or vector? I tried to declare array or vector and provide pointer or address into OpenMP via shared and collect information for each thread at corresponding index. But it works slower than the way with the variable and reduction. Such the approach with vector or array is needed for me for my current project. Thanks a lot!
UPD:
When I said above that "it works slower" I meant comparison of two realizations of the Monte-Carlo method: 1) via shared and a vector/array, and 2) via a scalar variable and reduction. The first case is faster. My guess and question about it below.
I would like to rephrase my question more clear. I create a vector/array and provide it into OpenMP via shared. I want to collect data for each thread at corresponding index in vector/array. Under this approach I don't need any synchronization of access to the vector/array. Is it true that OpenMP enable synchronization by default when I use shared. If it is so, then how to disable it. Or may be another approaches exist. If it is not so, then how to share vector/array into the parallel part correctly and without synchronization of access.
I'd like to apply this technique for my project where I want to sort through different permutations in parallel part, collect each permutation and scalar result for it outside of the parallel part. Then sort the results and choose the best one.

A partial answer:
Am I right that originally pointsInside is a variable but OpenMP represents it as an array and than the mantra reduction(+: pointsInside) sums over the elements of the "array"?
I think it is better to think of pointsInside as a scalar. When the parallel region starts the run-time takes care of creating individual scalars, perhaps you might think of them as myPointsInside, one such scalar for each thread. When the parallel region finishes the run-time reduces the values of all the thread scalars onto the original scalar pointsInside. This is just about what OpenMP actually does behind the scenes.
As to the rest of your question:
Yes, you can perform reductions onto arrays - but this was only added to OpenMP, for C and C++ programs, in OpenMP 4.5 (I think). What goes on is much the same as for the scalar case. This Q&A provides some assistance - Is it possible to do a reduction on an array with openmp?
As to the speed, it's difficult to answer that without a much clearer understanding of what comparisons you are making. But it's very easy to write parallel reductions on arrays which incur a significant penalty in performance from the phenomenon of false sharing, about which you may wish to inform yourself.

Using OpenMP in C++ with Timsort Algorithm

I've been looking for a way to implement Timsort for C++ (Implementation found on Github) with multithreading and I've tried using in this process.
I'm sure I'm using the correct compiler flags, but whenever I try to use Timsort as I do below:
#pragma omp parallel shared(DataVector)
{
gfx::timsort(DataVector.begin(), DataVector.end(), comp_1);
}
Note: Data being sorted is a vector containing strings of individual words, and I'm using a my own comparator.
It seems to sort in the same amount of time that it takes to run without using OpenMP. Using the appropriate includes for chrono and such, I timed values that were within .01 seconds of each other on average, hovering around 1.24 seconds for my sort.
Is there a reason the threading doesn't seem to work with my sorting method, or is it a problem with the way I'm implementing OpenMP?
Note on purpose: I have been using __gnu_parallel::sort as well with better results but I'm looking to compare these methods in practice myself.

omp parallel needs to see the loop it is going to parallelize. The way you've declared it, omp will parallelize a single section of code which does not give any benefit.
Check your docs on omp parallel usage.
To do a for loop you need to use omp parallel for with the for statement following. The way you have it now it will run your timsort on every core you have.

think openMP is not smart enough as you think...
if you want to do a parallel for gfx::timsort you can't do it from outside...
you should add this code in function gfx::timsort
#pragma omp parallel for
for(int i=0;i<num;i++)
...
Beside, shared is a key word to instruct a variable you don't want it be edit paralleled

Parallel tasks get better performances with boost::thread than with ppl or OpenMP

I have a C++ program which could be parallelized. I'm using Visual Studio 2010, 32bit compilation.
In short the structure of the program is the following
#define num_iterations 64 //some number
struct result
{
//some stuff
}
result best_result=initial_bad_result;
for(i=0; i<many_times; i++)
{
result *results[num_iterations];
for(j=0; j<num_iterations; j++)
{
some_computations(results+j);
}
// update best_result;
}
Since each some_computations() is independent(some global variables read, but no global variables modified) I parallelized the inner for-loop.
My first attempt was with boost::thread,
thread_group group;
for(j=0; j<num_iterations; j++)
{
group.create_thread(boost::bind(&some_computation, this, result+j));
}
group.join_all();
The results were good, but I decided to try more.
I tried the OpenMP library
#pragma omp parallel for
for(j=0; j<num_iterations; j++)
{
some_computations(results+j);
}
The results were worse than the boost::thread's ones.
Then I tried the ppl library and used parallel_for():
Concurrency::parallel_for(0,num_iterations, [=](int j) {
some_computations(results+j);
})
The results were the worst.
I found this behaviour quite surprising. Since OpenMP and ppl are designed for the parallelization, I would have expected better results, than boost::thread. Am I wrong?
Why is boost::thread giving me better results?

OpenMP or PPL do no such thing as being pessimistic. They just do as they are told, however there's some things you should take into consideration when you do try to paralellize loops.
Without seeing how you implemented these things, it's hard to say what the real cause may be.
Also if the operations in each iteration have some dependency on any other iterations in the same loop, then this will create contention, which will slow things down. You haven't shown what your some_operation function actually does, so it's hard to tell if there is data dependencies.
A loop that can be truly parallelized has to be able to have each iteration run totally independent of all other iterations, with no shared memory being accessed in any of the iterations. So preferably, you'd write stuff to local variables and then copy at the end.
Not all loops can be parallelized, it is very dependent on the type of work being done.
For example, something that is good for parallelizing is work being done on each pixel of a screen buffer. Each pixel is totally independent from all other pixels, and therefore, a thread can take one iteration of a loop and do the work without needing to be held up waiting for shared memory or data dependencies within the loop between iterations.
Also, if you have a contiguous array, this array may be partly in a cache line, and if you are editing element 5 in thread A and then changing element 6 in thread B, you may get cache contention, which will also slow down things, as these would be residing in the same cache line. A phenomenon known as false sharing.
There is many aspects to think about when doing loop parallelization.

In short words, openMP is mainly based on shared memory, with additional cost of tasking management and memory management. ppl is designed to handle generic patterns of common data structures and algorithms, it brings additional complexity cost. Both of them have additional CPU cost, but your simple falling down boost threads do not (boost threads are just simple API wrapping). That's why both of them are slower than your boost version. And, since the exampled computation is independent for each other, without synchronization, openMP should be close to the boost version.
It occurs in simple scenarios, but, for complicated scenarios, with complicated data layout and algorithms, it should be context dependent.

Elegant (and typical) workaround for OpenMP reduction on complex variable in C++?

I realize that reduction is only usable for POD types in C++. What would you do to implement a reduction for a complex type accumulator?
complex<double> x(0.0,0.0), y(1.0,1.0);
#pragma omp parallel for reduction(+:x)
for(int i=0; i<5; i++)
{
x += y;
}
(noting that I may have left some syntax out). It seems an obvious solution would be to split real and imaginary components into temporary doubles, then accumulate on those. I guess I'm looking for elegance, and that seems ... less than pretty. Would that be the typical approach here?

The typical workaround in absence of user-defined reductions in OpenMP is even uglier than what you suggested. Usually, prior to the parallel region people create an array of (at least) as many elements as there will be threads in the region, accumulate partial results separately for each thread using omp_get_thread_num() as an index to the array, and do final reduction of the accumulated results in a loop after the parallel region.
As far as I know, OpenMP language committee works on adding user-defined reductions to the specification, so maybe it will be finally resolved in a few years.

Sorry, OpenMP simply doesn't support that at this time. Unfortunately, you need to do parallel reduction in an ugly way what you already described.
However, if such parallel reduction is really frequent, I'd like to make a constructor similar to parallel_reduce in TBB. Implementation of such construct is fairly straight forward. Cilk plus has a more powerful reducer object, but I didn't check whether it supports non POD.
FYI, such kind of restriction can also be found in threadprivate pragma. I've tested with VC++ 2008/2010 and Intel compilers (icc). VC++ can't support threadprivate with a struct/class that has a constructor or destructor (or a scalar variable that requires function call to be initialized), by throwing an error: error C3057, "dynamic initialization of 'threadprivate' symbols". You may read this MSDN link as well. However, icc is okay with the case of C3057. You can see, at least, two major implementations are such different.
I guess that supporting parallel reduction on non-POD would have the similar problem above. In order to support parallel reduction, each parallel section should allocate a thread-local variable for a reduction variable. So, if a given reduction variable is non-POD, they may need to call user-defined constructor.This makes the same problem what I have mentioned in the case of C3057.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js