Using OpenMP in C++ with Timsort Algorithm - c++

I've been looking for a way to implement Timsort for C++ (Implementation found on Github) with multithreading and I've tried using in this process.
I'm sure I'm using the correct compiler flags, but whenever I try to use Timsort as I do below:
#pragma omp parallel shared(DataVector)
{
gfx::timsort(DataVector.begin(), DataVector.end(), comp_1);
}
Note: Data being sorted is a vector containing strings of individual words, and I'm using a my own comparator.
It seems to sort in the same amount of time that it takes to run without using OpenMP. Using the appropriate includes for chrono and such, I timed values that were within .01 seconds of each other on average, hovering around 1.24 seconds for my sort.
Is there a reason the threading doesn't seem to work with my sorting method, or is it a problem with the way I'm implementing OpenMP?
Note on purpose: I have been using __gnu_parallel::sort as well with better results but I'm looking to compare these methods in practice myself.

omp parallel needs to see the loop it is going to parallelize. The way you've declared it, omp will parallelize a single section of code which does not give any benefit.
Check your docs on omp parallel usage.
To do a for loop you need to use omp parallel for with the for statement following. The way you have it now it will run your timsort on every core you have.

think openMP is not smart enough as you think...
if you want to do a parallel for gfx::timsort you can't do it from outside...
you should add this code in function gfx::timsort
#pragma omp parallel for
for(int i=0;i<num;i++)
...
Beside, shared is a key word to instruct a variable you don't want it be edit paralleled

Related

How to collect data for each thread OpenMP

I'm new to OpenMP and try to sort out the issue of collecting data from threads. I study the example of applying OpenMP on Monte-Carlo method (square of a circle inscribed into a square).
I understood how the following code works:
unsigned pointsInside = 0;
#pragma omp parallel for num_threads(threadNum) shared(threadNum) reduction(+: pointsInside)
for (unsigned i = 0; i < threadNum; i++) { ... }
Am I right that originally pointsInsideis a variable but OpenMP represents it as an array and than the mantra reduction(+: pointsInside) sums over the elements of the "array"?
But the main question is how to collect information directly into an array or vector? I tried to declare array or vector and provide pointer or address into OpenMP via shared and collect information for each thread at corresponding index. But it works slower than the way with the variable and reduction. Such the approach with vector or array is needed for me for my current project. Thanks a lot!
UPD:
When I said above that "it works slower" I meant comparison of two realizations of the Monte-Carlo method: 1) via shared and a vector/array, and 2) via a scalar variable and reduction. The first case is faster. My guess and question about it below.
I would like to rephrase my question more clear. I create a vector/array and provide it into OpenMP via shared. I want to collect data for each thread at corresponding index in vector/array. Under this approach I don't need any synchronization of access to the vector/array. Is it true that OpenMP enable synchronization by default when I use shared. If it is so, then how to disable it. Or may be another approaches exist. If it is not so, then how to share vector/array into the parallel part correctly and without synchronization of access.
I'd like to apply this technique for my project where I want to sort through different permutations in parallel part, collect each permutation and scalar result for it outside of the parallel part. Then sort the results and choose the best one.
A partial answer:
Am I right that originally pointsInside is a variable but OpenMP represents it as an array and than the mantra reduction(+: pointsInside) sums over the elements of the "array"?
I think it is better to think of pointsInside as a scalar. When the parallel region starts the run-time takes care of creating individual scalars, perhaps you might think of them as myPointsInside, one such scalar for each thread. When the parallel region finishes the run-time reduces the values of all the thread scalars onto the original scalar pointsInside. This is just about what OpenMP actually does behind the scenes.
As to the rest of your question:
Yes, you can perform reductions onto arrays - but this was only added to OpenMP, for C and C++ programs, in OpenMP 4.5 (I think). What goes on is much the same as for the scalar case. This Q&A provides some assistance - Is it possible to do a reduction on an array with openmp?
As to the speed, it's difficult to answer that without a much clearer understanding of what comparisons you are making. But it's very easy to write parallel reductions on arrays which incur a significant penalty in performance from the phenomenon of false sharing, about which you may wish to inform yourself.

Not able to achieve desired speed up using openmp

I am trying to use openmp directive to parallelize a piece of code but not being able to achieve any speed up. Folowing is the piece of code that I am trying to parllelize:
#pragma omp parallel private(i,j) shared(a,x,n) default(none)
{
for(j=n-1;j>=0;j--)
{
x[j] = A(j,n,n)/A(j,j,n);
#pragma omp for schedule(dynamic)
for (i=0;i<=j-1;i++)
{
A(i,n,n )= A(i,n,n) - A(i,j,n)*x[j];
}
}
}
The value of n is 1000. The A(i,n,n) is defined macro which is used to access to array a.
As I increase the number of threads the execution time increases or it remains the same. The machine I am working on has 4 cores. I am suprised that that there is no speed up even when the number of threads is 2.
I am not able to figure what am I doing wrong?
Since n>>#CPUs (I don't think you have 1000 CPUs), it is not wise to parallelize the inner loop. In your example, you redistribute the work at each iteration.
Instead, it is wiser to parallelize the outer loop. This way, the value of x[j] won't be updated concurrently by different threads (as Zulan mentioned), and you will have much less work re-distribution.
In that case, using dynamic scheduling is wise since the quantity of work change at each iteration.
Note: You will have to change the order of the calculation, the current implementation does not allow you to move the parallelization to the outer loop since all of the threads will update the same value (A[i][n][n]).
Although it is true that threads creating take time, the threads are not recreated at each iteration. They are only created once on the top #pargma statement and running concurrently for the entire following clause.

Using OpenMP to parallelize a for loop

I'm new to OpenMP. When I parallelize a for loop using
#pragma omp parallel for num_threads(4)
for(i=0;i<4;i++){
//some parallelizable code
}
Is it guaranteed that every thread takes one and only one value of i? How is the loop work divided among the threads in general when num_threads is not equal to or does not evenly divide the total number of times of the for loop? Is there a command I can use to specify that each thread takes only one value of i, or the number of values of i each thread takes?
The work division in a loop construct is decided by the schedule. If no schedule clause is present, the def-sched-var schedule is used, which is implementation defined.
You could use schedule (static, 1), which in your case guarantees that each thread will get exactly one value.
I highly recommend to take a look at the OpenMP specification, Table 2.5 and 2.7.1.1.
There may be legitimate reasons for making this kind of assumptions, but in general the correctness of your loop code should not depend on this. Primarily I would treat this as a performance-hint.
Depending on your use-case you may want to consider tasks or just parallel constructs. If you rely such details for loops, make sure it is well specified in the standard, and not just works in your particular implementation.

Stream compaction (or Array Packing) with prefix scan using Openmp

I am using openmp to parallelize my code. I have an original array:
A=[3,5,2,5,7,9,-4,6,7,-3,1,7,6,8,-1,2]
and a marks array:
M=[1,0,1,0,0,0,1,0,0,1,1,0,0,0,1,1]
using array M i can compact my original array in this packed array:
A=[3,2,-4,-3,1,-1,2]
I'd like to solve this problem using a multi-threads approach. Library 'Thrust' for C++ solves this problem but i am not able to find a similar tools for Fortran.
Is there a library, like 'thrust' for C++, that i can use to execute a stream compaction?
Alternatively, is there an algorithm that i can write myself using fortran and openmp, to solve this?
Is there a library, like 'thrust' for C++, that i can use to execute a stream compaction?
It shouldn't be that difficult to call a thrust routine from Fortran (if you're willing to write a little bit of C++ code). Furthermore, thrust can target an OMP backend instead of a GPU backend.
Alternatively, is there an algorithm that i can write myself using fortran and openmp, to solve this?
The basic parallel stream compaction algorithm is as follows. We will assume that there is one thread assigned per element in your data array, initially.
Perform a parallel prefix sum (inclusive scan) on the M array:
M=[1,0,1,0,0,0,1,0,0,1,1,0,0,0,1,1]
sM=[1,1,2,2,2,2,3,3,3,4,5,5,5,5,6,7]
Each thread will then inspect its element in the M array, and if that element is non-zero, it will copy its corresponding element in the A array to the output array (let's call it O):
M=[1,0,1,0,0,0, 1,0,0, 1,1,0,0,0, 1,1]
sM=[1,1,2,2,2,2, 3,3,3, 4,5,5,5,5, 6,7]
A=[3,5,2,5,7,9,-4,6,7,-3,1,7,6,8,-1,2]
O=[3, 2, -4, -3,1, -1,2]
If you were doing this in OMP, you will need an OMP barrier between steps 1 and 2. The work in step 2 is relatively simple and completely independent, so you could use a OMP parallel do loop, and break the work up in any fashion you wish. Step 1 will be complicated, and I suggest following the outline provided in the chapter you and I linked. The OMP code there will require various barriers along the way, but is parallelizable.
As mentioned already in the comments, if this is the only piece of work you want to parallelize, I wouldn't recommend a GPU, because the cost of transferring the data to/from the GPU would probably outweigh any parallel execution time benefits you might accrue. But as I mentioned already, thrust can target an OMP realization rather than a GPU realization. It might be worth a try.
Regarding thrust from fortran, most of what you need is here. That is admittedly CUDA fortran, but the only differences should be not using the device attribute, and using thrust::host_vector instead of thrust::device_vector (at least, to get started).

Signaling in OpenMP

I am writing computational code that more-less has the following schematic:
#pragma omp parallel
{
#pragma omp for nowait
// Compute elements of some array A[i] in parallel
#pragma omp single
for (i = 0; i < N; ++i) {
// Do some operation with A[i].
// This time it is important that operations are sequential. e.g.:
result = compute_new_result(result, A[i]);
}
}
Both computing A[i] and compute_new_result are rather expensive. So my idea is to compute the array elements in parallel and if any of the threads gets free, it starts doing sequential operations. There is a good chance that the starting array elements are already computed and the others will be provided by the other threads doing still the first loop.
However, to make the concept work I have to achieve two things:
To make OpenMP split the loops in alternative way, i.e. for two threads: thread 1 computing A[0], A[2], A[4] and thread 2: A[1], A[3], A[5], etc.
To provide some signaling system. I am thinking about an array of flags indicating that A[i] has already been computed. Then compute_new_result should wait for the flag for respective A[i] to be released before proceeding.
I would be glad for any hints how to achieve both goals. I need the solution to be portable across Linux, Windows and Mac. I am writing the whole code in C++11.
Edit:
I have figured out the answer to the fist question. It looks like it is sufficient do add schedule(static,1) clause to the #pragma omp for directive.
However, I am still thinking on the elegant solution of the second issue...
If you don't mind replacing the OpenMP for worksharing construct with a loop that generates tasks instead, you can use OpenMP task to implement both parts of your application.
In the first loop you would create (instead of the loop chunks), tasks that take on the compute load of the iterations. Each iteration of the second loop then also becomes an OpenMP task. The important part then will be to syncronize the tasks between the different phases.
For that you can use task dependencies (introduce with OpenMP 4.0):
#pragma omp task depend(out:A[0])
{ A[0] = a(); }
#pragma omp task depend(in:A[0])
{ b(A[0]); }
Will make sure that task b does not run before task a has completed.
Cheers,
-michael
This is probably an extended comment rather than an answer ...
So, you have a two-phase computation. In phase 1 you can compute, independently, each entry in your array A. It is straightforward therefore to parallelise this using an OpenMP parallel for loop. But there is an issue here, naive allocations of work to threads are likely to lead to a (severely ?) unbalanced load across threads.
In phase 2 there is a computation which is not so easily parallelised and which you plan to give to the first thread to finish its share of phase 1.
Personally I'd split this into 2 phases. In the first, use a parallel for loop. In the second drop OpenMP and just have a sequential code. Sort out the load balancing within phase 1 by tuning the arguments to a schedule clause; I'd be tempted to try schedule(guided) first.
If tuning the schedule can't provide the balance you want then investigate replacing parallel for by task-ing.
Do not complicate the code for phase 2 by rolling your own signalling technique. I'm not concerned that the complication will overwhelm you, though you might be concerned about that, but that the complication will fail to deliver any benefits unless you sort out the load balance in phase 1. And when you've done that you don't need to put phase2 inside an OpenMP parallel region.