Is this the correct use of OpenMP firstprivate? - c++

I need to parallelize the following:
for(i=0; i<n/2; i++)
a[i] = a[i+1] + a[2*i]
In parallel, the output will be different than in sequential, because the "to read" values will be "rewritten". In order to get the sequential output, but then parallelized I want to make use of firstprivate(a). Because firstprivate gives each tread a copy of a.
Let's imagine 4 threads and a loop of 100.
1 --> i = 0 till 24
2 --> i = 25 till 49
3 --> i = 50 till 74
4 --> i =75 till 99
That means that each tread will rewrite 25% of the array.
When the parallel region is over, all the threads "merge". Does that mean that you get the same a as if you ran it in sequential?
#pragma omp parallel for firstprivate(a)
for(i=0; i<n/2; i++)
a[i] = a[i+1] + a[2*i]
Question:
Is my way of thinking correct?
Is the code parallelized in the right way to get the sequential output?

As you noted, using firstprivate to copy the data for each thread does not really help you getting the data back.
The easiest solution is in fact to separate input and output and have both be shared (the default).
In order to avoid a copy it would be good to just use the new variable instead of b from thereon in the code. Alternatively you could just have pointers and swap them.
int out[100];
#pragma omp parallel for
for(i=0; i<n/2; i++)
out[i] = a[i+1] + a[2*i]
// use out from here when you would have used a.
There is no easy and general way to have private copies of a for each thread and then merge them afterwards. lastprivate just copies one incomplete output array from the thread executing the last iteration and reduction doesn't know which elements to take from which array. Even if it was, it would be wasteful to copy the entire array for each thread. Having shared in-/outputs here is much more efficient.

Related

How do I translate this ACC code to SYCL?

My question is:
I have this code:
#pragma acc parallel loop
for(i=0; i<bands; i++)
{
#pragma acc loop seq
for(j=0; j<lines_samples; j++)
r_m[i] += image_vector[i*lines_samples+j];
r_m[i] /= lines_samples;
#pragma acc loop
for(j=0; j<lines_samples; j++)
R_o[i*lines_samples+j] = image_vector[i*lines_samples+j] - r_m[i];
}
I'm trying to translate it to SYCL, and I thought about putting a kernel substituting the first parallel loop, with the typical "queue.submit(...)", using "i". But then I realized that inside the first big loop there is a loop that must be executed in serial. Is there a way to tell SYCL to execute a loop inside a kernel in serial?
I can't think of another way to solve this, as I need to make both the first big for and the last for inside the main one parallel.
Thank you in advance.
You have a couple of options here. The first one, as you suggest, is to create a kernel with a 1D range over i:
q.submit([&](sycl::handler &cgh){
cgh.parallel_for(sycl::range<1>(bands), [&](sycl::item<1> i){
for(j=0; j<lines_samples; j++)
r_m[i] += image_vector[i*lines_samples+j];
r_m[i] /= lines_samples;
for(j=0; j<lines_samples; j++)
R_o[i*lines_samples+j] = image_vector[i*lines_samples+j] - r_m[i];
})
});
Note that for the inner loops, the kernel will just iterate serially over j in both cases. SYCL doesn't apply any magic to your loops like a #pragma would - loops are loops.
This is fine, but you're missing out on a higher degree of parallelism which could be achieved by writing a kernel with a 2D range over i and j: sycl::range<2>(bands, lines_samples). This can be made to work relatively easily, assuming your first loop is doing what I think it's doing, which is computing the average of a line of an image. In this case, you don't really need a serial loop - you can achieve this using work-groups.
Work-groups in SYCL have access to fast on-chip shared memory, and are able to synchronise. This means that you can have a work-group load all the pixels from a line of your image, then the work-group can collaboratively compute the average of that line, synchronize, then each member of the work-group uses the computed average to compute a single value of R_o, your output. This approach maximises available parallelism.
The collaborative reduction operation to find the average of the given line is probably best achieved through tree-reduction. Here are a couple of guides which go through this workgroup reduction approach:
https://developer.codeplay.com/products/computecpp/ce/guides/sycl-for-cuda-developers/examples
https://www.intel.com/content/www/us/en/develop/documentation/oneapi-gpu-optimization-guide/top/kernels/reduction.html

OpenMP: Is it better to use fewer threads that are long, or maximum available threads that are short?

I have some C++ code that I am running for an optimisation task, and I am trying to parallelise it using OpenMP. I tried using #pragma omp parallel for on both loops, but realised pretty quickly that it didnt work, so I want to set up a conditional to decide whether to parallelise the outer or inner loop, depending on how many outer iterations there are.
Here is the code:
std::vector<Sol> seeds; // vector with initial solutions
std::vector<Sol> sols (N_OUTER*N_INNER); // vector for output solutions
int N_OUTER; // typically 1-8
int N_INNER; // typically > 100
int PAR_THRESH; // this is the parameter I am interested in setting
#pragma omp parallel for if (N_OUTER >= PAR_THRESH)
for (int outer = 0; outer < N_OUTER; ++outer){
#pragma omp parallel for if (N_OUTER < PAR_THRESH)
for (int inner = 0; inner < N_INNER; ++inner){
sols[outer*N_INNER + inner] = solve(seeds[outer]);
}
}
This works fine to decide which loop (inner or outer) gets parallelised; however, I am trying to determine what is the best value for PAR_THRESH.
My intuition says that if N_OUTER is 1, then it shouldn't parallelise the outer loop, and if N_OUTER is greater than the number of threads available, then the outer loop should be the one to be parallelised; because it uses maximum available threads and the threads are long as possible. My question is about when N_OUTER is either 2 or 3 (4 being the number of threads available).
Is it better to run, say, 2 or 3 threads that are long, in parallel; but not use up all of the available threads? Or is it better to run the 2 or 3 outer loops in serial, while utilising the maximum number of threads for the inner loop?
Or is there a kind of trade off in play, and maybe 2 outer loop iterations might be wasting threads, but if there are 3 outer loop iterations, then having longer threads is beneficial, despite the fact that one thread is remaining unused?
EDIT:
edited code to replace N_ITER with N_INNER in two places
Didn't have much experience with OpenMP, but I have found something like collapse directive:
https://software.intel.com/en-us/articles/openmp-loop-collapse-directive
Understanding the collapse clause in openmp
It seems to be even more appropriate when number of inner loop iterations differs.
--
On the other hand:
It seems to me that solve(...) is side-effect free. It seems also that N_ITER is N_INNER.
Currently you calculate solve N_INNER*N_OUTER times.
While reducing that won't reduce O notation complexity, assuming it has very large constant factor - it should save a lot of time. You cannot cache the result with collapse, so maybe this could be even better:
std::vector<Sol> sols_tmp (N_INNER);
#pragma omp parallel for
for (int i = 0; i < N_OUTER; ++i) {
sols_tmp[i] = solve(seeds[i]);
}
This calculates only N_OUTER times.
Because solve returns same value for each row:
#pragma omp parallel for
for (int i = 0; i < N_OUTER*N_INNER; ++i) {
sols[i] = sols_tmp[i/N_INNER];
}
Of course it must be measured if parallelization is suitable for those loops.

Working with work groups and their sizes in opencl on a single array

I am using OpenCL C++ for the implementation of my project. I want to get the maximum speed/performance out of my GPU/s (depending on whether I have multiple GPUs or a single one). But for the purpose of this question, lets assume I have only one device.
Suppose I have an array of length 100.
double arr[100];
Now what currently I am doing is that I am calling the kernel through the following method.
kernelGA(cl::EnqueueArgs(queue[iter],
cl::NDRange(100)),
d_arr, // and some other buffers.
)
Now at the kernel side. I have one global id. that is:
int idx = get_global_id(0);
The way I want my kernel is to work is the following:
Each of the 100 work groups will take care of one element each.
There are some rules with using which each work group is updating the element of the array. eg:
if (arr[idx] < 5) {
arr[idx] = 10; // a very simple example.
}
For most of the parts, it is okay. But then there is one point where I want to interchange and where I want the threads/work items to communicate with each other. At that point, they don't seem to work and they don't seem to communicate.
eg:
if(arr[idx] < someNumber) {
arr[idx] = arr[idx + 1];
}
At this point, nothing seems to work. I tried to implement a for loop and to create a barrier
barrier(CLK_LOCAL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE);
but it also doesn't work. It doesn't change the values of the array elements.
I have the following questions:
1. Why doesn't it work? Is my implementation wrong? The threads seem to update their own indexed array element correctly. But when it comes to communication between them, they don't work. Why?
2. Is my implementation of the barriers and letting only one work item wrong? Is there a better way to let one item take care of this part while the other items are waiting for this one to finish?
The code you wrote is serial:
if(arr[idx] < someNumber) {
arr[idx] = arr[idx + 1];
}
The worker N will take the result of worker N-1, N-1 the results of N-2, and so on.
So it means worker N needs to wait for all the others to complete. Which means the code is not parallel and will never be. You are far better computing that with a CPU than a GPU.
OpenCL design model, allows you to run multiple work items in parallel but the synchronization model only allows to synchronize inside the work group.
If you need global sync, is a clear sign that your algorithm is not for OpenCL.
Now, if I assume you just want the value of the last element. And what you really want is to perform a "sum" of all the array. Then, this is a reduction problem, and it is possible to perform it in log(N) time by parallelization in this fashion:
1st step, array[x] = array[x] + array[N/2+x] (x from 0 to N/2)
2nd step, array[x] = array[x] + array[N/4+x] (x from 0 to N/4)
...
log(N) passes
Each step will be a separate kernel, and therefore ensures all work items have finished before starting the next batch.
Another faster option is to perform reduction inside the work group, so if the work group size is 256, you can sum groups of 256 together in each pass. Which is faster than just reducing by 2 in each pass.
I suspect that your problem represents a problem that has limited ability to be made parallel, and thus is a poor fit for any kind of GPGPU solution.
Consider the following array of elements:
1 5 2 6 5 3 6 7 2 8 1 8 3 4 2
Now suppose we perform the following transformation on this data:
//A straightforward, serially-executed iteration on the entire array.
for(int i = 0; i < arr.size() - 1; i++) {
if(arr[i] < 5) arr[i] = arr[i + 1];
}
The result will then be
5 5 6 6 5 6 6 7 8 8 8 8 4 2 2
But what happens if the for loop executes in reverse?
for(int i = arr.size() - 2; i >= 0; i--) {
if(arr[i] < 5) arr[i] = arr[i + 1];
}
The result will then be
5 5 6 6 5 6 6 7 8 8 8 8 2 2 2
Note how the third-to-last number is different depending on the order of execution. My example input doesn't change much, but if your code has lots of numbers below the threshold chosen, it could completely change the entire array! Because GPGPU APIs don't make guarantees about the order of execution of individual work items—which means your order of execution could be like the first for-loop I wrote, or it could be like the second for-loop I wrote, or it could be a completely randomly shuffled order—you've written non-deterministic code, and the only way to make it deterministic is to use so many barriers that you're guaranteeing sequential ordering, at which point, there's literally no reason to be using a GPGPU API in the first place.
You could write something like the following instead, which would be deterministic:
if(arr[i] < 5) output[i] = arr[i + 1];
else output[i] = arr[i];
But that might require a reconsideration of your design constraints. I don't know, as I don't know what your program is ultimately doing.
Either way though, you need to spend some time reconsidering what you're actually trying to do.

call a function and loops in parallel

I don't have any experience in openMP , so I want to kow how to do the following:
for (int i = 1; i <= NumImages; i++) {
//call a function
myfunction(...);
for (int k = 0 ; k < SumNumber k++) {
for (int l = 0; l < ElNum ; l++) {
//do 2 summing up calculations inside a while loop
}//end k loop
}//end i loop
Now , I have 40 cores in my disposal.
NumImages will be from 50 to 150 ,more usual 150.
SumNumber will be around 200.
ElNum will be around 5000.
So , the best dealing with this is assigning every thread to a function call and also execute in parallel the l loop?
And if yes , it will be like:
#pragma omp parallel for num_threads(40)
for (int i = 1; i <= NumImages; i++) {
myfunction(...);
for (int k = 0 ; k < SumNumber k++) {
#pragma omp for
for (int l = 0; l < ElNum ; l++) {
And the above means (for NumImages = 150) that myfunction will be executed 40 times in parallel and also l loop and then ,when l loop and k loop finishes , the next 40 threads will call again the function and the next 40 , so 3*40 = 120 and then the next 30?
Generally the best way is the way that splits the work evenly, to maintain efficiency (no cores are waiting). E.g. in your case probably static scheduling is not a good idea, because 40 does not divide 150 evenly, for the last iteration you would loose 25% of computing power. So it might turn out, that it would be better to put parallel clause before second loop. It all the depends on the mode you choose, and how really work is distributed within loops. E.g., If myfunction does 99% then its a bad idea, if 99% of work is within 2 inner loops it might be good.
Not really. There are 3 scheduling modes. But none of them works in a way, that it blocks other threads. There is a pool of tasks (iterations) that is distributed among the threads. Scheduling mode describes the strategy of assigning tasks to threads. When one thread finishes, it just gets next task, no waiting. The strategies are described in more detail here: http://en.wikipedia.org/wiki/OpenMP#Scheduling_clauses (I am not sure if balant-copy paste from wiki is a good idea, so I'll leave a link. It's a good material.)
Maybe what is not written there is that the modes overhead are presented in order of the amount of overhead they introduce. static is fastest, then dynamic, then guided. My advice when to use which would be, this is not the exact best, but good rule of thumb IMO:
static if you know will be divided evenly among the threads and take the same amount of time
dynamic if you know the tasks will not be divided evenly or their execution times are not even
guided for rather long tasks that you pretty much cannot tell anything
If your tasks are rather small you can see an overhead even for static scheduling (E.g. why my OpenMP C++ code is slower than a serial code?), but I think in your case dynamic should be fine and best choice.

How can writing to a shared array (over a pointer) in a nested for loop parallelized with OpenMP produce wrong results?

I have a very strange problem that I'm try to solve and understand. I have a nested for loop of the following form:
#pragma omp parallel for schedule(guided) shared(Array) collapse(3)
for (int i=istart; i<iend; i++)
{
for (int j=jstart; j<jend; j++)
{
for(int k=kstart; k<kend; k++)
{
Int IJK = (i*(jend-jstart) + (j-jstart))*(kend-kstart) + (k-kstart);
Array[3*IJK + 2] = an operation with some shared values;
}
}
}
There are three loops of this form, with Array[3*IJK], Array[3*IJK + 1] and Array[3*IJK+2] respectively. Array is also actually a shared pointer and for the value of IJK, a function is actually called (inlined).
I first tried parallelizing all loops and the program runs through, but the results are different compared to my serial results.
Now come the strange parts.
The for loop that is of this same structure, but has Array[3*IJK + 1] instead, produces correct results when it is parallelized (the other loops are serial in this case). But as soon as I parallelize one of the other loops, I get different results. It is only this single loop that produces correct results when parallelized by itself.
Also, If I don't use collapse, or collapse(2) instead of collapse(3), I get different results. Only with the #pragma statement as above, I get correct results in the Array[3*IJK + 1] loop.
I thought it might have something to do with the order in which Array was written to, but with an ordered clause and construct, I still get wrong results.
What can be the cause of this?
Are you sure your serial case is correct?
Your IJK calculation makes no sense to me; for one thing, it doesn't depend on j at all. As it is, if two threads get the same (i,k) pair with different j -- certainly possible with collapse(3) -- there's going to be a race condition as they both will be trying to write to the same IJK.
Are you sure you don't want something like
Int IJK = (i*(jend-jstart) + (j-jstart))*(kend-kstart) + (k-kstart);
?