Parallelism on inner selected sort loop skips few numbers - c++

I am trying to make a program, where you can set the amount of threads you want, and it will parallelize selection sort algorithm with the given data and amount of threads. I know I should just use another algorithm in this case, but its for educational purposes only. So I run into a problem when parallelizing inner loop in selection sort algorithm some close numbers are left unsorted, but the whole array is sorted apart those few pairs of numbers inside and I cant find out why.
int* selectionSort(int arr[], int size, int numberOfThreads)
{
int i, j;
int me, n, min_idx;
bool canSwap = false;
#pragma omp parallel num_threads(numberOfThreads) private(i,j,me,n)
{
me = omp_get_thread_num();
n = omp_get_num_threads();
printf("Hello from %d/%d\n", me, n);
for (i = 0; i < size - 1; i++) {
min_idx = i;
canSwap = true;
#pragma omp barrier
#pragma omp for
for (j = i + 1; j < size; j++) {
if (arr[j] < arr[min_idx])
min_idx = j;
//printf("I am %d processing %d,%d\n", me, i, j);
}
printf("Min value %d ---- %d \n", arr[min_idx], min_idx);
#pragma omp critical(swap)
if(canSwap)
{
swap(&arr[min_idx], &arr[i]);
canSwap = false;
}
#pragma omp barrier
}
}
return arr;
}

I found out that the problem is that you can't really parallelize this algorithm (well at least in a way I'm doing it), since I'm comparing the arr[j] with arr[min_idx],
min_idx value can sometimes get changed in such particular time that other thread will have finished the if (arr[j] < arr[min_idx]) line and right after that another thread would change the min_idx value which would sometimes make just completed if statement not true anymore.

Related

OpenMP Segmentation Fault in C++

I have a very straightforward function that counts how many inner entries of an N by N 2D matrix (represented by a pointer arr) is below a certain threshold, and updates a counter below_threshold that is passed by reference:
void count(float *arr, const int N, const float threshold, int &below_threshold) {
below_threshold = 0; // make sure it is reset
bool comparison;
float temp;
#pragma omp parallel for shared(arr, N, threshold) private(temp, comparison) reduction(+:below_threshold)
for (int i = 1; i < N-1; i++) // count only the inner N-2 rows
{
for (int j = 1; j < N-1; j++) // count only the inner N-2 columns
{
temp = *(arr + i*N + j);
comparison = (temp < threshold);
below_threshold += comparison;
}
}
}
When I do not use OpenMP, it runs fine (thus, the allocation and initialization were done correctly already).
When I use OpenMP with an N that is less than around 40000, it runs fine.
However, once I start using a larger N with OpenMP, it keeps giving me a segmentation fault (I am currently testing with N = 50000 and would like to eventually get it up to ~100000).
Is there something wrong with this at a software level?
P.S. The allocation was done dynamically ( float *arr = new float [N*N] ), and here is the code used to randomly initialize the entire matrix, which didn't have any issues with OpenMP with large N:
void initialize(float *arr, const int N)
{
#pragma omp parallel for
for (int i = 0; i < N; i++)
{
for (int j = 0; j < N; j++)
{
*(arr + i*N + j) = static_cast <float> (rand()) / static_cast <float> (RAND_MAX);
}
}
}
UPDATE:
I have tried changing i, j, and N to long long int, and it still has not fixed my segmentation fault. If this was the issue, why has it already worked without OpenMP? It is only once I add #pragma omp ... that it fails.
I think, it is because, your value (50000*50000 = 2500000000) reached above INT_MAX (2147483647) in c++. As a result, the array access behaviour will be undefined.
So, you should use UINT_MAX or some other types that suits with your usecase.

OpenMP: How to utilize recursive function in each thread?

#include <stdio.h>
#include<array>
#include<vector>
#include <omp.h>
std::vector<int> pNum;
std::array<int, 4> arr;
int pGen(int);
int main()
{
pNum.push_back(2);
pNum.push_back(3);
pGen(10);
for (int i = 0; i < pNum.size(); i++)
{
printf("%d \n", pNum[i]);
}
printf("top say: %d", pNum.size());
getchar();
}
int pGen(int ChunkSize)
{
//
if (pNum.size() == 50) return 0;
int i, k, n, id;
int state = 0;
//
#pragma omp parallel for schedule(dynamic) private(k, n, id) num_threads(4)
for (i = 1; i < pNum.back() * pNum.back(); i++)
{
//
id = omp_get_thread_num();
n = pNum.back() + i * 2;
for (k = 1; k < pNum.size(); k++)
{
//
if (n % pNum[k] == 0) break;
if (n / pNum[k] <= pNum[k])
{
//
#pragma omp critical
{
//
if (state == 0)
{
//
state = 1; pNum.push_back(n); printf("id: %d; number: %d \n", id, n); pGen(ChunkSize); break;
}
}
}
}
if (state == 1) break;
}
}
This is my code above. I am trying to find first 50 prime number with openMP scheduling for each dynamic, static and guided. I started with dynamic. And somehow I realized I have to use recursive function since I cant use do - while in parallel structures.
When I debug the code above, console opens up and close down immediately, I can only see "id:0, number:5" and an "error: blablabla(something)"
The strange thing is I never get to getchar() and output the vector I use to store prime numbers. I think this is about recursion function. Any other theories?
edit: I happened to catch the error:
this is the error
I don't know if this is significant for your algorithm, but since you add numbers in your pNum vector during the main loop, pNum.back() will change over iterations. Therefore, the boundaries of the parallelised loop will change during the loop itself: for (i = 1; i < pNum.back() * pNum.back(); i++)
This isn't supported by OpenMP. Loops can only be parallelised with OpenMP if they are in Canonical Loop Form. The link explains it in details, but it boils down for you that the boundaries should be known and fixed prior to entering the loop:
lb and b: Loop invariant expressions of a type compatible with the type of var
Therefore, your code has an Undefined Behaviour. It may or may not compile, may or may not run and can give whatever result if any (or just reformat your hard drive).
If it is not important that pNum.back() evolves over iterations, then you can simply evaluate it prior to the loop and use this value as upper bound in the for statement. But if it is important, then you'll have to find another method to parallelise your loop.
Finally, a side note: this algorithm uses nested parallelism, but you didn't explicitly allow it so, as the nested parallelism is disabled by default, only the outermost call to pGen() will generate OpenMP threads.

openMP for loop increment statment handling

for (uint i = 0; i < x; i++) {
for (uint j = 0; j < z; j++) {
if (inFunc(p, index)) {
XY[2*nind] = i;
XY[2*nind + 1] = j;
nind++;
}
}
}
here x = 512 and z = 512 and nind = 0 initially
and XY[2*x*y].
I want to optimize this for loops with openMP but 'nind' variable is closely binded serially to for loop. I have no clue because I am also checking a condition and so some of the time it will not enter in if and will skip increment or it will enter increment nind. openMP threads will increment nind variable as first come will increment nind firstly. Is there any way to unbind it. ('binding' I mean only can be implemented serially).
A typical cache-friendly solution in that case is to collect the (i,j) pairs in private arrays, then concatenate those private arrays at the end, and finally sort the result if needed:
#pragma omp parallel
{
uint myXY[2*z*x];
uint mynind = 0;
#pragma omp for collapse(2) schedule(dynamic,N)
for (uint i = 0; i < x; i++) {
for (uint j = 0; j < z; j++) {
if (inFunc(p, index)) {
myXY[2*mynind] = i;
myXY[2*mynind + 1] = j;
mynind++;
}
}
}
#pragma omp critical(concat_arrays)
{
memcpy(&XY[2*nind], myXY, 2*mynind*sizeof(uint));
nind += mynind;
}
}
// Sort the pairs if needed
qsort(XY, nind, 2*sizeof(uint), compar);
int compar(const uint *p1, const uint *p2)
{
if (p1[0] < p2[0])
return -1;
else if (p1[0] > p2[0])
return 1;
else
{
if (p1[1] < p2[1])
return -1;
else if (p1[1] > p2[1])
return 1;
}
return 0;
}
You should experiment with different values of N in the schedule(dynamic,N) clause in order to achieve the best trade-off between overhead (for small values of N) and load imbalance (for large values of N). The comparison function compar could probably be written in a more optimal way.
The assumption here is that the overhead from merging and sorting the array is small. Whether that will be the case depends on many factors.
Here is a variation on Hristo Iliev's good answer.
The important parameter to act on here is the index of the pairs rather than the pairs themselves.
We can fill private arrays of the pair indices in parallel for each thread. The arrays for each thread will be sorted (irrespective of the scheduling).
The following function merges two sorted arrays
void merge(int *a, int *b, int*c, int na, int nb) {
int i=0, j=0, k=0;
while(i<na && j<nb) c[k++] = a[i] < b[j] ? a[i++] : b[j++];
while(i<na) c[k++] = a[i++];
while(j<nb) c[k++] = b[j++];
}
Here is the remaining code
uint nind = 0;
uint *P;
#pragma omp parallel
{
uint myP[x*z];
uint mynind = 0;
#pragma omp for schedule(dynamic) nowait
for(uint k = 0 ; k < x*z; k++) {
if (inFunc(p, index)) myP[mynind++] = k;
}
#pragma omp critical
{
uint *t = (uint*)malloc(sizeof *P * (nind+mynind));
merge(P, myP, t, nind, mynind);
free(P);
P = t;
nind += mynind;
}
}
Then given an index k in P the pair is (k/z, k%z).
The merging can be improved. Right now it goes at O(omp_get_num_threads()) but it could be done in O(log2(omp_get_num_threads())). I did not bother with this.
Hristo Iliev's pointed out that dynamic scheduling does not guarantee that the iterations per thread increase monotonically. I think in practice they are but it's not guaranteed in principle.
If you want to be 100% sure that the iterations increase monotonically you can implement dynamic scheduling by hand.
The code you provide looks like you are trying to fill the XY data in sequential order. In this case OMP multithreading is probably not the tool for the job as threads (in a best case) should avoid communication as much as possible. You could introduce an atomic counter, but then again, it is probably going to be faster just doing it sequentially.
Also what do you want to achieve by optimizing it? The x and z are not too big, so I doubt that you will get a substantial speed increase even if you reformulate your problem in a parallel fashion.
If you do want parallel execution - map your indexes to the array, e.g. (not tested, but should do)
#pragma omp parallel for shared(XY)
for (uint i = 0; i < x; i++) {
for (uint j = 0; j < z; j++) {
if (inFunc(p, index)) {
uint idx = (2 * i) * x + 2 * j;
XY[idx] = i;
XY[idx + 1] = j;
}
}
}
However, you will have gaps in your array XY then. Which may or may not be a problem for you.

How to parallelize do while and while loop in openmp?

I'm trying to learn parallel programming with OpenMP and I'm interested in parallelizing the following do while loop with several while loop inside it:
do {
while(left < (length - 1) && data[left] <= pivot) left++;
while(right > 0 && data[right] >= pivot) right--;
/* swap elements */
if(left < right){
temp = data[left];
data[left] = data[right];
data[right] = temp;
}
} while(left < right);
I haven't actually figured out how to parallelize while and do while loops, couldn't find any resource where it specifically describes how to parallelize while and do while loops. I have found instructions for for loops, but I couldn't make any assumption for while and do while loops from that. So, could you please describe how I can parallelize this loops that I provided here?
EDIT
I have transformed the do while loop to the following code where only for loop is used.
for(i = 1; i<length-1; i++)
{
if(data[left] > pivot)
{
i = length;
}
else
{
left = i;
}
}
for(j=length-1; j > 0; j--)
{
if(data[right] < pivot)
{
j = 0;
}
else
{
right = j;
}
}
/* swap elements */
if(left < right)
{
temp = data[left];
data[left] = data[right];
data[right] = temp;
}
int leftCopy = left;
int rightCopy = right;
for(int leftCopy = left; leftCopy<right;leftCopy++)
{
for(int new_i = left; new_i<length-1; new_i++)
{
if(data[left] > pivot)
{
new_i = length;
}
else
{
left = new_i;
}
}
for(int new_j=right; new_j > 0; new_j--)
{
if(data[right] < pivot)
{
new_j = 0;
}
else
{
right = new_j;
}
}
leftCopy = left;
/* swap elements */
if(left < right)
{
temp = data[left];
data[left] = data[right];
data[right] = temp;
}
}
This code works fine and produces correct result, but when I tried to parallelize the parts of above stated code, by changing the first two for loops to the following:
#pragma omp parallel default(none) firstprivate(left) private(i,tid) shared(length, pivot, data)
{
#pragma omp for
for(i = 1; i<length-1; i++)
{
if(data[left] > pivot)
{
i = length;
}
else
{
left = i;
}
}
}
#pragma omp parallel default(none) firstprivate(right) private(j) shared(length, pivot, data)
{
#pragma omp for
for(j=length-1; j > 0; j--)
{
if(data[right] < pivot)
{
j = 0;
}
else
{
right = j;
}
}
}
The speed is worse than the non-parallelized code. Please help me identify my problem.
Thanks
First of all, sorting algorithms are very hard to parallelize with OpenMP parallel loops. This is because the loop trip count is not deterministic but depends on the input set values that are read every iteration.
I don't think having loop conditions such as data[left] <= pivot is going to work well, since OpenMP library does not know exactly how to partition the iteration space among the threads.
If you are still interested in parallel sorting algorithms, I suggest you to read the literature first, to see those algorithms that really worth implementing due to their scalability. If you just want to learn OpenMP, I suggest you start with easier algorithms such as bucket-sort, where the number of buckets is well known and does not frequently change.
Regarding the example you try to parallelize, while loops are not directly supported by OpenMP because the number of iterations (loop trip count) is not deterministic (otherwise, it is easy to transform them into for loops). Therefore, it is not possible to distribute the iterations among the threads. In addition, it is common for while loops to check for a condition using last iteration's result. This is called Read-after-Write or true-dependency and cannot be parallelized.
Your slowdown problem might be alleviated if you try to minimize the number of omp parallel clauses. In addition, try to move them out of all your loops. These clauses may create and join the additional threads that are used in the parallel parts of the code, which is expensive.
You can still synchronize threads inside parallel blocks, so that the outcome is similar. In fact, all threads wait for each other at the end of a omp for clause by default, so that this makes things even easier.
#pragma omp parallel default(none) firstprivate(right,left) private(i,j) shared(length, pivot, data)
{
#pragma omp for
for(i = 1; i<length-1; i++)
{
if(data[left] > pivot)
{
i = length;
}
else
{
left = i;
}
}
#pragma omp for
for(j=length-1; j > 0; j--)
{
if(data[right] < pivot)
{
j = 0;
}
else
{
right = j;
}
}
} // end omp parallel

Parallelize counting for-loops with openmp

I have an 2d-image where I want to count all colors and store the result in an array. I know the number of colors, so I can set the size of the array before. My problem now is that the counting lasts too long for me. How can I speed the counting up with OpenMP?
My current serial code is
std::vector<int> ref_color_num_thread;
ref_color_num.resize(ref_color.size());
std::fill(ref_color_num.begin(), ref_color_num.end(), 0);
ref_color_num_thread.resize(ref_color.size());
std::fill(ref_color_num_thread.begin(), ref_color_num_thread.end(), 0);
for (int i = 0; i < image.width(); i++)
{
for (int j = 0; j < image.height(); j++)
{
for (int k = 0; k < (int)ref_color.size(); k++)
{
if (image(i, j, 0, 0) == ref_color[k].R && image(i, j, 0, 1) == ref_color[k].G && image(i, j, 0, 2) == ref_color[k].B)
ref_color_num_thread[k]++;
}
}
}
First approaches were setting #pragma omp parallel for at each loop (each try at another), but everytime I get a program crash because of wrong memory access. Do I have to use private() for my vector?
What you're doing is filling a histogram of your colors. This is equivalence to doing an array reduction in C/C++ with OpenMP. In C/C++ OpenMP does not have built in support for this (but it does in Fortran due to the fact that the array size is known in Fortran where in C/C++ it's only known for static arrays). However, it's easy to do an array reduction in C/C++ with OpenMP yourself.
#pragma omp parallel
{
std:vector<int> ref_color_num_thread_private(ref_color.size(),0);
#pragma omp for
for (int i = 0; i < image.width(); i++) {
for (int j = 0; j < image.height(); j++) {
for (int k = 0; k < (int)ref_color.size(); k++) {
if (image(i, j, 0, 0) == ref_color[k].R && image(i, j, 0, 1) == ref_color[k].G && image(i, j, 0, 2) == ref_color[k].B)
ref_color_num_thread_private[k]++;
}
}
}
#pragma omp critical
{
for(int i=0; i<(int)ref_color.size(); i++) {
ref_color_num_thread[i] += ref_color_num_thread_private[i];
}
}
}
I went into a lot more detail about his here Fill histograms (array reduction) in parallel with OpenMP without using a critical section
I showed how to an array reduction without a critical section but it's a lot more tricky. You should test the first case and see if it works well for you first. As long as the number of colors (ref_color.size()) is small compared to the number of pixels it should parallelize well. Otherwise, you might need to try the second case without a critical section.
There is a race condition if one of the outer two loops (i or j) are parallized, because the inner loop iteratates over the vector (k). I think your crash is because of that.
You have to restructure your program. It is not trivial, but one idea is that each thread uses a local copy of the ref_color_num_thread vector. Once the computation is finished, you can sum up all the vectors.
If k is large enough to provide enough parallelism, you could exchange the loops. Instead of "i,j,k" you could iterate in the order "k,i,j". If I'm not mistaken, there are no violated dependencies. Then you can parallelize the outer k loop, and let the inner i and j loops execute sequentially.
Update:
pragma omp for also supports reductions, for example:
#pragma omp for reduction(+ : nSum)
Here is a link to some documentation.
Maybe that can help you to restructure your program.