How can I use openMP for this loop? - c++

I'm wondering if it is feasible to make this loop parallel using openMP.
Of coarse there is the issue with the race conditions. I'm unsure how to deal with the n in the inner loop being generated by the outerloop, and the race condition with where D=A[n]. Do you think it is practical to try and make this parallel?
for(n=0; n < 10000000; ++n) {
for (n2=0; n2< 100; ++n2) {
A[n]=A[n]+B[n2][n+C[n2]+200];
}
D=D+A[n];
}

Yes, this is indeed parallelizable assuming none of the pointers are aliased.
int D = 0; // Or whatever the type is.
#pragma omp parallel for reduction(+:D) private(n2)
for (n=0; n < 10000000; ++n) {
for (n2 = 0; n2 < 100; ++n2) {
A[n] = A[n] + B[n2][n + C[n2] + 200];
}
D += A[n];
}
It could actually be optimized somewhat as follows:
int D = 0; // Or whatever the type is.
#pragma omp parallel for reduction(+:D) private(n2)
for (n=0; n < 10000000; ++n) {
int tmp = A[n]
for (n2 = 0; n2 < 100; ++n2) {
tmp += B[n2][n + C[n2] + 200];
}
A[n] = tmp;
D += tmp;
}

Related

OMP for loop condition not a simple relational comparison

I have this program that isn't compiling due to error: condition of OpenMP for loop must be a relational comparison ('<', '<=', '>', '>=', or '!=') of loop variable 'i', referring to for (size_t i = 2; i * i <= n; i++). How can I modify and fix it without affecting performance? Is this an issue due to having an old OpenMP version? (Because I remember having a different issue before on another computer with an older version that is resolved now.)
#include <iostream>
#include <cstdio>
int main(int argc, char **argv)
{
if (argc != 2)
return EXIT_FAILURE;
size_t n;
if (sscanf(argv[1], "%zu", &n) == 0)
return EXIT_FAILURE;
auto *prime = new size_t[n + 1];
for (size_t i = 2; i <= n; i++)
prime[i] = i;
#pragma omp parallel for
for (size_t i = 2; i * i <= n; i++)
for (size_t j = i * i; j <= n; j += i)
prime[j] = 0;
size_t N = 0;
for (size_t i = 2; i <= n; i++)
if (prime[i] != 0)
N++;
std::cout << N << '\n';
}
Loop after the #pragma omp parallel for have to be a canonical form to be confirming. In your case the problem is with the test expression, which should be one of the following:
var relational-op b
b relational-op var
So, you have to use what was suggested by #Yakk: calculate the sqrt of n and compare it to i:
const size_t max_i=sqrt(n);
#pragma omp parallel for
for (size_t i = 2; i <= max_i; i++)
....

Why every time getting different output in openmp code?

Every time I execute cal() function with the same parameter I get different output. Function g() always calculate the same result for same input. Are threads overwriting any variable?
void cal(uint_fast64_t n) {
Bint num = N(n);
Bint total = 0, i, max_size(__UINT64_MAX__);
for(i = 1; i <= num; i+= max_size){
#pragma omp parallel shared(i,num,total)
{
int id = omp_get_thread_num();
int numthreads = omp_get_num_threads();
Bint sum(0), k;
for(uint64_t j = id; (j < __UINT64_MAX__); j+=numthreads){
k = i+j;
if(k > num){
i = k;
break;
}
sum = sum + g(k);
}
#pragma omp critical
total += sum;
}
}
std::cout << total << std::endl;
}
if(k > num){
i = k;
break;
}
Here you modify the shared variable i (possibly multiple times in parallel) while other threads may be reading from it (for k = i+j), all without synchronization. This is a race condition and your code thus has Undefined Behavior.
The value of j depends on the value of id. If different threads are used to do the math, you'll get different results.
int id = omp_get_thread_num(); // <---
int numthreads = omp_get_num_threads();
Bint sum(0), k;
for(uint64_t j = id; (j < __UINT64_MAX__); j+=numthreads){ // <---
k = i+j;
if(k > num){
i = k; // <---
break;
}
sum = sum + g(k);
Further, you change i to k when k > num. This can happen much sooner or much later depending on which thread is picked up first to run the inner loop.
You may want to look at this question and answer.
Does an OpenMP ordered for always assign parts of the loop to threads in order, too?

Very slow mutex in LLVM/OpenMP

I wrote code to test the performance of openmp on win (Win7 x64, Corei7 3.4HGz) and on Mac (10.12.3 Core i7 2.7 HGz).
In xcode I made a console application setting the compiled default. I use LLVM 3.7 and OpenMP 5 (in opm.h i searched define KMP_VERSION_MAJOR=5, define KMP_VERSION_MINOR=0 and KMP_VERSION_BUILD = 20150701, libiopm5) on macos 10.12.3 (CPU - Corei7 2700GHz)
For win I use VS2010 Sp1. Additional I set c/C++ -> Optimization -> Optimization = Maximize Speed (O2), c/C++ -> Optimization ->Favor Soze Or Speed = Favor Fast code (Ot).
If I run the application in a single thread, the time difference corresponds to the frequency ratio of processors (approximately). But if you run 4 threads, the difference becomes tangible: win program be faster then mac program in ~70 times.
#include <cmath>
#include <mutex>
#include <cstdint>
#include <cstdio>
#include <iostream>
#include <omp.h>
#include <boost/chrono/chrono.hpp>
static double ActionWithNumber(double number)
{
double sum = 0.0f;
for (std::uint32_t i = 0; i < 50; i++)
{
double coeff = sqrt(pow(std::abs(number), 0.1));
double res = number*(1.0-coeff)*number*(1.0-coeff) * 3.0;
sum += sqrt(res);
}
return sum;
}
static double TestOpenMP(void)
{
const std::uint32_t len = 4000000;
double *a;
double *b;
double *c;
double sum = 0.0;
std::mutex _mutex;
a = new double[len];
b = new double[len];
c = new double[len];
for (std::uint32_t i = 0; i < len; i++)
{
c[i] = 0.0;
a[i] = sin((double)i);
b[i] = cos((double)i);
}
boost::chrono::time_point<boost::chrono::system_clock> start, end;
start = boost::chrono::system_clock::now();
double k = 2.0;
omp_set_num_threads(4);
#pragma omp parallel for
for (int i = 0; i < len; i++)
{
c[i] = k*a[i] + b[i] + k;
if (c[i] > 0.0)
{
c[i] += ActionWithNumber(c[i]);
}
else
{
c[i] -= ActionWithNumber(c[i]);
}
std::lock_guard<std::mutex> scoped(_mutex);
sum += c[i];
}
end = boost::chrono::system_clock::now();
boost::chrono::duration<double> elapsed_time = end - start;
double sum2 = 0.0;
for (std::uint32_t i = 0; i < len; i++)
{
sum2 += c[i];
c[i] /= sum2;
}
if (std::abs(sum - sum2) > 0.01) printf("Incorrect result.\n");
delete[] a;
delete[] b;
delete[] c;
return elapsed_time.count();
}
int main()
{
double sum = 0.0;
const std::uint32_t steps = 5;
for (std::uint32_t i = 0; i < steps; i++)
{
sum += TestOpenMP();
}
sum /= (double)steps;
std::cout << "Elapsed time = " << sum;
return 0;
}
I specifically use a mutex here to compare the performance of openmp on the "mac" and "win". On the "Win" function returns the time of 0.39 seconds. On the "Mac" function returns the time of 25 seconds, i.e. 70 times slower.
What is the cause of this difference?
First of all, thank for edit my post (i use translater to write text).
In the real app, I update the values in a huge matrix (20000х20000) in random order. Each thread determines the new value and writes it in a particular cell. I create a mutex for each row, since in most cases different threads write to different rows. But apparently in cases when 2 threads write in one row and there is a long lock. At the moment I can't divide the rows in different threads, since the order of records is determined by the FEM elements.
So just to put a critical section in there comes out, as it will block writes to the entire matrix.
I wrote code like in real application.
static double ActionWithNumber(double number)
{
const unsigned int steps = 5000;
double sum = 0.0f;
for (u32 i = 0; i < steps; i++)
{
double coeff = sqrt(pow(abs(number), 0.1));
double res = number*(1.0-coeff)*number*(1.0-coeff) * 3.0;
sum += sqrt(res);
}
sum /= (double)steps;
return sum;
}
static double RealAppTest(void)
{
const unsigned int elementsNum = 10000;
double* matrix;
unsigned int* elements;
boost::mutex* mutexes;
elements = new unsigned int[elementsNum*3];
matrix = new double[elementsNum*elementsNum];
mutexes = new boost::mutex[elementsNum];
for (unsigned int i = 0; i < elementsNum; i++)
for (unsigned int j = 0; j < elementsNum; j++)
matrix[i*elementsNum + j] = (double)(rand() % 100);
for (unsigned int i = 0; i < elementsNum; i++) //build FEM element like Triangle
{
elements[3*i] = rand()%(elementsNum-1);
elements[3*i+1] = rand()%(elementsNum-1);
elements[3*i+2] = rand()%(elementsNum-1);
}
boost::chrono::time_point<boost::chrono::system_clock> start, end;
start = boost::chrono::system_clock::now();
omp_set_num_threads(4);
#pragma omp parallel for
for (int i = 0; i < elementsNum; i++)
{
unsigned int* elems = &elements[3*i];
for (unsigned int j = 0; j < 3; j++)
{
//in here set mutex for row with index = elems[j];
boost::lock_guard<boost::mutex> lockup(mutexes[i]);
double res = 0.0;
for (unsigned int k = 0; k < 3; k++)
{
res += ActionWithNumber(matrix[elems[j]*elementsNum + elems[k]]);
}
for (unsigned int k = 0; k < 3; k++)
{
matrix[elems[j]*elementsNum + elems[k]] = res;
}
}
}
end = boost::chrono::system_clock::now();
boost::chrono::duration<double> elapsed_time = end - start;
delete[] elements;
delete[] matrix;
delete[] mutexes;
return elapsed_time.count();
}
int main()
{
double sum = 0.0;
const u32 steps = 5;
for (u32 i = 0; i < steps; i++)
{
sum += RealAppTest();
}
sum /= (double)steps;
std::cout<<"Elapsed time = " << sum;
return 0;
}
You're combining two different sets of threading/synchronization primitives - OpenMP, which is built into the compiler and has a runtime system, and manually creating a posix mutex with std::mutex. It's probably not surprising that there's some interoperability hiccups with some compiler/OS combinations.
My guess here is that in the slow case, the OpenMP runtime is going overboard to make sure that there's no interactions between higher-level ongoing OpenMP threading tasks and the manual mutex, and that doing so inside a tight loop causes the dramatic slowdown.
For mutex-like behaviour in the OpenMP framework, we can use critical sections:
#pragma omp parallel for
for (int i = 0; i < len; i++)
{
//...
// replacing this: std::lock_guard<std::mutex> scoped(_mutex);
#pragma omp critical
sum += c[i];
}
or explicit locks:
omp_lock_t sumlock;
omp_init_lock(&sumlock);
#pragma omp parallel for
for (int i = 0; i < len; i++)
{
//...
// replacing this: std::lock_guard<std::mutex> scoped(_mutex);
omp_set_lock(&sumlock);
sum += c[i];
omp_unset_lock(&sumlock);
}
omp_destroy_lock(&sumlock);
We get much more reasonable timings:
$ time ./openmp-original
real 1m41.119s
user 1m15.961s
sys 1m53.919s
$ time ./openmp-critical
real 0m16.470s
user 1m2.313s
sys 0m0.599s
$ time ./openmp-locks
real 0m15.819s
user 1m0.820s
sys 0m0.276s
Updated: There's no problem with using an array of openmp locks in exactly the same way as the mutexes:
omp_lock_t sumlocks[elementsNum];
for (unsigned idx=0; idx<elementsNum; idx++)
omp_init_lock(&(sumlocks[idx]));
//...
#pragma omp parallel for
for (int i = 0; i < elementsNum; i++)
{
unsigned int* elems = &elements[3*i];
for (unsigned int j = 0; j < 3; j++)
{
//in here set mutex for row with index = elems[j];
double res = 0.0;
for (unsigned int k = 0; k < 3; k++)
{
res += ActionWithNumber(matrix[elems[j]*elementsNum + elems[k]]);
}
omp_set_lock(&(sumlocks[i]));
for (unsigned int k = 0; k < 3; k++)
{
matrix[elems[j]*elementsNum + elems[k]] = res;
}
omp_unset_lock(&(sumlocks[i]));
}
}
for (unsigned idx=0; idx<elementsNum; idx++)
omp_destroy_lock(&(sumlocks[idx]));

Parallelize four and more nested loops with CUDA

I am working on a compiler generating parallel C++ code. I am new to CUDA programming but I am trying to parallelize the C++ code with CUDA.
Currently if I have the following sequential C++ code:
for(int i = 0; i < a; i++) {
for(int j = 0; j < b; j++) {
for(int k = 0; k < c; k++) {
A[i*y*z + j*z + k*z +l] = 1;
}
}
}
and this results in the following CUDA code:
__global__ void kernelExample() {
int _cu_x = ((blockIdx.x*blockDim.x)+threadIdx.x);
int _cu_y = ((blockIdx.y*blockDim.y)+threadIdx.y);
int _cu_z = ((blockIdx.z*blockDim.z)+threadIdx.z);
A[_cu_x*y*z + _cu_y*z + _cu_z] = 1;
}
so each loop nest is mapped to one dimension, but what would be the correct way to parallelize four and more nested loops:
for(int i = 0; i < a; i++) {
for(int j = 0; j < b; j++) {
for(int k = 0; k < c; k++) {
for(int l = 0; l < d; l++) {
A[i*x*y*z + j*y*z + k*z +l] = 1;
}
}
}
}
Is there any similar way? Noteworthy: all loop dimensions are parallel and there are no dependencies between iterations.
Thanks in advance!
EDIT: the goal is to map all iterations to CUDA threads, since all iterations are independent and could be executed concurrently.
You could keep the outer loop unchanged. Also it is better to use .x as inner most loop so you can access the global memory efficiently.
__global__ void kernelExample() {
int _cu_x = ((blockIdx.x*blockDim.x)+threadIdx.x);
int _cu_y = ((blockIdx.y*blockDim.y)+threadIdx.y);
int _cu_z = ((blockIdx.z*blockDim.z)+threadIdx.z);
for(int i = 0; i < a; i++) {
A[i*x*y*z + _cu_z*y*z + _cu_y*z + _cu_x] = 1;
}
}
However if your a,b,c,d are all very small, you may not be able to get enough parallelism. In that case you could convert a linear index to n-D indices.
__global__ void kernelExample() {
int tid = ((blockIdx.x*blockDim.x)+threadIdx.x);
int i = tid / (b*c*d);
int j = tid / (c*d) % b;
int k = tid / d % c;
int l = tid % d;
A[i*x*y*z + j*y*z + k*z + l] = 1;
}
But be careful that calculating i,j,k,l may introduce a lot of overhead as integer division and mod are slow on GPU. As an alternative you could map i,j to .z and .y, and calculate only k,l and more dimensions from .x in a similar way.

OpenMP: Nested for-loop, barely any difference in execution time

I am doing some image processing and have a nested for loop. I want to implement multiprocessing using OpenMP. The for loop looks like this, where I have added the pragma tags and declared some of the variables private as well.
int a,b,j, idx;
#pragma omp parallel for private(b,j,sumG,sumGI)
for(a = 0; a < ny; ++a)
{
for(b = 0; b < nx; ++b)
{
idx = a*ny+b;
if (imMask[idx] == 0)
{
Wshw[idx] = 0;
continue;
}
sumG = 0;
sumGI = 0;
for(j = a; j < ny; ++j)
{
sumG += shadowM[j-a];
sumGI += shadowM[j-a] * imBlurred[nx*j + b];
}
Wshw[idx] = sumGI / sumG;
}
}
The size of both nx and ny is large and I thought that, using OpenMP, I would get a descent decrease in execution time, instead there is almost no difference. Am I doing something wrong when I implement the multi-threading maybe?
You have a race conditon in idx. You need to make it private as well.
However, instead you could try something like this.
int a,b,j, idx;
#pragma omp parallel for private(a,b,j,sumG,sumGI)
for(idx=0; idx<ny*nx; ++idx) {
if (imMask[idx] == 0)
{
Wshw[idx] = 0;
continue;
}
sumG = 0;
sumGI = 0;
a=idx/ny;
b=idx%ny;
for(j = a; j < ny; ++j) {
sumG += shadowM[j-a];
sumGI += shadowM[j-a] * imBlurred[nx*j + b];
}
Wshw[idx] = sumGI / sumG;
}
You might be able to simiply the inner loop as well as a functcion of idx instead a and b.