OpenMP and #pragma omp atomic - c++

I have an issue with OpenMP. MSVS compilator throws me "pragma omp atomic has improper form".
I don't have any idea why.
Code: (program appoints PI number using integrals method)
#include <stdio.h>
#include <time.h>
#include <omp.h>
long long num_steps = 1000000000;
double step;
int main(int argc, char* argv[])
{
clock_t start, stop;
double x, pi, sum=0.0;
int i;
step = 1./(double)num_steps;
start = clock();
#pragma omp parallel for
for (i=0; i<num_steps; i++)
{
x = (i + .5)*step;
#pragma omp atomic //this part contains error
sum = sum + 4.0/(1.+ x*x);
}
pi = sum*step;
stop = clock();
// some printf to show results
return 0;
}

Your program is a perfectly syntactically correct OpenMP code by the current OpenMP standards (e.g. it compiles unmodified with GCC 4.7.1), except that x should be declared private (which is not a syntactic but rather a semantic error). Unfortunately Microsoft Visual C++ implements a very old OpenMP specification (2.0 from March 2002) which only allows the following statements as acceptable in an atomic construct:
x binop= expr
x++
++x
x--
--x
Later versions included x = x binop expr, but MSVC is forever stuck at OpenMP version 2.0 even in VS2012. Just for comparison, the current OpenMP version is 3.1 and we expect 4.0 to come up in the following months.
In OpenMP 2.0 your statement should read:
#pragma omp atomic
sum += 4.0/(1.+ x*x);
But as already noticed, it would be better (and generally faster) to use reduction:
#pragma omp parallel for private(x) reduction(+:sum)
for (i=0; i<num_steps; i++)
{
x = (i + .5)*step;
sum = sum + 4.0/(1.+ x*x);
}
(you could also write sum += 4.0/(1.+ x*x);)

Try to change sum = sum + 4.0/( 1. + x*x ) to sum += 4.0/(1.+ x*x) , But I'm afraid this won't work either. You can try to split the work like this:
x = (i + .5)*step;
double xx = 4.0/(1.+ x*x);
#pragma omp atomic //this part contains error
sum += xx;
this should work, but I am not sure whether it fits your needs.

Replace :
#pragma omp atomic
by #pragma omp reduction(+:sum) or #pragma omp critical
But I guess #pragma omp reduction will be a better option as you have sum+=Var;
Do like this:
x = (i + .5)*step;
double z = 4.0/(1.+ x*x);
#pragma omp reduction(+:sum)
sum += z;

You probably need a recap about #pragma more than the real solution to your problem.
#pragma are a set of non-standard, compiler specific, and most of the time, platform/system specific - meaning that the behaviour can be different on different machines with the same OS or simply on machines with different setups - set of instrunctions for the pre-processor.
As consequence any issue with pragma can be solved only if you look at the official documentation for your compiler for your platform of choice, here are 2 links.
http://msdn.microsoft.com/en-us/library/d9x1s805.aspx
http://msdn.microsoft.com/en-us/library/0ykxx45t.aspx
For the standard C/C++ #pragma doesn't exist.

Related

warning #2901: [omp] OpenMP is not active; all OpenMP directives will be ignored

I'm currently trying to use OpenMP for parallel computing.
I've written the following basic code.
However it returns the following warning:
warning #2901: [omp] OpenMP is not active; all OpenMP directives will be ignored.
Changing the number of threads does not change the required running time since omp.h is ignored for some reason which is unclear to me.
Can someone help me out?
#include <stdio.h>
#include <omp.h>
#include <math.h>
int main(void)
{
double ts;
double something;
clock_t begin = clock();
#pragma omp parallel num_threads(4)
#pragma omp parallel for
for (int i = 0; i<pow(10,7);i++)
{
something=sqrt(123456);
}
clock_t end = clock();
ts = (double)(end - begin) / CLOCKS_PER_SEC;
printf("Time elpased is %f seconds", ts);
}
In order to get OpenMP support you need to explicitly tell your compiler.
g++, gcc and clang need the option -fopenmp
mvsc needs the option /openmp (more info here if you use visual studio)
Aside from the obvious having to compile with -fopenmp flag your code has some problem worth pointing out, namely:
To measure time use omp_get_wtime() instead of clock() (it will give you the number of clock ticks accumulated across all threads).
The other problem is:
#pragma omp parallel num_threads(4)
#pragma omp parallel for
for (int i = 0; i<pow(10,7);i++)
{
something=sqrt(123456);
}
the iterations of the loop are not being assigned to threads as you wanted. Because you have added again the clause parallel to #pragma omp for, and assuming that you have nested parallelism disabled, which by default it is, each of the threads created in the outer parallel region will execute "sequentially" the code within that region. Consequently, for a n = 6 (i.e., pow(10,7) = 6) and number of threads = 4, you would have the following block of code:
for (int i=0; i<n; i++) {
something=sqrt(123456);
}
being executed 6 x 4 = 24 times (i.e., the total number of loop iterations multiple by the total number of threads). For a more in depth explanation check this SO Thread about a similar issue. Nevertheless, the image below provides a visualization of the essential:
To fix this adapt your code to the following:
#pragma omp parallel for num_threads(4)
for (int i = 0; i<pow(10,7);i++)
{
something=sqrt(123456);
}

How to optimize OpenMp code,For example Histogram

I am dealing with huge point cloud data. I try to use OpenMp.
But I found it's very hard for beginners to optimize code.
For example, when I want to get the Histogram of the point cloud (the point has another info beyond x,y,z). I write code below
#pragma omp parallel num_threads(N_THREAD) shared(hist,partHist)
{
int tId = omp_get_thread_num();
int index = tId * partCount;
#pragma omp for nowait
for(int i =0;i<partCount;++i)
{
if (index + i < size)
#pragma omp atomic
partHist[tId][(int)floor((array[index + i] - minValue) / stride)]++;
}
#pragma omp critical
{
for (int i = 0; i < binCount; ++i)
hist[i] += partHist[tId][i];
}
}
The code is being run on Linux with an i7-9700k, compiled with g++ and using omp 4.0
I have two questions
The data set is about 10^8 at least, I use 128 threads. but It's slower than serial.How can I optimize the code
Are there rules that I can follow to optimize the code when some other questions occur?

OpenMP parallel-for efficiency query

Please consider the following simple code for summing up values in a parallel for loop:
int nMaxThreads = omp_get_max_threads();
int nTotalSum = 0;
#pragma omp parallel for num_threads(nMaxThreads) \
reduction(+:nTotalSum)
for (int i = 0; i < 4; i++)
{
nTotalSum += i;
cout << omp_get_thread_num() << ": nTotalSum is " << nTotalSum << endl;
}
When I run this on a two-core machine, the output I get is
0: nTotalSum is 0
0: nTotalSum is 1
1: nTotalSum is 2
1: nTotalSum is 5
This suggests to me that the critical section, i.e. the update of nTotalSum, is being executed on each loop. This seems like a waste, when all each thread has to do is calculate a 'local' sum of the values it is adding then update nTotalSum with this 'local sum' after it has done so.
Is my interpretation of the output correct, and if so, how can I make it more efficient? Note I tried the following:
#pragma omp parallel for num_threads(nMaxThreads) \
reduction(+:nTotalSum)
int nLocalSum = 0;
for (int i = 0; i < 4; i++)
{
nLocalSum += i;
}
nTotalSum += nLocalSum;
...but the compiler complained stating that it was expecting a for loop following the pragma omp parallel for statement...
Your output does in fact not indicate a critical section during the loop. Each thread has its own zero-initialized copy, thread 0 working on i = 0,1, thread 1 working on i = 2,3. At the end OpenMP takes care of adding the local copies to the original.
You should not try to implement it yourself unless you have specific evidence that you can do it more efficiently. See for example this question / answer.
Your manual version would work if you split the parallel / for into two directives:
int nTotalSum = 0;
#pragma omp parallel
{
// Declare the local variable it here!
// Then it's private implicitly and properly initialized
int localSum = 0;
#pragma omp for
for (int i = 0; i < 4; i++) {
localSum += i;
cout << omp_get_thread_num() << ": nTotalSum is " << nTotalSum << endl;
}
// Do not forget the atomic, or it would be a race condition!
// Alternative would be a critical, but that's less efficient
#pragma omp atomic
nTotalSum += localSum;
}
I think it's likely that your OpenMP implementation does the reduction just like that.
Each OMP thread has its own copy of nTotalSum. At the end of the OMP section these are combined back into the original nTotalSum. The output you're seeing comes from running loop iterations (0,1) in one thread, and (2,3) in another thread. If you output nTotalSum at the end of your loop, you should see the expected result of 6.
In you nLocalSum example, move the declaration of nLocalSum to before the #pragma omp line. The for loop must be on the line immediately following the pragma.
from my parallel programming in openmp book:
reduction clause can be trickier to understand, has both private and shared storage behavior. The reduction attribute is used on objects that are the target of an arithmetic reduction. This can be important in many applications...reduction allows it to be implemented by the compiler efficiently... this is such a common operation that openmp has the reduction data scope clause just to handle them...most common example is final summation of temporary local variables at the end of the parallel construct.
correction to your second example:
total_sum = 0; /* do all variable initialization prior to omp pragma */
#pragma omp parallel for \
private(i) \
reduction(+:total_sum)
for (int i = 0; i < 4; i++)
{
total_sum += i; /* you used nLocalSum here */
}
#pragma omp end parallel for
/* at this point in the code,
all threads will have done your `for` loop where total_sum is local to each thread,
openmp will then '+" together the values in `total_sum` coming from each thread because we used reduction,
do not do an explicit nTotalSum += nLocalSum after the omp for loop, it's not needed the reduction clause takes care of this
*/
In your first example, I'm not sure of your use of #pragma omp parallel for num_threads(nMaxThreads) reduction(+:nTotalSum) of what num_threads(nMaxThreads) is doing. But i suspect the weird output might be caused by print buffering.
In any case, the reduction clause is very useful and very efficient if used properly. It would be more obvious in a more complicated, real-world example.
Your posted example is so simple that it doesn't show off the usefulness of the reduction clause, and strictly speaking for your example since all threads are doing a summation the most efficient way to do it would just make total_sum a shared variable in the parallel section and have all threads pump in to it. At the end the answer would still be correct. would work if using critical directive.

OpenMP loop runs code slower than serial loop

I'm running this neat little gravity simulation and in serial execution it takes a little more than 4 minutes, when i parallelize one loop inside a it increases to about 7 minutes and if i try parallelizing more loops it increases to more than 20 minutes. I'm posting a slightly shortened version without some initializations but I think they don't matter. I'm posting the 7 minute version however with some comments where i wanted to add parallelization to loops. Thank you for helping me with my messy code.
#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <string.h>
#include <omp.h>
#define numb 1000
int main(){
double pos[numb][3],a[numb][3],a_local[3],v[numb][3];
memset(v, 0.0, numb*3*sizeof(double));
double richtung[3];
double t,deltat=0.0,r12 = 0.0,endt=10.;
unsigned seed;
int tcount=0;
#pragma omp parallel private(seed) shared(pos)
{
seed = 25235 + 16*omp_get_thread_num();
#pragma omp for
for(int i=0;i<numb;i++){
for(int j=0;j<3;j++){
pos[i][j] = (double) (rand_r(&seed) % 100000 - 50000);
}
}
}
for(t=0.;t<endt;t+=deltat){
printf("\r%le", t);
tcount++;
#pragma omp parallel for shared(pos,v)
for(int id=0; id<numb; id++){
for(int l=0;l<3;l++){
pos[id][l] = pos[id][l]+(0.5*deltat*v[id][l]);
v[id][l] = v[id][l]+a[id][l]*(deltat);
}
}
memset(a, 0.0, numb*3*sizeof(double));
memset(a_local, 0.0, 3*sizeof(double));
#pragma omp parallel for private(r12,richtung) shared(a,pos)
for(int id=0; id <numb; ++id){
for(int id2=0; id2<id; id2++){
for(int k=0;k<3;k++){
r12 += sqrt((pos[id][k]-pos[id2][k])*(pos[id][k]-pos[id2][k]));
}
for(int k=0; k<3;k++){
richtung[k] = (-1.e10)*(pos[id][k]-pos[id2][k])/r12;
a[id][k] += richtung[k]/(((r12)*(r12)));
a_local[k] += (-1.0)*richtung[k]/(((r12)*(r12)));
#pragma omp critical
{
a[id2][k] += a_local[k];
}
}
r12=0.0;
}
}
#pragma omp parallel for shared(pos)
for(int id =0; id<numb; id++){
for(int k=0;k<3;k++){
pos[id][k] = pos[id][k]+(0.5*deltat*v[id][k]);
}
}
deltat= 0.01;
}
return 0;
}
I'm using
g++ -fopenmp -o test_grav test_grav.c
to compile the code and I'm measuring time in the shell just by
time ./test_grav.
When I used
get_numb_threads()
to get the number of threads it displayed 4. top also shows more than 300% (sometimes ~380%) cpu usage. Interesting little fact if I start the parallel region before the time-loop (meaning the most outer for-loop) and without any actual #pragma omp for it is equivalent to making one parallel region for every major (the three second to most outer loops) loop. So I think it is an optimization thing, but I don't know how to solve it. Can anyone help me?
Edit: I made the example verifiable and lowered numbers like numb to make it better testable but the problem still occurs. Even when I remove the critical region as suggested by TheQuantumPhysicist, just not as severely.
I believe that critical section is the cause of the problem. Consider taking all critical sections outside the parallelized loop and running them after the parallelization is over.
Try this:
#pragma omp parallel shared(a,pos)
{
#pragma omp for private(id2,k,r12,richtung,a_local)
for(id=0; id <numb; ++id){
for(id2=0; id2<id; id2++){
for(k=0;k<3;k++){
r12 += sqrt((pos[id][k]-pos[id2][k])*(pos[id][k]-pos[id2][k]));
}
for(k =0; k<3;k++){
richtung[k] = (-1.e10)*(pos[id][k]-pos[id2][k])/r12;
a[id][k] += richtung[k]/(((r12)*(r12))+epsilon);
a_local[k]+= richtung[k]/(((r12)*(r12))+epsilon)*(-1.0);
}
}
}
}
for(id=0; id <numb; ++id){
for(id2=0; id2<id; id2++){
for(k=0;k<3;k++){
a[id2][k] += a_local[k];
}
}
}
Critical sections will lead to locking and blocking. If you can keep these sections linear, you'll win a lot in performance.
Notice that I'm talking about a syntactic solution, which I don't know whether it works for your case. But to be clear: If every point in your series depends on the next one, then parallelizing is not a solution for you; at least simple parallelization using OpenMP.

Multithreaded Program for Sparse Matrices

I am a newbie to multithreading. I am trying to design a program that solves a sparse matrix. In my code I call Vector Vector dot product and Matix vector product as subroutines many times to arrive at the final solution. I am trying to parallelise the code using open MP (Especially the above two sub routines.)
I also have sequential codes in between which i donot intend to parallelise.
My question is how do I handle the threads created when the sub routine is called. Should I put a barrier at the end of every sub routine call.
Also where should I set the number of threads?
Mat_Vec_Mult(MAT,x0,rm);
#pragma omp parallel for schedule(static)
for(int i=0;i<numcols;i++)
rm[i] = b[i] - rm[i];
#pragma omp barrier
#pragma omp parallel for schedule(static)
for(int i=0;i<numcols;i++)
xm[i] = x0[i];
#pragma omp barrier
double* pm = (double*) malloc(numcols*sizeof(double));
#pragma omp parallel for schedule(static)
for(int i=0;i<numcols;i++)
pm[i] = rm[i];
#pragma omp barrier
scalarProd(rm,rm,numcols);
Thanks
EDIT:
for the scalar dotproduct, I am using the following piece of code:
double scalarProd(double* vec1, double* vec2, int n){
double prod = 0.0;
int chunk = 10;
int i;
//double* c = (double*) malloc(n*sizeof(double));
omp_set_num_threads(4);
// #pragma omp parallel shared(vec1,vec2,c,prod) private(i)
#pragma omp parallel
{
double pprod = 0.0;
#pragma omp for
for(i=0;i<n;i++) {
pprod += vec1[i]*vec2[i];
}
//#pragma omp for reduction (+:prod)
#pragma omp critical
for(i=0;i<n;i++) {
prod += pprod;
}
}
return prod;
}
I have now added the time calculation code in my ConjugateGradient function as below:
start_dotprod = omp_get_wtime();
rm_rm_old = scalarProd(rm,rm,MAT->ncols);
run_dotprod = omp_get_wtime() - start_dotprod;
fprintf(timing,"Time taken by rm_rm dot product : %lf \n",run_dotprod);
Observed results : Time taken for the dot product Sequential Version : 0.000007s Parallel Version : 0.002110
I am doing a simple compile using gcc -fopenmp command on Linux OS on my Intel I7 laptop.
I am currently using a matrix of size n = 5000.
I am getting huge speed down overall since the same dot product gets called multiple times till convergence is achieved( around 80k times).
Please suggest some improvements. Any help is much appreciated!
Honestly, I would suggest parallelizing at a higher level. By this I mean trying to minimize the number of #pragma omp parallels you are using. Every time you try and split up the work among your threads, there is an OpenMP overhead. Try and avoid this whenever possible.
So in your case at the very least I would try:
Mat_Vec_Mult(MAT,x0,rm);
double* pm = (double*) malloc(numcols*sizeof(double)); // must be performed once outside of parallel region
// all threads forked and created once here
#pragma omp parallel for schedule(static)
for(int i = 0; i < numcols; i++) {
rm[i] = b[i] - rm[i]; // (1)
xm[i] = x0[i]; // (2) does not require (1)
pm[i] = rm[i]; // (3) requires (1) at this i, not (2)
}
// implicit barrier at the end of omp for
// implicit join of all threads at the end of omp parallel
scalarProd(rm,rm,numcols);
Notice how I show that no barriers are actually necessary between your loops anyway.
If the majority of your time had been spent in this computation stage, you will surely be seeing considerable improvement. However, I'm reasonably confident that the majority of your time is being spent in Mat_Vec_Mult() and maybe also scalarProd(), so the amount of time you'll be saving is probably minimal.
** EDIT **
And as per your edit, I am seeing a few problems. (1) Always compile with -O3 when you are testing performance of your algorithm. (2) You won't be able to improve the runtime of something that takes .000007 sec to complete; that's nearly instantaneous. This goes back to what I said previously: try and parallelize at a higher level. CG Method is inherently a sequential algorithm, but there are certainly research papers developed detailing parallel CG. (3) Your implementation of scalar product is not optimal. Indeed, I suspect your implementation of matrix-vector product is not either. I would personally do the following:
double scalarProd(double* vec1, double* vec2, int n) {
double prod = 0.0;
int i;
// omp_set_num_threads(4); this should be done once during initialization somewhere previously in your program
#pragma omp parallel for private(i) reduction(+:prod)
for (i = 0; i < n; ++i) {
prod += vec1[i]*vec2[i];
}
return prod;
}
(4) There are entire libraries (LAPACK, BLAS, etc) that have highly optimized matrix-vector, vector-vector, etc operations. Any Linear Algebra library must be built upon them. Therefore, I'd suggest looking at using one of those libraries to do your two operations before you start re-creating the wheel here and trying to implement your own.