I use android-ndk-r10d to code with C++ in android studio. Now I want to use openmp and I add the codes into Android.mk:
LOCAL_CFLAGS += -fopenmp
LOCAL_LDFLAGS += -fopenmp
and add the codes into myapp.cpp:
#include <omp.h>
#pragma omp parallel for
for(int i = 1, ii = 0; i < outImage[0]->height; i+=2, ii = i>>1) {
/* Do work... */
}
but gradle build finished with error just because of the [#pragma omp parallel for]
How can I handle the syntax error ?
May be the compiler does not like the complex structure of the for() statement. OpenMP likes to know the number of steps in advance. Also I doubt that OpenMP can deal with 2 loop variables (i and ii). Try a simple loop with fixed limits like
int kmax = (outImage[0]->height - 1) / 2; // not sure this is right
#pragma omp parallel for
for( int k=0; k<kmax; k++)
{
int i=k*2 + 1; // not sure this is right
int ii = i>>1;
/* Do work ... */
}
Related
I am trying to understand/test OpenMP with GPU offload. However, I am confused because some examples/info (1, 2, 3) in the internet are analogous or similar to mine but my example does not work as I think it should. I am using g++ 9.4 on Ubuntu 20.04 LTS and also installed gcc-9-offload-nvptx.
My example that does not work but is similar to this one:
#include <iostream>
#include <vector>
int main(int argc, char *argv[]) {
typedef double myfloat;
if (argc != 2) exit(1);
size_t size = atoi(argv[1]);
printf("Size: %zu\n", size);
std::vector<myfloat> data_1(size, 2);
myfloat *data1_ptr = data_1.data();
myfloat sum = -1;
#pragma omp target map(tofrom:sum) map(from: data1_ptr[0:size])
#pragma omp teams distribute parallel for simd reduction(+:sum) collapse(2)
for (size_t i = 0; i < size; ++i) {
for (size_t j = 0; j < size; ++j) {
myfloat term1 = data1_ptr[i] * i;
sum += term1 / (1 + term1 * term1 * term1);
}
}
printf("sum: %.2f\n", sum);
return 0;
}
When I compile it with: g++ main.cpp -o test -fopenmp -fcf-protection=none -fno-stack-protector I get the following
stack_example.cpp: In function ‘main._omp_fn.0.hsa.0’:
cc1plus: warning: could not emit HSAIL for the function [-Whsa]
cc1plus: note: support for HSA does not implement non-gridified OpenMP parallel constructs.
It does compile but when using it with
./test 10000
the printed sum is still -1. I think the sum value passed to the GPU was not returned properly but I explicitly map it, so shouldn't it be returned? Or what am I doing wrong?
EDIT 1
I was ask to modify my code because there was a historically grown redundant for loop and also sum was initialized with -1. I fixed that and also compiled it with gcc-11 which did not throw a warning or note as did gcc-9. However the behavior is similar:
Size: 100
Number of devices: 2
sum: 0.00
I checked with nvtop, the GPU is used. Because there are two GPUs I can even switch the device and can be seen by nvtop.
Solution:
The fix is very easy and stupid. Changing
map(from: data1_ptr[0:size])
to
map(tofrom: data1_ptr[0:size])
did the trick.
Even though I am not writing to the array this seemed to be the problem.
I am dealing with huge point cloud data. I try to use OpenMp.
But I found it's very hard for beginners to optimize code.
For example, when I want to get the Histogram of the point cloud (the point has another info beyond x,y,z). I write code below
#pragma omp parallel num_threads(N_THREAD) shared(hist,partHist)
{
int tId = omp_get_thread_num();
int index = tId * partCount;
#pragma omp for nowait
for(int i =0;i<partCount;++i)
{
if (index + i < size)
#pragma omp atomic
partHist[tId][(int)floor((array[index + i] - minValue) / stride)]++;
}
#pragma omp critical
{
for (int i = 0; i < binCount; ++i)
hist[i] += partHist[tId][i];
}
}
The code is being run on Linux with an i7-9700k, compiled with g++ and using omp 4.0
I have two questions
The data set is about 10^8 at least, I use 128 threads. but It's slower than serial.How can I optimize the code
Are there rules that I can follow to optimize the code when some other questions occur?
I'm running this neat little gravity simulation and in serial execution it takes a little more than 4 minutes, when i parallelize one loop inside a it increases to about 7 minutes and if i try parallelizing more loops it increases to more than 20 minutes. I'm posting a slightly shortened version without some initializations but I think they don't matter. I'm posting the 7 minute version however with some comments where i wanted to add parallelization to loops. Thank you for helping me with my messy code.
#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <string.h>
#include <omp.h>
#define numb 1000
int main(){
double pos[numb][3],a[numb][3],a_local[3],v[numb][3];
memset(v, 0.0, numb*3*sizeof(double));
double richtung[3];
double t,deltat=0.0,r12 = 0.0,endt=10.;
unsigned seed;
int tcount=0;
#pragma omp parallel private(seed) shared(pos)
{
seed = 25235 + 16*omp_get_thread_num();
#pragma omp for
for(int i=0;i<numb;i++){
for(int j=0;j<3;j++){
pos[i][j] = (double) (rand_r(&seed) % 100000 - 50000);
}
}
}
for(t=0.;t<endt;t+=deltat){
printf("\r%le", t);
tcount++;
#pragma omp parallel for shared(pos,v)
for(int id=0; id<numb; id++){
for(int l=0;l<3;l++){
pos[id][l] = pos[id][l]+(0.5*deltat*v[id][l]);
v[id][l] = v[id][l]+a[id][l]*(deltat);
}
}
memset(a, 0.0, numb*3*sizeof(double));
memset(a_local, 0.0, 3*sizeof(double));
#pragma omp parallel for private(r12,richtung) shared(a,pos)
for(int id=0; id <numb; ++id){
for(int id2=0; id2<id; id2++){
for(int k=0;k<3;k++){
r12 += sqrt((pos[id][k]-pos[id2][k])*(pos[id][k]-pos[id2][k]));
}
for(int k=0; k<3;k++){
richtung[k] = (-1.e10)*(pos[id][k]-pos[id2][k])/r12;
a[id][k] += richtung[k]/(((r12)*(r12)));
a_local[k] += (-1.0)*richtung[k]/(((r12)*(r12)));
#pragma omp critical
{
a[id2][k] += a_local[k];
}
}
r12=0.0;
}
}
#pragma omp parallel for shared(pos)
for(int id =0; id<numb; id++){
for(int k=0;k<3;k++){
pos[id][k] = pos[id][k]+(0.5*deltat*v[id][k]);
}
}
deltat= 0.01;
}
return 0;
}
I'm using
g++ -fopenmp -o test_grav test_grav.c
to compile the code and I'm measuring time in the shell just by
time ./test_grav.
When I used
get_numb_threads()
to get the number of threads it displayed 4. top also shows more than 300% (sometimes ~380%) cpu usage. Interesting little fact if I start the parallel region before the time-loop (meaning the most outer for-loop) and without any actual #pragma omp for it is equivalent to making one parallel region for every major (the three second to most outer loops) loop. So I think it is an optimization thing, but I don't know how to solve it. Can anyone help me?
Edit: I made the example verifiable and lowered numbers like numb to make it better testable but the problem still occurs. Even when I remove the critical region as suggested by TheQuantumPhysicist, just not as severely.
I believe that critical section is the cause of the problem. Consider taking all critical sections outside the parallelized loop and running them after the parallelization is over.
Try this:
#pragma omp parallel shared(a,pos)
{
#pragma omp for private(id2,k,r12,richtung,a_local)
for(id=0; id <numb; ++id){
for(id2=0; id2<id; id2++){
for(k=0;k<3;k++){
r12 += sqrt((pos[id][k]-pos[id2][k])*(pos[id][k]-pos[id2][k]));
}
for(k =0; k<3;k++){
richtung[k] = (-1.e10)*(pos[id][k]-pos[id2][k])/r12;
a[id][k] += richtung[k]/(((r12)*(r12))+epsilon);
a_local[k]+= richtung[k]/(((r12)*(r12))+epsilon)*(-1.0);
}
}
}
}
for(id=0; id <numb; ++id){
for(id2=0; id2<id; id2++){
for(k=0;k<3;k++){
a[id2][k] += a_local[k];
}
}
}
Critical sections will lead to locking and blocking. If you can keep these sections linear, you'll win a lot in performance.
Notice that I'm talking about a syntactic solution, which I don't know whether it works for your case. But to be clear: If every point in your series depends on the next one, then parallelizing is not a solution for you; at least simple parallelization using OpenMP.
I m trying to do multi-thread programming on CPU using OpenMP. I have lots of for loops which are good candidate to be parallel. I attached here a part of my code. when I use first #pragma omp parallel for reduction, my code is faster, but when I try to use the same command to parallelize other loops it gets slower. does anyone have any idea why it is like this?
.
.
.
omp_set_dynamic(0);
omp_set_num_threads(4);
float *h1=new float[nvi];
float *h2=new float[npi];
while(tol>0.001)
{
std::fill_n(h2, npi, 0);
int k,i;
float h222=0;
#pragma omp parallel for private(i,k) reduction (+: h222)
for (i=0;i<npi;++i)
{
int p1=ppi[i];
int m = frombus[p1];
for (k=0;k<N;++k)
{
h222 += v[m-1]*v[k]*(G[m-1][k]*cos(del[m-1]-del[k])
+ B[m-1][k]*sin(del[m-1]-del[k]));
}
h2[i]=h222;
}
//*********** h3*****************
std::fill_n(h3, nqi, 0);
float h333=0;
#pragma omp parallel for private(i,k) reduction (+: h333)
for (int i=0;i<nqi;++i)
{
int q1=qi[i];
int m = frombus[q1];
for (int k=0;k<N;++k)
{
h333 += v[m-1]*v[k]*(G[m-1][k]*sin(del[m-1]-del[k])
- B[m-1][k]*cos(del[m-1]-del[k]));
}
h3[i]=h333;
}
.
.
.
}
I don't think your OpenMP code gives the same result as without OpenMP. Let's just concentrate on the h2[i] part of the code (since the h3[i] has the same logic). There is a dependency of h2[i] on the index i (i.e. h2[1] = h2[1] + h2[0]). The OpenMP reduction you're doing won't give the correct result. If you want to do the reduction with OpenMP you need do it on the inner loop like this:
float h222 = 0;
for (int i=0; i<npi; ++i) {
int p1=ppi[i];
int m = frombus[p1];
#pragma omp parallel for reduction(+:h222)
for (int k=0;k<N; ++k) {
h222 += v[m-1]*v[k]*(G[m-1][k]*cos(del[m-1]-del[k])
+ B[m-1][k]*sin(del[m-1]-del[k]));
}
h2[i] = h222;
}
However, I don't know if that will be very efficient. An alternative method is fill h2[i] in parallel on the outer loop without a reduction and then take care of the dependency in serial. Even though the serial loop is not parallelized it still should have a small effect on the computation time since it does not have the inner loop over k. This should give the same result with and without OpenMP and still be fast.
#pragma omp parallel for
for (int i=0; i<npi; ++i) {
int p1=ppi[i];
int m = frombus[p1];
float h222 = 0;
for (int k=0;k<N; ++k) {
h222 += v[m-1]*v[k]*(G[m-1][k]*cos(del[m-1]-del[k])
+ B[m-1][k]*sin(del[m-1]-del[k]));
}
h2[i] = h222;
}
//take care of the dependency serially
for(int i=1; i<npi; i++) {
h2[i] += h2[i-1];
}
Keep in mind that creating and destroying threads is a time consuming process; clock the execution time for the process and see for yourself. You only use parallel reduction twice which may be faster than a serial reduction, however the initial cost of creating the threads may still be higher. Try parallelizing the outer most loop (if possible) to see if you can obtain a speedup.
I have an issue with OpenMP. MSVS compilator throws me "pragma omp atomic has improper form".
I don't have any idea why.
Code: (program appoints PI number using integrals method)
#include <stdio.h>
#include <time.h>
#include <omp.h>
long long num_steps = 1000000000;
double step;
int main(int argc, char* argv[])
{
clock_t start, stop;
double x, pi, sum=0.0;
int i;
step = 1./(double)num_steps;
start = clock();
#pragma omp parallel for
for (i=0; i<num_steps; i++)
{
x = (i + .5)*step;
#pragma omp atomic //this part contains error
sum = sum + 4.0/(1.+ x*x);
}
pi = sum*step;
stop = clock();
// some printf to show results
return 0;
}
Your program is a perfectly syntactically correct OpenMP code by the current OpenMP standards (e.g. it compiles unmodified with GCC 4.7.1), except that x should be declared private (which is not a syntactic but rather a semantic error). Unfortunately Microsoft Visual C++ implements a very old OpenMP specification (2.0 from March 2002) which only allows the following statements as acceptable in an atomic construct:
x binop= expr
x++
++x
x--
--x
Later versions included x = x binop expr, but MSVC is forever stuck at OpenMP version 2.0 even in VS2012. Just for comparison, the current OpenMP version is 3.1 and we expect 4.0 to come up in the following months.
In OpenMP 2.0 your statement should read:
#pragma omp atomic
sum += 4.0/(1.+ x*x);
But as already noticed, it would be better (and generally faster) to use reduction:
#pragma omp parallel for private(x) reduction(+:sum)
for (i=0; i<num_steps; i++)
{
x = (i + .5)*step;
sum = sum + 4.0/(1.+ x*x);
}
(you could also write sum += 4.0/(1.+ x*x);)
Try to change sum = sum + 4.0/( 1. + x*x ) to sum += 4.0/(1.+ x*x) , But I'm afraid this won't work either. You can try to split the work like this:
x = (i + .5)*step;
double xx = 4.0/(1.+ x*x);
#pragma omp atomic //this part contains error
sum += xx;
this should work, but I am not sure whether it fits your needs.
Replace :
#pragma omp atomic
by #pragma omp reduction(+:sum) or #pragma omp critical
But I guess #pragma omp reduction will be a better option as you have sum+=Var;
Do like this:
x = (i + .5)*step;
double z = 4.0/(1.+ x*x);
#pragma omp reduction(+:sum)
sum += z;
You probably need a recap about #pragma more than the real solution to your problem.
#pragma are a set of non-standard, compiler specific, and most of the time, platform/system specific - meaning that the behaviour can be different on different machines with the same OS or simply on machines with different setups - set of instrunctions for the pre-processor.
As consequence any issue with pragma can be solved only if you look at the official documentation for your compiler for your platform of choice, here are 2 links.
http://msdn.microsoft.com/en-us/library/d9x1s805.aspx
http://msdn.microsoft.com/en-us/library/0ykxx45t.aspx
For the standard C/C++ #pragma doesn't exist.