I'm using GCC to compile the following program which uses OpenMP's target directives to offload work to a GPU:
#include <iostream>
#include <cmath>
int main(){
const int SIZE = 400000;
double *m;
m = new double[SIZE];
#pragma omp target teams distribute parallel for
for(int i=0;i<SIZE;i++)
m[i] = std::sin((double)i);
for(int i=0;i<SIZE;i++)
std::cout<<m[i]<<"\n";
}
My compilation string is as follows:
g++ -O3 test2.cpp -fopenmp -omptargets=nvptx64sm_35-nvidia-linux
Compilation succeeds, but quietly.
Using PGI+OpenACC, I'm used to a series of outputs that tell me what the compiler actually did with the directive, like so:
main:
8, Accelerator kernel generated
Generating Tesla code
11, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
8, Generating implicit copyout(m[:400000])
How can I get similar information out of GCC? -fopt-info-all is a big mess.
Related
I am currently working with OpenMP offloading using LLVM/clang-16 (built from the github repository). Using the built-in profiling tools in clang (using environment variables such as LIBOMPTARGET_PROFILE=profile.json and LIBOMPTARGET_INFO) I was able to confirm that my code is executed on my GPU but when I try to profile the code using nvprof or ncu (from the NVIDIA Nsight tool suite) I get an error/warning stating, that the profiler did not detect any kernel launches:
> ncu ./saxpy
Time of kernel: 0.000004
==WARNING== No kernels were profiled.
==WARNING== Profiling kernels launched by child processes requires the --target-processes all option.
This is my test code:
#include <iostream>
#include <omp.h>
#include <cstdlib>
void saxpy(float a, float* x, float* y, int sz) {
double t = 0.0;
double tb, te;
tb = omp_get_wtime();
#pragma omp target teams distribute parallel for map(to:x[0:sz]) map(tofrom:y[0:sz])
{
for (int i = 0; i < sz; i++) {
y[i] = a * x[i] + y[i];
}
}
te = omp_get_wtime();
t = te - tb;
printf("Time of kernel: %lf\n", t);
}
int main() {
auto x = (float*) malloc(1000 * sizeof(float));
auto y = (float*) calloc(1000, sizeof(float));
for (int i = 0; i < 1000; i++) {
x[i] = i;
}
saxpy(42, x, y, 1000);
return 0;
}
Compiled using the following command:
> clang++ -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda main.cpp -o saxpy --cuda-path=/opt/nvidia/hpc_sdk/Linux_x86_64/22.11/cuda/10.2 --offload-arch=sm_61 -fopenmp-offload-mandatory
What do I need to do to enable profiling? I have seen others using ncu for clang compiled OpenMP offloading code without additional steps but maybe I am completely missing something.
By looking at the debug output generated when the program is executed with LIBOMPTARGET_DEBUG=1 and after receiving help from other forums I was able to fix this issue. The program cannot find the necessary files of the OpenMP CUDA runtime library whenever it is started through ncu (or nsys).
A workaround is to add the path to those libraries to the LD_LIBRARY_PATH environment variable (e.g. export LD_LIBRARY_PATH=/opt/llvm/lib:$LD_LIBRARY_PATH).
NVIDIA is now aware of this problem and are "looking into why that is the case".
I used nvc++ to run the C++ program with openacc and the error displayed 'libgomp: TODO'.
Here is the test code:
#include <opencv2/imgcodecs.hpp>
#include <opencv2/highgui.hpp>
#include<openacc.h>
using namespace std;
using namespace cv;
int main(){
cv::Mat srcImg=cv::imread("/home/testSpace/images/blue-mountains.jpg");
if(!srcImg.data){
cout<<"The file is not loaded or does not exist"<<endl;
return -1;
}
cout<<"Matrix"<<srcImg.rows<<" "<<srcImg.cols<<endl;
Mat duplicate(srcImg.rows,srcImg.cols, CV_8UC1,Scalar::all(255) );
#pragma acc enter data copyin(srcImg[:srcImg.rows][:srcImg.cols])
#pragma acc enter data copyin(duplicate[:duplicate.rows][:duplicate.cols])
#pragma acc parallel
{
#pragma acc loop
for(int i=0;i<srcImg.rows;i++){
#pragma acc loop
for(int j=0;j<srcImg.cols;j++){
duplicate.at<uchar>(i,j)=srcImg.at<uchar>(i,j);
}
}
#pragma acc data copyout(duplicate[:duplicate.rows][:duplicate.cols])
#pragma acc data copyout(srcImg[:srcImg.rows][:srcImg.cols])
}
cout<<"duplicate"<<": "<<(int)duplicate.at<uchar>(23,45)<<endl;
return 0;
}
Then I got error by using nvc++ to compile the code file.
main:
2216, Loop unrolled 4 times (completely unrolled)
36, Generating enter data copyin(duplicate,srcImg)
Generating NVIDIA GPU code
38, #pragma acc loop gang /* blockIdx.x */
40, #pragma acc loop vector(128) /* threadIdx.x */
36, Generating implicit copyin(duplicate.step.p[:1],srcImg.step.p[:1],srcImg,duplicate)[if not already present]
40, Loop is parallelizable
Loop not vectorized/parallelized: not countable
cv::Matx<double, (int)4, (int)1>::Matx():
The final result is :Matrix810 1440
libgomp: TODO
Can anyone please provide any hint? Moreover, I don't know why I got the error
Generating implicit
copyin(duplicate.step.p[:1],srcImg.step.p[:1],srcImg,duplicate)[if not
already present]
I used the same way to allocate memory for srcImg in GPU.
The content of run scripts:
nvc++ -g -O3 -acc -gpu=cc60,cc70 -Minfo ``pkg-config opencv4 --cflags --libs`` -nomp -o nvcpp.out test.cpp
Please check lib_information here
I am trying to understand/test OpenMP with GPU offload. However, I am confused because some examples/info (1, 2, 3) in the internet are analogous or similar to mine but my example does not work as I think it should. I am using g++ 9.4 on Ubuntu 20.04 LTS and also installed gcc-9-offload-nvptx.
My example that does not work but is similar to this one:
#include <iostream>
#include <vector>
int main(int argc, char *argv[]) {
typedef double myfloat;
if (argc != 2) exit(1);
size_t size = atoi(argv[1]);
printf("Size: %zu\n", size);
std::vector<myfloat> data_1(size, 2);
myfloat *data1_ptr = data_1.data();
myfloat sum = -1;
#pragma omp target map(tofrom:sum) map(from: data1_ptr[0:size])
#pragma omp teams distribute parallel for simd reduction(+:sum) collapse(2)
for (size_t i = 0; i < size; ++i) {
for (size_t j = 0; j < size; ++j) {
myfloat term1 = data1_ptr[i] * i;
sum += term1 / (1 + term1 * term1 * term1);
}
}
printf("sum: %.2f\n", sum);
return 0;
}
When I compile it with: g++ main.cpp -o test -fopenmp -fcf-protection=none -fno-stack-protector I get the following
stack_example.cpp: In function ‘main._omp_fn.0.hsa.0’:
cc1plus: warning: could not emit HSAIL for the function [-Whsa]
cc1plus: note: support for HSA does not implement non-gridified OpenMP parallel constructs.
It does compile but when using it with
./test 10000
the printed sum is still -1. I think the sum value passed to the GPU was not returned properly but I explicitly map it, so shouldn't it be returned? Or what am I doing wrong?
EDIT 1
I was ask to modify my code because there was a historically grown redundant for loop and also sum was initialized with -1. I fixed that and also compiled it with gcc-11 which did not throw a warning or note as did gcc-9. However the behavior is similar:
Size: 100
Number of devices: 2
sum: 0.00
I checked with nvtop, the GPU is used. Because there are two GPUs I can even switch the device and can be seen by nvtop.
Solution:
The fix is very easy and stupid. Changing
map(from: data1_ptr[0:size])
to
map(tofrom: data1_ptr[0:size])
did the trick.
Even though I am not writing to the array this seemed to be the problem.
I am getting "call to cuMemcpyDtoHsync returned error 700: Illegal address during kernel execution" error when I try to parallelize this simple loop.
#include <vector>
#include <iostream>
using namespace std;
int main() {
vector<float> xF = {0, 1, 2, 3};
#pragma acc parallel loop
for (int i = 0; i < 4; ++i) {
xF[i] = 0.0;
}
return 0;
}
Compiled with: $ pgc++ -fast -acc -std=c++11 -Minfo=accel -o test test.cpp
main:
6, Accelerator kernel generated
Generating Tesla code
9, #pragma acc loop gang, vector(4) /* blockIdx.x threadIdx.x */
std::vector<float, std::allocator<float>>::operator [](unsigned long):
1, include "vector"
64, include "stl_vector.h"
771, Generating implicit acc routine seq
Generating acc routine seq
Generating Tesla code
T3 std::__copy_move_a2<(bool)0, const float *, decltype((std::allocator_traits<std::allocator<float>>::_S_pointer_helper<std::allocator<float>>((std::allocator<float>*)0)))>(T2, T2, T3):
1, include "vector"
64, include "stl_vector.h"
/usr/bin/ld: error in /tmp/pgc++cAUEgAXViQSY.o(.eh_frame); no .eh_frame_hdr table will be created.
$ ./test
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution
The code runs normally without the #pragma, but I would like to make it parallel. What am I doing wrong?
Try compiling with "-ta=tesla:managed".
The problem here is that you aren't explicitly managing the data movement between the host and device and the compiler can't implicitly manage it for you since a std::vector is just a class of pointers so the compiler can't tell the size of the data. Hence, the device is using host addresses thus causing the illegal memory accesses.
While you can manage the data yourself by grabbing the vector's raw pointers and then using a data clause to copy the vector's data as well as copying the vector itself to the device, it's much easier to use CUDA Unified Memory (i.e. the "managed" flag) and have the CUDA runtime manage the data movement for you.
As Jerry notes, it's generally not recommended to use vectors in parallel code since they are not thread safe. In this case it's fine, but you may encounter other issues especially if you try to push or pop data. Better to use arrays. Plus arrays are easier to manage between the host and device copies.
% cat test.cpp
#include <vector>
#include <iostream>
using namespace std;
int main() {
vector<float> xF = {0, 1, 2, 3};
#pragma acc parallel loop
for (int i = 0; i < 4; ++i) {
xF[i] = 0.0;
}
for (int i = 0; i < 4; ++i) {
std::cout << xF[i] << std::endl;
}
return 0;
}
% pgc++ -ta=tesla:cc70,managed -Minfo=accel test.cpp --c++11 ; a.out
main:
6, Accelerator kernel generated
Generating Tesla code
9, #pragma acc loop gang, vector(4) /* blockIdx.x threadIdx.x */
6, Generating implicit copy(xF)
std::vector<float, std::allocator<float>>::operator [](unsigned long):
1, include "vector"
64, include "stl_vector.h"
771, Generating implicit acc routine seq
Generating acc routine seq
Generating Tesla code
0
0
0
0
I use android-ndk-r10d to code with C++ in android studio. Now I want to use openmp and I add the codes into Android.mk:
LOCAL_CFLAGS += -fopenmp
LOCAL_LDFLAGS += -fopenmp
and add the codes into myapp.cpp:
#include <omp.h>
#pragma omp parallel for
for(int i = 1, ii = 0; i < outImage[0]->height; i+=2, ii = i>>1) {
/* Do work... */
}
but gradle build finished with error just because of the [#pragma omp parallel for]
How can I handle the syntax error ?
May be the compiler does not like the complex structure of the for() statement. OpenMP likes to know the number of steps in advance. Also I doubt that OpenMP can deal with 2 loop variables (i and ii). Try a simple loop with fixed limits like
int kmax = (outImage[0]->height - 1) / 2; // not sure this is right
#pragma omp parallel for
for( int k=0; k<kmax; k++)
{
int i=k*2 + 1; // not sure this is right
int ii = i>>1;
/* Do work ... */
}