I am trying to understand/test OpenMP with GPU offload. However, I am confused because some examples/info (1, 2, 3) in the internet are analogous or similar to mine but my example does not work as I think it should. I am using g++ 9.4 on Ubuntu 20.04 LTS and also installed gcc-9-offload-nvptx.
My example that does not work but is similar to this one:
#include <iostream>
#include <vector>
int main(int argc, char *argv[]) {
typedef double myfloat;
if (argc != 2) exit(1);
size_t size = atoi(argv[1]);
printf("Size: %zu\n", size);
std::vector<myfloat> data_1(size, 2);
myfloat *data1_ptr = data_1.data();
myfloat sum = -1;
#pragma omp target map(tofrom:sum) map(from: data1_ptr[0:size])
#pragma omp teams distribute parallel for simd reduction(+:sum) collapse(2)
for (size_t i = 0; i < size; ++i) {
for (size_t j = 0; j < size; ++j) {
myfloat term1 = data1_ptr[i] * i;
sum += term1 / (1 + term1 * term1 * term1);
}
}
printf("sum: %.2f\n", sum);
return 0;
}
When I compile it with: g++ main.cpp -o test -fopenmp -fcf-protection=none -fno-stack-protector I get the following
stack_example.cpp: In function ‘main._omp_fn.0.hsa.0’:
cc1plus: warning: could not emit HSAIL for the function [-Whsa]
cc1plus: note: support for HSA does not implement non-gridified OpenMP parallel constructs.
It does compile but when using it with
./test 10000
the printed sum is still -1. I think the sum value passed to the GPU was not returned properly but I explicitly map it, so shouldn't it be returned? Or what am I doing wrong?
EDIT 1
I was ask to modify my code because there was a historically grown redundant for loop and also sum was initialized with -1. I fixed that and also compiled it with gcc-11 which did not throw a warning or note as did gcc-9. However the behavior is similar:
Size: 100
Number of devices: 2
sum: 0.00
I checked with nvtop, the GPU is used. Because there are two GPUs I can even switch the device and can be seen by nvtop.
Solution:
The fix is very easy and stupid. Changing
map(from: data1_ptr[0:size])
to
map(tofrom: data1_ptr[0:size])
did the trick.
Even though I am not writing to the array this seemed to be the problem.
Related
I have a very simple matrix multiplication program to try out OpenMP:
#include <iostream>
#include <stdlib.h>
#include <omp.h>
int main() {
const int N = 1000;
double matrix[N][N];
double vector[N];
double outvector[N];
for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
matrix[i][j] = rand() % 50;
}
}
for (int i = 0; i < N; i++) {
vector[i] = rand() % 50;
}
double t=omp_get_wtime();
omp_set_num_threads(12);
#pragma omp parallel for
for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
outvector[i] += matrix[i][j] * vector[j];
}
}
t = omp_get_wtime() - t;
printf("%g\n", t);
return 0;
}
I compile it two ways:
Using my Windows Subsystem Linux (WSL), using g++ main.cpp -o main -fopenmp. This way, it runs significantly faster than if I comment out #pragma omp parallel for (as one would expect).
Using my CLion toolchain:
It's a default WSL toolchain
And the following CMakeLists file:
cmake_minimum_required(VERSION 3.9)
project(Fernuni)
set(CMAKE_CXX_STANDARD 14)
# There was no difference between using this or find_package
#set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -std=c++11 -fopenmp")
find_package(OpenMP)
if (OPENMP_FOUND)
set (CMAKE_C_FLAGS "${CMAKE_C_FLAGS} ${OpenMP_C_FLAGS}")
set (CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${OpenMP_CXX_FLAGS}")
endif()
add_executable(Fernuni main.cpp)
This way, it runs about ten times slower if #pragma omp parallel for is there.
Why?
edit:
Here are my specific timings with various combinations of optimization settings and enabled/disabled pragma directive:
OMP,O0 OMP,O3 O0 O3 OMP only nothing
CLion 0.0332578 0.0234029 0.0023873 6.4e-06 0.0058386 0.0094753
WSL/g++ 0.007106 0.0012252 0.0038349 5.1e-06 0.0008419 0.0021912
You are mainly measuring the overheads of the operating system, the ones of the hardware, the ones of the runtime and the effect of compiler optimizations.
Indeed, the computation time should be less than 1 ms, while the time to create the OS threads, to schedule them and to initialize OpenMP can actually be close the same time regarding the OS and the runtime (it takes between 0.3 and 0.8 ms on my machine to do only that). Not to mention that caches are cold (many cache misses occurs), the processor frequency may not be high (because of frequency scaling).
Moreover, the compiler can optimize the loop to nothing as it does not have any side effect and the result is not read. Actually, GCC do that with -O3 without OpenMP but not with (see here). This is likely because the GCC optimizer do not see that the output is not read (due to internal escape references). This explains why you get better timings without OpenMP. You should read the result, at least to check they are correct! Put it shortly the benchmark is flawed.
Note that using -march=native or -mavx+-mfma can help a lot to speed up the loop this case since these instruction sets can theoretically divide by 4 the number of instructions executed in that case. In practice, the program will likely be memory-bound (due to the size of the matrix).
I am trying to make a simple GPU offloading program using openMP. However, when I try to offload it still runs on the default device, i.e. my CPU.
I have installed a compiler, g++ 7.2.0 that has CUDA support (is in on a cluster that I use). When I run the below code it shows me that it can see the 8 GPUs but when I try to offload it says that it is still on the CPU.
#include <omp.h>
#include <iostream>
#include <stdio.h>
#include <math.h>
#include <algorithm>
#define n 10000
#define m 10000
using namespace std;
int main()
{
double tol = 1E-10;
double err = 1;
size_t iter_max = 10;
size_t iter = 0;
bool notGPU[1] = {true};
double Anew[n][m];
double A[n][m];
int target[1];
target[0] = omp_get_initial_device();
cout << "Total Devices: " << omp_get_num_devices() << endl;
cout << "Target: " << target[0] << endl;
for (int iter = 0; iter < iter_max; iter++){
#pragma omp target
{
err = 0.0;
#pragma omp parallel for reduction(max:err)
for (int j = 1; j < n-1; ++j){
target[0] = omp_is_initial_device();
for (int i = 1; i < m-1; i++){
Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]);
err = fmax(err, fabs(Anew[j][i] - A[j][i]));
}
}
}
}
if (target[0]){
cout << "not on GPU" << endl;
} else{
cout << "On GPU" << endl;}
return 0;
}
When I run this I always get that it is not on the GPU, but that there are 8 devices available.
This is not a well documented process!
You have to install some packages which look a little like:
sudo apt install gcc-offload-nvptx
You also need to add additional flags to your compilation string. I've globbed together a number of them below. Mix and match until something works, or use them as the basis for further Googling.
gcc -fopenmp -foffload=x86_64-intelmicemul-linux-gnu="-mavx2" -foffload=nvptx-none -foffload="-O3" -O2 test.c -fopenmp-targets=nvptx64-nvidia-cuda
When I last tried this with GCC in 2018 it just didn't work. At that time target offloading for OpenMP only worked with the IBM XL compiler and OpenACC (a similar set of directives to OpenMP) only worked on the Nvidia's PGI compiler. I find PGI to do a worse job of compiling C/C++ than the others (seems inefficient, non-standard flags), but a Community Edition is available for free and a little translating will get you running in OpenACC quickly.
IBM XL seems to do a fine job compiling, but I don't know if it's available for free.
The situation may have changed with GCC. If you find a way to get it working, I'd appreciate you leaving a comment here. My strong recommendation is that you stop trying with GCC7 and get ahold of GCC8 or GCC9. GPU offloading is a fast-moving area and you'll want the latest compilers to take best advantage of it.
Looks like you're missing a device(id) in your #pragma omp target line:
#pragma omp target device(/*your device id here*/)
Without that, you haven't explicitly asked OpenMP to run anywhere but your CPU.
I am getting "call to cuMemcpyDtoHsync returned error 700: Illegal address during kernel execution" error when I try to parallelize this simple loop.
#include <vector>
#include <iostream>
using namespace std;
int main() {
vector<float> xF = {0, 1, 2, 3};
#pragma acc parallel loop
for (int i = 0; i < 4; ++i) {
xF[i] = 0.0;
}
return 0;
}
Compiled with: $ pgc++ -fast -acc -std=c++11 -Minfo=accel -o test test.cpp
main:
6, Accelerator kernel generated
Generating Tesla code
9, #pragma acc loop gang, vector(4) /* blockIdx.x threadIdx.x */
std::vector<float, std::allocator<float>>::operator [](unsigned long):
1, include "vector"
64, include "stl_vector.h"
771, Generating implicit acc routine seq
Generating acc routine seq
Generating Tesla code
T3 std::__copy_move_a2<(bool)0, const float *, decltype((std::allocator_traits<std::allocator<float>>::_S_pointer_helper<std::allocator<float>>((std::allocator<float>*)0)))>(T2, T2, T3):
1, include "vector"
64, include "stl_vector.h"
/usr/bin/ld: error in /tmp/pgc++cAUEgAXViQSY.o(.eh_frame); no .eh_frame_hdr table will be created.
$ ./test
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution
The code runs normally without the #pragma, but I would like to make it parallel. What am I doing wrong?
Try compiling with "-ta=tesla:managed".
The problem here is that you aren't explicitly managing the data movement between the host and device and the compiler can't implicitly manage it for you since a std::vector is just a class of pointers so the compiler can't tell the size of the data. Hence, the device is using host addresses thus causing the illegal memory accesses.
While you can manage the data yourself by grabbing the vector's raw pointers and then using a data clause to copy the vector's data as well as copying the vector itself to the device, it's much easier to use CUDA Unified Memory (i.e. the "managed" flag) and have the CUDA runtime manage the data movement for you.
As Jerry notes, it's generally not recommended to use vectors in parallel code since they are not thread safe. In this case it's fine, but you may encounter other issues especially if you try to push or pop data. Better to use arrays. Plus arrays are easier to manage between the host and device copies.
% cat test.cpp
#include <vector>
#include <iostream>
using namespace std;
int main() {
vector<float> xF = {0, 1, 2, 3};
#pragma acc parallel loop
for (int i = 0; i < 4; ++i) {
xF[i] = 0.0;
}
for (int i = 0; i < 4; ++i) {
std::cout << xF[i] << std::endl;
}
return 0;
}
% pgc++ -ta=tesla:cc70,managed -Minfo=accel test.cpp --c++11 ; a.out
main:
6, Accelerator kernel generated
Generating Tesla code
9, #pragma acc loop gang, vector(4) /* blockIdx.x threadIdx.x */
6, Generating implicit copy(xF)
std::vector<float, std::allocator<float>>::operator [](unsigned long):
1, include "vector"
64, include "stl_vector.h"
771, Generating implicit acc routine seq
Generating acc routine seq
Generating Tesla code
0
0
0
0
I use android-ndk-r10d to code with C++ in android studio. Now I want to use openmp and I add the codes into Android.mk:
LOCAL_CFLAGS += -fopenmp
LOCAL_LDFLAGS += -fopenmp
and add the codes into myapp.cpp:
#include <omp.h>
#pragma omp parallel for
for(int i = 1, ii = 0; i < outImage[0]->height; i+=2, ii = i>>1) {
/* Do work... */
}
but gradle build finished with error just because of the [#pragma omp parallel for]
How can I handle the syntax error ?
May be the compiler does not like the complex structure of the for() statement. OpenMP likes to know the number of steps in advance. Also I doubt that OpenMP can deal with 2 loop variables (i and ii). Try a simple loop with fixed limits like
int kmax = (outImage[0]->height - 1) / 2; // not sure this is right
#pragma omp parallel for
for( int k=0; k<kmax; k++)
{
int i=k*2 + 1; // not sure this is right
int ii = i>>1;
/* Do work ... */
}
Performance difference between C++ vectors and plain arrays has been extensively discussed, for example here and here. Usually discussions conclude that vectors and arrays are similar in terms on performance when accessed with the [] operator and the compiler is enabled to inline functions. That is why expected but I came through a case where it seems that is not true. The functionality of the lines below is quite simple: a 3D volume is taken and it is swap and applied some kind of 3D little mask a certain number of times. Depending on the VERSION macro, volumes will be declared as vectors and accessed through the at operator (VERSION=2), declared as vectors and accessed via [] (VERSION=1) or declared as simple arrays.
#include <vector>
#define NX 100
#define NY 100
#define NZ 100
#define H 1
#define C0 1.5f
#define C1 0.25f
#define T 3000
#if !defined(VERSION) || VERSION > 2 || VERSION < 0
#error "Bad version"
#endif
#if VERSION == 2
#define AT(_a_,_b_) (_a_.at(_b_))
typedef std::vector<float> Field;
#endif
#if VERSION == 1
#define AT(_a_,_b_) (_a_[_b_])
typedef std::vector<float> Field;
#endif
#if VERSION == 0
#define AT(_a_,_b_) (_a_[_b_])
typedef float* Field;
#endif
#include <iostream>
#include <omp.h>
int main(void) {
#if VERSION != 0
Field img(NX*NY*NY);
#else
Field img = new float[NX*NY*NY];
#endif
double end, begin;
begin = omp_get_wtime();
const int csize = NZ;
const int psize = NZ * NX;
for(int t = 0; t < T; t++ ) {
/* Swap the 3D volume and apply the "blurring" coefficients */
#pragma omp parallel for
for(int j = H; j < NY-H; j++ ) {
for( int i = H; i < NX-H; i++ ) {
for( int k = H; k < NZ-H; k++ ) {
int eindex = k+i*NZ+j*NX*NZ;
AT(img,eindex) = C0 * AT(img,eindex) +
C1 * (AT(img,eindex - csize) +
AT(img,eindex + csize) +
AT(img,eindex - psize) +
AT(img,eindex + psize) );
}
}
}
}
end = omp_get_wtime();
std::cout << "Elapsed "<< (end-begin) <<" s." << std::endl;
/* Access img field so we force it to be deleted after accouting time */
#define WHATEVER 12.f
if( img[ NZ ] == WHATEVER ) {
std::cout << "Whatever" << std::endl;
}
#if VERSION == 0
delete[] img;
#endif
}
One would expect code will perform the same with VERSION=1 and VERSION=0, but the output is as follows:
VERSION 2 : Elapsed 6.94905 s.
VERSION 1 : Elapsed 4.08626 s
VERSION 0 : Elapsed 1.97576 s.
If I compile without OMP (I've got only two cores), I get similar results:
VERSION 2 : Elapsed 10.9895 s.
VERSION 1 : Elapsed 7.14674 s
VERSION 0 : Elapsed 3.25336 s.
I always compile with GCC 4.6.3 and the compilation options -fopenmp -finline-functions -O3 (I of course remove -fopenmp when I compile without omp) Is there something I do wrong, for example when compiling? Or should we really expect that difference between vectors and arrays?
PS: I cannot use std::array because of the compiler, of which I depend, that doesn't support C11 standard. With ICC 13.1.2 I get similar behavior.
I tried your code, used chrono to count the time.
And I compiled with clang (version 3.5) and libc++.
clang++ test.cc -std=c++1y -stdlib=libc++ -lc++abi -finline-functions
-O3
The result is exactly same for VERSION 0 and VERSION 1, there's no big difference. They are both 3.4 seconds in average (I use virtual machine so it is slower.).
Then I tried g++ (version 4.8.1),
g++ test.cc -std=c++1y -finline-functions
-O3
The result shows that, for VERSION 0, it is 4.4seconds (roughly), for VERSION 1, it is 5.2 seconds (roughly).
I then, tried clang++ with libstdc++.
clang++ test.cc -std=c++11 -finline-functions
-O3
voila, the result back to 3.4seconds again.
So, it's purely the optimization "bug" of g++.