OpenMP offloading target map alloc - how does it work

OpenMP offloading target map alloc - how does it work - c++

I have always been confused and never understood how the alloc map-type of the map clause of the target (or target data) construct works.
What is my application - I would like to have a temporary array on a device, which is used only on the device, is initialized on the device, read on the device, everything on the device. The host does not touch the contents of the array at all. For the sake of simplicity, I have the following code, which copies an array to another array via a temporary array (using just a single team and thread, but that does not matter):
#include <cstdio>
int main()
{
const int count = 10;
int * src = new int[count];
int * tmp = new int[count];
int * dst = new int[count];
for(int i = 0; i < count; i++) src[i] = i;
for(int i = 0; i < count; i++) printf(" %3d", src[i]); printf("\n");
#pragma omp target map(to:src[0:count]) map(from:dst[0:count]) map(alloc:tmp[0:count])
{
for(int i = 0; i < count; i++) tmp[i] = src[i];
for(int i = 0; i < count; i++) dst[i] = tmp[i];
}
for(int i = 0; i < count; i++) printf(" %3d", dst[i]); printf("\n");
delete[] src;
delete[] tmp;
delete[] dst;
return 0;
}
This code works when using pgc++ -mp=gpu on Nvidia and on Intel gpu using icpx -fiopenmp -fopenmp-targets=spir64.
But the thing is, I don't want to allocate the tmp array on the host. If I just use int * tmp = nullptr, on nvidia the code fails (on intel it still works). If I leave the tmp uninitialized (using just int * tmp;, and removing the delete), the execution fails on Intel too. If I do not even declare the tmp variable, compilation fails (which kinda makes sense). I made sure it runs on the device (really offloads the code, doesn't fallback to cpu) using OMP_TARGET_OFFLOAD=MANDATORY.
This was weird to me, since I don't use the tmp array on the host at all. As I understand it, the tmp array is allocated on the device and then in the kernel the device array is used. Is that right? Why do I have to allocate and/or initialize the pointer on the host if I don't use it on the host?
So my question is: what are the exact requirements to use map(alloc) in OpenMP offloading? How does it work? How should I use it? I would appreciate an example and references from tutorials/documentation.
I wasn't able to find any useful information regarding this. The standard was not helpful at all, and the tutorials I attended and watched did not go into such depth.
I understand that the code should work even without OpenMP enabled (as if the pragmas were just ignored), so let's assume there is an #ifdef to actually allocate the tmp array if OpenMP is disabled.
I am also aware of manual memory management via omp_target_alloc(), omp_target_memcpy() and omp_target_free(), but I wanted to use the target map(alloc).
I am reading the standard 5.2, using pgc++ 22.2-0 and icpx 2022.0.0.20211123.

Related

In Torch C++ API, How to write to the internal data of a tensor fastly?

I am using torch C++ frontend and want to have a tensor with specified value in it. To achieve this one may allocate memory and set value by hand, then use torch::from_blob to build a tensor on the memory block, but it seems not clean enough for me.
In the very bottom of this document I found out that I can use subscript to directly access and modify the data. However, this approach has a big running time overhead, likely because the subscript access will treat the element of tensor as a 0-d tensor. The following code will cost more than 2 seconds on my machine (-O3 optimization level), which is unreasonably long for modern CPU.
torch::Tensor tensor = torch::empty({1000, 1000});
for(int i=0; i < 1000; i++)
{
for(int j=0 ; j < 1000; j++)
{
tensor[i][j] = calc_tensor_data(i,j);
}
}
Is there a clean and fast way to achieve this goal?

After hours of fruitless search on the Internet, I got a hypothesis in my mind, and I decided to give it a shot. It turns out that the accessor mentioned in the same document works well as left value too, although this feature is not mentioned by document at all. The following code is just fine, and it is as fast as manipulating raw pointer directly.
torch::Tensor tensor = torch::empty({1000, 1000});
auto accessor = tensor.accessor<float,2>();
for(int i=0; i < 1000; i++)
{
for(int j=0 ; j < 1000; j++)
{
accessor[i][j] = calc_tensor_data(i,j);
}
}

CUDA - separating cpu code from cuda code

Was looking to use system functions (such as rand() ) within the CUDA kernel. However, ideally this would just run on the CPU. Can I separate files (.cu and .c++), while still making use of gpu matrix addition? For example, something along these lines:
in main.cpp:
int main(){
std::vector<int> myVec;
srand(time(NULL));
for (int i = 0; i < 1024; i++){
myvec.push_back( rand()%26);
}
selfSquare(myVec, 1024);
}
and in cudaFuncs.cu:
__global__ void selfSquare_cu(int *arr, n){
int i = threadIdx.x;
if (i < n){
arr[i] = arr[i] * arr[i];
}
}
void selfSquare(std::vector<int> arr, int n){
int *cuArr;
cudaMallocManaged(&cuArr, n * sizeof(int));
for (int i = 0; i < n; i++){
cuArr[i] = arr[i];
}
selfSquare_cu<<1, n>>(cuArr, n);
}
What are best practices surrounding situations like these? Would it be a better idea to use curand and write everything in the kernel? It looks to me like in the above example, there is an extra step in taking the vector and copying it to the shared cuda memory.

In this case the only thing that you need is to have the array initialised with random values. Each value of the array can be initialised indipendently.
The CPU is involved in your code during the initialization and trasferring of the data to the device and back to the host.
In your case, do you really need to have the CPU to initialize the data for then having all those values moved to the GPU?
The best approach is to allocate some device memory and then initialize the values using a kernel.
This will save time because
The elements are initialized in parallel
There is not memory transfer required from the host to the device
As a rule of thumb, always avoid communication between host and device if possible.

Why in C++ overwritingis is slower than writing?

USELESS QUESTION - ASKED TO BE DELETED
I have to run a piece of code that manages a video stream from camera.
I am trying to boost it, and I realized a weird C++ behaviour. (I have to admit I am realizing I do not know C++)
The first piece of code run faster than the seconds, why? It might be possible that the stack is almost full?
Faster version
double* temp = new double[N];
for(int i = 0; i < N; i++){
temp[i] = operation(x[i],y[i]);
res = res + (temp[i]*temp[i])*coeff[i];
}
Slower version1
double temp;
for(int i = 0; i < N; i++){
temp = operation(x[i],y[i]);
res = res + (temp*temp)*coeff[i];
}
Slower version2
for(int i = 0; i < N; i++){
double temp = operation(x[i],y[i]);
res = res + (temp*temp)*coeff[i];
}
EDIT
I realized the compiler was optimizing the product between elemnts of coeff and temp. I beg your pardon for the unuseful question. I will delete this post.

This has obviously nothing to do with "writing vs overwriting".
Assuming your results are indeed correct, I can guess that your "faster" version can be vectorized (i.e. pipelined) by the compiler more efficiently.
The difference in that in this version you allocate a storage space for temp, whereas each iteration uses its own member of the array, hence all the iterations can be executed independently.
Your "slow 1" version creates a (sort of) false dependence on a single temp variable. A primitive compiler might "buy" it, producing a non-pipelined code.
Your "slow 2" version seems to be ok actually, loop iterations are independent.
Why is this still slower?
I can guess that this is due to the use of the same CPU registers. That is, arithmetic on double is usually implemented via FPU stack registers, this is the interference between loop iterations.

C++/RAII: Could this cause a memory leak?

I have a weird problem. I have written some MEX/Matlab-functions using C++. On my computer everything works fine. However, using the institute's cluster, the code sometimes simply stops running without any error (a core file is created which says "CPU limit time exceeded"). Unfortunately, on the cluster I cannot really debug my code and I also cannot reproduce the error.
What I do know is, that the error only occurs for very large runs, i.e., when a lot of memory is required. My assumption is therefore that my code has some memoryleaks.
The only real part where I could think of is the following bit:
#include <vector>
using std::vector;
vector<int> createVec(int length) {
vector<int> out(length);
for(int i = 0; i < length; ++i)
out[i] = 2.0 + i; // the real vector is obviously filled different, but it's just simple computations
return out;
}
void someFunction() {
int numUser = 5;
int numStages = 3;
// do some stuff
for(int user = 0; user < numUser; ++user) {
vector< vector<int> > userData(numStages);
for(int stage = 0; stage < numStages; ++stage) {
userData[stage] = createVec(42);
// use the vector for some computations
}
}
}
My question now is: Could this bit produce memory leaks or is this save due to RAII (which I would think it is)? Question for the MATLAB-experts: Does this behave any different when run as a mex file?
Thanks

Where is the loop-carried dependency here?

Does anyone see anything obvious about the loop code below that I'm not seeing as to why this cannot be auto-vectorized by VS2012's C++ compiler?
All the compiler gives me is info C5002: loop not vectorized due to reason '1200' when I use the /Qvec-report:2 command-line switch.
Reason 1200 is documented in MSDN as:
Loop contains loop-carried data dependences that prevent
vectorization. Different iterations of the loop interfere with each
other such that vectorizing the loop would produce wrong answers, and
the auto-vectorizer cannot prove to itself that there are no such data
dependences.
I know (or I'm pretty sure that) there aren't any loop-carried data dependencies but I'm not sure what's preventing the compiler from realizing this.
These source and dest pointers do not ever overlap nor alias the same memory and I'm trying to provide the compiler with that hint via __restrict.
pitch is always a positive integer value, something like 4096, depending on the screen resolution, since this is a 8bpp->32bpp rendering/conversion function, operating column-by-column.
byte * __restrict source;
DWORD * __restrict dest;
int pitch;
for (int i = 0; i < count; ++i) {
dest[(i*2*pitch)+0] = (source[(i*8)+0]);
dest[(i*2*pitch)+1] = (source[(i*8)+1]);
dest[(i*2*pitch)+2] = (source[(i*8)+2]);
dest[(i*2*pitch)+3] = (source[(i*8)+3]);
dest[((i*2+1)*pitch)+0] = (source[(i*8)+4]);
dest[((i*2+1)*pitch)+1] = (source[(i*8)+5]);
dest[((i*2+1)*pitch)+2] = (source[(i*8)+6]);
dest[((i*2+1)*pitch)+3] = (source[(i*8)+7]);
}
The parens around each source[] are remnants of a function call which I have elided here because the loop still won't be auto-vectorized without the function call, in its most simplest form.
EDIT:
I've simplified the loop to its most trivial form that I can:
for (int i = 0; i < 200; ++i) {
dest[(i*2*4096)+0] = (source[(i*8)+0]);
}
This still produces the same 1200 reason code.
EDIT (2):
This minimal test case with local allocations and identical pointer types still fails to auto-vectorize. I'm simply baffled at this point.
const byte * __restrict source;
byte * __restrict dest;
source = (const byte * __restrict ) new byte[1600];
dest = (byte * __restrict ) new byte[1600];
for (int i = 0; i < 200; ++i) {
dest[(i*2*4096)+0] = (source[(i*8)+0]);
}

Let's just say there's more than just a couple of things preventing this loop from vectorizing...
Consider this:
int main(){
byte *source = new byte[1000];
DWORD *dest = new DWORD[1000];
for (int i = 0; i < 200; ++i) {
dest[(i*2*4096)+0] = (source[(i*8)+0]);
}
for (int i = 0; i < 200; ++i) {
dest[i*2*4096] = source[i*8];
}
for (int i = 0; i < 200; ++i) {
dest[i*8192] = source[i*8];
}
for (int i = 0; i < 200; ++i) {
dest[i] = source[i];
}
}
Compiler Output:
main.cpp(10) : info C5002: loop not vectorized due to reason '1200'
main.cpp(13) : info C5002: loop not vectorized due to reason '1200'
main.cpp(16) : info C5002: loop not vectorized due to reason '1203'
main.cpp(19) : info C5002: loop not vectorized due to reason '1101'
Let's break this down:
The first two loops are the same. So they give the original reason 1200 which is the loop-carried dependency.
The 3rd loop is the same as the 2nd loop. Yet the compiler gives a different reason 1203:
Loop body includes non-contiguous accesses into an array
Okay... Why a different reason? I dunno. But this time, the reason is correct.
The 4th loop gives 1101:
Loop contains a non-vectorizable conversion operation (may be implicit)
So VC++ doesn't isn't smart enough to issue the SSE4.1 pmovzxbd instruction.
That's a pretty niche case, I wouldn't have expected any modern compiler to be able to do this. And if it could, you'd need to specify SSE4.1.
So the only thing that's out-of-the ordinary is why the initial loop reports a loop-carried dependency.Well, that's a tough call... I would go so far to say that the compiler just isn't issuing the correct reason. (When it really should be non-contiguous access.)
Getting back to the point, I wouldn't expect MSVC or any compiler to be able to vectorize your original loop. Your original loop has accesses grouped in chunks of 4 - which makes it contiguous enough to vectorize. But it's a long-shot to expect the compiler to be able to recognize that.
So if it matters, I suggest manually vectorizing this loop. The intrinsic that you will need is _mm_cvtepu8_epi32().
Your original loop:
for (int i = 0; i < count; ++i) {
dest[(i*2*pitch)+0] = (source[(i*8)+0]);
dest[(i*2*pitch)+1] = (source[(i*8)+1]);
dest[(i*2*pitch)+2] = (source[(i*8)+2]);
dest[(i*2*pitch)+3] = (source[(i*8)+3]);
dest[((i*2+1)*pitch)+0] = (source[(i*8)+4]);
dest[((i*2+1)*pitch)+1] = (source[(i*8)+5]);
dest[((i*2+1)*pitch)+2] = (source[(i*8)+6]);
dest[((i*2+1)*pitch)+3] = (source[(i*8)+7]);
}
vectorizes as follows:
for (int i = 0; i < count; ++i) {
__m128i s0 = _mm_loadl_epi64((__m128i*)(source + i*8));
__m128i s1 = _mm_unpackhi_epi64(s0,s0);
*(__m128i*)(dest + (i*2 + 0)*pitch) = _mm_cvtepu8_epi32(s0);
*(__m128i*)(dest + (i*2 + 1)*pitch) = _mm_cvtepu8_epi32(s1);
}
Disclaimer: This is untested and ignores alignment.

From the MSDN documentation, a case in which an error 1203 would be reported
void code_1203(int *A)
{
// Code 1203 is emitted when non-vectorizable memory references
// are present in the loop body. Vectorization of some non-contiguous
// memory access is supported - for example, the gather/scatter pattern.
for (int i=0; i<1000; ++i)
{
A[i] += A[0] + 1; // constant memory access not vectorized
A[i] += A[i*2+2] + 2; // non-contiguous memory access not vectorized
}
}
It really could be the computations at the indexes that mess with the auto-vectorizer. Funny the error code shown is not 1203, though.
MSDN Parallelizer and Vectorizer Messages

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

OpenMP offloading target map alloc - how does it work - c++

Related

In Torch C++ API, How to write to the internal data of a tensor fastly?

CUDA - separating cpu code from cuda code

Why in C++ overwritingis is slower than writing?

C++/RAII: Could this cause a memory leak?

Where is the loop-carried dependency here?

Categories

Resources