OpenCL Pipeline failed to allocate buffer with cl_mem_object_allocation_failure - c++

I have an OpenCL pipeline that process image/video and it can be greedy with the memory sometimes. It is crashing on cl::Buffer() allocation like this:
cl_int err = CL_SUCCESS;
cl::Buffer tmp = cl::Buffer(m_context, CL_MEM_READ_WRITE, sizeData, NULL, &err);
with the error -4 - cl_mem_object_allocation_failure.
This occurs at a fix point in my pipeline by using very large images. If I just downscale the image a bit, it pass through the pipeline at this very memory intensive part.
I have access to a Nvidia card with 4go that bust at a certain point, and also tried on an AMD GPU with 2go which bust earlier.
According to this thread, there is no need to know the current allocation due to swapping with VRAM, but it seems that my pipeline bust the memory of my device.
So here are my question:
1) Is there any settings on my computer, or pipeline to set to allow more VRAM ?
2) Is it okay to use CL_DEVICE_GLOBAL_MEM_SIZE as reference of the maximum size to allocate, or I need to do CL_DEVICE_GLOBAL_MEM_SIZE - (local memory + private), or something like that ?
According to my own memory profiler, I have 92% of the CL_DEVICE_GLOBAL_MEM_SIZE allocated at the crash. And by resizing a bit, the pipeline says that I used 89% on the resized image and it passed, so I assume that my large image is on the edge to pass.

Some parts of your device's VRAM may be used for the pixel buffer, constant memory, or other uses. For AMD cards, you can set the environment variables GPU_MAX_HEAP_SIZE and GPU_MAX_ALLOC_PERCENT to use a larger part of the VRAM, though this may have unintended side-effects. Both are expressed as percentages of your physically available memory on the card. Additionally, there is a limit on the size for each memory allocation. You can get the maximum size for a single memory allocation by querying CL_DEVICE_MAX_MEM_ALLOC_SIZE, which may be less than CL_DEVICE_GLOBAL_MEM_SIZE. For AMD cards, this size can be controlled with GPU_SINGLE_ALLOC_PERCENT. This requires no changes to your code, simply set the variables before you call your executable:
GPU_MAX_ALLOC_PERCENT="100"
GPU_MAX_HEAP_SIZE="100"
GPU_SINGLE_ALLOC_PERCENT="100"
./your_program

Related

Can I reuse host buffer memory ad libidum or should I re-map it every frame?

My app has a VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT buffer and a permanent command buffer that uploads the memory to a VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT buffer.
I have two questions about this setup. This is question 1, question 2 is separate.
For better performance, resources (buffers, buffer memories, memory mapping, command buffers, etc.) are allocated outside the main loop. The only thing I do in the main loop (per-frame) is triggering the command buffer with a vkQueueSubmit(), which transfers the data from host memory to device-local memory. I took several significant "shortcuts" in respect to literature (the classic Vulkan tutorial everybody starts with). By writing direcly into stagingMemory I need no separate memory and no memcpy(), and doing most of it outside the loop is even more of a shortcut. This is the pseudo code:
void* stagingMemory;
vkMapMemory(logicalDevice, stagingBufferMemory, 0, size, 0, &stagingMemory);
while (running)
{
// write directly into stagingMemory by fiddling with pointers and offsets
if (its_time_to_update_ubo_on_device)
{
VkQueueSubmit(...) // transfer stagingBufferMemory to device-local buffer
}
}
// only on exit
vkUnmapMemory(logicalDevice, stagingBufferMemory);
This works and I understand this is performant because I minimize instantiations (such as SubmitInfo and command buffers) and several other operations. But I wonder if this is safe on the long run. What happens when memory pressure triggers virtual memory pages to be paged out to disk? Can this happen, or is stagingMemory safe?
What raises doubts in me, is that I've always read about a very different approach, like this:
while (running)
{
// write to memory (not staging memory!)
void* stagingMemory;
vkMapMemory(logicalDevice, stagingBufferMemory, 0, size, 0, &stagingMemory);
memcpy(stagingMemory, memory, size);
vkUnmapMemory(logicalDevice, stagingBufferMemory);
if (its_time_to_update_ubo_on_device)
{
SubmitInfo info {}; // re-initialize every time anew
VkQueueSubmit(... info ...) // upload to device-local memory
}
}
Is this less-optimized approach just for didactic reasons, or does this prevent problems I don't envision, yet, and will ruin everything later on?
Am I doing what is described in this nVidia blog post as Pinned Host Memory, or is this something still different?
What happens when memory pressure triggers virtual memory pages to be paged out to disk?
Um... that's not a thing that happens, actually.
Virtual pages are never "paged out"; only physical storage gets paged out. The storage underneath a virtual address range can get paged out, but the actual virtual addresses are fine.
Perhaps you're thinking that Vulkan would have to ensure that physical pages associated with a mapped range can't be paged out, lest a DMA operation fail to complete. Well, that's not how Vulkan transfer operations work. They don't require that the memory is mapped during the transfer (nor do they require that it is unmapped prior to the transfer. Vulkan doesn't care). So it doesn't matter to Vulkan whether there is some virtual address range bound to the storage; internally, it could be using the actual physical addresses for its DMA operations.
If the GPU needs that range of memory to not be paged out all the time, then it will need it regardless of whether it is mapped. If the GPU is fine with it being paged out, and will page it back in prior to any DMA operations from/to it, then that's a thing that has nothing to do with the memory being mapped.
In short, your question is a non-sequitur: keeping it mapped or not will not affect memory pressure. The only thing it might affect is how much virtual memory addresses your program uses. Which in the days of 64-bit programs is really kind of academic. Unless you think you're going to allocate 2^48 bytes of storage.
So the only reason to unmap memory (besides when you're about to delete it) is if you're writing a 32-bit application and you need to be careful with virtual address space and you know that the implementation will not assign virtual addresses to CPU-accessible memory unless you map it (implementations are free to always give them virtual address space).

Allocate CUDA device memory for a point cloud with increasing dimension (number of point)

I'm writing a program in which I need to:
make a test on each pixel of an image
if test result is TRUE I have to add a point to a point cloud
if test result is FALSE, make nothing
I've already wrote a working code on CPU side C++.
Now I need to speed it up using CUDA. My idea was to make some block/thread (one thread per pixel I guess) execute the test in parallel and, if the test result is TRUE, make the thread to add a point to the cloud.
Here comes my trouble: How can I allocate space in device memory for a Point cloud (using cudaMalloc or similar) if I don't know a priori the number of point that I will insert in the cloud?
Do I have to allocate a fixed amount of memory and then increasing it everytime the point cloud reach the limit dimension? Or is there a method to "dynamically" allocate the memory?
When you allocate memory on the device, you may do so with two API calls: one is the malloc as described by Taro, but it is limited by some internal driver limit (8 MB by default), which can be increased by setting the appropriate limit with cudaDeviceSetLimit with parameter cudaLimitMallocHeapSize.
Alternately, you may use cudaMalloc within a kernel, as it is both a host and device API method.
In both cases, Taro's observation stands: you will allocate a new different buffer, as it would do on CPU by the way. Hence, using a single buffer might result in a need for a copy of data. Note that cudaMemcpy is not a device API method, hence, you may need to write your own.
To my knowledge, there is no such thing as realloc in the CUDA API.
Back to your original issue, you might want to implement your algorithm in three phases: First phase would count the number of samples you need, second phase would allocate the data array and third phase feed the data array. To implement this, you may use atomic functions to increment some int that counts the number of samples.
I would like to post this as a comment, as it only partially answers, but it is too long for this.
Yes, you can dynamically allocate memory from the kernels.
You can call malloc() and free() within your kernels to dynamically allocate and free memory during computation, as explained in the B-16 section of the CUDA 7.5 Programming Guide :
__global__ void mallocTest()
{
size_t size = 123;
char* ptr = (char*)malloc(size);
memset(ptr, 0, size);
printf("Thread %d got pointer: %p\n", threadIdx.x, ptr);
free(ptr);
}
int main()
{
// Set a heap size of 128 megabytes. Note that this must
// be done before any kernel is launched.
cudaDeviceSetLimit(cudaLimitMallocHeapSize, 128*1024*1024);
mallocTest<<<1, 5>>>();
cudaDeviceSynchronize();
return 0;
}
(You will need the compute capability 2.x or higher)
But by doing this you allocate a new and different buffer in memory, you don't make your previously - and allocated by the host - buffer "grow" like a CPU dynamic container (vector, list, etc).
I think you should set a constant setting the maximum size of your array, then allocating the maximum size, and making your kernel incrementing the "really used size" in this maximum buffer.
If doing so, don't forget to make this increment atomic/synchronized to count each increment from each concurrent thread.

Apparent CUDA magic

I'm using CUDA (in reality I'm using pyCUDA if the difference matters) and performing some computation over arrays. I'm launching a kernel with a grid of 320*600 threads. Inside the kernel I'm declaring two linear arrays of 20000 components using:
float test[20000]
float test2[20000]
With these arrays I perform simple calculations, like for example filling them with constant values. The point is that the kernel executes normally and perform correctly the computations (you can see this filling an array with a random component of test and sending that array to host from device).
The problem is that my NVIDIA card has only 2GB of memory and the total amount of memory to allocate the arrays test and test2 is 320*600*20000*4 bytes that is much more than 2GB.
Where is this memory coming from? and how can CUDA perform the computation in every thread?
Thank you for your time
The actual sizing of the local/stack memory requirements is not as you suppose (for the entire grid, all at once) but is actually based on a formula described by #njuffa here.
Basically, the local/stack memory require is sized based on the maximum instantaneous capacity of the device you are running on, rather than the size of the grid.
Based on the information provided by njuffa, the available stack size limit (per thread) is the lesser of:
The maximum local memory size (512KB for cc2.x and higher)
available GPU memory/(#of SMs)/(max threads per SM)
For your first case:
float test[20000];
float test2[20000];
That total is 160KB (per thread) so we are under the maximum limit of 512KB per thread. What about the 2nd limit?
GTX 650m has 2 cc 3.0 (kepler) SMs (each Kepler SM has 192 cores). Therefore, the second limit above gives, if all the GPU memory were available:
2GB/2/2048 = 512KB
(kepler has 2048 max threads per multiprocessor)
so it is the same limit in this case. But this assumes all the GPU memory is available.
Since you're suggesting in the comments that this configuration fails:
float test[40000];
float test2[40000];
i.e. 320KB, I would conclude that your actual available GPU memory is at the point of this bulk allocation attempt is somewhere above (160/512)*100% i.e. above 31% but below (320/512)*100% i.e. below 62.5% of 2GB, so I would conclude that your available GPU memory at the time of this bulk allocation request for the stack frame would be something less than 1.25GB.
You could try to see if this is the case by calling cudaGetMemInfo right before the kernel launch in question (although I don't know how to do this in pycuda). Even though your GPU starts out with 2GB, if you are running the display from it, you are likely starting with a number closer to 1.5GB. And dynamic (e.g. cudaMalloc) and or static (e.g. __device__) allocations that occur prior to this bulk allocation request at kernel launch, will all impact available memory.
This is all to explain some of the specifics. The general answer to your question is that the "magic" arises due to the fact that the GPU does not necessarily allocate the stack frame and local memory for all threads in the grid, all at once. It need only allocate what is required for the maximum instantaneous capacity of the device (i.e. SMs * max threads per SM), which may be a number that is significantly less than what would be required for the whole grid.

OpenCL Buffer Instantiation in a Multi Device Environment

I'm wondering how the system-side cl::Buffer objects instantiate in a multi-device context.
Let's say I have an OCL environment class, which generates, from cl::Platform, ONE cl::Context:
this->ocl_context = cl::Context(CL_DEVICE_TYPE_GPU, con_prop);
And then, the corresponding set of devices:
this->ocl_devices = this->ocl_context.getInfo<CL_CONTEXT_DEVICES>();
I generate one cl::CommandQueue object and one set of cl::Kernel(s) for EACH device.
Let's say I have 4 GPUs of the same type. Now I have 4x cl::Device objects in ocl_devices.
Now, what happens when I have a second handler class to manage computation on each device:
oclHandler h1(cl::Context* c, cl::CommandQueue cq1, std::vector<cl::Kernel*> k1);
...
oclHandler h2(cl::Context* c, cl::CommandQueue cq4, std::vector<cl::Kernel*> k4);
And then INSIDE EACH CLASS, I both instantiate:
oclHandler::classMemberFunction(
...
this->buffer =
cl::Buffer(
*(this->ocl_context),
CL_MEM_READ_WRITE,
mem_size,
NULL,
NULL
);
...
)
and then after, write to
oclHandler::classMemberFunction(
...
this->ocl_cq->enqueueWriteBuffer(
this->buffer,
CL_FALSE,
static_cast<unsigned int>(0),
mem_size,
ptr,
NULL,
NULL
);
...
this->ocl_cq.finish();
...
)
each buffer. There is a concern that, because the instantiation is for a cl::context, and not tied to a particular device, that there is maybe possibly quadruple memory address assignment on each device. I can't determine when the operation that says "on the device, this buffer runs from 0xXXXXXXXXXXXXXXXX for N bytes" occurs.
Should instantiate one context per device? That seems unnatural because I'd have to instantiate a context. See how many devices there are, and then instantiate d-1 more contexts....seems inefficient. My concern is with limiting available memory device side. I am doing computations on massive sets, and I'd probably be using all of the 6GB available on each card.
Thanks.
EDIT: is there a way to instantiate a buffer and fill it asynchronously without using a commandqueue? Like let's say I have 4 devices, and one buffer host side full of static, read only data. Let's say that buffer is like 500MB in size. If I want to just use clCreateBuffer, with
shared_buffer = new
cl::Buffer(
this->ocl_context,
CL_MEM_READ_ONLY | CL_MEM_ALLOC_HOST_PTR | CL_MEM_COPY_HOST_PTR,
total_mem_size,
ptr
NULL
);
that will start a blocking write, where I can not do anything host side until all of ptr's contents are copied to the newly allocated memory. I have a multithreaded device management system and I've created one cl::CommandQueue for each device, always passing along &shared_buffer for every kernel::setArg required. I'm having a hard time wrapping my head around what to do.
When you have a context that contains multiple devices, any buffers that you create within that context are visible to all of it's devices. This means that any device in the context could read from any buffer in the context, and the OpenCL implementation is in charge of making sure the data is actually moved to the correct devices as and when they need it. There are some grey areas around what should happen if multiple devices try and access the same buffer at the same time, but this kind of behaviour is generally avoided anyway.
Although all of the buffers are visible to all of the devices, this doesn't necessarily mean that they will be allocated on all of the devices. All of the OpenCL implementations that I've worked with use an 'allocate-on-first-use' policy, whereby the buffer is allocated on the device only when it is needed by that device. So in your particular case, you should end up with one buffer per device, as long as each buffer is only used by one device.
In theory an OpenCL implementation might pre-allocate all of the buffers on all the devices just in case they are needed, but I wouldn't expect this to happen in reality (and I've certainly never seen this happen). If you are running on a platform that has a GPU profiler available, you can often use the profiler to see when and where buffer allocations and data movement is actually occurring, to convince yourself that the system isn't doing anything undesirable.

Is there a limit on the size of array that can be used in CUDA?

Written a program that calculates the integral of a simple function. When testing it I found that if I used an array of size greater than 10 million elements it produced the wrong answer. I found that the error seemed to be occurring when once the array had been manipulated in a CUDA kernel. 10 millions elements and below worked fine and produced the correct result.
Is there a size limit on the amount of elements that can be transferred across to the GPU or calculated upon the GPU?
P.S. using C style arrays containing floats.
There are many different kinds of memory that you can use with CUDA. In particular, you have
Linear Memory (cuMemAlloc)
Pinned Memory (cuMemHostAlloc)
Zero-Copy Memory (cuMemAllocHost)
Pitch Allocation (cuMemAllocPitch)
Textures Bound to Linear Memory
Textures Bound to CUDA Arrays
Textures Bound to Pitch Memory
...and cube maps and surfaces, which I will not list here.
Each kind of memory is associated with its own hardware resource limits, many of which you can find by using cuDeviceGetAttribute. The function cuMemGetInfo returns the amount of free and total memory on the device, but because of alignment requirements, allocating 1,000,000 floats may result in more than 1,000,000 * sizeof(float) bytes being consumed. The maximum number of blocks that you can schedule at once is also a limitation: if you exceed it, the kernel will fail to launch (you can easily find this number using cuDeviceGetAttribute). You can find out the alignment requirements for different amounts of memory using the CUDA Driver API, but for a simple program, you can make a reasonable guess and check the value of allocation function to determine whether the allocation was successful.
There is no restriction on the amount of bytes that you can transfer; using asynchronous functions, you can overlap kernel execution with memory copying (providing that your card supports this). Exceeding the maximum number of blocks you can schedule, or consuming the available memory on your device means that you will have to split up your task so that you can use multiple kernels to handle it.
For compute capability>=3.0 the max grid dimensions are 2147483647x65535x65535,
so for a that should cover any 1-D array of sizes up to 2147483647x1024 = 2.1990233e+12.
1 billion element arrays are definitely fine.
1,000,000,000/1024=976562.5, and round up to 976563 blocks. Just make sure that if threadIdx.x+blockIdx.x*blockDim.x>= number of elements you return from kernel without processing.