libGL heap usage - c++

I am working on a linux-based c++ OpenGL application, utilizing the Nvidia 290.10 64bit drivers. I am trying to reduce its memory footprint as it makes use of quite a lot of live data.
I've been using valgrind/massif to analyze heap usage, and while it helped me optimize various things, by now the largest chunk of heap memory used is allocated by libGL. No matter how I set the threshold, massif doesn't let me see in detail where those allocations come from, just that it's libGL. At peak times, I see about 250MB allocated by libGL (out of 900MB total heap usage). I hold a similar amount of memory on the graphics card, as VBOs and Textures (mostly one big 4096*4096 texture).
So it appears as if a similar amount of memory as what I upload to GPU memory is allocated on the heap by libGL. The libGL allocations also peak when the volume of VBOs peaks. Is that normal? I thought one of the benefits of having a lot of GPU memory is that it keeps the RAM free?

What you experience is perfectly normal, because a OpenGL implementation must keep a copy of the data in system memory for various reasons.
In OpenGL there's no exclusive access to the GPU, so depending on its use, it may become neccessary to swap out data (or just release some objects from GPU memory). Also GPUs may crash and drivers then just silently reset them without the user noticing. This too requires a full copy of all the buffer data.
And don't forget that there's a major difference between address space allocation (the value reported by Valgrind) and actual memory utilization.

Related

Vulkan: Separate vertex buffers or dynamic vertex buffers?

So I'm starting to learn Vulkan, and I still haven't done anything with multiple objects yet, thing is, if I'm making a game engine, and I'm implementing some kind of "drag and drop" thing, where you drag, for example, a cube from a panel, and drop it into the scene, is it better to... ?
Have separate vertex buffers.
Have just one, and make it grow kinda like a std::vector class or something, but this sounds really slow considering I have to perform a transfer operation with command buffers every time a resize needs to happen (every time a new object gets added to the scene).
A growing vertex buffer is usually the way to go, keep in mind Vulkan has very limited buffer handles of every type and is designed for sub-allocation (like in old school C). Excerpt from NVIDIA's Vulkan recommendations:
Use memory sub-allocation. vkAllocateMemory() is an expensive operation on the CPU. Cost can be reduced by suballocating from a large memory object. Memory is allocated in pages which have a fixed size; sub-allocation helps to decrease the memory footprint.
The only note here is to aggressively allocate your buffer as large as you believe it will grow. The other point on that page warns you against pushing the memory to its limits, the OS will give up and kill your program if you fail this:
When memory is over-committed on Windows, the OS may temporarily suspend a process from the GPU runlist in order to page out its allocations to make room for a different process’ allocations. There is no OS memory manager on Linux that mitigates over-commitment by automatically performing paging operations on memory objects.

Does cudaMallocManaged() create a synchronized buffer in RAM and VRAM?

In an Nvidia developer blog: An Even Easier Introduction to CUDA the writer explains:
To compute on the GPU, I need to allocate memory accessible by the
GPU. Unified Memory in CUDA makes this easy by providing a single
memory space accessible by all GPUs and CPUs in your system. To
allocate data in unified memory, call cudaMallocManaged(), which
returns a pointer that you can access from host (CPU) code or device
(GPU) code.
I found this both interesting (since it seems potentially convenient) and confusing:
returns a pointer that you can access from host (CPU) code or device
(GPU) code.
For this to be true, it seems like cudaMallocManaged() must be syncing 2 buffers across VRAM and RAM. Is this the case? Or is my understanding lacking?
In my work so far with GPU acceleration on top of the WebGL abstraction layer via GPU.js, I learned the distinct performance difference between passing VRAM based buffers (textures in WebGL) from kernel to kernel (keeping the buffer on the GPU, highly performant) and retrieving the buffer value outside of the kernels to access it in RAM through JavaScript (pulling the buffer off the GPU, taking a performance hit since buffers in VRAM on the GPU don't magically move to RAM).
Forgive my highly abstracted understanding / description of the topic, since I know most CUDA / C++ devs have a much more granular understanding of the process.
So is cudaMallocManaged() creating synchronized buffers in both RAM
and VRAM for convenience of the developer?
If so, wouldn't doing so come with an unnecessary cost in cases where
we might never need to touch that buffer with the CPU?
Does the compiler perhaps just check if we ever reference that buffer
from CPU and never create the CPU side of the synced buffer if it's
not needed?
Or do I have it all wrong? Are we not even talking VRAM? How does
this work?
So is cudaMallocManaged() creating synchronized buffers in both RAM and VRAM for convenience of the developer?
Yes, more or less. The "synchronization" is referred to in the managed memory model as migration of data. Virtual address carveouts are made for all visible processors, and the data is migrated (i.e. moved to, and provided a physical allocation for) the processor that attempts to access it.
If so, wouldn't doing so come with an unnecessary cost in cases where we might never need to touch that buffer with the CPU?
If you never need to touch the buffer on the CPU, then what will happen is that the VA carveout will be made in the CPU VA space, but no physical allocation will be made for it. When the GPU attempts to actually access the data, it will cause the allocation to "appear" and use up GPU memory. Although there are "costs" to be sure, there is no usage of CPU (physical) memory in this case. Furthermore, once instantiated in GPU memory, there should be no ongoing additional cost for the GPU to access it; it should run at "full" speed. The instantiation/migration process is a complex one, and what I am describing here is what I would consider the "principal" modality or behavior. There are many factors that could affect this.
Does the compiler perhaps just check if we ever reference that buffer from CPU and never create the CPU side of the synced buffer if it's not needed?
No, this is managed by the runtime, not compile time.
Or do I have it all wrong? Are we not even talking VRAM? How does this work?
No you don't have it all wrong. Yes we are talking about VRAM.
The blog you reference barely touches on managed memory, which is a fairly involved subject. There are numerous online resources to learn more about it. You might want to review some of them. here is one. There are good GTC presentations on managed memory, including here. There is also an entire section of the CUDA programming guide covering managed memory.

With OpenCL, How to get GPU memory usage?

I'm looking for a reliable way to determine current GPU memory usage with OpenCL.
I have found NVidia API: cudaMemGetInfo( size_t* free, size_t* total ) to get free memory and total memory on the current device.
But I'm looking for a solution for AMD and OpenCL. I did not find if there is similar functionality in OpenCL and I don't know if AMD has something equivalent.
I don't want to know how much free memory there is on OpenCL devices before allocating buffers but free memory afer allocating buffers.
A priori as indicated in How do I determine available device memory in OpenCL?, With OpenCL, there is no way, and there is no need to know it.
devices before allocating buffers but free memory afer allocating buffers.
For AMD, perhaps try CL_DEVICE_GLOBAL_FREE_MEMORY_AMD from the cl_amd_device_attribute_query extension - this extension will probably only work with proprietary drivers, though.
In general case, it's impossible, because AFAIK there's no way to know when buffers are allocated (on the device). In this sense OpenCL is higher-level than CUDA. Buffers belong to contexts, not devices. Calling clCreateBuffer() can but doesn't have to allocate any memory on any device; the implementations automatically migrate buffers to device memory before they execute a kernel which needs those buffers, and move them away from the device if they need to free memory for next kernel. Even if you get the free memory of a device, you can't 100% reliably use it to make decisions on whether to run a kernel, because clEnqueueNDRange() doesn't necessarily immediately launch a kernel (it just enqueues it; if there's something else in the queue, it can be delayed), and some other application on the same computer could get scheduled on the GPU in meantime.
If you want to avoid swapping memory, you'll have to make sure 1) your application is the only one using the GPU, 2) for each of your kernels, total buffer arguments size must be <= GLOBAL_MEM_SIZE.

What is texture memory, allocated with OpenGL, limited by?

I'm making a 2D game with OpenGL. Something I'm concerned with is texture memory consumption. 2D games use a few orders of magnitude more texture memory than 3D games, most of that coming from animation frames and backgrounds.
Obviously there's a limit to how much texture memory a program can allocate, but what determines that limit? Does the limit come from available general memory to the program, or is it limited by how much video memory is available on the GPU? Can it use swap space at all? What happens when video memory is exhausted?
OpenGL's memory model is very abstract. Up to version including 2.1 there were two kinds of memory fast "server" memory and slow "client" memory. But there are no limits that could be queried in any way. When a OpenGL implementation runs out of "server" (=GPU) memory it may start swapping or just report "out of memory" errors.
OpenGL-3 did away (mostly, OpenGL-4 finished that job) with the two different kinds of memory. There's just "memory" and the limits are quite arbitrary and depend on the OpenGL implementation (= GPU + driver) the program is running on. All OpenGL implementations are perfectly capable of swapping out textures not used in a while. So the only situation where you would run into a out of memory situation would be the attempt to create a very large texture. The more recent GPUs are in fact capable of swapping in and out parts of textures on a as-needed base. Things will get slow, but keep working.
Obviously there's a limit to how much texture memory a program can allocate, but what determines that limit?
The details of the OpenGL implementation of the system and depending on that the amount of memory installed in that system.

cuda program kernel code in device memory space

Is there any way to find out, how much memory occupies the kernel code (execution) in gpu (device) memory?
If I have 512 MB device memory how can I know how much is available for allocation?
Could visual profiler show such info?
Program code uses up very little memory. The rest of the CUDA context (local memory, constant memory, printf buffers, heap and stack) uses a lot more. The CUDA runtime API includes the cudeGetMemInfo call which will return the amount of free memory available to your code. Note that because of fragmentation and page size constraints, you won't be able to allocate every last free byte of memory. The best strategy is to start with the maximum and recursively attempt allocating successively smaller allocations until you get a successful allocation.
You can see a fuller explanation of device memory consumption in my answer to an earlier question along similar lines,