GPU Memory Management in OpenGL and DirectX 12

GPU Memory Management in OpenGL and DirectX 12 - c++

I am currently improving my knowledge in OpenGL and DirectX 12 in order to create graphics applications with both APIs. I studied several tutorials but I still do not completely understand, how the memory is managed on the GPU side.
In OpenGL (my application runs an OpenGL 3.3 context), the frame buffers are created implicitly so I assume, that they are also freed implicitly by the API. In my example program, I created vertex and index buffers using glGenBuffers and uploaded them to the GPU using glBufferData. In case I want to update my vertex buffer every frame, I could simply do this using glBufferSubData. Let's assume instead, that I want to re-upload my vertex buffer every frame using glBufferData. According to the OpenGL documentation, this function creates and initializes the buffer's data store on the GPU. So I assume, that the GPU memory, mapped to this VBO is reused after another call to glBufferData in the next frame.
In DirectX 12, the frame buffers must be created by the graphics programmer. Those are managed and reused by the swap chain during the life time of the program. In my DirectX 12 test program, I also create vertex and index buffers using upload heaps and the ID3D12Device::CreateCommittedResource function. I also do this every frame for testing purposes. The buffers are stored in Microsoft::WRL::ComPtr<ID3D12Resource> variables. At the end of the render method, the use count of those buffer pointers should hit 0, which will free the memory behind on the CPU side. Nevertheless, I do not understand, what happens to the data and the underlying heap on the GPU side. Are they released, whenever the buffer pointer's use count hits 0, do they need to be freed manually, are they discarded by the GPU when reaching the fence or none of them.
I would really appreciate it, if you could provide some clarifications on this topic and my assumptions.
Can you also please provide an explanation, if and how GPU data needs to be freed by the graphics programmer.
Best regards.

For DirectX 12, it uses the same lifetime model as previous versions of Direct3D: The object is kept alive until the reference count hits 0. It's then eligible for destruction. The exact time of cleanup is up to the driver/runtime as it typically does 'delayed destruction' (it actually has both an 'internal' and 'external' reference count, and both have to be 0 before it's really eligible for destruction).
See Microsoft Docs.
You should watch this YouTube video if you want an overview of the underlying details.

Related

Why use Vertex Buffer Objects for dynamic objects?

Not sure what the DX parlance is for these, but I'm sure they have a similar notion.
As far as I'm aware the advantage of VBO's is that they allocate memory that's directly available by the GPU. We can then upload data to this buffer, and keep it there for an extended number of frames, preventing all the overhead of uploading the data every frame. Additionally, we're able to alter this data on a per-datum basis, if we choose to.
Therefore, I can see the advantage of using VBO's for static geo, but I don't see any benefit at all for dynamic objects - since you pretty much have to update all the data every frame anyways?

There are several methods of updating buffers in OpenGL. If you have dynamic data, you can simply reinitialize the buffer storage every frame with the new data (eg. with glBufferData). You can also use client vertex buffer pointers, in compatibility contexts. However, these methods can cause 'churn' in the memory allocation of the driver. The new data storage essentially has to sit in system memory until the GPU driver handles it, and it's not possible to get feedback on this process.
In later OpenGL versions (4.4, and via extensions in earlier versions), some functionality was introduced to try and reduce the overhead of updating dynamic buffers, allowing for GPU allocated memory to be written without direct driver synchronization. This essentially requires that you have the glBufferStorage and glMapBufferRange functionality available. You create the buffer storage with the GL_DYNAMIC_STORAGE_BIT, and then map it with GL_MAP_PERSISTENT_BIT (you may require other flags, depending on whether you are reading and/or writing the data). However, this technique also requires that you use GPU fencing to ensure you are not overwriting the data as the GPU is reading it. Using this method makes updating VBOs much more efficient than reinitializing the data store, or using client buffers.
There is a good presentation on GDC Vault about this technique (skip to the DynamicStreaming heading).

AFAIK, by creating dynamic vertex buffer, you are giving graphic adapter driver a hint to place vertex buffer in memory which fast for CPU to write but also reasonably fast for GPU to read it. Driver usually will manage it to minimize GPU stall by giving non-overlapping memory area so that CPU can write while GPU read other memory area.
If you do not give hint, it is assume a static resource so it will be placed in memory which fast for GPU to read/write but very slow for CPU to write.

OpenGL multithreading/shared context and glGenBuffers

If I plan to use multithreading in OpenGL, should I have separate buffers (from glGenBuffers) for each context?
I do not know much about OpenGL multithreading yet (for now I work in "single" thread). I need to know if I can share buffers already pushed to Video Memory (with glBufferData/glBufferSubData), or I have to keep copy of buffer for another thread.

You do not want to use several contexts with several threads. You really don't.
While this sounds like a good idea, in practice muli-context-multi-thread is complicated, troublesome, and badly supported on the driver side, and it only marginally improves (possibly even reduces!) performance.
What you really want is to have only one thread talk to OpenGL (with one context, obviously), map a buffer, and pass the memory pointer to another thread, preferrably using 3 buffers (3 subbuffers of a 3x sized buffer) with immutable storage and persistent mapping, if this is available.
That, and doing indirect render calls, where a second thread feeds the buffers the indirect call reads from.
Further info on the persistent mapping topic: See in particular slides 22-25 of this GDC2014 presentation, which is basically a remake of Cass Everitt's 2013 SIGGRAPH talk.
See also Everitt's original talk: Beyond porting.

Vaos aren't shared so you'll need to generate a new vao for each object per context or else the behavior will become unpredictable and incorrect upon deletion / creation of a new one. This can be a major source of error. Vbos can be shared, so you just need one vbo per object.

Efficient way of updating texture in OpenGL

I wanted to render multiple video streams using OpenGL. Currently I am performing using glTexImage2D provided by JOGL and rendering on Swing window.
For updating texture content for each video frame I am calling glTexImage2D. I want to know is there any faster method to update texture without calling glTexImage2D for each frame.

You will always be using glTexImage2D, but with the difference that data comes from a buffer object (what is this?) rather than from a pointer.
What's slow in updating a texture is not updating the texture, but synchronizing (blocking) with the current draw operation, and the PCIe transfer. When you call glTexImage, OpenGL must wait until it is done drawing the last frame during which it is still reading from the texture. During that time, your application is blocked and does nothing (this is necessary because otherwise you could modify or free the memory pointed to before OpenGL can copy it!). Then it must copy the data and transfer it to the graphics card, and only then your application continues to run.
While one can't make that process much faster, one can make it run asynchronously, so this latency pretty much disappears.
The easiest way of doing this is to for video frames is to create a buffer name, bind it, and reserve-initialize it once.
Then, on each subsequent frame, discard-initialize it by calling glBufferData with a null data pointer, and fill it either with a non-reserving call, or by mapping the buffer's complete range.
The reason why you want to do this strange dance instead of simply overwriting the buffer is that this will not block. OpenGL will synchronize access to buffer objects so you do not overwrite data while it is still reading from it. glBufferData with a null data pointer is a way of telling OpenGL that you don't really care about the buffer and that you don't necessary want the same buffer. So it will just allocate another one and give you that one, keep reading from the old one, and secretly swap them when it's done.
Since the word "synchronization" was used already, I shall explain my choice of glMapBufferRange in the link above, when in fact you want to map the whole buffer, not some range. Why would one want that?
Even if OpenGL can mostly avoid synchronizing when using the discard technique above, it may still have to, sometimes.
Also, it still has to run some kind of memory allocation algorithm to manage the buffers, which takes driver time. glMapBufferRange lets you specify additional flags, in particular (in later OpenGL versions) a flag that says "don't synchronize". This allows for a more complicated but still faster approach in which you create a single buffer twice the size you need once, and then keep mapping/writing either the lower or upper half, telling OpenGL not to synchronize at all. It is then your responsibility to know when it's safe (presumably by using a fence object), but you avoid all overhead as much as possible.

You can't update the texture without updating the texture.
Also I don't think that one call to glTexImage can be a real performance problem. If you are so oh concerned about it though, create two textures and map one of them for writing when using the other for drawing, then swap (just like double-buffering works).
If you could move processing to GPU you wouldn't have to call the function at all, which is about 100% speedup.

Confusion regarding memory management in OpenGL

I'm asking this question because I don't want to spend time writing some code that duplicates functionalities of the OpenGL drivers.
Can the OpenGL driver/server hold more data than the video card? Say, I have enough video RAM to hold 10 textures. Can I ask OpenGL to allocate 15 textures without getting an GL_OUT_OF_MEMORY error?
If I can rely on the driver to cleverly send the textures/buffers/objects from the 'normal' RAM to the video RAM when needed then I don't really need to Gen/Delete these objects myself. I become limited by the 'normal' RAM which is often plentiful when compared to the video RAM.

The approach "memory is abundant so I don't need to delete" is bad, and the approach "memory is abundant, so I'll never get out of memory errors" is flawed.
OpenGL memory management is obscure, both for technical reasons (see t.niese's comment above) and for ideological reasons ("you don't need to know, you don't want to know"). Though there exist vendor extensions (such as ATI_meminfo) that let you query some non-authorative numbers (non-authorative insofar as they could change the next millisecond, and they do not take effects like fragmentation into account).
Generally, for the most part, your assumption that you can use more memory than there is GPU memory is correct.
However, you are not usually not able to use all available memory. More likely, there is a limit well below "all available RAM" due to constraints on what memory regions (and how large regions) the driver can allocate, lock, and DMA to/from. And even though you can normally use more memory than will fit on the GPU (even if you used it exclusively), this does not mean careless allocations can't and won't eventually fail.
Usually, but not necessarily, you consume as much system memory as GPU memory, too (without knowing, the driver does that secretly). Since the driver swaps resources in and out as needed, it needs to maintain a copy. Sometimes, it is necessary to keep 2 or 3 copies (e.g. when streaming or for ARB_copy_buffer operations). Sometimes, mapping a buffer object is yet another copy in a specially allocated block, and sometimes you're allowed to write straight into the driver's memory.
On the other hand, PCIe 2.0 (and PCIe 3.0 even more so) is fast enough to stream vertices from main memory, so you do not even strictly need GPU memory (other than a small buffer). Some drivers will stream dynamic geometry right away from system memory.
Some GPUs do not even have separate system and GPU memory (Intel Sandy Bridge or AMD Fusion).
Also, you should note that deleting objects does not necessarily delete them (at least not immediately). Usually, with very few exceptions, deleting an OpenGL object is merely a tentative delete which prevents you from further referencing the object. The driver will keep the object valid for as long as it needs to.
On the other hand, you really should delete what you do not need any more, and you should delete early. For example, you should delete a shader immediately after attaching it to the program object. This ensures that you do not leak resources, and it is guaranteed to work. Deleting and re-specifying the in-use vertex or pixel buffer when streaming (by calling glBufferData(... NULL); is a well-known idiom. This only affects your view of the object, and it allows the driver to continue using the old object in parallel for as long as it needs to.

Some additional information to my comment that did not fit in there.
There are different reasons why this is not part of OpenGL.
It isn't an easy task for the system/driver to guess which resources are and will be required. The driver for sure could create an internal heuristic if resource will be required often or rarely (like CPU does for if statements and doing pre executing code certain code parts on that guess). But the GPU will not know (without knowing the application code) what resource will be required next. It even has no knowledge where the geometry is places in the scene (because you do this with you model and view martix you pass to your shader yourself)
If you e.g. have a game where you can walk through a scene, you normally won't render the parts that are out of the view. So the GPU could think that these resources are not required anymore, but if you turn around then all this textures and geometry is required again and needs to be moved from system memory to gpu memory, which could result in really bad performance. But the Game Engine itself has, because of the use of octrees (or similar techniques) and the possible paths that can be walked, an in deep knowledge about the scene and which resource could be removed from the GPU and which one could be move to the GPU while playing and where it would be necessary to display a loading screen.
If you look at the evolution of OpenGL and which features become deprecated you will see that they go to the direction to remove everything except the really required features that can be done best by the graphic card, driver and system. Everything else is up to the user to implement on it's own to get the best performance. (you e.g. create your projection matrix yourself to pass it to the shader, so OpenGl even does not know where the object is placed in the scene).

Here's my TL;DR answer, I recommend reading Daemon's and t.niese's answers as well:
Can the OpenGL driver/server hold more data than the video card?
Yes
Say, I have enough video RAM to hold 10 textures. Can I ask OpenGL to allocate 15 textures without getting an GL_OUT_OF_MEMORY error?
Yes. Depending on the driver / GPU combination it might even be possible to allocate a single texture that exceeds the GPU's memory, and actually use it for rendering. At my current occupation I exploit that fact to extract slices of arbitrary orientation and geometry from large volumetric datasets, using shaders to apply filters on the voxel data in situ. Works well, but doesn't work for interactive frame rates.

Allocating PBOs on separate thread?

I've got a multithreaded OpenGL application using PBOs for data transfers between cpu and gpu.
I have pooled the allocation of PBOs, however, when the pools are empty my non-opengl threads have to block for a while until the OpenGL thread reaches a point where it can allocate the buffers (i.e. finished rendering the current frame). This waiting is causing some lag spikes in certain situations which I'd like to avoid.
Is it possible to allocate PBO's on another thread which are then used by the "main" OpenGL thread?

Yes, you can create objects on one thread which can be used on another. You'll need a new GL context to do this.
That being said, there are two concerns.
First, it depends on what you mean by "allocation of PBOs". You should never be allocating buffer objects in the middle of a frame. You should allocate all the buffers you need up front. When the time comes to use them, then you can simply use what you have.
By "allocate", I mean call glBufferData on a previously allocated buffer using a different size or driver hint than was used before. Or by using glGenBuffers and glDeleteBuffers in any way. Neither of these should happen within a frame.
Second, invalidating the buffer should never cause "lag spikes". By "invalidate", I mean reallocating the buffer with glBufferData using the same size and usage hint, or using glMapBufferRange with the GL_INVALIDATE_BUFFER bit. You should look at this page on how to stream buffer object data for details. If you're getting issues, then you're probably on NVIDIA hardware and using the wrong buffer object hint (ie: use STREAM).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js