OpenGL Buffer Object internal workings? - c++

I've started to use Pixel Buffer Objects and while I understand how to use them and the gist of what they're doing, I really don't know what's going on under the hood. I'm aware that the OpenGL spec allows for leeway in regards to the exact implementation, but that's still beyond me.
So far as I understand, the Buffer Object typically resides server side in GRAM; though this apparently may vary depending on target and usage. This makes perfect sense as this would be why OpenGL calls on the BOs would operate so fast. But in what such instances would it reside in AGP or system memory? (side question: does PCI-e have an equivalent of AGP memory?)
Also, glMapBuffers() returns a pointer to a block of memory of the BO so the data may be read/written/changed. But how is this done? The manipulations are taking place client side, so the data still has to go from server to client some how. If it is, how is is better than glReadPixels()?
PBOs are obviously better than glReadPixels() as is obvious by the performance difference, I just don't understand how.
I haven't used FBOs yet, but I've heard they're better to use. Is this true? if so, why?

I can't tell you in what memory the buffer object will be allocated. Actually you mostly answered that question yourself, so you can hope that a good driver will actually do it this way.
glMapBuffer can be implemented the same way as memory mapped files. Remember the difference between physical memory and virtual address space: when you write to a memory location, the address is mapped through a page table to a physical location. If the required page is marked as swapped out an interrupt occurs and the system loads the required page from the swap to the RAM. This mechanism can be used to map files and other resources (like GPU memory) to your process's virtual address space. When you call glMapBuffer, the system allocates some address range (not memory, just addresses) and prepares the relevant entries in page table. When you try to read/write to these addresses the system loads/sends it to the GPU. Of course this would be slow, so some buffering is done on the way.
If you constantly transfer data between CPU and GPU, I doubt that PBOs will be faster. They are faster when you make many manipulations on the GPU (like load from frame buffer, change a few texels with CPU and use it as a texture again on the GPU). Well, they can be faster in case of integrated graphics processor or AGP memory, because in that case glMapBuffer can map the addresses directly to the physical memory, effectively eliminating one copy operation.
Are FBOs better? For what? They are better when you need to render to texture. That's again because they eliminate one data copy operation.

Related

Does cudaMallocManaged() create a synchronized buffer in RAM and VRAM?

In an Nvidia developer blog: An Even Easier Introduction to CUDA the writer explains:
To compute on the GPU, I need to allocate memory accessible by the
GPU. Unified Memory in CUDA makes this easy by providing a single
memory space accessible by all GPUs and CPUs in your system. To
allocate data in unified memory, call cudaMallocManaged(), which
returns a pointer that you can access from host (CPU) code or device
(GPU) code.
I found this both interesting (since it seems potentially convenient) and confusing:
returns a pointer that you can access from host (CPU) code or device
(GPU) code.
For this to be true, it seems like cudaMallocManaged() must be syncing 2 buffers across VRAM and RAM. Is this the case? Or is my understanding lacking?
In my work so far with GPU acceleration on top of the WebGL abstraction layer via GPU.js, I learned the distinct performance difference between passing VRAM based buffers (textures in WebGL) from kernel to kernel (keeping the buffer on the GPU, highly performant) and retrieving the buffer value outside of the kernels to access it in RAM through JavaScript (pulling the buffer off the GPU, taking a performance hit since buffers in VRAM on the GPU don't magically move to RAM).
Forgive my highly abstracted understanding / description of the topic, since I know most CUDA / C++ devs have a much more granular understanding of the process.
So is cudaMallocManaged() creating synchronized buffers in both RAM
and VRAM for convenience of the developer?
If so, wouldn't doing so come with an unnecessary cost in cases where
we might never need to touch that buffer with the CPU?
Does the compiler perhaps just check if we ever reference that buffer
from CPU and never create the CPU side of the synced buffer if it's
not needed?
Or do I have it all wrong? Are we not even talking VRAM? How does
this work?
So is cudaMallocManaged() creating synchronized buffers in both RAM and VRAM for convenience of the developer?
Yes, more or less. The "synchronization" is referred to in the managed memory model as migration of data. Virtual address carveouts are made for all visible processors, and the data is migrated (i.e. moved to, and provided a physical allocation for) the processor that attempts to access it.
If so, wouldn't doing so come with an unnecessary cost in cases where we might never need to touch that buffer with the CPU?
If you never need to touch the buffer on the CPU, then what will happen is that the VA carveout will be made in the CPU VA space, but no physical allocation will be made for it. When the GPU attempts to actually access the data, it will cause the allocation to "appear" and use up GPU memory. Although there are "costs" to be sure, there is no usage of CPU (physical) memory in this case. Furthermore, once instantiated in GPU memory, there should be no ongoing additional cost for the GPU to access it; it should run at "full" speed. The instantiation/migration process is a complex one, and what I am describing here is what I would consider the "principal" modality or behavior. There are many factors that could affect this.
Does the compiler perhaps just check if we ever reference that buffer from CPU and never create the CPU side of the synced buffer if it's not needed?
No, this is managed by the runtime, not compile time.
Or do I have it all wrong? Are we not even talking VRAM? How does this work?
No you don't have it all wrong. Yes we are talking about VRAM.
The blog you reference barely touches on managed memory, which is a fairly involved subject. There are numerous online resources to learn more about it. You might want to review some of them. here is one. There are good GTC presentations on managed memory, including here. There is also an entire section of the CUDA programming guide covering managed memory.

How to avoid VRAM fragmentation?

I realize there's no way to avoid it for certain, as OpenGL says nothing about VRAM fragmentation.
But all the same, I have fragmentation in my app and I want to try reducing it on common platforms.
The only thing I found on the topic was this:
The best way to prevent heavy memory fragmentation is to try to and restrict the amount of varying resolutions in a project. When an asset is swapped out for one that is the same resolution, often times it can take it's place in the memory.
Which makes a lot of sense.
Is it really a good idea? And are there other things to keep in mind about this?
Note that in my usecase virtually all my VRAM usage consists of textures (and the back/front/depth buffers). Hardly any is buffer objects and such.
Basically you are right. Out of all the resources texture is heavy one. Not only you can use textures as buffers but as normal images also. Plus you can have multiple resolutions of same texture. You have to remember on driver side there are certain criteria that makes the resource to loaded in GPU ram or unload. So if you have textures with same resolution. Even if one of it is unloaded its memory block can be allocated to other without any fragmentation. Also when you create texture or any resource for that matter you cant be sure the memory is present in GPU Ram. It could be any where based on Driver implementation. Also during the lifetime of the resource driver might move that resource from one memory to other. Other than that at least in OpenGL you dont have any control over different memories and how they are allocated. Hence Vulkan is preferred in such cases when you need more control and you know what you are doing as you can specify not only type of memory you want but also you can specify your own allocators for that memory.

When is data sent to the GPU with openGL

I've been looking into writing applications using OpenGL to render data on-screen, and there is one thing that constantly comes up -- it is slow to copy data into the GPU.
I am currently switching between reading the OpenGL SuperBible 7th Edition and reading various tutorials online, and I have not come across when data is actually sent to the GPU, I only have guesses.
Is space allocated in the GPU's ram when I make calls to glBufferStorage/glCreateVertexArrays? Or is this space allocated in my application's memory and then copied over at a later time?
Is the pointer returned from glMapBuffer* a pointer to GPU memory, or is it a pointer to space allocated in my applications memory that is then copied over at a later time?
Assuming that the data is stored in my applications memory and copied over to the GPU, when is the data actually copied? When I make a call to glCrawArrays?
1: glCreateVertexArrays doesn't have anything to do with buffer objects or GPU memory (of that kind), so it's kinda irrelevant.
As for the rest, when OpenGL decides to allocate actual GPU memory is up to the OpenGL implementation. It can defer the actual allocation as long as it wants.
If you're asking about when your data is uploaded to OpenGL, OpenGL will always be finished with any pointer you pass it when that function call returns. So the implementation will either copy the data to the GPU-accessible memory within the call itself, or it will allocate some CPU memory and copy your data into that, scheduling the transfer to the actual GPU storage for later.
As a matter of practicality, you should assume that copying to the buffer doesn't happen immediately. This is because DMAs usually require certain memory alignment, and the pointer you pass may not have that alignment.
But usually, you shouldn't care. Let the implementation do its job.
2: Like the above, the implementation can do whatever it wants when you map memory. It might give you a genuine pointer to GPU-accessible memory. Or it might just allocate a block of CPU memory and DMA it up when you unmap the memory.
The only exception to this is persistent mapping. That feature requires that OpenGL give you an actual pointer to the actual GPU-accessible memory that the buffer resides in. This is because you never actually tell the implementation when you're finished writing to/reading from the memory.
This is also (part of) why OpenGL requires you to allocate buffer storage immutably to be able to use persistent mapping.
3: It is copied whenever the implementation feels that it needs to be.
OpenGL implementations are a black box. What they do is more-or-less up to them. The only requirement the specification makes is that their behavior act "as if" it were doing things the way the specification says. As such, the data can be copied whenever the implementation feels like copying it, so long as everything still works "as if" it had copied it immediately.
Making a draw call does not require that any buffer DMAs that this draw command relies on have completed at that time. It merely requires that those DMAs will happen before the GPU actually executes that drawing command. The implementation could do that by blocking in the glDraw* call until the DMAs have completed. But it can also use internal GPU synchronization mechanisms to tie the drawing command being issued to the completion of the DMA operation(s).
The only thing that will guarantee that the upload has actually completed is to call a function that will cause the GPU to access the buffer, then synchronizing the CPU with that command. Synchronizing after only the upload doesn't guarantee anything. The upload itself is not observable behavior, so synchronizing there may not have an effect.
Then again, it might. That's the point; you cannot know.

Why use Vertex Buffer Objects for dynamic objects?

Not sure what the DX parlance is for these, but I'm sure they have a similar notion.
As far as I'm aware the advantage of VBO's is that they allocate memory that's directly available by the GPU. We can then upload data to this buffer, and keep it there for an extended number of frames, preventing all the overhead of uploading the data every frame. Additionally, we're able to alter this data on a per-datum basis, if we choose to.
Therefore, I can see the advantage of using VBO's for static geo, but I don't see any benefit at all for dynamic objects - since you pretty much have to update all the data every frame anyways?
There are several methods of updating buffers in OpenGL. If you have dynamic data, you can simply reinitialize the buffer storage every frame with the new data (eg. with glBufferData). You can also use client vertex buffer pointers, in compatibility contexts. However, these methods can cause 'churn' in the memory allocation of the driver. The new data storage essentially has to sit in system memory until the GPU driver handles it, and it's not possible to get feedback on this process.
In later OpenGL versions (4.4, and via extensions in earlier versions), some functionality was introduced to try and reduce the overhead of updating dynamic buffers, allowing for GPU allocated memory to be written without direct driver synchronization. This essentially requires that you have the glBufferStorage and glMapBufferRange functionality available. You create the buffer storage with the GL_DYNAMIC_STORAGE_BIT, and then map it with GL_MAP_PERSISTENT_BIT (you may require other flags, depending on whether you are reading and/or writing the data). However, this technique also requires that you use GPU fencing to ensure you are not overwriting the data as the GPU is reading it. Using this method makes updating VBOs much more efficient than reinitializing the data store, or using client buffers.
There is a good presentation on GDC Vault about this technique (skip to the DynamicStreaming heading).
AFAIK, by creating dynamic vertex buffer, you are giving graphic adapter driver a hint to place vertex buffer in memory which fast for CPU to write but also reasonably fast for GPU to read it. Driver usually will manage it to minimize GPU stall by giving non-overlapping memory area so that CPU can write while GPU read other memory area.
If you do not give hint, it is assume a static resource so it will be placed in memory which fast for GPU to read/write but very slow for CPU to write.

Confusion regarding memory management in OpenGL

I'm asking this question because I don't want to spend time writing some code that duplicates functionalities of the OpenGL drivers.
Can the OpenGL driver/server hold more data than the video card? Say, I have enough video RAM to hold 10 textures. Can I ask OpenGL to allocate 15 textures without getting an GL_OUT_OF_MEMORY error?
If I can rely on the driver to cleverly send the textures/buffers/objects from the 'normal' RAM to the video RAM when needed then I don't really need to Gen/Delete these objects myself. I become limited by the 'normal' RAM which is often plentiful when compared to the video RAM.
The approach "memory is abundant so I don't need to delete" is bad, and the approach "memory is abundant, so I'll never get out of memory errors" is flawed.
OpenGL memory management is obscure, both for technical reasons (see t.niese's comment above) and for ideological reasons ("you don't need to know, you don't want to know"). Though there exist vendor extensions (such as ATI_meminfo) that let you query some non-authorative numbers (non-authorative insofar as they could change the next millisecond, and they do not take effects like fragmentation into account).
Generally, for the most part, your assumption that you can use more memory than there is GPU memory is correct.
However, you are not usually not able to use all available memory. More likely, there is a limit well below "all available RAM" due to constraints on what memory regions (and how large regions) the driver can allocate, lock, and DMA to/from. And even though you can normally use more memory than will fit on the GPU (even if you used it exclusively), this does not mean careless allocations can't and won't eventually fail.
Usually, but not necessarily, you consume as much system memory as GPU memory, too (without knowing, the driver does that secretly). Since the driver swaps resources in and out as needed, it needs to maintain a copy. Sometimes, it is necessary to keep 2 or 3 copies (e.g. when streaming or for ARB_copy_buffer operations). Sometimes, mapping a buffer object is yet another copy in a specially allocated block, and sometimes you're allowed to write straight into the driver's memory.
On the other hand, PCIe 2.0 (and PCIe 3.0 even more so) is fast enough to stream vertices from main memory, so you do not even strictly need GPU memory (other than a small buffer). Some drivers will stream dynamic geometry right away from system memory.
Some GPUs do not even have separate system and GPU memory (Intel Sandy Bridge or AMD Fusion).
Also, you should note that deleting objects does not necessarily delete them (at least not immediately). Usually, with very few exceptions, deleting an OpenGL object is merely a tentative delete which prevents you from further referencing the object. The driver will keep the object valid for as long as it needs to.
On the other hand, you really should delete what you do not need any more, and you should delete early. For example, you should delete a shader immediately after attaching it to the program object. This ensures that you do not leak resources, and it is guaranteed to work. Deleting and re-specifying the in-use vertex or pixel buffer when streaming (by calling glBufferData(... NULL); is a well-known idiom. This only affects your view of the object, and it allows the driver to continue using the old object in parallel for as long as it needs to.
Some additional information to my comment that did not fit in there.
There are different reasons why this is not part of OpenGL.
It isn't an easy task for the system/driver to guess which resources are and will be required. The driver for sure could create an internal heuristic if resource will be required often or rarely (like CPU does for if statements and doing pre executing code certain code parts on that guess). But the GPU will not know (without knowing the application code) what resource will be required next. It even has no knowledge where the geometry is places in the scene (because you do this with you model and view martix you pass to your shader yourself)
If you e.g. have a game where you can walk through a scene, you normally won't render the parts that are out of the view. So the GPU could think that these resources are not required anymore, but if you turn around then all this textures and geometry is required again and needs to be moved from system memory to gpu memory, which could result in really bad performance. But the Game Engine itself has, because of the use of octrees (or similar techniques) and the possible paths that can be walked, an in deep knowledge about the scene and which resource could be removed from the GPU and which one could be move to the GPU while playing and where it would be necessary to display a loading screen.
If you look at the evolution of OpenGL and which features become deprecated you will see that they go to the direction to remove everything except the really required features that can be done best by the graphic card, driver and system. Everything else is up to the user to implement on it's own to get the best performance. (you e.g. create your projection matrix yourself to pass it to the shader, so OpenGl even does not know where the object is placed in the scene).
Here's my TL;DR answer, I recommend reading Daemon's and t.niese's answers as well:
Can the OpenGL driver/server hold more data than the video card?
Yes
Say, I have enough video RAM to hold 10 textures. Can I ask OpenGL to allocate 15 textures without getting an GL_OUT_OF_MEMORY error?
Yes. Depending on the driver / GPU combination it might even be possible to allocate a single texture that exceeds the GPU's memory, and actually use it for rendering. At my current occupation I exploit that fact to extract slices of arbitrary orientation and geometry from large volumetric datasets, using shaders to apply filters on the voxel data in situ. Works well, but doesn't work for interactive frame rates.