I realize there's no way to avoid it for certain, as OpenGL says nothing about VRAM fragmentation.
But all the same, I have fragmentation in my app and I want to try reducing it on common platforms.
The only thing I found on the topic was this:
The best way to prevent heavy memory fragmentation is to try to and restrict the amount of varying resolutions in a project. When an asset is swapped out for one that is the same resolution, often times it can take it's place in the memory.
Which makes a lot of sense.
Is it really a good idea? And are there other things to keep in mind about this?
Note that in my usecase virtually all my VRAM usage consists of textures (and the back/front/depth buffers). Hardly any is buffer objects and such.
Basically you are right. Out of all the resources texture is heavy one. Not only you can use textures as buffers but as normal images also. Plus you can have multiple resolutions of same texture. You have to remember on driver side there are certain criteria that makes the resource to loaded in GPU ram or unload. So if you have textures with same resolution. Even if one of it is unloaded its memory block can be allocated to other without any fragmentation. Also when you create texture or any resource for that matter you cant be sure the memory is present in GPU Ram. It could be any where based on Driver implementation. Also during the lifetime of the resource driver might move that resource from one memory to other. Other than that at least in OpenGL you dont have any control over different memories and how they are allocated. Hence Vulkan is preferred in such cases when you need more control and you know what you are doing as you can specify not only type of memory you want but also you can specify your own allocators for that memory.
Related
So basically whenever I create buffer objects Opengl allocates some memory on the GPU.
Consider scenario 1 where I generate 2 uniform buffers for 2 uniform variables.
Now consider scenario 2 where I create a single buffer and enclose the 2 uniform variables inside an interface block.
My understanding is that for scenario 1, two separate regions of memory get allocate while for scenario 2, one big contiguous block of memory gets allocated. If so, then Scenario 1 might be susceptible to memory fragmentation and if this happens is it managed by OpenGL or something else OR should we keep this in mind before writing performance critical code?
Actually I have to fix that for you. It's
So basically whenever I create buffer objects Opengl allocates some memory.
You don't know – and it's invalid to make assumptions – about the whereabouts of where this memory is located. You just get the assurance that it's there (somewhere) and that you can make use of it.
managed by OpenGL or something
Yes. In fact, and reasonable OpenGL implementations do have to move around data on a regular basis. Think about it: On a modern computer system several applications do use the GPU in parallel, and neither process (usually) does care about or respect the inner working of the other processes that coinhabit the same machine. Yet the user (naturally) expects, that all processes will "just work" independent of the situation.
The GPU drivers do a lot of data pushing in the background, moving stuff between the system memory, the GPU memory or even swap space on storage devices without processes noticing any of that.
OR should we keep this in mind before writing performance critical code?
Average-joe-programmer will get the best performance by just using the OpenGL API in a straightforward way, without trying to outsmart the implementation. Every OpenGL implementation (= combination of GPU model + driver version) has "fast paths", however short of having access to intimately detailed knowledge about the GPU and driver details those are very difficult to hit.
Usually only the GPU makers themselves have this knowledge; if you're a AAA game studio, you're usually having a few GPU vendor guys on quick dial to come for a visit to your office and do their voodoo; most people visiting this site probably don't.
My game engine tries to allocate large texture arrays to be able to batch majority (if not all) of its draw together. This array may become large enough that fails to allocate, at which point I'd (continually) split the texture array in halves.
Is it bad design to push the boundaries until receiving a glGetError:Out of memory and scale back from there?
Is my application a jerk because it's allocating huge chunks of VRAM, which may require swapping into GTT memory? As in, is it less ideal for the graphics driver to be manipulating a few large texture arrays rather than many individual textures when dealing with other OS operations?
It is hard to evaluate how well drivers handle large texture arrays. Behaviour of different drivers may vary a lot.
While using texture array can improve performance by reducing the number of draw calls, that should not be the main goal. Reduction of draw calls is somewhat important on mobile platforms, and even there, several dozens of them is not a problem. I'm not sure about your concerns and what exactly you try to optimise, but I would recommend using profiling tools from GPU vendor before doing any optimisation.
Is it bad design to push the boundaries until receiving a glGetError:Out of memory and scale back from there?
This is what typically done when data is dynamically loaded to the GPU. Once the error is received, old data should be unloaded to load a new one.
Is my application a jerk because it's allocating huge chunks of VRAM, which may require swapping into GTT memory?
There is no way to check if data was swapped to GTT or not (if driver supports GTT at all). The driver handles it on its own, and there is no access to that from OpenGL API. You may need to use profiling tools like Nsight, if you are using a GPU from NVidia.
However, if you are planning to have one giant textures array, it must fit into VRAM as a whole, it can not be partially in VRAM and in GTT. I would not recommend relying on GTT at all.
It must fit into VRAM, because when you bind it, the driver can not know beforehand which layers will be used and which won't since selection happens in the shader.
Despite the fact that textures array and 3dtexture are conceptually different, at hardware level they work very similarly, the difference is that the first one uses filtering in two dimensions and the second one - in three dimensions.
I was playing with large 3d textures for a while. I did experiments with GeForce 1070 (it has 6GB), and it handles textures ~1GB very good. The largest texture I managed to load was around 3GB (2048x2048x7**), but often it throws an error. Despite the fact that it should have a large amount of free VRAM that would fit the texture, it may fail to allocate such big chunk due to various reasons. So I would not recommend allocating textures that are comparable to the total size of VRAM unless it is absolutely necessary.
I'm somewhat confused about registers in DirectX 11. Let me give an example of the situation: Assume you have 3 models. They each have a texture that is mapped to register t0. Models 1 and 3 use the same texture, and model 2 uses a different texture. When drawing model 1, I set the texture resource view to register 0 and draw the model. Then I do the same things for models 2 and 3, but use the same resource view for model 3. When I set the the texture for model 2, does the GPU replace the texture in the GPU memory with a different one, or does it maintain that texture memory until space is needed and just moves some pointers around? I would like to minimize data transfer to GPU and I'm wondering if I should handle situations like these myself or does DX handle it for me. Btw, I am NOT using the Effects 11 framework.
Thanks in advance.
In general, you should assume that once you have created a resource on the GPU (e.g. using CreateTexture2D), that memory is reserved and resident for that resource for use by the 3D pipeline. Note that this is independent of data transfer to the GPU, which is also explicit via Map/Unmap or UpdateSubresource.
There are some cases where the OS will swap memory in and out, but usually this should be avoided if possible. For example, if you create a bunch of large textures but never access them, eventually the video memory manager will page them out to system memory for other tasks (e.g. watching Netflix / browsing the internet on another display). You can also run into real problems if you overcommit video memory (using more than what is available on the system). This used to be impossible (you would just get E_OUTOFMEMORY) but now the memory manager tries to make it work by paging things to system memory or even disk. This is something you should really strive to avoid since if you ever bind and use a paged-out resource, you'll get a glitch waiting for the memory manager to page it back in for use.
Note that the above really just applies to discrete GPU configs. On integrated systems e.g. from Intel or AMD, you get unified memory which has completely different characteristics. But in general you should target discrete configs first, since there are more performance cliffs you have to worry about if you screwed something up, and they would be unlikely to show up on integrated.
Going back to your original question, changing SRVs between draw calls is not that expensive - it's more than a pointer swap, but nowhere near the cost of transferring the entire texture across the bus. You should feel free to swap SRVs at the same frequency as your draw calls and expect no adverse performance impact.
I think what you're asking depends a lot on hardware and driver implementation. That's also why DX11 documentation doesn't do any claims on how memory is managed for graphics resources. I can't really give you any valid sources but I believe it's safe to assume that textures/buffers for which you've made a view will reside in GPU's memory (especially the ones you're accessing more frequently). The graphics driver will do a lot of access optimization. However, it's good practice to change the state of graphics pipeline as little as possible.
You can read an in depth discussion of graphics pipeline here
I'm asking this question because I don't want to spend time writing some code that duplicates functionalities of the OpenGL drivers.
Can the OpenGL driver/server hold more data than the video card? Say, I have enough video RAM to hold 10 textures. Can I ask OpenGL to allocate 15 textures without getting an GL_OUT_OF_MEMORY error?
If I can rely on the driver to cleverly send the textures/buffers/objects from the 'normal' RAM to the video RAM when needed then I don't really need to Gen/Delete these objects myself. I become limited by the 'normal' RAM which is often plentiful when compared to the video RAM.
The approach "memory is abundant so I don't need to delete" is bad, and the approach "memory is abundant, so I'll never get out of memory errors" is flawed.
OpenGL memory management is obscure, both for technical reasons (see t.niese's comment above) and for ideological reasons ("you don't need to know, you don't want to know"). Though there exist vendor extensions (such as ATI_meminfo) that let you query some non-authorative numbers (non-authorative insofar as they could change the next millisecond, and they do not take effects like fragmentation into account).
Generally, for the most part, your assumption that you can use more memory than there is GPU memory is correct.
However, you are not usually not able to use all available memory. More likely, there is a limit well below "all available RAM" due to constraints on what memory regions (and how large regions) the driver can allocate, lock, and DMA to/from. And even though you can normally use more memory than will fit on the GPU (even if you used it exclusively), this does not mean careless allocations can't and won't eventually fail.
Usually, but not necessarily, you consume as much system memory as GPU memory, too (without knowing, the driver does that secretly). Since the driver swaps resources in and out as needed, it needs to maintain a copy. Sometimes, it is necessary to keep 2 or 3 copies (e.g. when streaming or for ARB_copy_buffer operations). Sometimes, mapping a buffer object is yet another copy in a specially allocated block, and sometimes you're allowed to write straight into the driver's memory.
On the other hand, PCIe 2.0 (and PCIe 3.0 even more so) is fast enough to stream vertices from main memory, so you do not even strictly need GPU memory (other than a small buffer). Some drivers will stream dynamic geometry right away from system memory.
Some GPUs do not even have separate system and GPU memory (Intel Sandy Bridge or AMD Fusion).
Also, you should note that deleting objects does not necessarily delete them (at least not immediately). Usually, with very few exceptions, deleting an OpenGL object is merely a tentative delete which prevents you from further referencing the object. The driver will keep the object valid for as long as it needs to.
On the other hand, you really should delete what you do not need any more, and you should delete early. For example, you should delete a shader immediately after attaching it to the program object. This ensures that you do not leak resources, and it is guaranteed to work. Deleting and re-specifying the in-use vertex or pixel buffer when streaming (by calling glBufferData(... NULL); is a well-known idiom. This only affects your view of the object, and it allows the driver to continue using the old object in parallel for as long as it needs to.
Some additional information to my comment that did not fit in there.
There are different reasons why this is not part of OpenGL.
It isn't an easy task for the system/driver to guess which resources are and will be required. The driver for sure could create an internal heuristic if resource will be required often or rarely (like CPU does for if statements and doing pre executing code certain code parts on that guess). But the GPU will not know (without knowing the application code) what resource will be required next. It even has no knowledge where the geometry is places in the scene (because you do this with you model and view martix you pass to your shader yourself)
If you e.g. have a game where you can walk through a scene, you normally won't render the parts that are out of the view. So the GPU could think that these resources are not required anymore, but if you turn around then all this textures and geometry is required again and needs to be moved from system memory to gpu memory, which could result in really bad performance. But the Game Engine itself has, because of the use of octrees (or similar techniques) and the possible paths that can be walked, an in deep knowledge about the scene and which resource could be removed from the GPU and which one could be move to the GPU while playing and where it would be necessary to display a loading screen.
If you look at the evolution of OpenGL and which features become deprecated you will see that they go to the direction to remove everything except the really required features that can be done best by the graphic card, driver and system. Everything else is up to the user to implement on it's own to get the best performance. (you e.g. create your projection matrix yourself to pass it to the shader, so OpenGl even does not know where the object is placed in the scene).
Here's my TL;DR answer, I recommend reading Daemon's and t.niese's answers as well:
Can the OpenGL driver/server hold more data than the video card?
Yes
Say, I have enough video RAM to hold 10 textures. Can I ask OpenGL to allocate 15 textures without getting an GL_OUT_OF_MEMORY error?
Yes. Depending on the driver / GPU combination it might even be possible to allocate a single texture that exceeds the GPU's memory, and actually use it for rendering. At my current occupation I exploit that fact to extract slices of arbitrary orientation and geometry from large volumetric datasets, using shaders to apply filters on the voxel data in situ. Works well, but doesn't work for interactive frame rates.
I've started to use Pixel Buffer Objects and while I understand how to use them and the gist of what they're doing, I really don't know what's going on under the hood. I'm aware that the OpenGL spec allows for leeway in regards to the exact implementation, but that's still beyond me.
So far as I understand, the Buffer Object typically resides server side in GRAM; though this apparently may vary depending on target and usage. This makes perfect sense as this would be why OpenGL calls on the BOs would operate so fast. But in what such instances would it reside in AGP or system memory? (side question: does PCI-e have an equivalent of AGP memory?)
Also, glMapBuffers() returns a pointer to a block of memory of the BO so the data may be read/written/changed. But how is this done? The manipulations are taking place client side, so the data still has to go from server to client some how. If it is, how is is better than glReadPixels()?
PBOs are obviously better than glReadPixels() as is obvious by the performance difference, I just don't understand how.
I haven't used FBOs yet, but I've heard they're better to use. Is this true? if so, why?
I can't tell you in what memory the buffer object will be allocated. Actually you mostly answered that question yourself, so you can hope that a good driver will actually do it this way.
glMapBuffer can be implemented the same way as memory mapped files. Remember the difference between physical memory and virtual address space: when you write to a memory location, the address is mapped through a page table to a physical location. If the required page is marked as swapped out an interrupt occurs and the system loads the required page from the swap to the RAM. This mechanism can be used to map files and other resources (like GPU memory) to your process's virtual address space. When you call glMapBuffer, the system allocates some address range (not memory, just addresses) and prepares the relevant entries in page table. When you try to read/write to these addresses the system loads/sends it to the GPU. Of course this would be slow, so some buffering is done on the way.
If you constantly transfer data between CPU and GPU, I doubt that PBOs will be faster. They are faster when you make many manipulations on the GPU (like load from frame buffer, change a few texels with CPU and use it as a texture again on the GPU). Well, they can be faster in case of integrated graphics processor or AGP memory, because in that case glMapBuffer can map the addresses directly to the physical memory, effectively eliminating one copy operation.
Are FBOs better? For what? They are better when you need to render to texture. That's again because they eliminate one data copy operation.