How to put many small vertex buffers into GPU memory efficiently - opengl

I have some tiny objects using 4-8 vertices. Currently they use client side memory (I'm using OpenGL 2.1/OpenGL ES 2.0) but I want to use VBO (and optionally VAO) to be more compatible with WebGL 1.0 and newer GL versions.
Is there any recommended practice to do this without sacrificing performance and wasting GPU memory?
If my objects are static and use the same vertex configuration then I can put them all into a single VBO and use offset in DrawArrays. But what to do if they are different (still possible to use VBO, but not VAO) or when they are dynamic (allocate one VBO and orphan it on every write?)?
Especially interesting what to do with different vertex configurations (use one VBO and multiple VAO)?
How such cases are handled in modern GL?

Related

Accessing large amounts of memory from compute shader?

I need to access large amounts of data from a GLSL compute shader (read and write).
For reference, I work with an nvidia A6000 gpu with 50GB of memory, the driver is up to date.
Here is what I've tried so far:
Using a SSBO: glBufferData() can allocate arbitrarily large buffers but the shader will only be able to access 2GB of memory (according to my tests).
Using a texture: glTextureStorage3D() can allocate very large textures but the shader will only be able to access 4GB of memory (according to my tests).
Using multiple textures: I break the data in multiple bindless 3D textures (GL_ARB_bindless_texture extension) that work like pages of memory. I store the texture handles in a UBO. It effectively does what I want but there are several downsides:
The texture/image handle pairs take space in the UBO, which could be needed by something else.
The textures are not accessed in a uniform control flow: two invocations of the same subgroup can fetch data in different pages. This is allowed on Nvidia gpus with the GL_NV_gpu_shader5 extension, but I could not find a similar extension on AMD.
Using NV_shader_buffer_load and NV_shader_buffer_store to get a pointer to gpu memory. I haven't tested this method yet, but I suspect it could be more efficient than solution 3 since there is no need to dereference the texture/image handles which introduce an indirection.
Thus, I would like to know: Would solution 4 work for my use case? Is there a better/faster way? Is there a cross-platform way?

What is GPU driven rendering? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
Nowadays I'm hearing from different places about the so called GPU driven rendering which is a new paradigm of rendering which doesn't require draw calls at all, and that it is supported by the new versions of OpenGL and Vulkan APIs. Can someone explain how it actually works on conceptual level and what are the main differences with the traditional approach?
Overview
In order to render a scene, a number of things have to happen. You need to walk your scene graph to figure out which objects exist. For each object which exists, you now need to determine if it is visible. For each object which is visible, you need to figure out where its geometry is stored, which textures and buffers will be used to render that object, which shaders to use to render the object, and so forth. Then you render that object.
The "traditional" method handling this is for the CPU to handle this process. The scene graph lives in CPU-accessible memory. The CPU does visibility culling on that scene graph. The CPU takes the visible objects and access some CPU data about the geometry (OpenGL buffer object and texture names, Vulkan descriptor sets and VkBuffers, etc), shaders, etc, transferring this as state data to the GPU. Then the CPU issues a GPU command to render that object with that state.
Now, if we go back farther, the most "traditional" method doesn't involve a GPU at all. The CPU would just take this mesh and texture data, do vertex transformations, rasterizatization, and so forth, producing an image in CPU memory. However, we started off-loading some of this to a separate processor. We started with the rasterization stuff (the earliest graphics chips were just rasterizers; the CPU did all the vertex T&L). Then we incorporated the vertex transformations into the GPU. When we did that, we started having to store vertex data in GPU accessible memory so the GPU could read it on its own time.
We did all of that, off-loading these things to a separate processor for two reasons: the GPU was (much) faster at it, and the CPU can now spend its time doing something else.
GPU driven rendering is just the next stage in that process. We went from no GPU, to rasterization GPU, to vertex GPU, and now to scene-graph-level GPU. The "traditional" method offloads how to render to the GPU; GPU driven rendering offloads the decision of what to render.
Mechanism
Now, the reason we haven't been doing this all along is because the basic rendering commands all take data that comes from the CPU. glDrawArrays/Elements takes a number of parameters from the CPU. So even if we used the GPU to generate that data, we would need a full GPU/CPU synchronization so that the CPU could read the data... and give it right back to the GPU.
That's not helpful.
OpenGL 4 gave us indirect rendering of various forms. The basic idea is that, instead of taking those parameters from a function call, they're just data stored in GPU memory. The CPU still has to make a function call to start the rendering operation, but the actual parameters to that call are just data stored in GPU memory.
The other half of that requires the ability of the GPU to write data to GPU memory in a format that indirect rendering can read. Historically, data on GPUs goes in one direction: data gets read for the purpose of being converted into pixels in a render target. We need a way to generate semi-arbitrary data from other arbitrary data, all on the GPU.
The older mechanism for this was to (ab)use transform feedback for this purpose, but nowadays we just use SSBOs or failing that, image load/store. Compute shaders help here as well, since they are designed to be outside of the standard rendering pipeline and therefore are not bound to its limitations.
The ideal form of GPU-driven rendering makes the scene-graph part of the rendering operation. There are lesser forms, such as having the GPU do nothing more than per-object viewport culling. But let's look at the most ideal process. From the perspective of the CPU, this looks like:
Update the scene graph in GPU memory.
Issue one or more compute shaders that generate multi-draw indirect rendering commands.
Issue a single multi-draw indirect call that draws everything.
Now of course, there's no such thing as a free lunch. Doing full scene graph processing on the GPU requires building your scene graph in a way that is efficient for GPU processing. Even more importantly, visibility culling mechanisms have to be engineered with efficient GPU processing in mind. That's complexity I'm not going to address here.
Implementation
Instead, let's look at the nuts-and-bolts of making the drawing part work. We have to sort out a lot of things here.
See, the indirect rendering command is still a regular old rendering command. While the multi-draw form draws multiple distinct "objects", it's still one CPU rendering command. This means that, for the duration of this command, all rendering state is fixed.
So everything under the purview of this multi-draw operation must use the same shader, bound buffers&textures, blending parameters, stencil state, and so forth. This makes implementing a GPU-driven rendering operation a bit complicated.
State and Shaders
If you need blending, or similar state-based differences in rendering operations, then you are going to have to issue another rendering command. So in the blending case, your scene-graph processing is going to have to compute multiple sets of rendering commands, with each set being for a specific set of blending modes. You may also need to have this system sort transparent objects (unless you're rendering them with an OIT mechanism). So instead of having just one rendering command, you have a small number of them.
But the point of this exercise however isn't to have only one rendering command; the point is that the number of CPU rendering commands does not change with regard to how much stuff you're rendering. It shouldn't matter how many objects are in the scene; the CPU will be issuing the same number of rendering commands.
When it comes to shaders, this technique requires some degree of "ubershader" style: where you have a very few number of rather flexible shaders. You want to parameterize your shader rather than having dozens or hundreds of them.
However things were probably going to fall out that way anyway, particularly with regard to deferred rendering. The geometry pass of deferred renderers tends to use the same kind of processing, since they're just doing vertex transformation and extracting material parameters. The biggest difference usually is with regard to doing skinned vs. non-skinned rendering, but that's really only 2 shader variations. Which you can handle similarly to the blending case.
Speaking of deferred rendering, the GPU driven processes can also walk the graph of lights, thus generating the draw calls and rendering data for the lighting passes. So while the lighting pass will need a separate draw call, it will still only need a single multidraw call regardless of the number of lights.
Buffers
Here's where things start to get interesting. See, if the GPU is processing the scene graph, that means that the GPU needs to somehow associate a particular draw within the multi-draw command with the resources that particular draw needs. It may also need to put the data into those resources, like the matrix transforms for a given object and so forth.
Oh, and you also somehow need to tie the vertex input data to a particular sub-draw.
That last part is probably the most complicated. The buffers which OpenGL/Vulkan's standard vertex input method pull from are state data; they cannot change between sub-draws of a multi-draw operation.
Your best bet is to try to put every object's data in the same buffer object, using the same vertex format. Essentially, you have one gigantic array of vertex data. You can then use the drawing parameters for the sub-draw to select which parts of the buffer(s) to use.
But what do we do about per-object data (matrices, etc), things you would typically use a UBO or global uniform for? How do you effectively change the buffer binding state within a CPU rendering command?
Well... you can't. So you cheat.
First, you realize that SSBOs can be arbitrarily large. So you don't really need to change buffer binding state. What you need is a single SSBO that contains everyone's per-object data. For each vertex, the VS simply needs to pick out the correct data for that sub-draw from the giant list of data.
This is done via a special vertex shader input: gl_DrawID. When you issue a multi-draw command, the VS gets an input value that represents the index of this sub-draw operation within the multidraw command. So you can use gl_DrawID to index into a table of per-object data to fetch the appropriate data for that particular object.
This also means that the compute shader which generates this sub-draw also needs use the index of that sub-draw to define where in the array to put the per-object data for that sub-draw. So the CS that writes a sub-draw also needs to be responsible for setting up the per-object data that matches the sub-draw.
Textures
OpenGL and Vulkan have pretty strict limits on the number of textures that can be bound. Well actually those limits are quite large relative to traditional rendering, but in GPU driven rendering land, we need a single CPU rendering call to potentially access any texture. That's harder.
Now, we do have gl_DrawID; coupled with the table mentioned above, we can retrieve per-object data. So: how do we convert this to a texture?
There are multiple ways. We could put a bunch of our 2D textures into an array texture. We can then use gl_DrawID to fetch an array index from our SSBO of per-object data; that array index becomes the array layer we use to fetch "our" texture. Note that we don't use gl_DrawID directly because multiple different sub-draws could use the same texture, and because the GPU code that sets up the array of draw calls does not control the order in which textures appear in our array.
Array textures have obvious downsides, the most notable of which is that we must respect the limitations of an array texture. All elements in the array must use the same image format. They must all be of the same size. Also, there are limits on the number of array layers in an array texture, so you might encounter them.
The alternatives to array textures differ along API lines, though they basically boil down to the same thing: convert a number into a texture.
In OpenGL land, you can employ bindless texturing (for hardware that supports it). This system provides a mechanism that allows one to generate a 64-bit integer handle which represents a particular texture, pass this handle to the GPU (since it is just an integer, use whatever mechanism you want), and then convert this 64-bit handle into a sampler type. So you use gl_DrawID to fetch a 64-bit handle from the per-object data, then convert that into a sampler of the appropriate type and use it.
In Vulkan land, you can employ sampler arrays (for hardware that supports it). Note that these are not array textures; in GLSL, these are sampler types which are arrayed: uniform sampler2D my_2D_textures[6000];. In OpenGL, this would be a compile error because each array element represents a distinct bind point for a texture, and you cannot have 6000 distinct bind points. In Vulkan, an arrayed sampler only represents a single descriptor, no matter how many elements are in that array. Vulkan implementations have limits on how many elements there can be in such arrays, but hardware that supports the feature you need to employ this (shaderSampledImageArrayDynamicIndexing) will typically offer a generous limit.
So your shader uses gl_DrawID to get an index from the per-object data. The index is turned into a sampler by just fetching the value from the sampler array. The only limitation for textures in that arrayed descriptor is that they must all be of the same type and basic data format (floating-point 2D for sampler2D, unsigned integer cubemap for usamplerCube, etc). The specifics of formats, texture sizes, mipmap counts, and the like are all irrelevant.
And if you're concerned about the cost difference of Vulkan's array of samplers compared to OpenGL's bindless, don't be; implementations of bindless are just doing this behind your back anyway.

What is the reasoning behind OpenGL texture units as opposed to regular buffers and uniforms?

I am very new to the OpenGL API and just discovered textures and how to use them. The generated texture buffers are not bound in the same way as regular uniforms are bound and instead use glActiveTexture, followed by a bind rather than just supplying the texture to the shaders via glUniform as we do with other constants.
What is the reasoning behind this convention?
The only reason I can think of is to utilize the graphics card's full potential and texture processing capabilities instead of just binding buffers directly. Is this correct reasoning, or is it simply the way the API was implemented?
No reasoning is given on the official wiki, just says that it's weird: "Binding textures for use in OpenGL is a little weird" https://www.khronos.org/opengl/wiki/Texture
Your question can be interpreted in two ways.
"Why do we bind textures to the context rather than to the shader?"
Because that would make it needlessly difficult to have multiple shaders use the same textures. Note that pretty much no graphics API directly attaches the texture to the program. Not D3D of any version, not Metal, nor even Vulkan.
Textures are resources used by shaders. But they are not part of the shader.
"Why do we treat textures differently from a general array of values?"
In modern OpenGL, shaders have access to several different kinds of resources: UBOs, SSBOs, textures, and images. Each of these kinds of resources ultimately represents a potentially distinct part of the graphics hardware.
A storage block is not just a uniform block that can be bigger. Storage buffers represent the shader doing global memory accesses, while uniform blocks are often copied directly into the shader execution units. In the latter case, accessing their data is much faster, but that also means that you're limited by how much storage those execution units can have.
Now, this is not true for all hardware (AMD's GCN hardware treats the two almost identically, which is why their UBO limits are so large). But it is true of much hardware.
Textures are even more complicated, because implementations need to be able to store their data in an optimal way for performance reasons. As such, texture storage formats are opaque. They're even opaque in ostensibly low-level APIs like Vulkan. Oh sure, linear formats exist, but implementations aren't required to let you read from them at all.
So textures are not just constant arrays.
You are comparing two completely different things
A texture object can (somehow) be compared to a buffer object. The texture is bound by a combination of glActiveTexture and glBindTexture to a texture unit, whereas a buffer is bound by glBindBuffer which is kind-off similar.
A texture sampler uniform is a uniform in the shader and should thus be compared with other uniforms. This sampler is set by a glUniform1i call.

Does using a VAO eliminate the overhead produced by using multiple VBOs?

It is my understanding that having several VBO bind calls is not recommended since it produces overhead from the CPU. However, does binding several VBOs to one VAO, then binding that VAO reduce or completely remove the overhead from binding several VBOs, since you are letting OpenGL rebind those VBOs automatically? This is assuming that the GPU knows what to do and that OpenGL isn't doing it for you on the CPU.
I've searched around and I could not come up with any results. I am currently stuck with a OpenGL <2.1 laptop so I cannot really test for myself (yet).
With a well tuned driver implementation, I would still expect it to be more efficient if all your attributes are in the same VBO, even when using VAO. derhass already touched on a major aspect. Most GPUs cannot directly access memory using CPU addresses. Buffers have to be mapped into the GPU address space.
So, every time a VAO is bound, the driver has to at least check if those mappings already exist, and map the buffers that are not already mapped. There's also overhead for tracking the buffers to your draw call, to make sure that they remain mapped until the draw call finished executing. The more buffers you use, the more work is needed for all of this.
Another aspect are the memory access patterns. If your attributes are interleaved in a single buffer, you get much better memory access locality. Say you have positions and normals with 3 floats each, adding up to 24 bytes per vertex. If they are interleaved, the 24 bytes will likely be in the same cache line. If they are in two separate buffers, you could get two cache misses for a single vertex.
On the question if VAOs are more efficient:
Yes, binding a VAO should be more efficient if it's well implemented. For example, without using VAO, you typically have a bunch of glVertexAttribPointer calls when binding a buffer. The state passed in with these calls (size, type, stride, etc) has to be translated into GPU commands. With a VAO, the GPU commands for setting up this state can be rebuilt only when the VAO changes, and then simply reused every time the VAO is bound.
While state setup for plain VBO usage can also be well optimized, it will still take more work in the driver. You also have the overhead for making more API calls.
Such things are completely implementation-specific. It will reduce some function calls into the driver, which might already be worth it. But how the implementation handles VAOs internally is not really predictable. The way OpenGL is designed will still lead to lots of validataion when a VAO is bound, for example an implementation will not be able to just cache the pointers to the memory which are the relevant thing for the GPU at VAO setup time, since the VBO name is being referenced. Since you can create a new and totally different buffer storage for the same buffer object, the implementation has to at least make sure that the buffer object is still the same as before. So there is lots of stuff do be done here, which is not going to happen on the GPU, so I would not expect a too big decrease in CPU overhead in this situation. However, I haven't profiled it...

OpenGL Rendering Modes

So far I know about immediate, display list, vertex buffer and vertex buffer object rendering. Which is the fastest? Which OpenGL version does each require? What should I use?
The best (and pretty much only) method of rendering now is to use general purpose buffers, AKA Vertex Buffer Objects. They are in core from 2.1, if I'm correct, but generally appeared as an extension in 1.5 (as ARB_vertex_buffer_object). They have hardware support, which means they can be and probably will be stored directly in GPU memory.
When you load data to them, you specify the suggested usage. You can read more about it in glBufferData manual. For example, GL_STATIC_DRAW is something very similar to static display list. This allows your graphics card to optimize access to them.
Modern (read: non-ancient) hardware really dislikes immediate mode. I've seen a nearly 2-order-of-magnitude performance improvement by replacing immediate mode with vertex arrays.
OpenGL 3 and above support only buffer objects, all other rendering modes are deprecated.
Display lists are a serious pain to use correctly, and not worth it on non-ancient hardware.
To summarize: if you use OpenGL 3+, you have to use (V)BOs. If you target OpenGL 2, use vertex arrays or VBOs as appropriate.