Which is more efficient between texelfetch() and imageLoad() in OpenGL? - opengl

If I just need to read data from BO,I have two methods.First is to load data to buffer textures and then read data by texelfetch() in shaders. Second,load data to buffer textures and bind buffer textures to image unit, then I can read data by imageLoad().I want to know which is more efficient if there is great deal of data?

It really depends on the specific case and the hardware being used. In general, imageLoad() will be more efficient than texelfetch() if there is a great deal of data. However, the performance difference between the two methods may be negligible in many cases. It's best to benchmark both methods on the hardware you are using to get an accurate comparison.

Reading data from a buffer texture using texelfetch() in a shader is generally more efficient than binding the buffer texture to an image unit and using imageLoad() to read the data. This is because texelfetch() is designed specifically for reading texture data and can be optimized for this purpose, while imageLoad() is more general-purpose and may not be as efficient when used to read data from a buffer texture. The choice between texelfetch() and imageLoad() will depend on your specific use case and how you want to access the data in the buffer texture.

Related

Vulkan storage buffer vs image sampler

I am currently building an application in vulkan where I will be sampling a lot of data from a buffer. I will be using as much storage as possible, but sampling speed is also important. My data is in the form of a 2D array of 32 bit integers. I can either upload it as a texture and use a texture sampler for it, or as a storage buffer. I read that storage buffers are generally slow, so I was considering using the image sampler to read my data in a fragment shader. I would have to disable mipmapping and filtering, and convert UV coordinates to array indices, but if it's faster I think it might be worth it.
My question is, would it generally be worth it to store my data in an image sampler, or should I do the obvious and use a storage buffer? What are the pros/cons of each approach?
Guarantees about performance do not exist.
But Vulkan API tries not to decieve you. The obvious way is likely the right way.
If you want to sample then sample. If you want to do raw access then obviously do raw access. Generally, you should not be forcefully trying to put a square in a round hole.

What is GPU driven rendering? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
Nowadays I'm hearing from different places about the so called GPU driven rendering which is a new paradigm of rendering which doesn't require draw calls at all, and that it is supported by the new versions of OpenGL and Vulkan APIs. Can someone explain how it actually works on conceptual level and what are the main differences with the traditional approach?
Overview
In order to render a scene, a number of things have to happen. You need to walk your scene graph to figure out which objects exist. For each object which exists, you now need to determine if it is visible. For each object which is visible, you need to figure out where its geometry is stored, which textures and buffers will be used to render that object, which shaders to use to render the object, and so forth. Then you render that object.
The "traditional" method handling this is for the CPU to handle this process. The scene graph lives in CPU-accessible memory. The CPU does visibility culling on that scene graph. The CPU takes the visible objects and access some CPU data about the geometry (OpenGL buffer object and texture names, Vulkan descriptor sets and VkBuffers, etc), shaders, etc, transferring this as state data to the GPU. Then the CPU issues a GPU command to render that object with that state.
Now, if we go back farther, the most "traditional" method doesn't involve a GPU at all. The CPU would just take this mesh and texture data, do vertex transformations, rasterizatization, and so forth, producing an image in CPU memory. However, we started off-loading some of this to a separate processor. We started with the rasterization stuff (the earliest graphics chips were just rasterizers; the CPU did all the vertex T&L). Then we incorporated the vertex transformations into the GPU. When we did that, we started having to store vertex data in GPU accessible memory so the GPU could read it on its own time.
We did all of that, off-loading these things to a separate processor for two reasons: the GPU was (much) faster at it, and the CPU can now spend its time doing something else.
GPU driven rendering is just the next stage in that process. We went from no GPU, to rasterization GPU, to vertex GPU, and now to scene-graph-level GPU. The "traditional" method offloads how to render to the GPU; GPU driven rendering offloads the decision of what to render.
Mechanism
Now, the reason we haven't been doing this all along is because the basic rendering commands all take data that comes from the CPU. glDrawArrays/Elements takes a number of parameters from the CPU. So even if we used the GPU to generate that data, we would need a full GPU/CPU synchronization so that the CPU could read the data... and give it right back to the GPU.
That's not helpful.
OpenGL 4 gave us indirect rendering of various forms. The basic idea is that, instead of taking those parameters from a function call, they're just data stored in GPU memory. The CPU still has to make a function call to start the rendering operation, but the actual parameters to that call are just data stored in GPU memory.
The other half of that requires the ability of the GPU to write data to GPU memory in a format that indirect rendering can read. Historically, data on GPUs goes in one direction: data gets read for the purpose of being converted into pixels in a render target. We need a way to generate semi-arbitrary data from other arbitrary data, all on the GPU.
The older mechanism for this was to (ab)use transform feedback for this purpose, but nowadays we just use SSBOs or failing that, image load/store. Compute shaders help here as well, since they are designed to be outside of the standard rendering pipeline and therefore are not bound to its limitations.
The ideal form of GPU-driven rendering makes the scene-graph part of the rendering operation. There are lesser forms, such as having the GPU do nothing more than per-object viewport culling. But let's look at the most ideal process. From the perspective of the CPU, this looks like:
Update the scene graph in GPU memory.
Issue one or more compute shaders that generate multi-draw indirect rendering commands.
Issue a single multi-draw indirect call that draws everything.
Now of course, there's no such thing as a free lunch. Doing full scene graph processing on the GPU requires building your scene graph in a way that is efficient for GPU processing. Even more importantly, visibility culling mechanisms have to be engineered with efficient GPU processing in mind. That's complexity I'm not going to address here.
Implementation
Instead, let's look at the nuts-and-bolts of making the drawing part work. We have to sort out a lot of things here.
See, the indirect rendering command is still a regular old rendering command. While the multi-draw form draws multiple distinct "objects", it's still one CPU rendering command. This means that, for the duration of this command, all rendering state is fixed.
So everything under the purview of this multi-draw operation must use the same shader, bound buffers&textures, blending parameters, stencil state, and so forth. This makes implementing a GPU-driven rendering operation a bit complicated.
State and Shaders
If you need blending, or similar state-based differences in rendering operations, then you are going to have to issue another rendering command. So in the blending case, your scene-graph processing is going to have to compute multiple sets of rendering commands, with each set being for a specific set of blending modes. You may also need to have this system sort transparent objects (unless you're rendering them with an OIT mechanism). So instead of having just one rendering command, you have a small number of them.
But the point of this exercise however isn't to have only one rendering command; the point is that the number of CPU rendering commands does not change with regard to how much stuff you're rendering. It shouldn't matter how many objects are in the scene; the CPU will be issuing the same number of rendering commands.
When it comes to shaders, this technique requires some degree of "ubershader" style: where you have a very few number of rather flexible shaders. You want to parameterize your shader rather than having dozens or hundreds of them.
However things were probably going to fall out that way anyway, particularly with regard to deferred rendering. The geometry pass of deferred renderers tends to use the same kind of processing, since they're just doing vertex transformation and extracting material parameters. The biggest difference usually is with regard to doing skinned vs. non-skinned rendering, but that's really only 2 shader variations. Which you can handle similarly to the blending case.
Speaking of deferred rendering, the GPU driven processes can also walk the graph of lights, thus generating the draw calls and rendering data for the lighting passes. So while the lighting pass will need a separate draw call, it will still only need a single multidraw call regardless of the number of lights.
Buffers
Here's where things start to get interesting. See, if the GPU is processing the scene graph, that means that the GPU needs to somehow associate a particular draw within the multi-draw command with the resources that particular draw needs. It may also need to put the data into those resources, like the matrix transforms for a given object and so forth.
Oh, and you also somehow need to tie the vertex input data to a particular sub-draw.
That last part is probably the most complicated. The buffers which OpenGL/Vulkan's standard vertex input method pull from are state data; they cannot change between sub-draws of a multi-draw operation.
Your best bet is to try to put every object's data in the same buffer object, using the same vertex format. Essentially, you have one gigantic array of vertex data. You can then use the drawing parameters for the sub-draw to select which parts of the buffer(s) to use.
But what do we do about per-object data (matrices, etc), things you would typically use a UBO or global uniform for? How do you effectively change the buffer binding state within a CPU rendering command?
Well... you can't. So you cheat.
First, you realize that SSBOs can be arbitrarily large. So you don't really need to change buffer binding state. What you need is a single SSBO that contains everyone's per-object data. For each vertex, the VS simply needs to pick out the correct data for that sub-draw from the giant list of data.
This is done via a special vertex shader input: gl_DrawID. When you issue a multi-draw command, the VS gets an input value that represents the index of this sub-draw operation within the multidraw command. So you can use gl_DrawID to index into a table of per-object data to fetch the appropriate data for that particular object.
This also means that the compute shader which generates this sub-draw also needs use the index of that sub-draw to define where in the array to put the per-object data for that sub-draw. So the CS that writes a sub-draw also needs to be responsible for setting up the per-object data that matches the sub-draw.
Textures
OpenGL and Vulkan have pretty strict limits on the number of textures that can be bound. Well actually those limits are quite large relative to traditional rendering, but in GPU driven rendering land, we need a single CPU rendering call to potentially access any texture. That's harder.
Now, we do have gl_DrawID; coupled with the table mentioned above, we can retrieve per-object data. So: how do we convert this to a texture?
There are multiple ways. We could put a bunch of our 2D textures into an array texture. We can then use gl_DrawID to fetch an array index from our SSBO of per-object data; that array index becomes the array layer we use to fetch "our" texture. Note that we don't use gl_DrawID directly because multiple different sub-draws could use the same texture, and because the GPU code that sets up the array of draw calls does not control the order in which textures appear in our array.
Array textures have obvious downsides, the most notable of which is that we must respect the limitations of an array texture. All elements in the array must use the same image format. They must all be of the same size. Also, there are limits on the number of array layers in an array texture, so you might encounter them.
The alternatives to array textures differ along API lines, though they basically boil down to the same thing: convert a number into a texture.
In OpenGL land, you can employ bindless texturing (for hardware that supports it). This system provides a mechanism that allows one to generate a 64-bit integer handle which represents a particular texture, pass this handle to the GPU (since it is just an integer, use whatever mechanism you want), and then convert this 64-bit handle into a sampler type. So you use gl_DrawID to fetch a 64-bit handle from the per-object data, then convert that into a sampler of the appropriate type and use it.
In Vulkan land, you can employ sampler arrays (for hardware that supports it). Note that these are not array textures; in GLSL, these are sampler types which are arrayed: uniform sampler2D my_2D_textures[6000];. In OpenGL, this would be a compile error because each array element represents a distinct bind point for a texture, and you cannot have 6000 distinct bind points. In Vulkan, an arrayed sampler only represents a single descriptor, no matter how many elements are in that array. Vulkan implementations have limits on how many elements there can be in such arrays, but hardware that supports the feature you need to employ this (shaderSampledImageArrayDynamicIndexing) will typically offer a generous limit.
So your shader uses gl_DrawID to get an index from the per-object data. The index is turned into a sampler by just fetching the value from the sampler array. The only limitation for textures in that arrayed descriptor is that they must all be of the same type and basic data format (floating-point 2D for sampler2D, unsigned integer cubemap for usamplerCube, etc). The specifics of formats, texture sizes, mipmap counts, and the like are all irrelevant.
And if you're concerned about the cost difference of Vulkan's array of samplers compared to OpenGL's bindless, don't be; implementations of bindless are just doing this behind your back anyway.

What's the most efficient way to write 1 MB from a compute shader and read it in a fragment shader?

In OpenGL, there are so many different kinds of storage options: buffers, textures, framebuffers, etc. in OpenGL it seems very difficult to figure out which to use when.
I want to write a compute buffer that on each frame writes up to around 1 MB to some type of storage that is then read (read-only, random access) by a fragment shader. What is the most efficient mechanism to write that data from the compute shader so it's accessible in the fragment shader?
The data is made up of custom data structures: it is not a specific rendering type of data such as a texture.
Generally there are multiple ways you could handle data transfer across pipeline.
1. UBO
Use when you have fixed amount of data large enough or you can group individual uniforms go with UBO. They are faster that SSBO but have size limitation and cant have variable size.
I see some vendors are even supporting UBO of size 2GB. But I am not sure if that still hold performance benefit of older UBOs. You can query though the limit on UBO size and decide for your self.
2. SSBO
When you have really large data then generally prefer SSBO. They are slower than UBO but can hold much larger data. Internally they are implemented using texture buffer memory.
Generally I prefer limit of 64 K data in UBO and anything larger than that in SSBO.
Another possible way to communicate is to use Transform Feedback. But I am not sure that will be efficient. But considering case when data computed by the compute shader will be processed by application before sending it to fragment shader. This case Transform feedback could be ideal.

One big OpenGL vertex buffer, or many small ones?

Let's say I have 5 entities (objects) with a method Render(). Every entity needs to set its own vertices in a buffer for rendering.
Which of the following two options is better?
Use one big pre-allocated buffer created with glGenBuffer, which every entity will use (id of buffer passed as argument to Render methods) by writing its vertices to the buffer with glBufferSubData.
Every entity creates and uses its own buffer.
If one big buffer is better, how can I render all vertices in this buffer (from all entities) properly, with proper shaders and everything?
Having multiple VBOs is fine as long as they have a certain size. What you want to avoid is to have a lot of small draw calls, and to have to bind different buffers very frequently.
How large the buffers have to be to avoid excessive overhead depends on so many factors that it's almost impossible to even give a rule of thumb. Factors that come into play include:
Hardware performance characteristics.
Driver efficiency.
Number of vertices relative to number of fragments (triangle size).
Complexity of shaders.
Generally it can make sense to keep similar/related objects that you typically draw at the same time in a single vertex buffer.
Putting everything in a single buffer seems extreme, and could in fact have adverse effects. Say you have a large "world", where you only render a small subset in any given frame. If you go to the extreme, an have all vertices in one giant buffer, that buffer needs to be accessible to the GPU for each draw call. Depending on the architecture, and how the buffer is allocated, this could mean:
Attempting to keep the buffer in dedicate GPU memory (e.g. VRAM), which could be problematic if it's too large.
Mapping the memory into the GPU address space.
Pinning/wiring the memory.
If any of the above needs to be applied to a very large buffer, but you end up using only a small fraction of it to render a frame, there is significant waste in these operations. In a system with VRAM, it could also prevent other allocations, like textures, to fit in VRAM.
If rendering is done with calls that can only access a subset of the buffer given by the arguments, like glDrawArrays() or glDrawRangeElements(), it might be possible for the driver to avoid making the whole buffer GPU accessible. But I wouldn't necessarily count on that happening.
It's easier to use one VBO (Vertex Buffer Object) with glGenBuffer for each entity you have but it's not always the best things to do, this depend on the use. But, in most cases, this is not a problem to have 1 VBO for each entity and the rendering is rarely affected.
Good info is located at: OpenGL Vertex Specification Best Practices

Framebuffer Texture behavior in OpenGL/OpenGLES

In OpenGL/ES you have to be careful to not cause a feedback loop (reading pixels from the same texture you are writing to) when implementing render to texture functionality. For obvious reasons the behavior is undefined when you are reading and writing to the same pixels of a texture. However, is it also undefined behavior if you are reading and writing to different pixels of the same texture? An example would be if I was trying to make a texture atlas with a render texture inside. While I am rendering to the texture, I read the pixels from another texture stored in the texture atlas.
As I am not reading and writing the same pixels in the texture is the behavior still considered undefined, just because the data is coming from the same texture?
However, is it also undefined behavior if you are reading and writing to different pixels of the same texture?
Yes.
Caching is the big problem here. When you write pixel data, it is not necessarily written to the image immediately. The write is stored in a cache, so that multiple pixels can be written all at once.
Texture accesses do the same thing. The problem is that they don't have the same cache. So you can have written some data that is in the write cache, but the texture cache doesn't know about it.
Now, the specification is a bit heavy-handed here. It is theoretically possible that you can read from one area of a texture and write to another (but undefined by the spec), so long as you never read from any location you've written to, and vice versa. Obviously, that's not very helpful.
The NV_texture_barrier extension allows you to get around this. Despite being an NVIDIA extension, it is supported on ATI hardware too. The way it works is that you call the glTextureBarrierNV function when you want to flush all of the caches. That way, you can be sure that when you read from a location, you have written to it.
So the idea is that you designate one area of the texture as the write area, and another as the read area. After you have rendered some stuff, and you need to do readback, you fire off a barrier and swap texture areas. It's like texture ping-ponging, but without the heavy operation of attaching a new texture or binding an FBO, or changing the drawbuffers.
The problem is not so much the possibility of feedback loops (technically this would not result in a loop, but an undefined order in which pixels are read/written causing an undefineable behaviour), but the limitations of the access modes GPUs implement: A buffer can either only be read from or written to at any given time (gather vs. scatter access). And the GPU always sees a buffer as a whole. This is the main reason for that limitation.
Yes it still is, GPUs are massively parallel so you can't really say that you write 'one' pixel at a time, there are also cache systems that are populated when you ready a texture. If you write to the same texture, the cache would need to be synchronized, and so on.
For some insight, you can take a look at the NV_texture_barrier OpenGL extension, that is meant to add some flexibility is this area.
Yes, it's also undefined to read/write different areas of the texture. But why care if it's undefined or not, just write into another texture and avoid the problem altogether!