Binding textures in OpenGL - opengl

So I'm rendering my scene in batches, to try and minimize state changes.
Since my shaders usually need multiple textures, I have a few of them bound at the same time to different texture units. Some of these textures are used in multiple batches, so could even stay bound.
Now my question is, is it fine to just re-bind all of the textures that I need during a batch, even though some of them may already be bound? Or should I check which ones are bound, and only bind new ones? How expensive is glBindTexture? I'm using shaders, is it bad to have unused textures bound to texture units that the shader won't sample from, or should I unbind them?
I'm mostly asking "how to make it fast", not "how to".
EDIT: Would it help if I supplied code?

The holy rule of GPU programming is that less you talk to GPU, the more she loves you.
glBindTexture is much more expensive than single condition check on CPU, so for frequent state changes, I think you should see if there is a need to transfer new texture index to GPU.

The answer is difficult, because any strong GL implementation should make binding a already bound texture a no-op as an optimization. You should try benchmarking to see if putting a condition makes a difference or not.
The same applies to unused textures in texture units, since the GL implementation does know what texture units are finally used, it should not affect performance (as an optimization) to have unused texture units.

Track which textures you've bound (don't use glGet(GL_TEXTURE_BINDING_2D) ) and only rebind if it changes. Some drivers will do this for you, some won't.
If you know what platform you're targeting in advance, it's pretty easy to do a simple benchmark to see how much difference it makes.
You can also sort your batches by what textures they are using to minimize texture changes.

I'd say the best way is to bind all the textures at once using glBindTextures https://www.khronos.org/opengl/wiki/GLAPI/glBindTextures . Or use bindless textures :)

Related

Performance gain of glColorMask()/glDepthMask() on modern hardware?

In my application I have some shaders which write only depth buffer to use it later for shadowing. Also I have some other shaders which render a fullscreen quad whose depth will not affect all subsequent draw calls, so it's depth values may be thrown away.
Assuming the application runs on modern hardware (produced 5 years ago till now), will I gain any additional performance if I disable color buffer writing (glColorMask(all to GL_FALSE)) for shadow map shaders, and depth buffer writing (with glDepthMask()) for fullscreen quad shaders?
In other words, do these functions really disable some memory operations or they just alter some mask bits which are used in fixed bitwise-operations logic in this part of rendering pipeline?
And the same question about testing. If I know beforehand that all fragments will pass depth test, will disabling depth test improve performance?
My FPS measurement don't show any significant difference, but the result may be different on another machine.
Finally, if rendering runs faster with depth/color test/write disabled, how much faster does it run? Wouldn't this performance gain be negated by gl functions call overhead?
Your question is missing a very important thing: you have to do something.
Every fragment has color and depth values. Even if your FS doesn't generate a value, there will still be a value there. Therefore, every fragment produced that is not discarded will write these values, so long as:
The color is routed to a color buffer via glDrawBuffers.
There is an appropriate color/depth buffer attached to the FBO.
The color/depth write mask allows it to be written.
So if you're rendering and you don't want to write one of those colors or to the depth buffer, you've got to do one of these. Changing #1 or #2 is an FBO state change, which is among the most heavyweight operations you can do in OpenGL. Therefore, your choices are to make an FBO change or to change the write mask. The latter will always be the more performance-friendly operation.
Maybe in your case, your application doesn't stress the GPU or CPU enough for such a change to matter. But in general, changing write masks are a better idea than playing with the FBO.
If I know beforehand that all fragments will pass depth test, will disabling depth test improve performance?
Are you changing other state at the same time, or is that the only state you're interested in?
One good way to look at these kinds of a priori performance questions is to look at Vulkan or D3D12 and see what it would require in that API. Changing any pipeline state there is a big deal. But changing two pieces of state is no bigger of a deal than one.
So if changing the depth test correlates with changing other state (blend modes, shaders, etc), it's probably not going to hurt any more.
At the same time, if you really care enough about performance for this sort of thing to matter, you should do application testing. And that should happen after you implement this, and across all hardware of interest. And your code should be flexible enough to easily switch from one to the other as needed.

What is the reasoning behind OpenGL texture units as opposed to regular buffers and uniforms?

I am very new to the OpenGL API and just discovered textures and how to use them. The generated texture buffers are not bound in the same way as regular uniforms are bound and instead use glActiveTexture, followed by a bind rather than just supplying the texture to the shaders via glUniform as we do with other constants.
What is the reasoning behind this convention?
The only reason I can think of is to utilize the graphics card's full potential and texture processing capabilities instead of just binding buffers directly. Is this correct reasoning, or is it simply the way the API was implemented?
No reasoning is given on the official wiki, just says that it's weird: "Binding textures for use in OpenGL is a little weird" https://www.khronos.org/opengl/wiki/Texture
Your question can be interpreted in two ways.
"Why do we bind textures to the context rather than to the shader?"
Because that would make it needlessly difficult to have multiple shaders use the same textures. Note that pretty much no graphics API directly attaches the texture to the program. Not D3D of any version, not Metal, nor even Vulkan.
Textures are resources used by shaders. But they are not part of the shader.
"Why do we treat textures differently from a general array of values?"
In modern OpenGL, shaders have access to several different kinds of resources: UBOs, SSBOs, textures, and images. Each of these kinds of resources ultimately represents a potentially distinct part of the graphics hardware.
A storage block is not just a uniform block that can be bigger. Storage buffers represent the shader doing global memory accesses, while uniform blocks are often copied directly into the shader execution units. In the latter case, accessing their data is much faster, but that also means that you're limited by how much storage those execution units can have.
Now, this is not true for all hardware (AMD's GCN hardware treats the two almost identically, which is why their UBO limits are so large). But it is true of much hardware.
Textures are even more complicated, because implementations need to be able to store their data in an optimal way for performance reasons. As such, texture storage formats are opaque. They're even opaque in ostensibly low-level APIs like Vulkan. Oh sure, linear formats exist, but implementations aren't required to let you read from them at all.
So textures are not just constant arrays.
You are comparing two completely different things
A texture object can (somehow) be compared to a buffer object. The texture is bound by a combination of glActiveTexture and glBindTexture to a texture unit, whereas a buffer is bound by glBindBuffer which is kind-off similar.
A texture sampler uniform is a uniform in the shader and should thus be compared with other uniforms. This sampler is set by a glUniform1i call.

Does using a VAO eliminate the overhead produced by using multiple VBOs?

It is my understanding that having several VBO bind calls is not recommended since it produces overhead from the CPU. However, does binding several VBOs to one VAO, then binding that VAO reduce or completely remove the overhead from binding several VBOs, since you are letting OpenGL rebind those VBOs automatically? This is assuming that the GPU knows what to do and that OpenGL isn't doing it for you on the CPU.
I've searched around and I could not come up with any results. I am currently stuck with a OpenGL <2.1 laptop so I cannot really test for myself (yet).
With a well tuned driver implementation, I would still expect it to be more efficient if all your attributes are in the same VBO, even when using VAO. derhass already touched on a major aspect. Most GPUs cannot directly access memory using CPU addresses. Buffers have to be mapped into the GPU address space.
So, every time a VAO is bound, the driver has to at least check if those mappings already exist, and map the buffers that are not already mapped. There's also overhead for tracking the buffers to your draw call, to make sure that they remain mapped until the draw call finished executing. The more buffers you use, the more work is needed for all of this.
Another aspect are the memory access patterns. If your attributes are interleaved in a single buffer, you get much better memory access locality. Say you have positions and normals with 3 floats each, adding up to 24 bytes per vertex. If they are interleaved, the 24 bytes will likely be in the same cache line. If they are in two separate buffers, you could get two cache misses for a single vertex.
On the question if VAOs are more efficient:
Yes, binding a VAO should be more efficient if it's well implemented. For example, without using VAO, you typically have a bunch of glVertexAttribPointer calls when binding a buffer. The state passed in with these calls (size, type, stride, etc) has to be translated into GPU commands. With a VAO, the GPU commands for setting up this state can be rebuilt only when the VAO changes, and then simply reused every time the VAO is bound.
While state setup for plain VBO usage can also be well optimized, it will still take more work in the driver. You also have the overhead for making more API calls.
Such things are completely implementation-specific. It will reduce some function calls into the driver, which might already be worth it. But how the implementation handles VAOs internally is not really predictable. The way OpenGL is designed will still lead to lots of validataion when a VAO is bound, for example an implementation will not be able to just cache the pointers to the memory which are the relevant thing for the GPU at VAO setup time, since the VBO name is being referenced. Since you can create a new and totally different buffer storage for the same buffer object, the implementation has to at least make sure that the buffer object is still the same as before. So there is lots of stuff do be done here, which is not going to happen on the GPU, so I would not expect a too big decrease in CPU overhead in this situation. However, I haven't profiled it...

Should I sort by buffer use when rendering?

I'm designing the sorting part of my rendering engine. I know that changing the render target, shader program, texture bindings, and more are expensive and therefore one should sort the draw order based on them to reduce state changes. However, what about sorting based on what index buffer is bound, and which vertex buffers are used for attributes?
I'm confused about these because VAOs are mandatory and they encapsulate all of that state. So should I peek behind the scenes of vertex array objects (VAOs), see what state they set and sort based on it? Or should I just not care in what order VAOs are called?
This is what confuses me so much about vertex array objects. It makes sense to me to not be switching which buffers are in use over and over and yet VAOs just seem to force one to not care about that.
Is there a general vague or not agreed on order on which to sort stuff for rendering/game engines?
I know that binding a buffer simply changes some global state but surely it must be beneficial to the hardware to draw from the same buffer multiple times, maybe some small cache coherency?
While VAOs are mandated in GL 3.1 without GL_ARB_compatibility or core 3.2+, you do not have to use them the way they are intended... that is to say, you can bind a single VAO for the duration of your application's lifetime and continue to bind and unbind VBOs, etc. the traditional way if this somehow makes your life easier. Valve is famous for advocating doing this in their presentation on porting the Source engine from D3D to GL... I tend to disagree with them on some points though. A lot of things that they mention in their presentation make me cringe as someone who has years of experience with both D3D and OpenGL; they are making suggestions on how to port something to an API they have a minimal working knowledge of.
Getting back to your performance concern though, there can be validation overhead for changing bound resources frequently, so it is actually more than just "simply changing a global state." All GL commands have to do validation in order to determine if they need to set an error state. They will validate your input parameters (which is pretty trivial), as well as the state of any resource the command needs to use (this can be complicated).
Other types of GL objects like FBOs, textures and GLSL programs have more rigorous validation and more complicated memory dependencies than buffer objects and vertex arrays do. Swapping a vertex pointer should be cheaper in the grand scheme of things than most other kinds of object bindings, especially since a lot of stuff can be deferred by an implementation until you actually issue a glDrawElements (...) command.
Nevertheless, the best way to tackle this problem is just to increase reuse of vertex buffers. Object reuse is pretty high to begin with for vertex buffers, if you have 200 instances of the same opaque model in a scene you can potentially draw all 200 of them back-to-back and never have to change a vertex pointer. Materials tend to change far more frequently than actual vertex buffers, and so you would generally sort your drawing first and foremost by material (sub-sorted by associated states like opaque/translucent, texture(s), shader(s), etc.). You can add another level to batch sorting to draw all batches that share the same vertex data after they have been sorted by material. The ultimate goal is usually to minimize the number of draw commands necessary to complete your frame, and using priority/hierarchy-based sorting with emphasis on material often delivers the best results.
Furthermore, if you can fit multiple LODs of your model into a single vertex buffer, instead of swapping between different vertex buffers sometimes you can just draw different sets of indices or even just a different range of indices from a single index buffer. In a very similar way, texture swapping pressure can be alleviated by using packed texture atlases / sprite sheets instead of a single texture object for each texture.
You can definitely squeeze out some performance by reducing the number of changes to vertex array state, but the takeaway message here is that vertex array state is pretty cheap compared to a lot of other states that change frequently. If you can quickly implement a secondary sort to reduce vertex state changes then go for it, but I would not invest a lot of time in anything more sophisticated unless you know it is a bottleneck. Prioritize texture, shader and framebuffer state first as a general rule.

OpenGL: Using only one framebuffer and switching target textures

Instead of using multiple framebuffer objects, can I also create only one and achieve the same results by switching it's target texture when needed?
Is this a bad idea in all cases? If yes, why?
I've been implementing a function render.SetTargetTexture() in my program's API, and logically it wouldn't work if there were more framebuffers used behind the scenes. I'd have to fully expose framebuffers then.
A FBO itself is just some logical construct, maintained by the implementation and it will consume only the little memory that its parameters require. The main purpose of a FBO is to have an object that maintains a consistent state.
Whenever you do changes to a FBO's structure, the implementation must check for validity and completeness. The OpenGL specification doesn't say anything about the complexity of this operation, but it's safe to assume that changes to a FBO's structure is about the most time consuming operation (maybe by a large margin).
Since the FBO itself does not consume noteworthy memory, only the attachments take memory, which are independent objects, allocating multiple FBOs and swithing those is what I recommend.
Switching framebuffers is usually faster for the reasons people have given. In my experience, there are also problems when attaching different texture types to fbos (2D texture to fbo which previously had a 1D texture attached, so the first texture attached to the fbo set some kind of target state which could not be altered later). A solution I used was to have an fbo for each target type. This error occured on NVidia Quadro devices, not sure whether it's a driver issue.