Using Texture1D instead of uniforms - opengl

I'm using a color palette of 5 colors in my game, and every time I am passing every single color as a uniform vec3 to the program. Would it be more efficient if I was using a one dimensional texture that contains all 5 colors (15 floats in the texture)?
That's just one of the situations where I would like to do this kind of thing. Another would be to send all the matrices/variables at once to the shader program. I seems a little bit inefficient to send every variable, one at the time, every time I want to render. Would it be better to group them all in a single Texture and send them all at once?
Is there maybe another, even more efficient way of doing what I'm trying to do?

Would it be more efficient if I was using a one dimensional texture that contains all 5 colors (15 floats in the texture)?
No, texture reads are most likely going to have less performance, but in your case it shouldn't make much difference.
Is there maybe another, even more efficient way of doing what I'm trying to do?
Well, if they are always the same as you say, then just put them as constants in shaders.
That's just one of the situations where I would like to do this kind of thing. Another would be to send all the matrices/variables at once to the shader program. I seems a little bit inefficient to send every variable, one at the time, every time I want to render. Would it be better to group them all in a single Texture and send them all at once?
Modifying textures is not going to be any faster then setting a few uniforms.
And if you are going to use your matrices per-vertex or per-fragment, then it will cause a lot of texture reads. That might actually cause a significant performance drop, depending on the amount of vertices/fragments/matrices you have. And even if that texture data ends up in L1 texture cache, it still won't outperform uniform reads.
If you don't want to send all the variables independently, you could use Uniform Buffer Objects.

Related

Performance gain of glColorMask()/glDepthMask() on modern hardware?

In my application I have some shaders which write only depth buffer to use it later for shadowing. Also I have some other shaders which render a fullscreen quad whose depth will not affect all subsequent draw calls, so it's depth values may be thrown away.
Assuming the application runs on modern hardware (produced 5 years ago till now), will I gain any additional performance if I disable color buffer writing (glColorMask(all to GL_FALSE)) for shadow map shaders, and depth buffer writing (with glDepthMask()) for fullscreen quad shaders?
In other words, do these functions really disable some memory operations or they just alter some mask bits which are used in fixed bitwise-operations logic in this part of rendering pipeline?
And the same question about testing. If I know beforehand that all fragments will pass depth test, will disabling depth test improve performance?
My FPS measurement don't show any significant difference, but the result may be different on another machine.
Finally, if rendering runs faster with depth/color test/write disabled, how much faster does it run? Wouldn't this performance gain be negated by gl functions call overhead?
Your question is missing a very important thing: you have to do something.
Every fragment has color and depth values. Even if your FS doesn't generate a value, there will still be a value there. Therefore, every fragment produced that is not discarded will write these values, so long as:
The color is routed to a color buffer via glDrawBuffers.
There is an appropriate color/depth buffer attached to the FBO.
The color/depth write mask allows it to be written.
So if you're rendering and you don't want to write one of those colors or to the depth buffer, you've got to do one of these. Changing #1 or #2 is an FBO state change, which is among the most heavyweight operations you can do in OpenGL. Therefore, your choices are to make an FBO change or to change the write mask. The latter will always be the more performance-friendly operation.
Maybe in your case, your application doesn't stress the GPU or CPU enough for such a change to matter. But in general, changing write masks are a better idea than playing with the FBO.
If I know beforehand that all fragments will pass depth test, will disabling depth test improve performance?
Are you changing other state at the same time, or is that the only state you're interested in?
One good way to look at these kinds of a priori performance questions is to look at Vulkan or D3D12 and see what it would require in that API. Changing any pipeline state there is a big deal. But changing two pieces of state is no bigger of a deal than one.
So if changing the depth test correlates with changing other state (blend modes, shaders, etc), it's probably not going to hurt any more.
At the same time, if you really care enough about performance for this sort of thing to matter, you should do application testing. And that should happen after you implement this, and across all hardware of interest. And your code should be flexible enough to easily switch from one to the other as needed.

glUniform vs. single draw call performance

Suppose I have many meshes I'd like to render. I have two choices:
Bake transforms and colors for each mesh into a VBO and render with a single draw call.
Use glUniform for transforms and colors and use many draw calls (but still a single VBO)
Assuming the scene changes very little between frames, which method tends to be better?
There are more than those two choices. At least one more comes to mind:
...
....
Use attributes for transforms and colors and use many draw calls.
Choice 3 is similar to choice 2, but setting attributes (using calls like glVertexAttrib4f) is mostly faster than setting uniforms. The efficiency of setting uniforms is highly platform dependent. But they're generally not intended to be modified very frequently. They are called uniform for a reason. :)
That being said, choice 1 might be the best for your use case where the transforms/colors change rarely. If you're not doing this yet, you could try keeping the attributes that are modified in a separate VBO (with usage GL_DYNAMIC_DRAW), and the attributes that remain constant in their own VBO (with usage GL_STATIC_DRAW). Then make the necessary updates to the dynamic buffer with glBufferSubData.
The reality is that there are no simple rules to predict what is going to perform best. It will depend on the size of your data and draw calls, how frequent and large the data changes are, and also very much on the platform you run on. If you want to be confident that you're using the most efficient solution, you need to implement all of them, and start benchmarking.
Generally, option 1 (minimize number of draw calls) is the best advice. There are a couple of caveats:
I have seen performance fall off a cliff when using very large VBOs on at least one mobile device (assuming relevant for opengl-es tag). The explanation (from the vendor) involved internal buffers exceeding a certain size.
If putting all the information which would otherwise be conveyed with uniforms into vertex attributes significantly increases the size of the vertex buffer, the price you pay (in perhaps costly memory reads) of reading redundant information (because it doesn't really vary per vertex) might negate the savings of using fewer draw calls.
As always the best (but tiresome) advice is to test (I know this is particularly hard developing for mobile where there are many potential implementations your code could be running on). Try to keep your pipeline/toolchain flexible enough that you can easily try out and compare different options.

OpenGL: recommended way of making lots of edits to VBO

This question comes in two (mostly) independent parts
My current setup is that I have a lot of Objects in gamespace. Each has a VBO assigned to it, which holds Vertex Attribute data for each vertex. If the Object wants to change its vertex data (position etc) it does so in an internal array and then call glBufferSubDataARB to update the version in the GPU.
Now I understand that this is a horrible thing to do and so I am looking for alternatives. One that presents itself is to have some managing thing that has a large VBO in the beginning and Objects can request space from it, and edit points in it. This drops the overhead of loading VBOs but comes with a large energy/time expenditure in creating and debugging such a beast (basically an entire memory management system).
My question (part (a)) is if this is the "best" method for doing this, or if there is something better that I have not thought of.
Such a system should allow easy addition/removal of vertices and editing them, as fast as possible.
Part (b) is about some simple actions taken on every object, ie those of rotation and translation. At the moment I am moving each vertex (ouch), but this must have a better option. I am considering uploading rotation and translation matrices to my shader to do there. This seems fine, but I am slightly worried about the overhead of changing uniform variables. Would it ultimately be to my advantage to do this? How fast is changing uniform variables?
Last time I checked the preferred way to do buffer updates was orphaning.
Basically, whenever you want to update your buffered data, you call glBindBuffer on your buffer, which invalidates the current content of the buffer, and then you write your new data with glMapBuffer / glBufferSubdata.
Hence:
Use a single big VBO for your static data is indeed a good idea. You must take care of the maximum allowed VBO size, and split your static data into multiple VBOs if necessary. But this is probably an over-optimization in most cases (i.e. "I wouldn't bother").
Data which is updated frequently should be grouped in the same VBO (with usage = GL_STREAM_DRAW), and you shall use orphaning to update that.
Unfortunately, the actual performance of this stuff varies on different implementations. This guy made some tests on an actual game, it may be worth reading.
For the second part of your question, obviously using uniforms is the way to do it. Yes there is some (little) overhead, but it's sure 1000 times better than streaming all your data at every frame.

Geometry Shader Additional Primitives

I wanted to use a GLSL geometry shader to look at a line strip and determine the place to put a textured annotation, taking into account the current ModelView. It seems I'm limited to only getting 4 vertices per invokation (using GL_LINE_STRIP_ADJACENCY), but what I need is the entire line strip to evaluate.
I could use some other primitive type (such as a Multi-point, if there is an equivalent in GL), but the important point is I want to consider all the geometry, not just a portion.
Is there an extension of come kind that would provide additional vertices to the Geometry shader? Or is there a better way to do this other than using the Geometry shader?
There is no mechanism that will give you access to an entire rendered primitive stream. Primitives can be arbitrarily large, so they can easily blow past any reasonable internal buffer sizes that GPUs have. Thus implementing this would be impractical.
You could bind your array as a buffer texture and just read the data from there. But that's going to be exceedingly slow, since every GS invocation is going to have to process hundreds of vertices. That's not exactly taking advantage of GPU parallelism.
If you just want to put a text tag near something, you should designate a vertex or something as being where annotations should go.

Why fragment shader is faster than just rendering texture?

I checked something and I got weird result about performance with C++ & OpenGL & GLSL.
In the first program I drew pixels to texture with fragment shader and then render the texture.
The texture's mag\min was GL_NEAREST.
In the second program I took the fragment shader and rendered directly to the screen.
Why the second program is faster? Isn't rendering texture faster instead of repeating the same action?
It's like taking a video of AAA game and then show it on the same computer and get lower FPS with the video.
The fragment shader is:
uniform int mx,my;
void main(void) {
vec2 p=gl_FragCoord.xy;
p-=vec2(mx,my);
if (p.x<0.0)
p.x=-p.x;
if (p.y<0.0)
p.y=-p.y;
float dis=sqrt(p.x*p.x+p.y*p.y);
dis+=(abs(p.x)+abs(p.y))-(abs(p.x)-abs(p.y));
p.x/=dis;
p.y/=dis;
gl_FragColor=vec4(p.x,p.y,0.0,1.0);
}
As usual with performance questions, about the only way to be really certain would be to use a profiler.
That said, my guess would be that this is mostly a question of processing bandwidth versus memory bandwidth. To render a texture, the processor has to read data from one part of memory, and write that same data back to another part of memory.
To directly render from the shader, the processor only has to write the output to memory, but doesn't have to read data in from memory.
Therefore, it's a question of which is faster: reading that particular data from memory, or generating it with the processing units? The math in your shader is pretty simple (essentially the only part that's at all complex is the sqrt) -- so at least with your particular hardware, it appears that it's a little faster to compute the result than read it from memory (at least given the other memory accesses going on at the same time, etc.)
Note that the two (shader vs. texture) have quite different characteristics. Reading a texture is going to be nearly constant speed, regardless of how simple or complex of computation was involved in creating it. Not to state the obvious, but a shader is going to run fast if the computation is simple, but slow down (potentially a lot) if the computation as the computation gets complex. In the AAA games you mention, it's fair to guess that at least some of the shaders use complex enough calculations that they'll almost certainly be slower than a texture read. At the opposite extreme a really trivial shader (e.g., one that just passes the fragment color through from input to output) is probably quite a lot faster than reading from a texture.