Do conditional statements in shaders come at a higher cost? - opengl

I can use conditional statements to minimize average reads from texture, but if conditional statements come at a high cost like with CPUs (that prevent predicting etc), it might result in a complex code that has no gain or even it has less efficiency. Do they come at a higher cost?

Do conditional statements in shaders come at a higher cost?
As always: it depends. Nothing is for free. Modern GPUs can do quite well with branching based on uniform conditions. What really hurts is non-uniform control flow, which will naturally lead to not utilizing all available shader ALUs. With nested conditionalsand/or loops, you can easily end up with a singe active core per SIMT group, which means throwing away 31/32 to 63/64 of the potential computational power.
I can use conditional statements to minimize average reads from texture
Be careful with texture sampling in non-uniform control flow. As per the spec, this will lead to undefined values. The problem here is that you screw up the calculation of the derivatives for the texcoords, so the GPU can't descide if it needs to use the magnification or minification filter, as well as which mipmap level to use.

Related

Does any graphics API allow efficient per-primitive branching?

When writing fragment shaders in OpenGL, one can branch either on compile-time constants, on uniform variables or on varying variables.
How performant that branching is depends on the hardware and driver implementation, but generally branching on a compile time constant is usually free and branching on a uniform is faster than on a varying.
In the case of a varying, the rasterizer still has to interpolate the variable for each fragment and the branch has to be decided on each family execution, even if the value of the varying is the same for each fragment in the current primitive.
What I wonder is whether any graphics api or extension allows some fragment shader branching that is executed only once per rasterized primitive (or in the case of tiled rendering once per primitive per bin)?
Dynamic branching is only expensive when it causes divergence of instances executing at the same time. The cost of interpolating a "varying" is trivial.
Furthermore, different GPUs handle primitive rasterization differently. Some GPUs ensure that wavefronts for fragment shaders only contain instances that are executing on the same primitive. On these GPUs, branching based on values that that don't change per-primitive will be fast.
However, other GPUs will pack instances from different primitives into the same wavefronts. On these GPUs, divergence will happen if the value is different for different primitives. How much divergence? It rather depends on how often you get multiple instances in a primitive. If many of your primitives are small in rasterized space, then you'll get a lot more divergence than if you have a lot of large primitives.
GPUs that pack instances from different primitives into a wavefront are trying to maximize how much their cores get utilized. It's a tradeoff: you're minimizing the overall number of wavefronts you have to execute, but a particular cause of divergence (data that is constant within a primitive but not between them) will be penalized.
In any case, try to avoid divergence when you can. But if your algorithm requires it... then your algorithm requires it, and the performance you get is the performance you get. The best you can do is let the GPU know that the "varying" will be constant per-primitive by using flat interpolation.

Should I use uniform variable to reduce the amount of matrix multiplication?

I just wrote a program to rotate an object. it just updates a variable theta using the idle function. That variable is used to create a rotation matrix ..then I do this..
gl_Position = rx * ry * rz * vPosition;
rx, ry and rz (matrices) are same for every point during the same frame....but it is being multiplied for every single point in the object....should I just use a uniform variable mat4 which stores the multiplied value of rx* ry * rz and pass it to the shader?...or let the shader handle the multiplication for every single point?.....which is faster?....
While profiling is essential to measure how your application responds to optimizations, in general, passing a concatenated matrix to the vertex shader is desirable. This is for two reasons:
The amount of data passed from CPU to GPU is reduced. If rx, ry and rz are all 4x4 matrices, and the product of them (say rx_ry_rz = rx * ry * rz), is also a 4x4 matrix, then you will be transferring 2 less 4x4 matrices (128 bytes) as uniforms each update. If you use this shader to render 1000 objects per frame at 60hz, and the uniform updates with each object, that's 7MB+ per second of saved bandwidth. Maybe not extremely significant, but every bit helps, especially if bandwidth is your bottleneck.
The amount of work the vertex stage must do is reduced (assuming a non-trivial number of vertices). Generally the vertex stage is not a bottleneck, however, many drivers implement load balancing in their shader core allocation between stages, so reducing work in the vertex stage could give benefits in the pixel stage (for example). Again, profiling will give you a better idea of if/how this benefits performance.
The drawback is added CPU time taken to multiply the matrices. If your application's bottleneck is CPU execution, doing this could potentially slow down your application, as it will require the CPU to do more work than it did before.
I wouldn't count on this repeated multiplication being optimized out, unless you convinced yourself that it is indeed happening on all platforms you care about. To do that:
One option is benchmarking, but it will probably be difficult to isolate this operation well enough to measure a possible difference reliably.
I believe some vendors provide development tools that let you see assembly code for the compiled shader. I think that's the only reliable way for you to see what exactly happens with your GLSL code in this case.
This is a very typical example for a much larger theme. At least in my personal opinion, what you have is an example of code that uses OpenGL inefficiently. Making calculations that are the same for each vertex in the vertex shader, which at least conceptually is executed for each vertex, is not something you should do.
In reality, driver optimizations to work around inefficient use of the API are done based on the benefit they offer. If a high profile app/game uses certain bad patterns (and many of them do!), and they are identified as having a negative effect on performance, drivers are optimized to work around them, and still provide the best possible performance. This is particularly true if the app/game is commonly used for benchmarks. Ironically, those optimizations may hurt the performance of well written software that is considered less important.
So if there ever was an important app/game that did the same thing you're doing, which seems quite likely in this case, chances are that many drivers will contain optimizations to deal with it efficiently.
Still, I wouldn't rely on it. The reasons are philosophical as well as practical:
If I work on an app, I feel that it's my job to write efficient code. I don't want to write poor code, and hope that somebody else happened to optimize their code to compensate for it.
You can't count on all of the platforms the app will ever run on to contain these types of optimizations. Particularly since app code can have a long lifetime, and those platforms might not even exist yet.
Even if the optimizations are in place, they will most likely not be free. You might trigger driver code that ends up consuming more resources than it would take for your code to provide the combined matrix yourself.

Cost of Branching on uniforms on modern GPUs

When using GLSL on modern (GL3.3+) GPUs, what is the likely cost of branching on a uniform?
In my engine I'm getting to the point where I have a lot of shaders. And I have several different quality presets for a lot of those. As it stands, I'm using uniforms with if() in the shaders to choose different quality presets. I'm however worried that I might achieve better performance by recompiling the shaders and using #ifdef. The problem with that is the need to worry about tracking and resetting other uniforms when I recompile a shader.
Basically what I want to know is if my fears are unfounded. Is branching on a uniform cheap on modern GPUs? I have done a few tests myself and found very little difference either way, but I've only tested on an nVidia 680.
I will admit that I'm not an expert, but perhaps my speculation is better than nothing.
I would think that branching on uniforms is indeed fairly cheap. It's clearly much different from branching on texture or attribute data, since all the ALUs in the SIMD will follow the same code path from the shader, so it is a "real" branch rather than an execution mask. I'm not too sure how shader processors suffer from branch bubbles in their pipeline, but the pipeline is certainly bound to be more shallow than in general-purpose CPUs (particularly given the much lower clock-speeds they typically run at).
I wish I could be more helpful and I'd also appreciate if someone else can answer more authoritatively. I, for one, wouldn't worry too much about branching on uniforms, however. But as always, if you have the possibility, do profile your shader and see if it makes any noticeable difference.

Gaining an understanding of performance implications of shader stages, particularly the GS

I am confused about what's faster versus what's slower when it comes to coding algorithms that execute in the pipeline.
I made a program with a GS that seemingly bottlenecked from fillrate, because timer queries showed it to execute much faster with no rasterisation enabled.
So then I made a different multi-pass algorithm using transform feedback, still using a GS every time but theoretically does much less work overall by executing in stages, and it significantly reduces the fill rate because it renders much less triangles, but in my early tests of it, it appears to run slower.
My original thought was that the bottleneck of fillrate was traded for the bottleneck of calling multiple draw calls. But how expensive is another draw call really? How much overhead is involved in the cpu and gpu?
Then I read the answer of a different stack question regarding the GS:
No one has ever accused Geometry Shaders of being fast. Especially when increasing the size of geometry.
Your GS is taking a line and not only doing a 30x amplification of vertex data, but also doing lighting computations on each of those new vertices. That's not going to be terribly fast, in large part due to a lack of parallelism. Each GS invocation has to do 60 lighting computations, rather than having 60 separate vertex shader invocations doing 60 lighting computations in parallel.
You're basically creating a giant bottleneck in your geometry shader.
It would probably be faster to put the lighting stuff in the fragment shader (yes, really).
and it makes me wonder how it's possible for a geometry shaders to be slower if their use provides an overall less work output. I know things execute in parallel, but my understanding is that there is only a relatively small group of shader cores, so starting an amount of threads much larger than that group will result in the bottleneck being something proportional to program complexity (instruction size) times the number of threads (using thread here to refer to invocation of a shader). If you can have some instruction execute once per vertex on the geometry shader instead of once per fragment, why would it ever be slower?
Help me gain a better understanding so I don't waste time designing algorithms that are inefficient.

A triangle with 3 varyings of same value.. does GPU interpolate / waste performance?

I have a simple question of which I was unable to find solid facts about GPUs behaviour in case of 3 vertexes having the same varying output from vertex shader.
Does the GPU notice that case or does it try to interpolate when its not even needed ?
This might be interesting as there are quite some cases where you want a constantish varying available in fragment shader per triangle. Please don't just guess, try to bring up references or atleast reasons why you think its the one way or another.
The GPU does the interpolation, no matter if it's needed or not.
The reason is quite simple: checking if the varying variable has already been changed is very expensive.
Shaders are small programs, that are executed concurrently on different GPU cores. So if you would like to avoid that two different cores are computing the same value, you would have to "reserve" the output variable. So you need an additional data structure (like a flag or mutex) that every core can read. In your case this would mean, that three different cores have to read the same flag, and the first of them has to reserve it if it's not already reserved.
This has to happen atomically, meaning that the reserving core has to be the only one who is setting the flag at a time. To do this all other cores would e.g. have to be stopped for a tick. As you don't know the which cores are computing the vertex shader you would have to stop ALL other cores (on a GTX Titan this would be 2687 others).
Additionally, when the variable is set and a new frame is rendered, all the flags would have to be reset, so the race for the flag can begin again.
To conclude: you would need additional hardware in your GPU, that is expensive and slows down the rendering pipeline.
It is the programmers job to avoid that multiple shaders are producing the same output. So if you are doing your job right this does not happen or you know, that avoiding it (on the CPU) would cost more than ignoring it.
An example would be the stiching for different levels of detail (like on a height map), where most methods are creating some fragments twice. This is a very small impact on the rendering performance but would require a lot of CPU time to avoid.
If the behavior isn't mandated in the OpenGL specificiation then the answer is that it's up to the implementation.
The comments and other answers are almost certainly spot on that there is no optimization path for identical values because there would be little to no benefit from the added complexity to make such a path.