performance of conditional branch in fragment shader [duplicate] - opengl

I want to know if "if-statements" inside shaders (vertex / fragment / pixel...) are really slowing down the shader performance. For example:
Is it better to use this:
vec3 output;
output = input*enable + input2*(1-enable);
instead of using this:
vec3 output;
if(enable == 1)
{
output = input;
}
else
{
output = input2;
}
in another forum there was a talk about that (2013): http://answers.unity3d.com/questions/442688/shader-if-else-performance.html
Here the guys are saying, that the If-statements are really bad for the performance of the shader.
Also here they are talking about how much is inside the if/else statements (2012):
https://www.opengl.org/discussion_boards/showthread.php/177762-Performance-alternative-for-if-(-)
maybe the hardware or the shadercompiler are better now and they fix somehow this (maybe not existing) performance issue.
EDIT:
What is with this case, here lets say enable is a uniform variable and it is always set to 0:
if(enable == 1) //never happens
{
output = vec4(0,0,0,0);
}
else //always happens
{
output = calcPhong(normal, lightDir);
}
I think here we have a branch inside the shader which slows the shader down. Is that correct?
Does it make more sense to make 2 different shaders like one for the else and the other for the if part?

What is it about shaders that even potentially makes if statements performance problems? It has to do with how shaders get executed and where GPUs get their massive computing performance from.
Separate shader invocations are usually executed in parallel, executing the same instructions at the same time. They're simply executing them on different sets of input values; they share uniforms, but they have different internal registers. One term for a group of shaders all executing the same sequence of operations is "wavefront".
The potential problem with any form of conditional branching is that it can screw all that up. It causes different invocations within the wavefront to have to execute different sequences of code. That is a very expensive process, whereby a new wavefront has to be created, data copied over to it, etc.
Unless... it doesn't.
For example, if the condition is one that is taken by every invocation in the wavefront, then no runtime divergence is needed. As such, the cost of the if is just the cost of checking a condition.
So, let's say you have a conditional branch, and let's assume that all of the invocations in the wavefront will take the same branch. There are three possibilities for the nature of the expression in that condition:
Compile-time static. The conditional expression is entirely based off of compile-time constants. As such, you know from looking at the code which branches will be taken. Pretty much any compiler handles this as part of basic optimization.
Statically uniform branching. The condition is based off of expressions involving things which are known at compile-time to be constant (specifically, constants and uniform values). But the value of the expression will not be known at compile-time. So the compiler can statically be certain that wavefronts will never be broken by this if, but the compiler cannot know which branch will be taken.
Dynamic branching. The conditional expression contains terms other than constants and uniforms. Here, a compiler cannot tell a priori if a wavefront will be broken up or not. Whether that will need to happen depends on the runtime evaluation of the condition expression.
Different hardware can handle different branching types without divergence.
Also, even if a condition is taken by different wavefronts, the compiler could restructure the code to not require actual branching. You gave a fine example: output = input*enable + input2*(1-enable); is functionally equivalent to the if statement. A compiler could detect that an if is being used to set a variable, and thus execute both sides. This is frequently done for cases of dynamic conditions where the bodies of the branches are small.
Pretty much all hardware can handle var = bool ? val1 : val2 without having to diverge. This was possible way back in 2002.
Since this is very hardware-dependent, it... depends on the hardware. There are however certain epochs of hardware that can be looked at:
Desktop, Pre-D3D10
There, it's kinda the wild west. NVIDIA's compiler for such hardware was notorious for detecting such conditions and actually recompiling your shader whenever you changed uniforms that affected such conditions.
In general, this era is where about 80% of the "never use if statements" comes from. But even here, it's not necessarily true.
You can expect optimization of static branching. You can hope that statically uniform branching won't cause any additional slowdown (though the fact that NVIDIA thought recompilation would be faster than executing it makes it unlikely at least for their hardware). But dynamic branching is going to cost you something, even if all of the invocations take the same branch.
Compilers of this era do their best to optimize shaders so that simple conditions can be executed simply. For example, your output = input*enable + input2*(1-enable); is something that a decent compiler could generate from your equivalent if statement.
Desktop, Post-D3D10
Hardware of this era is generally capable of handling statically uniform branches statements with little slowdown. For dynamic branching, you may or may not encounter slowdown.
Desktop, D3D11+
Hardware of this era is pretty much guaranteed to be able to handle dynamically uniform conditions with little performance issues. Indeed, it doesn't even have to be dynamically uniform; so long as all of the invocations within the same wavefront take the same path, you won't see any significant performance loss.
Note that some hardware from the previous epoch probably could do this as well. But this is the one where it's almost certain to be true.
Mobile, ES 2.0
Welcome back to the wild west. Though unlike Pre-D3D10 desktop, this is mainly due to the huge variance of ES 2.0-caliber hardware. There's such a huge amount of stuff that can handle ES 2.0, and they all work very differently from each other.
Static branching will likely be optimized. But whether you get good performance from statically uniform branching is very hardware-dependent.
Mobile, ES 3.0+
Hardware here is rather more mature and capable than ES 2.0. As such, you can expect statically uniform branches to execute reasonably well. And some hardware can probably handle dynamic branches the way modern desktop hardware does.

It's highly dependent on the hardware and on the condition.
If your condition is a uniform : don't bother, let the compiler deal with it.
If your condition is something dynamic (like a value computed from attribute or fetched from texture or something), then it's more complicated.
For that latter case you'll pretty much have to test and benchmark because it'll depend on the complexity of the code in each branches and how 'consistent' the branch decision is.
For instance if one of the branch is taken 99% of the case and discards the fragment, then most likely you want to keep the conditional. But OTOH in your simple example above if enable is some dynamic condition, the arithmetic select might be better.
Unless you have a clear cut case like the above, or unless you're optimizing for one fixed known architecture, you're probably better off the compiler figure that out for you.

Related

Can I use if else in GLSL shaders? [duplicate]

This question already has answers here:
Do conditional statements slow down shaders?
(2 answers)
Closed 1 year ago.
How much does using if else affect the performance of the vertex and fragment shader?
I know that the vertex shader is called for each vertex, there may not be many of them, and using if else there is unlikely to lead to performance problems.
But the fragment shader is called for each pixel, which means there are a lot of calls, it seems to me that using if else can have a bad impact on performance.
I would like to be able to configure shaders using uniform variables and if else, is that correct?
What can you say about this?
From what I understand introducing branching does have an impact. There is a lot of discussion over on this Stack Overflow Question and this Stack Overflow Question
In short - it seems to depend on 2 factors: The age/make of gpu and what kind of if statement you have:
Of the gpu age/make: The compiler on the gpu will handle branching in different ways. It may be able to make optimisations or assumptions, or then again it might force all branches to be executed and then pick which one is valid afterwards - so you can imagine a lot of branching really starts to eat at performance in that case.
Of the 'if' statement: If you're checking a uniform value, that would be evaluated the same in all executions of your Shader program, then the performance hit is relatively minor. But if you're checking a calculated value from this vertex/fragment code execution, then the performance hit is big.
I would consider either:
Having a Shader per-configuration setup, do the if/else checks in your app code, select the appropriate Shader, run without any Shader branching.
And/or, look into the alternatives to branching e.g. Uniforms that can be applied in a way where - if set to certain values - they don't actually change any visual result. E.g. a BlendColour that tints your final result, if set to white, doesn't actually change anything you've already calculated.
But in my experience, nothing works better than trying it out and profiling.

"IF" branching vs. arithmetic tradeoff in shaders? [duplicate]

I want to know if "if-statements" inside shaders (vertex / fragment / pixel...) are really slowing down the shader performance. For example:
Is it better to use this:
vec3 output;
output = input*enable + input2*(1-enable);
instead of using this:
vec3 output;
if(enable == 1)
{
output = input;
}
else
{
output = input2;
}
in another forum there was a talk about that (2013): http://answers.unity3d.com/questions/442688/shader-if-else-performance.html
Here the guys are saying, that the If-statements are really bad for the performance of the shader.
Also here they are talking about how much is inside the if/else statements (2012):
https://www.opengl.org/discussion_boards/showthread.php/177762-Performance-alternative-for-if-(-)
maybe the hardware or the shadercompiler are better now and they fix somehow this (maybe not existing) performance issue.
EDIT:
What is with this case, here lets say enable is a uniform variable and it is always set to 0:
if(enable == 1) //never happens
{
output = vec4(0,0,0,0);
}
else //always happens
{
output = calcPhong(normal, lightDir);
}
I think here we have a branch inside the shader which slows the shader down. Is that correct?
Does it make more sense to make 2 different shaders like one for the else and the other for the if part?
What is it about shaders that even potentially makes if statements performance problems? It has to do with how shaders get executed and where GPUs get their massive computing performance from.
Separate shader invocations are usually executed in parallel, executing the same instructions at the same time. They're simply executing them on different sets of input values; they share uniforms, but they have different internal registers. One term for a group of shaders all executing the same sequence of operations is "wavefront".
The potential problem with any form of conditional branching is that it can screw all that up. It causes different invocations within the wavefront to have to execute different sequences of code. That is a very expensive process, whereby a new wavefront has to be created, data copied over to it, etc.
Unless... it doesn't.
For example, if the condition is one that is taken by every invocation in the wavefront, then no runtime divergence is needed. As such, the cost of the if is just the cost of checking a condition.
So, let's say you have a conditional branch, and let's assume that all of the invocations in the wavefront will take the same branch. There are three possibilities for the nature of the expression in that condition:
Compile-time static. The conditional expression is entirely based off of compile-time constants. As such, you know from looking at the code which branches will be taken. Pretty much any compiler handles this as part of basic optimization.
Statically uniform branching. The condition is based off of expressions involving things which are known at compile-time to be constant (specifically, constants and uniform values). But the value of the expression will not be known at compile-time. So the compiler can statically be certain that wavefronts will never be broken by this if, but the compiler cannot know which branch will be taken.
Dynamic branching. The conditional expression contains terms other than constants and uniforms. Here, a compiler cannot tell a priori if a wavefront will be broken up or not. Whether that will need to happen depends on the runtime evaluation of the condition expression.
Different hardware can handle different branching types without divergence.
Also, even if a condition is taken by different wavefronts, the compiler could restructure the code to not require actual branching. You gave a fine example: output = input*enable + input2*(1-enable); is functionally equivalent to the if statement. A compiler could detect that an if is being used to set a variable, and thus execute both sides. This is frequently done for cases of dynamic conditions where the bodies of the branches are small.
Pretty much all hardware can handle var = bool ? val1 : val2 without having to diverge. This was possible way back in 2002.
Since this is very hardware-dependent, it... depends on the hardware. There are however certain epochs of hardware that can be looked at:
Desktop, Pre-D3D10
There, it's kinda the wild west. NVIDIA's compiler for such hardware was notorious for detecting such conditions and actually recompiling your shader whenever you changed uniforms that affected such conditions.
In general, this era is where about 80% of the "never use if statements" comes from. But even here, it's not necessarily true.
You can expect optimization of static branching. You can hope that statically uniform branching won't cause any additional slowdown (though the fact that NVIDIA thought recompilation would be faster than executing it makes it unlikely at least for their hardware). But dynamic branching is going to cost you something, even if all of the invocations take the same branch.
Compilers of this era do their best to optimize shaders so that simple conditions can be executed simply. For example, your output = input*enable + input2*(1-enable); is something that a decent compiler could generate from your equivalent if statement.
Desktop, Post-D3D10
Hardware of this era is generally capable of handling statically uniform branches statements with little slowdown. For dynamic branching, you may or may not encounter slowdown.
Desktop, D3D11+
Hardware of this era is pretty much guaranteed to be able to handle dynamically uniform conditions with little performance issues. Indeed, it doesn't even have to be dynamically uniform; so long as all of the invocations within the same wavefront take the same path, you won't see any significant performance loss.
Note that some hardware from the previous epoch probably could do this as well. But this is the one where it's almost certain to be true.
Mobile, ES 2.0
Welcome back to the wild west. Though unlike Pre-D3D10 desktop, this is mainly due to the huge variance of ES 2.0-caliber hardware. There's such a huge amount of stuff that can handle ES 2.0, and they all work very differently from each other.
Static branching will likely be optimized. But whether you get good performance from statically uniform branching is very hardware-dependent.
Mobile, ES 3.0+
Hardware here is rather more mature and capable than ES 2.0. As such, you can expect statically uniform branches to execute reasonably well. And some hardware can probably handle dynamic branches the way modern desktop hardware does.
It's highly dependent on the hardware and on the condition.
If your condition is a uniform : don't bother, let the compiler deal with it.
If your condition is something dynamic (like a value computed from attribute or fetched from texture or something), then it's more complicated.
For that latter case you'll pretty much have to test and benchmark because it'll depend on the complexity of the code in each branches and how 'consistent' the branch decision is.
For instance if one of the branch is taken 99% of the case and discards the fragment, then most likely you want to keep the conditional. But OTOH in your simple example above if enable is some dynamic condition, the arithmetic select might be better.
Unless you have a clear cut case like the above, or unless you're optimizing for one fixed known architecture, you're probably better off the compiler figure that out for you.

Do conditional statements slow down shaders?

I want to know if "if-statements" inside shaders (vertex / fragment / pixel...) are really slowing down the shader performance. For example:
Is it better to use this:
vec3 output;
output = input*enable + input2*(1-enable);
instead of using this:
vec3 output;
if(enable == 1)
{
output = input;
}
else
{
output = input2;
}
in another forum there was a talk about that (2013): http://answers.unity3d.com/questions/442688/shader-if-else-performance.html
Here the guys are saying, that the If-statements are really bad for the performance of the shader.
Also here they are talking about how much is inside the if/else statements (2012):
https://www.opengl.org/discussion_boards/showthread.php/177762-Performance-alternative-for-if-(-)
maybe the hardware or the shadercompiler are better now and they fix somehow this (maybe not existing) performance issue.
EDIT:
What is with this case, here lets say enable is a uniform variable and it is always set to 0:
if(enable == 1) //never happens
{
output = vec4(0,0,0,0);
}
else //always happens
{
output = calcPhong(normal, lightDir);
}
I think here we have a branch inside the shader which slows the shader down. Is that correct?
Does it make more sense to make 2 different shaders like one for the else and the other for the if part?
What is it about shaders that even potentially makes if statements performance problems? It has to do with how shaders get executed and where GPUs get their massive computing performance from.
Separate shader invocations are usually executed in parallel, executing the same instructions at the same time. They're simply executing them on different sets of input values; they share uniforms, but they have different internal registers. One term for a group of shaders all executing the same sequence of operations is "wavefront".
The potential problem with any form of conditional branching is that it can screw all that up. It causes different invocations within the wavefront to have to execute different sequences of code. That is a very expensive process, whereby a new wavefront has to be created, data copied over to it, etc.
Unless... it doesn't.
For example, if the condition is one that is taken by every invocation in the wavefront, then no runtime divergence is needed. As such, the cost of the if is just the cost of checking a condition.
So, let's say you have a conditional branch, and let's assume that all of the invocations in the wavefront will take the same branch. There are three possibilities for the nature of the expression in that condition:
Compile-time static. The conditional expression is entirely based off of compile-time constants. As such, you know from looking at the code which branches will be taken. Pretty much any compiler handles this as part of basic optimization.
Statically uniform branching. The condition is based off of expressions involving things which are known at compile-time to be constant (specifically, constants and uniform values). But the value of the expression will not be known at compile-time. So the compiler can statically be certain that wavefronts will never be broken by this if, but the compiler cannot know which branch will be taken.
Dynamic branching. The conditional expression contains terms other than constants and uniforms. Here, a compiler cannot tell a priori if a wavefront will be broken up or not. Whether that will need to happen depends on the runtime evaluation of the condition expression.
Different hardware can handle different branching types without divergence.
Also, even if a condition is taken by different wavefronts, the compiler could restructure the code to not require actual branching. You gave a fine example: output = input*enable + input2*(1-enable); is functionally equivalent to the if statement. A compiler could detect that an if is being used to set a variable, and thus execute both sides. This is frequently done for cases of dynamic conditions where the bodies of the branches are small.
Pretty much all hardware can handle var = bool ? val1 : val2 without having to diverge. This was possible way back in 2002.
Since this is very hardware-dependent, it... depends on the hardware. There are however certain epochs of hardware that can be looked at:
Desktop, Pre-D3D10
There, it's kinda the wild west. NVIDIA's compiler for such hardware was notorious for detecting such conditions and actually recompiling your shader whenever you changed uniforms that affected such conditions.
In general, this era is where about 80% of the "never use if statements" comes from. But even here, it's not necessarily true.
You can expect optimization of static branching. You can hope that statically uniform branching won't cause any additional slowdown (though the fact that NVIDIA thought recompilation would be faster than executing it makes it unlikely at least for their hardware). But dynamic branching is going to cost you something, even if all of the invocations take the same branch.
Compilers of this era do their best to optimize shaders so that simple conditions can be executed simply. For example, your output = input*enable + input2*(1-enable); is something that a decent compiler could generate from your equivalent if statement.
Desktop, Post-D3D10
Hardware of this era is generally capable of handling statically uniform branches statements with little slowdown. For dynamic branching, you may or may not encounter slowdown.
Desktop, D3D11+
Hardware of this era is pretty much guaranteed to be able to handle dynamically uniform conditions with little performance issues. Indeed, it doesn't even have to be dynamically uniform; so long as all of the invocations within the same wavefront take the same path, you won't see any significant performance loss.
Note that some hardware from the previous epoch probably could do this as well. But this is the one where it's almost certain to be true.
Mobile, ES 2.0
Welcome back to the wild west. Though unlike Pre-D3D10 desktop, this is mainly due to the huge variance of ES 2.0-caliber hardware. There's such a huge amount of stuff that can handle ES 2.0, and they all work very differently from each other.
Static branching will likely be optimized. But whether you get good performance from statically uniform branching is very hardware-dependent.
Mobile, ES 3.0+
Hardware here is rather more mature and capable than ES 2.0. As such, you can expect statically uniform branches to execute reasonably well. And some hardware can probably handle dynamic branches the way modern desktop hardware does.
It's highly dependent on the hardware and on the condition.
If your condition is a uniform : don't bother, let the compiler deal with it.
If your condition is something dynamic (like a value computed from attribute or fetched from texture or something), then it's more complicated.
For that latter case you'll pretty much have to test and benchmark because it'll depend on the complexity of the code in each branches and how 'consistent' the branch decision is.
For instance if one of the branch is taken 99% of the case and discards the fragment, then most likely you want to keep the conditional. But OTOH in your simple example above if enable is some dynamic condition, the arithmetic select might be better.
Unless you have a clear cut case like the above, or unless you're optimizing for one fixed known architecture, you're probably better off the compiler figure that out for you.

Using switch statements in GLSL 4.1

I read a terrifying post recently where someone claimed that a switch statement in GLSL uses no conditional branching, and in fact causes every possible outcome to be run every time the switch is entered. This is worrying for me because I'm currently working on a deferred rendering engine which uses a few nested switch statements.
Does anyone know if there's any truth to this, and can it be backed up with any evidence?
Thanks!
I read a terrifying post recently where someone claimed that a switch statement in GLSL uses no conditional branching, and in fact causes every possible outcome to be run every time the switch is entered.
Neither of these is necessarily true. At least, not on today's hardware.
What happens is very dependent on the compiler and the underlying hardware architecture. So there is no one answer. But it is also very dependent on one other thing: what the condition actually is.
See, the reason why a compiler would execute both sides of a condition has to do with how GPUs work. GPUs gain their performance by grouping threads together and executing them in lock-step, with each thread group executing the exact same sequence of steps. With a conditional branch, this is impossible. So to do a true branch, you have to break up a group depending on which individual threads execute which branch.
So instead, if the two branches are fairly short, it'll execute them both and discard the particular values from the not-taken branch. The particular discarding of values doesn't require breaking thread groups, due to specialized opcodes and such.
Well, if the condition is based on an expression which is dynamically uniform (ie: an expression which is always the same within a draw call/context), then there is a good chance the compiler will not execute both sides. That it will do a true condition.
The reason being that, because the condition is dynamically uniform, all threads in the group will execute the same code. So there is no need to break a group of threads up to do a proper condition.
So if you have a switch statement that is based on a uniform variable, or expressions only involving uniform and compile-time constant variables, then there is no reason to expect it to execute multiple branches simultaneously.
It should also be noted that even if the expression is not dynamically uniform, the compiler will not always execute both branches. If the branches are too long or too different or whatever, it can choose to break up thread groups. This can lower performance, but potentially not as much as executing both groups. It's really up to the compiler to work out how to do it.

Should I use uniform variable to reduce the amount of matrix multiplication?

I just wrote a program to rotate an object. it just updates a variable theta using the idle function. That variable is used to create a rotation matrix ..then I do this..
gl_Position = rx * ry * rz * vPosition;
rx, ry and rz (matrices) are same for every point during the same frame....but it is being multiplied for every single point in the object....should I just use a uniform variable mat4 which stores the multiplied value of rx* ry * rz and pass it to the shader?...or let the shader handle the multiplication for every single point?.....which is faster?....
While profiling is essential to measure how your application responds to optimizations, in general, passing a concatenated matrix to the vertex shader is desirable. This is for two reasons:
The amount of data passed from CPU to GPU is reduced. If rx, ry and rz are all 4x4 matrices, and the product of them (say rx_ry_rz = rx * ry * rz), is also a 4x4 matrix, then you will be transferring 2 less 4x4 matrices (128 bytes) as uniforms each update. If you use this shader to render 1000 objects per frame at 60hz, and the uniform updates with each object, that's 7MB+ per second of saved bandwidth. Maybe not extremely significant, but every bit helps, especially if bandwidth is your bottleneck.
The amount of work the vertex stage must do is reduced (assuming a non-trivial number of vertices). Generally the vertex stage is not a bottleneck, however, many drivers implement load balancing in their shader core allocation between stages, so reducing work in the vertex stage could give benefits in the pixel stage (for example). Again, profiling will give you a better idea of if/how this benefits performance.
The drawback is added CPU time taken to multiply the matrices. If your application's bottleneck is CPU execution, doing this could potentially slow down your application, as it will require the CPU to do more work than it did before.
I wouldn't count on this repeated multiplication being optimized out, unless you convinced yourself that it is indeed happening on all platforms you care about. To do that:
One option is benchmarking, but it will probably be difficult to isolate this operation well enough to measure a possible difference reliably.
I believe some vendors provide development tools that let you see assembly code for the compiled shader. I think that's the only reliable way for you to see what exactly happens with your GLSL code in this case.
This is a very typical example for a much larger theme. At least in my personal opinion, what you have is an example of code that uses OpenGL inefficiently. Making calculations that are the same for each vertex in the vertex shader, which at least conceptually is executed for each vertex, is not something you should do.
In reality, driver optimizations to work around inefficient use of the API are done based on the benefit they offer. If a high profile app/game uses certain bad patterns (and many of them do!), and they are identified as having a negative effect on performance, drivers are optimized to work around them, and still provide the best possible performance. This is particularly true if the app/game is commonly used for benchmarks. Ironically, those optimizations may hurt the performance of well written software that is considered less important.
So if there ever was an important app/game that did the same thing you're doing, which seems quite likely in this case, chances are that many drivers will contain optimizations to deal with it efficiently.
Still, I wouldn't rely on it. The reasons are philosophical as well as practical:
If I work on an app, I feel that it's my job to write efficient code. I don't want to write poor code, and hope that somebody else happened to optimize their code to compensate for it.
You can't count on all of the platforms the app will ever run on to contain these types of optimizations. Particularly since app code can have a long lifetime, and those platforms might not even exist yet.
Even if the optimizations are in place, they will most likely not be free. You might trigger driver code that ends up consuming more resources than it would take for your code to provide the combined matrix yourself.