This question already has answers here:
Do conditional statements slow down shaders?
(2 answers)
Closed 1 year ago.
How much does using if else affect the performance of the vertex and fragment shader?
I know that the vertex shader is called for each vertex, there may not be many of them, and using if else there is unlikely to lead to performance problems.
But the fragment shader is called for each pixel, which means there are a lot of calls, it seems to me that using if else can have a bad impact on performance.
I would like to be able to configure shaders using uniform variables and if else, is that correct?
What can you say about this?
From what I understand introducing branching does have an impact. There is a lot of discussion over on this Stack Overflow Question and this Stack Overflow Question
In short - it seems to depend on 2 factors: The age/make of gpu and what kind of if statement you have:
Of the gpu age/make: The compiler on the gpu will handle branching in different ways. It may be able to make optimisations or assumptions, or then again it might force all branches to be executed and then pick which one is valid afterwards - so you can imagine a lot of branching really starts to eat at performance in that case.
Of the 'if' statement: If you're checking a uniform value, that would be evaluated the same in all executions of your Shader program, then the performance hit is relatively minor. But if you're checking a calculated value from this vertex/fragment code execution, then the performance hit is big.
I would consider either:
Having a Shader per-configuration setup, do the if/else checks in your app code, select the appropriate Shader, run without any Shader branching.
And/or, look into the alternatives to branching e.g. Uniforms that can be applied in a way where - if set to certain values - they don't actually change any visual result. E.g. a BlendColour that tints your final result, if set to white, doesn't actually change anything you've already calculated.
But in my experience, nothing works better than trying it out and profiling.
Related
I want to know if "if-statements" inside shaders (vertex / fragment / pixel...) are really slowing down the shader performance. For example:
Is it better to use this:
vec3 output;
output = input*enable + input2*(1-enable);
instead of using this:
vec3 output;
if(enable == 1)
{
output = input;
}
else
{
output = input2;
}
in another forum there was a talk about that (2013): http://answers.unity3d.com/questions/442688/shader-if-else-performance.html
Here the guys are saying, that the If-statements are really bad for the performance of the shader.
Also here they are talking about how much is inside the if/else statements (2012):
https://www.opengl.org/discussion_boards/showthread.php/177762-Performance-alternative-for-if-(-)
maybe the hardware or the shadercompiler are better now and they fix somehow this (maybe not existing) performance issue.
EDIT:
What is with this case, here lets say enable is a uniform variable and it is always set to 0:
if(enable == 1) //never happens
{
output = vec4(0,0,0,0);
}
else //always happens
{
output = calcPhong(normal, lightDir);
}
I think here we have a branch inside the shader which slows the shader down. Is that correct?
Does it make more sense to make 2 different shaders like one for the else and the other for the if part?
What is it about shaders that even potentially makes if statements performance problems? It has to do with how shaders get executed and where GPUs get their massive computing performance from.
Separate shader invocations are usually executed in parallel, executing the same instructions at the same time. They're simply executing them on different sets of input values; they share uniforms, but they have different internal registers. One term for a group of shaders all executing the same sequence of operations is "wavefront".
The potential problem with any form of conditional branching is that it can screw all that up. It causes different invocations within the wavefront to have to execute different sequences of code. That is a very expensive process, whereby a new wavefront has to be created, data copied over to it, etc.
Unless... it doesn't.
For example, if the condition is one that is taken by every invocation in the wavefront, then no runtime divergence is needed. As such, the cost of the if is just the cost of checking a condition.
So, let's say you have a conditional branch, and let's assume that all of the invocations in the wavefront will take the same branch. There are three possibilities for the nature of the expression in that condition:
Compile-time static. The conditional expression is entirely based off of compile-time constants. As such, you know from looking at the code which branches will be taken. Pretty much any compiler handles this as part of basic optimization.
Statically uniform branching. The condition is based off of expressions involving things which are known at compile-time to be constant (specifically, constants and uniform values). But the value of the expression will not be known at compile-time. So the compiler can statically be certain that wavefronts will never be broken by this if, but the compiler cannot know which branch will be taken.
Dynamic branching. The conditional expression contains terms other than constants and uniforms. Here, a compiler cannot tell a priori if a wavefront will be broken up or not. Whether that will need to happen depends on the runtime evaluation of the condition expression.
Different hardware can handle different branching types without divergence.
Also, even if a condition is taken by different wavefronts, the compiler could restructure the code to not require actual branching. You gave a fine example: output = input*enable + input2*(1-enable); is functionally equivalent to the if statement. A compiler could detect that an if is being used to set a variable, and thus execute both sides. This is frequently done for cases of dynamic conditions where the bodies of the branches are small.
Pretty much all hardware can handle var = bool ? val1 : val2 without having to diverge. This was possible way back in 2002.
Since this is very hardware-dependent, it... depends on the hardware. There are however certain epochs of hardware that can be looked at:
Desktop, Pre-D3D10
There, it's kinda the wild west. NVIDIA's compiler for such hardware was notorious for detecting such conditions and actually recompiling your shader whenever you changed uniforms that affected such conditions.
In general, this era is where about 80% of the "never use if statements" comes from. But even here, it's not necessarily true.
You can expect optimization of static branching. You can hope that statically uniform branching won't cause any additional slowdown (though the fact that NVIDIA thought recompilation would be faster than executing it makes it unlikely at least for their hardware). But dynamic branching is going to cost you something, even if all of the invocations take the same branch.
Compilers of this era do their best to optimize shaders so that simple conditions can be executed simply. For example, your output = input*enable + input2*(1-enable); is something that a decent compiler could generate from your equivalent if statement.
Desktop, Post-D3D10
Hardware of this era is generally capable of handling statically uniform branches statements with little slowdown. For dynamic branching, you may or may not encounter slowdown.
Desktop, D3D11+
Hardware of this era is pretty much guaranteed to be able to handle dynamically uniform conditions with little performance issues. Indeed, it doesn't even have to be dynamically uniform; so long as all of the invocations within the same wavefront take the same path, you won't see any significant performance loss.
Note that some hardware from the previous epoch probably could do this as well. But this is the one where it's almost certain to be true.
Mobile, ES 2.0
Welcome back to the wild west. Though unlike Pre-D3D10 desktop, this is mainly due to the huge variance of ES 2.0-caliber hardware. There's such a huge amount of stuff that can handle ES 2.0, and they all work very differently from each other.
Static branching will likely be optimized. But whether you get good performance from statically uniform branching is very hardware-dependent.
Mobile, ES 3.0+
Hardware here is rather more mature and capable than ES 2.0. As such, you can expect statically uniform branches to execute reasonably well. And some hardware can probably handle dynamic branches the way modern desktop hardware does.
It's highly dependent on the hardware and on the condition.
If your condition is a uniform : don't bother, let the compiler deal with it.
If your condition is something dynamic (like a value computed from attribute or fetched from texture or something), then it's more complicated.
For that latter case you'll pretty much have to test and benchmark because it'll depend on the complexity of the code in each branches and how 'consistent' the branch decision is.
For instance if one of the branch is taken 99% of the case and discards the fragment, then most likely you want to keep the conditional. But OTOH in your simple example above if enable is some dynamic condition, the arithmetic select might be better.
Unless you have a clear cut case like the above, or unless you're optimizing for one fixed known architecture, you're probably better off the compiler figure that out for you.
I want to know if "if-statements" inside shaders (vertex / fragment / pixel...) are really slowing down the shader performance. For example:
Is it better to use this:
vec3 output;
output = input*enable + input2*(1-enable);
instead of using this:
vec3 output;
if(enable == 1)
{
output = input;
}
else
{
output = input2;
}
in another forum there was a talk about that (2013): http://answers.unity3d.com/questions/442688/shader-if-else-performance.html
Here the guys are saying, that the If-statements are really bad for the performance of the shader.
Also here they are talking about how much is inside the if/else statements (2012):
https://www.opengl.org/discussion_boards/showthread.php/177762-Performance-alternative-for-if-(-)
maybe the hardware or the shadercompiler are better now and they fix somehow this (maybe not existing) performance issue.
EDIT:
What is with this case, here lets say enable is a uniform variable and it is always set to 0:
if(enable == 1) //never happens
{
output = vec4(0,0,0,0);
}
else //always happens
{
output = calcPhong(normal, lightDir);
}
I think here we have a branch inside the shader which slows the shader down. Is that correct?
Does it make more sense to make 2 different shaders like one for the else and the other for the if part?
What is it about shaders that even potentially makes if statements performance problems? It has to do with how shaders get executed and where GPUs get their massive computing performance from.
Separate shader invocations are usually executed in parallel, executing the same instructions at the same time. They're simply executing them on different sets of input values; they share uniforms, but they have different internal registers. One term for a group of shaders all executing the same sequence of operations is "wavefront".
The potential problem with any form of conditional branching is that it can screw all that up. It causes different invocations within the wavefront to have to execute different sequences of code. That is a very expensive process, whereby a new wavefront has to be created, data copied over to it, etc.
Unless... it doesn't.
For example, if the condition is one that is taken by every invocation in the wavefront, then no runtime divergence is needed. As such, the cost of the if is just the cost of checking a condition.
So, let's say you have a conditional branch, and let's assume that all of the invocations in the wavefront will take the same branch. There are three possibilities for the nature of the expression in that condition:
Compile-time static. The conditional expression is entirely based off of compile-time constants. As such, you know from looking at the code which branches will be taken. Pretty much any compiler handles this as part of basic optimization.
Statically uniform branching. The condition is based off of expressions involving things which are known at compile-time to be constant (specifically, constants and uniform values). But the value of the expression will not be known at compile-time. So the compiler can statically be certain that wavefronts will never be broken by this if, but the compiler cannot know which branch will be taken.
Dynamic branching. The conditional expression contains terms other than constants and uniforms. Here, a compiler cannot tell a priori if a wavefront will be broken up or not. Whether that will need to happen depends on the runtime evaluation of the condition expression.
Different hardware can handle different branching types without divergence.
Also, even if a condition is taken by different wavefronts, the compiler could restructure the code to not require actual branching. You gave a fine example: output = input*enable + input2*(1-enable); is functionally equivalent to the if statement. A compiler could detect that an if is being used to set a variable, and thus execute both sides. This is frequently done for cases of dynamic conditions where the bodies of the branches are small.
Pretty much all hardware can handle var = bool ? val1 : val2 without having to diverge. This was possible way back in 2002.
Since this is very hardware-dependent, it... depends on the hardware. There are however certain epochs of hardware that can be looked at:
Desktop, Pre-D3D10
There, it's kinda the wild west. NVIDIA's compiler for such hardware was notorious for detecting such conditions and actually recompiling your shader whenever you changed uniforms that affected such conditions.
In general, this era is where about 80% of the "never use if statements" comes from. But even here, it's not necessarily true.
You can expect optimization of static branching. You can hope that statically uniform branching won't cause any additional slowdown (though the fact that NVIDIA thought recompilation would be faster than executing it makes it unlikely at least for their hardware). But dynamic branching is going to cost you something, even if all of the invocations take the same branch.
Compilers of this era do their best to optimize shaders so that simple conditions can be executed simply. For example, your output = input*enable + input2*(1-enable); is something that a decent compiler could generate from your equivalent if statement.
Desktop, Post-D3D10
Hardware of this era is generally capable of handling statically uniform branches statements with little slowdown. For dynamic branching, you may or may not encounter slowdown.
Desktop, D3D11+
Hardware of this era is pretty much guaranteed to be able to handle dynamically uniform conditions with little performance issues. Indeed, it doesn't even have to be dynamically uniform; so long as all of the invocations within the same wavefront take the same path, you won't see any significant performance loss.
Note that some hardware from the previous epoch probably could do this as well. But this is the one where it's almost certain to be true.
Mobile, ES 2.0
Welcome back to the wild west. Though unlike Pre-D3D10 desktop, this is mainly due to the huge variance of ES 2.0-caliber hardware. There's such a huge amount of stuff that can handle ES 2.0, and they all work very differently from each other.
Static branching will likely be optimized. But whether you get good performance from statically uniform branching is very hardware-dependent.
Mobile, ES 3.0+
Hardware here is rather more mature and capable than ES 2.0. As such, you can expect statically uniform branches to execute reasonably well. And some hardware can probably handle dynamic branches the way modern desktop hardware does.
It's highly dependent on the hardware and on the condition.
If your condition is a uniform : don't bother, let the compiler deal with it.
If your condition is something dynamic (like a value computed from attribute or fetched from texture or something), then it's more complicated.
For that latter case you'll pretty much have to test and benchmark because it'll depend on the complexity of the code in each branches and how 'consistent' the branch decision is.
For instance if one of the branch is taken 99% of the case and discards the fragment, then most likely you want to keep the conditional. But OTOH in your simple example above if enable is some dynamic condition, the arithmetic select might be better.
Unless you have a clear cut case like the above, or unless you're optimizing for one fixed known architecture, you're probably better off the compiler figure that out for you.
I want to know if "if-statements" inside shaders (vertex / fragment / pixel...) are really slowing down the shader performance. For example:
Is it better to use this:
vec3 output;
output = input*enable + input2*(1-enable);
instead of using this:
vec3 output;
if(enable == 1)
{
output = input;
}
else
{
output = input2;
}
in another forum there was a talk about that (2013): http://answers.unity3d.com/questions/442688/shader-if-else-performance.html
Here the guys are saying, that the If-statements are really bad for the performance of the shader.
Also here they are talking about how much is inside the if/else statements (2012):
https://www.opengl.org/discussion_boards/showthread.php/177762-Performance-alternative-for-if-(-)
maybe the hardware or the shadercompiler are better now and they fix somehow this (maybe not existing) performance issue.
EDIT:
What is with this case, here lets say enable is a uniform variable and it is always set to 0:
if(enable == 1) //never happens
{
output = vec4(0,0,0,0);
}
else //always happens
{
output = calcPhong(normal, lightDir);
}
I think here we have a branch inside the shader which slows the shader down. Is that correct?
Does it make more sense to make 2 different shaders like one for the else and the other for the if part?
What is it about shaders that even potentially makes if statements performance problems? It has to do with how shaders get executed and where GPUs get their massive computing performance from.
Separate shader invocations are usually executed in parallel, executing the same instructions at the same time. They're simply executing them on different sets of input values; they share uniforms, but they have different internal registers. One term for a group of shaders all executing the same sequence of operations is "wavefront".
The potential problem with any form of conditional branching is that it can screw all that up. It causes different invocations within the wavefront to have to execute different sequences of code. That is a very expensive process, whereby a new wavefront has to be created, data copied over to it, etc.
Unless... it doesn't.
For example, if the condition is one that is taken by every invocation in the wavefront, then no runtime divergence is needed. As such, the cost of the if is just the cost of checking a condition.
So, let's say you have a conditional branch, and let's assume that all of the invocations in the wavefront will take the same branch. There are three possibilities for the nature of the expression in that condition:
Compile-time static. The conditional expression is entirely based off of compile-time constants. As such, you know from looking at the code which branches will be taken. Pretty much any compiler handles this as part of basic optimization.
Statically uniform branching. The condition is based off of expressions involving things which are known at compile-time to be constant (specifically, constants and uniform values). But the value of the expression will not be known at compile-time. So the compiler can statically be certain that wavefronts will never be broken by this if, but the compiler cannot know which branch will be taken.
Dynamic branching. The conditional expression contains terms other than constants and uniforms. Here, a compiler cannot tell a priori if a wavefront will be broken up or not. Whether that will need to happen depends on the runtime evaluation of the condition expression.
Different hardware can handle different branching types without divergence.
Also, even if a condition is taken by different wavefronts, the compiler could restructure the code to not require actual branching. You gave a fine example: output = input*enable + input2*(1-enable); is functionally equivalent to the if statement. A compiler could detect that an if is being used to set a variable, and thus execute both sides. This is frequently done for cases of dynamic conditions where the bodies of the branches are small.
Pretty much all hardware can handle var = bool ? val1 : val2 without having to diverge. This was possible way back in 2002.
Since this is very hardware-dependent, it... depends on the hardware. There are however certain epochs of hardware that can be looked at:
Desktop, Pre-D3D10
There, it's kinda the wild west. NVIDIA's compiler for such hardware was notorious for detecting such conditions and actually recompiling your shader whenever you changed uniforms that affected such conditions.
In general, this era is where about 80% of the "never use if statements" comes from. But even here, it's not necessarily true.
You can expect optimization of static branching. You can hope that statically uniform branching won't cause any additional slowdown (though the fact that NVIDIA thought recompilation would be faster than executing it makes it unlikely at least for their hardware). But dynamic branching is going to cost you something, even if all of the invocations take the same branch.
Compilers of this era do their best to optimize shaders so that simple conditions can be executed simply. For example, your output = input*enable + input2*(1-enable); is something that a decent compiler could generate from your equivalent if statement.
Desktop, Post-D3D10
Hardware of this era is generally capable of handling statically uniform branches statements with little slowdown. For dynamic branching, you may or may not encounter slowdown.
Desktop, D3D11+
Hardware of this era is pretty much guaranteed to be able to handle dynamically uniform conditions with little performance issues. Indeed, it doesn't even have to be dynamically uniform; so long as all of the invocations within the same wavefront take the same path, you won't see any significant performance loss.
Note that some hardware from the previous epoch probably could do this as well. But this is the one where it's almost certain to be true.
Mobile, ES 2.0
Welcome back to the wild west. Though unlike Pre-D3D10 desktop, this is mainly due to the huge variance of ES 2.0-caliber hardware. There's such a huge amount of stuff that can handle ES 2.0, and they all work very differently from each other.
Static branching will likely be optimized. But whether you get good performance from statically uniform branching is very hardware-dependent.
Mobile, ES 3.0+
Hardware here is rather more mature and capable than ES 2.0. As such, you can expect statically uniform branches to execute reasonably well. And some hardware can probably handle dynamic branches the way modern desktop hardware does.
It's highly dependent on the hardware and on the condition.
If your condition is a uniform : don't bother, let the compiler deal with it.
If your condition is something dynamic (like a value computed from attribute or fetched from texture or something), then it's more complicated.
For that latter case you'll pretty much have to test and benchmark because it'll depend on the complexity of the code in each branches and how 'consistent' the branch decision is.
For instance if one of the branch is taken 99% of the case and discards the fragment, then most likely you want to keep the conditional. But OTOH in your simple example above if enable is some dynamic condition, the arithmetic select might be better.
Unless you have a clear cut case like the above, or unless you're optimizing for one fixed known architecture, you're probably better off the compiler figure that out for you.
I have a simple question of which I was unable to find solid facts about GPUs behaviour in case of 3 vertexes having the same varying output from vertex shader.
Does the GPU notice that case or does it try to interpolate when its not even needed ?
This might be interesting as there are quite some cases where you want a constantish varying available in fragment shader per triangle. Please don't just guess, try to bring up references or atleast reasons why you think its the one way or another.
The GPU does the interpolation, no matter if it's needed or not.
The reason is quite simple: checking if the varying variable has already been changed is very expensive.
Shaders are small programs, that are executed concurrently on different GPU cores. So if you would like to avoid that two different cores are computing the same value, you would have to "reserve" the output variable. So you need an additional data structure (like a flag or mutex) that every core can read. In your case this would mean, that three different cores have to read the same flag, and the first of them has to reserve it if it's not already reserved.
This has to happen atomically, meaning that the reserving core has to be the only one who is setting the flag at a time. To do this all other cores would e.g. have to be stopped for a tick. As you don't know the which cores are computing the vertex shader you would have to stop ALL other cores (on a GTX Titan this would be 2687 others).
Additionally, when the variable is set and a new frame is rendered, all the flags would have to be reset, so the race for the flag can begin again.
To conclude: you would need additional hardware in your GPU, that is expensive and slows down the rendering pipeline.
It is the programmers job to avoid that multiple shaders are producing the same output. So if you are doing your job right this does not happen or you know, that avoiding it (on the CPU) would cost more than ignoring it.
An example would be the stiching for different levels of detail (like on a height map), where most methods are creating some fragments twice. This is a very small impact on the rendering performance but would require a lot of CPU time to avoid.
If the behavior isn't mandated in the OpenGL specificiation then the answer is that it's up to the implementation.
The comments and other answers are almost certainly spot on that there is no optimization path for identical values because there would be little to no benefit from the added complexity to make such a path.
Lets consider such situation.
The scene contains given objects: ABCDE
Where order from camera (from nearest to farthest)
AEBDC
And objects AC use shader1,ED shader 2,B shader3
Objects AC use shame shader but different texture.
Now what to deal with such situation?
Render everything from front to back (5 swaps)
Render by shader group which are sorted(3 shaders swaps).
Marge all shader programs to one(1 swap).
Does instructions like glUniform,glBindTexture etc. to change value in already in use program cause overhead?
There is no one answer to this question. Does changing OpenGL state "cause overhead"? Of course they do; nothing is free. The question is whether the overhead caused by state change will be worse than the less effective depth test support.
That cannot be answered, because the answer depends on how much overdraw there is, how costly your fragment shaders are, how many state changes a particular sequence of draw calls will require, and numerous other intangibles that cannot be known beforehand.
That's why profiling before optimization is important.
Profile, profile and even more profile :)
I would like to add one thing though:
In your situation you can use idea of a rendering queue. It is some sort of manager for drawing objects. Instead of drawing an object you call renderQueue.add(myObject). Then, when you add() all the needed objects you can call renderQueue.renderAll(). This method can handle all the sorting (by the distance, by the shader, by material, etc) and that way it can be more useful when profiling (and then changing the way you render).
Of course this is only a rough idea.