I read a terrifying post recently where someone claimed that a switch statement in GLSL uses no conditional branching, and in fact causes every possible outcome to be run every time the switch is entered. This is worrying for me because I'm currently working on a deferred rendering engine which uses a few nested switch statements.
Does anyone know if there's any truth to this, and can it be backed up with any evidence?
Thanks!
I read a terrifying post recently where someone claimed that a switch statement in GLSL uses no conditional branching, and in fact causes every possible outcome to be run every time the switch is entered.
Neither of these is necessarily true. At least, not on today's hardware.
What happens is very dependent on the compiler and the underlying hardware architecture. So there is no one answer. But it is also very dependent on one other thing: what the condition actually is.
See, the reason why a compiler would execute both sides of a condition has to do with how GPUs work. GPUs gain their performance by grouping threads together and executing them in lock-step, with each thread group executing the exact same sequence of steps. With a conditional branch, this is impossible. So to do a true branch, you have to break up a group depending on which individual threads execute which branch.
So instead, if the two branches are fairly short, it'll execute them both and discard the particular values from the not-taken branch. The particular discarding of values doesn't require breaking thread groups, due to specialized opcodes and such.
Well, if the condition is based on an expression which is dynamically uniform (ie: an expression which is always the same within a draw call/context), then there is a good chance the compiler will not execute both sides. That it will do a true condition.
The reason being that, because the condition is dynamically uniform, all threads in the group will execute the same code. So there is no need to break a group of threads up to do a proper condition.
So if you have a switch statement that is based on a uniform variable, or expressions only involving uniform and compile-time constant variables, then there is no reason to expect it to execute multiple branches simultaneously.
It should also be noted that even if the expression is not dynamically uniform, the compiler will not always execute both branches. If the branches are too long or too different or whatever, it can choose to break up thread groups. This can lower performance, but potentially not as much as executing both groups. It's really up to the compiler to work out how to do it.
Related
In my code there's a regular: if a if statement is true, it will keep true for a while, and if it changes to false, it will keep false for a while. Since the performance in this code matters, I want to make the branch predict more efficient.
Currently what I tried is to write two versions of this if statement, one is optimized with "likely" and the other is optimized with "unlikely" and use a function pointer to save which one to use, but since function pointer breaks the pipeline either, the benchmark seems no different with normal if statement. So I'm curious if there's any tech to let CPU "remember" the last choice of this if statement?
Or, do I really need to care about this?
If it stays the same for a while, the branch predictor will figure that out pretty quickly. That's why sorting an input sometimes makes code run significantly faster; the random unsorted data keeps changing the test result back and forth with no pattern the branch predictor can use, but with sorted data, it has long runs where the branch is always taken or always not taken, and that's the easiest case for branch predictors to handle.
Don't overthink this; let the branch predictor do its job. You don't need to care about it.
I want to know if "if-statements" inside shaders (vertex / fragment / pixel...) are really slowing down the shader performance. For example:
Is it better to use this:
vec3 output;
output = input*enable + input2*(1-enable);
instead of using this:
vec3 output;
if(enable == 1)
{
output = input;
}
else
{
output = input2;
}
in another forum there was a talk about that (2013): http://answers.unity3d.com/questions/442688/shader-if-else-performance.html
Here the guys are saying, that the If-statements are really bad for the performance of the shader.
Also here they are talking about how much is inside the if/else statements (2012):
https://www.opengl.org/discussion_boards/showthread.php/177762-Performance-alternative-for-if-(-)
maybe the hardware or the shadercompiler are better now and they fix somehow this (maybe not existing) performance issue.
EDIT:
What is with this case, here lets say enable is a uniform variable and it is always set to 0:
if(enable == 1) //never happens
{
output = vec4(0,0,0,0);
}
else //always happens
{
output = calcPhong(normal, lightDir);
}
I think here we have a branch inside the shader which slows the shader down. Is that correct?
Does it make more sense to make 2 different shaders like one for the else and the other for the if part?
What is it about shaders that even potentially makes if statements performance problems? It has to do with how shaders get executed and where GPUs get their massive computing performance from.
Separate shader invocations are usually executed in parallel, executing the same instructions at the same time. They're simply executing them on different sets of input values; they share uniforms, but they have different internal registers. One term for a group of shaders all executing the same sequence of operations is "wavefront".
The potential problem with any form of conditional branching is that it can screw all that up. It causes different invocations within the wavefront to have to execute different sequences of code. That is a very expensive process, whereby a new wavefront has to be created, data copied over to it, etc.
Unless... it doesn't.
For example, if the condition is one that is taken by every invocation in the wavefront, then no runtime divergence is needed. As such, the cost of the if is just the cost of checking a condition.
So, let's say you have a conditional branch, and let's assume that all of the invocations in the wavefront will take the same branch. There are three possibilities for the nature of the expression in that condition:
Compile-time static. The conditional expression is entirely based off of compile-time constants. As such, you know from looking at the code which branches will be taken. Pretty much any compiler handles this as part of basic optimization.
Statically uniform branching. The condition is based off of expressions involving things which are known at compile-time to be constant (specifically, constants and uniform values). But the value of the expression will not be known at compile-time. So the compiler can statically be certain that wavefronts will never be broken by this if, but the compiler cannot know which branch will be taken.
Dynamic branching. The conditional expression contains terms other than constants and uniforms. Here, a compiler cannot tell a priori if a wavefront will be broken up or not. Whether that will need to happen depends on the runtime evaluation of the condition expression.
Different hardware can handle different branching types without divergence.
Also, even if a condition is taken by different wavefronts, the compiler could restructure the code to not require actual branching. You gave a fine example: output = input*enable + input2*(1-enable); is functionally equivalent to the if statement. A compiler could detect that an if is being used to set a variable, and thus execute both sides. This is frequently done for cases of dynamic conditions where the bodies of the branches are small.
Pretty much all hardware can handle var = bool ? val1 : val2 without having to diverge. This was possible way back in 2002.
Since this is very hardware-dependent, it... depends on the hardware. There are however certain epochs of hardware that can be looked at:
Desktop, Pre-D3D10
There, it's kinda the wild west. NVIDIA's compiler for such hardware was notorious for detecting such conditions and actually recompiling your shader whenever you changed uniforms that affected such conditions.
In general, this era is where about 80% of the "never use if statements" comes from. But even here, it's not necessarily true.
You can expect optimization of static branching. You can hope that statically uniform branching won't cause any additional slowdown (though the fact that NVIDIA thought recompilation would be faster than executing it makes it unlikely at least for their hardware). But dynamic branching is going to cost you something, even if all of the invocations take the same branch.
Compilers of this era do their best to optimize shaders so that simple conditions can be executed simply. For example, your output = input*enable + input2*(1-enable); is something that a decent compiler could generate from your equivalent if statement.
Desktop, Post-D3D10
Hardware of this era is generally capable of handling statically uniform branches statements with little slowdown. For dynamic branching, you may or may not encounter slowdown.
Desktop, D3D11+
Hardware of this era is pretty much guaranteed to be able to handle dynamically uniform conditions with little performance issues. Indeed, it doesn't even have to be dynamically uniform; so long as all of the invocations within the same wavefront take the same path, you won't see any significant performance loss.
Note that some hardware from the previous epoch probably could do this as well. But this is the one where it's almost certain to be true.
Mobile, ES 2.0
Welcome back to the wild west. Though unlike Pre-D3D10 desktop, this is mainly due to the huge variance of ES 2.0-caliber hardware. There's such a huge amount of stuff that can handle ES 2.0, and they all work very differently from each other.
Static branching will likely be optimized. But whether you get good performance from statically uniform branching is very hardware-dependent.
Mobile, ES 3.0+
Hardware here is rather more mature and capable than ES 2.0. As such, you can expect statically uniform branches to execute reasonably well. And some hardware can probably handle dynamic branches the way modern desktop hardware does.
It's highly dependent on the hardware and on the condition.
If your condition is a uniform : don't bother, let the compiler deal with it.
If your condition is something dynamic (like a value computed from attribute or fetched from texture or something), then it's more complicated.
For that latter case you'll pretty much have to test and benchmark because it'll depend on the complexity of the code in each branches and how 'consistent' the branch decision is.
For instance if one of the branch is taken 99% of the case and discards the fragment, then most likely you want to keep the conditional. But OTOH in your simple example above if enable is some dynamic condition, the arithmetic select might be better.
Unless you have a clear cut case like the above, or unless you're optimizing for one fixed known architecture, you're probably better off the compiler figure that out for you.
I want to know if "if-statements" inside shaders (vertex / fragment / pixel...) are really slowing down the shader performance. For example:
Is it better to use this:
vec3 output;
output = input*enable + input2*(1-enable);
instead of using this:
vec3 output;
if(enable == 1)
{
output = input;
}
else
{
output = input2;
}
in another forum there was a talk about that (2013): http://answers.unity3d.com/questions/442688/shader-if-else-performance.html
Here the guys are saying, that the If-statements are really bad for the performance of the shader.
Also here they are talking about how much is inside the if/else statements (2012):
https://www.opengl.org/discussion_boards/showthread.php/177762-Performance-alternative-for-if-(-)
maybe the hardware or the shadercompiler are better now and they fix somehow this (maybe not existing) performance issue.
EDIT:
What is with this case, here lets say enable is a uniform variable and it is always set to 0:
if(enable == 1) //never happens
{
output = vec4(0,0,0,0);
}
else //always happens
{
output = calcPhong(normal, lightDir);
}
I think here we have a branch inside the shader which slows the shader down. Is that correct?
Does it make more sense to make 2 different shaders like one for the else and the other for the if part?
What is it about shaders that even potentially makes if statements performance problems? It has to do with how shaders get executed and where GPUs get their massive computing performance from.
Separate shader invocations are usually executed in parallel, executing the same instructions at the same time. They're simply executing them on different sets of input values; they share uniforms, but they have different internal registers. One term for a group of shaders all executing the same sequence of operations is "wavefront".
The potential problem with any form of conditional branching is that it can screw all that up. It causes different invocations within the wavefront to have to execute different sequences of code. That is a very expensive process, whereby a new wavefront has to be created, data copied over to it, etc.
Unless... it doesn't.
For example, if the condition is one that is taken by every invocation in the wavefront, then no runtime divergence is needed. As such, the cost of the if is just the cost of checking a condition.
So, let's say you have a conditional branch, and let's assume that all of the invocations in the wavefront will take the same branch. There are three possibilities for the nature of the expression in that condition:
Compile-time static. The conditional expression is entirely based off of compile-time constants. As such, you know from looking at the code which branches will be taken. Pretty much any compiler handles this as part of basic optimization.
Statically uniform branching. The condition is based off of expressions involving things which are known at compile-time to be constant (specifically, constants and uniform values). But the value of the expression will not be known at compile-time. So the compiler can statically be certain that wavefronts will never be broken by this if, but the compiler cannot know which branch will be taken.
Dynamic branching. The conditional expression contains terms other than constants and uniforms. Here, a compiler cannot tell a priori if a wavefront will be broken up or not. Whether that will need to happen depends on the runtime evaluation of the condition expression.
Different hardware can handle different branching types without divergence.
Also, even if a condition is taken by different wavefronts, the compiler could restructure the code to not require actual branching. You gave a fine example: output = input*enable + input2*(1-enable); is functionally equivalent to the if statement. A compiler could detect that an if is being used to set a variable, and thus execute both sides. This is frequently done for cases of dynamic conditions where the bodies of the branches are small.
Pretty much all hardware can handle var = bool ? val1 : val2 without having to diverge. This was possible way back in 2002.
Since this is very hardware-dependent, it... depends on the hardware. There are however certain epochs of hardware that can be looked at:
Desktop, Pre-D3D10
There, it's kinda the wild west. NVIDIA's compiler for such hardware was notorious for detecting such conditions and actually recompiling your shader whenever you changed uniforms that affected such conditions.
In general, this era is where about 80% of the "never use if statements" comes from. But even here, it's not necessarily true.
You can expect optimization of static branching. You can hope that statically uniform branching won't cause any additional slowdown (though the fact that NVIDIA thought recompilation would be faster than executing it makes it unlikely at least for their hardware). But dynamic branching is going to cost you something, even if all of the invocations take the same branch.
Compilers of this era do their best to optimize shaders so that simple conditions can be executed simply. For example, your output = input*enable + input2*(1-enable); is something that a decent compiler could generate from your equivalent if statement.
Desktop, Post-D3D10
Hardware of this era is generally capable of handling statically uniform branches statements with little slowdown. For dynamic branching, you may or may not encounter slowdown.
Desktop, D3D11+
Hardware of this era is pretty much guaranteed to be able to handle dynamically uniform conditions with little performance issues. Indeed, it doesn't even have to be dynamically uniform; so long as all of the invocations within the same wavefront take the same path, you won't see any significant performance loss.
Note that some hardware from the previous epoch probably could do this as well. But this is the one where it's almost certain to be true.
Mobile, ES 2.0
Welcome back to the wild west. Though unlike Pre-D3D10 desktop, this is mainly due to the huge variance of ES 2.0-caliber hardware. There's such a huge amount of stuff that can handle ES 2.0, and they all work very differently from each other.
Static branching will likely be optimized. But whether you get good performance from statically uniform branching is very hardware-dependent.
Mobile, ES 3.0+
Hardware here is rather more mature and capable than ES 2.0. As such, you can expect statically uniform branches to execute reasonably well. And some hardware can probably handle dynamic branches the way modern desktop hardware does.
It's highly dependent on the hardware and on the condition.
If your condition is a uniform : don't bother, let the compiler deal with it.
If your condition is something dynamic (like a value computed from attribute or fetched from texture or something), then it's more complicated.
For that latter case you'll pretty much have to test and benchmark because it'll depend on the complexity of the code in each branches and how 'consistent' the branch decision is.
For instance if one of the branch is taken 99% of the case and discards the fragment, then most likely you want to keep the conditional. But OTOH in your simple example above if enable is some dynamic condition, the arithmetic select might be better.
Unless you have a clear cut case like the above, or unless you're optimizing for one fixed known architecture, you're probably better off the compiler figure that out for you.
I want to know if "if-statements" inside shaders (vertex / fragment / pixel...) are really slowing down the shader performance. For example:
Is it better to use this:
vec3 output;
output = input*enable + input2*(1-enable);
instead of using this:
vec3 output;
if(enable == 1)
{
output = input;
}
else
{
output = input2;
}
in another forum there was a talk about that (2013): http://answers.unity3d.com/questions/442688/shader-if-else-performance.html
Here the guys are saying, that the If-statements are really bad for the performance of the shader.
Also here they are talking about how much is inside the if/else statements (2012):
https://www.opengl.org/discussion_boards/showthread.php/177762-Performance-alternative-for-if-(-)
maybe the hardware or the shadercompiler are better now and they fix somehow this (maybe not existing) performance issue.
EDIT:
What is with this case, here lets say enable is a uniform variable and it is always set to 0:
if(enable == 1) //never happens
{
output = vec4(0,0,0,0);
}
else //always happens
{
output = calcPhong(normal, lightDir);
}
I think here we have a branch inside the shader which slows the shader down. Is that correct?
Does it make more sense to make 2 different shaders like one for the else and the other for the if part?
What is it about shaders that even potentially makes if statements performance problems? It has to do with how shaders get executed and where GPUs get their massive computing performance from.
Separate shader invocations are usually executed in parallel, executing the same instructions at the same time. They're simply executing them on different sets of input values; they share uniforms, but they have different internal registers. One term for a group of shaders all executing the same sequence of operations is "wavefront".
The potential problem with any form of conditional branching is that it can screw all that up. It causes different invocations within the wavefront to have to execute different sequences of code. That is a very expensive process, whereby a new wavefront has to be created, data copied over to it, etc.
Unless... it doesn't.
For example, if the condition is one that is taken by every invocation in the wavefront, then no runtime divergence is needed. As such, the cost of the if is just the cost of checking a condition.
So, let's say you have a conditional branch, and let's assume that all of the invocations in the wavefront will take the same branch. There are three possibilities for the nature of the expression in that condition:
Compile-time static. The conditional expression is entirely based off of compile-time constants. As such, you know from looking at the code which branches will be taken. Pretty much any compiler handles this as part of basic optimization.
Statically uniform branching. The condition is based off of expressions involving things which are known at compile-time to be constant (specifically, constants and uniform values). But the value of the expression will not be known at compile-time. So the compiler can statically be certain that wavefronts will never be broken by this if, but the compiler cannot know which branch will be taken.
Dynamic branching. The conditional expression contains terms other than constants and uniforms. Here, a compiler cannot tell a priori if a wavefront will be broken up or not. Whether that will need to happen depends on the runtime evaluation of the condition expression.
Different hardware can handle different branching types without divergence.
Also, even if a condition is taken by different wavefronts, the compiler could restructure the code to not require actual branching. You gave a fine example: output = input*enable + input2*(1-enable); is functionally equivalent to the if statement. A compiler could detect that an if is being used to set a variable, and thus execute both sides. This is frequently done for cases of dynamic conditions where the bodies of the branches are small.
Pretty much all hardware can handle var = bool ? val1 : val2 without having to diverge. This was possible way back in 2002.
Since this is very hardware-dependent, it... depends on the hardware. There are however certain epochs of hardware that can be looked at:
Desktop, Pre-D3D10
There, it's kinda the wild west. NVIDIA's compiler for such hardware was notorious for detecting such conditions and actually recompiling your shader whenever you changed uniforms that affected such conditions.
In general, this era is where about 80% of the "never use if statements" comes from. But even here, it's not necessarily true.
You can expect optimization of static branching. You can hope that statically uniform branching won't cause any additional slowdown (though the fact that NVIDIA thought recompilation would be faster than executing it makes it unlikely at least for their hardware). But dynamic branching is going to cost you something, even if all of the invocations take the same branch.
Compilers of this era do their best to optimize shaders so that simple conditions can be executed simply. For example, your output = input*enable + input2*(1-enable); is something that a decent compiler could generate from your equivalent if statement.
Desktop, Post-D3D10
Hardware of this era is generally capable of handling statically uniform branches statements with little slowdown. For dynamic branching, you may or may not encounter slowdown.
Desktop, D3D11+
Hardware of this era is pretty much guaranteed to be able to handle dynamically uniform conditions with little performance issues. Indeed, it doesn't even have to be dynamically uniform; so long as all of the invocations within the same wavefront take the same path, you won't see any significant performance loss.
Note that some hardware from the previous epoch probably could do this as well. But this is the one where it's almost certain to be true.
Mobile, ES 2.0
Welcome back to the wild west. Though unlike Pre-D3D10 desktop, this is mainly due to the huge variance of ES 2.0-caliber hardware. There's such a huge amount of stuff that can handle ES 2.0, and they all work very differently from each other.
Static branching will likely be optimized. But whether you get good performance from statically uniform branching is very hardware-dependent.
Mobile, ES 3.0+
Hardware here is rather more mature and capable than ES 2.0. As such, you can expect statically uniform branches to execute reasonably well. And some hardware can probably handle dynamic branches the way modern desktop hardware does.
It's highly dependent on the hardware and on the condition.
If your condition is a uniform : don't bother, let the compiler deal with it.
If your condition is something dynamic (like a value computed from attribute or fetched from texture or something), then it's more complicated.
For that latter case you'll pretty much have to test and benchmark because it'll depend on the complexity of the code in each branches and how 'consistent' the branch decision is.
For instance if one of the branch is taken 99% of the case and discards the fragment, then most likely you want to keep the conditional. But OTOH in your simple example above if enable is some dynamic condition, the arithmetic select might be better.
Unless you have a clear cut case like the above, or unless you're optimizing for one fixed known architecture, you're probably better off the compiler figure that out for you.
Does MSVC automatically optimize computation on dual core architecture?
void Func()
{
Computation1();
Computation2();
}
If given the 2 computation with no relations in a function, does the visual studio
compiler automatically optimize the computation and allocate them to different cores?
Don't quote me on it but I doubt it. The OpenMP pragmas are the closest thing to what you're trying to do here, but even then you have to tell the compiler to use OpenMP and delineate the tasks.
Barring linking to libraries which are inherently multi-threaded, if you want to use both cores you have to set up threads and divide the work you want done intelligently.
No. It is up to you to create threads (or fibers) and specify what code runs on each one. The function as defined will run sequentially. It may switch to another thread (thanks Drew) core during execution, but it will still be sequential. In order for two functions to run concurrently on two different cores, they must first be running in two separate threads.
As greyfade points out, the compiler is unable to detect whether it is possible. In fact, I suspect that this is in the class of NP-Complete problems. If I am wrong, I am sure one of the compiler gurus will let me know.
There's no reliable way for the compiler to detect that the two functions are completely independent and that they have no state. Therefore, there's no way for the compiler to know that it's safe to break them out into separate threads of execution. In fact, threads aren't even part of the C++ standard (until C++1x), and even when they will be, they won't be an intrinsic feature - you must use the feature explicitly to benefit from it.
If you want your two functions to run in independent threads, then create independent threads for them to execute in. Check out boost::thread (which is also available in the std::tr1 namespace if your compiler has it). It's easy to use and works perfectly for your use case.
No. Madness would ensue if compilers did such a thing behind your back; what if Computation2 depended on side effects of Computation1?
If you're using VC10, look into the Concurrency Runtime (ConcRT or "concert") and it's partner the Parallel Patterns Library (PPL)
Similar solutions include OpenMP (kind of old and busted IMO, but widely supported) and Intel's Threading Building Blocks (TBB).
The compiler can't tell if it's a good idea.
First, of course, the compiler must be able to prove that it would be a safe optimization: That the functions can safely be executed in parallel. In general, that's a NP-complete problem, but in many simple cases, the compiler can figure that out (it already does a lot of dependency analysis).
Some bigger problems are:
it might turn out to be slower. Creating threads is a fairly expensive operation. The cost of that may just outweigh the gain from parallelizing the code.
it has to work well regardless of the number of CPU cores. The compiler doesn't know how many cores will be available when you run the program. So it'd have to insert some kind of optional forking code. If a core is available, follow this code path and branch out into a separate thread, otherwise follow this other code path. And again, more code and more conditionals also has an effect on performance. Will the result still be worth it? Perhaps, but how is the compiler supposed to know that?
it might not be what the programmer expects. What if I already create precisely two CPU-heavy threads on a dual-core system? I expect them both to be running 99% of the time. Suddenly the compiler decides to create more threads under the hood, and suddenly I have three CPU-heavy threads, meaning that mine get less execution time than I'd expected.
How many times should it do this? If you run the code in a loop, should it spawn a new thread in every iteration? Sooner or later the added memory usage starts to hurt.
Overall, it's just not worth it. There are too many cases where it might backfire. Added to the fact that the compiler could only safely apply the optimization in fairly simple cases in the first place, it's just not worth the bother.