How to force GLSL branching?

How to force GLSL branching? - opengl

I'm working on a fragment shader. It works, but it still needs some optimisation.
As far as I know, most of the cases branches in GLSL are flatten, so both cases are executed. I've eliminated most of the if-else conditions, but there are some of them, which have to stay as they are, because both branches are expensive to execute. I know, that in HLSL there is a [branch] keyword for this problem. But how is it possible to solve it in GLSL?
My code looks like this (the conditions are not uniform, their results depend on calculations in the shader):
if( condition ) {
expensive calculations...
}
if( condition2 ) {
expensive calculations...
}
if( condition3 ) {
expensive calculations...
}
...
One "expensive calculation" can modify the variables, on which a condition will depend. It is possible, that more than one calculation is executed.
I know, that there are older or mobile GPU-s, which does not support branching at all. In that case, there is nothing to do with this issue

GLSL has no mechanism to enforce branching (or to enforce flattening a branch).

Related

Is there a compiler hint for GCC to force branch prediction to always go a certain way?

For the Intel architectures, is there a way to instruct the GCC compiler to generate code that always forces branch prediction a particular way in my code? Does the Intel hardware even support this? What about other compilers or hardwares?
I would use this in C++ code where I know the case I wish to run fast and do not care about the slow down when the other branch needs to be taken even when it has recently taken that branch.
for (;;) {
if (normal) { // How to tell compiler to always branch predict true value?
doSomethingNormal();
} else {
exceptionalCase();
}
}
As a follow on question for Evdzhan Mustafa, can the hint just specify a hint for the first time the processor encounters the instruction, all subsequent branch prediction, functioning normally?

GCC supports the function __builtin_expect(long exp, long c) to provide this kind of feature. You can check the documentation here.
Where exp is the condition used and c is the expected value. For example in you case you would want
if (__builtin_expect(normal, 1))
Because of the awkward syntax this is usually used by defining two custom macros like
#define likely(x) __builtin_expect (!!(x), 1)
#define unlikely(x) __builtin_expect (!!(x), 0)
just to ease the task.
Mind that:
this is non standard
a compiler/cpu branch predictor are likely more skilled than you in deciding such things so this could be a premature micro-optimization

No, there is not. (At least on modern x86 processors.)
__builtin_expect mentioned in other answers influences the way gcc arranges the assembly code. It does not directly influence the CPU's branch predictor. Of course, there will be indirect effects on branch prediction caused by reordering the code. But on modern x86 processors there is no instruction that tells the CPU "assume this branch is/isn't taken".
See this question for more detail: Intel x86 0x2E/0x3E Prefix Branch Prediction actually used?
To be clear, __builtin_expect and/or the use of -fprofile-arcs can improve the performance of your code, both by giving hints to the branch predictor through code layout (see Performance optimisations of x86-64 assembly - Alignment and branch prediction), and also improving cache behaviour by keeping "unlikely" code away from "likely" code.

gcc has long __builtin_expect (long exp, long c) (emphasis mine):
You may use __builtin_expect to provide the compiler with branch
prediction information. In general, you should prefer to use actual
profile feedback for this (-fprofile-arcs), as programmers are
notoriously bad at predicting how their programs actually perform.
However, there are applications in which this data is hard to collect.
The return value is the value of exp, which should be an integral
expression. The semantics of the built-in are that it is expected that
exp == c. For example:
if (__builtin_expect (x, 0))
foo ();
indicates that we do not expect to call foo, since we expect x to be
zero. Since you are limited to integral expressions for exp, you
should use constructions such as
if (__builtin_expect (ptr != NULL, 1))
foo (*ptr);
when testing pointer or floating-point values.
As the documentation notes you should prefer to use actual profile feedback and this article shows a practical example of this and how it in their case at least ends up being an improvement over using __builtin_expect. Also see How to use profile guided optimizations in g++?.
We can also find a Linux kernel newbies article on the kernal macros likely() and unlikely() which use this feature:
#define likely(x) __builtin_expect(!!(x), 1)
#define unlikely(x) __builtin_expect(!!(x), 0)
Note the !! used in the macro we can find the explanation for this in Why use !!(condition) instead of (condition)?.
Just because this technique is used in the Linux kernel does not mean it always makes sense to use it. We can see from this question I recently answered difference between the function performance when passing parameter as compile time constant or variable that many hand rolled optimizations techniques don't work in the general case. We need to profile code carefully to understand whether a technique is effective. Many old techniques may not even be relevant with modern compiler optimizations.
Note, although builtins are not portable clang also supports __builtin_expect.
Also on some architectures it may not make a difference.

The correct way to define likely/unlikely macros in C++11 is the following:
#define LIKELY(condition) __builtin_expect(static_cast<bool>(condition), 1)
#define UNLIKELY(condition) __builtin_expect(static_cast<bool>(condition), 0)
This method is compatible with all C++ versions, unlike [[likely]], but relies on non-standard extension __builtin_expect.
When these macros defined this way:
#define LIKELY(condition) __builtin_expect(!!(condition), 1)
That may change the meaning of if statements and break the code. Consider the following code:
#include <iostream>
struct A
{
explicit operator bool() const { return true; }
operator int() const { return 0; }
};
#define LIKELY(condition) __builtin_expect((condition), 1)
int main() {
A a;
if(a)
std::cout << "if(a) is true\n";
if(LIKELY(a))
std::cout << "if(LIKELY(a)) is true\n";
else
std::cout << "if(LIKELY(a)) is false\n";
}
And its output:
if(a) is true
if(LIKELY(a)) is false
As you can see, the definition of LIKELY using !! as a cast to bool breaks the semantics of if.
The point here is not that operator int() and operator bool() should be related. Which is good practice.
Rather that using !!(x) instead of static_cast<bool>(x) loses the context for C++11 contextual conversions.

As the other answers have all adequately suggested, you can use __builtin_expect to give the compiler a hint about how to arrange the assembly code. As the official docs point out, in most cases, the assembler built into your brain will not be as good as the one crafted by the GCC team. It's always best to use actual profile data to optimize your code, rather than guessing.
Along similar lines, but not yet mentioned, is a GCC-specific way to force the compiler to generate code on a "cold" path. This involves the use of the noinline and cold attributes, which do exactly what they sound like they do. These attributes can only be applied to functions, but with C++11, you can declare inline lambda functions and these two attributes can also be applied to lambda functions.
Although this still falls into the general category of a micro-optimization, and thus the standard advice applies—test don't guess—I feel like it is more generally useful than __builtin_expect. Hardly any generations of the x86 processor use branch prediction hints (reference), so the only thing you're going to be able to affect anyway is the order of the assembly code. Since you know what is error-handling or "edge case" code, you can use this annotation to ensure that the compiler won't ever predict a branch to it and will link it away from the "hot" code when optimizing for size.
Sample usage:
void FooTheBar(void* pFoo)
{
if (pFoo == nullptr)
{
// Oh no! A null pointer is an error, but maybe this is a public-facing
// function, so we have to be prepared for anything. Yet, we don't want
// the error-handling code to fill up the instruction cache, so we will
// force it out-of-line and onto a "cold" path.
[&]() __attribute__((noinline,cold)) {
HandleError(...);
}();
}
// Do normal stuff
⋮
}
Even better, GCC will automatically ignore this in favor of profile feedback when it is available (e.g., when compiling with -fprofile-use).
See the official documentation here: https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#Common-Function-Attributes

As of C++20 the likely and unlikely attributes should be standardized and are already supported in g++9. So as discussed here, you can write
if (a > b) {
/* code you expect to run often */
[[likely]] /* last statement here */
}
e.g. in the following code the else block gets inlined thanks to the [[unlikely]] in the if block
int oftendone( int a, int b );
int rarelydone( int a, int b );
int finaltrafo( int );
int divides( int number, int prime ) {
int almostreturnvalue;
if ( ( number % prime ) == 0 ) {
auto k = rarelydone( number, prime );
auto l = rarelydone( number, k );
[[unlikely]] almostreturnvalue = rarelydone( k, l );
} else {
auto a = oftendone( number, prime );
almostreturnvalue = oftendone( a, a );
}
return finaltrafo( almostreturnvalue );
}
godbolt link comparing the presence/absence of the attribute

__builtin_expect can be used to tell the compiler which way you expect a branch to go. This can influence how the code is generated. Typical processors run code faster sequentially. So if you write
if (__builtin_expect (x == 0, 0)) ++count;
if (__builtin_expect (y == 0, 0)) ++count;
if (__builtin_expect (z == 0, 0)) ++count;
the compiler will generate code like
if (x == 0) goto if1;
back1: if (y == 0) goto if2;
back2: if (z == 0) goto if3;
back3: ;
...
if1: ++count; goto back1;
if2: ++count; goto back2;
if3: ++count; goto back3;
If your hint is correct, this will execute the code without any branches actually performed. It will run faster than the normal sequence, where each if statement would branch around the conditional code and would execute three branches.
Newer x86 processors have instructions for branches that are expected to be taken, or for branches that are expected not to be taken (there's an instruction prefix; not sure about the details). Not sure if the processor uses that. It is not very useful, because branch prediction will handle this just fine. So I don't think you can actually influence the branch prediction.

With regards to the OP, no, there is no way in GCC to tell the processor to always assume the branch is or isn't taken. What you have is __builtin_expect, which does what others say it does. Furthermore, I think you don't want to tell the processor whether the branch is taken or not always. Today's processors, such as the Intel architecture can recognize fairly complex patterns and adapt effectively.
However, there are times you want to assume control of whether by default a branch is predicted taken or not: When you know the code will be called "cold" with respect of branching statistics.
One concrete example: Exception management code. By definition the management code will happen exceptionally, but perhaps when it occurs maximum performance is desired (there may be a critical error to take care off as soon as possible), hence you may want to control the default prediction.
Another example: You may classify your input and jump into the code that handles the result of your classification. If there are many classifications, the processor may collect statistics but lose them because the same classification does not happen soon enough and the prediction resources are devoted to recently called code. I wish there would be a primitive to tell the processor "please do not devote prediction resources to this code" the way you sometimes can say "do not cache this".

Is it good practice to construct long circuit statements?

Question Context: [C++] I want to know what is theoretically the fastest, and what the compiler will do. I don't want to hear about premature optimization is the root of all evil, etc.
I was writing some code like this:
bool b0 = ...;
bool b1 = ...;
if (b0 && b1)
{
...
}
But then I was thinking: the code, as-is, will compile into two TEST instructions, if compiled without optimizations. This means two branches. So I was thinking that it might be better to write:
if (b0 & b1)
Which will produce only one TEST instruction, if no optimization is done by the compiler. But then I feel that this is against my code-style. I usually write && and ||.
Q: What will the compiler do if I turn on optimization flags (-O1, -O2, -O3, -Os and -Ofast). Will the compiler automatically compile it like &, even if I have used a && in the code? And what is theoretically faster? Does the behavior change if I do this:
if (b0 && b1)
{ ... }
else if (b0)
{ ... }
else if (b1)
{ ... }
else
{ ... }
Q: As I could have guessed, this is very depended on the situation, but is it a common trick for a compiler to replace a && with a &?

Q: What will the compiler do if I turn on optimization flags (-O1, -O2, -O3, -Os and -Ofast).
Most likely nothing more to increase the optimization.
As stated in my comments, you really can't optimize the evaluation any further than:
AND B0 WITH B1 (sets condition flags)
JUMP ZERO TO ...
Although, if you have a lot of simple boolean logic or data operations, some processors may conditionally execute them.
Will the compiler automatically compile it like &, even if I have used a && in the code?
And what is theoretically faster?
In most platforms, there is no difference in evaluation of A & B versus A && B.
In the final evaluation, either a compare or an AND instruction is executed, then a jump based on the status. Two instructions.
Most processors don't have Boolean registers. It's all numbers and bits.
Optimize By Boolean Logic
Your best option is to review the design and set up your algorithms to use Boolean algebra. You can than simplify the Boolean expressions.
Another option is to implement the code so that the compiler can generate conditional assembly instructions, if the platform supports them.
Optimize: Reduce jumps
Processors favor arithmetic and data transfers over jumps.
Many processors are always feeding an instruction pipeline. When it comes to a conditional branch instruction, the processor has to wait (suspend the instruction prefetching) until the condition status is determined. Then it can determine where the next instruction will be fetched.
If you can't remove the jumps, such as in a loop, make the ratio of data processing to jumping bigger in the data side. Search for "Loop Unrolling". Many compilers will perform this when optimization levels are increased.
Optimize: Data Cache
You may notice increased performance by organizing your data for best data cache usage.
For example, instead of 3 large arrays, use one array of a structure containing 3 elements. This allows the elements in use to be close to each other (and reduce the likelihood of accessing data outside of the cache).
Summary
The difference in evaluation of A && B versus A & B as conditional expressions is known as a micro-optimization. You will achieve improved performance by using Boolean algebra to reduce the quantity of conditional expressions. Jumps, or changes in execution path, slow down instruction execution. Fetching data outside of the data cache also slows down execution. You will most likely get better performance by redesigning your code and helping the compiler to reduce the branches and more effective use of the data cache.

If you care about what's fastest, why do you care what the compiler will do without optimisation?
Q: As I could have guessed, this is very depended on the situation, but is it a common trick for a compiler to replace a && with a &?
This question seems to assume that the compiler transforms C++ code into more C++ code. It doesn't. It transforms your code into machine instructions (including the assembler as part of the compiler for argument's sake). You should not assume there is a one-to-one mapping from a C++ operator like && or & to a particular instruction.
With optimisation the compiler will do whatever it thinks will be faster. If a single instruction would be faster the compiler will generate a single instruction for if (b0 && b1), you don't need to bugger up your code with micro-optimisations to help it make such a simple transformation.
The compiler knows the instruction set it's using, it knows the context the condition is in and whether it can be removed entirely as dead code, or moved elsewhere to help the pipeline, or simplified by constant propagation, etc. etc.
And if you really care about what's fastest, why would you compute b1 until you know it's actually needed? If obtaining the value of b1 has no side effects the compiler could even transform your code to:
bool b0 = ...;
if (b0)
{
bool b1 = ...;
if (b1)
{
Does that mean two if conditions are faster than a &?! Of course not.
In other words, the whole premise of the question is flawed. Do not compromise the readability and simplicity of your code in the misguided pursuit of the "theoretically fastest" micro-optimisation. Spend your time improving the algorithms and data structures used not trying to second guess which instructions the compiler will generate.

GLSL equivalent of HLSL clip()?

The HLSL clip() function is described here.
I intend to use this for alpha cutoff, in OpenGL. Would the equivalent in GLSL simply be
if (gl_FragColor.a < cutoff)
{
discard;
}
Or is there some more efficient equivalent?

OpenGL has no such function. And it doesn't need one.
Or is there some more efficient equivalent?
The question assumes that this conditional statement is less efficient than calling HLSL's clip function. It's very possible that it's more efficient (though even then, it's a total micro-optimization). clip checks if the value is less than 0, and if it is, discards the fragment. But you're not testing against zero; you're testing against cutoff, which probably isn't 0. So, you must call clip like this (using GLSL-style): clip(gl_FragColor.a - cutoff)
If clip is not directly support by the hardware, then your call is equivalent to if(gl_FragColor.a - cutoff < 0) discard;. That's a math operation and a conditional test. That's slower than just a conditional test. And if it's not... the driver will almost certainly rearrange your code to do the conditional test that way.
The only way the conditional would be slower than clip is if the hardware had specific support for clip and that the driver is too stupid to turn if(gl_FragColor.a < cutoff) discard; into clip(gl_FragColor.a - cutoff). If the driver is that stupid, if it lacks that basic pinhole optimization, then you've got bigger performance problems than this to deal with.
In short: don't worry about it.

GLSL break command

Currently I am learning how to create shaders in GLSL for a game engine I am working on, and I have a question regarding the language which puzzles me. I have learned that in shader versions lower than 3.0 you cannot use uniform variables in the condition of a loop. For example the following code would not work in shader versions older than 3.0.
for (int i = 0; i < uNumLights; i++)
{
...............
}
But isn't it possible to replace this with a loop with a fixed amount of iterations, but containing a conditional statement which would break the loop if i, in this case, is greater than uNumLights?. Ex :
for (int i = 0; i < MAX_LIGHTS; i++)
{
if(i >= uNumLights)
break;
..............
}
Aren't these equivalent? Should the latter work in older versions GLSL? And if so, isn't this more efficient and easy to implement than other techniques that I have read about, like using a different version of the shader for different number of lights?
I know this might be a silly question, but I am a beginner and I cannot find a reason why this shouldn't work.

GLSL can be confusing insofar as for() suggests to you that there must be conditional branching, even when there isn't because the hardware is unable to do it at all (which applies to if() in the same way).
What really happens on pre-SM3 hardware is that the HAL inside your OpenGL implementation will completely unroll your loop, so there is actually no jump any more. And, this explains why it has difficulties doing so with non-constants.
While technically possible to do it with non-constants anyway, the implementation would have to recompile the shader every time you change that uniform, and it might run against the maximum instruction count if you're just allowed to supply any haphazard number.
That is a problem because... what then? That's a bad situation.
If you supply a too big constant, it will give you a "too many instructions" compiler error when you build the shader. Now, if you supply a silly number in an uniform, and the HAL thus has to produce new code and runs against this limit, what can OpenGL do?
You most probably validated your program after compiling and linking, and you most probably queried the shader info log, and OpenGL kept telling you that everything was fine. This is, in some way, a binding promise, it cannot just decide otherwise all of a sudden. Therefore, it must make sure that this situation cannot arise, and the only workable solution is to not allow uniforms in conditions on hardware generations that don't support dynamic branching.
Otherwise, there would need to be some form of validation inside glUniform that rejects bad values. However, since this depends on successful (or unsuccessful) shader recompilation, this would mean that it would have to run synchronously, which makes it a "no go" approach. Also, consider that GL_ARB_uniform_buffer_object is exposed on some SM2 hardware (for example GeForce FX), which means you could throw a buffer object with unpredictable content at OpenGL and still expect it to work somehow! The implementation would have to scan the buffer's memory for invalid values after you unmap it, which is insane.
Similar to a loop, an if() statement does not branch on SM2 hardware, even though it looks like it. Instead, it will calculate both branches and do a conditional move.

(I'm assuming you are talking about pixel shaders).
Second variant is going to work only on gpu which supports shader model >= 3. Because dynamic branching (such as putting variable uNumLights into IF condition) is not supported on gpu shader model < 3 either.
Here you can compare what is and isn't supported between different shader models.

There is a fun work around I just figured out. Seems stupid and I can't promise you that it's a healthy choice, but it appears to work for me right now:
Set your for loop to the maximum you allow. Put a condition inside the loop to skip over the heavy routines, if the count goes beyond your uniform value.
uniform int iterations;
for(int i=0; i<10; i++){
if(i<iterations){
//do your thing...
}
}

Is there a faster alternative to if-else in this case?

while(some_condition){
if(FIRST)
{
do_this;
}
else
{
do_that;
}
}
In my program the possibility of if(FIRST) succeeding is about 1 in 10000. Can there be any alternative in C/C++ such that we can avoid checking the condition on every iteration inside the while loop with the hope of seeing a better performance in this case.
Ok! Let me put in some more detail.
i am writing a code for a signal acquisiton and tracking scheme where the state of my system will remain in TRACKING mode more often that ACQUISITION mode.
while(signal_present)
{
if(ACQUISITION_SUCCEEDED)
{
do_tracking(); // this functions can change the state from TRACKING to ACQUISITION
}
else
{
do_acquisition(); // this function can change the state from ACQUISITION to TRACKING
}
}
So what happens here is that the system usually remains in tracking mode but it can enter acquisition mode when tracking fails but is not a common occurrence.( Assume the incoming data to be infinite in number. )

The performance cost of a single branch is not going to be a big deal. The only thing you really can do is put the most likely code first, save on some instruction cache. Maybe. This is really deep into micro-optimization.

There is no particularly good reason to try to optimize this. Almost all modern architectures incorporate branch predictors. These speculate that a branch (an if or else) will be taken essentially the way it has been in the past. In your case, the speculation will always succeed, eliminating all overhead. There are non-portable ways to hint that a condition is taken one way or another, but any branch predictor will work just as well.
One thing you might want to do to improve instruction-cache locality is to move do_that out of the while loop (unless it is a function call).

The GCC has a __builtin_expect “function” that you can use to indicate to the compiler which branch will likely be taken. You could use it like this:
if(__builtin_expect(FIRST, 1)) …
Is this useful? I have no idea. I have never used it, never seen it used (except allegedly in the Linux kernel). The GCC documentation actually discourages its usage in favour of using profiling information to achieve a more reliable metric.

On recent x86 processor systems, final execution speed will barely rely on source code implementation.
You can have a look at this page http://igoro.com/archive/fast-and-slow-if-statements-branch-prediction-in-modern-processors/ to see amount the optimization that occurs inside the processor.

If this test is really consuming significant time compared to the implementation of do_aquisition, then you might get a boost by having a function table:
typedef void (*trackfunc)(void);
trackfunc tracking_action[] = {do_acquisition, do_tracking};
while (signal_present)
{
tracking_action[ACQUISITION_STATE]();
}
The effects of these kinds of manual optimizations are very dependent on the platform, the compiler, and the optimization settings.
You will most likely get a much greater performance gain by spending your time measuring and tuning the do_aquisition and do_tracking algorithms.

If you don't know when "FIRST" will be true, then no.
The issue is whether FIRST is time consuming or not; maybe you could evaluate FIRST before the loop (or part of it) and just test the boolean.

I'd change moonshadow's code a little bit to
while( some_condition )
{
do_that;
if( FIRST )
{
do_this; // overwrite what you did earlier.
}
}

Based on your new information, I'd say something like the following:
while(some_condition)
{
while(ACQUISITION_SUCCEEDED)
{
do_tracking();
}
if (some_condition)
while(!ACQUISITION_SUCCEEDED)
{
do_acquisition();
}
}
The point is that the ACQUISITION_SUCCEEDED state must include the some_condition information to a certain extent (i.e. it will break out of the inner loops if some_condition is false - hence there is a chance to break out of the outer loop)

This is a classic in optimization. You should avoid putting conditionals within loops if you can. This code:
while(...)
{
if( a )
{
foo();
}
else
{
bar();
}
}
is often better to rewrite as:
if( a )
{
while(...)
{
foo();
}
}
else
{
while(...)
{
bar();
}
}
It's not always possible though, and you should always when you try to optimize something measure the performance before and after.

There is not much more useful optimizing you can do with your example.
The call / branch to the do_this and do_that may negate any savings you earned by optimizing an if-then-else statement.
One of the rules of performance optimizing is to reduce branches. Most processors prefer to execute sequential code. They can take a chunk of sequential code and haul it into their caches. Branching interrupts this pleasantry and may cause a complete reload of the instruction cache (which loses valuable execution time).
Before you micro-optimize at this level, review your design to see if you can:
Eliminate unnecessary branching.
Split up code so it fits into the
cache.
Organize the data to reduce fetches
from memory or hard drive.
I'm sure that the above steps will gain you more performance than optimizing your posted loop.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js