Are recent GLSL compilers smart/well optimized?
In other words, if I go brainless and write stuff like the following, would recent compilers save my day and optimize away the unnecessary code, or should I be always careful with what I write?
// All of the values are constants
if (3.7 == 3.7) // Will the condition be executed or removed at build time?
x++;
// Will this whole block be entirely removed? (or should I use macros)
if (1 == 2)
x++;
for (i = 0; i < 0; ++i) // Remove this
x++;
for (i = 0; i < varA * varB; ++i) // Compute varA * varB once, outside the loop
x++;
vec3 v = vec3(0);
if (length(v) > 0) // Remove
x++;
float p = mix(varA, varB, 1); // p = varB
float p = 5;
p *= uniform * 0; // Just set a = 0 from the start
float p = 5;
p *= 1; // Remove that
There is a lot more things that I can't get out now, but you should have got the point.
Also, can recent compilers automatically detect less obvious optimizations, like the ones described there:
http://www.opengl.org/wiki/GLSL_Optimizations
Is there a known trade off between "compilation time" versus "time spent optimizing", set up by the implementors or the specification?
GLSL compilers are very conservative on floating point optimizations. You cannot be sure for any particular optimization. The rule of thumb is: optimize whatever you can and don't hope for any help from a GLSL compiler.
Read Low-Level Thinking in High-Level Shading Languages by Emil Persson for interesting details and case-studies.
P.S: This answer may sound pessimistic. However, GLSL compilers still do a lot of great job optimizing your code. Just don't count on it and do your best.
Related
In a function that updates all particles I have the following code:
for (int i = 0; i < _maxParticles; i++)
{
// check if active
if (_particles[i].lifeTime > 0.0f)
{
_particles[i].lifeTime -= _decayRate * deltaTime;
}
}
This decreases the lifetime of the particle based on the time that passed.
It gets calculated every loop, so if I've 10000 particles, that wouldn't be very efficient because it doesn't need to(it doesn't get changed anyways).
So I came up with this:
float lifeMin = _decayRate * deltaTime;
for (int i = 0; i < _maxParticles; i++)
{
// check if active
if (_particles[i].lifeTime > 0.0f)
{
_particles[i].lifeTime -= lifeMin;
}
}
This calculates it once and sets it to a variable that gets called every loop, so the CPU doesn't have to calculate it every loop, which would theoretically increase performance.
Would it run faster than the old code? Or does the release compiler do optimizations like this?
I wrote a program that compares both methods:
#include <time.h>
#include <iostream>
const unsigned int MAX = 1000000000;
int main()
{
float deltaTime = 20;
float decayRate = 200;
float foo = 2041.234f;
unsigned int start = clock();
for (unsigned int i = 0; i < MAX; i++)
{
foo -= decayRate * deltaTime;
}
std::cout << "Method 1 took " << clock() - start << "ms\n";
start = clock();
float calced = decayRate * deltaTime;
for (unsigned int i = 0; i < MAX; i++)
{
foo -= calced;
}
std::cout << "Method 2 took " << clock() - start << "ms\n";
int n;
std::cin >> n;
return 0;
}
Result in debug mode:
Method 1 took 2470ms
Method 2 took 2410ms
Result in release mode:
Method 1 took 0ms
Method 2 took 0ms
But that doesn't work. I know it doesn't do exactly the same, but it gives an idea.
In debug mode, they take roughly the same time. Sometimes Method 1 is faster than Method 2(especially at fewer numbers), sometimes Method 2 is faster.
In release mode, it takes 0 ms. A little weird.
I tried measuring it in the game itself, but there aren't enough particles to get a clear result.
EDIT
I tried to disable optimizations, and let the variables be user inputs using std::cin.
Here are the results:
Method 1 took 2430ms
Method 2 took 2410ms
It will almost certainly make no difference what so ever, at least if
you compile with optimization (and of course, if you're concerned with
performance, you are compiling with optimization). The opimization in
question is called loop invariant code motion, and is universally
implemented (and has been for about 40 years).
On the other hand, it may make sense to use the separate variable
anyway, to make the code clearer. This depends on the application, but
in many cases, giving a name to the results of an expression can make
code clearer. (In other cases, of course, throwing in a lot of extra
variables can make it less clear. It's all depends on the application.)
In any case, for such things, write the code as clearly as possible
first, and then, if (and only if) there is a performance problem,
profile to see where it is, and fix that.
EDIT:
Just to be perfectly clear: I'm talking about this sort of code optimization in general. In the exact case you show, since you don't use foo, the compiler will probably remove it (and the loops) completely.
In theory, yes. But your loop is extremely simple and thus likeley to be heavily optimized.
Try the -O0 option to disable all compiler optimizations.
The release runtime might be caused by the compiler statically computing the result.
I am pretty confident that any decent compiler will replace your loops with the following code:
foo -= MAX * decayRate * deltaTime;
and
foo -= MAX * calced ;
You can make the MAX size depending on some kind of input (e.g. command line parameter) to avoid that.
In the follow two code snippets, is there actually any different according to the speed of compiling or running?
for (int i = 0; i < 50; i++)
{
if (i % 3 == 0)
continue;
printf("Yay");
}
and
for (int i = 0; i < 50; i++)
{
if (i % 3 != 0)
printf("Yay");
}
Personally, in the situations where there is a lot more than a print statement, I've been using the first method as to reduce the amount of indentation for the containing code. Been wondering for a while so found it about time I ask whether it's actually having an effect other than visually.
Reply to Alf (i couldn't get code working in comments...)
More accurate to my usage is something along the lines of a "handleObjectMovement" function which would include
for each object
if object position is static
continue
deal with velocity and jazz
compared with
for each object
if object position is not static
deal with velocity and jazz
Hence me not using return. Essentially "if it's not relevant to this iteration, move on"
The behaviour is the same, so the runtime speed should be the same unless the compiler does something stupid (or unless you disable optimisation).
It's impossible to say whether there's a difference in compilation speed, since it depends on the details of how the compiler parses, analyses and translates the two variations.
If speed is important, measure it.
If you know which branch of the condition has higher probability you may use GCC likely/unlikely macro
How about getting rid of the check altogether?
for (int t = 0; t < 33; t++)
{
int i = t + (t >> 1) + 1;
printf("%d\n", i);
}
Recently i was working with an application that had code similar to:
for (auto x = 0; x < width - 1 - left; ++x)
{
// store / reset points
temp = hPoint = 0;
for(int channel = 0; channel < audioData.size(); channel++)
{
if (peakmode) /* fir rms of window size */
{
for (int z = 0; z < sizeFactor; z++)
{
temp += audioData[channel][x * sizeFactor + z + offset];
}
hPoint += temp / sizeFactor;
}
else /* highest sample in window */
{
for (int z = 0; z < sizeFactor; z++)
{
temp = audioData[channel][x * sizeFactor + z + offset];
if (std::fabs(temp) > std::fabs(hPoint))
hPoint = temp;
}
}
.. some other code
}
... some more code
}
This is inside a graphical render loop, called some 50-100 times / sec with buffers up to 192kHz in multiple channels. So it's a lot of data running through the innermost loops, and profiling showed this was a hotspot.
It occurred to me that one could cast the float to an integer and erase the sign bit, and cast it back using only temporaries. It looked something like this:
if ((const float &&)(*((int *)&temp) & ~0x80000000) > (const float &&)(*((int *)&hPoint) & ~0x80000000))
hPoint = temp;
This gave a 12x reduction in render time, while still producing the same, valid output. Note that everything in the audiodata is sanitized beforehand to not include nans/infs/denormals, and only have a range of [-1, 1].
Are there any corner cases where this optimization will give wrong results - or, why is the standard library function not implemented like this? I presume it has to do with handling of non-normal values?
e: the layout of the floating point model is conforming to ieee, and sizeof(float) == sizeof(int) == 4
Well, you set the floating-point mode to IEEE conforming. Typically, with switches like --fast-math the compiler can ignore IEEE corner cases like NaN, INF and denormals. If the compiler also uses intrinsics, it can probably emit the same code.
BTW, if you're going to assume IEEE format, there's no need for the cast back to float prior to the comparison. The IEEE format is nifty: for all positive finite values, a<b if and only if reinterpret_cast<int_type>(a) < reinterpret_cast<int_type>(b)
It occurred to me that one could cast the float to an integer and erase the sign bit, and cast it back using only temporaries.
No, you can't, because this violates the strict aliasing rule.
Are there any corner cases where this optimization will give wrong results
Technically, this code results in undefined behavior, so it always gives wrong "results". Not in the sense that the result of the absolute value will always be unexpected or incorrect, but in the sense that you can't possibly reason about what a program does if it has undefined behavior.
or, why is the standard library function not implemented like this?
Your suspicion is justified, handling denormals and other exceptional values is tricky, the stdlib function also needs to take those into account, and the other reason is still the undefined behavior.
One (non-)solution if you care about performance:
Instead of casting and pointers, you can use a union. Unfortunately, that only works in C, not in C++, though. That won't result in UB, but it's still not portable (although it will likely work with most, if not all, platforms with IEEE-754).
union {
float f;
unsigned u;
} pun = { .f = -3.14 };
pun.u &= ~0x80000000;
printf("abs(-pi) = %f\n", pun.f);
But, granted, this may or may not be faster than calling fabs(). Only one thing is sure: it won't be always correct.
You would expect fabs() to be implemented in hardware. There was an 8087 instruction for it in 1980 after all. You're not going to beat the hardware.
How the standard library function implements it is .... implementation dependent. So you may find different implementation of the standard library with different performance.
I imagine that you could have problems in platforms where int is not 32 bits. You 'd better use int32_t (cstdint>)
For my knowledge, was std::abs previously inlined ? Or the optimisation you observed is mainly due to suppression of the function call ?
Some observations on how refactoring may improve performance:
as mentioned, x * sizeFactor + offset can be factored out of the inner loops
peakmode is actually a switch changing the function's behaviour - make two functions rather than test the switch mid-loop. This has 2 benefits:
easier to maintain
fewer local variables and code paths to get in the way of optimisation.
The division of temp by sizeFactor can be deferred until outside the channel loop in the peakmode version.
abs(hPoint) can be pre-computed whenever hPoint is updated
if audioData is a vector of vectors you may get some performance benefit by taking a reference to audioData[channel] at the start of the body of the channel loop, reducing the array indexing within the z loop down to one dimension.
finally, apply whatever specific optimisations for the calculation of fabs you deem fit. Anything you do here will hurt portability so it's a last resort.
In VS2008, using the following to track the absolute value of hpoint and hIsNeg to remember whether it is positive or negative is about twice as fast as using fabs():
int hIsNeg=0 ;
...
//Inside loop, replacing
// if (std::fabs(temp) > std::fabs(hPoint))
// hPoint = temp;
if( temp < 0 )
{
if( -temp > hpoint )
{
hpoint = -temp ;
hIsNeg = 1 ;
}
}
else
{
if( temp > hpoint )
{
hpoint = temp ;
hIsNeg = 0 ;
}
}
...
//After loop
if( hIsNeg )
hpoint = -hpoint ;
Would there be any noticeable speed difference between these two snippets of code? Naively, I think the second snippet would be faster because branch instructions are encountered a lot less, but on the other hand the branch predictor should solve this problem. Or will it have a noticeable overhead despite the predictable pattern? Assume that no conditional move instruction is used.
Snippet 1:
for (int i = 0; i < 100; i++) {
if (a == 3)
output[i] = 1;
else
output[i] = 0;
}
Snippet 2:
if (a == 3) {
for (int i = 0; i < 100; i++)
output[i] = 1;
} else {
for (int i = 0; i < 100; i++)
output[i] = 0;
}
I'm not intending to optimise these cases myself, but I would like to know more about the overhead of branches even with a predictable pattern.
Since a remains unchanged once you enter into the loop, there shouldn't be much difference between the two code-snippet.
Personally, I would prefer the former, unless branch predictor fails to predict the branch which is really unlikely, given that a remains unchanged in the loop.
Moreover, the compiler may perform this optimization:
Loop unswitching
thereby making both code-snippets emit exactly same machine instructions.
You asked a performance question without specifying hardware (although from the question we can infer that it's one of the architectures that have branch prediction), toolchain, or compile options.
Overall, this is just another space vs speed tradeoff, where space often itself affects speed (CPU instruction and microcode caches).
The only reasonable answer is "Performance will vary depending on processor hardware and compiler optimizations."
I'd like to write a function that would have some optional code to be executed or not depending on user settings. The function is cpu-intensive and having ifs in it would be slow since the branch predictor is not that good.
My idea is making a copy in memory of the function and replace NOPs with a jump when I don't want to execute some code. My working example goes like this:
int Test()
{
int x = 2;
for (int i=0 ; i<10 ; i++)
{
x *= 2;
__asm {NOP}; // to skip it replace this
__asm {NOP}; // by JMP 2 (after the goto)
x *= 2; // Op to skip or not
x *= 2;
}
return x;
}
In my test's main, I copy this function into a newly allocated executable memory and replace the NOPs by a JMP 2 so that the following x *= 2 is not executed. JMP 2 is really "skip the next 2 bytes".
The problem is that I would have to change the JMP operand every time I edit the code to be skipped and change its size.
An alternative that would fix this problem would be:
__asm {NOP}; // to skip it replace this
__asm {NOP}; // by JMP 2 (after the goto)
goto dont_do_it;
x *= 2; // Op to skip or not
dont_do_it:
x *= 2;
I would then want to skip or not the goto, which has a fixed size. Unfortunately, in full optimization mode, the goto and the x*=2 are removed because they are unreachable at compilation time.
Hence the need to keep that dead code.
I'm using VStudio 2008.
You can cut the cost of the branch by up to 10, just by moving it out of the loop:
int Test()
{
int x = 2;
if (should_skip) {
for (int i=0 ; i<10 ; i++)
{
x *= 2;
x *= 2;
}
} else {
for (int i=0 ; i<10 ; i++)
{
x *= 2;
x *= 2;
x *= 2;
}
}
return x;
}
In this case, and others like it, that might also provoke the compiler into doing a better job of optimising the loop body, since it will consider the two possibilities separately rather than trying to optimise conditional code, and it won't optimise anything away as dead.
If this results in too much duplicated code to be maintainable, use a template that takes x by reference:
int x = 2;
if (should_skip) {
doLoop<true>(x);
} else {
doLoop<false>(x);
}
And check that the compiler inlines it.
Obviously this increases code size a bit, which will occasionally be a concern. Whichever way you do it though, if this change doesn't produce a measurable performance improvement then I'd guess that yours won't either.
If the number of permutations for the code is reasonable, you can define your code as C++ templates and generate all variants.
You do not specify what compiler and platform you are using, which will prevent most people from being able to help you. For example, on some platforms, the code section is not going to be writeable, so you won't be able to replace the NOPs with a JMP.
You are trying to pick-and-choose the optimizations offered to you by the compiler and second-guessing it. In general, it's a bad idea. Either write the whole inner loop block in assembly, which would prevent the compiler eliminating is as dead code, or put the damn if statement in there and let the compiler do its thing.
I'm also dubious that the branch prediction is bad enough where you will gain any sort of a net win from doing what you're proposing. Are you sure this isn't a case of premature optimization? Have you written the code in the most obvious way possible and only then determined that its performance isn't good enough? That would be my suggested start.
Here's an actual answer to the actual question!
volatile int y = 0;
int Test()
{
int x = 2;
for (int i=0 ; i<10 ; i++)
{
x *= 2;
__asm {NOP}; // to skip it replace this
__asm {NOP}; // by JMP 2 (after the goto)
goto dont_do_it;
keep_my_code:
x *= 2; // Op to skip or not
dont_do_it:
x *= 2;
}
if (y) goto keep_my_code;
return x;
}
Is this x64? You might be able to use function pointers and a conditional move to avoid the branch predictor. Load the address of the procedure based on the user settings; one of the procedures could be a dummy that does nothing. You should be able to do this without any inline ASM at all.
This may give insight:
#pragma optimize for Visual Studio.
That said, for this particular problem I would hand-code into ASM, using the VS asm output as a reference point.
At the meta level, I would have to be very certain this was the best design & algorithm for what I was doing before I started optimizing for the CPU pipe.
If you get this to work then I would profile it to make sure that it really is faster for you. On modern CPUs there is very little you can do that is slower than modifying code that is already in the cpu cache, or worse, the cpu pipeline. The cpu basically has to throw out all the work that is in the pipeline and start again.