My 9600GT hates me.
Fragment shader:
#version 130
uint aa[33] = uint[33](
0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,
0,0,0
);
void main() {
int i=0;
int a=26;
for (i=0; i<a; i++) aa[i]=aa[i+1];
gl_FragColor=vec4(1.0,0.0,0.0,1.0);
}
If a=25 program runs at 3000 fps.
If a=26 program runs at 20 fps.
If size of aa <=32 issue doesn't appear.
Viewport size is 1000x1000.
Problem occurs only when the size of aa is >32.
Value of a as the threshold varies with the calls to the array inside the loop (aa[i]=aa[i+1]+aa[i-1] gives a different deadline).
I know gl_FragColor is deprecated. But that's not the issue.
My guess is that GLSL doesn't unroll automatically the loop if a>25 and size(aa)>32. Why. The reason why it depends on the size of the array is unknown to mankind.
A quite similar behavior explained here:
http://www.gamedev.net/topic/519511-glsl-for-loops/
Unwinding the loop manually does solve the issue (3000 fps), even if aa size is >32:
aa[0]=aa[1];
aa[1]=aa[2];
aa[2]=aa[3];
aa[3]=aa[4];
aa[4]=aa[5];
aa[5]=aa[6];
aa[6]=aa[7];
aa[7]=aa[8];
aa[8]=aa[9];
aa[9]=aa[10];
aa[10]=aa[11];
aa[11]=aa[12];
aa[12]=aa[13];
aa[13]=aa[14];
aa[14]=aa[15];
aa[15]=aa[16];
aa[16]=aa[17];
aa[17]=aa[18];
aa[18]=aa[19];
aa[19]=aa[20];
aa[20]=aa[21];
aa[21]=aa[22];
aa[22]=aa[23];
aa[23]=aa[24];
aa[24]=aa[25];
aa[25]=aa[26];
aa[26]=aa[27];
aa[27]=aa[28];
aa[28]=aa[29];
aa[29]=aa[30];
aa[30]=aa[31];
aa[31]=aa[32];
aa[32]=aa[33];
I am just putting in a summarizing answer of the comments here so this does not show up as unanswered anymore.
"#pragma optionNV (unroll all)"
fixes the immediate issue on nvidia.
In general though, GLSL compilers are very implementation dependent. The reason why there is a drop of at exactly 32 is easily explained by hitting a compiler heuristic like "don't unroll loops longer than 32". Also the huge speed difference might come from an unrolled loop using constants while a dynamic loop will require addressable array memory. Another reason could be that when unrolling dead code elimination an constant folding kicks in reducing the entire loop to nothing.
The most portable way to fix this is really manual unrolling, or even better manual constant folding. It is always questionable to compute constants in a fragment shader that can be computed outside. Some drivers might catch it for some cases, but it is better not to rely on that.
Related
In my GLSL code, I have large functions being called multiple times.
As an example, inside my perlin noise function I run a hash function many times.
float perlinNoise (...) {
...
float x0y0z0 = pcgUnit3(seedminX,seedminY,seedminZ);
float x0y0z1 = pcgUnit3(seedminX,seedminY,seedmaxZ);
float x0y1z0 = pcgUnit3(seedminX,seedmaxY,seedminZ);
float x0y1z1 = pcgUnit3(seedminX,seedmaxY,seedmaxZ);
float x1y0z0 = pcgUnit3(seedmaxX,seedminY,seedminZ);
float x1y0z1 = pcgUnit3(seedmaxX,seedminY,seedmaxZ);
float x1y1z0 = pcgUnit3(seedmaxX,seedmaxY,seedminZ);
float x1y1z1 = pcgUnit3(seedmaxX,seedmaxY,seedmaxZ);
...
}
In my procedural generation algorithm, I call the perlin noise function multiple times.
I notice that for every time I add a call to the perlin noise function, the compile time for the shaders takes longer. I think the problem is that glsl inlines the function calls, and thereby the shader code becomes very large.
Am I correct about the inlining, and if so, how do I prevent it from happening?
On page 13 of the GLSL_ES_Specification_1.00 it is described, how to set pragma derectives
#pragma optimize(off)
#pragma debug(on)
That would be the "inlining" case. "Am I correct about the inlining ..". For my knowlenge, no. Also not quite sure what is happening in the background. On Windows it gets transpiled to ANGEL, maybe have a look at the source could help.
In general: try to lower the precession, consider that three digits 1.000 could be "nice" enough. Floats can only store 5 -6 digits anyway. Avoid branching. That should be the First Aid Kid.
Some debugger tools out in the wild (please make your own research)
http://glsl-debugger.github.io/
http://shader-playground.timjones.io/
Optimizer
https://zz85.github.io/glsl-optimizer/
I've written my first couple of GLSL programs for Processing (a visual language similar to Java that can load shaders) recently that make fractals. In the loop that handles the fractal code, I have an escape conditional that breaks if a point would tend to infinity.
It works fine and it is similar to how I generally write the code for non-GLSL. However someone told me that two paths are calculated every time a conditional is executed. I've had a hard time finding exactly how much of a penalty is caused by conditionals in GLSL.
Edit: To the best of my understanding in non-GLSL when an if is encountered a path is assumed. If the "correct" path was assumed everything is great. If the "wrong" path was assumed then "bad" work is discarded and instructions continue along the "correct" path. The penalty might be say 3 (or whatever number) of instructions. I want to know if there is some number (3 or whatever) of instructions that are the penalty or if both paths are calculated all the way through.
Here is the code if the explanation is not clear enough:
// Mandelbrot Set code
int i = 0;
float zr = x;
float zi = y;
for (; i < maxIterations; i++) {
float sqZr = zr*zr;
float sqZi = zi*zi;
float twoZri = 2.0*zr*zi;
zr = sqZr-sqZi+x;
zi = twoZri+y;
if (sqZr+sqZi > 16.0) break;
}
On old GPUs, both sides of an if() clause were executed and the correct result chosen at the end. On newer ones, this is only the case if the compiler thinks it would be more efficient. if() clauses are not free: the generic rule of thumb I have used for some time is: "if() costs 14 clock cycles" though the latest GPUs may be cheaper.
Why is this so? Because GPUs are stream processors, they want to have identical data-loading profiles for all pixels (especially for gradient values like texture colors or values from vertex registers). The principle of SIMD -- even when the devices are not strictly SIMD -- is usually the way to get the most performance from such devices.
When in doubt, see if you can use one of the NVIDIA perf analysis tools on your code, or just try writing the code (it's short!) a few different ways and comparing your performance for your specific GPU.
(BTW Processing is not Java-like: it's Java)
I'm looking at the source of an OpenGL application that uses shaders. One particular shader looks like this:
uniform float someConstantValue;
void main()
{
// Use someConstantValue
}
The uniform is set once from code and never changes throughout the application run-time.
In what cases would I want to declare someConstantValue as a uniform and not as const float?
Edit:
Just to clarify, the constant value is a physical constant.
Huge reason:
Error: Loop index cannot be compared with non-constant expression.
If I use:
uniform float myfloat;
...
for (float i = 0.0; i < myfloat; i++)
I get an error because myfloat isn't a constant expression.
However this is perfectly valid:
const float myfloat = 10.0;
...
for (float i = 0.0; i < myfloat; i++)
Why?
When GLSL (and HLSL for that matter) are compiled to GPU assembly instructions, loops are unrolled in a very verbose (yet optimized using jumps, etc) way. Meaning the myfloat value is used during compile time to unroll the loop; if that value is a uniform (ie. can change each render call) then that loop cannot be unrolled until run time (and GPUs don't do that kind of JustInTime compilation, at least not in WebGL).
First off, the performance difference between using a uniform or a constant is probably negligible. Secondly, just because a value is always constant in nature doesn't mean that you will always want it be constant in your program. Programmers will often tweak physical values to produce the best looking result, even when that doesn't match reality. For instance, the acceleration due to gravity is often increased in certain types of games to make them more fast paced.
If you don't want to have to set the uniform in your code you could provide a default value in GLSL:
uniform float someConstantValue = 12.5;
That said, there is no reason not to use const for something like pi where there would be little value in modifying it....
I can think of two reasons:
The developer reuses a library of shaders in several applications. So instead of customizing each shader for every app, the developer tries to keep them general.
The developer anticipates this variable will later be a user-controlled setting. So declaring it as uniform is preparation for that upcoming feature.
If I was the developer and none of the above applies then I would declare it as "const" instead because it can give a performance benefit and I wouldn't have to set the uniform from my code.
Currently I am learning how to create shaders in GLSL for a game engine I am working on, and I have a question regarding the language which puzzles me. I have learned that in shader versions lower than 3.0 you cannot use uniform variables in the condition of a loop. For example the following code would not work in shader versions older than 3.0.
for (int i = 0; i < uNumLights; i++)
{
...............
}
But isn't it possible to replace this with a loop with a fixed amount of iterations, but containing a conditional statement which would break the loop if i, in this case, is greater than uNumLights?. Ex :
for (int i = 0; i < MAX_LIGHTS; i++)
{
if(i >= uNumLights)
break;
..............
}
Aren't these equivalent? Should the latter work in older versions GLSL? And if so, isn't this more efficient and easy to implement than other techniques that I have read about, like using a different version of the shader for different number of lights?
I know this might be a silly question, but I am a beginner and I cannot find a reason why this shouldn't work.
GLSL can be confusing insofar as for() suggests to you that there must be conditional branching, even when there isn't because the hardware is unable to do it at all (which applies to if() in the same way).
What really happens on pre-SM3 hardware is that the HAL inside your OpenGL implementation will completely unroll your loop, so there is actually no jump any more. And, this explains why it has difficulties doing so with non-constants.
While technically possible to do it with non-constants anyway, the implementation would have to recompile the shader every time you change that uniform, and it might run against the maximum instruction count if you're just allowed to supply any haphazard number.
That is a problem because... what then? That's a bad situation.
If you supply a too big constant, it will give you a "too many instructions" compiler error when you build the shader. Now, if you supply a silly number in an uniform, and the HAL thus has to produce new code and runs against this limit, what can OpenGL do?
You most probably validated your program after compiling and linking, and you most probably queried the shader info log, and OpenGL kept telling you that everything was fine. This is, in some way, a binding promise, it cannot just decide otherwise all of a sudden. Therefore, it must make sure that this situation cannot arise, and the only workable solution is to not allow uniforms in conditions on hardware generations that don't support dynamic branching.
Otherwise, there would need to be some form of validation inside glUniform that rejects bad values. However, since this depends on successful (or unsuccessful) shader recompilation, this would mean that it would have to run synchronously, which makes it a "no go" approach. Also, consider that GL_ARB_uniform_buffer_object is exposed on some SM2 hardware (for example GeForce FX), which means you could throw a buffer object with unpredictable content at OpenGL and still expect it to work somehow! The implementation would have to scan the buffer's memory for invalid values after you unmap it, which is insane.
Similar to a loop, an if() statement does not branch on SM2 hardware, even though it looks like it. Instead, it will calculate both branches and do a conditional move.
(I'm assuming you are talking about pixel shaders).
Second variant is going to work only on gpu which supports shader model >= 3. Because dynamic branching (such as putting variable uNumLights into IF condition) is not supported on gpu shader model < 3 either.
Here you can compare what is and isn't supported between different shader models.
There is a fun work around I just figured out. Seems stupid and I can't promise you that it's a healthy choice, but it appears to work for me right now:
Set your for loop to the maximum you allow. Put a condition inside the loop to skip over the heavy routines, if the count goes beyond your uniform value.
uniform int iterations;
for(int i=0; i<10; i++){
if(i<iterations){
//do your thing...
}
}
How to optimize this line drawing routine ? Will memcpy work faster ?
void ScreenDriver::HorizontalLine(int wXStart, int wXEnd, int wYPos,
COLORVAL Color, int wWidth)
{
int iLen = wXEnd - wXStart + 1;
if (iLen <= 0)
{
return;
}
while(wWidth-- > 0)
{
COLORVAL *Put = mpScanPointers[wYPos] + wXStart;
int iLen1 = iLen;
while(iLen1--)
{
*Put++ = Color;
}
wYPos++;
}
}
I think you mean to say "memset" instead of "memcpy". Replacing this bit of the code:
while (iLen--)
{
*Put++ = Color;
}
with
memset(Put, Color, iLen);
could be faster but so much depends on your target CPU, memory architecture and the typical values of iLen encountered. It's not likely to be a big win, but if you've got the time I encourage you to measure the alternatives as that kind of exercise is the only way to really understand optimization.
Of course, this memset() use will only work if COLORVAL is character sized.
No, not really. memcpy copies memory, that's a read and a write and you don't need the read. memset, which only writes, only writes bytes, so that isn't going to work either, unless COLORVAL is also a byte. No, leave it as is, the compiler should produce a fairly good bit of code. Don't forget that you are probably limited by memory bandwidth.
Your best bet before doing anything else is to employ whatever low-level profiling tools you have available. At the very least get an overall timing for a hefty test case or 3. Without a baseline measurement you're shooting in the dark. (I should know, I'm as guilty of this as anyone else!)
That said I note that your code looks like it has a fair bit of overhead per pixel,
A memset() call could be a win (if COLORVAL is sizeof(char) ).
Alternately, unrolling the loop may help - this is heavily dependent on you input data, machine architecture etc.
If your iLen value is reasonably bounded you might consider writing a custom function for each iLen value that is fully unrolled (inline the first few smallish cases in a switch) and call the bigger cases through an array of function pointers.
The fastest option of course is usually to resort to assembly.
I've found through personal experience that memcpy is slightly faster than direct pointer access... but only slightly, it isn't usually a ground-breaking optimization.
One of the fastest ways to draw a horizontal line, aka fill an array with a value, in assembly is to use the stosb, stosw, stosd instructions. memset is optimized to use stosb. To use dword values we can write code like the one below to draw a line,
__asm {
cld
mov eax, color
mov ecx, screen_width
mov edi, video_buffer
rep stosd
}
But I'm quite sure that your inner while loop will be optimized by the compiler to use the stosd anyway.
You could try unrolling the inner loop, but really it's only going to matter for lines close to horizontal.
For lines that are not close to horizontal it could be you spend more time setting up the table of scan pointers.
Frankly, for more realistic situations, where you have not only colors, but widths, line-styles and end-styles, not to mention drawing modes like XOR, and aliasing, the way I've seen it done is
each "line" is really a polygon-fill, for which there are pretty fast algorithms (which is actually what your algorithm is), and/or
a special-purpose machine-language routine is generated on-the-fly (stored on the stack) because there are too many options to have option-specific special routines, and you don't want the algorithm continually questioning pixel-by-pixel what the options are.