Is there a maximum number of assembly language instructions to be loaded into the fragment program unit?
I have an algorithm on to port from cpu to gpu and apparently it doesn't fit on the gpu.
There are several hard and soft limits, some of which are not immediately obvious:
Instruction slots: The total number of instructions that the hardware can accomodate in local memory.
Executed instructions: The maximum number of instructions that will execute (including instructions that run several times in a loop)
A single GLSL instruction can map to a dozen or more instructions
Several GLSL instructions can map to a single instruction depending on the optimizer's quality (e.g. multiply-add, dot, lerp)
Limited temp registers (only 32) may require more instructions than necessary on pre-SM4 hardware (no such problem with 4096).
Swizzling usually does not cost extra instructions nowadays, but does on some older hardware, and may in some situations on some hardware (esp. gl_FragColor is such a candidate)
Regardless of actual instructions, OpenGL 2.0 compatible hardware is limited to 8 dependent texture fetches (unlimited on hardware that can do OpenGL 2.1 or better)
You have these guaranteed minimums (most cards have more):
512 instruction slots for vertex and pixel shaders on OpenGL 2.x (SM3) capable hardware
65536 executed instructions
4096 vertex and 65536 pixel shader instruction slots on 3.x (SM4) hardware
65536 executed vertex shader instructions, unlimited pixel shader instructions
At least 24 dynamic branches possible on 2.x (SM3) hardware
Fully dynamic branching (no limits) on SM4 hardware
Only conditional move available on SM2.x, everything else must be accomodated by code duplication and loop unrolling, or must fail
There is a limit on the maximum amount of instructions a shader can have. As far as I know, it varies from GPU to GPU. If your shader is too large, compilation will generate an error.
Related
When writing fragment shaders in OpenGL, one can branch either on compile-time constants, on uniform variables or on varying variables.
How performant that branching is depends on the hardware and driver implementation, but generally branching on a compile time constant is usually free and branching on a uniform is faster than on a varying.
In the case of a varying, the rasterizer still has to interpolate the variable for each fragment and the branch has to be decided on each family execution, even if the value of the varying is the same for each fragment in the current primitive.
What I wonder is whether any graphics api or extension allows some fragment shader branching that is executed only once per rasterized primitive (or in the case of tiled rendering once per primitive per bin)?
Dynamic branching is only expensive when it causes divergence of instances executing at the same time. The cost of interpolating a "varying" is trivial.
Furthermore, different GPUs handle primitive rasterization differently. Some GPUs ensure that wavefronts for fragment shaders only contain instances that are executing on the same primitive. On these GPUs, branching based on values that that don't change per-primitive will be fast.
However, other GPUs will pack instances from different primitives into the same wavefronts. On these GPUs, divergence will happen if the value is different for different primitives. How much divergence? It rather depends on how often you get multiple instances in a primitive. If many of your primitives are small in rasterized space, then you'll get a lot more divergence than if you have a lot of large primitives.
GPUs that pack instances from different primitives into a wavefront are trying to maximize how much their cores get utilized. It's a tradeoff: you're minimizing the overall number of wavefronts you have to execute, but a particular cause of divergence (data that is constant within a primitive but not between them) will be penalized.
In any case, try to avoid divergence when you can. But if your algorithm requires it... then your algorithm requires it, and the performance you get is the performance you get. The best you can do is let the GPU know that the "varying" will be constant per-primitive by using flat interpolation.
I have an application which has many different shaders, and has both an OpenGL and DirectX 11 implementation. Other than language differences (between GLSL and HLSL), the shaders are identical on both platforms. In DirectX, the shaders are compiled offline using fxc (with the /Fh option), and compiled into the executable. In OpenGL, the text of the shader is embedded into the executable, and is compiled at runtime. It then will be saved as binary with glGetProgramBinary (if the platform is capable, which it is on the test case I have), and reloaded on subsequent runs with glProgramBinary.
I have noticed that in OpenGL, the total memory reported used by my application (reported in procexp) grows by an average of 49kb per call glProgramBinary. The increase seems to be not directly related to the size of the program's binary. Some increase memory roughly the size of the binary, where as some raise it by many times the size. Also, the binary sizes seem quite large, being ~12kb even for the simplest programs.
Conversely, in DirectX 11, the equivalent calls to ID3D11Device::CreateVertexShader and ID3D11Device::CreatePixelShader do increase the reported memory usage, but only by an average of 6kb per pair (again, the increase seems not directly proportional to the size of the binary). Also, the shader binary blobs compiled offline seem to be drastically smaller, with the smallest being less than 2kb.
Is there some way to reduce the amount of memory associated with shaders in OpenGL, such that it's more inline with the memory usage in DirectX? Or, is this a reporting problem (DirectX is actually using the memory, but it doesn't show up in the process' count)?
(Platform is NVidia 373.34, on Windows 10).
As I understand we store texture in GPU memory with glTexImage2D function and there is no limited number theoretically. However, when we try to use them there is a limitation which is 32 texture units GL_TEXTURE0 to GL_TEXTURE31. Why is there this limitation?
The limitation in question (which as mentioned in the comments, is the wrong limit) is on the number of textures that can be used with a single rendering command at one time. Each shader stage has its own limit on the number of textures it can access.
This limitation exists because hardware is not unlimited in its capabilities. Textures require resources, and some hardware has limits on how many of these resources can be used at once. Just like hardware has limits on the number of uniforms a shader stage can have and so forth.
Now, there is some hardware which is capable of accessing texture data in a more low-level way. This functionality is exposed via the ARB_bindless_texture. It's not a core feature because not all recent, GL 4.x-capable hardware can support it.
How many instructions can a vertex and fractal shader each have in WebGL in Chrome, without taking rendering time per frame into account?
from: https://groups.google.com/forum/#!topic/angleproject/5Z3EiyqfbQY
So the only way to know an instruction count limit has been exceeded
is to compile it?
Unfortunately yes. Note that you should probably try to compile and
link it to really be sure since some systems may not actually do much
at the compilation phase.
Someone must have some rough samples or a limiting factor?
There is no specific limit. It's up to the driver. A high end GPU will have a larger limit than a low end mobile GPU. There's also no way to know how many instructions a particular line of GLSL will translate into since that's also up to the particular GLSL compiler in that particular driver.
This is an aspect that is not mandated by the Khronos specification, and hence varies depending on the GPU, or the shader compiler if ANGLE is used.
I've been tuning my game's renderer for my laptop, which has a Radeon HD 3850. This chip has a decent amount of processing power, but rather limited memory bandwidth, so I've been trying to move more shader work into fewer passes.
Previously, I was using a simple multipass model:
Bind and clear FP16 blend buffer (with depth buffer)
Depth-only pass
For each light, do an additive light pass
Bind backbuffer, use blend buffer as a texture
Tone mapping pass
In an attempt to improve the performance of this method, I wrote a new rendering path that counts the number and type of lights to dynamically build custom GLSL shaders. These shaders accept all light parameters as uniforms and do all lighting in a single pass. I was expecting to run into some kind of limit, so I tested it first with one light. Then three. Then twenty-one, with no errors or artifacts, and with great performance. This leads me to my actual questions:
Is the maximum number of uniforms retrievable?
Is this method viable on older hardware, or are uniforms much more limited?
If I push it too far, at what point will I get an error? Shader compilation? Program linking? Using the program?
Shader uniforms are typically implemented by the hardware as registers (or sometimes by patching the values into shader microcode directly, e.g. nVidia fragment shaders). The limit is therefore highly implementation dependent.
You can retrieve the maximums by querying GL_MAX_VERTEX_UNIFORM_COMPONENTS_ARB and GL_MAX_FRAGMENT_UNIFORM_COMPONENTS_ARB for vertex and fragment shaders respectively.
See 4.3.5 Uniform of The OpenGLĀ® Shading Language specs:
There is an implementation dependent limit on the amount of storage for uniforms that can be used for
each type of shader and if this is exceeded it will cause a compile-time or link-time error. Uniform
variables that are declared but not used do not count against this limit.
It will fail at link or compile-time, but not using the program.
For how to get the max number supported by your OpenGL implementation, see moonshadow's answer.
For an idea of where the limit actually is for arbitrary GPUs, I'd recommend looking at which DX version that GPU supports.
DX9 level hardware:
vs2_0 supports 256 vec4. ps2_0 supports 32 vec4.
vs3_0 is 256 vec4, ps3_0 is 224 vec4.
DX10 level hardware:
vs4_0/ps4_0 is a minumum of 4096 constants per constant buffer - and you can have 16 of them.
In short, It's unlikely you'll run out with anything that is DX10 based.
I guess the maximum number of uniforms is determined by the amount of video memory,
as it's just a variable. Normal varaibles on the cpu are limited by your RAM too right?