I've looked everywhere and couldn't find a definitive answer
vertex and fragment shaders do in fact have a limit on the maximum size of the shader and number of instructions, but i've never heard about those limitations on a compute shader
Since I need to port an existing CPU Path tracer with many different BRDFs I need to know in advance if this could be an issue and move to CUDA or if OpenGL's compute shaders could handle the work just fine
There are always limits. But these limits are implementation-defined; they're not expressed in any a-priori determinable way. So the only way you'll find out what they are is to cross them.
CUDA has limits too.
Related
I was recently looking for ray tracing via opengl tutorials. Most of tutorials prefer compute shaders. I wonder why don't they just render to texture, then render the texture to screen as quad.
What is the advantages and disadvantages of compute shader method over screen quad?
Short answer: because compute shaders give you more effective tools to perform complex computations.
Long answer:
Perhaps the biggest advantage that they afford (in the case of tracing) is the ability to control exactly how work is executed on the GPU. This is important when you're tracing a complex scene. If your scene is trivial (e.g., Cornell Box), then the difference is negligible. Trace some spheres in your fragment shader all day long. Check http://shadertoy.com/ to witness the madness that can be achieved with modern GPUs and fragment shaders.
But. If your scene and shading are quite complex, you need to control how work is done. Rendering a quad and doing the tracing in a frag shader is going to, at best, make your application hang while the driver cries, changes its legal name, and moves to the other side of the world...and at worst, crash the driver. Many drivers will abort if a single operation takes too long (which virtually never happens under standard usage, but will happen awfully quickly when you start trying to trace 1M poly scenes).
So you're doing too much work in the frag shader...next logical though? Ok, limit the workload. Draw smaller quads to control how much of the screen you're tracing at once. Or use glScissor. Make the workload smaller and smaller until your driver can handle it.
Guess what we've just re-invented? Compute shader work groups! Work groups are compute shader's mechanism for controlling job size, and they're a far better abstraction for doing so than fragment-level hackery (when we're dealing with this kind of complex task). Now we can very naturally control how many rays we dispatch, and we can do so without being tightly-coupled to screen-space. For a simple tracer, that adds unnecessary complexity. For a 'real' one, it means that we can easily do sub-pixel raycasting on a jittered grid for AA, huge numbers of raycasts per pixel for pathtracing if we so desire, etc.
Other features of compute shaders that are useful for performant, industrial-strength tracers:
Shared Memory between thread groups (allows, for example, packet tracing, wherein an entire packet of spatially-coherent rays are traced at the same time to exploit memory coherence & the ability to communicate with nearby rays)
Scatter Writes allow compute shaders to write to arbitrary image locations (note: image and texture are different in subtle ways, but the advantage remains relevant); you no longer have to trace directly from a known pixel location
In general, the architecture of modern GPUs are designed to support this kind of task more naturally using compute. Personally, I have written a real-time progressive path tracer using MLT, kd-tree acceleration, and a number of other computationally-expensive techniques (PT is already extremely expensive). I tried to remain in a fragment shader / full-screen quad as long as I could. Once my scene was complex enough to require an acceleration structure, my driver started choking no matter what hackery I pulled. I re-implemented in CUDA (not quite the same as compute, but leveraging the same fundamental GPU architectural advances), and all was well with the world.
If you really want to dig in, have a glance at section 3.1 here: https://graphics.cg.uni-saarland.de/fileadmin/cguds/papers/2007/guenther_07_BVHonGPU/Guenter_et_al._-_Realtime_Ray_Tracing_on_GPU_with_BVH-based_Packet_Traversal.pdf. Frankly the best answer to this question would be an extensive discussion of GPU micro-architecture, and I'm not at all qualified to give that. Looking at modern GPU tracing papers like the one above will give you a sense of how deep the performance considerations go.
One last note: any performance advantage of compute over frag in the context of raytracing a complex scene has absolutely nothing to do with rasterization / vertex shader overhead / blending operation overhead, etc. For a complex scene with complex shading, bottlenecks are entirely in the tracing computations, which, as discussed, compute shaders have tools for implementing more efficiently.
I am going to complete Josh Parnell information.
One problem with both fragment shader and compute shader is that they both lack recursivity.
A ray tracer is recursive by nature (yeah I know it is always possible to transform a recursive algorithm in a non recursive one, but is is not always that easy to do it).
So another way to see the problem could be the following :
Instead to have "one thread" per pixel, one idea could be to have one thread per path (a path is a part of your ray (between 2 bounces)).
Going that way, you are dispatching on your "bunch" of rays instead on your "pixel grid". Doing so simplify the potential recursivity of the ray tracer, and avoid divergence in complex materials :
More information here :
http://research.nvidia.com/publication/megakernels-considered-harmful-wavefront-path-tracing-gpus
As I understand we store texture in GPU memory with glTexImage2D function and there is no limited number theoretically. However, when we try to use them there is a limitation which is 32 texture units GL_TEXTURE0 to GL_TEXTURE31. Why is there this limitation?
The limitation in question (which as mentioned in the comments, is the wrong limit) is on the number of textures that can be used with a single rendering command at one time. Each shader stage has its own limit on the number of textures it can access.
This limitation exists because hardware is not unlimited in its capabilities. Textures require resources, and some hardware has limits on how many of these resources can be used at once. Just like hardware has limits on the number of uniforms a shader stage can have and so forth.
Now, there is some hardware which is capable of accessing texture data in a more low-level way. This functionality is exposed via the ARB_bindless_texture. It's not a core feature because not all recent, GL 4.x-capable hardware can support it.
How many instructions can a vertex and fractal shader each have in WebGL in Chrome, without taking rendering time per frame into account?
from: https://groups.google.com/forum/#!topic/angleproject/5Z3EiyqfbQY
So the only way to know an instruction count limit has been exceeded
is to compile it?
Unfortunately yes. Note that you should probably try to compile and
link it to really be sure since some systems may not actually do much
at the compilation phase.
Someone must have some rough samples or a limiting factor?
There is no specific limit. It's up to the driver. A high end GPU will have a larger limit than a low end mobile GPU. There's also no way to know how many instructions a particular line of GLSL will translate into since that's also up to the particular GLSL compiler in that particular driver.
This is an aspect that is not mandated by the Khronos specification, and hence varies depending on the GPU, or the shader compiler if ANGLE is used.
I plan on writing a program that will take some paraemeters as input and will generate its own fragment shader string which will then be compiled, linked and used as a fragment shader (it will only be done once at the start of a program).
Im not an expert in computer graphics so I dont know if this is standard practice but I definitely think it has the potential for some interesting applications - not necessarily graphics applications but possibly computational ones.
My question is what is the code size limit of a shader in OpenGL i.e. how much memory can OpenGL reasonably allocate to a program on the graphics processor?
There is no code size limit. OK, there is, but:
OpenGL doesn't give you a way to query it because:
Such a number would be meaningless, since it does not translate to anything you can directly control in GLSL.
A long GLSL shader might compile while a short shader can't. Why? Because the compiler may have been able to optimize the long shader down to size, while the short shader expanded to lots of opcodes. In short, GLSL is too high-level to be able to effectively quantify such limitations.
In any case, given the limitations of GL 2.x-class hardware, you probably won't hit any length limitations unless you're trying to do so or are doing GPGPU work.
I've been tuning my game's renderer for my laptop, which has a Radeon HD 3850. This chip has a decent amount of processing power, but rather limited memory bandwidth, so I've been trying to move more shader work into fewer passes.
Previously, I was using a simple multipass model:
Bind and clear FP16 blend buffer (with depth buffer)
Depth-only pass
For each light, do an additive light pass
Bind backbuffer, use blend buffer as a texture
Tone mapping pass
In an attempt to improve the performance of this method, I wrote a new rendering path that counts the number and type of lights to dynamically build custom GLSL shaders. These shaders accept all light parameters as uniforms and do all lighting in a single pass. I was expecting to run into some kind of limit, so I tested it first with one light. Then three. Then twenty-one, with no errors or artifacts, and with great performance. This leads me to my actual questions:
Is the maximum number of uniforms retrievable?
Is this method viable on older hardware, or are uniforms much more limited?
If I push it too far, at what point will I get an error? Shader compilation? Program linking? Using the program?
Shader uniforms are typically implemented by the hardware as registers (or sometimes by patching the values into shader microcode directly, e.g. nVidia fragment shaders). The limit is therefore highly implementation dependent.
You can retrieve the maximums by querying GL_MAX_VERTEX_UNIFORM_COMPONENTS_ARB and GL_MAX_FRAGMENT_UNIFORM_COMPONENTS_ARB for vertex and fragment shaders respectively.
See 4.3.5 Uniform of The OpenGLĀ® Shading Language specs:
There is an implementation dependent limit on the amount of storage for uniforms that can be used for
each type of shader and if this is exceeded it will cause a compile-time or link-time error. Uniform
variables that are declared but not used do not count against this limit.
It will fail at link or compile-time, but not using the program.
For how to get the max number supported by your OpenGL implementation, see moonshadow's answer.
For an idea of where the limit actually is for arbitrary GPUs, I'd recommend looking at which DX version that GPU supports.
DX9 level hardware:
vs2_0 supports 256 vec4. ps2_0 supports 32 vec4.
vs3_0 is 256 vec4, ps3_0 is 224 vec4.
DX10 level hardware:
vs4_0/ps4_0 is a minumum of 4096 constants per constant buffer - and you can have 16 of them.
In short, It's unlikely you'll run out with anything that is DX10 based.
I guess the maximum number of uniforms is determined by the amount of video memory,
as it's just a variable. Normal varaibles on the cpu are limited by your RAM too right?