If I have a very complicate geometry shader, is there a limit of memory footprint for the codes?
Shaders are not limited by "complexity", but by runtime. If a shader executes for too long, the GPU/OS will assume that the shader has entered an infinite loop and kill the shader's execution (and probably the application that launched it). How long that is depends on the GPU and possibly the shader stage.
Related
I'm designing an OO program (in C++) which deals with moderately-simple graphics implemented with OpenGL. I've written a couple of vertex shader programs for my Drawable objects to use. In addition, I've encapsulated shader management (compilation, linking and usage) in a Shader class.
My question is, since all my classes will be rendering visual objects using this small set of shaders and they all have a pointer to a Shader object, does it make sense to provide a common reference for all of them to use, so that it would avoid having "the same" shader code compiled more than once?
If so, why? Is it really important to prevent duplication of shader code? (My program will likely have thousands of independent visual elements that need to be rendered together). I'm new to OpenGL and performance/efficiency issues are still very obscure to me...
EDIT: Moreover, I wonder what will then happen with my shader uniforms; will they be shared as well? How's that supposed to allow me to, e.g. rotate my elements at a different rate? Is it better to write element-uniforms (i.e. the model matrix) every time I want to draw each element, than to have replicated shader code?
I would wager that in most if not all OpenGL implementations, compiling and linking the same shader multiple times would result in multiple copies of the shader binaries and space for uniforms, etc. Calling glUseProgram to switch between your duplicate copies will still cause a state change, despite the same code being run on your GPU cores before and after the call. With a sufficiently complex enough scene, you'll probably be switching textures as well so there will be a state change anyways.
It may not be your bottleneck, but it certainly is wasteful. A good pattern for static content like shaders and textures is to have one or more Manager classes (AssetManager, TextureManager, etc.) that will lazy-load (or pre-load) all of your assets and give you a shared pointer (or some other memory-management strategy) when you ask for it, typically by some string ID.
About the edit:
Yes, your uniforms will be shared and will also remain loaded after you unbind. This is the preferred way to do it because updating a uniform is more than an order of magnitude faster than binding a new shader. You would just set the model matrix uniforms for every new object but keep the same shader.
Uniforms are stored with the shader, so switching shaders means loading in all of the uniforms anyways.
I plan on writing a program that will take some paraemeters as input and will generate its own fragment shader string which will then be compiled, linked and used as a fragment shader (it will only be done once at the start of a program).
Im not an expert in computer graphics so I dont know if this is standard practice but I definitely think it has the potential for some interesting applications - not necessarily graphics applications but possibly computational ones.
My question is what is the code size limit of a shader in OpenGL i.e. how much memory can OpenGL reasonably allocate to a program on the graphics processor?
There is no code size limit. OK, there is, but:
OpenGL doesn't give you a way to query it because:
Such a number would be meaningless, since it does not translate to anything you can directly control in GLSL.
A long GLSL shader might compile while a short shader can't. Why? Because the compiler may have been able to optimize the long shader down to size, while the short shader expanded to lots of opcodes. In short, GLSL is too high-level to be able to effectively quantify such limitations.
In any case, given the limitations of GL 2.x-class hardware, you probably won't hit any length limitations unless you're trying to do so or are doing GPGPU work.
I checked something and I got weird result about performance with C++ & OpenGL & GLSL.
In the first program I drew pixels to texture with fragment shader and then render the texture.
The texture's mag\min was GL_NEAREST.
In the second program I took the fragment shader and rendered directly to the screen.
Why the second program is faster? Isn't rendering texture faster instead of repeating the same action?
It's like taking a video of AAA game and then show it on the same computer and get lower FPS with the video.
The fragment shader is:
uniform int mx,my;
void main(void) {
vec2 p=gl_FragCoord.xy;
p-=vec2(mx,my);
if (p.x<0.0)
p.x=-p.x;
if (p.y<0.0)
p.y=-p.y;
float dis=sqrt(p.x*p.x+p.y*p.y);
dis+=(abs(p.x)+abs(p.y))-(abs(p.x)-abs(p.y));
p.x/=dis;
p.y/=dis;
gl_FragColor=vec4(p.x,p.y,0.0,1.0);
}
As usual with performance questions, about the only way to be really certain would be to use a profiler.
That said, my guess would be that this is mostly a question of processing bandwidth versus memory bandwidth. To render a texture, the processor has to read data from one part of memory, and write that same data back to another part of memory.
To directly render from the shader, the processor only has to write the output to memory, but doesn't have to read data in from memory.
Therefore, it's a question of which is faster: reading that particular data from memory, or generating it with the processing units? The math in your shader is pretty simple (essentially the only part that's at all complex is the sqrt) -- so at least with your particular hardware, it appears that it's a little faster to compute the result than read it from memory (at least given the other memory accesses going on at the same time, etc.)
Note that the two (shader vs. texture) have quite different characteristics. Reading a texture is going to be nearly constant speed, regardless of how simple or complex of computation was involved in creating it. Not to state the obvious, but a shader is going to run fast if the computation is simple, but slow down (potentially a lot) if the computation as the computation gets complex. In the AAA games you mention, it's fair to guess that at least some of the shaders use complex enough calculations that they'll almost certainly be slower than a texture read. At the opposite extreme a really trivial shader (e.g., one that just passes the fragment color through from input to output) is probably quite a lot faster than reading from a texture.
Is there any simple way that I can know some codes are executed in GPU rather than CPU?
I think you need to get your concept about the separation of work between CPU and GPU straight. If you code something and compile it with a regular compiler that's not targeted at GPU execution, the code will always execute of the CPU.
All calls to OpenGL or DirectX functions in your main program are executed on the CPU, there's no "magical" translation layer. However some those calls make the GPU do something, like drawing triangles.
CUDA and OpenCL are languages aimed at data parallel execution architectures. The GPU is such an architecture. But CUDA and OpenCL code require some host program, which in turn will be executed on the CPU. The very same goes for programmable shaders (HLSL, GLSL).
So: The host part of the program (setting up the work environment, issuing rendering calls or GPU execution) will run on CPU. The code running on GPU is compiled in a separate compilation unit (i.e. GLSL shader code uploaded to OpenGL, OpenCL/CUDA code compiled with a OpenCL/CUDA compiler).
As datenwolf said, any code you write that is compiled via a standard compiler (gcc, etc.) will be run on the CPU. The programs which are run on the GPU are called shaders. The variable types in shaders are different than C/C++ programs, and the syntax is also stricter and more limited.
Older graphics applications operated with two types of shaders: vertex and fragment. The vertex shader operated on any vertex of geometry sent to the renderer. The fragment shader would receive output from the vertex shader (interpolated across the geometry faces) and would operate on each pixel, or fragment, of the geometry that would be drawn to the screen.
Modern graphics has introduced the idea of General Purpose GPU Programming. OpenGL's geometry shaders and Nvidia's CUDA can carry out general purpose programming on the GPU.
To summarize: Compiled shaders run on the GPU, and compiled C/C++ runs on the CPU.
Depends on which OS your'e using, and OpenGL/OpenCL/other, you can use a system profiler to give you this information. A system profiler, is a piece of software that tracks System-wide activity, and presents it in a readable form after the tracking is done.
For example, for Windows, you can use Vtune, which monitors both CPU and GPU.
Hope this helps.
I've been tuning my game's renderer for my laptop, which has a Radeon HD 3850. This chip has a decent amount of processing power, but rather limited memory bandwidth, so I've been trying to move more shader work into fewer passes.
Previously, I was using a simple multipass model:
Bind and clear FP16 blend buffer (with depth buffer)
Depth-only pass
For each light, do an additive light pass
Bind backbuffer, use blend buffer as a texture
Tone mapping pass
In an attempt to improve the performance of this method, I wrote a new rendering path that counts the number and type of lights to dynamically build custom GLSL shaders. These shaders accept all light parameters as uniforms and do all lighting in a single pass. I was expecting to run into some kind of limit, so I tested it first with one light. Then three. Then twenty-one, with no errors or artifacts, and with great performance. This leads me to my actual questions:
Is the maximum number of uniforms retrievable?
Is this method viable on older hardware, or are uniforms much more limited?
If I push it too far, at what point will I get an error? Shader compilation? Program linking? Using the program?
Shader uniforms are typically implemented by the hardware as registers (or sometimes by patching the values into shader microcode directly, e.g. nVidia fragment shaders). The limit is therefore highly implementation dependent.
You can retrieve the maximums by querying GL_MAX_VERTEX_UNIFORM_COMPONENTS_ARB and GL_MAX_FRAGMENT_UNIFORM_COMPONENTS_ARB for vertex and fragment shaders respectively.
See 4.3.5 Uniform of The OpenGLĀ® Shading Language specs:
There is an implementation dependent limit on the amount of storage for uniforms that can be used for
each type of shader and if this is exceeded it will cause a compile-time or link-time error. Uniform
variables that are declared but not used do not count against this limit.
It will fail at link or compile-time, but not using the program.
For how to get the max number supported by your OpenGL implementation, see moonshadow's answer.
For an idea of where the limit actually is for arbitrary GPUs, I'd recommend looking at which DX version that GPU supports.
DX9 level hardware:
vs2_0 supports 256 vec4. ps2_0 supports 32 vec4.
vs3_0 is 256 vec4, ps3_0 is 224 vec4.
DX10 level hardware:
vs4_0/ps4_0 is a minumum of 4096 constants per constant buffer - and you can have 16 of them.
In short, It's unlikely you'll run out with anything that is DX10 based.
I guess the maximum number of uniforms is determined by the amount of video memory,
as it's just a variable. Normal varaibles on the cpu are limited by your RAM too right?