I have learned enough OpenGL/GLUT (using PyOpenGL) to come up with a simple program that sets up a fragment shader, draws a full screen quad, and produces frames in sync with the display (shadertoy-style). I also to some degree understand the graphics pipeline.
What I don't understand is how the OpenGL program and the graphics pipeline fit together. In particular, in my GLUT display callback,
# set uniforms
glDrawArrays(GL_TRIANGLE_STRIP, 0, 4) # draw quad
glutSwapBuffers()
I suppose I activate the vertex shader by giving it vertices through glDrawArrays, which produces fragments (pixels). But then, does the fragment shader kick in immediately after glDrawArrays? There are fragments, so it can do something. On the other hand, it is still possible that there are further draw commands creating further vertices, which can a) produce new fragments and b) overwrite existing fragments.
I profiled the program and found that 99% of the time is spent in glutSwapBuffers. That is of course partially due to waiting for the vertical sync, but it stays that way when I use a very demanding fragment shader which significantly reduces the frame rate. That suggests that the fragment shader is only activated somewhere in glutSwapBuffers. Is that correct?
I understand that the fragment shader is executed on the GPU, not the CPU, but it still appears that the CPU (program) waits until the GPU (shader) is finished, within glutSwapBuffers...
I profiled the program and found that 99% of the time is spent in glutSwapBuffers. That is of course partially due to waiting for the vertical sync, but it stays that way when I use a very demanding fragment shader which significantly reduces the frame rate. That suggests that the fragment shader is only activated somewhere in glutSwapBuffers. Is
that correct?
No. That logic is completely flawed. The main point here is that the fragment shader runs on the GPU, which works totally asynchronous to the CPU. You are not measuring the fragment shader, you are measuring some implicit CPU-GPU-synchronization - it looks like your implementation syncs on the buffer swap (if too many frames are queued up, probably), so all you measure is the time the CPU has to wait for the GPU. And if you increase the GPU workload without significantly increasing the CPU workload, your CPU will just spend more time waiting.
OpenGL itself does not define any of this, so all the details are ultimately completely implementation-specific. It is just guaranteed by the spec that the implementation will behave as if the fragments were generated in the order in which you draw the primitives (e.g. with blending enabled, the actual order becomes relevant evan ion overdraw scenarios). But at what point the fragments will be generated, and which optimizations might happen in-between vertex processing and invocation of your fragment shader, is totally out of your control. GPUs might employ tile-based rasterization schemes, where the actual fragment shading is delayed a bit (if possible) to improve efficiency and avoid overshading.
Note that most GPU drivers work completely asynchronously. When you call a gl*() command it returns before it has been processed. It might only be queued up for later processing (e.g. in another driver thread), and will ultimately be transformed in some GPU-specific command buffers which are transferred to the GPU. You might end up with implicit CPU-GPU synchronization (or CPU-CPU with a driver thread), for example, when you read back framebuffer data after a draw call, this will imply that all previous GL commands will be flushed for processing, and the CPU will wait for the processing to be done before retrieving the image data - and that is also what makes such readbacks so slow.
As a consequence, any CPU-side measures of OpenGL code are completely meaningless. You need to measure the timing on the GPU, and that's what Timer Queries are for.
Related
Is there any way to check if OpenGL draw command has produced any fragments? For example in the case of Depth Peeling for a transparent object I don't want to render consequent iterations if during the previous iteration no additional fragments were produced during the draw calls due to discards or depth test failures.
Even if there was a way (and there sort of is), you shouldn't do this. To do what you're talking about would require having the CPU wait until the GPU has finished rendering the particular iteration before issuing the rendering command for the next one. This is pretty bad for performance, both CPU and GPU.
I've successfully updated my rendering engine to use uniform buffer objects and instancing.
The problem is that, since I do a first frustum culling pass every frame in order to know the objects I need to draw, I have to update the buffers every frame because the objects I draw could change every time, and this isn't the most efficient thing.
How could I make this more efficient?
The only thing I could think of is to not do the frustum culling so all the buffers remain static and I don't need to update them all the times, but not doing frustum culling I'd end up draw a lot of unnecessary objects.
Updating uniform buffers is fairly cheap, to be honest. You are quite limited in size and that prevents you from doing anything too crazy.
What you need to focus on to make this efficient is actually accommodating incomplete commands that are queued up. You are more likely to run into problems where the driver/GPU is forced to stop working on the next frame/command due to poor data write patterns than you are to run into data transfer rate limitations. The problem is always going to be avoiding situations where you might write to portions of data that are still in use by the GPU (it is often working on data 1-2 frames behind the CPU).
You have multiple options depending on your target version, and the OpenGL Wiki has a general overview of buffer streaming approaches.
You will have to do some performance testing to say for sure, but I suspect that CPU-side frustum culling combined with buffer orphaning of your instance UBO will give good results. Rather than reusing any data from previous frames, you would just stream the entire instance UBO from CPU to GPU each frame and let the GPU discard the old UBO when it finishes each frame.
I am working on a Automatic OpenGL batching method in my Game Engine, to reduce draw calls and redundant calls.
My batch tree design begins with the most expensive states and adds leafs down for each less expensive state.
Example:
Tree Root: Shaders / Programs
Siblings: Blend states ...
a.s.o.
So my question is what are most likely the most expensive calls, in this list:
binding program
binding textures
binding buffers
buffering texture, vertex data
binding render targets
glEnable / glDisable
blend state equation, color, functions, colorWriteMask
depth stencil state depthFunc, stencilOperations, stencilFunction, writeMasks
Also wondering which method will be faster:
- Collect all batchable draw commands to single vertex buffer and call only 1 draw call (this method would also force to update matrix transforms per vertex on cpu side)
- Do not batch at all and render many small draw calls, only batch particle system ...
PS: Render Targets will always Pre or Post changed, depending on usage.
Progress so far:
Andon M. Coleman: Cheapest Uniform & Vertex Array Binding, Expensive FBO, Texture Bindings
datenwolf: Programs invalidate State Cache
1: Framebuffer states
2: Program
3: Texture Binding
...
N: Vertex Array binding, Uniform binding
Current execution Tree in WebGL:
Program
Attribute Pointers
Texture
Blend State
Depth State
Stencil Front / Back State
Rasterizer State
Sampler State
Bind Buffer
Draw Arrays
Each step is a sibling hash tree, to avoid checking agains state cache inside of main render queue
Loading Textures / Programs / Shaders / Buffers happens before rendering in an extra queue, for future multi threading and also to be sure that the context is initialized before doing anything with it.
The biggest problem of self rendering objects is that you cannot control when something happens, for example if a developer calls these methods before gl is initialized, he wouldn't know why but he would have some bugs or problems...
The relative costs of such operations will of course depend on the usage pattern and your general scenario. But you might find Nvidia's "Beoynd Porting" presentation slides as a useful guide. Let me reproduce especially slide 48 here:
Relative Cost of state changes
In decreasing cost...
Render Target ~60K/s
Program ~300K/s
ROP
Texture Bindings ~1.5M/s
Vertex Format
UBO Bindings
Uniform Updates ~10M/s
This does not directly match all of the bullet points of your list. E.g. glEnable/glDisable might affect anything. Also GL's buffer bindings are nothing the GPU directly sees. Buffer bindings are mainly a client side state, depending on the target, of course. Change of blending state would be a ROP state change, and so on.
This tends to be highly platform/vendor dependent. Any numbers you may find apply to a specific GPU, platform and driver version. And there are a lot of myths floating around on the internet about this topic. If you really want to know, you need to write some benchmarks, and run them across a range of platforms.
With all these caveats:
Render target (FBO) switching tends to be quite expensive. Highly platform and architecture dependent, though. For example if you have some form of tile based architecture, pending rendering that would ideally be deferred until the end of the frame may have to be completed and flushed out. Or on more "classic" architectures, there might be compressed color buffers or buffers used for early depth testing that need consideration when render targets are switched.
Updating texture or buffer data is impossible to evaluate in general terms. It obviously depends heavily on how much data is being updated. Contrary to some claims on the internet, calls like glBufferSubData() and glTexSubImage2D() do not typically cause a synchronization. But they involve data copies.
Binding programs should not be terribly expensive, but is typically still more heavyweight than the state changes below.
Texture binding is mostly relatively cheap. But it really depends on the circumstances. For example, if you use a GPU that has VRAM, and the texture is not in VRAM at the moment, it might trigger a copy of the texture data from system memory to VRAM.
Uniform updates. This is supposedly very fast on some platforms. But it's actually moderately expensive on others. So there's a lot of variability here.
Vertex state setup (including VBO and VAO binding) is typically fast. It has to be, because it's done so frequently by most apps that it can very quickly become a bottleneck. But there are similar consideration as for textures, where buffer memory may have to be copied/mapped if it was not used recently.
General state updates, like blend states, stencil state, or write masks, are generally very fast. But there can be very substantial exceptions.
Just a typical example of why characteristics can be so different between architectures: If you change blend state, that might be sending a couple of command words on one architecture, with minimal overhead. On other architectures, blending is done as part of the fragment shader. So if you change blend state, the shader program has to be modified to patch in the code for the new blending calculation.
I am working on an application that needs to render the scene from multiple points of view. I notice that if I render once, even if the frag shader is long and complicated (writing to multiple 3D textures) it runs at 65 FPS. As soon as I add another rendering pass before that (simply rendering to 2 targets, colour and normals+depth) the framerate drops to 40. If I add a shadowmap pass it drops even further to 25-30 FPS. What is the best way to cope with multiple renderings and still retain a high framerate?
Right now I have 1 shader for doing both the normal+depth map and the shadowmap, 1 shader to write to 3d textures and 1 shader to do the final rendering by reading from all the maps. If I run only the last shader (hence reading gibberish values for nomral+depth and shadowmap) it runs at 65 FPS (and the calculations is simply a series of operations, no loops or conditionals).
Measuring FPS can be misleading. 65 FPS corresponds to 15ms per frame whereas 40 FPS corresponds to 25ms per frame. 30 FPS corresponds to 33ms per frame.
So, the complicated shader alone takes 15ms, and the complicated shader plus switching rendertargets plus switching shaders plus doing the actual processing of the second render pass takes an additional 10ms. That's not bad at all, the normal/depth shader takes 1/3 less time, which is pretty much "as expected". The shadow map adds another 8ms.
Unless you have noticeable pipeline stalls, rendering is nowadays first and foremost limited by ROP, which means nothing else but the more pixels you touch the more time it takes, proportionally.
Of course, 15ms is already a quite heavy frame time unless the scene is massive, you should make sure that you do not have a lot of stalls due to shader and texture changes (which break batches), and you should make sure that you don't stall because of buffer syncs.
Try to batch together draw calls, and be sure to avoid state changes. That will make sure the GPU doesn't go idle in between. The cost of state changes, in decreasing order of importance, is (courtesy of Cass Everitt):
Render target
Shader
ROP
Texture
Vertex Format
UBO/Vertex buffer bindings
Uniform updates
It seems like you can't avoid the render target change (since you have two of them) but in fact you can render to two targets at the same time. Sorting by shader (before sorting by texture or other stuff) may avoid those state changes, etc. etc.
Duplicate the geometry you are rendering in the geometry shader, and perform whatever transformations you require. You will only need to make one render pass this way.
More info: http://www.geeks3d.com/20111117/simple-introduction-to-geometry-shader-in-glsl-part-2/
Has anyone familiar with some sort of OpenGL magic to get rid of calculating bunch of pixels in fragment shader instead of only 1? Especially this issue is hot for OpenGL ES in fact meanwile flaws mobile platforms and necessary of doing things in more accurate (in performance meaning) way on it.
Are any conclusions or ideas out there?
P.S. it's known shader due to GPU architecture organisation is run in parallel for each texture monad. But maybe there techniques to raise it from one pixel to a group of ones or to implement your own glTexture organisation. A lot of work could be done faster this way within GPU.
OpenGL does not support writing to multiple fragments (meaning with distinct coordinates) in a shader, for good reason, it would obstruct the GPUs ability to compute each fragment in parallel, which is its greatest strength.
The structure of shaders may appear weird at first because an entire program is written for only one vertex or fragment. You might wonder why can't you "see" what is going on in neighboring parts?
The reason is an instance of the shader program runs for each output fragment, on each core/thread simultaneously, so they must all be independent of one another.
Parallel, independent, processing allows GPUs to render quickly, because the total time to process a batch of pixels is only as long as the single most intensive pixel.
Adding outputs with differing coordinates greatly complicates this.
Suppose a single fragment was written to by two or more instances of a shader.
To ensure correct results, the GPU can either assign one to be an authority and ignore the other (how does it know which will write?)
Or you can add a mutex, and have one wait around for the other to finish.
The other option is to allow a race condition regarding whichever one finishes first.
Either way this would immensely slows down the process, make the shaders ugly, and introduces incorrect and unpredictable behaviour.
Well firstly you can calculate multiple outputs from a single fragment shader in OpenGL 3 and up. A framebuffer object can have more than one RGBA surfaces (Renderbuffer Objects) attached and generate an RGBA for each of them by using gl_FragData[n] instead of gl_FragColor. See chapter 8 of the 5th edition OpenGL SuperBible.
However, the multiple outputs can only be generated for the same X,Y pixel coordinates in each buffer. This is for the same reason that an older style fragment shader can only generate one output, and can't change gl_FragCoord. OpenGL guarantees that in rendering any primitive, one and only one fragment shader will write to any X,Y pixel in the destination framebuffer(s).
If a fragment shader could generate multiple pixel values at different X,Y coords, it might try to write to the same destination pixel as another execution of the same fragment shader. Same if the fragment shader could change the pixel X or Y. This is the classic multiple threads trying to update shared memory problem.
One way to solve it would be to say "if this happens, the results are unpredictable" which sucks from the programmer point of view because it's completely out of your control. Or fragment shaders would have to lock the pixels they are updating, which would make GPUs far more complicated and expensive, and the performance would suck. Or fragment shaders would execute in some defined order (eg top left to bottom right) instead of in parallel, which wouldn't need locks but the performance would suck even more.