Improving performance of OpenGL rendering: rendering multiple times from different POVs

Improving performance of OpenGL rendering: rendering multiple times from different POVs - c++

I am working on an application that needs to render the scene from multiple points of view. I notice that if I render once, even if the frag shader is long and complicated (writing to multiple 3D textures) it runs at 65 FPS. As soon as I add another rendering pass before that (simply rendering to 2 targets, colour and normals+depth) the framerate drops to 40. If I add a shadowmap pass it drops even further to 25-30 FPS. What is the best way to cope with multiple renderings and still retain a high framerate?
Right now I have 1 shader for doing both the normal+depth map and the shadowmap, 1 shader to write to 3d textures and 1 shader to do the final rendering by reading from all the maps. If I run only the last shader (hence reading gibberish values for nomral+depth and shadowmap) it runs at 65 FPS (and the calculations is simply a series of operations, no loops or conditionals).

Measuring FPS can be misleading. 65 FPS corresponds to 15ms per frame whereas 40 FPS corresponds to 25ms per frame. 30 FPS corresponds to 33ms per frame.
So, the complicated shader alone takes 15ms, and the complicated shader plus switching rendertargets plus switching shaders plus doing the actual processing of the second render pass takes an additional 10ms. That's not bad at all, the normal/depth shader takes 1/3 less time, which is pretty much "as expected". The shadow map adds another 8ms.
Unless you have noticeable pipeline stalls, rendering is nowadays first and foremost limited by ROP, which means nothing else but the more pixels you touch the more time it takes, proportionally.
Of course, 15ms is already a quite heavy frame time unless the scene is massive, you should make sure that you do not have a lot of stalls due to shader and texture changes (which break batches), and you should make sure that you don't stall because of buffer syncs.
Try to batch together draw calls, and be sure to avoid state changes. That will make sure the GPU doesn't go idle in between. The cost of state changes, in decreasing order of importance, is (courtesy of Cass Everitt):
Render target
Shader
ROP
Texture
Vertex Format
UBO/Vertex buffer bindings
Uniform updates
It seems like you can't avoid the render target change (since you have two of them) but in fact you can render to two targets at the same time. Sorting by shader (before sorting by texture or other stuff) may avoid those state changes, etc. etc.

Duplicate the geometry you are rendering in the geometry shader, and perform whatever transformations you require. You will only need to make one render pass this way.
More info: http://www.geeks3d.com/20111117/simple-introduction-to-geometry-shader-in-glsl-part-2/

Related

Timing in OpenGL program?

I have learned enough OpenGL/GLUT (using PyOpenGL) to come up with a simple program that sets up a fragment shader, draws a full screen quad, and produces frames in sync with the display (shadertoy-style). I also to some degree understand the graphics pipeline.
What I don't understand is how the OpenGL program and the graphics pipeline fit together. In particular, in my GLUT display callback,
# set uniforms
glDrawArrays(GL_TRIANGLE_STRIP, 0, 4) # draw quad
glutSwapBuffers()
I suppose I activate the vertex shader by giving it vertices through glDrawArrays, which produces fragments (pixels). But then, does the fragment shader kick in immediately after glDrawArrays? There are fragments, so it can do something. On the other hand, it is still possible that there are further draw commands creating further vertices, which can a) produce new fragments and b) overwrite existing fragments.
I profiled the program and found that 99% of the time is spent in glutSwapBuffers. That is of course partially due to waiting for the vertical sync, but it stays that way when I use a very demanding fragment shader which significantly reduces the frame rate. That suggests that the fragment shader is only activated somewhere in glutSwapBuffers. Is that correct?
I understand that the fragment shader is executed on the GPU, not the CPU, but it still appears that the CPU (program) waits until the GPU (shader) is finished, within glutSwapBuffers...

I profiled the program and found that 99% of the time is spent in glutSwapBuffers. That is of course partially due to waiting for the vertical sync, but it stays that way when I use a very demanding fragment shader which significantly reduces the frame rate. That suggests that the fragment shader is only activated somewhere in glutSwapBuffers. Is
that correct?
No. That logic is completely flawed. The main point here is that the fragment shader runs on the GPU, which works totally asynchronous to the CPU. You are not measuring the fragment shader, you are measuring some implicit CPU-GPU-synchronization - it looks like your implementation syncs on the buffer swap (if too many frames are queued up, probably), so all you measure is the time the CPU has to wait for the GPU. And if you increase the GPU workload without significantly increasing the CPU workload, your CPU will just spend more time waiting.
OpenGL itself does not define any of this, so all the details are ultimately completely implementation-specific. It is just guaranteed by the spec that the implementation will behave as if the fragments were generated in the order in which you draw the primitives (e.g. with blending enabled, the actual order becomes relevant evan ion overdraw scenarios). But at what point the fragments will be generated, and which optimizations might happen in-between vertex processing and invocation of your fragment shader, is totally out of your control. GPUs might employ tile-based rasterization schemes, where the actual fragment shading is delayed a bit (if possible) to improve efficiency and avoid overshading.
Note that most GPU drivers work completely asynchronously. When you call a gl*() command it returns before it has been processed. It might only be queued up for later processing (e.g. in another driver thread), and will ultimately be transformed in some GPU-specific command buffers which are transferred to the GPU. You might end up with implicit CPU-GPU synchronization (or CPU-CPU with a driver thread), for example, when you read back framebuffer data after a draw call, this will imply that all previous GL commands will be flushed for processing, and the CPU will wait for the processing to be done before retrieving the image data - and that is also what makes such readbacks so slow.
As a consequence, any CPU-side measures of OpenGL code are completely meaningless. You need to measure the timing on the GPU, and that's what Timer Queries are for.

Is their an alternative to Stencil Pass

At the moment I've implemented Deferred Rendering using OpenGL its fairly simple just now. However I'm having major performance issues due to using the stencil pass at the moment (at least in the current way I use it). I've mainly been using ogldev.atspace tutorial (only got 2 links per post atm sorry!) as a reference alongside a few dozen tidbits of information from other articles.
It works like:
Gbuffer pass (render geometry and fill normals, diffuse, ambient etc.)
For each light
2a) stencil pass
2b) light pass
Swap to screen
The thing is using the stencil pass in this fashion is incurring huge costs, as I need to swap between light pass mode and stencil mode for every light in the scene. So that's a lot of GL state swaps.
An alternate method without the stencil pass would look like this:
Gbuffer fill
Set light pass
Compute for all lights
Swap to screen
Doing this skips the need to swap all of the OpenGL states (and buffer clears etc.) for each light in the scene.
I've tested/profiled this using CodeXL and basic fps's std::cout'ng. The State Change Functions using the stencil pass method take up 44% of my GL calls (in comparison to 6% for draw and 6% for textures), buffer swaps/clears etc also cost a fair % more. When I go to the second method the GL state changes drop to 2.98% and the others drop a fair margin as well. The FPS also changes drastically for example I have 65~ lights in my scene, dynamically moving. Stencil Pass gives me around 20-30 fps if I'm lucky in release mode (with rendering taking the majority of the total time). The second method gives me 71~ (with the rendering taking the smaller part of the total time).
Now why not just use the second method? Well, it causes serious lighting issues that I don't get with the first. I have no idea how to get rid of them. Here's an example:
2nd non-stencil version (it essentially bleeds and overlaps onto things outside its range):
http://imgur.com/BNn9SP2
1st stencil version (how it should look):
http://imgur.com/kVGRwH2
So my main question is, is there a way to avoid using the stencil pass (and use something similar to the first without the graphical glitch) without completely changing the algorithm to something like tiled deferred rendering?
And if not, is their an alternate deferred rendering method, that isn't too much of a jump from the style of deferred renderer I'm using?
Getting rid of the stencil pass isn't a new problem for me, I was looking for an alternative to this a 6 months or so back when I first implemented it and thought it might be a bit too much of an overhead for what I had in mind. But I couldn't find anything at the time and still can't.

Another tecnique wich is used in Doom3 for lighting is the following:
http://fabiensanglard.net/doom3/renderer.php
For each light
render the geometry affected only by 1 light
accumulate the light result (clamping to 255)
As optimization you can add a scissor test so that you will render only the visible part of the geometry for each light.
The advantage of this over stencil lights is that you can do complex light calculations if you want, or just keep simple lights. And the whole work is GPU, you don't have redundant state changes (you setup 1 shader only, 1 vbo only and you re-draw each time changin only the uniform parameters of the light and the scissor test area). You don't even need a G-Buffer.

Is it possible to write a bunch of pixels in gl_FragColor?

Has anyone familiar with some sort of OpenGL magic to get rid of calculating bunch of pixels in fragment shader instead of only 1? Especially this issue is hot for OpenGL ES in fact meanwile flaws mobile platforms and necessary of doing things in more accurate (in performance meaning) way on it.
Are any conclusions or ideas out there?
P.S. it's known shader due to GPU architecture organisation is run in parallel for each texture monad. But maybe there techniques to raise it from one pixel to a group of ones or to implement your own glTexture organisation. A lot of work could be done faster this way within GPU.

OpenGL does not support writing to multiple fragments (meaning with distinct coordinates) in a shader, for good reason, it would obstruct the GPUs ability to compute each fragment in parallel, which is its greatest strength.
The structure of shaders may appear weird at first because an entire program is written for only one vertex or fragment. You might wonder why can't you "see" what is going on in neighboring parts?
The reason is an instance of the shader program runs for each output fragment, on each core/thread simultaneously, so they must all be independent of one another.
Parallel, independent, processing allows GPUs to render quickly, because the total time to process a batch of pixels is only as long as the single most intensive pixel.
Adding outputs with differing coordinates greatly complicates this.
Suppose a single fragment was written to by two or more instances of a shader.
To ensure correct results, the GPU can either assign one to be an authority and ignore the other (how does it know which will write?)
Or you can add a mutex, and have one wait around for the other to finish.
The other option is to allow a race condition regarding whichever one finishes first.
Either way this would immensely slows down the process, make the shaders ugly, and introduces incorrect and unpredictable behaviour.

Well firstly you can calculate multiple outputs from a single fragment shader in OpenGL 3 and up. A framebuffer object can have more than one RGBA surfaces (Renderbuffer Objects) attached and generate an RGBA for each of them by using gl_FragData[n] instead of gl_FragColor. See chapter 8 of the 5th edition OpenGL SuperBible.
However, the multiple outputs can only be generated for the same X,Y pixel coordinates in each buffer. This is for the same reason that an older style fragment shader can only generate one output, and can't change gl_FragCoord. OpenGL guarantees that in rendering any primitive, one and only one fragment shader will write to any X,Y pixel in the destination framebuffer(s).
If a fragment shader could generate multiple pixel values at different X,Y coords, it might try to write to the same destination pixel as another execution of the same fragment shader. Same if the fragment shader could change the pixel X or Y. This is the classic multiple threads trying to update shared memory problem.
One way to solve it would be to say "if this happens, the results are unpredictable" which sucks from the programmer point of view because it's completely out of your control. Or fragment shaders would have to lock the pixels they are updating, which would make GPUs far more complicated and expensive, and the performance would suck. Or fragment shaders would execute in some defined order (eg top left to bottom right) instead of in parallel, which wouldn't need locks but the performance would suck even more.

Simple 2D Culling in OpenGL using VBOs

I am looking into using a VBO instead of immediate mode for performance reasons. I am creating a 2D orthographic scene filled with sprites. I do not want to draw sprites that are off-screen. I do this by checking their position against the screen size and position of the camera.
In immediate mode this is simple; there is draw method for each sprite. Using a VBO this seems non-trivial; I render an entire section of a VBO at one time. There would be no way for me (that I can think of) to elect out of rendering sprites that are off-screen.

I'll just assume that you do indeed animate the sprites on the CPU, because that's the only thing that makes sense in the light of your question (otherwise, how would you draw them in immediate mode initially, and how would you skip drawing some).
AGP/PCIe behaves much like a harddisk from a performance point of view. Bandwidth is huge, but access time is quite noticeable. In other words, doing a transfer at all is painful, but once you do it, a few kilobytes more don't really make any difference. Uploading 500 sprites and uploading 1000 sprites is the same thing.
Since you animate the sprites on the CPU, you already must do one transfer (glBufferSubData or glMapBuffer/glUnmapBuffer) every frame, there is no other way.
Be sure to use a "fresh" buffer e.g. by applying the glBufferData(null) idiom. This avoids pipeline stalls by allowing OpenGL to continue using (drawing from) the buffer while giving you a different buffer (without you knowing) at the same time. Later when it is done drawing, it just secretly flips buffers and throws the old one away. That way, you achieve good parallelism (which is key to performance and much more important than culling a few thousand vertices).
Also, graphics cards are reasonably good at culling geometry (this includes discarding entire triangles that are off-screen before fragments are generated). Hundreds? Thousands? Hundred thousands? No issue. Let the graphics card do it.
Unless you have a million sprites of which one half is visible at a time and the other half isn't, it is not unlikely that writing the entire buffer continuously and without branches is not only just as fast, but even faster due to cache and pipeline effects.

GLSL Interlacing

I would like to efficiently render in an interlaced mode using GLSL.
I can alrdy do this like:
vec4 background = texture2D(plane[5], gl_TexCoord[1].st);
if(is_even_row(gl_TexCoord[1].t))
{
vec4 foreground = get_my_color();
gl_FragColor = vec4(fore.rgb * foreground .a + background .rgb * (1.0-foreground .a), background .a + fore.a);
}
else
gl_FragColor = background;
However, as far as I have understood the nature of branching in GLSL is that both branches will actually be executed, since "even_row" is considered as run-time value.
Is there any trick I can use here in order to avoid unnecessarily calling the rather heavy function "get_color"? The behavior of is_even_row is quite static.
Or is there some other way to do this?
NOTE: glPolygonStipple will not work since I have custom blend functions in my GLSL code.

(comment to answer, as requested)
The problem with interlacing is that GPUs run shaders in 2x2 clusters, which means that you gain nothing from interlacing (a good software implementation might possibly only execute the actual pixels that are needed, unless you ask for partial derivatives).
At best, interlacing runs at the same speed, at worst it runs slower because of the extra work for the interlacing. Some years ago, there was an article in ShaderX4, which suggested interlaced rendering. I tried that method on half a dozen graphics cards (3 generations of hardware of each the "two big" manufacturers), and it ran slower (sometimes slightly, sometimes up to 50%) in every case.
What you could do is do all the expensive rendering in 1/2 the vertical resolution, this will reduce the pixel shader work (and texture bandwidth) by 1/2. You can then upscale the texture (GL_NEAREST), and discard every other line.
The stencil test can be used to discard pixels before the pixel shader is executed. Of course the hardware still runs shaders in 2x2 groups, so in this pass you do not gain anything. However, that does not matter if it's just the very last pass, which is a trivial shader writing out a single fetched texel. The more costly composition shaders (the ones that matter!) run at half resolution.
You find a detailled description including code here: fake dynamic branching. This demo avoids lighting pixels by discarding those that are outside the light's range using the stencil.
Another way which does not need the stencil buffer is to use "explicit Z culling". This may in fact be even easier and faster.
For this, clear Z, disable color writes (glColorMask), and draw a fullscreen quad whose vertices have some "close" Z coordinate, and have the shader kill fragments in every odd line (or use the deprecated alpha test if you want, or whatever). gl_FragCoord.y is a very simple way of knowing which line to kill, using a small texture that wraps around would be another (if you must use GLSL 1.0).
Now draw another fullscreen quad with "far away" Z values in the vertices (and with depth test, of course). Simply fetch your half-res texture (GL_NEAREST filtering), and write it out. Since the depth buffer has a value that is "closer" in every other row, it will discard those pixels.
How does glPolygonStipple compare to this? Polygon stipple is a deprecated feature, because it is not directly supported by the hardware and has to be emulated by the driver either by "secretly" rewriting the shader to include extra logic or by falling back to software.

This is probably not the right way to do interlacing. If you really need to achieve this effect, don't do it in the fragment shader like this. Instead, here is what you could do:
Initialize a full screen 1-bit stencil buffer, where each bit stores the parity of its corresponding row.
Render your scene like usual to a temporary FBO with 1/2 the vertical resoltion.
Turn on the stencil test, and switch the stencil func depending on which set of scan lines you are going to draw.
Blit a rescaled version of the aforementioned fbo (containing the contents of your frame) to the stencil buffer.
Note that you could skip the offscreen FBO step and draw directly using the stencil buffer, but this would waste some fill rate testing those pixels that are just going to clipped anyway. If your program is shader heavy, the solution I just mentioned would be optimal. If it is not, you may end up being marginally better off drawing directly to the screen.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js