Optimal matrix structure and CPU/GPU communication in modern OpenGL with glsl - opengl

I want to know your thoughts regarding where matrix multiplication should be processed, where to store the results and general communication between the CPU and GPU.
Since the MVP matrix is the same for each object, calculating it on the CPU and then send it to the shaders seems like the way to go, yes?
If I have multiple shaders and some only need to model matrix, some need only the MVP matrix and some need both. How should I proceed?
I currently have a shader object for each vertex/fragment shader pair that stores the location of each uniform in the shaders. This seems like a good practise, yes?

This may depend on how many vertices you process in a single draw call. If you only process a small handful of vertices then you're probably better off sending the M, V and P seprately, as GPUs tend to be much faster at floating point operations (most cases wouldn't take this approach though; you'd want to profile with your specific models to be sure). On the other hand, if you're rendering a lot of vertices in each call you'd may be better off doing the calculation on the CPU once (even though it will be a bit slower) because this means you won't keep recomputing the same value over and over again in the shader.
If you're sending the whole MVP as one matrix then in the case where you need both MVP and M (I think lighting was a case where I ran into this) then just send both, as you don't really have many other options. And if you only need M then only send M.
I don't understand what you mean here (and I don't have enough rep to ask in a comment). If you mean that you externally store the uniform IDs instead of querying them each render loop then yes that is pretty standard procedure.
More generally when grappling with performance questions like these your best bet is to use a profiler (something graphics specific like GPUPerfStudio would be good for this sort of problem) and find out which method is faster in your specific case, as many performance tweaks have varying degrees of usefulness depending on the more specific scenarios.

Related

Ray tracing via Compute Shader vs Screen Quad

I was recently looking for ray tracing via opengl tutorials. Most of tutorials prefer compute shaders. I wonder why don't they just render to texture, then render the texture to screen as quad.
What is the advantages and disadvantages of compute shader method over screen quad?
Short answer: because compute shaders give you more effective tools to perform complex computations.
Long answer:
Perhaps the biggest advantage that they afford (in the case of tracing) is the ability to control exactly how work is executed on the GPU. This is important when you're tracing a complex scene. If your scene is trivial (e.g., Cornell Box), then the difference is negligible. Trace some spheres in your fragment shader all day long. Check http://shadertoy.com/ to witness the madness that can be achieved with modern GPUs and fragment shaders.
But. If your scene and shading are quite complex, you need to control how work is done. Rendering a quad and doing the tracing in a frag shader is going to, at best, make your application hang while the driver cries, changes its legal name, and moves to the other side of the world...and at worst, crash the driver. Many drivers will abort if a single operation takes too long (which virtually never happens under standard usage, but will happen awfully quickly when you start trying to trace 1M poly scenes).
So you're doing too much work in the frag shader...next logical though? Ok, limit the workload. Draw smaller quads to control how much of the screen you're tracing at once. Or use glScissor. Make the workload smaller and smaller until your driver can handle it.
Guess what we've just re-invented? Compute shader work groups! Work groups are compute shader's mechanism for controlling job size, and they're a far better abstraction for doing so than fragment-level hackery (when we're dealing with this kind of complex task). Now we can very naturally control how many rays we dispatch, and we can do so without being tightly-coupled to screen-space. For a simple tracer, that adds unnecessary complexity. For a 'real' one, it means that we can easily do sub-pixel raycasting on a jittered grid for AA, huge numbers of raycasts per pixel for pathtracing if we so desire, etc.
Other features of compute shaders that are useful for performant, industrial-strength tracers:
Shared Memory between thread groups (allows, for example, packet tracing, wherein an entire packet of spatially-coherent rays are traced at the same time to exploit memory coherence & the ability to communicate with nearby rays)
Scatter Writes allow compute shaders to write to arbitrary image locations (note: image and texture are different in subtle ways, but the advantage remains relevant); you no longer have to trace directly from a known pixel location
In general, the architecture of modern GPUs are designed to support this kind of task more naturally using compute. Personally, I have written a real-time progressive path tracer using MLT, kd-tree acceleration, and a number of other computationally-expensive techniques (PT is already extremely expensive). I tried to remain in a fragment shader / full-screen quad as long as I could. Once my scene was complex enough to require an acceleration structure, my driver started choking no matter what hackery I pulled. I re-implemented in CUDA (not quite the same as compute, but leveraging the same fundamental GPU architectural advances), and all was well with the world.
If you really want to dig in, have a glance at section 3.1 here: https://graphics.cg.uni-saarland.de/fileadmin/cguds/papers/2007/guenther_07_BVHonGPU/Guenter_et_al._-_Realtime_Ray_Tracing_on_GPU_with_BVH-based_Packet_Traversal.pdf. Frankly the best answer to this question would be an extensive discussion of GPU micro-architecture, and I'm not at all qualified to give that. Looking at modern GPU tracing papers like the one above will give you a sense of how deep the performance considerations go.
One last note: any performance advantage of compute over frag in the context of raytracing a complex scene has absolutely nothing to do with rasterization / vertex shader overhead / blending operation overhead, etc. For a complex scene with complex shading, bottlenecks are entirely in the tracing computations, which, as discussed, compute shaders have tools for implementing more efficiently.
I am going to complete Josh Parnell information.
One problem with both fragment shader and compute shader is that they both lack recursivity.
A ray tracer is recursive by nature (yeah I know it is always possible to transform a recursive algorithm in a non recursive one, but is is not always that easy to do it).
So another way to see the problem could be the following :
Instead to have "one thread" per pixel, one idea could be to have one thread per path (a path is a part of your ray (between 2 bounces)).
Going that way, you are dispatching on your "bunch" of rays instead on your "pixel grid". Doing so simplify the potential recursivity of the ray tracer, and avoid divergence in complex materials :
More information here :
http://research.nvidia.com/publication/megakernels-considered-harmful-wavefront-path-tracing-gpus

Should I use uniform variable to reduce the amount of matrix multiplication?

I just wrote a program to rotate an object. it just updates a variable theta using the idle function. That variable is used to create a rotation matrix ..then I do this..
gl_Position = rx * ry * rz * vPosition;
rx, ry and rz (matrices) are same for every point during the same frame....but it is being multiplied for every single point in the object....should I just use a uniform variable mat4 which stores the multiplied value of rx* ry * rz and pass it to the shader?...or let the shader handle the multiplication for every single point?.....which is faster?....
While profiling is essential to measure how your application responds to optimizations, in general, passing a concatenated matrix to the vertex shader is desirable. This is for two reasons:
The amount of data passed from CPU to GPU is reduced. If rx, ry and rz are all 4x4 matrices, and the product of them (say rx_ry_rz = rx * ry * rz), is also a 4x4 matrix, then you will be transferring 2 less 4x4 matrices (128 bytes) as uniforms each update. If you use this shader to render 1000 objects per frame at 60hz, and the uniform updates with each object, that's 7MB+ per second of saved bandwidth. Maybe not extremely significant, but every bit helps, especially if bandwidth is your bottleneck.
The amount of work the vertex stage must do is reduced (assuming a non-trivial number of vertices). Generally the vertex stage is not a bottleneck, however, many drivers implement load balancing in their shader core allocation between stages, so reducing work in the vertex stage could give benefits in the pixel stage (for example). Again, profiling will give you a better idea of if/how this benefits performance.
The drawback is added CPU time taken to multiply the matrices. If your application's bottleneck is CPU execution, doing this could potentially slow down your application, as it will require the CPU to do more work than it did before.
I wouldn't count on this repeated multiplication being optimized out, unless you convinced yourself that it is indeed happening on all platforms you care about. To do that:
One option is benchmarking, but it will probably be difficult to isolate this operation well enough to measure a possible difference reliably.
I believe some vendors provide development tools that let you see assembly code for the compiled shader. I think that's the only reliable way for you to see what exactly happens with your GLSL code in this case.
This is a very typical example for a much larger theme. At least in my personal opinion, what you have is an example of code that uses OpenGL inefficiently. Making calculations that are the same for each vertex in the vertex shader, which at least conceptually is executed for each vertex, is not something you should do.
In reality, driver optimizations to work around inefficient use of the API are done based on the benefit they offer. If a high profile app/game uses certain bad patterns (and many of them do!), and they are identified as having a negative effect on performance, drivers are optimized to work around them, and still provide the best possible performance. This is particularly true if the app/game is commonly used for benchmarks. Ironically, those optimizations may hurt the performance of well written software that is considered less important.
So if there ever was an important app/game that did the same thing you're doing, which seems quite likely in this case, chances are that many drivers will contain optimizations to deal with it efficiently.
Still, I wouldn't rely on it. The reasons are philosophical as well as practical:
If I work on an app, I feel that it's my job to write efficient code. I don't want to write poor code, and hope that somebody else happened to optimize their code to compensate for it.
You can't count on all of the platforms the app will ever run on to contain these types of optimizations. Particularly since app code can have a long lifetime, and those platforms might not even exist yet.
Even if the optimizations are in place, they will most likely not be free. You might trigger driver code that ends up consuming more resources than it would take for your code to provide the combined matrix yourself.

glUniform vs. single draw call performance

Suppose I have many meshes I'd like to render. I have two choices:
Bake transforms and colors for each mesh into a VBO and render with a single draw call.
Use glUniform for transforms and colors and use many draw calls (but still a single VBO)
Assuming the scene changes very little between frames, which method tends to be better?
There are more than those two choices. At least one more comes to mind:
...
....
Use attributes for transforms and colors and use many draw calls.
Choice 3 is similar to choice 2, but setting attributes (using calls like glVertexAttrib4f) is mostly faster than setting uniforms. The efficiency of setting uniforms is highly platform dependent. But they're generally not intended to be modified very frequently. They are called uniform for a reason. :)
That being said, choice 1 might be the best for your use case where the transforms/colors change rarely. If you're not doing this yet, you could try keeping the attributes that are modified in a separate VBO (with usage GL_DYNAMIC_DRAW), and the attributes that remain constant in their own VBO (with usage GL_STATIC_DRAW). Then make the necessary updates to the dynamic buffer with glBufferSubData.
The reality is that there are no simple rules to predict what is going to perform best. It will depend on the size of your data and draw calls, how frequent and large the data changes are, and also very much on the platform you run on. If you want to be confident that you're using the most efficient solution, you need to implement all of them, and start benchmarking.
Generally, option 1 (minimize number of draw calls) is the best advice. There are a couple of caveats:
I have seen performance fall off a cliff when using very large VBOs on at least one mobile device (assuming relevant for opengl-es tag). The explanation (from the vendor) involved internal buffers exceeding a certain size.
If putting all the information which would otherwise be conveyed with uniforms into vertex attributes significantly increases the size of the vertex buffer, the price you pay (in perhaps costly memory reads) of reading redundant information (because it doesn't really vary per vertex) might negate the savings of using fewer draw calls.
As always the best (but tiresome) advice is to test (I know this is particularly hard developing for mobile where there are many potential implementations your code could be running on). Try to keep your pipeline/toolchain flexible enough that you can easily try out and compare different options.

Front to back rendering vs shaders swapping

Lets consider such situation.
The scene contains given objects: ABCDE
Where order from camera (from nearest to farthest)
AEBDC
And objects AC use shader1,ED shader 2,B shader3
Objects AC use shame shader but different texture.
Now what to deal with such situation?
Render everything from front to back (5 swaps)
Render by shader group which are sorted(3 shaders swaps).
Marge all shader programs to one(1 swap).
Does instructions like glUniform,glBindTexture etc. to change value in already in use program cause overhead?
There is no one answer to this question. Does changing OpenGL state "cause overhead"? Of course they do; nothing is free. The question is whether the overhead caused by state change will be worse than the less effective depth test support.
That cannot be answered, because the answer depends on how much overdraw there is, how costly your fragment shaders are, how many state changes a particular sequence of draw calls will require, and numerous other intangibles that cannot be known beforehand.
That's why profiling before optimization is important.
Profile, profile and even more profile :)
I would like to add one thing though:
In your situation you can use idea of a rendering queue. It is some sort of manager for drawing objects. Instead of drawing an object you call renderQueue.add(myObject). Then, when you add() all the needed objects you can call renderQueue.renderAll(). This method can handle all the sorting (by the distance, by the shader, by material, etc) and that way it can be more useful when profiling (and then changing the way you render).
Of course this is only a rough idea.

Should Particle systems be updated entirely in the geometry shader

Should Particle systems be updated entirely in the geometry shader or should the geometry shader be passed updated data for positions and life ect. At the moment i update everything in the geometry but iam not sure if this is the best idea incase some of the data is needed in the C++.
It's possible to almost everything in shaders ( especially if you're going for SM4+ ). I don't recommend going for anything over SM3 if you want any sort of market penetration. I still regret we didn't provide a SM2 fallback for our latest game, because quite a few people still use old crappy SM2 cards.
More on to the question. You can use RTT and never to do a round trip back to the main memory ( this is slow as hell, minimize transfer from graphics memory to main memory ), but the down side is that you need to use some rather elaborate tricks to compute AABBs ( which you'll want on the CPU side of things ) if you go pure GPU.
Instead we do everything which requires changing the state of a particle on the CPU side. We then have a tight memory representation of that data which gets updated to GPU. The vertex shader is rather meaty ( but that's totally fine, do as much as you possibly can in the vertex shader! ), it extracts this compressed representation of a particle, transforms it, and passes the uncompressed data on to the pixel shader. An important observation here is that you can, and should, split per vertex & per particle data. This implies using instancing ( which is just a way of saying: use frequency dividers ). We represent particle rotation with a normal + rotation about that normal.
Another reason for doing the state changes of a particle CPU side is that it's a heck of a lot easier to composite behavior CPU side. Any at least half decent particle system needs quite a bit of knobs to turn to be able to create interesting particle effects.
EDIT: And if you have anything resembling Particle::Update that can't be inlined you've failed, minimize per particle function calls, especially virtual ones, and keep the memory representation of a particle tightly packed!
That depends on what kind of particle system you have. In most cases you have a software representation in C++ and a hardware representation for your shader. The geometry data for the shader is computed from the software representation and should be as little as possible. Because in most cases not the computational power is the limiting resource but the transfer rate to the graphics card.
If you can decrease transfers even further with your method, you can still keep a software representation in memory for further usage. Even if this implies computing data twice, it can be faster than the transfer process.