Should I calculate matrices on the GPU or on the CPU? - opengl

Should I prefer to calculate matrices on the CPU or GPU?
Let's say I have the following matrices P * V * M , should I calculate them on the CPU so that I can send the final matrix to the GPU (GLSL) or should I send those three matrices separately to the GPU so that GLSL can calculate the final matrix?
I mean in this case GLSL would have to calculate the MVP matrix for every vertex, so it is probably faster to precompute it on the CPU.
But let's say that GLSL only has to calculate he MVP matrix once, would the GPU calculate the final matrix faster than the CPU?

General rule: If you can pass it to a shader in form of a uniform, always precalculate on the CPU; no exceptions. Calculations on the shader side make sense only for values that vary between vertices and fragments. Everything that's constant among a whole batch of vertices is most efficiently dealt with on the CPU.
GPUs are not magic "can do faster everything" machines. There are certain tasks where a CPU can easily outperform a GPU, even for very large datasets. So a very simple guideline is: If you can move it to the CPU without spending more CPU time doing the calculation than it takes for the GPU in total overhead to process it, then do it on the CPU. The calculation of a single matrix is among those tasks.

Like most situations with OpenGL, it depends.
In most cases, a single calculation can be done faster on the CPU than on the GPU. The GPU's advantage is that it can do lots of calculations in parallel.
On the other hand, it also depends where your bottlenecks are. If your CPU is doing lots of other work, but your shaders are not a bottleneck yet on the lowest-powered target system, then you could easily see some performance improvement by moving some matrix multiplications to the vertex shader.
Generally, you should avoid any work in the fragment shader that could also be done in the vertex shader or on the CPU, but beyond that, it depends on the situation. Unless you are running into performance issues, just do it whatever way is easiest for you, and if you are having performance issues, do it both ways and profile the performance to see which works better.

Related

Does any graphics API allow efficient per-primitive branching?

When writing fragment shaders in OpenGL, one can branch either on compile-time constants, on uniform variables or on varying variables.
How performant that branching is depends on the hardware and driver implementation, but generally branching on a compile time constant is usually free and branching on a uniform is faster than on a varying.
In the case of a varying, the rasterizer still has to interpolate the variable for each fragment and the branch has to be decided on each family execution, even if the value of the varying is the same for each fragment in the current primitive.
What I wonder is whether any graphics api or extension allows some fragment shader branching that is executed only once per rasterized primitive (or in the case of tiled rendering once per primitive per bin)?
Dynamic branching is only expensive when it causes divergence of instances executing at the same time. The cost of interpolating a "varying" is trivial.
Furthermore, different GPUs handle primitive rasterization differently. Some GPUs ensure that wavefronts for fragment shaders only contain instances that are executing on the same primitive. On these GPUs, branching based on values that that don't change per-primitive will be fast.
However, other GPUs will pack instances from different primitives into the same wavefronts. On these GPUs, divergence will happen if the value is different for different primitives. How much divergence? It rather depends on how often you get multiple instances in a primitive. If many of your primitives are small in rasterized space, then you'll get a lot more divergence than if you have a lot of large primitives.
GPUs that pack instances from different primitives into a wavefront are trying to maximize how much their cores get utilized. It's a tradeoff: you're minimizing the overall number of wavefronts you have to execute, but a particular cause of divergence (data that is constant within a primitive but not between them) will be penalized.
In any case, try to avoid divergence when you can. But if your algorithm requires it... then your algorithm requires it, and the performance you get is the performance you get. The best you can do is let the GPU know that the "varying" will be constant per-primitive by using flat interpolation.

Should I use uniform variable to reduce the amount of matrix multiplication?

I just wrote a program to rotate an object. it just updates a variable theta using the idle function. That variable is used to create a rotation matrix ..then I do this..
gl_Position = rx * ry * rz * vPosition;
rx, ry and rz (matrices) are same for every point during the same frame....but it is being multiplied for every single point in the object....should I just use a uniform variable mat4 which stores the multiplied value of rx* ry * rz and pass it to the shader?...or let the shader handle the multiplication for every single point?.....which is faster?....
While profiling is essential to measure how your application responds to optimizations, in general, passing a concatenated matrix to the vertex shader is desirable. This is for two reasons:
The amount of data passed from CPU to GPU is reduced. If rx, ry and rz are all 4x4 matrices, and the product of them (say rx_ry_rz = rx * ry * rz), is also a 4x4 matrix, then you will be transferring 2 less 4x4 matrices (128 bytes) as uniforms each update. If you use this shader to render 1000 objects per frame at 60hz, and the uniform updates with each object, that's 7MB+ per second of saved bandwidth. Maybe not extremely significant, but every bit helps, especially if bandwidth is your bottleneck.
The amount of work the vertex stage must do is reduced (assuming a non-trivial number of vertices). Generally the vertex stage is not a bottleneck, however, many drivers implement load balancing in their shader core allocation between stages, so reducing work in the vertex stage could give benefits in the pixel stage (for example). Again, profiling will give you a better idea of if/how this benefits performance.
The drawback is added CPU time taken to multiply the matrices. If your application's bottleneck is CPU execution, doing this could potentially slow down your application, as it will require the CPU to do more work than it did before.
I wouldn't count on this repeated multiplication being optimized out, unless you convinced yourself that it is indeed happening on all platforms you care about. To do that:
One option is benchmarking, but it will probably be difficult to isolate this operation well enough to measure a possible difference reliably.
I believe some vendors provide development tools that let you see assembly code for the compiled shader. I think that's the only reliable way for you to see what exactly happens with your GLSL code in this case.
This is a very typical example for a much larger theme. At least in my personal opinion, what you have is an example of code that uses OpenGL inefficiently. Making calculations that are the same for each vertex in the vertex shader, which at least conceptually is executed for each vertex, is not something you should do.
In reality, driver optimizations to work around inefficient use of the API are done based on the benefit they offer. If a high profile app/game uses certain bad patterns (and many of them do!), and they are identified as having a negative effect on performance, drivers are optimized to work around them, and still provide the best possible performance. This is particularly true if the app/game is commonly used for benchmarks. Ironically, those optimizations may hurt the performance of well written software that is considered less important.
So if there ever was an important app/game that did the same thing you're doing, which seems quite likely in this case, chances are that many drivers will contain optimizations to deal with it efficiently.
Still, I wouldn't rely on it. The reasons are philosophical as well as practical:
If I work on an app, I feel that it's my job to write efficient code. I don't want to write poor code, and hope that somebody else happened to optimize their code to compensate for it.
You can't count on all of the platforms the app will ever run on to contain these types of optimizations. Particularly since app code can have a long lifetime, and those platforms might not even exist yet.
Even if the optimizations are in place, they will most likely not be free. You might trigger driver code that ends up consuming more resources than it would take for your code to provide the combined matrix yourself.

Gaining an understanding of performance implications of shader stages, particularly the GS

I am confused about what's faster versus what's slower when it comes to coding algorithms that execute in the pipeline.
I made a program with a GS that seemingly bottlenecked from fillrate, because timer queries showed it to execute much faster with no rasterisation enabled.
So then I made a different multi-pass algorithm using transform feedback, still using a GS every time but theoretically does much less work overall by executing in stages, and it significantly reduces the fill rate because it renders much less triangles, but in my early tests of it, it appears to run slower.
My original thought was that the bottleneck of fillrate was traded for the bottleneck of calling multiple draw calls. But how expensive is another draw call really? How much overhead is involved in the cpu and gpu?
Then I read the answer of a different stack question regarding the GS:
No one has ever accused Geometry Shaders of being fast. Especially when increasing the size of geometry.
Your GS is taking a line and not only doing a 30x amplification of vertex data, but also doing lighting computations on each of those new vertices. That's not going to be terribly fast, in large part due to a lack of parallelism. Each GS invocation has to do 60 lighting computations, rather than having 60 separate vertex shader invocations doing 60 lighting computations in parallel.
You're basically creating a giant bottleneck in your geometry shader.
It would probably be faster to put the lighting stuff in the fragment shader (yes, really).
and it makes me wonder how it's possible for a geometry shaders to be slower if their use provides an overall less work output. I know things execute in parallel, but my understanding is that there is only a relatively small group of shader cores, so starting an amount of threads much larger than that group will result in the bottleneck being something proportional to program complexity (instruction size) times the number of threads (using thread here to refer to invocation of a shader). If you can have some instruction execute once per vertex on the geometry shader instead of once per fragment, why would it ever be slower?
Help me gain a better understanding so I don't waste time designing algorithms that are inefficient.

why the modelview matrix?

I am sorry if this is a silly question, but I have wondered for a long time why there are so many example vertex shaders out there, containing a modelview matrix. In my program I have the following situation:
projection matrix hardly ever changes (e.g. on resize of app window) and it is separate, which is fine,
model matrix changes often (e.g. transforms on the model),
view matrix changes fairly often as well (e.g. changing direction of viewing, moving around, ...).
If I were to use a modelview matrix in the vertex shader, I'd have to perform a matrix multiplication on the CPU and upload a single matrix. The alternative is uploading both model and view matrices and doing the multiplication on the GPU. The point is, that the view matrix does not necessarily change at the same time as the model matrix, but if one uses a modelview matrix, one has to perform the CPU multiplication and the upload, whenever either of them changes. Why not then use separate view and model matrices for a fast GPU multiplication and probably approximately the same number of GPU matrix uploads?
Because having the matrices multiplied in the vertex shader makes the GPU do the full computation for each and every vertex that goes into it (note that recent GLSL compilers will detect the product to be uniform over all vertices and may move the calculation off the GPU to the CPU).
Also when it comes to performing a single 4×4 matrix computation a CPU actually outperforms a GPU, because there's no data transfer and command queue overhead.
The general rule for GPU computing is: If it's uniform over all vertices, and you can easily precompute it on the CPU, do it on the CPU.
Because you only need to calculate the MV matrix once per model. If you upload the two separately to the GPU, it will do that calculation for every vertex.
Now it may be that if you are CPU bound then it is still a performance gain, as even though you are adding (potentially) 1000s of additional matrix multiplications, you are pushing them off the CPU, but I'd consider that an optimization rather than a standard technique.

Depth vs Position

I've been reading about reconstructing a fragment's position in world space from a depth buffer, but I was thinking about storing position in a high-precision three channel position buffer. Would doing this be faster than unpacking the position from a depth buffer? What is the cost of reconstructing position from depth?
This question is essentially unanswerable for two reasons:
There are several ways of "reconstructing position from depth", with different performance characteristics.
It is very hardware-dependent.
The last point is important. You're essentially comparing the performance of a texture fetch of a GL_RGBA16F (at a minimum) to the performance of a GL_DEPTH24_STENCIL8 fetch followed by some ALU computations. Basically, you're asking if the cost of fetching an addition 32-bits per fragment (the difference between the 24x8 fetch and the RGBA16F fetch) is equivalent to the ALU computations.
That's going to change with various things. The performance of fetching memory, texture cache sizes, and so forth will all have an effect on texture fetch performance. And the speed of ALUs depends on how many are going to be in-flight at once (ie: number of shading units), as well as clock speeds and so forth.
In short, there are far too many variables here to know an answer a priori.
That being said, consider history.
In the earliest days of shaders, back in the GeForce 3 days, people would need to re-normalize a normal passed from the vertex shader. They did this by using a cubemap, not by doing math computations on the normal. Why? Because it was faster.
Today, there's pretty much no common programmable GPU hardware, in the desktop or mobile spaces, where a cubemap texture fetch is faster than a dot-product, reciprocal square-root, and a vector multiply. Computational performance in the long-run outstrips memory access performance.
So I'd suggest going with history and finding a quick means of computing it in your shader.