why the modelview matrix? - opengl

I am sorry if this is a silly question, but I have wondered for a long time why there are so many example vertex shaders out there, containing a modelview matrix. In my program I have the following situation:
projection matrix hardly ever changes (e.g. on resize of app window) and it is separate, which is fine,
model matrix changes often (e.g. transforms on the model),
view matrix changes fairly often as well (e.g. changing direction of viewing, moving around, ...).
If I were to use a modelview matrix in the vertex shader, I'd have to perform a matrix multiplication on the CPU and upload a single matrix. The alternative is uploading both model and view matrices and doing the multiplication on the GPU. The point is, that the view matrix does not necessarily change at the same time as the model matrix, but if one uses a modelview matrix, one has to perform the CPU multiplication and the upload, whenever either of them changes. Why not then use separate view and model matrices for a fast GPU multiplication and probably approximately the same number of GPU matrix uploads?

Because having the matrices multiplied in the vertex shader makes the GPU do the full computation for each and every vertex that goes into it (note that recent GLSL compilers will detect the product to be uniform over all vertices and may move the calculation off the GPU to the CPU).
Also when it comes to performing a single 4×4 matrix computation a CPU actually outperforms a GPU, because there's no data transfer and command queue overhead.
The general rule for GPU computing is: If it's uniform over all vertices, and you can easily precompute it on the CPU, do it on the CPU.

Because you only need to calculate the MV matrix once per model. If you upload the two separately to the GPU, it will do that calculation for every vertex.
Now it may be that if you are CPU bound then it is still a performance gain, as even though you are adding (potentially) 1000s of additional matrix multiplications, you are pushing them off the CPU, but I'd consider that an optimization rather than a standard technique.

Related

Optimal matrix structure and CPU/GPU communication in modern OpenGL with glsl

I want to know your thoughts regarding where matrix multiplication should be processed, where to store the results and general communication between the CPU and GPU.
Since the MVP matrix is the same for each object, calculating it on the CPU and then send it to the shaders seems like the way to go, yes?
If I have multiple shaders and some only need to model matrix, some need only the MVP matrix and some need both. How should I proceed?
I currently have a shader object for each vertex/fragment shader pair that stores the location of each uniform in the shaders. This seems like a good practise, yes?
This may depend on how many vertices you process in a single draw call. If you only process a small handful of vertices then you're probably better off sending the M, V and P seprately, as GPUs tend to be much faster at floating point operations (most cases wouldn't take this approach though; you'd want to profile with your specific models to be sure). On the other hand, if you're rendering a lot of vertices in each call you'd may be better off doing the calculation on the CPU once (even though it will be a bit slower) because this means you won't keep recomputing the same value over and over again in the shader.
If you're sending the whole MVP as one matrix then in the case where you need both MVP and M (I think lighting was a case where I ran into this) then just send both, as you don't really have many other options. And if you only need M then only send M.
I don't understand what you mean here (and I don't have enough rep to ask in a comment). If you mean that you externally store the uniform IDs instead of querying them each render loop then yes that is pretty standard procedure.
More generally when grappling with performance questions like these your best bet is to use a profiler (something graphics specific like GPUPerfStudio would be good for this sort of problem) and find out which method is faster in your specific case, as many performance tweaks have varying degrees of usefulness depending on the more specific scenarios.

Depth vs Position

I've been reading about reconstructing a fragment's position in world space from a depth buffer, but I was thinking about storing position in a high-precision three channel position buffer. Would doing this be faster than unpacking the position from a depth buffer? What is the cost of reconstructing position from depth?
This question is essentially unanswerable for two reasons:
There are several ways of "reconstructing position from depth", with different performance characteristics.
It is very hardware-dependent.
The last point is important. You're essentially comparing the performance of a texture fetch of a GL_RGBA16F (at a minimum) to the performance of a GL_DEPTH24_STENCIL8 fetch followed by some ALU computations. Basically, you're asking if the cost of fetching an addition 32-bits per fragment (the difference between the 24x8 fetch and the RGBA16F fetch) is equivalent to the ALU computations.
That's going to change with various things. The performance of fetching memory, texture cache sizes, and so forth will all have an effect on texture fetch performance. And the speed of ALUs depends on how many are going to be in-flight at once (ie: number of shading units), as well as clock speeds and so forth.
In short, there are far too many variables here to know an answer a priori.
That being said, consider history.
In the earliest days of shaders, back in the GeForce 3 days, people would need to re-normalize a normal passed from the vertex shader. They did this by using a cubemap, not by doing math computations on the normal. Why? Because it was faster.
Today, there's pretty much no common programmable GPU hardware, in the desktop or mobile spaces, where a cubemap texture fetch is faster than a dot-product, reciprocal square-root, and a vector multiply. Computational performance in the long-run outstrips memory access performance.
So I'd suggest going with history and finding a quick means of computing it in your shader.

OpenGL Shaders - Should the camera translation happen on the GPU or the CPU?

So currently what I am doing is before loading my elements onto a VBO I create a new matrix and I add them to it. I do that so I can play with the matrix as much as I want.
So what I did is I just added the camera position onto the coordinates in the matrix.
Note: the actual position of the objects is saved elsewhere the matrix is a translation stage.
Now, this works but I am not sure if its correct or if I should translate to the camera location in the shaders instead of in the CPU.
So this is my question:
Should the camera translation happen on the GPU or the CPU?
I am not entirely sure what you are currently doing. But the sane way of doing this is to not touch the VBO. Instead, pass one or more transformation matrices as uniforms to your vertex shader and perform the matrix multiplication on the GPU.
Changing your VBO data on the CPU is insane, it means either keeping a copy of your vertex data on the CPU, iterating over it and uploading or mapping the buffer and iterating over it. Either way, it would be insanely slow. The whole point of having a VBO is so you can upload your vertex data once and work concurrently on the CPU while the GPU buggers off and does its thing with said vertex data.
Instead, you just store your vertices once in the vertex buffer, preferably in object space (just for sanity's sake). Then you keep track of a transformation matrix for each object which transforms the vertices from the object's space to clip space. You pass that matrix to your vertex shader and do the multiplications for each vertex on the GPU.
Obviously the GPU is multiplying every vertex by at least one matrix each frame. But the GPU has parallel hardware which does matrix multiplication insanely fast. So especially when your matrices constantly change (e.g. your objects move) this is much faster than doing it on the CPU and updating a massive buffer. Besides, you free your CPU to do other things like physics or audio or whatever.
Now I can imagine you would want to NOT do this if your object never moves, however, GPU matrix multiplication is probably about the same speed as a CPU float multiplication (I don't know specifics). So it is questionable if having more shaders for static objects is worth it.
Summary:
Updating buffers on the CPU is SLOW.
Matrix multiplication on the GPU is FAST.
No buffer update? = free up the CPU.
Multiplications on the GPU? = easy and fast to move objects (just change the matrix you upload).
Hope this somewhat helped.

Should I calculate matrices on the GPU or on the CPU?

Should I prefer to calculate matrices on the CPU or GPU?
Let's say I have the following matrices P * V * M , should I calculate them on the CPU so that I can send the final matrix to the GPU (GLSL) or should I send those three matrices separately to the GPU so that GLSL can calculate the final matrix?
I mean in this case GLSL would have to calculate the MVP matrix for every vertex, so it is probably faster to precompute it on the CPU.
But let's say that GLSL only has to calculate he MVP matrix once, would the GPU calculate the final matrix faster than the CPU?
General rule: If you can pass it to a shader in form of a uniform, always precalculate on the CPU; no exceptions. Calculations on the shader side make sense only for values that vary between vertices and fragments. Everything that's constant among a whole batch of vertices is most efficiently dealt with on the CPU.
GPUs are not magic "can do faster everything" machines. There are certain tasks where a CPU can easily outperform a GPU, even for very large datasets. So a very simple guideline is: If you can move it to the CPU without spending more CPU time doing the calculation than it takes for the GPU in total overhead to process it, then do it on the CPU. The calculation of a single matrix is among those tasks.
Like most situations with OpenGL, it depends.
In most cases, a single calculation can be done faster on the CPU than on the GPU. The GPU's advantage is that it can do lots of calculations in parallel.
On the other hand, it also depends where your bottlenecks are. If your CPU is doing lots of other work, but your shaders are not a bottleneck yet on the lowest-powered target system, then you could easily see some performance improvement by moving some matrix multiplications to the vertex shader.
Generally, you should avoid any work in the fragment shader that could also be done in the vertex shader or on the CPU, but beyond that, it depends on the situation. Unless you are running into performance issues, just do it whatever way is easiest for you, and if you are having performance issues, do it both ways and profile the performance to see which works better.

Why Do I Need to Convert Quaternion to 4x4 Matrix When Uploading to the Shaders?

I have read several tutorials about skeletal animation in OpenGL, they all seem to be single minded in using quaternions for rotation, 3d vector for translation, so not matrices.
But when they come to the vertex skinning process, they combine all of the quaternions and 3d vectors into a 4x4 matrix and upload the matrices to do the rest of calculations in shaders. 4x4 matrices have 16 elements while quaternion + 3d vector has only 7. So why are we converting these to 4x4 matrices before uploading ?
Because with having only two 4×4 matrices, one for each bone a vertex is assigned and weighted to, you have to do only two 4-vector 4×4-matrix multiplications and a weighted sum.
In contrast to this, if you'd submit as a separate quaternion and translation you'd have to do the equvalent of two 3-vector 3×3-matrix multiplications plus four 3-vector 3-vector additions and a weighted sum. Either you first convert your quaternion into a rotation matrix first, then to 3-vector 3×3-matrix multiplication, or you do direct 3-vector quaternion multiplication, the computational effort is about the same. And after that you have to postmultiply with the modelview matrix.
It's perfectly possible to use a 4-element vector uniform as a quaternion, but then you have to chain a lot of computations in the vertex shader: First rotate the vertex by the two quaternions, then translate it and then multiply it with the modelview matrix. By simply uploading two transformation matrix which are weighted in the shader, you save a lot of computations on the GPU. Doing the quaternion-matrix multiplication on the CPU performs the calculation only one time per bone, whereas doing it in the shader performs it for each single vertex. GPUs are great if you have to to a lot of identical computations with varying input date. But they suck if you have to calculate only a handfull of values, which are reused over large amounts of data. CPUs however love this kind of task.
The nice thing about homgenous transformations represented by 4×4 matrices is, that a single matrix can contain a whole transformation chain. If you separate rotations and translations, you have to perform the whole chain of operations in order. With only one rotation and translation it's less operations than a single 4×4 matrix transform. Add one single transformation and you've reached the break even.
The transformation matrices, even in a skeletal pose applied to a mesh, are identical for all vertices. Say the mesh has 100 vertices around a pair of bones (this is a small number, BTW), then you'd have to to the computations outlined above for each any every vertex, wasting precious GPU computation cycles. And for what? To determine some 32 scalar values (or 8 4-vectors). Now compare this: 100 4-vectors (if you only consider vertex position) vs. only 8. This is the order of magnitude of calculation overhead imposed by processing quaternion poses in the shader. Compute it once on the CPU and give it the GPU precalculated to share among the primitives. If you code it right, the whole calculation of a single matrix column will nicely fit into the CPUs pipeline, making is vastly outperform every attempt at parallelizing it. Parallelization doesn't come for free!
In modern GPUs there is no restriction to what data format you upload to constant buffers.
Of course you need to write your vertex shader differently in order to use quaternions for skinning instead of matrices. In fact, we are using dual quaternion skinning in our engine.
Note that older fixed function hardware skinning indeed only worked with matrices, but that was a long time ago.