Matrix math on CPU or GPU for common 3D operations - opengl

Is there any common wisdom about how much matrix math should be done on the CPU vs the GPU for common 3D operations?
A typical 3D shader potentially needs several matrices. A world matrix for computing surface to light calculations. A world inverse transpose matrix for normal calculations. A world view projection matrix for 3d projection. Etc.
There are 2 basic ways to approach this.
Calculate the matrices on the CPU and upload the computed matrices to the GPU
In some CPU language
worldViewProjection = world * view * projection
worldInverseTranspose = transpose(inverse(world));
upload world, worldViewProjection, worldInverseProjection to GPU
on GPU use world,worldViewProjection, worldInverseProjection where needed.
Pass the various component matrices to the GPU (world, view, projection) and compute the needed matrices on the GPU
In some CPU language
upload world, view, projection to GPU
On GPU
worldViewProjection = world * view * projection
worldInverseTranspose = transpose(inverse(world));
I understand that at some level I probably just have to profile on the different machines and GPUs and that drawing a million vertices in 1 draw call might have different needs than drawing 4 vertices in 1 draw call but ... I'm wondering ...
Is there any common wisdom about when to do math on the GPU vs CPU for matrix calculations.
Another way to ask this question is what should my default be #1 or #2 above after which later I can profile for those cases where the default is not the best performance.

When deciding on CPU / GPU compute, the issue is not calculation, but streaming.
GPU calculation is very cheap. As your calculation world * view * projection involves uniforms, it's likely that this will be optimised.
However, if you choose to calculate on the GPU, then world view and projection have to be streamed as individual uniform matrices. This takes more time than streaming a single matrix, and also uses up more uniform components within your shader.
Note that the streaming time for uniforms is minimal when compared to texture data or buffer data. You're unlikely to hit performance limits because of it, and if you do then it's easy to optimise.

Related

Opengl 3.0+ : How to efficiently draw hierarchical (i.e. chain transform-matrix) meshes with Shader?

This is a sample 3D scene:-
Mesh A is the parent of Mesh B. (parent like 3D modeling program Ex.Maya or Blender)
Transformation matrix of Mesh A and B = MA and MB.
In the old Opengl, Mesh A and Mesh B can be drawn by :-
glLoadIdentity();
glMulMatrix(MA);
MeshA.draw();
glMulMatrix(MB);
MeshB.draw();
In the new shader Opengl 3.0+, it can be drawn by :-
shader.bind();
passToShader(MA);
MeshA.draw();
passToShader(MA*MB);
MeshB.draw();
Shader is:-
uniform mat4 multiplicationResult;
glVertex = M_multiplicationResult * meshPosition
When MA is changed in a timestep: In the old way, only MA has to be recomputed. But in the new way, using Shader, the whole MA x MB have to be recomputed in CPU.
The problem become severe in the scene in which the hierarchy (parenting) of those Mesh are very high (Ex. 5 levels) and many branches (Ex. one MeshA has many MeshB) , CPU has to recompute the whole MA x MB x MC x MD x ME for every related Mesh E, even only single MA is changed.
How to optimize it? Or is it the way to go?
My poor solutions :-
add more slots in Shader like this:-
uniform mat4 MA;
uniform mat4 MB;
uniform mat4 MC;
uniform mat4 MD;
uniform mat4 ME;
glVertex = MA*MB*MC*MD*ME*meshPosition;
But the shader would never know how many MX would be enough. It is hard-coded, waste GPU for low hierarchy, lower maintainability and don't support more complex scene.
use compatibility context - not a good practice
But in the new way, using Shader, the whole MA x MB have to be recomputed in CPU.
What did you think that glMultMatrix was doing? It too was computing MA x MB. And that computation almost certainly happened on the CPU.
What you want is a matrix stack that works like OpenGL's matrix stack. So... just write one. There's nothing magical about what OpenGL was doing. You can write a data type that mirrors OpenGL's matrix operations, then pass it around when you render.
Alternatively, you can just use the C++ stack:
void render(const matrix &parent)
{
matrix me = parent * my_transform;
passToShader(me);
my_mesh.draw();
for(each object)
object.render(me);
}
There, problem solved. Each child of an object receives its parent matrix, which it uses to compute its own full modelview matrix.
I hope to use something faster because they are "relatively-static" objects.
OK, let's do a full performance analysis of this.
The general CPU performance of the code I posted above is doing the exact same number of matrix multiplications as the glMultMatrix So your code is as fast now as it used to be (give or take).
So, let's consider the case where you minimize the number of matrix multiples you do on the CPU. Right now, you're doing one matrix multiplication per-object. Instead, let's do no matrix multiplications per object.
So let's say your shader has 4 matrix uniforms (whether a 4 element matrix or just 4 separate uniforms, it doesn't matter). So you're limited to a maximum stack depth of 4, but never mind that now.
This way, you only change the matrices that change. So if a parent matrix changes, the child matrix doesn't have to be recomputed.
OK... so what?
You still have to give that child matrix to the shader. So you're still paying the price of changing program uniform state. You're still uploading 16 floats to the shader per-object.
Not only that, consider what your vertex shader has to do now. It must perform 4 vector/matrix multiplications. And it must do this for every single vertex of every single object. After all, the shader doesn't know which of those matrices are empty and which ones aren't. So it must assume that they all have data and it must therefore multiply against them all.
So the question is, which is faster:
A single matrix multiplication per object on the CPU
3 vector/matrix multiplications for every vertex on the GPU (you need to do at least one).

Performance of using OpenGL UBOs for camera matrices

I've read about how to use UBOs in OpenGL from this tutorial, in which it is suggested that both the projection and camera position matrix be stored in a single UBO, shared between shaders, like this:
layout(std140) uniform global_transforms
{
mat4 camera_orientation;
mat4 camera_projection;
};
But it seems to me to make more sense to store the product of the camera projection * position matrix multiplication in the UBO, so that the matrix multiplication doesn't have to occur once for every call to the vertex buffer. You'd be sending the same amount of data to the buffer each step, but trading potentially many matrix multiplications on the GPU for just one on the CPU.
My question: am I right in thinking that this would be even just a wee bit more performant? Perhaps the shader compiler is smart enough to perform operations involving only uniforms once per draw?
(I'd just test it myself with a few thousand polygons, but it's my first time working with a programmable pipeline and I haven't quite gotten UBOs working yet :P)

Moving/rotating shapes in the vertex shader

I'm writing a program that draws a number of moving/rotating polygons using OpenGL. Each polygon has a location in world coordinates while its vertices are expressed in local coordinates (relative to polygon location). Each polygon also has a rotation.
The only way I can think of doing this is calculate vertex positions by translation/rotation in each frame and push them to the GPU be drawn, but I was wondering if this could be performed in the vertex shader.
I thought I might express vertex locations in local coordinates and then add location and rotation attributes to each vertex, but then it occurred to me that this won't be any better than pushing new vertex positions on each frame.
Should I do this kind of calculation on the CPU, or is there a way to do it efficiently in the vertex shader?
The vertex shader is indeed responsible for transforming your geometry. However, the vertex shader is run for every single vertex of your scene. If you do transformations inside the vertex shader, you'll do the same calculation over and over again which yields the same result every time (as opposed to simply multiplying the model view projection matrix with the vertex coordinate). So in terms of efficiency you're best off doing that on the CPU side.
If the models are small, like in your case, I don't expect there to be too much of a difference, because you still have to set the coordinates where the polygons are supposed to be drawn somehow. In this case doing the calculations once on the CPU side is still the best, given that it does the calculation once independent of the vertex count of your polygons, as well as that it will probably result in clearer code since it's easier to see what you're doing.
These calculations are usually done on CPU only. As doing them on CPU is efficient in general. your best shot is to send these rotation matrices in as uniform and do multiplication on GPU. Sending uniforms is not very expensive operation in general so u should be be worrying about that.

Storing and loading matrices in OpenGL

Is it possible to tell OpenGL to store the current transformation matrix in a specific location (instead of pushing it on a stack) and to load a matrix from a specific location?
I prefer a solution that does not involve additional data transfer between the video device and main memory (i.e. it's better to store the matrix somewhere in video memory).
To answer the first part of you question:
The functions glLoadMatrix and glGet can do that
// get current model view matrix
double matrix[16];
glGetDoublev(GL_MODELVIEW_MATRIX, matrix)
// manipulate matrix
glLoadMatrixd(matrix);
Note that these functions are not supported with OpenGL 4 anymore. Matrix operations have to be done on application site anyway and provided as uniform variables to the shader programs.
On old fixed function pipeline the matrices were loaded into some special GPU registers on demand, but never were in VRAM, all matrix calculations happened on the CPU. In modern day OpenGL matrix calculations still happen on CPU (and right so), and are loaded into registers called "Uniforms" on demand again. However modern OpenGL also has a feature called "Uniform Buffer Objects" that allow to load Uniform (register) values from VRAM. http://www.opengl.org/wiki/Uniform_Buffer_Object
But they make little use for storing transform matrices. First, you'll change them constantly for doing animations. Second, the overhead of managing the UBOs for just a simple matrix eats much more performance, than setting it from the CPU. A matrix is just 16 scalars, or the equivalent of one single vertex with a position, normal, texture coordinate and tangent attribute.

Why would it be beneficial to have a separate projection matrix, yet combine model and view matrix?

When you are learning 3D programming, you are taught that it's easiest think in terms of 3 transformation matrices:
The Model Matrix. This matrix is individual to every single model and it rotates and scales the object as desired and finally moves it to its final position within your 3D world. "The Model Matrix transforms model coordinates to world coordinates".
The View Matrix. This matrix is usually the same for a large number of objects (if not for all of them) and it rotates and moves all objects according to the current "camera position". If you imaging that the 3D scene is filmed by a camera and what is rendered on the screen are the images that were captured by this camera, the location of the camera and its viewing direction define which parts of the scene are visible and how the objects appear on the captured image. There are little reasons for changing the view matrix while rendering a single frame, but those do in fact exists (e.g. by rendering the scene twice and changing the view matrix in between, you can create a very simple, yet impressive mirror within your scene). Usually the view matrix changes only once between two frames being drawn. "The View Matrix transforms world coordinates to eye coordinates".
The Projection Matrix. The projection matrix decides how those 3D coordinates are mapped to 2D coordinates, e.g. if there is a perspective applied to them (objects get smaller the farther they are away from the viewer) or not (orthogonal projection). The projection matrix hardly ever changes at all. It may have to change if you are rendering into a window and the window size has changed or if you are rendering full screen and the resolution has changed, however only if the new window size/screen resolution has a different display aspect ratio than before. There are some crazy effects for that you may want to change this matrix but in most cases its pretty much constant for the whole live of your program. "The Projection Matrix transforms eye coordinates to screen coordinates".
This makes all a lot of sense to me. Of course one could always combine all three matrices into a single one, since multiplying a vector first by matrix A and then by matrix B is the same as multiplying the vector by matrix C, where C = B * A.
Now if you look at the classical OpenGL (OpenGL 1.x/2.x), OpenGL knows a projection matrix. Yet OpenGL does not offer a model or a view matrix, it only offers a combined model-view matrix. Why? This design forces you to permanently save and restore the "view matrix" since it will get "destroyed" by model transformations applied to it. Why aren't there three separate matrices?
If you look at the new OpenGL versions (OpenGL 3.x/4.x) and you don't use the classical render pipeline but customize everything with shaders (GLSL), there are no matrices available any longer at all, you have to define your own matrices. Still most people keep the old concept of a projection matrix and a model-view matrix. Why would you do that? Why not using either three matrices, which means you don't have to permanently save and restore the model-view matrix or you use a single combined model-view-projection (MVP) matrix, which saves you a matrix multiplication in your vertex shader for ever single vertex rendered (after all such a multiplication doesn't come for free either).
So to summarize my question: Which advantage has a combined model-view matrix together with a separate projection matrix over having three separate matrices or a single MVP matrix?
Look at it practically. First, the fewer matrices you send, the fewer matrices you have to multiply with positions/normals/etc. And therefore, the faster your vertex shaders.
So point 1: fewer matrices is better.
However, there are certain things you probably need to do. Unless you're doing 2D rendering or some simple 3D demo-applications, you are going to need to do lighting. This typically means that you're going to need to transform positions and normals into either world or camera (view) space, then do some lighting operations on them (either in the vertex shader or the fragment shader).
You can't do that if you only go from model space to projection space. You cannot do lighting in post-projection space, because that space is non-linear. The math becomes much more complicated.
So, point 2: You need at least one stop between model and projection.
So we need at least 2 matrices. Why model-to-camera rather than model-to-world? Because working in world space in shaders is a bad idea. You can encounter numerical precision problems related to translations that are distant from the origin. Whereas, if you worked in camera space, you wouldn't encounter those problems, because nothing is too far from the camera (and if it is, it should probably be outside the far depth plane).
Therefore: we use camera space as the intermediate space for lighting.
In most cases your shader will need the geometry in world or eye coordinates for shading so you have to seperate the projection matrix from the model and view matrices.
Making your shader multiply the geometry with two matrices hurts performance. Assuming each model have thousends (or more) vertices it is more efficient to compute a model view matrix in the cpu once, and let the shader do one less mtrix-vector multiplication.
I have just solved a z-buffer fighting problem by separating the projection matrix. There is no visible increase of the GPU load. The two folowing screenshots shows the two results - pay attention to the green and white layers fighting.