The purpose of Model View Projection Matrix - opengl

For what purposes are we using Model View Projection Matrix?
Why do shaders require Model View Projection Matrix?

The model, view and projection matrices are three separate matrices. Model maps from an object's local coordinate space into world space, view from world space to camera space, projection from camera to screen.
If you compose all three, you can use the one result to map all the way from object space to screen space, making you able to work out what you need to pass on to the next stage of a programmable pipeline from the incoming vertex positions.
In the fixed functionality pipelines of old, you'd apply model and view together, then work out lighting using another result derived from them (with some fixes so that e.g. normals are still unit length even if you've applied some scaling to the object), then apply projection. You can see that reflected in OpenGL, which never separates the model and view matrices — keeping them as a single modelview matrix stack. You therefore also sometimes see that reflected in shaders.
So: the composed model view projection matrix is often used by shaders to map from the vertices you loaded for each model to the screen. It's not required, there are lots of ways of achieving the same thing, it's just usual because it allows all possible linear transforms. Because of that, a lesser composed version of it was also the norm in ye olde fixed pipeline world.

Because matrices are convenient. Matrices help to convert locations/directions with respect to different spaces (A space can be defined by 3 perpendicular axes and an origin).
Here is an example from a book specified by #legends2k in comments.
The residents of Cartesia use a map of their city with the origin
centered quite sensibly at the center of town and axes directed along
the cardinal points of the compass. The residents of Dyslexia use a
map of their city with the coordinates centered at an arbitrary point
and the axes running in some arbitrary directions that probably seemed
a good idea at the time. The citizens of both cities are quite happy
with their respective maps, but the State Transportation Engineer
assigned a task of running up a budget for the first highway between
Cartesia and Dyslexia needs a map showing the details of both cities,
which therefore introduces a third coordinate system that is superior
to him, though not necessarily to anybody else.
Here is another example,
Assume that you have created a car object in a game with it's vertex positions using world's co-ordinates. Suppose you have to use this same car in some other game in an entirely different world, you have to define the positions again and the calculations will go complex. This is because you again have to calculate the positions of window, hood, headlight, wheels etc., in the car with respect to new world.
See this video to understand the concepts of model, view and projection. (highly recommended)
Then see this to understand how the vertices in the world are represented as Matrices and how they are transformed.

Related

Transformation of vertex data. Mouse picking algorithm

In my program I have a gizmo wich moves any objects in the scene. As I already know, is the usual way of storing any transformations is store that transformation in model matrices of this objects and execute any transformation directly in the shader. BUT also in my program I implement a classic ray-picking algorythm wich works only with a real transformed data. A ray detect any intersection with real(transformed) vertex position. How is the common way to solve this conflict:
Multiply any transformation immediatly on CPU and store transformed data. I think it's a clear way but it's expensive: for example I drag my object on screen in during 100 frames, and each frame I convert the delta of moving to matrices and multiply whole data by it.
Store any transformation in matrices until the mouse picking will starts and then quick multiply verticies by matrix to prepare data for picking. This is very fancy but there is ways to optimize it.
Which is the more performant way. Maybe there is some other method?
Update for Robinson:
I think you misunderstood me. Or I did not fully understand you. I have a box and a sphere and I move it by gizmo (I edit their model matrix) on 1,0,0 and 0,1,0 respectively. His model matrix is now different. HERE I get data that I need for ray-picking - ever objects has own individual place.
Then I transform the entire scene to eye space (view matrix) and then to clip space (projetion matrix) and render it. My ray makes return journey from viewport to world space (unproject a view and a projection matrix) and should interacts with the actual scene. My ray transformed rather than scene!
My question was how to interact with the objects wich the real place is unknown until it's will render (or transformed)? Or may be I'm not on the right track and I should have done it differently - multiply entire data each step (it's expensive, look at my first question).
You use ray-picking which technically is "get x,y screen coordinates, transform them to NDC and set the z as anyone in the range [-1,1]; and finally transform them all back to world coordinates".
This is useful when you want to intersect a ray from the point of view (the camera) to "mouse coordinates" AND you want to do all of this intersection calculations on CPU side.
Notice you can do it even when nothing is drawn in the screen, just mouse coordinates are needed; well, plus the viewport and the current transformations, but you know them before any glDrawxxx command.
Next, the question is: what are you going to do with that ray or intersections?
You may wish to modify some property (like color) or position. Right?
How many objects are to be modified? If it's just a bunch then it's OK to do it on CPU modifying the data to send to GPU. But if you have thousands of objects then think of the hardware-accelerated way: keep their coordinates but send the new tranformation matrices and properties to GPU and let it do the hard work.
If you are worried about some objects stay as before but others get modified, remember that you can draw groups of objects that share the matrices and other uniforms with a single glDrawxxx call. If you have different groups, use several glDrawxxx calls with different uniforms, even different shaders.

Why are model and view matrices specifically often combined together?

I'm having trouble understanding why the model and view matrices are traditionally combined together. I know that the less matrix multiplying you do in the vertex shader the better, but it makes much more sense to me to combine the projection and view matrices.
This is because they are both intrinsically camera properties. It makes sense to me to first transform vertices into world space with the model matrix, perform lighting etc., then use your combined camera matrix to translate to normalised clip space.
I know that I can do it that way if I want in a programmable pipeline, but I want to know why historically people combined the model and view matrices.
In graphics programming, camera doesn't exist. It's always fixed at (0,0,0) and looking towards (0,0,-1). Camera as everyone knows it is totally artificial and mimics the way we are used to observing objects, as humans, and by that I mean: moving around, pivoting our head and such. To mimic that, cg introduces the concept of camera. It is interesting and well known that it is the same thing whether will you move to camera to the right or to move all other objects in the scene to the left. That invariance is then transffered onto modelMatrix by combining all transformations on object in one matrix - MVMatrix.
View and Projection matrices are seperated because those matrices do very different transformations. One is very similar to modelMatrix and represents 3d, in-space transformations, and the other is used for calculating angles from which the objects are viewed.

What do we use matrices for in openGL?

I'm new to OpenGL and was wondering what do we use matrices for, if someone could explain me in abstract intuitive way, because when reading references or any tutorial, all of these take matrices as known mechanism. I've learned matrices in maths, but as English is not my native language, it's hard to figure out what does some stuff mean.
I found good example at www.learnopengl.com which says:
The model matrix. This matrix is used to place a model somewhere in the “world”. For example, if you have a model of a car and you want it located 1000 meters to the east, you will use the model matrix to do this.
The view matrix. This matrix represents the camera. If we want to view our car which is 1000 meters to the east, we’ll have to move ourselves 1000 meters to the east as well (another way of thinking about it is that we remain stationary, and the rest of the world moves 1000 meters to the west). We use the view matrix to do this.
The projection matrix. Since our screens are flat, we need to do a final transformation to “project” our view onto our screen and get that nice 3D perspective. This is what the projection matrix is used for.
This explains it pretty good. But, how do we build them? How large are they?
Also, I've read in this question:
What does glLoadIdentity() do in OpenGL?
that:
glMatrixMode(GL_PROJECTION) deals with the matrices used by
perspective transformation or orthogonal transformation.
glMatrixMode(GL_MODELVIEW) deals with matrices used by model-view
transformation. That is, to transform your object (aka model) to the
view coordinate space (or camera space).
What those transformation mean and how do they get calculated?
I know that here are many question, but I'm trying to make better conception of all of these to get better view on OpenGL. That's why I need some abstract explanation to dive into all details with understanding of conception beyond.
Translation, rotation, and scaling are all affine transforms, which can be implemented using matrix multiplication and addition. Actually, by augmenting the vector with a w element that's always one, the affine transform in 3 dimensions becomes a linear transformation in 4 dimensions and all you need is a matrix multiply.
Doing it with a matrix multiply is very nice because (1) it's fast and (2) you don't need special logic for any of the operations -- you can even compose as many of these affine operations as you want and still represent it with a single matrix.
Having multiple matrix modes is useful when composing. If you had only one matrix, you could add new operations at either end but not in the middle. By having 3 matrices multiplied together later, you can insert new operations at four different points in the order.
The matrix stack is also very useful because it allows you to do an operation for a few primitives and then remove it. If you tried to undo the operation by doing the inverse operation, eventually rounding errors would get out of control. But by remembering the previous matrix, you can just get it back as if the rotation or whatever never happened.
OpenGL is nice in that rather than working with matrices directly, you can call functions that will manipulate them.
So under the hood (what really happens), is that there are several matrices that transform your objects (a model-view matrix that transforms object to camera space, and projection matrix for perspective / orthogonal transformations).
glMatrixMode is like a switch that allows you to choose which type of matrix to use and manipulate, and you specify using the arguments. So glMatrixMode(GL_PROJECTION) means that you will be manipulating the projection matrix.

Why would it be beneficial to have a separate projection matrix, yet combine model and view matrix?

When you are learning 3D programming, you are taught that it's easiest think in terms of 3 transformation matrices:
The Model Matrix. This matrix is individual to every single model and it rotates and scales the object as desired and finally moves it to its final position within your 3D world. "The Model Matrix transforms model coordinates to world coordinates".
The View Matrix. This matrix is usually the same for a large number of objects (if not for all of them) and it rotates and moves all objects according to the current "camera position". If you imaging that the 3D scene is filmed by a camera and what is rendered on the screen are the images that were captured by this camera, the location of the camera and its viewing direction define which parts of the scene are visible and how the objects appear on the captured image. There are little reasons for changing the view matrix while rendering a single frame, but those do in fact exists (e.g. by rendering the scene twice and changing the view matrix in between, you can create a very simple, yet impressive mirror within your scene). Usually the view matrix changes only once between two frames being drawn. "The View Matrix transforms world coordinates to eye coordinates".
The Projection Matrix. The projection matrix decides how those 3D coordinates are mapped to 2D coordinates, e.g. if there is a perspective applied to them (objects get smaller the farther they are away from the viewer) or not (orthogonal projection). The projection matrix hardly ever changes at all. It may have to change if you are rendering into a window and the window size has changed or if you are rendering full screen and the resolution has changed, however only if the new window size/screen resolution has a different display aspect ratio than before. There are some crazy effects for that you may want to change this matrix but in most cases its pretty much constant for the whole live of your program. "The Projection Matrix transforms eye coordinates to screen coordinates".
This makes all a lot of sense to me. Of course one could always combine all three matrices into a single one, since multiplying a vector first by matrix A and then by matrix B is the same as multiplying the vector by matrix C, where C = B * A.
Now if you look at the classical OpenGL (OpenGL 1.x/2.x), OpenGL knows a projection matrix. Yet OpenGL does not offer a model or a view matrix, it only offers a combined model-view matrix. Why? This design forces you to permanently save and restore the "view matrix" since it will get "destroyed" by model transformations applied to it. Why aren't there three separate matrices?
If you look at the new OpenGL versions (OpenGL 3.x/4.x) and you don't use the classical render pipeline but customize everything with shaders (GLSL), there are no matrices available any longer at all, you have to define your own matrices. Still most people keep the old concept of a projection matrix and a model-view matrix. Why would you do that? Why not using either three matrices, which means you don't have to permanently save and restore the model-view matrix or you use a single combined model-view-projection (MVP) matrix, which saves you a matrix multiplication in your vertex shader for ever single vertex rendered (after all such a multiplication doesn't come for free either).
So to summarize my question: Which advantage has a combined model-view matrix together with a separate projection matrix over having three separate matrices or a single MVP matrix?
Look at it practically. First, the fewer matrices you send, the fewer matrices you have to multiply with positions/normals/etc. And therefore, the faster your vertex shaders.
So point 1: fewer matrices is better.
However, there are certain things you probably need to do. Unless you're doing 2D rendering or some simple 3D demo-applications, you are going to need to do lighting. This typically means that you're going to need to transform positions and normals into either world or camera (view) space, then do some lighting operations on them (either in the vertex shader or the fragment shader).
You can't do that if you only go from model space to projection space. You cannot do lighting in post-projection space, because that space is non-linear. The math becomes much more complicated.
So, point 2: You need at least one stop between model and projection.
So we need at least 2 matrices. Why model-to-camera rather than model-to-world? Because working in world space in shaders is a bad idea. You can encounter numerical precision problems related to translations that are distant from the origin. Whereas, if you worked in camera space, you wouldn't encounter those problems, because nothing is too far from the camera (and if it is, it should probably be outside the far depth plane).
Therefore: we use camera space as the intermediate space for lighting.
In most cases your shader will need the geometry in world or eye coordinates for shading so you have to seperate the projection matrix from the model and view matrices.
Making your shader multiply the geometry with two matrices hurts performance. Assuming each model have thousends (or more) vertices it is more efficient to compute a model view matrix in the cpu once, and let the shader do one less mtrix-vector multiplication.
I have just solved a z-buffer fighting problem by separating the projection matrix. There is no visible increase of the GPU load. The two folowing screenshots shows the two results - pay attention to the green and white layers fighting.

In a big OpenGL game with lots of 3D moving objects, how are the points typically updated?

Points calculated with own physics engine and then sent to OpenGL every time it has to display, e.g. with glBufferSubDataArb, with the updated coordinates of a flying barrel
There are lots of barrels with the same world coordinates but somehow for each one you tell OpenGL to use a different matrix transformation. When a barrel moves you update it's transformation matrix somehow, to reflect which way it rotated/translated in the world.
Some other way
Also, if the answer is #2, is there any easy way to do it, e.g. with abstracted code rather than manipulating the matrices yourself
OpenGL is not a scene graph, it's a drawing API. Most recent versions of OpenGL (OpenGL-3 core and above) reflect this, by not managing matrix state at all. Indeed the answer is 2, more or less. And actually you are expected to deal with the matrix math. OpenGL-3 no longer provides any primitives for that.
Usually a physics engine sees an object as a rigid body with a convex hull. The natural way to represent such a body is using a 4×3 matrix (a 3×3 rotation matrix and a translation vector). So if using a physics engine you're presented with such matrices anyway.
Also you must understand that OpenGL doesn't maintain a scene, so there is nothing you "update". You just draw your data using OpenGL. Matrices are loaded as they are needed.