Projection matrices: What should depth map to? - opengl

I'm running into contradictions while attempting to build a projection matrix for Vulkan, and have yet to find an explanation for how the projection matrix should map Z from input vector to the output. Mapping x and y is straightforward. My understanding is that OpenGL Projection matrices should map the near frustum plane to -1, and far to +1. Vulkan to 0 and +1 respectively. The mapping should be logarithmic, allowing greater precision in the near field.
The examples below use near (n) = 1, far (f) = 100.
Here's a plot of z mapping using a matrix I constructed to the Vulkan spec. It produces errors in rendering, but produces the correct result as I understand it:
lambda z: (f / (f-n) * z - f*n/(f-n)) / z
A plot of the most common OpenGL projection I've found online, which should map from -1 to +1:
lambda z: ((-f+n)/(f-n)*z - 2*f*n/(f-n))/-z
And here's one generated from a lib I use, for OpenGL (cgmath in Rust):
I'm unable to build a proper Vulkan projection matrix (Of which I've found none via Google) unless I understand what z should map to. I suspect this is due to an implicit correction done post projection matrix by the shader that actually maps to the ranges I listed, but if so, I don't know what range to feed into it via the proj mat.

The mapping should be logarithmic, allowing greater precision in the near field.
Actually, if you don't do any tricks, the mapping will be hyperbolic, not logarithmic. The key point about the hyperbolic mapping is that you can actually interpolate it linear in screen space (which is a very nice property when you want to do some Z buffer optimizations like Hierarchical Z).
A plot of the most common OpenGL projection I've found online, which should map from -1 to +1:
lambda z: ((-f+n)/(f-n)*z - 2*f*n/(f-n))/-z
Nope. You have a sign error in the first term, it should be
(-(f+n)/(f-n)*z - 2*f*n/(f-n))/-z
Hence, your plot is just wrong. With the corrected formula, you would get a plot similiar to your cgmath rust library.
But the important bit is something else: you are plotting the wrong thing!
Note the -z on denominator in that formula? Classic GL convention has always been to use a right handed eye space, but left handed window space. As a result, a classic GL projection matrix projects along -z direction. The parameters n and f are still given as distances along the viewing direction, though. This means, that the actual clip planes will be at z_eye = -n and z_eye=-f in eye space. What you plotted in your graph is the range behind the camera, and you will see the second branch of the hyperbole, the one which is usually clipped ayway, and which will map to outside of the [-1,1] interval.
If you plot the mapping for n=5 and f=100, you would get:
Note that OpenGL's project into -z direction is pure convention, it is not enforced by anything, so you can use +z projection matrix as well.
I'm unable to build a proper Vulkan projection matrix (Of which I've found none via Google) unless I understand what z should map to.
Here's a plot of z mapping using a matrix I constructed to the Vulkan spec. It produces errors in rendering, but produces the correct result as I understand it:
lambda z: (f / (f-n) * z - f*n/(f-n)) / z
Not sure what errors you're seeing with this, but the mapping is correct if you
assume vulkan's default [0,1] z clip conventions, and
want a projection direction along +z in eye space.
Btw. vulkans [0,1] clip convention also makes it possible to better use the precisions when using a floating point depth attachment: By reversing the mapping so that the near plane is mapped to 1, and the far plane mapped to 0, you can improve precision considerably. Have a look at this nvidia devblog article for more details.

Related

OpenGL: confused by the unproject function settings

I need to implement raycasting. For this I need to convert the mouse cursor to world space.
I use the unproject function for this. I need to first find a point on the near plane, then on the far plane. After that, from the last substract the first one and we will get a ray.
But I don't understand how to set winZ correctly. Because in some implementations I see two ways: winZ = 0 (near plane), winZ = 1 (far plane) or winZ = -1 (near plane), winZ = 1 (far plane).
What's the difference in these ranges?
If it really is windows space z, then there is only the [0,1] range. Note that 0 doesn't necessarily mean the near plane and one the far plane, though. One can set up a projection matrix where near will end up at 1 and far at 0 (reversed z, which, in combination with the [0,1] z clip condition as explained below, has some precision advantages).
Also not that glDepthRange can further modify to which values (inside [0,1]) these two planes will be mapped.
To understand the unproject operation, you first need to understand the different coordinate spaces. Typically in a render API, you have to deal with these 3 spaces at the end of the transform chain:
clip space: This is what the output of the vertex shader is in, and where the actual clipping at least on a conceptual level happens. This space is still homogeneous with an arbitrary value for the w coordinate.
normalized device coordinates (NDC). This is what you get after the perspective division by the clip space w coordinate (after the clipping has been applied, which will eliminate the w<=0 cases completely).
window space. The 2D xy part are the actual pixel coordinates inside your output window (or your render target), and the transformation from NDC xy to window space xy is defined by the viewport settings. The z coordinate is the value which will go into the depth test, and depth buffer, it is in the range [0,1] or some sub-range of that (controlled via glDepthRange in the GL).
Other spaces before these in this list, like model/object space, world space, eye/view space, are completely up to you and do not concern the GPU directly at all (legacy fixed-function GL did care about eye space for lighting and fog, but nowadays, implementing this is all your job, and you can use whatever spaces you see fit).
Having established the spaces, the next relevant thing here is the viewing volume. This is the volume of the space which will actually mapped to your viewport, and it is a 3D volume bounded by the six planes: left, right, bottom, top, near far.
However, the actual view volume is set up by pure convention in the various render APIs (and the actual GPU hardware).
Since you tagged this question with "OpenGL", I'm going to begin with the default GL convention first:
Standard GL convention is the view volume is the completely symmetrical [-1,1] cube in NDC. Actually, this means that the clip condition in clip space is -w <= x,y,z <= w.
Direct3D uses a different convention: they use [-1,1] in NDC for x and y just like GL does, but [0,1] for the z dimension. This also means that the depth range transformation into window space can be identity in many cases (you seldom need to limit it to a sub-range of [0,1]). This convention has some numerical advantages because the GL convention of moving [-1,1] range to [0,1] for window space will make it loose precision around the (NDC) zero point.
Modern GL since GL 4.5 optionally allows you to switch to the [0,1] convention for z via glClipControl. Vulkan also supports both conventions, but uses [0,1] as the default.
There is not "the" unproject function, but the concept of "unprojecting" a point means calculating these transformations in reverse, going from window space back to some space before clip space, undoing the proejction matrix. For implementing an unproject function, you need to know which conventions were used.
Because in some implementations I see two ways: winZ = 0 (near plane), winZ = 1 (far plane) or winZ = -1 (near plane), winZ = 1 (far plane). What's the difference in these ranges?
Maybe they are not taking in a window space Z, but NDC z directly. Maybe the parameters are just named in a confusing or wrong manner. Maybe some of the implementations out there are just flat-out wrong.

What is the role of gl_Position.w in Vulkan?

Variable gl_Position output from a GLSL vertex shader must have 4 coordinates. In OpenGL, it seems w coordinate is used to scale the vector, by dividing the other coordinates by it. What is the purpose of w in Vulkan?
Shaders and projections in Vulkan behave exactly the same as in OpenGL. There are small differences in depth ranges ([-1, 1] in OpenGL, [0, 1] in Vulkan) or in the origin of the coordinate system (lower-left in OpenGL, upper-left in Vulkan), but the principles are exactly the same. The hardware is still the same and it performs calculations in the same way both in OpenGL and in Vulkan.
4-component vectors serve multiple purposes:
Different transformations (translation, rotation, scaling) can be
represented in the same way, with 4x4 matrices.
Projection can also be represented with a 4x4 matrix.
Multiple transformations can be combined into one 4x4 matrix.
The .w component You mention is used during perspective projection.
All this we can do with 4x4 matrices and thus we need 4-component vectors (so they can be multiplied by 4x4 matrices). Again, I write about this because the above rules apply both to OpenGL and to Vulkan.
So for purpose of the .w component of the gl_Position variable - it is exactly the same in Vulkan. It is used to scale the position vector - during perspective calculations (projection matrix multiplication) original depth is modified by the original .w component and stored in the .z component of the gl_Position variable. And additionally, original depth is also stored in the .w component. After that (as a fixed-function step) hardware performs perspective division and divides position stored in the gl_Position variable by its .w component.
In orthographic projection steps performed by the hardware are exactly the same, but values used for calculations are different. So the perspective division step is still performed by the hardware but it does nothing (position is dived by 1.0).
gl_Position is a Homogeneous coordinates. The w component plays a role at perspective projection.
The projection matrix describes the mapping from 3D points of the view on a scene, to 2D points on the viewport. It transforms from eye space to the clip space, and the coordinates in the clip space are transformed to the normalized device coordinates (NDC) by dividing with the w component of the clip coordinates (Perspective divide).
At Perspective Projection the projection matrix describes the mapping from 3D points in the world as they are seen from of a pinhole camera, to 2D points of the viewport. The eye space coordinates in the camera frustum (a truncated pyramid) are mapped to a cube (the normalized device coordinates).
Perspective Projection Matrix:
r = right, l = left, b = bottom, t = top, n = near, f = far
2*n/(r-l) 0 0 0
0 2*n/(t-b) 0 0
(r+l)/(r-l) (t+b)/(t-b) -(f+n)/(f-n) -1
0 0 -2*f*n/(f-n) 0
When a Cartesian coordinate in view space is transformed by the perspective projection matrix, then the the result is a Homogeneous coordinates. The w component grows with the distance to the point of view. This cause that the objects become smaller after the Perspective divide, if they are further away.
In computer graphics, transformations are represented with matrices. If you want something to rotate, you multiply all its vertices (a vector) by a rotation matrix. Want it to move? Multiply by translation matrix, etc.
tl;dr: You can't describe translation along the z-axis with 3D matrices and vectors. You need at least 1 more dimension, so they just added a dummy dimension w. But things break if it's not 1, so keep it at 1 :P.
Anyway, now we begin with a quick review on matrix multiplication:
You basically put x above a, y above b, z above c. Multiply the whole column by the variable you just moved, and sum up everything in the row.
So if you were to translate a vector, you'd want something like:
See how x and y is now translated by az and bz? That's pretty awkward though:
You'd have to account for how big z is whenever you move things (what if z was negative? You'd have to move in opposite directions. That's cumbersome as hell if you just want to move something an inch over...)
You can't move along the z axis. You'll never be able to fly or go underground
But, if you can make sure z = 1 at all times:
Now it's much clearer that this matrix allows you to move in the x-y plane by a, and b amounts. Only problem is that you're conceptually levitating all the time, and you still can't go up or down. You can only move in 2D.
But you see a pattern here? With 3D matrices and 3D vectors, you can describe all the fundamental movements in 2D. So what if we added a 4th dimension?
Looks familiar. If we keep w = 1 at all times:
There we go, now you get translation along all 3 axis. This is what's called homogeneous coordinates.
But what if you were doing some big & complicated transformation, resulting in w != 1, and there's no way around it? OpenGL (and basically any other CG system I think) will do what's called normalization: divide the resultant vector by the w component. I don't know enough to say exactly why ('cause scaling is a linear transformation?), but it has favorable implications (can be used in perspective transforms). Anyway, the translation matrix would actually look like:
And there you go, see how each component is shrunken by w, then it's translated? That's why w controls scaling.

How to determine the XYZ coords of a point on the back buffer

If I pick a spot on my monitor in screen X/Y, how can I obtain the point in 3D space, based on my projection and view matrices?
For example, I want to put an object at depth and have it located at 10,10 in screen coords. So when I update its world matrix it will render onscreen at 10,10.
I presume it's fairly straightforward given I have my camera matrices, but I'm not sure offhand how to 'reverse' the normal process.
DirectXTk XMMath would be best, but I can no doubt sort it out from any linear algebra system (OpenGL, D3DX, etc).
What I'm actually trying to do is find a random point on the back clipping plane where I can start an object that then drifts straight towards the camera along its projection line. So I want to keep picking points in deep space that are still within my view (no point creating ones outside in my case) and starting alien ships (or whatever) at that point.
As discussed in my comments, you need four things to do this generally.
ModelView (GL) or View (D3D / general) matrix
Projection matrix
Viewport
Depth Range (let us assume default, [0, 1])
What you are trying to do is locate in world-space a point that lies on the far clipping plane at a specific x,y coordinate in window-space. The point you are looking for is <x,y,1> (z=1 corresponds to the far plane in window-space).
Given this point, you need to transform back to NDC-space
The specifics are actually API-dependent since D3D's definition of NDC is different from OpenGL's -- they do not agree on the range of Z (D3D = [0, 1], GL = [-1, 1]).
Once in NDC-space, you can apply the inverse Projection matrix to transform back to view-space.
These are homogeneous coordinates and division by W is necessary.
From view-space, apply the inverse View matrix to arrive at a point in world-space that satisfies your criteria.
Most math libraries have a function called UnProject (...) that will do all of this for you. I would suggest using that because you tagged this question D3D and OpenGL and the specifics of some of these transformations is different depending on API.
You are better off knowing how they work, even if you never implement them yourself. I think the key thing you were missing was the viewport, I have an answer here that explains this step visually.

Perspective divide: Why use the w component?

In OpenGL, I have read that a vertex should be represented by (x,y,z,w), where w = z. This is to enable perspective divide, whereby (x,y,z) are divided by w in order to determine their screen position due to the perspective effect. And if they were just divided by the original z value, then the z would be 1 everywhere.
My question is: Why do you need to divide the z component by w at all? Why can you not just divide the x and y components by z, such that the screen coordinates have the perspective effect applied, and then just used the original unmodified z component for depth testing? In this way, you would not have to use the w component at all....
Obviously I am missing something!
3D computer graphics is typically handled with homogeneous coordinates and in a projective vector space. The math behind this is a bit more than "just divide by w".
Using 4D homogeneous vectors and 4x4 matrices has the nice advantage that all sorts of affine transformations (and this includes especially the translation, which also relies on w) and projective transforms can be represented by simple matrix multiplications.
In OpenGL, I have read that a vertex should be represented by
(x, y, z, w), where w = z.
That is not true. A vertex should be represented by (x, y, z, w), where w is just w. In your typical case, the input w is actually 1, so it is usually not stored in the vertex data, but added on demand in the shaders etc.
Your typical projection matrix will set w_clip = -z_eye. But that is a different thing. This means that you just project along the -z direction in eye space. You could also put w_clip=2 *x_eye -3*y_eye + 4 * z_eye there, and your axis of projection would have the direction (2, -3, 4, 0).
My question is: Why do you need to divide the z component by w at all? Why can you not just divide the x and y components by z, such that the screen coordinates have the perspective effect applied, and then just used the original unmodified z component for depth testing?
Conceptually, the space is distorted along all 3 dimensions, not just in x and y. Furthermore, in the beginning, GPUs had just 16 bit or 24 bit integer precision for the depth buffer. In such a scenario, you definitively want to have a denser representation near the camera, and a sparse one far away.
Nowadays, with programmable vertex shaders and floating-point depth buffer formats, you can basically just store the z_eye value in the depth buffer, and use this for depth testing. However, this is typically referred to as W buffering, because the (clip space) w component is used.
There is another conceptional issue if you would divide by z: you wouldn't be able to use an orthogonal projection, you always would force some kind of perspective. Now one might argue that the division by z doesn't have to happen automatically, but one could apply it when needed in the vertex shader. But this won't work either. You must not apply the perspective divide in the vertex shader, because that would project points which lie behind the camera in front of the camera. As the vertex shader does not work on whole primitives, this would completely screw up any primitive there if at least one vertex lies behind the camera and another one lies in front of it. To deal with that situation, the clipping has to be applied before the divide - hence the name clip space.
In this way, you would not have to use the w component at all.
That is also not true. The w component is further used down the pipeline. It is essential for perspective-correct attribute interpolation.

Calculating the perspective projection matrix according to the view plane

I'm working with openGL but this is basically a math question.
I'm trying to calculate the projection matrix, I have a point on the view plane R(x,y,z) and the Normal vector of that plane N(n1,n2,n3).
I also know that the eye is at (0,0,0) which I guess in technical terms its the Perspective Reference Point.
How can I arrive the perspective projection from this data? I know how to do it the regular way where you get the FOV, aspect ration and near and far planes.
I think you created a bit of confusion by putting this question under the "opengl" tag. The problem is that in computer graphics, the term projection is not understood in a strictly mathematical sense.
In maths, a projection is defined (and the following is not the exact mathematical definiton, but just my own paraphrasing) as something which doesn't further change the results when applied twice. Think about it. When you project a point in 3d space to a 2d plane (which is still in that 3d space), each point's projection will end up on that plane. But points which already are on this plane aren't moving at all any more, so you can apply this as many times as you want without changing the outcome any further.
The classic "projection" matrices in computer graphics don't do this. They transfrom the space in a way that a general frustum is mapped to a cube (or cuboid). For that, you basically need all the parameters to describe the frustum, which typically is aspect ratio, field of view angle, and distances to near and far plane, as well as the projection direction and the center point (the latter two are typically implicitely defined by convention). For the general case, there are also the horizontal and vertical asymmetries components (think of it like "lens shift" with projectors). And all of that is what the typical projection matrix in computer graphics represents.
To construct such a matrix from the paramters you have given is not really possible, because you are lacking lots of parameters. Also - and I think this is kind of revealing - you have given a view plane. But the projection matrices discussed so far do not define a view plane - any plane parallel to the near or far plane and in front of the camera can be imagined as the viewing plane (behind the camere would also work, but the image would be mirrored), if you should need one. But in the strict sense, it would only be a "view plane" if all of the projected points would also end up on that plane - which the computer graphics perspective matrix explicitely does'nt do. It instead keeps their 3d distance information - which also means that the operation is invertible, while a classical mathematical projection typically isn't.
From all of that, I simply guess that what you are looking for is a perspective projection from 3D space onto a 2D plane, as opposed to a perspective transformation used for computer graphics. And all parameters you need for that are just the view point and a plane. Note that this is exactly what you have givent: The projection center shall be the origin and R and N define the plane.
Such a projection can also be expressed in terms of a 4x4 homogenous matrix. There is one thing that is not defined in your question: the orientation of the normal. I'm assuming standard maths convention again and assume that the view plane is defined as <N,x> + d = 0. From using R in that equation, we can get d = -N_x*R_x - N_y*R_y - N_z*R_z. So the projection matrix is just
( 1 0 0 0 )
( 0 1 0 0 )
( 0 0 1 0 )
(-N_x/d -N_y/d -N_z/d 0 )
There are a few properties of this matrix. There is a zero column, so it is not invertible. Also note that for every point (s*x, s*y, s*z, 1) you apply this to, the result (after division by resulting w, of course) is just the same no matter what s is - so every point on a line between the origin and (x,y,z) will result in the same projected point - which is what a perspective projection is supposed to do. And finally note that w=(N_x*x + N_y*y + N_z*z)/-d, so for every point fulfilling the above plane equation, w= -d/-d = 1 will result. In combination with the identity transform for the other dimensions, which just means that such a point is unchanged.
Projection matrix must be at (0,0,0) and viewing in Z+ or Z- direction
this is a must because many things in OpenGL depends on it like FOG,lighting ... So if your direction or position is different then you need to move this to camera matrix. Let assume your focal point is (0,0,0) as you stated and the normal vector is (0,0,+/-1)
Z near
is the distance between focal point and projection plane so znear is perpendicular distance of plane and (0,0,0). If assumption is correct then
znear=R.z
otherwise you need to compute that. I think you got everything you need for it
cast line from R with direction N
find closest point to focal point (0,0,0)
and then the z near is the distance of that point to R
Z far
is determined by the depth buffer bit width and z near
zfar=znear*(1<<(cDepthBits-1))
this is the maximal usable zfar (for mine purposes) if you need more precision then lower it a bit do not forget precision is higher near znear and much much worse near zfar. The zfar is usually set to the max view distance and znear computed from it or set to min focus range.
view angle
I use mostly 60 degree view. zang=60.0 [deg]
Common males in my region can see up to 90 degrees but that is peripherial view included the 60 degree view is more comfortable to view.
Females have a bit wider view ... but I did not heard any complains from them on 60 degree views ever so let assume its comfortable for them too...
Aspect
aspect ratio is determined by your OpenGL window dimensions xs,ys
aspect=(xs/ys)
This is how I set the projection matrix:
glMatrixMode(GL_PROJECTION);
glLoadIdentity();
gluPerspective(zang/aspect,aspect,znear,zfar);
// gluPerspective has inacurate tangens so correct perspective matrix like this:
double perspective[16];
glGetDoublev(GL_PROJECTION_MATRIX,perspective);
perspective[ 0]= 1.0/tan(0.5*zang*deg);
perspective[ 5]=aspect/tan(0.5*zang*deg);
glLoadMatrixd(perspective);
deg = M_PI/180.0
perspective is projection matrix copy I use it for mouse position conversions etc ...
If you do not correct the matrix then you will be off when using advanced things like overlapping more frustrum to get high precision depth range. I use this to obtain <0.1m,1000AU> frustrum with 24bit depth buffer and the inaccuracy would cause the images will not fit perfectly ...
[Notes]
if the focal point is not really (0,0,0) or you are not viewing in Z axis (like you do not have camera matrix but instead use projection matrix for that) then on basic scenes/techniques you will see no problem. They starts with use of advanced graphics. If you use GLSL then you can handle this without problems but fixed OpenGL function can not handle this properly. This is also called PROJECTION_MATRIX abuse
[edit1] few links
If your view is standard frustrum then write the matrix your self gluPerspective otherwise look here Projections for some ideas how to construct it
[edit2]
From your comment I see it like this:
f is your viewing point (axises are the global world axises)
f' is viewing point if R would be the center of screen
so create projection matrix for f' position (as explained above), create transform matrix to transform f' to f. The transformed f must have Z axis the same as in f' the other axises can be obtained by cross product and use that as camera or multiply booth together and use as abused Projection matrix
How to construct the matrix is explained in the Understanding transform matrices link from my earlier comments