Perspective divide: Why use the w component? - opengl

In OpenGL, I have read that a vertex should be represented by (x,y,z,w), where w = z. This is to enable perspective divide, whereby (x,y,z) are divided by w in order to determine their screen position due to the perspective effect. And if they were just divided by the original z value, then the z would be 1 everywhere.
My question is: Why do you need to divide the z component by w at all? Why can you not just divide the x and y components by z, such that the screen coordinates have the perspective effect applied, and then just used the original unmodified z component for depth testing? In this way, you would not have to use the w component at all....
Obviously I am missing something!

3D computer graphics is typically handled with homogeneous coordinates and in a projective vector space. The math behind this is a bit more than "just divide by w".
Using 4D homogeneous vectors and 4x4 matrices has the nice advantage that all sorts of affine transformations (and this includes especially the translation, which also relies on w) and projective transforms can be represented by simple matrix multiplications.
In OpenGL, I have read that a vertex should be represented by
(x, y, z, w), where w = z.
That is not true. A vertex should be represented by (x, y, z, w), where w is just w. In your typical case, the input w is actually 1, so it is usually not stored in the vertex data, but added on demand in the shaders etc.
Your typical projection matrix will set w_clip = -z_eye. But that is a different thing. This means that you just project along the -z direction in eye space. You could also put w_clip=2 *x_eye -3*y_eye + 4 * z_eye there, and your axis of projection would have the direction (2, -3, 4, 0).
My question is: Why do you need to divide the z component by w at all? Why can you not just divide the x and y components by z, such that the screen coordinates have the perspective effect applied, and then just used the original unmodified z component for depth testing?
Conceptually, the space is distorted along all 3 dimensions, not just in x and y. Furthermore, in the beginning, GPUs had just 16 bit or 24 bit integer precision for the depth buffer. In such a scenario, you definitively want to have a denser representation near the camera, and a sparse one far away.
Nowadays, with programmable vertex shaders and floating-point depth buffer formats, you can basically just store the z_eye value in the depth buffer, and use this for depth testing. However, this is typically referred to as W buffering, because the (clip space) w component is used.
There is another conceptional issue if you would divide by z: you wouldn't be able to use an orthogonal projection, you always would force some kind of perspective. Now one might argue that the division by z doesn't have to happen automatically, but one could apply it when needed in the vertex shader. But this won't work either. You must not apply the perspective divide in the vertex shader, because that would project points which lie behind the camera in front of the camera. As the vertex shader does not work on whole primitives, this would completely screw up any primitive there if at least one vertex lies behind the camera and another one lies in front of it. To deal with that situation, the clipping has to be applied before the divide - hence the name clip space.
In this way, you would not have to use the w component at all.
That is also not true. The w component is further used down the pipeline. It is essential for perspective-correct attribute interpolation.

Related

why gl_FragCoord.w is 1/ w? [duplicate]

The gl_FragCoord variable stores four components: (x, y, z, 1/w)
What is the w coordinate and why is it stored as 1/w?
The GLSL and OpenGL specifications are needlessly confusing in this regard. The OpenGL spec is easier to understand: gl_FragCoord stores the X, Y, and Z components of the window-space vertex position. The X, Y, and Z values are calculated as described for computing the window-space position (though the pixel-center and upper-left origin can modify the X and Y values). This is described in the Coordinate Transforms section of the spec.
The W component of gl_FragCoord is (1 / Wc), where Wc is the clip-space vertex position. It's gl_Position.w from your vertex shader.
The only useful purpose for keeping Wc around is to reverse-transform gl_FragCoord to get the clip-space position back. Which, as that page shows, requires multiplying by Wc. But since gl_FragCoord only stores the inverse of this value, it now requires dividing by gl_FragCoord.w.
Therefore, we can assume that OpenGL stores it this way because OpenGL isn't allowed to make too much sense ;) See, it's a rule that every part of the OpenGL specification must have something that's a bit nonsensical. The XYZ components made too much sense, so they decided to have it stores the inverse of the value you actually want.
OK, technically this is a historical artifact from the days when 3D Labs created GLSL. I'm sure they did it for purely selfish hardware reasons, but I have no real proof of that.
A homogeneous coordinate is given by: (x, y, z, w), which projects to: (x/w, y/w, z/w). gl_FragCoord stores this projection, but rather than storing the (useless) (w/w) = (1) for the last component, it stores (1/w), to preserve useful information.

My understanding on the projection matrix, perspective division, NDC and viewport transform

I was quite confused on how the projection matrix worked so I researched and I discovered a few other things but after researching a few days, I just wanted to confirm my understanding is correct. I might use a few wrong terms but my brain was exhausted after writing this. A few topics I just researched briefly like screen coordinates and window transform so I didn’t write much about it and my knowledge might be incorrect. Is everything I’ve written here correct or mostly correct? Correct me on anything if I’m wrong.
What does the projection matrix do?
So the perspective projection matrix defines a frustum that is a truncated pyramid. Anything outside of that frustum/frustum range will be clipped. I'll get more on that later. The perspective projection matrix also adds perspective. To make the vertices follow the rules of perspective, the perspective projection matrix manipulates the vertex's w component (the homogenous component) depending on how far the vertex is from the viewer (the farther the vertex is, the higher the w coordinate will increase).
Why and how does the w component make the world look perceptive?
The w component makes the world look perceptive because in the perspective division (perspective division happens in the vertex post processing stage), when the x, y and z is divided by the w component, the vertex coordinate will be scaled smaller depending on how big the w component is. So essentially, the w component scales the object smaller the farther the object is.
Example:
Vertex position (1, 1, 2, 2).
Here, the vertex is 2 away from the viewer. In perspective division the x, y, and z will be divided by 2 because 2 is the w component.
(1/2, 1/2, 2/2) = (0.5, 0.5, 1).
As shown here, the vertex coordinate has been scaled by half.
How does the projection matrix decide what will be clipped?
The near and far plane are the limits of where the viewer can see (anything beyond the far plane and before the near plane will be clipped). Any coordinate will also have to go through a clipping check to see if it has to be clipped. The clipping check is checking whether the vertex coordinate is within a frustum range of -w to w.  If it is outside of that range, it will be clipped.
Let's say I have a vertex with a position of (2, 130, 90, 90).
x value is 2
y value is 130
z value is 90
w value is 90
This vertex must be within the range of -90 to 90. The x and z value is within the range but the y value goes beyond the range thus the vertex will be clipped.
So after the vertex shader is finished, the next step is vertex post processing. In vertex post processing the clipping happens and also perspective division happens where clip space is converted into NDC (normalized device coordinates). Also, viewport transform happens where NDC is converted to window space.
What does perspective division do?
Perspective division essentially divides the x, y, and z component of a vertex with the w component. Doing this actually does two things, converts the clip space to Normalized device coordinates and also add perspective by scaling the vertices.
What is Normalized Device Coordinates?
Normalized Device Coordinates is the coordinate system where all coordinates are condensed into an NDC box where each axis is in the range of -1 to +1.
After NDC is occurred, viewport transform happens where all the NDC coordinates are converted screen coordinates. NDC space will become window space.
If an NDC coordinate is (0.5, 0.5, 0.3), it will be mapped onto the window based on what the programmer provided in the function glViewport. If the viewport is 400x300, the NDC coordinate will be placed at pixel 200 on x axis and 150 on y axis.
The perspective projection matrix does not decide what is clipped. After transforming a world coordinate with the projection, you get a clipspace coordinate. This is a Homogeneous coordinates. Base on this coordinate the Rendering Pipeline clips the scene. The clipping rule is -w < x, y, z < w. In the following process of the rendering pipeline, the clip space coordinates is transformed into the normalized device space by the perspective divide (x, y, z)' = (x/w, y/w, z/w). This division by the w component gives the perspective effect. (See also What exactly are eye space coordinates? and Transform the modelMatrix)

When perspective division is necessary?

I'm very confused when it is necessary for a homogeneous coordinate (x, y, z, w) get divided by w (ie. converting to (x/w, y/w, z/w, 1)).
According to this page, when a homogeneous vertex coordinate is passed through a orthographical or projective matrix, the coordinate will be in clip coordinates. Perspective division is necessary for clip coordinates to get NDC. It will produce x, y, and z values in range (-1, 1).
I was following WebGL tutorial pages on orthographical and perspective projection. These tutorials didn't mention a single word about perspective division. I'm not sure if the division is still necessary after multiplying with projective matrix. Perhaps, the division is performed automatically during matrix multiplication?
The perspective divide (the conversion from clip space to NDC space) is neither necessary nor unnecessary; it is a part of the graphics pipeline. It happens automatically to every vertex that passes through the system.
You can make it into a no-op by setting a vertex's W component to 1.0 (which is what the orthographic projection matrix does, assuming the input position had a W of 1.0). But the division always happens.

What is the role of gl_Position.w in Vulkan?

Variable gl_Position output from a GLSL vertex shader must have 4 coordinates. In OpenGL, it seems w coordinate is used to scale the vector, by dividing the other coordinates by it. What is the purpose of w in Vulkan?
Shaders and projections in Vulkan behave exactly the same as in OpenGL. There are small differences in depth ranges ([-1, 1] in OpenGL, [0, 1] in Vulkan) or in the origin of the coordinate system (lower-left in OpenGL, upper-left in Vulkan), but the principles are exactly the same. The hardware is still the same and it performs calculations in the same way both in OpenGL and in Vulkan.
4-component vectors serve multiple purposes:
Different transformations (translation, rotation, scaling) can be
represented in the same way, with 4x4 matrices.
Projection can also be represented with a 4x4 matrix.
Multiple transformations can be combined into one 4x4 matrix.
The .w component You mention is used during perspective projection.
All this we can do with 4x4 matrices and thus we need 4-component vectors (so they can be multiplied by 4x4 matrices). Again, I write about this because the above rules apply both to OpenGL and to Vulkan.
So for purpose of the .w component of the gl_Position variable - it is exactly the same in Vulkan. It is used to scale the position vector - during perspective calculations (projection matrix multiplication) original depth is modified by the original .w component and stored in the .z component of the gl_Position variable. And additionally, original depth is also stored in the .w component. After that (as a fixed-function step) hardware performs perspective division and divides position stored in the gl_Position variable by its .w component.
In orthographic projection steps performed by the hardware are exactly the same, but values used for calculations are different. So the perspective division step is still performed by the hardware but it does nothing (position is dived by 1.0).
gl_Position is a Homogeneous coordinates. The w component plays a role at perspective projection.
The projection matrix describes the mapping from 3D points of the view on a scene, to 2D points on the viewport. It transforms from eye space to the clip space, and the coordinates in the clip space are transformed to the normalized device coordinates (NDC) by dividing with the w component of the clip coordinates (Perspective divide).
At Perspective Projection the projection matrix describes the mapping from 3D points in the world as they are seen from of a pinhole camera, to 2D points of the viewport. The eye space coordinates in the camera frustum (a truncated pyramid) are mapped to a cube (the normalized device coordinates).
Perspective Projection Matrix:
r = right, l = left, b = bottom, t = top, n = near, f = far
2*n/(r-l) 0 0 0
0 2*n/(t-b) 0 0
(r+l)/(r-l) (t+b)/(t-b) -(f+n)/(f-n) -1
0 0 -2*f*n/(f-n) 0
When a Cartesian coordinate in view space is transformed by the perspective projection matrix, then the the result is a Homogeneous coordinates. The w component grows with the distance to the point of view. This cause that the objects become smaller after the Perspective divide, if they are further away.
In computer graphics, transformations are represented with matrices. If you want something to rotate, you multiply all its vertices (a vector) by a rotation matrix. Want it to move? Multiply by translation matrix, etc.
tl;dr: You can't describe translation along the z-axis with 3D matrices and vectors. You need at least 1 more dimension, so they just added a dummy dimension w. But things break if it's not 1, so keep it at 1 :P.
Anyway, now we begin with a quick review on matrix multiplication:
You basically put x above a, y above b, z above c. Multiply the whole column by the variable you just moved, and sum up everything in the row.
So if you were to translate a vector, you'd want something like:
See how x and y is now translated by az and bz? That's pretty awkward though:
You'd have to account for how big z is whenever you move things (what if z was negative? You'd have to move in opposite directions. That's cumbersome as hell if you just want to move something an inch over...)
You can't move along the z axis. You'll never be able to fly or go underground
But, if you can make sure z = 1 at all times:
Now it's much clearer that this matrix allows you to move in the x-y plane by a, and b amounts. Only problem is that you're conceptually levitating all the time, and you still can't go up or down. You can only move in 2D.
But you see a pattern here? With 3D matrices and 3D vectors, you can describe all the fundamental movements in 2D. So what if we added a 4th dimension?
Looks familiar. If we keep w = 1 at all times:
There we go, now you get translation along all 3 axis. This is what's called homogeneous coordinates.
But what if you were doing some big & complicated transformation, resulting in w != 1, and there's no way around it? OpenGL (and basically any other CG system I think) will do what's called normalization: divide the resultant vector by the w component. I don't know enough to say exactly why ('cause scaling is a linear transformation?), but it has favorable implications (can be used in perspective transforms). Anyway, the translation matrix would actually look like:
And there you go, see how each component is shrunken by w, then it's translated? That's why w controls scaling.

What does the 1/w coordinate stand for in gl_FragCoord?

The gl_FragCoord variable stores four components: (x, y, z, 1/w)
What is the w coordinate and why is it stored as 1/w?
The GLSL and OpenGL specifications are needlessly confusing in this regard. The OpenGL spec is easier to understand: gl_FragCoord stores the X, Y, and Z components of the window-space vertex position. The X, Y, and Z values are calculated as described for computing the window-space position (though the pixel-center and upper-left origin can modify the X and Y values). This is described in the Coordinate Transforms section of the spec.
The W component of gl_FragCoord is (1 / Wc), where Wc is the clip-space vertex position. It's gl_Position.w from your vertex shader.
The only useful purpose for keeping Wc around is to reverse-transform gl_FragCoord to get the clip-space position back. Which, as that page shows, requires multiplying by Wc. But since gl_FragCoord only stores the inverse of this value, it now requires dividing by gl_FragCoord.w.
Therefore, we can assume that OpenGL stores it this way because OpenGL isn't allowed to make too much sense ;) See, it's a rule that every part of the OpenGL specification must have something that's a bit nonsensical. The XYZ components made too much sense, so they decided to have it stores the inverse of the value you actually want.
OK, technically this is a historical artifact from the days when 3D Labs created GLSL. I'm sure they did it for purely selfish hardware reasons, but I have no real proof of that.
A homogeneous coordinate is given by: (x, y, z, w), which projects to: (x/w, y/w, z/w). gl_FragCoord stores this projection, but rather than storing the (useless) (w/w) = (1) for the last component, it stores (1/w), to preserve useful information.