I recently realized that OpenGL performs perspective division not only for x and y, but for z too.
In my understanding x /= w; and y /= w; would be enough. Of course, then we would need different projection matrixes.
So, why OpenGL does z /= w;? To make z-buffer more precise on short distances but less precise on long ones?
Mathematically, dividing all components is the correct way. That way, interpolating z in screen space linearily (perspective correct interpolation is not done for position data, as it is supposed to be interpolated in screen space).
The linear interpolation in sceen space of course means that looking at this in eye or object space, it appears nonlinear. It simply means for an object not parallel to the image plane, going one pixel to the left on the screen does mean going a variable amount along + or -z, depending on the distance - so the perspecitve actually does distort the z axis, too.
The side effect is that Z buffer precision is highest at the near plane, and that is actually a good thing for most scenes.
Using an "undivided" Z for depth test is called W buffer. But that means that the linear interpolation can't be used any more. However, with modern GPUs, that is not a too big issue.
The projection transformation has the role of moving the objects from World Space to Projection Space (Actually it will be the Camera Space before Projection space, but it's not in the scope ).
Visually every other space is a cube going from -1 to 1 , while The Projection space is a Pyramid section, with the near plane at Z0 and the FarPlane at Z1 (or Z-1 depending on right hand or left hand system) . So z gets morphed as well (unless you have an orthonormal projection). Z goes from 0 to 1 because it doesn't really make any sense for objects behind near plane to get into the rendering pipeline.
You mentioned yourself and in the comment as well about the Z Buffer precision. It's precision won't be changed, however, after Projection Transformation, the objects Z deltas will be less between objects near the far plane and more near the objects near the near plane (In less fancier words : the distance on the Z axis between objects close to the NearPlane will be increased, while the distance on the Z axis between objects near the FarPlane will be decreased).
This is why reducing the Near Plane and Far Plane distance sometimes fixes Z Fighting : The distance between far away objects will be reduced with less if the distance between the two planes is lesser.
Related
I was quite confused on how the projection matrix worked so I researched and I discovered a few other things but after researching a few days, I just wanted to confirm my understanding is correct. I might use a few wrong terms but my brain was exhausted after writing this. A few topics I just researched briefly like screen coordinates and window transform so I didn’t write much about it and my knowledge might be incorrect. Is everything I’ve written here correct or mostly correct? Correct me on anything if I’m wrong.
What does the projection matrix do?
So the perspective projection matrix defines a frustum that is a truncated pyramid. Anything outside of that frustum/frustum range will be clipped. I'll get more on that later. The perspective projection matrix also adds perspective. To make the vertices follow the rules of perspective, the perspective projection matrix manipulates the vertex's w component (the homogenous component) depending on how far the vertex is from the viewer (the farther the vertex is, the higher the w coordinate will increase).
Why and how does the w component make the world look perceptive?
The w component makes the world look perceptive because in the perspective division (perspective division happens in the vertex post processing stage), when the x, y and z is divided by the w component, the vertex coordinate will be scaled smaller depending on how big the w component is. So essentially, the w component scales the object smaller the farther the object is.
Example:
Vertex position (1, 1, 2, 2).
Here, the vertex is 2 away from the viewer. In perspective division the x, y, and z will be divided by 2 because 2 is the w component.
(1/2, 1/2, 2/2) = (0.5, 0.5, 1).
As shown here, the vertex coordinate has been scaled by half.
How does the projection matrix decide what will be clipped?
The near and far plane are the limits of where the viewer can see (anything beyond the far plane and before the near plane will be clipped). Any coordinate will also have to go through a clipping check to see if it has to be clipped. The clipping check is checking whether the vertex coordinate is within a frustum range of -w to w. If it is outside of that range, it will be clipped.
Let's say I have a vertex with a position of (2, 130, 90, 90).
x value is 2
y value is 130
z value is 90
w value is 90
This vertex must be within the range of -90 to 90. The x and z value is within the range but the y value goes beyond the range thus the vertex will be clipped.
So after the vertex shader is finished, the next step is vertex post processing. In vertex post processing the clipping happens and also perspective division happens where clip space is converted into NDC (normalized device coordinates). Also, viewport transform happens where NDC is converted to window space.
What does perspective division do?
Perspective division essentially divides the x, y, and z component of a vertex with the w component. Doing this actually does two things, converts the clip space to Normalized device coordinates and also add perspective by scaling the vertices.
What is Normalized Device Coordinates?
Normalized Device Coordinates is the coordinate system where all coordinates are condensed into an NDC box where each axis is in the range of -1 to +1.
After NDC is occurred, viewport transform happens where all the NDC coordinates are converted screen coordinates. NDC space will become window space.
If an NDC coordinate is (0.5, 0.5, 0.3), it will be mapped onto the window based on what the programmer provided in the function glViewport. If the viewport is 400x300, the NDC coordinate will be placed at pixel 200 on x axis and 150 on y axis.
The perspective projection matrix does not decide what is clipped. After transforming a world coordinate with the projection, you get a clipspace coordinate. This is a Homogeneous coordinates. Base on this coordinate the Rendering Pipeline clips the scene. The clipping rule is -w < x, y, z < w. In the following process of the rendering pipeline, the clip space coordinates is transformed into the normalized device space by the perspective divide (x, y, z)' = (x/w, y/w, z/w). This division by the w component gives the perspective effect. (See also What exactly are eye space coordinates? and Transform the modelMatrix)
I have this function to get a 2D pixel location from 3D coordinate position. The x y z are pre-transform coordinates (1 to -1). This is a model view architecture with camera permanently at -3.5,0,0 looking at 0,0,0 while the object/scenes
coordinates are transformed by a horizontal xz rotation and vertical y rotation, etc to produce the final frame.
This function is mostly used to overlay 2D text on top of the 3D scene. Where the 2D text is positioned relative to the 3D underlying scene.
void My3D::Get2Dfrom3Dx(float x, float y, float z, float* psx, float* psy) {
XMVECTOR xmScreenCoord = XMLoadFloat3( (XMFLOAT3*) &screenCoord);
XMMATRIX xmWorldViewProjection = XMLoadFloat4x4( (XMFLOAT4X4*) &m_WorldViewProjection);
XMVECTOR result = XMVector3TransformCoord( xmScreenCoord, xmWorldViewProjection);
XMStoreFloat3( (XMFLOAT3*) &screenCoord, result);
screenCoord.x = ((screenCoord.x + 1.0f) / 2.0f) * m_nCurrWidth;
screenCoord.y = ((-screenCoord.y + 1.0f) / 2.0f) * m_nCurrHeight;
*psx = screenCoord.x;
*psy = screenCoord.y; }
This function works perfectly when the scene is fully/mostly visible, (the eyeat between -4 and -1.5.)
I have a nagging problem with text showing up mirrored in 3D position where it should not be.
This happens when for example I'm viewing the image from below (60+ degrees upward below object), and zooming(moving the eyeat location closer to say -.5,0,0.) The text should not be visible as it should be behind the eye (note eyeat is not past 0,0,0 which really messes the image up),
but somehow the above function causes the calculated screen x y coordinates to show within the viewport in situations where they should not.
I seem to think there is a simple solution to this side effect but can't find it. Hopefully someone has seen this 2d mirrored problem/effect before and knows the simple tweak.
I realize I could go down a more complex path of determining if the view vector is opposite the target point and filter this way, but I seem to think there should be a simpler solution.
Again, the camera is permanently on the line -3.5, 0, 0 to say -.5,0,0 as the world is transformed around it.
The problem lies in the way the projection works. Basically, the perspective projection will divide the x and y coordinates by the z coordinate. That's how you get the effect of perspective, i.e., that things that are farther away (larger z coordinate) appear smaller on screen. One issue with this perspective division is (simplified) that it doesn't work correctly for stuff that's behind the camera. Stuff behind the camera will have a negative z coordinate. When you divide x and y by a negative value, you'll have your point reflected around the origin. Which is exactly what you see. Since stuff that's located behind the camera is not going to be visible anyways, one way to solve this problem is to simply clip all geometry before dividing by z such that everything that has a negative z value is cut off and removed.
I assume the division in your code here happens inside XMVector3TransformCoord(). As you note yourself, the text should not be visible in the problematic cases anyways. So I suggest you simply check whether the text is behind the camera and don't render it if it is. One way to do so would be to simply check the result of transforming your world-space position with the xmWorldViewProjection matrix and only continue if it happens to be in front of the camera. xmScreenCoord holds the homogeneous clipspace coordinates of your point. The point will be in front of the camera iff the z coordinate of xmScreenCoord is larger than zero. So I guess you'd want to do something like
if (XMVectorGetZ(xmScreenCoord) > 0)
{
…
}
Sidenote due to the discussion in the comments below: When one wants to solve a problem involving the projections of objects on screen, one can often avoid explicitly computing the projection by instead transforming the problem into its dual and working directly in projective space on homogeneous coordinates. Since your problem is about placing text in 2D on screen, however, I don't think this is an option here. You could place the geometry for drawing your text in clip-space directly. You would start again by computing the clip-space coordinates of the 3D point to which you want your 2D text attached (by multiplying them with m_WorldViewProjection but not dividing by w). You can then generate homogeneous coordinates for the geometry for drawing your text by simply offsetting the x- and y- coordinates from that point to get the corners of a quad or whatever you need to construct. If you then also scale the size of the quad by the w coordinate of the point, you will get a quad at that position that projects to always the same size on the screen (since the premultiplication with w effectively cancels out the projection). However, all you're effectively doing then is leaving the application of the projection and necessarily clipping to the GPU. If you want to render a large number of quads, that might be an option to consider as it could be done completely on the GPU, e.g., using a geometry shader. However, if you just have a few text elements, it would be much simpler and probably also more efficient to just skip the drawing of text elements that would be behind the camera as described above…
Michael's response was very helpful in making me continue down a path that the solution should be a simple comparison. In my case, I had to re-evaluate the screen coordinates by only applying the World transform, versus the full WorldViewProjection. I call this TargetTransformed. My comparison value was then simply the Eye/Camera location (this never gets adjusted (except zoom) as the world is transformed around the Eye.) And again note my Camera in this case is at -3.5,0,0 looking at 0,0,0 (center of model, really 8,0,0 thus a line through the center). So I had to compare the x component, not the z component. I add a bit of fudge .1F as the mirror artifact happens when the target is significantly behind the camera. In which case I return the final screenCoord locations translated (-8000) way out in outer space as to guarantee they are not seen in the viewport.
if ((Eye.x + 0.1F) > TargetTransformed.x)
{
screenCoord.x += -8000;
screenCoord.y += -8000;
//TRACE("point is behind camera.\n");
*psx = screenCoord.x;
*psy = screenCoord.y;
}
else
{
*psx = screenCoord.x;
*psy = screenCoord.y;
}
And for completeness, my project has 2 view models: a) looking along a line through the center of the model. Which this can be translated to look from any direction and offset by screen x and y. The first view model works fine with this above code. The second view model b) targets the camera to look at a focal point of the model and then allows full rotation around that random point (not the center of the model) which requires calculating a tricky additional translation matrix and vector I call TargetViewTranslation. And for this additional translation, the formula adds the z component of the additional transform.
if ((Eye.x + 0.1F - m_structTargetViewTranslation.Z) > TargetTransformed.x)
{
screenCoord.x += -8000;
screenCoord.y += -8000;
//TRACE("point is behind camera.\n");
*psx = screenCoord.x;
*psy = screenCoord.y;
}
else
{
*psx = screenCoord.x;
*psy = screenCoord.y;
}
And success, my mirrored text problem is resolved. Hopefully this helps others with this mirrored text problem. Realizing that one may need to only transform the test case by the World transform, and it should be a simple comparison, and the location of the camera may impact which x or z component is used. And if you are translating the world in any additional ways, then this translation could also impact if x or z is compared. Using TRACE and looking at the x y z values was helpful in figuring out what components I needed to use in my specific case.
Variable gl_Position output from a GLSL vertex shader must have 4 coordinates. In OpenGL, it seems w coordinate is used to scale the vector, by dividing the other coordinates by it. What is the purpose of w in Vulkan?
Shaders and projections in Vulkan behave exactly the same as in OpenGL. There are small differences in depth ranges ([-1, 1] in OpenGL, [0, 1] in Vulkan) or in the origin of the coordinate system (lower-left in OpenGL, upper-left in Vulkan), but the principles are exactly the same. The hardware is still the same and it performs calculations in the same way both in OpenGL and in Vulkan.
4-component vectors serve multiple purposes:
Different transformations (translation, rotation, scaling) can be
represented in the same way, with 4x4 matrices.
Projection can also be represented with a 4x4 matrix.
Multiple transformations can be combined into one 4x4 matrix.
The .w component You mention is used during perspective projection.
All this we can do with 4x4 matrices and thus we need 4-component vectors (so they can be multiplied by 4x4 matrices). Again, I write about this because the above rules apply both to OpenGL and to Vulkan.
So for purpose of the .w component of the gl_Position variable - it is exactly the same in Vulkan. It is used to scale the position vector - during perspective calculations (projection matrix multiplication) original depth is modified by the original .w component and stored in the .z component of the gl_Position variable. And additionally, original depth is also stored in the .w component. After that (as a fixed-function step) hardware performs perspective division and divides position stored in the gl_Position variable by its .w component.
In orthographic projection steps performed by the hardware are exactly the same, but values used for calculations are different. So the perspective division step is still performed by the hardware but it does nothing (position is dived by 1.0).
gl_Position is a Homogeneous coordinates. The w component plays a role at perspective projection.
The projection matrix describes the mapping from 3D points of the view on a scene, to 2D points on the viewport. It transforms from eye space to the clip space, and the coordinates in the clip space are transformed to the normalized device coordinates (NDC) by dividing with the w component of the clip coordinates (Perspective divide).
At Perspective Projection the projection matrix describes the mapping from 3D points in the world as they are seen from of a pinhole camera, to 2D points of the viewport. The eye space coordinates in the camera frustum (a truncated pyramid) are mapped to a cube (the normalized device coordinates).
Perspective Projection Matrix:
r = right, l = left, b = bottom, t = top, n = near, f = far
2*n/(r-l) 0 0 0
0 2*n/(t-b) 0 0
(r+l)/(r-l) (t+b)/(t-b) -(f+n)/(f-n) -1
0 0 -2*f*n/(f-n) 0
When a Cartesian coordinate in view space is transformed by the perspective projection matrix, then the the result is a Homogeneous coordinates. The w component grows with the distance to the point of view. This cause that the objects become smaller after the Perspective divide, if they are further away.
In computer graphics, transformations are represented with matrices. If you want something to rotate, you multiply all its vertices (a vector) by a rotation matrix. Want it to move? Multiply by translation matrix, etc.
tl;dr: You can't describe translation along the z-axis with 3D matrices and vectors. You need at least 1 more dimension, so they just added a dummy dimension w. But things break if it's not 1, so keep it at 1 :P.
Anyway, now we begin with a quick review on matrix multiplication:
You basically put x above a, y above b, z above c. Multiply the whole column by the variable you just moved, and sum up everything in the row.
So if you were to translate a vector, you'd want something like:
See how x and y is now translated by az and bz? That's pretty awkward though:
You'd have to account for how big z is whenever you move things (what if z was negative? You'd have to move in opposite directions. That's cumbersome as hell if you just want to move something an inch over...)
You can't move along the z axis. You'll never be able to fly or go underground
But, if you can make sure z = 1 at all times:
Now it's much clearer that this matrix allows you to move in the x-y plane by a, and b amounts. Only problem is that you're conceptually levitating all the time, and you still can't go up or down. You can only move in 2D.
But you see a pattern here? With 3D matrices and 3D vectors, you can describe all the fundamental movements in 2D. So what if we added a 4th dimension?
Looks familiar. If we keep w = 1 at all times:
There we go, now you get translation along all 3 axis. This is what's called homogeneous coordinates.
But what if you were doing some big & complicated transformation, resulting in w != 1, and there's no way around it? OpenGL (and basically any other CG system I think) will do what's called normalization: divide the resultant vector by the w component. I don't know enough to say exactly why ('cause scaling is a linear transformation?), but it has favorable implications (can be used in perspective transforms). Anyway, the translation matrix would actually look like:
And there you go, see how each component is shrunken by w, then it's translated? That's why w controls scaling.
In OpenGL, I have read that a vertex should be represented by (x,y,z,w), where w = z. This is to enable perspective divide, whereby (x,y,z) are divided by w in order to determine their screen position due to the perspective effect. And if they were just divided by the original z value, then the z would be 1 everywhere.
My question is: Why do you need to divide the z component by w at all? Why can you not just divide the x and y components by z, such that the screen coordinates have the perspective effect applied, and then just used the original unmodified z component for depth testing? In this way, you would not have to use the w component at all....
Obviously I am missing something!
3D computer graphics is typically handled with homogeneous coordinates and in a projective vector space. The math behind this is a bit more than "just divide by w".
Using 4D homogeneous vectors and 4x4 matrices has the nice advantage that all sorts of affine transformations (and this includes especially the translation, which also relies on w) and projective transforms can be represented by simple matrix multiplications.
In OpenGL, I have read that a vertex should be represented by
(x, y, z, w), where w = z.
That is not true. A vertex should be represented by (x, y, z, w), where w is just w. In your typical case, the input w is actually 1, so it is usually not stored in the vertex data, but added on demand in the shaders etc.
Your typical projection matrix will set w_clip = -z_eye. But that is a different thing. This means that you just project along the -z direction in eye space. You could also put w_clip=2 *x_eye -3*y_eye + 4 * z_eye there, and your axis of projection would have the direction (2, -3, 4, 0).
My question is: Why do you need to divide the z component by w at all? Why can you not just divide the x and y components by z, such that the screen coordinates have the perspective effect applied, and then just used the original unmodified z component for depth testing?
Conceptually, the space is distorted along all 3 dimensions, not just in x and y. Furthermore, in the beginning, GPUs had just 16 bit or 24 bit integer precision for the depth buffer. In such a scenario, you definitively want to have a denser representation near the camera, and a sparse one far away.
Nowadays, with programmable vertex shaders and floating-point depth buffer formats, you can basically just store the z_eye value in the depth buffer, and use this for depth testing. However, this is typically referred to as W buffering, because the (clip space) w component is used.
There is another conceptional issue if you would divide by z: you wouldn't be able to use an orthogonal projection, you always would force some kind of perspective. Now one might argue that the division by z doesn't have to happen automatically, but one could apply it when needed in the vertex shader. But this won't work either. You must not apply the perspective divide in the vertex shader, because that would project points which lie behind the camera in front of the camera. As the vertex shader does not work on whole primitives, this would completely screw up any primitive there if at least one vertex lies behind the camera and another one lies in front of it. To deal with that situation, the clipping has to be applied before the divide - hence the name clip space.
In this way, you would not have to use the w component at all.
That is also not true. The w component is further used down the pipeline. It is essential for perspective-correct attribute interpolation.
I am confused about the position of objects in opengl .The eye position is 0,0,0 , the projection plane is at z = -1 . At this point , will the objects be in between the eye position and and the plane (Z =(0 to -1)) ? or its behind the projection plane ? and also if there is any particular reason for being so?
First of all, there is no eye in modern OpenGL. There is also no camera. There is no projection plane. You define these concepts by yourself; the graphics library does not give them to you. It is your job to transform your object from your coordinate system into clip space in your vertex shader.
I think you are thinking about projection wrong. Projection doesn't move the objects in the same sense that a translation or rotation matrix might. If you take a look at the link above, you can see that in order to render a perspective projection, you calculate the x and y components of the projected coordinate with R = V(ez/pz), where ez is the depth of the projection plane, pz is the depth of the object, V is the coordinate vector, and R is the projection. Almost always you will use ez=1, which makes that equation into R = V/pz, allowing you to place pz in the w coordinate allowing OpenGL to do the "perspective divide" for you. Assuming you have your eye and plane in the correct places, projecting a coordinate is almost as simple as dividing by its z coordinate. Your objects can be anywhere in 3D space (even behind the eye), and you can project them onto your plane so long as you don't divide by zero or invalidate your z coordinate that you use for depth testing.
There is no "projection plane" at z=-1. I don't know where you got this from. The classic GL perspective matrix assumes an eye space where the camera is located at origin and looking into -z direction.
However, there is the near plane at z<0 and eveything in front of the near plane is going to be clipped. You cannot put the near plane at z=0, because then, you would end up with a division by zero when trying to project points on that plane. So there is one reasin that the viewing volume isn't a pyramid with they eye point at the top but a pyramid frustum.
This is btw. also true for real-world eyes or cameras. The projection center lies behind the lense, so no object can get infinitely close to the optical center in either case.
The other reason why you want a big near clipping distance is the precision of the depth buffer. The whole depth range between the front and the near plane has to be mapped to some depth value with a limited amount of bits, typically 24. So you want to keep the far plane as close as possible, and shift away the near plane as far as possible. The non-linear mapping of the screen-space z coordinate makes this even more important, as that the precision is non-uniformely distributed over that range.