I need to implement raycasting. For this I need to convert the mouse cursor to world space.
I use the unproject function for this. I need to first find a point on the near plane, then on the far plane. After that, from the last substract the first one and we will get a ray.
But I don't understand how to set winZ correctly. Because in some implementations I see two ways: winZ = 0 (near plane), winZ = 1 (far plane) or winZ = -1 (near plane), winZ = 1 (far plane).
What's the difference in these ranges?
If it really is windows space z, then there is only the [0,1] range. Note that 0 doesn't necessarily mean the near plane and one the far plane, though. One can set up a projection matrix where near will end up at 1 and far at 0 (reversed z, which, in combination with the [0,1] z clip condition as explained below, has some precision advantages).
Also not that glDepthRange can further modify to which values (inside [0,1]) these two planes will be mapped.
To understand the unproject operation, you first need to understand the different coordinate spaces. Typically in a render API, you have to deal with these 3 spaces at the end of the transform chain:
clip space: This is what the output of the vertex shader is in, and where the actual clipping at least on a conceptual level happens. This space is still homogeneous with an arbitrary value for the w coordinate.
normalized device coordinates (NDC). This is what you get after the perspective division by the clip space w coordinate (after the clipping has been applied, which will eliminate the w<=0 cases completely).
window space. The 2D xy part are the actual pixel coordinates inside your output window (or your render target), and the transformation from NDC xy to window space xy is defined by the viewport settings. The z coordinate is the value which will go into the depth test, and depth buffer, it is in the range [0,1] or some sub-range of that (controlled via glDepthRange in the GL).
Other spaces before these in this list, like model/object space, world space, eye/view space, are completely up to you and do not concern the GPU directly at all (legacy fixed-function GL did care about eye space for lighting and fog, but nowadays, implementing this is all your job, and you can use whatever spaces you see fit).
Having established the spaces, the next relevant thing here is the viewing volume. This is the volume of the space which will actually mapped to your viewport, and it is a 3D volume bounded by the six planes: left, right, bottom, top, near far.
However, the actual view volume is set up by pure convention in the various render APIs (and the actual GPU hardware).
Since you tagged this question with "OpenGL", I'm going to begin with the default GL convention first:
Standard GL convention is the view volume is the completely symmetrical [-1,1] cube in NDC. Actually, this means that the clip condition in clip space is -w <= x,y,z <= w.
Direct3D uses a different convention: they use [-1,1] in NDC for x and y just like GL does, but [0,1] for the z dimension. This also means that the depth range transformation into window space can be identity in many cases (you seldom need to limit it to a sub-range of [0,1]). This convention has some numerical advantages because the GL convention of moving [-1,1] range to [0,1] for window space will make it loose precision around the (NDC) zero point.
Modern GL since GL 4.5 optionally allows you to switch to the [0,1] convention for z via glClipControl. Vulkan also supports both conventions, but uses [0,1] as the default.
There is not "the" unproject function, but the concept of "unprojecting" a point means calculating these transformations in reverse, going from window space back to some space before clip space, undoing the proejction matrix. For implementing an unproject function, you need to know which conventions were used.
Because in some implementations I see two ways: winZ = 0 (near plane), winZ = 1 (far plane) or winZ = -1 (near plane), winZ = 1 (far plane). What's the difference in these ranges?
Maybe they are not taking in a window space Z, but NDC z directly. Maybe the parameters are just named in a confusing or wrong manner. Maybe some of the implementations out there are just flat-out wrong.
Related
Variable gl_Position output from a GLSL vertex shader must have 4 coordinates. In OpenGL, it seems w coordinate is used to scale the vector, by dividing the other coordinates by it. What is the purpose of w in Vulkan?
Shaders and projections in Vulkan behave exactly the same as in OpenGL. There are small differences in depth ranges ([-1, 1] in OpenGL, [0, 1] in Vulkan) or in the origin of the coordinate system (lower-left in OpenGL, upper-left in Vulkan), but the principles are exactly the same. The hardware is still the same and it performs calculations in the same way both in OpenGL and in Vulkan.
4-component vectors serve multiple purposes:
Different transformations (translation, rotation, scaling) can be
represented in the same way, with 4x4 matrices.
Projection can also be represented with a 4x4 matrix.
Multiple transformations can be combined into one 4x4 matrix.
The .w component You mention is used during perspective projection.
All this we can do with 4x4 matrices and thus we need 4-component vectors (so they can be multiplied by 4x4 matrices). Again, I write about this because the above rules apply both to OpenGL and to Vulkan.
So for purpose of the .w component of the gl_Position variable - it is exactly the same in Vulkan. It is used to scale the position vector - during perspective calculations (projection matrix multiplication) original depth is modified by the original .w component and stored in the .z component of the gl_Position variable. And additionally, original depth is also stored in the .w component. After that (as a fixed-function step) hardware performs perspective division and divides position stored in the gl_Position variable by its .w component.
In orthographic projection steps performed by the hardware are exactly the same, but values used for calculations are different. So the perspective division step is still performed by the hardware but it does nothing (position is dived by 1.0).
gl_Position is a Homogeneous coordinates. The w component plays a role at perspective projection.
The projection matrix describes the mapping from 3D points of the view on a scene, to 2D points on the viewport. It transforms from eye space to the clip space, and the coordinates in the clip space are transformed to the normalized device coordinates (NDC) by dividing with the w component of the clip coordinates (Perspective divide).
At Perspective Projection the projection matrix describes the mapping from 3D points in the world as they are seen from of a pinhole camera, to 2D points of the viewport. The eye space coordinates in the camera frustum (a truncated pyramid) are mapped to a cube (the normalized device coordinates).
Perspective Projection Matrix:
r = right, l = left, b = bottom, t = top, n = near, f = far
2*n/(r-l) 0 0 0
0 2*n/(t-b) 0 0
(r+l)/(r-l) (t+b)/(t-b) -(f+n)/(f-n) -1
0 0 -2*f*n/(f-n) 0
When a Cartesian coordinate in view space is transformed by the perspective projection matrix, then the the result is a Homogeneous coordinates. The w component grows with the distance to the point of view. This cause that the objects become smaller after the Perspective divide, if they are further away.
In computer graphics, transformations are represented with matrices. If you want something to rotate, you multiply all its vertices (a vector) by a rotation matrix. Want it to move? Multiply by translation matrix, etc.
tl;dr: You can't describe translation along the z-axis with 3D matrices and vectors. You need at least 1 more dimension, so they just added a dummy dimension w. But things break if it's not 1, so keep it at 1 :P.
Anyway, now we begin with a quick review on matrix multiplication:
You basically put x above a, y above b, z above c. Multiply the whole column by the variable you just moved, and sum up everything in the row.
So if you were to translate a vector, you'd want something like:
See how x and y is now translated by az and bz? That's pretty awkward though:
You'd have to account for how big z is whenever you move things (what if z was negative? You'd have to move in opposite directions. That's cumbersome as hell if you just want to move something an inch over...)
You can't move along the z axis. You'll never be able to fly or go underground
But, if you can make sure z = 1 at all times:
Now it's much clearer that this matrix allows you to move in the x-y plane by a, and b amounts. Only problem is that you're conceptually levitating all the time, and you still can't go up or down. You can only move in 2D.
But you see a pattern here? With 3D matrices and 3D vectors, you can describe all the fundamental movements in 2D. So what if we added a 4th dimension?
Looks familiar. If we keep w = 1 at all times:
There we go, now you get translation along all 3 axis. This is what's called homogeneous coordinates.
But what if you were doing some big & complicated transformation, resulting in w != 1, and there's no way around it? OpenGL (and basically any other CG system I think) will do what's called normalization: divide the resultant vector by the w component. I don't know enough to say exactly why ('cause scaling is a linear transformation?), but it has favorable implications (can be used in perspective transforms). Anyway, the translation matrix would actually look like:
And there you go, see how each component is shrunken by w, then it's translated? That's why w controls scaling.
I heard clipping should be done in clipping coordinate system.
The book suggests a situation that a line is laid from behind camera to in viewing volume. (We define this line as PQ, P is behind camera point)
I cannot understand why it can be a problem.
(The book says after finishing normalizing transformation, the P will be laid in front of camera.)
I think before making clipping coordinate system, the camera is on original point (0, 0, 0, 1) because we did viewing transformation.
However, in NDCS, I cannot think about camera's location.
And I have second question.
In vertex shader, we do model-view transformation and then projection transformation. Finally, we output these vertices to rasterizer.
(some vertices's w is not equal to '1')
Here, I have curiosity. The rendering pipeline will automatically do division procedure (by w)? after finishing clipping.
Sometimes not all the model can be seen on screen, mostly because some objects of it lie behind the camera (or "point of view"). Those objects are clipped out. If just a part of the object can't be seen, then just that part must be clipped leaving the rest as seen.
OpenGL clips
OpenGL does this clipping in Clipping Coordinate Space (CCS). This is a cube of size 2w x 2w x 2w where 'w' is the fourth coordinate resulting of (4x4) x (4x1) matrix and point multiplication. A mere comparison of coordinates is enough to tell if the point is clipped or not. If the point passes the test then its coordinates are divided by 'w' (so called "perspective division"). Notice that for ortogonal projections 'w' is always 1, and with perspective it's generally not 1.
CPU clips
If the model is too big perhaps you want to save GPU resources or improve the frame rate. So you decide to skip those objects that are going to get clipped anyhow. Then you do the maths on your own (on CPU) and only send to the GPU the vertices that passed the test. Be aware that some objects may have some vertices clipped while other vertices of this same object may not.
Perhaps you do send them to GPU and let it handle these special cases.
You have a volume defined where only objects inside are seen. This volume is defined by six planes. Let's put ourselves in the camera and look at this volume: If your projection is perspective the six planes build a "fustrum", a sort of truncated pyramid. If your projection is orthogonal, the planes form a parallelepiped.
In order to clip or not to clip a vertex you must use the distance form the vertex to each of these six planes. You need a signed distance, this means that the sign tells you what side of the plane is seen form the vertex. If any of the six distance signs is not the right one, the vertex is discarded, clipped.
If a plane is defined by equation Ax+By+Cz+D=0 then the signed distance from p1,p2,p3 is (Ap1+Bp2+Cp3+D)/sqrt(AA+BB+C*C). You only need the sign, so don't bother to calculate the denominator.
Now you have all tools. If you know your planes on "camera view" you can calculate the six distances and clip or not the vertex. But perhaps this is an expensive operation considering that you must transform the vertex coodinates from model to camera (view) spaces, a ViewModel matrix calculation. With the same cost you use your precalculated ProjectionViewModel matrix instead and obtain CCS coordinates, which are much easier to compare to '2w', the size of the CCS cube.
Sometimes you want to skip some vertices not due to they are clipped, but because their depth breaks a criteria you are using. In this case CCS is not the right space to work with, because Z-coordinate is transformed into [-w, w] range, depth is somehow "lost". Instead, you do your clip test in "view space".
In OpenGL, I have read that a vertex should be represented by (x,y,z,w), where w = z. This is to enable perspective divide, whereby (x,y,z) are divided by w in order to determine their screen position due to the perspective effect. And if they were just divided by the original z value, then the z would be 1 everywhere.
My question is: Why do you need to divide the z component by w at all? Why can you not just divide the x and y components by z, such that the screen coordinates have the perspective effect applied, and then just used the original unmodified z component for depth testing? In this way, you would not have to use the w component at all....
Obviously I am missing something!
3D computer graphics is typically handled with homogeneous coordinates and in a projective vector space. The math behind this is a bit more than "just divide by w".
Using 4D homogeneous vectors and 4x4 matrices has the nice advantage that all sorts of affine transformations (and this includes especially the translation, which also relies on w) and projective transforms can be represented by simple matrix multiplications.
In OpenGL, I have read that a vertex should be represented by
(x, y, z, w), where w = z.
That is not true. A vertex should be represented by (x, y, z, w), where w is just w. In your typical case, the input w is actually 1, so it is usually not stored in the vertex data, but added on demand in the shaders etc.
Your typical projection matrix will set w_clip = -z_eye. But that is a different thing. This means that you just project along the -z direction in eye space. You could also put w_clip=2 *x_eye -3*y_eye + 4 * z_eye there, and your axis of projection would have the direction (2, -3, 4, 0).
My question is: Why do you need to divide the z component by w at all? Why can you not just divide the x and y components by z, such that the screen coordinates have the perspective effect applied, and then just used the original unmodified z component for depth testing?
Conceptually, the space is distorted along all 3 dimensions, not just in x and y. Furthermore, in the beginning, GPUs had just 16 bit or 24 bit integer precision for the depth buffer. In such a scenario, you definitively want to have a denser representation near the camera, and a sparse one far away.
Nowadays, with programmable vertex shaders and floating-point depth buffer formats, you can basically just store the z_eye value in the depth buffer, and use this for depth testing. However, this is typically referred to as W buffering, because the (clip space) w component is used.
There is another conceptional issue if you would divide by z: you wouldn't be able to use an orthogonal projection, you always would force some kind of perspective. Now one might argue that the division by z doesn't have to happen automatically, but one could apply it when needed in the vertex shader. But this won't work either. You must not apply the perspective divide in the vertex shader, because that would project points which lie behind the camera in front of the camera. As the vertex shader does not work on whole primitives, this would completely screw up any primitive there if at least one vertex lies behind the camera and another one lies in front of it. To deal with that situation, the clipping has to be applied before the divide - hence the name clip space.
In this way, you would not have to use the w component at all.
That is also not true. The w component is further used down the pipeline. It is essential for perspective-correct attribute interpolation.
I am trying to understand open GL concepts . While reading this tutorial - http://www.arcsynthesis.org/gltut/Positioning/Tut04%20Perspective%20Projection.html,
I came accross this statement :
This is because camera space and NDC space have different viewing directions. In camera space, the camera looks down the -Z axis; more negative Z values are farther away. In NDC space, the camera looks down the +Z axis; more positive Z values are farther away. The diagram flips the axis so that the viewing direction can remain the same between the two images (up is away).
I am confused as to why the viewing direction has to change . Could some one please help me understand this with an example ?
This is mostly just a convention. OpenGL clip space (and NDC space and screen space) has always been defined as left-handed (with z pointing away into the screen) by the spec.
OpenGL eye space had been defined with camera at origin and looking at -z direction (so right-handed). However, this convention was just meaningful in the fixed-function pipeline, where together with the fixed function per vertex lighting which was carried out in eye space, the viewing direction did matter cases like whenGL_LOCAL_VIEWER was disabled (as was the default).
The classic GL projection matrix typically converts the handedness, and the perspecitve division is done with a divisior of -z_eye, typically, so the last row of the projection matrix is typically (0, 0, -1, 0). The old glFrustum(), glOrtho(), and gluPerspective() actually supported that convention by using the z_near and z_far clipping distances negated, so that you had to specify positive values for clip planes to lie before the camera at z<0.
However, with modern GL, this convention is more or less meaningless. There is no fixed-function unit left which does work in eye space, so the eye space (and anything before that) is totally under the user's control. You can use anything you like here. The clip space and all the later spaces are still used by fixed function units (clipping, rasterization, ...), so there most be some convention to define the interface, and it is still a left-handed system.
Even in modern GL, the old right-handed eye space convention is still in use. The popular glm library for example reimplements the old GL matrix functions the same way.
There is really no reason to prefer one of the possible conventions over the other, but at some point, you have to choose and stick to one.
Opengl superbible 4th Edition.page 164
To apply a camera transformation, we take the camera’s actor transform and flip it so that
moving the camera backward is equivalent to moving the whole world forward. Similarly,
turning to the left is equivalent to rotating the whole world to the right.
I can't understand why?
Image yourself placed within a universe that also contains all other things. In order for your viewpoint to appear to move in a forwardly direction, you have two options...
You move yourself forward.
You move everything else in the universe in the opposite direction to 1.
Because you defining everything in OpenGL in terms of the viewer (you're ultimately rendering a 2D image of a particular viewpoint of the 3D world), it can often make more sense from both a mathematical and programatic sense to take the 2nd approach.
Mathematically there is only one correct answer. It is defined that after transforming to eye-space by multiplying a world-space position by the view-matrix, the resulting vector is interpreted relative to the origin (the position in space where the camera conceptually is located relative to the aforementioned point).
What SuperBible states is mathematically just a negation of translation in some direction, which is what you will automatically get when using functions that compute a view-matrix like gluLookAt() or glmLookAt() (although GLU is a lib layered on legacy GL stuff, mathematically the two are identical).
Have a look at the API ref for gluLookAt(). You'll see that the first step is setting up an ortho-normal base of the eye-space which first results in a 4x4 matrix basically only encoding the upper 3x3 rotation matrix. The second is multiplying the former matrix by a translation matrix. In terms of legacy functions, this can be expressed as
glMultMatrixf(M); // where M encodes the eye-space basis
glTranslated(-eyex, -eyey, -eyez);
You can see, the vector (eyex, eyey, eyez) which specifies where the camera is located in world-space is simply multiplied by -1. Now assume we don't rotate the camera at all, but assume it to be located at world-space position (5, 5, 5). The appropriate view-matrix View would be
[1 0 0 -5
0 1 0 -5
0 0 1 -5
0 0 0 1]
Now take a world-space vertex position P = (0, 0, 0, 1) transformed by that matrix: P' = View * P. P' will then simply be P'=(-5, -5, -5, 1).
When thinking in world-space, the camera is at (5, 5, 5) and the vertex is at (0, 0, 0). When thinking in eye-space, the camera is at (0, 0, 0) and the vertex is at (-5, -5, -5).
So in conclusion: Conceptually, it's a matter of how you're looking at things. You can either think of it as transforming the camera relative to the world, or you think of it as transform the world relative to the camera.
Mathematically, and in terms of the OpenGL transformation pipeline, there is only one answer, and that is: the camera in eye-space (or view-space or camera-space) is always at the origin and world-space positions transformed to eye-space will always be relative to the coordinate system of the camera.
EDIT: Just to clarify, although the transformation pipeline and involved vector spaces are well defined, you can still use world-space positions of everything, even the camera, for instance in a fragment shader for lighting computation. The important thing here is to know never to mix entities from different spaces, e.g. don't compute stuff based on a world-space and and eye-space position and so on.
EDIT2: Nowadays, in a time that we all use shaders *cough and roll-eyes*, you're pretty flexible and theoretically you can pass any position you like to gl_Position in a vertex shader (or the geometry shader or tessellation stages). However, since the subsequent computations are fixed, i.e. clipping, perspective division and viewport transformation the resulting position will simply be clipped if its not inside [-gl_Position.w, gl_Position.w] in x, y and z.
There is a lot to this to really get it down. I suggest you read the entire article on the rendering pipeline in the official GL wiki.