Difference between Camera/Eye coordinates and world coordinates - opengl

For the sake of simplicity let's take OpenGL rendering system as an example. According to what I have learnt,
The camera in OpenGL is aligned with the world space's origin i.e (0,0,0). I also read that we don't move the camera so it will stay at it's original position of (0,0,0) in world coordinates. The camera faces in the negative z direction.
Thus the question is, if you don't move the camera then it will always stay at world space's origin of (0,0,0). If that's the case then there will be no difference between world and eye coordinates. Because for example, Object A position in world coordinates is (2,2,-5). In eye coordinates it will be the same (2-0, 2-0, -5-0) = (2,2,-5)

The camera is the eye. If you don't move your camera, it means that your View matrix is an identity matrix so any coordinate you multiply with it will remain the same.
Lets take your example, we have a point being rendered (2.0, 2.0, -5.0). This is the world space position of the point. It means the point lies at (2.0, 2.0, -5.0) regardless of where the camera is.
If the camera is at origin like you say, yes the eye(camera) space position is the same as the world space position.
For a simple example, lets say you want to move your camera 2 units in the positive z-axis. Now your camera is at (0, 0, 2) and when you look from your camera(or eye), the point will be 2 units farther away in z axis from the origin, since you moved your camera back, so the eye space position is (2.0, 2.0, -7.0).
This is what happens with more complicated transformation on your camera. You will have a view matrix which will just be the inverse of your camera's transformation, and this is applied to every point in world space to convert it to eye (or camera) space.

Related

how do I maintain relative transformation b/w 2 objects after changing transformation of any one without a scenegraph?

Say I have 2 objects, a camera and a cube, both on XZ plane, the cube has some arbitrary rotation, and camera is facing the cube.
now if a transformation R is applied to the camera such that it has a new rotation and position.
I want to move the cube in front of the camera using transformation R1, such that in the view it looks exactly as before R was applied, meaning relative distance, rotation and scale b/w the 2 objects remain same after both R and R1.
Following image gives a gist of the problem.
Assume that there's no scenegraph that we can use.
I've posed the problem mainly in 2D but I'm trying to solve it in 3D, so rotations can have all yaw, pitch and roll, and translations can be anywhere in 3D space.
EDIT:
I forgot to add what I have done so far.
I figured out how to maintain relative distance b/w camera and cube, I can project cube's position to get world to screen point, then unproject the screen point in new camera position to get new world position.
However for rotation, I have tried this
I thought I can apply same rotation as R in R1, this didn't work, it appears to work if rotation happens only in one axis, if rotation happens in more than one axes, it does not work.
I thought I can take delta rotation b/w camera and cube, and simply apply camera's rotation to the cube and then multiply delta rotation, this also didn't work
Let M and V be the model and view matrices before you move the camera, M2 and V2 be the matrices after you move the camera. To be clear: model matrix transforms the coordinates from object local coordinates into world coordinates; a view matrix transforms from world coordinates into clip-space camera coordinates. Consequently V*M*p transforms the position p into clip-space.
For the position on the screen to stay constant, we need V*M*p = V2*M2*p to be true for all p (that's assuming that the FOV doesn't change). Therefore V*M = V2*M2, or
M2 = inverse(V2)*V*M
If you apply the camera transformation on the right (V2 = V*R) then the above expression for M2 can be simplified:
M2 = inverse(R)*M
(that is you apply inverse(R) on the left of the model matrix to compensate).
Alternatively, ask yourself if you really need to keep the object coordinates in the world reference frame. It may be easier to not to apply the view matrix when rendering that object at all; that would effectively keep it relative to the camera at all times without any additional tweaks. That would have better numerical stability too.

What are we trying to achieve by creating a right and down vectors in this ray tracing depiction?

Often, we see the following picture when talking about ray tracing.
Here, I see the Z axis as the sort of direction if the camera pointed straight ahead, and the XY grid as the grid that the camera is seeing here. From the camera's point of view, we see the usual Cartesian grid me and my classmates are used to.
Recently I was examining code that simulates this. One thing that is not obvious from this picture to me is the requirement for the "right" and "down" vectors. Obviously we have look_at, which shows where the camera is looking. And campos is where the camera is located. But why do we need camright and camdown? What are we trying to construct?
Vect X (1, 0, 0);
Vect Y (0, 1, 0);
Vect Z (0, 0, 1);
Vect campos (3, 1.5, -4);
Vect look_at (0, 0, 0);
Vect diff_btw (
campos.getVectX() - look_at.getVectX(),
campos.getVectY() - look_at.getVectY(),
campos.getVectZ() - look_at.getVectZ()
);
Vect camdir = diff_btw.negative().normalize();
Vect camright = Y.crossProduct(camdir);
Vect camdown = camright.crossProduct(camdir);
Camera scene_cam (campos, camdir, camright, camdown);
I was searching about this question recently and found this post as well: Setting the up vector in the camera setup in a ray tracer
Here, the answerer says this: "My answer assumes that the XY plane is the "ground" in world space and that Z is the altitude. Imagine standing on the floor holding a camera. It's position has a positive Z component, and it's view vector is nearly parallel to the ground. (In my diagram, it's pointing slightly upwards.) The plane of the camera's film (the uv grid) is perpendicular to the view grid. A common approach is to transform everything until the film plane coincides with the XY plane and the camera points along the Z axis. That's equivalent, but not what I'm describing here."
I'm not entirely sure why "transformations" are necessary.. How is this point of view different from the picture at the top? Here they also say that they need an "up" vector and "right" vector to "construct an image plane". I'm not sure what an image plane is..
Could someone explain better the relationship between the physical representation and code representation?
How do you know that you always want the camera's "up" to be aligned with the vertical lines in the grid in your image?
Trying to explain it another way: The grid is not really there. That imaginary grid is the result of the calculations of camera's directional vectors and the resolution you are rendering in. The grid is not what decides the camera angle.
When you are holding a physical camera in your hand, like the camera in the cell phone, don't you ever rotate the camera little bit for effect? Or when filming, you may want to slowly rotate the camera? Have you not seen any movies where the camera is rotated?
In the same way, you may want to rotate the "camera" in your ray traced world. And rotating only the camera is much easier than rotating all your objects in the scene(may be millions!)
Check out the example of rotating the camera from the movie Ice Age here:
https://youtu.be/22qniGtZhZ8?t=61
The (up or down) and right vectors constructs the plane you project the scene onto. Since the scene is in 3D you need to project the scene onto a 2D scene in order to render a picture to display on your screen.
If you have the camera position and direction you still don't know whether you're holding the camera right-side up, upside down, or tilted to the left and right.
Using camera position, lookat, up (down) and right vectors we can uniquely define the 3D scene is projected into a 2D picture.
Concretely, if you look at the code and the picture. The 3D scene are the objects displayed. The image/projection plane is the grid infront of the camera. It's orientation is defined by the the camright and camdir vectors (because we are assuming the cameras line of sight is perpendicular to camdir, camdown is uniquely defined by the other two).
The placement of the grid is based on the camera's position and intrinsic properties (it's not being displayed here, but the camera will have a specific field of view).

OpenGL camera direction

So I was reading this tutorial's "Inverting the Camera Orientation Matrix" section and I don't understand why, when calculating the camera's up direction, I need to multiply the inverse of orientation by the up direction vector, and not just orientation.
I drew the following image to illustrate my insight of the tutorial I read.
What did I get wrong?
Well, that tutorial explicitely states:
The way we calculate the up direction of the camera is by taking the
"directly upwards" unit vector (0,1,0) and "unrotate" it by using the
inverse of the camera's orientation matrix. Or, to explain it
differently, the up direction is always (0,1,0) after the camera
rotation has been applied, so we multiply (0,1,0) by the inverse
rotation, which gives us the up direction before the camera rotation
was applied.
The up direction which is calculated here is the up direction in world space. In eye space, the up vector is (0,1,0) (by convention, one could define it differently). As the view matrix will transform coordinates from world space to eye space, we need to use the inverse to transform that up vector from eye space to the world space. Your image is wrong as it does not correctly relate to eye and world space.

Perspective Projection - OpenGL

I am confused about the position of objects in opengl .The eye position is 0,0,0 , the projection plane is at z = -1 . At this point , will the objects be in between the eye position and and the plane (Z =(0 to -1)) ? or its behind the projection plane ? and also if there is any particular reason for being so?
First of all, there is no eye in modern OpenGL. There is also no camera. There is no projection plane. You define these concepts by yourself; the graphics library does not give them to you. It is your job to transform your object from your coordinate system into clip space in your vertex shader.
I think you are thinking about projection wrong. Projection doesn't move the objects in the same sense that a translation or rotation matrix might. If you take a look at the link above, you can see that in order to render a perspective projection, you calculate the x and y components of the projected coordinate with R = V(ez/pz), where ez is the depth of the projection plane, pz is the depth of the object, V is the coordinate vector, and R is the projection. Almost always you will use ez=1, which makes that equation into R = V/pz, allowing you to place pz in the w coordinate allowing OpenGL to do the "perspective divide" for you. Assuming you have your eye and plane in the correct places, projecting a coordinate is almost as simple as dividing by its z coordinate. Your objects can be anywhere in 3D space (even behind the eye), and you can project them onto your plane so long as you don't divide by zero or invalidate your z coordinate that you use for depth testing.
There is no "projection plane" at z=-1. I don't know where you got this from. The classic GL perspective matrix assumes an eye space where the camera is located at origin and looking into -z direction.
However, there is the near plane at z<0 and eveything in front of the near plane is going to be clipped. You cannot put the near plane at z=0, because then, you would end up with a division by zero when trying to project points on that plane. So there is one reasin that the viewing volume isn't a pyramid with they eye point at the top but a pyramid frustum.
This is btw. also true for real-world eyes or cameras. The projection center lies behind the lense, so no object can get infinitely close to the optical center in either case.
The other reason why you want a big near clipping distance is the precision of the depth buffer. The whole depth range between the front and the near plane has to be mapped to some depth value with a limited amount of bits, typically 24. So you want to keep the far plane as close as possible, and shift away the near plane as far as possible. The non-linear mapping of the screen-space z coordinate makes this even more important, as that the precision is non-uniformely distributed over that range.

What exactly are eye space coordinates?

As I am learning OpenGL I often stumble upon so-called eye space coordinates.
If I am right, you typically have three matrices. Model matrix, view matrix and projection matrix. Though I am not entirely sure how the mathematics behind that works, I do know that the convert coordinates to world space, view space and screen space.
But where is the eye space, and which matrices do I need to convert something to eye space?
Perhaps the following illustration showing the relationship between the various spaces will help:
Depending if you're using the fixed-function pipeline (you are if you call glMatrixMode(), for example), or using shaders, the operations are identical - it's just a matter of whether you code them directly in a shader, or the OpenGL pipeline aids in your work.
While there's distaste in discussing things in terms of the fixed-function pipeline, it makes the conversation simpler, so I'll start there.
In legacy OpenGL (i.e., versions before OpenGL 3.1, or using compatibility profiles), two matrix stacks are defined: model-view, and projection, and when an application starts the matrix at the top of each stack is an identity matrix (1.0 on the diagonal, 0.0 for all other elements). If you draw coordinates in that space, you're effectively rendering in normalized device coordinates(NDCs), which clips out any vertices outside of the range [-1,1] in both X, Y, and Z. The viewport transform (as set by calling glViewport()) is what maps NDCs into window coordinates (well, viewport coordinates, really, but most often the viewport and the window are the same size and location), and the depth value to the depth range (which is [0,1] by default).
Now, in most applications, the first transformation that's specified is the projection transform, which come in two varieties: orthographic and perspective projections. An orthographic projection preserves angles, and is usually used in scientific and engineering applications, since it doesn't distort the relative lengths of line segments. In legacy OpenGL, orthographic projections are specified by either glOrtho or gluOrtho2D. More commonly used are perspective transforms, which mimic how the eye works (i.e., objects far from the eye are smaller than those close), and are specified by either glFrustum or gluPerspective. For perspective projections, they defined a viewing frustum, which is a truncated pyramid anchored at the eye's location, which are specified in eye coordinates. In eye coordinates, the "eye" is located at the origin, and looking down the -Z axis. Your near and far clipping planes are specified as distances along the -Z axis. If you render in eye coordinates, any geometry specified between the near and far clipping planes, and inside of the viewing frustum will not be culled, and will be transformed to appear in the viewport. Here's a diagram of a perspective projection, and its relationship to the image plane .
The eye is located at the apex of the viewing frustum.
The last transformation to discuss is the model-view transform, which is responsible for moving coordinate systems (and not objects; more on that in a moment) such that they are well position relative to the eye and the viewing frustum. Common modeling transforms are translations, scales, rotations, and shears (of which there's no native support in OpenGL).
Generally speaking, 3D models are modeled around a local coordinate system (e.g., specifying a sphere's coordinates with the origin at the center). Modeling transforms are used to move the "current" coordinate system to a new location so that when you render your locally-modeled object, it's positioned in the right place.
There's no mathematical difference between a modeling transform and a viewing transform. It's just usually, modeling transforms are used to specific models and are controlled by glPushMatrix() and glPopMatrix() operations, which a viewing transformation is usually specified first, and affects all of the subsequent modeling operations.
Now, if you're doing this modern OpenGL (core profile versions 3.1 and forward), you have to do all these operations logically yourself (you might only specify one transform folding both the model-view and projection transformations into a single matrix multiply). Matrices are specified usually as shader uniforms. There are no matrix stacks, separation of model-view and projection transformations, and you need to get your math correct to emulate the pipeline. (BTW, the perspective division and viewport transform steps are performed by OpenGL after the completion of your vertex shader - you don't need to do the math [you can, it doesn't hurt anything unless you fail to set w to 1.0 in your gl_Position vertex shader output).
Eye space, view space, and camera space are all synonyms for the same thing: the world relative to the camera.
In a rendering, each mesh of the scene usually is transformed by the model matrix, the view matrix and the projection matrix. Finally the projected scene is mapped to the viewport.
The projection, view and model matrix interact together to present the objects (meshes) of a scene on the viewport.
The model matrix defines the position orientation and scale of a single object (mesh) in the world space of the scene.
The view matrix defines the position and viewing direction of the observer (viewer) within the scene.
The projection matrix defines the area (volume) with respect to the observer (viewer) which is projected onto the viewport.
Coordinate Systems:
Model coordinates (Object coordinates)
The model space is the coordinates system, which is used to define or modulate a mesh. The vertex coordinates are defined in model space.
World coordinates
The world space is the coordinate system of the scene. Different models (objects) can be placed multiple times in the world space to form a scene, in together.
The model matrix defines the location, orientation and the relative size of a model (object, mesh) in the scene. The model matrix transforms the vertex positions of a single mesh to world space for a single specific positioning. There are different model matrices, one for each combination of a model (object) and a location of the object in the world space.
View space (Eye coordinates)
The view space is the local system which is defined by the point of view onto the scene.
The position of the view, the line of sight and the upwards direction of the view, define a coordinate system relative to the world coordinate system. The objects of a scene have to be drawn in relation to the view coordinate system, to be "seen" from the viewing position. The inverse matrix of the view coordinate system is named the view matrix. This matrix transforms from world coordinates to view coordinates.
In general world coordinates and view coordinates are Cartesian coordinates
The view coordinates system describes the direction and position from which the scene is looked at. The view matrix transforms from the world space to the view (eye) space.
If the coordinate system of the view space is a Right-handed system, where the X-axis points to the right and the Y-axis points up, then the Z-axis points out of the view (Note in a right hand system the Z-Axis is the cross product of the X-Axis and the Y-Axis).
Clip space coordinates are Homogeneous coordinates. In clip space the clipping of the scene is performed.
A point is in clip space if the x, y and z components are in the range defined by the inverted w component and the w component of the homogeneous coordinates of the point:
-w <= x, y, z <= w.
The projection matrix describes the mapping from 3D points of a scene, to 2D points of the viewport. The projection matrix transforms from view space to the clip space. The coordinates in the clip space are transformed to the normalized device coordinates (NDC) in the range (-1, -1, -1) to (1, 1, 1) by dividing with the w component of the clip coordinates.
At orthographic projection, this area (volume) is defined by 6 distances (left, right, bottom, top, near and far) to the viewer's position.
If the left, bottom and near distance are negative and the right, top and far distance are positive (as in normalized device space), this can be imagined as box around the viewer.
All the objects (meshes) which are in the space (volume) are "visible" on the viewport. All the objects (meshes) which are out (or partly out) of this space are clipped at the borders of the volume.
This means at orthographic projection, the objects "behind" the viewer are possibly "visible". This may seem unnatural, but this is how orthographic projection works.
At perspective projection the viewing volume is a frustum (a truncated pyramid), where the top of the pyramid is the viewing position.
The direction of view (line of sight) and the near and the far distance define the planes which truncated the pyramid to a frustum (the direction of view is the normal vector of this planes).
The left, right, bottom, top distance define the distance from the intersection of the line of sight and the near plane, with the side faces of the frustum (on the near plane).
This causes that the scene looks like, as it would be seen from of a pinhole camera.
One of the most common mistakes, when an object is not visible on the viewport (screen is all "black"), is that the mesh is not within the view volume which is defined by the projection and view matrix.
Normalized device coordinates
The normalized device space is a cube, with right, bottom, front of (-1, -1, -1) and a left, top, back of (1, 1, 1).
The normalized device coordinates are the clip space coordinates divide by the w component of the clip coordinates. This is called Perspective divide
Window coordinates (Screen coordinates)
The window coordinates are the coordinates of the viewport rectangle. The window coordinates are decisive for the rasterization process.
The normalized device coordinates are linearly mapped to the viewport rectangle (Window Coordinates / Screen Coordinates) and to the depth for the depth buffer.
The viewport rectangle is defined by glViewport. The depth range is set by glDepthRange and is by default [0, 1].