I have just finished reading this, but i have 2 questions about multiplications.
gl_Position = gl_ProjectionMatrix*gl_ModelViewMatrix*gl_Vertex
This is the final screen coordinates (x,y) when i multiply them all. What if i only multiply gl_ModelViewMatrix*gl_Vertex or gl_ProjectionMatrix*gl_Vertex ? And what does gl_Vertex mean alone ?

gl_Vertex is the vertex coordinate in world space.
vertices and the eye may be arbitrarily placed and oriented in world space.
After multiplication with ModelViewMatrix, you get a vertex coordinate in 'eye-space', a coordinate system with the eye at 0,0,0. Multiplying by the projection matrix(and doing that homogenous coordinate system division thingy) gives you coordinates in screen space. Those are not pixel-coordinates yet, but some normalized coordinate system with 0,0,0 in the center of the screen/window. Viewport transform is last. It maps the image to window (pixel?) coordinates.
An explanation is given in:
chapter 3.

What if i only multiply gl_ModelViewMatrix*gl_Vertex or gl_ProjectionMatrix*gl_Vertex ?
Then the projection matrix (the 1st case) or model-view matrix (the 2nd case) will not take effect. If they are identity matrices, you will not see effect. If they are not, then your view angle or position might be wrong.
And what does gl_Vertex mean alone ?
gl_Vertex is the vertex coordinate that is passed to the input of the pipeline.


My understanding on the projection matrix, perspective division, NDC and viewport transform

I was quite confused on how the projection matrix worked so I researched and I discovered a few other things but after researching a few days, I just wanted to confirm my understanding is correct. I might use a few wrong terms but my brain was exhausted after writing this. A few topics I just researched briefly like screen coordinates and window transform so I didn’t write much about it and my knowledge might be incorrect. Is everything I’ve written here correct or mostly correct? Correct me on anything if I’m wrong.
What does the projection matrix do?
So the perspective projection matrix defines a frustum that is a truncated pyramid. Anything outside of that frustum/frustum range will be clipped. I'll get more on that later. The perspective projection matrix also adds perspective. To make the vertices follow the rules of perspective, the perspective projection matrix manipulates the vertex's w component (the homogenous component) depending on how far the vertex is from the viewer (the farther the vertex is, the higher the w coordinate will increase).
Why and how does the w component make the world look perceptive?
The w component makes the world look perceptive because in the perspective division (perspective division happens in the vertex post processing stage), when the x, y and z is divided by the w component, the vertex coordinate will be scaled smaller depending on how big the w component is. So essentially, the w component scales the object smaller the farther the object is.
Vertex position (1, 1, 2, 2).
Here, the vertex is 2 away from the viewer. In perspective division the x, y, and z will be divided by 2 because 2 is the w component.
(1/2, 1/2, 2/2) = (0.5, 0.5, 1).
As shown here, the vertex coordinate has been scaled by half.
How does the projection matrix decide what will be clipped?
The near and far plane are the limits of where the viewer can see (anything beyond the far plane and before the near plane will be clipped). Any coordinate will also have to go through a clipping check to see if it has to be clipped. The clipping check is checking whether the vertex coordinate is within a frustum range of -w to w.  If it is outside of that range, it will be clipped.
Let's say I have a vertex with a position of (2, 130, 90, 90).
x value is 2
y value is 130
z value is 90
w value is 90
This vertex must be within the range of -90 to 90. The x and z value is within the range but the y value goes beyond the range thus the vertex will be clipped.
So after the vertex shader is finished, the next step is vertex post processing. In vertex post processing the clipping happens and also perspective division happens where clip space is converted into NDC (normalized device coordinates). Also, viewport transform happens where NDC is converted to window space.
What does perspective division do?
Perspective division essentially divides the x, y, and z component of a vertex with the w component. Doing this actually does two things, converts the clip space to Normalized device coordinates and also add perspective by scaling the vertices.
What is Normalized Device Coordinates?
Normalized Device Coordinates is the coordinate system where all coordinates are condensed into an NDC box where each axis is in the range of -1 to +1.
After NDC is occurred, viewport transform happens where all the NDC coordinates are converted screen coordinates. NDC space will become window space.
If an NDC coordinate is (0.5, 0.5, 0.3), it will be mapped onto the window based on what the programmer provided in the function glViewport. If the viewport is 400x300, the NDC coordinate will be placed at pixel 200 on x axis and 150 on y axis.
The perspective projection matrix does not decide what is clipped. After transforming a world coordinate with the projection, you get a clipspace coordinate. This is a Homogeneous coordinates. Base on this coordinate the Rendering Pipeline clips the scene. The clipping rule is -w < x, y, z < w. In the following process of the rendering pipeline, the clip space coordinates is transformed into the normalized device space by the perspective divide (x, y, z)' = (x/w, y/w, z/w). This division by the w component gives the perspective effect. (See also What exactly are eye space coordinates? and Transform the modelMatrix)

What is the role of gl_Position.w in Vulkan?

Variable gl_Position output from a GLSL vertex shader must have 4 coordinates. In OpenGL, it seems w coordinate is used to scale the vector, by dividing the other coordinates by it. What is the purpose of w in Vulkan?
Shaders and projections in Vulkan behave exactly the same as in OpenGL. There are small differences in depth ranges ([-1, 1] in OpenGL, [0, 1] in Vulkan) or in the origin of the coordinate system (lower-left in OpenGL, upper-left in Vulkan), but the principles are exactly the same. The hardware is still the same and it performs calculations in the same way both in OpenGL and in Vulkan.
4-component vectors serve multiple purposes:
Different transformations (translation, rotation, scaling) can be
represented in the same way, with 4x4 matrices.
Projection can also be represented with a 4x4 matrix.
Multiple transformations can be combined into one 4x4 matrix.
The .w component You mention is used during perspective projection.
All this we can do with 4x4 matrices and thus we need 4-component vectors (so they can be multiplied by 4x4 matrices). Again, I write about this because the above rules apply both to OpenGL and to Vulkan.
So for purpose of the .w component of the gl_Position variable - it is exactly the same in Vulkan. It is used to scale the position vector - during perspective calculations (projection matrix multiplication) original depth is modified by the original .w component and stored in the .z component of the gl_Position variable. And additionally, original depth is also stored in the .w component. After that (as a fixed-function step) hardware performs perspective division and divides position stored in the gl_Position variable by its .w component.
In orthographic projection steps performed by the hardware are exactly the same, but values used for calculations are different. So the perspective division step is still performed by the hardware but it does nothing (position is dived by 1.0).
gl_Position is a Homogeneous coordinates. The w component plays a role at perspective projection.
The projection matrix describes the mapping from 3D points of the view on a scene, to 2D points on the viewport. It transforms from eye space to the clip space, and the coordinates in the clip space are transformed to the normalized device coordinates (NDC) by dividing with the w component of the clip coordinates (Perspective divide).
At Perspective Projection the projection matrix describes the mapping from 3D points in the world as they are seen from of a pinhole camera, to 2D points of the viewport. The eye space coordinates in the camera frustum (a truncated pyramid) are mapped to a cube (the normalized device coordinates).
Perspective Projection Matrix:
r = right, l = left, b = bottom, t = top, n = near, f = far
2*n/(r-l) 0 0 0
0 2*n/(t-b) 0 0
(r+l)/(r-l) (t+b)/(t-b) -(f+n)/(f-n) -1
0 0 -2*f*n/(f-n) 0
When a Cartesian coordinate in view space is transformed by the perspective projection matrix, then the the result is a Homogeneous coordinates. The w component grows with the distance to the point of view. This cause that the objects become smaller after the Perspective divide, if they are further away.
In computer graphics, transformations are represented with matrices. If you want something to rotate, you multiply all its vertices (a vector) by a rotation matrix. Want it to move? Multiply by translation matrix, etc.
tl;dr: You can't describe translation along the z-axis with 3D matrices and vectors. You need at least 1 more dimension, so they just added a dummy dimension w. But things break if it's not 1, so keep it at 1 :P.
Anyway, now we begin with a quick review on matrix multiplication:
You basically put x above a, y above b, z above c. Multiply the whole column by the variable you just moved, and sum up everything in the row.
So if you were to translate a vector, you'd want something like:
See how x and y is now translated by az and bz? That's pretty awkward though:
You'd have to account for how big z is whenever you move things (what if z was negative? You'd have to move in opposite directions. That's cumbersome as hell if you just want to move something an inch over...)
You can't move along the z axis. You'll never be able to fly or go underground
But, if you can make sure z = 1 at all times:
Now it's much clearer that this matrix allows you to move in the x-y plane by a, and b amounts. Only problem is that you're conceptually levitating all the time, and you still can't go up or down. You can only move in 2D.
But you see a pattern here? With 3D matrices and 3D vectors, you can describe all the fundamental movements in 2D. So what if we added a 4th dimension?
Looks familiar. If we keep w = 1 at all times:
There we go, now you get translation along all 3 axis. This is what's called homogeneous coordinates.
But what if you were doing some big & complicated transformation, resulting in w != 1, and there's no way around it? OpenGL (and basically any other CG system I think) will do what's called normalization: divide the resultant vector by the w component. I don't know enough to say exactly why ('cause scaling is a linear transformation?), but it has favorable implications (can be used in perspective transforms). Anyway, the translation matrix would actually look like:
And there you go, see how each component is shrunken by w, then it's translated? That's why w controls scaling.

What is the difference between ProjectionTransformMatrix of VTK and GL_PROJECTION of OpenGL?

I am having profound issues regarding understanding the transformations involved in VTK. OpenGL has fairly good documentation and I was of the impression that VTK is verym similar to OpenGL (it is, in many ways). But when it comes to transformations, it seems to be an entirely different story.
This is a good OpenGL documentation about transforms involved:
The perspective projection matrix in OpenGL is:
I wanted to see if this formula applied in VTK will give me the projection matrix of VTK (by cross-checking with VTK projection matrix).
Relevant Camera and Renderer Parameters:
double crSet[2] = {10, 1000};
double windowSize[2];
proj = renderer->GetActiveCamera()->GetProjectionTransformMatrix(windowSize[0]/windowSize[1], crSet[0], crSet[1]);
The projection transform matrix I got for this configuration is:
The (3,3) and (3,4) values of the projection matrix (lets say it is indexed 1 to 4 for rows and columns) should be - (f+n)/(f-n) and -2*f*n/(f-n) respectively. In my VTK camera settings, the nearz is 10 and farz is 1000 and hence I should get -1.020 and -20.20 respectively in the (3,3) and (3,4) locations of the matrix. But it is -1010 and -10000.
I have changed my clipping range values to see the changes and the (3,3) position is always nearz+farz which makes no sense to me. Also, it would be great if someone can explain why it is 3.7320 in the (1,1) and (2,2) positions. And this value DOES NOT change when I change the window size of the renderer window. Quite perplexing to me.
I see in VTKCamera class reference that GetProjectionTransformMatrix() returns the transformation matrix that maps from camera coordinates to viewport coordinates.
VTK Camera Class Reference
This is a nice depiction of the transforms involved in OpenGL rendering:
OpenGL Projection Matrix is the matrix that maps from eye coordinates to clip coordinates. It is beyond doubt that eye coordinates in OpenGL is the same as camera coordinates in VTK. But is the clip coordinates in OpenGL same as viewport coordinates of VTK?
My aim is to simulate a real webcam camera (already calibrated) in VTK to render a 3D model.
Well, the documentation you linked to actually explains this (emphasis mine):
Return the projection transform matrix, which converts from camera
coordinates to viewport coordinates. This method computes the aspect,
nearz and farz, then calls the more specific signature of
Return the concatenation of the ViewTransform and the
ProjectionTransform. This transform will convert world coordinates to
viewport coordinates. The 'aspect' is the width/height for the
viewport, and the nearz and farz are the Z-buffer values that map to
the near and far clipping planes. The viewport coordinates of a point located inside the frustum are in the range
([-1,+1],[-1,+1], [nearz,farz]).
Note that this neither matches OpenGL's window space nor normalized device space. If find the term "viewport coordinates" for this aa poor choice, but be it as it may. What bugs me more with this is that the matrix actually does not transform to that "viewport space", but to some clip-space equivalent. Only after the perspective divide, the coordinates will be in the range as given for the above definition of the "viewport space".
But is the clip coordinates in OpenGL same as viewport coordinates of
So that answer is a clear no. But it is close. Basically, that projection matrix is just a scaled and shiftet along the z dimension, and it is easy to convert between those two. Basically, you can simply take znear and zfar out of VTK's matrix, and put it into that OpenGL projection matrix formula you linked above, replacing just those two matrix elements.

OPENGL clip coordinate

I have a question for opengl clip coordinate. For example, a triangle, three vetices, now have transformed to camera coordinate, multiply with perspective projection matrix to clip coordinate, begin to clip,
-w=<x<=w, -w=<y<=w, -w=<z<=w,
does x,y,z,w mean to each vertex's clip coordinate? So w may not be same in these three vertices?
Yes, that w will vary per vertex. Most people imagine the clip space as the cube [-1,1]^3. However, that is not the clip space, but the normalized device space (NDC). You get from clip space to NDC by doing the perspective divide, so dividing each vertex by it's w component. So, in NDC, that clip condition would transform to -1 <= x/w <= 1. However, the clipping cannot be done in NDC (withpout extra information).
The problem here is that points which lie behind the camera would appear in front of the camera in NDC space. Think about it: x/w is the same as -x/-w. With a typical GL projection matrix, w_clip == z_eye of the vertex. Also, a point that lies in the camera plane (the plane parallel to the projection plane, but going through the camera itself) will have w=0 and you can't do any clipping after that divide. The solution is to always do the clipping before the divide, hence the clip space is called "clip space"...

What exactly are eye space coordinates?

As I am learning OpenGL I often stumble upon so-called eye space coordinates.
If I am right, you typically have three matrices. Model matrix, view matrix and projection matrix. Though I am not entirely sure how the mathematics behind that works, I do know that the convert coordinates to world space, view space and screen space.
But where is the eye space, and which matrices do I need to convert something to eye space?
Perhaps the following illustration showing the relationship between the various spaces will help:
Depending if you're using the fixed-function pipeline (you are if you call glMatrixMode(), for example), or using shaders, the operations are identical - it's just a matter of whether you code them directly in a shader, or the OpenGL pipeline aids in your work.
While there's distaste in discussing things in terms of the fixed-function pipeline, it makes the conversation simpler, so I'll start there.
In legacy OpenGL (i.e., versions before OpenGL 3.1, or using compatibility profiles), two matrix stacks are defined: model-view, and projection, and when an application starts the matrix at the top of each stack is an identity matrix (1.0 on the diagonal, 0.0 for all other elements). If you draw coordinates in that space, you're effectively rendering in normalized device coordinates(NDCs), which clips out any vertices outside of the range [-1,1] in both X, Y, and Z. The viewport transform (as set by calling glViewport()) is what maps NDCs into window coordinates (well, viewport coordinates, really, but most often the viewport and the window are the same size and location), and the depth value to the depth range (which is [0,1] by default).
Now, in most applications, the first transformation that's specified is the projection transform, which come in two varieties: orthographic and perspective projections. An orthographic projection preserves angles, and is usually used in scientific and engineering applications, since it doesn't distort the relative lengths of line segments. In legacy OpenGL, orthographic projections are specified by either glOrtho or gluOrtho2D. More commonly used are perspective transforms, which mimic how the eye works (i.e., objects far from the eye are smaller than those close), and are specified by either glFrustum or gluPerspective. For perspective projections, they defined a viewing frustum, which is a truncated pyramid anchored at the eye's location, which are specified in eye coordinates. In eye coordinates, the "eye" is located at the origin, and looking down the -Z axis. Your near and far clipping planes are specified as distances along the -Z axis. If you render in eye coordinates, any geometry specified between the near and far clipping planes, and inside of the viewing frustum will not be culled, and will be transformed to appear in the viewport. Here's a diagram of a perspective projection, and its relationship to the image plane .
The eye is located at the apex of the viewing frustum.
The last transformation to discuss is the model-view transform, which is responsible for moving coordinate systems (and not objects; more on that in a moment) such that they are well position relative to the eye and the viewing frustum. Common modeling transforms are translations, scales, rotations, and shears (of which there's no native support in OpenGL).
Generally speaking, 3D models are modeled around a local coordinate system (e.g., specifying a sphere's coordinates with the origin at the center). Modeling transforms are used to move the "current" coordinate system to a new location so that when you render your locally-modeled object, it's positioned in the right place.
There's no mathematical difference between a modeling transform and a viewing transform. It's just usually, modeling transforms are used to specific models and are controlled by glPushMatrix() and glPopMatrix() operations, which a viewing transformation is usually specified first, and affects all of the subsequent modeling operations.
Now, if you're doing this modern OpenGL (core profile versions 3.1 and forward), you have to do all these operations logically yourself (you might only specify one transform folding both the model-view and projection transformations into a single matrix multiply). Matrices are specified usually as shader uniforms. There are no matrix stacks, separation of model-view and projection transformations, and you need to get your math correct to emulate the pipeline. (BTW, the perspective division and viewport transform steps are performed by OpenGL after the completion of your vertex shader - you don't need to do the math [you can, it doesn't hurt anything unless you fail to set w to 1.0 in your gl_Position vertex shader output).
Eye space, view space, and camera space are all synonyms for the same thing: the world relative to the camera.
In a rendering, each mesh of the scene usually is transformed by the model matrix, the view matrix and the projection matrix. Finally the projected scene is mapped to the viewport.
The projection, view and model matrix interact together to present the objects (meshes) of a scene on the viewport.
The model matrix defines the position orientation and scale of a single object (mesh) in the world space of the scene.
The view matrix defines the position and viewing direction of the observer (viewer) within the scene.
The projection matrix defines the area (volume) with respect to the observer (viewer) which is projected onto the viewport.
Coordinate Systems:
Model coordinates (Object coordinates)
The model space is the coordinates system, which is used to define or modulate a mesh. The vertex coordinates are defined in model space.
World coordinates
The world space is the coordinate system of the scene. Different models (objects) can be placed multiple times in the world space to form a scene, in together.
The model matrix defines the location, orientation and the relative size of a model (object, mesh) in the scene. The model matrix transforms the vertex positions of a single mesh to world space for a single specific positioning. There are different model matrices, one for each combination of a model (object) and a location of the object in the world space.
View space (Eye coordinates)
The view space is the local system which is defined by the point of view onto the scene.
The position of the view, the line of sight and the upwards direction of the view, define a coordinate system relative to the world coordinate system. The objects of a scene have to be drawn in relation to the view coordinate system, to be "seen" from the viewing position. The inverse matrix of the view coordinate system is named the view matrix. This matrix transforms from world coordinates to view coordinates.
In general world coordinates and view coordinates are Cartesian coordinates
The view coordinates system describes the direction and position from which the scene is looked at. The view matrix transforms from the world space to the view (eye) space.
If the coordinate system of the view space is a Right-handed system, where the X-axis points to the right and the Y-axis points up, then the Z-axis points out of the view (Note in a right hand system the Z-Axis is the cross product of the X-Axis and the Y-Axis).
Clip space coordinates are Homogeneous coordinates. In clip space the clipping of the scene is performed.
A point is in clip space if the x, y and z components are in the range defined by the inverted w component and the w component of the homogeneous coordinates of the point:
-w <= x, y, z <= w.
The projection matrix describes the mapping from 3D points of a scene, to 2D points of the viewport. The projection matrix transforms from view space to the clip space. The coordinates in the clip space are transformed to the normalized device coordinates (NDC) in the range (-1, -1, -1) to (1, 1, 1) by dividing with the w component of the clip coordinates.
At orthographic projection, this area (volume) is defined by 6 distances (left, right, bottom, top, near and far) to the viewer's position.
If the left, bottom and near distance are negative and the right, top and far distance are positive (as in normalized device space), this can be imagined as box around the viewer.
All the objects (meshes) which are in the space (volume) are "visible" on the viewport. All the objects (meshes) which are out (or partly out) of this space are clipped at the borders of the volume.
This means at orthographic projection, the objects "behind" the viewer are possibly "visible". This may seem unnatural, but this is how orthographic projection works.
At perspective projection the viewing volume is a frustum (a truncated pyramid), where the top of the pyramid is the viewing position.
The direction of view (line of sight) and the near and the far distance define the planes which truncated the pyramid to a frustum (the direction of view is the normal vector of this planes).
The left, right, bottom, top distance define the distance from the intersection of the line of sight and the near plane, with the side faces of the frustum (on the near plane).
This causes that the scene looks like, as it would be seen from of a pinhole camera.
One of the most common mistakes, when an object is not visible on the viewport (screen is all "black"), is that the mesh is not within the view volume which is defined by the projection and view matrix.
Normalized device coordinates
The normalized device space is a cube, with right, bottom, front of (-1, -1, -1) and a left, top, back of (1, 1, 1).
The normalized device coordinates are the clip space coordinates divide by the w component of the clip coordinates. This is called Perspective divide
Window coordinates (Screen coordinates)
The window coordinates are the coordinates of the viewport rectangle. The window coordinates are decisive for the rasterization process.
The normalized device coordinates are linearly mapped to the viewport rectangle (Window Coordinates / Screen Coordinates) and to the depth for the depth buffer.
The viewport rectangle is defined by glViewport. The depth range is set by glDepthRange and is by default [0, 1].