What does the resultant matrix of Homography denote? - computer-vision

I have 2 frames of shaky video. I applied homography on all the inliers points. Now the resultant matrix that i get for different frames are like this
0.2711 -0.0036 0.853
-0.0002 0.2719 -0.2247
0.0000 -0.0000 0.2704
0.4787 -0.0061 0.5514
0.0007 0.4798 -0.0799
0.0000 -0.0000 0.4797
What are those similar values in the diagonal and how can I retrieve the translation component from this matrix ?

Start with the following observation: a homography matrix is only defined up to scale. This means that if you divide or multiply all the matrix coefficients by the same number, you obtain a matrix that represent the same geometrical transformation. This is because, in order to apply the homography to a point at coordinates (x, y), you multiply its matrix H on the right by the column vector [x, y, 1]' (here I use the apostrophe symbol to denote transposition), and then divide the result H * x = [u, v, w]' by the third component w. Therefore, if instead of H you use a scaled matrix (s * H), you end up with [s*u, s*v, s*w], which represents the same 2D point.
So, to understand what is going on with your matrices, start by dividing both of them by their bottom-right component:
octave:1> a = [
> 0.2711 -0.0036 0.853
> -0.0002 0.2719 -0.2247
> 0.0000 -0.0000 0.2704
> ];
octave:2> b=[
> 0.4787 -0.0061 0.5514
> 0.0007 0.4798 -0.0799
> 0.0000 -0.0000 0.4797];
octave:3> a/a(3,3)
ans =
1.00259 -0.01331 3.15459
-0.00074 1.00555 -0.83099
0.00000 -0.00000 1.00000
octave:4> b/b(3,3)
ans =
0.99792 -0.01272 1.14947
0.00146 1.00021 -0.16656
0.00000 -0.00000 1.00000
Now suppose, for the moment, that the third column elements in both matrices were [0, 0, 1]'. Then the effect of applying it to any point (x, y) would be to move it by approx 1/100 units (say, pixels). Basically, not changing it by much.
Plugging back the actual values for the third column shows that both matrices are, essentially, translating the whole images by constant amounts.
So, in conclusion, having equal values on the diagonals, and very small values at indices (1,2) and (2,1), means that these homographies are both (essentially) pure translations.

Various transformations involve all elementary operations such as addition, multiplication, division, and addition of a constant. Only the first two can be modeled by regular matrix multiplication. Note that addition of a constant and, in case of a Homography, division is impossible to represent with matrix multiplication in 2D. Adding a third coordinate (that is converting points to homogeneous representation) solves this problem. For example, if you want to add constant 5 to x you can do this like this
1 0 5 x x+5
0 1 0 * y = y
1
Note that matrix is 2x3, not 2x2 and coordinates have three numbers though they represent 2D points. Also, the last transition is converting back from homogeneous to Euclidian representation. Thus two results are achieved: all operations (multiplication, division, addition of variables and additions of constants) can be represented by matrix multiplication; second, we can chain multiple operations (via multiplying their matrices) and still have only a single matrix as the result (of matrix multiplication).
Ok, now let’s explain Homography. Homography is better to consider in the context of the whole family of transformation moving from simple ones to complex ones. In other words, it is easier to understand the meaning of Homography coefficients by comparing them to the meaning of coefficients of simpler Euclidean, Similarity and Affine transforms. The Euclidwan transformation is the simplest and represents a rigid rotation and translation in space (note that matrix is 2x3). For 2D case,
cos(a) -sin(a) Tx
sin(a) cos(a) Ty
Similarity adds scaling to the rotation coefficients. So now the matrix looks like this:
Scl*cos(a) -scl*sin(a) Tx
Scl*sin(a) scl*cos(a) Ty
Affiliate transformation adds shearing so the rotation coefficients become unrestricted:
a11 a12 Tx
a21 a22 Ty
Homography adds another row that divides the output x and y (see how we explained the division during the transition form homogeneous to Euclidean coordinates above) and thus introduces projectivity or non uniform scaling that is a function of point coordinates. This is better understood by looking at the transition to Euclidean coordinates.
a11 a12 Tx x a11*x+a12*y+Tx (a11*x+a12*y+Tx)/(a32*x+a32*y+a33)
a21 a22 Ty * y = a21*x+a22*y+Ty -> (a21*x+a22*y+Ty)/(a32*x+a32*y+a33)
a31 a32 a33 1 a32*x+a32*y+a33
Thus homography has an extra row compared to other transformations such as affine or similarity. This extra row allows to scale objects depending on their coordinates which is how projectivity is formed.
Finally, speaking of your numbers:
0.4787 -0.0061 0.5514
0.0007 0.4798 -0.0799
0.0000 -0.0000 0.4797
This is not homography!. Just look at the last row and you will see that the first two coefficients are 0 thus there is no projectivity. Since a11=a22 this is not even an Affine transformation. This is rather a similarity transform. The translation is
Tx=0.5514/0.4797 and Ty=-0.0799/0.4797

Related

Creation of a compression algorithm so I can access the data to interpolate later?

The following is a more elaborative conjecture on what i wish to achieve; here is how far I reached;
A 3d grid, about 303030, or a 3d array, so i can define a function of R3 -> R f(x, y, z) = v More precisely, where x, y, z € [0, N] of float values so for f(0.5, 0.5, 0.5) the result would be the trilinear interpolation for the points (0,0,0), (0,0,1), (0,1,0), (0,1,1), (1,0,0), (1,0,1), (1,1,0) and (1,1,1). With v is equal to the value stored in the array if x, y, and z are integer values, or the trilinear interpolation of the closest points in the array where N_i is the number of points - 1 in the i dimension of the array; x € [0, N_x], y € [0, N_y], and z € [0, N_z]. Now let's Imagine a 1d array(which does not exist, only integer indices), one can make up a value by interpolation between closest actual values, and can extend this to 2d, though if you try to get a value for the position 0.3864 for positions 0 and 1 you need the 4 closest points in the end you can extend to any number of dimensions. Providing the values at (0,0), (0,1), (1,0) and (1,1). n is the number of dimensions which have a non-integer coordinate, but you get the point with a bilinear interpolation, and you'll need exactly 2n points where n is the number of dimensions.
Simplified;
I have a 3d grid of floats which via I wish to access this values in parallel by the thousands In random positions. To which then I want to convert this memory bound process into cpu bound; by flattening the 3d array, and approximate it with a finite Fourier expansion or something similar. Then calculate the values at the required positions of this flattened data and use the calculated values to do the trilinear interpolation. Conclusively, the original code would just access the values by their array indices, one by one. as the values are being accessed randomly and they are far away from each other in memory; which i'm looking for a suitable strategy to access (or calculate if possible) the values based on an index.

Understand Translation Matrix in OpenGL

Assume we want to translate a point p(1, 2, 3, w=1) with a vector v(a, b, c, w=0) to a new point p'
Note: w=0 represents a vector and w=1 represent a point in OpenGL, please correct me if I'm wrong.
In Affine transformation definition, we have:
p + v = p'
=> p(1, 2, 3, 1) + v(a, b, c, 0) = p(1 + a, 2 + b, 3 + c, 1)
=> point + vector = point (everything works as expected)
In OpenGL, the translation matrix is as following:
1 0 0 a
0 1 0 b
0 0 1 c
0 0 0 1
I assume (a, b, c, 1) is the vector from Affine transformation definition
why we have w=1, but not w=0 such as
1 0 0 a
0 1 0 b
0 0 1 c
0 0 0 0
Note: w=0 represents a vector and w=1 represent a point in OpenGL, please correct me if I'm wrong.
You are wrong. First of all, this hasn't really anything to do with OpenGL. This is about homogenous coordinates, which is a purely mathematical concept. It works by embedding an n-dimensional vector space into an n+1 dimensional vector space. In the 3D case, we use 4D homogenous coordinates, with the definition that the homogenous vector (x, y, z, w) represents the 3D point (x/w, y/w, z/w) in cartesian coordinates.
As a result, for any w != 0, you get a certain finite point, and for w = 0, you are discribing an infinitely far away point into a specific direction. This means that the homogenous coordinates are more powerful in the regard that they can actually describe infinitely far away points with finite coordinates (which is something which comes very handy for perspective transformations, where infinitely far away points are mapped to finite points, and vice versa).
You can, as a shortcut, imagine (x,y,z,0) as some direction vector. But for a point, it is not just w=1, but any w value unequal 0. Conceptually, this means that any cartesian 3D point is represented by a line in homogenous space (we did go up one dimension, so this actually makes sense).
I assume (a, b, c, 1) is the vector from Affine transformation definition why we have w=1, but not w=0?
Your assumption is wrong. One thing about homogenous coordinates is that we do not apply a translation in the 4D space. We get the effect of the translation in the 3D space by actually doing a shearing operation in 4D space.
So what we really want to do in homogenous space is
(x + w *a, y + w*b, z+ w*c, w)
since the 3D interpretation of the resulting vector will then be
(x + w*a) / w == x/w + a
(y + w*b) / w == y/w + b
(z + w*c) / w == z/w + c
which will represent the translation that we were after.
So to try to make this even more clear:
What you wrote in your question:
p(1, 2, 3, 1) + v(a, b, c, 0) = p(1 + a, 2 + b, 3 + c, 1)
Is explicitely not what we want to do. What you describe is an affine translation with respect to the 4D vector space.
But what we actually want is a translation in the 3D cartesian coordinates, so
(1, 2, 3) + (a, b, c) = (1 + a, 2 + b, 3 + c)
Applying your formula would actually mean doing a translation in the homogenous space, which would have the effect of doing a translation which is scaled by the w coordinate, while the formula I gave will always translate the point by (a,b,c), no matter what w we chose for the point.
This is of course not true if we chose w=0. Then, we will get no change at all, which is also correct because a translation will never change directions - your formula would change the direction. Your formula is correct only for w=1, which is aonly a special case. But the key point here is that we are not doing a vector addition after all, but a matrix * vector multiplication. And homogenous coordinates just allow us (among other, more powerful things), to represent a translation via matrix multiplication. But this does not mean that we can just interpret the last column as a translation vector as if we did vector addition.
Simple Answer
The reason is the way how matrix multiplications work. If you multiply a matrix by a vector then the w-component of the result is the inner product of the 4th line of the matrix with the vector. After applying the transformation, a point should still be a point and a direction should be a direction. If you would set that to a 0-vector, the result will always be 0 and thus, the resulting vector will have changed from position (w=1) to direction (w=0).
More detailed answer
The definition of a affine transformation is:
x' = A * x + t,
where is a A is a linear map and t a translation. Traditionally, linear maps are written by mathematicians in matrix form. Note, that t is here, similar to x, a 3-dimensional vector. It would now be cumbersome (and less general, thinking of projective mappings), if we would always have to handle the linear mapping matrix and the translation vector. This can be solved by introducing an additional dimension to the mapping, the so-called homogeneous coordinate, which allows us to store the linear mapping as well as the translation vector in a combined 4x4 matrix. This is called augmented matrix and by definition,
x' A | t x
[ ] = [ | ] * [ ]
1 0 | 1 1
It should also be noted, that affine transformations can now be combined very easily by just multiplying there augmented matrices, which would be hard to do in matrix plus vector notation.
One should also note, that the bottom-right 1 is not part of the translation vector, which is still 3-dimensional, but of the matrix augmentation.
You might also want to read the section about "Augmented matrix" here: https://en.wikipedia.org/wiki/Affine_transformation#Augmented_matrix

Eigen C++ / Matlab quaternion and rotation matrix mismatch

I noticed that there's a difference in Eigen C++ and Matlab when calculating with quaternions.
In Eigen C++, the code
Eigen::Quaterniond q;
q.x() = 0.270598;
q.y() = 0.653281;
q.z() = -0.270598;
q.w() = 0.653281;
Eigen::Matrix3d R = q.normalized().toRotationMatrix();
std::cout << "R=" << std::endl << R << std::endl;
gives the rotation matrix:
R=
-2.22045e-16 0.707107 0.707107
0 0.707107 -0.707107
-1 0 -2.22045e-16
In Matlab (which uses wxyz), however, I get the following result:
q =
0.6533 0.2706 0.6533 -0.2706
>> quat2dcm(q)
ans =
-0.0000 0 -1.0000
0.7071 0.7072 0
0.7072 -0.7071 -0.0000
which is the transpose! Can somebody explain me what is going on? I made sure that the positions of wxyz are correct.
Thank you
With Matlab, you are calculating the direction cosine matrix. It is indeed a rotation matrix like the one you are calculating with Eigen C++, and as such is also unitary (all rows and all columns have a norm of 1 and either form a perpendicular set of vectors).
Now, it so happens that the inverse of a unitary matrix is equal to its conjugate transpose (*), i.e.:
U*U = UU* = I
In other words, what must be happening is that the convention of Matlab is the opposite of that of Eigen C++.
From Wikipedia:
The coordinates of a point P may change due to either a rotation of the coordinate system CS (alias), or a rotation of the point P (alibi).
In most cases the effect of the ambiguity is equivalent to the effect of a rotation matrix inversion (for these orthogonal matrices equivalently matrix transpose).

Why do we need perspective division?

I know perspective division is done by dividing x,y, and z by w, to get normalized device coordinates. But I am not able to understand the purpose of doing that. Also, does it have anything to do with clipping?
Some details that complement the general answers:
The idea is to project a point (x,y,z) on screen to have (xs,ys,d).
The next figure shows this for the y coordinate.
We know from school that
tan(alpha) = ys / d = y / z
This means that the projection is computed as
ys = d*y/z = y /w
w = z / d
This is enough to apply a projection.
However in OpenGL, you want (xs,ys,zs) to be normalized device coordinates in [-1,1] and yes this has something to do with clipping.
The extrema values for (xs,ys,zs) represent the unit cube and everything outside it will be clipped.
So a projection matrix usually takes into consideration the clipping limits (Frustum) to make a single transformation that, with the perspective division, simultaneously apply a projection and transform the projected coordinates along with the z to normalized device coordinates.
I mean why do we need that?
In layman terms: To make perspective distortion work. In a perspective projection matrix, the Z coordinate gets "mixed" into the W output component. So the smaller the value of the Z coordinate, i.e. the closer to the origin, the more things get scaled up, i.e. bigger on screen.
To really distill it to the basic concept, and why the op is division (instead of e.g. square root or some such), consider that an object twice as far should appear with dimensions exactly one half as large. Obtain 1/2 from 2 by... division.
There are many geometric ways to arrive at the same conclusion. A diagram serves as visual proof for this, really.
Dividing x, y, z by w is a "trick" you do with "homogeneous coordinates". To convert a R⁴ vector back to R³ by dividing by the 4th component (or w component as you said). A process called dehomogenizing.
Why you use homogeneous coordinate? That topic is a little bit more involved, I try to explain. I hope I do it justice.
However I will use the x1, x2, x3, x4 as the components of a vector instead of x, y, z, w:
Consider a 3x3 Matrix M and column vectors x, a, b, c of R³. x=(x1, x2, x3) and x1,x2,x3 being scalars or components of x.
With the 3x3 Matrix can do all linear transformations on a vector x you could do with the linear combination:
x' = x1*a + x2*b + x3*c (1).
(x' is the transformed vector that holds the result of transforming x).
Khan Academy on his Course Linear Algebra has a section explaining the fact that every linear transformation can be written as a matrix product with a vector.
You can try this out for example by putting the column vectors a, b, c in the columns of the Matrix M = [ a b c ].
So with the matrix product you essentially get the upper linear combination:
x' = M * x = [a b c] * x = a*x1 + b*x2 + c*x3 (2).
However this operation only accounts for rotation, scaling and shearing transformations. The origin (0, 0, 0) will always stay at (0, 0, 0).
For this you need another kind of transformation named "translation" (moving a vector or adding a vector to the vector).
Consider the translation column vector t = (t1, t2, t3) and the linear combination
x' = x1*a + x2*b + x3*c + t (3).
With this linear combination you can translate, rotate, scale and shear a vector. As you can see this Linear Combination does actually move the origin vector (0, 0, 0) to (0+t1, 0+t2, 0+t3).
However you can't put this translation into a 3x3 Matrix.
So what Graphics Programmers or Mathematicians came up with is adding another dimension to the Matrix and Vectors like this:
M is 4x4 Matrix, x~ vector in R⁴ with x~=(x1, x2, x3, x4). a, b, c, t also being column vectors of R⁴ (last components of a,b,c being 0 and last component for t being 1 - I keep the names the same to later show the similarity between homogeneous linear combination and (3) ). x~ is the homogeneous coordinate of x.
Now watch what happens if we take a vector x of R³ and put it into x~ of R⁴.
This vector will be in homogeneous coordinates in R⁴ x~=(x1, x2, x3, 1). The last component simply being 1 if it is a point and 0 if it's simply a direction (which couldn't be translated anyway).
So you have the linear combination:
x~' = M * x = [a b c t] * x = x1*a + x2*b + x3*c + x4*t (4).
(x~' is the result vector when transforming the homogeneous vector x~)
Since we took a vector from R³ and put it into R⁴ our x4 component is 1 we have:
x~' = x1*a + x2*b + x3*c + 1*t
<=> x~' = x1*a + x2*b + x3*c + t (5).
which is exactly the upper linear transformation (3) with the translation t. This is called an affine transformation (linear transf. + translation).
So with a 3x3 Matrix and a vector of R³ you couldn't do translations. However adding another dimension having a vector in R⁴ and a Matrix in R^4x4 you actually can do it.
However when you want to return to R³ you have to divide the first components with the last one. This is called "dehomogenizing". Which is the the x4 component or in your variable naming the w-component. So x is the original coordinate in R³. Be x~ in R⁴ and the homogenized vector of x. And x' in R³ of x~.
x' = (x1/x4, x2/x4, x3/x4) (6).
Then x' is the dehomogenized vector of the vector x~.
Coming back to perspective division:
(I will leave it out, because many here have explained the divide by z already. It's because of the relationship of a right triangle, being similar which leads you to simplify that with a given focal length f a z values with y coordinate lands at y' = f*y/z. Also since you stated [I hope I didn't misread that you already know why this is done I simply leave a link to a YT-Video here, I find it very well explained on the course lecture CMU 15-462/662 ).
When dehomogenizing the division by the w-component is a pretty handy property when returning to R³. When you apply homogeneous perspective Matrix of 4x4 on a vector you simply put the z component into the w component and let the dehomogenizing process (as in (6) ) perform the perspective divide. So you can setup the w-Component in a way that the division by w divides by z and also maps the values from 0 to 1 (basically you put the range of z-near to z-far values into a range floating points are precise at).
This is also described by Ravi Ramamoorthi in his Course CSE167 when he explains how to set up the perspective projection matrix.
I hope this helped to understand the rational of putting z into the w component. Sorry for my horrible formatting and lengthy text. Yet I hope it helped more than it confused.
Best of luck!
Actually, via standard notational convention from a 4x4 perspective matrix with sightline along a 'z' direction, 'w' differs by 1 from the distance ratio. Also that ratio, though interpreted correctly, is normally expressed as -z/d where 'z' is negative (therefore producing the correct ratio) because, again, in common notational convention, the camera is looking in the negative 'z' direction.
The reason for the offset by 1 needs to be explained. Many references put the origin at the image plane rather than the center of projection. With that convention (again with the camera looking along the negative 'z' direction) the distance labeled 'z' in the similar triangles diagram is thereby replaced by (d-z). Then substituting that for 'z' the expression for 'w' becomes, instead of 'z/d', (d-z)/d = [1-z/d]. To some these conventions may seem unorthodox but they are quite popular among analysts.

How to calculate extrinsic parameters of one camera relative to the second camera?

I have calibrated 2 cameras with respect to some world coordinate system. I know rotation matrix and translation vector for each of them relative to the world frame. From these matrices how to calculate rotation matrix and translation vector of one camera with respect to the other??
Any help or suggestion please. Thanks!
Here is an easier solution, since you already have the 3x3 rotation matrices R1 and R2, and the 3x1 translation vectors t1 and t2.
These express the motion from the world coordinate frame to each camera, i.e. are the matrices such that, if p is a point expressed in world coordinate frame, then the same point expressed in, say, camera 1 frame is p1 = R1 * p + t1.
The motion from camera 1 to 2 is then simply the composition of (a) the motion FROM camera 1 TO the world frame, and (b) of the motion FROM the world frame TO camera 2. You can easily compute this composition as follows:
Form the 4x4 roto-translation matrices Qw1 = [R1 t1] and Qw2 = [ R2 t2 ], both with the 4th row equal to [0 0 0 1]. These matrices completely express the roto-translation FROM the world coordinate frame TO camera 1 and 2 respectively.
The motion FROM camera 1 TO the world frame is simply Q1w = inv(Qw1). Here inv() is the algebraic inverse matrix, i.e. the one such that inv(X) * X = X * inv(X) = IdentityMatrix, for every nonsingular matrix X.
The roto-translation from camera 1 to 2 is then Q12 = Q1w * Qw2, and viceversa, the one from camera 2 to 1 is Q21 = Q2w * Qw1 = inv(Qw2) * Qw1.
Once you have Q12 you can extract from it the rotation and translation parts, if you so wish, respectively from its upper 3x3 submatrix and right 3x1 sub-column.
First convert your rotation matrix into a rotation vector. Now you have 2 3d vectors for each camera, call them A1,A2,B1,B2. You have all 4 of them with respect to some origin O. The rule you need is
A relative to B = (A relative to O)- (B relative to O)
Apply that rule to your 2 vectors and you will have their pose relative to one another.
Some documentation on converting from rotation matrix to euler angles can be found here as well as many other places. If you are using openCV you can just use Rodrigues. Here is some matlab/octave code I found.
Here is very simple and easy solution. I suppose your 1st camera has R1 and T1, 2nd camera has R2 and T2 rotation matrixes and translation vector according to common reference point.
Translation from 1st to 2nd camera, rotation from 1st to 2nd camera can be calculated by following two line matlab code;
R=R2*R1';
T=T2-R*T1;
but note, that is true if you have just one R and T for each camera. (I mean rotations and translation for one unique world reference). if you have more reference translations and rotations, you should calcuate R,T for every single reference point. Probably they will be very close to each other. But those might be sligtly different. Then you can calculate mean of Translation vector and convert all found rotation matrix to rotation vector, caluculate its mean and then convert them as rotation matrix.