Mapping between different camera views - computer-vision

I have a calibrated (virtual) camera in Blender that views a roughly planar object. I make an image from a first camera pose P0 and move the camera to a new pose P1. So I have the 4x4 camera matrix for both views from which I can calculate the transformation between the cameras as given below. I also know the intrinsics matrix K. Using those, I want to map the points from the image for P0 to a new image seen from P1 (of course, I have the ground truth to compare because I can render in Blender after the camera has moved to P1). If I only rotate the camera between P0 and P1, I can calculate the homography perfectly. But if there is translation, the calculated homography matrix does not take that into account. The theory says, after calculating M10, the last row and column should be dropped for a planar scene. However, when I check M10, I see that the translation values are in the rightmost column, which I drop to get the 3x3 homography matrix H10. Then, if there is no rotation, H10 is equal to the identity matrix. What is going wrong here?
Edit: I know that the images are related by a homography because given the two images from P0 and P1, I can find a homography (by feature matching) that perfectly maps the image from P0 to the image from P1, even in presence of a translational camera movement.

The theory became more clear to me after reading from two other books: "Multiple View Geometry" from Hartley and Zissermann (Example 13.2) and particularly "An Invitation to 3-D Vision: From Images to Geometric Models" (Section 5.3.1, Planar homography). Below is an outline, please check the above-mentioned sources for a thorough explanation.
Consider two images of points p on a 2D plane P in 3D space, the transformation between the two camera frames can be written as: X2 = R*X1 + T (1) where X1 and X2 are the coordinates of the world point p in camera frames 1 and 2, respectively, R the rotation and T the translation between the two camera frames. Denoting the unit normal vector of the plane P to the first camera frame as N and the distance from the plane P to the first camera as d, we can use the plane equation to write N.T*X1=d (.T means transpose), or equivalently (1/d)*N.T*X1=1 (2) for all X1 on the plane P. Substituting (2) into (1) gives X2 = R*X1+T*(1/d)*N.T*X1 = (R+(1/d)*T*N.T)*X1. Therefore, the planar homography matrix (3x3) can be extracted as H=R+(1/d)*T*N.T, that is X2 = H*X1. This is a linear transformation from X1 to X2.
The distance d can be computed as the dot product between the plane normal and a point on the plane. Then, the camera intrinsics matrix K should be used to calculate the projective homography G = K * R+(1/d)*T*N.T * inv(K). If you are using a software like Blender or Unity, you can set the camera intrinsics yourself and thus obtain K. For Blender, there a nice code snippet is given in this excellent answer.
OpenCV has some nice code example in this tutorial; see "Demo 3: Homography from the camera displacement".

Related

Compute Homography Matrix based on intrinsic and extrinsic camera parameters

I am willing to perform a 360° Panorama stitching for 6 fish-eye cameras.
In order to find the relation among cameras I need to compute the Homography Matrix. The latter is usually computed by finding features in the images and matching them.
However, for my camera setup I already know:
The intrinsic camera matrix K, which I computed through camera calibration.
Extrinsic camera parameters R and t. The camera orientation is fixed and does not change at any point. The cameras are located on a circle of known diameter d, being each camera positioned with a shift of 60° degrees with respect to the circle.
Therefore, I think I could manually compute the Homography Matrix, which I am assuming would result in a more accurate approach than performing feature matching.
In the literature I found the following formula to compute the homography Matrix which relates image 2 to image 1:
H_2_1 = (K_2) * (R_2)^-1 * R_1 * K_1
This formula only takes into account a rotation angle among the cameras but not the translation vector that exists in my case.
How could I plug the translation t of each camera in the computation of H?
I have already tried to compute H without considering the translation, but as d>1 meter, the images are not accurate aligned in the panorama picture.
EDIT:
Based on Francesco's answer below, I got the following questions:
After calibrating the fisheye lenses, I got a matrix K with focal length f=620 for an image of size 1024 x 768. Is that considered to be a big or small focal length?
My cameras are located on a circle with a diameter of 1 meter. The explanation below makes it clear for me, that due to this "big" translation among the cameras, I have remarkable ghosting effects with objects that are relative close to them. Therefore, if the Homography model cannot fully represent the position of the cameras, is it possible to use another model like Fundamental/Essential Matrix for image stitching?
You cannot "plug" the translation in: its presence along with a nontrivial rotation mathematically implies that the relationship between images is not a homography.
However, if the imaged scene is and appears "far enough" from the camera, i.e. if the translations between cameras are small compared to the distances of the scene objects from the cameras, and the cameras' focal lengths are small enough, then you may use the homography induced by a pure rotation as an approximation.
Your equation is wrong. The correct formula is obtained as follows:
Take a pixel in camera 1: p_1 = (x, y, 1) in homogeneous coordinates
Back project it into a ray in 3D space: P_1 = inv(K_1) * p_1
Decompose the ray in the coordinates of camera 2: P_2 = R_2_1 * P1
Project the ray into a pixel in camera 2: p_2 = K_2 * P_2
Put the equations together: p_2 = [K_2 * R_2_1 * inv(K_1)] * p_1
The product H = K2 * R_2_1 * inv(K1) is the homography induced by the pure rotation R_2_1. The rotation transforms points into frame 2 from frame 1. It is represented by a 3x3 matrix whose columns are the components of the x, y, z axes of frame 1 decomposed in frame 2. If your setup gives you the rotations of all the cameras with respect to a common frame 0, i.e. as R_i_0, then it is R_2_1 = R_2_0 * R_1_0.transposed.
Generally speaking, you should use the above homography as an initial estimation, to be refined by matching points and optimizing. This is because (a) the homography model itself is only an approximation (since it ignores the translation), and (b) the rotations given by the mechanical setup (even a calibrated one) are affected by errors. Using matched pixels to optimize the transformation will minimize the errors where it matters, on the image, rather than in an abstract rotation space.

OpenCV Structure from Motion Reprojection Issue

I am currently facing an issue with my Structure from Motion program based on OpenCv.
I'm gonna try to depict you what it does, and what it is supposed to do.
This program lies on the classic "structure from motion" method.
The basic idea is to take a pair of images, detect their keypoints and compute the descriptors of those keypoints. Then, the keypoints matching is done, with a certain number of tests to insure the result is good. That part works perfectly.
Once this is done, the following computations are performed : fundamental matrix, essential matrix, SVD decomposition of the essential matrix, camera matrix computation and finally, triangulation.
The result for a pair of images is a set of 3D coordinates, giving us points to be drawn in a 3D viewer. This works perfectly, for a pair.
Indeed, here is my problem : for a pair of images, the 3D points coordinates are calculated in the coordinate system of the first image of the image pair, taken as the reference image. When working with more than two images, which is the objective of my program, I have to reproject the 3D points computed in the coordinate system of the very first image, in order to get a consistent result.
My question is : How do I reproject 3D points coordinate given in a camera related system, into an other camera related system ? With the camera matrices ?
My idea was to take the 3D point coordinates, and to multiply them by the inverse of each camera matrix before.
I clarify :
Suppose I am working on the third and fourth image (hence, the third pair of images, because I am working like 1-2 / 2-3 / 3-4 and so on).
I get my 3D point coordinates in the coordinate system of the third image, how do I do to reproject them properly in the very first image coordinate system ?
I would have done the following :
Get the 3D points coordinates matrix, apply the inverse of the camera matrix for image 2 to 3, and then apply the inverse of the camera matrix for image 1 to 2.
Is that even correct ?
Because those camera matrices are non square matrices, and I can't inverse them.
I am surely mistaking somewhere, and I would be grateful if someone could enlighten me, I am pretty sure this is a relative easy one, but I am obviously missing something...
Thanks a lot for reading :)
Let us say you have a 3 * 4 extrinsic parameter matrix called P. To match the notations of OpenCV documentation, this is [R|t].
This matrix P describes the projection from world space coordinates to the camera space coordinates. To quote the documentation:
[R|t] translates coordinates of a point (X, Y, Z) to a coordinate system, fixed with respect to the camera.
You are wondering why this matrix is non-square. That is because in the usual context of OpenCV, you are not expecting homogeneous coordinates as output. Therefore, to make it square, just add a fourth row containing (0,0,0,1). Let's call this new square matrix Q.
You have one such matrix for each pair of cameras, that is you have one Qk matrix for each pair of images {k,k+1} that describes the projection from the coordinate space of camera k to that of camera k+1. Those matrices are inversible because they describe isometries in homogeneous coordinates.
To go from the coordinate space of camera 3 to that of camera 1, just apply to your points the inverse of Q2 and then the inverse of Q1.

OpenCV stereo vision 3D coordinates to 2D camera-plane projection different than triangulating 2D points to 3D

I get an image point in the left camera (pointL) and the corresponding image point in the right camera (pointR) of my stereo camera using feature matching. The two cameras are parallel and are at the same "hight". There is only a x-translation between them.
I also know the projection matrices for each camera (projL, projR), which I got during calibration using initUndistortRectifyMap.
For triangulating the point, I call:
triangulatePoints(projL, projR, pointL, pointR, pos3D) (documentation), where pos3D is the output 3D position of the object.
Now, I want to project the 3D-coordinates to the 2D-image of the left camera:
2Dpos = projL*3dPos
The resulting x-coordinate is correct. But the y-coodinate is about 20 pixels wrong.
How can I fix this?
Edit:
Of course, I need to use homogeneous coordinates, in order to multiply it with the projection matrix (3x4). For that reason, I set:
3dPos[0] = x;
3dPos[1] = y;
3dPos[2] = z;
3dPos[3] = 1;
Is it wrong, to set 3dPos[3]to 1?
Note:
All images are remapped, I do this in a kind of preprocessing step.
Of course, I always use the homogeneous coordinates
You are likely projecting into the rectified camera. Need to apply the inverse of the rectification warp to obtain the point in the original (undistorted) linear camera coordinates, then apply distortion to get into the original image.

Reference frame of homography matrix (and how to obtain the motion wrt the first image)?

The OpenCV function findHomography "finds a perspective transformation between two planes". According to this post, if H is the transformation matrix, we have (where X1 is the prior image):
X1 = H · X2
I am struggling to fully comprehend the reference axes of the parameters in the transformation (translation distances, scaling factors and rotation angle -in a simplified affine transformation case without shear-).
What I am trying to obtain is the camera motion from one image to another, relative to the first image. That is, motion between images expressed as the translation in the first image's axes. I assume these translations are the (tx,ty) from the homography matrix, but corrected using the rotation angle and scaling parameters. Could someone enlighten me on how to obtain these translation values?

3d object overlay - augmented reality irrlicht + opencv

I am trying to develop an augmented reality program that overlays a 3d object on top of a marker. The model does not move along(proportionately) with the marker. Here are the list of things that I did
1) Using opencv: a) I used the solvepnp method to find rvecs and tvecs. b) I also used the rodrigues method to find the rotation matrix and appended the tvecs vector to get the projection matrix. c) Just for testing I made some points and lines and projected them to make a cube. This works perfectly fine and I am getting a good output.
2) Using irrlicht: a) I tried to place a 3d model(at position(0,0,0) and rotation(0,0,0)) with the camera feed running in the background. b) Using the rotation matrix found using rodrigues in opencv I calculated the pitch, yaw and roll values from this post("http://planning.cs.uiuc.edu/node103.html") and passed the value onto the rotation field. In the position field I passed the tvecs values. The tvecs values are tvecs[0], -tvecs[1], tvecs[2].
The model is moving in the correct directions but it is not moving proportionately. Meaning, if I move the marker 100 pixels in the x direction, the model only moves 20 pixels(the values 100 and 20 are not measured, I just took arbitrary values illustrate the example). Similarly for y axis and z axis. I do know I have to introduce another transformation matrix that maps the opencv camera coordinates to irrlicht camera coordinates and its a 4x4 matrix. But I do not know how to find it. Also the opencv's projections matrix [R|t] is a 3x4 matrix and it yields a 2d point that is to be projected. The 4x4 matrix mapping between opencv and irrlicht requires a 3d point(made homogeneous) to be fed into a 4x4 matrix. How do I achieve that?
The 4x4 matrix You are writing about seems to be M=[ R|t; 0 1]. t is 3x1 translation vector. To get the transformed coordinates v' of 4x1 ([x y z 1]^T) point v just do v'=Mt.
Your problem with scaling may be also caused by difference in units used for camera calibration in OpenCV and those used by the other library.