I am willing to perform a 360° Panorama stitching for 6 fish-eye cameras.
In order to find the relation among cameras I need to compute the Homography Matrix. The latter is usually computed by finding features in the images and matching them.
However, for my camera setup I already know:
The intrinsic camera matrix K, which I computed through camera calibration.
Extrinsic camera parameters R and t. The camera orientation is fixed and does not change at any point. The cameras are located on a circle of known diameter d, being each camera positioned with a shift of 60° degrees with respect to the circle.
Therefore, I think I could manually compute the Homography Matrix, which I am assuming would result in a more accurate approach than performing feature matching.
In the literature I found the following formula to compute the homography Matrix which relates image 2 to image 1:
H_2_1 = (K_2) * (R_2)^-1 * R_1 * K_1
This formula only takes into account a rotation angle among the cameras but not the translation vector that exists in my case.
How could I plug the translation t of each camera in the computation of H?
I have already tried to compute H without considering the translation, but as d>1 meter, the images are not accurate aligned in the panorama picture.
EDIT:
Based on Francesco's answer below, I got the following questions:
After calibrating the fisheye lenses, I got a matrix K with focal length f=620 for an image of size 1024 x 768. Is that considered to be a big or small focal length?
My cameras are located on a circle with a diameter of 1 meter. The explanation below makes it clear for me, that due to this "big" translation among the cameras, I have remarkable ghosting effects with objects that are relative close to them. Therefore, if the Homography model cannot fully represent the position of the cameras, is it possible to use another model like Fundamental/Essential Matrix for image stitching?
You cannot "plug" the translation in: its presence along with a nontrivial rotation mathematically implies that the relationship between images is not a homography.
However, if the imaged scene is and appears "far enough" from the camera, i.e. if the translations between cameras are small compared to the distances of the scene objects from the cameras, and the cameras' focal lengths are small enough, then you may use the homography induced by a pure rotation as an approximation.
Your equation is wrong. The correct formula is obtained as follows:
Take a pixel in camera 1: p_1 = (x, y, 1) in homogeneous coordinates
Back project it into a ray in 3D space: P_1 = inv(K_1) * p_1
Decompose the ray in the coordinates of camera 2: P_2 = R_2_1 * P1
Project the ray into a pixel in camera 2: p_2 = K_2 * P_2
Put the equations together: p_2 = [K_2 * R_2_1 * inv(K_1)] * p_1
The product H = K2 * R_2_1 * inv(K1) is the homography induced by the pure rotation R_2_1. The rotation transforms points into frame 2 from frame 1. It is represented by a 3x3 matrix whose columns are the components of the x, y, z axes of frame 1 decomposed in frame 2. If your setup gives you the rotations of all the cameras with respect to a common frame 0, i.e. as R_i_0, then it is R_2_1 = R_2_0 * R_1_0.transposed.
Generally speaking, you should use the above homography as an initial estimation, to be refined by matching points and optimizing. This is because (a) the homography model itself is only an approximation (since it ignores the translation), and (b) the rotations given by the mechanical setup (even a calibrated one) are affected by errors. Using matched pixels to optimize the transformation will minimize the errors where it matters, on the image, rather than in an abstract rotation space.
Related
I have a calibrated (virtual) camera in Blender that views a roughly planar object. I make an image from a first camera pose P0 and move the camera to a new pose P1. So I have the 4x4 camera matrix for both views from which I can calculate the transformation between the cameras as given below. I also know the intrinsics matrix K. Using those, I want to map the points from the image for P0 to a new image seen from P1 (of course, I have the ground truth to compare because I can render in Blender after the camera has moved to P1). If I only rotate the camera between P0 and P1, I can calculate the homography perfectly. But if there is translation, the calculated homography matrix does not take that into account. The theory says, after calculating M10, the last row and column should be dropped for a planar scene. However, when I check M10, I see that the translation values are in the rightmost column, which I drop to get the 3x3 homography matrix H10. Then, if there is no rotation, H10 is equal to the identity matrix. What is going wrong here?
Edit: I know that the images are related by a homography because given the two images from P0 and P1, I can find a homography (by feature matching) that perfectly maps the image from P0 to the image from P1, even in presence of a translational camera movement.
The theory became more clear to me after reading from two other books: "Multiple View Geometry" from Hartley and Zissermann (Example 13.2) and particularly "An Invitation to 3-D Vision: From Images to Geometric Models" (Section 5.3.1, Planar homography). Below is an outline, please check the above-mentioned sources for a thorough explanation.
Consider two images of points p on a 2D plane P in 3D space, the transformation between the two camera frames can be written as: X2 = R*X1 + T (1) where X1 and X2 are the coordinates of the world point p in camera frames 1 and 2, respectively, R the rotation and T the translation between the two camera frames. Denoting the unit normal vector of the plane P to the first camera frame as N and the distance from the plane P to the first camera as d, we can use the plane equation to write N.T*X1=d (.T means transpose), or equivalently (1/d)*N.T*X1=1 (2) for all X1 on the plane P. Substituting (2) into (1) gives X2 = R*X1+T*(1/d)*N.T*X1 = (R+(1/d)*T*N.T)*X1. Therefore, the planar homography matrix (3x3) can be extracted as H=R+(1/d)*T*N.T, that is X2 = H*X1. This is a linear transformation from X1 to X2.
The distance d can be computed as the dot product between the plane normal and a point on the plane. Then, the camera intrinsics matrix K should be used to calculate the projective homography G = K * R+(1/d)*T*N.T * inv(K). If you are using a software like Blender or Unity, you can set the camera intrinsics yourself and thus obtain K. For Blender, there a nice code snippet is given in this excellent answer.
OpenCV has some nice code example in this tutorial; see "Demo 3: Homography from the camera displacement".
I am working on building 3D point cloud from features matching using OpenCV3.1 and OpenGL.
I have implemented 1) Camera Calibration (Hence I am having Intrinsic Matrix of the camera) 2) Feature extraction( Hence I have 2D points in Pixel Coordinates).
I was going through few websites but generally all have suggested the flow for converting 3D object points to pixel points but I am doing completely backword projection. Here is the ppt that explains it well.
I have implemented film coordinates(u,v) from pixel coordinates(x,y)(With the help of intrisic matrix). Can anyone shed the light on how I can render "Z" of camera coordinate(X,Y,Z) from the film coordinate(x,y).
Please guide me on how I can utilize functions for the desired goal in OpenCV like solvePnP, recoverPose, findFundamentalMat, findEssentialMat.
With single camera and rotating object on fixed rotation platform I would implement something like this:
Each camera has resolution xs,ys and field of view FOV defined by two angles FOVx,FOVy so either check your camera data sheet or measure it. From that and perpendicular distance (z) you can convert any pixel position (x,y) to 3D coordinate relative to camera (x',y',z'). So first convert pixel position to angles:
ax = (x - (xs/2)) * FOVx / xs
ay = (y - (ys/2)) * FOVy / ys
and then compute cartesian position in 3D:
x' = distance * tan(ax)
y' = distance * tan(ay)
z' = distance
That is nice but on common image we do not know the distance. Luckily on such setup if we turn our object than any convex edge will make an maximum ax angle on the sides if crossing the perpendicular plane to camera. So check few frames and if maximal ax detected you can assume its an edge (or convex bump) of object positioned at distance.
If you also know the rotation angle ang of your platform (relative to your camera) Then you can compute the un-rotated position by using rotation formula around y axis (Ay matrix in the link) and known platform center position relative to camera (just subbstraction befor the un-rotation)... As I mention all this is just simple geometry.
In an nutshell:
obtain calibration data
FOVx,FOVy,xs,ys,distance. Some camera datasheets have only FOVx but if the pixels are rectangular you can compute the FOVy from resolution as
FOVx/FOVy = xs/ys
Beware with Multi resolution camera modes the FOV can be different for each resolution !!!
extract the silhouette of your object in the video for each frame
you can subbstract the background image to ease up the detection
obtain platform angle for each frame
so either use IRC data or place known markers on the rotation disc and detect/interpolate...
detect ax maximum
just inspect the x coordinate of the silhouette (for each y line of image separately) and if peak detected add its 3D position to your model. Let assume rotating rectangular box. Some of its frames could look like this:
So inspect one horizontal line on all frames and found the maximal ax. To improve accuracy you can do a close loop regulation loop by turning the platform until peak is found "exactly". Do this for all horizontal lines separately.
btw. if you detect no ax change over few frames that means circular shape with the same radius ... so you can handle each of such frame as ax maximum.
Easy as pie resulting in 3D point cloud. Which you can sort by platform angle to ease up conversion to mesh ... That angle can be also used as texture coordinate ...
But do not forget that you will lose some concave details that are hidden in the silhouette !!!
If this approach is not enough you can use this same setup for stereoscopic 3D reconstruction. Because each rotation behaves as new (known) camera position.
You can't, if all you have is 2D images from that single camera location.
In theory you could use heuristics to infer a Z stacking. But mathematically your problem is under defined and there's literally infinitely many different Z coordinates that would evaluate your constraints. You have to supply some extra information. For example you could move your camera around over several frames (Google "structure from motion") or you could use multiple cameras or use a camera that has a depth sensor and gives you complete XYZ tuples (Kinect or similar).
Update due to comment:
For every pixel in a 2D image there is an infinite number of points that is projected to it. The technical term for that is called a ray. If you have two 2D images of about the same volume of space each image's set of ray (one for each pixel) intersects with the set of rays corresponding to the other image. Which is to say, that if you determine the ray for a pixel in image #1 this maps to a line of pixels covered by that ray in image #2. Selecting a particular pixel along that line in image #2 will give you the XYZ tuple for that point.
Since you're rotating the object by a certain angle θ along a certain axis a between images, you actually have a lot of images to work with. All you have to do is deriving the camera location by an additional transformation (inverse(translate(-a)·rotate(θ)·translate(a)).
Then do the following: Select a image to start with. For the particular pixel you're interested in determine the ray it corresponds to. For that simply assume two Z values for the pixel. 0 and 1 work just fine. Transform them back into the space of your object, then project them into the view space of the next camera you chose to use; the result will be two points in the image plane (possibly outside the limits of the actual image, but that's not a problem). These two points define a line within that second image. Find the pixel along that line that matches the pixel on the first image you selected and project that back into the space as done with the first image. Due to numerical round-off errors you're not going to get a perfect intersection of the rays in 3D space, so find the point where the ray are the closest with each other (this involves solving a quadratic polynomial, which is trivial).
To select which pixel you want to match between images you can use some feature motion tracking algorithm, as used in video compression or similar. The basic idea is, that for every pixel a correlation of its surroundings is performed with the same region in the previous image. Where the correlation peaks is, where it likely was moved from into.
With this pixel tracking in place you can then derive the structure of the object. This is essentially what structure from motion does.
The OpenCV function findHomography "finds a perspective transformation between two planes". According to this post, if H is the transformation matrix, we have (where X1 is the prior image):
X1 = H · X2
I am struggling to fully comprehend the reference axes of the parameters in the transformation (translation distances, scaling factors and rotation angle -in a simplified affine transformation case without shear-).
What I am trying to obtain is the camera motion from one image to another, relative to the first image. That is, motion between images expressed as the translation in the first image's axes. I assume these translations are the (tx,ty) from the homography matrix, but corrected using the rotation angle and scaling parameters. Could someone enlighten me on how to obtain these translation values?
I am reading the following documentation: http://docs.opencv.org/doc/tutorials/calib3d/camera_calibration/camera_calibration.html
I have managed to successfully calibrate the camera obtaining the camera matrix and the distortion matrix.
I had two sub-questions:
1) How do I use the distortion matrix as I don't know 'r'?
2) For all the views I have the rotation and translation vectors which transform the object points (given in the model coordinate space) to the image points (given in the world coordinate space). So a total of 6 coordinates per image(3 rotational, 3 translational). How do I make use of this information to obtain the rotational-translational matrix?
Any help would be appreciated. Thanks!
Answers in order:
1) "r" is the pixel's radius with respect to the distortion center. That is:
r = sqrt((x - x_c)^2 + (y - y_c)^2)
where (x_c, y_c) is the center of the nonlinear distortion (i.e. the point in the image that has zero nonlinear distortion. This is usually (and approximately) identified with the principal point, i.e. the intersection of the camera focal axis with the image plane. The coordinates of the principal point are in the 3rd column of the matrix of the camera intrinsic paramers.
2) Use Rodrigues's formula to convert between rotation vectors and rotation matrices.
I'd like to know how I can go about calculating the angle of some pixel in a photo relative to the webcam that I'm using. I'm new to this sort of thing and I'm using a webcam. Essentially, I take a photo, process it, and I end up with a pixel value in the image that is what I'm looking for. I then need to somehow turn that pixel value into some meaningful quantity---I need to find a line/vector that passes through the pixel and the camera. I don't need magnitude, just phase.
How does one go about doing this? Is camera calibration necessary? I've been reading a bit about it but am unsure.
Thanks
You don't need to know the distance to the object, only the resolution and angle of view of the camera.
Computing the angle requires only simple linear interpolation. For example, let's assume a camera with a resolution of 1920x1080 that covers a 45 degree angle of view across the diagonal.
In this case, sqrt(19202 + 10802) gives 2292.19 pixels along the diagonal. That means each pixel represents 45/2292.19 = .0153994 degrees.
So, compute the distance from the center (in pixels), multiply by .0153994, and you have its angle from the center (for that camera -- for yours, you'll obviously have to use its resolution and angle of view).
Of course, this is somewhat approximate -- its accuracy will depend on how much distortion the lens has. With a zoom lens (especially wider angle) you can generally count on that being fairly high. With a fixed focal length lens (especially if it doesn't cover an angle wider than 90 degrees or so) it'll usually be pretty low.
If you want to improve accuracy, you can start by taking a picture of a flat rectangle with straight lines just inside the angle of view of the camera, then compute the distortion based on the deviation from perfectly straight in the resulting picture. If you're working with an extremely wide angle lens, this may be nearly essential. With a lens covering a narrower angle of view (especially, as already mentioned, if it's fixed focal length) it's rarely likely to be worthwhile (such lenses often have only a fraction of a percent of distortion).
Recipe:
1 - Calibrate the camera, obtaining the camera matrix K and distortion parameters D. In OpenCV this is done as described in this tutorial.
2 - Remove the nonlinear distortion from the pixel positions of interest. In OpenCV is done using the undistortPoints routine, without passing arguments R and P.
3 - Back-project the pixels of interest into rays (unit vectors with the tail at the camera center) in camera 3D coordinates, by multiplying their pixel positions in homogeneous coordinates times the inverse of the camera matrix.
4 - The angle you want is the angle between the above vectors and (0, 0, 1), the vector associated to the camera's focal axis.