Projective or Euclidean 3D- Reconstruction? - python-2.7

I have problems understanding if I get an euclidean reconstruction result or just a projective one. So at first let me tell you what I've done:
I have two stereo images. The images are SEM images and are eucentrically tilted. The difference of tilt is 5°. Using SURF-correspondences and RANSAC, I calculate the fundamental matrix with the normalized 8-point algorithm.
Then the images are rectified and I do a dense stereo-matching:
minDisp = -16
numDisp = 16-minDisp
stereo = cv2.StereoSGBM_create(minDisparity = minDisp,
numDisparities = numDisp)
disp = stereo.compute(imgL, imgR).astype(np.float32) / 16.0
That gives me a disparity map, f.e. this 5x5 matrix (the values range from -16 to 16). I mask the bad pixels out (-17) and compute the z-component of my images using the flattened disp array.
-0.1875 -0.1250 -0.1250 0
-0.1250 -0.1250 -0.1250 -17
disp = -0.0625 -0.0625 -0.1250 -17
-0.0625 -0.0625 0 0.0625
0 0 0.0625 0.1250
#create mask that eliminates the bad pixel values ( = minimum values)
mask = disp != disp.min()
dispMasked = disp[mask]
#compute z-component
zWorld = np.float32(((dispMasked) * p) / (2 * np.sin(tilt)))
It's a simplified form of a real triangulation assuming a parallel projection using trigonometric equations. The pixelconstant was calculated with a calibration object. So I get the height in mm. The disparity was calculated in pixels.
The results of the point cloud look quite good but I have a small constant tilt of all points. So the created pointcloud(-plane) has a tiltangle.
My question is now, is this point cloud in real euclidean coordinates or do I have a projective reconstruction ( equal to affine reconstruction? ) result that still differs from an euclidean result (unknown transformation between euclidean and projective result)?
The reason why I ask is that I don't have a real calibration matrix and I didn't use a real triangulation method using central projection with camera center coordinates, focal length and image point coordinates.
Any suggestions or literature are appreciated. :)
Best regards and thanks in advance!


kitti dataset camera projection matrix

I am looking at the kitti dataset and particularly how to convert a world point into the image coordinates. I looked at the README and it says below that I need to transform to camera coordinates first then multiply by the projection matrix. I have 2 questions, coming from a non computer vision background
I looked at the numbers from calib.txt and in particular the matrix is 3x4 with non-zero values in the last column. I always thought this matrix = K[I|0], where K is the camera's intrinsic matrix. So, why is the last column non-zero and what does it mean? e.g P2 is
array([[7.070912e+02, 0.000000e+00, 6.018873e+02, 4.688783e+01],
[0.000000e+00, 7.070912e+02, 1.831104e+02, 1.178601e-01],
[0.000000e+00, 0.000000e+00, 1.000000e+00, 6.203223e-03]])
After applying projection into [u,v,w] and dividing u,v by w, are these values with respect to origin at the center of image or origin being at the top left of the image?
calib.txt: Calibration data for the cameras: P0/P1 are the 3x4
matrices after rectification. Here P0 denotes the left and P1 denotes the
right camera. Tr transforms a point from velodyne coordinates into the
left rectified camera coordinate system. In order to map a point X from the
velodyne scanner to a point x in the i'th image plane, you thus have to
transform it like:
x = Pi * Tr * X
How to understand the KITTI camera calibration files?
Format of parameters in KITTI's calibration file
I strongly recommend you read those references above. They may solve most, if not all, of your questions.
For question 2: The projected points on images are with respect to origin at the top left. See ref 2 & 3, the coordinates of a far 3d point in image are (center_x, center_y), whose values are provided in the P_rect matrices. Or you can verify this with some simple codes:
import numpy as np
p = np.array([[7.070912e+02, 0.000000e+00, 6.018873e+02, 4.688783e+01],
[0.000000e+00, 7.070912e+02, 1.831104e+02, 1.178601e-01],
[0.000000e+00, 0.000000e+00, 1.000000e+00, 6.203223e-03]])
x = [0, 0, 1E8, 1] # A far 3D point
y =, x)
y[0] /= y[2]
y[1] /= y[2]
y = y[:2]
You will see some output like:
array([6.018873e+02, 1.831104e+02 ])
which is quite near the (p[0, 2], p[1, 2]), a.k.a. (center_x, center_y).
For all the P matrices (3x4), they represent:
P(i)rect = [[fu 0 cx -fu*bx],
[0 fv cy -fv*by],
[0 0 1 0]]
Last column are baselines in meters w.r.t. the reference camera 0. You can see the P0 has all zeros in the last column because it is the reference camera.
This post has more details:
How Kitti calibration matrix was calculated?

Converting pixel width to real world width in millimeters using camera calibration

Assuming I know the distance between camera and the object. How do I know what each pixel width corresponds to in real life in real world measurement (such as cm/mm).
For a NxM resolution image with focal length of x mm. For a planar object (assume a square) at distance D from the camera, how do I measure the square width in millimeters or cm after doing camera calibration. Also, how do I find what a pixel width corresponds to in real world width (cm/mm)
Can someone explain the procedure and algorithm in detail for the calibration procedure to obtain this?
You don't need the distance of the camera to an object for that. Given the focal length in mm and the focal length in pixels taken from the intrinsics matrix you can compute the height and width of the sensor as:
sensor_width_mm = sensor_width_px * (f_mm / f_px)
sensor_height_mm = sensor_height_px * (f_mm / f_px)
Note that the ratio (f_mm / f_px) is the size of one pixel in mm.

3D reconstruction using the projection matrices from the trifocal tensor

I have computed the trifocal tensor and corresponding projection matrices P_0, P_1 and P_2 from line correspondences over 3 views, according to 'Multiple View Geometry by Hartley & Zisserman, 2nd edition', Chapter 16. The computed matrices are:
P_0 =
[1 0 0 0
0 1 0 0
0 0 1 0]
P_1 =
[-0.284955 -0.129918 -0.0276358 0.922516
0.122053 0.560496 0.061383 0.385913
0.00455229 -0.0114709 -0.607497 0.00589735]
P_2 =
[0.21558 -0.10182 0.00499782 0.998876
0.0079606 0.11325 0.0226247 0.047112
0.006613 -0.00260303 -0.130705 0.00512245]
Now I want to compute the 3D (plücker) lines from these projection matrices. I know the intrinsic camera matrix K. What I don't understand is, how to include the intrinsic matrix K with the normalized projection matrices from the trifocal tensor P_1, P_2 and P_3 in order to get correct 3D information. More specifically, I want to follow the triangulation procedure described by Bartoli and Sturm (Section 4, Triangulation).
I appreciate your help.
What do you mean with correct 3D information? The whole coordinate system is only computable up to a scale.
Which algorithm exactly did you use for the computation? Algorithm 16.2 in that chapter?
Why don't you use the triangulation algorithm here:

Calculating scale, rotation and translation from Homography matrix

I am trying to calculate scale, rotation and translation between two consecutive frames of a video. So basically I matched keypoints and then used opencv function findHomography() to calculate the homography matrix.
homography = findHomography(feature1 , feature2 , CV_RANSAC); //feature1 and feature2 are matched keypoints
My question is: How can I use this matrix to calculate scale, rotation and translation?.
Can anyone provide me the code or explanation as to how to do it?
if you can use opencv 3.0, this decomposition method is available
The right answer is to use homography as it is defined dst = H ⋅ src and explore what it does to small segments around a particular point.
Given a single point, for translation do
T = dst - (H ⋅ src)
Given two points p1 and p2
p1 = H ⋅ p1
p2 = H ⋅ p2
Now just calculate the angle between vectors p1 p2 and p1' p2'.
You can use the same trick but now just compare the lengths: |p1 p2| and |p1' p2'|.
To be fair, use another segment orthogonal to the first and average the result. You will see that there is no constant scale factor or translation one. They will depend on the src location.
Given Homography matrix H:
|H_00, H_01, H_02|
H = |H_10, H_11, H_12|
|H_20, H_21, H_22|
H_20 = H_21 = 0 and normalized to H_22 = 1 to obtain 8 DOF.
The translation along x and y axes are directly calculated from H:
tx = H_02
ty = H_12
The 2x2 sub matrix on the top left corner is decomposed to calculate shear, scaling and rotation. An easy and quick decomposition method is explained here.
Note: this method assumes invertible matrix.
Since i had to struggle for a couple of days to create my homography transformation function I'm going to put it here for the benefit of everyone.
Here you can see the main loop where every input position is multiplied by the homography matrix h. Then the result is used to copy the pixel from the original position to the destination position.
for (tempIn[0] = 0; tempIn[0] < stride; tempIn[0]++)
for (tempIn[1] = 0; tempIn[1] < rows; tempIn[1]++)
double w = h[6] * tempIn[0] + h[7] * tempIn[1] + 1; // very important!
//H_20 = H_21 = 0 and normalized to H_22 = 1 to obtain 8 DOF. <-- this is wrong
tempOut[0] = ((h[0] * tempIn[0]) + (h[1] * tempIn[1]) + h[2])/w;
tempOut[1] =(( h[3] * tempIn[0]) +(h[4] * tempIn[1]) + h[5])/w;
if (tempOut[1] < destSize && tempOut[0] < destSize && tempOut[0] >= 0 && tempOut[1] >= 0)
dest_[destStride * tempOut[1] + tempOut[0]] = src_[stride * tempIn[1] + tempIn[0]];
After such process an image with some kind of grid will be produced. Some kind of filter is needed to remove the grid. In my code i have used a simple linear filter.
Note: Only the central part of the original image is really required for producing a correct image. Some rows and columns can be safely discarded.
For estimating a tree-dimensional transform and rotation induced by a homography, there exist multiple approaches. One of them provides closed formulas for decomposing the homography, but they are very complex. Also, the solutions are never unique.
Luckily, OpenCV 3 already implements this decomposition (decomposeHomographyMat). Given an homography and a correctly scaled intrinsics matrix, the function provides a set of four possible rotations and translations.
The question seems to be about 2D parameters. Homography matrix captures perspective distortion. If the application does not create much perspective distortion, one can approximate a real world transformation using affine transformation matrix (that uses only scale, rotation, translation and no shearing/flipping). The following link will give an idea about decomposing an affine transformation into different parameters.

Get 3D coordinates from 2D image pixel if extrinsic and intrinsic parameters are known

I am doing camera calibration from tsai algo. I got intrensic and extrinsic matrix, but how can I reconstruct the 3D coordinates from that inormation?
1) I can use Gaussian Elimination for find X,Y,Z,W and then points will be X/W , Y/W , Z/W as homogeneous system.
2) I can use the
OpenCV documentation approach:
as I know u, v, R , t , I can compute X,Y,Z.
However both methods end up in different results that are not correct.
What am I'm doing wrong?
If you got extrinsic parameters then you got everything. That means that you can have Homography from the extrinsics (also called CameraPose). Pose is a 3x4 matrix, homography is a 3x3 matrix, H defined as
H = K*[r1, r2, t], //eqn 8.1, Hartley and Zisserman
with K being the camera intrinsic matrix, r1 and r2 being the first two columns of the rotation matrix, R; t is the translation vector.
Then normalize dividing everything by t3.
What happens to column r3, don't we use it? No, because it is redundant as it is the cross-product of the 2 first columns of pose.
Now that you have homography, project the points. Your 2d points are x,y. Add them a z=1, so they are now 3d. Project them as follows:
p = [x y 1];
projection = H * p; //project
projnorm = projection / p(z); //normalize
Hope this helps.
As nicely stated in the comments above, projecting 2D image coordinates into 3D "camera space" inherently requires making up the z coordinates, as this information is totally lost in the image. One solution is to assign a dummy value (z = 1) to each of the 2D image space points before projection as answered by Jav_Rock.
p = [x y 1];
projection = H * p; //project
projnorm = projection / p(z); //normalize
One interesting alternative to this dummy solution is to train a model to predict the depth of each point prior to reprojection into 3D camera-space. I tried this method and had a high degree of success using a Pytorch CNN trained on 3D bounding boxes from the KITTI dataset. Would be happy to provide code but it'd be a bit lengthy for posting here.