kitti dataset camera projection matrix - computer-vision

I am looking at the kitti dataset and particularly how to convert a world point into the image coordinates. I looked at the README and it says below that I need to transform to camera coordinates first then multiply by the projection matrix. I have 2 questions, coming from a non computer vision background
I looked at the numbers from calib.txt and in particular the matrix is 3x4 with non-zero values in the last column. I always thought this matrix = K[I|0], where K is the camera's intrinsic matrix. So, why is the last column non-zero and what does it mean? e.g P2 is
array([[7.070912e+02, 0.000000e+00, 6.018873e+02, 4.688783e+01],
[0.000000e+00, 7.070912e+02, 1.831104e+02, 1.178601e-01],
[0.000000e+00, 0.000000e+00, 1.000000e+00, 6.203223e-03]])
After applying projection into [u,v,w] and dividing u,v by w, are these values with respect to origin at the center of image or origin being at the top left of the image?
README:
calib.txt: Calibration data for the cameras: P0/P1 are the 3x4
projection
matrices after rectification. Here P0 denotes the left and P1 denotes the
right camera. Tr transforms a point from velodyne coordinates into the
left rectified camera coordinate system. In order to map a point X from the
velodyne scanner to a point x in the i'th image plane, you thus have to
transform it like:
x = Pi * Tr * X

Refs:
How to understand the KITTI camera calibration files?
Format of parameters in KITTI's calibration file
http://www.cvlibs.net/publications/Geiger2013IJRR.pdf
Answer:
I strongly recommend you read those references above. They may solve most, if not all, of your questions.
For question 2: The projected points on images are with respect to origin at the top left. See ref 2 & 3, the coordinates of a far 3d point in image are (center_x, center_y), whose values are provided in the P_rect matrices. Or you can verify this with some simple codes:
import numpy as np
p = np.array([[7.070912e+02, 0.000000e+00, 6.018873e+02, 4.688783e+01],
[0.000000e+00, 7.070912e+02, 1.831104e+02, 1.178601e-01],
[0.000000e+00, 0.000000e+00, 1.000000e+00, 6.203223e-03]])
x = [0, 0, 1E8, 1] # A far 3D point
y = np.dot(p, x)
y[0] /= y[2]
y[1] /= y[2]
y = y[:2]
print(y)
You will see some output like:
array([6.018873e+02, 1.831104e+02 ])
which is quite near the (p[0, 2], p[1, 2]), a.k.a. (center_x, center_y).

For all the P matrices (3x4), they represent:
P(i)rect = [[fu 0 cx -fu*bx],
[0 fv cy -fv*by],
[0 0 1 0]]
Last column are baselines in meters w.r.t. the reference camera 0. You can see the P0 has all zeros in the last column because it is the reference camera.
This post has more details:
How Kitti calibration matrix was calculated?

Related

How can I find the angle between two chessboard planes?

I have two chessboard poses obtained with solvePnp:
Mat rotationVector1, translationVector1;
solvePnP(chess1WorldPoints, chess1ImagePoints, intrinsicMatrix, distortCoefficients, rotationVector1, translationVector1);
Mat rotationVector2, translationVector2;
solvePnP(chess2WorldPoints, chess2ImagePoints, intrinsicMatrix, distortCoefficients, rotationVector2, translationVector2);
How can I check if the planes of the poses are parallel, or find the angle between these planes?
More info
I tried obtaining Euler angles and computing the difference between each alpha, beta and gamma but that only tells me relative rotation for each axis I think:
Vec3d eulerAnglesPose1;
Mat rotationMatrix1;
Rodrigues(rotationVector1, rotationMatrix1);
getEulerAngles(rotationMatrix1, eulerAngles1);
Vec3d eulerAnglesPose2;
Mat rotationMatrix2;
Rodrigues(rotationVector2, rotationMatrix2);
getEulerAngles(rotationMatrix2, eulerAngles2);
I used the getEulerAngles implementation from Camera Rotation SolvePnp :
void getEulerAngles(Mat &rotCamerMatrix, Vec3d &eulerAngles)
{
Mat cameraMatrix, rotMatrix, transVect, rotMatrixX, rotMatrixY, rotMatrixZ;
double* _r = rotCamerMatrix.ptr<double>();
double projMatrix[12] =
{
_r[0],_r[1],_r[2],0,
_r[3],_r[4],_r[5],0,
_r[6],_r[7],_r[8],0
};
decomposeProjectionMatrix(Mat(3, 4, CV_64FC1, projMatrix), cameraMatrix, rotMatrix, transVect, rotMatrixX, rotMatrixY, rotMatrixZ, eulerAngles);
}
Edit
In my case a rotation-translation pair (R, T) gives the correspondence between a coordinate system where the camera is at (0,0,0) (the camera coordinate system) to a coordinate system where (0,0,0) is something I defined in the first two parameters of solvePnp (the world coordinate system). So I have two world coordinate systems relative to the same camera coordinate system.
If I could switch from coord. system 2 to coord. system 1 I could use the Z=0 planes for each one to find the normals and solve my problem.
I think that for example switching from coord. system 2 to camera system should be done like in this post:
Rinv = R' (just the transpose as it's a rotation matrix)
Tinv = -Rinv * T (T is 3x1 column vector)
Then if Pw = [X Y Z] is a point in world coord. system 2 I can get its camera system coords.with:
Pc = [ Rinv Tinv] * [X Y Z 1] transposed.
Pc looks like [a b c d]
Following the same logic again I can get the coordinates of Pc relative to coord. system 1:
Pw1 = [ R1 T1] * Pc
Should I normalize Pc or just normalize Pw1 at the end?
I found how to translate points between coordinate systems in this OpenCV Demo.
The explanation from "Demo 3: Homography from the camera displacement" (the section spanning from title right until the first lines of code) shows how to translate between coordinate systems using matrix multiplication. I just had to apply it to my situation (I had CMO1 and CMO2 and needed to find O1MO2).
This way I can get two planes in the same coord. system, get their normals and find the angle between them.
Also it helped to realize that the extrinsic matrix [R T] translates a 3D point from the world coord. system to the camera coord. system (where camera is at (0,0,0)), not the other way around.

How to find an Equivalent point in a Scaled down image?

I would like to calculate the corner points or contours of the star in this in a Larger image. For that I'm scaling down the size to a smaller one & I'm able to get this points clearly. Now How to map this point in original image? I'm using opencv c++.
Consider a trivial example: the image size is reduced exactly by half.
So, the cartesian coordinate (x, y) in the original image becomes coordinate (x/2, y/2) in the reduced image, and coordinate (x', y') in the reduced image corresponds to coordinate (x*2, y*2) in the original image.
Of course, fractional coordinates get typically rounded off, in a reduced scale image, so the exact mapping is only possible for even-numbered coordinates in this example's original image.
Generalizing this, if the image's width is scaled by a factor of w horizontally and h vertically, coordinate (x, y) becomes coordinate(x*w, y*h), rounded off. In the example I gave, both w and h are 1/2, or .5
You should be able to figure out the values of w and h yourself, and be able to map the coordinates trivially. Of course, due to rounding off, you will not be able to compute the exact coordinates in the original image.
I realize this is an old question. I just wanted to add to Sam's answer above, to deal with "rounding off", in case other readers are wondering the same thing I faced.
This rounding off becomes obvious for even # of pixels across a coordinate axis. For instance, along a 1-D axis, a point demarcating the 2nd quartile gets mapped to an inaccurate value:
axis_prev = [0, 1, 2, 3]
axis_new = [0, 1, 2, 3, 4, 5, 6, 7]
w_prev = len(axis_prev) # This is an axis of length 4
w_new = len(axis_new) # This is an axis of length 8
x_prev = 2
x_new = x_prev * w_new / w_prev
print(x_new)
>>> 4
### x_new should be 5
In Python, one strategy would be to linearly interpolate values from one axis resolution to another axis resolution. Say for the above, we wish to map a point from the smaller image to its corresponding point of the star in the larger image:
import numpy as np
from scipy.interpolate import interp1d
x_old = np.linspace(0, 640, 641)
x_new = np.linspace(0, 768, 769)
f = interp1d(x_old, x_new)
x = 35
x_prime = f(x)

An inconsistency in my understanding of the GLM lookAt function

Firstly, if you would like an explanation of the GLM lookAt algorithm, please look at the answer provided on this question: https://stackoverflow.com/a/19740748/1525061
mat4x4 lookAt(vec3 const & eye, vec3 const & center, vec3 const & up)
{
vec3 f = normalize(center - eye);
vec3 u = normalize(up);
vec3 s = normalize(cross(f, u));
u = cross(s, f);
mat4x4 Result(1);
Result[0][0] = s.x;
Result[1][0] = s.y;
Result[2][0] = s.z;
Result[0][1] = u.x;
Result[1][1] = u.y;
Result[2][1] = u.z;
Result[0][2] =-f.x;
Result[1][2] =-f.y;
Result[2][2] =-f.z;
Result[3][0] =-dot(s, eye);
Result[3][1] =-dot(u, eye);
Result[3][2] = dot(f, eye);
return Result;
}
Now I'm going to tell you why I seem to be having a conceptual issue with this algorithm. There are two parts to this view matrix, the translation and the rotation. The translation does the correct inverse transformation, bringing the camera position to the origin, instead of the origin position to the camera. Similarly, you expect the rotation that the camera defines to be inversed before being put into this view matrix as well. I can't see that happening here, that's my issue.
Consider the forward vector, this is where your camera looks at. Consequently, this forward vector needs to be mapped to the -Z axis, which is the forward direction used by openGL. The way this view matrix is suppose to work is by creating an orthonormal basis in the columns of the view matrix, so when you multiply a vertex on the right hand side of this matrix, you are essentially just converting it's coordinates to that of different axes.
When I play the rotation that occurs as a result of this transformation in my mind, I see a rotation that is not the inverse rotation of the camera, like what's suppose to happen, rather I see the non-inverse. That is, instead of finding the camera forward being mapped to the -Z axis, I find the -Z axis being mapped to the camera forward.
If you don't understand what I mean, consider a 2D example of the same type of thing that is happening here. Let's say the forward vector is (sqr(2)/2 , sqr(2)/2), or sin/cos of 45 degrees, and let's also say a side vector for this 2D camera is sin/cos of -45 degrees. We want to map this forward vector to (0,1), the positive Y axis. The positive Y axis can be thought of as the analogy to the -Z axis in openGL space. Let's consider a vertex in the same direction as our forward vector, namely (1,1). By using the logic of GLM.lookAt, we should be able to map (1,1) to the Y axis by using a 2x2 matrix that consists of the forward vector in the first column and the side vector in the second column. This is an equivalent calculation of that calculation http://www.wolframalpha.com/input/?i=%28sqr%282%29%2F2+%2C+sqr%282%29%2F2%29++1+%2B+%28sqr%282%29%2F2%2C+-sqr%282%29%2F2+%29+1.
Note that you don't get your (1,1) vertex mapped the positive Y axis like you wanted, instead you have it mapped to the positive X axis. You might also consider what happened to a vertex that was on the positive Y axis if you applied this transformation. Sure enough, it is transformed to the forward vector.
Therefore it seems like something very fishy is going on with the GLM algorithm. However, I doubt this algorithm is incorrect since it is so popular. What am I missing?
Have a look at GLU source code in Mesa: http://cgit.freedesktop.org/mesa/glu/tree/src/libutil/project.c
First in the implementation of gluPerspective, notice the -1 is using the indices [2][3] and the -2 * zNear * zFar / (zFar - zNear) is using [3][2]. This implies that the indexing is [column][row].
Now in the implementation of gluLookAt, the first row is set to side, the next one to up and the final one to -forward. This gives you the rotation matrix which is post-multiplied by the translation that brings the eye to the origin.
GLM seems to be using the same [column][row] indexing (from the code). And the piece you just posted for lookAt is consistent with the more standard gluLookAt (including the translational part). So at least GLM and GLU agree.
Let's then derive the full construction step by step. Noting C the center position and E the eye position.
Move the whole scene to put the eye position at the origin, i.e. apply a translation of -E.
Rotate the scene to align the axes of the camera with the standard (x, y, z) axes.
2.1 Compute a positive orthonormal basis for the camera:
f = normalize(C - E) (pointing towards the center)
s = normalize(f x u) (pointing to the right side of the eye)
u = s x f (pointing up)
with this, (s, u, -f) is a positive orthonormal basis for the camera.
2.2 Find the rotation matrix R that aligns maps the (s, u, -f) axes to the standard ones (x, y, z). The inverse rotation matrix R^-1 does the opposite and aligns the standard axes to the camera ones, which by definition means that:
(sx ux -fx)
R^-1 = (sy uy -fy)
(sz uz -fz)
Since R^-1 = R^T, we have:
( sx sy sz)
R = ( ux uy uz)
(-fx -fy -fz)
Combine the translation with the rotation. A point M is mapped by the "look at" transform to R (M - E) = R M - R E = R M + t. So the final 4x4 transform matrix for "look at" is indeed:
( sx sy sz tx ) ( sx sy sz -s.E )
L = ( ux uy uz ty ) = ( ux uy uz -u.E )
(-fx -fy -fz tz ) (-fx -fy -fz f.E )
( 0 0 0 1 ) ( 0 0 0 1 )
So when you write:
That is, instead of finding the camera forward being mapped to the -Z
axis, I find the -Z axis being mapped to the camera forward.
it is very surprising, because by construction, the "look at" transform maps the camera forward axis to the -z axis. This "look at" transform should be thought as moving the whole scene to align the camera with the standard origin/axes, it's really what it does.
Using your 2D example:
By using the logic of GLM.lookAt, we should be able to map (1,1) to the Y
axis by using a 2x2 matrix that consists of the forward vector in the
first column and the side vector in the second column.
That's the opposite, following the construction I described, you need a 2x2 matrix with the forward and row vector as rows and not columns to map (1, 1) and the other vector to the y and x axes. To use the definition of the matrix coefficients, you need to have the images of the standard basis vectors by your transform. This gives directly the columns of the matrix. But since what you are looking for is the opposite (mapping your vectors to the standard basis vectors), you have to invert the transformation (transpose, since it's a rotation). And your reference vectors then become rows and not columns.
These guys might give some further insights to your fishy issue:
glm::lookAt vertical camera flips when z <= 0
The answer might be of interest to you?

Calculating a line from a starting point and angle in 3d

I have a point in 3D space and two angles, I want to calculate the resulting line from this information. I have found how to do this with 2D lines, but not 3D. How can this be calculated?
If it helps: I'm using C++ & OpenGL and have the location of the user's mouse click and the angle of the camera, I want to trace this line for intersections.
In trig terms two angles and a point are required to define a line in 3d space. Converting that to (x,y,z) is just polar coordinates to cartesian coordinates the equations are:
x = r sin(q) cos(f)
y = r sin(q) sin(f)
z = r cos(q)
Where r is the distance from the point P to the origin; the angle q (zenith) between the line OP and the positive polar axis (can be thought of as the z-axis); and the angle f (azimuth) between the initial ray and the projection of OP onto the equatorial plane(usually measured from the x-axis).
Edit:
Okay that was the first part of what you ask. The rest of it, the real question after the updates to the question, is much more complicated than just creating a line from 2 angles and a point in 3d space. This involves using a camera-to-world transformation matrix and was covered in other SO questions. For convenience here's one: How does one convert world coordinates to camera coordinates? The answers cover converting from world-to-camera and camera-to-world.
The line can be fathomed as a point in "time". The equation must be vectorized, or have a direction to make sense, so time is a natural way to think of it. So an equation of a line in 3 dimensions could really be three two dimensional equations of x,y,z related to time, such as:
x = ax*t + cx
y = ay*t + cy
z = az*t + cz
To find that set of equations, assuming the camera is at origin, (0,0,0), and your point is (x1,y1,z1) then
ax = x1 - 0
ay = y1 - 0
az = z1 - 0
cx = cy = cz = 0
so
x = x1*t
y = y1*t
z = z1*t
Note: this also assumes that the "speed" of the line or vector is such that it is at your point (x1,y1,z1) after 1 second.
So to draw that line just fill in the points as fine as you like for as long as required, such as every 1/1000 of a second for 10 seconds or something, might draw a "line", really a series of points that when seen from a distance appear as a line, over 10 seconds worth of distance, determined by the "speed" you choose.

Get 3D coordinates from 2D image pixel if extrinsic and intrinsic parameters are known

I am doing camera calibration from tsai algo. I got intrensic and extrinsic matrix, but how can I reconstruct the 3D coordinates from that inormation?
1) I can use Gaussian Elimination for find X,Y,Z,W and then points will be X/W , Y/W , Z/W as homogeneous system.
2) I can use the
OpenCV documentation approach:
as I know u, v, R , t , I can compute X,Y,Z.
However both methods end up in different results that are not correct.
What am I'm doing wrong?
If you got extrinsic parameters then you got everything. That means that you can have Homography from the extrinsics (also called CameraPose). Pose is a 3x4 matrix, homography is a 3x3 matrix, H defined as
H = K*[r1, r2, t], //eqn 8.1, Hartley and Zisserman
with K being the camera intrinsic matrix, r1 and r2 being the first two columns of the rotation matrix, R; t is the translation vector.
Then normalize dividing everything by t3.
What happens to column r3, don't we use it? No, because it is redundant as it is the cross-product of the 2 first columns of pose.
Now that you have homography, project the points. Your 2d points are x,y. Add them a z=1, so they are now 3d. Project them as follows:
p = [x y 1];
projection = H * p; //project
projnorm = projection / p(z); //normalize
Hope this helps.
As nicely stated in the comments above, projecting 2D image coordinates into 3D "camera space" inherently requires making up the z coordinates, as this information is totally lost in the image. One solution is to assign a dummy value (z = 1) to each of the 2D image space points before projection as answered by Jav_Rock.
p = [x y 1];
projection = H * p; //project
projnorm = projection / p(z); //normalize
One interesting alternative to this dummy solution is to train a model to predict the depth of each point prior to reprojection into 3D camera-space. I tried this method and had a high degree of success using a Pytorch CNN trained on 3D bounding boxes from the KITTI dataset. Would be happy to provide code but it'd be a bit lengthy for posting here.