How to convert 2d(x,y) cooridinates into 3d(x,y,z) coordinates using python and point cloud? - computer-vision

I have been using this github repo: https://github.com/aim-uofa/AdelaiDepth/blob/main/LeReS/Minist_Test/tools/test_shape.py
To figure out how this piece of code can be used to get x,y,z coordinates:
def reconstruct_3D(depth, f):
"""
Reconstruct depth to 3D pointcloud with the provided focal length.
Return:
pcd: N X 3 array, point cloud
"""
cu = depth.shape[1] / 2
cv = depth.shape[0] / 2
width = depth.shape[1]
height = depth.shape[0]
row = np.arange(0, width, 1)
u = np.array([row for i in np.arange(height)])
col = np.arange(0, height, 1)
v = np.array([col for i in np.arange(width)])
v = v.transpose(1, 0)
I want to use these coordinates to find distance between 2 people in 3D for an object detection model. Does anyone have any advice?
I know how to use 2d images with yolo to figure out distance between 2 people. Based on this link: Compute the centroid of a rectangle in python
My thinking is i can use the bounding boxes to get corners and then find the centroid and do that for 2 bounding boxes of people and use triangulation to find the hypotenuse between the 2 points (which is their distance).
However, i am having a tricky time on how to use a set of 3d coordinates to find distance between 2 people. I can get the relative distance from my 2d model.

By having a 2D depth image and camera's intrinsic matrix, you can convert each pixel to 3D point cloud as:
z = d
x = (u - cx) * z / f
y = (v - cy) * z / f
// where (cx, cy) is the principle point and f is the focal length.
In the meantime, you can use third party library like open3d for doing the same:
xyz = open3d.geometry.create_point_cloud_from_depth_image(depth, intrinsic)

Related

kitti dataset camera projection matrix

I am looking at the kitti dataset and particularly how to convert a world point into the image coordinates. I looked at the README and it says below that I need to transform to camera coordinates first then multiply by the projection matrix. I have 2 questions, coming from a non computer vision background
I looked at the numbers from calib.txt and in particular the matrix is 3x4 with non-zero values in the last column. I always thought this matrix = K[I|0], where K is the camera's intrinsic matrix. So, why is the last column non-zero and what does it mean? e.g P2 is
array([[7.070912e+02, 0.000000e+00, 6.018873e+02, 4.688783e+01],
[0.000000e+00, 7.070912e+02, 1.831104e+02, 1.178601e-01],
[0.000000e+00, 0.000000e+00, 1.000000e+00, 6.203223e-03]])
After applying projection into [u,v,w] and dividing u,v by w, are these values with respect to origin at the center of image or origin being at the top left of the image?
README:
calib.txt: Calibration data for the cameras: P0/P1 are the 3x4
projection
matrices after rectification. Here P0 denotes the left and P1 denotes the
right camera. Tr transforms a point from velodyne coordinates into the
left rectified camera coordinate system. In order to map a point X from the
velodyne scanner to a point x in the i'th image plane, you thus have to
transform it like:
x = Pi * Tr * X
Refs:
How to understand the KITTI camera calibration files?
Format of parameters in KITTI's calibration file
http://www.cvlibs.net/publications/Geiger2013IJRR.pdf
Answer:
I strongly recommend you read those references above. They may solve most, if not all, of your questions.
For question 2: The projected points on images are with respect to origin at the top left. See ref 2 & 3, the coordinates of a far 3d point in image are (center_x, center_y), whose values are provided in the P_rect matrices. Or you can verify this with some simple codes:
import numpy as np
p = np.array([[7.070912e+02, 0.000000e+00, 6.018873e+02, 4.688783e+01],
[0.000000e+00, 7.070912e+02, 1.831104e+02, 1.178601e-01],
[0.000000e+00, 0.000000e+00, 1.000000e+00, 6.203223e-03]])
x = [0, 0, 1E8, 1] # A far 3D point
y = np.dot(p, x)
y[0] /= y[2]
y[1] /= y[2]
y = y[:2]
print(y)
You will see some output like:
array([6.018873e+02, 1.831104e+02 ])
which is quite near the (p[0, 2], p[1, 2]), a.k.a. (center_x, center_y).
For all the P matrices (3x4), they represent:
P(i)rect = [[fu 0 cx -fu*bx],
[0 fv cy -fv*by],
[0 0 1 0]]
Last column are baselines in meters w.r.t. the reference camera 0. You can see the P0 has all zeros in the last column because it is the reference camera.
This post has more details:
How Kitti calibration matrix was calculated?

How to determine the intersection between the camera direction and a plane?

I have a 3D scene with an infinite horizontal plane (parallel to the xz coordinates) at a height H along the Y vertical axis.
I would like to know how to determine the intersection between the axis of my camera and this plane.
The camera is defined by a view-matrix and a projection-matrix.
There are two sub-problems here: 1) Extracting the position and view-direction from the camera matrix. 2) Calculating the intersection between the view-ray and the plane.
Extracting position and view-direction
The view matrix describes how points are transformed from world-space to view space. The view-space in OpenGL is usually defined such that the camera is in the origin and looks into the -z direction.
To get the position of the camera, we have to transform the origin [0,0,0] of the view-space back into world-space. Mathematically speaking, we have to calculate:
camera_pos_ws = inverse(view_matrix) * [0,0,0,1]
but when looking at the equation we'll see that we are only interrested in the 4th column of the inverse matrix which will contain 1
camera_pos_ws = [-view_matrix[12], -view_matrix[13], -view_matrix[14]]
The orientation of the camera can be found by a similar calculation. We know that the camera looks in -z direction in view-space thus the world space direction is given by
camera_dir_ws = inverse(view_matrix) * [0,0,-1,0];
Again, when looking at the equation, we'll see that this only takes the third row of the inverse matrix into account which is given by2
camera_dir_ws = [-view_matrix[2], -view_matrix[6], -view_matrix[10]]
Calculating the intersection
We now know the camera position P and the view direction D, thus we have to find the x,z value along the ray R(x,y,z) = P + l * D where y equals H. Since there is only one unknown, l, we can calculate that from
y = Py + l * Dy
H = Py + l * Dy
l = (H - Py) / Dy
The intersection point is then given by pasting l back into the ray equation.
Notes
1 The indices assume that the matrix is stored in a column-major linear array.
2 Note, that the inverse of a matrix of the form
M = [ R T ]
0 1
, where R is a orthogonal 3x3 matrix, is given by
inv(M) = [ transpose(R) -T ]
0 1
For a general line-plane intersection there are lot of answers and tutorials.
Your case is simple due to the plane is horizontal.
I suppose the camera is at C(cx, cy, cz) and it looks at T(tx, ty,tz).
Then the line camera-target can be defined by:
cx - x cy - y cz - z
------ = ------ = ------ /// These are two independant equations
tx - cx ty - cy tz - cz
For a horizontal plane, only a equation is needed: y = H.
Substitute this value in the line equations and you get
(cx-x)/(tx-cx) = (cy-H)/(ty-cy)
(cz-z)/(tz-cz) = (cy-H)/(ty-cy)
So
x = cx - (tx-cx)*(cy-H)/(ty-cy)
y = H
z = cz - (tz-cz)*(cy-H)/(ty-cy)
Of course if your camera looks in an also horizontal line then ty=cy and there is not solution.

how to calculate field of view of the camera from camera intrinsic matrix?

I got camera intrinsic matrix and distortion parameters using camera calibration.
The unit of the focal length is pixels, i guess.
Then, how can i calculate field of view (along y) ?
Is this formula right?
double fov_y = 2*atan(height/2/fy)*180/CV_PI;
I'll use it to parameters of
gluPerspective()
OpenCV has a function that does this. Looking at the implementation (available on GitHub) we have given an image with dimensions w x h and a camera matrix:
the equations for the field of view are:
In continuation of #mallwright's answer, here is a bit of Python/numpy code to compute the field of view from the image resolution and focal lengths (in pixels):
import numpy as np
# Prepare
w, h = 1280, 720
fx, fy = 1027.3, 1026.9
# Go
fov_x = np.rad2deg(2 * np.arctan2(w, 2 * fx))
fov_y = np.rad2deg(2 * np.arctan2(h, 2 * fy))
print("Field of View (degrees):")
print(f" {fov_x = :.1f}\N{DEGREE SIGN}")
print(f" {fov_y = :.1f}\N{DEGREE SIGN}")
Output:
Field of View (degrees):
fov_x = 63.8°
fov_y = 38.6°
Note that this assumes that the principal point is at the center of the image and that there is no distortion, see this answer.

Depth map values in opencv's reprojectImageTo3D()

OpenCV's reprojectImageTo3D() outputs a "3-channel image representing a 3D surface".
You can access this data by
Vec3f coordinates = _3dImage.at<Vec3f>(y,x);
float depth = _3dImage.at<Vec3f>(y,x)[2];
witch returns a vector [X,Y,Z].
In "Learning OpenCV" by Gary Bradski & Adrian Kaehler, it is explained that the depth is calculated by
Z = f T / (x_left - x_right)
where f = focal length, T = eye base/translation between cameras, (x_left - x_right) = disparity
This exact formula is implemented in OpenCV (I checked the source code - however there is for some reason an additional negative sign). The question is: In which unit are the X, Y, Z values specified?
T is in your unit (e.g. mm), x_l - x_r is in pixel and [ f ] = ?
When you calibrate the camera, you specify the chessboard's size in real world units (e.g. mm). Does the intrinsic matrix therefore have real world units? Or is it specified in px? Unfortunately I cannot find the answer in the documentation.
The underlying equation that performs depth reconstruction is:
Z = fB/d, where
f is the focal length (in pixels), you called it as eye base/translation between cameras
B is the stereo baseline (in meters)
d is disparity (in pixels) that measures the difference in retinal position between corresponding points
Z is the distance along the camera Z axis
The 3D position (X,Y,Z) of an image point (e.g. (u,v) in pixels) can be given in meters, cm, mm or whatever you choose, because the 3D coordinates (X,Y,Z) are in the same units as the chessboard's square size. For example, if you define the square size to be 1 cm then the 3D coordinates will be in cm as well.
i.e.:
Size boardSize(4, 5); // 4x5 chessboard
float squareSize = 0.025F; // 0.025 meters
for( int i = 0; i < boardSize.height; i++ )
for( int j = 0; j < boardSize.width; j++ )
corners.push_back(Point3f(float(j*squareSize), float(i*squareSize), 0.0F));
p.s.:
After Z is determined, X and Y can be calculated using the usual projective camera equations:
X = uZ/f
Y = vZ/f

opengl select sphere with mouse

I have a number of spheres in 3d space which the user should be able to select with a mouse click. Now I've seen some examples around using gluUnProject so I gave it a shot. So I have (please correct me every step of the way if I'm wrong because I'm not 100% sure of any part of it):
def compute_pos(x, y, z):
'''
Compute the 3d opengl coordinates for 3 coordinates.
#param x,y: coordinates from canvas taken with mouse position
#param z: coordinate for z-axis
#return; (gl_x, gl_y, gl_z) tuple corresponding to coordinates in OpenGL context
'''
modelview = numpy.matrix(glGetDoublev(GL_MODELVIEW_MATRIX))
projection = numpy.matrix(glGetDoublev(GL_PROJECTION_MATRIX))
viewport = glGetIntegerv(GL_VIEWPORT)
winX = float(x)
winY = float(viewport[3] - float(y))
winZ = z
return gluUnProject(winX, winY, winZ, modelview, projection, viewport)
Then, having the x and y of a mouse click and the position of the center of the sphere:
def is_picking(x, y, point):
ray_start = compute_pos(x, y, -1)
ray_end = compute_pos(x, y, 1)
d = _compute_2d_distance( (ray_start[0], ray_start[1]),
(ray_end[0], ray_end[1]),
(point[0], point[1]))
if d > CUBE_SIZE:
return False
d = _compute_2d_distance( (ray_start[0], ray_start[2]),
(ray_end[0], ray_end[2]),
(point[0], point[2]))
if d > CUBE_SIZE:
return False
d = _compute_2d_distance( (ray_start[1], ray_start[2]),
(ray_end[1], ray_end[2]),
(point[1], point[2]))
if d > CUBE_SIZE:
return False
return True
So because my 3d geometry is not good at all, I compute two points as a ray start and end point, the go into 2d 3 times eliminating one dimension at a time and compute the distance there between my line and the center of the sphere. If any of those distances are bigger than my sphere ray the it's not clicked. I think the formula for the distance is correct but just in case:
def _compute_2d_distance(p1, p2, target):
'''
Compute the distance between the line defined by two points and a target point.
#param p1: first point that defines the line
#param p2: second point that defines the line
#param target: the point to which distance needs to be computed
#return: distance from point to line
'''
if p2[0] != p1[0]:
if p2[1] == p1[1]:
return abs(p2[0] - p1[0])
a = (p2[1] - p1[1])/(p2[0] - p1[0])
b = -1
c = p1[1] + p1[0] * (p2[1] - p1[1]) / (p2[0] - p1[0])
d = abs(a * target[0] + b * target[1] + c) / math.sqrt(a * a + b * b)
return d
if p2[0] == p1[0]:
d = abs(p2[1] - p1[1])
return d
return None
Now the code seems to work fine in the start position. But after you use to mouse and rotate the screen even for a little bit, nothing works as expected anymore.
Hi there are a lot of solutions for this kind of problem.
Ray casting is one of the best but it involves a lot of geometry knowledge and it is not easy at all.
Moreover the gluUnProject is not available in other OpenGL implementations such as ES for mobile devices (though you can write it in your matrices manipulation functions).
I personally prefer the color picking solution which is quite flexible and very fast computing wise.
The idea is to render the select-able (only the select-able for performance boost) with a given unique color on an offscreen buffer.
Then you take the color of the pixel at the coordinates clicked by the user and you select the corresponding 3D object.
Cheers
Maurizio Benedetti