OpenCV solvePnP method returns NaN values - c++

I'm doing barcode detection using zbar library with opencv in C++. The barcode detection is working good and I'm getting results like that
Now I want to use cv::solvePnP to get the pose of my camera (already calibrated). As 3D points I'm using a template in which, at program start, I compute the same barcode detection and I take the top left corner and bottom right corner. I then compute the world coordinates of these two points with respect to the center of the barcode in this way:
(pt.x - imageSize.width / 2) * px_to_mm,
(pt.y - imageSize.height / 2) * px_to_mm,
0.
imageSize is the size of the barcode (in pixels), px_to_mm is the ratio "length in meters of the barcode height divided by the total number of pixel of the barcode height" and pt is the point (either top left or bottom right).
The template is
I checked that the resulting point of the barcode detection are correct. The world coordinates are top_left =[0.054160003, 0.025360001, 0], bottom_right = [0.085200004, 0.046080004, 0]. I assume that these are correct since the dimensions in pixels of the barcode are 388 x 200 and the height in meters is 0.016.
When I run cv::solvePnP I get these results
translation: [-nan, -nan, -nan]
rotation: [-nan, nan, -nan;
-nan, -nan, nan;
nan, -nan, -nan]
The input of the method are the two image points of the barcode detection and the two world points computed using the template. What is the problem?

As api55 said in his comment the problem was the number of points. Adding the other 2 corners worked.

Related

Estimate the camera pose in the reference system using one marker with ARUCO

I am currently working on a camera pose estimation project using only one marker with ARUCO.
I used Aruco's Marker Detector to detect markers and get the marker's Rvec and Tvec. I understand these two vectors represent the transform from the marker to the camera, which is the marker's pose w.r.t camera. I form a 4 by 4 matrix called T_marker_camera using these two vectors.
Then, I set up a world frame (left handed) and get the marker's world pose, which is a 4 by 4 transform matrix.
I want to calculate the pose of the camera w.r.t the world frame, and I use the following formula to calculate it:
T_camera_world = T_marker_world * T_marker_camera_inv
Before I perform the above formula, I convert the OpenCV coordinates to the left handed one (flip the sign of x axis).
However, I didn't get the correct x, y, z of the camera w.r.t the world frame.
What did I miss to get the correct answer?
Thanks
The one equation you gave looks right, so the issue is probably somewhere that you didn't show/describe.
A fix in your notation will help clarify.
Write the pose/source frame on the right (input), the reference/destination frame on the left (output). Then your matrices "match up" like dominos.
rvec and tvec yield a matrix that should be called T_cam_marker.
If you want the pose of your camera in the world frame, that is
T_world_cam = T_world_marker * T_marker_cam
T_world_cam = T_world_marker * inv(T_cam_marker)
(equivalent to what you wrote, but domino)
Be sure that you do matrix multiplication, not element-wise multiplication.
To move between left-handed and right-handed coordinate systems, insert a matrix that maps coordinates accordingly. Frames:
OpenCV camera/screen: right-handed, {X right, Y down, Z far}
ARUCO (in OpenCV anyway): right-handed, {X right, Y far, Z up}, first corner is top left (-X+Y quadrant)
whatever leftie frame you have, let's say {X right, Y up, Z far} and it's a screen or something
The hand-change matrix for typical frames on screens is an identity but with the entry for Y being a -1. I don't know why you would flip the X but that's "equivalent", ignoring any rotations.

How to create point cloud from rgb & depth images?

For a university project I am currently working on, I have to create a point cloud by reading images from this dataset. These are basically video frames, and for each frame there is an rgb image along with a corresponding depth image.
I am familiar with the equation z = f*b/d, however I am unable to figure out how the data should be interpreted. Information about the camera that was used to take the video is not provided, and also the project states the following:
"Consider a horizontal/vertical field of view of the camera 48.6/62
degrees respectively"
I have little to no experience in computer vision, and I have never encountered 2 fields of view being used before. Assuming I use the depth from the image as is (for the z coordinate), how would I go about calculating the x and y coordinates of each point in the point cloud?
Here's an example of what the dataset looks like:
Yes, it's unusual to specify multiple fields of view. Given a typical camera (squarish pixels, minimal distortion, view vector through the image center), usually only one field-of-view angle is given -- horizontal or vertical -- because the other can then be derived from the image aspect ratio.
Specifying a horizontal angle of 48.6 and a vertical angle of 62 is particularly surprising here, since the image is a landscape view, where I'd expect the horizontal angle to be greater than the vertical. I'm pretty sure it's a typo:
When swapped, the ratio tan(62 * pi / 360) / tan(48.6 * pi / 360) is the 640 / 480 aspect ratio you'd expect, given the image dimensions and square pixels.
At any rate, a horizontal angle of t is basically saying that the horizontal extent of the image, from left edge to right edge, covers an arc of t radians of the visual field, so the pixel at the center of the right edge lies along a ray rotated t / 2 radians to the right from the central view ray. This "righthand" ray runs from the eye at the origin through the point (tan(t / 2), 0, -1) (assuming a right-handed space with positive x pointing right and positive y pointing up, looking down the negative z axis). To get the point in space at distance d from the eye, you can just normalize a vector along this ray and multiply by it by d. Assuming the samples are linearly distributed across a flat sensor, I'd expect that for a given pixel at (x, y) you could calculate its corresponding ray point with:
p = (dx * tan(hfov / 2), dy * tan(vfov / 2), -1)
where dx is 2 * (x - width / 2) / width, dy is 2 * (y - height / 2) / height, and hfov and vfov are the field-of-view angles in radians.
Note that the documentation that accompanies your sample data links to a Matlab file that shows the recommended process for converting the depth images into a point cloud and distance field. In it, the fields of view are baked with the image dimensions to a constant factor of 570.3, which can be used to recover the field of view angles that the authors believed their recording device had:
atan(320 / 570.3) * (360 / pi / 2) * 2 = 58.6
which is indeed pretty close to the 62 degrees you were given.
From the Matlab code, it looks like the value in the image is not distance from a given point to the eye, but instead distance along the view vector to a perpendicular plane containing the given point ("depth", or basically "z"), so the authors can just multiply it directly with the vector (dx * tan(hfov / 2), dy * tan(vfov / 2), -1) to get the point in space, skipping the normalization step mentioned earlier.

Rectification fisheye lens onto a plane

I have a fisheye lens of which I know the principal point C= (x_0,y_0) and the relation between r (distorted radial distance) and Theta (angle between optical axis and the incoming ray) which follows the equidistant model r(Theta)= f*Theta
I would like to use these parameters to rectify this image Image to rectify, for that I follow these steps but I am not sure if my approach is correct because I'm left with negative values at the end:
1- shift the origin to the principal point
2- append to each point in the image plane 1 for the z coordinate
(which corresponds to a focal length equal to 1): {x,y} ==> {x,y,1}
3- calculate the angle Thea between {x, y, 1} and the point {0,0,1}
4- calculate the angle Beta in the image plane Beta = ArcTan(y/x)
5- calculate the image rectified coordinates:
x_rec = x_0 +[ Cos(Beta) * r(Theta)]
y_rec = y_0 +[ Sin(Beta) * r(Theta)]
You cannot correct this distortion blindly, without knowing the relation. You need to calibrate.
Take a picture of a chessboard or a ruler, and plot the relation between the distance to center in the image and in the real world.
A low degree polynomial fit will probably do. There shouldn't be much tangential distortion.

Panoramic Image Photogrametry: How to calculate range?

Assume that I took two panoramic image with vertical offset of H and each image is presented in equirectangular projection with size Xm and Ym. To do this, I place my panoramic camera at position say A and took an image, then move camera H meter up and took another image.
I know that a point in image 1 with coordinate of X1,Y1 is the same point on image 2 with coordinate X2 and Y2(assuming that X1=X2 as we have only vertical offset).
My question is that How I can calculate the range of selected of point (the point that know its X1and Y1 is on image 1 and its position on image 2 is X2 and Y2 from the Point A (where camera was when image no 1 was taken.).
Yes, you can do it - hold on!!!
Key thing y = focal length of your lens - now I can do it!!!
So, I think your question can be re-stated more simply by saying that if you move your camera (on the right in the diagram) up H metres, a point moves down p pixels in the image taken from the new location.
Like this if you imagine looking from the side, across you taking the picture.
If you know the micron spacing of the camera's CCD from its specification, you can convert p from pixels to metres to match the units of H.
Your range from the camera to the plane of the scene is given by x + y (both in red at the bottom), and
x=H/tan(alpha)
y=p/tan(alpha)
so your range is
R = x + y = H/tan(alpha) + p/tan(alpha)
and
alpha = tan inverse(p/y)
where y is the focal length of your lens. As y is likely to be something like 50mm, it is negligible, so, to a pretty reasonable approximation, your range is
H/tan(alpha)
and
alpha = tan inverse(p in metres/focal length)
Or, by similar triangles
Range = H x focal length of lens
--------------------------------
(Y2-Y1) x CCD photosite spacing
being very careful to put everything in metres.
Here is a shot in the dark, given my understanding of the problem at hand you want to do something similar to computer stereo vision, I point you to http://en.wikipedia.org/wiki/Computer_stereo_vision to start. Not sure if this is still possible to do in the manner you are suggesting, it sounds like you may need some more physical constraints but I do remember being able to correlate two 2d points in images after undergoing a strict translation. Think :
lambda[x,y,1]^t = W[r1, tx;r2, ty;ry, tz][x; y; z; 1]^t
Where lamda is a scale factor, W is a 3x3 matrix covering the intrinsic parameters of your camera, r1, r2, and r3 are row vectors that make up the 3x3 rotation matrix (in your case you can assume the identity matrix since you have only applied a translation), and tx, ty, tz which are your translation components.
Since you are looking at two 2d points at the same 3d point [x,y,z] this 3d point is shared by both 2d points. I cannot say if you can rationalize the actual x,y, and z values particularly for your depth calculation but this is where I would start.

OpenCV function HoughLines(), where should I consider the origin of the image?

I would like to detect some lines via the standard Hough transform and filter the result according to a theta value so that the remaining lines would be the ones that have some specific orientation.
What I'm curious about is, in the function HoughLines, where is the origin that this function calculates each theta value from? For example, if I have an image of size width x height, what is the origin's coordinates? Is it (0,height) or (0,0)?
I assume it's somewhere between the 4 corners of the image but I'm not so sure. If anybody could clear this out, it would be really appreciated.
The origin seen in this picture from opencv docs:
Is (0, heigh) in the image, the bottom left corner
lines – Output vector of lines.
Each line is represented by a two-element vector (rho, theta)
rho is the distance from the coordinate origin (0,0) (top-left corner of the image).
theta is the line rotation angle in radians ( 0 = {vertical line}, pi/2 = {horizontal line} ).
from https://docs.opencv.org/4.x/dd/d1a/group__imgproc__feature.html#ga46b4e588934f6c8dfd509cc6e0e4545a
It's the top-left corner of the image.