How to create point cloud from rgb & depth images? - computer-vision

For a university project I am currently working on, I have to create a point cloud by reading images from this dataset. These are basically video frames, and for each frame there is an rgb image along with a corresponding depth image.
I am familiar with the equation z = f*b/d, however I am unable to figure out how the data should be interpreted. Information about the camera that was used to take the video is not provided, and also the project states the following:
"Consider a horizontal/vertical field of view of the camera 48.6/62
degrees respectively"
I have little to no experience in computer vision, and I have never encountered 2 fields of view being used before. Assuming I use the depth from the image as is (for the z coordinate), how would I go about calculating the x and y coordinates of each point in the point cloud?
Here's an example of what the dataset looks like:

Yes, it's unusual to specify multiple fields of view. Given a typical camera (squarish pixels, minimal distortion, view vector through the image center), usually only one field-of-view angle is given -- horizontal or vertical -- because the other can then be derived from the image aspect ratio.
Specifying a horizontal angle of 48.6 and a vertical angle of 62 is particularly surprising here, since the image is a landscape view, where I'd expect the horizontal angle to be greater than the vertical. I'm pretty sure it's a typo:
When swapped, the ratio tan(62 * pi / 360) / tan(48.6 * pi / 360) is the 640 / 480 aspect ratio you'd expect, given the image dimensions and square pixels.
At any rate, a horizontal angle of t is basically saying that the horizontal extent of the image, from left edge to right edge, covers an arc of t radians of the visual field, so the pixel at the center of the right edge lies along a ray rotated t / 2 radians to the right from the central view ray. This "righthand" ray runs from the eye at the origin through the point (tan(t / 2), 0, -1) (assuming a right-handed space with positive x pointing right and positive y pointing up, looking down the negative z axis). To get the point in space at distance d from the eye, you can just normalize a vector along this ray and multiply by it by d. Assuming the samples are linearly distributed across a flat sensor, I'd expect that for a given pixel at (x, y) you could calculate its corresponding ray point with:
p = (dx * tan(hfov / 2), dy * tan(vfov / 2), -1)
where dx is 2 * (x - width / 2) / width, dy is 2 * (y - height / 2) / height, and hfov and vfov are the field-of-view angles in radians.
Note that the documentation that accompanies your sample data links to a Matlab file that shows the recommended process for converting the depth images into a point cloud and distance field. In it, the fields of view are baked with the image dimensions to a constant factor of 570.3, which can be used to recover the field of view angles that the authors believed their recording device had:
atan(320 / 570.3) * (360 / pi / 2) * 2 = 58.6
which is indeed pretty close to the 62 degrees you were given.
From the Matlab code, it looks like the value in the image is not distance from a given point to the eye, but instead distance along the view vector to a perpendicular plane containing the given point ("depth", or basically "z"), so the authors can just multiply it directly with the vector (dx * tan(hfov / 2), dy * tan(vfov / 2), -1) to get the point in space, skipping the normalization step mentioned earlier.

Related

Raytracer : High FOV distortion

I'm actually realising a C++ raytracer and I'm confronting a classic problem on raytracing. When putting a high vertical FOV, the shapes get a bigger distortion the nearer they are from the edges.
I know why this distortion happens, but I don't know to resolve it (of course, reducing the FOV is an option but I think that there is something to change in my code). I've been browsing different computing forums but didn't find any way to resolve it.
Here's a screenshot to illustrate my problem.
I think that the problem is that the view plane where I'm projecting my rays isn't actually flat, but I don't know how to resolve this. If you have any tip to resolve it, I'm open to suggestions.
I'm on a right-handed oriented system.
The Camera system vectors, Direction vector and Light vector are normalized.
If you need some code to check something, I'll put it in an answer with the part you ask.
code of ray generation :
// PixelScreenX = (pixelx + 0.5) / imageWidth
// PixelCameraX = (2 ∗ PixelScreenx − 1) ∗
// ImageAspectRatio ∗ tan(fov / 2)
float x = (2 * (i + 0.5f) / (float)options.width - 1) *
options.imageAspectRatio * options.scale;
// PixelScreeny = (pixely + 0.5) / imageHeight
// PixelCameraY = (1 − 2 ∗ PixelScreeny) ∗ tan(fov / 2)
float y = (1 - 2 * (j + 0.5f) / (float)options.height) * options.scale;
Vec3f dir;
options.cameraToWorld.multDirMatrix(Vec3f(x, y, -1), dir);
dir.normalize();
newColor = _renderer->castRay(options.orig, dir, objects, options);
There is nothing wrong with your projection. It produces exactly what it should produce.
Let's consider the following figure to see how all the quantities interact:
We have the camera position, the field of view (as an angle) and the image plane. The image plane is the plane that you are projecting your 3D scene onto. Essentially, this represents your screen. When you are viewing your rendering on the screen, your eye serves as the camera. It sees the projected image and if it is positioned at the right point, it will see exactly what it would see if the actual 3D scene was there (neglecting effects like depth of field etc.)
Obviously, you cannot modify your screen (you could change the window size but let's stick with a constant-size image plane). Then, there is a direct relationship between the camera's position and the field of view. As the field of view increases, the camera moves closer and closer to the image plane. Like this:
Thus, if you are increasing your field of view in code, you need to move your eye closer to the screen to get the correct perception. You can actually try that with your image. Move your eye very close to the screen (I'm talking about 3cm). If you look at the outer spheres now, they actually look like real balls again.
In summary, the field of view should approximately match the geometry of the viewing setup. For a given screen size and average watch distance, this can be calculated easily. If your viewing setup does not match your assumptions in code, 3D perception will break down.

Rectification fisheye lens onto a plane

I have a fisheye lens of which I know the principal point C= (x_0,y_0) and the relation between r (distorted radial distance) and Theta (angle between optical axis and the incoming ray) which follows the equidistant model r(Theta)= f*Theta
I would like to use these parameters to rectify this image Image to rectify, for that I follow these steps but I am not sure if my approach is correct because I'm left with negative values at the end:
1- shift the origin to the principal point
2- append to each point in the image plane 1 for the z coordinate
(which corresponds to a focal length equal to 1): {x,y} ==> {x,y,1}
3- calculate the angle Thea between {x, y, 1} and the point {0,0,1}
4- calculate the angle Beta in the image plane Beta = ArcTan(y/x)
5- calculate the image rectified coordinates:
x_rec = x_0 +[ Cos(Beta) * r(Theta)]
y_rec = y_0 +[ Sin(Beta) * r(Theta)]
You cannot correct this distortion blindly, without knowing the relation. You need to calibrate.
Take a picture of a chessboard or a ruler, and plot the relation between the distance to center in the image and in the real world.
A low degree polynomial fit will probably do. There shouldn't be much tangential distortion.

Panoramic Image Photogrametry: How to calculate range?

Assume that I took two panoramic image with vertical offset of H and each image is presented in equirectangular projection with size Xm and Ym. To do this, I place my panoramic camera at position say A and took an image, then move camera H meter up and took another image.
I know that a point in image 1 with coordinate of X1,Y1 is the same point on image 2 with coordinate X2 and Y2(assuming that X1=X2 as we have only vertical offset).
My question is that How I can calculate the range of selected of point (the point that know its X1and Y1 is on image 1 and its position on image 2 is X2 and Y2 from the Point A (where camera was when image no 1 was taken.).
Yes, you can do it - hold on!!!
Key thing y = focal length of your lens - now I can do it!!!
So, I think your question can be re-stated more simply by saying that if you move your camera (on the right in the diagram) up H metres, a point moves down p pixels in the image taken from the new location.
Like this if you imagine looking from the side, across you taking the picture.
If you know the micron spacing of the camera's CCD from its specification, you can convert p from pixels to metres to match the units of H.
Your range from the camera to the plane of the scene is given by x + y (both in red at the bottom), and
x=H/tan(alpha)
y=p/tan(alpha)
so your range is
R = x + y = H/tan(alpha) + p/tan(alpha)
and
alpha = tan inverse(p/y)
where y is the focal length of your lens. As y is likely to be something like 50mm, it is negligible, so, to a pretty reasonable approximation, your range is
H/tan(alpha)
and
alpha = tan inverse(p in metres/focal length)
Or, by similar triangles
Range = H x focal length of lens
--------------------------------
(Y2-Y1) x CCD photosite spacing
being very careful to put everything in metres.
Here is a shot in the dark, given my understanding of the problem at hand you want to do something similar to computer stereo vision, I point you to http://en.wikipedia.org/wiki/Computer_stereo_vision to start. Not sure if this is still possible to do in the manner you are suggesting, it sounds like you may need some more physical constraints but I do remember being able to correlate two 2d points in images after undergoing a strict translation. Think :
lambda[x,y,1]^t = W[r1, tx;r2, ty;ry, tz][x; y; z; 1]^t
Where lamda is a scale factor, W is a 3x3 matrix covering the intrinsic parameters of your camera, r1, r2, and r3 are row vectors that make up the 3x3 rotation matrix (in your case you can assume the identity matrix since you have only applied a translation), and tx, ty, tz which are your translation components.
Since you are looking at two 2d points at the same 3d point [x,y,z] this 3d point is shared by both 2d points. I cannot say if you can rationalize the actual x,y, and z values particularly for your depth calculation but this is where I would start.

How do you rotate a sprite based on mouse position?

Basically, I have a sprite that I render using SDL 2.0 that I can rotate a variable amount around a center orgin point of the texture clockwise using SDL_RenderCopyEx(). I want to rotate it based on the mouse position by using the angle x between my physical slope line and my two straight lines based off of my base line. The base line I'm talking about can be represented mathematically as x = orgin_x, where orgin_x is the rotation orgin. The other line is a segment along the baseline that connects the horizontal line end point to the orgin_x point vertically. With the angle to the mouse cursor being the one I want to find to rotate my character.
Please no complicated math symbols. I would rather the formula be posted in C-style format, and please explain the logic behind the math so I can maybe understand what's happening and fix similar future problems if needed.
Some basic trigonometry. You can use atan2(delta_y, delta_x). With this you will get your angle in RAD. To get your angle in degree, because RenderCopyEx use Degree for angle, you need to convert your angle. You got 360 Degree and 2*PI Rad for a full circle. So
angle_deg = (atan2(delta_y, delta_x)*180.0000)/3.1416
Now you got your angle to do a RenderCopyEx
BTW :
delta_y = origin_y - mouse_y
AND
delta_x = origin_x - mouse_x

Proper calculation for the first element of an OpenGL projection Matrix?

Almost all the theoretical stuff I read about projection matrices have the first element being 2n/(r-l), but most of the open source implementations I've seen have it as 2n/((t-b)*a), -- which makes sense to me at first since (r-l) should be ((t-b)*a), but when I actually run the numbers, something feels off.
If we have a vertical field of view of 65 degrees, a near plane of .1, and an aspect ratio of 4:3, then I seem to get:
2n/(r-l) = .2 / (tan(65*(4/3)*.5) * .2) = 1.0599
but
2n((t-b)*a) = .2 / (tan(65*.5) * (4/3) * .2) = 1.1773
Why is there a difference between everything I read, and everything I see implemented? I didn't notice until I started implementing the same analytical inverse I see whose first element is (r-l)/2n, which isn't the inverse of these other implementations.
You can't multiply the aspect ratio into the angle. The tangens isn't a linear function. Having 65 degress vertical field of view does not mean that you're going to have 86,67 degrees horizontal FOV with 4:3 aspect, but ~80.69 degrees.