Calculate baseline distance between 2 cameras (images) - computer-vision

I want to estimate the depth map between left and right images from "http://perso.lcpc.fr/tarel.jean-philippe/syntim/paires/GrRub.html". I understand that I must first calculate depth from disparity using formula Z = B * F/d
The data set unfortunately does not provide Baseline distance B.
Could you suggest how I can calculate this(if possible) or how I could calculate depth map from given data alone?
Thank you for your help.
As I am new to stackoverflow and computer vision, do let me know if I should provide more details.

If you have the extrinsic parameters, rotational matrices R and translation vector t, there are two cases
a) (most probable) one of your camera (the main camera) is the centre of your coordinate system: the R1 matrix is the identity matrix and the related t1 is equal to [0,0,0]. In this case you could think the baseline B as the euclidean norm of the translation vector t2 of the other camera
b) in case none of your camera is the centre of your coordinate system, at least you should have calibrated your cameras with respect to the same reference system. The baseline B is the euclidean norm of the difference vector (t1 - t2)
(I was not able to open the left/right camera links, so I could not verify)

Related

Best-fit relative orientations and locations of stereo cameras

I have 2 static cameras being used for stereo 3D positioning of objects. I need to determine the location and orientation of the second camera relative to the first as accurately as possible. I am trying to do this by locating n objects on the both cameras' images and correlating between the two cameras in order to calibrate my system to locate additional objects later.
Is there a preferred way to use a large number (6+) of correlated points to determine the best-fit relative locations/orientations of 2 cameras, assuming that I have already compensated for any distortive effects and know the correct (but somewhat noisy) angles between the optical axes and the objects, and the distance between the cameras?
My solution is to determine a rotation to perform on the second camera (B) in order to realign its measurements so they are from the point of view of the first camera (A) as if it has been translated to the location of camera B.
I did this using a compound rotation by first rotating the second camera's measurements about the cross product of vector -AB (B pointing at A from the perspective of A) and BA (B pointing at A from the perspective from B) such that R1*BA=-AB. Doing this rotation just means the vectors pointing between the cameras are aligned, and another rotation must be done in order to account for further degrees of freedom.
That rotation was done so that the second one can be about -AB. R2 is a rotation of theta radians about -AB. I found theta by taking the cross products of my measurements from camera A and vector AB, and comparing them to the cross products of R1*(the measurements from camera B) and -AB. I numerically minimized the RMS of the angles between the cross product pairs, because when the cameras are aligned those cross product vectors should be all pointing in the same directions because they are normal to coplanar planes.
After that I can use https://math.stackexchange.com/questions/61719/finding-the-intersection-point-of-many-lines-in-3d-point-closest-to-all-lines to find accurate 3D locations of intersection points by applying R1*R2 to any future measurements from camera B.

How to estimate camera translation given relative rotation and intrinsic matrix for stereo images?

I have 2 images (left and right) of a scene captured by a single camera.
I know the intrinsic matrices K_L and K_R for both images and the relative rotation R between the two cameras.
How do I compute the precise relative translation t between the two cameras?
You can only do it up to scale, unless you have a separate means to resolve scale, for example by observing an object of known size, or by having a sensor (e.g. LIDAR) give you the distance from a ground plane or from an object visible in both views.
That said, the solution is quite easy. You could do it by calculating and then decomposing the essential matrix, but here is a more intuitive way. Let xl and xr be two matched pixels in the two views in homogeneous image coordinates, and let X be their corresponding 3D world point, expressed in left camera coordinates. Let Kli and Kri be respectively the inverse of the left and right camera matrices Kl and Kr. Denote with R and t the transform from the right to the left camera coordinates. It is then:
X = sl * Kli * xl = t + sr * R * Kri * xr
where sl and sr are scales for the left and right rays back-projecting to point X from left and right camera respectively.
The second equality above represents 3 scalar equations in 5 unknowns: the 3 components of t, sl and sr. Depending on what additional information you have, you can solve it in different ways.
For example, if you know (e.g. from LIDAR measurements) the distance from the cameras to X, you can remove the scale terms from the equations above and solve directly. If there is a segment of known length [X1, X2] that is visible in both images, you can write two equations like above and again solve directly.

camera extrinsic calibration

I have a fisheye camera, which I have already calibrated. I need to calculate the camera pose w.r.t a checkerboard just by using a single image of said checkerboard,the intrinsic parameters, and the size of the squares of the checkerboards. Unfortunately many calibration libraries first calculate the extrinsic parameters from a set of images and then the intrinsic parameters, which is essentially the "inverse" procedure of what I want. Of course I can just put my checkerboard image inside the set of other images I used for the calibration and run the calib procedure again, but it's very tedious, and moreover, I can't use a checkerboard of different size from the ones used for the instrinsic calibration. Can anybody point me in the right direction?
EDIT: After reading francesco's answer, I realized that I didn't explain what I mean by calibrating the camera. My problem begins with the fact that I don't have the classic intrinsic parameters matrix (so I can't actually use the method Francesco described).In fact I calibrated the fisheye camera with the Scaramuzza's procedure (https://sites.google.com/site/scarabotix/ocamcalib-toolbox), which basically finds a polynom which maps 3d world points into pixel coordinates( or, alternatively, the polynom which backprojects pixels to the unit sphere). Now, I think these information are enough to find the camera pose w.r.t. a chessboard, but I'm not sure exactly how to proceed.
the solvePnP procedure calculates extrinsic pose for Chess Board (CB) in camera coordinates. openCV added a fishEye library to its 3D reconstruction module to accommodate significant distortions in cameras with a large field of view. Of course, if your intrinsic matrix or transformation is not a classical intrinsic matrix you have to modify PnP:
Undo whatever back projection you did
Now you have so-called normalized camera where intrinsic matrix effect was eliminated.
k*[u,v,1]T = R|T * [x, y, z, 1]T
The way to solve this is to write the expression for k first:
k=R20*x+R21*y+R22*z+Tz
then use the above expression in
k*u = R00*x+R01*y+R02*z+Tx
k*v = R10*x+R11*y+R12*z+Tx
you can rearrange the terms to get Ax=0, subject to |x|=1, where unknown
x=[R00, R01, R02, Tx, R10, R11, R12, Ty, R20, R21, R22, Tz]T
and A, b
are composed of known u, v, x, y, z - pixel and CB corner coordinates;
Then you solve for x=last column of V, where A=ULVT, and assemble rotation and translation matrices from x. Then there are few ‘messy’ steps that are actually very typical for this kind of processing:
A. Ensure that you got a real rotation matrix - perform orthogonal Procrustes on your R2 = UVT, where R=ULVT
B. Calculate scale factor scl=sum(R2(i,j)/R(i,j))/9;
C. Update translation vector T2=scl*T and check for Tz>0; if it is negative invert T and negate R;
Now, R2, T2 give you a good starting point for non linear algorithm optimization such as Levenberg Marquardt. It is required because a previous linear step optimizes only an algebraic error of parameters while non-linear one optimizes a correct metrics such as squared error in pixel distances. However, if you don’t want to follow all these steps you can take advantage of the fish-eye library of openCV.
I assume that by "calibrated" you mean that you have a pinhole model for your camera.
Then the transformation between your chessboard plane and the image plane is a homography, which you can estimate from the image of the corners using the usual DLT algorithm. You can then express it as the product, up to scale, of the matrix of intrinsic parameters A and [x y t], where x and y columns are the x and y unit vectors of the world's (i.e. chessboard's) coordinate frame, and t is the vector from the camera centre to the origin of that same frame. That is:
H = scale * A * [x|y|t]
Therefore
[x|y|t] = 1/scale * inv(A) * H
The scale is chosen so that x and y have unit length. Once you have x and y, the third axis is just their cross product.

How to calculate Rotation and Translation matrices from homography?

I have already done the comparison of 2 images of same scene which are taken by one camera with different view angles(say left and right) using SURF in emgucv (C#). And it gave me a 3x3 homography matrix for 2D transformation. But now I want to make those 2 images in 3D environment (using DirectX). To do that I need to calculate relative location and orientation of 2nd image(right) to the 1st image(left) in 3D form. How can I calculate Rotation and Translate matrices for 2nd image?
I need also z value for 2nd image.
I read something called 'Homograhy decomposition'. Is it the way?
Is there anybody who familiar with homography decomposition and is there any algorithm which it implement?
Thanks in advance for any help.
Homography only works for planar scenes (ie: all of your points are coplanar). If that is the case then the homography is a projective transformation and it can be decomposed into its components.
But if your scene isn't coplanar (which I think is the case from your description) then it's going to take a bit more work. Instead of a homography you need to calculate the fundamental matrix (which emgucv will do for you). The fundamental matrix is a combination of the camera intrinsic matrix (K), the relative rotation (R) and translation (t) between the two views. Recovering the rotation and translation is pretty straight forward if you know K. It looks like emgucv has methods for camera calibration. I am not familiar with their particular method but these generally involve taking several images of a scene with know geometry.
To figure out camera motion (exact rotation and translation up to a scaling factor) you need
Calculate fundamental matrix F, for example, using eight-point
algorithm
Calculate Essential matrix E = A’FA, where A is intrinsic camera matrix
Decompose E which is by definition Tx * R via SVD into E=ULV’
Create a special 3x3 matrix
0 -1 0
W = 1 0 0
0 0 1
that helps to run decomposition:
R = UW-1VT, Tx = ULWUT, where
0 -tx ty
Tx = tz 0 -tx
-ty tx 0
Since E can have an arbitrary sign and W can be replace by Winv we have 4 distinct solution and have to select the one which produces most points in front of the camera.
It's been a while since you asked this question. By now, there are some good references on this problem.
One of them is "invitation to 3D image" by Ma, chapter 5 of it is free here http://vision.ucla.edu//MASKS/chapters.html
Also, Vision Toolbox of Peter Corke includes the tools to perform this. However, he does not explain much math of the decomposition

How can I determine distance from an object in a video?

I have a video file recorded from the front of a moving vehicle. I am going to use OpenCV for object detection and recognition but I'm stuck on one aspect. How can I determine the distance from a recognized object.
I can know my current speed and real-world GPS position but that is all. I can't make any assumptions about the object I'm tracking. I am planning to use this to track and follow objects without colliding with them. Ideally I would like to use this data to derive the object's real-world position, which I could do if I could determine the distance from the camera to the object.
Your problem's quite standard in the field.
Firstly,
you need to calibrate your camera. This can be done offline (makes life much simpler) or online through self-calibration.
Calibrate it offline - please.
Secondly,
Once you have the calibration matrix of the camera K, determine the projection matrix of the camera in a successive scene (you need to use parallax as mentioned by others). This is described well in this OpenCV tutorial.
You'll have to use the GPS information to find the relative orientation between the cameras in the successive scenes (that might be problematic due to noise inherent in most GPS units), i.e. the R and t mentioned in the tutorial or the rotation and translation between the two cameras.
Once you've resolved all that, you'll have two projection matrices --- representations of the cameras at those successive scenes. Using one of these so-called camera matrices, you can "project" a 3D point M on the scene to the 2D image of the camera on to pixel coordinate m (as in the tutorial).
We will use this to triangulate the real 3D point from 2D points found in your video.
Thirdly,
use an interest point detector to track the same point in your video which lies on the object of interest. There are several detectors available, I recommend SURF since you have OpenCV which also has several other detectors like Shi-Tomasi corners, Harris, etc.
Fourthly,
Once you've tracked points of your object across the sequence and obtained the corresponding 2D pixel coordinates you must triangulate for the best fitting 3D point given your projection matrix and 2D points.
The above image nicely captures the uncertainty and how a best fitting 3D point is computed. Of course in your case, the cameras are probably in front of each other!
Finally,
Once you've obtained the 3D points on the object, you can easily compute the Euclidean distance between the camera center (which is the origin in most cases) and the point.
Note
This is obviously not easy stuff but it's not that hard either. I recommend Hartley and Zisserman's excellent book Multiple View Geometry which has described everything above in explicit detail with MATLAB code to boot.
Have fun and keep asking questions!
When you have moving video, you can use temporal parallax to determine the relative distance of objects. Parallax: (definition).
The effect would be the same we get with our eyes which which can gain depth perception by looking at the same object from slightly different angles. Since you are moving, you can use two successive video frames to get your slightly different angle.
Using parallax calculations, you can determine the relative size and distance of objects (relative to one another). But, if you want the absolute size and distance, you will need a known point of reference.
You will also need to know the speed and direction being traveled (as well as the video frame rate) in order to do the calculations. You might be able to derive the speed of the vehicle using the visual data but that adds another dimension of complexity.
The technology already exists. Satellites determine topographic prominence (height) by comparing multiple images taken over a short period of time. We use parallax to determine the distance of stars by taking photos of night sky at different points in earth's orbit around the sun. I was able to create 3-D images out of an airplane window by taking two photographs within short succession.
The exact technology and calculations (even if I knew them off the top of my head) are way outside the scope of discussing here. If I can find a decent reference, I will post it here.
You need to identify the same points in the same object on two different frames taken a known distance apart. Since you know the location of the camera in each frame, you have a baseline ( the vector between the two camera positions. Construct a triangle from the known baseline and the angles to the identified points. Trigonometry gives you the length of the unknown sides of the traingles for the known length of the baseline and the known angles between the baseline and the unknown sides.
You can use two cameras, or one camera taking successive shots. So, if your vehicle is moving a 1 m/s and you take fames every second, then successibe frames will gibe you a 1m baseline which should be good to measure the distance of objects up to, say, 5m away. If you need to range objects further away than the frames used need to be further apart - however more distant objects will in view for longer.
Observer at F1 sees target at T with angle a1 to velocity vector. Observer moves distance b to F2. Sees target at T with angle a2.
Required to find r1, range from target at F1
The trigonometric identity for cosine gives
Cos( 90 – a1 ) = x / r1 = c1
Cos( 90 - a2 ) = x / r2 = c2
Cos( a1 ) = (b + z) / r1 = c3
Cos( a2 ) = z / r2 = c4
x is distance to target orthogonal to observer’s velocity vector
z is distance from F2 to intersection with x
Solving for r1
r1 = b / ( c3 – c1 . c4 / c2 )
Two cameras so you can detect parallax. It's what humans do.
edit
Please see ravenspoint's answer for more detail. Also, keep in mind that a single camera with a splitter would probably suffice.
use stereo disparity maps. lots of implementations are afloat, here are some links:
http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/OWENS/LECT11/node4.html
http://www.ece.ucsb.edu/~manj/ece181bS04/L14(morestereo).pdf
In you case you don't have stereo camera, but depth can be evaluated using video
http://www.springerlink.com/content/g0n11713444148l2/
I think the above will be what might help you the most.
research has progressed so far that depth can be evaluated ( though not to a satisfactory extend) from a single monocular image
http://www.cs.cornell.edu/~asaxena/learningdepth/
Someone please correct me if I'm wrong, but it seems to me that if you're going to simply use a single camera and simply relying on a software solution, any processing you might do would be prone to false positives. I highly doubt that there is any processing that could tell the difference between objects that really are at the perceived distance and those which only appear to be at that distance (like the "forced perspective") in movies.
Any chance you could add an ultrasonic sensor?
first, you should calibrate your camera so you can get the relation between the objects positions in the camera plan and their positions in the real world plan, if you are using a single camera you can use the "optical flow technic"
if you are using two cameras you can use the triangulation method to find the real position (it will be easy to find the distance of the objects) but the probem with the second method is the matching, which means how can you find the position of an object 'x' in camera 2 if you already know its position in camera 1, and here you can use the 'SIFT' algorithme.
i just gave you some keywords wish it could help you.
Put and object of known size in the cameras field of view. That way you can have a more objective metric to measure angular distances. Without a second viewpoint/camera you'll be limited to estimating size/distance but at least it won't be a complete guess.