Reconstructing 3D from some images without calibration? - c++

I want to make a 3D reconstruction from multiple images without using a chessboard Calibration. I'm using OpenCV and studying the method to obtain the way to get the model 3D from 30 images without calibrating the camera with a chessboard pattern.
Is this possible? Where can I get the extrinsics params?
Can I make the 3D reconstruction without calibrating?

The calibration grid (chessboard in the typical OpenCV example) is simply an object of known dimensions that lets you estimate the camera's intrinsic parameters, i.e. the mapping from camera coordinates to the image coordinates of a point. This includes focal length, centre of projection, radial distortion parameters et cetera.
If you do away with the calibration object, you will need to find these parameters from the image observations themselves. This approach is called "self-calibration" or "auto-calibration" and can be fairly involved. Basically, you are trying to get a good starting point for the follow-up non-linear optimisation (i.e. bundle adjustment). For a start, you might want to refer to Marc Pollefeys' PhD thesis, who came up with a simple linear algorithm for this problem:
http://www.cs.unc.edu/~marc/pubs/PollefeysIJCV04.pdf

Related

Camera pose estimation

I am trying to write a program from scratch that can estimate the pose of a camera. I am open to any programming language and using inbuilt functions/methods for feature detection...
I have been exploring different ways of estimating pose like SLAM, PTAM, DTAM etc... but I don't really need need tracking and mapping, I just need the pose.
Can any of you suggest an approach or any resource that can help me ? I know what pose is and a rough idea of how to estimate it but I am unable to find any resources that explain how it can be done.
I was thinking of starting with a video recorded, extracting features from the video and then using these features and geometry to estimate the pose.
(Please forgive my naivety, I am not a computer vision person and am fairly new to all of this)
In order to compute a camera pose, you need to have a reference frame that is given by some known points in the image.
These known points come for example from a calibration pattern, but can also be some known landmarks in your images (for example, the 4 corners of teh base of Gizeh pyramids).
The problem of estimating the pose of the camera given known landmarks seen by the camera (ie, finding 3D position from 2D points) is classically known as PnP.
OpenCV provides you a ready-made solver for this problem.
However, you need first to calibrate your camera, ie, you need to determine what makes it unique.
The parameters that you need to estimate are called intrinsic parameters, because they will depend on the camera focal length, sensor size... but not on the camera location or orientation.
These parameters will mathematically explain how world points are projected onto your camera sensor frame.
You can estimate them from known planar patterns (again, OpenCV has some ready-made functions for that).
Generally, you can extract the pose of a camera only relative to a given reference frame.
It is quite common to estimate the relative pose between one view of a camera to another view.
The most general relationship between two views of the same scene from two different cameras, is given by the fundamental matrix (google it).
You can calculate the fundamental matrix from correspondences between the images. For example look in the Matlab implementation:
http://www.mathworks.com/help/vision/ref/estimatefundamentalmatrix.html
After calculating this, you can use a decomposition of the fundamental matrix in order to get the relative pose between the cameras. (Look here for example: http://www.daesik80.com/matlabfns/function/DecompPMatQR.m).
You can work a similar procedure in case you have a calibrated camera, and then you need the Essential matrix instead of fundamnetal.

how do I re-project points in a camera - projector system (after calibration)

i have seen many blog entries and videos and source coude on the internet about how to carry out camera + projector calibration using openCV, in order to produce the camera.yml, projector.yml and projectorExtrinsics.yml files.
I have yet to see anyone discussing what to do with this files afterwards. Indeed I have done a calibration myself, but I don't know what is the next step in my own application.
Say I write an application that now uses the camera - projector system that I calibrated to track objects and project something on them. I will use contourFind() to grab some points of interest from the moving objects and now I want to project these points (from the projector!) onto the objects!
what I want to do is (for example) track the centre of mass (COM) of an object and show a point on the camera view of the tracked object (at its COM). Then a point should be projected on the COM of the object in real time.
It seems that projectPoints() is the openCV function I should use after loading the yml files, but I am not sure how I will account for all the intrinsic & extrinsic calibration values of both camera and projector. Namely, projectPoints() requires as parameters the
vector of points to re-project (duh!)
rotation + translation matrices. I think I can use the projectorExtrinsics here. or I can use the composeRT() function to generate a final rotation & a final translation matrix from the projectorExtrinsics (which I have in the yml file) and the cameraExtrinsics (which I don't have. side question: should I not save them too in a file??).
intrinsics matrix. this tricky now. should I use the camera or the projector intrinsics matrix here?
distortion coefficients. again should I use the projector or the camera coefs here?
other params...
So If I use either projector or camera (which one??) intrinsics + coeffs in projectPoints(), then I will only be 'correcting' for one of the 2 instruments . Where / how will I use the other's instruments intrinsics ??
What else do I need to use apart from load() the yml files and projectPoints() ? (perhaps undistortion?)
ANY help on the matter is greatly appreciated .
If there is a tutorial or a book (no, O'Reilly "Learning openCV" does not talk about how to use the calibration yml files either! - only about how to do the actual calibration), please point me in that direction. I don't necessarily need an exact answer!
First, you seem to be confused about the general role of a camera/projector model: its role is to map 3D world points to 2D image points. This sounds obvious, but this means that given extrinsics R,t (for orientation and position), distortion function D(.) and intrisics K, you can infer for this particular camera the 2D projection m of a 3D point M as follows: m = K.D(R.M+t). The projectPoints function does exactly that (i.e. 3D to 2D projection), for each input 3D point, hence you need to give it the input parameters associated to the camera in which you want your 3D points projected (projector K&D if you want projector 2D coordinates, camera K&D if you want camera 2D coordinates).
Second, when you jointly calibrate your camera and projector, you do not estimate a set of extrinsics R,t for the camera and another for the projector, but only one R and one t, which represent the rotation and translation between the camera's and projector's coordinate systems. For instance, this means that your camera is assumed to have rotation = identity and translation = zero, and the projector has rotation = R and translation = t (or the other way around, depending on how you did the calibration).
Now, concerning the application you mentioned, the real problem is: how do you estimate the 3D coordinates of a given point ?
Using two cameras and one projector, this would be easy: you could track the objects of interest in the two camera images, triangulate their 3D positions using the two 2D projections using function triangulatePoints and finally project this 3D point in the projector 2D coordinates using projectPoints in order to know where to display things with your projector.
With only one camera and one projector, this is still possible but more difficult because you cannot triangulate the tracked points from only one observation. The basic idea is to approach the problem like a sparse stereo disparity estimation problem. A possible method is as follows:
project a non-ambiguous image (e.g. black and white noise) using the projector, in order to texture the scene observed by the camera.
as before, track the objects of interest in the camera image
for each object of interest, correlate a small window around its location in the camera image with the projector image, in order to find where it projects in the projector 2D coordinates
Another approach, which unlike the one above would use the calibration parameters, could be to do a dense 3D reconstruction using stereoRectify and StereoBM::operator() (or gpu::StereoBM_GPU::operator() for the GPU implementation), map the tracked 2D positions to 3D using the estimated scene depth, and finally project into the projector using projectPoints.
Anyhow, this is easier, and more accurate, using two cameras.
Hope this helps.

How to verify that the camera calibration is correct? (or how to estimate the error of reprojection)

The quality of calibration is measured by the reprojection error (is there an alternative?), which requires a knowledge world coordinates of some 3d point(s).
Is there a simple way to produce such known points? Is there a way to verify the calibration in some other way (for example, Zhang's calibration method only requires that the calibration object be planar and the geometry of the system need not to be known)
You can verify the accuracy of the estimated nonlinear lens distortion parameters independently of pose. Capture images of straight edges (e.g. a plumb line, or a laser stripe on a flat surface) spanning the field of view (an easy way to span the FOV is to rotate the camera keeping the plumb line fixed, then add all the images). Pick points on said line images, undistort their coordinates, fit mathematical lines, compute error.
For the linear part, you can also capture images of multiple planar rigs at a known relative pose, either moving one planar target with a repeatable/accurate rig (e.g. a turntable), or mounting multiple planar targets at known angles from each other (e.g. three planes at 90 deg from each other).
As always, a compromise is in order between accuracy requirements and budget. With enough money and a friendly machine shop nearby you can let your fantasy run wild with rig geometry. I had once a dodecahedron about the size of a grapefruit, machined out of white plastic to 1/20 mm spec. Used it to calibrate the pose of a camera on the end effector of a robotic arm, moving it on a sphere around a fixed point. The dodecahedron has really nice properties in regard to occlusion angles. Needless to say, it's all patented.
The images used in generating the intrinsic calibration can also be used to verify it. A good example of this is the camera-calib tool from the Mobile Robot Programming Toolkit (MRPT).
Per Zhang's method, the MRPT calibration proceeds as follows:
Process the input images:
1a. Locate the calibration target (extract the chessboard corners)
1b. Estimate the camera's pose relative to the target, assuming that the target is a planar chessboard with a known number of intersections.
1c. Assign points on the image to a model of the calibration target in relative 3D coordinates.
Find an intrinsic calibration that best explains all of the models generated in 1b/c.
Once the intrinsic calibration is generated, we can go back to the source images.
For each image, multiply the estimated camera pose with the intrinsic calibration, then apply that to each of the points derived in 1c.
This will map the relative 3D points from the target model back to the 2D calibration source image. The difference between the original image feature (chessboard corner) and the reprojected point is the calibration error.
MRPT performs this test on all input images and will give you an aggregate reprojection error.
If you want to verify a full system, including both the camera intrinsics and the camera-to-world transform, you will probably need to build a jig that places the camera and target in a known configuration, then test calculated 3D points against real-world measurements.
On Engine's question: the pose matrix is a [R|t] matrix where R is a pure 3D rotation and t a translation vector. If you have computed a homography from the image, section 3.1 of Zhang's Microsoft Technical Report (http://research.microsoft.com/en-us/um/people/zhang/Papers/TR98-71.pdf) gives a closed form method to obtain both R and t using the known homography and the intrinsic camera matrix K. ( I can't comment, so I added as a new answer)
Should be just variance and bias in calibration (pixel re-projection) errors given enough variability in calibration rig poses. It is better to visualize these errors rather than to look at the values. For example, error vectors pointing to the center would be indicative of wrong focal length. Observing curved lines can give intuition about distortion coefficients.
To calibrate the camera one has to jointly solve for extrinsic and intrinsic. The latter can be known from manufacturer, the solving for extrinsic (rotation and translation) involves decomposition of calculated homography: Decompose Homography matrix in opencv python
Calculate a Homography with only Translation, Rotation and Scale in Opencv
The homography is used here since most calibration targets are flat.

Rigid motion estimation

Now what I have is the 3D point sets as well as the projection parameters of the camera. Given two 2D point sets projected from the 3D point by using the camera and transformed camera(by rotation and translation), there should be an intuitive way to estimate the camera motion...I read some parts of Zisserman's book "Muliple view Geometry in Computer Vision", but I still did not get the solution..
Are there any hints, how can the rigid motion be estimated in this case?
THANKS!!
What you are looking for is a solution to the PnP problem. OpenCV has a function which should work called solvePnP. Just to be clear, for this to work you need point locations in world space, a camera matrix, and the points projections onto the image plane. It will then tell you the rotation and translation of the camera or points depending on how you choose to think of it.
Adding to the previous answer, Eigen has an implementation of Umeyama's method for estimation of the rigid transformation between two sets of 3d points. You can use it to get an initial estimation, and then refine it using an optimization algorithm and considering the projections of the 3d points onto the images too. For example, you could try to minimize the reprojection error between 2d points on the first image and projections of the 3d points after you bring them from the reference frame of one camera to the the reference frame of the other using the previously estimated transformation. You can do this in both ways, using the transformation and its inverse, and try to minimize the bidirectional reprojection error. I'd recommend the paper "Stereo visual odometry for autonomous ground robots", by Andrew Howard, as well as some of its references for a better explanation, especially if you are considering an outlier removal/inlier detection step before the actual motion estimation.

OpenCV translational/rotational displacement between frames?

I am currently researching the use of a low resolution camera facing vertically at the ground (fixed height) to measure the speed (speed of the camera passing over the surface). Using OpenCV 2.1 with C++.
Since the entire background will be constantly moving, translating and/or rotating between consequtive frames, what would be the most suitable method in determining the displacement of the frames in a 'useable value' form? (Function that returns frame displacement?) Then based on the height of the camera and the frame area captured (dimensions of the frame in real world), I would be able to calculate the displacement in the real world based on the frame displacement, then calculating the speed for a measured time interval.
Trying to determine my method of approach or if any example code is available, converting a frame displacement (or displacement of a set of pixels) into a distance displacement based on the height of the camera.
Thanks,
Josh.
It depends on your knowledge in computer vision. For the start, I would use what opencv can offer. please have a look at the feature2d module.
What you need is to first extract feature points (e.g. sift or surf), then use its build in matching algorithms to match points extracted from two frames. Each match will give you some constraints, and you will end up solving a over-saturated Ax=B.
Of course, do your experiments offline, i.e. shooting a video first and then operate on the single images.
UPDATE:
In case of mulit-camera calibration, your goal is to determine the 3D location of each camera, which is exactly what you have. Imagine instead of moving your single camera around, you have as many cameras as the number of images in the video captured by your single camera and you want to know the 3D location of each camera location, which represent the location of each image being taken by your single moving camera.
There is a matrix where you can map any 3D point in the world to a 2D point on your image see wiki. The camera matrix consists of 2 parts, intrinsic and extrinsic parameters. I (maybe inexactly) referred intrinsic parameter as the internal matrix. The intrinsic parameters consists of static parameters for a single camera (e.g. focal length), while the extrinsic ones consists of the location and rotation of your camera.
Now, once you have the intrinsic parameters of your camera and the matched points, you can then stack a lot of those projection equations on top of each other and solve the system for both the actual 3D location of all your matched points and all the extrinsic parameters.
Given interest points as described above, you can find the translational transformation with opevcv's findHomography.
Also, if you can assume that transformations will be somewhat small and near-linear, you can just compare image pixels of two consecutive frames to find the best match. With enough downsampling, this doesn't take too long, and from my experience works rather well.
Good luck!