Undistorting without calibration images - c++

I'm currently trying to "undistort" fisheye imagery using OpenCV in C++. I know the exact lens and camera model, so I figured that I would be able to use this information to calculate some parameters and ultimately convert fisheye images to rectilinear images. However, all the tutorials I've found online encourage using auto-calibration with checkerboards. Is there a way to calibrate the fisheye camera by just using camera + lens parameters and some math? Or do I have to use the checkerboard calibration technique?
I am trying to avoid having to use the checkerboard calibration technique because I am just receiving some images to undistort, and it would be undesirable to have to ask for images of checkerboards if possible. The lens is assumed to retain a constant zoom/focal length for all images.
Thank you so much!

To un-distord an image, you need to know the intrinsic parameters of the camera which describe the distorsion.
You can't compute them from datasheet values, because they depend on how the lens is manufactured and two lenses of the same vendor & model might have different distorsion coefficients, especially if they are cheap one.
Some raster graphics editor embed a lens database from which you can query distorsion coefficients. But there is no magic, they built it by measuring the lens distorsion and eventually interpolate them after.
But you can still use an empiric method to correct at least barrel effect.
They are plenty of shaders to do so and you can always do your own maths to build a distorsion map.

Related

stereo depth map but with a single moving camera measured with sensors

I've just gotten started learning about calculating depth from stereo images and before I went and committed to learning about this I wanted to check if it was a viable choice for a project I'm doing. I have a drone with a single rgb camera that has sensors that can give the orientation and movement of the drone. Would it be possible to sample two frames, the distance, and orientation differences between the samples and use this to calculate depth? I've seen in most examples the cameras are lined up horizontally. Is this necessary for stereo images or can I use any reasonable angle and distance between the two sampled images? Would it be feasible to do this in real time? My overall goal is to do some sort of monocular slam to have this drone navigate indoor areas. I know that ORB slam exists but I am mostly doing this for a learning experience and so would like to do things from scratch where possible.
Thank you.

Object Detection in Fisheye Images

I have a camera that uses a fish-eye lens and need to run an object detection network like YOLO or SSD with it.
Should I rectify/un-distort the incoming image first? Is that computationally expensive?
Or, should I try to train the network using fish-eye images?
Many thanks for the help.
If you are trying to use a model that was pretrained on perspective rectilinear images, you will probably get poor results either way. On one hand, objects in raw fisheye images have a different appearance from the same objects in perspective images and many will be misdetected. On the other hand, you can't really "undistort" a fisheye image with a large field of view, and when you try - it looks very different from real perspective pictures. Some researchers are doing neither and investigating other alternatives instead.
If you have a training set of fisheye images, you can train a model on the raw fisheye images. It is a more difficult task for the network to learn, because the same object changes appearance when it moves, while convolutional neural networks are shift invariant, but nevertheless it is possible and has been demonstrated in the literature.

Visual Odometry, Camera Parameters

I am studying about visual odometry and watched Prof. Dr. Cyrill Stachniss' video recordings which are available as YouTube 2015/16 Playlist about Photogrammetry I & II .
First, If I want to create my own dataset (like KITTI dataset for VO or like Oxford campus dataset) what should be the properties of the image that I take with a camera.
Are they just images? Or, does they have some special properties ? That is, how can I create my own dataset with a monocular or stereo camera.
Thank you.
To get extrinsic and intrinsic parameters from the image you must have a set of images of known shape from varying views. It's not trivial task to do on your own, by common CV libraries / solution have a built-in utilities for camera calibration (I have to deal with OpenCV library and Matlab CV package and they are generally the same).Usually it's done with a black and white checkboard or another simple geometric pattern.
Then with known camera parameters you can manipulate your own dataset.
Matlab camera calibration reference
OpenCV camera calibration tutorials
If you want to benchmark some visual odometry algorithms with your dataset, you will definitely need the intrinsic parameters of your camera as well as its pose.
As said in #f4f answer, the intrinsic calibration is typically done with some images of a checkerboard that you tilt and rotate (see opencv).
This will give you parameters such as focal length, optical center but also the distortion coefficients which can be important depending on your camera.
Getting the pose of the camera (i.e extrinsic parameters) at each frame is probably trickier. Usually the ground-truth is obtained using information from additional sensors (tracking system, IMU, GPS, ...). You can have a look at : TUM RGB-D SLAM Dataset and the corresponding paper. They explain how they used a motion-capture system to get the ground-truth pose.
Recording the time of acquisition of the camera frames can also be interesting (one timestamp per frame).
Creating your own visual odometry dataset is not trivial. If you just want to create a dataset "for fun" or to do some experiments and if you have only a camera available, I would say you can just try some methods that are known to work well (like ORB-SLAM). This will give you good approximate of the camera poses (you may have to manually fix the unknown scale).

In computer vision, what does MVS do that SFM can't?

I'm a dev with about a decade of enterprise software engineering under his belt, and my hobbyist interests have steered me into the vast and scary realm of computer vision (CV).
One thing that is not immediately clear to me is the division of labor between Structure from Motion (SFM) tools and Multi View Stereo (MVS) tools.
Specifically, CMVS appears to be the best-in-show MVS tool, and Bundler seems to be one of the better open source SFM tools out there.
Taken from CMVS's own homepage:
You should ALWAYS use CMVS after Bundler and before PMVS2
I'm wondering: why?!? My understanding of SFM tools is that they perform the 3D reconstruction for you, so why do we need MVS tools in the first place? What value/processing/features do they add that SFM tools like Bundler can't address? Why the proposed pipeline of:
Bundler -> CMVS -> PMVS2
?
Quickly put, Structure from Motion (SfM) and MultiView Stereo (MVS) techniques are complementary, as they do not deal with the same assumptions. They also differ slightly in their inputs, MVS requiring camera parameters to run, which is estimated (output) by SfM. SfM only gives a coarse 3D output, whereas PMVS2 gives a more dense output, and finally CMVS is there to circumvent some limitations of PMVS2.
The rest of the answer provides an high-level overview of how each method works, explaining why it is this way.
Structure from Motion
The first step of the 3D reconstruction pipeline you highlighted is a SfM algorithm that could be done using Bundler, VisualSFM, OpenMVG or the like. This algorithm takes in input some images and outputs the camera parameters of each image (more on this later) as well as a coarse 3D shape of the scene, often called the sparse reconstruction.
Why does SfM outputs only a coarse 3D shape? Basically, SfM techniques begins by detecting 2D features in every input image and matching those features between pairs of images. The goal is, for example, to tell "this table corner is located at those pixels locations in those images." Those features are described by what we call descriptors (like SIFT or ORB). Those descriptors are built to represent a small region (ie. a bunch of neighboring pixels) in images. They can represent reliably highly textured or rough geometries (e.g., edges), but these scene features need to be unique (in the sense distinguishing) throughout the scene to be useful. For example (maybe oversimplified), a wall with repetitive patterns would not be very useful for the reconstruction, because even though it is highly textured, every region of the wall could potentially match pretty much everywhere else on the wall. Since SfM is performing a 3D reconstruction using those features, the vertices of the 3D scene reconstruction will be located on those unique textures or edges, giving a coarse mesh as output. SfM won't typically produce a vertex in the middle of surface without precise and distinguishing texture. But, when many matches are found between the images, one can compute a 3D transformation matrix between the images, effectively giving the relative 3D position between the two camera poses.
MultiView Stereo
Afterwards, the MVS algorithm is used to refine the mesh obtained by the SfM technique, resulting in what is called a dense reconstruction. This algorithm requires the camera parameters of each image to work, which is output by the SfM algorithm. As it works on a more constrained problem (since they already have the camera parameters of every image like position, rotation, focal, etc.), MVS will compute 3D vertices on regions which were not (or could not be) correctly detected by descriptors or matched. This is what PMVS2 does.
How can PMVS work on regions where 2D feature descriptor would difficultly match? Since you know the camera parameters, you know a given pixel in an image is the projection of a line in another image. This approach is called epipolar geometry. Whereas SfM had to seek through the entire 2D image for every descriptor to find a potential match, MVS will work on a single 1D line to find matches, simplifying the problem quite a deal. As such, MVS usually takes into account illumination and object materials into its optimization, which SfM does not.
There is one issue, though: PMVS2 performs a quite complex optimization that can be dreadfully slow or take an astronomic amount of memory on large image sequences. This is where CMVS comes into play, clustering the coarse 3D SfM output into regions. PMVS2 will then be called (potentially in parallel) on each cluster, simplifying its execution. CMVS will then merge each PMVS2 output in an unified detailed model.
Conclusion
Most of the information provided in this answer and many more can be found in this tutorial from Yasutaka Furukawa, author of CMVS and PMVS2:
http://www.cse.wustl.edu/~furukawa/papers/fnt_mvs.pdf
In essence, both techniques emerge from two different approaches: SfM aims to perform a 3D reconstruction using a structured (but theunknown) sequence of images while MVS is a generalization of the two-view stereo vision, based on human stereopsis.

C++ OpenCV sky image stitching

Some background:
Hi all! I have a project which involves cloud imaging. I take pictures of the sky using a camera mounted on a rotating platform. I then need to compute the amount of cloud present based on some color threshold. I am able to this individually for each picture. To completely achieve my goal, I need to do the computation on the whole image of the sky. So my problem lies with stitching several images (about 44-56 images). I've tried using the stitch function on all and some subsets of image set but it returns an incomplete image (some images were not stitched). This could be because of a lack of overlap of something, I dunno. Also the output image has been distorted weirdly (I am actually expecting the output to be something similar to a picture taken by a fish-eye lense).
The actual problem:
So now I'm trying to figure out the opencv stitching pipeline. Here is a link:
http://docs.opencv.org/modules/stitching/doc/introduction.html
Based on what I have researched I think this is what I want to do. I want to map all the images to a circular shape, mainly because of the way how my camera rotates, or something else that has uses a fairly simple coordinate transformation. So I think I need get some sort of fixed coordinate transform thing for the images. Is this what they call the homography? If so, does anyone have any idea how I can go about my problem? After this, I believe I need to get a mask for blending the images. Will I need to get a fixed mask like the one I want for my homography?
Am I going through a possible path? I have some background in programming but almost none in image processing. I'm basically lost. T.T
"So I think I need get some sort of fixed coordinate transform thing for the images. Is this what they call the homography?"
Yes, the homography matrix is the transformation matrix between an original image and the ideal result. It warps an image in perspective so it can fit in stitching to the other image.
"If so, does anyone have any idea how I can go about my problem?"
Not with the limited information you provided. It would ease the problem a lot if you know the order of pictures (which borders which.. row, column position)
If you have no experience in image processing, I would recommend you use a tutorial covering stitching using more basic functions in detail. There is some important work behind the scenes, and it's not THAT harder to actually do it yourself.
Start with this example. It stitches two pictures.
http://ramsrigoutham.com/2012/11/22/panorama-image-stitching-in-opencv/